Information Retrieval: IR Models: Boolean Model

Information Retrieval
IR models: Boolean model
Dr. Bassel ALKHATIB

IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
vector
U probabilistic Generalized Vector
s Retrieval: Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
T Non-Overlapping Lists
a Proximal Nodes Inference Network
s Belief Network
k Browsing
Browsing
Flat
Structure Guided
Hypertext
Slide 2
The Boolean Model
• Simple model based on set theory
• Queries and documents specified as boolean expressions

– precise semantics
– neat formalism
– Query = ka  (kb  kc)
• Terms are either present or absent. Thus, wij  {0,1}
Slide 3
The Boolean Model
• q = ka  (kb  kc)
• sim(q,dj) = 1, if document satisfies the boolean query

0 otherwise
- no in-between, only 0 or 1 Slide 4

Example
• q = ka  (kb  kc)
• vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0)
• Disjunctive Normal Form
• Similar/Matching documents
• md1 = [ka ka kd ke] => (1,0,0)
• md2 = [ka kb kc] => (1,1,1)
• Unmatched documents
• ud1 = [ka kc] => (1,0,1)
• ud2 = [kd] => (0,0,0)
Slide 5
Boolean Model : Example
• Set of index term = {k1, k2, k3, k4, k5}

• D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
• Q1 = k1  k3 k5
• The DNF of Q1= (1,0,1,0,0)  (1,1,1,0,0) (1,0,1,1,0)
(1,1,1,1,0)
• D3 is relevant
Slide 6
Set of index term = {k1, k2, k3, k4, k5}
D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
Q = (k1   k3) k5
Slide 7
Set of index term = {k1, k2, k3, k4, k5}
D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
Q = (k1   k3) k5
The DNF of
Q= (1,0,1,0,1) (1,0,1,1,1) (1,1,1,0,1)(1,1,1,1,1)
 (1,0,0,0,1)  (1,0,0,1,1)  (1,1,0,0,1)  (1,1,0,1,1)
(0,0,0,0,1)  (0,0,0,1,1) (0,1,0,0,1) (0,1,0,1,1)
non relevant document
Slide 8
Exercise
• D1 = “computer information retrieval”

• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information  retrieval”
• Q2 = “information  ¬computer”
Slide 9
Exercise
0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift

12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
Slide 10
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
(c, m, h, w)
( (c  m)  ( w) )  ( c  (w  h) )
(c  m  w  h)  ( w   c)
Q= (0,0,1,0)  (0,1,0,0)  (0,1,1,0)

 (0,0,0,1)  (0,0,1,1)  (0,1,0,1)  (0,1,1,1)
 (1,0,0,0)  (1,0,1,0)  (1,1,0,0)  (1,1,1,0)
Relevant documents={1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14}
Non relevant documents={0, 9, 11, 13, 15}
Slide 11
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no notion of
partial matching
• No ranking of the documents is provided (absence of a
grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are most often
too simplistic
• As a consequence, the Boolean model frequently returns
either too few or too many documents in response to a user
query
Slide 12
• The question of how to extend the Boolean model to
accomodate partial matching and a ranking has attracted
considerable attention in the past
• Two extensions of boolean model:
– Extended Boolean Model
– Fuzzy Set Model
Slide 13
Extended Boolean Model:
• Disadvantages of “Boolean Model” :

query q=Kx AND Ky.
Documents containing just one term, e,g, Kx is

considered as irrelevant as another document
containing none of these terms.
• No term weight is used
Slide 14
Extended Boolean Model
• Boolean model is simple and elegant.

• But, no provision for a ranking
• As with the fuzzy model, a ranking can be obtained by
relaxing the condition on set membership
• Extend the Boolean model with the notions of partial
matching and term weighting
Slide 15
The Idea
• The extended Boolean model (introduced by Salton, Fox,

and Wu, 1983) is based on a critique of a basic
assumption in Boolean algebra
Slide 16
The Idea:
• qand = kx  ky; wxj = x and wyj = y
ky (1,1)
dj+1 AND
y = wyj
dj
(0,0) x = wxj kx
We want a document to be
as close as possible to (1,1) sim(qand,dj) = 1 - sqrt( (1-x) 2+ (1-y)2 )
2 Slide 17
The Idea:
• qor = kx  ky; wxj = x and wyj = y
ky (1,1)
dj+1 OR
dj
y = wyj
(0,0) x = wxj kx
We want a document to be sim(qor,dj) = sqrt( x 2 + y2 )

as far as possible from (0,0) 2
Slide 18
Remarks
• If the weights are all Boolean {0,1}, a document is always

positioned in one of the four corners: (0,0), (0,1), (1,0)
and (1,1):
• Sim(qor, d) in {0, 1/ 2 , 1}
• Sim(qand, d) in { 0, 1 - 1/ 2 , 1}
Slide 19
Generalizing the Idea
• We can extend the previous model to consider Euclidean

distances in a t-dimensional space (t terms).
• This can be done using p-norms which extend the notion
of distance to include p-distances, where 1  p   is a
new parameter
• A generalized conjunctive query is given by
– qor = k1  pk2  p . . .  pkt  p
• A generalizedpdisjunctive
p
query is given by
p p
– qand = k1  k2  . . .  kt 
Slide 20
Generalizing the Idea
1
p p p p
– sim(qor,dj) = (x1 + x2 + . . . + xm )
m
1
p p p p
– sim(qand,dj) = 1 - ((1-x1) + (1-x2) + . . . + (1-xm) )
m
Slide 21
Properties:
• The p norm as defined above enjoys a couple of

interesting properties as follows. First, when p=1 it can
be verified that
x1  ...  xm
sim(qor , dj )  sim(qand , dj ) 
m
• Second, when p= it can be verified that
• Sim(qor,dj)=max(xi)
• Sim(qand,dj)=min(xi)
Slide 22
Example:
• For instance, consider the query q=(k1 k2)  k3. The

similarity sim(q,dj) between a document dj and this
query is then computed as
1
(1  x1 )  (1  x2 )
p p
(1  ( ) ) p  x3p 1
p
sim(q, d )  ( 2 ) p
2
• Any boolean can be expressed as a numeral formula.
Slide 23
Example:
• Q1= k1 k2 k3

• Q2= (k1 k2 )  k3
• D = (1, 0, 1)
Slide 24
Example:
• Q1= k1 k2 k3

• Q2= (k1 k2 )  k3
• D = (1, 0, 1)
1
(1  1)  (1  0)  (1  1)
2 2 2
Sim(Q1, D)  1  ( )  0.43
2
3
1
(1  1)  (1  0)
2 2
(1  ( ) )  12 1
2 2
Sim(Q 2, D)  ( 2 ) 2  0.73
2
Slide 25
Exercise:
1. Give the numeral formula for extended Boolean

model of the query
q=(k1 or k2 or k3)and (not k4 or k5). (assume that
there are 5 terms in total.)
2. Assume that the document is represented by

the vector (0.8, 0.1, 0.0, 0.0, 1.0).
What is sim(q, d) for extended Boolean model?
Slide 26
Exercise:
q=(k1 or k2 or k3)and (not k4 or k5).
D= (x1, x2, x3, x4, x5) = (0.8, 0.1, 0.0, 0.0, 1.0).
Slide 27
Exercise:
q=(k1 or k2 or k3)and (not k4 or k5).
D= (x1, x2, x3, x4, x5) = (0.8, 0.1, 0.0, 0.0, 1.0).
x x x
2 2 2
a 1
 0.22
2 3
(1  x )  x
2 2 2
b 4
 0.71 5
(1  a)  (1  b)
2 2
sim(Q, D)  1   0.41
2
Slide 28
Extended Boolean Model
• Model is quite powerful

• Properties are interesting and might be useful
• Computation is somewhat complex
• However, distributivity operation does not hold for
ranking computation:
– q1 = (k1  k2)  k3
– q2 = (k1  k3)  (k2  k3)
– sim(q1,dj)  sim(q2,dj)
Slide 29
Fuzzy Set Model
• Queries and docs represented by sets of index terms:

matching is approximate from the start
• This vagueness can be modeled using a fuzzy framework, as
follows:
– with each term is associated a fuzzy set
– each doc has a degree of membership in this fuzzy set
• This interpretation provides the foundation for many models

for IR based on fuzzy theory
• In here, the model proposed by Ogawa, Morita, and
Kobayashi (1991)
Slide 30
Fuzzy Set Theory
• Framework for representing classes whose boundaries are not
well defined
• Key idea is to introduce the notion of a degree of membership
associated with the elements of a set
• This degree of membership varies from 0 to 1 and allows
modeling the notion of marginal membership
• Thus, membership is now a gradual notion, contrary to the
notion enforced by classic Boolean logic
Slide 31
Fuzzy Set Theory
• Definition
– A fuzzy subset A of U is characterized by a membership function
(A,u) : U  [0,1]
which associates with each element u of U a number (u) in the interval
[0,1]
• Definition
– Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of
A. Then,
• (¬A,u) = 1 - (A,u)
• (AB,u) = max((A,u), (B,u))
• (AB,u) = min((A,u), (B,u))
Slide 32
Fuzzy Information Retrieval
• Fuzzy sets are modeled based on a thesaurus
• This thesaurus is built as follows:
– Let vec(c) be a term-term correlation matrix
– Let c(i,l) be a normalized correlation factor for (ki,kl):
c(i,l) = n(i,l)
ni + nl - n(i,l)
- ni: number of docs which contain ki
- nl: number of docs which contain kl
- n(i,l): number of docs which contain both ki and kl
• We now have the notion of proximity among index terms.
Slide 33
• The correlation factor c(i,l) can be used to define fuzzy set
membership for a document dj as follows:
(i,j) = 1 -  (1 - c(i,l))
kl  dj
 (i,j) : membership of doc dj in fuzzy subset associated with ki
• The above expression computes an algebraic sum over all

terms in the doc dj
• A doc dj belongs to the fuzzy set for ki, if its own terms are
associated with ki
• If doc dj contains a term kl which is closely related to ki, we
have
– c(i,l) ~ 1
 (i,j) ~ 1
Slide 34
• Disjunctive set: algebraic sum
 (cc1 | cc2 | cc3 , j) = 1 -  (1 - (cci, j))
• Conjunctive set: algebraic product

 (cc1 & cc2 & cc3,j) =  ((cci, j))
Slide 35
Fuzzy IR: An Example
Ka Kb
cc2
cc3
cc1
• q = ka  (kb  kc)
• vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0)
= vec(cc1) + vec(cc2) + vec(cc3) Kc
(q,dj) = (cc1+cc2+cc3,j)
= 1 -  (1 - (cci, j))
= 1 - (1 - (a,j) (b,j) (c,j)) *
(1 - (a,j) (b,j) (1-(c,j))) *
(1 - (a,j) (1-(b,j)) (1-(c,j)))
Slide 36
• Fuzzy IR models have been discussed mainly in the literature

associated with fuzzy theory
• Experiments with standard test collections are not available
• Difficult to compare at this time
Slide 37

Information Retrieval: IR Models: Boolean Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Retrieval: IR Models: Boolean Model

Uploaded by

Copyright:

Available Formats

Information Retrieval

IR models: Boolean model

Dr. Bassel ALKHATIB

• Simple model based on set theory

• Queries and documents specified as boolean expressions

• Terms are either present or absent. Thus, wij  {0,1}

• sim(q,dj) = 1, if document satisfies the boolean query

- no in-between, only 0 or 1 Slide 4

• Set of index term = {k1, k2, k3, k4, k5}

non relevant document

• D1 = “computer information retrieval”

7 Milton Shakespeare Swift

11 Chaucer Shakespeare Swift

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

Q= (0,0,1,0)  (0,1,0,0)  (0,1,1,0)

Relevant documents={1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14}

Non relevant documents={0, 9, 11, 13, 15}

• Disadvantages of “Boolean Model” :

Documents containing just one term, e,g, Kx is

• No term weight is used

• Boolean model is simple and elegant.

• The extended Boolean model (introduced by Salton, Fox,

We want a document to be sim(qor,dj) = sqrt( x 2 + y2 )

• If the weights are all Boolean {0,1}, a document is always

• We can extend the previous model to consider Euclidean

• The p norm as defined above enjoys a couple of

• For instance, consider the query q=(k1 k2)  k3. The

• Q1= k1 k2 k3

• Q1= k1 k2 k3

1. Give the numeral formula for extended Boolean

2. Assume that the document is represented by

• Model is quite powerful

• Queries and docs represented by sets of index terms:

• This interpretation provides the foundation for many models

• We now have the notion of proximity among index terms.

• The above expression computes an algebraic sum over all

• Conjunctive set: algebraic product

• Fuzzy IR models have been discussed mainly in the literature

You might also like