Professional Documents
Culture Documents
Information Retrieval: IR Models: Boolean Model
Information Retrieval: IR Models: Boolean Model
Slide 3
The Boolean Model
• q = ka (kb kc)
• Similar/Matching documents
• md1 = [ka ka kd ke] => (1,0,0)
• md2 = [ka kb kc] => (1,1,1)
• Unmatched documents
• ud1 = [ka kc] => (1,0,1)
• ud2 = [kd] => (0,0,0)
Slide 5
Boolean Model : Example
Slide 6
Boolean Model : Example
Set of index term = {k1, k2, k3, k4, k5}
D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
Q = (k1 k3) k5
Slide 7
Boolean Model : Example
Set of index term = {k1, k2, k3, k4, k5}
D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
Q = (k1 k3) k5
The DNF of
Q= (1,0,1,0,1) (1,0,1,1,1) (1,1,1,0,1)(1,1,1,1,1)
(1,0,0,0,1) (1,0,0,1,1) (1,1,0,0,1) (1,1,0,1,1)
(0,0,0,0,1) (0,0,0,1,1) (0,1,0,0,1) (0,1,0,1,1)
Slide 8
Exercise
• Q1 = “information retrieval”
• Q2 = “information ¬computer”
Slide 9
Exercise
0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
Slide 10
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
(c, m, h, w)
( (c m) ( w) ) ( c (w h) )
(c m w h) ( w c)
Slide 11
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no notion of
partial matching
• No ranking of the documents is provided (absence of a
grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are most often
too simplistic
• As a consequence, the Boolean model frequently returns
either too few or too many documents in response to a user
query
Slide 12
• The question of how to extend the Boolean model to
accomodate partial matching and a ranking has attracted
considerable attention in the past
• Two extensions of boolean model:
– Extended Boolean Model
– Fuzzy Set Model
Slide 13
Extended Boolean Model:
Slide 14
Extended Boolean Model
Slide 15
The Idea
Slide 16
The Idea:
• qand = kx ky; wxj = x and wyj = y
ky (1,1)
dj+1 AND
y = wyj
dj
(0,0) x = wxj kx
We want a document to be
as close as possible to (1,1) sim(qand,dj) = 1 - sqrt( (1-x) 2+ (1-y)2 )
2 Slide 17
The Idea:
• qor = kx ky; wxj = x and wyj = y
ky (1,1)
dj+1 OR
dj
y = wyj
(0,0) x = wxj kx
Slide 19
Generalizing the Idea
• A generalizedpdisjunctive
p
query is given by
p p
– qand = k1 k2 . . . kt
Slide 20
Generalizing the Idea
1
p p p p
– sim(qor,dj) = (x1 + x2 + . . . + xm )
m
1
p p p p
– sim(qand,dj) = 1 - ((1-x1) + (1-x2) + . . . + (1-xm) )
m
Slide 21
Properties:
x1 ... xm
sim(qor , dj ) sim(qand , dj )
m
• Second, when p= it can be verified that
• Sim(qor,dj)=max(xi)
• Sim(qand,dj)=min(xi)
Slide 22
Example:
1
(1 x1 ) (1 x2 )
p p
(1 ( ) ) p x3p 1
p
sim(q, d ) ( 2 ) p
2
• Any boolean can be expressed as a numeral formula.
Slide 23
Example:
Slide 24
Example:
Sim(Q 2, D) ( 2 ) 2 0.73
2
Slide 25
Exercise:
Slide 26
Exercise:
q=(k1 or k2 or k3)and (not k4 or k5).
D= (x1, x2, x3, x4, x5) = (0.8, 0.1, 0.0, 0.0, 1.0).
Slide 27
Exercise:
q=(k1 or k2 or k3)and (not k4 or k5).
D= (x1, x2, x3, x4, x5) = (0.8, 0.1, 0.0, 0.0, 1.0).
x x x
2 2 2
a 1
0.22
2 3
(1 x ) x
2 2 2
b 4
0.71 5
(1 a) (1 b)
2 2
sim(Q, D) 1 0.41
2
Slide 28
Extended Boolean Model
Slide 29
Fuzzy Set Model
Slide 30
Fuzzy Set Theory
• Framework for representing classes whose boundaries are not
well defined
• Key idea is to introduce the notion of a degree of membership
associated with the elements of a set
• This degree of membership varies from 0 to 1 and allows
modeling the notion of marginal membership
• Thus, membership is now a gradual notion, contrary to the
notion enforced by classic Boolean logic
Slide 31
Fuzzy Set Theory
• Definition
– A fuzzy subset A of U is characterized by a membership function
(A,u) : U [0,1]
which associates with each element u of U a number (u) in the interval
[0,1]
• Definition
– Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of
A. Then,
• (¬A,u) = 1 - (A,u)
• (AB,u) = max((A,u), (B,u))
• (AB,u) = min((A,u), (B,u))
Slide 32
Fuzzy Information Retrieval
• Fuzzy sets are modeled based on a thesaurus
• This thesaurus is built as follows:
– Let vec(c) be a term-term correlation matrix
– Let c(i,l) be a normalized correlation factor for (ki,kl):
c(i,l) = n(i,l)
ni + nl - n(i,l)
- ni: number of docs which contain ki
- nl: number of docs which contain kl
- n(i,l): number of docs which contain both ki and kl
Slide 33
Fuzzy Information Retrieval
• The correlation factor c(i,l) can be used to define fuzzy set
membership for a document dj as follows:
(i,j) = 1 - (1 - c(i,l))
kl dj
(i,j) : membership of doc dj in fuzzy subset associated with ki
Slide 35
Fuzzy IR: An Example
Ka Kb
cc2
cc3
cc1
• q = ka (kb kc)
• vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0)
= vec(cc1) + vec(cc2) + vec(cc3) Kc
(q,dj) = (cc1+cc2+cc3,j)
= 1 - (1 - (cci, j))
= 1 - (1 - (a,j) (b,j) (c,j)) *
(1 - (a,j) (b,j) (1-(c,j))) *
(1 - (a,j) (1-(b,j)) (1-(c,j)))
Slide 36
Fuzzy Information Retrieval
Slide 37