Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 37

Information Retrieval

IR models: Boolean model

Dr. Bassel ALKHATIB


IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
vector
U probabilistic Generalized Vector
s Retrieval: Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
T Non-Overlapping Lists
a Proximal Nodes Inference Network
s Belief Network
k Browsing
Browsing
Flat
Structure Guided
Hypertext
Slide 2
The Boolean Model

• Simple model based on set theory

• Queries and documents specified as boolean expressions


– precise semantics
– neat formalism
– Query = ka  (kb  kc)

• Terms are either present or absent. Thus, wij  {0,1}

Slide 3
The Boolean Model

• q = ka  (kb  kc)

• sim(q,dj) = 1, if document satisfies the boolean query


0 otherwise

- no in-between, only 0 or 1 Slide 4


Example
• q = ka  (kb  kc)
• vec(qdnf) = (1,1,1)  (1,1,0)  (1,0,0)
• Disjunctive Normal Form

• Similar/Matching documents
• md1 = [ka ka kd ke] => (1,0,0)
• md2 = [ka kb kc] => (1,1,1)
• Unmatched documents
• ud1 = [ka kc] => (1,0,1)
• ud2 = [kd] => (0,0,0)

Slide 5
Boolean Model : Example

• Set of index term = {k1, k2, k3, k4, k5}


• D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
• Q1 = k1  k3 k5
• The DNF of Q1= (1,0,1,0,0)  (1,1,1,0,0) (1,0,1,1,0)
(1,1,1,1,0)
• D3 is relevant

Slide 6
Boolean Model : Example
Set of index term = {k1, k2, k3, k4, k5}
D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
Q = (k1   k3) k5

Slide 7
Boolean Model : Example
Set of index term = {k1, k2, k3, k4, k5}
D1= (0,1,1,0,0), D2= (1,1, 0,1,0), D3=(1,1,1,1,0)
Q = (k1   k3) k5
The DNF of
Q= (1,0,1,0,1) (1,0,1,1,1) (1,1,1,0,1)(1,1,1,1,1)
 (1,0,0,0,1)  (1,0,0,1,1)  (1,1,0,0,1)  (1,1,0,1,1)
(0,0,0,0,1)  (0,0,0,1,1) (0,1,0,0,1) (0,1,0,1,1)

non relevant document

Slide 8
Exercise

• D1 = “computer information retrieval”


• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”

• Q1 = “information  retrieval”
• Q2 = “information  ¬computer”

Slide 9
Exercise
0

1 Swift

2 Shakespeare

3 Shakespeare Swift

4 Milton

5 Milton Swift

6 Milton Shakespeare

7 Milton Shakespeare Swift

8 Chaucer

9 Chaucer Swift

10 Chaucer Shakespeare

11 Chaucer Shakespeare Swift


12 Chaucer Milton

13 Chaucer Milton Swift

14 Chaucer Milton Shakespeare

15 Chaucer Milton Shakespeare Swift

((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
Slide 10
Exercise
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

(c, m, h, w)

( (c  m)  ( w) )  ( c  (w  h) )

(c  m  w  h)  ( w   c)

Q= (0,0,1,0)  (0,1,0,0)  (0,1,1,0)


 (0,0,0,1)  (0,0,1,1)  (0,1,0,1)  (0,1,1,1)
 (1,0,0,0)  (1,0,1,0)  (1,1,0,0)  (1,1,1,0)

Relevant documents={1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14}

Non relevant documents={0, 9, 11, 13, 15}

Slide 11
Drawbacks of the Boolean Model
• Retrieval based on binary decision criteria with no notion of
partial matching
• No ranking of the documents is provided (absence of a
grading scale)
• Information need has to be translated into a Boolean
expression which most users find awkward
• The Boolean queries formulated by the users are most often
too simplistic
• As a consequence, the Boolean model frequently returns
either too few or too many documents in response to a user
query
Slide 12
• The question of how to extend the Boolean model to
accomodate partial matching and a ranking has attracted
considerable attention in the past
• Two extensions of boolean model:
– Extended Boolean Model
– Fuzzy Set Model

Slide 13
Extended Boolean Model:

• Disadvantages of “Boolean Model” :


query q=Kx AND Ky.

Documents containing just one term, e,g, Kx is


considered as irrelevant as another document
containing none of these terms.

• No term weight is used

Slide 14
Extended Boolean Model

• Boolean model is simple and elegant.


• But, no provision for a ranking
• As with the fuzzy model, a ranking can be obtained by
relaxing the condition on set membership
• Extend the Boolean model with the notions of partial
matching and term weighting

Slide 15
The Idea

• The extended Boolean model (introduced by Salton, Fox,


and Wu, 1983) is based on a critique of a basic
assumption in Boolean algebra

Slide 16
The Idea:
• qand = kx  ky; wxj = x and wyj = y
ky (1,1)

dj+1 AND
y = wyj
dj

(0,0) x = wxj kx

We want a document to be
as close as possible to (1,1) sim(qand,dj) = 1 - sqrt( (1-x) 2+ (1-y)2 )
2 Slide 17
The Idea:
• qor = kx  ky; wxj = x and wyj = y
ky (1,1)

dj+1 OR

dj
y = wyj

(0,0) x = wxj kx

We want a document to be sim(qor,dj) = sqrt( x 2 + y2 )


as far as possible from (0,0) 2
Slide 18
Remarks

• If the weights are all Boolean {0,1}, a document is always


positioned in one of the four corners: (0,0), (0,1), (1,0)
and (1,1):
• Sim(qor, d) in {0, 1/ 2 , 1}
• Sim(qand, d) in { 0, 1 - 1/ 2 , 1}

Slide 19
Generalizing the Idea

• We can extend the previous model to consider Euclidean


distances in a t-dimensional space (t terms).
• This can be done using p-norms which extend the notion
of distance to include p-distances, where 1  p   is a
new parameter
• A generalized conjunctive query is given by
– qor = k1  pk2  p . . .  pkt  p

• A generalizedpdisjunctive
p
query is given by
p p
– qand = k1  k2  . . .  kt 

Slide 20
Generalizing the Idea
1
p p p p
– sim(qor,dj) = (x1 + x2 + . . . + xm )
m
1
p p p p
– sim(qand,dj) = 1 - ((1-x1) + (1-x2) + . . . + (1-xm) )
m

Slide 21
Properties:

• The p norm as defined above enjoys a couple of


interesting properties as follows. First, when p=1 it can
be verified that

x1  ...  xm
sim(qor , dj )  sim(qand , dj ) 
m
• Second, when p= it can be verified that
• Sim(qor,dj)=max(xi)
• Sim(qand,dj)=min(xi)
Slide 22
Example:

• For instance, consider the query q=(k1 k2)  k3. The


similarity sim(q,dj) between a document dj and this
query is then computed as

1
(1  x1 )  (1  x2 )
p p
(1  ( ) ) p  x3p 1
p

sim(q, d )  ( 2 ) p

2
• Any boolean can be expressed as a numeral formula.

Slide 23
Example:

• Q1= k1 k2 k3


• Q2= (k1 k2 )  k3
• D = (1, 0, 1)

Slide 24
Example:

• Q1= k1 k2 k3


• Q2= (k1 k2 )  k3
• D = (1, 0, 1)
1
(1  1)  (1  0)  (1  1)
2 2 2
Sim(Q1, D)  1  ( )  0.43
2
3
1
(1  1)  (1  0)
2 2
(1  ( ) )  12 1
2 2

Sim(Q 2, D)  ( 2 ) 2  0.73
2
Slide 25
Exercise:

1. Give the numeral formula for extended Boolean


model of the query
q=(k1 or k2 or k3)and (not k4 or k5). (assume that
there are 5 terms in total.)

2. Assume that the document is represented by


the vector (0.8, 0.1, 0.0, 0.0, 1.0).
What is sim(q, d) for extended Boolean model?

Slide 26
Exercise:
q=(k1 or k2 or k3)and (not k4 or k5).
D= (x1, x2, x3, x4, x5) = (0.8, 0.1, 0.0, 0.0, 1.0).

Slide 27
Exercise:
q=(k1 or k2 or k3)and (not k4 or k5).
D= (x1, x2, x3, x4, x5) = (0.8, 0.1, 0.0, 0.0, 1.0).

x x x
2 2 2

a 1
 0.22
2 3

(1  x )  x
2 2 2

b 4
 0.71 5

(1  a)  (1  b)
2 2

sim(Q, D)  1   0.41
2

Slide 28
Extended Boolean Model

• Model is quite powerful


• Properties are interesting and might be useful
• Computation is somewhat complex
• However, distributivity operation does not hold for
ranking computation:
– q1 = (k1  k2)  k3
– q2 = (k1  k3)  (k2  k3)
– sim(q1,dj)  sim(q2,dj)

Slide 29
Fuzzy Set Model

• Queries and docs represented by sets of index terms:


matching is approximate from the start
• This vagueness can be modeled using a fuzzy framework, as
follows:
– with each term is associated a fuzzy set
– each doc has a degree of membership in this fuzzy set

• This interpretation provides the foundation for many models


for IR based on fuzzy theory
• In here, the model proposed by Ogawa, Morita, and
Kobayashi (1991)

Slide 30
Fuzzy Set Theory
• Framework for representing classes whose boundaries are not
well defined
• Key idea is to introduce the notion of a degree of membership
associated with the elements of a set
• This degree of membership varies from 0 to 1 and allows
modeling the notion of marginal membership
• Thus, membership is now a gradual notion, contrary to the
notion enforced by classic Boolean logic

Slide 31
Fuzzy Set Theory
• Definition
– A fuzzy subset A of U is characterized by a membership function
(A,u) : U  [0,1]
which associates with each element u of U a number (u) in the interval
[0,1]

• Definition
– Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of
A. Then,
• (¬A,u) = 1 - (A,u)
• (AB,u) = max((A,u), (B,u))
• (AB,u) = min((A,u), (B,u))

Slide 32
Fuzzy Information Retrieval
• Fuzzy sets are modeled based on a thesaurus
• This thesaurus is built as follows:
– Let vec(c) be a term-term correlation matrix
– Let c(i,l) be a normalized correlation factor for (ki,kl):

c(i,l) = n(i,l)
ni + nl - n(i,l)
- ni: number of docs which contain ki
- nl: number of docs which contain kl
- n(i,l): number of docs which contain both ki and kl

• We now have the notion of proximity among index terms.

Slide 33
Fuzzy Information Retrieval
• The correlation factor c(i,l) can be used to define fuzzy set
membership for a document dj as follows:
(i,j) = 1 -  (1 - c(i,l))
kl  dj
 (i,j) : membership of doc dj in fuzzy subset associated with ki

• The above expression computes an algebraic sum over all


terms in the doc dj
• A doc dj belongs to the fuzzy set for ki, if its own terms are
associated with ki
• If doc dj contains a term kl which is closely related to ki, we
have
– c(i,l) ~ 1
 (i,j) ~ 1
Slide 34
Fuzzy Information Retrieval
• Disjunctive set: algebraic sum
 (cc1 | cc2 | cc3 , j) = 1 -  (1 - (cci, j))

• Conjunctive set: algebraic product


 (cc1 & cc2 & cc3,j) =  ((cci, j))

Slide 35
Fuzzy IR: An Example
Ka Kb
cc2
cc3
cc1

• q = ka  (kb  kc)
• vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0)
= vec(cc1) + vec(cc2) + vec(cc3) Kc

(q,dj) = (cc1+cc2+cc3,j)
= 1 -  (1 - (cci, j))
= 1 - (1 - (a,j) (b,j) (c,j)) *
(1 - (a,j) (b,j) (1-(c,j))) *
(1 - (a,j) (1-(b,j)) (1-(c,j)))
Slide 36
Fuzzy Information Retrieval

• Fuzzy IR models have been discussed mainly in the literature


associated with fuzzy theory
• Experiments with standard test collections are not available
• Difficult to compare at this time

Slide 37

You might also like