Professional Documents
Culture Documents
Retrieval Models: Boolean and Vector Space
Retrieval Models: Boolean and Vector Space
Retrieval Models: Boolean and Vector Space
Information Retrieval
Retrieval Models:
Boolean and Vector Space
Jamie Callan
Carnegie Mellon University
callan@cs.cmu.edu
Lecture Outline
Introduction to retrieval models
Exact-match retrieval
2003, Jamie
2003, Jamie
Basic IR Processes
Information Need
Document
Representation
Representation
Query
Indexed Objects
Comparison
Evaluation/Feedback
Retrieved Objects
2003, Jamie
Boolean
Vector space
2003, Jamie
2003, Jamie
Exact Match
Boolean
Vector space Best Match
Basic vector space
Extended Boolean model
Latent Semantic Indexing (LSI)
Probabilistic models Best Match
Basic probabilistic model
Bayesian inference networks
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever) Best Match
Page rank (Google) Exact Match
2003, Jamie
2003, Jamie
2003, Jamie
Unranked Boolean:
Example
A typical Boolean query
(distributed NEAR/1 information NEAR/1 retrieval) OR
((collection OR database OR resource) NEAR/1 selection) OR
(multi NEAR/1 database NEAR/1 search) OR
((FIELD author callan) AND CORI) OR
((FIELD author hawking) AND NOT stephen)
Studies show that people are not good at creating Boolean queries
People overestimate the quality of the queries they create
Queries are too strict: Few relevant documents found
Queries are too loose: Too many documents found (few relevant)
10
2003, Jamie
Unranked Boolean:
WESTLAW
Large commercial system
Serves legal and professional markets
11
2003, Jamie
Unranked Boolean:
WESTLAW
Boolean operators
Proximity operators
12
2003, Jamie
Unranked Boolean:
WESTLAW
Queries are typically developed incrementally
13
2003, Jamie
Unranked Boolean:
WESTLAW
14
2003, Jamie
Unranked Boolean:
Summary
Advantages:
Very efficient
Predictable, easy to explain
Structured queries
Works well when searchers knows exactly what is wanted
Disadvantages:
Most people find it difficult to create good Boolean queries
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
Documents that are close are not retrieved
15
2003, Jamie
Term Weights:
A Brief Introduction
The words of a text are not equally indicative of its meaning
Most scientists think that butterflies use the position of the sun
in the sky as a kind of compass that allows them to determine
which way is north. Scientists think that butterflies may use other
cues, such as the earth's magnetic field, but we have a lot to learn
about monarchs' sense of direction.
Important: Butterflies, monarchs, scientists, direction, compass
Unimportant: Most, think, kind, sky, determine, cues, learn
Term weights reflect the (estimated) importance of each term
There are many variations on how they are calculated
Historically it was tf weights
The standard approach for many IR systems is tf.idf weights
16
2003, Jamie
17
2003, Jamie
(tfi,j)
tn, j
t1, j t2, j
18
tn, j
i, j
OR
t1, j t2, j
tn, j
2003, Jamie
19
2003, Jamie
20
2003, Jamie
21
2003, Jamie
22
2003, Jamie
23
2003, Jamie
Term3 Term4
Termn
Doc1 1
Doc2 0
Doc3 1
: :
:
:
:
:
:
A document is represented as a vector of binary values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Query
0
0
1
0
1
Linear algebra is used to determine which vectors are similar
24
2003, Jamie
Basis vectors
for 2 dimensions
25
Basis vectors
for 3 dimensions
2003, Jamie
26
2003, Jamie
importance, or representativeness
The vector space model does not specify how to set term weights
Some common elements:
Document term weight: Importance of the term in this document
Collection term weight: Importance of the term in this collection
Length normalization: Compensate for documents of different lengths
Naming convention for term weight functions: DCL.DCL
First triple is document term weights, second triple is query term weights
n=none (no weighting on that factor)
Example: lnc.ltc
27
2003, Jamie
d i 2 qi 2
28
log
tf
log
qtf
1
log
df
N
2
log
tf
log
qtf
1
log
df
2003, Jamie
Term Weights:
A Brief Introduction
Inverse document frequency (idf):
Terms that occur in many documents in the collection are less useful
for discriminating among documents
Document frequency (df): number of documents containing the term
N
I
log(
) 1
idf often calculated as:
df
29
2003, Jamie
Old normalization
Average old normalization
30
2003, Jamie
d: 1 + ln (1 + ln (tf)),
(0 if tf = 0)
t: log ((N+1)/df)
b: 1 / (0.8 + 0.2 * doclen / avg_doclen)
What next?
31
2003, Jamie
TermID
T1
T2
T3
T4
Tn
Doc 1
0.33
0.00
0.00
0.47
0.53
Doc 2
0.00
0.64
0.53
0.00
0.00
Doc 3
0.43
0.00
0.25
0.00
0.00
:
:
:
:
:
:
:
A document is represented as a vector of tf.idf values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Q
0.00
0.00
1.00
0.00
2.00
Linear algebra is used to determine which vectors are similar
32
2003, Jamie
Doc1
Term2
Query
Doc2
Te
rm
1
Term3
33
Similarity is
inversely related
to the angle
between the vectors.
Doc2 is the
most similar
to the Query.
Rank the documents
by their similarity
to the Query.
2003, Jamie
(# nonzero dimensions)
Dice coefficient
(Length normalized
Inner Product)
Cosine coefficient
2 | X Y |
| X | |Y |
| X Y |
| X | |Y |
Jackard coefficient
(like Dice, but penalizes
| X Y |
| X | |Y | | X Y |
2
2
x
y
i i
xi yi
xi 2 yi 2
xi yi
xi 2 yi 2 xi yi
34
2003, Jamie
Sim(D1, Q)
Sim(D2 , Q)
Term Weights
Doc 1 0.3 0.1 0.4
Doc 2 0.8 0.5 0.6
35
0.02
0.20
0.10
0.10
0.45
0.22
2003, Jamie
Sim OR Q, D
qi p di
qi p
1
p p
qi p 1 di p
qi p
Sim AND Q, D
Sim NOT Q, D
1 Sim(Q, D)
1
p
36
2003, Jamie
37
2003, Jamie
38
2003, Jamie
39
2003, Jamie
The vector space model is the most popular retrieval model (today)
40
2003, Jamie
Prentice-Hall. 1971.
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill. 1983.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer. Addison-Wesley. 1989.
A. Singhal. AT&T at TREC-6. In The Sixth Text Retrieval Conference (TREC-6). National
Institute of Standards and Technology special publication 500-240.
A. Singhal, J. Choi, D. Hindle, D.D. Lewis, and F. Pereira. AT&T at TREC-7. In The Seventh
Text Retrieval Conference (TREC-7). National Institute of Standards and Technology special
publication 500-242.
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. SIGIR-96.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Harshman. Indexing by latent
semantic analysis. Journal of the American Society for Information, 41, pp. 391-407. 1990.
E. A. Fox. Extending the Boolean and Vector Space models of Information Retrieval with P-Norm
queries and multiple concept types. PhD thesis, Cornell University. 1983.
41
2003, Jamie