Retrieval Models: Boolean and Vector Space

11-741:
Information Retrieval
Retrieval Models:
Boolean and Vector Space
Jamie Callan
Carnegie Mellon University
callan@cs.cmu.edu
Lecture Outline
Introduction to retrieval models
Exact-match retrieval
Unranked Boolean retrieval model

Ranked Boolean retrieval model
Best-match retrieval
Vector space retrieval model
Term weights
Boolean query operators (p-norm)
2003, Jamie
What is a Retrieval Model?
A model is an abstract representation of a process or object
Used to study properties, draw conclusions, make predictions

The quality of the conclusions depends upon how closely the
model represents reality
A retrieval model describes the human and computational
processes involved in ad-hoc retrieval
Example: A model of human information seeking behavior
Example: A model of how documents are ranked
computationally
Components: Users, information needs, queries, documents,
relevance assessments, .
Retrieval models define relevance, explicitly or implicitly
2003, Jamie
Basic IR Processes
Information Need
Document
Representation
Representation
Query
Indexed Objects
Comparison
Evaluation/Feedback
Retrieved Objects
2003, Jamie
Overview of Major Retrieval Models
Boolean
Vector space
Basic vector space

Extended Boolean model
Probabilistic models
Basic probabilistic model
Bayesian inference networks
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever)
Page rank (Google)
2003, Jamie
Types of Retrieval Models:

Exact Match vs. Best Match Retrieval
Exact Match
Query specifies precise retrieval criteria
Every document either matches or fails to match query
Result is a set of documents
Usually in no particular order
Often in reverse-chronological order
Best Match
Query describes retrieval criteria for desired documents
Every document matches a query to some degree
Result is a ranked list of documents, best first
2003, Jamie
Overview of Major Retrieval Models
Exact Match
Boolean
Vector space Best Match
Basic vector space
Extended Boolean model
Latent Semantic Indexing (LSI)
Probabilistic models Best Match
Basic probabilistic model
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever) Best Match
Page rank (Google) Exact Match
2003, Jamie
Exact Match vs. Best Match Retrieval
Best-match models are usually more accurate / effective

Good documents appear at the top of the rankings
Good documents often dont exactly match the query
Query was too strict
Multimedia AND NOT (Apple OR IBM OR Microsoft)
Document didnt match user expectations

multi database retrieval vs multi db retrieval
Exact match still prevalent in some markets
Large installed base that doesnt want to migrate to new software

Efficiency: Exact match is very fast, good for high volume tasks
Sufficient for some tasks
Web advanced search
2003, Jamie
Retrieval Model #1:

Unranked Boolean
Unranked Boolean is most common Exact Match model
Model
Retrieve documents iff they satisfy a Boolean expression
Query specifies precise relevance criteria
Documents returned in no particular order
Operators:
Logical operators: AND, OR, AND NOT
Unconstrained NOT is expensive, so often not included
Distance operators: Near/1, Sentence/1, Paragraph/1,
String matching operators: Wildcard
Field operators: Date, Author, Title
Note: Unranked Boolean model not the same as Boolean queries
2003, Jamie
Unranked Boolean:
Example
A typical Boolean query
(distributed NEAR/1 information NEAR/1 retrieval) OR
((collection OR database OR resource) NEAR/1 selection) OR
(multi NEAR/1 database NEAR/1 search) OR
((FIELD author callan) AND CORI) OR
((FIELD author hawking) AND NOT stephen)
Studies show that people are not good at creating Boolean queries
People overestimate the quality of the queries they create
Queries are too strict: Few relevant documents found
Queries are too loose: Too many documents found (few relevant)
10
2003, Jamie
Unranked Boolean:
WESTLAW
Large commercial system
Serves legal and professional markets
Legal: Court cases, statutes, regulations, ..

News: Newspapers, magazines, journals, .
Financial: Stock quotes, SEC materials, financial analyses
Total collection size: 5-7 Terabytes
700,000 users
In operation since 1974
Best-match and free text queries added in 1992
11
2003, Jamie
Unranked Boolean:
WESTLAW
Boolean operators
Proximity operators
Phrases: Carnegie Mellon

Word proximity: Language /3 Technologies
Same sentence (/s) or paragraph (/p): Microsoft /s anti trust
Restrictions: Date (After 1996 & Before 1999)
Query expansion:
Wildcard and truncation: Al*n Lev!
Automatic expansion of plurals and possessives
Document structure (fields): Title (gGlOSS)
Citations: Cites (Callan) & Date (After 1996)
12
2003, Jamie
Unranked Boolean:
WESTLAW
Queries are typically developed incrementally
Implicit relevance feedback

V1: machine AND learning
V2: (machine AND learning) OR (neural AND networks) OR
(decision AND tree)
V3: (machine AND learning) OR (neural AND networks) OR
(decision AND tree) AND C4.5 OR Ripper OR EG OR EM
Queries are complex
Proximity operators used often
NOT is rare
Queries are long (9-10 words, on average)
13
2003, Jamie
Unranked Boolean:
WESTLAW
Two stopword lists: 1 for indexing, 50 for queries

Legal thesaurus: To help people expand queries
Document ordering: Determined by user (often date)
Query term highlighting: Help people see how document matched
Performance:
Response time controlled by varying amount of parallelism
Average response time set to 7 seconds
Even for collections larger than 100 GB
14
2003, Jamie
Unranked Boolean:
Summary
Advantages:
Very efficient
Predictable, easy to explain
Structured queries
Works well when searchers knows exactly what is wanted
Disadvantages:
Most people find it difficult to create good Boolean queries
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
Documents that are close are not retrieved
15
2003, Jamie
Term Weights:
A Brief Introduction
The words of a text are not equally indicative of its meaning
Most scientists think that butterflies use the position of the sun
in the sky as a kind of compass that allows them to determine
which way is north. Scientists think that butterflies may use other
cues, such as the earth's magnetic field, but we have a lot to learn
about monarchs' sense of direction.
Important: Butterflies, monarchs, scientists, direction, compass
Unimportant: Most, think, kind, sky, determine, cues, learn
Term weights reflect the (estimated) importance of each term
There are many variations on how they are calculated
Historically it was tf weights
The standard approach for many IR systems is tf.idf weights
16
2003, Jamie
Retrieval Model #2:

Ranked Boolean
Ranked Boolean is another common Exact Match retrieval model
Model
Retrieve documents iff they satisfy a Boolean expression

Query specifies precise relevance criteria
Documents returned ranked by frequency of query terms
Operators:
Logical operators: AND, OR, AND NOT
Unconstrained NOT is expensive, so often not included
Distance operators: Proximity
String matching operators: Wildcard
Field operators: Date, Author, Title
17
2003, Jamie
Exact Match Retrieval:

Ranked Boolean
How document scores are calculated:
Term weight: How often term i occurs in document j
(tfi,j)
tfi,j idfi weight
AND weight: Minimum of argument weights

OR weight: Maximum of argument weights
Sum of argument weights
Min t1, j , t 2, j , ..., t n , j

AND
t1, j t2, j
Max t1, j , t 2, j , ..., t n , j

OR
tn, j
t1, j t2, j
18
tn, j
i, j
OR
t1, j t2, j
tn, j
2003, Jamie

Ranked Boolean
Advantages:
All of the advantages of the unranked Boolean model
Very efficient, predictable, easy to explain, structured queries,
works well when searchers know exactly what is wanted
Result set is ordered by how redundantly a document satisfies a query
Usually enables a person to find relevant documents more quickly
Other term weighting methods can be used, too
Example: tf, tf.idf,
19
2003, Jamie

Ranked Boolean
Disadvantages:
Its still an Exact-Match model
Good Boolean queries hard to create
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
The returned documents match expectations.
so it is easy to forget that many relevant documents are missed
Documents that are close are not retrieved
20
2003, Jamie
Are Boolean Retrieval Models Still Relevant?
Many people prefer Boolean
Professional searchers (e.g., librarians, paralegals)

Some Web surfers (e.g., Advanced Search feature)
About 80% of WESTLAW & Dialog searches are Boolean
What do they like? Control, predictability, understandability
Boolean and free-text queries find different documents
Solution: Retrieval models that support free-text and Boolean queries
Recall that almost any retrieval model can be Exact Match
Extended Boolean (vector space) retrieval model
21
2003, Jamie
Retrieval Model #3:

Vector Space Model
Retrieval Model:
Any text object can be represented by a term vector
Examples: Documents, queries, sentences, .
Similarity is determined by distance in a vector space
Example: The cosine of the angle between the vectors
The SMART system:
Developed at Cornell University, 1960-1999
Still used widely
22
2003, Jamie
How Different Retrieval Models

View Ad-hoc Retrieval
Boolean
Query: A set of FOL conditions a document must satisfy
Retrieval: Deductive inference
Vector space
Query: A short document
Retrieval: Finding similar objects
23
2003, Jamie
Vector Space Retrieval Model:

Introduction
How are documents represented in the binary vector model?
Term1 Term2
Term3 Term4
Termn
Doc1 1
Doc2 0
Doc3 1
: :
:
:
:
:
:
A document is represented as a vector of binary values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Query
0
0
1
0
1
Linear algebra is used to determine which vectors are similar
24
2003, Jamie
Vector Space Representation:

Linear Algebra Review
Formally, a vector space is defined by a set of linearly
independent basis vectors.

Basis vectors:
correspond to the dimensions or directions in the vector space;
determine what can be described in the vector space; and
must be orthogonal, or linearly independent,i.e. a value along
one dimension implies nothing about a value along another.
Basis vectors
for 2 dimensions
25
Basis vectors
for 3 dimensions
2003, Jamie
Vector Space Representation
What should be the basis vectors for information retrieval?

Basic concepts:
Difficult to determine (Philosophy? Cognitive science?)
Orthogonal (by definition)
A relatively static vector space (there are no new ideas)
Terms (Words, Word stems):
Easy to determine
Not really orthogonal (orthogonal enough?)
A constantly growing vector space
new vocabulary creates new dimensions
26
2003, Jamie

Vector Coefficients
The coefficients (vector elements, term weights) represent term presence,
importance, or representativeness
The vector space model does not specify how to set term weights
Some common elements:
Document term weight: Importance of the term in this document
Collection term weight: Importance of the term in this collection
Length normalization: Compensate for documents of different lengths
Naming convention for term weight functions: DCL.DCL
First triple is document term weights, second triple is query term weights
n=none (no weighting on that factor)
Example: lnc.ltc
27
2003, Jamie

Term Weights
There have been many vector-space term weight functions
lnc.ltc: A very popular and commonly-used choice
l: log (tf) + 1
n: No weight / normalization (i.e., 1.0)
t: log (N / df)
c: cosine normalization
wi 2
For example:
d i qi
d i 2 qi 2
28
log
tf
log
qtf
1
log
df
N
2
log
tf
log
qtf
1
log
df
2003, Jamie
Term Weights:
A Brief Introduction
Inverse document frequency (idf):
Terms that occur in many documents in the collection are less useful
for discriminating among documents
Document frequency (df): number of documents containing the term
N
I
log(
) 1
idf often calculated as:
df
Idf is log p, the entropy of the term
29
2003, Jamie

Term Weights
Cosine normalization favors short documents
Pivoting helps, but shifts the bias towards long documents
1.0 slope slope
Old normalization
Average old normalization
Empirically, cosine normalization approximates # unique terms

lnu and Lnu document weighting:
u: pivoted unique normalization
1
Num Unique Terms
0.8 0.2
Avg Num Unique Terms
L: compensate for higher tfs in long documents
log(tf ) 1
log(avg tf in doc) 1
(Singhal, 1996; Singhal, 1998)
30
2003, Jamie

Term Weights
dnb.dtn: A more recent attempt to handle varied document lengths
d: 1 + ln (1 + ln (tf)),
(0 if tf = 0)
t: log ((N+1)/df)
b: 1 / (0.8 + 0.2 * doclen / avg_doclen)
What next?
31
(Singhal, et al, 1999)
2003, Jamie
Vector Space Retrieval Model
How are documents represented using a tf.idf representation?
TermID
T1
T2
T3
T4
Tn
Doc 1
0.33
0.00
0.00
0.47
0.53
Doc 2
0.00
0.64
0.53
0.00
0.00
Doc 3
0.43
0.00
0.25
0.00
0.00
:
:
:
:
:
:
:
A document is represented as a vector of tf.idf values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Q
0.00
0.00
1.00
0.00
2.00
Linear algebra is used to determine which vectors are similar
32
2003, Jamie
Vector Space Similarity
Doc1
Term2
Query
Doc2
Te
rm
1
Term3
33
Similarity is
inversely related
to the angle
between the vectors.
Doc2 is the
most similar
to the Query.
Rank the documents
by their similarity
to the Query.
2003, Jamie
Vector Space Similarity:

Common Measures
Sim(X,Y)
Inner product
Binary Term Vectors

| X Y |
(# nonzero dimensions)
Dice coefficient
(Length normalized
Inner Product)
Cosine coefficient
2 | X Y |
| X | |Y |
| X Y |
| X | |Y |
(like Dice, but lower

penalty with diff # features)
Jackard coefficient
(like Dice, but penalizes
| X Y |
| X | |Y | | X Y |
Weighted Term Vectors

xi yi
2 xi yi
2
2
x
y
i i
xi yi
xi 2 yi 2
xi yi
xi 2 yi 2 xi yi
low overlap cases)
34
2003, Jamie

Example
Term Weights
Query 0.0 0.2 0.0
Sim(D1, Q)
Sim(D2 , Q)
Term Weights
Doc 1 0.3 0.1 0.4
Doc 2 0.8 0.5 0.6
(0.0 0.3) (0.2 0.1) (0.0 0.4)

0.0 2 0.22 0.02 * 0.32 0.12 0.4 2
(0.0 0.8) (0.2 0.5) (0.0 0.6)
0.02 0.22 0.02 * 0.82 0.52 0.62
35
0.02
0.20
0.10
0.10
0.45
0.22
2003, Jamie

The Extended Boolean (p-norm) Model
Some similarity measures mimic Boolean AND, OR, NOT
Sim OR Q, D
qi p di
qi p
1
p p
qi p 1 di p
qi p
Sim AND Q, D
Sim NOT Q, D
1 Sim(Q, D)
1
p
p indicates the degree of strictness.

1 is the least strict.
is the most strict, i.e. the Boolean case.
2 tends to be a good choice.
36
2003, Jamie

The Extended Boolean (p-norm) Model
Characteristics:
Full Boolean queries
(White AND House) OR
(Bill AND Clinton AND (NOT Hillary))
Ranked retrieval
Handle tf idf weighting
Effective
Boolean operators are not used often with the vector space model
It is not clear why
37
2003, Jamie

Summary
Standard vector space
each dimension corresponds to a term in the vocabulary

vector elements are real-valued, reflecting term importance
any vector (document,query, ...) can be compared to any other
cosine correlation is the similarity metric used most often
Ranked Boolean
vector elements are binary
Extended Boolean
multiple nested similarity measures
38
2003, Jamie

Disadvantages
Assumed independence relationship among terms
Lack of justification for some vector operations
e.g. choice of similarity function

e.g., choice of term weights
Barely a retrieval model
Doesnt explicitly model relevance, a persons information need,
language models, etc.
Assumes a query and a document can be treated the same
Lack of a cognitive (or other) justification
39
2003, Jamie

Advantages
Simplicity: Easy to implement

Effectiveness: It works very well
Ability to incorporate any kind of term weights
Can measure similarities between almost anything:
documents and queries, documents and documents, queries and
queries, sentences and sentences, etc.
Used in a variety of IR tasks:
Retrieval, classification, summarization, SDI, visualization,
The vector space model is the most popular retrieval model (today)
40
2003, Jamie
For Additional Information
G. Salton. Automatic Information Organization and Retrieval. McGraw-Hill. 1968.

G. Salton. The SMART Retrieval System - Experiments in Automatic Document Processing.
Prentice-Hall. 1971.
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill. 1983.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer. Addison-Wesley. 1989.
A. Singhal. AT&T at TREC-6. In The Sixth Text Retrieval Conference (TREC-6). National
Institute of Standards and Technology special publication 500-240.
A. Singhal, J. Choi, D. Hindle, D.D. Lewis, and F. Pereira. AT&T at TREC-7. In The Seventh
Text Retrieval Conference (TREC-7). National Institute of Standards and Technology special
publication 500-242.
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. SIGIR-96.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Harshman. Indexing by latent
semantic analysis. Journal of the American Society for Information, 41, pp. 391-407. 1990.
E. A. Fox. Extending the Boolean and Vector Space models of Information Retrieval with P-Norm
queries and multiple concept types. PhD thesis, Cornell University. 1983.
41
2003, Jamie

Retrieval Models: Boolean and Vector Space

Uploaded by

Copyright:

Available Formats

You might also like

Retrieval Models: Boolean and Vector Space

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Retrieval Models: Boolean and Vector Space

Uploaded by

Copyright:

Available Formats

11-741:

Unranked Boolean retrieval model

What is a Retrieval Model?

A model is an abstract representation of a process or object

Used to study properties, draw conclusions, make predictions

Overview of Major Retrieval Models

Basic vector space

Types of Retrieval Models:

Overview of Major Retrieval Models

Exact Match vs. Best Match Retrieval

Best-match models are usually more accurate / effective

Document didnt match user expectations

Exact match still prevalent in some markets

Large installed base that doesnt want to migrate to new software

Retrieval Model #1:

Legal: Court cases, statutes, regulations, ..

Phrases: Carnegie Mellon

Implicit relevance feedback

Two stopword lists: 1 for indexing, 50 for queries

Retrieval Model #2:

Retrieve documents iff they satisfy a Boolean expression

Exact Match Retrieval:

tfi,j idfi weight

AND weight: Minimum of argument weights

Min t1, j , t 2, j , ..., t n , j

Max t1, j , t 2, j , ..., t n , j

Exact Match Retrieval:

Exact Match Retrieval:

Are Boolean Retrieval Models Still Relevant?

Many people prefer Boolean

Professional searchers (e.g., librarians, paralegals)

Retrieval Model #3:

How Different Retrieval Models

Vector Space Retrieval Model:

Vector Space Representation:

independent basis vectors.

Vector Space Representation

What should be the basis vectors for information retrieval?

Vector Space Representation:

Vector Space Representation:

Idf is log p, the entropy of the term

Vector Space Representation:

1.0 slope slope

Empirically, cosine normalization approximates # unique terms

Vector Space Representation:

(Singhal, et al, 1999)

Vector Space Retrieval Model

How are documents represented using a tf.idf representation?

Vector Space Similarity

Vector Space Similarity:

Binary Term Vectors

(like Dice, but lower

Weighted Term Vectors

low overlap cases)

Vector Space Similarity:

(0.0 0.3) (0.2 0.1) (0.0 0.4)

Vector Space Similarity:

p indicates the degree of strictness.