Retrieval Models: Boolean and Vector Space

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 41

11-741:

Information Retrieval

Retrieval Models:
Boolean and Vector Space
Jamie Callan
Carnegie Mellon University
callan@cs.cmu.edu

Lecture Outline
Introduction to retrieval models
Exact-match retrieval

Unranked Boolean retrieval model


Ranked Boolean retrieval model
Best-match retrieval
Vector space retrieval model
Term weights
Boolean query operators (p-norm)

2003, Jamie

What is a Retrieval Model?

A model is an abstract representation of a process or object

Used to study properties, draw conclusions, make predictions


The quality of the conclusions depends upon how closely the
model represents reality
A retrieval model describes the human and computational
processes involved in ad-hoc retrieval
Example: A model of human information seeking behavior
Example: A model of how documents are ranked
computationally
Components: Users, information needs, queries, documents,
relevance assessments, .
Retrieval models define relevance, explicitly or implicitly

2003, Jamie

Basic IR Processes
Information Need

Document

Representation

Representation

Query

Indexed Objects
Comparison

Evaluation/Feedback

Retrieved Objects

2003, Jamie

Overview of Major Retrieval Models

Boolean
Vector space

Basic vector space


Extended Boolean model
Probabilistic models
Basic probabilistic model
Bayesian inference networks
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever)
Page rank (Google)

2003, Jamie

Types of Retrieval Models:


Exact Match vs. Best Match Retrieval
Exact Match
Query specifies precise retrieval criteria
Every document either matches or fails to match query
Result is a set of documents
Usually in no particular order
Often in reverse-chronological order
Best Match
Query describes retrieval criteria for desired documents
Every document matches a query to some degree
Result is a ranked list of documents, best first

2003, Jamie

Overview of Major Retrieval Models

Exact Match
Boolean
Vector space Best Match
Basic vector space
Extended Boolean model
Latent Semantic Indexing (LSI)
Probabilistic models Best Match
Basic probabilistic model
Bayesian inference networks
Language models
Citation analysis models
Hubs & authorities (Kleinberg, IBM Clever) Best Match
Page rank (Google) Exact Match

2003, Jamie

Exact Match vs. Best Match Retrieval

Best-match models are usually more accurate / effective


Good documents appear at the top of the rankings
Good documents often dont exactly match the query
Query was too strict
Multimedia AND NOT (Apple OR IBM OR Microsoft)

Document didnt match user expectations


multi database retrieval vs multi db retrieval

Exact match still prevalent in some markets

Large installed base that doesnt want to migrate to new software


Efficiency: Exact match is very fast, good for high volume tasks
Sufficient for some tasks
Web advanced search

2003, Jamie

Retrieval Model #1:


Unranked Boolean
Unranked Boolean is most common Exact Match model
Model
Retrieve documents iff they satisfy a Boolean expression
Query specifies precise relevance criteria
Documents returned in no particular order
Operators:
Logical operators: AND, OR, AND NOT
Unconstrained NOT is expensive, so often not included
Distance operators: Near/1, Sentence/1, Paragraph/1,
String matching operators: Wildcard
Field operators: Date, Author, Title
Note: Unranked Boolean model not the same as Boolean queries

2003, Jamie

Unranked Boolean:
Example
A typical Boolean query
(distributed NEAR/1 information NEAR/1 retrieval) OR
((collection OR database OR resource) NEAR/1 selection) OR
(multi NEAR/1 database NEAR/1 search) OR
((FIELD author callan) AND CORI) OR
((FIELD author hawking) AND NOT stephen)
Studies show that people are not good at creating Boolean queries
People overestimate the quality of the queries they create
Queries are too strict: Few relevant documents found
Queries are too loose: Too many documents found (few relevant)

10

2003, Jamie

Unranked Boolean:
WESTLAW
Large commercial system
Serves legal and professional markets

Legal: Court cases, statutes, regulations, ..


News: Newspapers, magazines, journals, .
Financial: Stock quotes, SEC materials, financial analyses
Total collection size: 5-7 Terabytes
700,000 users
In operation since 1974
Best-match and free text queries added in 1992

11

2003, Jamie

Unranked Boolean:
WESTLAW
Boolean operators
Proximity operators

Phrases: Carnegie Mellon


Word proximity: Language /3 Technologies
Same sentence (/s) or paragraph (/p): Microsoft /s anti trust
Restrictions: Date (After 1996 & Before 1999)
Query expansion:
Wildcard and truncation: Al*n Lev!
Automatic expansion of plurals and possessives
Document structure (fields): Title (gGlOSS)
Citations: Cites (Callan) & Date (After 1996)

12

2003, Jamie

Unranked Boolean:
WESTLAW
Queries are typically developed incrementally

Implicit relevance feedback


V1: machine AND learning
V2: (machine AND learning) OR (neural AND networks) OR
(decision AND tree)
V3: (machine AND learning) OR (neural AND networks) OR
(decision AND tree) AND C4.5 OR Ripper OR EG OR EM
Queries are complex
Proximity operators used often
NOT is rare
Queries are long (9-10 words, on average)

13

2003, Jamie

Unranked Boolean:
WESTLAW

Two stopword lists: 1 for indexing, 50 for queries


Legal thesaurus: To help people expand queries
Document ordering: Determined by user (often date)
Query term highlighting: Help people see how document matched
Performance:
Response time controlled by varying amount of parallelism
Average response time set to 7 seconds
Even for collections larger than 100 GB

14

2003, Jamie

Unranked Boolean:
Summary
Advantages:
Very efficient
Predictable, easy to explain
Structured queries
Works well when searchers knows exactly what is wanted
Disadvantages:
Most people find it difficult to create good Boolean queries
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
Documents that are close are not retrieved

15

2003, Jamie

Term Weights:
A Brief Introduction
The words of a text are not equally indicative of its meaning

Most scientists think that butterflies use the position of the sun
in the sky as a kind of compass that allows them to determine
which way is north. Scientists think that butterflies may use other
cues, such as the earth's magnetic field, but we have a lot to learn
about monarchs' sense of direction.
Important: Butterflies, monarchs, scientists, direction, compass
Unimportant: Most, think, kind, sky, determine, cues, learn
Term weights reflect the (estimated) importance of each term
There are many variations on how they are calculated
Historically it was tf weights
The standard approach for many IR systems is tf.idf weights

16

2003, Jamie

Retrieval Model #2:


Ranked Boolean
Ranked Boolean is another common Exact Match retrieval model
Model

Retrieve documents iff they satisfy a Boolean expression


Query specifies precise relevance criteria
Documents returned ranked by frequency of query terms
Operators:
Logical operators: AND, OR, AND NOT
Unconstrained NOT is expensive, so often not included
Distance operators: Proximity
String matching operators: Wildcard
Field operators: Date, Author, Title

17

2003, Jamie

Exact Match Retrieval:


Ranked Boolean
How document scores are calculated:
Term weight: How often term i occurs in document j

(tfi,j)

tfi,j idfi weight

AND weight: Minimum of argument weights


OR weight: Maximum of argument weights
Sum of argument weights

Min t1, j , t 2, j , ..., t n , j


AND
t1, j t2, j

Max t1, j , t 2, j , ..., t n , j


OR

tn, j

t1, j t2, j

18

tn, j

i, j

OR

t1, j t2, j

tn, j

2003, Jamie

Exact Match Retrieval:


Ranked Boolean
Advantages:
All of the advantages of the unranked Boolean model
Very efficient, predictable, easy to explain, structured queries,
works well when searchers know exactly what is wanted
Result set is ordered by how redundantly a document satisfies a query
Usually enables a person to find relevant documents more quickly
Other term weighting methods can be used, too
Example: tf, tf.idf,

19

2003, Jamie

Exact Match Retrieval:


Ranked Boolean
Disadvantages:
Its still an Exact-Match model
Good Boolean queries hard to create
Difficulty increases with size of collection
Precision and recall usually have strong inverse correlation
Predictability of results causes people to overestimate recall
The returned documents match expectations.
so it is easy to forget that many relevant documents are missed
Documents that are close are not retrieved

20

2003, Jamie

Are Boolean Retrieval Models Still Relevant?

Many people prefer Boolean

Professional searchers (e.g., librarians, paralegals)


Some Web surfers (e.g., Advanced Search feature)
About 80% of WESTLAW & Dialog searches are Boolean
What do they like? Control, predictability, understandability
Boolean and free-text queries find different documents
Solution: Retrieval models that support free-text and Boolean queries
Recall that almost any retrieval model can be Exact Match
Extended Boolean (vector space) retrieval model
Bayesian inference networks

21

2003, Jamie

Retrieval Model #3:


Vector Space Model
Retrieval Model:
Any text object can be represented by a term vector
Examples: Documents, queries, sentences, .
Similarity is determined by distance in a vector space
Example: The cosine of the angle between the vectors
The SMART system:
Developed at Cornell University, 1960-1999
Still used widely

22

2003, Jamie

How Different Retrieval Models


View Ad-hoc Retrieval
Boolean
Query: A set of FOL conditions a document must satisfy
Retrieval: Deductive inference
Vector space
Query: A short document
Retrieval: Finding similar objects

23

2003, Jamie

Vector Space Retrieval Model:


Introduction
How are documents represented in the binary vector model?
Term1 Term2

Term3 Term4

Termn

Doc1 1

Doc2 0

Doc3 1

: :
:
:
:
:
:
A document is represented as a vector of binary values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Query
0
0
1
0

1
Linear algebra is used to determine which vectors are similar

24

2003, Jamie

Vector Space Representation:


Linear Algebra Review
Formally, a vector space is defined by a set of linearly

independent basis vectors.


Basis vectors:
correspond to the dimensions or directions in the vector space;
determine what can be described in the vector space; and
must be orthogonal, or linearly independent,i.e. a value along
one dimension implies nothing about a value along another.

Basis vectors
for 2 dimensions

25

Basis vectors
for 3 dimensions

2003, Jamie

Vector Space Representation

What should be the basis vectors for information retrieval?


Basic concepts:
Difficult to determine (Philosophy? Cognitive science?)
Orthogonal (by definition)
A relatively static vector space (there are no new ideas)
Terms (Words, Word stems):
Easy to determine
Not really orthogonal (orthogonal enough?)
A constantly growing vector space
new vocabulary creates new dimensions

26

2003, Jamie

Vector Space Representation:


Vector Coefficients
The coefficients (vector elements, term weights) represent term presence,

importance, or representativeness
The vector space model does not specify how to set term weights
Some common elements:
Document term weight: Importance of the term in this document
Collection term weight: Importance of the term in this collection
Length normalization: Compensate for documents of different lengths
Naming convention for term weight functions: DCL.DCL
First triple is document term weights, second triple is query term weights
n=none (no weighting on that factor)
Example: lnc.ltc

27

2003, Jamie

Vector Space Representation:


Term Weights
There have been many vector-space term weight functions
lnc.ltc: A very popular and commonly-used choice
l: log (tf) + 1
n: No weight / normalization (i.e., 1.0)
t: log (N / df)
c: cosine normalization
wi 2
For example:
d i qi

d i 2 qi 2
28

log
tf

log
qtf

1
log

df

N
2

log
tf

log
qtf

1
log

df

2003, Jamie

Term Weights:
A Brief Introduction
Inverse document frequency (idf):
Terms that occur in many documents in the collection are less useful
for discriminating among documents
Document frequency (df): number of documents containing the term
N
I

log(
) 1
idf often calculated as:
df

Idf is log p, the entropy of the term

29

2003, Jamie

Vector Space Representation:


Term Weights
Cosine normalization favors short documents
Pivoting helps, but shifts the bias towards long documents

1.0 slope slope

Old normalization
Average old normalization

Empirically, cosine normalization approximates # unique terms


lnu and Lnu document weighting:
u: pivoted unique normalization
1
Num Unique Terms
0.8 0.2
Avg Num Unique Terms
L: compensate for higher tfs in long documents
log(tf ) 1
log(avg tf in doc) 1
(Singhal, 1996; Singhal, 1998)

30

2003, Jamie

Vector Space Representation:


Term Weights
dnb.dtn: A more recent attempt to handle varied document lengths

d: 1 + ln (1 + ln (tf)),
(0 if tf = 0)
t: log ((N+1)/df)
b: 1 / (0.8 + 0.2 * doclen / avg_doclen)
What next?

31

(Singhal, et al, 1999)

2003, Jamie

Vector Space Retrieval Model

How are documents represented using a tf.idf representation?

TermID
T1
T2
T3
T4

Tn
Doc 1
0.33
0.00
0.00
0.47

0.53
Doc 2
0.00
0.64
0.53
0.00

0.00
Doc 3
0.43
0.00
0.25
0.00

0.00
:
:
:
:
:
:
:
A document is represented as a vector of tf.idf values
One dimension per term in the corpus vocabulary
An unstructured query can also be represented as a vector
Q
0.00
0.00
1.00
0.00

2.00
Linear algebra is used to determine which vectors are similar

32

2003, Jamie

Vector Space Similarity

Doc1
Term2

Query

Doc2

Te
rm
1

Term3

33

Similarity is
inversely related
to the angle
between the vectors.
Doc2 is the
most similar
to the Query.
Rank the documents
by their similarity
to the Query.

2003, Jamie

Vector Space Similarity:


Common Measures
Sim(X,Y)
Inner product

Binary Term Vectors


| X Y |

(# nonzero dimensions)

Dice coefficient
(Length normalized
Inner Product)

Cosine coefficient

2 | X Y |
| X | |Y |
| X Y |
| X | |Y |

(like Dice, but lower


penalty with diff # features)

Jackard coefficient
(like Dice, but penalizes

| X Y |
| X | |Y | | X Y |

Weighted Term Vectors


xi yi
2 xi yi

2
2
x

y
i i

xi yi
xi 2 yi 2
xi yi
xi 2 yi 2 xi yi

low overlap cases)

34

2003, Jamie

Vector Space Similarity:


Example
Term Weights
Query 0.0 0.2 0.0

Sim(D1, Q)

Sim(D2 , Q)

Term Weights
Doc 1 0.3 0.1 0.4
Doc 2 0.8 0.5 0.6

(0.0 0.3) (0.2 0.1) (0.0 0.4)


0.0 2 0.22 0.02 * 0.32 0.12 0.4 2
(0.0 0.8) (0.2 0.5) (0.0 0.6)
0.02 0.22 0.02 * 0.82 0.52 0.62

35

0.02
0.20
0.10

0.10
0.45
0.22

2003, Jamie

Vector Space Similarity:


The Extended Boolean (p-norm) Model
Some similarity measures mimic Boolean AND, OR, NOT

Sim OR Q, D

qi p di
qi p

1
p p

qi p 1 di p
qi p

Sim AND Q, D

Sim NOT Q, D

1 Sim(Q, D)

1
p

p indicates the degree of strictness.


1 is the least strict.
is the most strict, i.e. the Boolean case.
2 tends to be a good choice.

36

2003, Jamie

Vector Space Similarity:


The Extended Boolean (p-norm) Model
Characteristics:
Full Boolean queries
(White AND House) OR
(Bill AND Clinton AND (NOT Hillary))
Ranked retrieval
Handle tf idf weighting
Effective
Boolean operators are not used often with the vector space model
It is not clear why

37

2003, Jamie

Vector Space Retrieval Model:


Summary
Standard vector space

each dimension corresponds to a term in the vocabulary


vector elements are real-valued, reflecting term importance
any vector (document,query, ...) can be compared to any other
cosine correlation is the similarity metric used most often
Ranked Boolean
vector elements are binary
Extended Boolean
multiple nested similarity measures

38

2003, Jamie

Vector Space Retrieval Model:


Disadvantages
Assumed independence relationship among terms
Lack of justification for some vector operations

e.g. choice of similarity function


e.g., choice of term weights
Barely a retrieval model
Doesnt explicitly model relevance, a persons information need,
language models, etc.
Assumes a query and a document can be treated the same
Lack of a cognitive (or other) justification

39

2003, Jamie

Vector Space Retrieval Model:


Advantages

Simplicity: Easy to implement


Effectiveness: It works very well
Ability to incorporate any kind of term weights
Can measure similarities between almost anything:
documents and queries, documents and documents, queries and
queries, sentences and sentences, etc.
Used in a variety of IR tasks:
Retrieval, classification, summarization, SDI, visualization,

The vector space model is the most popular retrieval model (today)

40

2003, Jamie

For Additional Information

G. Salton. Automatic Information Organization and Retrieval. McGraw-Hill. 1968.


G. Salton. The SMART Retrieval System - Experiments in Automatic Document Processing.

Prentice-Hall. 1971.
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill. 1983.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of
Information by Computer. Addison-Wesley. 1989.
A. Singhal. AT&T at TREC-6. In The Sixth Text Retrieval Conference (TREC-6). National
Institute of Standards and Technology special publication 500-240.
A. Singhal, J. Choi, D. Hindle, D.D. Lewis, and F. Pereira. AT&T at TREC-7. In The Seventh
Text Retrieval Conference (TREC-7). National Institute of Standards and Technology special
publication 500-242.
A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. SIGIR-96.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer and R. Harshman. Indexing by latent
semantic analysis. Journal of the American Society for Information, 41, pp. 391-407. 1990.
E. A. Fox. Extending the Boolean and Vector Space models of Information Retrieval with P-Norm
queries and multiple concept types. PhD thesis, Cornell University. 1983.

41

2003, Jamie

You might also like