Basic IR: Modeling

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Basic IR: Modeling

 Basic IR Task:
 Match a subset of documents to the user’s
query
 Slightly more complex:
 and rank the resulting documents by predicted
relevance
The derivation of relevance leads to different
IR models.
Concepts:
Term-Document Incidence
Imagine matrix of terms X documents with 1 when
the term appears in the document and 0 otherwise.
search segment select semanti …
c
MIR 1 0 1 1
AI 1 1 0 1

Queries satisfied how?
Problems?
Concepts:
Term Frequency
 To support document ranking, need
more than just term incidence.
 Term frequency records number of
times a given term appears in each
document.
 Intuition: More times a term appears in
a document the more central it is to the
topic of the document.
Concept: Term Weight
 Weights represent the importance of a
given term for characterizing a document.
 wij is a weight for term i in document j.
Mapping Task and Document
Type to Model
Index Full Text Full Text +
Terms Structure
Searching Classic Classic Structured
(Retrieval)

Surfing Flat Flat Structure Guided


(Browsing) Hypertext Hypertext
IR Models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
boolean Algebraic
U vector
Generalized Vector
Retrieval: probabilistic
s Lat. Semantic Index
e Adhoc Neural Networks
r Filtering
Structured Models
Probabilistic
Non-Overlapping Lists
T Inference Network
Proximal Nodes
a Belief Network
s Browsing
k Browsing
Flat
Structure Guided
Hypertext from MIR text
Classic Models: Basic Concepts
 Ki is an index term
 dj is a document
 t is the total number of docs
 K = (k1, k2, …, kt) is the set of all index terms
 wij >= 0 is a weight associated with (ki,dj)
 wij = 0 indicates that term does not belong to doc
 vec(dj) = (w1j, w2j, …, wtj) is a weighted vector
associated with the document dj
 gi(vec(dj)) = wij is a function which returns the weight
associated with pair (ki,dj)
Classic: Boolean Model
 Based on set theory: map queries with
Boolean operations to set operations
 Select documents from term-document

incidence matrix
Pros:
Cons:
Exact Matching Ignores…
 term frequency in document
 term scarcity in corpus
 size of document
 ranking
Vector Model
 Vector of term weights based on term
frequency
 Compute similarity between query and
document where both are vectors
 vec(dj) = (w1j, w2j, ..., wtj) vec(q) =
(w1q, w2q, ..., wtq)
 Similarity is the cosine of the angle between
the vectors.
Cosine Measure
j
 cos()
dj
dj  q
Sim(d , q ) 
dj  q

q
t

w i, j
 wi ,q
 t
i 1
t

Since wij > 0 and wiq > 0,  i, j 


w 2

i 1
 i ,q
w 2

i 1
0 <= sim(q,dj) <=1
from MIR notes
How to Set Wij Weights?
TF-IDF
 Within document: Term-Frequency
 tf measures term density within a document
 Across document: Inverse Document
Frequency
 idf measures informativeness or rarity of term
across corpus.
 n 
idf i  log 
 df i 
TF * IDF Computation
wi ,d  tf i ,d  log(n / df i )
tf i ,d  frequency of term i in document d
n  total number of documents
df i  the number of documents that contain term i

 What happens as number of occurrences in a document


increases?
 What happens as term becomes more rare?
TF * IDF
 TF may be normalized.
 tf(i,d) = freq(i,d) / max(freq(l,d))
 IDF is computed
 normalized to size of corpus
 as log to make TF and IDF values
comparable
 IDF requires a static corpus.
How to Set Wi,q Weights?
1. Create Vector directly from query
2. Use modified tf-idf
 freq (i, q)   n 
Wi ,q  0.5   0.5 *  * log 
 max( freq (i, q))   df i 
The Vector Model:
Example
k1 k2 k3
d1 2 0 1 k2
k1
d2 1 0 0 d7
d6
d3 0 1 3 d2
d4 2 0 0 d4 d5
d3
d5 1 2 4 d1
d6 1 2 0
d7 0 5 0
k3
q 1 2 3

from MIR notes


The Vector Model: k1
k2

Example (cont.) d2 d6
d7

d4 d5
d3
d1

1. Compute Tf-IDF Vector for each document k3


For first document:
K1: ((2/2)*(log (7/5)) = .33
K2: (0*(log (7/4))) =0
K3: ((1/2)*(log (7/3))) = .42

for rest:
[.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85],
[.17 .56 0], [0 .56 0]

from MIR notes


The Vector Model: k1
d7
k2

Example (cont.) d2

d4
d6

d5
d3
d1

k3

2. Compute the Tf-IDF for the query [1 2 3]:


K1: (.5 + ((.5 * 1)/3))*(log (7/5)))
K2: (.5 + ((.5 * 2)/3))*(log (7/4)))
K3: (.5 + ((.5 * 3)/3))*(log (7/3)))
which is: [.22 .47 .85]
The Vector Model: k1
d7
k2

Example (cont.) d2

d4
d6

d5
d3
d1

k3

3. Compute the Sim for each document:


D1:
D1*q = (.33 * .22) + (0 * .47) + (.42 * .85) = .43
|D1| = sqrt((.33^2) + (.42^2)) = .53
|q| = sqrt((.22^2) + (.47^2) + (.85^2)) = 1.0
sim = .43 / (.53 * 1.0) = .81
D2: .22 D3: .93 D4: .23
D5: .97 D6: .51 D7: .47
Vector Model
Implementation Issues
 Sparse TermXDocument matrix
 Store term count, term weight, or
weighted by idfi ?
 What if the corpus is not fixed (e.g., the
Web)? What happens to IDF?
 How to efficiently compute Cosine for
large index?
Heuristics for Computing
Cosine for Large Index
 Select from only non-zero cosines
 Focus on non-zero cosines for rare (high idf)
words
 Pre-compute document adjacency
 for each term, pre-compute k nearest docs
 for a t term query, compute cosines from query
to union of t pre-computed lists, choose top k
The TFIDF Vector Model:
Pros/Cons
 Pros:
 term-weighting improves quality
 cosine ranking formula sorts documents
according to degree of similarity to the query
 Cons:
 assumes independence of index terms

You might also like