Professional Documents
Culture Documents
Basic IR: Modeling
Basic IR: Modeling
Basic IR: Modeling
Basic IR Task:
Match a subset of documents to the user’s
query
Slightly more complex:
and rank the resulting documents by predicted
relevance
The derivation of relevance leads to different
IR models.
Concepts:
Term-Document Incidence
Imagine matrix of terms X documents with 1 when
the term appears in the document and 0 otherwise.
search segment select semanti …
c
MIR 1 0 1 1
AI 1 1 0 1
…
Queries satisfied how?
Problems?
Concepts:
Term Frequency
To support document ranking, need
more than just term incidence.
Term frequency records number of
times a given term appears in each
document.
Intuition: More times a term appears in
a document the more central it is to the
topic of the document.
Concept: Term Weight
Weights represent the importance of a
given term for characterizing a document.
wij is a weight for term i in document j.
Mapping Task and Document
Type to Model
Index Full Text Full Text +
Terms Structure
Searching Classic Classic Structured
(Retrieval)
incidence matrix
Pros:
Cons:
Exact Matching Ignores…
term frequency in document
term scarcity in corpus
size of document
ranking
Vector Model
Vector of term weights based on term
frequency
Compute similarity between query and
document where both are vectors
vec(dj) = (w1j, w2j, ..., wtj) vec(q) =
(w1q, w2q, ..., wtq)
Similarity is the cosine of the angle between
the vectors.
Cosine Measure
j
cos()
dj
dj q
Sim(d , q )
dj q
q
t
w i, j
wi ,q
t
i 1
t
i 1
i ,q
w 2
i 1
0 <= sim(q,dj) <=1
from MIR notes
How to Set Wij Weights?
TF-IDF
Within document: Term-Frequency
tf measures term density within a document
Across document: Inverse Document
Frequency
idf measures informativeness or rarity of term
across corpus.
n
idf i log
df i
TF * IDF Computation
wi ,d tf i ,d log(n / df i )
tf i ,d frequency of term i in document d
n total number of documents
df i the number of documents that contain term i
Example (cont.) d2 d6
d7
d4 d5
d3
d1
for rest:
[.34 0 0], [0 .19 .85], [.34 0 0], [.08 .28 .85],
[.17 .56 0], [0 .56 0]
Example (cont.) d2
d4
d6
d5
d3
d1
k3
Example (cont.) d2
d4
d6
d5
d3
d1
k3