Professional Documents
Culture Documents
Vector Space Model: TF - IDF: Adapted From Lectures by
Vector Space Model: TF - IDF: Adapted From Lectures by
Prasad
L08VSM-tfidf
Postings compression
Prasad
Scoring documents
Term frequency
Collection statistics
Weighting schemes
Prasad
L08VSM-tfidf
Ranked retrieval
Prasad
Prasad
L08VSM-tfidf
Prasad
L08VSM-tfidf
Prasad
L08VSM-tfidf
Prasad
L08VSM-tfidf
Prasad
L08VSM-tfidf
Prasad
10
L08VSM-tfidf
11
Prasad
L08VSM-tfidf
12
Prasad
L08VSM-tfidf
13
Term frequency tf
Prasad
L08VSM-tfidf
14
Log-frequency weighting
if tf t,d 0
otherwise
tqd
Prasad
15
Document frequency
Prasad
L08VSM-tfidf
16
Prasad
L08VSM-tfidf
17
idf weight
Prasad
L08VSM-tfidf
18
dft
idft
calpurnia
animal
100
sunday
1,000
10,000
100,000
1,000,000
fly
under
the
L08VSM-tfidf
19
Collection frequency
Document frequency
insurance
10440
3997
try
10422
8760
Prasad
L08VSM-tfidf
20
tf-idf weighting
w t ,d (1 log tf t ,d ) log N / df t
L08VSM-tfidf
21
L08VSM-tfidf
22
Documents as vectors
Prasad
L08VSM-tfidf
23
Queries as vectors
Prasad
L08VSM-tfidf
24
Euclidean distance?
Prasad
L08VSM-tfidf
25
Prasad
L08VSM-tfidf
26
Prasad
L08VSM-tfidf
27
Prasad
L08VSM-tfidf
28
Length normalization
Prasad
i i
29
cosine(query,document)
Dot product
qd
cos( q , d )
qd
Unit vectors
q di
i 1 i
2
i 1 i
2
d
i1 i
V
L08VSM-tfidf
30
term
SaS
PaP
WH
affection
115
58
20
jealous
10
11
gossip
wuthering
38
L08VSM-tfidf
31
SaS
PaP
WH
After normalization
term
SaS
PaP
WH
affection
3.06
2.76
2.30
affection
0.789
0.832
0.524
jealous
2.00
1.85
2.04
jealous
0.515
0.555
0.465
gossip
1.30
1.78
gossip
0.335
0.405
2.58
wuthering
0.588
wuthering
cos(SaS,PaP)
0.789 0.832 + 0.515 0.555 + 0.335 0.0 + 0.0 0.0
0.94
cos(SaS,WH) 0.79
cos(PaP,WH) 0.69
Why do we have cos(SaS,PaP) > cos(SAS,WH)?
Prasad
Query
tf-raw
tf-wt
df
Document
idf
wt
tf-raw
tf-wt
wt
Prod
nlized
auto
5000 2.3
0.52
best
50000 1.3
1.3
car
10000 2.0
2.0
0.52
1.04
insurance
1000 3.0
3.0
1.3
3.9
2.03
6.09
Prasad
L08VSM-tfidf
37