Professional Documents
Culture Documents
Chap 5
Chap 5
報告人:林秉儀
學號: 89522022
Introduction
• Define:
– Weight:
Let the ki be a generic index term in the set K = {k1,
…, kt}.
A weight wi,j > 0 is associated with each index term
ki of a document dj.
– document index term vector:
the document dj is associated with
an index term ve
d j by
ctor dj representd d j ( w1, j , w2, j , , wt , j )
Vector Model (cont’d)
• Define
– from the chapter 2
N
the term weighting : wi , j f i , j log
ni
freq i , j
the normalized frequency : fi, j
max l freq l , j
freqi,j be the raw frequency of ki in the document dj
N
nverse document frequency for ki : idf i log
ni
0.5 freq i ,q
the query term weight: wi ,q 0.5 log N
max l freq l ,q ni
Vector Model (cont’d)
• Define:
– query vector:
query vector qq is defined as q ( w1,q , w2,q , , wt ,q )
– Dr: set of relevant documents identified by the: user
– Dn: set of non-relevant documents among the retrieved
documents
– Cr: set of relevant documents among all documents in t
he collection
– α,β,γ: tuning constants
Query Expansion and Term Reweighting
for the Vector Model
• ideal case
Cr : the complete set Cr of relevant documents to a
given query q
– the best query vector is presented by
1 1
q opt
Cr
dj
d j C r N Cr
dj
d j C r
– Ide_Regular: qm q
d j
d j Dr
j
d
d j Dn
– Ide_Dec_Hi:
qm q
d D
d j max non relevant (d j )
j r
t Dr ,i ni Dr ,i
sim(d j , q ) wi ,q wi , j log
Dr Dr ,i Dr (ni Dr ,i )
• There is no query i 1 expansion occurs in theNprocedure.
Term Reweighting for the Probabilistic
Model (cont’d)
• Adjusment factor
– Because of |Dr| and |Dr,i| are certain small, take
a 0.5 adjustment factor added to the PP(k
(ki i||R)
R ) and
P (kii|R)
P(k | R)
Dr ,i 0.5 ni Dr ,i 0.5
P (ki | R ) P ( ki | R )
Dr 1 N Dr 1
• feedback searches:
P (ki | R ) 1 P (ki | R )
Fi , j ,q C log log fi, j
1 P ( k | R ) P ( k | R )
• empty text i i i i
Automatic Local Analysis
• Clustering : the grouping of documents which satisfy a set
of common properties.
• Attempting to obtain a description for a larger cluster of rel
evant documents automatically :
To identify terms which are related to the query terms such
as:
– Synonyms
– Stemming
– Variations
– Terms with a distance of at most k words from a query
term
Automatic Local Analysis (cont’d)
– Normalized
cu ,v
su ,v
cu ,u cv ,v cu ,v
Association Clusters (cont’d)
Sv(n)
su
sv
Interactive Search Formulation (cont’d)