Retrieval Model

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

11/4/2019

Indexed corpus
Crawler
Ranking procedure
TRUY VẤN THÔNG TIN
ĐA PHƯƠNG TIỆN Research attention

INFORMATION RETRIEVAL Feedback Evaluation


Doc Analyzer
(Query)
Doc Representation
Query Rep User

Indexer Index Ranker results


Retrieval Model
1 CS@UVa 2

Mô hình truy vấn - Retrieval model


Indexed corpus 1. Visiting strategy
Crawler 2. Avoid duplicated visit Algorithms that find the most relevant documents
3. Re-visit policy
for the given information need

1. HTML parsing
2. Tokenization
Doc Analyzer 3. Stemming/normalization
4. Stopword/controlled vocabulary filter
Doc Representation
BagOfWord
representation!

CS@UVa 3 4

1
11/4/2019

Một số mô hình trong IR 1. Mô hình Boolean model

5 6

1. Mô hình Boolean model 1. Mô hình Boolean model

• Boolean query
• E.g., “obama” AND “healthcare” NOT “news”
• Procedures
• Lookup query term in the dictionary
• Retrieve the posting lists
• Operation
• AND: intersect the posting lists
• OR: union the posting list
• NOT: diff the posting list

CS@UVa 7 8

2
11/4/2019

Selection vs. Ranking


+ +- - Rel’(q)
1
True Rel(q) + ++
Doc Selection
f(d,q)=?
+ +- - + - - - - -
+ - 0 - - - --
+ - + -
-- -- - -
-- - 0.98 d1 +
- Doc Ranking 0.95 d2 +
0.83 d3 - Rel’(q)
rel(d,q)=?
0.80 d4 +
0.76 d5 -
0.56 d6 -
0.34 d7 -
0.21 d8 +
0.21 d9 -
CS@UVa 9

Relevance = Similarity

Vector Space Model

CS@UVa 12

3
11/4/2019

Vector space model Vector space model

•Represent both doc and query by concept


vectors
• Each concept defines one dimension
• K concepts define a high-dimensional space
• Element of vector corresponds to concept weight
• E.g., d=(x1,…,xk), xi is “importance” of concept i
•Measure relevance
• Distance between the query vector and
document vector in this concept space

CS@UVa 13 14

Vector space model VS Model: an illustration

• Which document is closer to the query?


Finance
D2

D4

D3

Sports
Query

D5 D1
Education
15 CS@UVa CS 4501: Information Retrieval 16

4
11/4/2019

Vector space model Vector space model

17 18

What the VS model doesn’t say What is a good “basic concept”?


•Orthogonal
•How to define/select the “basic concept”
• Linearly independent basis vectors
• Concepts are assumed to be orthogonal • “Non-overlapping” in meaning
•How to assign weights • No ambiguity

• Weight in query indicates importance of the •Weights can be assigned automatically and
concept accurately
• Weight in doc indicates how well the concept •Existing solutions
characterizes the doc
• Terms or N-grams, i.e., bag-of-words
•How to define the similarity/distance • Topics, i.e., topic model
measure
CS@UVa 19 CS@UVa 20

5
11/4/2019

How to assign weights? TF weighting

•Important!
•Why?
• Query side: not all terms are equally important
• Doc side: some terms carry more information
about the content
•How?
• Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
• TF -IDF

CS@UVa 21 CS@UVa 22

TF normalization TF normalization - scaled frequency


• Two views of document length
• A doc is long because it is verbose
• A doc is long because it has more content
• Raw TF is inaccurate
• Document length variation
• “Repeated occurrences” are less informative than the “first
occurrence”
• Relevance does not increase proportionally with number of
term occurrence
• Generally penalize long doc, but avoid over-penalizing
• Pivoted length normalization

CS@UVa 23 24

6
11/4/2019

TF normalization - scaled length Document frequency

• Idea: a term is more discriminative if it occurs only in fewer


documents

25 CS@UVa CS 4501: Information Retrieval 26

Document frequency IDF weighting

Non-linear scaling
Total number of docs in collection

CS@UVa 27 CS@UVa 28

7
11/4/2019

Example Example

29 30

Example Example

31 32

8
11/4/2019

Example Example

33 34

Example Example

35 36

9
11/4/2019

Example TF-IDF weighting

“Salton was perhaps the


leading computer scientist
working in the field of Gerard Salton Award
information retrieval during his – highest achievement award in IR
time.” - wikipedia
37 CS@UVa CS 4501: Information Retrieval 38

How to define a good similarity measure?

TF-IDF space
Finance
D2

D4

D3

Sports
Query

D5 D1
Education
39 CS@UVa 40

10
11/4/2019

Similarity measure Similarity measure

41 42

Similarity measure Tích trong – ví dụ

43 44

11
11/4/2019

Similarity measure Mô hình không gian vector

45 46

Mô hình Vector – ưu điểm Tài liệu tham khảo

Slide được tham khảo từ:

• http://www.cs.virginia.edu/~hw5x/Course/IR2015/_site/lectures/
• https://nlp.stanford.edu/IR-book/newslides.html

• https://course.ccs.neu.edu/cs6200s14/slides.html

47

12
11/4/2019

49

13

You might also like