Professional Documents
Culture Documents
Retrieval Model
Retrieval Model
Retrieval Model
Indexed corpus
Crawler
Ranking procedure
TRUY VẤN THÔNG TIN
ĐA PHƯƠNG TIỆN Research attention
1. HTML parsing
2. Tokenization
Doc Analyzer 3. Stemming/normalization
4. Stopword/controlled vocabulary filter
Doc Representation
BagOfWord
representation!
CS@UVa 3 4
1
11/4/2019
5 6
• Boolean query
• E.g., “obama” AND “healthcare” NOT “news”
• Procedures
• Lookup query term in the dictionary
• Retrieve the posting lists
• Operation
• AND: intersect the posting lists
• OR: union the posting list
• NOT: diff the posting list
CS@UVa 7 8
2
11/4/2019
Relevance = Similarity
CS@UVa 12
3
11/4/2019
CS@UVa 13 14
D4
D3
Sports
Query
D5 D1
Education
15 CS@UVa CS 4501: Information Retrieval 16
4
11/4/2019
17 18
• Weight in query indicates importance of the •Weights can be assigned automatically and
concept accurately
• Weight in doc indicates how well the concept •Existing solutions
characterizes the doc
• Terms or N-grams, i.e., bag-of-words
•How to define the similarity/distance • Topics, i.e., topic model
measure
CS@UVa 19 CS@UVa 20
5
11/4/2019
•Important!
•Why?
• Query side: not all terms are equally important
• Doc side: some terms carry more information
about the content
•How?
• Two basic heuristics
• TF (Term Frequency) = Within-doc-frequency
• IDF (Inverse Document Frequency)
• TF -IDF
CS@UVa 21 CS@UVa 22
CS@UVa 23 24
6
11/4/2019
Non-linear scaling
Total number of docs in collection
CS@UVa 27 CS@UVa 28
7
11/4/2019
Example Example
29 30
Example Example
31 32
8
11/4/2019
Example Example
33 34
Example Example
35 36
9
11/4/2019
TF-IDF space
Finance
D2
D4
D3
Sports
Query
D5 D1
Education
39 CS@UVa 40
10
11/4/2019
41 42
43 44
11
11/4/2019
45 46
• http://www.cs.virginia.edu/~hw5x/Course/IR2015/_site/lectures/
• https://nlp.stanford.edu/IR-book/newslides.html
• https://course.ccs.neu.edu/cs6200s14/slides.html
47
12
11/4/2019
49
13