Chapter - Three

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Select College

ISR
Chapter -Three

IR Models 1
IR models

2
What is a Retrieval Model?
•A retrieval model describes the human and
computational processes involved in retrieval
• Example: A model of human information seeking
behavior
• Example: A model of how documents are ranked
computationally
• Components: Users, information needs, queries,
documents, relevance assessments, ….
• Retrieval models define relevance, explicitly or implicitly
Boolean IR Model
one of the earliest and simplest retrieval methods
 uses exact matching to match documents to a user
"query“
 information request by finding documents that are
"relevant" in terms of matching the words in the query.
Any number of logical statements can be combined using
the three Boolean operators.
Operators: AND, OR, NOT

4
AND: Finds only documents containing all of the

specified words or phrases.

OR: Finds documents containing at least one of the

specified words or phrases.

NOT: Excludes documents containing the specified word

or phrase.
P Q Not P P And Q P Or Q
False False True False False
False True True False True
True False False False True
True True False True True 5
Advantages:

• Very efficient

• Predictable, easy to explain

• Structured queries

• Works well when searchers knows exactly what is wanted

Disadvantages:

• Most people find it difficult to create good Boolean queries

• Precision and recall usually have strong inverse correlation

• Predictability of results causes people to overestimate recall


• Documents that are “close” are not retrieved

6
Vector Space Model
 is a way of representing documents through the words
that they contain

Indexing terms are coordinates in a multidimensional


information space

Term weighting

cosine similarity measure

7
• Each document is broken down into a word frequency
table.

• The tables are called vectors and can be stored as


arrays

• A vocabulary is built from all the words in all


documents in the system

• Each document is represented as a vector based


against the vocabulary

8
Example Vector Space
• Document A
• “A dog and a cat.”

a dog and cat


2 1 1 1

• Document B
• “A frog.”

a frog
1 1
9
Vector Example ….

• The vocabulary contains all words used

• a, dog, and, cat, frog

• The vocabulary needs to be sorted

• a, and, cat, dog, frog

10
Vector Example ….
• Document A: “A dog and a cat.”

• Vector: (2,1,1,1,0) a and cat dog frog


• Document B: “A frog.” 2 1 1 1 0

• Vector: (1,0,0,0,1) a and cat dog frog


1 0 0 0 1

11
For simplicity, let’s assume three index terms: dog, bite, man
(i.e., V=3)

0 = the term does not appear 1 = the term appears at least once

Dog man bite

doc_1 1 1 1 man “dog bite man“ [1,1, 1]


“man, bite “ [0,1,
doc_2 1 0 1 1] 1
1
doc_3 0 1 1
Dog
1
1
Doc _1 “dog bite man“ [1,1, 1] bite 1 “dog bite “ [1,0, 1]
Doc _2 “dog bite“ [1,0, 1]

Doc _3 “dog bite“ [0,1, 1]

12
Probabilistic Model
The probabilistic retrieval model is based on the Probability
Ranking Principle.
The document is retrieved according to the probability of the
document being relevant to the query. Mathematically, the scoring
function is given by

• P(R = 1|d, q)

The document is termed relevant if its probability of being relevant


is greater than its probability of being non relevant
• P(R = 1|d, q) > P(R = 0|d, q)

13
Probabilistic model

Ranking based on calculation of probability, not similarity

Uses a ranking function to order retrieved documents

May use term frequency data to estimate probability

14
Basic Probability priniple
Let a, b be two events.

Bayesian formulas

p(a | b) p(b)  p(a  b)  p(b | a) p(a)


p(b | a) p(a)
p ( a | b) 
p(b)
p(a | b) p(b)  p(b | a ) p(a )

15
16
17
18
19
End of Chapter -Three

Questions

20

You might also like