3 Retrieval Models

INFORMATION
RETRIEVAL
(3170718 )
Prepared By
Prof.Karuna patel
Assitant professor
Computer Engineering Department
C.K.Pithawala College of Engineering and Technology,Surat
Unit-3 Retrieval Models
2
OutLines
• Boolean
• vector space
• TFIDF
• Okapi
• probabilistic
• language modeling
• latent semantic indexing
• Vector space scoring
• The cosine measure
• Efficiency considerations
• Document length normalization
• Relevance feedback and query expansion 3
What is Retrival Model
• Define a way to represent the contents of a document and a query.
• It define a way to copare a document representation to a query representation so as to result in a
document ranking.
• An idealization or abstraction of an actual process (retrieval)
• results in measure of similarity b/w query and document
• May describe the computational process
• e.g. how documents are ranked
• note that inverted file is an implementation not a model
• May attempt to describe the human process
• e.g. the information need, search strategy, etc.
4
• Retrieval variables:
• queries, documents, terms, relevance judgements, users, information needs
Different Types of Models
1.Boolean model
2.Vector space model
3.TFIDF model
4.Okapi
5.probabilistic
5
Boolean Model
• The Boolean model of information retieval is a classical information retrieval(IR) model and,at the
same time,the first and most adopted one.
• Retrieving documents that satisfy a Boolean expression constitutes the Boolean exact match
retrieval model
• query specifies precise retrieval criteria
• every document either matches or fails to match query
• result is a set of documents (no order)
• In the boolean retrieval model we can pose any query in the form of a Boolean expression of term
i.e.one in which terms are combined with the opeartors AND,OR,and NOT.
• Output:Document is relevant or not.No partial matches or ranking.
6
Example:Find the relevant document using Boolean Model
d1=big cats are nice and funny.
d2=small dogs are better than big dogs.
d3=small cats are afraid of small dogs.
d4=big cats are not afraid of small dogs.
d5=funny cats are not afraid of small dogs.
Q:1 Retrieve all documents with funny and dog

Q:2 Retrieve all documents with big and dog and not funny
7
term d1 d2 d3 d4 d5
big 1 1 0 1 0
1 0 1 1 1 1If the document in column d
cat
contain the term in row t
nice 1 0 0 0 0 0otherwise
funny 1 0 0 0 1
small 0 1 1 1 1
dog 0 1 1 1 1
better 0 1 0 0 0
than 0 1 0 0 0
afraid 0 0 1 1 1
not 0 0 0 1 1
8
1.Funny ʌ dog
Dfunny={d1,d5} , Ddog={d2,d3,d4,d5}
Dfunny ʌ Ddog= Dfunny ∩ Ddog={d5}
2. big ʌ dog ʌ(¬funny)

Dbig={d1,d2,d4} ,Ddog={d2,d3,d4,d5}
D’funny={d2,d3,d4}
Dbig ʌ Ddog ʌ (¬𝐷’funny)=Dbig ∩ Ddog ∩(D’funny)
={d2,d4}
9
Advantages:
1.Very efficient and easy to implement
2.predicatable,easy to explain
3.Structured queries
4.Works well when searches knowns exactly what is wanted.
Disadvantages:
1.Difficult to create good boolean queries.
2.Documents that are “close” are not retrieved.
3.No Ranking.
10
Vector Space Model
• It is also called as ‘term vector model’ or ‘vector processing model’.
• Represents both documents and queries by term sets and compares global similarities between
queries and documents.
• Used in information filtering,information retrieval,indexing and relevancy rankings.
• Any text object can be represented by a term vector
Examples: Documents,queries,sentences,...
A query is viewed as a short document.
• Similarity is determined by relationship between two vectors.
e.g. The cosine of the angle between the vectors, or the distance between vectors.
• The SMART System:
Developed at Cornell University,1960-1999
Still used widely
11
12
• Represent both documents and queries by word histogram vectors.
• N:the number of unique words.
• A query q=(q1,q2,...qn)
qi=occurence of the i-th word in query
A document dk=(dk,1,dk,2,...d,k,n)
Dk,i:occurence of the i-th word in document
Similarity of a query q to a document dk
13
Example:
Queries can be represented as vectors in the same way as documents.
using cosine measure:
For two vectors q and d the cosine similarity between q and d is given as
14
15
16
1. d=(2,1,1,1,0) d’=(0,0,0,1,0)
d x d’=2x0+1x0+1x0+1x1+0x0 =1
|d|= 22 + 12 + 12 + 12 + 02 = 7 = 2.646
|d’|= 02 + 02 + 02 + 12 + 02 = 1 = 1
Similarity=1/1x 2.646=0.378
2. d=(1,0,0,0,1) and d’=(0,0,0,1,0)
d x d’=1x0+0x0+0x0+0x1+0x1 =0
|d|= 12 + 02 + 02 + 02 + 12 =√2
|d’|= 02 + 02 + 02 + 12 + 02 = √1
similarity=0/2x1=0
17
• H.W.
1 d1=(1,3) d2=(10,30) d3=(3,1)
Find the similarity between d1 and d2 and d1 and d3.
18
Advantages:
• Simple model based on linear algebra.
• Term weights not binary.
• Allows computing a continuous degree of similarity between queries and documents.
• Allows ranking documents according to their possible relevance.
• Allows partial matching.
Disadvantages:
• Long documents are poorly represented because they have poor similarity values (a small scalar product
and a large dimensionality)
• Search keywords must precisely match document terms,word substrings might result in a "false positive
match"
19
Term Weights: Term Frequency
• More frequent terms in a document are more important.i.e.more indicative of the topic.
fij=frequency of term i in document j
• May want to normalize term frequency(tf) across the entire corpus:
tfij=fij/max{fij}
20
Term Weights:Inverse Document Frequency
• Terms that appear in many different documents are less indicative of overall topic.
dfi = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
=log(N/dfi)
(N:total number of documents)
• An indication of a term’s discrimination power
• Log used to dampen the effect relative to tf.
21
TFIDF
TF: Term Frequency
IDF: inverse Document Frequency
TF=No.of repeation of words in sentence / Total No.of words in sentence
IDF=log(No.of sentences) / (No.of sentences containing words)
TF*IDF
22
Example:1
Sent 1-good boy
Sent 2-good girl
Sent 3-boy girl good
word frequency
good 3
boy 2
girl 2
23
TF
Sent 1 sent 2 Sent 3
good 1/2 1/2 1/3
boy 1/2 0 1/3
girl 0 1/2 1/3
IDF
words IDF
good log(3/3)=0
boy log(3/2)
girl log(3/2)
24
TF*IDF
f1 f2 f3 o/p
good boy girl
sent1 0 ½*log(3/2) 0
sent2 0 0 ½*log(3/2)
Sent 3 0 1/3*log(3/2) 1/3*log(3/2)
25
Example:2
Sent 1-Think like fox
Sent 2-Play like rabbit
Sent 3-Eat like horse
words frequency
think 1
like 3
fox 1
play 1
rabbit 1
eat 1
horse 1
26
TF
TF Sent 1 Sent 2 Sent 3
Think 1/3 0/3 0/3
like 1/3 1/3 1/3
fox 1/3 0/3 0/3
play 0/3 1/3 0/3
rabbit 0/3 1/3 0/3
eat 0/3 0/3 1/3
horse 0/3 0/3 1/3
27
IDF
words IDF
Think Log(3/1)
like Log(3/3)
fox Log(3/1)
play Log(3/1)
rabbit Log(3/1)
eat Log(3/1)
horse Log(3/1)
28
TF*IDF
29
• H.W.
1.Find the TFIDF for following table.
30
OKAPI(BM25)
• It is also known as best matching.
• It is a ranking function used by search engines to estimate the relevance of documents to a given
search query.
• It is based on the probablistic retrieval frame work developed in the 1970s and 1980s by Stephen
E. Robertson,karen Sparck Jones.
• The name of the actual ranking function is BM25.
• The fuller name Okapi BM25 includes the name of the first system to use it which was the Okapi
information retrieval system.
• BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms
appearing in each document regardless of their proximity within the document.
• It is a family of scoring functions with slightly different components and parameters.
• One of the most prominent instantiations of the function is as follows.
31
Given a query Q, containing keywords the BM25 score of a document D is:
• Where f(qi,D) is qi’s term frequency in the document D.

• |D|=the lengthe of the document D in words.
• avgdl=the average document length in the text collection from which documents are drwan.
• k1,b= free parameters as k1 𝜖 1.2,2.0 𝑎𝑛𝑑 b= 0.75.
• IDF(qi) =inverse document frequency weight of the query term qi.
• It is usaully computed as:
32
• where N is the total number of documents in the collection and n(qi) is the number of documents
containing qi.
Example:1
A search query “president lincoln” is being scored against a document D. The terms “president” and
“lincoln” both appear just once in D.The length of this document D is 90% of the average length of
all documents in the corpus.There are 40,000 documents that contain the term “president” and there
are 300 documents that contain term”lincoln”.Assume that IDF for all queries is 1.and k=1.2 and
b=0.75. what is the BM25 score for the query against the Document D?
Solution:
BM25(q)=BM25(“president”) + BM25(“lincoln”)
= 1 X 1X(k+1) / 1+k(1-b+b*0.9) + 1X(k+1) / 1+k(1-b+b+0.9)
=1X 1X(1.2+1) / 1+1.2(1-0.75+0.75X0.9) + 1x(1.2+1) / 1+1.2(1-0.75+0.75X0.9)
=1X 1.04 +1.04
=2.094
33
Example :2
34
• Query with two terms,”president lincoln”,(qf=1) No relevance information(r and R are
zero)N=500,000 documents
“president”occurs in 40,000 documents(n1=40,000)
“lincoln”occurs in 300 documents(n2=300)
“president”occurs 15 times in doc(f1=15)
“lincoln”occurs 25 times in doc(f2=25)
Document length is 90% of the average length(dl/avdl=0.9)
K1=1.2,b=0.75, and k2=100
K=1.2*(0.25+0.75*0.9)=1.11
35
• For”president”
IDF =ln(1+doccount-f(qi)+0.5/f(qi)+0.5)
=ln(1+500000-15+0.5/15+0.5)
=ln(1+500000-15.5/15.5)
=ln(1+4999984.5/15.5)
=ln(1+499999)
=ln(500000)
=13.12
36
• For”lincoln”
IDF=ln(1+doccount-f(qi)+0.5/f(qi)+0.5)
=ln(1+500,000-25+0.5/25+0.5)
=ln(1+499974.5/25.5)
=ln(19607.84)
=9.88
BM25(D,q)=13.12x(15x(1.2+1)/15+1.11) + 9.88x(25.(100+1)/25+100(67.5)
=13.12x33/16.11+9.88x2525/92.5
=13.12x2.0+9.88x27.29
=26.24+269.6252
=295.86
37
Advantages:
1. Very good retrieval performance.
2. Well tunable to different retrieval senarios.
3. Most terms can be precomputed at indexing time.
Disadvantages:
1.Departure from the theoretic probablistic foundation.
2.BM25 actually can be viewd as an empirical(probablistic)model.
38
Probablistic Model
39
D t1 t2
D1 1 0
D2 0 1
D3 1 0
D4 1 1
D5 0 1
P(t1|rel)=1/2 P(rel)=2/5
P(t1|rel’)=2/3 P(rel’)=3/5
P(t2|rel)=2/2
P(t2|rel’)=1/3
40
41
42
Advantage:
In this documents are ranked in decreasing order of probability.
Disadvantage:
We need to predict the initial seperation of documents.
43
Language Modeling
44
Language Modeling(cont...)
The language model provides context to distinguish between words and phrases that sound similar.
For example, in American English the phrases "recognize speech" and "wreck a nice beach" sound
similar, but mean different things.
Finite automata and language models:
• A language model is a function that puts a probability measure over strings drawn from some
vocabulary. That is, for a language model M over an alphabet 𝜀:
• One simple kind of language model is equivalent to a probabilistic finite automaton consisting of
just a single node with a single probability distribution over producing different terms so that
P(t) = 1.
45
Why a Lauguage Model?
• Suppose a machine is required to translate:”The human Race”.
• The word “Race”has at least 2 meanings,which one to choose?
• Obviously the choice depens on the “history”or the “context” preceding the word “race”.E.g.”The
human race” versus “The dogs race”.
• A statistical language model can solve this ambiguity by giving higher probability to the correct
meaning.
46
A simple finite automaton and some of the strings in the language it generates. → shows the
start state of the automaton and a double circle indicates a (possible) finishing state.
47
To find the probability of a word sequence we just multiply the probabilities which the
model gives to each word in the sequence together with the probability of continuing or
stopping after producing each word.
For example,
P(frog said that toad likes frog) = (0.01 × 0.03 × 0.04 × 0.01 × 0.02 × 0.01)×(0.8 ×0.8
× 0.8 × 0.8 × 0.8 × 0.8 × 0.2)
≈ 0.000000000001573
48
Uses of Language Model
• Speech recognition:
“I ate a cherry”is a more likely sentence than “Eye eight uh Jerry”.
• OCR & Handwriting recognition:
More probable sentences are more likely correct readings.
• Machine translation:
More likely sentences are probably better translations.
• Generation:
More likely sentences are better NL generations.
• Context sensitive spelling correction:
“Their are problems wit this sentence”.
49
Types of Language Models
• How do we build probabilities over sequences of terms? We can always use the chain rule to
decompose the probability of a sequence of events into the probability of each successive event
conditioned on earlier events:
P(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t1t2)P(t4|t1t2t3)
• The simplest form of language model simply throws away all conditioning context and estimates
each term independently. Such a model is called a UNIGRAM LANGUAGE unigram language
model:
Puni(t1t2t3t4) = P(t1)P(t2)P(t3)P(t4)
• There are many more complex kinds of language models, such as bigram
Pbi(t1t2t3t4) = P(t1)P(t2|t1)P(t3|t2)P(t4|t3)
50
Example:
1.Did you call your...?”
-How can we guess or predict the next word?
-Possible words to follow:
mother,doctor,child...
-unlikely words to follow:
dinosaur,oven.
Estimate
P(w|did you call your...)
Probablity to each next possible word: predict the next word.
P(mother|Did you call your...)
P(dinosaur|Did you call your...)
51
Likelihood Language Model
P(doctor|Did you call your...)

• Language modeling is a quite general formal approach to IR, with many variant realizations.
• The original and basic method for using language models in IR is the query likelihood model .
• In it, we construct from each document d in the collection a language model Md .
• Our goal is to rank documents by P(d|q) where the probability of a document is interpreted as
the likelihood that it is relevant to the query. Using Bayes rule we have:
P(q) is the same for all documents, and so can be ignored.

The prior probability of a document .P(d) is often treated as uniform across all d and so it can also be
ignored.
52
• P(d|q)the probability of the query q under the language model derived from d.
• The Language Modeling approach thus attempts to model the query generation process:
Documents are ranked by the probability that a query would be observed as a random
sample from the respective document model.
• where again is the multinomial coefficient for the query which we will
henceforth ignore, since it is a constant for a particular query.
53
Latent Semantic Indexing
• Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical
technique called singular value decomposition (SVD) to identify patterns in the relationships
between the terms and concepts contained in an unstructured collection of text.
• LSI is an automatic indexing method that projects both documents and terms into a low
dimensional space which by intent represents the semantic concepts in the document.
• By projecting documnts into the semantic space.LSI enables the analysis of documents at a
conceptual level,purportedly overcoming the drawbacks of purely term-based analysis.
• LSI overcomes the issues of synonymy and polysemy that plague term-based information retrieval.
• LSI is based on the singular value decomposition(SVD) of the term document matrix,which
constructs a low approximation of the original matrix while preserving the similarity between the
documents.
54
Latent Semantic Indexing:Example
• A “collection” consists of the following “documents”:
d1: Shipment of gold damaged in a fire.
d2: Delivery of silver arrived in a silver truck.
d3: Shipment of gold arrived in a truck.
• Suppose that we use the term frequency as term weights and query weights.
The following document indexing rules are also used:
• stop words were not ignored
• text was tokenized and lowercased
• no stemming was used
• terms were sorted alphabetically
55
Problem: Use Latent Semantic Indexing (LSI) to rank these documents for the query
gold silver truck.
Step 1: Set term weights and construct the term-document matrix A and query matrix:
56
Step 2: Decompose matrix A matrix and find the U, S and V matrices, where
A = USVT
57
58
Step 4: Find the new document vector coordinates in this reduced 2-dimensional space.
• Rows of V holds eigenvector values. These are the coordinates of individual document
vectors, hence
d1(-0.4945, 0.6492)
d2(-0.6458, -0.7194)
d3(-0.5817, 0.2469)
Step 5: Find the new query vector coordinates in the reduced 2-dimensional space.
• q = qTUkSk-1
• Note: These are the new coordinate of the query vector in two dimensions.
• Note how this matrix is now different from the original query matrix q given in Step 1.
59
60
Step 6: Rank documents in decreasing order of query-document cosine similarities.
61
• LSI works best in applications where there is little overlap between queries and documents.
62
Advantages:
1.LSI works best in applications where there is little overlap between queries and documents.
2.Entire training set can be learned at same time.
3.No intermediate model need to be build.
4.Good for the training set is predefined.
Disadvantages:
1.When new document added matrix X changed and TSD need to be re-calculated.
2.Time consuming
3.Real classifier need the ability to change training set.
63
Vector Space Scoring
• The representation of a set of documents as vectors in a common vector space is known as
the vector space model and is fundamental to a host of information retrieval operations ranging
from scoring documents on a query document classification.
• We need a way of assigning a score to a query/document pair
• Let’s start with a one-term query
If the query term does not occur in the document: score should be 0
If the query terms occurs in the document score 1
• For a multi-term query
View the query as well as the document as sets of words
Compute some similarity measure between the two sets
64
Score(q, d) = å tf.idft,d
t ÎqÇd
• There are many variants

How “tf” is computed (with/without logs)
Whether the terms in the query are also weighted
• Documents as vectors So we have a |V|-dimensional vector space Terms are axes of the space
Documents are points or vectors in this space
• Very high-dimensional space: tens of millions of dimensions in case of a web search engine
These are very sparse vectors - most entries are zero.
65
Queries as vectors
Key idea 1: Do the same for queries: represent queries as vectors in the space
Key idea 2: Rank documents according to their proximity to the query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
66
Document Length Normalization
Dot product Unit vectors
  


V
  qd q d q di
cos( q , d )         i 1 i
q d
 i1 i
V V
qd 2
q d 2
i 1 i
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or equivalently the cosine of the angle between q and
d.
Document length normalization adjusts the term frequency or the relevance score in order to
normalize the effect of document length on the document ranking.
67
Cosine for length-normalized vectors
• For length-normalized vectors cosine similarity is simply the dot product (or scalar product):
for q, d length-normalized.
68
Cosine similarity amongst 3 documents
• How similar are the novels
• SaS: Sense and Sensibility
• PaP: Pride and Prejudice term SaS PaP WH
• WH: Wuthering Heights? 58 20

affection 115
jealous 10 7 11
gossip 2 0 6
wuthering 0 0 38
Term frequencies (counts)
69
Log frequency weighting After length normalization
term SaS PaP WH term SaS PaP WH
affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405

wuthering 0 0 2.58 wuthering 0 0 0.588
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
70
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
71
Weighting may differ in queries vs documents
• Many search engines allow for different weightings for queries vs. documents
• SMART Notation: denotes the combination in use in an engine with the notation ddd.qqq, using
the acronyms from the previous table
• A very standard weighting scheme is: lnc.ltc
• Document: logarithmic tf (l as first character), no idf and cosine normalization
• Query: logarithmic tf (l in leftmost column), idf (t in second column), cosine normalization …
72
• Represent the query as a weighted tf-idf vector
• Represent each document as a weighted tf-idf vector
• Compute the cosine similarity score for the query vector and each document vector
• Rank documents with respect to the query by score
• Return the top K (e.g., K = 10) to the user
73
Example
• We now consider the query best car insurance on a fictitious collection with N = 1,000,000
documents where the document frequencies of auto, best, car and insurance are respectively 5000,
50000, 10000 and 1000.
In this example the weight of a term in the query is simply the idf (and zero for a term not in the query, such as
auto); this is reflected in the column header wt,q (the entry for auto is zero because the query does not contain the
termauto). For documents,we use tf weighting with no use of idf but with Euclidean normalization. The former is
shown under the column headed wf, while the latter is shown under the column headed wt,d. now gives a net
score of 0+ 0 + 0.82 + 2.46 = 3.28.
74
Relevance Feedback and Query Expansion
• The same word can have different meanings(polysemy)
• Two different words can have the same meaning(synonymy)
• Vocabulary of searcher may not match that of the documents.
• Consider the query={plane fuel}
• While this is relatively unambiguous(wrt the meaning of each word in context),exact matching will
miss documents containing aircraft,airplane,or jet.
• Relevance feedback and query expansion aim to overcome the problem of synonymy.
75
• Global methods are techniques for expanding or reformulating query:
Query expansion/reformulation with a thesaurus.
Query expansion via automatic thesaurus generation
Techniques like spelling correction
• Local methods adjust a query relative to the documents that initially appear to match the query.The
basic methods here are:
Revelance feedback
Pseudo relevance feedback,also known as blind relevance feedback
(Global) indirect relevance feedback
76
• The idea of relevance feedback(RF) is to involve theuser in the retrieval process so as to improve
the final result set.The basic procedure is:
• The user issues a(short,simple)query.
• The system returns an intial set of retrieval results.
• The user marks some returned documents as relevant or nonrelevant.
• The system computes a better representation of the information need based on the user feedback.
• The system displays a revised ste of retrieval result.
77
Rocchio Algorithm:Relevance Feedback
• We want to find a query vector,denoted as q,that maximizes similarity with relevant documents
while minimizing similarity with nonrelevant documents.
• Where sim(q,Cr) is the similarity between a query q and the set of relevant documents Cr. cosine
similarity, the optimal query vector ~qopt for separating the relevant and nonrelevant documents
is:
78
• Which is the difference btween the centriod of the revant and non-relavnt document vectors.
The Rocchio optimal query for separating relevant and nonrelavant

documents.
79
The Rocchio optimal query for separating relevant and An application of Rocchio’s algorithm some documents
nonrelavant documents. labled as relevant and nonrelevant and the initial query
vector is moved to this feedback.
80
• However, we usaully do not know the full relevant and non-relevant sets.
• For example, a user might only label a few documents as relevant.
• Therefore, in practice Rocchio is often paramaetrised as follows:
• Where 𝛼, 𝛽, 𝑎𝑛𝑑 𝛾 are weights that are attached to each component.
81
Rocchio Algorithm:Example
82
Example-1
Q=“news about presidential campaign”
Q=(1,1,1,1,0,0...)
D1=news about
-D1=(1.5,0.1,0,0,0,0...)
D2=news about organic food..
-D2=(1.5,0.1,0,2.0,2.0,0..)
D3=news of presidential campaign
+D3=(1.5,0,3.0,2.0,0,0...)
D4=new of presential campaign
+D4=(1.5,0,4.0,2.0,0,0,..)
83
D5=news of organic feed campaign
-D5=(1.5,0,0,6.0,2.0,0...)
Here D3,D4 are positive documents(i.e. Relevant documents)
D1,D2,D5 are Negative documents(i.e. Non relevant documents)
Now Find Centriod vector
(+)positive centriod=[(1.5+1.5)/2,0,(3.0+4.0)/2,(2.0+2.0)/2,0,0...)]
= (1.5,0,3.5,2.0,0,0)
(-)negative centroid=[(1.5+1.5+1.5)/3,(0.1+0.1+0)/3,0,(0+2.0+6.0)/3,(0+2.0+2.0)/3,0..)]
= (1.5,0.067,0,2.6,1.3,0)
New Query
Q’=(𝛼 ∗ 1 + 𝛽 ∗ 1.5 − 𝛾 ∗ 1.5, 𝛼 ∗ 1- 𝛾 ∗ 0.067, 𝛼*1+ 𝛽*3.5, 𝛼*1+ 𝛽*2.0- 𝛾*2.6- 𝛾*1.3,0,0..)
Hence we can say that we can get the better result after query modification or relevance feed back
and query expansion.
84
Example -2 H.W.
Q=cheap car best car insurance
d1=best car insurance
d2=cheap auto insurance best insurance
d3=cheap insuarnce best car
d4=cheap car cheap auto
𝛼 = 1, 𝛽 = 0.75, 𝛾 = 0.5
Using standard Rocchio algorithm what is the query vector after relevance feedback?
85
Advantages:
• Rocchio has been shown useful for incresing recall.
• Cotains aspects of positive and negative feedback.
• Positive feedback is much more valuable.(i.e. Indications of what is relevant and 𝛾 < 𝛽)
• Reseasonable value of the parameters are.𝛼 = 1.0, 𝛽 = 0.75, 𝛾=0.15.
Disadvantages:
Relevance Feedback is expensive
• Relevance feedback creats long modified queries.
• Long queries are expensive to process.
• Users are reluctant to provide explicit feedback.
• Its often hard to understand why a particular document was retrieved after applying relevance feedback.
86
Thank You
87

3 Retrieval Models

Uploaded by

Copyright:

Available Formats

You might also like

3 Retrieval Models

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Retrieval Models

Uploaded by

Copyright:

Available Formats

INFORMATION

Q:1 Retrieve all documents with funny and dog

2. big ʌ dog ʌ(¬funny)

fij=frequency of term i in document j

• May want to normalize term frequency(tf) across the entire corpus:

IDF=log(No.of sentences) / (No.of sentences containing words)

• Where f(qi,D) is qi’s term frequency in the document D.

P(doctor|Did you call your...)

P(q) is the same for all documents, and so can be ignored.

• There are many variants

Dot product Unit vectors

• WH: Wuthering Heights? 58 20

Term frequencies (counts)

term SaS PaP WH term SaS PaP WH

affection 3.06 2.76 2.30 affection 0.789 0.832 0.524

jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465

gossip 1.30 0 1.78 gossip 0.335 0 0.405

Columns headed ‘n’ are acronyms for weight schemes.

• Query: logarithmic tf (l in leftmost column), idf (t in second column), cosine normalization …

The Rocchio optimal query for separating relevant and nonrelavant

• Where 𝛼, 𝛽, 𝑎𝑛𝑑 𝛾 are weights that are attached to each component.

You might also like