Professional Documents
Culture Documents
Probabilistic Information Retrieval
Probabilistic Information Retrieval
Probabilistic Information Retrieval
Information Retrieval
Fall 2019
Probabilistic Information
Retrieval
Material from:
Manning, Raghavan, Schutze: Chapter 11
Supplemented by additional material
Srihari-CSE535-Fall2019
Why probabilities in IR?
User Query
Understanding
Information Need Representation of user need is
uncertain
How to match?
Uncertain guess of
Document whether document
Documents Representation
has relevant content
Odds: O(a) =
p(a) p(a) Revise (posterior) based on
n = evidence event b, in cases
p(a ) 1 - p(a) where a does or does not
hold
The Probability Ranking Principle
If a reference retrieval system's response to each request
is a ranking of the documents in the collection in order of
decreasing probability of relevance to the user who
submitted the request, where the probabilities are
estimated as accurately as possible on the basis of
whatever data have been made available to the system
for this purpose, the overall effectiveness of the system to
its user will be the best that is obtainable on the basis of
those data.
n Theorem:
n 1/0 loss: you lose a point for either returning a NR doc, or fail to
return a relevant doc
n Using the PRP is optimal, in that it minimizes the loss (Bayes risk)
under 1/0 loss
n Provable if all probabilities correct, etc. [e.g., Ripley 1996]
Probability Ranking Principle
n More complex case: retrieval costs.
n Let d be a document
n C - cost of retrieval of relevant document
n C’ - cost of retrieval of non-relevant document
n Probability Ranking Principle: if
C × p( R | d ) + C ¢ × (1 - p( R | d )) £ C × p( R | d ¢) + C ¢ × (1 - p( R | d ¢))
for all d not yet retrieved, then d is the next
document to be retrieved
n We won t further consider loss/utility from
now on
Probability Ranking Principle
n How do we compute all those probabilities?
n Do not know exact probabilities, have to use
estimates
n Binary Independence Retrieval (BIR) – which we
discuss next– is the simplest model
n Questionable assumptions
n “Relevance” of each document is independent of
relevance of other documents.
n Really, it’s bad to keep on returning duplicates
n Term independence
n Boolean model of relevance
n That one has a single step information need
n Seeing a range of results might let user refine query
Probabilistic Retrieval Strategy
n Estimate how terms contribute to relevance
n How do things like tf, df, and length influence your
judgments about document relevance?
n One answer is the Okapi formulae (S. Robertson)
Step 1: Left product is over query terms found in doc; right product over query terms not in doc
Step 2: Manipulate expression: include query terms found in doc into right product, simultaneously
divide through by them into left product
Binary Independence Model
! pi (1 - ri ) 1 - pi
O( R | q, x ) = O( R | q) × Õ ×Õ
xi = qi =1 ri (1 - pi ) qi =1 1 - ri
Constant for
each query
Only quantity to be estimated
for rankings
• Retrieval Status Value:
pi (1 - ri ) pi (1 - ri )
RSV = log Õ = å log
xi = qi =1 ri (1 - pi ) xi = qi =1 ri (1 - pi )
Binary Independence Model
• All boils down to computing RSV.
pi (1 - ri ) pi (1 - ri )
RSV = log Õ = å log
xi = qi =1 ri (1 - pi ) xi = qi =1 ri (1 - pi )
pi (1 - ri )
RSV = å ci ; ci = log
xi = qi =1 ri (1 - pi )
Odds of term pi ri Odds of term
appearing if appearing if
doc is Rel
(1− pi ) (1− ri ) doc is NonRel
s (n - s)
• Estimates: pi » ri » For now,
S (N - S) assume no
s ( S - s) zero terms.
ci » K ( N , n, S , s ) = log
(n - s) ( N - n - S + s)
Smoothing
n In practice, do some smoothing to account for
zero weights. Add .5 to all quantities.
Adding log-scaled
components, not
multiplying
If a term appears in more than half the docs, this generates negative weights; stop words solve this issue
Okapi BM-25: Non-Binary Model
Recall RSV
n Where tftd is tf of term t in doc d, Ld and Lave are length of doc d and avg doc; k1 is a positive
tuning parameter that calibrates the doc tf scaling; k1=0 is binary model; large value of k1 is
similar to using raw tf. b determines scaling by doc length; b=1 represents doc length
scaling, b=0 represents no scaling
g ¬g Triple Latte
t 0.99 0.1 (t)
¬t 0.01 0.9
Independence Assumptions
•Joint probability
P(f d n g t)
No Sleep Gloom
=P(f) P(d) P(n|f) P(g|f d) P(t|g)
(n) (g)
Triple Latte
(t)
Chained inference
n Evidence - a node takes on some value
n Inference
n Compute belief (probabilities) of other nodes
n conditioned on the known evidence
n Two kinds of inference: Diagnostic and Predictive
n Computational complexity
n General network: NP-hard
n Tree-like networks are easily tractable
n Much other work on efficient exact and approximate
Bayesian network inference
n Clever dynamic programming
n Approximate inference (“loopy belief propagation”)
Model for Text Retrieval
n Goal
n Given a user’s information need (evidence), find
probability a doc satisfies need
n Retrieval model
n Model docs in a document network
n Model information need in a query network
Bayesian Nets for IR: Idea
Document Network
d1 d2 di -documents dn
tiLarge,
- document
but representations
t1 t2 tn
riCompute
- concepts
once for each
document collection rk
r1 r2 r3
c1 c2 ci - query concepts cm
Small, compute once for
every query
qi - high-level concepts q2
q1
Query Network I I - goal node
Bayesian Nets for IR
n Construct Document Network (once !)
n For each query
n Construct best Query Network
n Attach it to Document Network
n Find subset of di’s which maximizes the
probability value of node I (best subset).
n Retrieve these di’s as the answer to query.
Bayesian nets for text retrieval
d1 Documents d2
Document
Network
r1 r2 r3 Terms/Concepts
c1 c2 c3 Concepts Query
Network
q1 q2 Query operators
(AND/OR/NOT)
i
Information need
Link matrices and probabilities
n Prior doc probability P(d) n P(c|r)
= 1/n n 1-to-1
n P(r|d) n thesaurus
n within-document term n P(q|c): canonical forms of
frequency query operators
n tf ´ idf - based n Always use things like
AND and NOT – never
store a full CPT*
Hamlet Macbeth
Document
Network
reason trouble double
User query
Extensions
n Prior probs don t have to be 1/n.
n User information need doesn t have to be a
query - can be words typed, in docs read, any
combination …
n Phrases, inter-document links
n Link matrices can be modified over time.
n User feedback.
n The promise of “personalization”
Computational details
n Document network built at indexing time
n Query network built/scored at query time
n Representation:
n Link matrices from docs to any single term are like
the postings entry for that term
n Canonical link matrices are efficient to store and
compute
n Attach evidence only at roots of network
n Can do single pass from roots to leaves
Bayes Nets in IR
n Flexible ways of combining term weights, which can
generalize previous approaches
n Boolean model
n Binary independence model
n Probabilistic models with weaker assumptions