Professional Documents
Culture Documents
Unit 2
Unit 2
Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns?
Lossless vs. lossy compression
Lossless compression: All information is
preserved.
What we mostly do in IR.
Lossy compression: Discard some information
Several of the preprocessing steps can be
viewed as lossy compression: case folding, stop
words, stemming, number elimination.
Optimization: Prune postings entries that are
unlikely to turn up in the top k list for any query.
Almost no loss quality for top k list.
Vocabulary vs. collection size
How big is the term vocabulary?
That is, how many distinct words are there?
Can we assume an upper bound?
Not really: At least 7020 = 1037 different words of
length 20
In practice, the vocabulary will keep growing with
the collection size
Especially with Unicode
Vocabulary vs. collection size
Heaps’ law: M = kTb
M is the size of the vocabulary, T is the number
of tokens in the collection
Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
In a log-log plot of vocabulary size M vs. T,
Heaps’ law predicts a line with slope about ½
It is the simplest possible relationship between the
two in log-log space
An empirical finding (“empirical law”)
Heaps’ Law
For RCV1, the dashed line
log10M = 0.49 log10T + 1.64 is
the best least squares fit.
Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.
75
Compression
Now, we will consider compressing the
space for the dictionary and postings
Basic Boolean index only
schemes
DICTIONARY COMPRESSION
Why compress the dictionary?
Search begins with the dictionary
We want to keep it in memory
Memory footprint competition with other
applications
Embedded/mobile devices may have very little
memory
Even if the dictionary isn’t in memory, we want it
to be small for a fast search startup time
So, compressing the dictionary is important
Dictionary storage - first cut
Array of fixed-width entries
~400,000 terms; 28 bytes/term = 11.2 MB.
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2∙2+4∙3+4)/8 ~2.6
8automata8automate9automatic10automation
→8automat*a1◊e2◊ic3◊ion
Gap: 824 5
11111111111111111111111111111111111111110 .
Unary code for 80 is:
11111111111111111111111111111111111111111111
1111111111111111111111111111111111110
102
Gamma codes
We can compress better with bit-level codes
The Gamma code is the best known of these.
Represent a gap G as a pair length and offset
offset is G in binary, with the leading bit cut off
For example 13 → 1101 → 101
length is the length of offset
For 13 (offset 101), this is 3.
We encode length with unary code: 1110.
Gamma code of 13 is the concatenation of length
and offset: 1110101
Gamma code examples
number length offset γ-code
0 none
1 0 0
2 10 0 10,0
3 10 1 10,1
4 110 00 110,00
9 1110 001 1110,001
13 1110 101 1110,101
24 11110 1000 11110,1000
511 111111110 11111111 111111110,11111111
1025 11111111110 0000000001 11111111110,0000000001
Gamma code properties
G is encoded using 2 log G + 1 bits
Length of offset is log G bits
Length of length is log G + 1 bits
All gamma codes have an odd number of bits
Almost within a factor of 2 of best possible, log2 G
Introduction to
Information Retrieval
Lecture 5: Scoring, Term Weighting and the
Vector Space Model
Introduction to Information Retrieval Ch. 6
Ranked retrieval
Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of
their needs and the collection
Not good for the majority of users.
Most users incapable of writing Boolean queries (or they
are, but they think it’s too much work).
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
Introduction to Information Retrieval Ch. 6
8
Introduction to Information Retrieval
9
Introduction to Information Retrieval
11
Introduction to Information Retrieval Ch. 6
Jaccard coefficient
A commonly used measure of overlap of two sets A
and B
jaccard(A,B) = |A ∩ B| / |A ∪ B|
jaccard(A,A) = 1
jaccard(A,B) = 0 if A ∩ B = 0
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
Introduction to Information Retrieval Ch. 6
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Introduction to Information Retrieval
Term frequency tf
The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
We want to use tf when computing query-document
match scores. But how?
Raw term frequency is not what we want:
A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
But not 10 times more relevant.
Relevance does not increase proportionally with
term frequency.
NB: frequency = count in IR
Introduction to Information Retrieval Sec. 6.2
Log-frequency weighting
The log frequency weight of term t in d is
1 log10 tf t,d , if tf t,d 0
wt,d
0, otherwise
Document frequency
Rare terms are more informative than frequent terms
Recall stop words
Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
A document containing this term is very likely to be
relevant to the query arachnocentric
→ We want a high weight for rare terms like
arachnocentric.
Introduction to Information Retrieval Sec. 6.2.1
idf weight
dft is the document frequency of t: the number of
documents that contain t
dft is an inverse measure of the informativeness of t
dft N
We define the idf (inverse document frequency) of t
by
idf t log10 ( N/df t )
We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.
25
Introduction to Information Retrieval Sec. 6.2.1
tf-idf weighting
The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d log(1 tf t ,d ) log10 ( N / df t )
Best known weighting scheme in information retrieval
Note: the “-” in tf-idf is a hyphen, not a minus sign!
Alternative names: tf.idf, tf x idf
Increases with the number of occurrences of term
within a document
Increases with the rarity of the term in the collection
Introduction to Information Retrieval Sec. 6.2.2
Score(q,d) = å tf.idft,d
t ÎqÇd
28
Introduction to Information Retrieval Sec. 6.3
Documents as vectors
So we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Queries as vectors
Key idea 1: Do the same for queries: represent
queries as vectors in the space
Key idea 2: Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors
proximity ≈ inverse of distance
Recall: We do this because we want to get away from
the you’re-either-in-or-out Boolean model.
Instead: rank more relevant documents higher than
less relevant documents
Introduction to Information Retrieval Sec. 6.3
cosine(query,document)
Dot product Unit vectors
V
qd q d q di
cos(q, d ) i 1 i
q d
i1 i
V V
qd q2
d 2
i 1 i
for q, d length-normalized.
37
Introduction to Information Retrieval
38
Introduction to Information Retrieval Sec. 6.3
Sensibility jealous 10 7 11
gossip 2 0 6
PaP: Pride and
wuthering 0 0 38
Prejudice, and
WH: Wuthering Term frequencies (counts)
Heights?
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
Introduction to Information Retrieval Sec. 6.4
Points to note
A document may have a high cosine similarity score
for a query, even if it does not contain all terms in
the query
How to speedup the vector space retrieval?
Can store the inverse document frequency (e.g., N/dft) at
the head of the postings list for term t
Store the term-frequency (e.g., tft,d) in each postings entry
of the postings list for term t
For a multi-word query, the postings lists of the various
query terms can even be traversed concurrently
45
Introduction to Information Retrieval
COMPUTING SCORES IN A
COMPLETE SEARCH SYSTEM
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Ch. 6
Recap: tf‐idf weighting
The tf‐idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d (1 log10 tf t ,d ) log10 ( N / df t )
Best known weighting scheme in information retrieval
Increases with the number of occurrences within a
document
Increases with the rarity of the term in the collection
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Ch. 6
Recap: Queries as vectors
Key idea 1: Do the same for queries: represent them
as vectors in the space
Key idea 2: Rank documents according to their
proximity to the query in this space
proximity = similarity of vectors
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Ch. 6
Recap: cosine(query,document)
Dot product Unit vectors
V
qd q d q di
cos( q , d ) i 1 i
q d
i1 i
V V
qd q2
d 2
i 1 i
This lecture
Speeding up vector space ranking
Putting together a complete search
system
Will require learning about a number of
miscellaneous topics and heuristics
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 6.3.3
Computing cosine scores
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1
Efficient cosine ranking
Find the K docs in the collection “nearest” to the
query K largest query‐doc cosines.
Efficient ranking:
Computing a single cosine efficiently.
Choosing the K largest cosine values efficiently.
Can we do this without computing all N cosines?
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1
Efficient cosine ranking
What we’re doing in effect: solving the K‐nearest
neighbor problem for a query vector
In general, we do not know how to do this efficiently
for high‐dimensional spaces
But it is solvable for short queries, and standard
indexes support this well
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1
Special case – unweighted queries
No weighting on query terms
Assume each query term occurs only once
Then for ranking, don’t need to normalize query
vector
Slight simplification of algorithm from Lecture 6
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1
Computing the K largest cosines:
selection vs. sorting
Typically we want to retrieve the top K docs (in the
cosine ranking for the query)
not to totally order all docs in the collection
Can we pick off docs with K highest cosines?
Let J = number of docs with nonzero cosines
We seek the K best of these J
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1
Use heap for selecting top K
Binary tree in which each node’s value > the values
of children
Takes 2J operations to construct, then each of K
“winners” read off in 2log J steps.
For J=1M, K=100, this is about 10% of the cost of
sorting.
1
.9 .3
.3 .8 .1
Slides by Manning, Raghavan, Schutze
.1
Introduction to Information Retrieval Sec. 7.1.1
Bottlenecks
Primary computational bottleneck in scoring: cosine
computation
Can we avoid all this computation?
Yes, but may sometimes get it wrong
a doc not in the top K may creep into the list of K
output docs
Is this such a bad thing?
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.1
Cosine similarity is only a proxy
User has a task and a query formulation
Cosine matches docs to query
Thus cosine is anyway a proxy for user happiness
If we get a list of K docs “close” to the top K by cosine
measure, should be ok
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.1
Generic approach
Find a set A of contenders, with K < |A| << N
A does not necessarily contain the top K, but has
many docs from among the top K
Return the top K docs in A
Think of A as pruning non‐contenders
The same approach is also used for other (non‐
cosine) scoring functions
Will look at several schemes following this approach
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.2
Index elimination
Basic algorithm cosine computation algorithm only
considers docs containing at least one query term
Take this further:
Only consider high‐idf query terms
Only consider docs containing many query terms
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.2
High‐idf query terms only
For a query such as catcher in the rye
Only accumulate scores from catcher and rye
Intuition: in and the contribute little to the scores
and so don’t alter rank‐ordering much
Benefit:
Postings of low‐idf terms have many docs these (many)
docs get eliminated from set A of contenders
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.2
Docs containing many query terms
Any doc with at least one query term is a candidate
for the top K output list
For multi‐term queries, only compute scores for docs
containing several of the query terms
Say, at least 3 out of 4
Imposes a “soft conjunction” on queries seen on web
search engines (early Google)
Easy to implement in postings traversal
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.2
3 of 4 query terms
Antony 3 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
Calpurnia 13 16 32
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.3
Champion lists
Precompute for each dictionary term t, the r docs of
highest weight in t’s postings
Call this the champion list for t
(aka fancy list or top docs for t)
Note that r has to be chosen at index build time
Thus, it’s possible that r < K
At query time, only compute scores for docs in the
champion list of some query term
Pick the K top‐scoring docs from amongst these
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.3
Exercises
How do Champion Lists relate to Index Elimination?
Can they be used together?
How can Champion Lists be implemented in an
inverted index?
Note that the champion list has nothing to do with small
docIDs
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.4
Static quality scores
We want top‐ranking documents to be both relevant
and authoritative
Relevance is being modeled by cosine scores
Authority is typically a query‐independent property
of a document
Examples of authority signals
Wikipedia among websites
Articles in certain newspapers
Quantitative
A paper with many citations
Many bitly’s, diggs or del.icio.us marks
(Pagerank) Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.4
Modeling authority
Assign to each document a query‐independent
quality score in [0,1] to each document d
Denote this by g(d)
Thus, a quantity like the number of citations is scaled
into [0,1]
Exercise: suggest a formula for this.
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.4
Net score
Consider a simple total score combining cosine
relevance and authority
net‐score(q,d) = g(d) + cosine(q,d)
Can use some other linear combination
Indeed, any function of the two “signals” of user happiness
– more later
Now we seek the top K docs by net score
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.4
Top K by net score – fast methods
First idea: Order all postings by g(d)
Key: this is a common ordering for all postings
Thus, can concurrently traverse query terms’
postings for
Postings intersection
Cosine score computation
Exercise: write pseudocode for cosine score
computation if postings are ordered by g(d)
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.4
Why order postings by g(d)?
Under g(d)‐ordering, top‐scoring docs likely to
appear early in postings traversal
In time‐bound applications (say, we have to return
whatever search results we can in 50 ms), this allows
us to stop postings traversal early
Short of computing scores for all docs in postings
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.4
Champion lists in g(d)‐ordering
Can combine champion lists with g(d)‐ordering
Maintain for each term a champion list of the r docs
with highest g(d) + tf‐idftd
Seek top‐K results from only the docs in these
champion lists
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.4
High and low lists
For each term, we maintain two postings lists called
high and low
Think of high as the champion list
When traversing postings on a query, only traverse
high lists first
If we get more than K docs, select the top K and stop
Else proceed to get docs from the low lists
Can be used even for simple cosine scores, without
global quality g(d)
A means for segmenting index into two tiers
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.5
Impact‐ordered postings
We only want to compute scores for docs for which
wft,d is high enough
We sort each postings list by wft,d
Now: not all postings in a common order!
How do we compute scores in order to pick off top K?
Two ideas follow
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.5
1. Early termination
When traversing t’s postings, stop early after either
a fixed number of r docs
wft,d drops below some threshold
Take the union of the resulting sets of docs
One from the postings of each query term
Compute only the scores for docs in this union
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.5
2. idf‐ordered terms
When considering the postings of query terms
Look at them in order of decreasing idf
High idf terms likely to contribute most to score
As we update score contribution from each query
term
Stop if doc scores relatively unchanged
Can apply to cosine or some other net scores
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.6
Cluster pruning: preprocessing
Pick N docs at random: call these leaders
For every other doc, pre‐compute nearest
leader
Docs attached to a leader: its followers;
Likely: each leader has ~ N followers.
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.6
Cluster pruning: query processing
Process a query as follows:
Given query Q, find its nearest leader L.
Seek K nearest docs from among L’s
followers.
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.6
Visualization
Query
Leader Follower
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.6
Why use random sampling
Fast
Leaders reflect data distribution
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.6
General variants
Have each follower attached to b1=3 (say) nearest
leaders.
From query, find b2=4 (say) nearest leaders and their
followers.
Can recurse on leader/follower construction.
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.1.6
Exercises
To find the nearest leader in step 1, how many cosine
computations do we do?
Why did we have N in the first place?
What is the effect of the constants b1, b2 on the
previous slide?
Devise an example where this is likely to fail – i.e., we
miss one of the K nearest docs.
Likely under random sampling.
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 6.1
Parametric and zone indexes
Thus far, a doc has been a sequence of terms
In fact documents have multiple parts, some with
special semantics:
Author
Title
Date of publication
Language
Format
etc.
These constitute the metadata about a document
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 6.1
Fields
We sometimes wish to search by these metadata
E.g., find docs authored by William Shakespeare in the
year 1601, containing alas poor Yorick
Year = 1601 is an example of a field
Also, author last name = shakespeare, etc.
Field or parametric index: postings for each field
value
Sometimes build range trees (e.g., for dates)
Field query typically treated as conjunction
(doc must be authored by shakespeare)
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 6.1
Zone
A zone is a region of the doc that can contain an
arbitrary amount of text, e.g.,
Title
Abstract
References …
Build inverted indexes on zones as well to permit
querying
E.g., “find docs with merchant in the title zone and
matching the query gentle rain”
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 6.1
Example zone indexes
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.2.1
Tiered indexes
Break postings up into a hierarchy of lists
Most important
…
Least important
Can be done by g(d) or another measure
Inverted index thus broken up into tiers of decreasing
importance
At query time use top tier unless it fails to yield K
docs
If so drop to lower tiers
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.2.1
Example tiered index
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.2.2
Query term proximity
Free text queries: just a set of terms typed into the
query box – common on the web
Users prefer docs in which query terms occur within
close proximity of each other
Let w be the smallest window in a doc containing all
query terms, e.g.,
For the query strained mercy the smallest window in
the doc The quality of mercy is not strained is 4
(words)
Would like scoring function to take this into account
– how? Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.2.3
Query parsers
Free text query from user may in fact spawn one or
more queries to the indexes, e.g., query rising
interest rates
Run the query as a phrase query
If <K docs contain the phrase rising interest rates, run the
two phrase queries rising interest and interest rates
If we still have <K docs, run the vector space query rising
interest rates
Rank matching docs by vector space scoring
This sequence is issued by a query parser
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.2.3
Aggregate scores
We’ve seen that score functions can combine cosine,
static quality, proximity, etc.
How do we know the best combination?
Some applications – expert‐tuned
Increasingly common: machine‐learned
See May 19th lecture
Slides by Manning, Raghavan, Schutze
Introduction to Information Retrieval Sec. 7.2.4
Putting it all together
Slides by Manning, Raghavan, Schutze
Evaluation in information retrieval
1. A document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment of either relevant or
nonrelevant for each query-document pair.
The standard approach to information retrieval system evaluation revolves around the notion of
relevant and nonrelevant documents. With respect to a user information need, a document in
the test collection is given a binary classification as either relevant or nonrelevant. This decision
is referred to as the gold standard or ground truth judgment of relevance. The test document
collection and suite of information needs have to be of a reasonable size: you need to average
performance over fairly large test sets, as results are highly variable over different documents
and information needs. As a rule of thumb, 50 information needs has usually been found to be a
sufficient minimum.
Relevance is assessed relative to an , not a query. For example, an information need might be:
Information on whether drinking red wine is more effective at reducing your risk of
heart attacks than white wine.
This might be translated into a query such as:
wine and red and white and heart and attack and effective
A document is relevant if it addresses the stated information need, not because it just happens
to contain all the words in the query. This distinction is often misunderstood in practice, because
the information need is not overt. But, nevertheless, an information need is present. If a user
types python into a web search engine, they might be wanting to know where they can purchase
a pet python. Or they might be wanting information on the programming language Python. From
a one word query, it is very difficult for a system to know what the information need is. But,
nevertheless, the user has one, and can judge the returned results on the basis of their
relevance to it. To evaluate a system, we require an overt expression of an information need,
which can be used for judging returned documents as relevant or nonrelevant. At this point, we
make a simplification: relevance can reasonably be thought of as a scale, with some documents
highly relevant and others marginally so. But for the moment, we will use just a binary decision
of relevance. We discuss the reasons for using binary relevance judgments and alternatives in
Section 8.5.1 .
Many systems contain various weights (often known as parameters) that can be adjusted to
tune system performance. It is wrong to report results on a test collection which were obtained
by tuning these parameters to maximize performance on that collection. That is because such
tuning overstates the expected performance of the system, because the weights will be set to
maximize performance on one particular set of queries rather than for a random sample of
queries. In such cases, the correct procedure is to have one or more development test
collections , and to tune the parameters on the development test collection. The tester then runs
the system with those weights on the test collection and reports the results on that collection as
an unbiased estimate of performance.
Standard test collections
Here is a list of the most standard test collections and evaluation series. We focus particularly
on test collections for ad hoc information retrieval system evaluation, but also mention a couple
of similar test collections for text classification.
The Cranfield collection. This was the pioneering test collection in allowing precise quantitative
measures of information retrieval effectiveness, but is nowadays too small for anything but the
most elementary pilot experiments. Collected in the United Kingdom starting in the late 1950s, it
contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive
relevance judgments of all (query, document) pairs.
Text Retrieval Conference (TREC) . The U.S. National Institute of Standards and Technology
(NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there
have been many tracks over a range of different test collections, but the best known test
collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations
between 1992 and 1999. In total, these test collections comprise 6 CDs containing 1.89 million
documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450
information needs, which are called topics and specified in detailed text passages. Individual
test collections are defined over different subsets of this data. The early TRECs each consisted
of 50 information needs, evaluated over different but overlapping sets of documents. TRECs 6-8
provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information
Service articles. This is probably the best subcollection to use in future work, because it is the
largest and the topics are more consistent. Because the test document collections are so large,
there are no exhaustive relevance judgments. Rather, NIST assessors' relevance judgments are
available only for the documents that were among the top returned for some system which
was entered in the TREC evaluation for which the information need was developed.
In more recent years, NIST has done evaluations on larger document collections, including the
25 million page GOV2 web page collection. From the beginning, the NIST test document
collections were orders of magnitude larger than anything available to researchers previously
and GOV2 is now the largest Web collection easily available for research purposes.
Nevertheless, the size of GOV2 is still more than 2 orders of magnitude smaller than the current
size of the document collections indexed by the large web search companies.
NII Test Collections for IR Systems ( NTCIR ). The NTCIR project has built various test
collections of similar sizes to the TREC collections, focusing on East Asian language and
cross-language information retrieval , where queries are made in one language over a
document collection containing documents in one or more other languages. See:
http://research.nii.ac.jp/ntcir/data/data-en.html
Cross Language Evaluation Forum ( CLEF ). This evaluation series has concentrated on
European languages and cross-language information retrieval. See:
http://www.clef-campaign.org/
and Reuters-RCV1. For text classification, the most used test collection has been the
Reuters-21578 collection of 21578 newswire articles; see Chapter 13 , page 13.6 . More
recently, Reuters released the much larger Reuters Corpus Volume 1 (RCV1), consisting of
806,791 documents; see Chapter 4 , page 4.2 . Its scale and rich annotation makes it a better
basis for future research.
20 Newsgroups . This is another widely used text classification collection, collected by Ken
Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name
being regarded as the category). After the removal of duplicate articles, as it is usually used, it
contains 18941 articles.
Evaluation of unranked retrieval sets
Given these ingredients, how is system effectiveness measured? The two most frequent and
basic measures for information retrieval effectiveness are precision and recall. These are first
defined for the simple case where an IR system returns a set of documents for a query. We will
see later how to extend these notions to ranked retrieval situations.
(36)
(37)
These notions can be made clear by examining the following contingency table:
Then:
(38)
(39)
An obvious alternative that may occur to the reader is to judge an information retrieval system
by its accuracy , that is, the fraction of its classifications that are correct. In terms of the
There is a good reason why accuracy is not an appropriate measure for information retrieval
problems. In almost all circumstances, the data is extremely skewed: normally over 99.9% of the
documents are in the nonrelevant category. A system tuned to maximize accuracy can appear
to perform well by simply deeming all documents nonrelevant to all queries. Even if the system
is quite good, trying to label some documents as relevant will almost always lead to a high rate
of false positives. However, labeling all documents as nonrelevant is completely unsatisfying to
an information retrieval system user. Users are always going to want to see some documents,
and can be assumed to have a certain tolerance for seeing some false positives providing that
they get some useful information. The measures of precision and recall concentrate the
evaluation on the return of true positives, asking what percentage of the relevant documents
have been found and how many false positives have also been returned.
The advantage of having the two numbers for precision and recall is that one is more important
than the other in many circumstances. Typical web surfers would like every result on the first
page to be relevant (high precision) but have not the slightest interest in knowing let alone
looking at every document that is relevant. In contrast, various professional searchers such as
paralegals and intelligence analysts are very concerned with trying to get as high recall as
possible, and will tolerate fairly low precision results in order to get it. Individuals searching their
hard disks are also often interested in high recall searches. Nevertheless, the two quantities
clearly trade off against one another: you can always get a recall of 1 (but very low precision) by
retrieving all documents for all queries! Recall is a non-decreasing function of the number of
documents retrieved. On the other hand, in a good system, precision usually decreases as the
number of documents retrieved is increased. In general we want to get some amount of recall
while tolerating only a certain percentage of false positives.
A single measure that trades off precision versus recall is the F measure , which is the weighted
harmonic mean of precision and recall:
(40)
where and thus . The default balanced F measure equally weights
which is short for , even though the formulation in terms of more transparently exhibits
the F measure as a weighted harmonic mean. When using , the formula on the right
simplifies to:
(41)
However, using an even weighting is not the only choice. Values of emphasize
Why do we use a harmonic mean rather than the simpler average (arithmetic mean)? Recall
that we can always get 100% recall by just returning all documents, and therefore we can
always get a 50% arithmetic mean by the same process. This strongly suggests that the
arithmetic mean is an unsuitable measure to use. In contrast, if we assume that 1 document in
10,000 is relevant to the query, the harmonic mean score of this strategy is 0.02%. The
harmonic mean is always less than or equal to the arithmetic mean and the geometric mean.
When the values of two numbers differ greatly, the harmonic mean is closer to their minimum
than to their arithmetic mean; see Figure 8.1 .
Exercises.
● The balanced F measure (a.k.a. F ) is defined as the harmonic mean of precision and
recall. What is the advantage of using the harmonic mean rather than ``averaging''
(using the arithmetic mean)?
Evaluation of ranked retrieval results
Precision, recall, and the F measure are set-based measures. They are computed using
unordered sets of documents. We need to extend these measures (or to define new measures)
if we are to evaluate the ranked retrieval results that are now standard with search engines. In a
ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top
retrieved documents. For each such set, precision and recall values can be plotted to give a
precision-recall curve , such as the one shown in Figure 8.2 . Precision-recall curves have a
distinctive saw-tooth shape: if the document retrieved is nonrelevant then recall is the
same as for the top documents, but precision has dropped. If it is relevant, then both
precision and recall increase, and the curve jags up and to the right. It is often useful to remove
these jiggles and the standard way to do this is with an interpolated precision: the interpolated
precision at a certain recall level is defined as the highest precision found for any
recall level :
The justification is that almost anyone would be prepared to look at a few more documents if it
would increase the percentage of the viewed set that were relevant (that is, if the precision of
the larger set is higher). Interpolated precision is shown by a thinner line in Figure 8.2 . With this
definition, the interpolated precision at a recall of 0 is well-defined (Exercise 8.4 ).
Recall Interp.
Precision
0.0 1.00
0.1 0.67
0.2 0.63
0.3 0.55
0.4 0.45
0.5 0.41
0.6 0.36
0.7 0.29
0.8 0.13
0.9 0.10
1.0 0.08
average of the precision value obtained for the set of top documents existing after each
relevant document is retrieved, and this value is then averaged over information needs. That is,
the set of ranked retrieval results from the top result until you get to document , then
(43)
When a relevant document is not retrieved at all, the precision value in the above equation is
taken to be 0. For a single information need, the average precision approximates the area under
the uninterpolated precision-recall curve, and so the MAP is roughly the average area under the
precision-recall curve for a set of queries.
Using MAP, fixed recall levels are not chosen, and there is no interpolation. The MAP value for
a test collection is the arithmetic mean of average precision values for individual information
needs. (This has the effect of weighting each information need equally in the final reported
number, even if many documents are relevant to some queries whereas very few are relevant to
other queries.) Calculated MAP scores normally vary widely across information needs when
measured within a single system, for instance, between 0.1 and 0.7. Indeed, there is normally
more agreement in MAP for an individual information need across systems than for MAP scores
for different information needs for the same system. This means that a set of test information
needs must be large and diverse enough to be representative of system effectiveness across
different queries.
The above measures factor in precision at all recall levels. For many prominent applications,
particularly web search, this may not be germane to users. What matters is rather how many
good results there are on the first page or the first three pages. This leads to measuring
precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to
as ``Precision at '', for example ``Precision at 10''. It has the advantage of not requiring any
estimate of the size of the set of relevant documents but the disadvantages that it is the least
stable of the commonly used evaluation measures and that it does not average well, since the
total number of relevant documents for a query has a strong influence on precision at .
An alternative, which alleviates this problem, is R-precision . It requires having a set of known
relevant documents , from which we calculate the precision of the top documents
returned. (The set may be incomplete, such as when is formed by creating relevance
judgments for the pooled top results of particular systems in a set of experiments.)
R-precision adjusts for the size of the set of relevant documents: A perfect system could score 1
on this metric for each query, whereas, even a perfect system could only achieve a precision at
20 of 0.4 if there were only 8 documents in the collection relevant to an information need.
Averaging this measure across queries thus makes more sense. This measure is harder to
explain to naive users than Precision at but easier to explain than MAP. If there are
relevant documents for a query, we examine the top results of a system, and find that
are relevant, then by definition, not only is the precision (and hence R-precision) , but
the recall of this result set is also . Thus, R-precision turns out to be identical to the
break-even point , another measure which is sometimes used, defined in terms of this equality
relationship holding. Like Precision at , R-precision describes only one point on the
precision-recall curve, rather than attempting to summarize effectiveness across the curve, and
it is somewhat unclear why you should be interested in the break-even point rather than either
the best point on the curve (the point with maximal
.
Another concept sometimes used in evaluation is an ROC curve . (``ROC'' stands for ``Receiver
Operating Characteristics'', but knowing that doesn't help most people.) An ROC curve plots the
true positive rate or sensitivity against the false positive rate or ( ). Here,
sensitivity is just another term for recall. The false positive rate is given by .
Figure 8.4 shows the ROC curve corresponding to the precision-recall curve in Figure 8.2 . An
ROC curve always goes from the bottom left to the top right of the graph. For a good system,
the graph climbs steeply on the left side. For unranked result sets, specificity , given by
, was not seen as a very useful notion. Because the set of true negatives is
always so large, its value would be almost 1 for all information needs (and, correspondingly, the
value of the false positive rate would be almost 0). That is, the ``interesting'' part of Figure 8.2 is
A final approach that has seen increasing adoption, especially when employed with machine
learning approaches to ranking svm-ranking is measures of cumulative gain , and in particular
normalized discounted cumulative gain ( NDCG ). NDCG is designed for situations of non-binary
notions of relevance (cf. Section 8.5.1 ). Like precision at , it is evaluated over some number
of top search results. For a set of queries , let be the relevance score assessors
NDCG at for query is 1. For queries for which documents are retrieved, the
Given information needs and documents, you need to collect relevance assessments. This is a
time-consuming and expensive process involving human beings. For tiny collections like
Cranfield, exhaustive judgments of relevance for each query and document pair were obtained.
For large modern collections, it is usual for relevance to be assessed only for a subset of the
documents for each query. The most standard approach is pooling , where relevance is
assessed over a subset of the collection that is formed from the top documents returned by a
number of different IR systems (usually the ones to be evaluated), and perhaps other sources
such as the results of Boolean keyword searches or documents found by expert searchers in an
interactive process.
Judge 2 Relevance
Yes No Total
Relevance No 10 70 80
Pooled marginals
Kappa statistic
A human is not a device that reliably reports a gold standard judgment of relevance of a
document to a query. Rather, humans and their relevance judgments are quite idiosyncratic and
variable. But this is not a problem to be solved: in the final analysis, the success of an IR system
depends on how good it is at satisfying the needs of these idiosyncratic humans, one
information need at a time.
Nevertheless, it is interesting to consider and measure how much agreement between judges
there is on relevance judgments. In the social sciences, a common measure for agreement
between judges is the kappa statistic . It is designed for categorical judgments and corrects a
simple agreement rate for the rate of chance agreement.
(46)
where is the proportion of the times the judges agreed, and is the proportion of
the times they would be expected to agree by chance. There are choices in how the latter is
estimated: if we simply say we are making a two-class decision and assume nothing more, then
the expected chance agreement rate is 0.5. However, normally the class distribution assigned is
skewed, and it is usual to use marginal statistics to calculate expected agreement. There are
still two ways to do it depending on whether one pools the marginal distribution across judges or
uses the marginals for each judge separately; both forms have been used, but we present the
pooled version because it is more conservative in the presence of systematic differences in
assessments across judges. The calculations are shown in Table 8.2 . The kappa value will be
1 if two judges always agree, 0 if they agree only at the rate given by chance, and negative if
they are worse than random. If there are more than two judges, it is normal to calculate an
average pairwise kappa value. As a rule of thumb, a kappa value above 0.8 is taken as good
agreement, a kappa value between 0.67 and 0.8 is taken as fair agreement, and agreement
below 0.67 is seen as data providing a dubious basis for an evaluation, though the precise
cutoffs depend on the purposes for which the data will be used.
Interjudge agreement of relevance has been measured within the TREC evaluations and for
medical IR collections. Using the above rules of thumb, the level of agreement normally falls in
the range of ``fair'' (0.67-0.8). The fact that human agreement on a binary relevance judgment is
quite modest is one reason for not requiring more fine-grained relevance labeling from the test
set creator. To answer the question of whether IR evaluation results are valid despite the
variation of individual assessors' judgments, people have experimented with evaluations taking
one or the other of two judges' opinions as the gold standard. The choice can make a
considerable absolute difference to reported scores, but has in general been found to have little
impact on the relative effectiveness ranking of either different systems or variants of a single
system which are being compared for effectiveness.