Unit 2

Index compression
Why compression (in general)?

 Use less disk space
 Saves a little money
 Keep more stuff in memory
 Increases speed
 Increase speed of data transfer from disk to
memory
 [read compressed data | decompress] is faster than
[read uncompressed data]
 Premise: Decompression algorithms are fast
 True of the decompression algorithms we use
Why compression for inverted
indexes?
 Dictionary
 Make it small enough to keep in main memory
 Make it so small that you can keep some postings lists in
main memory too
 Postings file(s)
 Reduce disk space needed
 Decrease time needed to read postings lists from disk
 Large search engines keep a significant part of the postings
in memory.
 Compression lets you keep more in memory
 We will devise various IR-specific compression
schemes
Reuters RCV1 statistics (copy)
 symbol statistic value
 N documents 800,000
 L avg. # tokens per doc 200
 M terms (= word types) 400,000
 avg. # bytes per token 6
(incl. spaces/punctuation)
 avg. # bytes per token 4.5
(without spaces/punctuation)
 avg. # bytes per term 7.5
 non-positional postings 100,000,000
Index parameters vs. what we index
size of word types (terms) non-positional positional postings
postings
dictionary non-positional index positional index
Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul

(K) % % % % %
Unfiltered 484 109,971 197,879
No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
Exercise: give intuitions for all the ‘0’ entries. Why do some
zero entries correspond to big deltas in other columns?
Lossless vs. lossy compression
 Lossless compression: All information is
preserved.
 What we mostly do in IR.
 Lossy compression: Discard some information
 Several of the preprocessing steps can be
viewed as lossy compression: case folding, stop
words, stemming, number elimination.
 Optimization: Prune postings entries that are
unlikely to turn up in the top k list for any query.
 Almost no loss quality for top k list.
Vocabulary vs. collection size
 How big is the term vocabulary?
 That is, how many distinct words are there?
 Can we assume an upper bound?
 Not really: At least 7020 = 1037 different words of
length 20
 In practice, the vocabulary will keep growing with
the collection size
 Especially with Unicode 
Vocabulary vs. collection size
 Heaps’ law: M = kTb
 M is the size of the vocabulary, T is the number
of tokens in the collection
 Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
 In a log-log plot of vocabulary size M vs. T,
Heaps’ law predicts a line with slope about ½
 It is the simplest possible relationship between the
two in log-log space
 An empirical finding (“empirical law”)
Heaps’ Law
For RCV1, the dashed line
log10M = 0.49 log10T + 1.64 is
the best least squares fit.
Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.
Good empirical fit for Reuters

RCV1 !
For first 1,000,020 tokens,

law predicts 38,323 terms;
actually, 38,365 terms
Exercises
 What is the effect of including spelling errors, vs.
automatically correcting spelling errors on Heaps’
law?
 Compute the vocabulary size M for this scenario:
 Looking at a collection of web pages, you find that there are
3000 different terms in the first 10,000 tokens and 30,000
different terms in the first 1,000,000 tokens.
 Assume a search engine indexes a total of 20,000,000,000
(2 × 1010) pages, containing 200 tokens on average
 What is the size of the vocabulary of the indexed collection
as predicted by Heaps’ law?
Zipf’s law
 Heaps’ law gives the vocabulary size in collections.
 We also study the relative frequencies of terms.
 In natural language, there are a few very frequent
terms and very many very rare terms.
 Zipf’s law: The ith most frequent term has frequency
proportional to 1/i .
 cfi ∝ 1/i = K/i where K is a normalizing constant
 cfi is collection frequency: the number of
occurrences of the term ti in the collection.
Zipf consequences
 If the most frequent term (the) occurs cf1 times
 then the second most frequent term (of) occurs
cf1/2 times
 the third most frequent term (and) occurs cf1/3
times …
 Equivalent: cfi = K/i where K is a normalizing
factor, so
 log cfi = log K - log i
 Linear relationship between log cfi and log i
 Another power law relationship
Zipf’s law for Reuters RCV1
75
Compression
 Now, we will consider compressing the
space for the dictionary and postings
 Basic Boolean index only
 No study of positional indexes, etc.
 We will consider compression
schemes
DICTIONARY COMPRESSION
Why compress the dictionary?
 Search begins with the dictionary
 We want to keep it in memory
 Memory footprint competition with other
applications
 Embedded/mobile devices may have very little
memory
 Even if the dictionary isn’t in memory, we want it
to be small for a fast search startup time
 So, compressing the dictionary is important
Dictionary storage - first cut
 Array of fixed-width entries
 ~400,000 terms; 28 bytes/term = 11.2 MB.
Terms Freq. Postings ptr.

a 656,265
aachen 65
…. ….
zulu 221
Dictionary search 20 bytes 4 bytes each

structure
Fixed-width terms are wasteful
 Most of the bytes in the Term column are wasted
– we allot 20 bytes for 1 letter terms.
 And we still can’t handle supercalifragilisticexpialidocious or
hydrochlorofluorocarbons.
 Written English averages ~4.5 characters/word.
 Exercise: Why is/isn’t this the number to use for
estimating the dictionary size?
 Ave. dictionary word in English: ~8 characters
 How do we use ~8 characters per dictionary term?
 Short words dominate token counts but not type
average.
Compressing the term list:
Dictionary-as-a-String
Store dictionary as a (long) string of characters:

Pointer to next word shows end of current word
Hope to save up to 60% of dictionary space.
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.

Total string length =
33
400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126
positions: log23.2M =
22bits = 3bytes
Space for dictionary as a string
 4 bytes per term for Freq.  Now avg. 11
 4 bytes per term for pointer to Postings.  bytes/term,
 not 20.
 3 bytes per term pointer
 Avg. 8 bytes per term in term string
 400K terms x 19 ⇒ 7.6 MB (against 11.2MB for
fixed width)
Blocking
 Store pointers to every kth term string.
 Example below: k=4.
 Need to store term lengths (1 extra byte)
….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….
Freq. Postings ptr. Term ptr.

33
29
 Save 9 bytes Lose 4 bytes on
44  on 3 term lengths.
126  pointers.
7
Net
 Example for block size k = 4
 Where we used 3 bytes/pointer without blocking
 3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.
Shaved another ~0.5MB. This reduces the size of the

dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.
Why not go with larger k?

Exercise
 Estimate the space usage (and savings
compared to 7.6 MB) with blocking, for block
sizes of k = 4, 8 and 16.
Dictionary search without blocking
 Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2∙2+4∙3+4)/8 ~2.6
Exercise: what if the

frequencies of query terms
were non-uniform but known,
how would you structure the
dictionary search tree?
Dictionary search with blocking
 Binary search down to 4-term block;

 Then linear search through terms in block.
 Blocks of 4 (binary tree), avg. =
(1+2∙2+2∙3+2∙4+5)/8 = 3 compares
Exercise
 Estimate the impact on search performance (and
slowdown compared to k=1) with blocking, for
block sizes of k = 4, 8 and 16.
Total space
 By increasing k, we could cut the pointer space in
the dictionary, at the expense of search time;
space 9.5MB → ~8MB
 Net – postings take up most of the space
 Generally kept on disk
 Dictionary compressed in memory
Front coding
 Front-coding:
 Sorted words commonly have long common prefix
– store differences only
 (for last k-1 in a block of k)
8automata8automate9automatic10automation
→8automat*a1◊e2◊ic3◊ion
Encodes automat Extra length

beyond automat.
Begins to resemble general string compression.
RCV1 dictionary compression
summary
Technique Size in MB
Fixed width 11.2
Dictionary-as-String with pointers to every term 7.6
Also, blocking k = 4 7.1
Also, Blocking + front coding 5.9

POSTINGS COMPRESSION
Postings compression
 The postings file is much larger than the
dictionary, factor of at least 10.
 Key desideratum: store each posting compactly.
 A posting for our purposes is a docID.
 For Reuters (800,000 documents), we would use
32 bits per docID when using 4-byte integers.
 Alternatively, we can use log2 800,000 ≈ 20 bits
per docID.
 Our goal: use a lot less than 20 bits per docID.
Postings: two conflicting forces
 A term like arachnocentric occurs in maybe one
doc out of a million – we would like to store this
posting using log2 1M ~ 20 bits.
 A term like the occurs in virtually every doc, so
20 bits/posting is too expensive.
 Prefer 0/1 bitmap vector in this case
Postings file entry
 We store the list of docs containing a term in
increasing order of docID.
 computer: 33,47,154,159,202 …
 Consequence: it suffices to store gaps.
 33,14,107,5,43 …
 Hope: most gaps can be encoded/stored with far
fewer than 20 bits.
Three postings entries
Variable length encoding
 Aim:
 For arachnocentric, we will use ~20 bits/gap
entry.
 For the, we will use ~1 bit/gap entry.
 If the average gap for a term is G, we want to use
~log2G bits/gap entry.
 Key challenge: encode every integer (gap) with
about as few bits as needed for that integer.
 This requires a variable length encoding
 Variable length codes achieve this by using short
codes for small numbers
Variable Byte (VB) codes
 For a gap value G, we want to use close to the
fewest bytes needed to hold log2 G bits
 Begin with one byte to store G and dedicate 1 bit
in it to be a continuation bit c
 If G ≤127, binary-encode it in the 7 available bits
and set c =1
 Else encode G’s lower-order 7 bits and then use
additional bytes to encode the higher order bits
using the same algorithm
 At the end set the continuation bit of the last byte
to 1 (c =1) – and for the other bytes c = 0.
Variable Bytecode Example
ID: 824 829
Gap: 824 5
Encoding: 00000110 10111000 10000101
Decoding: 6*128 + (184 – 128) (133 – 128)

Example (cont.)
docIDs 824 829 215406
gaps 5 214577
VB code 00000110 10000101 00001101
10111000 00001100
10110001
Postings stored as the byte concatenation

000001101011100010000101000011010000110010110001
Key property: VB-encoded postings are

uniquely prefix-decodable.
For a small gap (5), VB

uses a whole byte.
Other variable unit codes
 Instead of bytes, we can also use a different “unit of
alignment”: 32 bits (words), 16 bits, 4 bits (nibbles).
 Variable byte alignment wastes space if you have
many small gaps – nibbles do better in such cases.
 Variable byte codes:
 Used by many commercial/research systems
 Good low-tech blend of variable-length coding and sensitivity
to computer memory alignment matches (vs. bit-level codes,
which we look at next).
 There is also recent work on word-aligned codes that
pack a variable number of gaps into one word
Unary code
 Represent n as n 1s with a final 0.
 Unary code for 3 is 1110.
 Unary code for 40 is
11111111111111111111111111111111111111110 .
 Unary code for 80 is:
11111111111111111111111111111111111111111111
1111111111111111111111111111111111110
 This doesn’t look promising, but….
102
Gamma codes
 We can compress better with bit-level codes
 The Gamma code is the best known of these.
 Represent a gap G as a pair length and offset
 offset is G in binary, with the leading bit cut off
 For example 13 → 1101 → 101
 length is the length of offset
 For 13 (offset 101), this is 3.
 We encode length with unary code: 1110.
 Gamma code of 13 is the concatenation of length
and offset: 1110101
Gamma code examples
number length offset γ-code
0 none
1 0 0
2 10 0 10,0
3 10 1 10,1
4 110 00 110,00
9 1110 001 1110,001
13 1110 101 1110,101
24 11110 1000 11110,1000
511 111111110 11111111 111111110,11111111
1025 11111111110 0000000001 11111111110,0000000001
Gamma code properties
 G is encoded using 2 log G + 1 bits
 Length of offset is log G bits
 Length of length is log G + 1 bits
 All gamma codes have an odd number of bits
 Almost within a factor of 2 of best possible, log2 G
 Gamma code is uniquely prefix-decodable, like VB

 Gamma code can be used for any distribution
 Gamma code is parameter-free
Gamma seldom used in practice
 Machines have word boundaries – 8, 16, 32, 64
bits
 Operations that cross word boundaries are slower
 Compressing and manipulating at the granularity
of bits can be slow
 Variable byte encoding is aligned and thus
potentially more efficient
 Regardless of efficiency, variable byte is
conceptually simpler at little additional space cost
Exercise
 Given the following sequence of γ−coded gaps,
reconstruct the postings sequence:
1110001110101011111101101111011
From these γ-decode and reconstruct gaps,

then full postings.
RCV1 compression
Data structure Size in MB
dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
with blocking, k = 4 7.1
with blocking & front coding 5.9
collection (text, xml markup etc) 3,600.0
collection (text) 960.0
Term-doc incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ−encoded 101.0
Index compression summary
 We can now create an index for highly efficient
Boolean retrieval that is very space efficient
 Only 4% of the total size of the collection
 Only 10-15% of the total size of the text in the
collection
 However, we’ve ignored positional information
 Hence, space savings are less for indexes used
in practice
 But techniques substantially the same.
Introduction to Information Retrieval
Introduction to
Information Retrieval
Lecture 5: Scoring, Term Weighting and the
Vector Space Model
Introduction to Information Retrieval Ch. 6
Ranked retrieval
 Thus far, our queries have all been Boolean.
 Documents either match or don’t.
 Good for expert users with precise understanding of
their needs and the collection
 Not good for the majority of users.
 Most users incapable of writing Boolean queries (or they
are, but they think it’s too much work).
 Most users don’t want to wade through 1000s of results.
 This is particularly true of web search.
Problem with Boolean search:

feast or famine
 Boolean queries often result in either too few (=0) or
too many (1000s) results.
 Query 1: “standard user dlink 650” → 200,000 hits
 Query 2: “standard user dlink 650 no card found”: 0
hits
 It takes a lot of skill to come up with a query that
produces a manageable number of hits.
 AND gives too few; OR gives too many
Ranked retrieval models

 Rather than a set of documents satisfying a query
expression, in ranked retrieval, the system returns an
ordering over the (top) documents in the collection
for a query
 Free text queries: Rather than a query language of
operators and expressions, the user’s query is just
one or more words in a human language
 In principle, there are two separate choices here, but
in practice, ranked retrieval has normally been
associated with free text queries and vice versa
5
Feast or famine: not a problem in

ranked retrieval
 When a system produces a ranked result set, large
result sets are not an issue
 Indeed, the size of the result set is not an issue
 We just show the top k ( ≈ 10) results
 We don’t overwhelm the user
 Premise: the ranking algorithm works

Scoring as the basis of ranked retrieval

 We wish to return in order the documents most likely
to be useful to the searcher
 How can we rank-order the documents in the
collection with respect to a query?
 Assign a score – say in [0, 1] – to each document
 This score measures how well document and query
“match”.
parametric and zone indexes
8
Metadata, Fields, Zones

 Documents can have metadata and fields
 E.g., title of document, author of document, date of creation
 Zones similar to fields, but can contain arbitrary text
 E.g., abstract, introduction, … of a research paper
 We can have an index for each field/zone

 To support queries like “documents having merchant in the
title and william in the author list”
 Either separate index for each field/zone, or part of the
same index
9
Weighted zone scoring

 Given a Boolean query q and a document d
 Compute a ’zone match score’ in [0,1] for each zone/field
of d with q
 Compute linear combination of zone match scores, where
each zone assigned a weight (sum of weights equal to 1.0)
 Sometimes called ‘ranked Boolean retrieval’
 How to decide the weights?
 Option 1: Specified by experts, e.g., match in “title” has
higher significance than match in “body”
 Option 2: Learn from training examples – application of
Machine Learning
10
WEIGHTING THE IMPORTANCE OF

TERMS
11
Query-document matching scores

 We need a way of assigning a score to a
query/document pair
 Let’s start with a one-term query
 If the query term does not occur in the document: score
should be 0
 If the query terms occurs in the document, score 1
 For a multi-term query
 View the query as well as the document as sets of words
 Compute some similarity measure between the two sets
Jaccard coefficient
 A commonly used measure of overlap of two sets A
and B
 jaccard(A,B) = |A ∩ B| / |A ∪ B|
 jaccard(A,A) = 1
 jaccard(A,B) = 0 if A ∩ B = 0
 A and B don’t have to be the same size.
 Always assigns a number between 0 and 1.
Jaccard coefficient: Scoring example

 What is the query-document match score that the
Jaccard coefficient computes for each of the two
documents below?
 Query: ides of march
 Document 1: caesar died in march
 Document 2: the long march
Issues with Jaccard for scoring

 It doesn’t consider term frequency (how many times
a term occurs in a document)
 A document/zone that mentions a query-term more often
intuitively matches the query more
 Rare terms in a collection are more informative than
frequent terms. Jaccard doesn’t consider this
information
 We need a more sophisticated way of normalizing for
length
Introduction to Information Retrieval Sec. 6.2
Recall: Binary term-document

incidence matrix
Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth
Antony 1 1 0 0 0 1
Brutus 1 1 0 1 0 0
Caesar 1 1 0 1 1 1
Calpurnia 0 1 0 0 0 0
Cleopatra 1 0 0 0 0 0
mercy 1 0 1 1 1 1
worser 1 0 1 1 1 0
Each document is represented by a binary vector ∈ {0,1}|V|

Term-document count matrices

 Consider the number of occurrences of a term in a
document:
 Each document is a count vector in ℕv: a column below
Antony 157 73 0 0 0 0
Brutus 4 157 0 1 0 0
Caesar 232 227 0 2 1 1
Calpurnia 0 10 0 0 0 0
Cleopatra 57 0 0 0 0 0
mercy 2 0 3 5 5 1
worser 2 0 1 1 1 0
Bag of words model

 Vector representation doesn’t consider the ordering
of words in a document
 John is quicker than Mary and Mary is quicker than
John have the same vectors
 This is called the bag of words model.
 In a sense, this is a step back: The positional index
was able to distinguish these two documents.
 We will look at “recovering” positional information
later in this course.
 For now: bag of words model
Term frequency tf
 The term frequency tft,d of term t in document d is
defined as the number of times that t occurs in d.
 We want to use tf when computing query-document
match scores. But how?
 Raw term frequency is not what we want:
 A document with 10 occurrences of the term is more
relevant than a document with 1 occurrence of the term.
 But not 10 times more relevant.
 Relevance does not increase proportionally with
term frequency.
NB: frequency = count in IR
Log-frequency weighting
 The log frequency weight of term t in d is
1  log10 tf t,d , if tf t,d  0
wt,d 
 0, otherwise
 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

 Score for a document-query pair: sum over terms t in
both q and d:

 score  tqd (1  log tf t ,d )
 The score is 0 if none of the query terms is present in

the document.
Introduction to Information Retrieval Sec. 6.2.1
Document frequency
 Rare terms are more informative than frequent terms
 Recall stop words
 Consider a term in the query that is rare in the
collection (e.g., arachnocentric)
 A document containing this term is very likely to be
relevant to the query arachnocentric
 → We want a high weight for rare terms like
arachnocentric.
Document frequency, continued

 Frequent terms are less informative than rare terms
 Consider a query term that is frequent in the
collection (e.g., high, increase, line)
 A document containing such a term is more likely to
be relevant than a document that doesn’t
 But it’s not a sure indicator of relevance.
 → For frequent terms, we want positive weights for
words like high, increase, and line
 But lower weights than for rare terms.
 We will use document frequency (df) to capture this.
idf weight
 dft is the document frequency of t: the number of
documents that contain t
 dft is an inverse measure of the informativeness of t
 dft  N
 We define the idf (inverse document frequency) of t
by
idf t  log10 ( N/df t )
 We use log (N/dft) instead of N/dft to “dampen” the effect
of idf.
Will turn out the base of the log is immaterial.

idf example, suppose N = 1 million

term dft idft
calpurnia 1
animal 100
sunday 1,000
fly 10,000
under 100,000
the 1,000,000
idf t  log10 ( N/df t )

There is one idf value for each term t in a collection.
Effect of idf on ranking

 Does idf have an effect on ranking for one-term
queries, like
 iPhone
 idf has no effect on ranking one term queries
 idf affects the ranking of documents for queries with at
least two terms
 For the query capricious person, idf weighting makes
occurrences of capricious count for much more in the final
document ranking than occurrences of person.
25
Collection vs. Document frequency

 The collection frequency of t is the number of
occurrences of t in the collection, counting
multiple occurrences.
 Example:
Word Collection frequency Document frequency
insurance 10440 3997
try 10422 8760
 Which word is a better search term (and should

get a higher weight)?
tf-idf weighting
 The tf-idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d  log(1  tf t ,d )  log10 ( N / df t )
 Best known weighting scheme in information retrieval
 Note: the “-” in tf-idf is a hyphen, not a minus sign!
 Alternative names: tf.idf, tf x idf
 Increases with the number of occurrences of term
within a document
 Increases with the rarity of the term in the collection
Score for a document given a query
Score(q,d) = å tf.idft,d
t ÎqÇd
 There are many variants

 How “tf” is computed (with/without logs)
 Whether the terms in the query are also weighted
…
28
Binary → count → weight matrix

Antony 5.25 3.18 0 0 0 0.35
Brutus 1.21 6.1 0 1 0 0
Caesar 8.59 2.54 0 1.51 0.25 0
Calpurnia 0 1.54 0 0 0 0
Cleopatra 2.85 0 0 0 0 0
mercy 1.51 0 1.9 0.12 5.25 0.88
worser 1.37 0 0.11 4.15 0.25 1.95
Each document is now represented by a real-valued

vector of tf-idf weights ∈ R|V|
Documents as vectors
 So we have a |V|-dimensional vector space
 Terms are axes of the space
 Documents are points or vectors in this space
 Very high-dimensional space: tens of millions of

dimensions in case of a web search engine
 These are very sparse vectors - most entries are zero.
Queries as vectors
 Key idea 1: Do the same for queries: represent
queries as vectors in the space
 Key idea 2: Rank documents according to their
proximity to the query in this space
 proximity = similarity of vectors
 proximity ≈ inverse of distance
 Recall: We do this because we want to get away from
the you’re-either-in-or-out Boolean model.
 Instead: rank more relevant documents higher than
less relevant documents
Formalizing vector space proximity

 First cut: distance between two points
 ( = distance between the end points of the two vectors)
 Euclidean distance?
 Euclidean distance is a bad idea . . .
 . . . because Euclidean distance is large for vectors of
different lengths.
 Two documents having similar content can have large
Euclidean distance simply because one document is
much longer than the other
Why distance is a bad idea

The Euclidean distance
between q
and d2 is large even
though the
distribution of terms in
the query q and the
distribution of
terms in the document
d2 are
very similar.
Use angle instead of distance

 Thought experiment: take a document d and append
it to itself. Call this document d′.
 “Semantically” d and d′ have the same content
 The Euclidean distance between the two documents
can be quite large
 The angle between the two documents is 0,
corresponding to maximal similarity.
 Key idea: Rank documents according to angle with

query.
From angles to cosines

 The following two notions are equivalent.
 Rank documents in increasing order of the angle between
query and document
 Rank documents in decreasing order of
cosine(query,document)
 Cosine is a monotonically decreasing function for the
interval [0o, 180o]
cosine(query,document)
Dot product Unit vectors
  


V
  qd q d q di
cos(q, d )         i 1 i
q d
 i1 i
V V
qd q2
d 2
i 1 i
qi is the tf-idf weight of term i in the query

di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.
Cosine for length-normalized vectors

 For length-normalized vectors, cosine similarity is
simply the dot product (or scalar product):
for q, d length-normalized.
37
Cosine similarity illustrated
38
Cosine similarity amongst 3 documents

How similar are
the novels term SaS PaP WH
SaS: Sense and affection 115 58 20
Sensibility jealous 10 7 11
gossip 2 0 6
PaP: Pride and
wuthering 0 0 38
Prejudice, and
WH: Wuthering Term frequencies (counts)
Heights?
Note: To simplify this example, we don’t do idf weighting.

3 documents example contd.

Log frequency weighting After length normalization
term SaS PaP WH term SaS PaP WH

affection 3.06 2.76 2.30 affection 0.789 0.832 0.524
jealous 2.00 1.85 2.04 jealous 0.515 0.555 0.465
gossip 1.30 0 1.78 gossip 0.335 0 0.405
wuthering 0 0 2.58 wuthering 0 0 0.588
cos(SaS,PaP) ≈
0.789 × 0.832 + 0.515 × 0.555 + 0.335 × 0.0 + 0.0 × 0.0
≈ 0.94
cos(SaS,WH) ≈ 0.79
cos(PaP,WH) ≈ 0.69
Why do we have cos(SaS,PaP) > cos(SaS,WH)?
tf-idf weighting has many variants
Columns headed ‘n’ are acronyms for weight schemes.
Why is the base of the log in idf immaterial?

Weighting may differ in queries vs

documents
 Many search engines allow for different weightings
for queries vs. documents
 SMART Notation: denotes the combination in use in
an engine, with the notation ddd.qqq, using the
acronyms from the previous table
 A very standard weighting scheme is: lnc.ltc
 Document: logarithmic tf (l as first character), no idf
and cosine normalization
A bad idea?
 Query: logarithmic tf (l in leftmost column), idf (t in
second column), cosine normalization …
tf-idf example: lnc.ltc

Document: car insurance auto insurance
Query: best car insurance
Term Query Document Pro
d
tf- tf-wt df idf wt n’liz tf-raw tf-wt wt n’liz
raw e e
auto 0 0 5000 2.3 0 0 1 1 1 0.52 0
best 1 1 50000 1.3 1.3 0.34 0 0 0 0 0
car 1 1 10000 2.0 2.0 0.52 1 1 1 0.52 0.27
insurance 1 1 1000 3.0 3.0 0.78 2 1.3 1.3 0.68 0.53
Exercise: what is N, the number of docs?
Doc length = 12 + 02 +12 +1.32 »1.92
Score = 0+0+0.27+0.53 = 0.8
Summary – vector space ranking

 Represent the query as a weighted tf-idf vector
 Represent each document as a weighted tf-idf vector
 Compute the cosine similarity score for the query
vector and each document vector
 Rank documents with respect to the query by score
 Return the top K (e.g., K = 10) to the user
Points to note
 A document may have a high cosine similarity score
for a query, even if it does not contain all terms in
the query
 How to speedup the vector space retrieval?
 Can store the inverse document frequency (e.g., N/dft) at
the head of the postings list for term t
 Store the term-frequency (e.g., tft,d) in each postings entry
of the postings list for term t
 For a multi-word query, the postings lists of the various
query terms can even be traversed concurrently
45
COMPUTING SCORES IN A
COMPLETE SEARCH SYSTEM
Slides by Manning, Raghavan, Schutze
Recap: tf‐idf weighting
 The tf‐idf weight of a term is the product of its tf
weight and its idf weight.
w t ,d  (1  log10 tf t ,d )  log10 ( N / df t )
 Best known weighting scheme in information retrieval
 Increases with the number of occurrences within a
document
 Increases with the rarity of the term in the collection
Recap: Queries as vectors
 Key idea 1: Do the same for queries: represent them
as vectors in the space
 Key idea 2: Rank documents according to their
proximity to the query in this space
 proximity = similarity of vectors
Recap: cosine(query,document)
Dot product Unit vectors
  


V
  qd q d q di
cos( q , d )         i 1 i
q d
 i1 i
V V
qd q2
d 2
i 1 i
cos(q,d) is the cosine similarity of q and d … or,

equivalently, the cosine of the angle between q and d.
This lecture
 Speeding up vector space ranking
 Putting together a complete search
system
 Will require learning about a number of
miscellaneous topics and heuristics
Computing cosine scores
Efficient cosine ranking
 Find the K docs in the collection “nearest” to the
query  K largest query‐doc cosines.
 Efficient ranking:
 Computing a single cosine efficiently.
 Choosing the K largest cosine values efficiently.
 Can we do this without computing all N cosines?
Efficient cosine ranking
 What we’re doing in effect: solving the K‐nearest
neighbor problem for a query vector
 In general, we do not know how to do this efficiently
for high‐dimensional spaces
 But it is solvable for short queries, and standard
indexes support this well
Special case – unweighted queries
 No weighting on query terms
 Assume each query term occurs only once
 Then for ranking, don’t need to normalize query
vector
 Slight simplification of algorithm from Lecture 6
Computing the K largest cosines:
selection vs. sorting
 Typically we want to retrieve the top K docs (in the
cosine ranking for the query)
 not to totally order all docs in the collection
 Can we pick off docs with K highest cosines?
 Let J = number of docs with nonzero cosines
 We seek the K best of these J
Use heap for selecting top K
 Binary tree in which each node’s value > the values
of children
 Takes 2J operations to construct, then each of K
“winners” read off in 2log J steps.
 For J=1M, K=100, this is about 10% of the cost of
sorting.
1
.9 .3
.3 .8 .1
.1
Bottlenecks
 Primary computational bottleneck in scoring: cosine
computation
 Can we avoid all this computation?
 Yes, but may sometimes get it wrong
 a doc not in the top K may creep into the list of K
output docs
 Is this such a bad thing?
Cosine similarity is only a proxy
 User has a task and a query formulation
 Cosine matches docs to query
 Thus cosine is anyway a proxy for user happiness
 If we get a list of K docs “close” to the top K by cosine
measure, should be ok
Generic approach
 Find a set A of contenders, with K < |A| << N
 A does not necessarily contain the top K, but has
many docs from among the top K
 Return the top K docs in A
 Think of A as pruning non‐contenders
 The same approach is also used for other (non‐
cosine) scoring functions
 Will look at several schemes following this approach
Index elimination
 Basic algorithm cosine computation algorithm only
considers docs containing at least one query term
 Take this further:
 Only consider high‐idf query terms
 Only consider docs containing many query terms
High‐idf query terms only
 For a query such as catcher in the rye
 Only accumulate scores from catcher and rye
 Intuition: in and the contribute little to the scores
and so don’t alter rank‐ordering much
 Benefit:
 Postings of low‐idf terms have many docs  these (many)
docs get eliminated from set A of contenders
Docs containing many query terms
 Any doc with at least one query term is a candidate
for the top K output list
 For multi‐term queries, only compute scores for docs
containing several of the query terms
 Say, at least 3 out of 4
 Imposes a “soft conjunction” on queries seen on web
search engines (early Google)
 Easy to implement in postings traversal
3 of 4 query terms
Antony 3 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
Calpurnia 13 16 32
Scores only computed for docs 8, 16 and 32.
Champion lists
 Precompute for each dictionary term t, the r docs of
highest weight in t’s postings
 Call this the champion list for t
 (aka fancy list or top docs for t)
 Note that r has to be chosen at index build time
 Thus, it’s possible that r < K
 At query time, only compute scores for docs in the
champion list of some query term
 Pick the K top‐scoring docs from amongst these
Exercises
 How do Champion Lists relate to Index Elimination?
Can they be used together?
 How can Champion Lists be implemented in an
inverted index?
 Note that the champion list has nothing to do with small
docIDs
Static quality scores
 We want top‐ranking documents to be both relevant
and authoritative
 Relevance is being modeled by cosine scores
 Authority is typically a query‐independent property
of a document
 Examples of authority signals
 Wikipedia among websites
 Articles in certain newspapers
Quantitative
 A paper with many citations
 Many bitly’s, diggs or del.icio.us marks
 (Pagerank) Slides by Manning, Raghavan, Schutze
Modeling authority
 Assign to each document a query‐independent
quality score in [0,1] to each document d
 Denote this by g(d)
 Thus, a quantity like the number of citations is scaled
into [0,1]
 Exercise: suggest a formula for this.
Net score
 Consider a simple total score combining cosine
relevance and authority
 net‐score(q,d) = g(d) + cosine(q,d)
 Can use some other linear combination
 Indeed, any function of the two “signals” of user happiness
– more later
 Now we seek the top K docs by net score
Top K by net score – fast methods
 First idea: Order all postings by g(d)
 Key: this is a common ordering for all postings
 Thus, can concurrently traverse query terms’
postings for
 Postings intersection
 Cosine score computation
 Exercise: write pseudocode for cosine score
computation if postings are ordered by g(d)
Why order postings by g(d)?
 Under g(d)‐ordering, top‐scoring docs likely to
appear early in postings traversal
 In time‐bound applications (say, we have to return
whatever search results we can in 50 ms), this allows
us to stop postings traversal early
 Short of computing scores for all docs in postings
Champion lists in g(d)‐ordering
 Can combine champion lists with g(d)‐ordering
 Maintain for each term a champion list of the r docs
with highest g(d) + tf‐idftd
 Seek top‐K results from only the docs in these
champion lists
High and low lists
 For each term, we maintain two postings lists called
high and low
 Think of high as the champion list
 When traversing postings on a query, only traverse
high lists first
 If we get more than K docs, select the top K and stop
 Else proceed to get docs from the low lists
 Can be used even for simple cosine scores, without
global quality g(d)
 A means for segmenting index into two tiers
Impact‐ordered postings
 We only want to compute scores for docs for which
wft,d is high enough
 We sort each postings list by wft,d
 Now: not all postings in a common order!
 How do we compute scores in order to pick off top K?
 Two ideas follow
1. Early termination
 When traversing t’s postings, stop early after either
 a fixed number of r docs
 wft,d drops below some threshold
 Take the union of the resulting sets of docs
 One from the postings of each query term
 Compute only the scores for docs in this union
2. idf‐ordered terms
 When considering the postings of query terms
 Look at them in order of decreasing idf
 High idf terms likely to contribute most to score
 As we update score contribution from each query
term
 Stop if doc scores relatively unchanged
 Can apply to cosine or some other net scores
Cluster pruning: preprocessing
 Pick N docs at random: call these leaders
 For every other doc, pre‐compute nearest
leader
 Docs attached to a leader: its followers;
 Likely: each leader has ~ N followers.
Cluster pruning: query processing
 Process a query as follows:
 Given query Q, find its nearest leader L.
 Seek K nearest docs from among L’s
followers.
Visualization
Query
Leader Follower
Why use random sampling
 Fast
 Leaders reflect data distribution
General variants
 Have each follower attached to b1=3 (say) nearest
leaders.
 From query, find b2=4 (say) nearest leaders and their
followers.
 Can recurse on leader/follower construction.
Exercises
 To find the nearest leader in step 1, how many cosine
computations do we do?
 Why did we have N in the first place?
 What is the effect of the constants b1, b2 on the
previous slide?
 Devise an example where this is likely to fail – i.e., we
miss one of the K nearest docs.
 Likely under random sampling.
Parametric and zone indexes
 Thus far, a doc has been a sequence of terms
 In fact documents have multiple parts, some with
special semantics:
 Author
 Title
 Date of publication
 Language
 Format
 etc.
 These constitute the metadata about a document
Fields
 We sometimes wish to search by these metadata
 E.g., find docs authored by William Shakespeare in the
year 1601, containing alas poor Yorick
 Year = 1601 is an example of a field
 Also, author last name = shakespeare, etc.
 Field or parametric index: postings for each field
value
 Sometimes build range trees (e.g., for dates)
 Field query typically treated as conjunction
 (doc must be authored by shakespeare)
Zone
 A zone is a region of the doc that can contain an
arbitrary amount of text, e.g.,
 Title
 Abstract
 References …
 Build inverted indexes on zones as well to permit
querying
 E.g., “find docs with merchant in the title zone and
matching the query gentle rain”
Example zone indexes
Encode zones in dictionary vs. postings.
Tiered indexes
 Break postings up into a hierarchy of lists
 Most important
 …
 Least important
 Can be done by g(d) or another measure
 Inverted index thus broken up into tiers of decreasing
importance
 At query time use top tier unless it fails to yield K
docs
 If so drop to lower tiers
Example tiered index
Query term proximity
 Free text queries: just a set of terms typed into the
query box – common on the web
 Users prefer docs in which query terms occur within
close proximity of each other
 Let w be the smallest window in a doc containing all
query terms, e.g.,
 For the query strained mercy the smallest window in
the doc The quality of mercy is not strained is 4
(words)
 Would like scoring function to take this into account
– how? Slides by Manning, Raghavan, Schutze
Query parsers
 Free text query from user may in fact spawn one or
more queries to the indexes, e.g., query rising
interest rates
 Run the query as a phrase query
 If <K docs contain the phrase rising interest rates, run the
two phrase queries rising interest and interest rates
 If we still have <K docs, run the vector space query rising
interest rates
 Rank matching docs by vector space scoring
 This sequence is issued by a query parser
Aggregate scores
 We’ve seen that score functions can combine cosine,
static quality, proximity, etc.
 How do we know the best combination?
 Some applications – expert‐tuned
 Increasingly common: machine‐learned
 See May 19th lecture
Putting it all together
Evaluation in information retrieval
Information retrieval system evaluation

To measure ad hoc information retrieval effectiveness in the standard way, we need a test
collection consisting of three things:
1. A document collection
2. A test suite of information needs, expressible as queries
3. A set of relevance judgments, standardly a binary assessment of either relevant or
nonrelevant for each query-document pair.
The standard approach to information retrieval system evaluation revolves around the notion of
relevant and nonrelevant documents. With respect to a user information need, a document in
the test collection is given a binary classification as either relevant or nonrelevant. This decision
is referred to as the gold standard or ground truth judgment of relevance. The test document
collection and suite of information needs have to be of a reasonable size: you need to average
performance over fairly large test sets, as results are highly variable over different documents
and information needs. As a rule of thumb, 50 information needs has usually been found to be a
sufficient minimum.
Relevance is assessed relative to an , not a query. For example, an information need might be:
Information on whether drinking red wine is more effective at reducing your risk of
heart attacks than white wine.
This might be translated into a query such as:
wine and red and white and heart and attack and effective
A document is relevant if it addresses the stated information need, not because it just happens
to contain all the words in the query. This distinction is often misunderstood in practice, because
the information need is not overt. But, nevertheless, an information need is present. If a user
types python into a web search engine, they might be wanting to know where they can purchase
a pet python. Or they might be wanting information on the programming language Python. From
a one word query, it is very difficult for a system to know what the information need is. But,
nevertheless, the user has one, and can judge the returned results on the basis of their
relevance to it. To evaluate a system, we require an overt expression of an information need,
which can be used for judging returned documents as relevant or nonrelevant. At this point, we
make a simplification: relevance can reasonably be thought of as a scale, with some documents
highly relevant and others marginally so. But for the moment, we will use just a binary decision
of relevance. We discuss the reasons for using binary relevance judgments and alternatives in
Section 8.5.1 .
Many systems contain various weights (often known as parameters) that can be adjusted to
tune system performance. It is wrong to report results on a test collection which were obtained
by tuning these parameters to maximize performance on that collection. That is because such
tuning overstates the expected performance of the system, because the weights will be set to
maximize performance on one particular set of queries rather than for a random sample of
queries. In such cases, the correct procedure is to have one or more development test
collections , and to tune the parameters on the development test collection. The tester then runs
the system with those weights on the test collection and reports the results on that collection as
an unbiased estimate of performance.
Standard test collections
Here is a list of the most standard test collections and evaluation series. We focus particularly
on test collections for ad hoc information retrieval system evaluation, but also mention a couple
of similar test collections for text classification.
The Cranfield collection. This was the pioneering test collection in allowing precise quantitative
measures of information retrieval effectiveness, but is nowadays too small for anything but the
most elementary pilot experiments. Collected in the United Kingdom starting in the late 1950s, it
contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive
relevance judgments of all (query, document) pairs.
Text Retrieval Conference (TREC) . The U.S. National Institute of Standards and Technology
(NIST) has run a large IR test bed evaluation series since 1992. Within this framework, there
have been many tracks over a range of different test collections, but the best known test
collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations
between 1992 and 1999. In total, these test collections comprise 6 CDs containing 1.89 million
documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450
information needs, which are called topics and specified in detailed text passages. Individual
test collections are defined over different subsets of this data. The early TRECs each consisted
of 50 information needs, evaluated over different but overlapping sets of documents. TRECs 6-8
provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information
Service articles. This is probably the best subcollection to use in future work, because it is the
largest and the topics are more consistent. Because the test document collections are so large,
there are no exhaustive relevance judgments. Rather, NIST assessors' relevance judgments are
available only for the documents that were among the top returned for some system which
was entered in the TREC evaluation for which the information need was developed.
In more recent years, NIST has done evaluations on larger document collections, including the
25 million page GOV2 web page collection. From the beginning, the NIST test document
collections were orders of magnitude larger than anything available to researchers previously
and GOV2 is now the largest Web collection easily available for research purposes.
Nevertheless, the size of GOV2 is still more than 2 orders of magnitude smaller than the current
size of the document collections indexed by the large web search companies.
NII Test Collections for IR Systems ( NTCIR ). The NTCIR project has built various test
collections of similar sizes to the TREC collections, focusing on East Asian language and
cross-language information retrieval , where queries are made in one language over a
document collection containing documents in one or more other languages. See:
http://research.nii.ac.jp/ntcir/data/data-en.html
Cross Language Evaluation Forum ( CLEF ). This evaluation series has concentrated on
European languages and cross-language information retrieval. See:
http://www.clef-campaign.org/
and Reuters-RCV1. For text classification, the most used test collection has been the
Reuters-21578 collection of 21578 newswire articles; see Chapter 13 , page 13.6 . More
recently, Reuters released the much larger Reuters Corpus Volume 1 (RCV1), consisting of
806,791 documents; see Chapter 4 , page 4.2 . Its scale and rich annotation makes it a better
basis for future research.
20 Newsgroups . This is another widely used text classification collection, collected by Ken
Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name
being regarded as the category). After the removal of duplicate articles, as it is usually used, it
contains 18941 articles.
Evaluation of unranked retrieval sets
Given these ingredients, how is system effectiveness measured? The two most frequent and
basic measures for information retrieval effectiveness are precision and recall. These are first
defined for the simple case where an IR system returns a set of documents for a query. We will
see later how to extend these notions to ranked retrieval situations.
Precision ( ) is the fraction of retrieved documents that are relevant
(36)
Recall ( ) is the fraction of relevant documents that are retrieved
(37)
These notions can be made clear by examining the following contingency table:
Then:
(38)
(39)
An obvious alternative that may occur to the reader is to judge an information retrieval system
by its accuracy , that is, the fraction of its classifications that are correct. In terms of the
contingency table above, . This seems plausible,

since there are two actual classes, relevant and nonrelevant, and an information retrieval
system can be thought of as a two-class classifier which attempts to label them as such (it
retrieves the subset of documents which it believes to be relevant). This is precisely the
effectiveness measure often used for evaluating machine learning classification problems.
There is a good reason why accuracy is not an appropriate measure for information retrieval
problems. In almost all circumstances, the data is extremely skewed: normally over 99.9% of the
documents are in the nonrelevant category. A system tuned to maximize accuracy can appear
to perform well by simply deeming all documents nonrelevant to all queries. Even if the system
is quite good, trying to label some documents as relevant will almost always lead to a high rate
of false positives. However, labeling all documents as nonrelevant is completely unsatisfying to
an information retrieval system user. Users are always going to want to see some documents,
and can be assumed to have a certain tolerance for seeing some false positives providing that
they get some useful information. The measures of precision and recall concentrate the
evaluation on the return of true positives, asking what percentage of the relevant documents
have been found and how many false positives have also been returned.
The advantage of having the two numbers for precision and recall is that one is more important
than the other in many circumstances. Typical web surfers would like every result on the first
page to be relevant (high precision) but have not the slightest interest in knowing let alone
looking at every document that is relevant. In contrast, various professional searchers such as
paralegals and intelligence analysts are very concerned with trying to get as high recall as
possible, and will tolerate fairly low precision results in order to get it. Individuals searching their
hard disks are also often interested in high recall searches. Nevertheless, the two quantities
clearly trade off against one another: you can always get a recall of 1 (but very low precision) by
retrieving all documents for all queries! Recall is a non-decreasing function of the number of
documents retrieved. On the other hand, in a good system, precision usually decreases as the
number of documents retrieved is increased. In general we want to get some amount of recall
while tolerating only a certain percentage of false positives.
A single measure that trades off precision versus recall is the F measure , which is the weighted
harmonic mean of precision and recall:
(40)
where and thus . The default balanced F measure equally weights
precision and recall, which means making or . It is commonly written as ,
which is short for , even though the formulation in terms of more transparently exhibits
the F measure as a weighted harmonic mean. When using , the formula on the right
simplifies to:
(41)
However, using an even weighting is not the only choice. Values of emphasize
precision, while values of emphasize recall. For example, a value of or

might be used if recall is to be emphasized. Recall, precision, and the F measure are inherently
measures between 0 and 1, but they are also very commonly written as percentages, on a scale
between 0 and 100.
Graph comparing the harmonic mean to other means.The graph shows a slice through the
calculation of various means of precision and recall for the fixed recall value of 70%. The
harmonic mean is always less than either the arithmetic or geometric mean, and often quite
close to the minimum of the two numbers. When the precision is also 70%, all the measures
coincide.
Why do we use a harmonic mean rather than the simpler average (arithmetic mean)? Recall
that we can always get 100% recall by just returning all documents, and therefore we can
always get a 50% arithmetic mean by the same process. This strongly suggests that the
arithmetic mean is an unsuitable measure to use. In contrast, if we assume that 1 document in
10,000 is relevant to the query, the harmonic mean score of this strategy is 0.02%. The
harmonic mean is always less than or equal to the arithmetic mean and the geometric mean.
When the values of two numbers differ greatly, the harmonic mean is closer to their minimum
than to their arithmetic mean; see Figure 8.1 .
Exercises.
● An IR system returns 8 relevant documents, and 10 nonrelevant documents. There are a

total of 20 relevant documents in the collection. What is the precision of the system on
this search, and what is its recall?
● The balanced F measure (a.k.a. F ) is defined as the harmonic mean of precision and
recall. What is the advantage of using the harmonic mean rather than ``averaging''
(using the arithmetic mean)?
Evaluation of ranked retrieval results
Precision, recall, and the F measure are set-based measures. They are computed using
unordered sets of documents. We need to extend these measures (or to define new measures)
if we are to evaluate the ranked retrieval results that are now standard with search engines. In a
ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top
retrieved documents. For each such set, precision and recall values can be plotted to give a
precision-recall curve , such as the one shown in Figure 8.2 . Precision-recall curves have a
distinctive saw-tooth shape: if the document retrieved is nonrelevant then recall is the
same as for the top documents, but precision has dropped. If it is relevant, then both
precision and recall increase, and the curve jags up and to the right. It is often useful to remove
these jiggles and the standard way to do this is with an interpolated precision: the interpolated
precision at a certain recall level is defined as the highest precision found for any
recall level :
The justification is that almost anyone would be prepared to look at a few more documents if it
would increase the percentage of the viewed set that were relevant (that is, if the precision of
the larger set is higher). Interpolated precision is shown by a thinner line in Figure 8.2 . With this
definition, the interpolated precision at a recall of 0 is well-defined (Exercise 8.4 ).
Recall Interp.
Precision
0.0 1.00
0.1 0.67
0.2 0.63
0.3 0.55
0.4 0.45
0.5 0.41
0.6 0.36
0.7 0.29
0.8 0.13
0.9 0.10
1.0 0.08
Calculation of 11-point Interpolated Average Precision.This is for the precision-recall

curve shown in Figure 8.2
Examining the entire precision-recall curve is very informative, but there is often a desire to boil
this information down to a few numbers, or perhaps even a single number. The traditional way
of doing this (used for instance in the first 8 TREC Ad Hoc evaluations) is the 11-point
interpolated average precision . For each information need, the interpolated precision is
measured at the 11 recall levels of 0.0, 0.1, 0.2, ..., 1.0. For the precision-recall curve in Figure
8.2 , these 11 values are shown in Table 8.1 . For each recall level, we then calculate the
arithmetic mean of the interpolated precision at that recall level for each information need in the
test collection. A composite precision-recall curve showing 11 points can then be graphed.
Figure 8.3 shows an example graph of such results from a representative good system at TREC
8.
Averaged 11-point precision/recall graph across 50 queries for a representative TREC
system.The Mean Average Precision for this system is 0.2553.
In recent years, other measures have become more common. Most standard among the TREC
community is Mean Average Precision (MAP), which provides a single-figure measure of quality
across recall levels. Among evaluation measures, MAP has been shown to have especially
good discrimination and stability. For a single information need, Average Precision is the
average of the precision value obtained for the set of top documents existing after each
relevant document is retrieved, and this value is then averaged over information needs. That is,
if the set of relevant documents for an information need is and is
the set of ranked retrieval results from the top result until you get to document , then
(43)
When a relevant document is not retrieved at all, the precision value in the above equation is
taken to be 0. For a single information need, the average precision approximates the area under
the uninterpolated precision-recall curve, and so the MAP is roughly the average area under the
precision-recall curve for a set of queries.
Using MAP, fixed recall levels are not chosen, and there is no interpolation. The MAP value for
a test collection is the arithmetic mean of average precision values for individual information
needs. (This has the effect of weighting each information need equally in the final reported
number, even if many documents are relevant to some queries whereas very few are relevant to
other queries.) Calculated MAP scores normally vary widely across information needs when
measured within a single system, for instance, between 0.1 and 0.7. Indeed, there is normally
more agreement in MAP for an individual information need across systems than for MAP scores
for different information needs for the same system. This means that a set of test information
needs must be large and diverse enough to be representative of system effectiveness across
different queries.
The above measures factor in precision at all recall levels. For many prominent applications,
particularly web search, this may not be germane to users. What matters is rather how many
good results there are on the first page or the first three pages. This leads to measuring
precision at fixed low levels of retrieved results, such as 10 or 30 documents. This is referred to
as ``Precision at '', for example ``Precision at 10''. It has the advantage of not requiring any
estimate of the size of the set of relevant documents but the disadvantages that it is the least
stable of the commonly used evaluation measures and that it does not average well, since the
total number of relevant documents for a query has a strong influence on precision at .
An alternative, which alleviates this problem, is R-precision . It requires having a set of known
relevant documents , from which we calculate the precision of the top documents
returned. (The set may be incomplete, such as when is formed by creating relevance
judgments for the pooled top results of particular systems in a set of experiments.)
R-precision adjusts for the size of the set of relevant documents: A perfect system could score 1
on this metric for each query, whereas, even a perfect system could only achieve a precision at
20 of 0.4 if there were only 8 documents in the collection relevant to an information need.
Averaging this measure across queries thus makes more sense. This measure is harder to
explain to naive users than Precision at but easier to explain than MAP. If there are
relevant documents for a query, we examine the top results of a system, and find that
are relevant, then by definition, not only is the precision (and hence R-precision) , but
the recall of this result set is also . Thus, R-precision turns out to be identical to the
break-even point , another measure which is sometimes used, defined in terms of this equality
relationship holding. Like Precision at , R-precision describes only one point on the
precision-recall curve, rather than attempting to summarize effectiveness across the curve, and
it is somewhat unclear why you should be interested in the break-even point rather than either
the best point on the curve (the point with maximal
.
Another concept sometimes used in evaluation is an ROC curve . (``ROC'' stands for ``Receiver
Operating Characteristics'', but knowing that doesn't help most people.) An ROC curve plots the
true positive rate or sensitivity against the false positive rate or ( ). Here,
sensitivity is just another term for recall. The false positive rate is given by .
Figure 8.4 shows the ROC curve corresponding to the precision-recall curve in Figure 8.2 . An
ROC curve always goes from the bottom left to the top right of the graph. For a good system,
the graph climbs steeply on the left side. For unranked result sets, specificity , given by
, was not seen as a very useful notion. Because the set of true negatives is
always so large, its value would be almost 1 for all information needs (and, correspondingly, the
value of the false positive rate would be almost 0). That is, the ``interesting'' part of Figure 8.2 is
, a part which is compressed to a small corner of Figure 8.4 . But an ROC

curve could make sense when looking over the full retrieval spectrum, and it provides another
way of looking at the data. In many fields, a common aggregate measure is to report the area
under the ROC curve, which is the ROC analog of MAP. Precision-recall curves are sometimes
loosely referred to as ROC curves. This is understandable, but not accurate.
A final approach that has seen increasing adoption, especially when employed with machine
learning approaches to ranking svm-ranking is measures of cumulative gain , and in particular
normalized discounted cumulative gain ( NDCG ). NDCG is designed for situations of non-binary
notions of relevance (cf. Section 8.5.1 ). Like precision at , it is evaluated over some number
of top search results. For a set of queries , let be the relevance score assessors
gave to document for query . Then,

where is a normalization factor calculated to make it so that a perfect ranking's
NDCG at for query is 1. For queries for which documents are retrieved, the
last summation is done up to .

Assessing relevance
To properly evaluate a system, your test information needs must be germane to the documents
in the test document collection, and appropriate for predicted usage of the system. These
information needs are best designed by domain experts. Using random combinations of query
terms as an information need is generally not a good idea because typically they will not
resemble the actual distribution of information needs.
Given information needs and documents, you need to collect relevance assessments. This is a
time-consuming and expensive process involving human beings. For tiny collections like
Cranfield, exhaustive judgments of relevance for each query and document pair were obtained.
For large modern collections, it is usual for relevance to be assessed only for a subset of the
documents for each query. The most standard approach is pooling , where relevance is
assessed over a subset of the collection that is formed from the top documents returned by a
number of different IR systems (usually the ones to be evaluated), and perhaps other sources
such as the results of Boolean keyword searches or documents found by expert searchers in an
interactive process.
Judge 2 Relevance
Yes No Total
Judge 1 Yes 300 20 320
Relevance No 10 70 80
Total 310 90 400

Observed proportion of the times the judges agreed
Pooled marginals
Probability that the two judges agreed by chance
Kappa statistic
A human is not a device that reliably reports a gold standard judgment of relevance of a
document to a query. Rather, humans and their relevance judgments are quite idiosyncratic and
variable. But this is not a problem to be solved: in the final analysis, the success of an IR system
depends on how good it is at satisfying the needs of these idiosyncratic humans, one
information need at a time.
Nevertheless, it is interesting to consider and measure how much agreement between judges
there is on relevance judgments. In the social sciences, a common measure for agreement
between judges is the kappa statistic . It is designed for categorical judgments and corrects a
simple agreement rate for the rate of chance agreement.
(46)
where is the proportion of the times the judges agreed, and is the proportion of
the times they would be expected to agree by chance. There are choices in how the latter is
estimated: if we simply say we are making a two-class decision and assume nothing more, then
the expected chance agreement rate is 0.5. However, normally the class distribution assigned is
skewed, and it is usual to use marginal statistics to calculate expected agreement. There are
still two ways to do it depending on whether one pools the marginal distribution across judges or
uses the marginals for each judge separately; both forms have been used, but we present the
pooled version because it is more conservative in the presence of systematic differences in
assessments across judges. The calculations are shown in Table 8.2 . The kappa value will be
1 if two judges always agree, 0 if they agree only at the rate given by chance, and negative if
they are worse than random. If there are more than two judges, it is normal to calculate an
average pairwise kappa value. As a rule of thumb, a kappa value above 0.8 is taken as good
agreement, a kappa value between 0.67 and 0.8 is taken as fair agreement, and agreement
below 0.67 is seen as data providing a dubious basis for an evaluation, though the precise
cutoffs depend on the purposes for which the data will be used.
Interjudge agreement of relevance has been measured within the TREC evaluations and for
medical IR collections. Using the above rules of thumb, the level of agreement normally falls in
the range of ``fair'' (0.67-0.8). The fact that human agreement on a binary relevance judgment is
quite modest is one reason for not requiring more fine-grained relevance labeling from the test
set creator. To answer the question of whether IR evaluation results are valid despite the
variation of individual assessors' judgments, people have experimented with evaluations taking
one or the other of two judges' opinions as the gold standard. The choice can make a
considerable absolute difference to reported scores, but has in general been found to have little
impact on the relative effectiveness ranking of either different systems or variants of a single
system which are being compared for effectiveness.

Unit 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

Index compression

Why compression (in general)?

Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul

Good empirical fit for Reuters

For first 1,000,020 tokens,

 No study of positional indexes, etc.

 We will consider compression

Terms Freq. Postings ptr.

Dictionary search 20 bytes 4 bytes each

Freq. Postings ptr. Term ptr.

Freq. Postings ptr. Term ptr.

Shaved another ~0.5MB. This reduces the size of the

Why not go with larger k?

Exercise: what if the

 Binary search down to 4-term block;

Encodes automat Extra length

Fixed width 11.2

Dictionary-as-String with pointers to every term 7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

ID: 824 829

Encoding: 00000110 10111000 10000101

Decoding: 6*128 + (184 – 128) (133 – 128)

Postings stored as the byte concatenation

Key property: VB-encoded postings are

For a small gap (5), VB

 Unary code for 40 is

 This doesn’t look promising, but….

 Gamma code is uniquely prefix-decodable, like VB

From these γ-decode and reconstruct gaps,

Problem with Boolean search:

Ranked retrieval models

Feast or famine: not a problem in

 Premise: the ranking algorithm works

Scoring as the basis of ranked retrieval

parametric and zone indexes

Metadata, Fields, Zones

 We can have an index for each field/zone

Weighted zone scoring

WEIGHTING THE IMPORTANCE OF

Query-document matching scores

Jaccard coefficient: Scoring example

Issues with Jaccard for scoring

Recall: Binary term-document

Each document is represented by a binary vector ∈ {0,1}|V|

Term-document count matrices

Bag of words model

 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.

 The score is 0 if none of the query terms is present in

Document frequency, continued

Will turn out the base of the log is immaterial.

idf example, suppose N = 1 million

idf t  log10 ( N/df t )

Effect of idf on ranking

Collection vs. Document frequency

insurance 10440 3997

try 10422 8760

 Which word is a better search term (and should

Score for a document given a query

 There are many variants

Binary → count → weight matrix

Each document is now represented by a real-valued

Precision ( ) is the fraction of retrieved documents that are relevant

Recall ( ) is the fraction of relevant documents that are retrieved

, a part which is compressed to a small corner of Figure 8.4 . But an ROC