Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

ISCL – wintersemester 2007 – IR – Midterm exam

17 December 2007

Non-electronic documents and calculators are authorized.


Name :
Semester :

Exercise 1 : Definitions
Define the following terms :
– tokenization

– permuterm index

– champion list

Exercise 2 : Characteristics of a collection and its index


Consider a collection made of 500 000 documents, each containing on average 800 words. The
number of different words (i.e. not taking duplicates into account) is estimated to 700 000.
For all questions, give your computation.

– What is the size (mega or giga bytes) of the collection when stored (uncompressed) on disc ?

– With the best reduction rate of the dictionary achieved when using a linguistic preproces-
sing (noise words, stemming), what is the size (number of terms) of the dictionary ?

– Consider an index where the average length of a non-positional posting list is 200. What
is the estimation of the total number of postings of this index ?

1
– How many bytes do you allow respectively for encoding (without compression) a dictionary term ?
a non-positional posting ?

– What are the size (mega or giga bytes) of the resulting dictionary and posting lists ?

– If you compress your dictionary using the dictionary-as-a-string method, what is the new
size of the dictionary ?

Exercise 3 : Querying an index


What kind of queries can be applied to the collection, for each of these, what index is needed ?

2
Exercise 4 : Linguistic preprocessing
Are the following statements right or false (justify your answer) ?

a) stemming increases retrieval precision.

b) stemming only slightly reduces the size of the dictionary.

c) stop lists contains all most frequent terms.

Exercise 5 : Porter stemming


What would be the result of the porter stemmer used with the following words ?
– busses

– rely

– realised

What is the Porter measure of the following words (give your computation) ?
– crepuscular

– rigorous

– placement

3
Exercise 6 : Index architecture
Propose a Map-Reduce architecture for creating language specific indexes from an heteroge-
neous collection. You can illustrate this architecture using a figure.

Exercise 7 : Index compression


– What is the largest gap that can be encoded in 2 bytes using the variable-byte encoding ?

– What is the posting list that can be decoded from the variable byte-code
10001001 00000001 10000010 11111111 ?

– What would be the encoding of the same posting list using a γ-code ?

4
Exercise 8 : Vector Space Model
Consider a collection made of the documents d1 , d2 , d3 and whose characteristics are the
following :

Term tfd1 tfd2 tfd3 df


actor 12 35 55 123
movie 15 24 48 240
trailer 52 13 12 85

– Compute the vector representations of d1 , d2 and d3 using the tf − idft,d weighting and
the euclidian normalisation.

– Compute the cosine similarities between these documents.

– Give the ranking retrieved by the system for the query “movie trailer”.

5
Exercise 9 : Term weighting
Compute the vector representations of the documents introduced in the previous exercise
using the ltn weighting scheme.

Exercise 10 : Index architecture (extra credit)


Consider a hashtable as a structure mapping keys to values using a hash function h such
that h(key) = value.
– What problem may arise from such a structure when inserting new key-value pairs ?

– What workaround would you propose for this insertion ? Give an algorithm for inserting
a key-value pair.

You might also like