Professional Documents
Culture Documents
ISCL - Wintersemester 2007 - IR - Midterm Exam
ISCL - Wintersemester 2007 - IR - Midterm Exam
17 December 2007
Exercise 1 : Definitions
Define the following terms :
– tokenization
– permuterm index
– champion list
– What is the size (mega or giga bytes) of the collection when stored (uncompressed) on disc ?
– With the best reduction rate of the dictionary achieved when using a linguistic preproces-
sing (noise words, stemming), what is the size (number of terms) of the dictionary ?
– Consider an index where the average length of a non-positional posting list is 200. What
is the estimation of the total number of postings of this index ?
1
– How many bytes do you allow respectively for encoding (without compression) a dictionary term ?
a non-positional posting ?
– What are the size (mega or giga bytes) of the resulting dictionary and posting lists ?
– If you compress your dictionary using the dictionary-as-a-string method, what is the new
size of the dictionary ?
2
Exercise 4 : Linguistic preprocessing
Are the following statements right or false (justify your answer) ?
– rely
– realised
What is the Porter measure of the following words (give your computation) ?
– crepuscular
– rigorous
– placement
3
Exercise 6 : Index architecture
Propose a Map-Reduce architecture for creating language specific indexes from an heteroge-
neous collection. You can illustrate this architecture using a figure.
– What is the posting list that can be decoded from the variable byte-code
10001001 00000001 10000010 11111111 ?
– What would be the encoding of the same posting list using a γ-code ?
4
Exercise 8 : Vector Space Model
Consider a collection made of the documents d1 , d2 , d3 and whose characteristics are the
following :
– Compute the vector representations of d1 , d2 and d3 using the tf − idft,d weighting and
the euclidian normalisation.
– Give the ranking retrieved by the system for the query “movie trailer”.
5
Exercise 9 : Term weighting
Compute the vector representations of the documents introduced in the previous exercise
using the ltn weighting scheme.
– What workaround would you propose for this insertion ? Give an algorithm for inserting
a key-value pair.