Professional Documents
Culture Documents
Information Retrieval Exam: Name and Group: Series ONE Email
Information Retrieval Exam: Name and Group: Series ONE Email
Email:
1. If my main concern is producing the fastest possibly 4. Suppose the corpus above is dynamic. Which of the
dictionary search, which will be the best possible data following issues will you have to deal with?
structure to use? a) New or changed documents.
a) An array of words and word IDs over a contiguous b) Spell corrections.
chunk of memory, because this would allow very fast c) Deleted documents.
sequential passes through all dictionary words. d) Document access control lists, if any.
b) A sorted linked list with skip pointers, because this
would allow both very fast sequential passes and 5. If I use simple TFxIDF for ranking, which will be the order
very fast look-ups. generated for the regular query << a b >>?
c) A hash table, because this would allow very fast a) D1, D2, D3
look-ups. b) D2, D1, D3
d) A B-tree, because this would allow on average the c) D1, D3, D2
fastest look-up time for a word. d) D2, D3, D1.
2. How will a basic inverted index for this corpus look like? 6. Using un-normalized cosine similarity on TFxIDF (just the
a) a => D1 -> D2 -> D3; b => D1 -> D2 -> D3; dot product between the TFxIDF vectors of each
c => D1 -> D2 -> D3 document), which two of the three documents above are
b) a => D1 -> D1 -> D2 -> D2 -> D2 -> D2; most similar to each other?
b => D1 -> D1 -> D1 -> D1 -> D2 -> D2 -> D3 -> D3 -> a) (D1, D3)
D3 -> D3 -> D3 -> D3; b) (D2, D3)
c => D1 -> D3 -> D3 c) All are equally similar
c) a => D1 -> D2; b => D1 -> D2 -> D3; d) (D1, D2)
c => D1 -> D3
d) a => D1:1,4 -> D2:1,2,4,6; 7. Given the following sequence of gamma coded gaps,
b => D1:2,3,5,6 -> D2:3,5 -> D3:1,2,3,4,5,6; reconstruct the postings sequence:
c => D1:7 -> D3:7,8. 110111011110010100.
a) 2 -> 5 -> 9 -> 10 -> 11
3. What is the best order for processing the Boolean query b) 4 -> 6 -> 10 -> 11
<< a AND b AND c >>? c) 11 -> 14 -> 16400 -> 16402
d) 7 -> 10 -> 20 -> 22.
a) Informational query
8. How is it best to implement a basic indexing pipeline for b) Shopping query
Chinese? c) Downloads and documentation query
a) Just like a European one, but making sure we use d) Navigational query
UTF-8 in order to maintain the character encodings.
b) Just like a European one, but splitting words using a 13. Given the bow-tie structure of the web, which is the best
specialized dictionary (in UTF-8). place to start a crawl of all its documents?
c) Just like a European one, but (1) splitting words a) IN area
using a specialized dictionary (in UTF-8) and (2) b) OUT area
without removing stopwords, because they are not c) SCC / CENTER area
present in Asian languages. d) It does not matter, because the crawl will reach all
d) Just like a European one, but (1) making sure we use pages anyway.
UTF-8 in order to maintain the character encodings
and (2) indexing numbers separately, because they 14. Which of the following must considered when crawling?
are used differently in Asian texts. a) Last modification date for each page
b) Frequency of site accesses
9. Suppose two words are considered similar if the Jaccard c) Robots rules
coefficient corresponding to their trigrams is greater d) Order in which pages are crawled.
than 0.5. Which of the following pairs contains similar
words? Suppose documents D1, D2, D3 have the following hyperlink
a) (alone, alane) structure between them:
b) (alone, along)
c) (alike, alite)
D1 D2
d) (restaurant, restaurate)