Information Retrieval Exam: Name and Group: Series ONE Email

Name and Group: Series ONE
Email:
Information Retrieval Exam
a) Intersect postings for „a” AND „c” first, then

Suppose we have the following documents: intersect result with postings for „b”.
b) Intersect postings for „a” AND „b” first, then
Document Words intersect result with postings for „c”.
D1 abbabbc c) Intersect postings for „b” AND „c” first, then
D2 aababa intersect result with postings for „a”.
D3 bbbbbbcc d) It does not matter.
1. If my main concern is producing the fastest possibly 4. Suppose the corpus above is dynamic. Which of the
dictionary search, which will be the best possible data following issues will you have to deal with?
structure to use? a) New or changed documents.
a) An array of words and word IDs over a contiguous b) Spell corrections.
chunk of memory, because this would allow very fast c) Deleted documents.
sequential passes through all dictionary words. d) Document access control lists, if any.
b) A sorted linked list with skip pointers, because this
would allow both very fast sequential passes and 5. If I use simple TFxIDF for ranking, which will be the order
very fast look-ups. generated for the regular query << a b >>?
c) A hash table, because this would allow very fast a) D1, D2, D3
look-ups. b) D2, D1, D3
d) A B-tree, because this would allow on average the c) D1, D3, D2
fastest look-up time for a word. d) D2, D3, D1.
2. How will a basic inverted index for this corpus look like? 6. Using un-normalized cosine similarity on TFxIDF (just the
a) a => D1 -> D2 -> D3; b => D1 -> D2 -> D3; dot product between the TFxIDF vectors of each
c => D1 -> D2 -> D3 document), which two of the three documents above are
b) a => D1 -> D1 -> D2 -> D2 -> D2 -> D2; most similar to each other?
b => D1 -> D1 -> D1 -> D1 -> D2 -> D2 -> D3 -> D3 -> a) (D1, D3)
D3 -> D3 -> D3 -> D3; b) (D2, D3)
c => D1 -> D3 -> D3 c) All are equally similar
c) a => D1 -> D2; b => D1 -> D2 -> D3; d) (D1, D2)
c => D1 -> D3
d) a => D1:1,4 -> D2:1,2,4,6; 7. Given the following sequence of gamma coded gaps,
b => D1:2,3,5,6 -> D2:3,5 -> D3:1,2,3,4,5,6; reconstruct the postings sequence:
c => D1:7 -> D3:7,8. 110111011110010100.
a) 2 -> 5 -> 9 -> 10 -> 11
3. What is the best order for processing the Boolean query b) 4 -> 6 -> 10 -> 11
<< a AND b AND c >>? c) 11 -> 14 -> 16400 -> 16402
d) 7 -> 10 -> 20 -> 22.
a) Informational query
8. How is it best to implement a basic indexing pipeline for b) Shopping query
Chinese? c) Downloads and documentation query
a) Just like a European one, but making sure we use d) Navigational query
UTF-8 in order to maintain the character encodings.
b) Just like a European one, but splitting words using a 13. Given the bow-tie structure of the web, which is the best
specialized dictionary (in UTF-8). place to start a crawl of all its documents?
c) Just like a European one, but (1) splitting words a) IN area
using a specialized dictionary (in UTF-8) and (2) b) OUT area
without removing stopwords, because they are not c) SCC / CENTER area
present in Asian languages. d) It does not matter, because the crawl will reach all
d) Just like a European one, but (1) making sure we use pages anyway.
UTF-8 in order to maintain the character encodings
and (2) indexing numbers separately, because they 14. Which of the following must considered when crawling?
are used differently in Asian texts. a) Last modification date for each page
b) Frequency of site accesses
9. Suppose two words are considered similar if the Jaccard c) Robots rules
coefficient corresponding to their trigrams is greater d) Order in which pages are crawled.
than 0.5. Which of the following pairs contains similar
words? Suppose documents D1, D2, D3 have the following hyperlink
a) (alone, alane) structure between them:
b) (alone, along)
c) (alike, alite)
D1 D2
d) (restaurant, restaurate)
10. Suppose a search returns documents D1, D2, and D3 in

D3
this order. The correct results in the system would have
been D2, D1, D4, and D5 in this order. Which are the
precision and recall for the engine in this case?
a) P = 0.67; R = 0.5 15. Which will be their scores and rankings if we order them
b) P = 0.5; R = 0.67 using PageRank, computed with α = 0.1 (probability to
c) P = 0.67; R = 0.4 jump to a random page) for two iterations?
d) P = 0.4; R = 0.67 a) D2=0.39, D3=0.32, D1=0.29
b) D1=0.34, D2=0.34, D3=0.34
11. Which of the following are drawbacks for Mean Average c) D3=0.39, D2=0.32, D1=0.29
Precision? d) D2=0.42, D3=0.40, D1=0.18.
a) It is not used in practice.
b) There can be duplicate results in the output. 16. Which of the following should be top priority design
c) There can be very similar results in the output. principles for Information Retrieval user interfaces?
d) It does not provide information about the system a) Accuracy
itself, but about how people perceive it. b) Usability
c) Feeling of user accomplishment
12. What is the most probable type of query for the web d) Previous user experience.
query “Microsoft”?

Information Retrieval Exam: Name and Group: Series ONE Email

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Information Retrieval Exam: Name and Group: Series ONE Email

Uploaded by

Copyright:

Available Formats

Name and Group: Series ONE

Information Retrieval Exam

a) Intersect postings for „a” AND „c” first, then

10. Suppose a search returns documents D1, D2, and D3 in

You might also like