Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

1.

The statement “The vector space model cannot be combined outright with the phrase
queries.” [1 mark]
a. Correct because the concept of idf can be easily extended to phrase queries.
b. Correct because of the assumption that terms are not semantically
connected to each other in the vector space model.
c. Incorrect because the words form independent axes and can be successfully
combined to form phrase queries.
d. Incorrect because the phrase queries behave just like a multi-word query.
2. Suppose in a collection of 1000 documents, there are 8 documents relevant to a query.
The system has managed to retrieve 6 out of them in the top 15. The sequence of
relevant (R) and nonrelevant documents are as follows:
RNNRRNNRRNNRNNN.
Find the interpolated precision value at recall of 0.5. [1 mark]
a. 0.50
b. 0.56
c. 0.60
d. 0.42
3. Suppose in a collection of 10000 documents, there are 6 documents relevant to a query
and the annotator scores the documents following 3-scale annotation scheme where the
most relevant document is scored 2, somewhat relevant documents as 1 and
non-relevant documents as 0. If among the relevant documents are most relevant and
others are somewhat relevant, then find the normalization constant Z for computing the
NDCG for the given query on the document collection. [1 mark]
a. 6.57
b. 0.152
c. 4.79
d. 0.21
4. Query drift in relevance feedback most likely [1 mark]
a. increases recall and decreases precision
b. decreases recall and increases precision
c. decreases recall and decreases precision
d. increases recall and increases precision
5. Suppose two annotators (A1 and A2) are asked to judge a collection of documents. A2
comes to know about the query which will be issued against the information need and
biases her annotations to increase system performance. Which of the following will give
you reasons to suspect A2? [1 mark]
a. High kappa agreement between the system output and the annotations of A2.
b. Low kappa agreement between the annotations of A1 and A2.
c. Lower kappa agreement between A1 and A2 as compared to kappa agreement
between system and A2 and high keyword match between documents marked
relevant by A1 and A2.
d. Higher kappa agreement between A2 and system compared to A1 and A2
and high keyword agreement between documents marked relevant A2 and
system.
6. Suppose A is the term-term co-occurrence matrix. You compute C=AA T . For a pair of
terms ti and tj, the entry Cij indicates the [1 mark]
a. number of contexts in which both ti and tj appears.
b. number of contexts in which either ti or tj appears.
c. number of words that have occurred in the contexts of both ti and tj.
d. number of words that have occurred in the contexts of either ti or tj.
7. Relevance feedback increases recall but does not necessarily increase precision
because [1 mark]
a. It increases the number of documents retrieved but may not increase the actual
number of relevant documents.
b. There is a limit on the number of relevant documents for a given query and an
increase in the number of documents will decrease the ratio.
c. Precision is independent of documents retrieved and will go up irrespective of the
number of documents retrieved.
d. It increases the total number of documents retrieved and the ratio of
relevant documents retrieved to total documents retrieved may not
increase.
8. Suppose you have indexed a collection of 100 documents. For a query q, 20 out of the
100 documents are actually relevant. You issue this query q to your search engine. Your
search engine fails to return any document as relevant. What is the accuracy of your
search engine with respect to the query? [1 mark]
a. 0% (Accuracy can only be calculated when we get some result).
b. 80%
c. 0% (The system finds all documents irrelevant)
d. 50%
9. In the ROC curve concept of evaluating a system, we plot sensitivity against
(1-specificity) because usually [1 mark]
a. (1-specificity) is a measure closer to precision as compared to specificity
b. the number of non-relevant documents in a collection for a query is too
large compared to the number of relevant documents.
c. The true negative rate is a much more reliable measure compared to the false
positive rate.
d. None of the above
10. If the vocabulary size of a collection is 30,000. Then for what value of k, will k/i give the
relative frequency of a word where i is the rank of the word in order of frequency such
that the relative frequencies sum to 1. [1 mark]
a. 0.067
b. 0.097
c. 0.223
d. 10.31
11. In cluster pruning, we attach each follower to n (> 1) leaders and at query time consider
m (> 1) leaders closest to q, to address the following characteristics of the data. [1
mark]
a. Non-uniformity of cluster structure
b. Unimodality in data
c. Multimodality in data
d. Variance in number of members in each cluster.
12. In the binary independence model, under the assumption that the number of documents
relevant to a query are very small compared to non-relevant documents, the log odds of
relevance of a document for a query estimates to [1 mark]
a. the sum of the idfs of the query terms
b. the sum of the idfs of each query term in the document
c. the sum of the probabilities of the query terms occurring in the relevant
documents.
d. The sum of the logs of probabilities of the query terms occurring in the relevant
documents.
13. In a corpus of 500 terms, the term frequency of three terms are as follows:
dogs -> 70
cats -> 80
hate -> 40
If “hate” occurs immediately after “dogs” 15 times and “cats” occur immediately after
“hate” 20 times then estimate the probability of the sequence “dogs hate cats” in the
corpus using the bigram language model. [1 mark]
a. 1/70
b. 3/3500
c. 3.86X10^-5
d. 4.2X10^-5
14. Suppose we wish to implement wildcard queries in a system which uses the vector
space retrieval model. Which of the following schemes may be applied for the same. [1
mark]
a. Map different possibilities for a given wildcard to a common term in the
vocabulary.
b. Use separate indices for wildcard query processing and vector space model
followed by a combination of results.
c. Apply query term expansion corresponding to wildcard query terms.
d. For the different possibilities of the wildcard query term, separately compute the
sets of documents and return the intersection of the results.
15. In an image search system, you upload a picture as a search query. In Rocchio’s
algorithm, which weight settings will you most likely use?
a. alpha = 0, beta = 1, gamma = 0
b. alpha = 1, beta = 0, gamma = 0
c. alpha = 1, beta = 1, gamma = 0
d. alpha = 0, beta = 1, gamma = 1
16. Consider the two documents
D1 = “virus china vaccine china discovered virus”
D2 = “virus discovered spread china vaccine”
and the query q = “virus china vaccine”
Under the boolean model, vector space model and the binary independence model,
what will be the relative ranks of the two documents? [1 mark]
a. d1 ranked higher than d2, same, same
b. same, d1 ranked higher than d2, same
c. same, d1 ranked higher than d2, d1 ranked higher than d2
d. same, same, same
17. An IR system produces the following interpolated precision-recall curve on a particular
query (based on 20 results). Assuming that you know that the number of relevant
documents is 10. Based on this curve, find the precision value after the system has
retrieved 3 documents approximated upto 2 decimal places.

a. 0.43
18. For the following query types
1: Official site of Arsenal football club
2: Looking for verdicts given on a cases related to a particular situation while preparing
for a case hearing
3: Information about neural diseases
[1 mark]
Which of the evaluation measures will be more effective?
a. Recall, Precision, F-measure
b. Recall, Recall, Precision
c. Precision, Recall, F-measure
d. Precision, Recall, Recall
19. Suppose in a document collection, the scripts are very old and the scanned copies are
not so clearly visible. Hence while the index your terms may have lots of scanner errors.
In this case which additional index would be suitable for processing documents while
indexing?
a. Permuterm index
b. Standard inverted index
c. Phonetic index
d. K-gram index
20. How stemming and case-folding of the tokens is going to affect the parameter b in the
expression of Heaps’ law?
a. Increase
b. Decrease
c. No change
d. Nothing can be predicted
21. Suppose you are provided with the global relevance scores of the documents and you are
asked to incorporate the information in a language model based IR system. Which of the
following strategies do you think is most appropriate for doing it? [1 mark]
a. Scale the individual term probabilities with the normalized global score to
enhance the contribution of a term to that particular document.
b. Add the normalized global score to the likelihood probability of the document
given a query.
c. Use the normalized global score as the document prior.
d. Add the global score to the language model based score.
22. Given the following documents, which of them is most likely to generate the query q
based on the language modelling scheme most commonly used in IR systems. In case of
term absence use the value from the collection language model. [1 mark]
D1: recent ukraine russia conflict usa
D2: ukraine crisis russia china ukraine russia recent usa china india
D3: ukraine china crisis conflict russia ukraine

q: russia ukraine conflict

a. D1
b. D2
c. D3
d. Both D1 and D3 are equally likely.

You might also like