Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Assigments 2-3

Exercise : Suppose your search engine has just retrieved the top 50 documents from your collection
based on scores from a ranking function R(Q,D). Your user interface can show only 10 results, but you
can pick any of the top 50 documents to show. Why might you choose to show the user something other
than the top 10 documents from the retrieved document set?

Answer:

The reason to show something other than the top 10 documents depends on various factors. For
example, some of the documents may be sponsored and carry more weightage than the rest. A search
engine can give more importance to the documents which are paid for being shown in the top 10 results
, also based on user history,

Exercise : Differentiate between simple inverted index, inverted index with counts, and inverted index
with positions. State the advantages and disadvantages for each one.

Answer:

Simple inverted index: An inverted index is a mapping of words to their location in a set of documents.
An inverted index is a simple hash table which maps words in the documents to some sort of document
identifier.

Inverted index with counts: Document postings can store any information needed for efficient ranking.
supports better ranking algorithms. Using word occurrence counts helps us rank the most relevant
document.

Inverted index with positions: contains the information of the word positions. Each posting contains
two numbers: a document number first, followed by a word position.

Exercise : State 3 different options to handle searches that prefer and benefit from looking at document
fields.

Exercise : Given the below info. Which documents contain the word fish in its title.

In document one . This title includes the word“fish”, because the inverted list for “fish” tells us that
“fish” is the second word in document 1.

Exercise : Complete the following skip pointers structure

)52,12(,)45,9(,)34,6(,)17,3(
Assigments 2-3

Exercise : Compare between Document-at-a-time and Term-at-a-time query processing

• Document-at-a-time

– Calculates complete scores for documents by processing all term lists, one document at
a time.

• Term-at-a-time

– Accumulates scores for documents by processing term lists one at a time.

Both approaches have optimization techniques that significantly reduce time required to generate
scores.

Exercise : What is meant by Query-Based Stemming

• Make decision about stemming at query time rather than during indexing ,improved flexibility,

effectiveness, Query is expanded using word variants

Exercise : Compare between relevance and pseudo relevance feed back

Pseudo-Relevance Feedback is one of the methods for improving search engine results. By
automatically extracting information from a previous search result, a new query is posed as an
expansion of the original query, and then it is searched again.

Relevance feedback is to involve the user in the retrieval process so as to improve the final result set.

10.Snippet Generation involves more features than just significance factor, state 4 of these
factors

• Weighted combination of features used to rank sentences

• Web pages are less structured than news stories

• Snippet sentences are often selected from other sources

• Snippets can be generated from text of pages like Wikipedia


Assigments 2-3

11. explain pooling

pooling is used. In this technique, the top k results (for TREC, k varied between 50 and 200) from the
rankings obtained by different search engines (or retrieval algorithms) are merged into a pool,
duplicates are removed, and the documents are presented in some random order to the people doing
the relevance judgments. Pooling produces a large number of relevance judgments for each query.

1- Damerau-Levenshtein distance counts the minimum number of insertions,


deletions, substitutions, or transpositions of single characters required. List
Number of techniques used to speed up calculation of edit distances.
• Number of techniques used to speed up calculation of edit distances
– restrict to words starting with same character
– restrict to words of same or similar length
– restrict to words that sound the same
Assigments 2-3

12.What are the typical contents of a query log

• Typical contents

– User identifier or user session identifier

– Query terms - stored exactly as user entered

– List of URLs of results, their ranks on the result list, and whether they were clicked on

– Timestamp(s) - records the time of user events such as query submission, clicks

You might also like