Welcome to Scribd!

ISCL - Wintersemester 2007 - IR - Midterm Exam

Uploaded by

0% found this document useful (0 votes)

35 views6 pages

This document contains a midterm exam for an information retrieval (IR) course. It consists of 10 exercises testing students' knowledge of core IR concepts like tokenization, inverted indexes, vector space models, term weighting schemes, and index architectures. The exercises include defining terms, calculating index sizes, proposing index architectures, analyzing queries, discussing linguistic preprocessing techniques, demonstrating stemming algorithms, and performing vector space retrieval operations.

Original Description:

Original Title

midterm

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

0% found this document useful (0 votes)

35 views6 pages

ISCL - Wintersemester 2007 - IR - Midterm Exam

Uploaded by

Pulkit Mehndiratta

Copyright:

Attribution Non-Commercial (BY-NC)

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as pdf or txt

Jump to Page

You are on page 1of 6

Search inside document

ISCL – wintersemester 2007 – IR – Midterm exam

17 December 2007

Non-electronic documents and calculators are authorized.

Name :
Semester :

Exercise 1 : Definitions
Define the following terms :
– tokenization

– permuterm index

– champion list

Exercise 2 : Characteristics of a collection and its index

Consider a collection made of 500 000 documents, each containing on average 800 words. The
number of different words (i.e. not taking duplicates into account) is estimated to 700 000.
For all questions, give your computation.

– What is the size (mega or giga bytes) of the collection when stored (uncompressed) on disc ?

– With the best reduction rate of the dictionary achieved when using a linguistic preproces-
sing (noise words, stemming), what is the size (number of terms) of the dictionary ?

– Consider an index where the average length of a non-positional posting list is 200. What
is the estimation of the total number of postings of this index ?

1
– How many bytes do you allow respectively for encoding (without compression) a dictionary term ?
a non-positional posting ?

– What are the size (mega or giga bytes) of the resulting dictionary and posting lists ?

– If you compress your dictionary using the dictionary-as-a-string method, what is the new
size of the dictionary ?

Exercise 3 : Querying an index

What kind of queries can be applied to the collection, for each of these, what index is needed ?

2
Exercise 4 : Linguistic preprocessing
Are the following statements right or false (justify your answer) ?

a) stemming increases retrieval precision.

b) stemming only slightly reduces the size of the dictionary.

c) stop lists contains all most frequent terms.

Exercise 5 : Porter stemming

What would be the result of the porter stemmer used with the following words ?
– busses

– rely

– realised

What is the Porter measure of the following words (give your computation) ?
– crepuscular

– rigorous

– placement

3
Exercise 6 : Index architecture
Propose a Map-Reduce architecture for creating language specific indexes from an heteroge-
neous collection. You can illustrate this architecture using a figure.

Exercise 7 : Index compression

– What is the largest gap that can be encoded in 2 bytes using the variable-byte encoding ?

– What is the posting list that can be decoded from the variable byte-code
10001001 00000001 10000010 11111111 ?

– What would be the encoding of the same posting list using a γ-code ?

4
Exercise 8 : Vector Space Model
Consider a collection made of the documents d1 , d2 , d3 and whose characteristics are the
following :

Term tfd1 tfd2 tfd3 df

actor 12 35 55 123
movie 15 24 48 240
trailer 52 13 12 85

– Compute the vector representations of d1 , d2 and d3 using the tf − idft,d weighting and
the euclidian normalisation.

– Compute the cosine similarities between these documents.

– Give the ranking retrieved by the system for the query “movie trailer”.

5
Exercise 9 : Term weighting
Compute the vector representations of the documents introduced in the previous exercise
using the ltn weighting scheme.

Exercise 10 : Index architecture (extra credit)

Consider a hashtable as a structure mapping keys to values using a hash function h such
that h(key) = value.
– What problem may arise from such a structure when inserting new key-value pairs ?

– What workaround would you propose for this insertion ? Give an algorithm for inserting
a key-value pair.

Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
Document6 pages
Assignment 1.1: First 10 Rows Looks Like Below in Notepad++
priyam
100% (1)
Big Data: Practice Exercises
Document4 pages
Big Data: Practice Exercises
NUBG Gamer
0% (1)
SP18 CS182 Midterm Solutions - Edited
Document14 pages
SP18 CS182 Midterm Solutions - Edited
Hasim
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Midterm Sol
Document6 pages
Midterm Sol
Mohasin Azeez Khan
No ratings yet
HF Sample Paper-X Half Yearly Class - Xi Informatics Practices
Document3 pages
HF Sample Paper-X Half Yearly Class - Xi Informatics Practices
utkarsh
No ratings yet
15 122 hw2
Document10 pages
15 122 hw2
Ryan Sit
No ratings yet
NR-220502 - Design and Analysis of Algorithms
Document4 pages
NR-220502 - Design and Analysis of Algorithms
Srinivasa Rao G
100% (2)
Autoencoder Assignment PDF
Document5 pages
Autoencoder Assignment PDF
praveen kandula
No ratings yet
Homework Exercise 4: Statistical Learning, Fall 2020-21
Document3 pages
Homework Exercise 4: Statistical Learning, Fall 2020-21
Elinor Rahamim
No ratings yet
CS5785 Homework 4: .PDF .Py .Ipynb
Document5 pages
CS5785 Homework 4: .PDF .Py .Ipynb
Al Tarino
No ratings yet
Assignment 1
Document2 pages
Assignment 1
Aditya Kapoor
No ratings yet
Ps 1
Document4 pages
Ps 1
cancancan35
No ratings yet
End Sem Paper
Document2 pages
End Sem Paper
Praveen Kumar K
No ratings yet
MCA 312 Design&Analysis of Algorithm QuestionBank
Document7 pages
MCA 312 Design&Analysis of Algorithm QuestionBank
nbpr
No ratings yet
Instructions:: Problem 1
Document3 pages
Instructions:: Problem 1
Lomash Yaduvanshi
No ratings yet
Question IITK
Document8 pages
Question IITK
NEELAM RAWAT
No ratings yet
ps1 PDF
Document5 pages
ps1 PDF
Anil
No ratings yet
03 Unit Calculator Fa14
Document59 pages
03 Unit Calculator Fa14
kaysov
No ratings yet
PB QP 12 Cs 2021-Set-1
Document10 pages
PB QP 12 Cs 2021-Set-1
Kowshik
No ratings yet
C16-DBMS Key
Document20 pages
C16-DBMS Key
TSN Prasad
No ratings yet
Assignment For Week2
Document2 pages
Assignment For Week2
sneha Sabbineni
No ratings yet
System Programming Pyqs
Document59 pages
System Programming Pyqs
Ishita Mahajan
No ratings yet
IR Endsem Solution1
Document17 pages
IR Endsem Solution1
Dash Casper
No ratings yet
asila-IR
Document16 pages
asila-IR
pu818950
No ratings yet
ComputerScience SQP
Document11 pages
ComputerScience SQP
Arihant Nath Chaudhary
No ratings yet
Time: 3 Hours Total Marks: 70: Printed Pages: 02 Sub Code: RCS405 Paper Id: 110435 Roll No
Document2 pages
Time: 3 Hours Total Marks: 70: Printed Pages: 02 Sub Code: RCS405 Paper Id: 110435 Roll No
NIET14
No ratings yet
PracList XII CS 2022 23
Document8 pages
PracList XII CS 2022 23
Manan Sethi
No ratings yet
HT TP: //qpa Pe R.W But .Ac .In: Pattern Recognition
Document4 pages
HT TP: //qpa Pe R.W But .Ac .In: Pattern Recognition
Duma Dumai
No ratings yet
Multmedia Studies
Document15 pages
Multmedia Studies
Devson Simon
No ratings yet
OOP MSBTE Question Paper Winter 2007
Document3 pages
OOP MSBTE Question Paper Winter 2007
api-3728136
No ratings yet
Oomd Assignment - 1 - 2020 - 2021
Document2 pages
Oomd Assignment - 1 - 2020 - 2021
vejiyo9416
No ratings yet
DataGrokr Technical Assignment
Document4 pages
DataGrokr Technical Assignment
Sidkrish
No ratings yet
CSI 4107 - Winter 2006 - Final
Document10 pages
CSI 4107 - Winter 2006 - Final
Amin Dhouib
No ratings yet
ML Lab 11 Manual - Neural Networks (Ver4)
Document8 pages
ML Lab 11 Manual - Neural Networks (Ver4)
dodela6303
No ratings yet
DAA Question Bank
Document10 pages
DAA Question Bank
AVANTHIKA
No ratings yet
Csci 5454 HW
Document10 pages
Csci 5454 HW
Ramnarayan Shreyas
No ratings yet
Data Structure Syllabus
Document6 pages
Data Structure Syllabus
darshansjadhav369
No ratings yet
First Model Exam 2021-22 - Semester 2
Document7 pages
First Model Exam 2021-22 - Semester 2
Andrina Praveen
No ratings yet
Cbse QP Sol 2011
Document8 pages
Cbse QP Sol 2011
thiripura sundari
No ratings yet
CS 3MI3: Fundamentals of Programming Languages: Dr. Jacques Carette
Document3 pages
CS 3MI3: Fundamentals of Programming Languages: Dr. Jacques Carette
midaw87576
No ratings yet
Lords Institute of Engineering & Technology Approved by AICTE/Affiliated To Osmania University/Estd.2002. Name of The Subject: PYTHON PROGRAMMING
Document6 pages
Lords Institute of Engineering & Technology Approved by AICTE/Affiliated To Osmania University/Estd.2002. Name of The Subject: PYTHON PROGRAMMING
Abdul Azeez 312
No ratings yet
PRML Assignment1 2022
Document2 pages
PRML Assignment1 2022
Basil Azeem ed20b009
No ratings yet
Assignment 1
Document2 pages
Assignment 1
Arjun
No ratings yet
UNIT-4 Part-B: 1) Briefly Described About 1-D Time Series and The 2-D Color Images and With Suitable Examples?
Document11 pages
UNIT-4 Part-B: 1) Briefly Described About 1-D Time Series and The 2-D Color Images and With Suitable Examples?
chandral joshi
No ratings yet
Sample Paper 1 (Solved) : Class Xii Informatics Practices (New)
Document4 pages
Sample Paper 1 (Solved) : Class Xii Informatics Practices (New)
Harsh Vardhan Singh
No ratings yet
CNN Text Classification
Document12 pages
CNN Text Classification
孙亚童
No ratings yet
Computer Science 123789
Document15 pages
Computer Science 123789
Laxmikant Tak
No ratings yet
423/723 Natural Language Processing: Assignment 1
Document4 pages
423/723 Natural Language Processing: Assignment 1
Amir
No ratings yet
Xii CS PB1
Document10 pages
Xii CS PB1
lalita nagar
No ratings yet
CS 09 303 Data Structures NOV 2014
Document2 pages
CS 09 303 Data Structures NOV 2014
Sai Das
No ratings yet
Practice 5
Document4 pages
Practice 5
Ajoy A G
No ratings yet
Heba DSBook 2022
Document337 pages
Heba DSBook 2022
Ahmed Fawzy
No ratings yet
Class Xii Computer Science (Question Bank - MLL)
Document40 pages
Class Xii Computer Science (Question Bank - MLL)
Keshav Pandiarajan
No ratings yet
Questions Collected From Web - Amazon
Document13 pages
Questions Collected From Web - Amazon
Nishtha Patel
100% (1)
MLchallenge2022 Block4
Document9 pages
MLchallenge2022 Block4
fede
No ratings yet
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
From Everand
Good Habits for Great Coding: Improving Programming Skills with Examples in Python
Michael Stueben
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet