Professional Documents
Culture Documents
DD2475 Information Retrieval Written Exam, December 13, 2010, 14.00 - 19.00
DD2475 Information Retrieval Written Exam, December 13, 2010, 14.00 - 19.00
WrittenExam,December13,2010,14.0019.00
Theexamconsistsoftwoparts.Part1consistsof5questionswithatotalmaximumscore
of20.Part2consistsof5questionswithatotalmaximumscoreof30.Thequestionscan
beansweredinEnglishorSwedish.Thelimitsfordifferentgradesare:
A:4250
B:3741AND(10onPart1)
C:3336AND((10onPart1)AND(15onPart2))
D:2932AND((10onPart1)AND(15onPart2))
E:2528AND((10onPart1)AND(15onPart2))
Fx:25AND((<10onPart1)OR(<15onPart2))
F:<25
IntheeventofgradeF,thewrittenexamhastobetakenagain.IntheeventofgradeFx,
youwillbeallowedtoorallycompletepartsoftheexamtoreachgradeE.
Part1:No aidsallowed.Hand inyour solutions toPart 1inanassigned envelopebefore
producingtheaidstoPart2.
Part2:AllaidsallowedEXCEPTI)otherhumans,II)phones,computers,readingpads.
Part1:Theory(20credits)
Weexpecttheanswerstoall5questionstocoveruptotwo(2)handwrittenA4.
1.1(4credits)
Termdocument incidence matrices are most often not stored as 2Darrays, but as
anotherstructure.Why?
1.2(4credits)
Whatisthedifferencebetweenstemmingandlemmatization?Whataretheadvantages
anddisadvantageswitheachmethod?
1.3(4credits)
What is the difference between Boolean retrieval and ranked retrieval? Why are most
publiclyavailablesearchenginesusingrankedretrievalnowadays?
1.4(4credits)
Belowisapieceofapositionalindex,whereentriesareintheform:docID1:<position1,
position2,...>;docID2:<position1,position2,...>;etc.
zbigniew:
0:<1024,6555>;1:<2,232>;2:<0,233>;3:<5,105,268,1323,3214>;
4:<234,8526,13283,17500,32698>
kazimierz: 1:<3>;5:<32699>
brzeszynski:1:<4,812>;2:<234>;4:<8525>;6:<78,378,4798,16788,19234>;
7:<2,207,654,891,2024,11026,22704>
politial:
1:<7>;4:<452,986>;5:<25,512,765,2187,7453,13818>;6:<6526>;
7:<2064,11027>;8:<4362,19254>
advisor: 1:<8,20560>,7:<490,11030>;8:<44,786,4360>
Suppose the query "zbigniew NEAR brzeszynski" returns the documents 1,2 and 4.
Furthermore,supposethequery"politicalNEARadvisor"returnsthedocuments1and
8.DescribeaprobableinterpretationoftheNEARoperator.
1.5(4credits)
Thefollowing20rankedresultshavebeenreturnedasaresponsetoaquery.Sincewe
areexpertsinthearea,wefindresults2,6,8,9,10,12,15,16,17,18,and20relevant.
12345
678910
1112131415
1617181920
Weknowthatthereareintotal20relevantdocumentsinthecollection.
(a)Drawtheprecisionrecallcurveforthefirst20resultsshownabove.
(b)Whatisprecisionat10forthislistofresults?
Part2:Problems(30credits)
Weexpecttheanswerstoeachquestiontocoveruptoone(1)handwrittenA4.
2.1(6credits)
Considerthequery"MichaelJackson",andthetwofollowingdocuments:
1)TheWikipediaarticleaboutMichaelJackson,whichbegins:
MichaelJosephJackson(August29,1958June25,2009)wasanAmericanrecording
artist, dancer, singersongwriter, and philanthropist. Referred to as the King of Pop,
JacksonisrecognizedasthemostsuccessfulentertainerofalltimebyGuinnessWorld
Records. His contribution to music, dance and fashion, along with a muchpublicized
personal life, made him a global figure in popular culture for over four decades. The
eighth child of the Jackson family, he debuted on the professional music scene along
with his brothers as a member of The Jackson 5 in the mid1960s, and began his solo
careerin1971"
2)Thedocumentwhoseentirecontentsis:
"MichaelJackson"
Ifthedocumentsarerepresentedbytheirtfidfweights,whichdocumentwouldreceive
thehighercosinescorewiththequery?Inthelightofthisexample,discusstheuseoftf
idfasameasureforrankingsearchresults.Whichimprovementswouldyousuggest?
2.2(6credits)
HowcouldanIRsystemcombineuseofapositionalindexanduseofstopwords?What
isthepotentialproblem,andhowcoulditbehandled?
2.3(6credits)
Consideranindexwithvectorrepresentationofdocuments.Considerthecaseofaquery
term that is not in the set of M indexed terms; thus, our standard construction of the
query vector results in V(q) not being in the vector space created from the collection.
Howwouldoneadaptthevectorspacerepresentationtohandlethiscase?
2.4(6credits)
TheRocchioalgorithmforrelevancefeedbacksuggestsupdatingthequeryaccordingto:
qm = qo +
!
1
1 !
di
dj
|Dr |
|Dnr |
di Dr
dj Dnr
As a response to the initial query qo, the system returns the following top10 list of
rankeddocuments:
docA,docB,docC,docD,docE,docF,docG,docH,docI,docJ
Theuserthentagsthereturneddocumentswitha+ifitisrelevantorwithaifitis
notrelevant.Thisishowtheuserlabeledtheinitialrankedtop10list:
docA
+docB
+docC
docD
+docE
+docF
docG
docH
+docI
docJ
AnewqueryqmisformulatedaccordingtotheRocchioformula,andusedtoretrievea
newlist.Whataretheapproximativevaluesof , and inthethreefollowingcases?
Motivateyouranswers.
(a)AfteroneiterationoftheRocchioalgorithm,theupdatedtop10listlookslikethis:
docA,docB,docC,docD,docE,docF,docG,docH,docI,docJ
(b)AfteroneiterationoftheRocchioalgorithm,theupdatedtop10listlookslikethis:
docB,docC,docE,docF,docI,docA,docD,docG,docH,docJ
(c)AfteroneiterationoftheRocchioalgorithm,theupdatedtop10listlookslikethis:
docB,docC,docE,docF,docI,docK,docL,docM,docN,docO
2.5(6credits)
Youareaninvestigatingreporter,andsomeonehasgivenyouamemorystickwithan
indexofasetofsecretdocuments(butnotthedocumentsthemselves).
(a) Assuming that the index is not encrypted (i.e., that you can access the data
structure), what information can you obtain about the original documents from the
index?
(b)Assumethattheindexisencrypted,butthereisaqueryinterfaceinwhichyoucan
posequeriesandgetdocIDsofmatchingdocuments.Inthiscase,whatcanlearnabout
theindexandthedocuments?