Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

DD2475InformationRetrieval

WrittenExam,December13,2010,14.0019.00

Theexamconsistsoftwoparts.Part1consistsof5questionswithatotalmaximumscore
of20.Part2consistsof5questionswithatotalmaximumscoreof30.Thequestionscan
beansweredinEnglishorSwedish.Thelimitsfordifferentgradesare:
A:4250
B:3741AND(10onPart1)
C:3336AND((10onPart1)AND(15onPart2))
D:2932AND((10onPart1)AND(15onPart2))
E:2528AND((10onPart1)AND(15onPart2))
Fx:25AND((<10onPart1)OR(<15onPart2))
F:<25
IntheeventofgradeF,thewrittenexamhastobetakenagain.IntheeventofgradeFx,
youwillbeallowedtoorallycompletepartsoftheexamtoreachgradeE.
Part1:No aidsallowed.Hand inyour solutions toPart 1inanassigned envelopebefore
producingtheaidstoPart2.
Part2:AllaidsallowedEXCEPTI)otherhumans,II)phones,computers,readingpads.

Part1:Theory(20credits)
Weexpecttheanswerstoall5questionstocoveruptotwo(2)handwrittenA4.

1.1(4credits)
Termdocument incidence matrices are most often not stored as 2Darrays, but as
anotherstructure.Why?

1.2(4credits)
Whatisthedifferencebetweenstemmingandlemmatization?Whataretheadvantages
anddisadvantageswitheachmethod?

1.3(4credits)
What is the difference between Boolean retrieval and ranked retrieval? Why are most
publiclyavailablesearchenginesusingrankedretrievalnowadays?

1.4(4credits)
Belowisapieceofapositionalindex,whereentriesareintheform:docID1:<position1,
position2,...>;docID2:<position1,position2,...>;etc.
zbigniew:

0:<1024,6555>;1:<2,232>;2:<0,233>;3:<5,105,268,1323,3214>;
4:<234,8526,13283,17500,32698>
kazimierz: 1:<3>;5:<32699>
brzeszynski:1:<4,812>;2:<234>;4:<8525>;6:<78,378,4798,16788,19234>;
7:<2,207,654,891,2024,11026,22704>
politial:
1:<7>;4:<452,986>;5:<25,512,765,2187,7453,13818>;6:<6526>;

7:<2064,11027>;8:<4362,19254>
advisor: 1:<8,20560>,7:<490,11030>;8:<44,786,4360>
Suppose the query "zbigniew NEAR brzeszynski" returns the documents 1,2 and 4.
Furthermore,supposethequery"politicalNEARadvisor"returnsthedocuments1and
8.DescribeaprobableinterpretationoftheNEARoperator.

1.5(4credits)
Thefollowing20rankedresultshavebeenreturnedasaresponsetoaquery.Sincewe
areexpertsinthearea,wefindresults2,6,8,9,10,12,15,16,17,18,and20relevant.

12345
678910
1112131415
1617181920

Weknowthatthereareintotal20relevantdocumentsinthecollection.
(a)Drawtheprecisionrecallcurveforthefirst20resultsshownabove.
(b)Whatisprecisionat10forthislistofresults?

Part2:Problems(30credits)
Weexpecttheanswerstoeachquestiontocoveruptoone(1)handwrittenA4.

2.1(6credits)
Considerthequery"MichaelJackson",andthetwofollowingdocuments:
1)TheWikipediaarticleaboutMichaelJackson,whichbegins:
MichaelJosephJackson(August29,1958June25,2009)wasanAmericanrecording
artist, dancer, singersongwriter, and philanthropist. Referred to as the King of Pop,
JacksonisrecognizedasthemostsuccessfulentertainerofalltimebyGuinnessWorld
Records. His contribution to music, dance and fashion, along with a muchpublicized
personal life, made him a global figure in popular culture for over four decades. The
eighth child of the Jackson family, he debuted on the professional music scene along
with his brothers as a member of The Jackson 5 in the mid1960s, and began his solo
careerin1971"
2)Thedocumentwhoseentirecontentsis:
"MichaelJackson"
Ifthedocumentsarerepresentedbytheirtfidfweights,whichdocumentwouldreceive
thehighercosinescorewiththequery?Inthelightofthisexample,discusstheuseoftf
idfasameasureforrankingsearchresults.Whichimprovementswouldyousuggest?

2.2(6credits)
HowcouldanIRsystemcombineuseofapositionalindexanduseofstopwords?What
isthepotentialproblem,andhowcoulditbehandled?

2.3(6credits)
Consideranindexwithvectorrepresentationofdocuments.Considerthecaseofaquery
term that is not in the set of M indexed terms; thus, our standard construction of the
query vector results in V(q) not being in the vector space created from the collection.
Howwouldoneadaptthevectorspacerepresentationtohandlethiscase?

2.4(6credits)
TheRocchioalgorithmforrelevancefeedbacksuggestsupdatingthequeryaccordingto:

qm = qo +

!
1
1 !
di
dj
|Dr |
|Dnr |
di Dr

dj Dnr

As a response to the initial query qo, the system returns the following top10 list of
rankeddocuments:
docA,docB,docC,docD,docE,docF,docG,docH,docI,docJ

Theuserthentagsthereturneddocumentswitha+ifitisrelevantorwithaifitis
notrelevant.Thisishowtheuserlabeledtheinitialrankedtop10list:
docA
+docB
+docC
docD
+docE
+docF
docG
docH
+docI
docJ
AnewqueryqmisformulatedaccordingtotheRocchioformula,andusedtoretrievea
newlist.Whataretheapproximativevaluesof , and inthethreefollowingcases?
Motivateyouranswers.
(a)AfteroneiterationoftheRocchioalgorithm,theupdatedtop10listlookslikethis:
docA,docB,docC,docD,docE,docF,docG,docH,docI,docJ
(b)AfteroneiterationoftheRocchioalgorithm,theupdatedtop10listlookslikethis:
docB,docC,docE,docF,docI,docA,docD,docG,docH,docJ
(c)AfteroneiterationoftheRocchioalgorithm,theupdatedtop10listlookslikethis:
docB,docC,docE,docF,docI,docK,docL,docM,docN,docO

2.5(6credits)
Youareaninvestigatingreporter,andsomeonehasgivenyouamemorystickwithan
indexofasetofsecretdocuments(butnotthedocumentsthemselves).
(a) Assuming that the index is not encrypted (i.e., that you can access the data
structure), what information can you obtain about the original documents from the
index?
(b)Assumethattheindexisencrypted,butthereisaqueryinterfaceinwhichyoucan
posequeriesandgetdocIDsofmatchingdocuments.Inthiscase,whatcanlearnabout
theindexandthedocuments?

You might also like