Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

SearchEngines

InformationRetrievalinPractice

AllslidesAddisonWesley,2008

SearchandInformationRetrieval
SearchontheWeb1 isadailyactivityformany peoplethroughouttheworld Searchandcommunicationaremostpopular usesofthecomputer Applicationsinvolvingsearchareeverywhere Thefieldofcomputersciencethatismost involvedwithR&Dforsearchisinformation retrieval(IR)
1

orisitweb?

InformationRetrieval
Informationretrievalisafieldconcernedwith thestructure,analysis,organization,storage, searching,andretrievalofinformation. (Salton,1968) Generaldefinitionthatcanbeappliedtomany typesofinformationandsearchapplications PrimaryfocusofIRsincethe50shasbeenon text anddocuments

WhatisaDocument?
Examples:
webpages,email,books,newsstories,scholarly papers,textmessages,Word,Powerpoint,PDF, forumpostings,patents,IMsessions,etc.

Commonproperties
Significanttextcontent Somestructure(e.g.,title,author,dateforpapers; subject,sender,destinationforemail)

Documentsvs.DatabaseRecords
Databaserecords(ortuples inrelational databases)aretypicallymadeupofwell definedfields(orattributes)
e.g.,bankrecordswithaccountnumbers, balances,names,addresses,socialsecurity numbers,datesofbirth,etc.

Easytocomparefieldswithwelldefined semanticstoqueriesinordertofindmatches Textismoredifficult

Documentsvs.Records
Examplebankdatabasequery
Findrecordswithbalance>$50,000inbranches locatedinAmherst,MA. Matcheseasilyfoundbycomparisonwithfield valuesofrecords

Examplesearchenginequery
bankscandalsinwesternmass Thistextmustbecomparedtothetextofentire newsstories

ComparingText
Comparingthequerytexttothedocument textanddeterminingwhatisagoodmatchis thecoreissue ofinformationretrieval Exactmatchingofwordsisnotenough
Manydifferentwaystowritethesamethingina naturallanguagelikeEnglish e.g.,doesanewsstorycontainingthetextbank directorinAmherststealsfunds matchthe query? Somestorieswillbebettermatchesthanothers

DimensionsofIR
IRismorethanjusttext,andmorethanjust websearch
althoughthesearecentral

PeopledoingIRworkwithdifferentmedia, differenttypesofsearchapplications,and differenttasks

OtherMedia
Newapplicationsincreasinglyinvolvenew media
e.g.,video,photos,music,speech

Liketext,contentisdifficulttodescribeand compare
textmaybeusedtorepresentthem(e.g.tags)

IRapproachestosearchandevaluationare appropriate

DimensionsofIR
Content Text Images Video Scanneddocs Audio Music Applications Websearch Verticalsearch Enterprisesearch Desktopsearch Forumsearch P2Psearch Literaturesearch Tasks Adhocsearch Filtering Classification Questionanswering

IRTasks
Adhocsearch
Findrelevantdocumentsforanarbitrarytext query

Filtering
Identifyrelevantuserprofilesforanewdocument

Classification
Identifyrelevantlabelsfordocuments

Questionanswering
Giveaspecificanswertoaquestion

BigIssuesinIR
Relevance
Whatisit? Simple(andsimplistic)definition:Arelevant documentcontainstheinformationthataperson waslookingforwhentheysubmittedaqueryto thesearchengine Manyfactorsinfluenceapersonsdecisionabout whatisrelevant:e.g.,task,context,novelty,style Topicalrelevance(sametopic)vs.userrelevance (everythingelse)

BigIssuesinIR
Relevance
Retrievalmodelsdefineaviewofrelevance Rankingalgorithmsusedinsearchenginesare basedonretrievalmodels Mostmodelsdescribestatisticalpropertiesoftext ratherthanlinguistic
i.e.countingsimpletextfeaturessuchaswordsinstead ofparsingandanalyzingthesentences Statisticalapproachtotextprocessingstartedwith Luhn inthe50s Linguisticfeaturescanbepartofastatisticalmodel

BigIssuesinIR
Evaluation
Experimentalproceduresandmeasuresfor comparingsystemoutputwithuserexpectations
OriginatedinCranfield experimentsinthe60s

IRevaluationmethodsnowusedinmanyfields Typicallyusetestcollectionofdocuments,queries, andrelevancejudgments


MostcommonlyusedareTRECcollections

Recall andprecision aretwoexamplesof effectiveness measures

BigIssuesinIR
UsersandInformationNeeds
Searchevaluationisusercentered Keywordqueriesareoftenpoordescriptionsof actualinformationneeds Interactionandcontextareimportantfor understandinguserintent Queryrefinementtechniquessuchasquery expansion,querysuggestion,relevancefeedback improveranking

IRandSearchEngines
Asearchengineisthepracticalapplicationof informationretrievaltechniquestolargescale textcollections Websearchenginesarebestknown examples,butmanyothers
Opensourcesearchenginesareimportantfor researchanddevelopment
e.g.,Lucene,Lemur/Indri,Galago

BigissuesincludemainIRissuesbutalsosome others

IRandSearchEngines
InformationRetrieval
Relevance
Effectiveranking

SearchEngines
Performance
Efficientsearchandindexing

Evaluation
Testingand measuring

Incorporatingnewdata
Coverageandfreshness

Informationneeds
Userinteraction

Scalability
Growingwithdataandusers

Adaptability
Tuningforapplications

Specificproblems
e.g.Spam

SearchEngineIssues
Performance
Measuringandimprovingtheefficiencyofsearch
e.g.,reducingresponsetime,increasingquery throughput,increasingindexingspeed

Indexesaredatastructuresdesignedtoimprove searchefficiency
designingandimplementingthemaremajorissuesfor searchengines

SearchEngineIssues
Dynamicdata
Thecollectionformostrealapplicationsis constantlychangingintermsofupdates, additions,deletions
e.g.,webpages

Acquiringorcrawlingthedocumentsisamajor task
Typicalmeasuresarecoverage (howmuchhasbeen indexed)andfreshness(howrecentlywasitindexed)

Updatingtheindexeswhileprocessingqueriesis alsoadesignissue

SearchEngineIssues
Scalability
Makingeverythingworkwithmillionsofusers everyday,andmanyterabytesofdocuments Distributedprocessingisessential

Adaptability
Changingandtuningsearchenginecomponents suchasrankingalgorithm,indexingstrategy, interfacefordifferentapplications

Spam
ForWebsearch,spaminallitsformsisoneof the majorissues Affectstheefficiencyofsearchenginesand, moreseriously,theeffectiveness oftheresults Manytypesofspam
e.g.spamdexing ortermspam,linkspam, optimization

NewsubfieldcalledadversarialIR,since spammersareadversarieswithdifferent goals

CourseGoals
Tohelpyoutounderstandsearchengines, evaluateandcomparethem,andmodifythem forspecificapplications Providebroadcoverageoftheimportant issuesininformationretrievalandsearch engines
includesunderlyingmodelsandcurrentresearch directions

You might also like