Professional Documents
Culture Documents
Search Engines: Information Retrieval in Practice
Search Engines: Information Retrieval in Practice
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
SearchandInformationRetrieval
SearchontheWeb1 isadailyactivityformany peoplethroughouttheworld Searchandcommunicationaremostpopular usesofthecomputer Applicationsinvolvingsearchareeverywhere Thefieldofcomputersciencethatismost involvedwithR&Dforsearchisinformation retrieval(IR)
1
orisitweb?
InformationRetrieval
Informationretrievalisafieldconcernedwith thestructure,analysis,organization,storage, searching,andretrievalofinformation. (Salton,1968) Generaldefinitionthatcanbeappliedtomany typesofinformationandsearchapplications PrimaryfocusofIRsincethe50shasbeenon text anddocuments
WhatisaDocument?
Examples:
webpages,email,books,newsstories,scholarly papers,textmessages,Word,Powerpoint,PDF, forumpostings,patents,IMsessions,etc.
Commonproperties
Significanttextcontent Somestructure(e.g.,title,author,dateforpapers; subject,sender,destinationforemail)
Documentsvs.DatabaseRecords
Databaserecords(ortuples inrelational databases)aretypicallymadeupofwell definedfields(orattributes)
e.g.,bankrecordswithaccountnumbers, balances,names,addresses,socialsecurity numbers,datesofbirth,etc.
Documentsvs.Records
Examplebankdatabasequery
Findrecordswithbalance>$50,000inbranches locatedinAmherst,MA. Matcheseasilyfoundbycomparisonwithfield valuesofrecords
Examplesearchenginequery
bankscandalsinwesternmass Thistextmustbecomparedtothetextofentire newsstories
ComparingText
Comparingthequerytexttothedocument textanddeterminingwhatisagoodmatchis thecoreissue ofinformationretrieval Exactmatchingofwordsisnotenough
Manydifferentwaystowritethesamethingina naturallanguagelikeEnglish e.g.,doesanewsstorycontainingthetextbank directorinAmherststealsfunds matchthe query? Somestorieswillbebettermatchesthanothers
DimensionsofIR
IRismorethanjusttext,andmorethanjust websearch
althoughthesearecentral
OtherMedia
Newapplicationsincreasinglyinvolvenew media
e.g.,video,photos,music,speech
Liketext,contentisdifficulttodescribeand compare
textmaybeusedtorepresentthem(e.g.tags)
IRapproachestosearchandevaluationare appropriate
DimensionsofIR
Content Text Images Video Scanneddocs Audio Music Applications Websearch Verticalsearch Enterprisesearch Desktopsearch Forumsearch P2Psearch Literaturesearch Tasks Adhocsearch Filtering Classification Questionanswering
IRTasks
Adhocsearch
Findrelevantdocumentsforanarbitrarytext query
Filtering
Identifyrelevantuserprofilesforanewdocument
Classification
Identifyrelevantlabelsfordocuments
Questionanswering
Giveaspecificanswertoaquestion
BigIssuesinIR
Relevance
Whatisit? Simple(andsimplistic)definition:Arelevant documentcontainstheinformationthataperson waslookingforwhentheysubmittedaqueryto thesearchengine Manyfactorsinfluenceapersonsdecisionabout whatisrelevant:e.g.,task,context,novelty,style Topicalrelevance(sametopic)vs.userrelevance (everythingelse)
BigIssuesinIR
Relevance
Retrievalmodelsdefineaviewofrelevance Rankingalgorithmsusedinsearchenginesare basedonretrievalmodels Mostmodelsdescribestatisticalpropertiesoftext ratherthanlinguistic
i.e.countingsimpletextfeaturessuchaswordsinstead ofparsingandanalyzingthesentences Statisticalapproachtotextprocessingstartedwith Luhn inthe50s Linguisticfeaturescanbepartofastatisticalmodel
BigIssuesinIR
Evaluation
Experimentalproceduresandmeasuresfor comparingsystemoutputwithuserexpectations
OriginatedinCranfield experimentsinthe60s
BigIssuesinIR
UsersandInformationNeeds
Searchevaluationisusercentered Keywordqueriesareoftenpoordescriptionsof actualinformationneeds Interactionandcontextareimportantfor understandinguserintent Queryrefinementtechniquessuchasquery expansion,querysuggestion,relevancefeedback improveranking
IRandSearchEngines
Asearchengineisthepracticalapplicationof informationretrievaltechniquestolargescale textcollections Websearchenginesarebestknown examples,butmanyothers
Opensourcesearchenginesareimportantfor researchanddevelopment
e.g.,Lucene,Lemur/Indri,Galago
BigissuesincludemainIRissuesbutalsosome others
IRandSearchEngines
InformationRetrieval
Relevance
Effectiveranking
SearchEngines
Performance
Efficientsearchandindexing
Evaluation
Testingand measuring
Incorporatingnewdata
Coverageandfreshness
Informationneeds
Userinteraction
Scalability
Growingwithdataandusers
Adaptability
Tuningforapplications
Specificproblems
e.g.Spam
SearchEngineIssues
Performance
Measuringandimprovingtheefficiencyofsearch
e.g.,reducingresponsetime,increasingquery throughput,increasingindexingspeed
Indexesaredatastructuresdesignedtoimprove searchefficiency
designingandimplementingthemaremajorissuesfor searchengines
SearchEngineIssues
Dynamicdata
Thecollectionformostrealapplicationsis constantlychangingintermsofupdates, additions,deletions
e.g.,webpages
Acquiringorcrawlingthedocumentsisamajor task
Typicalmeasuresarecoverage (howmuchhasbeen indexed)andfreshness(howrecentlywasitindexed)
Updatingtheindexeswhileprocessingqueriesis alsoadesignissue
SearchEngineIssues
Scalability
Makingeverythingworkwithmillionsofusers everyday,andmanyterabytesofdocuments Distributedprocessingisessential
Adaptability
Changingandtuningsearchenginecomponents suchasrankingalgorithm,indexingstrategy, interfacefordifferentapplications
Spam
ForWebsearch,spaminallitsformsisoneof the majorissues Affectstheefficiencyofsearchenginesand, moreseriously,theeffectiveness oftheresults Manytypesofspam
e.g.spamdexing ortermspam,linkspam, optimization
CourseGoals
Tohelpyoutounderstandsearchengines, evaluateandcomparethem,andmodifythem forspecificapplications Providebroadcoverageoftheimportant issuesininformationretrievalandsearch engines
includesunderlyingmodelsandcurrentresearch directions