Search Engines: Information Retrieval in Practice

SearchEngines
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
ProcessingText
Convertingdocumentstoindexterms Why?
Matchingtheexactstringofcharacterstypedby theuseristoorestrictive
i.e.,itdoesntworkverywellintermsofeffectiveness
Notallwordsareofequalvalueinasearch Sometimesnotclearwherewordsbeginandend
Notevenclearwhatawordisinsomelanguages
e.g.,Chinese,Korean
TextStatistics
Hugevarietyofwordsusedintextbut Manystatisticalcharacteristicsofword occurrencesarepredictable
e.g.,distributionofwordcounts
Retrievalmodelsandrankingalgorithms dependheavilyonstatisticalpropertiesof words

e.g.,importantwordsoccuroftenindocuments butarenothighfrequencyincollection
Zipfs Law
Distributionofwordfrequenciesisveryskewed
afewwordsoccurveryoften,manywordshardlyever occur e.g.,twomostcommonwords(the,of)makeup about10%ofallwordoccurrencesintextdocuments
Zipfs law:
observationthatrank(r)ofawordtimesitsfrequency (f)isapproximatelyaconstant(k)
assumingwordsarerankedinorderofdecreasingfrequency
i.e.,r.f korr.Pr c,wherePr isprobabilityofword occurrenceandc 0.1forEnglish
Zipfs Law
NewsCollection(AP89)Statistics
Totaldocuments 84,678 Totalwordoccurrences 39,749,179 Vocabularysize 198,763 Wordsoccurring>1000times4,169 Wordsoccurringonce 70,064
Word Freq.rPr(%) assistant5,095 1,021.013 sewers 100 17,1102.56 104 toothbrush10 51,5552.56 105 hazmat 1 166,945 2.56 106
r.Pr 0.13 0.04 0.01 0.04
Top50WordsfromAP89
Zipfs LawforAP89
Noteproblemsathighandlowfrequencies
Zipfs Law
Whatistheproportionofwordswithagiven frequency?
Wordthatoccursntimeshasrankrn = k/n Numberofwordswithfrequencynis
rn rn+1= k/n k/(n+ 1)= k/n(n+ 1)
Proportionfoundbydividingbytotalnumberof words=highestrank=k So,proportionwithfrequencyn is1/n(n+1)
Zipfs Law
Exampleword frequencyranking
Tocomputenumberofwordswithfrequency5,099 rankofchemicalminustherankofsummit 10061002=4
Example
Proportionsofwordsoccurringn timesin 336,310TRECdocuments Vocabularysizeis508,209
VocabularyGrowth
Ascorpusgrows,sodoesvocabularysize
Fewernewwordswhencorpusisalreadylarge
Observedrelationship(HeapsLaw): v =k.n
wherev isvocabularysize(numberofuniquewords), n isthenumberofwordsincorpus, k,areparametersthatvaryforeachcorpus (typicalvaluesgivenare10 k 100 and0.5)
AP89Example
HeapsLawPredictions
PredictionsforTRECcollectionsareaccurate forlargenumbersofwords
e.g.,first10,879,522wordsoftheAP89collection scanned predictionis100,151uniquewords actualnumberis100,024
Predictionsforsmallnumbersofwords(i.e. <1000)aremuchworse
GOV2(Web)Example
WebExample
HeapsLawworkswithverylargecorpora
newwordsoccurringevenafterseeing30million! parametervaluesdifferentthantypicalTREC values
Newwordscomefromavarietyofsources
spellingerrors,inventedwords(e.g.product,company names),code,otherlanguages,emailaddresses,etc.
Searchenginesmustdealwiththeselargeand growingvocabularies
EstimatingResultSetSize
Howmanypagescontainall ofthequeryterms? Forthequeryabc:

fabc =Nfa/Nfb/Nfc/N=(fa fb fc)/N2
Assumingthattermsoccurindependently fabc istheestimatedsizeoftheresultset fa,fb,fcarethenumberofdocumentsthattermsa,b,andc occurin N isthenumberofdocumentsinthecollection
GOV2Example
Collectionsize(N)is25,205,179
ResultSetSizeEstimation
Poorestimatesbecausewordsarenot independent Betterestimatespossibleifcooccurrence informationavailable
P(a b c) =P(a b) P(c|(a b))
ftropicalfishaquarium =ftropicalaquarium ffishaquarium/faquarium =19219722/26480=705 ftropicalfishbreeding =ftropicalbreeding ffishbreeeding/fbreeding =551036427/81885=2451
ResultSetEstimation
Evenbetterestimatesusinginitialresultset
EstimateissimplyC/s
wheresistheproportionofthetotaldocumentsthat havebeenranked,andC isthenumberofdocuments foundthatcontainallthequerywords
E.g.,tropicalfishaquariuminGOV2
afterprocessing3,000outofthe26,480documents thatcontainaquarium,C =258 ftropicalfishaquarium =258/(300026480)=2,277 Afterprocessing20%ofthedocuments, ftropicalfishaquarium =1,778(1,529isrealvalue)
EstimatingCollectionSize
ImportantissueforWebsearchengines Simpletechnique:useindependencemodel
Giventwowordsa andb thatareindependent fab/N= fa/Nfb/N N= (fa fb)/fab e.g.,forGOV2
flincoln =771,326ftropical =120,990flincoln tropical =3,018 N=(120990771326)/3018=30,922,045 (actualnumberis25,205,179)
Tokenizing
Formingwordsfromsequenceofcharacters SurprisinglycomplexinEnglish,canbeharder inotherlanguages EarlyIRsystems:
anysequenceofalphanumericcharactersof length3ormore terminatedbyaspaceorotherspecialcharacter uppercasechangedtolowercase
Tokenizing
Example:
Bigcorp's 2007biannualreportshowedprofits rose10%.becomes bigcorp 2007annualreportshowedprofitsrose
Toosimpleforsearchapplicationsoreven largescaleexperiments Why?Toomuchinformationlost

Smalldecisionsintokenizingcanhavemajor impactoneffectivenessofsomequeries
TokenizingProblems
Smallwordscanbeimportantinsomequeries, usuallyincombinations
xp,ma,pm,ben eking,elpaso,masterp,gm,jlo,world warII
Bothhyphenatedandnonhyphenatedformsof manywordsarecommon
Sometimeshyphenisnotneeded
ebay,walmart,activex,cdrom,tshirts
Atothertimes,hyphensshouldbeconsideredeither aspartofthewordorawordseparator
winstonsalem,mazda rx7,ecards,prediabetes,tmobile, spanishspeaking
TokenizingProblems
Specialcharactersareanimportantpartoftags, URLs,codeindocuments Capitalizedwordscanhavedifferentmeaning fromlowercasewords
Bush,Apple
Apostrophescanbeapartofaword,apartofa possessive,orjustamistake
rosie o'donnell,can't,don't,80's,1890's,men'sstraw hats,master'sdegree,england's tenlargestcities, shriner's
TokenizingProblems
Numberscanbeimportant,includingdecimals
nokia 3250,top10courses,united93,quicktime 6.5pro,92.3thebeat,288358
Periodscanoccurinnumbers,abbreviations, URLs,endsofsentences,andothersituations
I.B.M.,Ph.D.,cs.umass.edu,F.E.A.R.
Note:tokenizingstepsforqueriesmustbe identicaltostepsfordocuments
TokenizingProcess
Firststepistouseparsertoidentifyappropriate partsofdocumenttotokenize Defercomplexdecisionstoothercomponents
wordisanysequenceofalphanumericcharacters, terminatedbyaspaceorspecialcharacter,with everythingconvertedtolowercase everythingindexed example:92.3923butsearchfindsdocuments with92and3adjacent incorporatesomerulestoreducedependenceon querytransformationcomponents
TokenizingProcess
Notthatdifferentthansimpletokenizing processusedinpast ExamplesofrulesusedwithTREC
Apostrophesinwordsignored
oconnor oconnor bobsbobs
Periodsinabbreviationsignored
I.B.M.ibm Ph.D.phd
Stopping
Functionwords(determiners,prepositions) havelittlemeaningontheirown Highoccurrencefrequencies Treatedasstopwords (i.e.removed)
reduceindexspace,improveresponsetime, improveeffectiveness
Canbeimportantincombinations
e.g.,tobeornottobe
Stopping
Stopword listcanbecreatedfromhigh frequencywordsorbasedonastandardlist Listsarecustomizedforapplications,domains, andevenpartsofdocuments
e.g.,clickisagoodstopword foranchortext
Bestpolicyistoindexallwordsindocuments, makedecisionsaboutwhichwordstouseat querytime
Stemming
Manymorphologicalvariationsofwords
inflectional (plurals,tenses) derivational (makingverbsnounsetc.)
Inmostcases,thesehavethesameorvery similarmeanings Stemmersattempttoreducemorphological variationsofwordstoacommonstem

usuallyinvolvesremovingsuffixes
Canbedoneatindexingtimeoraspartof queryprocessing(likestopwords)
Stemming
Generallyasmallbutsignificanteffectiveness improvement
canbecrucialforsomelanguages e.g.,510%improvementforEnglish,upto50%in Arabic
WordswiththeArabicrootktb
Stemming
Twobasictypes
Dictionarybased:useslistsofrelatedwords Algorithmic:usesprogramtodeterminerelated words
Algorithmicstemmers
suffixs:removesendingsassumingplural
e.g.,catscat,lakeslake,wiis wii Manyfalsenegatives:suppliessupplie Somefalsepositives:upsup
PorterStemmer
AlgorithmicstemmerusedinIRexperiments sincethe70s Consistsofaseriesofrulesdesignedtothe longestpossiblesuffixateachstep EffectiveinTREC Producesstems notwords Makesanumberoferrorsanddifficultto modify
PorterStemmer
Examplestep(1of5)
PorterStemmer
Porter2stemmeraddressessomeoftheseissues Approachhasbeenusedwithotherlanguages
Krovetz Stemmer
Hybridalgorithmicdictionary
Wordcheckedindictionary
Ifpresent,eitherleftaloneorreplacedwithexception Ifnotpresent,wordischeckedforsuffixesthatcouldbe removed Afterremoval,dictionaryischeckedagain
Produceswordsnotstems Comparableeffectiveness Lowerfalsepositiverate,somewhathigherfalse negative
StemmerComparison
Phrases
Manyqueriesare23wordphrases Phrasesare
Moreprecisethansinglewords
e.g.,documentscontainingblackseavs.twowords blackandsea
Lessambiguous
e.g.,bigapplevs.apple
Canbedifficultforranking
e.g.,Givenqueryfishingsupplies,howdowescore documentswith
exactphrasemanytimes,exactphrasejustonce,individualwords insamesentence,sameparagraph,wholedocument,variations onwords?
Phrases
Textprocessingissue howarephrases recognized? Threepossibleapproaches:
Identifysyntacticphrasesusingapartofspeech (POS)tagger Usewordngrams Storewordpositionsinindexesanduseproximity operators inqueries
POSTagging
POStaggersusestatisticalmodelsoftextto predictsyntactictagsofwords
Exampletags:
NN(singularnoun),NNS(pluralnoun),VB(verb),VBD (verb,pasttense),VBN(verb,pastparticiple),IN (preposition),JJ(adjective),CC(conjunction,e.g.,and, or),PRP(pronoun),andMD(modalauxiliary,e.g., can,will).
Phrasescanthenbedefinedassimplenoun groups,forexample
PosTaggingExample
ExampleNounPhrases
WordNGrams
POStaggingtooslowforlargecollections Simplerdefinition phraseisanysequenceofn words knownasngrams
bigram:2wordsequence,trigram:3wordsequence, unigram:singlewords Ngramsalsousedatcharacterlevelforapplications suchasOCR
Ngramstypicallyformedfromoverlapping sequencesofwords
i.e.movenwordwindowonewordatatimein document
NGrams
Frequentngramsaremorelikelytobe meaningfulphrases NgramsformaZipf distribution
Betterfitthanwordsalone
Couldindexallngramsuptospecifiedlength
MuchfasterthanPOStagging Usesalotofstorage
e.g.,documentcontaining1,000wordswouldcontain 3,990instancesofwordngramsoflength2n5
GoogleNGrams
Websearchenginesindexngrams Googlesample:
MostfrequenttrigraminEnglishisallrights reserved
InChinese,limitedliabilitycorporation
DocumentStructureandMarkup
Somepartsofdocumentsaremoreimportant thanothers Documentparserrecognizesstructureusing markup,suchasHTMLtags
Headers,anchortext,boldedtextalllikelytobe important Metadatacanalsobeimportant Linksusedforlinkanalysis
ExampleWebPage
ExampleWebPage
LinkAnalysis
LinksareakeycomponentoftheWeb Importantfornavigation,butalsoforsearch
e.g.,<ahref="http://example.com">Example website</a> Examplewebsiteistheanchortext http://example.comisthedestinationlink bothareusedbysearchengines
AnchorText
Usedasadescriptionofthecontentofthe destinationpage
i.e.,collectionofanchortextinalllinkspointingto apageusedasanadditionaltextfield
Anchortexttendstobeshort,descriptive,and similartoquerytext Retrievalexperimentshaveshownthatanchor texthassignificantimpactoneffectivenessfor sometypesofqueries

i.e.,morethanPageRank
PageRank
Billionsofwebpages,somemoreinformative thanothers Linkscanbeviewedasinformationaboutthe popularity (authority?)ofawebpage
canbeusedbyrankingalgorithm
Inlink countcouldbeusedassimplemeasure LinkanalysisalgorithmslikePageRankprovide morereliableratings

lesssusceptibletolinkspam
RandomSurferModel
BrowsetheWebusingthefollowingalgorithm:
Choosearandomnumberrbetween0and1 Ifr<:
Gotoarandompage
Ifr:
Clickalinkatrandomonthecurrentpage
Startagain
PageRankofapageistheprobabilitythatthe randomsurferwillbelookingatthatpage
linksfrompopularpageswillincreasePageRankof pagestheypointto
DanglingLinks
Randomjumppreventsgettingstuckon pagesthat
donothavelinks containsonlylinksthatnolongerpointto otherpages havelinksformingaloop
Linksthatpointtothefirsttwotypesof pagesarecalleddanglinglinks
mayalsobelinkstopagesthathavenotyet beencrawled
PageRank
PageRank(PR)ofpageC=PR(A)/2+PR(B)/1 Moregenerally,
whereBu isthesetofpagesthatpointtou,andLv is thenumberofoutgoinglinksfrompagev(not countingduplicatelinks)
PageRank
DontknowPageRankvaluesatstart Assumeequalvalues(1/3inthiscase),then iterate:
firstiteration:PR(C)=0.33/2+0.33=0.5,PR(A)= 0.33,andPR(B)=0.17 second:PR(C)=0.33/2+0.17=0.33,PR(A)=0.5, PR(B)=0.17 third:PR(C)=0.42,PR(A)=0.33,PR(B)=0.25
ConvergestoPR(C)=0.4,PR(A)=0.4,andPR(B)= 0.2
PageRank
Takingrandompagejumpintoaccount,1/3 chanceofgoingtoanypagewhenr< PR(C)=/3+(1)(PR(A)/2+PR(B)/1) Moregenerally,
whereN isthenumberofpages, typically0.15
APageRankImplementation
Preliminaries:
1)Extractlinksfromthesourcetext.You'llalsowanttoextracttheURL fromeachdocumentinaseparatefile.Nowyouhaveallthelinks (sourcedestinationpairs)andallthesourcedocuments 2)Removealllinksfromthelistthatdonotconnecttwodocumentsin thecorpus.Theeasiestwaytodothisistosortalllinksbydestination, thencomparethatagainstthecorpusURLslist(alsosorted) 3)CreateanewfileIthatcontainsa(url,pagerank)pairforeachURL inthecorpus.TheinitialPageRankvalueis1/#D(#D=numberofurls)
Atthispointtherearetwointerestingfiles:
[L]links(trimmedtocontainonlycorpuslinks,sortedbysourceURL) [I]URL/PageRankpairs,initializedtoaconstant
Preliminaries LinkExtractionfrom.corpusfileusingGalago
DocumentSplit>IndexReaderSplitParser>TagTokenizer split=newDocumentSplit(filename,filetype,newbyte[0],newbyte[0]) index=newIndexReaderSplitParser(split) tokenizer=new.TagTokenizer() tokenizer.setProcessor(NullProcessor(Document.class)) doc=index.nextDocument() tokenizer.process(doc)
doc.identifiercontainsthefilesname doc.tagsnowcontainsalltags Linkscanbeextractedbyfindingalltagswithnamea Linksshouldbeprocessedsothattheycanbecomparedwithsome filenameinthecorpus
Iteration: Steps:
1. 2. 3. 4. 5. Makeanewoutputfile,R. ReadLandIinparallel(sincethey'reallsortedbyURL). ForeachuniquesourceURL,determinewhetherithasanyoutgoing links: Ifnot,additscurrentPageRankvaluetothesum:T(terminals). Ifitdoeshaveoutgoinglinks,write(source_url,dest_url,Ip/|Q|), whereIp isthecurrentPageRankvalue,|Q|isthenumberof outgoinglinks,anddest_urlisalinkdestination. Dothisforalloutgoinglinks.WritethistoR. SortRbydestinationURL. ScanRandIatthesametime.ThenewvalueofRp is: (1 lambda)/#D(afractionofthesumofallpages) plus:lambda*sum(T)/#D(thetotaleffectfromterminalpages), plus:lambda*allincomingmassfromstep5.() Checkforconvergence WritenewRp valuestoanewIfile.
6. 7.
8. 9.
Convergencecheck
StoppingcriteriaforthistypesofPRalgorithmtypicallyisoftheform ||new old||<tauwherenewandoldarethenewandoldPageRank vectors,respectively. Tauissetdependingonhowmuchprecisionyouneed.Reasonable valuesinclude0.1or0.01.Ifyouwant reallyfast,butinaccurate convergence,thenyoucanusesomethingliketau=1. ThesettingoftaualsodependsonN(=numberofdocumentsinthe collection),since||newold||(forafixednumericalprecision) increasesasNincreases,soyoucanalternativelyformulateyour convergencecriteriaas||new old||/N<tau. EithertheL1orL2normcanbeused.
LinkQuality
Linkqualityisaffectedbyspamandother factors
e.g.,linkfarmstoincreasePageRank trackbacklinksinblogscancreateloops linksfromcommentssectionofpopularblogs
Blogservicesmodifycommentlinkstocontain rel=nofollow attribute e.g.,Comevisitmy<arel=nofollow href="http://www.page.com">webpage</a>.
TrackbackLinks
InformationExtraction
Automaticallyextractstructurefromtext
annotatedocumentusingtagstoidentify extractedstructure
Namedentityrecognition
identifywordsthatrefertosomethingofinterest inaparticularapplication e.g.,people,companies,locations,dates,product names,prices,etc.
NamedEntityRecognition
Exampleshowingsemanticannotationoftext usingXMLtags Informationextractionalsoincludes documentstructureandmorecomplex featuressuchasrelationships andevents
Rulebased
Useslexicons (listsofwordsandphrases)that categorizenames
e.g.,locations,peoplesnames,organizations,etc.
Rulesalsousedtoverifyorfindnewentitynames
e.g.,<number><word>streetforaddresses <streetaddress>,<city>orin<city>toverifycity names <streetaddress>,<city>,<state>tofindnewcities <title><name>tofindnewnames
Ruleseitherdevelopedmanuallybytrialand errororusingmachinelearningtechniques Statistical
usesaprobabilisticmodelofthewordsinand aroundanentity probabilitiesestimatedusingtrainingdata (manuallyannotatedtext) HiddenMarkovModel(HMM)isoneapproach
HMMforExtraction
Resolveambiguityinawordusingcontext
e.g.,marathonisalocationorasportingevent, boston marathonisaspecificsportingevent
Modelcontextusingagenerative modelof thesequenceofwords

Markovproperty:thenextwordinasequence dependsonlyonasmallnumberoftheprevious words
HMMforExtraction
MarkovModeldescribesaprocessasa collectionofstateswithtransitionsbetween them
eachtransitionhasaprobabilityassociatedwithit nextstatedependsonlyoncurrentstateand transitionprobabilities
HiddenMarkovModel
eachstatehasasetofpossibleoutputs outputshaveprobabilities
HMMSentenceModel
Eachstateisassociatedwithaprobability distributionoverwords(theoutput)
HMMforExtraction
Couldgeneratesentenceswiththismodel Torecognizenamedentities,findsequenceof labelsthatgivehighestprobabilityforthe sentence
onlytheoutputs(words)arevisibleorobserved statesarehidden e.g.,<start><name><notanentity><location><not anentity><end>
Viterbi algorithmusedforrecognition
Accuraterecognitionrequiresabout1Mwords oftrainingdata(1,500newsstories)
maybemoreexpensivethandevelopingrulesfor someapplications
Bothrulebasedandstatisticalcanachieve about90%effectivenessforcategoriessuchas names,locations,organizations

others,suchasproductname,canbemuchworse
Internationalization
2/3oftheWebisinEnglish About50%ofWebusersdonotuseEnglishas theirprimarylanguage Many(maybemost)searchapplicationshave todealwithmultiplelanguages
monolingualsearch:searchinonelanguage,but withmanypossiblelanguages crosslanguagesearch:searchinmultiple languagesatthesametime
Internationalization
Manyaspectsofsearchenginesarelanguage neutral Majordifferences:
Textencoding(convertingtoUnicode) Tokenizing(manylanguageshavenoword separators) Stemming
Culturaldifferencesmayalsoimpactinterface designandfeaturesprovided
ChineseTokenizing

Search Engines: Information Retrieval in Practice

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Engines: Information Retrieval in Practice

Uploaded by

Copyright:

Available Formats

SearchEngines

Retrievalmodelsandrankingalgorithms dependheavilyonstatisticalpropertiesof words

i.e.,r.f korr.Pr c,wherePr isprobabilityofword occurrenceandc 0.1forEnglish

r.Pr 0.13 0.04 0.01 0.04

Proportionfoundbydividingbytotalnumberof words=highestrank=k So,proportionwithfrequencyn is1/n(n+1)

Tocomputenumberofwordswithfrequency5,099 rankofchemicalminustherankofsummit 10061002=4

Proportionsofwordsoccurringn timesin 336,310TRECdocuments Vocabularysizeis508,209

wherev isvocabularysize(numberofuniquewords), n isthenumberofwordsincorpus, k,areparametersthatvaryforeachcorpus (typicalvaluesgivenare10 k 100 and0.5)

Howmanypagescontainall ofthequeryterms? Forthequeryabc:

Toosimpleforsearchapplicationsoreven largescaleexperiments Why?Toomuchinformationlost

Bestpolicyistoindexallwordsindocuments, makedecisionsaboutwhichwordstouseat querytime

Inmostcases,thesehavethesameorvery similarmeanings Stemmersattempttoreducemorphological variationsofwordstoacommonstem

Produceswordsnotstems Comparableeffectiveness Lowerfalsepositiverate,somewhathigherfalse negative

Anchortexttendstobeshort,descriptive,and similartoquerytext Retrievalexperimentshaveshownthatanchor texthassignificantimpactoneffectivenessfor sometypesofqueries

Inlink countcouldbeusedassimplemeasure LinkanalysisalgorithmslikePageRankprovide morereliableratings

whereBu isthesetofpagesthatpointtou,andLv is thenumberofoutgoinglinksfrompagev(not countingduplicatelinks)

whereN isthenumberofpages, typically0.15

doc.identifiercontainsthefilesname doc.tagsnowcontainsalltags Linkscanbeextractedbyfindingalltagswithnamea Linksshouldbeprocessedsothattheycanbecomparedwithsome filenameinthecorpus

Exampleshowingsemanticannotationoftext usingXMLtags Informationextractionalsoincludes documentstructureandmorecomplex featuressuchasrelationships andevents

Modelcontextusingagenerative modelof thesequenceofwords

Bothrulebasedandstatisticalcanachieve about90%effectivenessforcategoriessuchas names,locations,organizations

You might also like