Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

SearchEngines

InformationRetrievalinPractice

AllslidesAddisonWesley,2008

ProcessingText
Convertingdocumentstoindexterms Why?
Matchingtheexactstringofcharacterstypedby theuseristoorestrictive
i.e.,itdoesntworkverywellintermsofeffectiveness

Notallwordsareofequalvalueinasearch Sometimesnotclearwherewordsbeginandend
Notevenclearwhatawordisinsomelanguages
e.g.,Chinese,Korean

TextStatistics
Hugevarietyofwordsusedintextbut Manystatisticalcharacteristicsofword occurrencesarepredictable
e.g.,distributionofwordcounts

Retrievalmodelsandrankingalgorithms dependheavilyonstatisticalpropertiesof words


e.g.,importantwordsoccuroftenindocuments butarenothighfrequencyincollection

Zipfs Law
Distributionofwordfrequenciesisveryskewed
afewwordsoccurveryoften,manywordshardlyever occur e.g.,twomostcommonwords(the,of)makeup about10%ofallwordoccurrencesintextdocuments

Zipfs law:
observationthatrank(r)ofawordtimesitsfrequency (f)isapproximatelyaconstant(k)
assumingwordsarerankedinorderofdecreasingfrequency

i.e.,r.f korr.Pr c,wherePr isprobabilityofword occurrenceandc 0.1forEnglish

Zipfs Law

NewsCollection(AP89)Statistics
Totaldocuments 84,678 Totalwordoccurrences 39,749,179 Vocabularysize 198,763 Wordsoccurring>1000times4,169 Wordsoccurringonce 70,064

Word Freq.rPr(%) assistant5,095 1,021.013 sewers 100 17,1102.56 104 toothbrush10 51,5552.56 105 hazmat 1 166,945 2.56 106

r.Pr 0.13 0.04 0.01 0.04

Top50WordsfromAP89

Zipfs LawforAP89

Noteproblemsathighandlowfrequencies

Zipfs Law
Whatistheproportionofwordswithagiven frequency?
Wordthatoccursntimeshasrankrn = k/n Numberofwordswithfrequencynis
rn rn+1= k/n k/(n+ 1)= k/n(n+ 1)

Proportionfoundbydividingbytotalnumberof words=highestrank=k So,proportionwithfrequencyn is1/n(n+1)

Zipfs Law
Exampleword frequencyranking

Tocomputenumberofwordswithfrequency5,099 rankofchemicalminustherankofsummit 10061002=4

Example

Proportionsofwordsoccurringn timesin 336,310TRECdocuments Vocabularysizeis508,209

VocabularyGrowth
Ascorpusgrows,sodoesvocabularysize
Fewernewwordswhencorpusisalreadylarge

Observedrelationship(HeapsLaw): v =k.n

wherev isvocabularysize(numberofuniquewords), n isthenumberofwordsincorpus, k,areparametersthatvaryforeachcorpus (typicalvaluesgivenare10 k 100 and0.5)

AP89Example

HeapsLawPredictions
PredictionsforTRECcollectionsareaccurate forlargenumbersofwords
e.g.,first10,879,522wordsoftheAP89collection scanned predictionis100,151uniquewords actualnumberis100,024

Predictionsforsmallnumbersofwords(i.e. <1000)aremuchworse

GOV2(Web)Example

WebExample
HeapsLawworkswithverylargecorpora
newwordsoccurringevenafterseeing30million! parametervaluesdifferentthantypicalTREC values

Newwordscomefromavarietyofsources
spellingerrors,inventedwords(e.g.product,company names),code,otherlanguages,emailaddresses,etc.

Searchenginesmustdealwiththeselargeand growingvocabularies

EstimatingResultSetSize

Howmanypagescontainall ofthequeryterms? Forthequeryabc:


fabc =Nfa/Nfb/Nfc/N=(fa fb fc)/N2
Assumingthattermsoccurindependently fabc istheestimatedsizeoftheresultset fa,fb,fcarethenumberofdocumentsthattermsa,b,andc occurin N isthenumberofdocumentsinthecollection

GOV2Example

Collectionsize(N)is25,205,179

ResultSetSizeEstimation
Poorestimatesbecausewordsarenot independent Betterestimatespossibleifcooccurrence informationavailable
P(a b c) =P(a b) P(c|(a b))
ftropicalfishaquarium =ftropicalaquarium ffishaquarium/faquarium =19219722/26480=705 ftropicalfishbreeding =ftropicalbreeding ffishbreeeding/fbreeding =551036427/81885=2451

ResultSetEstimation
Evenbetterestimatesusinginitialresultset
EstimateissimplyC/s
wheresistheproportionofthetotaldocumentsthat havebeenranked,andC isthenumberofdocuments foundthatcontainallthequerywords

E.g.,tropicalfishaquariuminGOV2
afterprocessing3,000outofthe26,480documents thatcontainaquarium,C =258 ftropicalfishaquarium =258/(300026480)=2,277 Afterprocessing20%ofthedocuments, ftropicalfishaquarium =1,778(1,529isrealvalue)

EstimatingCollectionSize
ImportantissueforWebsearchengines Simpletechnique:useindependencemodel
Giventwowordsa andb thatareindependent fab/N= fa/Nfb/N N= (fa fb)/fab e.g.,forGOV2
flincoln =771,326ftropical =120,990flincoln tropical =3,018 N=(120990771326)/3018=30,922,045 (actualnumberis25,205,179)

Tokenizing
Formingwordsfromsequenceofcharacters SurprisinglycomplexinEnglish,canbeharder inotherlanguages EarlyIRsystems:
anysequenceofalphanumericcharactersof length3ormore terminatedbyaspaceorotherspecialcharacter uppercasechangedtolowercase

Tokenizing
Example:
Bigcorp's 2007biannualreportshowedprofits rose10%.becomes bigcorp 2007annualreportshowedprofitsrose

Toosimpleforsearchapplicationsoreven largescaleexperiments Why?Toomuchinformationlost


Smalldecisionsintokenizingcanhavemajor impactoneffectivenessofsomequeries

TokenizingProblems
Smallwordscanbeimportantinsomequeries, usuallyincombinations
xp,ma,pm,ben eking,elpaso,masterp,gm,jlo,world warII

Bothhyphenatedandnonhyphenatedformsof manywordsarecommon
Sometimeshyphenisnotneeded
ebay,walmart,activex,cdrom,tshirts

Atothertimes,hyphensshouldbeconsideredeither aspartofthewordorawordseparator
winstonsalem,mazda rx7,ecards,prediabetes,tmobile, spanishspeaking

TokenizingProblems
Specialcharactersareanimportantpartoftags, URLs,codeindocuments Capitalizedwordscanhavedifferentmeaning fromlowercasewords
Bush,Apple

Apostrophescanbeapartofaword,apartofa possessive,orjustamistake
rosie o'donnell,can't,don't,80's,1890's,men'sstraw hats,master'sdegree,england's tenlargestcities, shriner's

TokenizingProblems
Numberscanbeimportant,includingdecimals
nokia 3250,top10courses,united93,quicktime 6.5pro,92.3thebeat,288358

Periodscanoccurinnumbers,abbreviations, URLs,endsofsentences,andothersituations
I.B.M.,Ph.D.,cs.umass.edu,F.E.A.R.

Note:tokenizingstepsforqueriesmustbe identicaltostepsfordocuments

TokenizingProcess
Firststepistouseparsertoidentifyappropriate partsofdocumenttotokenize Defercomplexdecisionstoothercomponents
wordisanysequenceofalphanumericcharacters, terminatedbyaspaceorspecialcharacter,with everythingconvertedtolowercase everythingindexed example:92.3923butsearchfindsdocuments with92and3adjacent incorporatesomerulestoreducedependenceon querytransformationcomponents

TokenizingProcess
Notthatdifferentthansimpletokenizing processusedinpast ExamplesofrulesusedwithTREC
Apostrophesinwordsignored
oconnor oconnor bobsbobs

Periodsinabbreviationsignored
I.B.M.ibm Ph.D.phd

Stopping
Functionwords(determiners,prepositions) havelittlemeaningontheirown Highoccurrencefrequencies Treatedasstopwords (i.e.removed)
reduceindexspace,improveresponsetime, improveeffectiveness

Canbeimportantincombinations
e.g.,tobeornottobe

Stopping
Stopword listcanbecreatedfromhigh frequencywordsorbasedonastandardlist Listsarecustomizedforapplications,domains, andevenpartsofdocuments
e.g.,clickisagoodstopword foranchortext

Bestpolicyistoindexallwordsindocuments, makedecisionsaboutwhichwordstouseat querytime

Stemming
Manymorphologicalvariationsofwords
inflectional (plurals,tenses) derivational (makingverbsnounsetc.)

Inmostcases,thesehavethesameorvery similarmeanings Stemmersattempttoreducemorphological variationsofwordstoacommonstem


usuallyinvolvesremovingsuffixes

Canbedoneatindexingtimeoraspartof queryprocessing(likestopwords)

Stemming
Generallyasmallbutsignificanteffectiveness improvement
canbecrucialforsomelanguages e.g.,510%improvementforEnglish,upto50%in Arabic

WordswiththeArabicrootktb

Stemming
Twobasictypes
Dictionarybased:useslistsofrelatedwords Algorithmic:usesprogramtodeterminerelated words

Algorithmicstemmers
suffixs:removesendingsassumingplural
e.g.,catscat,lakeslake,wiis wii Manyfalsenegatives:suppliessupplie Somefalsepositives:upsup

PorterStemmer
AlgorithmicstemmerusedinIRexperiments sincethe70s Consistsofaseriesofrulesdesignedtothe longestpossiblesuffixateachstep EffectiveinTREC Producesstems notwords Makesanumberoferrorsanddifficultto modify

PorterStemmer
Examplestep(1of5)

PorterStemmer

Porter2stemmeraddressessomeoftheseissues Approachhasbeenusedwithotherlanguages

Krovetz Stemmer
Hybridalgorithmicdictionary
Wordcheckedindictionary
Ifpresent,eitherleftaloneorreplacedwithexception Ifnotpresent,wordischeckedforsuffixesthatcouldbe removed Afterremoval,dictionaryischeckedagain

Produceswordsnotstems Comparableeffectiveness Lowerfalsepositiverate,somewhathigherfalse negative

StemmerComparison

Phrases
Manyqueriesare23wordphrases Phrasesare
Moreprecisethansinglewords
e.g.,documentscontainingblackseavs.twowords blackandsea

Lessambiguous
e.g.,bigapplevs.apple

Canbedifficultforranking
e.g.,Givenqueryfishingsupplies,howdowescore documentswith
exactphrasemanytimes,exactphrasejustonce,individualwords insamesentence,sameparagraph,wholedocument,variations onwords?

Phrases
Textprocessingissue howarephrases recognized? Threepossibleapproaches:
Identifysyntacticphrasesusingapartofspeech (POS)tagger Usewordngrams Storewordpositionsinindexesanduseproximity operators inqueries

POSTagging
POStaggersusestatisticalmodelsoftextto predictsyntactictagsofwords
Exampletags:
NN(singularnoun),NNS(pluralnoun),VB(verb),VBD (verb,pasttense),VBN(verb,pastparticiple),IN (preposition),JJ(adjective),CC(conjunction,e.g.,and, or),PRP(pronoun),andMD(modalauxiliary,e.g., can,will).

Phrasescanthenbedefinedassimplenoun groups,forexample

PosTaggingExample

ExampleNounPhrases

WordNGrams
POStaggingtooslowforlargecollections Simplerdefinition phraseisanysequenceofn words knownasngrams
bigram:2wordsequence,trigram:3wordsequence, unigram:singlewords Ngramsalsousedatcharacterlevelforapplications suchasOCR

Ngramstypicallyformedfromoverlapping sequencesofwords
i.e.movenwordwindowonewordatatimein document

NGrams
Frequentngramsaremorelikelytobe meaningfulphrases NgramsformaZipf distribution
Betterfitthanwordsalone

Couldindexallngramsuptospecifiedlength
MuchfasterthanPOStagging Usesalotofstorage
e.g.,documentcontaining1,000wordswouldcontain 3,990instancesofwordngramsoflength2n5

GoogleNGrams
Websearchenginesindexngrams Googlesample:

MostfrequenttrigraminEnglishisallrights reserved
InChinese,limitedliabilitycorporation

DocumentStructureandMarkup
Somepartsofdocumentsaremoreimportant thanothers Documentparserrecognizesstructureusing markup,suchasHTMLtags
Headers,anchortext,boldedtextalllikelytobe important Metadatacanalsobeimportant Linksusedforlinkanalysis

ExampleWebPage

ExampleWebPage

LinkAnalysis
LinksareakeycomponentoftheWeb Importantfornavigation,butalsoforsearch
e.g.,<ahref="http://example.com">Example website</a> Examplewebsiteistheanchortext http://example.comisthedestinationlink bothareusedbysearchengines

AnchorText
Usedasadescriptionofthecontentofthe destinationpage
i.e.,collectionofanchortextinalllinkspointingto apageusedasanadditionaltextfield

Anchortexttendstobeshort,descriptive,and similartoquerytext Retrievalexperimentshaveshownthatanchor texthassignificantimpactoneffectivenessfor sometypesofqueries


i.e.,morethanPageRank

PageRank
Billionsofwebpages,somemoreinformative thanothers Linkscanbeviewedasinformationaboutthe popularity (authority?)ofawebpage
canbeusedbyrankingalgorithm

Inlink countcouldbeusedassimplemeasure LinkanalysisalgorithmslikePageRankprovide morereliableratings


lesssusceptibletolinkspam

RandomSurferModel
BrowsetheWebusingthefollowingalgorithm:
Choosearandomnumberrbetween0and1 Ifr<:
Gotoarandompage

Ifr:
Clickalinkatrandomonthecurrentpage

Startagain

PageRankofapageistheprobabilitythatthe randomsurferwillbelookingatthatpage
linksfrompopularpageswillincreasePageRankof pagestheypointto

DanglingLinks
Randomjumppreventsgettingstuckon pagesthat
donothavelinks containsonlylinksthatnolongerpointto otherpages havelinksformingaloop

Linksthatpointtothefirsttwotypesof pagesarecalleddanglinglinks
mayalsobelinkstopagesthathavenotyet beencrawled

PageRank

PageRank(PR)ofpageC=PR(A)/2+PR(B)/1 Moregenerally,

whereBu isthesetofpagesthatpointtou,andLv is thenumberofoutgoinglinksfrompagev(not countingduplicatelinks)

PageRank
DontknowPageRankvaluesatstart Assumeequalvalues(1/3inthiscase),then iterate:
firstiteration:PR(C)=0.33/2+0.33=0.5,PR(A)= 0.33,andPR(B)=0.17 second:PR(C)=0.33/2+0.17=0.33,PR(A)=0.5, PR(B)=0.17 third:PR(C)=0.42,PR(A)=0.33,PR(B)=0.25

ConvergestoPR(C)=0.4,PR(A)=0.4,andPR(B)= 0.2

PageRank
Takingrandompagejumpintoaccount,1/3 chanceofgoingtoanypagewhenr< PR(C)=/3+(1)(PR(A)/2+PR(B)/1) Moregenerally,

whereN isthenumberofpages, typically0.15

APageRankImplementation
Preliminaries:
1)Extractlinksfromthesourcetext.You'llalsowanttoextracttheURL fromeachdocumentinaseparatefile.Nowyouhaveallthelinks (sourcedestinationpairs)andallthesourcedocuments 2)Removealllinksfromthelistthatdonotconnecttwodocumentsin thecorpus.Theeasiestwaytodothisistosortalllinksbydestination, thencomparethatagainstthecorpusURLslist(alsosorted) 3)CreateanewfileIthatcontainsa(url,pagerank)pairforeachURL inthecorpus.TheinitialPageRankvalueis1/#D(#D=numberofurls)

Atthispointtherearetwointerestingfiles:
[L]links(trimmedtocontainonlycorpuslinks,sortedbysourceURL) [I]URL/PageRankpairs,initializedtoaconstant

APageRankImplementation
Preliminaries LinkExtractionfrom.corpusfileusingGalago
DocumentSplit>IndexReaderSplitParser>TagTokenizer split=newDocumentSplit(filename,filetype,newbyte[0],newbyte[0]) index=newIndexReaderSplitParser(split) tokenizer=new.TagTokenizer() tokenizer.setProcessor(NullProcessor(Document.class)) doc=index.nextDocument() tokenizer.process(doc)

doc.identifiercontainsthefilesname doc.tagsnowcontainsalltags Linkscanbeextractedbyfindingalltagswithnamea Linksshouldbeprocessedsothattheycanbecomparedwithsome filenameinthecorpus

APageRankImplementation
Iteration: Steps:
1. 2. 3. 4. 5. Makeanewoutputfile,R. ReadLandIinparallel(sincethey'reallsortedbyURL). ForeachuniquesourceURL,determinewhetherithasanyoutgoing links: Ifnot,additscurrentPageRankvaluetothesum:T(terminals). Ifitdoeshaveoutgoinglinks,write(source_url,dest_url,Ip/|Q|), whereIp isthecurrentPageRankvalue,|Q|isthenumberof outgoinglinks,anddest_urlisalinkdestination. Dothisforalloutgoinglinks.WritethistoR. SortRbydestinationURL. ScanRandIatthesametime.ThenewvalueofRp is: (1 lambda)/#D(afractionofthesumofallpages) plus:lambda*sum(T)/#D(thetotaleffectfromterminalpages), plus:lambda*allincomingmassfromstep5.() Checkforconvergence WritenewRp valuestoanewIfile.

6. 7.

8. 9.

APageRankImplementation
Convergencecheck
StoppingcriteriaforthistypesofPRalgorithmtypicallyisoftheform ||new old||<tauwherenewandoldarethenewandoldPageRank vectors,respectively. Tauissetdependingonhowmuchprecisionyouneed.Reasonable valuesinclude0.1or0.01.Ifyouwant reallyfast,butinaccurate convergence,thenyoucanusesomethingliketau=1. ThesettingoftaualsodependsonN(=numberofdocumentsinthe collection),since||newold||(forafixednumericalprecision) increasesasNincreases,soyoucanalternativelyformulateyour convergencecriteriaas||new old||/N<tau. EithertheL1orL2normcanbeused.

LinkQuality
Linkqualityisaffectedbyspamandother factors
e.g.,linkfarmstoincreasePageRank trackbacklinksinblogscancreateloops linksfromcommentssectionofpopularblogs
Blogservicesmodifycommentlinkstocontain rel=nofollow attribute e.g.,Comevisitmy<arel=nofollow href="http://www.page.com">webpage</a>.

TrackbackLinks

InformationExtraction
Automaticallyextractstructurefromtext
annotatedocumentusingtagstoidentify extractedstructure

Namedentityrecognition
identifywordsthatrefertosomethingofinterest inaparticularapplication e.g.,people,companies,locations,dates,product names,prices,etc.

NamedEntityRecognition

Exampleshowingsemanticannotationoftext usingXMLtags Informationextractionalsoincludes documentstructureandmorecomplex featuressuchasrelationships andevents

NamedEntityRecognition
Rulebased
Useslexicons (listsofwordsandphrases)that categorizenames
e.g.,locations,peoplesnames,organizations,etc.

Rulesalsousedtoverifyorfindnewentitynames
e.g.,<number><word>streetforaddresses <streetaddress>,<city>orin<city>toverifycity names <streetaddress>,<city>,<state>tofindnewcities <title><name>tofindnewnames

NamedEntityRecognition
Ruleseitherdevelopedmanuallybytrialand errororusingmachinelearningtechniques Statistical
usesaprobabilisticmodelofthewordsinand aroundanentity probabilitiesestimatedusingtrainingdata (manuallyannotatedtext) HiddenMarkovModel(HMM)isoneapproach

HMMforExtraction
Resolveambiguityinawordusingcontext
e.g.,marathonisalocationorasportingevent, boston marathonisaspecificsportingevent

Modelcontextusingagenerative modelof thesequenceofwords


Markovproperty:thenextwordinasequence dependsonlyonasmallnumberoftheprevious words

HMMforExtraction
MarkovModeldescribesaprocessasa collectionofstateswithtransitionsbetween them
eachtransitionhasaprobabilityassociatedwithit nextstatedependsonlyoncurrentstateand transitionprobabilities

HiddenMarkovModel
eachstatehasasetofpossibleoutputs outputshaveprobabilities

HMMSentenceModel

Eachstateisassociatedwithaprobability distributionoverwords(theoutput)

HMMforExtraction
Couldgeneratesentenceswiththismodel Torecognizenamedentities,findsequenceof labelsthatgivehighestprobabilityforthe sentence
onlytheoutputs(words)arevisibleorobserved statesarehidden e.g.,<start><name><notanentity><location><not anentity><end>

Viterbi algorithmusedforrecognition

NamedEntityRecognition
Accuraterecognitionrequiresabout1Mwords oftrainingdata(1,500newsstories)
maybemoreexpensivethandevelopingrulesfor someapplications

Bothrulebasedandstatisticalcanachieve about90%effectivenessforcategoriessuchas names,locations,organizations


others,suchasproductname,canbemuchworse

Internationalization
2/3oftheWebisinEnglish About50%ofWebusersdonotuseEnglishas theirprimarylanguage Many(maybemost)searchapplicationshave todealwithmultiplelanguages
monolingualsearch:searchinonelanguage,but withmanypossiblelanguages crosslanguagesearch:searchinmultiple languagesatthesametime

Internationalization
Manyaspectsofsearchenginesarelanguage neutral Majordifferences:
Textencoding(convertingtoUnicode) Tokenizing(manylanguageshavenoword separators) Stemming

Culturaldifferencesmayalsoimpactinterface designandfeaturesprovided

ChineseTokenizing

You might also like