Professional Documents
Culture Documents
Search Engines: Information Retrieval in Practice
Search Engines: Information Retrieval in Practice
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
ProcessingText
Convertingdocumentstoindexterms Why?
Matchingtheexactstringofcharacterstypedby theuseristoorestrictive
i.e.,itdoesntworkverywellintermsofeffectiveness
Notallwordsareofequalvalueinasearch Sometimesnotclearwherewordsbeginandend
Notevenclearwhatawordisinsomelanguages
e.g.,Chinese,Korean
TextStatistics
Hugevarietyofwordsusedintextbut Manystatisticalcharacteristicsofword occurrencesarepredictable
e.g.,distributionofwordcounts
Zipfs Law
Distributionofwordfrequenciesisveryskewed
afewwordsoccurveryoften,manywordshardlyever occur e.g.,twomostcommonwords(the,of)makeup about10%ofallwordoccurrencesintextdocuments
Zipfs law:
observationthatrank(r)ofawordtimesitsfrequency (f)isapproximatelyaconstant(k)
assumingwordsarerankedinorderofdecreasingfrequency
Zipfs Law
NewsCollection(AP89)Statistics
Totaldocuments 84,678 Totalwordoccurrences 39,749,179 Vocabularysize 198,763 Wordsoccurring>1000times4,169 Wordsoccurringonce 70,064
Word Freq.rPr(%) assistant5,095 1,021.013 sewers 100 17,1102.56 104 toothbrush10 51,5552.56 105 hazmat 1 166,945 2.56 106
Top50WordsfromAP89
Zipfs LawforAP89
Noteproblemsathighandlowfrequencies
Zipfs Law
Whatistheproportionofwordswithagiven frequency?
Wordthatoccursntimeshasrankrn = k/n Numberofwordswithfrequencynis
rn rn+1= k/n k/(n+ 1)= k/n(n+ 1)
Zipfs Law
Exampleword frequencyranking
Example
VocabularyGrowth
Ascorpusgrows,sodoesvocabularysize
Fewernewwordswhencorpusisalreadylarge
Observedrelationship(HeapsLaw): v =k.n
AP89Example
HeapsLawPredictions
PredictionsforTRECcollectionsareaccurate forlargenumbersofwords
e.g.,first10,879,522wordsoftheAP89collection scanned predictionis100,151uniquewords actualnumberis100,024
Predictionsforsmallnumbersofwords(i.e. <1000)aremuchworse
GOV2(Web)Example
WebExample
HeapsLawworkswithverylargecorpora
newwordsoccurringevenafterseeing30million! parametervaluesdifferentthantypicalTREC values
Newwordscomefromavarietyofsources
spellingerrors,inventedwords(e.g.product,company names),code,otherlanguages,emailaddresses,etc.
Searchenginesmustdealwiththeselargeand growingvocabularies
EstimatingResultSetSize
GOV2Example
Collectionsize(N)is25,205,179
ResultSetSizeEstimation
Poorestimatesbecausewordsarenot independent Betterestimatespossibleifcooccurrence informationavailable
P(a b c) =P(a b) P(c|(a b))
ftropicalfishaquarium =ftropicalaquarium ffishaquarium/faquarium =19219722/26480=705 ftropicalfishbreeding =ftropicalbreeding ffishbreeeding/fbreeding =551036427/81885=2451
ResultSetEstimation
Evenbetterestimatesusinginitialresultset
EstimateissimplyC/s
wheresistheproportionofthetotaldocumentsthat havebeenranked,andC isthenumberofdocuments foundthatcontainallthequerywords
E.g.,tropicalfishaquariuminGOV2
afterprocessing3,000outofthe26,480documents thatcontainaquarium,C =258 ftropicalfishaquarium =258/(300026480)=2,277 Afterprocessing20%ofthedocuments, ftropicalfishaquarium =1,778(1,529isrealvalue)
EstimatingCollectionSize
ImportantissueforWebsearchengines Simpletechnique:useindependencemodel
Giventwowordsa andb thatareindependent fab/N= fa/Nfb/N N= (fa fb)/fab e.g.,forGOV2
flincoln =771,326ftropical =120,990flincoln tropical =3,018 N=(120990771326)/3018=30,922,045 (actualnumberis25,205,179)
Tokenizing
Formingwordsfromsequenceofcharacters SurprisinglycomplexinEnglish,canbeharder inotherlanguages EarlyIRsystems:
anysequenceofalphanumericcharactersof length3ormore terminatedbyaspaceorotherspecialcharacter uppercasechangedtolowercase
Tokenizing
Example:
Bigcorp's 2007biannualreportshowedprofits rose10%.becomes bigcorp 2007annualreportshowedprofitsrose
TokenizingProblems
Smallwordscanbeimportantinsomequeries, usuallyincombinations
xp,ma,pm,ben eking,elpaso,masterp,gm,jlo,world warII
Bothhyphenatedandnonhyphenatedformsof manywordsarecommon
Sometimeshyphenisnotneeded
ebay,walmart,activex,cdrom,tshirts
Atothertimes,hyphensshouldbeconsideredeither aspartofthewordorawordseparator
winstonsalem,mazda rx7,ecards,prediabetes,tmobile, spanishspeaking
TokenizingProblems
Specialcharactersareanimportantpartoftags, URLs,codeindocuments Capitalizedwordscanhavedifferentmeaning fromlowercasewords
Bush,Apple
Apostrophescanbeapartofaword,apartofa possessive,orjustamistake
rosie o'donnell,can't,don't,80's,1890's,men'sstraw hats,master'sdegree,england's tenlargestcities, shriner's
TokenizingProblems
Numberscanbeimportant,includingdecimals
nokia 3250,top10courses,united93,quicktime 6.5pro,92.3thebeat,288358
Periodscanoccurinnumbers,abbreviations, URLs,endsofsentences,andothersituations
I.B.M.,Ph.D.,cs.umass.edu,F.E.A.R.
Note:tokenizingstepsforqueriesmustbe identicaltostepsfordocuments
TokenizingProcess
Firststepistouseparsertoidentifyappropriate partsofdocumenttotokenize Defercomplexdecisionstoothercomponents
wordisanysequenceofalphanumericcharacters, terminatedbyaspaceorspecialcharacter,with everythingconvertedtolowercase everythingindexed example:92.3923butsearchfindsdocuments with92and3adjacent incorporatesomerulestoreducedependenceon querytransformationcomponents
TokenizingProcess
Notthatdifferentthansimpletokenizing processusedinpast ExamplesofrulesusedwithTREC
Apostrophesinwordsignored
oconnor oconnor bobsbobs
Periodsinabbreviationsignored
I.B.M.ibm Ph.D.phd
Stopping
Functionwords(determiners,prepositions) havelittlemeaningontheirown Highoccurrencefrequencies Treatedasstopwords (i.e.removed)
reduceindexspace,improveresponsetime, improveeffectiveness
Canbeimportantincombinations
e.g.,tobeornottobe
Stopping
Stopword listcanbecreatedfromhigh frequencywordsorbasedonastandardlist Listsarecustomizedforapplications,domains, andevenpartsofdocuments
e.g.,clickisagoodstopword foranchortext
Stemming
Manymorphologicalvariationsofwords
inflectional (plurals,tenses) derivational (makingverbsnounsetc.)
Canbedoneatindexingtimeoraspartof queryprocessing(likestopwords)
Stemming
Generallyasmallbutsignificanteffectiveness improvement
canbecrucialforsomelanguages e.g.,510%improvementforEnglish,upto50%in Arabic
WordswiththeArabicrootktb
Stemming
Twobasictypes
Dictionarybased:useslistsofrelatedwords Algorithmic:usesprogramtodeterminerelated words
Algorithmicstemmers
suffixs:removesendingsassumingplural
e.g.,catscat,lakeslake,wiis wii Manyfalsenegatives:suppliessupplie Somefalsepositives:upsup
PorterStemmer
AlgorithmicstemmerusedinIRexperiments sincethe70s Consistsofaseriesofrulesdesignedtothe longestpossiblesuffixateachstep EffectiveinTREC Producesstems notwords Makesanumberoferrorsanddifficultto modify
PorterStemmer
Examplestep(1of5)
PorterStemmer
Porter2stemmeraddressessomeoftheseissues Approachhasbeenusedwithotherlanguages
Krovetz Stemmer
Hybridalgorithmicdictionary
Wordcheckedindictionary
Ifpresent,eitherleftaloneorreplacedwithexception Ifnotpresent,wordischeckedforsuffixesthatcouldbe removed Afterremoval,dictionaryischeckedagain
StemmerComparison
Phrases
Manyqueriesare23wordphrases Phrasesare
Moreprecisethansinglewords
e.g.,documentscontainingblackseavs.twowords blackandsea
Lessambiguous
e.g.,bigapplevs.apple
Canbedifficultforranking
e.g.,Givenqueryfishingsupplies,howdowescore documentswith
exactphrasemanytimes,exactphrasejustonce,individualwords insamesentence,sameparagraph,wholedocument,variations onwords?
Phrases
Textprocessingissue howarephrases recognized? Threepossibleapproaches:
Identifysyntacticphrasesusingapartofspeech (POS)tagger Usewordngrams Storewordpositionsinindexesanduseproximity operators inqueries
POSTagging
POStaggersusestatisticalmodelsoftextto predictsyntactictagsofwords
Exampletags:
NN(singularnoun),NNS(pluralnoun),VB(verb),VBD (verb,pasttense),VBN(verb,pastparticiple),IN (preposition),JJ(adjective),CC(conjunction,e.g.,and, or),PRP(pronoun),andMD(modalauxiliary,e.g., can,will).
Phrasescanthenbedefinedassimplenoun groups,forexample
PosTaggingExample
ExampleNounPhrases
WordNGrams
POStaggingtooslowforlargecollections Simplerdefinition phraseisanysequenceofn words knownasngrams
bigram:2wordsequence,trigram:3wordsequence, unigram:singlewords Ngramsalsousedatcharacterlevelforapplications suchasOCR
Ngramstypicallyformedfromoverlapping sequencesofwords
i.e.movenwordwindowonewordatatimein document
NGrams
Frequentngramsaremorelikelytobe meaningfulphrases NgramsformaZipf distribution
Betterfitthanwordsalone
Couldindexallngramsuptospecifiedlength
MuchfasterthanPOStagging Usesalotofstorage
e.g.,documentcontaining1,000wordswouldcontain 3,990instancesofwordngramsoflength2n5
GoogleNGrams
Websearchenginesindexngrams Googlesample:
MostfrequenttrigraminEnglishisallrights reserved
InChinese,limitedliabilitycorporation
DocumentStructureandMarkup
Somepartsofdocumentsaremoreimportant thanothers Documentparserrecognizesstructureusing markup,suchasHTMLtags
Headers,anchortext,boldedtextalllikelytobe important Metadatacanalsobeimportant Linksusedforlinkanalysis
ExampleWebPage
ExampleWebPage
LinkAnalysis
LinksareakeycomponentoftheWeb Importantfornavigation,butalsoforsearch
e.g.,<ahref="http://example.com">Example website</a> Examplewebsiteistheanchortext http://example.comisthedestinationlink bothareusedbysearchengines
AnchorText
Usedasadescriptionofthecontentofthe destinationpage
i.e.,collectionofanchortextinalllinkspointingto apageusedasanadditionaltextfield
PageRank
Billionsofwebpages,somemoreinformative thanothers Linkscanbeviewedasinformationaboutthe popularity (authority?)ofawebpage
canbeusedbyrankingalgorithm
RandomSurferModel
BrowsetheWebusingthefollowingalgorithm:
Choosearandomnumberrbetween0and1 Ifr<:
Gotoarandompage
Ifr:
Clickalinkatrandomonthecurrentpage
Startagain
PageRankofapageistheprobabilitythatthe randomsurferwillbelookingatthatpage
linksfrompopularpageswillincreasePageRankof pagestheypointto
DanglingLinks
Randomjumppreventsgettingstuckon pagesthat
donothavelinks containsonlylinksthatnolongerpointto otherpages havelinksformingaloop
Linksthatpointtothefirsttwotypesof pagesarecalleddanglinglinks
mayalsobelinkstopagesthathavenotyet beencrawled
PageRank
PageRank(PR)ofpageC=PR(A)/2+PR(B)/1 Moregenerally,
PageRank
DontknowPageRankvaluesatstart Assumeequalvalues(1/3inthiscase),then iterate:
firstiteration:PR(C)=0.33/2+0.33=0.5,PR(A)= 0.33,andPR(B)=0.17 second:PR(C)=0.33/2+0.17=0.33,PR(A)=0.5, PR(B)=0.17 third:PR(C)=0.42,PR(A)=0.33,PR(B)=0.25
ConvergestoPR(C)=0.4,PR(A)=0.4,andPR(B)= 0.2
PageRank
Takingrandompagejumpintoaccount,1/3 chanceofgoingtoanypagewhenr< PR(C)=/3+(1)(PR(A)/2+PR(B)/1) Moregenerally,
APageRankImplementation
Preliminaries:
1)Extractlinksfromthesourcetext.You'llalsowanttoextracttheURL fromeachdocumentinaseparatefile.Nowyouhaveallthelinks (sourcedestinationpairs)andallthesourcedocuments 2)Removealllinksfromthelistthatdonotconnecttwodocumentsin thecorpus.Theeasiestwaytodothisistosortalllinksbydestination, thencomparethatagainstthecorpusURLslist(alsosorted) 3)CreateanewfileIthatcontainsa(url,pagerank)pairforeachURL inthecorpus.TheinitialPageRankvalueis1/#D(#D=numberofurls)
Atthispointtherearetwointerestingfiles:
[L]links(trimmedtocontainonlycorpuslinks,sortedbysourceURL) [I]URL/PageRankpairs,initializedtoaconstant
APageRankImplementation
Preliminaries LinkExtractionfrom.corpusfileusingGalago
DocumentSplit>IndexReaderSplitParser>TagTokenizer split=newDocumentSplit(filename,filetype,newbyte[0],newbyte[0]) index=newIndexReaderSplitParser(split) tokenizer=new.TagTokenizer() tokenizer.setProcessor(NullProcessor(Document.class)) doc=index.nextDocument() tokenizer.process(doc)
APageRankImplementation
Iteration: Steps:
1. 2. 3. 4. 5. Makeanewoutputfile,R. ReadLandIinparallel(sincethey'reallsortedbyURL). ForeachuniquesourceURL,determinewhetherithasanyoutgoing links: Ifnot,additscurrentPageRankvaluetothesum:T(terminals). Ifitdoeshaveoutgoinglinks,write(source_url,dest_url,Ip/|Q|), whereIp isthecurrentPageRankvalue,|Q|isthenumberof outgoinglinks,anddest_urlisalinkdestination. Dothisforalloutgoinglinks.WritethistoR. SortRbydestinationURL. ScanRandIatthesametime.ThenewvalueofRp is: (1 lambda)/#D(afractionofthesumofallpages) plus:lambda*sum(T)/#D(thetotaleffectfromterminalpages), plus:lambda*allincomingmassfromstep5.() Checkforconvergence WritenewRp valuestoanewIfile.
6. 7.
8. 9.
APageRankImplementation
Convergencecheck
StoppingcriteriaforthistypesofPRalgorithmtypicallyisoftheform ||new old||<tauwherenewandoldarethenewandoldPageRank vectors,respectively. Tauissetdependingonhowmuchprecisionyouneed.Reasonable valuesinclude0.1or0.01.Ifyouwant reallyfast,butinaccurate convergence,thenyoucanusesomethingliketau=1. ThesettingoftaualsodependsonN(=numberofdocumentsinthe collection),since||newold||(forafixednumericalprecision) increasesasNincreases,soyoucanalternativelyformulateyour convergencecriteriaas||new old||/N<tau. EithertheL1orL2normcanbeused.
LinkQuality
Linkqualityisaffectedbyspamandother factors
e.g.,linkfarmstoincreasePageRank trackbacklinksinblogscancreateloops linksfromcommentssectionofpopularblogs
Blogservicesmodifycommentlinkstocontain rel=nofollow attribute e.g.,Comevisitmy<arel=nofollow href="http://www.page.com">webpage</a>.
TrackbackLinks
InformationExtraction
Automaticallyextractstructurefromtext
annotatedocumentusingtagstoidentify extractedstructure
Namedentityrecognition
identifywordsthatrefertosomethingofinterest inaparticularapplication e.g.,people,companies,locations,dates,product names,prices,etc.
NamedEntityRecognition
NamedEntityRecognition
Rulebased
Useslexicons (listsofwordsandphrases)that categorizenames
e.g.,locations,peoplesnames,organizations,etc.
Rulesalsousedtoverifyorfindnewentitynames
e.g.,<number><word>streetforaddresses <streetaddress>,<city>orin<city>toverifycity names <streetaddress>,<city>,<state>tofindnewcities <title><name>tofindnewnames
NamedEntityRecognition
Ruleseitherdevelopedmanuallybytrialand errororusingmachinelearningtechniques Statistical
usesaprobabilisticmodelofthewordsinand aroundanentity probabilitiesestimatedusingtrainingdata (manuallyannotatedtext) HiddenMarkovModel(HMM)isoneapproach
HMMforExtraction
Resolveambiguityinawordusingcontext
e.g.,marathonisalocationorasportingevent, boston marathonisaspecificsportingevent
HMMforExtraction
MarkovModeldescribesaprocessasa collectionofstateswithtransitionsbetween them
eachtransitionhasaprobabilityassociatedwithit nextstatedependsonlyoncurrentstateand transitionprobabilities
HiddenMarkovModel
eachstatehasasetofpossibleoutputs outputshaveprobabilities
HMMSentenceModel
Eachstateisassociatedwithaprobability distributionoverwords(theoutput)
HMMforExtraction
Couldgeneratesentenceswiththismodel Torecognizenamedentities,findsequenceof labelsthatgivehighestprobabilityforthe sentence
onlytheoutputs(words)arevisibleorobserved statesarehidden e.g.,<start><name><notanentity><location><not anentity><end>
Viterbi algorithmusedforrecognition
NamedEntityRecognition
Accuraterecognitionrequiresabout1Mwords oftrainingdata(1,500newsstories)
maybemoreexpensivethandevelopingrulesfor someapplications
Internationalization
2/3oftheWebisinEnglish About50%ofWebusersdonotuseEnglishas theirprimarylanguage Many(maybemost)searchapplicationshave todealwithmultiplelanguages
monolingualsearch:searchinonelanguage,but withmanypossiblelanguages crosslanguagesearch:searchinmultiple languagesatthesametime
Internationalization
Manyaspectsofsearchenginesarelanguage neutral Majordifferences:
Textencoding(convertingtoUnicode) Tokenizing(manylanguageshavenoword separators) Stemming
Culturaldifferencesmayalsoimpactinterface designandfeaturesprovided
ChineseTokenizing