Download as pdf
Download as pdf
You are on page 1of 9
| | | 1 Automatic keyword extraction from individual documents Stuart Rose, Dave Engel, Nick Cramer ‘and Wendy Cowley 1.1 Introduction Keywords, which we define 8 2 sequence of one or more wands. provide = ‘compact roreetaton of dacunert's coment Lely. keywords reset it ‘ondeosd fom the seni cota ofa decument. Keywords ae wily used to detoe queries wan inforatn retneval QR) systems as they te eas 1 Akin revise, remember, and sb In compara fo mathemati! signa, yond are idepenem of any comes and can be apie srs malin exper ad IR systems. Keyords have also been aplied 1 improve the fanconalty of IR sy=- tems, Jones and Payer (2002) deste Pisa tm tht iss Soeumens Telit to primary dncomems Keyan that spp te be of kev echt ts Ippo betveen documents, cabling eet gucty acest ‘lated ati twine a (1999) cscbe Keying, which ses Keywords {fom documents the sc bung boc fran IR syste. Keywords can so bused to erch the presen of serch ests. Huth 20D) desebes Kee= he astm that dyoumiclly provides Keyword ext for we poges ened From» Googe serch, Ancram Valencia (1998) pester at ‘ately annotate protein fncion with Keywords eae from te seen trace hat ae stole ih given pete, LA Keyword extraction methods Despite thr wily for analysis indeing. ad erival, ost documens do neta ssigned kaywonds, Most existing sppnaches foc on th manual {signet ef prods by pofestooal uso who may ue xed oD, trey om che authors” jgmen fo provides rgeseattive Ist. Reseach as ‘hear focsed oo meth to asm extract yoda Joven ‘Snail (o sugges keyword fora fessional indexer 0 generale ‘ay fests fr coment tht woald eerie be inaccessible, Early appeeches to aulmascallyexrac: keywords focus on evaluating ‘ospearemed nates of individual words. Jones (1972) and Salton et a {i9"8) deste postive rene of slecting for an index vocabulary tbe tically scrim wor aco corps, Lar keyword extcton research applica ese meutcs to select dicriniating words a keywors for Truvual documents. For example, Aside and Velecia (1998) bate thei proach on comparion of wot Sequeney Gsuibtons within ext agains Cision fom artes corpus "While some keswrds re kl ob evahated statistically dicing whi th corpus key took in many Socment Wahi he apes ae tot ikl tobe sledge stitiscally derininaing, Carpus-orented ods “opel epee cly singe word Tis forthe isthe measuemen’ of ‘Suis acon words becuse single word ae ota used in mele td diferent contexte "fo avoid hve dravbacks, we fous our itrest ca methods of keyword eacton that operate on individual dbcameat. Sach. documented ‘meioge ill xtc the same Hejwends fom a docubestrepdlss of the ‘rent eof comps, Document enented meds tetefore provide context {adept vue nes, enabling aon oalsie methods such 5 those desc Engel ta 3000) and Whines eal (2008) ht carcteize ‘Sings wid ext seat over Hine. Thee docment-osewed mods ae ‘Sted wo corpora th hang, sok a ellen of published etc abs That grow ve tine or seams of news arcs, Furtbermore, by opeaing & 2 Sale decent, tem methods beret sae fo vast colletons a an be ‘pled in man’ contents enh IR systems and ana ols Previous wrk on docuent ented meds of yond exactions com ‘ined oral Inge processing approaches 0 ienifypar-orspecch (POS) tags tht re combined wih sapervse len, chine earings, of ‘tien met ols (203) compares he effectiveness of thre em selection arose: oun pase (NP) chun, grams ard POS tgs, wt four icine fe (tet ofthese terme aap for astra kayo exracton USDg © spavied ‘machine earaing alge, Mile an Tra (200) deci sstm tat plies src of saat fore ientiy POS tags that are used to select words 1 eat 8 key- sete Cen cenarentes of the selected words within 1 Rxed-alze sliding window AUTOMATIC KEYWORD EXTRACTION 5 tue accomubeed wila a word cooscurence giaph A gipl-esed ranklag "poet CTestRank is applied to rank words based onthe associations in the graph, ad then top raking words are elected as keywords. Keywords hat, 2s Ajacent in the octment ie combined to ferm mltwor keyword. Mia a abd Taran (2001 reper that Texan ahsve ty st peormance hen ony mons apd adjectives are selected ws poten keyed Masood Ishznks (200+) apply a chisqare meas to calcu how selene word and pss co over within sane rentences ax pclae bea of rue terms nthe document text. The issue meas applied to deurnine the bas of word coaccurences in the document tex When ten asd wo ea words an pray a Leywonds ofthe document Magy st [shins (200) st that th dese of bess ot celable hen ter een 'S small. The autos present an evaluation on fll ext acer and a wocieg ‘ample on &27pagedocuent, showing tht thir method operates eectely fn ge dames Tbe falling scons, we describe Rapid Automatic Keyword Ex tion (RAKE. an wiinperred,doiiviopendent ad engungendpende ‘azbod for exacting Leywors fom lida! documents. We provide eas ofthe algorithm ands coaiguraion parameter, nd present esis ona teach ‘vk dtet of ectial stat, sowing tat RAKE s mere compute ficient than TextRank wile achieving higher precision ané comparable recall, ‘core. We then deste anotel method for generating spliss, wich we se © Configure RAKE fr rece domung ap crporn Fall we apply RAKE 13 pis ones steer and define meric for evaluating the excl, esa Sys od genera of extaced Keywords, erabling a system to iecity Leywords ‘hate essental r general w documents isthe abseace of manal anotsions. 1.2. Rapid automatic keyword extraction In developing RAKE, our medvaton as ben to develop a keyword extacon ‘std thats extemal efficent, eres op invdul dosimets to nae Srpization to dymamic olson seas spp to em domain and oper well onmulipe types of documents, parc hoe tat doo flow specie ‘rama conventions. igure 1 coiaits the tie pd et fora typical svat, ‘5 Well ys manvalyelaned Leywous. "RAKE s based on ou observation tat heswords que contin multiple ‘words but wel coma stundard punto 2 sop Wors such she uncten ‘wor and th, of or ter wor wth minal lea meaning. Reviewing ‘he manly assigned Keywords for be abstain Figure Ll, tet 8 aly ‘ne Leivord that sonia a op Word (of se of matrl rumbers) Stop ‘fond ae typically droped from nce wan IR syst td ot ala ‘sous et anaes ey se considered to be uniforms or measngles ‘his reasoning tase onthe expectasion ht sich Wards ae oo feqaety ‘and brondly used to and users in their analyses or search tasks. Words that JO 6 TExtMNNG apnoea aac ab rare uns secre nome croup Sout tnitae ae hase taars nde operate sgoemet eeeele AT owing se sagrs cartes aSerg as Ssadeedbpart syste ara ones ms Ws Feehan rere re Opa atone te Sc mane asa pert ea prt “Fgwre 1A sample rac fom the nee tote an i aml aged Aeon ry nsning within «docoment are descent ening and ar fen ‘sfoned wo content word “The inp paraneter for RAKE comes tof tp words (or sols) | set of pit dln and st of word deliiters RARE wes tp Wor snd frat later to parton the document et ino cana Keyword, which Irs eguenes of cote word ar they oser inthe et Co oeuences of Words ‘iti thew cane keyworer are meaning and alow so ea word eo ‘seuence thatthe application ofan arial sad sig window. Werd ‘Seoiston ar the mented ins manner ha atomatally aps ote se ‘Sd content oft text enabling adap a neared measuemet of er ‘Sroscencs tat wile ured to sore ean bay word. 12.1 Candidate keywords RAKE begin keyword extraction on a cuen by parng text int ast of| ‘inde hepmoasFis te doccot et pian aay of wor by ‘Specie ord dels. This aay I thn sitio sequences of congnoas ord pase elite ap word pions. Weeds wis segura re "sign the same postion inthe tet and loge ae comidered a canite eywor Tigo 12 shone the candi keywords inthe one tat they are pared fiom te sample ena suri when in Figur 1.1, The ean heed. Seba 10) 2 121 207 41 Tas ots fos is ms 422 RE wn 39 52 179 la 2s 87 2B rou acre pease, sy, eed ws. once oe epee S| ‘gue 16 To 100 works in he gverted toplist. (2004) are ince for comparison. The highs ales fer presse rec nd ‘mens ares in le. Arne ert precision tne psi with ‘any ofthe wechgoes 6th manually asiped keywords do ot aways appeat Sve absact ext The Highest peision and measure ie scheved Wing [RAKE with gonented soli baad on keyword djcecy, a abet of whch {sisted ia igure L6. With hs ops RAKE ie the best esl ines of F-measre and pression, and provides comparable reall With Fox's spi, RAKE achieves high recall wile experening opin preion 13.2. Evaluating efficieney Because of increasing interest in energy conservation in le data cates, we lb evaluated the computa cost ssid with exracg keywords i ‘RAKE and TentRank TenRank applies symacic fers ta docuDeot tet 1 ‘enify coeteat words and accumulates & graph of word eo-oecenset in ‘endow sie of 2A rank foreach word inte graph alcatel trough 2 ‘ere ofits unl convergence below atresbld [achieved ‘We se Texan's damping Incr d = 085 an ts oavergencetestold to (00001, as recommended in lca and Taras C204). We do nt have axes tothe ntact fiers feed in Mile std Tar (2000, wee anal {elite hee compurationa ost. “To mininize dpa al parsing sages in he espectiveexacion mas tue ident, TetRankaccumoltes cooscumense ina win fae 2 sad RAKE accumulates word eo occurence within cant fesword ARE co oscurences ar le, he alerts compute Keyword score acorn oir especie metods. The bentnak was plemented av and exe nto Siva SE Rooine Envisoament (TRE) 6 on Dell Precision TaD) worstiton ‘We alot the tal ine for RAKE sd Testfank (a av average ove 1 iterations) to exact keywords from te Insos testing st of 00 abstract fe the abseact were veal from les and ose n memory. RAKE enacted Key ‘words fom th 00 absacte 160 maillcconde.TestRank enacted Keynods in 102 miliesonds, ovr 6 tines the ine of RAKE. Refering 10 Figure 17, We cat seo tal sb the number of content words fer x document increases, the perfomance advange of RAKE over Tea increases Tis i doe to RAKE'S ability co sooe heyworls i a sgl pass Wheres TetRank reies rpentod erations to aleve comergence on Based en his benchnckevaleton, iis lear tht RAKE effectively exits ‘eywerds and onprfoms the caret ste ofthe ain tems of prison, ‘enc, and simply. As RAKE cas be puto doin iy dileent sem an Sliainos ine ect sation we disci method for opis geen tat iy be used to congue RAKE on parol corps, mins, tnd Ingeae. 1.4 Stoplist generation ‘Stoplisssre wiely wed in IR and ext analy applications. However the is ‘emarabl ite information decriing methods for ter ceston. Fos (1989) [sens an ands of sop, ning datepacies between sted conven tons apd actual Isunces td iplmenators of septs The ack of ‘ical igor sacle with the cretion of pls prensa chslenge hea ompuring text anal metods Ia pate. sop ae often Ral on com ‘on fonction words and band-uned for purclaraplcaas, dome, or Specie language, ‘We cna th we fem roqunsy a ameter uml seen swords fo tpi Table 13 Uist te op 50 words hy tem fogerty In te tring set of abseact in the benchmark dtc. Adstonal mete bow for fh Wor aredocumest fequesy.allacencyfeqoeny. sn keyword eqns. ‘Adjacency frequency reflects the numberof times the word occurred aljacent to 12 TEXT MINING ttn Tie Decne ae wane ese ne Coser gh Figure 17 Comparon of TetRank and RAKE exvacion tes on ual ecient sab’ keywords. Keywn! equ eles th mumberof tines de word ‘ccured within sn abs’ keyword Looking ate top 50 feqeet wor, in aon to the epic fenton wore en Se ha tm, cont, and method ar highly equa wi ‘eebicalsbivacts and Nahyfequentwiia easels keywords. Solecing Solely hye frequency will eee eaase oetea-eaing Words fe ded ‘0 the spl, paral ifthe corpus of documents sfecased on a parclar oma rope In hose ccunstaes, selecting sop word y tenn frequency bprtets #sak of removing important content baring words om lsh ‘We theefre preset the following moto fr sutomaclly peering & stops frm a et of documents for which keywords re defed. The algo ‘is based on the ition that words adjacent a and sot wil, keyed ae Ise ta be mening and therefore ae good choc for sop words “o genes oot soplst we ienied for each asc ine Isp wating set the words ozuingsdacent fo wor inthe abarne's ancl ey iword tia The ingenty of each word casing adjacent to keyword was ‘uml soos he shave, Word that occured mre fester win ‘Yeywoeds than adjacent to them were excluded from the stoplist [AUTOMATIC KEYWORD EXTRACTION 13 “Tble 13 The 30 mos fequnt wows a te Iagpec rainy set ted io Sscendng derby teen one. ‘Tem_—Decumeat’ —Adjcenoy Keyword Now __fequeney _fremeney _tegneney_fregeney ite wr 78 32 3 o ste te s td ou 20 5 : 9 ust 2 soo So oO in 2636 02 + i ine is 6 fe 1912 3st 5 ae 1138 x0 ° oth 1065 3 $ ue 108 555 i is co ois ° on oe Seo 8 = 56 = 6 we 2 at 6 a ms 2s ° = es ae 6 = 5 ino ° e 5a 3 8 ‘stem a i om oo ast 250 ° ised asi ies 8 fem a i ° sing as 250 ° contol ‘00 2 a sich ee 28s 0 per ae 16 : a ‘“ wh method a m i au Mt 2 BL time 35 a 38 ‘model aS a a Information 322 is ist a ts 16 0 : Erg 7 5 have Er 19 ° ba on ise ° Py 236 mt o tow Be 35 4 me Ber 3 3 ‘Contd overeat) ‘EES ls TEXT MMNING able 13 (Conined) Tem Dasoment Keyword Wert sreqonsy fens) regency igi ae 1B 6 rae ln 2 i za te oe o ws 3s ibs 6 Tse i 20 5 ae Fa ay ‘ a 8 4 beblem Fa ir Ss Scien bs no os “Tova hin neti of sone stp, we coated six sols tree cof which select wont fo the split by tm feueney (TF, a ee whch ‘eto words tm eguenes Dut also exlnd words from he split whose eywon!fegency was eter than thi Keyword acency toqunsy. We ‘tr to later set of spi a keyerd adjcecy (KA) Spl since hey Puy ince weeds hat ae ajacent oad aa Wain Keywors “Table 14 Comparison of RAKE pesormance using splits based on em ‘Goqueney (TF) ad keyword aaceney (A). TEscased—_Conech Tol_Mean Tol Mem Prssice Recall Faure sso m1 a 21 Re O18 me us 1m 30 710 m9 280 fo mim 41 mT as m2 ors 160 aT 42 ash ach ofthe splits was st as te put saps for RAKE, which was then nan on he testing set of he espe corps of cial sbsace. Tale It Tiss the precision. recall, ane F-pensure for te Keywords exacted By each ‘of these rans. The KA sopists generated by ear method cuperfoned the ‘TF stops generated by tr foguncy A noble diference between res achieved sg the wo pes of Sop Is viene ia Table Le Fase improves as more words a addedton KA split wheca whoa mote word st Aled 2 TF opis the Fase deraces. Further the best TF sop. Underperfomas he worst KA split, This ves that or algo fer geer- lng top sad the ight slop words and excadng coment won fom the spl. ‘Becaus tbe gncated KA stupiss leverge manual signed eyo, Me ‘vision that sel application would be wath esting distal Meas o Be ‘Sites and cllectons where defied keywords est re esi dented for 2 ube ofthe drcuments. tpi ony heed tobe generated nce for parclar mais, embling RAKE 10 be applied to new and fre aes, ing ‘he ana and indexing of new dct. 1.5 Evaluation on news articles Wale we hase shown tat simple et of confgrton parmstrs cables BRAKE (ecient exact Keywords fom individual docemets, ii Woe lnvesigag how well exacted Levers present heel content within 1 cous of dcamens for which Keywords have nat been manta signed. “Te following secon pfescts resus on appizason of RAKE to the Mul Perspective Question Answering (MPQA) Corus (CERATOPS 2008). LSA The MPQA Corpus ‘The MPQA Corpos consis of 55 news arises provide by the Cater fe te xeon and Surimaranton of Ens sad Opmion in Text (CERATOPS) ‘Ate nthe MPQA Corpus arom [8 ferent fori nd US news eres ft dite fom June 2001 Nay 2002. 152 Extracting keywords from news articles We exacted keyonds fom tle and text lds of document in the MPA (Corpus and sec nin dicen theold of wo because We ae tee in keywords tet are esocned wit nliple ocuneats, (Candlte Keyword sores were bod on word sos 3 degen) aod ar deg, Calelang word cores opti) RAKE extracted 17 [eynedsrleranal by an aterape of 49 docamentsCaleating word sores 45 deh), RAKE extactad 71 kes words sefeenced by an average of 61 ‘ooumean. ‘Tis aiereace in average number of referenced document counts Is the result of loager keywords baving lover frequency ass docuen’. The metic deqtwifiet) favors Inger keyworls abl hefore resus in extracted ey ‘were tha occur in ewer doconet inthe MPQA Corps In many ass ase is ocasonally pest init lng form and moze ‘roger referenced ins stoner fom. For example, rerng to Table 15, ‘note provca on climate change and 1997 bets preface! oso ies fequey ‘han he stoner oto prsocel. Because on net inthe aalss of news Srl is cones ices tat reference velit content we set RAKE #9 ‘core word by dpi) mane favor sore ke words hat Oost Aro owe Becuise most dacunent ar uigu win any ven co, we expect | ‘ind vail in what documents ate eselly about ae well te how each docuent represents speci subjects While some documens may te primi shou the eto proce, greenhouse gas emisions, snd climate charge, he ‘documents ay cay sak references hae sj Document i te fore Seti itty have foto protocol, renhouse ga emision, and liate change cuaced a keywords whereas documents te later stil a in many applications, were ave a dese to cape ll references fo exacted seywords orth porposes of evaingextated keywords, we actuate “Table LS Keyonds ect with word sores by det) ad depot ‘Scere by dro) Seared by dew peat Keyword tw) rw) ew) __ iw) ‘yor preiclieply obliged — oes 2 eloped cous culeoier age rusia toraify 2 2 2 2 geo prod yo poet om ciate 2 22 2 ‘hunge ra Kyo protest 2 2 2 2 jo prccel regures 2 2 3 2 1897 Hjewo proto! 2 4 4 4 jo reel Fr koro » Rm ? * to accor 3 ean 2 to pct 2 ies6 S Sen Kyte proteo! 3 Bord : atfeaio ote kyoto 2 af = ooo ray the kyoto provecat 2 2 - - Io agreement 2 a AUTOMATIC KEYWORD EXTRACTION 17 suns on how often cach extaced keyword is eferenced by document inthe apt The referenced doamen reguncy of Keyworth mmber of ocunenc in wai the hayward occured canal eyo. The exacted ocumene frequency of keyword df, theauber of docmsas rom whch the keyword nas enactes. ‘A keywerd that exacted fom lof he decom’ in which ister nce can be carsteied st exe or eel, whereas «Heyward eats ‘efereoed in many docaments bu exacted fam a ow ay be characterized 3 {eneral Comparing the elatenshipof et ard) allows uso character the etclosiiy of paricalr Keyword We therefore dete Keyword excusiy ‘x(t as sown in Egan 1 ext) = ay (ofthe 71 exctd keywords, 395 have an exclusivity score af 1, indcing tha hey were exc frm every document in wach tey Wete rece ‘Win ta set of 395 exclave heywor some oer n more socments tan ties ad ap referee considered mor nent othe cop of docu Tn order to measre ho eset! a ayo i, We dese the een Keyword eu) a shown in Eguton (1.2) ak) xe ay Figure 13 iss he op 30 ese keywords exeated fom the MPQA cor ps, ited in descending ede by thei 57) sores. According to CERATOPS, the MIPQA corpus compess 10 prima tpi, sed in TAME 1.6, which a ‘well epeseated by the SO mou cnenl Sayward ad aed RAKE Tn adi w keywords tht ae essential o documents, we ean so cha acerze keywords by how general heya tothe orp Tooter word, om ety: ss 2 na ia), tt oe 2, tra pce ate 8 i) spa af) hos terse 2 poo rarer Figure 18. Top SO essential Keywords rom the MPQA Corps, with corespond- ing es) sore in parentheser, ‘Table 16 _MPQA Corpus topes and detios. ie Desapon| ‘egenina Eonar alps in Argentina Stmferl Reto o Preset Best's 2002 Sat ofthe Union Ades fomtenamo US eldiag prisoner ie Guantanamo Bey Fimansgts Reason to US State Department eport on human Highs ‘to Ratan of Kyou Potcal ‘mopake 2002 Presid leon in Zimbabwe cdomects Ie! element a Gua aod West Bark Sesion Space misions of vareus counties ‘ivan Relations between Tan and Ching ‘eoerla Presiden coup in Venez errr corm gt an wo Seopa pret a ay bs hu a cer sey) cca) a fy abe sy eh poe Sarre t etna at ees ac tr fe ia ine Figure 19. Top generalkzywodsfram the MPA Corpus, wih coreponding SK sore in prem ‘fea was a hey wend referenced by documeets fom which twas not exacted? In is case we dene gerry of» keyword, gent) 8 shown in Eton a3) eh = nf) + (1.0 ~ ext. ay Figure 1.9 ts the $0 gnu haywonds exteced ram the MEQA cases, ine in descending order by the ge) Scots I shoal be need hat ger Keywords ang eset Keywords are pt mutually exclosve, Wiha the top 30 for oth meric, tee te several ihamed Keyword ster, pesos, bush prisoner, eeton. phy, uth amination, haan rg, tod or fovea. Keywords that se bo highly eset al highly gee are ese 1o't st of documents win the cops but also reflrenced by a sgnficanly eter amber of documents within the comps than ter Keywords 1.6 Summary Wie have shown tht our automatic Keyword exracton weblog, RAKE ‘Edievos higher prociaion and staller recall in comparison to exiringechniqucs. AUTOMATIC KEYWORD EXTRACTION 19 In const wo metho tat depend on nual tnguage prcesing tenis ‘0 achive their resus, RAKE: tes a simple ate of fnpt parmeters and Suomatcally exacts Aoywonds in single pss, making sie for wie ‘ange of document and ealeions. nally. RAKE'ssnphcity and eficency ene ts we ia many aplstions ‘where eywrds can be evra. Hated on the vary and volume of exng Callectonr andthe ae st which documents ae cetod and cle, RAKE ovides advances and es computing vesoucs or oer analyte mths 1.7 Acknowledgements “Tis work was sappore bythe Nasonl Vsaizaton snd Analytics Cet (SVACMS, wich Te sponsored bythe US Deparment of Homeland Secu Program znd lected tthe Pate Nonbvest National Labortry (PN), and by Laboratory Directed Research and Deveipmeat at PNNL, PNNL mange ‘oe te US Deparment of Energy by Bale Memoria ander Const DEACOS-7ORLNTES, ‘We lso thank Ante Hal, for making avaible the dst wid inher References Andre M and Ys A 1986 Anomate entciog of Heyden sic ‘exe apcon to te tov dona f Pt oe, Bierman a, ‘aa [CERATOFS 2008 MPQA Corpo pci adapsaeronsora bi Engel D. Whi PCat A and ran F 0 Ming fr emeping cool ‘Bes winx sean ad dcumens Proceeding of Nn SAM cron Cnc on Data Mining See nisi a Apple Matera FOC 1989 A sep a gl at ACM SIGIR Ferm, vl 2, 9p 9-31, ACM New Yea USA ‘Gwin C, Pye, Wine Nei Maring Can Fak E1999 govig owing iil vanes ith npn Deco Sppo Sone 1-2) 4-108 Hult A 209 grove amas hyve extn gn nen none, Prceding of he 2003 Corn tm Fnptal Metos Nantel Langage Pe ssn, vl 0p. 21633 Anion Compt Lge Meo Suge ut 2104 Combing moh earn andnatil ngage proces for aoa “tower exracon, San Unies Fata Sic Sac, Dept of Compe sd Sytem Scenes (gs ith KE. Jom X12 A se nepal pn tin “ual of Daanenason 2), 12, Jon $s Feyner 2102 Aon ets f domes fr we ‘sal teratoma apical emerson Soe fr for maton Selesce and Technology,

You might also like