Download as pdf
Download as pdf
You are on page 1of 36
yl loot One ocd \y' je On. Jvc ckig bh Crea (Ok O* b, jon ls pyewdag, CX Chien" & + CxcarCoett )* Vo (ovtto' oy. cfbCbic Yo. (acttste'Y* \d7 | olokuir a RE Ap ctu lorquoge Covsistngy 4 Bhs 4 ay QA 4 wun bagi . Paths ah an (uot. hig ab, dub, ed b'o chouste 2 by euyohy sviy (ar -Lobot netBey J NER) ef C00. toby lout Hes” | 2204 le) olsun o RE fo au} 0 loxsya0g e Gunsishrg 4 Arg ah & ba 9 odd ogi. Ly Combonabi Of 201, ao jb £0. (cuore cub-t bo 4 bK)™ (oxtla) tt (un eh yptat ov (oct) (00+ ab> a+b) * If) | obs RE dul ted LLRD 2 fucken E bop vadte oF lout Hor Lonseahur ole > (Ord* oveloit® w L(p) >AoHYooe( 1 |m2o Lnroy 16 —— aac @ scanned with OKEN Scanner as = WwW i? > psgin pdhu a RE td anyst 47m 4 a's 2 ' Crcbty nulla “7 hos me Aubshng Od Le > Cbhtosyt => we subi aa Cpiaryh > CA fb ends wth b . (btes) (baby UP) = 4 Coro [nzit pldb 0 RE fo ast Shi f) 0 414 Posey mo tuo Cove Wrof* =ty (1.400% 2 (1H Ore COE) istey gol As et fo ctu dius bouguege (ows ba ie wou poi 4 colud ol apheow Pes of chow ead 1's. two Covache Pro « (4M (ote) © + anew? (He) . ©83@ 5 * Cot Loy (142) (Ho * COPE t a pba 0 RE fo uuph Abnig a's &P Ly OF will eto teokodo> petete- G lugin zt J (eb ot 3° @ scanned with OKEN Scanner jy obrus @ RE te wtp? Aris 4 WA ADA afoubry Wut a! gp dolly wut’ b. +> Gr ok Hed any? Shut wth o cae b. afoiryib. as 12] obtun a RE do acupt Bing ) aa f b's wluxe lo Symbol qe tHe wat ud wa. Tb (ett (os) (ate) CY vvite @to? Se) — * @iny*o (oth).Codb) -- >> wer), © . i A wade ra) Dbbur We RE fo actph the wok wile Or Iter lout bemainrag A enduy ith poh Y eA ee Ly as £64 ty as F tectnyt do Cut | onde yal dows “ ww: acaty)* oth ath)* cpt ite Bea wlude i RE do cept rap oA ") orn a ale wer raha 3d both. Lug ) clk BA . [cod) (as) Wage raul 4 pe 4% (Cott) Cot) Coats) * os (toinyCaan)) 24 (CoB) (app)(oth).)” @ scanned with OKEN Scanner \°) obleun a RE po aia Advi, ,) os dbs ju fleck ae Aywhol Jom Alan yigbed 24 0. 4 4h Ayrdol -byown Vu evict 4 I. a5 wget tok With & 9 yu dynbol —~\\ Wb”? UM ayvaa Yoon (cx MS) (cull) aie ar 4 eo byA (ety ba (at) at) IPP ob a RE to catuyst Arras qardbls Aula Hot UH, blade 4H rwuds Dryrast Covkule 00 Wot J ot Ly hd ARCDax ay rabold. PR (ALR om aA wha da C oon (out) (ot? AC (AFC AK OA awh on BEDE Foy on (at) ca (ot) po (a4 pom aia phar or BEC efa,by er(oth) (od) (0) ge (RA can as wher of 4D Edaubl Coxtip an (oth). ; BAD on aA whave mm Bp ee tadh) cr COW) cp. COED oun Oh vhuuw a fd & Gai ao 18 4D Ch aby ASCE (bh fa bh @ scanned with OKEN Scanner | es | : [on (cw) (tn) caloetle Doclad>) jealod B) (la 4 Oto Woxlod) | 4 (oad) (ed + (aH) lob) oo] 20) Prove thot gvim ~ Cfo a wl pe or the folly, | he dove Jw Elo D*) ai ae ie i A ub. n ac wp JA t wu FAL yd bak 0 tye doe f Xe VE Gee Onn? Ne | vA ax = | ie wl lh, ge Jou yf - | RedewWher ss | Ah e008 Jalan dpict He shy % vty vw . Juyjen 4 Wh23 hey & \Uken-! A Wir) 4 yy felt sai Flan >» BY ponps lean uv'wet pak 0 Leb v daca oyu’ eae [cana su OS | he dil Etonyty x ost SO dD) hefore )nzob xv vst spun ae , ” Atake wm i he lt gel on aw q tc 0,1,e--- @ scanned with OKEN Scanner 20) | a ath 2 jwl=an 4 Gyyeockw Howe Tt, GLA IC Who Us duds dou luv} en g lel. > bwy Te Se | | Vien A eialaien +! Alle? (®y pores nod 4, luvlele : uv’ w &b yo CeO ben y ceo ,v dau oye POD Bim | epdous : najwnphiow Ane 4o> bape pn vy po ie Janta zo yaa wet vag ula pefam bln taba wat Magulee’: Ae be fens ag “Mohs oo FA hi ie op gn pagum, be ace Fen eater fare m0 4 5 “ULC L aa tyukvas “a Pusyubw K va auaned 4 aly k (ey, Diao ooh) LE Pe (te) yi Halew 2b edat bejnzoy eb ade) bY dwt sgulev dee ful fe a 35 @ scanned with OKEN Scanner “ | ay) Ls §au Jnzo\ a ube Peogulow. @ ty Mis [X29 Br-- h | [ Lonnie dn ares wea pun , Spit aye luvl ze 4 vl?) oe at! = a alk a Sjules ik 12 lezy> So Het luviclult wh Sup Eb freon? - os (al we Wats en nbd (tee ydga der Live uner) = 20 cL niga Clee) ih as wv twthust ine lune) n! en (u | Lzfo* | |r oli Wash Sesgulon ay defo" l wis Fresh ot be, afte en AN fs Joo, boo, P9PPO = Mats rr aa Wo f Aedr ve FE eo Eb: Mal ppv % tito Wwe 7 Vebbr hs rsh ust en d 2] a @ scanned with OKEN Scanner Jr? —? Le fe oo, bo: Lin pasulew 4 Le {wl nals) = mo an X= o"= 03 ol prets uvw = eG divi =21 ~ fuvl elude \vle(tls eve “wth GL so) 1;2 wo! [oo no MeL (By paps Luna) i Jalern sole endtk( ft) 4 n prall cro Us 4 the “ ‘menti-n adlen= 1 Les) ler) |e wot afd ° Late "ina prime ‘vast ast Sylow ot daqulor, BL # Seyulor) cucu , oo , phan i nw i yar wo r= ae” et \xufedh en puts x wo || wizl we oes cals b : a ne nn) vvir EL fh iro)l2 + ©, UBy Pex cad(ett) “bb Jace 0,1,2-- ha $ wol yo) > sinplind Anat TOO Regul @ scanned with OKEN Scanner WwW oS 25? pha Sion. ha woh Hag uur b% ola ul Y) Le drolrale) «Mol? iby how Ey io gesqulaw 4 nae eo Mod “Tb. “ye aht pret 1 feud 1 mio Uv pu ton 2 i . jul én Atle t re : aptcor ot bt ae al b mae a wieh-t al Wl 216 Juvpe tale! By Ly buna uv'w Et fev bach b= an ct bY ek fs ot eo ght BEL ey onerteh LiKe!) oy Led Jredwd pG& i -roAltd > fows SlAwy=& A->vw SH w= Produk Trouschows weld ay $e, F) £ oA S (S00 a A 1918 BUI=B Ge$H,6,8 49,54} f2—> of 6(8,0)> oe for 1G Bo.) SLB Aa4y Ge 8 (2G ea kag Vs ee Gow @ scanned with OKEN Scanner Produ uhinet Worl his S—rah 6lga24 Ce: g-ofiod A->af Gln,o) 2A Nn -> bs SA4)= 8 Noe A oy prad- B—> bB 6 (B,5)* L eae © ofa b Fa fth cy shied 6626) Covstuat a RG fi te. oa FL. POS b@V FV Sougthon Frochuuur S(ee)cA 8 Dott S(gwee Sobec Csi, P. §A,e=C Co ve {,4,6,c4 & (Arb) 6 Aol Tafaby 618%) =6 Koa8 he §18,4)= 6 eG sg anlb© Sa) ¢ oa n -2a¢ [8 it wed Cobe R->4B BIE cc [b¢ Ja @ scanned with OKEN Scanner b) | 62) Cowrhe 4 ky 4 Mossy rel Qa. ‘ Ko Io eet HAD sah se) YQ) Ot : Barsihon “Healers , StS aoa earl Cel vii, PS SUH): S $->be Ve mcd Stale Holl redeaby SCH Me 8 Nob ? ) f dle S(R,c) 24 Ratt Coan 7 SUB) = € (Rabe fi ae S(c,a)2C Cat td seat ng { “ Ca be CPaclhe Slows © \ eoGlud Obkuin om NL (PA polucl acyl Atv 4 wd {ba tlauhey puidh Hu Aduy Ob. r? OO => 640) a - | =) _ ae iy oP > ot me) vig Oe G @ ; (cet) =O) a ab OPO ab (ot bY (amr ¢ 4 h 2(abolouy) 33 @ scanned with OKEN Scanner sft 9 a on gee w the RE OM) 6%) (o*) SS QEG CRO. gsi OD ESO te is epee otic an PA dita RE (0rt5}anr/adbl* > (HS) (oetI* (ol & 4 OPOLs 5 y EO, “ee ae eo ot? € > E046) 34 @ scanned with OKEN Scanner be) @® An(oln)* 0" 4s ae a) ey ay) Os va a —O*0- Wee bag” ~ ie : (ato yoa. (oct 5 ae ON) rege oy) 2 : Y what isthe bugvage Covrerpwols fo te DED 1 Qt AWE? | Go, 5 =m en) Or @-1 749 ¢ Ri au Rif Fe Rs Me ie fete y Rye 7 obka 0 RC. ds dc PN slow below: Rar! Cot | wo) TH ©. eh liegt? Lei)” Re oo. enna tein Cyten) <(eanar" ale = @ scanned with OKEN Scanner echt gi 6 Ce," Ro? zotlen(en? o zott*®O a0) Doe Ra 8 2 gi Rt RA =h4 6 ED*(E4D zh perc ge TRL HR Roe? aoe ces o = (thes 3D keh a 7, {2 "Eo la ft ans RLFRS? ~ ait ss ee ~ ou Ww ¢ Pap’ See 2, Ba i ns ON leon = tot! ‘o {4 0" = volo? qe Ryle Re ply” [Pe OP RO m4 | Tea e01D" p Sy ol) re alee [athe | | Rs Z Ty ? ) on a Sona (E404 AC exaateno" =)" (oD & 3 Heo) 24) TO-6 0. Qa le > Ro (. evo) 285 (oD) Rs _—. ‘aby er FR wet Ra coH Qt oO =e Oo ot ]tp 6°) Re UR 7 Re i ee eb (eayrto eerot? 2, op e tu &Q) gs Ret Fle, aa Roy. i 1 " F 09) TE > \torl *Olot 41) =\ #6 (o1D* you @ scanned with OKEN Scanner ® 8) 95] a Obkun the Vigh luuw Grown na Hu houguase @) te fatto” |nz2,moah b> no yar me bs m2 devil, 8) v= 63,0,84 F148) "4 § Pooh A204 | bobB R->beWle j State $ dade RE (Couty*btb)™ alo rods Qfun,09) AICS AW) Tc Subd fe * sas boil DA 2GbH \oobstlé RneboflabM Sp @ scanned with OKEN Scanner yaad h Keyl? dae Noes (@ Kofi fahue Qoww” _ | Jade OFA {yi lig Dame é>iclre R—> oglealoc foiale . ‘lagi dud frau GrlviT 9D vif G48) Tefoy P- ee Siclie | BK Golto|Co Q-ro@foal oc fails A-SIAle a K=aC Atel? L . at —— = S A SoS 1 x 9) pbb a lyf hice Groomer Js dhe DE ((eabYou) —> —>®) -onseayyor ze ae an x Ge 14 @ scanned with OKEN Scanner Severe the WE Wy 7 tats Oe 0, OV pon} + Ged (Oe oo UNo0% Q ei lhisan peu Lash beeen Gpamens A> hast bea} baslé A faded Bob) ale 13> to KprwAl aS a froada |6 SS => AB SY Tadu iby PHL 5 abl Gobalduble. (> Kedo| Aaalo Galo Qrée § ope ©) obbro LLG br RLY. 8 > ub A--PbaS Boantb ——) DFA dye RLY egw 35 @ scanned with OKEN Scanner Proven de OFF Cus be Oe OPER OKO Regd LG Jat h& CbbB C—>Bbb Rab # Bofba A halos A Danie. EL. TPS) ae ic.n,8) Te fayby Q - Ai dc oRb) (J Aho DA ob| CH Voce 7. ——— 36 @ scanned with OKEN Scanner Chapter 3 Lexical Analysis In this chapter we show how to construct a lexical analyzer. To implement a lexical analyzer by hand, it helps to start with a diagram or other description for the lexemes of each token. We can then write code to identify each occurrence of each lexeme on the input and to return information about the token identified. We can also produce a lexical analyzer automatically by specifying the lex. eme patterns to a lerical-analyzer generator and compiling those patterns into code that functions as a lexical analyzer. This approach makes it easier to mod. ify a lexical analyzer, since we have only to rewrite the affected patterns, not the entire program. It also speeds up the process of implementing the lexical analyzer, since the programmer specifies the software at the very high level of Patterns and relies on the generator to produce the detailed code. We shall introduce in Section 3.5 a lexical-analyzer generator called Ler ( more recent embodiment). We begin the study of lexical-analyzer generators by introducing regular expressions, a convenient notation for specifying lexeme patterns. We show how this notation can be transformed, first into nondeterministic automata and then into deterministic automata. The latter two notations can be used as input to a “driver,” that is, code which simulates these automata and uses them as a guide to determining the next token. This driver and the specification of the automaton form the nucleus of the lexical analyzer. (or Fler in a 3.1 The Role of the Lexical Analyzer As the first phase of a compiler, the main task of the lexical analyzer is to read the input characters of the source program, group them into lexemes, and Produce as output a sequence of tokens for each lexeme in the source program. ‘The stream of tokens is sent to the parser for syntax analysis. It is common for the lexical analyzer to interact with the symbol table as well. When the lexical analyzer discovers a lexeme constituting an identifer, it needs to enter that lexeme into the symbol table. In some cases, information regarding the 109 @ scanned with OKEN Scanner CHAPTER A. LEXICAL ayy Ana ne : My, tayo table By the lexical an So of etic i iene fo i as ss C0 Ce pee" 1 ain deernng he pve se ly the oli Fg intratin Tas it we eal amit The al ogee ‘mplernentet by having he PAT es the Iesical analyzer to tend cha bby the grtNeat Ten omni from its inp wi team Ment token whic it regur t0 the POI | tosemstie 2 ce he eal ner ite part of the compl that reads the ‘it may perform certain other tasks besides identification of ke -( SEALS Rapsaz tur comments and ehespce (bank, new ay ‘perhaps other characters that are used to separate tokens in the input). EQPPCRIN coe mesaue guarated bythe compiler wth the For atau. tr feed alveer may hep tock of the CSiscerwen, wi an anocite ae umber with he ea anger mabes copy ter as me ah o> ean an of the single {pftte stile processes that do not require token 35 deletion of commen J acacia crass and compaction of Example $.1: Figure 3.2 gives some typical tokens theit Patterns, an practice, ‘THE ROLE OF THE LEXICAL ANALYZER m Se ioe bell ead prince cuter oe elo ee ein a Compiler efficiency is improved. A separate lexical anlyze allows ws to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input Characters can speed up the compiler significantly. Compiler portability is enhanced. Input-devicespectc peculiarities ean be restricted t0 the lexical analyzer. 9.1.2 Tokens, Patterns, and Lexemes {When discussing lexial analysis, we uso three related but distinct terms A token is a pair consisting of a token name and an optional attribute falue. ‘The token name is an abstract symbol representing a hind of lesial unit, eg. a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input symbols that the parser processes. In what follows, we shall generally write the name of a foken in boldface. We will often refer to a token by its token name. A pattern is a description of the form that the lexemes of a token may take. In the case of a keyword as a token, the pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern is a more complex structure that is matched by many strings [A lezeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the lexical analyzer as an instance of that token informally described owe sample lexemes, To see how these concepts are used in the C statement Het enalgis proper the : both printf and eens mating th ater ken a8 gee b score are leemes 1 Lesa Amps Vu, ut ‘Total = Yd\n" is a lexeme matching literal. © There ste mumber of i ; 008 Hy 1 many programing lx following classes cover most o all nto lexical nas a8 orton ofa consplleris 80 hein eine nee.‘ : su parsing fo 's (Symtax analysis) phases @ scanned with OKEN Scanner GHAPTER 3. LBXICAL ANAryy, me SAMPLE Len rat, DESCRIPTION a Tones [INFORM = © Tharacter se claaracters 1,88 ae ‘and digits | pi, score, D2 Soin eee = constant 3.14189, 0, 6.02093 ena ity" |e eet Figue 2: Examples of tokens 1L One token for each keyword. The pattern for & keyword is the same a the keeword itself “Towns forthe operators, ether individually oi clases such as the ob cosparison meosied in Fig. 32. 5. One token representing ll identifiers (One or more tokens representing constants, such as numbers srg. ‘Toon for each punctuation symbol, such as left and right comma, and seman 3.1.3 Attributes for Tokens When more than one Jexeme can match a patter, the lexical analyze provide the subsequent compiler phases additional information about the ‘cular Jexave that match. For example, the pattern for token mum ‘natchrs both U aud 1, but itis extremely important for the code gener know which lee was found inthe source program. ‘Thus, in many Jel salve eur the parser nt only a token name, but an at ucts totins te lems ratte by the taken the ok 2 SRE JG Helo, wile he atta value iene tan We sal assune that tokens have at ‘his striate ay hase reactive f astute tat com The mast important ut example the tl thet gat da a sO its lene it ype an ‘ise an enor messy ab ag symbol able. Thus, the appecyg sociated at his several pices of in where we ned to tally nformation about a8 fhe lesion at whi i fst fu a there f'no need for an attribute value. In this examph ben given an integer-valued attribute. In practice, lnstead store a character string representing the const alu for number a pointer to that string. 0 ‘qe ROLE OF THE LEXICAL ANALYZER us ‘Tricky Problems When Recognizing Tokens ly, given the patiern describing the leemes of token, itis elatively Aitple to recognize matching lexemes whem they occur on the input. Hom itp in some languages i snot immediately apparent when we have een rinstance of Texeme corresponding toa token. ‘The following example 2 aken from Fortran, inthe fixe-frmat stl allowed in Fortran 90. Tn the statement usu bos r= 1.25 itis not apparent that the ist lexeme is DOST, an instanceof the identifier token, until we se the dot following the 1. Note that blanks in fixed-format tran are ignored (an archaic convention), Had we sen a comma instead Af the dot, we would have had a do-statement, bo ST = 1,25 in which the fist lexeme isthe keyword 0. Example 8.2: The token names and associated attribute values forthe For- tran statement 2 Bene ae waitten below as a sequence of pairs. ERE pointe to smbob table entry or > Stale p> ‘SET pole to smb tbl entry for > “mer itge sae 2> punctuation, and keywords, othe token number has ‘8 typical compiler would ant and use as an atribute Note that in certain pairs, expecially operators 3.14 Lexical Errors 4 is taed for a lexical analyzer to tll that there is a source-code error. For instance, "or the ist time in a C program in the context: ‘vthout the aid of other components if the sting #4 encountered @ scanned with OKEN Scanner HAPTER 3 DEXICAL ANA), m a1) + iC rnseeling of the keyword i cay nbether 15188 a av wl ese ft a lescal analyze C20 Since £3 is 8 valid I taka’ See eo le ote a att se it iene nee ret 86 — ane oy a Caney ofthe Ate 7 fe to taneposti in which the lexical analyzer is ay uation as owen. err a 8 nr rokens matches any pref of te res Fee daracers om the remaining iput, until the om Seavet begining of what input is left! Tis ‘ae Snd welfare oben at ti cen ate pce ban interactive eompating cave ier Be que agus 5 ee rb pol ere actions a 1 Dees one carter rom the remaining input. 2 sera misingcaratr tothe remaining input. 4. Replace a carat by another character, 4. Treepoe tro ace characters. ‘Tassocetion he tha maybe ried in an attempt to repair the input smplen such states tose whether a prefix of the remaining inpu : Je transformation. This st ‘alos see, sce i rece mos lexialerors invalve a single charac ‘es gel nei rae to fad the salt numberof trans ~=i+i to oss th one program into one that consists only of deren. bt ti epic sender too expensive in practice to be 3.1.5 Exercises for Section 3.1 Ener 3.1.1: Die te fing Co progam Hex Bsteeonerts that x { SLEee amare, reer (ee BOY never : feoitlistina eee saee than 100 +7 i regio Sei td arn ee ‘a of Section 3.1,2 as a guide. ls What should those values bf ‘eins inthe fen have paranre, : a the punt , "Mew fallowing, HTML do Inu BUPEBRING é 32. are is a photo of : ppeING SRC = "house. gif">
see Gh REE = *norePix.htal*>Hore Pletures it you = 17 Siked that one.

npropriate lexcmes. Which lexemesshonld got associated lexical vars, tate spat should those values be? find what 3.2. Input Buffering + tung ellen cg lmen in at nie eth ran kt mig he ue ran 2 ed TEs tak i mace elt by the fat that we often hare a eee characters beyond the next fren fore we can be ste tooo oe ght lee, ‘The box on "Ticky Problems Whon Reoging we he goin 3.1 gave an extreme example, ut thee are magy stuations Tee eed to loak at least one additional character ahead. or instance, wee sure we've sen the end of an ientifer unl we st character ween utr or gt and therfore isnot part ofthe exeme for i In or ccharcter operators lke =.=, or «coil als be the begining of tw-chgracter operator lke>, ==, or <. Thus, we shal intfoduce a ps0-bafle——— ‘Sreahgeheir treo Then omer einen Se ae eatin can tense Gel. thch Mphtl tow 324 Buffer Rairs , Because ofthe amount of time taken to process characters and the lange number of characters that must be processed during the compilation of a large source Program, specialized buffering techniques have been developed to reduce the !mount Of averhead required to process a single input character. An impor at chee inves two bufers that ar alterately loaded 8 gg %g. a a 2 wisleree 2 mf fora ater ecemebea @: Figure 33: Using a pair of input balers git lle is ofthe same size, and 1 8 usually the sizeof dik Book, {ik 1006 bytes. Using one system read command wy can read NV characters "8 ur ahr than wing ne SET pr ce. evs a etersremain i the impute, then a special character, represented y eof @ scanned with OKEN Scanner APTERS. LEXICAL ay a Sala 6 of aS PHI ayy, cans ei oft ser ae i de ser gma ‘Torsone to she in ae neo tl ve mening we he begining eR Keg) gy any abo wtih paler 2 [pa toad an a adel Cove nt vetegs whereby fd hi age. vr io determined, forward is wet tothe character at Once the ext eva ine On eee ita nue 2 ote are imme alert te res. emo pelo of hon at cnn ond mas b ecd ooe po oe Mag ert ep hl fit at wk wea raed od ik ooe tf the bers, od ifs, ee nist reo the other butfer from te ta ne tervard tthe benno the newly loaded buifer, Au og se we newer need th look se far abel of the actual Jexeme that the sum of the Saco tore sl ta ha a wn mn boc nal ian dae 2.2 Sentinels tr ewe Sti aw mt keh tn eran tervrd at wha sk mo ov of he ers me dy ee tae ales rela the cabs buffer, “Thus, for each chiapas we me ete: ce a he he a, nl Te ToT fe le nea ney 7 multinay tranch). Wee ean combine the bt sentinel usnarig oe ne character if ws extend each buffer to hold s cee neat the rd. The satin is» yin) character that eam | ee a a an, sa natal vin te eae. sind Si gt ssa iy, tk the Ty a ate nab ed thee ia Pe i hd alr ea ha ir i i ati eigen ono ia aid hy ta gt vay bra mally se am Ae the cay est we ke, exept i ie ae ec Specification oF tykung ezulosexyteiony White ey cana rayon a eatin fg nay ait for spetying,hoxeme pale ern, Une ane yory effective i 33. SPECIFICSTION OF TOK ut Con We Ftun Out of Buffer Space? In ment mendeen Innis, Weer ate shu, at ene ce ty Pntserers i laslahead i sfc. Visa taller sien 8 nthe hnrwanes ate nd the demblesbuller wchetae of Sentinn 421 work onweer, there ate some she, Por exaniple, if chararser weigy tan be sey Wei, even, coer usagi, thn we crit rm the preshlig that a Jenene is Vonger than s.r nova pevtieans with Wag hararser tings, we an teat thean 98 ermeatenitinn oh genset tah line over which the string, is writen. For instar, in Jas ito conventional to represen. szingy ay writing pie ose lise ae (concatenaliny,pvese sith m+ operate athe end of each een ‘A more difficult probean venus whey sabia loeb np ie noted. For example, some languages ie PLT tr ne tron loy- snes rescruvdy that a, yo east ihensiies with the aac st srw Hs DECLARE. I the lexical analyzer is presented with tent of PL/I prograan tat begins DECLARE ( ARE, 202, t eanzot be se ‘ehether DECLARE is a keyword, avd ABS au so om are variables ting, de Clana, or whether DECLARE is a preeedure name with its argurets. For this reason, anodern languages tend to ceserve thie eywords. Hoswewr,f wt, ome can treat a keyword like DECLARE as an amiga eae ad Jet the parser resolve the issue, peshaps in conjunction with symial-table Joop. int been, fying those types of patterns that we actually nel for tokens, Tn this section ‘we shall study the formal notation for regular expressions, andl ia Section 3-5 ve shall soe how these expressions are used in « lexial-analyzer generator. “Then, Section 3.7 shows hiow to build the lexical analyaer by converting regular ‘pressions to autotnata that perform the ecognition ofthe specified tokens 3.3.1 Strings and Languages An dyad s any finite set of syubols. Typical examples of symbule a lt tery digits ane punctuation, ‘he set (0,1) i the inary abphabet. ASCH is an eof a niphabet; I used in any sofware sates. Us ion lexemebein| igure 34: Sentinels at the end of wach baller @ scanned with OKEN Scanner NICAL ANAy, Lrg us od a: Sone gt ye raya alr ars tbe dnp tee nae esi anal breaks forthe oer characte cues gue 35 Lockabead code ith sentinels Tmaplementing Multiway Branches ‘te mie maine thatthe pita in Fig 35 requires many steps 10 ex aaa erring the eat eof Sst is aot a wise choice. Actually, SL nur in nat order we Ist the cases fr each character. Ts roc, mney rap depending on te input characteris made in Sey ig an es di ra of ade indexed 0, define to De sts See ey, i fellows that a! = s. Then a? = ay 2? = 988, and 0.00. 3.8.2 Operations on Languages In lexical analysis, the most important operations wn languages are union, con Inleval analyi he me ee dfn orally 0 Fie 36 ‘Unon uniliar operation on ets. The concatenation of languages is all strings f ly takingerstring rane the fist language asd a string fron, a Pe eat wonnrenne eT (Ale) digit digit entation > HHL ay epbionalEsponent > (E (+1 ~ Le) sits) | hunber > digits optionalFraction optionalBsponent is precise specification for this set of strings. ‘That i, an eptionalfiaction © ‘ira decal point (dt) followed by one or re digits or sma ‘tupty string). At optionalBponent, i wt missing, the Fettc B one OF {uv optional + or ~ sgn, followed by owe or more digits. Note that at Wai) {Bet us follow the dt, sober docs wot math Tout docs match 1 @ scanned with OKEN Scanner CHAPTERS LENIC. mt i ions ns of Regt EXD sos ith the Die opera fe 1a, a estes hee corals ga i 10H stg py Sie eno eS TT seta 0 BE corporat Be ne neve 9 FS pty wet in the specication Xn ne mn scapes cota a Besson of some apt 3.35 Extension afore meu TTS Sov Kies nent i sea — sary, postfix operator * represents SENSE cess the lange (L(7))". The opera Se traie agers Tet rr relate the Kleene dona sy post operator ? meas “Zero or one 0 oF put another way, L{r?) = ‘Toe operator bas the same preeedence and associativity ax A repuar exyresion ay[aa[---[aq, where the slpbatet, can be replaced by the shorthand iy, whan oy ls v= uppercase exes, lomereace letters, oF di == by ore, that just the first and last sep ew dorthands, we can rewrite the regular: er hte) 4st — oy 1 er eter digit) Tee regula Sis eo Bsc 3. cana be simples dist io aigts ie wemier Oo, ; M8 dita CB ey digit J? Mm pECIPICNTION OF TOKENS, aa SPF TOKENS 7 § Exercises for Section 3.3 wear it character strings or comments), (ii) the lesieal form ot is ‘erin, nd (it) the ral maf ne ee ma ppxorcise 3.3.2: Describe the languages denoted by the following regular wx presi a) (alba 1) ((da)b"Y ) (alb)*a(ab (ab). 4) arba*barba". 11) (aa[bb)*((abjba)(aa)bb)*(abba)(aa\bb*)* [Bxercise $.8.3: In a string of length n, bow many ofthe fllowing are there? a) Profiss. 1) Sufix. «) Proper prefixes. 14) Substrings. 4) Subsequences. Exercise 3.3.42 Most languages are cae sensitive, so keywords can be writen ‘nly one way, and the regular expressions describing their leemes are very simple. However, some languages, like SQL, are case snsensiire,s0 8 keyword can be written either in lowerease or in upperease or in any mixture of eases ‘Thus, the SQL keyword SELECT can also be writen select, Select, or ELECT, for instance, Show how to write a regular expression fora keyword in a case. ast langnge. strate the hes by wing the expresin fr “Se insqu. Exercise 3.8.6: Write regular definitions for the following languages 4) Al strings of lowerease letters that contain the five vowels in ore. 1 of lowercase Ieters in which the letters are in ascending les ing of a string surrounded by /* and +/, without ax ©) Comments co less it is inside double-quotes (*) intervening ©/, @ scanned with OKEN Scanner nar ns HESTEAT AN . nce ts Ant: Tey this ob pono ig an aring ED cf cea oot apnea Feet oe een anti aman of 74431 Od yg, es ane agp in the informal ontin the substring bh station, Sh 9 FEB peg re ne canes te lowing 5 chr, 3 tases for Breese pe nae (peo 7) 8 ier HPP oF lower case. 1) The nec comnts rnc amber (ose HET URPEE OF Kemer eg The sg Eran ise ae dam hat cn pot vc endo egtiate English tng eect ven, up t ad nding Exercise 3.3.10, discus te {the lexical- analyzer som (prt charters) a special meaning TOU sel «vcard off they are nesded to represent th ed sy ing he character wis pate gue eprestin "oo matches the string ‘ero ami fn geaton carne by stint sin \\e lo matches the ta ares the stein" Wie a separ er 34: ota gh ECTERAE SS stato rater cect 2 tay ace na titi the class 2 acco th ean es ven nee Exercise rences ofthe pattern Show that for every regular expr fora, there isan equivalent regular Exercise 3-3-1 the right end ofa line. Te operator * is also used Character classes, but t ‘waded, For example, “C° ‘ntin m lowercase vowel 8) How do you tell which meaning of *isfatende? 1) Can you always replace a egular expe | Bxereine 3.3.11 ‘The UNIX shell command # in filename expressions to describe sets of file names, Fo xpeesion #0 matehes all file nasmes ending i tn ofthe foram 18.3.9: The regular expression r{m™:} enaeacet literal string a literally any charactor bit eine being of ine cent of Line ‘any one ofthe characters in string © any one character not in string + ‘evo ot more stings matching ¢ fone or more strings matching * ‘peewoen m and n occurrences of r fan ry followed by an rr 1 when followed by 72 Figure 3.8: Lex regular expressions For exaznple, a( 5) matches Ine operator“ matches the the context always makes it ‘aesou) +B matches aay com ‘expression that does nol jortl.c, where ¢ i any chat sion containing repetition pression without repetition Operators. lit end of a ine, and matches to introduce complemented ear which meaning isi plete line that dors not | vabe avs [abel [raved | oe aft.s) arb cal) matches fromm m to n oct fering of one to ive a's operators of this cesion using the "and $ operators use either of these operators? ps uses the operators in Fig. 3.9 i exarmple the flename 7 matches al fl acter. Shans how sh filename @ scanned with OKEN Scanner ro car HENTCAT ANA sortt. (eso) necoo rr eel Eire = pare Te hei vane fo NITION OF TOKENS: nas of the ra ae wames of tokens ws far as the lecieal analy hese tokens nte described using regular d iar amber are siilar 9 of tongunges ike Paweal oe SQM: risen pera tas,” heater it presents 1. hich are if, then, else, velop, id and wee. The ras, 9 Fig, 3.1 ple 3.7 tehat ae sae in Bs | digit + [0-9] mansee gts. digitey? ( 8 1? dite? jas expressions ws ‘eter + (keZa-e] ems a rea! ele Ng ony Gd etter (ter | digit * coe Sn ad ie oa. wo so awn «eset Form of ACCES iM which fy then > theo pnecbe 8218 SARE Gade (stands fr any one chara ica i ee oy sing of © oF more characters. In addi, rlop > <)> 11d! eR AC Shar say et be the scape chase, ==: ‘race es the character that follows its Ral Figure 3.11: Patterns for tokens of Example 3.8 SELPSERES ESSN SO1 pate as a regular expression, gra fast For this language, th lexical analyzer will recognize the keywords 1, chen, ‘and cleo, a well as lexemes that match the patterns for relop ud. and number. stanly matters, we make the common assumption that keywords are also veruitluords:. that is, they are not identifiers, even though thee lexernes natch the pattern for identifier. In addition, we assign the lexical analyzer the job of stripping out white space, by recognizing the “token” ws defined by ws kom ik arate the escape character 3.4 Recognition of Tokens cw ee i orgs pater sing regular = he par al he ede sec a aes tn sing and nda pee ‘ee lnmr sseching vow of the patterns. Our discussion will make use of a ‘ws + ( blank | tab | newline | eet if expr then sat eye then sind else stmt Here, blank, tab, and newline are abstract symbols that we use to express elie " the ASCII characters ofthe same names, Token is different from the other 7 fem tp er tokensin that, when we recognize it, we do not return it tothe parser but rather ae ‘estat the lexical analysis from the character that follows the whitespace. Te is a Ue following tokew that gets returted to the parser. * s Our goal for the lexical analyzer is summarized in Fig. 312. That table Nevin 4 ¥ shows, for each lexeme or family of leemes, which token name is returned to ra ‘he parser al what attribute value, as discussed in Section 3.1, returned. LT, LE, and 50. ching ue Matenents Sot that forthe si elational operators, symbolic constants ‘wa ec athe atte val, ett nate wich tan of ken relop we have foun, The particular operator found wil infuece the "nk that is outpmt fom the compiler. sample 3. ene fognyy sec F310 dace a inp iat tn, “Thin ata ile ‘iar expiily after coud @ scanned with OKEN Scanner CHAPTER HESICAL ayy, e Pointer to table entry aumnber | Pointer to table entry rt Le 3.4.1 Transition Diagrams Asan terete spin the costco of leical analyzer, we fst omen yttens ito slid fora called “ransiton diagrams.” Tn this tng ‘peor the sven fm egul-expesion patterns to tranlon a {Gams by land, bin Sion 3.6, we shal ee that there is @ mechani wy ‘oqmsres thse diagrams fom coleton of regular expressions Transition agra havea collection ofuades or ctces, called sates. Ea ‘2ir prs s condom that could our ding the process of s he tig rs ene at maces oe of svt patiern. Wea ‘snmarig ll ened to know about what ine set oem i embod ‘situation of Fig. 3.3), eee Fly sc eel fen oe ; afi del tom oe sae of he wansidon dagen to aie, fate ae aba ot of symbol If wt ee acne Ha ty aed LPM tba welick ran edge cet a ana a 28 ford yo el) Ewe Bnd such an de, acm eB hal wore a ean nea emit etn et transition dag haben more han one ee out of 4 8 tab ating Fret cai of deeming HSS Starting in Section 35, We of a an ai AME ema ean da i 7 rt dag inpementer, Some ipa { + Ceuttarn ae sie ti, All posits beeen ty, OO th eH 30 forward pointers, We a fal, Those states indicate th, actual exe consi me may not as OGNITION OF TOKEN gh RE Inlieate an accepting state by a double cite, nnd if thee is an and tobe taken — typically returning a token and an attibte vale tot parser — we shall attach that action tothe secepting state 2, In audition, if tis necessary to retract the forwerd pointer one peniton (es the lexeme des not inch the symb that got us to the necting, state), then we shall additionally plare a * near that accepting state, Ig ‘our example, itis never necessary to retract forward by more than on position, but if were, we could attach any aumber of "tothe accepting state. 3, One state is designated the staré state, or initial state; itis iicated by an edge, labeled “start,” entering from aowhere. The transition diagram always begins in the start state before any input symbols have been ead Example 3.9: Figure 3.13 isa transition diagram that recognizes the lexeam matching the token relop. We begin instate 0, the star tate. Ire se ca the frst input symbol, then among the lexemes that match the pattern for relop wwe can only be looking at <, ©, or <=. We therefore goto state 1, and look at the next character. If itis», then we recognize lexenve <= enter state 2, and return the token relop with attribute LE, the symbolic constant representing this particular comparison operator. If in state 1 the next character is >, then instead we have lexeme ©, and enter state 3 to return an indication that the not-equals operator has been found. On any other character, the lexeme i <, and we enter state 4 to return that information. Note, however, that state 4 hhas.a* to indicate that we must reteat the input one position a>+© cet 6 Figure 3.19: Transition diagram for relop On the other hand, fn stave Othe fist character we se is =, then this one ‘haracter must be the leseme. We immediately return that fact from state 5 @ scanned with OKEN Scanner wi CHAPTER. LEXICAL Ay, pte ft ctr, whet the lexan nig ott ees cane i th Ms Teen 00 7 (ot Sar state 6a i i 8 2g, oF, We ea OL PORSDIY (tore carbs ae bewsel. 7 state OM this rsison i Reserved Words and Identifier, 2 Recognition i re venir presnisa problem. USUAIY, Keywory (rece vet ein or rng ate, $ they a reamed 0 Te Tut ALOUD ve ets HED of Fi, 314 to starch forint , a arnt eywotds if, chen, and e250 of our rugs connie) ve ‘san eee cher : a ot DG nmin nt Figure 14 A transition dagram for id's and keywords ‘Tere ace too mays that we can handle resrved words that look Uk i es 1 fla he ewan the smb ble initially. A Bel fe ym ey inset hinge are never ordinary is fe nd hie ey gr We hae suppor a aodisaseiFig 3.14, Wier wid an identi, a cal otal teak pa bit aad ad eta lat fr he eee found OF our, a i ‘not in the symbol table during Non i dll it cl yet be are ‘crite Toc Pn canines the sol sun ease ator ae th al Po ty a i io oe the keyword tes ‘cute are tin Se wed ‘eh Sigs fr each keyword; an ex lagram consists town in Fig. 3.15. Not . irate eng eda ea cbr tation ater ech ay character tay ca Nowe BY a tet fora “nonetter ‘ects ata tb Atm conte Sop lk 6 ttt oa et Ita shenereaiae ta ght care tee ye yt Sonat ety 6 le a 4 ie gl UP pre I we ad ™ then so that the reserve ju nACOGNITION OF TOKENS 1 tokens ace recognized in preference to id, when the lexeme matches both Wrens, We do nol use this approach in our example, whic i why te ites in Fig, 3.15 are unnumbered #5 0+ wan): Figure 3.15: Hypothetical transition diagram for the Keyword shen, ~—t 4.43. Completion of the Running Example ‘Te transton diagram for d's that wesw in Fig, 3.14 hassle struct. Ping in state 9, it checks thatthe lexeme begins with a letar and goes to Sarit. Westayinstate 10 slong asthe input contains letters an digs. sive we fst encounter anything but a eter or dig, we go to stats 11 and Tres, the exer found. Since the last character is not pat of the identifier, scree retract the Input one position, and as discussed in Section 3.4.2, we mesa shat we have found inthe symbol table and determine whether we have heyword or a true identifier “Tne transition diagram for token number is shown in Fig. 3.16, and is #0 Tar the most complex diagram we have seen. Beginning in state 12, if we see a digie we go to state 13, Tn that state, we can read any number of additional ‘dg. However, if we see anything but a digi, dot, or E, we have seen number {nthe form of an integer; 123 is an example. "That case is handled by entering Slave 20, where we return token number and a pointer to a table of constants there the found lexeme is entered. These mechanics are not shown on the ‘Gagram but are analogous tothe way we handled identifiers. FFiguee 4.16: A transition diagram for unsigned numbers yal 8 stad see a dot in state 13, then we have an Yptondl ration.” Stato 1s eataed, and we look for one or more addtional digits; sate 15 ig Usa for that purpose. If we ee an E then we have an. ptional exponent? hose recognition is the job of state 16 through 19. Should we a state 15, Ital see anything but of digit then weave come to the end ofthe Hon, there is no exponent, and we return the lexeme found, va state 2, @ scanned with OKEN Scanner | Ye MACY O &\ “PAR” 9 OPS Moatbub, 2 Jrvedtares eter cxcoo fo (nquoge ole Ab tourth cldennodive KO dbles XV epelow SS \ io) Coury oh & haga 8 | Covatng eA bage A ON A Wa +> + Yeoncaleroehig, the aig ab Qne dy move (ab falas opdrnot b Ao © (eats) Couloy” | 5 coy np Cod (El? (soe tut K (gan) © LD Hewodhve. eaedh¥ yk » (ob a (ba)® Costs "4 blowy*-4Ctseu) 4c Uar)* \w? poh om ‘RE to cucu or Lanquoge lovers by g Atma i o's 4A Wa ot mgd One por 4 oyreadhee 014 Lgetirosdr 4 Vs Los. Cito End nity oy wo (ito M*oe > Boros [cp] obra a RE fo cept loguoge bontrry ott (sect 7 ne & 4 od Looat one bo au ts Jas, ba ACA Oe fattorcy® 15 @ scanned with OKEN Scanner 4 bb) Cc poe # A — Ia lab HB Pah] Rev OS 0,8) yc ic. 8h fob’ i ) C Rb» | RD Abe | Debs] We Foce abo ee zee @ scanned with OKEN Scanner

You might also like