Professional Documents
Culture Documents
Module2 ATCD
Module2 ATCD
npropriate lexcmes. Which lexemesshonld got associated lexical vars, tate spat should those values be? find what 3.2. Input Buffering + tung ellen cg lmen in at nie eth ran kt mig he ue ran 2 ed TEs tak i mace elt by the fat that we often hare a eee characters beyond the next fren fore we can be ste tooo oe ght lee, ‘The box on "Ticky Problems Whon Reoging we he goin 3.1 gave an extreme example, ut thee are magy stuations Tee eed to loak at least one additional character ahead. or instance, wee sure we've sen the end of an ientifer unl we st character ween utr or gt and therfore isnot part ofthe exeme for i In or ccharcter operators lke =.=, or «coil als be the begining of tw-chgracter operator lke>, ==, or <. Thus, we shal intfoduce a ps0-bafle——— ‘Sreahgeheir treo Then omer einen Se ae eatin can tense Gel. thch Mphtl tow 324 Buffer Rairs , Because ofthe amount of time taken to process characters and the lange number of characters that must be processed during the compilation of a large source Program, specialized buffering techniques have been developed to reduce the !mount Of averhead required to process a single input character. An impor at chee inves two bufers that ar alterately loaded 8 gg %g. a a 2 wisleree 2 mf fora ater ecemebea @: Figure 33: Using a pair of input balers git lle is ofthe same size, and 1 8 usually the sizeof dik Book, {ik 1006 bytes. Using one system read command wy can read NV characters "8 ur ahr than wing ne SET pr ce. evs a etersremain i the impute, then a special character, represented y eof @ scanned with OKEN ScannerAPTERS. LEXICAL ay a Sala 6 of aS PHI ayy, cans ei oft ser ae i de ser gma ‘Torsone to she in ae neo tl ve mening we he begining eR Keg) gy any abo wtih paler 2 [pa toad an a adel Cove nt vetegs whereby fd hi age. vr io determined, forward is wet tothe character at Once the ext eva ine On eee ita nue 2 ote are imme alert te res. emo pelo of hon at cnn ond mas b ecd ooe po oe Mag ert ep hl fit at wk wea raed od ik ooe tf the bers, od ifs, ee nist reo the other butfer from te ta ne tervard tthe benno the newly loaded buifer, Au og se we newer need th look se far abel of the actual Jexeme that the sum of the Saco tore sl ta ha a wn mn boc nal ian dae 2.2 Sentinels tr ewe Sti aw mt keh tn eran tervrd at wha sk mo ov of he ers me dy ee tae ales rela the cabs buffer, “Thus, for each chiapas we me ete: ce a he he a, nl Te ToT fe le nea ney 7 multinay tranch). Wee ean combine the bt sentinel usnarig oe ne character if ws extend each buffer to hold s cee neat the rd. The satin is» yin) character that eam | ee a a an, sa natal vin te eae. sind Si gt ssa iy, tk the Ty a ate nab ed thee ia Pe i hd alr ea ha ir i i ati eigen ono ia aid hy ta gt vay bra mally se am Ae the cay est we ke, exept i ie ae ec Specification oF tykung ezulosexyteiony White ey cana rayon a eatin fg nay ait for spetying,hoxeme pale ern, Une ane yory effective i 33. SPECIFICSTION OF TOK ut Con We Ftun Out of Buffer Space? In ment mendeen Innis, Weer ate shu, at ene ce ty Pntserers i laslahead i sfc. Visa taller sien 8 nthe hnrwanes ate nd the demblesbuller wchetae of Sentinn 421 work onweer, there ate some she, Por exaniple, if chararser weigy tan be sey Wei, even, coer usagi, thn we crit rm the preshlig that a Jenene is Vonger than s.r nova pevtieans with Wag hararser tings, we an teat thean 98 ermeatenitinn oh genset tah line over which the string, is writen. For instar, in Jas ito conventional to represen. szingy ay writing pie ose lise ae (concatenaliny,pvese sith m+ operate athe end of each een ‘A more difficult probean venus whey sabia loeb np ie noted. For example, some languages ie PLT tr ne tron loy- snes rescruvdy that a, yo east ihensiies with the aac st srw Hs DECLARE. I the lexical analyzer is presented with tent of PL/I prograan tat begins DECLARE ( ARE, 202, t eanzot be se ‘ehether DECLARE is a keyword, avd ABS au so om are variables ting, de Clana, or whether DECLARE is a preeedure name with its argurets. For this reason, anodern languages tend to ceserve thie eywords. Hoswewr,f wt, ome can treat a keyword like DECLARE as an amiga eae ad Jet the parser resolve the issue, peshaps in conjunction with symial-table Joop. int been, fying those types of patterns that we actually nel for tokens, Tn this section ‘we shall study the formal notation for regular expressions, andl ia Section 3-5 ve shall soe how these expressions are used in « lexial-analyzer generator. “Then, Section 3.7 shows hiow to build the lexical analyaer by converting regular ‘pressions to autotnata that perform the ecognition ofthe specified tokens 3.3.1 Strings and Languages An dyad s any finite set of syubols. Typical examples of symbule a lt tery digits ane punctuation, ‘he set (0,1) i the inary abphabet. ASCH is an eof a niphabet; I used in any sofware sates. Us ion lexemebein| igure 34: Sentinels at the end of wach baller @ scanned with OKEN ScannerNICAL ANAy, Lrg us od a: Sone gt ye raya alr ars tbe dnp tee nae esi anal breaks forthe oer characte cues gue 35 Lockabead code ith sentinels Tmaplementing Multiway Branches ‘te mie maine thatthe pita in Fig 35 requires many steps 10 ex aaa erring the eat eof Sst is aot a wise choice. Actually, SL nur in nat order we Ist the cases fr each character. Ts roc, mney rap depending on te input characteris made in Sey ig an es di ra of ade indexed 0, define to De sts See ey, i fellows that a! = s. Then a? = ay 2? = 988, and 0.00. 3.8.2 Operations on Languages In lexical analysis, the most important operations wn languages are union, con Inleval analyi he me ee dfn orally 0 Fie 36 ‘Unon uniliar operation on ets. The concatenation of languages is all strings f ly takingerstring rane the fist language asd a string fron, a Pe eat wonnrenne eT (Ale) digit digit entation > HHL ay epbionalEsponent > (E (+1 ~ Le) sits) | hunber > digits optionalFraction optionalBsponent is precise specification for this set of strings. ‘That i, an eptionalfiaction © ‘ira decal point (dt) followed by one or re digits or sma ‘tupty string). At optionalBponent, i wt missing, the Fettc B one OF {uv optional + or ~ sgn, followed by owe or more digits. Note that at Wai) {Bet us follow the dt, sober docs wot math Tout docs match 1 @ scanned with OKEN ScannerCHAPTERS LENIC. mt i ions ns of Regt EXD sos ith the Die opera fe 1a, a estes hee corals ga i 10H stg py Sie eno eS TT seta 0 BE corporat Be ne neve 9 FS pty wet in the specication Xn ne mn scapes cota a Besson of some apt 3.35 Extension afore meu TTS Sov Kies nent i sea — sary, postfix operator * represents SENSE cess the lange (L(7))". The opera Se traie agers Tet rr relate the Kleene dona sy post operator ? meas “Zero or one 0 oF put another way, L{r?) = ‘Toe operator bas the same preeedence and associativity ax A repuar exyresion ay[aa[---[aq, where the slpbatet, can be replaced by the shorthand iy, whan oy ls v= uppercase exes, lomereace letters, oF di == by ore, that just the first and last sep ew dorthands, we can rewrite the regular: er hte) 4st — oy 1 er eter digit) Tee regula Sis eo Bsc 3. cana be simples dist io aigts ie wemier Oo, ; M8 dita CB ey digit J? Mm pECIPICNTION OF TOKENS, aa SPF TOKENS 7 § Exercises for Section 3.3 wear it character strings or comments), (ii) the lesieal form ot is ‘erin, nd (it) the ral maf ne ee ma ppxorcise 3.3.2: Describe the languages denoted by the following regular wx presi a) (alba 1) ((da)b"Y ) (alb)*a(ab (ab). 4) arba*barba". 11) (aa[bb)*((abjba)(aa)bb)*(abba)(aa\bb*)* [Bxercise $.8.3: In a string of length n, bow many ofthe fllowing are there? a) Profiss. 1) Sufix. «) Proper prefixes. 14) Substrings. 4) Subsequences. Exercise 3.3.42 Most languages are cae sensitive, so keywords can be writen ‘nly one way, and the regular expressions describing their leemes are very simple. However, some languages, like SQL, are case snsensiire,s0 8 keyword can be written either in lowerease or in upperease or in any mixture of eases ‘Thus, the SQL keyword SELECT can also be writen select, Select, or ELECT, for instance, Show how to write a regular expression fora keyword in a case. ast langnge. strate the hes by wing the expresin fr “Se insqu. Exercise 3.8.6: Write regular definitions for the following languages 4) Al strings of lowerease letters that contain the five vowels in ore. 1 of lowercase Ieters in which the letters are in ascending les ing of a string surrounded by /* and +/, without ax ©) Comments co less it is inside double-quotes (*) intervening ©/, @ scanned with OKEN Scannernar ns HESTEAT AN . nce ts Ant: Tey this ob pono ig an aring ED cf cea oot apnea Feet oe een anti aman of 74431 Od yg, es ane agp in the informal ontin the substring bh station, Sh 9 FEB peg re ne canes te lowing 5 chr, 3 tases for Breese pe nae (peo 7) 8 ier HPP oF lower case. 1) The nec comnts rnc amber (ose HET URPEE OF Kemer eg The sg Eran ise ae dam hat cn pot vc endo egtiate English tng eect ven, up t ad nding Exercise 3.3.10, discus te {the lexical- analyzer som (prt charters) a special meaning TOU sel «vcard off they are nesded to represent th ed sy ing he character wis pate gue eprestin "oo matches the string ‘ero ami fn geaton carne by stint sin \\e lo matches the ta ares the stein" Wie a separ er 34: ota gh ECTERAE SS stato rater cect 2 tay ace na titi the class 2 acco th ean es ven nee Exercise rences ofthe pattern Show that for every regular expr fora, there isan equivalent regular Exercise 3-3-1 the right end ofa line. Te operator * is also used Character classes, but t ‘waded, For example, “C° ‘ntin m lowercase vowel 8) How do you tell which meaning of *isfatende? 1) Can you always replace a egular expe | Bxereine 3.3.11 ‘The UNIX shell command # in filename expressions to describe sets of file names, Fo xpeesion #0 matehes all file nasmes ending i tn ofthe foram 18.3.9: The regular expression r{m™:} enaeacet literal string a literally any charactor bit eine being of ine cent of Line ‘any one ofthe characters in string © any one character not in string + ‘evo ot more stings matching ¢ fone or more strings matching * ‘peewoen m and n occurrences of r fan ry followed by an rr 1 when followed by 72 Figure 3.8: Lex regular expressions For exaznple, a( 5) matches Ine operator“ matches the the context always makes it ‘aesou) +B matches aay com ‘expression that does nol jortl.c, where ¢ i any chat sion containing repetition pression without repetition Operators. lit end of a ine, and matches to introduce complemented ear which meaning isi plete line that dors not | vabe avs [abel [raved | oe aft.s) arb cal) matches fromm m to n oct fering of one to ive a's operators of this cesion using the "and $ operators use either of these operators? ps uses the operators in Fig. 3.9 i exarmple the flename 7 matches al fl acter. Shans how sh filename @ scanned with OKEN Scannerro car HENTCAT ANA sortt. (eso) necoo rr eel Eire = pare Te hei vane fo NITION OF TOKENS: nas of the ra ae wames of tokens ws far as the lecieal analy hese tokens nte described using regular d iar amber are siilar 9 of tongunges ike Paweal oe SQM: risen pera tas,” heater it presents 1. hich are if, then, else, velop, id and wee. The ras, 9 Fig, 3.1 ple 3.7 tehat ae sae in Bs | digit + [0-9] mansee gts. digitey? ( 8 1? dite? jas expressions ws ‘eter + (keZa-e] ems a rea! ele Ng ony Gd etter (ter | digit * coe Sn ad ie oa. wo so awn «eset Form of ACCES iM which fy then > theo pnecbe 8218 SARE Gade (stands fr any one chara ica i ee oy sing of © oF more characters. In addi, rlop > <)> 11d! eR AC Shar say et be the scape chase, ==: ‘race es the character that follows its Ral Figure 3.11: Patterns for tokens of Example 3.8 SELPSERES ESSN SO1 pate as a regular expression, gra fast For this language, th lexical analyzer will recognize the keywords 1, chen, ‘and cleo, a well as lexemes that match the patterns for relop ud. and number. stanly matters, we make the common assumption that keywords are also veruitluords:. that is, they are not identifiers, even though thee lexernes natch the pattern for identifier. In addition, we assign the lexical analyzer the job of stripping out white space, by recognizing the “token” ws defined by ws kom ik arate the escape character 3.4 Recognition of Tokens cw ee i orgs pater sing regular = he par al he ede sec a aes tn sing and nda pee ‘ee lnmr sseching vow of the patterns. Our discussion will make use of a ‘ws + ( blank | tab | newline | eet if expr then sat eye then sind else stmt Here, blank, tab, and newline are abstract symbols that we use to express elie " the ASCII characters ofthe same names, Token is different from the other 7 fem tp er tokensin that, when we recognize it, we do not return it tothe parser but rather ae ‘estat the lexical analysis from the character that follows the whitespace. Te is a Ue following tokew that gets returted to the parser. * s Our goal for the lexical analyzer is summarized in Fig. 312. That table Nevin 4 ¥ shows, for each lexeme or family of leemes, which token name is returned to ra ‘he parser al what attribute value, as discussed in Section 3.1, returned. LT, LE, and 50. ching ue Matenents Sot that forthe si elational operators, symbolic constants ‘wa ec athe atte val, ett nate wich tan of ken relop we have foun, The particular operator found wil infuece the "nk that is outpmt fom the compiler. sample 3. ene fognyy sec F310 dace a inp iat tn, “Thin ata ile ‘iar expiily after coud @ scanned with OKEN ScannerCHAPTER HESICAL ayy, e Pointer to table entry aumnber | Pointer to table entry rt Le 3.4.1 Transition Diagrams Asan terete spin the costco of leical analyzer, we fst omen yttens ito slid fora called “ransiton diagrams.” Tn this tng ‘peor the sven fm egul-expesion patterns to tranlon a {Gams by land, bin Sion 3.6, we shal ee that there is @ mechani wy ‘oqmsres thse diagrams fom coleton of regular expressions Transition agra havea collection ofuades or ctces, called sates. Ea ‘2ir prs s condom that could our ding the process of s he tig rs ene at maces oe of svt patiern. Wea ‘snmarig ll ened to know about what ine set oem i embod ‘situation of Fig. 3.3), eee Fly sc eel fen oe ; afi del tom oe sae of he wansidon dagen to aie, fate ae aba ot of symbol If wt ee acne Ha ty aed LPM tba welick ran edge cet a ana a 28 ford yo el) Ewe Bnd such an de, acm eB hal wore a ean nea emit etn et transition dag haben more han one ee out of 4 8 tab ating Fret cai of deeming HSS Starting in Section 35, We of a an ai AME ema ean da i 7 rt dag inpementer, Some ipa { + Ceuttarn ae sie ti, All posits beeen ty, OO th eH 30 forward pointers, We a fal, Those states indicate th, actual exe consi me may not as OGNITION OF TOKEN gh RE Inlieate an accepting state by a double cite, nnd if thee is an and tobe taken — typically returning a token and an attibte vale tot parser — we shall attach that action tothe secepting state 2, In audition, if tis necessary to retract the forwerd pointer one peniton (es the lexeme des not inch the symb that got us to the necting, state), then we shall additionally plare a * near that accepting state, Ig ‘our example, itis never necessary to retract forward by more than on position, but if were, we could attach any aumber of "tothe accepting state. 3, One state is designated the staré state, or initial state; itis iicated by an edge, labeled “start,” entering from aowhere. The transition diagram always begins in the start state before any input symbols have been ead Example 3.9: Figure 3.13 isa transition diagram that recognizes the lexeam matching the token relop. We begin instate 0, the star tate. Ire se ca the frst input symbol, then among the lexemes that match the pattern for relop wwe can only be looking at <, ©, or <=. We therefore goto state 1, and look at the next character. If itis», then we recognize lexenve <= enter state 2, and return the token relop with attribute LE, the symbolic constant representing this particular comparison operator. If in state 1 the next character is >, then instead we have lexeme ©, and enter state 3 to return an indication that the not-equals operator has been found. On any other character, the lexeme i <, and we enter state 4 to return that information. Note, however, that state 4 hhas.a* to indicate that we must reteat the input one position a>+© cet 6 Figure 3.19: Transition diagram for relop On the other hand, fn stave Othe fist character we se is =, then this one ‘haracter must be the leseme. We immediately return that fact from state 5 @ scanned with OKEN Scanner wiCHAPTER. LEXICAL Ay, pte ft ctr, whet the lexan nig ott ees cane i th Ms Teen 00 7 (ot Sar state 6a i i 8 2g, oF, We ea OL PORSDIY (tore carbs ae bewsel. 7 state OM this rsison i Reserved Words and Identifier, 2 Recognition i re venir presnisa problem. USUAIY, Keywory (rece vet ein or rng ate, $ they a reamed 0 Te Tut ALOUD ve ets HED of Fi, 314 to starch forint , a arnt eywotds if, chen, and e250 of our rugs connie) ve ‘san eee cher : a ot DG nmin nt Figure 14 A transition dagram for id's and keywords ‘Tere ace too mays that we can handle resrved words that look Uk i es 1 fla he ewan the smb ble initially. A Bel fe ym ey inset hinge are never ordinary is fe nd hie ey gr We hae suppor a aodisaseiFig 3.14, Wier wid an identi, a cal otal teak pa bit aad ad eta lat fr he eee found OF our, a i ‘not in the symbol table during Non i dll it cl yet be are ‘crite Toc Pn canines the sol sun ease ator ae th al Po ty a i io oe the keyword tes ‘cute are tin Se wed ‘eh Sigs fr each keyword; an ex lagram consists town in Fig. 3.15. Not . irate eng eda ea cbr tation ater ech ay character tay ca Nowe BY a tet fora “nonetter ‘ects ata tb Atm conte Sop lk 6 ttt oa et Ita shenereaiae ta ght care tee ye yt Sonat ety 6 le a 4 ie gl UP pre I we ad ™ then so that the reserve ju nACOGNITION OF TOKENS 1 tokens ace recognized in preference to id, when the lexeme matches both Wrens, We do nol use this approach in our example, whic i why te ites in Fig, 3.15 are unnumbered #5 0+ wan): Figure 3.15: Hypothetical transition diagram for the Keyword shen, ~—t 4.43. Completion of the Running Example ‘Te transton diagram for d's that wesw in Fig, 3.14 hassle struct. Ping in state 9, it checks thatthe lexeme begins with a letar and goes to Sarit. Westayinstate 10 slong asthe input contains letters an digs. sive we fst encounter anything but a eter or dig, we go to stats 11 and Tres, the exer found. Since the last character is not pat of the identifier, scree retract the Input one position, and as discussed in Section 3.4.2, we mesa shat we have found inthe symbol table and determine whether we have heyword or a true identifier “Tne transition diagram for token number is shown in Fig. 3.16, and is #0 Tar the most complex diagram we have seen. Beginning in state 12, if we see a digie we go to state 13, Tn that state, we can read any number of additional ‘dg. However, if we see anything but a digi, dot, or E, we have seen number {nthe form of an integer; 123 is an example. "That case is handled by entering Slave 20, where we return token number and a pointer to a table of constants there the found lexeme is entered. These mechanics are not shown on the ‘Gagram but are analogous tothe way we handled identifiers. FFiguee 4.16: A transition diagram for unsigned numbers yal 8 stad see a dot in state 13, then we have an Yptondl ration.” Stato 1s eataed, and we look for one or more addtional digits; sate 15 ig Usa for that purpose. If we ee an E then we have an. ptional exponent? hose recognition is the job of state 16 through 19. Should we a state 15, Ital see anything but of digit then weave come to the end ofthe Hon, there is no exponent, and we return the lexeme found, va state 2, @ scanned with OKEN Scanner| Ye MACY O &\ “PAR” 9 OPS Moatbub, 2 Jrvedtares eter cxcoo fo (nquoge ole Ab tourth cldennodive KO dbles XV epelow SS \ io) Coury oh & haga 8 | Covatng eA bage A ON A Wa +> + Yeoncaleroehig, the aig ab Qne dy move (ab falas opdrnot b Ao © (eats) Couloy” | 5 coy np Cod (El? (soe tut K (gan) © LD Hewodhve. eaedh¥ yk » (ob a (ba)® Costs "4 blowy*-4Ctseu) 4c Uar)* \w? poh om ‘RE to cucu or Lanquoge lovers by g Atma i o's 4A Wa ot mgd One por 4 oyreadhee 014 Lgetirosdr 4 Vs Los. Cito End nity oy wo (ito M*oe > Boros [cp] obra a RE fo cept loguoge bontrry ott (sect 7 ne & 4 od Looat one bo au ts Jas, ba ACA Oe fattorcy® 15 @ scanned with OKEN Scanner4 bb) Cc poe # A — Ia lab HB Pah] Rev OS 0,8) yc ic. 8h fob’ i ) C Rb» | RD Abe | Debs] We Foce abo ee zee @ scanned with OKEN Scanner