Professional Documents
Culture Documents
090 Top-Down Parsing
090 Top-Down Parsing
TopDownParsing
Handout09 July2th,2012
HandoutwrittenbyMaggieJohnsonandrevisedbyJulieZelenski.
Ab AB S
Increatingaparserforacompiler,wenormallyhavetoplacesomerestrictionsonhow weprocesstheinput.Intheaboveexample,itwaseasyforustoseewhichproductions wereappropriatebecausewecouldseetheentirestringaaab.Inacompilersparser, however,wedonthavelongdistancevision.Weareusuallylimitedtojustonesymbol oflookahead.Thelookaheadsymbolisthenextsymbolcomingupintheinput.This restrictioncertainlymakestheparsingmorechallenging.Usingthesamegrammarfrom above,iftheparserseesonlyasinglebintheinputanditcannotlookaheadanyfurther thanthesymbolweareon,itcantknowwhethertousetheproductionB >borB > bB. Backtracking Onesolutiontoparsingwouldbetoimplementbacktracking.Basedontheinformation theparsercurrentlyhasabouttheinput,adecisionismadetogowithoneparticular production.Ifthischoiceleadstoadeadend,theparserwouldhavetobacktracktothat decisionpoint,movingbackwardsthroughtheinput,andstartagainmakingadifferent choiceandsoonuntiliteitherfoundtheproductionthatwastheappropriateoneorran outofchoices.Forexample,considerthissimplegrammar:
S A > > bab | bA d | cA
weeitherfindaworkingparseorhaveexhaustivelytriedallcombinationswithout success. Anumberofauthorshavedescribedbacktrackingparsers;theappealisthattheycanbe usedforavarietyofgrammarswithoutrequiringthemtofitanyspecificform.Fora smallgrammarsuchasabove,abacktrackingapproachmaybetractable,butmost programminglanguagegrammarshavedozensofnonterminalseachwithseveral optionsandtheresultingcombinatorialexplosionmakesathisapproachveryslowand impractical.Wewillinsteadlookatwaystoparseviaefficientmethodsthathave restrictionsabouttheformofthegrammar,butusuallythoserequirementsarenotso onerousthatwecannotrearrangeaprogramminglanguagegrammartomeetthem. TopDownPredictiveParsing First,wewillfocusinontopdownparsing.Wewilllookattwodifferentwaysto implementanonbacktrackingtopdownparsercalledapredictiveparser.Apredictive parserischaracterizedbyitsabilitytochoosetheproductiontoapplysolelyonthebasis ofthenextinputsymbolandthecurrentnonterminalbeingprocessed.Toenablethis, thegrammarmusttakeaparticularform.WecallsuchagrammarLL(1).Thefirst"L" meanswescantheinputfromlefttoright;thesecond"L"meanswecreatealeftmost derivation;andthe1meansoneinputsymboloflookahead.Informally,anLL(1)hasno leftrecursiveproductionsandhasbeenleftfactored.Notethatthesearenecessary conditionsforLL(1)butnotsufficient,i.e.,thereexistgrammarswithnoleftrecursionor commonprefixesthatarenotLL(1).Notealsothatthereexistmanygrammarsthat cannotbemodifiedtobecomeLL(1).Insuchcases,anotherparsingtechniquemustbe employed,orspecialrulesmustbeembeddedintothepredictiveparser. RecursiveDescent Thefirsttechniqueforimplementingapredictiveparseriscalledrecursivedescent.A recursivedescentparserconsistsofseveralsmallfunctions,oneforeachnonterminalin thegrammar.Asweparseasentence,wecallthefunctionsthatcorrespondtotheleft sidenonterminaloftheproductionsweareapplying.Iftheseproductionsarerecursive, weendupcallingthefunctionsrecursively. LetsstartbyexaminingsomeproductionsfromagrammarforasimplePascallike programminglanguage.Inthisprogramminglanguage,allfunctionsareprecededby thereservedwordFUNC:
program function_list function > > > function_list function_list function | function FUNC identifier ( parameter_list ) statements
NowwecantidyuptheParseFunctionroutineandmakeitclearerwhatitdoes:
void ParseFunction() { MatchToken(T_FUNC); ParseIdentifier(); MatchToken(T_LPAREN); ParseParameterList(); MatchToken(T_RPAREN); ParseStatements(); }
Thefollowingdiagramillustrateshowtheparsetreeisbuilt:
program function_list function identifier parameter_list by the call to ParseParameterList statements by the call to ParseStatements
FUNC call to
ParseIdentifier
Hereistheproductionforanifstatementinthislanguage:
if_statement > IF expression THEN statement ENDIF | IF expression THEN statement ELSE statement ENDIF
Topreparethisgrammarforrecursivedescent,wemustleftfactortosharethecommon parts:
if_statement > IF expression THEN statement close_if close_if > ENDIF | ELSE statement ENDIF
Now,letslookattherecursivedescentfunctionstoparseanifstatement:
void ParseIfStatement() { MatchToken(T_IF); ParseExpression(); MatchToken(T_THEN); ParseStatement(); ParseCloseIf(); } void ParseCloseIf() { if (lookahead == T_ENDIF) lookahead = yylex(); else { MatchToken(T_ELSE); ParseStatement(); MatchToken(T_ENDIF); } }
// if we immediately find ENDIF // predict close_if -> ENDIF // otherwise we look for ELSE // predict close_if -> ELSE stmt ENDIF
Navigatingthroughtwochoicesseemedsimpleenough,however,whathappenswhere wehavemanyalternativesontherightside?
statement > assg_statement | return_statement | print_statement | null_statement | if_statement | while_statement | block_of_statements
WhenimplementingtheParseStatementfunction,howarewegoingtobeableto determinewhichofthesevenoptionstomatchforanygiveninput?Remember,weare tryingtodothiswithoutbacktracking,andjustonetokenoflookahead,sowehavetobe abletomakeimmediatedecisionwithminimalinformationthiscanbeachallenge! Tounderstandhowtorecognizeandsolveproblem,weneedadefinition: Thefirstsetofasequenceofsymbolsu,writtenasFirst(u)isthesetofterminals whichstartthesequencesofsymbolsderivablefromu.Abitmoreformally, considerallstringsderivablefromu.Ifu=>*v,wherevbeginswithsome terminal,thatterminalisinFirst(u). Ifu=>*,then isinFirst(u). Informally,thefirstsetofasequenceisalistofallthepossibleterminalsthatcouldstart astringderivedfromthatsequence.Wewillworkanexampleofcalculatingthefirst setsabitlater.Fornow,justkeepinmindtheintuitivemeaning.Findingourlookahead tokeninoneofthefirstsetsofthepossibleexpansionstellsusthatisthepathtofollow. Givenaproductionwithanumberofalternatives:A > u1 | u2 | ...,wecanwritea recursivedescentroutineonlyifallthesetsFirst(ui)aredisjoint.Thegeneralformof sucharoutinewouldbe:
void ParseA() { // case below not quite legal C, need to list symbols individually switch (lookahead) { case First(u1): // predict production A -> u1 /* code to recognize u1 */ return; case First(u2): // predict production A -> u2 /* code to recognize u2 */ return; .... default: printf("syntax error \n"); exit(0); } }
Ifthefirstsetsofthevariousproductionsforanonterminalarenotdisjoint,apredictive parserdoesn'tknowwhichchoicetomake.Wewouldeitherneedtorewritethe grammaroruseadifferentparsingtechniqueforthisnonterminal.Forprogramming languages,itisusuallypossibletorestructuretheproductionsorembedcertainrules intotheparsertoresolveconflicts,butthisconstraintisoneoftheweaknessesofthetop downnonbacktrackingapproach. Itisabittrickierifthenonterminalwearetryingtorecognizeisnullable.Anonterminal AisnullableifthereisaderivationofAthatresultsin (i.e.,thatnonterminalwould completelydisappearintheparsestring)i.e., First(A).InthiscaseAcouldbereplaced bynothingandthenexttokenwouldbethefirsttokenofthesymbolfollowing Ainthe sentencebeingparsed.ThusifAisnullable,ourpredictiveparseralsoneedstoconsider thepossibilitythatthepathtochooseistheonecorrespondingto A=>*.Todealwith thiswedefinethefollowing: ThefollowsetofanonterminalAisthesetofterminalsymbolsthatcanappear immediatelytotherightofAinavalidsentence.Abitmoreformally,forevery validsentenceS =>* uAv,wherevbeginswithsometerminal,andthatterminalis inFollow(A). Informally,youcanthinkaboutthefollowsetlikethis:Acanappearinvariousplaces withinavalidsentence.Thefollowsetdescribeswhatterminalscouldhavefollowed thesententialformthatwasexpandedfromA.Wewilldetailhowtocalculatethefollow setabitlater.Fornow,realizefollowsetsareusefulbecausetheydefinetheright contextconsistentwithagivennonterminalandprovidethelookaheadthatmightsignal anullablenonterminalshouldbeexpandedto. Withthesetwodefinitions,wecannowgeneralizehowtohandle A > u1 | u2 | ...,ina recursivedescentparser.Inallsituations,weneedacasetohandleeachmemberin First(ui).Inadditionifthereisaderivationfromanyuithatcouldyield(i.e.ifitis nullable)thenwealsoneedtohandlethemembersinFollow(A).
void ParseA() { switch (lookahead) { case First(u1): /* code to recognize u1 */ return; case First(u2): /* code to recognize u2 */ return; ... case Follow(A): // predict production A->epsilon if A is nullable /* usually do nothing here */ default: printf("syntax error \n"); exit(0); } }
becomes
function_list > function function_list | function
thenwemustleftfactorthecommonparts
function_list > more_functions > function more_functions function more_functions |
Andnowtheparsingfunctionlookslikethis:
Computingfirstandfollow Thesearethealgorithmsusedtocomputethefirstandfollowsets: Calculatingfirstsets.TocalculateFirst(u)whereuhastheformX1X2...Xn,dothe following: a) IfX1isaterminal,addX1toFirst(u)andyou'refinished. b) ElseX1isanonterminal,soaddFirst(X1) - toFirst(u). a. IfX1isanullablenonterminal,i.e.,X1=>*,addFirst(X2) - toFirst(u). Furthermore,ifX2canalsogoto,thenaddFirst(X3)- andsoon,throughallXn untilthefirstnonnullablesymbolisencountered. b. IfX1X2...Xn =>*,addtothefirstset. Calculatingfollowsets.Foreachnonterminalinthegrammar,dothefollowing: 1. PlaceEOFinFollow(S)whereSisthestartsymbolandEOFistheinput'sright endmarker.Theendmarkermightbeendoffile,newline,oraspecialsymbol, whateveristheexpectedendofinputindicationforthisgrammar.Wewill typicallyuse$astheendmarker. 2. ForeveryproductionA > uBvwhereuandvareanystringofgrammarsymbols andBisanonterminal,everythinginFirst(v)exceptisplacedinFollow(B). 3. ForeveryproductionA > uB,oraproductionA > uBvwhereFirst(v)contains (i.e.visnullable),theneverythinginFollow(A) isaddedtoFollow(B). Hereisacompleteexampleoffirstandfollowsetcomputation,startingwiththis grammar:
S A B C > > > > AB Ca | BaAC | c b|
NoticewehavealeftrecursiveproductionthatmustbefixedifwearetouseLL(1) parsing:
B > BaAC | c
becomes
B B'
Thenewgrammaris:
S A B B' C > > > > > AB Ca | cB' aACB' | b|
Thefirstsetsforeachnonterminalare:
First(C) = {b } First(B') = {a } First(B) = {c} First(A) = {b a } StartwithFirst(C) - ,add a
(sinceCisnullable)and (sinceAitselfisnullable)
Follow(S) = {$} S doesntappearintherighthandsideofanyproductions.Weput $inthe followsetbecauseSisthestartsymbol. Follow(B) = {$} B appearsontherighthandsideoftheS > AB production.Itsfollowsetisthe sameasS. Follow(B') = {$} B' appearsontherighthandsideoftwoproductions.TheB' > aACB' productiontellsusitsfollowsetincludesthefollowsetofB',whichis tautological.FromB > cB', welearnitsfollowsetisthesameasB. Follow(C) = {a $} C appearsintherighthandsideoftwoproductions.TheproductionA > Ca tellsusaisinthefollowset.FromB' > aACB' , weaddthe First(B')whichis justaagain.Because B' isnullable,wemustalsoadd Follow(B') whichis$. Follow(A) = {c b a $} A appearsintherighthandsideoftwoproductions.From S > AB weadd First(B) whichisjust c. B isnotnullable. From B' > aACB' , weadd First(C) whichis b. SinceCisnullable,sowealsoinclude First(B') whichis a. B'isalso nullable,soweinclude Follow(B') whichadds $.
Itcanbeconvenienttocomputethefollowssetsforthenonterminalsthatappeartoward thetopoftheparsetreeandworkyourwaydown,butsometimesyouhavetocircle aroundcomputingthefollowsetsofothernonterminalsinordertocompletetheone youreon. Thecalculationofthefirstandfollowsetsfollowmechanicalalgorithms,butitisvery easytogettrippedupinthedetailsandmakemistakesevenwhenyouknowtherules. Becareful! TabledrivenLL(1)Parsing Inarecursivedescentparser,theproductioninformationisembeddedintheindividual parsefunctionsforeachnonterminalandtheruntimeexecutionstackiskeepingtrackof ourprogressthroughtheparse.Thereisanothermethodforimplementingapredictive parserthatusesatabletostorethatproductionalongwithanexplicitstacktokeeptrack ofwhereweareintheparse. Thisgrammarforadd/multiplyexpressionsisalreadysetuptohandleprecedenceand associativity:
E T F > > > E+T|T T*F|F (E) | int
Afterremovalofleftrecursion,weget:
E E' T T' F > > > > > TE' + TE' | FT' * FT' | (E) | int
Onewaytoillustratetheprocessistostudysometransitiongraphsthatrepresentthe grammar:
Apredictiveparserbehavesasfollows.Letsassumetheinputstringis3 + 4 * 5. ParsingbeginsinthestartstateofthesymbolEandmovestothenextstate.This transitionismarkedwithaT,whichsendsustothestartstateforT.Thisinturn,sends ustothestartstateforF.Fhasonlyterminals,sowereadatokenfromtheinputstring. Itmusteitherbeanopenparenthesisoranintegerinorderforthisparsetobevalid.We consumetheintegertoken,andthuswehavehitafinalstateinthe Ftransitiondiagram, sowereturntowherewecamefromwhichistheTdiagram;wehavejustfinished processingtheFnonterminal.WecontinuewithT',andgotothatstartstate.The currentlookaheadis+whichdoesntmatchthe*requiredbythefirstproduction,but+ isinthefollowsetforT'sowematchthesecondproductionwhichallowsT'todisappear entirely.WefinishT'andreturntoT,wherewearealsoinafinalstate.WereturntotheE diagramwherewehavejustfinishedprocessingtheT.WemoveontoE',andsoon. Atabledrivenpredictiveparserusesastacktostoretheproductionstowhichitmust return.Aparsingtablestorestheactionstheparsershouldtakebasedontheinput tokenandwhatvalueisontopofthestack.$istheendofinputsymbol.
( E> TE'
E'> +TE' T> FT' T'> F> int T'> *FT' F> (E) T> FT'
E'>
E'>
T'>
T'>
Tracing Hereishowapredictiveparserworks.Wepushthestartsymbolonthestackandread thefirstinputtoken.Astheparserworksthroughtheinput,therearethefollowing possibilitiesforthetopstacksymbolXandtheinputtokenausingtableM: 1. If X = aanda=endofinput($),parserhaltsandparsecompletedsuccessfully. 2. IfX = a anda!=$,successfulmatch,popXandadvancetonextinputtoken. Thisiscalledamatchaction. 3. IfX != a andXisanonterminal,popXandconsulttableatM[X,a]toseewhich productionapplies,pushrightsideofproductiononstack.Thisiscalleda predictaction. 4. Ifnoneoftheprecedingcasesappliesorthetableentryfromstep3isblank, therehasbeenaparseerror. Hereisanexampleparseofthestringint + int * int:
PARSESTACK REMAININGINPUT PARSERACTION
E$
int + int * int$ Predict E > TE', pop E from stack, push TE', no change in input int + int * int$ Predict T> FT' int + int * int$ Predict F> int int + int * int$ Match int, pop from stack, move ahead in input + int * int$ Predict T'> + int * int$ Predict E'> +TE' + int * int$ Match +, pop int * int$ Predict T>FT' int * int$ Predict F> int
int * int$ Match int, pop * int$ Predict T'> *FT' * int$ Match *, pop int$ Predict F> int int$ Match int, pop $ Predict T'> $ Predict E'> $ Match $, pop, success!
First(E) = First(T) = First(F) = { ( int } First(T') = { * } First(E') = { + } Follow(E) = Follow(E') { $ ) } Follow(T) = Follow(T') = { + $ ) } Follow(F) = { * + $ ) }
Oncewehavethefirstandfollowsets,webuildatableMwiththeleftmostcolumn labeledwithallthenonterminalsinthegrammar,andthetoprowlabeledwithallthe terminalsinthegrammar,alongwith$.Thefollowingalgorithmfillsinthetablecells: 1. ForeachproductionA > uofthegrammar,dosteps2and3 2. ForeachterminalainFirst(u),addA > utoM[A,a] 3. If inFirst(u),(i.e.Aisnullable)addA > utoM[A,b]foreachterminalbin Follow(A),If inFirst(u),and$isinFollow(A),addA > utoM[A,$] 4. Allundefinedentriesareerrors TheconceptusedhereistoconsideraproductionA > uwithainFirst(u).Theparser shouldexpandA to uwhenthecurrentinputsymbolisa.Itsalittletrickierwhenu=
oru=>*.Inthiscase,weshouldexpandA to uifthecurrentinputsymbolisin Follow(A),orifthe$attheendoftheinputhasbeenreached,and$isinFollow(A). Iftheprocedureevertriestofillinanentryofthetablethatalreadyhasanonerror entry,theprocedurefailsthegrammarisnotLL(1). LL(1)Grammars:Properties Thesepredictivetopdowntechniques(eitherrecursivedescentortabledriven)require agrammarthatisLL(1).OnefullygeneralwaytodetermineifagrammarisLL(1)isto buildthetableandseeifyouhaveconflicts.Insomecases,youwillbeabletodetermine thatagrammarisorisn'tLL(1)viaashortcut(suchasidentifyingobviousleftfactors). TogiveaformalstatementofwhatisrequiredforagrammartobeLL(1): o Noambiguity o Noleftrecursion o AgrammarGisLL(1)iffwheneverA > u | varetwodistinctproductionsofG,the followingconditionshold: o fornoterminaladobothu andvderivestringsbeginningwitha(i.e.,first setsaredisjoint) o atmostoneofu andvcanderivetheemptystring o ifv=>*thenudoesnotderiveanystringbeginningwithaterminalin Follow(A) (i.e.,firstandfollowmustbedisjointifnullable) AllofthistranslatesintuitivelythatwhentryingtorecognizeA,theparsermustbeable toexaminejustoneinputsymboloflookaheadanduniquelydeterminewhich productiontouse. ErrorDetectionandRecovery Afewgeneralprinciplesapplytoerrorsfoundregardlessofparsingtechniquebeing used: o Aparsershouldtrytodeterminethatanerrorhasoccurredassoonaspossible. Waitingtoolongbeforedeclaringanerrorcancausetheparsertolosetheactual locationoftheerror. o Asuitableandcomprehensivemessageshouldbereported.Missingsemicolon online36ishelpful,unabletoshiftinstate425isnot. o Afteranerrorhasoccurred,theparsermustpickareasonableplacetoresumethe parse.Ratherthangivingupatthefirstproblem,aparsershouldalwaystryto parseasmuchofthecodeaspossibleinordertofindasmanyrealerrorsas possibleduringasinglerun.
o Aparsershouldavoidcascadingerrors,whichiswhenoneerrorgeneratesa lengthysequenceofspuriouserrormessages. Recognizingthattheinputisnotsyntacticallyvalidcanberelativelystraightforward.An errorisdetectedinpredictiveparsingwhentheterminalontopofthestackdoesnot matchthenextinputsymbolorwhennonterminalAisontopofthestack,aisthenext inputsymbolandtheparsingtableentryM[A,a]isempty. Decidinghowtohandletheerrorisbitmorecomplicated.Byinsertingspecificerror actionsintotheemptyslotsofthetable,youcandeterminehowapredictiveparserwill handleagivenerrorcondition.Attheleast,youcanprovideapreciseerrormessagethat describesthemismatchbetweenexpectedandfound. Recoveringfromerrorsandbeingabletoresumeandsuccessfullyparseismoredifficult. Theentirecompilationcouldbeabortedonthefirsterror,butmostuserswouldliketo findoutmorethanoneerrorpercompilation.Theproblemishowtofixtheerrorin somewaytoallowparsingtocontinue. Manyerrorsarerelativelyminorandinvolvesyntacticviolationsforwhichtheparser hasacorrectionthatitbelievesislikelytobewhattheprogrammerintended.For example,amissingsemicolonattheendofthelineoramisspelledkeywordcanusually berecognized.Formanyminorerrors,theparsercan"fix"theprogrambyguessingat whatwasintendedandreportingawarning,butallowingcompilationtoproceed unhindered.Theparsermightskipwhatappearstobeanerroneoustokenintheinput orinsertanecessary,butmissing,tokenorchangeatokenintotheoneexpected (substitutingBEGINforBGEIN).Formoremajororcomplexerrors,theparsermayhave noreliablecorrection.Theparserwillattempttocontinuebutwillprobablyhavetoskip overpartoftheinputortakesomeotherexceptionalactiontodoso. Panicmodeerrorrecoveryisasimpletechniquethatjustbailsoutofthecurrent construct,lookingforasafesymbolatwhichtorestartparsing.Theparserjustdiscards inputtokensuntilitfindswhatiscalledasynchronizingtoken.Thesetofsynchronizing tokensarethosethatwebelieveconfirmtheendoftheinvalidstatementandallowusto pickupatthenextpieceofcode.ForanonterminalA,wecouldplaceallthesymbolsin Follow(A)intoitssynchronizingset.IfAisthenonterminalforavariabledeclarationand thegarbledinputissomethinglikeduoble d;theparsermightskipaheadtothesemi colonandactasthoughthedeclarationdidntexist.Thiswillsurelycausesomemore cascadingerrorswhenthevariableislaterused,butitmightgetthroughthetrouble spot.Wecouldalsousethesymbolsin First(A) asasynchronizingsetforrestartingthe parseofA.Thiswouldallowinputjunk double d;toparseasavalidvariable declaration.
Bibliography A. Aho, R. Sethi, J. Ullman, Compilers: Principles, Techniques, and Tools. Reading, MA: Addison-Wesley, 1986. J.P. Bennett, Introduction to Compiling Techniques. Berkshire, England: McGraw-Hill, 1990. D. Cohen, Introduction to Computer Theory. New York: Wiley, 1986. C. Fischer, R. LeBlanc, Crafting a Compiler. Menlo Park, CA: Benjamin/Cummings, 1988. K. Loudon, Compiler Construction. Boston, MA: PWS, 1997.