Professional Documents
Culture Documents
Syntax Analyzer RNP19 Unit 2
Syntax Analyzer RNP19 Unit 2
Outline
1. Top-Down Parser
– the parse tree is created top to bottom, starting from the root.
2. Bottom-Up Parser
– the parse is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right
(one symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for
sub-classes of context-free grammars.
– LL for top-down parsing
– LR for bottom-up parsing
• unambiguous grammar
unique selection of the parse tree for a sentence
stmt stmt
E2 S1 E2 S1 S2
1 2
Compiler Design 40106 22
Ambiguity (cont.)
• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.
A
+
A for some string
In general,
A A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A
eliminate immediate left recursion
A 1 A’ | ... | n A’
A’ 1 A’ | ... | m A’ | an equivalent grammar
E E+T | T
T T*F | F
F id | (E)
• when we see if, we cannot figure out which production rule to choose
to re-write stmt in the derivation.
convert it into
A A’ | 1 | ... | m
A’ 1 | ... | n
A aA’ | cdg | cdeB | cdfB
A’ bB | B
A aA’ | cdA’’
A’ bB | B
A’’ g | eB | fB
– Predictive Parsing
• no backtracking
• efficient
• needs a special form of grammars (LL(1) grammars).
• Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.
S cAd
A ab | a
S S
input: cad
c A d c A d
fails, backtrack
a b a
current token
stmt if ...... |
while ...... |
begin ...... |
for .....
• When we are trying to write the non-terminal stmt, if the current token
is if we have to choose first production rule.
• When we are trying to write the non-terminal stmt, we can uniquely
choose the production rule by just looking the current token.
• We eliminate the left recursion in the grammar, and left factor it. But it
may not be suitable for predictive parsing (not LL(1) grammar).
Compiler Design 40106 35
Recursive Predictive Parsing
• Each non-terminal corresponds to a procedure.
proc A {
- match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
}
A aBe | cBd | C
B bB |
Cf
input buffer
Parsing Table
output
– a production rule representing a step of the derivation sequence (left-most derivation) of the
string in the input buffer.
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S. $S initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is completed.
parsing table
– a two-dimensional array M[A,a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
– each entry holds a production rule.
Derivation(left-most): SaBaabBaabbBaabba
S
parse tree
a B a
b B
b B
Compiler Design 40106 42
LL(1) Parser – Example2
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
F (E) | id
id + * ( ) $
E E TE’ E TE’
E’ E’ +TE’ E’ E’
T T FT’ T FT’
T’ T’ T’ *FT’ T’ T’
F F id F (E)
Compiler Design 40106 43
LL(1) Parser – Example2
stack input output
$E id+id$ E TE’
$E’T id+id$ T FT’
$E’ T’F id+id$ F id
$ E’ T’id id+id$
$ E’ T ’ +id$ T’
$ E’ +id$ E’ +TE’
$ E’ T+ +id$
$ E’ T id$ T FT’
$ E’ T ’ F id$ F id
$ E’ T’id id$
$ E’ T ’ $ T’
$ E’ $ E’
$ $ accept
• If ( A B is a production rule ) or
( A B is a production rule and is in FIRST() )
everything in FOLLOW(A) is in FOLLOW(B).
We apply these rules until nothing more can be added to any follow set.
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
– If in FIRST()
for each terminal a in FOLLOW(A) add A to M[A,a]
• All other undefined entries of the parsing table are error entries.
E’ FIRST()={} none
but since in FIRST()
and FOLLOW(E’)={$,)} E’ into M[E’,$] and M[E’,)]
T’ FIRST()={} none
but since in FIRST()
and FOLLOW(T’)={$,),+} T’ into M[T’,$], M[T’,)] and M[T’,+]
• The parsing table of a grammar may contain more than one production
rule. In this case, we say that it is not a LL(1) grammar.
FIRST(iCtSE) = {i}
a b e i t $
FIRST(a) = {a}
S Sa S iCtSE
FIRST(eS) = {e}
FIRST() = {} E EeS E
E
FIRST(b) = {b}
C Cb
Problem ambiguity
Compiler Design 40106 53
Properties of LL(1) Grammars
• A grammar G is LL(1) if and only if the following conditions hold for
two distinctive production rules A and A
• Error-Productions
– If we have a good idea of the common errors that might be encountered, we can augment
the grammar with productions that generate erroneous constructs.
– When an error production is used by the parser, we can generate appropriate error
diagnostics.
– Since it is almost impossible to know all the errors that can be made by the programmers,
this method is not practical.
S AbS | e |
A a | cAd
FOLLOW(S)={$}
A A a sync A cAd sync sync sync
FOLLOW(A)={b,d}
• At each reduction step, a substring of the input matching to the right side of a
production rule is replaced by the non-terminal at the left side of that production rule.
• If the substring is chosen correctly, the right most derivation of that string is created in
the reverse order.
Rightmost Derivation: S
*
rm
aABb
S rm aaAbb
rm aAbb rm rm aaabb
1. Shift : The next input symbol is shifted onto the top of the stack.
2. Reduce: Replace the handle on the top of the stack by the non-
terminal.
3. Accept: Successful completion of parsing.
4. Error: Parser discovers a syntax error, and calls an error recovery
routine.
LR(k) parsing.
4. Error -- Parser detected an error (an empty entry in the action table)
Compiler Design 40106 73
Reduce Action
• pop 2|| (r is the length of ) items from the stack;
let us assume that = Y1Y2...Yr
• then push A and s where s=goto[sm-r,A]
( So X1 S1 ... Xm-r Sm-r Y1 Sm-r+1 ...Yr Sm, ai ai+1 ... an $ )
( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )
2) ET 0 s5 s4 1 2 3
1 s6 acc
3) T T*F 2 r2 s7 r2 r2
4) TF 3 r4 r4 r4 r4
5) F (E) 4 s5 s4 8 2 3
6) F id 5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
. .
F (E), F id }
goto(I,E) = { ----------}
goto(I,T) = { -----------}
goto(I,F) = {------------}
goto(I,() = { ---------- }
goto(I,id) = {----------}
.
goto(I,id) = { F id }
Compiler Design 40106 83
Construction of The Canonical LR(0) Collection
• To create the SLR parsing tables for a grammar G, we will create the
canonical LR(0) collection of the grammar G’.
• Algorithm:
.
C is { closure({S’ S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
I5: F id.
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A b reduce by A
reduce by B reduce by B
reduce/reduce conflict reduce/reduce conflict
can be written as
.
A ,a1/a2/.../an
I9:S L=R.,$
R I13:L *R.,$
I6:S L=.R,$ to I9
R .L,$ L I10:R L.,$
to I10
L .*R,$ *
R
L .id,$ to I11 I11:L *.R,$ to I13
id R .L,$ L
to I12 to I10
L .*R,$ *
I7:L *R.,$/= L .id,$ to I11
id
I8: R L.,$/= to I12
I12:L id.,$
Compiler Design 40106 95
Canonical LR(1) Collection – Example2
S’ S I0:S’ .S,$ I1:S’ S.,$ I4:L *.R,$/= R to I7
1) S L=R S .L=R,$ S * R .L,$/= L
to I8
2) S R S .R,$ L I2:S L.=R,$ to I6 L .*R,$/= *
3) L *R L .*R,$/= R L.,$ L .id,$/= to I4
id
4) L id L .id,$/= R to I5
I3:S R.,$ id
I5:L id.,$/=
5) R L R .L,$
I9:S L=R.,$
R I13:L *R.,$
I6:S L=.R,$ to I9
R .L,$ L I10:R L.,$
to I10
L .*R,$ *
R
I4 and I11
L .id,$ to I11 I11:L *.R,$ to I13
id R .L,$ L I5 and I12
to I12 to I10
L .*R,$ *
I7:L *R.,$/= L .id,$ to I11 I7 and I13
id
I8: R L.,$/= to I12 I8 and I10
I12:L id.,$
Compiler Design 40106 96
LR(1) Parsing Tables – (for Example2)
id * = $ S L R
0 s5 s4 1 2 3
1 acc
2 s6 r5
3 r2
4 s5 s4 8 7
5 r4 r4 no shift/reduce or
6 s12 s11 10 9 no reduce/reduce conflict
7
8
r3
r5
r3
r5
9 r1 so, it is a LR(1) grammar
10 r5
11 s12 s11 10 13
12 r4
13 r3
c I6 I9
C
c I7
d d
I3 I8
c
I4
d C
d
An Example
c d $ S C
0 s3 s4 1 2
1 acc
2 s6 s7 5
3 s3 s4 8
4 r3 r3
5 r1
6 s6 s7 9
7 r3
8 r2 r2
9 r2
The Core of LR(1) Items
• LALR parsers are often used in practice because LALR parsing tables
are smaller than LR(1) parsing tables.
• The number of states in SLR and LALR parsing tables for a grammar G
are equal.
• But LALR parsers recognize more grammars than SLR parsers.
• yacc creates a LALR parser for the given grammar.
• A state of LALR parser will be again a set of LR(1) items.
• We will find the states (sets of LR(1) items) in a canonical LR(1) parser with same
cores. Then we will merge them as a single state.
.
I1:L id ,= A new state: .
I12: L id ,=
.
L id ,$
.
I2:L id ,$ have same core, merge them
• We will do this for all states of a canonical LR(1) parser to get the states of the LALR
parser.
• In fact, the number of the states of the LALR parser for a grammar will be equal to the
number of states of the SLR parser for that grammar.
I7 d
d
C d , $
c
C c C, c/d C I89
c C c C, c/d
C d, c/d C c C , c/d/$
I3
d
d I4
C d , c/d
S’ S, $ I1
S C C, $ S (S’ S , $
C c C, c/d
C d, c/d
I0 S C C, $ I5
C C
C c C, $ S C C , $
C d, $
I2
c I6
d C c C, $ C
c C c C, $
C d, $
d
d I47
C d , c/d/$
c
C c C, c/d
d
c C I89
C c C, c/d
C d, c/d C c C , c/d/$
I3
S’ S, $ I1
S C C, $ S (S’ S , $
C c C, c/d
C d, c/d
I0 S C C, $ I5
C C
C c C, $ S C C , $
C d, $
I2
c I36
C c C, c/d/$
c C c C,c/d/$
C
C d,c/d/$
c
d
d I47
C d , c/d/$
d
I89
C c C , c/d/$
LALR Parse Table
c d $ S C
0 s36 s47 1 2
1 acc
2 s36 s47 5
36 s36 s47 89
47 r3 r3 r3
5 r1
89 r2 r2 r2
Shift/Reduce Conflict
• We say that we cannot introduce a shift/reduce conflict during the
shrink process for the creation of the states of a LALR parser.
• Assume that we can introduce a shift/reduce conflict. In this case, a state
of LALR parser must have:
.
A ,a and B a,b .
• This means that a state of the canonical LR(1) parser must have:
.
A ,a and B a,c .
But, this state has also a shift/reduce conflict. i.e. The original canonical
LR(1) parser has a conflict.
(Reason for this, the shift operation does not depend on lookaheads)
.
I1 : A ,a I2: A ,b .
B .,b B .,c
.
I12: A ,a/b reduce/reduce conflict
.
B ,b/c
I9:S L=R.,$
R I13:L *R.,$
I6:S L=.R,$ to I9
R .L,$ L I10:R L.,$
to I10
L .*R,$ *
R
L .id,$ to I11 I11:L *.R,$ to I13
id R .L,$ L
to I12 to I10
L .*R,$ *
I7:L *R.,$/= L .id,$ to I11
id
I8: R L.,$/= to I12
I12:L id.,$
Compiler Design 40106 114
Canonical LR(1) Collection – Example2
S’ S I0:S’ .S,$ I1:S’ S.,$ I4:L *.R,$/= R to I7
1) S L=R S .L=R,$ S * R .L,$/= L
to I8
2) S R S .R,$ L I2:S L.=R,$ to I6 L .*R,$/= *
3) L *R L .*R,$/= R L.,$ L .id,$/= to I4
id
4) L id L .id,$/= R to I5
I3:S R.,$ id
I5:L id.,$/=
5) R L R .L,$
I9:S L=R.,$
R I13:L *R.,$
I6:S L=.R,$ to I9
R .L,$ L I10:R L.,$
to I10
L .*R,$ *
R
I4 and I11
L .id,$ to I11 I11:L *.R,$ to I13
id R .L,$ L I5 and I12
to I12 to I10
L .*R,$ *
I7:L *R.,$/= L .id,$ to I11 I7 and I13
id
I8: R L.,$/= to I12 I8 and I10
I12:L id.,$
Compiler Design 40106 115
Canonical LALR(1) Collection – Example2
S’ S I0:S’ .S,$ I1:S’ S ,$. I411:L * R,$/=. R
1) S L=R S .
L=R,$ S *
.. R L,$/=. L
to I713
2) S R S .
R,$ L I2:S L =R,$ to I6 L *R,$/= . *
to I810
3) L *R L .
*R,$/= R L ,$ .
L id,$/= to I411
4) L id L .
id,$/= R
I3:S R ,$. id
.
id
to I512
5) R L R .
L,$
I :L id ,$/=
512
.
I6:S L= R,$
R
to I9 I9:S L=R ,$ .
..
R L,$
L *R,$
L
*
to I810
Same Cores
I4 and I11
.
L id,$
id
to I411
to I512
I5 and I12
.
I713:L *R ,$/=
I7 and I13
.
I810: R L ,$/=
I8 and I10
E
E
..(E)
id
*
E id
..
E (E)
id
I3
(
I : E E *.E
(
I2: E ( ..E+E
5
E .E+E (
E
.. .
I8: E E*E + I4
E .E*E
E)
id I2 E E +E * I
E
E .(E)
E E *E 5
E .E*E E .id
I3
id
E
E
..(E)
id
E
I : E (E.) .
id ) I9: E (E)
E E.+E
6
+
I3: E id. E E.*E * I4
I5
• LALR(1) parsers may perform a few reductions after the error has been
encountered, and then detect it
Error recovery in LR parsing
• Phase-level recovery
– Associate error routines with the empty table slots. Figure out what
situation may have cause the error and make an appropriate
recovery.
• Panic-mode recovery
– Discard symbols from the stack until a non-terminal is found.
Discard input symbols until a possible lookahead for that non-
terminal is found. Try to continue parsing.
• Error productions
– Common errors can be added as rules to the grammar.
Error recovery in LR parsing
• Phase-level recovery
– Consider the table for grammar EE+E | id
+ id $ E
0 s2 1
1 s3 accept
2 r(Eid) r(Eid)
3 s2 4
4 r(EE+E) r(EE+E)
Error recovery in LR parsing
• Phase-level recovery
– Consider the table for grammar EE+E | id
+ id $ E
0 e1 s2 e1 1
1 s3 e2 accept
2 r(Eid) e3 r(Eid)
3 e1 s2 e1 4
4 r(EE+E) e2 r(EE+E)
• Error productions
– Allow the parser to "recognize" erroneous input.
– Example:
• statement : expression SEMI
| error SEMI
| error NEWLINE
– If the expression cannot be matched, discard
everything up to the next semicolon or the next new
line
Error recovery in LR parsing
E E + E | E * E | ( E ) | id ---- (1)
0 s3 s2 1
1 s4 s5 acc
2 s3 s2 6
3 r4 r4 r4 r4
4 s3 s2 7
5 s3 s2 8
6 s4 s5 s9
7 r1 s5 r1 r1
8 r2 r2 r2 r2
9 r3 r3 r3 r3
The LR parsing table with error routines
Id + * ( ) $ E
0 s3 e1 e1 s2 e2 e1 1
1 e3 s4 s5 e3 e2 acc
2 s3 e1 e1 s2 e2 e1 6
3 r4 r4 r4 r4 r4 r4
4 s3 e1 e1 s2 e2 e1 7
5 s3 e1 e1 s2 e2 e1 8
6 e3 s4 s5 e3 s9 e4
7 r1 r1 s5 r1 r1 r1
8 r2 r2 r2 r2 r2 r2
9 r3 r3 r3 r3 r3 r3
Error routines : e1
• This routine is called from states 0,2,4 and 5, all of which the beginning
of the operand, either an id of left parenthesis. Instead an operator, + or
*, or the end of input was found.
• Action: Push an imaginary id on to the stack and cover it with a state 3.
(the goto of the states 0, 2, 4 and 5)
• Print: Issue diagnostic “missing operand”
Error routines : e2
• This routine is called from states 0, 1, 2, 4 and 5 on the finding a right
parenthesis.
• Action: Remove the right parenthesis from the input
• Print: Issue diagnostic “Unbalanced right parenthesis”
Error routines : e3
• This routine is called from states 1 or 6 when expecting an operator, and
an id or right parenthesis is found.
• Action: Push + onto the stack and cover it with state 4.
• Print: Issue diagnostic “Missing operator”
Error routines : e4
• This routine is called from state 6 when the end of input is found while
expecting operator or a right parenthesis.
• Action: Push a right parenthesis onto the stack and cover it with a state
9.
• Print: Issue diagnostic “Missing right parenthesis”