Professional Documents
Culture Documents
Unit I: E Es E
Unit I: E Es E
b) Machine code
c) Object code
d) Macro
e) Mnemonic
A word or string which is intended to be easier to remember than the thing it stands
for. Most often used in "instruction mnemonic" which are so called because they are
easier to remember than the binary patterns they stand for.
The part or parts of a machine language instruction which determines what kind of
action the computer should take, e.g. add, jump, load, store.
g) Instruction set
1
h) High-level language
i) Assembly language
j) Interpreter
Interpreting code is slower than running the compiled code because the interpreter
must analyse each statement in the program each time it is executed and then perform
the desired action whereas the compiled code just performs the action. This run-time
analysis is known as "interpretive overhead". Access to variables is also slower in an
interpreter because the mapping of identifiers to storage locations must be done
repeatedly at run time rather than at compile time.
k) Compiler
A program that converts another program from some source language (or
programming language) to machine language (object code). Some compilers output
assembly language which is then converted to machine language by a separate
assembler.
l) Peephole optimisation
A kind of code optimisation that considers only a few adjacent instructions at a time
and looks for certain combinations which can be replaced with more efficient
sequences. E.g.
1. ADD R0, #1
2. ADD R0, #1
2
m) Register allocation
The phase of a compiler that determines which values will be placed in registers.
Register allocation may be combined with register assignment
n) C preprocessor
The standard macro-expansion utility run as the first phase of the C compiler, cc. Cpp
interprets lines beginning with "#" such as
o) Execution
p) Grammar
q) Instruction mnemonic
E.g. ADD, B (branch), BLT (branch if less than), SVC, MOVE, LDR (load register).
r) Lexical analysis
The first stage of processing a language. The stream of characters making up the
source program or other input is read one at a time and grouped into lexemes (or
"tokens") - word-like pieces such as keywords, identifiers, literals and punctutation.
The lexemes are then passed to the parser
3
s) Lexical analyser (Or "scanner")
The initial input stage of a language processor (e.g. a compiler), the part that performs
lexical analysis.
t) Parser
u) Parser generator
A program which takes a formal description of a grammar (e.g. in BNF) and outputs
source code for a parser which will recognise valid strings obeying that grammar and
perform associated actions. Unix's yacc is a well known example
v) Run time
The period of time during which a program is being executed, as opposed to compile-
time or load time.
w) Token
4
Understanding the operation of a compiler gives better insight into the structure of
programming languages.
The concept of syntax-directed processing is also of wider importance.
COMPILERS
Error message
Other Applications
5
6
Phases of A Compiler
Source program
Lexical analysis
Syntax analysis
Semantic analysis
Code optimisation
Code generation
Target program
Phases of compiler
7
Each phase transforms the source program from one representation into another
representation.
They communicate with error handlers.
They communicate with the symbol table.
Lexical Analyzer
Lexical Analyzer reads the source program character by character and returns the
tokens of the source program.
A token describes a pattern of characters having same meaning in the source
program. (such as identifiers, operators, keywords, numbers, delimeters and so
on)
Ex: newval := oldval + 12 => tokens:
o newval identifier
o := assignment operator
o oldval identifier
o + add operator
o a number
Syntax Analyzer
A Syntax Analyzer creates the syntactic structure (generally a parse tree) of the
given program.
A syntax analyzer is also called as a parser.
A parse tree describes a syntactic structure
assigstmt
identifier := Expression
newval
Expression
+ Expression
number
identifier
oldval 12
8
Syntax Analyzer (CFG)
Parsing Techniques
Depending on how the parse tree is created, there are different parsing techniques.
These parsing techniques are categorized into two groups:
Top-Down Parsing, Bottom-Up Parsing
Top-Down Parsing:
Construction of the parse tree starts at the root, and proceeds towards the leaves.
Efficient top-down parsers can be easily constructed by hand.
Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing).
9
Bottom-Up Parsing:
Construction of the parse tree starts at the leaves, and proceeds towards the root.
Normally efficient bottom-up parsers are created with the help of some software
tools.
Bottom-up parsing is also known as shift-reduce parsing.
Operator-Precedence Parsing – simple, restrictive, easy to implement
LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR
Semantic Analyzer
A semantic analyzer checks the source program for semantic errors and collects
the type information for the code generation.
Type-checking is an important part of semantic analyzer.
Normally semantic information cannot be represented by a context-free language
used in syntax analyzers.
Context-free grammars used in the syntax analysis are integrated with attributes
(semantic rules)
the result is a syntax-directed translation,
Attribute grammars
Ex:
newval := oldval + 12
The type of the identifier newval must match with type of the expression
(oldval+12)
10
The code optimizer optimizes the code produced by the intermediate code
generator in the terms of time and space.
Ex:
MULT id2,id3,temp1
ADD temp1,#1,id1
Code Generator
MOVE id2,R1
MULT id3,R1
ADD #1,R1
MOVE R1,id1
Some software tool for analysis
1. Structure Editors:
2. Pretty printers:
A pretty printer analyzes a program and prints it in such a way that the
structure of the program becomes clearly visible.
3. Static Checkers:
A static checker reads a program , analyzes it, and attempts to discover
potential bugs without running the program.
4. Interpreters:
Instead of producing a target program as translation an interpreter performs the
operations implied by the source program for an assignment statement. Interpreter are
frequently used to execute command languages , since each operator executed in a
command language is usually an invocation of a complex routine such as an editor or
compiler.
11
Analysis portion is similar to Conventional Compiler in
1. Text formatters:
A text formatter takes input that is a stream of characters ,most of which is
text to be typeset, but some of which includes commands to indicate paragraphs , figures,
or mathematical structures like subscripts and superscripts.
2. Silicon Compilers:
A Silicon compiler has a source language that is similar or identical to a conventional
programming language. However, the variables of the language represent , not locations
in memory, but, logical signals(0 or 1) or groups of signals in a switching circuit. The
output is a circuit design in an appropriate language.
3. Query interpreters:
A query interpreter translates a predicate containing relational and Boolean
operators into commands to search a database for records satisfying that predicate.
1. Parser generators
2. Scanner generators
3. Syntax directed translation engines
4. Automatic code generators
5. Data flow engines
1. Parser generator
Parser generators produce syntax analyzer from input that is based on the context free
grammar. Syntax analysis consumed a large fraction of the running time of a compiler.
Also syntax analysis consumed a large fraction of the intellectual effect of writing a
compiler. Many parser generators utilize powerful parsing algorithms that are too
complex to be carried out by hand.
2. Scanner generators
12
3. Syntax directed translation engines
Syntax directed translation engines produce collections of routines that walk the
parse tree and produces intermediate code. Translations are associated with each node of
the parse tree. Each translation is defined in terms of translations at its neighbor nodes of
the tree.
Automatic code generators take a collection of rules that define the translation of
each operation of the intermediate language into the machine language fore the target
machine. The rules should have sufficient details to handle the different possible access
methods for data.
E.g. variables may be in the registers, or location in memory or allocated a
position on a stack.
The intermediate code statement are replaced by the template that represent sequence of
the machine instruction in such a way that the assumption about the storage of variable
match from template to template. This technique is called template matching
Data flow engine helps in data flow analysis. Data flow analysis is needed to
perform good code optimisation. It also helps us to know how values are transmitted
from one part of a program to other part. The user has to supply detail relation ship
between intermediate code statements. The information is gathered and the data analysis
is done.
Token specification
Alphabet :
a finite set of symbols (ASCII characters)
String :
Finite sequence of symbols on an alphabet
Sentence and word are also used in terms of string
is the empty string
|s| is the length of string s.
Language:
sets of strings over some fixed alphabet
the empty set is a language.
{} the set containing empty string is a language
The set of well-wormed C programs is a language
The set of all possible identifiers is a language.
13
Operators on Strings:
Parts of string:
Operations on Languages
•Concatenation:
o L1L2 = { s1s2 | s1 L1 and s2 L2 }
•
Union
o L1 L2 = { s | s L1 or s L2 }
•Exponentiation:
o L0 = {} L1 = L L2 = LL
•
Kleene Closure
o –
L* = zero or more occurance
•Positive Closure
Example
L1 = {a,b,c,d} L2 = {1,2}
L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
14
L1 È L2 = {a,b,c,d,1,2}
L13 = all strings with length three (using a,b,c,d}
L1* = all strings using letters a,b,c,d and empty string
L1+ = doesn’t include the empty string
Regular Expressions
(r)+ = (r)(r)*
(r)? = (r) |
o * highest
o concatenation next
o | lowest
ab*|c means (a(b)*)|(c)
Examples:
S = {0,1}
0|1 => {0,1}
15
(0|1)(0|1) => {00,01,10,11}
0* => { ,0,00,000,0000,....}
(0|1)* => all strings with 0 and 1, including the empty string
Regular Definitions
To write regular expression for some languages can be difficult, because their
regular expressions can be quite complex. In those cases, we may use regular
definitions.
We can give names to regular expressions, and we can use these names as
symbols to define other regular expressions.
dn rn
basic symbols previously defined names
Ex:Identifiers in Pascal
digit 0 | 1 | ... | 9
digits digit +
opt-fraction ( . digits ) ?
opt-exponent ( E (+|-)? digits ) ?
unsigned-num digits opt-fraction opt-exponent
16
Recognition machine
Finite Automata
A recognizer for a language is a program that takes a string x, and answers “yes”
if x is a sentence of that language, and “no” otherwise.
x Yes or no
Recogniser
17
o F – a set of accepting states (final states)
- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
A NFA accepts a string x, if and only if there is a path from the starting state to
one of accepting states such that edge labels along this path spell out x.
NFA (Example)
start 0 1
a b
Five tuples
a b
0 {0,1} {0}
1 _ {2}
2 _ _
18
o - a set of input symbols (alphabet)
o move – a transition function is from pair of state-symbol to state (not set
of states).
o q0 - a start (initial) state
o F – a set of accepting states (final states
b a
a b
start 0 1
Implementing a DFA
Let us assume that the end of a string is marked with a special symbol (say eos).
The algorithm for recognition will be as follows: (an efficient implementation)
•
s q0 { start from the initial state }
c nextchar { get the next character from the input string }
while (c != eos) do { do until the en dof the string }
begin
s move(s,c) { transition function }
c nextchar
end
if (s in F) then { if s is an accepting state }
return “yes”
else
return “no”
19
Implementing a NFA
i f
20
For regular expression r1 | r2
N(r1)
i f
N(r2)
NFA for r1 | r2
NFA for r*
21
(Example - (a|b) * a )
b
b:
(a | b)
b
(a|b) *
b
a
(a|b) * a a
22
while (there is one unmarked Q1 in DS) do
begin
mark S1 set of states to which there is a
transition on a from a state s in Q1
for each input symbol a do
begin
Q2 -closure(move(Q1,a))
if (Q2 is not in DS) then
add Q2 into DS as an unmarked state
transfunc[Q1,a] Q2
end
end
•a state Q in DS is an accepting state of DFA if a state in S is an accepting state of NFA
•the start state of DFA is -closure({q0})
2 a 3
0 1 6 7 a 8
4 b 5
mark Q1
-closure(move(Q1,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = Q1
-closure(move(S1,b)) = -closure({5}) = {1,2,4,5,6,7} = Q2
transfunc[Q1,a] S1 transfunc[Q1,b] S2
23
mark Q2
-closure(move(Q2,a)) = -closure({3,8}) = {1,2,3,4,6,7,8} = Q1
-closure(move(Q2,b)) = -closure({5}) = {1,2,4,5,6,7} = Q2
transfunc[Q2,a] Q1 transfunc[Q2,b] Q2
S1
S0 b a
S2
b
Converting Regular Expressions Directly to DFAs
We may convert a regular expression into a DFA (without creating a NFA first).
First we augment the given regular expression by concatenating it with a special
symbol #.
r (r)# augmented regular expression
Then, we create a syntax tree for this augmented regular expression.
In this syntax tree, all alphabet symbols (plus # and the empty string) in the
augmented regular expression will be on the leaves, and all inner nodes will be
the operators in that augmented regular expression.
Then each alphabet symbol (plus #) will be numbered (position numbers).
Syntax tree of (a|b) * a #
(a|b) * a (a|b) * a # augmented regular expression
followpos
For example, ( a | b) * a #
1 2 3 4
followpos(1) = {1,2,3}
followpos(2) = {1,2,3}
followpos(3) = {4}
followpos(4) = {}
•o evaluate followpos, we need three more functions to be defined for the nodes
(not just for leaves) of the syntax tree.
•firstpos(n)
the set of the positions of the first symbols of strings generated by the sub-
expression rooted by n.
•lastpos(n)
the set of the positions of the last symbols of strings generated by the sub-
expression rooted by n.
•nullable(n)
25
How to evaluate firstpos, lastpos, nullable
If n is concatenation-node with left child c1 and right child c2, and i is a position
in lastpos(c1), then all positions in firstpos(c2) are in followpos(i).
If n is a star-node, and i is a position in lastpos(n), then all positions in
firstpos(n) are in followpos(i).
If firstpos and lastpos have been computed for each node, followpos of each
position can be computed by making one depth-first traversal of the syntax tree.
Example -- ( a | b) * a #
followpos(1) = {1,2,3}
followpos(2) = {1,2,3}
followpos(3) = {4}
followpos(4) = {}
After we calculate follow positions, we are ready to create DFA for the regular
expression
Example -- ( a | b) * a #
1 2 34
followpos(1)={1,2,3}
followpos(2)={1,2,3}
followpos(3)={4}
followpos(4)={}
27
S1=firstpos(root)={1,2,3}
mark S1
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S1,a)=S2
b: followpos(2)={1,2,3}=S1 move(S1,b)=S1
mark S2
a: followpos(1) followpos(3)={1,2,3,4}=S2 move(S2,a)=S2
b: followpos(2)={1,2,3}=S1 move(S2,b)=S1
start state: S1 b a
accepting states: {S2}
a
S1 S2
Example -- ( a | ) b c* #
1 2 3 4
followpos(1)={2}
followpos(2)={3,4}
followpos(3)={3,4}
followpos(4)={}
S1=firstpos(root)={1,2}
mark S1
a: followpos(1)={2}=S2 move(S1,a)=S2
b: followpos(2)={3,4}=S3 move(S1,b)=S3
mark S2
b: followpos(2)={3,4}=S3 move(S2,b)=S3
mark S3
c: followpos(3)={3,4}=S3 move(S3,c)=S3
start state: S1
accepting states: {S3} S2
a
b
S1
b
S3 c
28
Minimizing Number of States of a DFA
Start state of the minimized DFA is the group containing the start state of the
original DFA.
Accepting states of the minimized DFA are the groups containing the accepting
states of the original DFA.
a G1 = {2}
– G2 = {1,3}
b b a
So, the minimized DFA (with minimum states)
{1,3} a {2}
29
Some Other Issues in Lexical Analyzer
•What is the end of a token? Is there any character which marks the end of a
token?
It is normally not defined.
If the number of characters in a token is fixed, in that case no problem: + -
But < < or <> (in Pascal)
The end of an identifier : the characters cannot be in an identifier can mark the
end of token.
We may need a lookhead
In Prolog:
p :- X is 1. p :- X is 1.5.
The dot followed by a white space character can mark the end of a number.
But if that is not the case, the dot must be treated as a part of the number
Skipping comments
Normally we don’t return a comment as a token.
We skip a comment, and return the next token (which is not a comment) to the
parser.
So, the comments are only processed by the lexical analyzer, and the don’t
complicate the syntax of the language.
30
Each empty entry in the parsing table is filled with a pointer to a specific error
routine to take care that error case.
Error-Productions
If we have a good idea of the common errors that might be encountered, we can
augment the grammar with productions that generate erroneous constructs.
When an error production is used by the parser, we can generate appropriate error
diagnostics.
Since it is almost impossible to know all the errors that can be made by the
programmers, this method is not practical.
Global-Correction
In panic-mode error recovery, we skip all the input symbols until a synchronizing token
is found.
31
pop items from the stack.
We should be careful when we design these error routines, because we may put the parser
into an infinite loop.
lex.yy.c
C COMPILER a.out
Input stream
a.out Sequence of token
Lex specification
Declaration
Variables
Manifest constants: an identifier that is declared to represent a constant
Regular definition:
32
To write regular expression for some languages can be difficult,
because their regular expressions can be quite complex. In those
cases, we may use regular definitions.
We can give names to regular expressions, and we can use these
names as symbols to define other regular expressions.
Translation rules
The translation rules of the Lex program are statements of the form
p1 {action1}
p2 {action2}
.. …
pn {actionn}
Auxicillary procedures:
Example
%{
/* Definition of manifest constants LT, LE, EQ,NE,GT,GE,IF,THEN, ELSE, ID,
NUMBER, RELOP*/
}%
/*regular definition*/
delim { \t\n}
ws {delim}+
letter {A-Za-z}
digit {0-9}
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
33
%%
{ws} {/*no action and noreturn*/}
if {return (IF);}
then {return (THEN);}
else {return (ELSE);}
{id} {yyval=install_id(); return (ID);}
{number} {yyval=install_num (); return (NUMBER);}
“<” {yyval = LT; return (RELOP);}
“<=” {yyval = LE; return (RELOP);}
“=” {yyval = EQ; return (RELOP);}
“<>” {yyval = NE; return (RELOP);}
“>” {yyval = GT; return (RELOP);}
“>=” {yyval = GE; return (RELOP);}
%%
INSTALL_ID()
{
/* Procedure to install the lexemes, whose first character is pointed to by yytext and
whose length is yyleng, into the symbol table and return a pointer there to */
}
install_num()
{
Similar procedure to install a lexeme that is a number */
}
p1 {action1}
p2 {action2}
.. …
pn {actionn}
Our problem is to construct a recognizer that looks for the lexemes in the input buffer. If
more than one matches, the recognizer is to choose the longest lexeme matched.
Lex specification
LEX COMPILER Transition table
34
Lexeme Input buffer
FA
Simulator
Transition
table
N(p1)
S0 N(p2)
……….
N(pn)
35
NFA recognizes the longest prefix of the input string that is matched by a pattern
Simulate the NFA using the algorithm
Construct the sequence of sets of state that the combined NFA can be in after
seeing each input symbol
We must continue to simulate NFA until it reaches termination.
Example
a {}
abb {}
a*b+ {}
a
1 2
a b b
3 4 5 6
b
7 8
a b
36
a
1 2
a b b
3 4 5 6
0
b
7 8
a b
p1 p3
37
Syntax Analyzer
Syntax Analyzer creates the syntactic structure of the given source program.
This syntactic structure is mostly a parse tree.
Syntax Analyzer is also known as parser.
The syntax of a programming is described by a context-free grammar (CFG). We
will use BNF (Backus-Naur Form) notation in the description of CFGs.
The syntax analyzer (parser) checks whether a given source program satisfies the
rules implied by a context-free grammar or not.
If it satisfies, the parser creates the parse tree of that program.
Otherwise the parser gives the error messages.
A context-free grammar
Parser
Parser works on a stream of tokens.
1.Top-Down Parser
–the parse tree is created top to bottom, starting from the root.
2.Bottom-Up Parser
–the parse is created bottom to top; starting from the leaves
•Both top-down and bottom-up parsers scan the input from left to right (one symbol at a
time).
•Efficient top-down and bottom-up parsers can be implemented only for sub-classes of
context-free grammars.
38
LL for top-down parsing
LR for bottom-up parsing
Context-Free Grammars
In a context-free grammar,
we have:
A finite set of terminals (in our case, this will be the set of tokens)
A finite set of non-terminals (syntactic-variables)
A finite set of productions rules in the following form
Aa where A is a non-terminal and a is a string of terminals and non-
terminals (including the empty string)
A start symbol (one of the non-terminal symbol)
•
•Example:
E E+E | E–E | E*E | E/E | -E
E (E)
E id
Derivations
E E+E
•E+E derives from E
–we can replace E by E+E
–to able to do this, we have to have a production rule EE+E in our grammar.
•
E E+E id+E id+id
•A sequence of replacements of non-terminal symbols is called a derivation of id+id
from E.
39
*
: derives in zero or more steps
+
: derives in one or more steps
CFG – Terminology
L(G) is the language of G (the language generated by G) which is a set of
sentences.
A sentence of L(G) is a string of terminal symbols of G.
If S is the start symbol of G then w is a sentence of L(G) iff S where is
a string of terminals of G.
If G is a context-free grammar, L(G) is a context-free language.
Two grammars are equivalent if they produce the same language.
S
- If contains non-terminals, it is called as a sentential form of G.
- If does not contain non-terminals, it is called as a sentence of G.
Derivation Example
E -E -(E) -(E+E) -(id+E) -(id+id)
OR
E -E -(E) -(E+E) -(E+id) -(id+id)
•At each derivation step, we can choose any of the non-terminal in the sentential
form of G for the replacement.
Right-Most Derivation
40
E -E -(E) -(E+E) -(E+id) -(id+id)
rm rm rm rm rm
•We will see that the top-down parsers try to find the left-most derivation of the given
source program.
•We will see that the bottom-up parsers try to find the right-most derivation of the given
source program in the reverse order.
Parse Tree
E -E E E
-(E)
- E - E
( E )
E
E
-(E+E) - E
- E -(id+E)
id + E
( E )
( E )
E + E
E + E
id
E
- E 41
-(id+id)
E + E
id id
id + id
Ambiguity
A grammar produces more than one parse tree for a sentence is called as an ambiguous
grammar.
id E * E
id id
E E*E E+E*E id+E*E
id+id*E id+id*id
E
E * E
E + E id
id id
Unambiguous grammar
unique selection of the parse tree for a sentence
We should eliminate the ambiguity in the grammar during the design phase of the
compiler.
42
An ambiguous grammar should be written to eliminate the ambiguity.
We have to prefer one of the parse trees of a sentence (generated by an ambiguous
grammar) to disambiguate that grammar to restrict to this choice.
stmt
if expr then stmt else stmt
E1 if expr then stmt S2
E2 S1
stmt
if expr then stmt
E1 if expr then stmt else stmt
E2 S1 S2
2
43
We prefer the second parse tree (else matches with closest if).
So, we have to disambiguate our grammar to reflect this choice.
Left Recursion
A grammar is left recursive if it has a non-terminal A such that there is a
derivation.
o +
Immediate Left-Recursion
44
AA| where does not start with A
eliminate immediate left recursion
A A’
A’ A’ | an equivalent grammar
In general,
E T E’
E’ +T E’ |
T F T’
T’ *F T’ |
F id | (E)
Left-Recursion – Problem
S Aa | b
A Sc | d This grammar is not immediately left-recursive,
but it is still left-recursive.
S Aa Sca or
A Sc Aac causes to a left-recursion
45
So, we have to eliminate all left-recursions from our grammar
46
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A SdA’ | fA’
A’ cA’ |
for S:
- Replace S Aa with S SdA’a | fA’a
So, we will have S SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S fA’aS’ | bS’
S’ dA’aS’ |
A SdA’ | fA’
A’ cA’ |
Left-Factoring
A predictive parser (a top-down parser without backtracking) insists that the
grammar must be left-factored.
•when we see if, we cannot now which production rule to choose to re-write
stmt in the derivation.
In general,
•
A 1 | 2 where is non-empty and the first symbols of 1 and 2 (if they
have one)are different.
when processing a we cannot know whether expand
A to 1 or A to 2
47
A’ | so, we can immediately expand A to A’
1 2
Left-Factoring -- Algorithm
For each non-terminal A with two or more alternatives (production rules) with a
common non-empty prefix, let say
•
convert it into
A A’ | 1 | ... | m
A 1 | ... | n
’
Left-Factoring – Example1
Left-Factoring – Example2
A ad | a | ab | abc | b
A aA’ | b
A’ d | e | b | bc
48
A aA’ | b
A’ d | e | bA’’
A’’ e | c
There are some language constructions in the programming languages which are
not context-free. This means that, we cannot write a context-free grammar for
these constructions.
L1 = { c | is in (a|b)*} is not context-free
declaring an identifier and checking whether it is declared or not later.
We cannot do this with a context-free language. We need semantic analyzer
(which is not context-free).
m parameters), and then calling them with actual parameters.
•
Top-Down Parsing
The parse tree is created top to bottom.
Top-down parser
–Recursive-Descent Parsing
49
Recursive Predictive Parsing is a special form of Recursive Descent parsing
without backtracking.
Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.
S aBc
B bc | b
input: abc
Predictive Parser
current token
stmt if ...... |
while ...... |
begin ...... |
for .....
50
When we are trying to write the non-terminal stmt, if the current token is if we
have to choose first production rule.
When we are trying to write the non-terminal stmt, we can uniquely choose the
production rule by just looking the current token.
We eliminate the left recursion in the grammar, and left factor it. But it may not
be suitable for predictive parsing (not LL(1) grammar).
proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}
A aA | bB | e
If all other productions fail, we should apply an -production. For example, if the
current token is not a or b, we may apply the -production.
Most correct choice: We should apply an -production for a non-terminal A when
the current token is in the follow set of A (which terminals can follow A in the
sentential forms).
A aBe | cBd | C
B bB | e
51
Cf
proc C
{
match the current token with f, and move to the next token;
}
proc A
{
case of the current token
{
a: - match the current token with a, and move to the next token;
- call B;
- match the current token with e, and move to the next token;
c: - match the current token with c, and move to the next token;
- call B;
- match the current token with d, and move to the next token;
f: - call C
}
}
proc B
{
case of the current token
{
b: - match the current token with b, and move to the next token;
- call B;
e,d: do nothing
}
f- first set of C
e,d – follow set of B
input buffer
Parsing Table
52
LL(1) Parser
input buffer
our string to be parsed. We will assume that its end is marked with a special
symbol $.
output
a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer.
stack
–contains the grammar symbols
at the bottom of the stack, there is a special end marker symbol $.
initially the stack contains only the symbol $ and the starting symbol S.
$S initial stack
when the stack is emptied (ie. only $ left in the stack), the parsing is completed.
•
parsing table
a two-dimensional array M[A,a]
each row is a non-terminal symbol
each column is a terminal symbol or the special symbol $
each entry holds a production rule.
The symbol at the top of the stack (say X) and the current symbol in the input string (say
a) determine the parser action.
53
LL(1) Parser – Example1
S aBa
B bB |
a b $
S S aBa
B B B bB
LL(1) Parsing Table
Derivation(left-most): SaBaabBaabbBaabba
S
parse tree
a B a
b B
b B
LL(1) Parser – Example2
E TE’
E’ +TE’ |
T FT’
54
T’ *FT’ |
F (E) | id
id + * ( ) $
E E TE’ E TE’
E E’ E’
’ E’ +TE’
T T FT’
T FT’
T T’ T’ T’
T’ *FT’
’F
F id F (E)
55
FIRST() is a set of the terminal symbols which occur as first symbols in strings
derived from where is any string of grammar symbols.
if derives to , then is also in FIRST() .
FOLLOW(A) is the set of the terminals which occur immediately after (follow) the
non-terminal A in the strings derived from the starting symbol.
a terminal a is in FOLLOW(A) if S Aa
$ is in FOLLOW(A) if S A
If X is a terminal symbol
FIRST(X)={X}
If X is Y1Y2..Yn
if a terminal a in FIRST(Yi) and is in all FIRST(Yj) for j=1,...,i-1
then a is in FIRST(X).
if is in all FIRST(Yj) for j=1,...,n then is in FIRST(X).
FIRST Example
E TE’
E’ +TE’ |
56
T FT’
T’ *FT’ |
F (E) | id
We apply these rules until nothing more can be added to any follow set.
FOLLOW Example
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
F (E) | id
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
57
for each production rule A of a grammar G
–for each terminal a in FIRST()
add A to M[A,a]
–If in FIRST()
for each terminal a in FOLLOW(A) add A to M[A,a]
–If in FIRST() and $ in FOLLOW(A)
add A to M[A,$]
All other undefined entries of the parsing table are error entries.
E’ FIRST()={} none
T’ FIRST()={} none
but since in FIRST()
and FOLLOW(T’)={$,),+} T e into M[T’,$], M[T’,)]
’
and M[T’,+]
F (E) FIRST((E) )={(} F (E) into M[F,(]
F id FIRST(id)={id} F id into M[F,id]
LL(1) Grammars
(1)
one input symbol used as a look-head symbol do determine parser action
L left most derivation
L input scanned from left to right
58
The parsing table of a grammar may contain more than one production rule. In this case,
we say that it is not a LL(1) grammar.
FIRST(iCtSE) = {i}
FIRST(a) = {a}
FIRST(eS) = {e}
FIRST() = {}
FIRST(b) = {b}
What do we have to do it if the resulting parsing table contains multiply defined entries?
If we didn’t eliminate left recursion, eliminate the left recursion in the grammar.
If the grammar is not left factored, we have to left factor the grammar.
If its (new grammar’s) parsing table still contains multiply defined entries, that
grammar is ambiguous or it is inherently not a LL(1) grammar.
A A |
any terminal that appears in FIRST() also appears FIRST(A) because
A .
If is , any terminal that appears in FIRST() also appears in
FIRST(A) and FOLLOW(A).
59
A grammar is not left factored, it cannot be a LL(1) grammar
A 1 | 2
any terminal that appears in FIRST(1) also appears in
FIRST(2).
An ambiguous grammar cannot be a LL(1) grammar.
A grammar G is LL(1) if and only if the following conditions hold for two distinctive
production rules A and A
Bottom-Up Parsing
60
A bottom-up parser creates the parse tree of the given input starting from leaves
towards the root.
A bottom-up parser tries to find the right-most derivation of the given input in the reverse
order.
S ... w (the right-most derivation of w)
(the bottom-up parser finds the right-most derivation in the
reverse order)
•Bottom-up parsing is also known as shift-reduce parsing because its two main actions
are shift and reduce.
–At each shift action, the current symbol in the input string is pushed to a stack.
–At each reduction step, the symbols at the top of the stack (this symbol sequence is the
right side of a production) will replaced by the non-terminal at the left side of that
production.
–There are also two more actions: accept and error.
Shift-Reduce Parsing
A shift-reduce parser tries to reduce the given input string into the starting symbol.
At each reduction step, a substring of the input matching to the right side of a production
rule is replaced by the non-terminal at the left side of that production rule.
If the substring is chosen correctly, the right most derivation of that string is created in
the reverse order.
61
How do we know which substring to be replaced at each reduction step?
Handle
Informally, a handle of a string is a substring that matches the right side of a production
rule.
–But not every substring matches the right side of a production rule is handle
S A
If the grammar is unambiguous, then every right-sentential form of the grammar has
exactly one handle.
We will see that is a string of terminals.
Handle Pruning
Start from n, find a handle Ann in n, and replace n in by An to get n-1.
Then find a handle An-1n-1 in n-1, and replace n-1 in by An-1 to get n-2.
Repeat this, until we reach S.
A Shift-Reduce Parser
E E+T | T Right-Most Derivation of id+id*id
T T*F | F E E+T E+T*F E+T*id E+F*id
F (E) | id E+id*id T+id*id F+id*id id+id*id
id+id*id F id
F+id*id TF
T+id*id ET
E+id*id F id
E+F*id TF
E+T*id F id
E+T*F T T*F
62
E+T E E+T
E
Handles are red and underlined in the right-sentential forms.
1.Shift : The next input symbol is shifted onto the top of the stack.
2.Reduce: Replace the handle on the top of the stack by the non-terminal.
3.Accept: Successful completion of parsing.
4.Error: Parser discovers a syntax error, and calls an error recovery routine.
63
Parse Tree
Stack contents and the next input symbol may not decide action:
If a shift-reduce parser cannot be used for a grammar, that grammar is called as non-
LR(k) grammar.
•
left to right right-most k lookhead
scanning derivation
•An ambiguous grammar can never be a LR grammar
Shift-Reduce Parsers
1.Operator-Precedence Parser
64
– simple, but only a small class of grammars.
–
–
–
2.LR-Parsers
–covers wide range of grammars.
SLR – simple LR parser
LR – most general LR parser
LALR – intermediate LR parser (lookhead LR parser)
–SLR, LR and LALR work same, only their parsing tables are different.
CFG G
LR
LALR
SLR
Operator-Precedence Parser
Operator grammar
–small, but an important class of grammars
–we may have an efficient operator precedence parser (a shift-reduce parser) for an
operator grammar.
In an operator grammar, no production rule can have:
–e at the right side
–two adjacent non-terminals at the right side.
•Ex:
EAB EEOE EE+E |
Aa Eid E*E |
Bb O®+|*|/ E/E | id
not operator grammar not operator grammar operator grammar
–
Precedence Relations
In operator-precedence parsing, we define three disjoint precedence relations between
certain pairs of terminals.
•
65
a <. b b has higher precedence than a
a =· b b has same precedence as a
a .> b b has lower precedence than a
•The determination of correct precedence relations between terminals are based on the
traditional notions of associativity and precedence of operators. (Unary minus causes a
problem).
•The intention of the precedence relations is to find the handle of a right-sentential form,
•In our input string $a1a2...an$, we insert the precedence relation between the pairs of
terminals (the precedence relation holds between the terminals in that pair).
•Then the input string id+id*id with the precedence relations inserted will be:
•
id + * $
id .> .> .>
66
$ <. + .> $ E E+E $E+E$
$$ $E$
67
$+* $ * .> $ reduce E E*E
$+ $ + .> $ reduce E E+E
$ $ accept
.Also, let
o (=·) $ <. ( id .> ) ) .> $
o ( <. ( $ <. id id .> $ ) .> )
o ( <. id
Operator-Precedence Relations
+ - * / ^ id ( ) $
+ .> .> <. <. <. <. <. .> .>
- .> .> <. <. <. <. <. .> .>
* .> .> .> .> <. <. <. .> .>
/ .> .> .> .> <. <. <. .> .>68
^ .> .> .> .> <. <. <. .> .>
.> .> .> .> .> .> .>
( <. <. <. <. <. <. <. =·
) .> .> .> .> .> .> .>
$ <. <. <. <. <. <. <.
Precedence Functions
Compilers using operator precedence parsers do not need to store the table of
precedence relations.
The table can be encoded by two precedence functions f and g that map terminal
symbols to integers.
For symbols a and b.
f(a) < g(b) whenever a <. b
f(a) = g(b) whenever a =· b
69
f(a) > g(b) whenever a .> b
It cannot handle the unary minus (the lexical analyzer should handle the unary
minus).
Small class of grammars.
Advantages:
simple
powerful enough for expressions in programming languages
Error Cases:
No relation holds between the terminal on the top of stack and the next input
symbol.
A handle is found (reduction step), but there is no production with this handle as a
right side
Error Recovery:
1.Each empty entry is filled with a pointer to an error routine.
.Decides the popped handle “looks like” which right hand side. And tries to
recover from that situation
LR Parsers
The most powerful shift-reduce parsing (yet efficient) is:
LR(k) parsing.
L left to right Scanning
R right-most Derivation
(k) k lookhead (k is omitted it is 1)
70
The class of grammars that can be parsed using LR methods is a proper superset
of the class of grammars that can be parsed with predictive parsers.
LL(1)-Grammars LR(1)-Grammars
An LR-parser can detect a syntactic error as soon as it is possible to do so a left-
to-right scan of the input.
LR-Parsers
LR Parsing Algorithm
71
A Configuration of LR Parsing Algorithm
A configuration of a LR parsing is:
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ )
Sm and ai decides the parser action by consulting the parsing action table. (Initial Stack
contains just So )
Actions of A LR-Parser
1.shift s -- shifts the next input symbol and the state s onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) ( So X1 S1 ... Xm Sm ai s, ai+1 ... an $ )
4.Error -- Parser detected an error (an empty entry in the action table)
Reduce Action
pop 2|| (=r) items from the stack; let us assume that = Y1Y2...Yr then push A and s
where s=goto[sm-r,A]
72
( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )
1) E E+T
2) ET
3) T T*F
4) TF
5) F (E)
6) F id
state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s1
9 r1 s7 1
r1 r1
73
10 r3 r3 r3 r3
11 r5 r5 r5 r5
Actions of A (S)LR-Parser -- Example
74
If I is a set of LR(0) items for a grammar G, then closure(I) is the set of LR(0) items
constructed from I by the two rules:
Initially, every LR(0) item in I is added to closure(I).
If A .B is in closure(I) and B is a production rule of G; then B. will
be in the closure(I).
We will apply this rule until no more new LR(0) items can be added to closure(I).
E’ E .
closure({E’ E}) =
ET E E+T.
T T*F E T .
TF T T*F.
F (E) T F .
F id F (E).
.
F id }
Goto Operation
. .
–If A X in I then every item in closure({A X }) will be in goto(I,X).
Example:
I ={ E’ . E, E . E+T, E .
T,
75
T . T*F, T . F,
F . (E), F id } .
. .
goto(I,E) = { E’ E , E E +T }
. .
goto(I,T) = { E T , T T *F }
goto(I,F) = {T F. }
. . . .
goto(I,() = { F ( E), E E+T, E T, T T*F, T . F,
F. . (E), F id }
.
goto(I,id) = { F id }
To create the SLR parsing tables for a grammar G, we will create the canonical LR(0)
collection of the grammar G’.
Algorithm:
C is { closure({S’.S}) }
repeat the followings until no more set of LR(0) items can be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
76
E .E+T E E.+T T .T*F T T.*F
E .T T .F
T .T*F I2: E T. F .(E) I10: T T*F.
T .F T T.*F F .id
F .(E)
F .id I3: T F. I7: T T*.F I11: F (E).
F .(E)
I4: F (.E) F .id
E .E+T
E .T I8: F (E.)
T .T*F E E.+T
T .F
F .(E)
F .id
I5: F id.
E T
I0 I1 I6
F
I9 * to I7
( to I3
T to I4
I2 I7 id
to I5
F
I3 *
F
I8 I10
I4
to I2 ( to I4
77
I5 to I3 id
( to I5
) I11
id T
id
F + to I6
(
SLR(1) Grammar
An LR parser using SLR(1) parsing tables for a grammar G is called as the SLR(1) parser
for G.
If a grammar G has an SLR(1) parsing table, it is called SLR(1) grammar (or SLR
grammar in short).
Every SLR grammar is unambiguous, but every unambiguous grammar is not a SLR
grammar.
If a state does not know whether it will make a shift operation or reduction for a terminal,
we say that there is a shift/reduce conflict.
If a state does not know whether it will make a reduction operation using the production
rule i or j for a terminal, we say that there is a reduce/reduce conflict.
If the SLR parsing table of a grammar G has a conflict, we say that that grammar is not
SLR grammar
Conflict Example
S L=R I0: S’ .S I1:S’ S. I6:S L=.R I9: S L=R.
SR S .L=R R .L
L *R S .R I2:S L.=R L .*R
79
L id L .*R R L. L .id
RL L .id
R .L I3:S R.
Conflict Example2
S AaAb I0: S’ .S
S BbBa S .AaAb
A S .BbBa
B A.
80
B.
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A b reduce by A
reduce by B reduce by B
LR(1) Item
To avoid some of invalid reductions, the states need to carry more information.
Extra information is put into a state by including a terminal symbol as a second
component in an item.
A LR(1) item is:
.
A ,a where a is the look-head of the LR(1) item
(a is a terminal or end-marker.)
When ( in the LR(1) item A .,a ) is not empty, the look-head does not have any
affect.
When is empty (A .,a ), we do the reduction by A only if the next input
symbol is a (not for any terminal in FOLLOW(A)).
81
A state will contain A .,a1 where {a1,...,an} Í FOLLOW(A)
...
A .,an
The construction of the canonical collection of the sets of LR(1) items are similar to the
construction of the canonical collection of the sets of LR(0) items, except that closure
and goto operations work a little bit different.
goto operation
•If I is a set of LR(1) items and X is a grammar symbol (terminal or non-terminal), then
goto(I,X) is defined as follows:
–If A .X,a in I then every item in closure({A X.,a}) will be in
goto(I,X).
82
A Short Notation for The Sets of LR(1) Items
A .,a1/a2/.../an
83
Construction of LR(1) Parsing Tables
LALR parsers are often used in practice because LALR parsing tables are smaller than
LR(1) parsing tables.
The number of states in SLR and LALR parsing tables for a grammar G are equal.
But LALR parsers recognize more grammars than SLR parsers.
yacc creates a LALR parser for the given grammar.
A state of LALR parser will be again a set of LR(1) items.
This shrink process may introduce a reduce/reduce conflict in the resulting LALR parser
(so the grammar is NOT LALR)
But, this shrik process does not produce a shift/reduce conflict.
84
The Core of A Set of LR(1) Items
The core of a set of LR(1) items is the set of its first component.
Ex: S L.=R,$ S L.=R Core
R L.,$ R L.
We will find the states (sets of LR(1) items) in a canonical LR(1) parser with same cores.
Then we will merge them as a single state.
I1:L id.,= A new state: I12: L id.,=
L id.,$
I2:L id.,$ have same core, merge them
We will do this for all states of a canonical LR(1) parser to get the states of the LALR
parser.
In fact, the number of the states of the LALR parser for a grammar will be equal to the
number of states of the SLR parser for that grammar
Create the canonical LR(1) collection of the sets of LR(1) items for the given grammar.
Find each core; find all sets having that same core; replace those sets having same cores
with a single set which is their union.
C={I0,...,In} C’={J1,...,Jm} where m n
Create the parsing tables (action and goto tables) same as the construction of the parsing
tables of LR(1) parser.
–Note that: If J=I1 ... Ik since I1,...,Ik have same cores
cores of goto(I1,X),...,goto(I2,X) must be same.
–So, goto(J,X)=K where K is the union of all sets of items having same cores as
goto(I1,X).
If no conflict is introduced, the grammar is LALR(1) grammar. (We may only
introduce reduce/reduce conflicts; we cannot introduce a shift/reduce conflict)
Shift/Reduce Conflict
We say that we cannot introduce a shift/reduce conflict during the shrink process for the
creation of the states of a LALR parser.
85
Assume that we can introduce a shift/reduce conflict. In this case, a state of LALR parser
must have:
.
A ,a and .
B a,c
But, this state has also a shift/reduce conflict. i.e. The original canonical LR(1)
parser has a conflict.
(Reason for this, the shift operation does not depend on lookaheads)
Reduce/Reduce Conflict
But, we may introduce a reduce/reduce conflict during the shrink process for the creation
of the states of a LALR parser.
.
I1 : A ,a .
I2: A ,b
.
B ,b .
B ,c
.
I12: A ,a/b reduce/reduce conflict
.
B ,b/c
86
–Some of the ambiguous grammars are much natural, and a
corresponding unambiguous grammar can be very complex.
–Usage of an ambiguous grammar may eliminate unnecessary
reductions.
Ex.
E E+T | T
E E+E | E*E | (E) | id T T*F | F
F (E) | id
•
Error recovery
Scan down the stack until a state s with a goto on a particular nonterminal A is
found. (Get rid of everything from the stack before this state s).
Discard zero or more input symbols until a symbol a is found that can legitimately
follow A.
o –The symbol a is simply in FOLLOW(A), but this may not work for all
situations.
The parser stacks the nonterminal A and the state goto[s,A], and it resumes the
normal parsing.
This nonterminal A is normally is a basic programming block (there can be more
than one choice for A).
o –stmt, expr, block, ...
–
87
Phrase-Level Error Recovery in LR Parsing
Each empty entry in the action table is marked with a specific error routine.
An error routine reflects the error that the user most likely will make in that case.
An error routine inserts the symbols into the stack or the input (or it deletes the
symbols from the stack and the input, or it can do both insertion and deletion).
o –missing operand
o –unbalanced right parenthesis
–
Intermediate languages
The front end translates a source program into an intermediate represntation from
which the back end generates the target code.
Intermediate code
Parser Static Intermedi Code
checker ate code generator
generator
Intermediate languages
Syntax trees
88
Postfix notation
Three address codes: [the semantic rules for generating three address code
from common programming languages]
Graphical representation
89
postfix notation
syntax tree for the assignment statement is produced by the syntax directed translation
Nonterminal S generates an assignment statement
+ and – are operators in the typical languages
operator associates and precedence are usual
90
o Field – operator, pointers to children
Node are allocated from an array
91
Three address codes
1. t1:= -c
2. t2:= b * t1
3. t3:= -c
4. t4:= b * t3
5. t5 := t2 + t4
6. a:= t5
92
o similar to assembly code
o statement can have symbolic labelsand there are statement for flow of
conrol.
o A symbolic label represnts the index of a three address statement in the
array holding the intermediate code.
Declarations
as new name is seen, the name is entered in the symbol tablewith the offset equal to the
current value of offset.
For each local name entry is created in symbol table along with other information
(type and the realive address of the storage for the name)
93
declaration in a procedure
the procedure enter (name, type, offset) creates an symbol table entry for name, gives it
type type and relative address offset in its data area.
attribute type represnts a type expression constructed from the basic types integer and real
P D {offset :=0}
D D ; D
Local names of ecah procedure can be assigned relative address using the above approach
For nested procedures the processing of declaration is temperorily suspended.
P D
D D ; D | id : T | proc id; D ; S
S for statement
T for ypes are not shown
A new symbol table is created when a procedure declaration D proc id D ; S is seen
and the entries for the declaration in D1 are created in the new table
The new table points back to the symbol table of the enclosing procedure.
The name represnted by id is local
Symbol tables of procedures are shown in fig
The symbol tables of the procedures readarray, exchange, quicksort points back to the
containing procedure sort.
Since partition is declared with in quicksort, its table points to that of quick sort
94
nil header
a
x
Readarray To readarray
exchange
quicksort To exchange
quicksort
exchange readarray
1. mktable(previous) creates a new symbol table and returns a pointer to the new
table
a. The argument previous points to the previously created symbol table
2. Enter (table, name, type, offset) creates an entry for name name in the symbol
table pointed by the table
3. Enter places type type and the relative address address offset in fields within the
entry.
4. Addwidth (table, width ) records the cumulative width of all the entries in table in
the header assiciated with this table.
95
The action to the translation scheme
Case statement
Switch expression
Begin
Case value : statement
Case value : statement
Case value : statement
…………..
Case value : statement
default : statement
end
Switch expression
Begin
Case v1 : s1
Case v2 : s2
Case v3 : s3
…………..
Case vn-1 : sn-1
default : sn
96
end
Procedure calls
97
Calling sequence
1. S call id (Elist)
2. Elist Elist, E
3. Elist E
Calling sequence
1. When a procedure call occurs, space must be allocated for the activation
record of the called procedure.
2. The argument of the called procedure must be evaluated and made available to
the called procedure in a known place
3. Environment pointers must be established to enable the called procedure to
access data in the enclosing blocks
4. The state of the calling procedure must be Saved so it can resume execution
after the call.
5. Also the return address of the calling procedure is saved in a known place
6. The return address is the location of the the instruction that follows the call in
the calling procedure
7. Finally a jump to the beginning of the code for the called procedure must be
generated
98
When a function returns,
1. if the procedure is a function , the result must be stored in a known place .
2. The activatation record of the calling procedure must be restored.
3. A jump to the beginningof the code for the called procedure must be
generated
There is no exact division of the run-time tasks between the calling called procedure
Syntax directed translation
1. S call id (Elist)
2. Elist Elist, E
3. Elist E
S call id (Elist)
{ for each item p on queue do
emit (‘param’ p)
emit (‘call’ id.place)}
Elist Elist, E
{ append E.place to the end of queue}
Elist E
{ initilize the queue to contain only E.place}
Code Generation
99
o –The output code must be high quality.
o –It should make effective use of the resources of the target machine.
o –It should run efficiently.
In theory, the problem of generating optimal code is undecidable.
In practice, we use heuristic techniques to generate sub-optimal (good, but not
optimal) target code. The choice of the heuristic is important since a carefully
designed code generation algorithm can produce much better code than a naive
code generation algorithm.
Memory Management
100
How the names in the intermediate codes are converted into addresses in the
target code?
The labels in the intermediate codes must be converted into the addresses of the
target machine instructions.
A quadraple will match to more than one machine instruction. If that quadraple
has a label, this label will be the address of the first machine instruction
corresponding to that quadraple.
Instruction Selection
The structure of the instruction set of the target machine determines the difficulty
of the instruction selection.
–The uniformity and completeness of the instruction set are an important factors.
Instruction speeds are also important.
–If we do not care speed, the code generation is a straight forward job. We can
map each quadraple into a set of machine instructions. Naive code generation:
ADD y,z,x MOV y, R0
ADD z, R0
MOV R0,x
The quality of the generated code is determined by its speed and size.
Instruction speeds are needed to design good code sequences.
Register Allocation
Instructions involving register operands are usually shorter and faster than those
involving operands in memory.
The efficient utilization of registers is important in generating good code
sequence.
The use of registers is divided into two sub-problems:
o –Register Allocation – we select the set of registers that will be reside in
registers at a point in the program.
o –Register Assignment – we pick the specific register that a variable will
reside in.
Finding an optimal assignment of registers is difficult.
o –In theory, the problem is NP-complete.
o –The problem is further complicated because some architectures may
require certain register-usage conventions such as address vs data
registers, even vs odd registers for certain instructions.
101
Choice of Evaluation Order
Target Machine
To design a code generator, we should be familiar with the structure of the target
machine and its instruction set.
nstead of a specific architecture, we will design our own simple target machine
for the code generation.
– We will decide the instruction set, but it will be closer actually machine
instructions.
– We will decide size and speeds of the instructions, and we will use them in
the creation of good code generators.
– Although we do not use an actual target machine, our discussions are also
applicable to actual target machines.
ADD add source to destination
SUB subtract source from destination
MOV move source to destination
102
•The source and destination fields are not long enough to hold memory addresses.
Certain bit-patterns in these fields specify that words following the instruction
(the instruction is also one word) contain operand addresses (or constants).
•Of course, there will be cost for having memory addresses and constants in
instructions.
•We will use different addressing modes to get addresses of source and
destination.
103
Static Allocation -- the static allocation can be performed by just reserving
enough memory space for static data objects.
o –Static variables can be accessible by just using absolute memory address.
Stack Allocation – the code generator should produce machine codes to allocate
the activation records (corresponding to intermediate codes).
o –Normally we will use a specific register to point (the beginning of) the
activation record, and we will use this register to access variables residing
in that activation record.
o We cannot know actual address of these stack variables until run-time.
Return address
SP
Return Value
Actual Parameters
Other Stuff
Local Variables
Temporaries
All values in the activation record can be accessible from SP by a positive offset.
ADD #caller.recordsize,SP
MOV PARAM1,*8(SP) // save parameters
MOV PARAM2,*12(SP)
.
MOV PARAMn,*4+4n(SP)
. // saving other stuff
MOV #here+16,*SP // save return address
GOTO callee.codearea // jump to procedure
SUB #caller.recordsize,SP // return address
104
Possible Return from A Procedure Call
MOV RETVAL,*4(SP) // save the return value
GOTO *SP // return to caller
Run-Time Addresses
Static Variables:
o static[12] staticaddressblock+12
So, the static variables are absolute addresses and these absolute addresses are
evaluated at compile time (or load time).
Run-Time Addresses
Stack Variables
o Stack variables are accesses using offsets from the beginning of the
activation records.
non-local variable
o access links
o displays
Basic Blocks
A basic block:
t1 := a * a
t2 := a * b
t3 := t1 – t2
105
Input: A sequence of three-address codes
Output: A list of basic blocks with each three-address statement in exactly one block.
Algorithm:
1.Determine the list of leaders. The first statement of each basic block will be a
leader.
o The first statement is a leader
o Any statement that is the target of a jump instruction (conditional or
unconditional) is a leader.
o Any statement immediately following a jump instruction (conditional or
unconditional) is a leader.
2.For each leader, its basic block consists of the leader and all statements up to
but not including the next leader or the end of the program.
begin
prod := 0;
i := 1;
do begin
prod := prod + a[i] * b[i];
i := i + 1;
end
while i <= 20
end
Corresponding Quadraples
1: prod := 0
2: i := 1
3: t1 := 4*i
4: t2 := a[t1]
5: t3 := 4*i
6: t4 := b[t3]
7: t5 := t2*t4
8: t6 := prod+t5
9: prod := t6
10: t7 := i+1
11: i := t7
12: if i<=20 goto 3
106
Basic Blocks (Another Example)
Live Variables
A name in a basic block is said to be live at a given point if its value is used after
that point in the program (in that basic block or in another basic block).
A basic block computes a set of expressions. These expressions are values of the
some names live on exit from the block.
Two blocks are said to be equivalent if they compute the same set of expressions.
107
A number of transformations can be applied to a basic block without changing the
set of expressions computed by the block.
Many of these transformations are useful for improving the quality of code that
will be generated from that block.
These transformations are categorized into two groups:
o –Structure-Preserving Transformations
o –Algebraic Transformations.
Structure-Preserving Transformations
a := b+c a := b+c
b := a-d b := a-d
c := b+c c := b+c
d := a-d d := b
a-d in 2 and 4 statements are common sub-expressions.
But, b+c in 1 and 3 are not common sub-expressions because the values of b in
those statements are different.
Dead-Code Elimination
We say that x is dead at a certain point, if it is not used after that point in the
block (or in the following blocks).
108
If x is dead at the point of the statement x := y op z, this statement can be safely
eliminated without changing the meaning of the block
Without changing the meaning of a block, we may rename temporary variables in that
block.
The new block with renamed variables is equivalent to the original block.
t1 := a+b t2 := a+b
t2 := t1*c t1 := t2*c
Interchange of Statements
If two adjacent statements are independent they can be interchanged without affecting the
value of the basic block.
t1 := a+b t2 := x*y
t2 := x*y t1 := a+b
Algebraic Transformations
x := x+0 eliminate this statement
x := y+0 x := y
x := x+1 INC ,,X
x := y**2 x := y*y
Flow Graphs
We can add the flow-of-control information to the set of basic blocks making up a
program by constructing a directed graph called a flow graph.
There is a directed edge from block B1 to block B2 if B2 immediately follows B1
in some sequence; that is if:
o –there is a conditional or unconditional jump from the last statement of B1
to the first statement of B2, or
o –B2 immediately follow B1 in the order of the program, and B1 does not
end with an unconditional jump.
We say that B2 is successor of B1, and B1 is predecessor of B2.
109
Flow Graphs - Example
prod := 0
i := 1
t1 := 4*i
t2 := a[t1]
t3 := 4*i
t4 := b[t3]
t5 := t2*t4
t6 := prod*t5
prod := t6
t7 := i+1
i := t7
if i<=20 goto 3
110
Loops
What is a loop in a flow graph?
A loop in a flow graph is:
o 1.All nodes in a a loop is strongly connected. In other words, from any
node in a loop to any other node there is a path of length one or more, and
all nodes in that path are in that loop.
o 2.The collection of nodes in a loop has a unique entry. In other words, the
only way to reach a node in a loop from an outside node is to first go
through that unique entry.
A loop that contain no other loop is called inner loop.
We will assume that we can find all loops in a given flow graph.
Next Uses
We say that a quadraple (x := y op z) uses names y and z.
For each name in a quadraple, we would like to know the next use of that name.
We also would like to know which names are live after the block; ie, they will be
used after this block.
o –Without a global live-analysis, we cannot determine which names will be
live after a block.
o –For simplicity, we may assume that all variables and some of temporaries
are live after each block.
We use next use and live information about names to determine which names will
be in registers.
111
If two temporaries are not live at the some time, we can pack these temporaries into a
same location.
We can use the next use information to pack temporaries.
t1 := a*a t1 := a*a
t2 := a*b t2 := a*b
t3 := 2*t2 t2 := 2*t2
t4 := t1+t3 t1 := t1+t2
t5 := b*b t2 := b*b
t6 := t4+t5 t1 := t1+t2
For simplicity, we will assume that for each intermediate code operator we have a
corresponding target code operator.
We will also assume that computed results can be left in registers as long as
possible.
o –If the register is needed for another computation, the value in the register
must be stored.
o –Before we leave a basic block, everything must be stored in memory
locations.
We will try to produce reasonable code for a given basic block.
The code-generation algorithm will use descriptors keep track of register contents
and addresses for names.
Register Descriptors
A register descriptor keeps track of what is currently in each register.
It will be consulted when a new register is needed by code-generation algorithm.
We assume that all registers are initially empty before we enter into a basic block.
This is not true if the registers are assigned across blocks.
At a certain time, each register descriptor will hold zero or more names.
R1 is empty
MOV a,R1
R1 holds a
MOV R1,b
R1 holds both a and b
Address Descriptors
112
An address descriptor keeps track of the locations where the current value of a
name can be found at run-time.
The location can be a register, a stack location or a memory location (in static
area). The location can be a set of these.
This information can be stored in the symbol table.
a is in the memory
MOV a,R1
a is in R1 and in the memory
MOV R1,b
b is in R1 and in the memory
This code generation algorithm takes a basic block of three-address codes, and
produces machines codes in our target architecture.
For each three-address code we perform certain actions.
We assume that we have getreg routine to determine the location of the result of
the operation.
113
1.If y is in a register that does not hold no other names, and y is not live and has
no next use after x:=y op z, then return the register of y as Lx. Update the address
descriptor of y to indicate that y is no longer in Lx.
2.If (1) fails, return an empty register for Lx if there is one.
3.If (2) fails if x has a next use in the block (or OP is a special instruction
needs a register), find a suitable occupied register and empty that register.
4.If x is not used in the block, or no suitable register can be found, select the
memory location of x as Lx.
5.A more sophisticated getreg function can be designed.
114
a := *p We have to move p into a register first (if it is not in a register).
Most of the machines uses a set of condition codes to indicate whether the last
quantity computed or loaded into a register is negative,zero, or positive.
A compare CMP can be used to set these condition codes without evaluating the
value.
Then a conditional jump instruction (based on these conditional codes) can be
used for designated condition: < = > <= >=
Register Allocation
Just holding values in registers during a single block may not produce good
results.
We need some other techniques for register allocation:
o –Putting specific values into fix registers – activation record pointer, stack
pointer, ...
o –global analysis (across the blocks) of the variables to decide which
variables will be in registers. Frequently used variables are kept in the
registers. A frequently used variable in an inner loop should be in a
register.
o –Or, in some programming languages (such as C), programmer can use
register declarations to keep certain variables in registers.
Usage Counts
115
Keeping a variable x in a register for the duration of a loop L è save
one unit of cost for each reference to x if x is in a register.
If a variable x has subsequent usages in next blocks, it should stay in a register.
So, if x stays in the register:
o –we save one unit every time x is referenced prior to any definition of x.
o –we save two units because we avoid a store of x at the end of a block.
Benefit (approximate) of keeping x in the register is:
∑ (use( x , B)+2∗live( x , B ))
bocksBinL
bcdf
a:=b+c
d:=d-b B1
e:=a+f
acdef acdf
acde
b:=d+f
e:=a-c B3
f:=a-d B2
b:=d+c B4
bcdef
bcdef - live
116
Usage Counts – Example
So, benefits:
a: 4 Thus, b and d must be in registers. If we can only use
b: 6 two registers.
c: 3 If we have a third one we can put one of a, e, f into register.
d: 6 For example, a into a register.
e: 4
f: 4
MOV R1,R0
ADD c,R0
SUB R1,R2
MOV R0,R3
ADD f,R3
MOV R3,e
MOV R2,R2
ADD f,R1
MOV R0,R3 MOV R0,R3
SUB R2,R3 SUB c,R3
MOV R3,f MOV R3,e
MOV R2,R1
ADD c,R1 MOV R1,b
MOV R2,d
MOV R1,b
MOV R2,d
117
•First, for each symbol, we assign a symbolic register (we assume that we have
symbolic registers as much as we need).
•A register-interference graph is created in which the nodes are symbolic
registers, and there is an edge between two nodes if the names in both nodes are
live at the same time.
•Then, we have N machine registers
o –we apply N-coloring problem to this register-inference graph to find a
solution. If two nodes have an edge, they cannot have a same color (i.e.
they cannot be assigned to same machine registers because they are live at
the same time).
o –Since graph-coloring problem is NP-complete, we use certain heuristics
to get approximations to find the solutions
r1:=1
x:=1 s1:=1 r2:=2
y:=2 s2:=2 r3:=r1+r2
w:=x+y s3:=s1+s2 r3:=r1+1
z:=x+1 s4:=s1+1 r1:=r1*r2
u:=x*y s5:=s1*s2 r2:=r3*2
t:=z*2 s6:=s4*2
•Directed Acyclic Graphs (dags) can be useful data structures for implementing
transformations on basic blocks.
•Using dags
118
o –we can easily determine common sub-expressions
o –We can determine which names are evaluated outside of the block, but
used in the block.
•First, we will construct a dag for a basic block.
•Then, we apply transformations on this dag.
•Later, we will produce target code from a dag.
119
Corresponding DAG
Construction of DAGs
•We can systematically create a corresponding dag for a given basic block.
•Each name is associated with a node of the dag. Initially, all names are undefined
(i.e. they are not associated with nodes of the dag).
•For each three-address code x := y op z
o –Find node(y). If node(y) is undefined, create a leaf node labeled y and let
node(y) to be this node.
o –Find node(z). If node(z) is undefined, create a leaf node labeled y and let
node(z) to be this node.
o –If there is a node with op, node(y) as its left child, and node(z) as its right
child this is node is also treated as node(x).
o –Otherwise, create node(x) with op, node(y) as its left child, and node(z)
as its right child.
Applications of DAGs
We automatically detect common sub-expressions.
We can determine which identifiers whose values are used in the block. (the
identifier at leaves).
We can create simplified quadraples for a block using its dag.
o –taking advantage of common sub-expressions
o –without performing unnecessary move instructions.
120
In general, the interior nodes of a the dag can be evaluated in any order that is a
topological sort of the dag.
o –In topological sort, a node is not evaluated until its all children are
evaluated.
o –So, a different evaluation order may correspond to a better code
sequence.
1: t1 := 4*i
t1 := 4 * i 2: t2 := a[t1]
t2 := a [t1] 3: t3 := 4*i
t4: = b[t1] 4: t4 := b[t3]
t5 := t2 * t4 5: t5 := t2*t4
6: t6 := prod+t5
prod := prod + t5 7: prod := t6
i := i + 1 8: t7 := i+1
if i<=20 goto (1) 9: i := t7
10: if i<=20 goto 1
Some Problems
Peephole Optimization
Peephole Optimization is a method to improve performance of the target program
by examining a short sequence of target instructions (called peephole), and
replacing these instructions shorter and faster instructions.
o –peephole optimization can be applicable to both intermediate codes and
target codes.
o –the peephole can be in a basic block (sometimes can be across blocks).
o –we may need multiple passes to get best improvement in the target code.
121
o –we will look at certain program transformations which can be seen as
peephole optimization.
Unreachable Code
We may remove unreachable codes.
#define debug 0
.
.
if (debug==1) { print debugging info }
Flow-of-Control Optimizations
goto L1 goto L2
.
L1: goto L2 L1: goto L2
----------------------------------------------
if a<b goto L1 if a<b goto L2
.
L1: goto L2 L1: goto L2
-----------------------------------------------
goto L1 if a<b goto L2
. goto L3
L1: if a<b goto L2 .
L3: L3:
122
o –x := y*2 x := lshift(y,1)
Specific Machine Instructions
o –The target machine may specific instructions to implement specific
operations.
o –auto increment, auto decrement, ...
t1 := a+b - t4
t2 := c+d
t3 := e-t2 + t1 - t3
t4 := t1-t3
a b e + t2
c d
Target Codes
only t4 is live on exit from the block, and we have only registers R0 and R1
t1:=a+b t2:=c+d t3:=e-t2 t4:=t1-t3 t2:=c+d t3:=e-t2 t1:=a+b t4:=t1-t3
123
SUB R0,R1
MOV R1,t4 Revised code sequence
* 1
+ 2 - 3
* 4
- 5 + 6
+ 7 c 8 d 11 e 12
124
a 9 b 10
t6:=d+e
t7:=a+b
t5:=t7-c
t4:=t5*t6
t3:=t4-e
t2:=t6+t4
t1:=t2*t3
Labeling – Algorithm
if (n is a leaf) {
if (n is the leftmost child of its parent)
label(n) := 1;
else
label(n) := 0;
else { // interior node (assume that a binary operator)
if (label(child1)=label(child2))
label(n) := label(child1)+1
else
label(n) := max(label(child1),label(child2));
}
125
In general, if c1,c2,...,ck are children of n ordered by label
label(c1)>=label(c2)>=...>=label(ck)
label(n) := max(label(ci)+i-1) where i from 1 to k.
Labeling – Example
t4 2
t1 1 t3 2
a 1 b 0 e 1 t2 1
c 1 d 0
126
// other cases
gencode(t4) case2 R1 R0
gencode(t3) case3 R0 R1
gencode(e) case0 R0,R1
MOV e,R1
gencode(t2) case1 R0
gencode(c) case0 R0
127
MOV c,R0
ADD d,R0
SUB R0,R1
gencode(t1) case1 R0
gencode(a) case0 R0
MOV a,R0
ADD b,R0
SUB R1,R0
Optimal Code
codegen produce optimal code for our target machine if we assume that
o –there are no common sub-expression
o –there are no algebraic properties of operators which effect the timing.
Algebraic properties of operators (such as commutative, associative) may effect
the generated code.
When there are common sub-expressions, the dag will no longer be a tree. In this
case, we cannot apply the algorithm directly.
If the dag of a basic block is not tree, we cannot apply gencode procedure directly
to that dag.
We partition the dag into trees, and apply gencode to these trees. Then we
combine the solutions. This may not be optimal solution, but it will be very good
solution.
Each shared node will be the root of a tree.
Then, we can put these trees into an evaluation order.
2 3 2 3
4 7 4 4 9
5 6
7 8 9 4
128
5 6
7 8 8 9
7 8 9
E1 E2
Contiguous Evaluation
129
In non-contiguous evaluations, we may mix the evaluations of the sub-trees.
For any given machine-language program P (for register machines) to evaluate an
expression tree T, we can find an equivalent program Q such that:
1.Q does not have higher cost than P
2.Q uses no more registers than
3.Q evaluates the tree in a contiguous fashion.
–This means that every expression tree can be evaluated optimally by a
contiguous program.
Example
Assume that we have the following machine codes, and the cost of each of them is
one unit.
o –mov M,Ri
o –mov Ri,M
o –mov Ri,Rj
o –OP M,Ri
o –OP Rj,Ri
Assume that we have only two registers R0 and R1.
First, we have to evaluate cost arrays for the tree.
130