Unit I

Compiler Design (KCS-502)
3rd year (Semester – V)

Session – 2023 - 24
Unit – I
Ratish Srivastava
Asst. Prof.
CSE Dept.
UCER, Prayagraj
Introduction
Translator:
• A program written in high-level language is called as source
code. To convert the source code into machine code, translators
are needed.
• A translator takes a program written in source language as input
and converts it into a program in target language as output.
Purpose of Translator
• Translating the high-level language program input into an
equivalent machine language program.
• Providing diagnostic messages wherever the programmer
violates specification of the high-level language program.
• Different Types of Translators
Compiler, Interpreter and Assembler
Compiler Design, KCS-502 2
Introduction
• Compiler:
– A compiler is a computer program (or set of
programs) that transforms source code written in a
programming language (the source language) into
another computer language (the target language,
often having a binary form known as object code)
– An important role of the compiler is to report any

errors in the source program that it detects during the
translation process.

Introduction
Source Target
Compiler
Program Program
Errors
Fig.: A Compiler

Introduction
• Interpreter:
– An interpreter reads the source code and
instruction or line at a time, converts this line into
machine code and executes it.

Introduction
Source Program
Interpreter Output
Input
Fig.: An Interpreter

Difference between Compiler and
Interpreter
• The main difference between an interpreter
and a compiler is that compilation requires
analysis and the generation of machine code
only once, whereas
an interpreter may need to analyze and
interpret the same program statements each
time it meets them ex., instructions appearing
within a loop.

Interpreter
• A compiler converts the higher level
instruction into machine language while an
interpreter converts the high level instruction
into an intermediate form.
• Before execution, entire program is executed

by the compiler whereas after translating the
first line, an interpreter then executes it and
so on.

Interpreter
• List of errors is created by the compiler after the
compilation process while an interpreter stops
translating after the first error.
• An independent executable file is created by the

compiler whereas interpreter is required by an
interpreted program each time.
• The compiler produces object code whereas

interpreter does not produce object code.

Assembler
• Assemblers are a third type of translator.
• The purpose of an assembler is to
translate assembly language into object (machine)
code.
• It uses opcode (mnemonics) for the instructions.
For example: ADD A,B
• Here, ADD is the mnemonic that tells the
processor that it has to perform addition function.
Moreover, A and B are the operands.

Language-processing System
• A source program may be divided into modules stored
in separate files. The task of collecting the source
program is sometimes entrusted to a separate
program, called a preprocessor. The preprocessor may
also expand shorthands, called macros, into source
language statements.
• The modified source program is then fed to a compiler.

The compiler may produce an assembly language
program as its output, because assembly language is
easier to produce as output and is easier to debug.

Language-processing System
• The assembly language is then processed by a
program called an assembler that produces
relocatable machine code as its output.
• The linker resolves external memory addresses,

where the code in one file may refer to a location
in another file.
• The loader then puts together all of the

executable object files into memory for
execution.
Source program with macros
Preprocessor
Source program
Compiler
Target assembly program
Assembler
Relocatable machine code

Linker / Loader
Absolute machine code

A language processing system
The Structure of a Compiler
• Upto this point, we have treated a compiler as
a single box that maps a source program into a
semantically equivalent target program. If we
open up this box a little, we see that there are
two parts to this mapping: analysis and
synthesis.

• Analysis:
– Breaks up source program into pieces and
imposes a grammatical structure.
– Creates intermediate representation (IR) of source
program.
– Determines the operations and records them in a
structure, syntax tree.
– Known as front end of compiler.

• Synthesis:
– Constructs target program from intermediate
representation.
– Takes the tree structure and translates the
operations into the target program.
– Known as back end of compiler.

Phases of a Compiler
• Conceptually, a compiler operates in phases, each
of which transforms the source program from
one representation to another.
• In greater detail, the compiler first makes the

lexical, syntax and semantic analysis of the source
program. Then, from the information gathered
during this threefold analysis, it generates the
intermediate code of the source program, makes
its optimization and creates the resulting target
code.

• As a whole, the compilation thus consists of
these six compilation phases.
– Lexical Analysis
– Syntax Analysis
– Semantic Analysis
– Intermediate Code Generation
– Code Optimization
– Code Generation.

• The first three phases, forms the bulk of the
analysis portion of a compiler and the last
three phases form the synthesis portion of a
compiler.
• Symbol table management and error handling,

are shown interacting with the six phases.


Lexical Analysis
• The lexical analysis is also called scanning.
• Lexical analysis breaks up the source program

into lexemes, that is, logically cohesive lexical
entities, such as identifiers or integers.
• It identifies that these entities are well-

formed, produces tokens that uniformly
represent lexemes in a fixed-sized way and
sends these tokens to the syntax analysis.
Lexical Analysis
• Different lexical classes or tokens or lexemes
are
– Identifiers
– Constants
– Keywords
– Operators

Syntax Analysis
• The syntax analysis is also called parsing.
• The parser uses the first components of the
tokens produced by the lexical analyzer to create
a tree-like intermediate representation that
depicts the grammatical structure of the token
stream.
• A typical representation is a syntax tree in which

each interior node represents an operation and
the children of the node represents the
arguments of the operation.
Semantic Analysis
• Semantic analysis checks that the source program
satisfies the semantic conventions of the source
language.
• Perhaps most importantly, it performs type checking to

verify that each operator has operands permitted by
the source language specification.
• If the operands are not permitted, this compilation

phase takes an appropriate action to handle this
incompatibility, that is, it either indicates an error or
makes type coercion, during which the operands are
converted so that they are compatible.
Intermediate Code Generation
• In the process of translating a source program into
target code, a compiler may construct one or more
intermediate representations, which can have a variety
of forms.
– Syntax trees are a form of intermediate representation;
they are commonly used during syntax and semantic
analysis.
• In this phase, an intermediate form called three-

address code is generated.
• The three-address code consists of a sequence of

assembly-like instructions with three operands per
instruction. Each operand can act like a register.
Code Optimization
• Optimization reshapes the intermediate code so that it
works in a more efficient way. This phase usually
involves numerous sub phases, many of which are
applied repeatedly.
• In greater detail, we distinguish two kinds of

optimizations:
– Machine-independent optimization and
– Machine-dependent optimization
• The former operates on the intermediate code while

latter is applied to the target code.

Code Generation
• The code generator takes as input an
intermediate representation of the source
program and maps it into the target language.
• If the target language is machine code, registers

or memory locations are selected for each of the
variables used by the program, then the
intermediate instructions are translated into
sequences of machine instructions that perform
the same task.

Code Generator
[Intermediate Code Generator]
Non-optimized Intermediate Code

Scanner
[Lexical Analyzer]
Tokens
Code Optimizer
Parser
[Syntax Analyzer]
Optimized Intermediate Code
Parse tree
Code Generator
Semantic Process
[Semantic analyzer] Target machinecode
Abstract Syntax Tree w/ Attributes

24
Error Handling
• The three analysis phases can encounter various errors.
– For instance, the lexical analysis can find out that the
upcoming sequence of numeric characters represents no
number in the source language.
– The syntax analysis can find out that the tokenized version
of the source program cannot be parsed by the
grammatical rules.
– Finally, the semantic analysis may detect an incompatibility

regarding the operands attached to an operator.
• The error handler must be able to detect any error of

this kind.
Symbol-Table Management
• An essential function of a compiler is to record the variable
names used in the source program and collect information
about various attributes of each name.
• These attributes may provide information about the

storage allocated for a name, its type, its scope (where in
the program its value may be used), and in the case of
procedure names, such things as the number and types of
its arguments, the method of passing each argument (for
ex., by value or by reference), and type returned.
• The symbol table is a data structure containing a record for

each variable name, with fields for the attributes of the
name.
Symbol-Table Management
• The data structure should be designed to
allow the compiler to find the record for each
name quickly and to store or retrieve data
from that record quickly.
• Example:
1 position real var
2 initial real var
3 rate real var
4 60 int const
Grouping of Phases into Passes
• The discussion of phases deals with the logical organization of a
compiler. In an implementation, activities from several phases may
be grouped together into a pass that reads an input file and writes
an output file.
• Phase is used to classify compilers according to construction, while

pass is used to classify compilers according to how they operate.
– For ex., the front-end phases of lexical analysis, syntax analysis,
semantic analysis, and intermediate code generation might be
grouped together into one pass. Code optimization might be an
optional pass. Then there could be a back-end pass consisting of code
generation for a particular target machine.
• Basically, there are two passes in compiler

– Multi-pass compiler
– One-pass compiler
Multi-pass Compilers
• A compiler that scan the input source code once,
produces a first modified form then scans the
first-modified form and produces a second-
modified form and so on, until the required
object form is produced. Such a compiler is called
multi-pass compiler.
• In multi-pass compiler, each function of the

compiler can be performed by one pass of the
compiler.

One-pass Compiler
• A one-pass compiler is a compiler that passes through the source
code of each compilation unit only once.
• In one-pass compiler, when a line source is processed it is scanned

and the tokens are extracted.
• Then the syntax of the line is analyzed and the tree structure and
some tables containing information about each token are built.
• Finally, after the semantical part is checked for correctness, the

code is generated.
• The same process is repeated for each line of code until the entire
program is compiled.

• Compilers are broken into several passes and each pass
communicate with each other via temporary files.
• When source program is inputed to compiler, it reads

the source program and stores values, variables,
function etc. in temporary files, this is done in one pass
and in the second or subsequent passes meaning of
each is substituted.
• It is upto the compiler developer to create one-pass,

two-pass, three-pass or four-pass compiler.
• Number of passes increases the execution time,
as it takes time to first store something in
temporary files and then substitute its value in
second subsequent passes.
• The reasons for multiple passes in compilers are

as follows:
– Forward references
– Storage limitations
– Optimization

Difference between Single-pass
Compiler and Multi-pass Compiler
• Single-pass compiler is a compiler that passes through
the source code of each compilation unit only once. A
multi-pass compiler is a type of compiler that
processes the source code several times.
• A one-pass compiler is fast, since all compiler code is

loaded in memory at once. On the other hand in multi-
pass compiler, the output of each pass is stored in disk
and must read in each time the next pass starts.
• The components of a one-pass compiler are inter-

related much closer than the components of a multi-
pass compiler.
Difference between Single-pass
Compiler and Multi-pass Compiler
• A one-pass compiler has limited scope of passes but
multi-pass compiler has wide scope of passes.
• A one-pass compiler tends to impose some restrictions

upon the program so constants, type, variables and
procedures must be defined before they are used. A
multi-pass compiler does not impose this type of
restrictions upon the user.
• Many programming languages cannot be represented

with single-pass compiler, ex., Java requires multi-pass
compiler.

Bootstrapping
• When writing a compiler, one will usually prefer to
write it in a high-level language. A possible choice is to
use a language that is already available on the machine
where the compiler should eventually run.
• It is, however, quite common to be in the following

situation:
– You have a completely new processor for which no
compilers exist yet. Nevertheless, you want to have a
compiler that only targets this processor, but also runs on
it. In other words, you want to write a compiler for a
language A, targeting language B (the machine language)
and written in language B.

Bootstrapping
• The most obvious approach is to write the compiler in
language B. But if B is machine language, it is a horrible
job to write any non-trivial compiler in this language.
Instead, it is customary to use a process called
“bootstrapping”, referring to the seemingly impossible
task of pulling oneself up by the bootstraps.
• The idea of bootstrapping is simple:

– You write your compiler in language A (but still let it target
B) and then let it compile itself. The result is a compiler
from A to B written in B.

Bootstrapping
• It may sound a bit paradoxical to let the compiler compile
itself:
– In order to use the compiler to compile a program, we must
already have compiled it, and to do this we must use the
compiler. In a way, it is a bit like the chicken-and-egg-paradox.
• In other words, bootstrapping is the process of writing a

compiler (or assembler) in the target programming
language which it is intended to compile. Applying this
technique leads to a self-hosting compiler.
• Many compilers for many programming languages are

bootstrapped, including compilers for BASIC, ALGOL, C, etc.

Advantages of bootstrapping
• It is a non-trivial test of the language being compiled.
• Compiler developers only need to know the language

being compiled.
• Compiler development can be done in the higher level

language being compiled.
• It is a comprehensive consistency check as it should be

able to produce its own object code.

Cross Compiler
• A compiler is characterized by three languages as
its source language, its object language, and the
language in which it is written.
• These languages may be quite different.
• A cross compiler can run on one machine and
produce target code for another machine.
 Source language = Host language ⇒ Self compiler

 Target language = Host language ⇒ Self-resident compiler
 Target language ≠ Host language ⇒ Cross compiler

Cross Compiler
Example − Create a cross compiler using bootstrapping when SsM runs on
SAA.
Solution − First of all, it represents two compilers with T-diagram.
• When SsM runs on SAA, SAM will be generated.

Regular expression examples
1. Determine the regular expression for all

strings containing exactly one ‘a’ over ∑ = {a, b,
c}.

Solution:
We have the input alphabets are ∑ = {a, b, c}
The objective of the problem is to find out the regular
expression for all strings containing exactly one ‘a’.
For this, first find out the regular expression for any strings at
all over the given ∑, which does not contain any ‘a’. It is-
(b + c)*
Thus, to write a regular expression which denote all string
containing exactly on ‘a’ we simply put the regular expression
for any string at all which does not contain any ‘a’ on the both
side of ‘a’. Thus, the resultant regular expression is-
(b + c)* a (b + c)*
2. Find the regular expression for the set of all

strings over input alphabet ∑ = {a, b} with three
consecutive b’s.

Solution:
We have the input alphabets are ∑ = {a, b}
According to the problem given, the resultant regular expression
represent all strings which have three consecutive b’s (which means,
strings always have a substring of b’s of length 3 i.e bbb) over the
given input alphabets ∑ = {a, b}.
The regular expression for any string at all over the given input
alphabets ∑ = {a, b}. It is (a + b)*
Thus, to write a regular expression which denote all string with three
consecutive b’s we simply put the regular expression for any string at
all over the given ∑ on the both side of three consecutive b’s, i.e., bbb.
Thus, the resultant regular expression is-
(a + b)* bbb (a +b)*


strings over {0, 1} beginning with 00.
Solution:
The regular expression which fulfill the
requirement of given question can be written as-
00 (0 + 1)*


strings over {0, 1} ending with 00 and beginning
with 1.
Solution:
1 (0 + 1)* 00

5. Find the regular expression for all strings over

input alphabets ∑ = {0, 1} ending in either 010 or
0010.
Solution:
(0 + 1)*(010 + 0010)

6. Find the regular expression for all string containing no more

than three a’s over input alphabets ∑ = {a, b, c}
Solution:
we can write either zero ‘a’ or one ‘a’ or two ‘a’ or three ‘a’ as –
(λ + a) (λ + a) (λ + a)
Thus, we must have to allow the arbitrary regular expression which represent
the string not containing a’s over the given input alphabets ∑ = {a, b, c} at the
place which marked by ‘X’ (Assumed), as under –
X (λ + a) X (λ + a) X (λ + a) X
Here, the regular expression for the place X can be written as-
(b + c)*
Thus, putting this regular expression at the place marked by X’s in the form of
resultant regular expression given above, we will get;
(b + c)* (λ + a) (b + c)* (λ + a) (b + c)* (λ + a) (b + c)*

Non-Deterministic Finite Automata
In Non-Deterministic Finite Automata:

• For some current state and input symbol,
there exist more than one next output states.
• A string is accepted only if there exists at
least one transition path starting at initial
state and ending at final state.

Converting NFA to DFA
The following steps are followed to convert a given NFA to a DFA-
Step-01:
• Let Q’ be a new set of states of the DFA. Q’ is null in the
starting.
• Let T’ be a new transition table of the DFA
Step-02:
• Add start state of the NFA to Q’.
• Add transitions of the start state to the transition table T’.
• If start state makes transition to multiple states for some input
alphabet, then treat those multiple states as a single state in
the DFA.

Converting NFA to DFA
Step-03:
If any new state is present in the transition table T’,
• Add the new state in Q’.
• Add transitions of that state in the transition table T’.
Step-04:
• Keep repeating Step-03 until no new state is present in the
transition table T’.
• Finally, the transition table T’ so obtained is the complete
transition table of the required DFA.

PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Problem-01:
Convert the following Non-Deterministic Finite Automata

(NFA) to Deterministic Finite Automata (DFA)-

Solution-
Transition table for the given Non-Deterministic Finite
Automata (NFA) is-
State / Alphabet a b
→q0 q0 q0, q1
q1 – *q2
*q2 – –

Step-01:
• Let Q’ be a new set of states of the Deterministic Finite Automata
(DFA).
• Let T’ be a new transition table of the DFA.
Step-02:
Add transitions of start state q0 to the transition table T’.
→q0 q0 {q0, q1}

Step-03:
• New state present in state Q’ is {q0, q1}.
• Add transitions for set of states {q0, q1} to the transition table T’.
→q0 q0 {q0, q1}
{q0, q1} q0 {q0, q1, q2}

Step-04:
• New state present in state Q’ is {q0, q1, q2}.
• Add transitions for set of states {q0, q1, q2} to the transition table T’.
→q0 q0 {q0, q1}
{q0, q1} q0 {q0, q1, q2}
{q0, q1, q2} q0 {q0, q1, q2}

Step-05:
• Since no new states are left to be added in the transition table T’, so
we stop.
• States containing q2 as its component are treated as final states of
the DFA.
→q0 q0 {q0, q1}
{q0, q1} q0 *{q0, q1, q2}
*{q0, q1, q2} q0 *{q0, q1, q2}

• Now, Deterministic Finite Automata (DFA) may be drawn as-

Problem-02:
Convert the following Non-Deterministic Finite Automata

(NFA) to Deterministic Finite Automata (DFA)-

Solution-
• Deterministic Finite Automata (DFA) may be drawn
as-

Minimization of DFA-
• The process of reducing a given DFA to its

minimal form is called as minimization of DFA.
• It contains the minimum number of states.
• The DFA in its minimal form is called as
a Minimal DFA.

How To Minimize DFA?
Step-01:
Eliminate all the dead states and inaccessible states from
the given DFA (if any).
• Dead State
All those non-final states which transit to itself for all
input symbols in ∑ are called as dead states.
• Inaccessible State
All those states which can never be reached from the
initial state are called as inaccessible states.
Step-02:
• Draw a state transition table for the given DFA.
• Transition table shows the transition of all states on all
input symbols in Σ.
Step-03:
Now, start applying equivalence theorem.
• Take a counter variable k and initialize it with value 0.
• Divide Q (set of states) into two sets such that one set
contains all the non-final states and other set contains
all the final states.
• This partition is called P0.
Step-04:
• Increment k by 1.
• Find Pk by partitioning the different sets of Pk-1 .
• In each set of Pk-1 , consider all the possible pair of
states within each set and if the two states are
distinguishable, partition the set into different sets
in Pk.

Step-05:
• Repeat step-04 until no change in partition occurs.
• In other words, when you find Pk = Pk-1, stop.
Step-06:
• All those states which belong to the same set are
equivalent.
• The equivalent states are merged to form a single state
in the minimal DFA.
Number
Numberof
ofstates
statesin
inMinimal
MinimalDFA
DFA==Number
Numberof
ofsets
sets in
in PPkk

MINIMIZATION OF DFA
Problem-01:
Minimize the given DFA-

MINIMIZATION OF DFA
Solution-
Step-01:
The given DFA contains no dead states and inaccessible states.
Step-02:
Draw a state transition table-
a b
→q0 q1 q2
q1 q1 q3
q2 q1 q2
q3 q1 *q4
*q4 q1 q2
MINIMIZATION OF DFA
Step-03:
Now using Equivalence Theorem, we have-
P0 = { q0 , q1 , q2 , q3 } { q4 }
P1 = { q0 , q1 , q2 } { q3 } { q4 }
P2 = { q0 , q2 } { q1 } { q3 } { q4 }
P3 = { q0 , q2 } { q1 } { q3 } { q4 }
Since P3 = P2, so we stop.

From P3, we infer that states q0 and q2 are equivalent and can be
merged together.
So, Our minimal DFA is-

MINIMIZATION OF DFA
Problem-02:
Minimize the given DFA-

MINIMIZATION OF DFA
Solution-
The minimal DFA is-

Context free grammar
• Context free grammar is a formal grammar which is
used to generate all possible strings in a given formal
language.
• Context free grammar G can be defined by four tuples
as:
G= (V, T, P, S)
Where,
• G describes the grammar
• V describes a finite set of non-terminal symbols
• T describes a finite set of terminal symbols
• P describes a set of production rules
• S is the start symbol.
Context free grammar
• In CFG, the start symbol is used to derive the string. You can
derive the string by repeatedly replacing a non-terminal by
the right hand side of the production, until all non-terminal
have been replaced by terminal symbols.
• Production rules:
S → aSa
S → bSb
S→c
• Now check that abbcbba string can be derived from the
given CFG.
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba Compiler Design, KCS-502 82
Syntax Analyzer (Parser)
• Syntax analyzer creates the syntactic structure of the
given source program. This syntactic structure is mostly a
parse tree.
• The syntax of programming is described by Context-Free
Grammar (CFG).
• BNF (Backus-Naur Form) notation in the description of
CFGs.
• The syntax analyzer (parser) checks whether a given
source program satisfies the rules implied by a context-
free grammar or not.
• If it satisfies, the parser creates the parse tree of that
program. Otherwise the parser gives the error messages.
Syntax Analyzer (Parser)
What syntax analysis cannot do!
• To check whether variables are of types on which
operations are allowed
• To check whether a variable has been declared
before use
• To check whether a variable has been initialized
• These issues will be handled in semantic analysis

Backus-Naur form
• In 1960 John Bakus and Peter Naur introduced
formal method For describing Syntax of
programming language which is known as Backus
Naur form or simply BNF.
• BNF was basically designed for ALGOL60
• BNF and context free grammars was nearly
identical.
• BNF is a meta language for primary languages.
• Meta Language is a language that is used to
describe another language.

Backus-Naur form
BNF: Symbols to describe a syntax
• ::= means 'is defined as‘
• <> means ' can be described as‘
• | means 'or‘
• Terminals: Symbol that can appear in the output of
a language because of its rules themselves.
Non Terminals: Syntactic entities that defines a
part of the grammar.
• Non-Terminals can be replaced and they are string,
composition of Terminals and non-Terminals.

Backus-Naur form

Backus-Naur form
Example
if(condition)
{
body
}
else
{
body
}

Backus-Naur form

Context Free Grammar (CFG) or BNF
• For specifying syntax, a widely used notation is
presented, called context-free grammars or BNF
(for Backus-Naur Form).
• A context-free grammar has four components:

– A set of terminal symbols, sometimes referred to as
tokens. The terminals are the elementary symbols of
the language defined by the grammar.
– A set of nonterminals, sometimes called syntactic
variables. Each non-terminal represents a set of
strings of terminals.
Context Free Grammar (CFG) or BNF
– A set of productions, where each production
consists of a nonterminal, called the head or left
side of the production, an arrow, and a sequence
of terminals and/or terminals, called the body or
right side of the production.
– A designation of one of the nonterminals as the
start symbol.

Capabilities of CFG
• Give precise syntactic specification of programming
languages.
• A parser can be constructed automatically by CFG.
• The syntax entity specified in CFG can be used for

translating into object code.
• Useful for describing nested structures such as

balanced parentheses, matching begin-end’s,
corresponding if-then-else, etc.
Derivations
• A grammar derives strings by beginning with the start
symbol and repeatedly replacing a nonterminal by the
body of a production for that nonterminal.
• The terminal strings that can be derived from the start

symbol form the language defined by the grammar.
• Parsing is the problem of taking a string of terminals

and figuring out how to derive it from the start symbol
of the grammar, and if it cannot be derived from the
start symbol of the grammar, then reporting syntax
errors within the string.

Parse Trees
• A parse tree pictorially shows how the start
symbol of a grammar derives a string in the
language.
• If nonterminal A has a production A→XYZ,

then a parse tree may have an interior node
labeled A with three children labeled X, Y and
Z, from left to right.

Parse Trees
• Formally, given a context-free grammar, a parse
tree according to the grammar is a tree with the
following properties:
– The root is labeled by the start symbol.
– Each leaf is labeled by a terminal or by Є.
– Each interior node is labeled by a nonterminal.
– If A is the nonterminal labeling some interior node
and X1, X2, ….., Xn are the labels of the children of that
node from left to right, then there must be a
production A→X1X2….Xn.
• Here X1, X2, ….., Xn each stand for a symbol that is either a
terminal or a nonterminal.
• As a special case, if A→Є is a production, then a node
labeled A may have a single child labeled Є.
Parse Trees
• From left to right, the leaves of a parse tree
form the yield of the tree, which is the string
generated or derived from the nonterminal at
the root of the parse tree.

Parse Trees
Example-
Consider the following grammar-
S → aB / bA
S → aS / bAA / a
B → bS / aBB / b
Let us consider a string w = aaabbabbba
Now, let us derive the string w using leftmost
derivation.

Parse Trees
Leftmost Derivation-
S → aB
→ aaBB (Using B → aBB)
→ aaaBBB (Using B → aBB)
→ aaabBB (Using B → b)
→ aaabbB (Using B → b)
→ aaabbaBB (Using B → aBB)
→ aaabbabB (Using B → b)
→ aaabbabbS (Using B → bS)
→ aaabbabbbA (Using S → bA)
→ aaabbabbba (Using A → a)
Parse Trees

Parse Trees
Problem-
Consider the following grammar-
S → A1B
A → 0A / ∈
B → 0B / 1B / ∈
For the string w = 00101,
find-
Leftmost derivation
Rightmost derivation
Parse Tree Compiler Design, KCS-502 100
Parse Trees
Solution-
1. Leftmost Derivation-
S → A1B
→ 0A1B (Using A → 0A)
→ 00A1B (Using A → 0A)
→ 001B (Using A → ∈)
→ 0010B (Using B → 0B)
→ 00101B (Using B → 1B)
→ 00101 (Using B → ∈)
Parse Trees
2. Rightmost Derivation-
S → A1B
→ A10B (Using B → 0B)
→ A101B (Using B → 1B)
→ A101 (Using B → ∈)
→ 0A101 (Using A → 0A)
→ 00A101 (Using A → 0A)
→ 00101 (Using A → ∈)

Parse Trees
3. Parse Tree-

Ambiguity
• A grammar can have more than one parse tree
generating a given string of terminals. Such a grammar
is said to be ambiguous.
• To show that a grammar is ambiguous all we need to

do is find a terminal string that is the yield of more
than one parse tree.
• Since a string with more than one parse tree usually

has more than one meaning, we need to design
unambiguous grammars for compiling applications, or
to use ambiguous grammars with additional rules to
resolve the ambiguities.
Left Factoring
• Grammar With Common Prefixes-
• If RHS of more than one production starts with the
same symbol, then such a grammar is called as
Grammar With Common Prefixes.
• This kind of grammar creates a problematic

situation for Top down parsers.
• Top down parsers can not decide which production
must be chosen to parse the string in hand.
• To remove this confusion, we use left factoring.

Left Factoring
• Left factoring is a process which isolates the common
parts of two productions into a single production.
• Any production of the form

A→αβ1| αβ2|……..| αβn
can be replaced by
A→αA’
A’→β1| β2|……| βn
• Left factoring is useful for producing a grammar

suitable for a predictive parser.

Left Factoring
Example:
Consider the following grammar

A→aAB|aA|a
B→bB|b
Remove the left factoring from it.

Left Factoring
Solution:
A→aAB|aA|a
It can be replaced by
A→aA’
A’→AB|A|Є
Similarly, B→bB|b can be replaced by
B→bB’
B’→B|Є
By left factoring the grammar will be
A→aA’
A’→AB|A|Є
B→bB’
B’→B|Є
Left Factoring
Solution:
A’→AB|A|Є
It can be replaced by
A'→AC|Є
C→B|Є
By left factoring the grammar will be
A→aA’
A'→AC|Є
C→B|Є
B→bB’
B’→B|Є

Left Factoring
Example:
Consider the following grammar
S → a / ab / abc / abcd
Remove the left factoring from it.

Left Factoring
Solution:
S → aS’
S’ → b / bc / bcd / ∈
Again, this is a grammar with common prefixes. So now,
S → aS’
S’ → bA / ∈
A → c / cd / ∈
Again, this is a grammar with common prefixes. So now,
S → aS’
S’ → bA / ∈
A → cB / ∈
B→d/∈
This is a left factored grammar.
Left Recursion
• The production is left-recursive if the leftmost
symbol on the right side is same as the non-
terminal on the left side.
• For example,
A→Aα
Here A can be any non-terminal and α denotes
some input string.
• Because of left recursion the top-down parser

can enter in infinite loop.
Left Recursion
• To eliminate left recursion we need to modify the
grammar.
• Let G be a CFG having a production rule with left

recursion
A→Aα|β
Then we eliminate left recursion by rewriting the
production rule as
A→βA’
A’→αA’
A’→Є

Left Recursion
• General algorithm for removing left recursion
A→Aα1|Aα2|……|Aαm|β1|β2|……|βm
where no βi begins with an A.
We can write above production as

A→β1A’|β2A’|…….|βmA’
A’→α1A’|α2A’|……|αmA’|Є

Left Recursion
Example:
Consider the following CFG

S→Bb|a
B→Bc|Sd|e
Find out the left recursion and remove it.

Left Recursion
Solution:
We can write the given productions as
S→Bb|a
B→Bc|Bbd|ad|e
We will remove left recursion as follows:
B→adB’|eB’
B’→cB’|bdB’|Є

Left Recursion
Example:
Consider the following grammar for arithmetic
expressions
E→E+T|T
T →T*F|F
F →(E)|id
Remove the left recursion.

Left Recursion
Solution:
Eliminating the immediate left recursion to the
production for E and then for T, we obtain
E →TE’
E’ →+TE’|Є
T →FT’
T’ →*FT’|Є
F →(E)|id

The Role of Lexical Analyzer
• As the first phase of a compiler, the main task
of the lexical analyzer is to read the input
characters of the source program, group them
into lexemes, and produce as output a
sequence of tokens for each lexeme in the
source program.
• The stream of tokens is sent to the parser for

syntax analysis.

• It is common for the lexical analyzer to interact
with the symbol table as well. When the lexical
analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the
symbol table.
• In some cases, information regarding the kind of

identifier may be read from the symbol table by
the lexical analyzer to assist it in determining the
proper token it must pass to the parser.

• Since the lexical analyzer is the part of the
compiler that reads the source text, it may
perform certain other tasks besides
identification of lexemes.
• One such task is stripping out comments and

whitespaces (blank, newline, tab and perhaps
other characters that are used to separate
tokens in the input).

• Another task is correlating error messages
generated by the compiler with the source
program.
– For instance, the lexical analyzer may keep track of the
number of newline characters seen, so it can associate
a line number with each error message.
• If the source program uses a macro-processor,

the expansion of macros may also be performed
by the lexical analyzer.

Lexical Analysis versus Parsing
• There are a number of reasons why the analysis
portion of a compiler is normally separated into lexical
analysis and parsing (syntax analysis) phases.
– Simplicity of design is the most important consideration.
The separation of lexical and syntactic analysis often allows
us to simplify at least one of these tasks.
• For ex., a parser that had to deal with comments and whitespace
as syntactic units would be considerably more complex than one
that can assume comments and whitespace have already been
removed by the lexical analyzer.
• If we are designing a new language, separating lexical and
syntactic concerns can lead to a cleaner overall language design.

Lexical Analysis versus Parsing
– Compiler efficiency is improved.
• A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task,
not the job of parsing.
• In addition, specialized buffering techniques for reading
input characters can speed up the compiler
significantly.
– Compiler portability is enhanced.

• Input-device-specific peculiarities can be restricted to
the lexical analyzer.

Tokens, Patterns, and Lexemes
• Tokens:
– A token is a pair consisting of a token name and an
optional attribute value.
– The token name is an abstract symbol
representing a kind of lexical unit.
• Ex., a particular keyword, or a sequence of input
characters denoting an identifier.
– The token names are the input symbols that the
parser processes.
– We will often refer to a token by its token name.

• Patterns:
– A pattern is a description of the form that the
lexemes of a token may take.
– In the case of a keyword as a token, the pattern is
just the sequence of characters that form the
keyword.
– For identifiers and some other tokens, the pattern
is a more complex structure that is matched by
many strings.

• Lexemes:
– A lexeme is a sequence of characters in source
program that matches the pattern for a token and
is identified by the lexical analyzer as an instance
of that token.


For example, consider the program:
int main()
{
// variable declaration
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';' 'a' '=' '10' ';' 'return' '0' ';' '}'

Exercise 1:
Count number of tokens :
int main()
{
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}
Answer: Total number of token: 27

Exercise 2:
// comment
printf("string %d ",++i++&&&i***a);
return(x?y:z)
printf ( "string %d " , ++ i ++ && & i * * * a ) ; return ( x

? y : z )
Answer: Total number of token: 24

: Distinct number of tokens will be = 18

The Lexical Analyzer Generator
•
LEX
LEX, or in a more recent implementation FLEX is a tool
that allows one to specify a lexical analyzer by
specifying regular expressions to describe patterns for
tokens.
• The input notation for the LEX tool is referred to as the

LEX language and the tool itself is the LEX compiler.
• The LEX compiler transforms the input patterns into a

transition diagram and generates code, in a file called
lex.yy.c, that simulates this transition diagram.
An Overview of Lex
Lex source lex.yy.c

Lex
program
lex.yy.c C compiler a.out
input a.out tokens

An Overview of Lex
• An input file, which we call lex.l, is written in the LEX
language and describes the lexical analyzer to be
generated.
• The LEX compiler transforms lex.l to a C program, in a file

that is always named lex.yy.c.
• The latter file is compiled by the C compiler into a file called

a.out, as always.
• The C-compiler output is a working lexical analyzer that can

take a stream of input characters and produce a stream of
tokens.

Use of LEX
• The normal use of the compiled C program, referred to
as a.out is as a subroutine of the parser.
– It is a C function that returns an integer, which is a code for
one of the possible token names.
• The attribute value, whether it be another numeric

code, a pointer to the symbol table, or nothing, is
placed in a global variable yylval, which is shared
between the lexical analyzer and parser, thereby
making it simple to return both the name and an
attribute value of a token.

Structure of LEX programs
• Lex source is separated into three sections by %%
delimiters
• The general format of Lex source is
{definitions}
%% (required)
{transition rules}
%%
(optional)
{user subroutines}
• The absolute minimum Lex program is thus
%%
Example:
digit [0-9]
letter [a-zA-Z]
%%
({letter}|{digit})* printf(“id: %s\n”, yytext);
\n printf(“new line\n”);
%%
main()
{
yylex();
}

• The declarations section includes declaration of variables,
manifest constants (identifiers declared to stand for a
constant, ex., the name of a token), and regular definitions.
• The transition rules each have the form

Pattern{Action}
• Each pattern is a regular expression, which may use the

regular definitions of the declaration section.
• The actions are fragments of a code, typically written in C,

although many variants of LEX using other languages have
been created.

• The third section holds whatever additional
functions are used in the actions.
Alternatively, these functions can be compiled
separately and loaded with the lexical
analyzer.

YACC
• YACC stands for Yet Another Compiler
Compiler, reflecting the popularity of parser
generators in the early 1970s when the first
version of YACC was created by S.C. Johnson.
• YACC is available as a command on the UNIX

system, and has been used to help implement
many production compilers.

Introduction
• What is YACC ?
– Tool which will produce a parser for a given
grammar.
– YACC (Yet Another Compiler Compiler) is a
program designed to compile a LALR(1)
grammar and to produce the source code of
the syntactic analyzer of the language
produced by this grammar.

How YACC Works
y.tab.h
YACC source (*.y) yacc y.tab.c
y.output
(1) Parser generation time
y.tab.c C compiler/linker a.out
(2) Compile time

Abstract
Token stream a.out Syntax
Tree
(3) Run time
YACC File Format
%{
C declarations
%}
yacc declarations
%%
Grammar rules
%%
Additional C code
– Comments enclosed in /* ... */ may appear in any
of the sections.

An YACC File Example
%{
#include <stdio.h>
%}
%token NAME NUMBER

%%
statement: NAME '=' expression

| expression { printf("= %d\n", $1); }
;
expression: expression '+' NUMBER { $$ = $1 + $3; }

| expression '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
%%
int yyerror(char *s)
{
fprintf(stderr, "%s\n", s);
return 0;
}
int main(void)
{
yyparse();
return 0;
}

The Parser Generator YACC
• First, a file, say translate.y, containing a YACC
specification of the translator is prepared.
• The UNIX system command yacc translate.y translates

the file translate.y into a C program called y.tab.c using
the LALR method.
• By compiling y.tab.c along with the ly library that

contains the LR parsing program using the command
cc y.tab.c –ly we obtain the desired object program
a.out that performs the translation specified by the
original YACC program.

Unit I

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit I

Uploaded by

Copyright:

Available Formats

Compiler Design (KCS-502)

3rd year (Semester – V)

– An important role of the compiler is to report any

Compiler Design, KCS-502 3

Compiler Design, KCS-502 4

Compiler Design, KCS-502 5

Compiler Design, KCS-502 6

Compiler Design, KCS-502 7

• Before execution, entire program is executed

Compiler Design, KCS-502 8

• An independent executable file is created by the

• The compiler produces object code whereas

Compiler Design, KCS-502 9

Compiler Design, KCS-502 10

• The modified source program is then fed to a compiler.

Compiler Design, KCS-502 11

• The linker resolves external memory addresses,

• The loader then puts together all of the

Target assembly program

Relocatable machine code

Absolute machine code

Compiler Design, KCS-502 14

Compiler Design, KCS-502 15

Compiler Design, KCS-502 16

• In greater detail, the compiler first makes the

Compiler Design, KCS-502 17

Compiler Design, KCS-502 18

• Symbol table management and error handling,

Compiler Design, KCS-502 19

Compiler Design, KCS-502 20

• Lexical analysis breaks up the source program

• It identifies that these entities are well-

Compiler Design, KCS-502 22

• A typical representation is a syntax tree in which

• Perhaps most importantly, it performs type checking to

• If the operands are not permitted, this compilation

• In this phase, an intermediate form called three-

• The three-address code consists of a sequence of

• In greater detail, we distinguish two kinds of

• The former operates on the intermediate code while

Compiler Design, KCS-502 26

• If the target language is machine code, registers

Compiler Design, KCS-502 27

Non-optimized Intermediate Code

Abstract Syntax Tree w/ Attributes

Compiler Design, KCS-502 28

– Finally, the semantic analysis may detect an incompatibility

• The error handler must be able to detect any error of

• These attributes may provide information about the

• The symbol table is a data structure containing a record for

• Phase is used to classify compilers according to construction, while

• Basically, there are two passes in compiler

• In multi-pass compiler, each function of the

Compiler Design, KCS-502 33

• In one-pass compiler, when a line source is processed it is scanned

• Finally, after the semantical part is checked for correctness, the

Compiler Design, KCS-502 34

• When source program is inputed to compiler, it reads

• It is upto the compiler developer to create one-pass,

• The reasons for multiple passes in compilers are

Compiler Design, KCS-502 36

• A one-pass compiler is fast, since all compiler code is

• The components of a one-pass compiler are inter-

{q0, q1, q2} q0 {q0, q1, q2}