Download as pdf or txt
Download as pdf or txt
You are on page 1of 145

Compiler Design (KCS-502)

3rd year (Semester – V)


Session – 2023 - 24
Unit – I

Ratish Srivastava
Asst. Prof.
CSE Dept.
UCER, Prayagraj
Introduction
Translator:
• A program written in high-level language is called as source
code. To convert the source code into machine code, translators
are needed.
• A translator takes a program written in source language as input
and converts it into a program in target language as output.
Purpose of Translator
• Translating the high-level language program input into an
equivalent machine language program.
• Providing diagnostic messages wherever the programmer
violates specification of the high-level language program.
• Different Types of Translators
Compiler, Interpreter and Assembler
Compiler Design, KCS-502 2
Introduction
• Compiler:
– A compiler is a computer program (or set of
programs) that transforms source code written in a
programming language (the source language) into
another computer language (the target language,
often having a binary form known as object code)

– An important role of the compiler is to report any


errors in the source program that it detects during the
translation process.

Compiler Design, KCS-502 3


Introduction

Source Target
Compiler
Program Program

Errors

Fig.: A Compiler

Compiler Design, KCS-502 4


Introduction
• Interpreter:
– An interpreter reads the source code and
instruction or line at a time, converts this line into
machine code and executes it.

Compiler Design, KCS-502 5


Introduction

Source Program
Interpreter Output
Input

Fig.: An Interpreter

Compiler Design, KCS-502 6


Difference between Compiler and
Interpreter
• The main difference between an interpreter
and a compiler is that compilation requires
analysis and the generation of machine code
only once, whereas
an interpreter may need to analyze and
interpret the same program statements each
time it meets them ex., instructions appearing
within a loop.

Compiler Design, KCS-502 7


Difference between Compiler and
Interpreter
• A compiler converts the higher level
instruction into machine language while an
interpreter converts the high level instruction
into an intermediate form.

• Before execution, entire program is executed


by the compiler whereas after translating the
first line, an interpreter then executes it and
so on.

Compiler Design, KCS-502 8


Difference between Compiler and
Interpreter
• List of errors is created by the compiler after the
compilation process while an interpreter stops
translating after the first error.

• An independent executable file is created by the


compiler whereas interpreter is required by an
interpreted program each time.

• The compiler produces object code whereas


interpreter does not produce object code.

Compiler Design, KCS-502 9


Assembler
• Assemblers are a third type of translator.
• The purpose of an assembler is to
translate assembly language into object (machine)
code.
• It uses opcode (mnemonics) for the instructions.
For example: ADD A,B
• Here, ADD is the mnemonic that tells the
processor that it has to perform addition function.
Moreover, A and B are the operands.

Compiler Design, KCS-502 10


Language-processing System
• A source program may be divided into modules stored
in separate files. The task of collecting the source
program is sometimes entrusted to a separate
program, called a preprocessor. The preprocessor may
also expand shorthands, called macros, into source
language statements.

• The modified source program is then fed to a compiler.


The compiler may produce an assembly language
program as its output, because assembly language is
easier to produce as output and is easier to debug.

Compiler Design, KCS-502 11


Language-processing System
• The assembly language is then processed by a
program called an assembler that produces
relocatable machine code as its output.

• The linker resolves external memory addresses,


where the code in one file may refer to a location
in another file.

• The loader then puts together all of the


executable object files into memory for
execution.
Compiler Design, KCS-502 12
Source program with macros

Preprocessor

Source program

Compiler

Target assembly program

Assembler

Relocatable machine code


Linker / Loader

Absolute machine code


A language processing system
Compiler Design, KCS-502 13
The Structure of a Compiler
• Upto this point, we have treated a compiler as
a single box that maps a source program into a
semantically equivalent target program. If we
open up this box a little, we see that there are
two parts to this mapping: analysis and
synthesis.

Compiler Design, KCS-502 14


The Structure of a Compiler
• Analysis:
– Breaks up source program into pieces and
imposes a grammatical structure.
– Creates intermediate representation (IR) of source
program.
– Determines the operations and records them in a
structure, syntax tree.
– Known as front end of compiler.

Compiler Design, KCS-502 15


The Structure of a Compiler
• Synthesis:
– Constructs target program from intermediate
representation.
– Takes the tree structure and translates the
operations into the target program.
– Known as back end of compiler.

Compiler Design, KCS-502 16


Phases of a Compiler
• Conceptually, a compiler operates in phases, each
of which transforms the source program from
one representation to another.

• In greater detail, the compiler first makes the


lexical, syntax and semantic analysis of the source
program. Then, from the information gathered
during this threefold analysis, it generates the
intermediate code of the source program, makes
its optimization and creates the resulting target
code.

Compiler Design, KCS-502 17


Phases of a Compiler
• As a whole, the compilation thus consists of
these six compilation phases.
– Lexical Analysis
– Syntax Analysis
– Semantic Analysis
– Intermediate Code Generation
– Code Optimization
– Code Generation.

Compiler Design, KCS-502 18


Phases of a Compiler
• The first three phases, forms the bulk of the
analysis portion of a compiler and the last
three phases form the synthesis portion of a
compiler.

• Symbol table management and error handling,


are shown interacting with the six phases.

Compiler Design, KCS-502 19


Phases of a Compiler

Compiler Design, KCS-502 20


Lexical Analysis
• The lexical analysis is also called scanning.

• Lexical analysis breaks up the source program


into lexemes, that is, logically cohesive lexical
entities, such as identifiers or integers.

• It identifies that these entities are well-


formed, produces tokens that uniformly
represent lexemes in a fixed-sized way and
sends these tokens to the syntax analysis.
Compiler Design, KCS-502 21
Lexical Analysis
• Different lexical classes or tokens or lexemes
are
– Identifiers
– Constants
– Keywords
– Operators

Compiler Design, KCS-502 22


Syntax Analysis
• The syntax analysis is also called parsing.
• The parser uses the first components of the
tokens produced by the lexical analyzer to create
a tree-like intermediate representation that
depicts the grammatical structure of the token
stream.

• A typical representation is a syntax tree in which


each interior node represents an operation and
the children of the node represents the
arguments of the operation.
Compiler Design, KCS-502 23
Semantic Analysis
• Semantic analysis checks that the source program
satisfies the semantic conventions of the source
language.

• Perhaps most importantly, it performs type checking to


verify that each operator has operands permitted by
the source language specification.

• If the operands are not permitted, this compilation


phase takes an appropriate action to handle this
incompatibility, that is, it either indicates an error or
makes type coercion, during which the operands are
converted so that they are compatible.
Compiler Design, KCS-502 24
Intermediate Code Generation
• In the process of translating a source program into
target code, a compiler may construct one or more
intermediate representations, which can have a variety
of forms.
– Syntax trees are a form of intermediate representation;
they are commonly used during syntax and semantic
analysis.

• In this phase, an intermediate form called three-


address code is generated.

• The three-address code consists of a sequence of


assembly-like instructions with three operands per
instruction. Each operand can act like a register.
Compiler Design, KCS-502 25
Code Optimization
• Optimization reshapes the intermediate code so that it
works in a more efficient way. This phase usually
involves numerous sub phases, many of which are
applied repeatedly.

• In greater detail, we distinguish two kinds of


optimizations:
– Machine-independent optimization and
– Machine-dependent optimization

• The former operates on the intermediate code while


latter is applied to the target code.

Compiler Design, KCS-502 26


Code Generation
• The code generator takes as input an
intermediate representation of the source
program and maps it into the target language.

• If the target language is machine code, registers


or memory locations are selected for each of the
variables used by the program, then the
intermediate instructions are translated into
sequences of machine instructions that perform
the same task.

Compiler Design, KCS-502 27


The Structure of a Compiler
Code Generator
[Intermediate Code Generator]

Non-optimized Intermediate Code


Scanner
[Lexical Analyzer]

Tokens

Code Optimizer
Parser
[Syntax Analyzer]
Optimized Intermediate Code
Parse tree

Code Generator

Semantic Process
[Semantic analyzer] Target machinecode

Abstract Syntax Tree w/ Attributes

Compiler Design, KCS-502 28


24
Error Handling
• The three analysis phases can encounter various errors.
– For instance, the lexical analysis can find out that the
upcoming sequence of numeric characters represents no
number in the source language.

– The syntax analysis can find out that the tokenized version
of the source program cannot be parsed by the
grammatical rules.

– Finally, the semantic analysis may detect an incompatibility


regarding the operands attached to an operator.

• The error handler must be able to detect any error of


this kind.
Compiler Design, KCS-502 29
Symbol-Table Management
• An essential function of a compiler is to record the variable
names used in the source program and collect information
about various attributes of each name.

• These attributes may provide information about the


storage allocated for a name, its type, its scope (where in
the program its value may be used), and in the case of
procedure names, such things as the number and types of
its arguments, the method of passing each argument (for
ex., by value or by reference), and type returned.

• The symbol table is a data structure containing a record for


each variable name, with fields for the attributes of the
name.
Compiler Design, KCS-502 30
Symbol-Table Management
• The data structure should be designed to
allow the compiler to find the record for each
name quickly and to store or retrieve data
from that record quickly.

• Example:
1 position real var
2 initial real var
3 rate real var
4 60 int const
Compiler Design, KCS-502 31
Grouping of Phases into Passes
• The discussion of phases deals with the logical organization of a
compiler. In an implementation, activities from several phases may
be grouped together into a pass that reads an input file and writes
an output file.

• Phase is used to classify compilers according to construction, while


pass is used to classify compilers according to how they operate.
– For ex., the front-end phases of lexical analysis, syntax analysis,
semantic analysis, and intermediate code generation might be
grouped together into one pass. Code optimization might be an
optional pass. Then there could be a back-end pass consisting of code
generation for a particular target machine.

• Basically, there are two passes in compiler


– Multi-pass compiler
– One-pass compiler
Compiler Design, KCS-502 32
Multi-pass Compilers
• A compiler that scan the input source code once,
produces a first modified form then scans the
first-modified form and produces a second-
modified form and so on, until the required
object form is produced. Such a compiler is called
multi-pass compiler.

• In multi-pass compiler, each function of the


compiler can be performed by one pass of the
compiler.

Compiler Design, KCS-502 33


One-pass Compiler
• A one-pass compiler is a compiler that passes through the source
code of each compilation unit only once.

• In one-pass compiler, when a line source is processed it is scanned


and the tokens are extracted.

• Then the syntax of the line is analyzed and the tree structure and
some tables containing information about each token are built.

• Finally, after the semantical part is checked for correctness, the


code is generated.

• The same process is repeated for each line of code until the entire
program is compiled.

Compiler Design, KCS-502 34


Grouping of Phases into Passes
• Compilers are broken into several passes and each pass
communicate with each other via temporary files.

• When source program is inputed to compiler, it reads


the source program and stores values, variables,
function etc. in temporary files, this is done in one pass
and in the second or subsequent passes meaning of
each is substituted.

• It is upto the compiler developer to create one-pass,


two-pass, three-pass or four-pass compiler.
Compiler Design, KCS-502 35
Grouping of Phases into Passes
• Number of passes increases the execution time,
as it takes time to first store something in
temporary files and then substitute its value in
second subsequent passes.

• The reasons for multiple passes in compilers are


as follows:
– Forward references
– Storage limitations
– Optimization

Compiler Design, KCS-502 36


Difference between Single-pass
Compiler and Multi-pass Compiler
• Single-pass compiler is a compiler that passes through
the source code of each compilation unit only once. A
multi-pass compiler is a type of compiler that
processes the source code several times.

• A one-pass compiler is fast, since all compiler code is


loaded in memory at once. On the other hand in multi-
pass compiler, the output of each pass is stored in disk
and must read in each time the next pass starts.

• The components of a one-pass compiler are inter-


related much closer than the components of a multi-
pass compiler.
Compiler Design, KCS-502 37
Difference between Single-pass
Compiler and Multi-pass Compiler
• A one-pass compiler has limited scope of passes but
multi-pass compiler has wide scope of passes.

• A one-pass compiler tends to impose some restrictions


upon the program so constants, type, variables and
procedures must be defined before they are used. A
multi-pass compiler does not impose this type of
restrictions upon the user.

• Many programming languages cannot be represented


with single-pass compiler, ex., Java requires multi-pass
compiler.

Compiler Design, KCS-502 38


Bootstrapping
• When writing a compiler, one will usually prefer to
write it in a high-level language. A possible choice is to
use a language that is already available on the machine
where the compiler should eventually run.

• It is, however, quite common to be in the following


situation:
– You have a completely new processor for which no
compilers exist yet. Nevertheless, you want to have a
compiler that only targets this processor, but also runs on
it. In other words, you want to write a compiler for a
language A, targeting language B (the machine language)
and written in language B.

Compiler Design, KCS-502 39


Bootstrapping
• The most obvious approach is to write the compiler in
language B. But if B is machine language, it is a horrible
job to write any non-trivial compiler in this language.
Instead, it is customary to use a process called
“bootstrapping”, referring to the seemingly impossible
task of pulling oneself up by the bootstraps.

• The idea of bootstrapping is simple:


– You write your compiler in language A (but still let it target
B) and then let it compile itself. The result is a compiler
from A to B written in B.

Compiler Design, KCS-502 40


Bootstrapping
• It may sound a bit paradoxical to let the compiler compile
itself:
– In order to use the compiler to compile a program, we must
already have compiled it, and to do this we must use the
compiler. In a way, it is a bit like the chicken-and-egg-paradox.

• In other words, bootstrapping is the process of writing a


compiler (or assembler) in the target programming
language which it is intended to compile. Applying this
technique leads to a self-hosting compiler.

• Many compilers for many programming languages are


bootstrapped, including compilers for BASIC, ALGOL, C, etc.

Compiler Design, KCS-502 41


Advantages of bootstrapping
• It is a non-trivial test of the language being compiled.

• Compiler developers only need to know the language


being compiled.

• Compiler development can be done in the higher level


language being compiled.

• It is a comprehensive consistency check as it should be


able to produce its own object code.

Compiler Design, KCS-502 42


Cross Compiler
• A compiler is characterized by three languages as
its source language, its object language, and the
language in which it is written.
• These languages may be quite different.
• A cross compiler can run on one machine and
produce target code for another machine.

 Source language = Host language ⇒ Self compiler


 Target language = Host language ⇒ Self-resident compiler
 Target language ≠ Host language ⇒ Cross compiler

Compiler Design, KCS-502 43


Cross Compiler
Example − Create a cross compiler using bootstrapping when SsM runs on
SAA.
Solution − First of all, it represents two compilers with T-diagram.

• When SsM runs on SAA, SAM will be generated.

Compiler Design, KCS-502 44


Compiler Design, KCS-502 45
Compiler Design, KCS-502 46
Compiler Design, KCS-502 47
Compiler Design, KCS-502 48
Compiler Design, KCS-502 49
Regular expression examples

1. Determine the regular expression for all


strings containing exactly one ‘a’ over ∑ = {a, b,
c}.

Compiler Design, KCS-502 50


Regular expression examples

Solution:
We have the input alphabets are ∑ = {a, b, c}
The objective of the problem is to find out the regular
expression for all strings containing exactly one ‘a’.
For this, first find out the regular expression for any strings at
all over the given ∑, which does not contain any ‘a’. It is-
(b + c)*
Thus, to write a regular expression which denote all string
containing exactly on ‘a’ we simply put the regular expression
for any string at all which does not contain any ‘a’ on the both
side of ‘a’. Thus, the resultant regular expression is-
(b + c)* a (b + c)*
Compiler Design, KCS-502 51
Regular expression examples

2. Find the regular expression for the set of all


strings over input alphabet ∑ = {a, b} with three
consecutive b’s.

Compiler Design, KCS-502 52


Regular expression examples

Solution:
We have the input alphabets are ∑ = {a, b}
According to the problem given, the resultant regular expression
represent all strings which have three consecutive b’s (which means,
strings always have a substring of b’s of length 3 i.e bbb) over the
given input alphabets ∑ = {a, b}.
The regular expression for any string at all over the given input
alphabets ∑ = {a, b}. It is (a + b)*
Thus, to write a regular expression which denote all string with three
consecutive b’s we simply put the regular expression for any string at
all over the given ∑ on the both side of three consecutive b’s, i.e., bbb.
Thus, the resultant regular expression is-
(a + b)* bbb (a +b)*

Compiler Design, KCS-502 53


Regular expression examples

3. Find the regular expression for the set of all


strings over {0, 1} beginning with 00.

Solution:
The regular expression which fulfill the
requirement of given question can be written as-
00 (0 + 1)*

Compiler Design, KCS-502 54


Regular expression examples

4. Find the regular expression for the set of all


strings over {0, 1} ending with 00 and beginning
with 1.

Solution:
The regular expression which fulfill the
requirement of given question can be written as-
1 (0 + 1)* 00

Compiler Design, KCS-502 55


Regular expression examples

5. Find the regular expression for all strings over


input alphabets ∑ = {0, 1} ending in either 010 or
0010.

Solution:
The regular expression which fulfill the
requirement of given question can be written as-
(0 + 1)*(010 + 0010)

Compiler Design, KCS-502 56


Regular expression examples

6. Find the regular expression for all string containing no more


than three a’s over input alphabets ∑ = {a, b, c}
Solution:
we can write either zero ‘a’ or one ‘a’ or two ‘a’ or three ‘a’ as –
(λ + a) (λ + a) (λ + a)
Thus, we must have to allow the arbitrary regular expression which represent
the string not containing a’s over the given input alphabets ∑ = {a, b, c} at the
place which marked by ‘X’ (Assumed), as under –
X (λ + a) X (λ + a) X (λ + a) X
Here, the regular expression for the place X can be written as-
(b + c)*
Thus, putting this regular expression at the place marked by X’s in the form of
resultant regular expression given above, we will get;
(b + c)* (λ + a) (b + c)* (λ + a) (b + c)* (λ + a) (b + c)*

Compiler Design, KCS-502 57


Non-Deterministic Finite Automata

In Non-Deterministic Finite Automata:


• For some current state and input symbol,
there exist more than one next output states.
• A string is accepted only if there exists at
least one transition path starting at initial
state and ending at final state.

Compiler Design, KCS-502 58


Converting NFA to DFA
The following steps are followed to convert a given NFA to a DFA-
Step-01:
• Let Q’ be a new set of states of the DFA. Q’ is null in the
starting.
• Let T’ be a new transition table of the DFA

Step-02:
• Add start state of the NFA to Q’.
• Add transitions of the start state to the transition table T’.
• If start state makes transition to multiple states for some input
alphabet, then treat those multiple states as a single state in
the DFA.

Compiler Design, KCS-502 59


Converting NFA to DFA
Step-03:
If any new state is present in the transition table T’,
• Add the new state in Q’.
• Add transitions of that state in the transition table T’.

Step-04:
• Keep repeating Step-03 until no new state is present in the
transition table T’.
• Finally, the transition table T’ so obtained is the complete
transition table of the required DFA.

Compiler Design, KCS-502 60


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Problem-01:

Convert the following Non-Deterministic Finite Automata


(NFA) to Deterministic Finite Automata (DFA)-

Compiler Design, KCS-502 61


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Solution-
Transition table for the given Non-Deterministic Finite
Automata (NFA) is-

State / Alphabet a b

→q0 q0 q0, q1

q1 – *q2

*q2 – –

Compiler Design, KCS-502 62


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Step-01:
• Let Q’ be a new set of states of the Deterministic Finite Automata
(DFA).
• Let T’ be a new transition table of the DFA.

Step-02:
Add transitions of start state q0 to the transition table T’.

State / Alphabet a b

→q0 q0 {q0, q1}

Compiler Design, KCS-502 63


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Step-03:
• New state present in state Q’ is {q0, q1}.
• Add transitions for set of states {q0, q1} to the transition table T’.

State / Alphabet a b

→q0 q0 {q0, q1}

{q0, q1} q0 {q0, q1, q2}

Compiler Design, KCS-502 64


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Step-04:
• New state present in state Q’ is {q0, q1, q2}.
• Add transitions for set of states {q0, q1, q2} to the transition table T’.

State / Alphabet a b

→q0 q0 {q0, q1}

{q0, q1} q0 {q0, q1, q2}

{q0, q1, q2} q0 {q0, q1, q2}

Compiler Design, KCS-502 65


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Step-05:
• Since no new states are left to be added in the transition table T’, so
we stop.
• States containing q2 as its component are treated as final states of
the DFA.

State / Alphabet a b

→q0 q0 {q0, q1}

{q0, q1} q0 *{q0, q1, q2}

*{q0, q1, q2} q0 *{q0, q1, q2}

Compiler Design, KCS-502 66


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
• Now, Deterministic Finite Automata (DFA) may be drawn as-

Compiler Design, KCS-502 67


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Problem-02:

Convert the following Non-Deterministic Finite Automata


(NFA) to Deterministic Finite Automata (DFA)-

Compiler Design, KCS-502 68


PRACTICE PROBLEMS BASED ON
CONVERTING NFA TO DFA
Solution-
• Deterministic Finite Automata (DFA) may be drawn
as-

Compiler Design, KCS-502 69


Minimization of DFA-

• The process of reducing a given DFA to its


minimal form is called as minimization of DFA.
• It contains the minimum number of states.
• The DFA in its minimal form is called as
a Minimal DFA.

Compiler Design, KCS-502 70


How To Minimize DFA?

Step-01:
Eliminate all the dead states and inaccessible states from
the given DFA (if any).

• Dead State
All those non-final states which transit to itself for all
input symbols in ∑ are called as dead states.

• Inaccessible State
All those states which can never be reached from the
initial state are called as inaccessible states.
Compiler Design, KCS-502 71
How To Minimize DFA?

Step-02:
• Draw a state transition table for the given DFA.
• Transition table shows the transition of all states on all
input symbols in Σ.

Step-03:
Now, start applying equivalence theorem.
• Take a counter variable k and initialize it with value 0.
• Divide Q (set of states) into two sets such that one set
contains all the non-final states and other set contains
all the final states.
• This partition is called P0.
Compiler Design, KCS-502 72
How To Minimize DFA?

Step-04:
• Increment k by 1.
• Find Pk by partitioning the different sets of Pk-1 .
• In each set of Pk-1 , consider all the possible pair of
states within each set and if the two states are
distinguishable, partition the set into different sets
in Pk.

Compiler Design, KCS-502 73


How To Minimize DFA?

Step-05:
• Repeat step-04 until no change in partition occurs.
• In other words, when you find Pk = Pk-1, stop.

Step-06:
• All those states which belong to the same set are
equivalent.
• The equivalent states are merged to form a single state
in the minimal DFA.
Number
Numberof
ofstates
statesin
inMinimal
MinimalDFA
DFA==Number
Numberof
ofsets
sets in
in PPkk

Compiler Design, KCS-502 74


PRACTICE PROBLEMS BASED ON
MINIMIZATION OF DFA
Problem-01:
Minimize the given DFA-

Compiler Design, KCS-502 75


PRACTICE PROBLEMS BASED ON
MINIMIZATION OF DFA
Solution-
Step-01:
The given DFA contains no dead states and inaccessible states.

Step-02:
Draw a state transition table-
a b

→q0 q1 q2

q1 q1 q3

q2 q1 q2

q3 q1 *q4

*q4 q1 q2
Compiler Design, KCS-502 76
PRACTICE PROBLEMS BASED ON
MINIMIZATION OF DFA
Step-03:
Now using Equivalence Theorem, we have-
P0 = { q0 , q1 , q2 , q3 } { q4 }
P1 = { q0 , q1 , q2 } { q3 } { q4 }
P2 = { q0 , q2 } { q1 } { q3 } { q4 }
P3 = { q0 , q2 } { q1 } { q3 } { q4 }

Since P3 = P2, so we stop.


From P3, we infer that states q0 and q2 are equivalent and can be
merged together.
So, Our minimal DFA is-

Compiler Design, KCS-502 77


Compiler Design, KCS-502 78
PRACTICE PROBLEMS BASED ON
MINIMIZATION OF DFA
Problem-02:
Minimize the given DFA-

Compiler Design, KCS-502 79


PRACTICE PROBLEMS BASED ON
MINIMIZATION OF DFA
Solution-
The minimal DFA is-

Compiler Design, KCS-502 80


Context free grammar
• Context free grammar is a formal grammar which is
used to generate all possible strings in a given formal
language.
• Context free grammar G can be defined by four tuples
as:
G= (V, T, P, S)
Where,
• G describes the grammar
• V describes a finite set of non-terminal symbols
• T describes a finite set of terminal symbols
• P describes a set of production rules
• S is the start symbol.
Compiler Design, KCS-502 81
Context free grammar
• In CFG, the start symbol is used to derive the string. You can
derive the string by repeatedly replacing a non-terminal by
the right hand side of the production, until all non-terminal
have been replaced by terminal symbols.
• Production rules:
S → aSa
S → bSb
S→c
• Now check that abbcbba string can be derived from the
given CFG.
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba Compiler Design, KCS-502 82
Syntax Analyzer (Parser)
• Syntax analyzer creates the syntactic structure of the
given source program. This syntactic structure is mostly a
parse tree.
• The syntax of programming is described by Context-Free
Grammar (CFG).
• BNF (Backus-Naur Form) notation in the description of
CFGs.
• The syntax analyzer (parser) checks whether a given
source program satisfies the rules implied by a context-
free grammar or not.
• If it satisfies, the parser creates the parse tree of that
program. Otherwise the parser gives the error messages.
Compiler Design, KCS-502 83
Syntax Analyzer (Parser)
What syntax analysis cannot do!
• To check whether variables are of types on which
operations are allowed
• To check whether a variable has been declared
before use
• To check whether a variable has been initialized
• These issues will be handled in semantic analysis

Compiler Design, KCS-502 84


Backus-Naur form
• In 1960 John Bakus and Peter Naur introduced
formal method For describing Syntax of
programming language which is known as Backus
Naur form or simply BNF.
• BNF was basically designed for ALGOL60
• BNF and context free grammars was nearly
identical.
• BNF is a meta language for primary languages.
• Meta Language is a language that is used to
describe another language.

Compiler Design, KCS-502 85


Backus-Naur form
BNF: Symbols to describe a syntax
• ::= means 'is defined as‘
• <> means ' can be described as‘
• | means 'or‘
• Terminals: Symbol that can appear in the output of
a language because of its rules themselves.
Non Terminals: Syntactic entities that defines a
part of the grammar.
• Non-Terminals can be replaced and they are string,
composition of Terminals and non-Terminals.

Compiler Design, KCS-502 86


Backus-Naur form

Compiler Design, KCS-502 87


Backus-Naur form
Example
if(condition)
{
body
}
else
{
body
}

Compiler Design, KCS-502 88


Backus-Naur form

Compiler Design, KCS-502 89


Context Free Grammar (CFG) or BNF
• For specifying syntax, a widely used notation is
presented, called context-free grammars or BNF
(for Backus-Naur Form).

• A context-free grammar has four components:


– A set of terminal symbols, sometimes referred to as
tokens. The terminals are the elementary symbols of
the language defined by the grammar.
– A set of nonterminals, sometimes called syntactic
variables. Each non-terminal represents a set of
strings of terminals.
Compiler Design, KCS-502 90
Context Free Grammar (CFG) or BNF
– A set of productions, where each production
consists of a nonterminal, called the head or left
side of the production, an arrow, and a sequence
of terminals and/or terminals, called the body or
right side of the production.
– A designation of one of the nonterminals as the
start symbol.

Compiler Design, KCS-502 91


Capabilities of CFG
• Give precise syntactic specification of programming
languages.

• A parser can be constructed automatically by CFG.

• The syntax entity specified in CFG can be used for


translating into object code.

• Useful for describing nested structures such as


balanced parentheses, matching begin-end’s,
corresponding if-then-else, etc.
Compiler Design, KCS-502 92
Derivations
• A grammar derives strings by beginning with the start
symbol and repeatedly replacing a nonterminal by the
body of a production for that nonterminal.

• The terminal strings that can be derived from the start


symbol form the language defined by the grammar.

• Parsing is the problem of taking a string of terminals


and figuring out how to derive it from the start symbol
of the grammar, and if it cannot be derived from the
start symbol of the grammar, then reporting syntax
errors within the string.

Compiler Design, KCS-502 93


Parse Trees
• A parse tree pictorially shows how the start
symbol of a grammar derives a string in the
language.

• If nonterminal A has a production A→XYZ,


then a parse tree may have an interior node
labeled A with three children labeled X, Y and
Z, from left to right.

Compiler Design, KCS-502 94


Parse Trees
• Formally, given a context-free grammar, a parse
tree according to the grammar is a tree with the
following properties:
– The root is labeled by the start symbol.
– Each leaf is labeled by a terminal or by Є.
– Each interior node is labeled by a nonterminal.
– If A is the nonterminal labeling some interior node
and X1, X2, ….., Xn are the labels of the children of that
node from left to right, then there must be a
production A→X1X2….Xn.
• Here X1, X2, ….., Xn each stand for a symbol that is either a
terminal or a nonterminal.
• As a special case, if A→Є is a production, then a node
labeled A may have a single child labeled Є.
Compiler Design, KCS-502 95
Parse Trees
• From left to right, the leaves of a parse tree
form the yield of the tree, which is the string
generated or derived from the nonterminal at
the root of the parse tree.

Compiler Design, KCS-502 96


Parse Trees
Example-
Consider the following grammar-
S → aB / bA
S → aS / bAA / a
B → bS / aBB / b
Let us consider a string w = aaabbabbba
Now, let us derive the string w using leftmost
derivation.

Compiler Design, KCS-502 97


Parse Trees
Leftmost Derivation-
S → aB
→ aaBB (Using B → aBB)
→ aaaBBB (Using B → aBB)
→ aaabBB (Using B → b)
→ aaabbB (Using B → b)
→ aaabbaBB (Using B → aBB)
→ aaabbabB (Using B → b)
→ aaabbabbS (Using B → bS)
→ aaabbabbbA (Using S → bA)
→ aaabbabbba (Using A → a)
Compiler Design, KCS-502 98
Parse Trees

Compiler Design, KCS-502 99


Parse Trees
Problem-
Consider the following grammar-
S → A1B
A → 0A / ∈
B → 0B / 1B / ∈
For the string w = 00101,
find-
Leftmost derivation
Rightmost derivation
Parse Tree Compiler Design, KCS-502 100
Parse Trees
Solution-
1. Leftmost Derivation-
S → A1B
→ 0A1B (Using A → 0A)
→ 00A1B (Using A → 0A)
→ 001B (Using A → ∈)
→ 0010B (Using B → 0B)
→ 00101B (Using B → 1B)
→ 00101 (Using B → ∈)
Compiler Design, KCS-502 101
Parse Trees
2. Rightmost Derivation-
S → A1B
→ A10B (Using B → 0B)
→ A101B (Using B → 1B)
→ A101 (Using B → ∈)
→ 0A101 (Using A → 0A)
→ 00A101 (Using A → 0A)
→ 00101 (Using A → ∈)

Compiler Design, KCS-502 102


Parse Trees
3. Parse Tree-

Compiler Design, KCS-502 103


Ambiguity
• A grammar can have more than one parse tree
generating a given string of terminals. Such a grammar
is said to be ambiguous.

• To show that a grammar is ambiguous all we need to


do is find a terminal string that is the yield of more
than one parse tree.

• Since a string with more than one parse tree usually


has more than one meaning, we need to design
unambiguous grammars for compiling applications, or
to use ambiguous grammars with additional rules to
resolve the ambiguities.
Compiler Design, KCS-502 104
Left Factoring
• Grammar With Common Prefixes-
• If RHS of more than one production starts with the
same symbol, then such a grammar is called as
Grammar With Common Prefixes.

• This kind of grammar creates a problematic


situation for Top down parsers.
• Top down parsers can not decide which production
must be chosen to parse the string in hand.
• To remove this confusion, we use left factoring.

Compiler Design, KCS-502 105


Left Factoring
• Left factoring is a process which isolates the common
parts of two productions into a single production.

• Any production of the form


A→αβ1| αβ2|……..| αβn
can be replaced by
A→αA’
A’→β1| β2|……| βn

• Left factoring is useful for producing a grammar


suitable for a predictive parser.

Compiler Design, KCS-502 106


Left Factoring

Example:

Consider the following grammar


A→aAB|aA|a
B→bB|b
Remove the left factoring from it.

Compiler Design, KCS-502 107


Left Factoring
Solution:
A→aAB|aA|a
It can be replaced by
A→aA’
A’→AB|A|Є
Similarly, B→bB|b can be replaced by
B→bB’
B’→B|Є
By left factoring the grammar will be
A→aA’
A’→AB|A|Є
B→bB’
B’→B|Є
Compiler Design, KCS-502 108
Left Factoring
Solution:
A’→AB|A|Є
It can be replaced by
A'→AC|Є
C→B|Є
By left factoring the grammar will be
A→aA’
A'→AC|Є
C→B|Є
B→bB’
B’→B|Є

Compiler Design, KCS-502 109


Left Factoring

Example:

Consider the following grammar

S → a / ab / abc / abcd

Remove the left factoring from it.

Compiler Design, KCS-502 110


Left Factoring
Solution:
S → aS’
S’ → b / bc / bcd / ∈
Again, this is a grammar with common prefixes. So now,
S → aS’
S’ → bA / ∈
A → c / cd / ∈
Again, this is a grammar with common prefixes. So now,
S → aS’
S’ → bA / ∈
A → cB / ∈
B→d/∈
This is a left factored grammar.
Compiler Design, KCS-502 111
Left Recursion
• The production is left-recursive if the leftmost
symbol on the right side is same as the non-
terminal on the left side.

• For example,
A→Aα
Here A can be any non-terminal and α denotes
some input string.

• Because of left recursion the top-down parser


can enter in infinite loop.
Compiler Design, KCS-502 112
Left Recursion
• To eliminate left recursion we need to modify the
grammar.

• Let G be a CFG having a production rule with left


recursion
A→Aα|β
Then we eliminate left recursion by rewriting the
production rule as
A→βA’
A’→αA’
A’→Є

Compiler Design, KCS-502 113


Left Recursion
• General algorithm for removing left recursion
A→Aα1|Aα2|……|Aαm|β1|β2|……|βm
where no βi begins with an A.

We can write above production as


A→β1A’|β2A’|…….|βmA’
A’→α1A’|α2A’|……|αmA’|Є

Compiler Design, KCS-502 114


Left Recursion
Example:

Consider the following CFG


S→Bb|a
B→Bc|Sd|e
Find out the left recursion and remove it.

Compiler Design, KCS-502 115


Left Recursion
Solution:
We can write the given productions as
S→Bb|a
B→Bc|Bbd|ad|e
We will remove left recursion as follows:
B→adB’|eB’
B’→cB’|bdB’|Є

Compiler Design, KCS-502 116


Left Recursion
Example:
Consider the following grammar for arithmetic
expressions
E→E+T|T
T →T*F|F
F →(E)|id
Remove the left recursion.

Compiler Design, KCS-502 117


Left Recursion
Solution:
Eliminating the immediate left recursion to the
production for E and then for T, we obtain
E →TE’
E’ →+TE’|Є
T →FT’
T’ →*FT’|Є
F →(E)|id

Compiler Design, KCS-502 118


The Role of Lexical Analyzer
• As the first phase of a compiler, the main task
of the lexical analyzer is to read the input
characters of the source program, group them
into lexemes, and produce as output a
sequence of tokens for each lexeme in the
source program.

• The stream of tokens is sent to the parser for


syntax analysis.

Compiler Design, KCS-502 119


The Role of Lexical Analyzer
• It is common for the lexical analyzer to interact
with the symbol table as well. When the lexical
analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the
symbol table.

• In some cases, information regarding the kind of


identifier may be read from the symbol table by
the lexical analyzer to assist it in determining the
proper token it must pass to the parser.

Compiler Design, KCS-502 120


The Role of Lexical Analyzer
• Since the lexical analyzer is the part of the
compiler that reads the source text, it may
perform certain other tasks besides
identification of lexemes.

• One such task is stripping out comments and


whitespaces (blank, newline, tab and perhaps
other characters that are used to separate
tokens in the input).

Compiler Design, KCS-502 121


The Role of Lexical Analyzer
• Another task is correlating error messages
generated by the compiler with the source
program.
– For instance, the lexical analyzer may keep track of the
number of newline characters seen, so it can associate
a line number with each error message.

• If the source program uses a macro-processor,


the expansion of macros may also be performed
by the lexical analyzer.

Compiler Design, KCS-502 122


Lexical Analysis versus Parsing
• There are a number of reasons why the analysis
portion of a compiler is normally separated into lexical
analysis and parsing (syntax analysis) phases.
– Simplicity of design is the most important consideration.
The separation of lexical and syntactic analysis often allows
us to simplify at least one of these tasks.
• For ex., a parser that had to deal with comments and whitespace
as syntactic units would be considerably more complex than one
that can assume comments and whitespace have already been
removed by the lexical analyzer.
• If we are designing a new language, separating lexical and
syntactic concerns can lead to a cleaner overall language design.

Compiler Design, KCS-502 123


Lexical Analysis versus Parsing
– Compiler efficiency is improved.
• A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task,
not the job of parsing.
• In addition, specialized buffering techniques for reading
input characters can speed up the compiler
significantly.

– Compiler portability is enhanced.


• Input-device-specific peculiarities can be restricted to
the lexical analyzer.

Compiler Design, KCS-502 124


Tokens, Patterns, and Lexemes
• Tokens:
– A token is a pair consisting of a token name and an
optional attribute value.
– The token name is an abstract symbol
representing a kind of lexical unit.
• Ex., a particular keyword, or a sequence of input
characters denoting an identifier.
– The token names are the input symbols that the
parser processes.
– We will often refer to a token by its token name.

Compiler Design, KCS-502 125


Tokens, Patterns, and Lexemes
• Patterns:
– A pattern is a description of the form that the
lexemes of a token may take.
– In the case of a keyword as a token, the pattern is
just the sequence of characters that form the
keyword.
– For identifiers and some other tokens, the pattern
is a more complex structure that is matched by
many strings.

Compiler Design, KCS-502 126


Tokens, Patterns, and Lexemes
• Lexemes:
– A lexeme is a sequence of characters in source
program that matches the pattern for a token and
is identified by the lexical analyzer as an instance
of that token.

Compiler Design, KCS-502 127


Tokens, Patterns, and Lexemes

Compiler Design, KCS-502 128


Tokens, Patterns, and Lexemes
For example, consider the program:
int main()
{
// variable declaration
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';' 'a' '=' '10' ';' 'return' '0' ';' '}'

Compiler Design, KCS-502 129


Tokens, Patterns, and Lexemes
Exercise 1:
Count number of tokens :
int main()
{
int a = 10, b = 20;
printf("sum is :%d",a+b);
return 0;
}

Answer: Total number of token: 27

Compiler Design, KCS-502 130


Tokens, Patterns, and Lexemes
Exercise 2:
// comment
printf("string %d ",++i++&&&i***a);
return(x?y:z)

printf ( "string %d " , ++ i ++ && & i * * * a ) ; return ( x


? y : z )

Answer: Total number of token: 24


: Distinct number of tokens will be = 18

Compiler Design, KCS-502 131


The Lexical Analyzer Generator

LEX
LEX, or in a more recent implementation FLEX is a tool
that allows one to specify a lexical analyzer by
specifying regular expressions to describe patterns for
tokens.

• The input notation for the LEX tool is referred to as the


LEX language and the tool itself is the LEX compiler.

• The LEX compiler transforms the input patterns into a


transition diagram and generates code, in a file called
lex.yy.c, that simulates this transition diagram.
Compiler Design, KCS-502 132
An Overview of Lex

Lex source lex.yy.c


Lex
program

lex.yy.c C compiler a.out

input a.out tokens

Compiler Design, KCS-502 133


An Overview of Lex
• An input file, which we call lex.l, is written in the LEX
language and describes the lexical analyzer to be
generated.

• The LEX compiler transforms lex.l to a C program, in a file


that is always named lex.yy.c.

• The latter file is compiled by the C compiler into a file called


a.out, as always.

• The C-compiler output is a working lexical analyzer that can


take a stream of input characters and produce a stream of
tokens.

Compiler Design, KCS-502 134


Use of LEX
• The normal use of the compiled C program, referred to
as a.out is as a subroutine of the parser.
– It is a C function that returns an integer, which is a code for
one of the possible token names.

• The attribute value, whether it be another numeric


code, a pointer to the symbol table, or nothing, is
placed in a global variable yylval, which is shared
between the lexical analyzer and parser, thereby
making it simple to return both the name and an
attribute value of a token.

Compiler Design, KCS-502 135


Structure of LEX programs
• Lex source is separated into three sections by %%
delimiters
• The general format of Lex source is
{definitions}
%% (required)
{transition rules}
%%
(optional)
{user subroutines}
• The absolute minimum Lex program is thus
%%
Compiler Design, KCS-502 136
Structure of LEX programs
Example:

digit [0-9]
letter [a-zA-Z]
%%
({letter}|{digit})* printf(“id: %s\n”, yytext);
\n printf(“new line\n”);
%%
main()
{
yylex();
}

Compiler Design, KCS-502 137


Structure of LEX programs
• The declarations section includes declaration of variables,
manifest constants (identifiers declared to stand for a
constant, ex., the name of a token), and regular definitions.

• The transition rules each have the form


Pattern{Action}

• Each pattern is a regular expression, which may use the


regular definitions of the declaration section.

• The actions are fragments of a code, typically written in C,


although many variants of LEX using other languages have
been created.

Compiler Design, KCS-502 138


Structure of LEX programs
• The third section holds whatever additional
functions are used in the actions.
Alternatively, these functions can be compiled
separately and loaded with the lexical
analyzer.

Compiler Design, KCS-502 139


YACC
• YACC stands for Yet Another Compiler
Compiler, reflecting the popularity of parser
generators in the early 1970s when the first
version of YACC was created by S.C. Johnson.

• YACC is available as a command on the UNIX


system, and has been used to help implement
many production compilers.

Compiler Design, KCS-502 140


Introduction

• What is YACC ?
– Tool which will produce a parser for a given
grammar.
– YACC (Yet Another Compiler Compiler) is a
program designed to compile a LALR(1)
grammar and to produce the source code of
the syntactic analyzer of the language
produced by this grammar.

Compiler Design, KCS-502 141


How YACC Works
y.tab.h
YACC source (*.y) yacc y.tab.c
y.output
(1) Parser generation time

y.tab.c C compiler/linker a.out

(2) Compile time


Abstract
Token stream a.out Syntax
Tree
(3) Run time
Compiler Design, KCS-502 142
YACC File Format
%{
C declarations
%}
yacc declarations
%%
Grammar rules
%%
Additional C code
– Comments enclosed in /* ... */ may appear in any
of the sections.

Compiler Design, KCS-502 143


An YACC File Example
%{
#include <stdio.h>
%}

%token NAME NUMBER


%%

statement: NAME '=' expression


| expression { printf("= %d\n", $1); }
;

expression: expression '+' NUMBER { $$ = $1 + $3; }


| expression '-' NUMBER { $$ = $1 - $3; }
| NUMBER { $$ = $1; }
;
%%
int yyerror(char *s)
{
fprintf(stderr, "%s\n", s);
return 0;
}

int main(void)
{
yyparse();
return 0;
}

Compiler Design, KCS-502 144


The Parser Generator YACC
• First, a file, say translate.y, containing a YACC
specification of the translator is prepared.

• The UNIX system command yacc translate.y translates


the file translate.y into a C program called y.tab.c using
the LALR method.

• By compiling y.tab.c along with the ly library that


contains the LR parsing program using the command
cc y.tab.c –ly we obtain the desired object program
a.out that performs the translation specified by the
original YACC program.

Compiler Design, KCS-502 145

You might also like