Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 63

Introduction of Compiler Design

The compiler is software that converts a program written in a high-level language (Source Language)
to a low-level language (Object/Target/Machine Language/0, 1’s).

A translator or language processor is a program that translates an input program written in a


programming language into an equivalent program in another language. The compiler is a type of
translator, which takes a program written in a high-level programming language as input and
translates it into an equivalent program in low-level languages such as machine language or
assembly language.

The program written in a high-level language is known as a source program, and the program
converted into a low-level language is known as an object (or target) program. Without compilation,
no program written in a high-level language can be executed. For every programming language, we
have a different compiler; however, the basic tasks performed by every compiler are the same. The
process of translating the source code into machine code involves several stages, including lexical
analysis, syntax analysis, semantic analysis, code generation, and optimization.

Compiler is an intelligent program as compare to an assembler. Compiler verifies all types of limits,
ranges, errors , etc. Compiler program takes more time to run and it occupies huge amount of
memory space. The speed of compiler is slower than other system software. It takes time because it
enters through the program and then does translation of the full program. When compiler runs on
same machine and produces machine code for the same machine on which it is running. Then it is
called as self compiler or resident compiler. Compiler may run on one machine and produces the
machine codes for other computer then in that case it is called as cross compiler.

High-Level Programming Language

A high-level programming language is a language that has an abstraction of attributes of the


computer. High-level programming is more convenient to the user in writing a program.

Low-Level Programming Language

A low-Level Programming language is a language that doesn’t require programming ideas and
concepts.

Stages of Compiler Design

1. Lexical Analysis: The first stage of compiler design is lexical analysis, also known as scanning.
In this stage, the compiler reads the source code character by character and breaks it down
into a series of tokens, such as keywords, identifiers, and operators. These tokens are then
passed on to the next stage of the compilation process.

2. Syntax Analysis: The second stage of compiler design is syntax analysis, also known as
parsing. In this stage, the compiler checks the syntax of the source code to ensure that it
conforms to the rules of the programming language. The compiler builds a parse tree, which
is a hierarchical representation of the program’s structure, and uses it to check for syntax
errors.

3. Semantic Analysis: The third stage of compiler design is semantic analysis. In this stage, the
compiler checks the meaning of the source code to ensure that it makes sense. The compiler
performs type checking, which ensures that variables are used correctly and that operations
are performed on compatible data types. The compiler also checks for other semantic errors,
such as undeclared variables and incorrect function calls.

4. Code Generation: The fourth stage of compiler design is code generation. In this stage, the
compiler translates the parse tree into machine code that can be executed by the computer.
The code generated by the compiler must be efficient and optimized for the target platform.

5. Optimization: The final stage of compiler design is optimization. In this stage, the compiler
analyzes the generated code and makes optimizations to improve its performance. The
compiler may perform optimizations such as constant folding, loop unrolling, and function
inlining.

Overall, compiler design is a complex process that involves multiple stages and requires a deep
understanding of both the programming language and the target platform. A well-designed compiler
can greatly improve the efficiency and performance of software programs, making them more useful
and valuable for users.

Compiler

 Cross Compiler that runs on a machine ‘A’ and produces a code for another machine ‘B’. It is
capable of creating code for a platform other than the one on which the compiler is running.

 Source-to-source Compiler or transcompiler or transpiler is a compiler that translates source


code written in one programming language into the source code of another programming
language.

Language Processing Systems

We know a computer is a logical assembly of Software and Hardware. The hardware knows a
language, that is hard for us to grasp, consequently, we tend to write programs in a high-level
language, that is much less complicated for us to comprehend and maintain in our thoughts. Now,
these programs go through a series of transformations so that they can readily be used by machines.
This is where language procedure systems come in handy.
High-Level Language to Machine Code

 High-Level Language: If a program contains pre-processor directives such as #include or


#define it is called HLL. They are closer to humans but far from machines. These (#) tags are
called preprocessor directives. They direct the pre-processor about what to do.

 Pre-Processor: The pre-processor removes all the #include directives by including the files
called file inclusion and all the #define directives using macro expansion. It performs file
inclusion, augmentation, macro-processing, etc.

 Assembly Language: It’s neither in binary form nor high level. It is an intermediate state that
is a combination of machine instructions and some other useful data needed for execution.

 Assembler: For every platform (Hardware + OS) we will have an assembler. They are not
universal since for each platform we have one. The output of the assembler is called an
object file. Its translates assembly language to machine code.
 Interpreter: An interpreter converts high-level language into low-level machine language,
just like a compiler. But they are different in the way they read the input. The Compiler in
one go reads the inputs, does the processing, and executes the source code whereas the
interpreter does the same line by line. A compiler scans the entire program and translates it
as a whole into machine code whereas an interpreter translates the program one statement
at a time. Interpreted programs are usually slower concerning compiled ones. For example:
Let in the source program, it is written #include “Stdio. h”. Pre-Processor replaces this file
with its contents in the produced output. The basic work of a linker is to merge object codes
(that have not even been connected), produced by the compiler, assembler, standard library
function, and operating system resources. The codes generated by the compiler, assembler,
and linker are generally re-located by their nature, which means to say, the starting location
of these codes is not determined, which means they can be anywhere in the computer
memory, Thus the basic task of loaders to find/calculate the exact address of these memory
locations.

 Relocatable Machine Code: It can be loaded at any point and can be run. The address within
the program will be in such a way that it will cooperate with the program movement.

 Loader/Linker: Loader/Linker converts the relocatable code into absolute code and tries to
run the program resulting in a running program or an error message (or sometimes both can
happen). Linker loads a variety of object files into a single file to make it executable. Then
loader loads it in memory and executes it.

Types of Compiler

There are mainly three types of compilers.

 Single Pass Compilers

 Two Pass Compilers

 Multipass Compilers

Single Pass Compiler

When all the phases of the compiler are present inside a single module, it is simply called a single-
pass compiler. It performs the work of converting source code to machine code.

Two Pass Compiler

Two-pass compiler is a compiler in which the program is translated twice, once from the front end
and the back from the back end known as Two Pass Compiler.

Multipass Compiler

When several intermediate codes are created in a program and a syntax tree is processed many
times, it is called Multi pass Compiler. It breaks codes into smaller programs.

Phases of a Compiler
There are two major phases of compilation, which in turn have many parts. Each of them takes input
from the output of the previous level and works in a coordinated way.

Phases of Compiler

Analysis Phase

An intermediate representation is created from the given source code :

 Lexical Analyzer

 Syntax Analyzer
 Semantic Analyzer

 Intermediate Code Generator

The lexical analyzer divides the program into “tokens”, the Syntax analyzer recognizes “sentences” in
the program using the syntax of the language and the Semantic analyzer checks the static semantics
of each construct. Intermediate Code Generator generates “abstract” code.

Synthesis Phase

An equivalent target program is created from the intermediate representation. It has two parts :

 Code Optimizer

 Code Generator

Code Optimizer optimizes the abstract code, and the final Code Generator translates abstract
intermediate code into specific machine instructions.

Operations of Compiler

These are some operations that are done by the compiler.

 It breaks source programs into smaller parts.

 It enables the creation of symbol tables and intermediate representations.

 It helps in code compilation and error detection.

 it saves all codes and variables.

 It analyses the full program and translates it.

 Convert source code to machine code.

Advantages of Compiler Design

1. Efficiency: Compiled programs are generally more efficient than interpreted programs
because the machine code produced by the compiler is optimized for the specific hardware
platform on which it will run.

2. Portability: Once a program is compiled, the resulting machine code can be run on any
computer or device that has the appropriate hardware and operating system, making it
highly portable.

3. Error Checking: Compilers perform comprehensive error checking during the compilation
process, which can help catch syntax, semantic, and logical errors in the code before it is
run.

4. Optimizations: Compilers can make various optimizations to the generated machine code,
such as eliminating redundant instructions or rearranging code for better performance.

Disadvantages of Compiler Design

1. Longer Development Time: Developing a compiler is a complex and time-consuming process


that requires a deep understanding of both the programming language and the target
hardware platform.
2. Debugging Difficulties: Debugging compiled code can be more difficult than debugging
interpreted code because the generated machine code may not be easy to read or
understand.

3. Lack of Interactivity: Compiled programs are typically less interactive than interpreted
programs because they must be compiled before they can be run, which can slow down the
development and testing process.

4. Platform-Specific Code: If the compiler is designed to generate machine code for a specific
hardware platform, the resulting code may not be portable to other platforms.

Why are compilers important?


There are the following reasons why compilers are important to us:

1)Enabling concept of High Level Language

They enable developers to write code in high-level programming languages, which are easier to
understand and more human-readable than machine code.

2)One time effort(Efficiency)

Compilers convert high-level code into machine code .Now, this machine code is stored in a file
having extension(.exe) which is directly executed by computer. So,programmer need only one time
effort to compile code into machine code and after that they can use code whenever they want
using (.exe) file. This process allows programs to run much faster and more efficiently than if they
were interpreted at runtime.

3)Portability

Compilers translates source code which is written in high-level programming language into machine
code.And thus,it enable portability feature because this machine code can be runned on many
different operating systems and hardware architectures because it is platform independent.

4)Security

Compiler can provide programmer secuirity by preventing memory-related errors, such as buffer
overflows, by analyzing and optimizing the code. It can also generate warnings or errors if it detects
potential memory issues.

There are different Compilers :

 Cross-Compiler – The compiled program can run on a computer whose CPU or Operating
System is different from the one on which the compiler runs.

 Bootstrap Compiler – The compiler written in the language that it intends to compile.

 Decompiler – The compiler that translates from a low-level language to a higher level one.

 Transcompiler – The compiler that translates high level languages.


A compiler can translate only those source programs which have been written in the language for
which the compiler is meant. Each high level programming language requires a separate compiler for
the conversion.

For Example, a FORTRAN compiler is capable of translating into a FORTRAN program. A computer
system may have more than one compiler to work for more than one high level languages.

Top most Compilers used according to the Computer Languages –

 C– Turbo C, Tiny C Compiler, GCC, Clang, Portable C Compiler

 C++ -GCC, Clang, Dev C++, Intel C++, Code Block

 JAVA– IntelliJ IDEA, Eclipse IDE, NetBeans, BlueJ, JDeveloper

 Kotlin– IntelliJ IDEA, Eclipse IDE

 Python– CPython, JPython, Wing, Spyder

 JavaScript– WebStorm, Atom IDE, Visual Studio Code, Komodo Edit

Explore the Phases of Compiler


The 6 phases of a compiler are:

1. Lexical Analysis
2. Syntactic Analysis or Parsing
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation

1.Lexical Analysis: Lexical analysis or Lexical analyzer is the initial stage or phase
of the compiler. This phase scans the source code and transforms the input program
into a series of a token.

A token is basically the arrangement of characters that defines a unit of information


in the source code.

NOTE: In computer science, a program that executes the process of lexical analysis
is called a scanner, tokenizer, or lexer.

You can gain in-depth knowledge of lexical analysis to get a better understanding.
Roles and Responsibilities of Lexical Analyzer

 It is accountable for terminating the comments and white spaces from the source
program.
 It helps in identifying the tokens.
 Categorization of lexical units.

2. Syntax Analysis:

In the compilation procedure, the Syntax analysis is the second stage. Here the
provided input string is scanned for the validation of the structure of the standard
grammar. Basically, in the second phase, it analyses the syntactical structure and
inspects if the given input is correct or not in terms of programming syntax.

It accepts tokens as input and provides a parse tree as output. It is also known as
parsing in a compiler.
Roles and Responsibilities of Syntax Analyzer

 Note syntax errors.


 Helps in building a parse tree.
 Acquire tokens from the lexical analyzer.
 Scan the syntax errors, if any.

3. Semantic Analysis: In the process of compilation, semantic analysis is the third


phase. It scans whether the parse tree follows the guidelines of language. It also
helps in keeping track of identifiers and expressions. In simple words, we can say
that a semantic analyzer defines the validity of the parse tree, and the annotated
syntax tree comes as an output.

Roles and Responsibilities of Semantic Analyzer:

 Saving collected data to symbol tables or syntax trees.


 It notifies semantic errors.
 Scanning for semantic errors.

4. Intermediate Code Generation: The parse tree is semantically confirmed; now,


an intermediate code generator develops three address codes. A middle-level
language code generated by a compiler at the time of the translation of a source
program into the object code is known as intermediate code or text.

Few Important Pointers:


 A code that is neither high-level nor machine code, but a middle-level code is an
intermediate code.
 We can translate this code to machine code later.
 This stage serves as a bridge or way from analysis to synthesis.

Roles and Responsibilities:

 Helps in maintaining the priority ordering of the source language.


 Translate the intermediate code into the machine code.
 Having operands of instructions.

5. Code optimizer: Now coming to a phase that is totally optional, and it is code
optimization. It is used to enhance the intermediate code. This way, the output of the
program is able to run fast and consume less space. To improve the speed of the
program, it eliminates the unnecessary strings of the code and organises the
sequence of statements.

Roles and Responsibilities:

 Remove the unused variables and unreachable code.


 Enhance runtime and execution of the program.
 Produce streamlined code from the intermediate expression.

6. Code Generator: The final stage of the compilation process is the code
generation process. In this final phase, it tries to acquire the intermediate code as
input which is fully optimised and map it to the machine code or language. Later, the
code generator helps in translating the intermediate code into the machine code.

Roles and Responsibilities:

 Translate the intermediate code to target machine code.


 Select and allocate memory spots and registers.

Phases of a Compiler
We basically have two phases of compilers, namely the Analysis phase and Synthesis phase. The
analysis phase creates an intermediate representation from the given source code. The synthesis
phase creates an equivalent target program from the intermediate representation.

A compiler is a software program that converts the high-level source code written in a programming
language into low-level machine code that can be executed by the computer hardware. The process
of converting the source code into machine code involves several phases or stages, which are
collectively known as the phases of a compiler. The typical phases of a compiler are:

1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as scanning. This
phase reads the source code and breaks it into a stream of tokens, which are the basic units
of the programming language. The tokens are then passed on to the next phase for further
processing.

2. Syntax Analysis: The second phase of a compiler is syntax analysis, also known as parsing.
This phase takes the stream of tokens generated by the lexical analysis phase and checks
whether they conform to the grammar of the programming language. The output of this
phase is usually an Abstract Syntax Tree (AST).

3. Semantic Analysis: The third phase of a compiler is semantic analysis. This phase checks
whether the code is semantically correct, i.e., whether it conforms to the language’s type
system and other semantic rules. In this stage, the compiler checks the meaning of the
source code to ensure that it makes sense. The compiler performs type checking, which
ensures that variables are used correctly and that operations are performed on compatible
data types. The compiler also checks for other semantic errors, such as undeclared variables
and incorrect function calls.
4. Intermediate Code Generation: The fourth phase of a compiler is intermediate code
generation. This phase generates an intermediate representation of the source code that
can be easily translated into machine code.

5. Optimization: The fifth phase of a compiler is optimization. This phase applies various
optimization techniques to the intermediate code to improve the performance of the
generated machine code.

6. Code Generation: The final phase of a compiler is code generation. This phase takes the
optimized intermediate code and generates the actual machine code that can be executed
by the target hardware.

In summary, the phases of a compiler are: lexical analysis, syntax analysis, semantic analysis,
intermediate code generation, optimization, and code generation.
Symbol Table – It is a data structure being used and maintained by the compiler, consisting of all the
identifier’s names along with their types. It helps the compiler to function smoothly by finding the
identifiers quickly.

The analysis of a source program is divided into mainly three phases. They are:

1. Linear Analysis-
This involves a scanning phase where the stream of characters is read from left to right. It is
then grouped into various tokens having a collective meaning.

2. Hierarchical Analysis-
In this analysis phase, based on a collective meaning, the tokens are categorized
hierarchically into nested groups.

3. Semantic Analysis-
This phase is used to check whether the components of the source program are meaningful
or not.

The compiler has two modules namely the front end and the back end. Front-end constitutes the
Lexical analyzer, semantic analyzer, syntax analyzer, and intermediate code generator. And the rest
are assembled to form the back end.

1. Lexical Analyzer –
It is also called a scanner. It takes the output of the preprocessor (which performs file
inclusion and macro expansion) as the input which is in a pure high-level language. It reads
the characters from the source program and groups them into lexemes (sequence of
characters that “go together”). Each lexeme corresponds to a token. Tokens are defined by
regular expressions which are understood by the lexical analyzer. It also removes lexical
errors (e.g., erroneous characters), comments, and white space.

2. Syntax Analyzer – It is sometimes called a parser. It constructs the parse tree. It takes all the
tokens one by one and uses Context-Free Grammar to construct the parse tree.

Why Grammar?
The rules of programming can be entirely represented in a few productions. Using these productions
we can represent what the program actually is. The input has to be checked whether it is in the
desired format or not.
The parse tree is also called the derivation tree. Parse trees are generally constructed to check for
ambiguity in the given grammar. There are certain rules associated with the derivation tree.

 Any identifier is an expression

 Any number can be called an expression

 Performing any operations in the given expression will always result in an


expression. For example, the sum of two expressions is also an expression.

 The parse tree can be compressed to form a syntax tree

Syntax error can be detected at this level if the input is not in accordance with the grammar.
 Semantic Analyzer – It verifies the parse tree, whether it’s meaningful or not. It furthermore
produces a verified parse tree. It also does type checking, Label checking, and Flow control
checking.

 Intermediate Code Generator – It generates intermediate code, which is a form that can be
readily executed by a machine We have many popular intermediate codes. Example – Three
address codes etc. Intermediate code is converted to machine language using the last two
phases which are platform dependent.

Till intermediate code, it is the same for every compiler out there, but after that, it depends on the
platform. To build a new compiler we don’t need to build it from scratch. We can take the
intermediate code from the already existing compiler and build the last two parts.

 Code Optimizer – It transforms the code so that it consumes fewer resources and produces
more speed. The meaning of the code being transformed is not altered. Optimization can be
categorized into two types: machine-dependent and machine-independent.

 Target Code Generator – The main purpose of the Target Code generator is to write a code
that the machine can understand and also register allocation, instruction selection, etc. The
output is dependent on the type of assembler. This is the final stage of compilation. The
optimized code is converted into relocatable machine code which then forms the input to
the linker and loader.

All these six phases are associated with the symbol table manager and error handler as shown in the
above block diagram.

The advantages of using a compiler to translate high-level programming languages into machine
code are:

1. Portability: Compilers allow programs to be written in a high-level programming language,


which can be executed on different hardware platforms without the need for modification.
This means that programs can be written once and run on multiple platforms, making them
more portable.

2. Optimization: Compilers can apply various optimization techniques to the code, such as loop
unrolling, dead code elimination, and constant propagation, which can significantly improve
the performance of the generated machine code.
3. Error Checking: Compilers perform a thorough check of the source code, which can detect
syntax and semantic errors at compile-time, thereby reducing the likelihood of runtime
errors.

4. Maintainability: Programs written in high-level languages are easier to understand and


maintain than programs written in low-level assembly language. Compilers help in
translating high-level code into machine code, making programs easier to maintain and
modify.

5. Productivity: High-level programming languages and compilers help in increasing the


productivity of developers. Developers can write code faster in high-level languages, which
can be compiled into efficient machine code.

In summary, compilers provide advantages such as portability, optimization, error checking,


maintainability, and productivity.

Compiler architecture and components


A compiler can broadly be divided into two phases based on the way they compile.

Analysis Phase

Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts and then checks for lexical, grammar and syntax errors.The
analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.

Synthesis Phase

Known as the back-end of the compiler, the synthesis phase generates the target program with the
help of intermediate source code representation and symbol table.

Lexical Analysis
Lexical analysis is the first phase of a compiler. It takes modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes
into a series of tokens, by removing any whitespace or comments in the source code.

If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely
with the syntax analyzer. It reads character streams from the source code, checks for legal tokens,
and passes the data to the syntax analyzer when it demands.

Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.

In programming language, keywords, constants, identifiers, strings, numbers, operators and


punctuations symbols can be considered as tokens.

For example, in C language, the variable declaration line

int value = 100;

contains the tokens:

int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Specifications of Tokens

Let us understand how the language theory undertakes the following terms:

Alphabets

Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of
Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.

Strings

Any finite sequence of alphabets (characters) is called a string. Length of the string is the total
number of occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted
by |tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an
empty string and is denoted by ε (epsilon).

Special symbols

A typical high-level language contains the following symbols:-

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =
Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.

Regular Expressions

The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme that
belong to the language in hand. It searches for the pattern defined by the language rules.

Regular expressions have the capability to express finite languages by defining a pattern for finite
strings of symbols. The grammar defined by regular expressions is known as regular grammar. The
language defined by regular grammar is known as regular language.

Regular expression is an important notation for specifying patterns. Each pattern matches a set of
strings, so regular expressions serve as names for a set of strings. Programming language tokens can
be described by regular languages. The specification of regular expressions is an example of a
recursive definition. Regular languages are easy to understand and have efficient implementation.

There are a number of algebraic laws that are obeyed by regular expressions, which can be used to
manipulate regular expressions into equivalent forms.

Operations
The various operations on languages are:

 Union of two languages L and M is written as

L U M = {s | s is in L or s is in M}

 Concatenation of two languages L and M is written as

LM = {st | s is in L and t is in M}

 The Kleene Closure of a language L is written as

L* = Zero or more occurrence of language L.

Notations
If r and s are regular expressions denoting the languages L(r) and L(s), then

 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)

 Kleene closure : (r)* is a regular expression denoting (L(r))*

 (r) is a regular expression denoting L(r)

Precedence and Associativity

 *, concatenation (.), and | (pipe sign) are left associative

 * has the highest precedence

 Concatenation (.) has the second highest precedence.

 | (pipe sign) has the lowest precedence of all.

Representing valid tokens of a language in regular expression

If x is a regular expression, then:

x* means zero or more occurrence of x.

i.e., it can generate { e, x, xx, xxx, xxxx, … }

x+ means one or more occurrence of x.

i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

x? means at most one occurrence of x

i.e., it can generate either {x} or {e}.

[a-z] is all lower-case alphabets of English language.

[A-Z] is all upper-case alphabets of English language.

[0-9] is all natural digits used in mathematics.

Representation occurrence of symbols using regular expressions

letter = [a – z] or [A – Z]

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

sign = [ + | - ]

Representation of language tokens using regular expressions

Decimal = (sign)?(digit)+

Identifier = (letter)(letter | digit)*

The only problem left with the lexical analyzer is how to verify the validity of a regular expression
used in specifying the patterns of keywords of a language. A well-accepted solution is to use finite
automata for verification.
Finite Automata
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression string
is fed into finite automata, it changes its state for each literal. If the input string is successfully
processed and the automata reaches its final state, it is accepted, i.e., the string just fed was said to
be a valid token of the language in hand.

The mathematical model of finite automata consists of:

 Finite set of states (Q)

 Finite set of input symbols (Σ)

 One Start state (q0)

 Set of final states (qf)

 Transition function (δ)

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ ➔
Q

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).

 States : States of FA are represented by circles. State names are of the state is written inside
the circle.

 Start state : The state from where the automata starts, is known as start state. Start state
has an arrow pointed towards it.

 Intermediate states : All intermediate states has at least two arrows; one pointing to and
another pointing out from them.

 Final state : If the input string is successfully parsed, the automata is expected to be in this
state. Final state is represented by double circles. It may have any odd number of arrows
pointing to it and even number of arrows pointing out from it. The number of odd arrows
are one greater than even, i.e. odd = even+1.

 Transition : The transition from one state to another state happens when a desired symbol
in the input is found. Upon transition, automata can either move to next state or stay in the
same state. Movement from one state to another is shown as a directed arrow, where the
arrows points to the destination state. If automata stays on the same state, an arrow
pointing from a state to itself is drawn.

Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q 0, qf), Σ(0,1),
q0, qf, δ}

Longest Match Rule


When the lexical analyzer read the source-code, it scans the code letter by letter; and when a
whitespace, operator symbol, or special symbols occurs, it decides that a word is completed.

For example:

int intvalue;

While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value.

The Longest Match Rule states that the lexeme scanned should be determined based on the longest
match among all the tokens available.

The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a language
is given priority over user input. That is, if the lexical analyzer finds a lexeme that matches with any
existing reserved word, it should generate an error.

Roles and Responsibility of Lexical Analyzer

The lexical analyzer performs the following tasks-

 The lexical analyzer is responsible for removing the white spaces and comments from the
source program.

 It corresponds to the error messages with the source program.

 It helps to identify the tokens.

 The input characters are read by the lexical analyzer from the source code.

Advantages of Lexical Analysis

 Lexical analysis helps the browsers to format and display a web page with the help of parsed
data.

 It is responsible to create a compiled binary executable code.

 It helps to create a more efficient and specialised processor for the task.
Disadvantages of Lexical Analysis

 It requires additional runtime overhead to generate the lexer table and construct the tokens.

 It requires much effort to debug and develop the lexer and its token description.

 Much significant time is required to read the source code and partition it into tokens.

Video on Lexical Analysis

Compiler Design - Finite Automata

Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression string
is fed into finite automata, it changes its state for each literal. If the input string is successfully
processed and the automata reaches its final state, it is accepted, i.e., the string just fed was said to
be a valid token of the language in hand.

The mathematical model of finite automata consists of:

 Finite set of states (Q)

 Finite set of input symbols (Σ)

 One Start state (q0)

 Set of final states (qf)

 Transition function (δ)

The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ ➔
Q

Finite Automata Construction

Let L(r) be a regular language recognized by some finite automata (FA).

 States : States of FA are represented by circles. State names are written inside circles.

 Start state : The state from where the automata starts, is known as the start state. Start
state has an arrow pointed towards it.

 Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.

 Final state : If the input string is successfully parsed, the automata is expected to be in this
state. Final state is represented by double circles. It may have any odd number of arrows
pointing to it and even number of arrows pointing out from it. The number of odd arrows
are one greater than even, i.e. odd = even+1.

 Transition : The transition from one state to another state happens when a desired symbol
in the input is found. Upon transition, automata can either move to the next state or stay in
the same state. Movement from one state to another is shown as a directed arrow, where
the arrows points to the destination state. If automata stays on the same state, an arrow
pointing from a state to itself is drawn.

Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q 0, qf), Σ(0,1),
q0, qf, δ}

Syntax analysis or parsing is the second phase of a compiler. In this chapter, we shall learn the basic
concepts used in the construction of a parser.

We have seen that a lexical analyzer can identify tokens with the help of regular expressions and
pattern rules. But a lexical analyzer cannot check the syntax of a given sentence due to the
limitations of the regular expressions. Regular expressions cannot check balancing tokens, such as
parenthesis. Therefore, this phase uses context-free grammar (CFG), which is recognized by push-
down automata.

CFG, on the other hand, is a superset of Regular Grammar, as depicted below:

It implies that every Regular Grammar is also context-free, but there exists some problems, which
are beyond the scope of Regular Grammar. CFG is a helpful tool in describing the syntax of
programming languages.

Context-Free Grammar
In this section, we will first see the definition of context-free grammar and introduce terminologies
used in parsing technology.

A context-free grammar has four components:

 A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings.
The non-terminals define sets of strings that help define the language generated by the
grammar.

 A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.

 A set of productions (P). The productions of a grammar specify the manner in which the
terminals and non-terminals can be combined to form strings. Each production consists of
a non-terminal called the left side of the production, an arrow, and a sequence of tokens
and/or on- terminals, called the right side of the production.

 One of the non-terminals is designated as the start symbol (S); from where the production
begins.

The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the
start symbol) by the right side of a production, for that non-terminal.

Example

We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means of
CFG, as illustrated below:

G = ( V, Σ, P, S )

Where:

V = { Q, Z, N }

Σ = { 0, 1 }

P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }

S={Q}

This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111, etc.

Syntax Analyzers

A syntax analyzer or parser takes the input from a lexical analyzer in the form of token streams. The
parser analyzes the source code (token stream) against the production rules to detect any errors in
the code. The output of this phase is a parse tree.
This way, the parser accomplishes two tasks, i.e., parsing the code, looking for errors and generating
a parse tree as the output of the phase.

Parsers are expected to parse the whole code even if some errors exist in the program. Parsers use
error recovering strategies, which we will learn later in this chapter.

Derivation

A derivation is basically a sequence of production rules, in order to get the input string. During
parsing, we take two decisions for some sentential form of input:

 Deciding the non-terminal which is to be replaced.

 Deciding the production rule, by which, the non-terminal will be replaced.

To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation

If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.

Right-most Derivation

If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential
form.

Example

Production rules:

E→E+E

E→E*E

E → id

Input string: id + id * id

The left-most derivation is:

E→E*E

E→E+E*E
E → id + E * E

E → id + id * E

E → id + id * id

Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:

E→E+E

E→E+E*E

E → E + E * id

E → E + id * id

E → id + id * id

Parse Tree

A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived
from the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us
see this by an example from the last topic.

We take the left-most derivation of a + b * c

The left-most derivation is:

E→E*E

E→E+E*E

E → id + E * E

E → id + id * E

E → id + id * id

Step 1:

E→E*E

Step 2:
E→E+E*E

Step 3:

E → id + E * E

Step 4:

E → id + id * E

Step 5:
E → id + id * id

In a parse tree:

 All leaf nodes are terminals.

 All interior nodes are non-terminals.

 In-order traversal gives original input string.

A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed
first, therefore the operator in that sub-tree gets precedence over the operator which is in the
parent nodes.

Ambiguity

A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for
at least one string.

Example

E→E+E

E→E–E

E → id

For the string id + id – id, the above grammar generates two parse trees:
The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in
grammar is not good for a compiler construction. No method can detect and remove ambiguity
automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or
by setting and following associativity and precedence constraints.

Associativity

If an operand has operators on both sides, the side on which the operator takes this operand is
decided by the associativity of those operators. If the operation is left-associative, then the operand
will be taken by the left operator or if the operation is right-associative, the right operator will take
the operand.

Example

Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:

id op id op id

it will be evaluated as:

(id op id) op id

For example, (id + id) + id

Operations like Exponentiation are right associative, i.e., the order of evaluation in the same
expression will be:

id op (id op id)

For example, id ^ (id ^ id)

Precedence

If two different operators share a common operand, the precedence of operators decides which will
take the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4
and another corresponding to 2+(3*4). By setting precedence among operators, this problem can be
easily removed. As in the previous example, mathematically * (multiplication) has precedence over +
(addition), so the expression 2+3*4 will always be interpreted as:
2 + (3 * 4)

These methods decrease the chances of ambiguity in a language or its grammar.

Left Recursion

A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself
as the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-
down parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal.
So, when the parser encounters the same non-terminal in its derivation, it becomes hard for it to
judge when to stop parsing the left non-terminal and it goes into an infinite loop.

Example:

(1) A => Aα | β

(2) S => Aα | β

A => Sd

(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents
a string of non-terminals.

(2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the
parser may go into a loop forever.

Removal of Left Recursion

One way to remove left recursion is to use the following technique:

The production

A => Aα | β

is converted into following productions

A => βA'

A'=> αA' | ε

This does not impact the strings derived from the grammar, but it removes immediate left recursion.
Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.

START

Arrange non-terminals in some order like A1, A2, A3,…, An

for each i from 1 to n

for each j from 1 to i-1

replace each production of form Ai ⟹Aj𝜸

with Ai ⟹ δ1𝜸 | δ2𝜸 | δ3𝜸 |…| 𝜸

where Aj ⟹ δ1 | δ2|…| δn are current Aj productions

eliminate immediate left-recursion

END

Example

The production set

S => Aα | β

A => Sd

after applying the above algorithm, should become

S => Aα | β

A => Aαd | βd

and then, remove immediate left recursion using the first technique.

A => βdA'

A' => αdA' | ε

Now none of the production has either direct or indirect left recursion.

Left Factoring

If more than one grammar production rules has a common prefix string, then the top-down parser
cannot make a choice as to which of the production it should take to parse the string in hand.
Example

If a top-down parser encounters a production like

A ⟹ αβ | α𝜸 | …

Then it cannot determine which production to follow to parse the string as both productions are
starting from the same terminal (or non-terminal). To remove this confusion, we use a technique
called left factoring.

Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we
make one production for each common prefixes and the rest of the derivation is added by new
productions.

Example

The above productions can be written as

A => αA'

A'=> β | 𝜸 | …

Now the parser has only one production per prefix which makes it easier to take decisions.

First and Follow Sets

An important part of parser table construction is to create first and follow sets. These sets can
provide the actual position of any terminal in the derivation. This is done to create the parsing table
where the decision of replacing T[A, t] = α with some production rule.

First Set

This set is created to know what terminal symbol is derived in the first position by a non-terminal.
For example,

α→tβ

That is α derives t (terminal) in the very first position. So, t ∈ FIRST(α).

Algorithm for calculating First set

Look at the definition of FIRST(α) set:

 if α is a terminal, then FIRST(α) = { α }.

 if α is a non-terminal and α → ℇ is a production, then FIRST(α) = { ℇ }.

 if α is a non-terminal and α → 𝜸1 𝜸2 𝜸3 … 𝜸n and any FIRST(𝜸) contains t then t is in


FIRST(α).

First set can be seen as:

Follow Set
Likewise, we calculate what terminal symbol immediately follows a non-terminal α in production
rules. We do not consider what the non-terminal can generate but instead, we see what would be
the next terminal symbol that follows the productions of a non-terminal.

Algorithm for calculating Follow set:

 if α is a start symbol, then FOLLOW() = $

 if α is a non-terminal and has a production α → AB, then FIRST(B) is in FOLLOW(A) except ℇ.

 if α is a non-terminal and has a production α → AB, where B ℇ, then FOLLOW(A) is in


FOLLOW(α).

Follow set can be seen as: FOLLOW(α) = { t | S *αt*}

Limitations of Syntax Analyzers

Syntax analyzers receive their inputs, in the form of tokens, from lexical analyzers. Lexical analyzers
are responsible for the validity of a token supplied by the syntax analyzer. Syntax analyzers have the
following drawbacks -

 it cannot determine if a token is valid,

 it cannot determine if a token is declared before it is being used,

 it cannot determine if a token is initialized before it is being used,

 it cannot determine if an operation performed on a token type is valid or not.

These tasks are accomplished by the semantic analyzer, which we shall study in Semantic Analysis.

Compiler Design - Types of Parsing

Syntax analyzers follow production rules defined by means of context-free grammar. The way the
production rules are implemented (derivation) divides parsing into two types : top-down parsing and
bottom-up parsing.

Top-down Parsing

When the parser starts constructing the parse tree from the start symbol and then tries to transform
the start symbol to the input, it is called top-down parsing.
 Recursive descent parsing : It is a common form of top-down parsing. It is called recursive as
it uses recursive procedures to process the input. Recursive descent parsing suffers from
backtracking.

 Backtracking : It means, if one derivation of a production fails, the syntax analyzer restarts
the process using different rules of same production. This technique may process the input
string more than once to determine the right production.

Bottom-up Parsing

As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the
parse tree up to the start symbol.

Example:

Input string : a + b * c

Production rules:

S→E

E→E+T

E→E*T

E→T

T → id

Let us start bottom-up parsing

a+b*c

Read the input and check if any production matches with the input:

a+b*c

T+b*c

E+b*c

E+T*c

E*c

E*T

ompiler Design - Top-Down Parser


We have learnt in the last chapter that the top-down parsing technique parses the input, and starts
constructing a parse tree from the root node gradually moving down to the leaf nodes. The types of
top-down parsing are depicted below:

Recursive Descent Parsing

Recursive descent is a top-down parsing technique that constructs the parse tree from the top and
the input is read from left to right. It uses procedures for every terminal and non-terminal entity.
This parsing technique recursively parses the input to make a parse tree, which may or may not
require back-tracking. But the grammar associated with it (if not left factored) cannot avoid back-
tracking. A form of recursive-descent parsing that does not require any back-tracking is known
as predictive parsing.

This parsing technique is regarded recursive as it uses context-free grammar which is recursive in
nature.

Back-tracking

Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of
CFG:

S → rXd | rZd

X → oa | ea

Z → ai

For an input string: read, a top-down parser, will behave like this:

It will start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances
to the next input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its
production from the left (X → oa). It does not match with the next input symbol. So the top-down
parser backtracks to obtain the next production rule of X, (X → ea).

Now the parser matches all the input letters in an ordered manner. The string is accepted.

Predictive Parser

Predictive parser is a recursive descent parser, which has the capability to predict which production
is to be used to replace the input string. The predictive parser does not suffer from backtracking.

To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next
input symbols. To make the parser back-tracking free, the predictive parser puts some constraints on
the grammar and accepts only a class of grammar known as LL(k) grammar.

Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree. Both
the stack and the input contains an end symbol $ to denote that the stack is empty and the input is
consumed. The parser refers to the parsing table to take any decision on the input and stack
element combination.
In recursive descent parsing, the parser may have more than one production to choose from for a
single instance of input, whereas in predictive parser, each step has at most one production to
choose. There might be instances where there is no production matching the input string, making
the parsing procedure to fail.

LL Parser

An LL Parser accepts LL grammar. LL grammar is a subset of context-free grammar but with some
restrictions to get the simplified version, in order to achieve easy implementation. LL grammar can
be implemented by means of both algorithms namely, recursive-descent or table-driven.

LL parser is denoted as LL(k). The first L in LL(k) is parsing the input from left to right, the second L in
LL(k) stands for left-most derivation and k itself represents the number of look aheads. Generally k =
1, so LL(k) may also be written as LL(1).

LL Parsing Algorithm

We may stick to deterministic LL(1) for parser explanation, as the size of table grows exponentially
with the value of k. Secondly, if a given grammar is not LL(1), then usually, it is not LL(k), for any
given k.

Given below is an algorithm for LL(1) Parsing:


Input:

string ω

parsing table M for grammar G

Output:

If ω is in L(G) then left-most derivation of ω,

error otherwise.

Initial State : $S on stack (with S being start symbol)

ω$ in the input buffer

SET ip to point the first symbol of ω$.

repeat

let X be the top stack symbol and a the symbol pointed by ip.

if X∈ Vt or $

if X = a

POP X and advance ip.

else

error()

endif

else /* X is non-terminal */

if M[X,a] = X → Y1, Y2,... Yk

POP X

PUSH Yk, Yk-1,... Y1 /* Y1 on top */

Output the production X → Y1, Y2,... Yk

else

error()

endif
endif

until X = $ /* empty stack */

A grammar G is LL(1) if A → α | β are two distinct productions of G:

 for no terminal, both α and β derive strings beginning with a.

 at most one of α and β can derive empty string.

 if β → t, then α does not derive any string beginning with a terminal in FOLLOW(A).

Compiler Design - Bottom-Up Parser

Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches
the root node. Here, we start from a sentence and then apply production rules in reverse manner in
order to reach the start symbol. The image given below depicts the bottom-up parsers available.

Shift-Reduce Parsing

Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-
step and reduce-step.

 Shift step: The shift step refers to the advancement of the input pointer to the next input
symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The shifted
symbol is treated as a single node of the parse tree.

 Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to (LHS),
it is known as reduce-step. This occurs when the top of the stack contains a handle. To
reduce, a POP function is performed on the stack which pops off the handle and replaces it
with LHS non-terminal symbol.

LR Parser

The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free
grammar which makes it the most efficient syntax analysis technique. LR parsers are also known as
LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for the
construction of right-most derivation in reverse, and k denotes the number of lookahead symbols to
make decisions.

There are three widely used algorithms available for constructing an LR parser:

 SLR(1) – Simple LR Parser:

o Works on smallest class of grammar

o Few number of states, hence very small table

o Simple and fast construction

 LR(1) – LR Parser:

o Works on complete set of LR(1) Grammar

o Generates large table and large number of states

o Slow construction

 LALR(1) – Look-Ahead LR Parser:

o Works on intermediate size of grammar

o Number of states are same as in SLR(1)

LR Parsing Algorithm

Here we describe a skeleton algorithm of an LR parser:

token = next_token()

repeat forever

s = top of stack

if action[s, token] = “shift si” then

PUSH token

PUSH si

token = next_token()

else if action[s, token] = “reduce A::= β“ then

POP 2 * |β| symbols

s = top of stack

PUSH A

PUSH goto[s,A]
else if action[s, token] = “accept” then

return

else

error()

LL LR

Does a leftmost derivation. Does a rightmost derivation in reverse.

Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack.

Ends when the stack is empty. Starts with an empty stack.

Uses the stack for designating what is still to be expected. Uses the stack for designating what is already seen.

Builds the parse tree top-down. Builds the parse tree bottom-up.

Continuously pops a nonterminal off the stack, and pushes Tries to recognize a right hand side on the stack, pops
the corresponding right hand side. pushes the corresponding nonterminal.

Expands the non-terminals. Reduces the non-terminals.

Reads the terminals when it pops one off the stack. Reads the terminals while it pushes them on the stack

Pre-order traversal of the parse tree. Post-order traversal of the parse tree.

Compiler Design - Semantic Analysis

We have learnt how a parser constructs parse trees in the syntax analysis phase. The plain parse-
tree constructed in that phase is generally of no use for a compiler, as it does not carry any
information of how to evaluate the tree. The productions of context-free grammar, which makes the
rules of the language, do not accommodate how to interpret them.

For example
E→E+T

The above CFG production has no semantic rule associated with it, and it cannot help in making any
sense of the production.

Semantics

Semantics of a language provide meaning to its constructs, like tokens and syntax structure.
Semantics help interpret symbols, their types, and their relations with each other. Semantic analysis
judges whether the syntax structure constructed in the source program derives any meaning or not.

CFG + semantic rules = Syntax Directed Definitions

For example:

int a = “value”;

should not issue an error in lexical and syntax analysis phase, as it is lexically and structurally correct,
but it should generate a semantic error as the type of the assignment differs. These rules are set by
the grammar of the language and evaluated in semantic analysis. The following tasks should be
performed in semantic analysis:

 Scope resolution

 Type checking

 Array-bound checking

Semantic Errors

We have mentioned some of the semantics errors that the semantic analyzer is expected to
recognize:

 Type mismatch

 Undeclared variable

 Reserved identifier misuse.

 Multiple declaration of variable in a scope.

 Accessing an out of scope variable.

 Actual and formal parameter mismatch.

Attribute Grammar

Attribute grammar is a special form of context-free grammar where some additional information
(attributes) are appended to one or more of its non-terminals in order to provide context-sensitive
information. Each attribute has well-defined domain of values, such as integer, float, character,
string, and expressions.

Attribute grammar is a medium to provide semantics to the context-free grammar and it can help
specify the syntax and semantics of a programming language. Attribute grammar (when viewed as a
parse-tree) can pass values or information among the nodes of a tree.
Example:

E → E + T { E.value = E.value + T.value }

The right part of the CFG contains the semantic rules that specify how the grammar should be
interpreted. Here, the values of non-terminals E and T are added together and the result is copied to
the non-terminal E.

Semantic attributes may be assigned to their values from their domain at the time of parsing and
evaluated at the time of assignment or conditions. Based on the way the attributes get their values,
they can be broadly divided into two categories : synthesized attributes and inherited attributes.

Synthesized attributes

These attributes get values from the attribute values of their child nodes. To illustrate, assume the
following production:

S → ABC

If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized attribute, as the
values of ABC are synthesized to S.

As in our previous example (E → E + T), the parent node E gets its value from its child node.
Synthesized attributes never take values from their parent nodes or any sibling nodes.

Inherited attributes

In contrast to synthesized attributes, inherited attributes can take values from parent and/or
siblings. As in the following production,

S → ABC

A can get values from S, B and C. B can take values from S, A, and C. Likewise, C can take values from
S, A, and B.

Expansion : When a non-terminal is expanded to terminals as per a grammatical rule

Reduction : When a terminal is reduced to its corresponding non-terminal according to grammar


rules. Syntax trees are parsed top-down and left to right. Whenever reduction occurs, we apply its
corresponding semantic rules (actions).
Semantic analysis uses Syntax Directed Translations to perform the above tasks.

Semantic analyzer receives AST (Abstract Syntax Tree) from its previous stage (syntax analysis).

Semantic analyzer attaches attribute information with AST, which are called Attributed AST.

Attributes are two tuple value, <attribute name, attribute value>

For example:

int value = 5;

<type, “integer”>

<presentvalue, “5”>

For every production, we attach a semantic rule.

S-attributed SDT

If an SDT uses only synthesized attributes, it is called as S-attributed SDT. These attributes are
evaluated using S-attributed SDTs that have their semantic actions written after the production (right
hand side).

As depicted above, attributes in S-attributed SDTs are evaluated in bottom-up parsing, as the values
of the parent nodes depend upon the values of the child nodes.

L-attributed SDT

This form of SDT uses both synthesized and inherited attributes with restriction of not taking values
from right siblings.

In L-attributed SDTs, a non-terminal can get values from its parent, child, and sibling nodes. As in the
following production

S → ABC

S can take values from A, B, and C (synthesized). A can take values from S only. B can take values
from S and A. C can get values from S, A, and B. No non-terminal can get values from the sibling to its
right.

Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing manner.
We may conclude that if a definition is S-attributed, then it is also L-attributed as L-attributed
definition encloses S-attributed definitions.

Compiler Design - Symbol Table

Symbol table is an important data structure created and maintained by compilers in order to store
information about the occurrence of various entities such as variable names, function names,
objects, classes, interfaces, etc. Symbol table is used by both the analysis and the synthesis parts of a
compiler.

A symbol table may serve the following purposes depending upon the language in hand:

 To store the names of all entities in a structured form at one place.

 To verify if a variable has been declared.

 To implement type checking, by verifying assignments and expressions in the source code
are semantically correct.

 To determine the scope of a name (scope resolution).

A symbol table is simply a table which can be either linear or a hash table. It maintains an entry for
each name in the following format:

<symbol name, type, attribute>

For example, if a symbol table has to store information about the following variable declaration:

static int interest;

then it should store the entry such as:

<interest, int, static>

The attribute clause contains the entries related to the name.


Implementation

If a compiler is to handle a small amount of data, then the symbol table can be implemented as an
unordered list, which is easy to code, but it is only suitable for small tables only. A symbol table can
be implemented in one of the following ways:

 Linear (sorted or unsorted) list

 Binary Search Tree

 Hash table

Among all, symbol tables are mostly implemented as hash tables, where the source code symbol
itself is treated as a key for the hash function and the return value is the information about the
symbol.

Operations

A symbol table, either linear or hash, should provide the following operations.

insert()

This operation is more frequently used by analysis phase, i.e., the first half of the compiler where
tokens are identified and names are stored in the table. This operation is used to add information in
the symbol table about unique names occurring in the source code. The format or structure in which
the names are stored depends upon the compiler in hand.

An attribute for a symbol in the source code is the information associated with that symbol. This
information contains the value, state, scope, and type about the symbol. The insert() function takes
the symbol and its attributes as arguments and stores the information in the symbol table.

For example:

int a;

should be processed by the compiler as:

insert(a, int);

lookup()

lookup() operation is used to search a name in the symbol table to determine:

 if the symbol exists in the table.

 if it is declared before it is being used.

 if the name is used in the scope.

 if the symbol is initialized.

 if the symbol declared multiple times.

The format of lookup() function varies according to the programming language. The basic format
should match the following:

lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the symbol exists in
the symbol table, it returns its attributes stored in the table.

Scope Management

A compiler maintains two types of symbol tables: a global symbol table which can be accessed by all
the procedures and scope symbol tables that are created for each scope in the program.

To determine the scope of a name, symbol tables are arranged in hierarchical structure as shown in
the example below:

...

int value=10;

void pro_one()

int one_1;

int one_2;

{ \

int one_3; |_ inner scope 1

int one_4; |

} /

int one_5;

{ \

int one_6; |_ inner scope 2

int one_7; |

} /

void pro_two()

int two_1;

int two_2;
{ \

int two_3; |_ inner scope 3

int two_4; |

} /

int two_5;

...

The above program can be represented in a hierarchical structure of symbol tables:

The global symbol table contains names for one global variable (int value) and two procedure
names, which should be available to all the child nodes shown above. The names mentioned in the
pro_one symbol table (and all its child tables) are not available for pro_two symbols and its child
tables.

This symbol table data structure hierarchy is stored in the semantic analyzer and whenever a name
needs to be searched in a symbol table, it is searched using the following algorithm:
 first a symbol will be searched in the current scope, i.e. current symbol table.

 if a name is found, then search is completed, else it will be searched in the parent symbol
table until,

 either the name is found or global symbol table has been searched for the name.

Attribute Grammar

Attribute grammar is a special form of context-free grammar where some additional information
(attributes) are appended to one or more of its non-terminals in order to provide context-sensitive
information. Each attribute has well-defined domain of values, such as integer, float, character,
string, and expressions.

Attribute grammar is a medium to provide semantics to the context-free grammar and it can help
specify the syntax and semantics of a programming language. Attribute grammar (when viewed as a
parse-tree) can pass values or information among the nodes of a tree.

Example:

E → E + T { E.value = E.value + T.value }

The right part of the CFG contains the semantic rules that specify how the grammar should be
interpreted. Here, the values of non-terminals E and T are added together and the result is copied to
the non-terminal E.

Semantic attributes may be assigned to their values from their domain at the time of parsing and
evaluated at the time of assignment or conditions. Based on the way the attributes get their values,
they can be broadly divided into two categories : synthesized attributes and inherited attributes.

Synthesized attributes

These attributes get values from the attribute values of their child nodes. To illustrate, assume the
following production:

S → ABC

If S is taking values from its child nodes (A,B,C), then it is said to be a synthesized attribute, as the
values of ABC are synthesized to S.

As in our previous example (E → E + T), the parent node E gets its value from its child node.
Synthesized attributes never take values from their parent nodes or any sibling nodes.

Inherited attributes
In contrast to synthesized attributes, inherited attributes can take values from parent and/or
siblings. As in the following production,

S → ABC

A can get values from S, B and C. B can take values from S, A, and C. Likewise, C can take values from
S, A, and B.

Expansion : When a non-terminal is expanded to terminals as per a grammatical rule

Reduction : When a terminal is reduced to its corresponding non-terminal according to grammar


rules. Syntax trees are parsed top-down and left to right. Whenever reduction occurs, we apply its
corresponding semantic rules (actions).

Semantic analysis uses Syntax Directed Translations to perform the above tasks.

Semantic analyzer receives AST (Abstract Syntax Tree) from its previous stage (syntax analysis).

Semantic analyzer attaches attribute information with AST, which are called Attributed AST.

Attributes are two tuple value, <attribute name, attribute value>

For example:

int value = 5;

<type, “integer”>

<presentvalue, “5”>

For every production, we attach a semantic rule.

S-attributed SDT

If an SDT uses only synthesized attributes, it is called as S-attributed SDT. These attributes are
evaluated using S-attributed SDTs that have their semantic actions written after the production (right
hand side).
As depicted above, attributes in S-attributed SDTs are evaluated in bottom-up parsing, as the values
of the parent nodes depend upon the values of the child nodes.

L-attributed SDT

This form of SDT uses both synthesized and inherited attributes with restriction of not taking values
from right siblings.

In L-attributed SDTs, a non-terminal can get values from its parent, child, and sibling nodes. As in the
following production

S → ABC

S can take values from A, B, and C (synthesized). A can take values from S only. B can take values
from S and A. C can get values from S, A, and B. No non-terminal can get values from the sibling to its
right.

Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing manner.

We may conclude that if a definition is S-attributed, then it is also L-attributed as L-attributed


definition encloses S-attributed definitions.
Compiler - Intermediate Code Generation

A source code can directly be translated into its target machine code, then why at all we need to
translate the source code into an intermediate code which is then translated to its target code? Let
us see the reasons why we need an intermediate code.

 If a compiler translates the source language to its target machine language without having
the option for generating intermediate code, then for each new machine, a full native
compiler is required.

 Intermediate code eliminates the need of a new full compiler for every unique machine by
keeping the analysis portion same for all the compilers.

 The second part of compiler, synthesis, is changed according to the target machine.

 It becomes easier to apply the source code modifications to improve code performance by
applying code optimization techniques on the intermediate code.

Intermediate Representation

Intermediate codes can be represented in a variety of ways and they have their own benefits.

 High Level IR - High-level intermediate code representation is very close to the source
language itself. They can be easily generated from the source code and we can easily apply
code modifications to enhance performance. But for target machine optimization, it is less
preferred.

 Low Level IR - This one is close to the target machine, which makes it suitable for register
and memory allocation, instruction set selection, etc. It is good for machine-dependent
optimizations.

Intermediate code can be either language specific (e.g., Byte Code for Java) or language independent
(three-address code).

Three-Address Code

Intermediate code generator receives input from its predecessor phase, semantic analyzer, in the
form of an annotated syntax tree. That syntax tree then can be converted into a linear
representation, e.g., postfix notation. Intermediate code tends to be machine independent code.
Therefore, code generator assumes to have unlimited number of memory storage (register) to
generate code.
For example:

a = b + c * d;

The intermediate code generator will try to divide this expression into sub-expressions and then
generate the corresponding code.

r1 = c * d;

r2 = b + r1;

a = r2

r being used as registers in the target program.

A three-address code has at most three address locations to calculate the expression. A three-
address code can be represented in two forms : quadruples and triples.

Quadruples

Each instruction in quadruples presentation is divided into four fields: operator, arg1, arg2, and
result. The above example is represented below in quadruples format:

Op arg1 arg2 result

* c d r1

+ b r1 r2

+ r2 r1 r3

= r3 a

Triples

Each instruction in triples presentation has three fields : op, arg1, and arg2.The results of respective
sub-expressions are denoted by the position of expression. Triples represent similarity with DAG and
syntax tree. They are equivalent to DAG while representing expressions.

Op arg1 arg2

* c d

+ b (0)

+ (1) (0)

= (2)
Triples face the problem of code immovability while optimization, as the results are positional and
changing the order or position of an expression may cause problems.

Indirect Triples

This representation is an enhancement over triples representation. It uses pointers instead of


position to store results. This enables the optimizers to freely re-position the sub-expression to
produce an optimized code.

Declarations

A variable or procedure has to be declared before it can be used. Declaration involves allocation of
space in memory and entry of type and name in the symbol table. A program may be coded and
designed keeping the target machine structure in mind, but it may not always be possible to
accurately convert a source code to its target language.

Taking the whole program as a collection of procedures and sub-procedures, it becomes possible to
declare all the names local to the procedure. Memory allocation is done in a consecutive manner
and names are allocated to memory in the sequence they are declared in the program. We use
offset variable and set it to zero {offset = 0} that denote the base address.

The source programming language and the target machine architecture may vary in the way names
are stored, so relative addressing is used. While the first name is allocated memory starting from the
memory location 0 {offset=0}, the next name declared later, should be allocated memory next to the
first one.

Example:

We take the example of C programming language where an integer variable is assigned 2 bytes of
memory and a float variable is assigned 4 bytes of memory.

int a;

float b;

Allocation process:

{offset = 0}

int a;

id.type = int

id.width = 2

offset = offset + id.width

{offset = 2}

float b;
id.type = float

id.width = 4

offset = offset + id.width

{offset = 6}

To enter this detail in a symbol table, a procedure enter can be used. This method may have the
following structure:

enter(name, type, offset)

This procedure should create an entry in the symbol table, for variable name, having its type set to
type and relative address offset in its data area.

Three address code in Compiler

 Read

 Discuss

 Courses

 Video

Prerequisite – Intermediate Code Generation


Three address code is a type of intermediate code which is easy to generate and can be easily
converted to machine code. It makes use of at most three addresses and one operator to represent
an expression and the value computed at each instruction is stored in temporary variable generated
by compiler. The compiler decides the order of operation given by three address code.

Three address code is used in compiler applications:

Optimization: Three address code is often used as an intermediate representation of code during
optimization phases of the compilation process. The three address code allows the compiler to
analyze the code and perform optimizations that can improve the performance of the generated
code.

Code generation: Three address code can also be used as an intermediate representation of code
during the code generation phase of the compilation process. The three address code allows the
compiler to generate code that is specific to the target platform, while also ensuring that the
generated code is correct and efficient.

Debugging: Three address code can be helpful in debugging the code generated by the compiler.
Since three address code is a low-level language, it is often easier to read and understand than the
final generated code. Developers can use the three address code to trace the execution of the
program and identify errors or issues that may be present.

Language translation: Three address code can also be used to translate code from one programming
language to another. By translating code to a common intermediate representation, it becomes
easier to translate the code to multiple target languages.

General representation –

a = b op c

Where a, b or c represents operands like names, constants or compiler generated temporaries and
op represents the operator

Example-1: Convert the expression a * – (b + c) into three address code.

Example-2: Write three address code for following code

for(i = 1; i<=10; i++)

a[i] = x * 5;

}
Implementation of Three Address Code –
There are 3 representations of three address code namely

1. Quadruple

2. Triples

3. Indirect Triples

1. Quadruple – It is a structure which consists of 4 fields namely op, arg1, arg2 and result. op
denotes the operator and arg1 and arg2 denotes the two operands and result is used to store the
result of the expression.

Advantage –

 Easy to rearrange code for global optimization.

 One can quickly access value of temporary variables using symbol table.

Disadvantage –

 Contain lot of temporaries.

 Temporary variable creation increases time and space complexity.

Example – Consider expression a = b * – c + b * – c. The three address code is:

t1 = uminus c

t2 = b * t1
t3 = uminus c

t4 = b * t3

t5 = t2 + t4

a = t5

2. Triples – This representation doesn’t make use of extra temporary variable to represent a single
operation instead when a reference to another triple’s value is needed, a pointer to that triple is
used. So, it consist of only three fields namely op, arg1 and arg2.

Disadvantage –

 Temporaries are implicit and difficult to rearrange code.

 It is difficult to optimize because optimization involves moving intermediate code. When a


triple is moved, any other triple referring to it must be updated also. With help of pointer
one can directly access symbol table entry.

Example – Consider expression a = b * – c + b * – c


3. Indirect Triples – This representation makes use of pointer to the listing of all references to
computations which is made separately and stored. Its similar in utility as compared to quadruple
representation but requires less space than it. Temporaries are implicit and easier to rearrange
code.

Example – Consider expression a = b * – c + b * – c


Question – Write quadruple, triples and indirect triples for following expression : (x + y) * (y + z) + (x
+ y + z)

Explanation – The three address code is:

t1 = x + y

t2 = y + z

t3 = t1 * t2

t4 = t1 + z

t5 = t3 + t4
Level Up Your GATE Prep!

You might also like