Professional Documents
Culture Documents
Narayana Engineering College::Nellore: Department of Computer Science and Engineering
Narayana Engineering College::Nellore: Department of Computer Science and Engineering
Course Details
Class: IIIrd B.Tech IInd Semester Branch: CSE Year: 2019-20
Course Title : Compiler Design Course Code: 15A05601 Credits: 3
Program/Dept.: Computer Science & Engineering (CSE) Batch: 2017-21
Regulation: R-15 Faculty: C. Rama Mohan
Unit - I
Introduction: Language processors, The Structure of a Compiler, the science of building a complier.
Lexical Analysis: The Role of the lexical analyzer, Input buffering, Specification of tokens, Recognition
of tokens, The lexical analyzer generator Lex, Design of a Lexical Analyzer generator.
Introduction
• In order to reduce the complexity of designing and building computers, nearly all of these are made
to execute relatively simple commands (but do so very quickly).
• A program for a computer must be built by combining these very simple commands into a program in
what is called machine language.
• Since this is a tedious and error prone process most programming is, instead, done using a high-
level programming language.
• This language (HLL) can be very different from the machine language that the computer can
execute, so some means of bridging the gap is required. This is where the compiler comes in.
• A COMPILER translates (or compiles) a program written in a high-level programming language that
is suitable for human programmers into the low-level machine language that is required by computers.
During this process, the compiler will also attempt to spot and report obvious programmer mistakes.
Language processors
Preprocessor:
• A preprocessor produce input to compilers.
They may perform the following functions.
1.Macro processing: A preprocessor may allow a
user to define macros that are short hands for
longer constructs.
2.File inclusion: A preprocessor may include
header files into the program text.
3.Rational preprocessor: these preprocessors
augment older languages with more modern
flow-of-control and data structuring facilities.
4.Language Extensions: These preprocessor
attempts to add capabilities to the language by
certain amounts to build-in macro
Compiler:
• Compiler is a translator program that
translates a program written in (HLL) the
source program and translate it into an
equivalent program in (MLL) the target
program.
Assembler:
• An assembler translates assembly language programs into machine code. The input to an assembler
program is called source program, the output is a machine language translation.
Interpreter:
• Interpreter is also a language translator like a compiler.
• Interpreter directly executes the operations in the source program on inputs provided by user rather than
producing a target program as a translation.
• Interpreter is a common type of language processor.
• It executes the source program statement by statement and therefore it provides better error diagnostics
in comparison with compiler.
Phases of a compiler:
• A compiler operates in phases. A phase is a logically interrelated operation that takes source program
in one representation and produces output in another representation.
• The phases of a compiler are shown in below. There are two phases of compilation.
a. Analysis (Machine Independent/Language Dependent)
b. Synthesis(Machine Dependent/Language independent)
Lexical Analysis:
• Lexical Analysis or Scanners or Linear Analysis reads the source program one character at a time,
carving the source program into a sequence of automic units called tokens. The input of the lexical
analysis is source program and output is set of tokens or stream of tokens.
• Functions of Lexical analyzer are
a) Removing white space
b) Removing constants, identifiers and keywords
c) Removing comments
Syntax Analysis:
• The second stage of translation is called Syntax analysis or Parsing. In this phase expressions,
statements, declarations etc… are identified by using the results of lexical analysis. Syntax analysis is
aided by using techniques based on formal grammar of the programming language.
Semantic Analysis:
• Semantic analysis checks whether the parse tree constructed follows the rules of language
Intermediate Code Generations:
• An intermediate representation of the final machine language code is produced. This phase bridges the
analysis and synthesis phases of translation.
Code Optimization:
• This is optional phase described to improve the intermediate code so that the output runs faster
and takes less space.
Code Generation:
• The last phase of translation is code generation. A number of optimizations to reduce the length of
machine language program are carried out during this phase. The output of the code generator is
the machine language program of the specified computer.
Translator:
• It is a program that takes as input a program written in one language (source language) and produces as
output a program in another language (object language). Types of translators are compiler, interpreter,
and assembler.
Lexical Analysis
Introduction
• To identify the tokens we need some method of describing the possible tokens that can appear in the
input stream. For this purpose we introduce regular expression, a notation that can be used to describe
essentially all the tokens of programming language.
• Secondly , having decided what the tokens are, we need some mechanism to recognize these in the
input stream. This is done by the token recognizers, which are designed using transition diagrams and
finite automata.
• This phase scans the source code as a stream of characters and converts it into meaningful lexemes.
• The LA scans the characters of the source program one at a time to discover tokens. Because of large
amount of time can be consumed scanning characters, specialized buffering techniques have been
developed to reduce the amount of overhead required to process an input character.
• Buffering techniques: 1. Buffer pairs 2. Sentinels.
Token, Lexeme, Pattern
Token:
• Token is a sequence of characters that can be treated as a single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) Operators 4) Special symbols 5) Constants
Pattern:
• A set of strings in the input for which the same token is produced as output. This set of strings is
described by a rule called a pattern associated with the token.
Lexeme:
• A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
Recognition of tokens
• The lexical analyzer will recognize the keywords if, then, else, as well as the lexemes denoted by relop,
id, and num. To simplify matters, we assume keywords are reserved; that is, they cannot be used as
identifiers.
• To recognize the tokens in the input stream transition diagram and finite automata are convenient
ways of designing recognizers.
• Design of a LA Generator Two approaches: NFA-based DFA-based The Lex compiler is implemented
using the second approach.
Lex specifications:
• A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
• The declarations section includes declarations of variables, manifest constants(A manifest constant is
an identifier that is declared to represent a constant e.g. # define PIE 3.14), and regular definitions.
• The translation rules of a Lex program are statements of the form :
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
where each p is a regular expression and each action is a program fragment describing what action the
lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions are written in C.
• The third section holds whatever auxiliary procedures are needed by the actions. Alternatively these
procedures can be compiled separately and loaded with the lexical analyzer.
Syntax Analysis
Concept: Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the
syntactical structure of the given input, i.e. whether the given input is in the correct syntax or not.
.
Definition: Syntax analysis is the process of analyzing a string of symbols, either in natural
language, computer languages or data structures, conforming to the rules of a formal grammar.
Definition: A CFG is a set of recursive rewriting rules (or productions) used to generate patterns
of strings.
A set of terminal symbols, which are the characters of the alphabet that appear in the
strings generated by the grammar.
A set of non-terminal symbols, which are placeholders for patterns of terminal symbols
that can be generated by the non-terminal symbols.
A set of productions, which are rules for replacing (or rewriting) non-terminal symbols
(on the left side of the production) in a string with other non-terminal or terminal symbols
(on the right side of the production).
A start symbol, which is a special non-terminal symbol that appears in the initial string
generated by the grammar.
Ambiguous Grammar
Concept: While deriving a string from the given grammar we can find an Ambiguous grammar
Definition: A grammar that produces more than one parse tree for some sentence is said to be
ambiguous.
(or)
An ambiguous grammar is one that produces more than one leftmost or rightmost derivation for
the same sentence. Ex: E E+E / E*E / id
Parsing
Concept: This is a second phase of compiler which comes after lexical analyzer.
Parse tree
Concept: A kind of tree structure generated while deriving string from the grammar.
Definition: A parse tree may be viewed as a graphical representation for a derivation that filters
out the choice regarding replacement order. Each interior node of a parse tree is labeled by some
non-terminal A and that the children of the node are labeled from left to right by symbols in the
right side of the production by which this A was replaced in the derivation. The leaves of the
parse tree are terminal symbols.
Definition: top-down parsing is a parsing strategy where one first looks at the highest level of
the parse tree and works down the parse tree by using the rewriting rules of a formal
grammar. LL parsers are a type of parser that uses a top-down parsing strategy.
Definition: In RDP we execute a set of recursive procedures to process the input. A procedure is
associated with each non- terminal of a grammar.
Predictive parsing
Concept: A special form of Recursive Descent parsing, in which the look-ahead symbol
unambiguously determines the procedure selected for each non-terminal, where no backtracking
is required.
Definition: Predictive parsing is a Top down parsing method use to parse the given string.
Bottom up Parsing
Definition: Parsing method in which construction starts at the leaves and proceeds towards the
root is called as Bottom Up Parsing.
Shift-Reduce parsing
Definition: A general style of bottom-up syntax analysis, which attempts to construct a parse
tree for an input string beginning at the leaves and working up towards the root.
Operator grammar
Definition: A grammar is operator grammar if,
1. No production rule involves “ε” on the right side.
2. No production has two adjacent non-terminals on the right side.
LR (k) parsing
Concept: The “L” is for left-to-right scanning of the input, the “R” for constructing a rightmost
derivation in reverse, and the k for the number of input symbols of look ahead that are used in
making parsing decisions.
Definition: It is a top down parsing method used to parse the input string by using canonical
derivation. The class of grammars that can be parsed using LR methods is a proper superset of
the class of grammars that can be parsed with predictive parsers.
GOTO function
Concept: It a function used to construct parse LR parse table. The function goto takes a state and
grammar symbol as arguments and produces a state.
Definition: The goto function of a parsing table constructed from a grammar G is the transition
function of a DFA that recognizes the viable prefixes of G.
Ex: goto(I,X)
Where I is a set of items and X is a grammar symbol to be the closure of the set of all items
[A→αX.β] such that [A→α.Xβ] is in I
LR grammar
Definition: A grammar for which we can construct a parsing table is said to be an LR grammar.
Kernel items
Concept: These are special productions in which dot present at right side of a production
Definition: The set of items which include the initial item and all items whose dots are not at the
left end are known as kernel items.
Definition: The set of items, which have their dots at the left end, are known as non kernel items.
Parser Generator
Concept: For generating parse tree for the given input programs fragment it is being used.
• The Principle of Syntax Directed Translation states that the meaning of an input sentence is
related to its syntactic structure, i.e., to its Parse-Tree.
• By Syntax Directed Translations we indicate those formalisms for specifying translations
for programming language constructs guided by context-free grammars.
1. We associate Attributes to the grammar symbols representing the language constructs.
2. Values for attributes are computed by Semantic Rules associated with grammar
productions.
• Evaluation of Semantic Rules may:
1. Generate Code;
2. Insert information into the Symbol Table;
3. Perform Semantic Check;
4. Issue error messages;
• Such formalism generates Annotated Parse-Trees where each node of the tree is a record
with a field for each attribute (e.g., X.a indicates the attribute a of the grammar symbol X).
• The value of an attribute of a grammar symbol at a given parse-tree node is defined by a
semantic rule
associated with the production used at that node.
L-attributed definition
Definition: A SDD its L-attributed if each inherited attribute of Xi in the RHS of A ! X1 : :Xn
depends only on
1. attributes of X1;X2; : : : ;Xi1 (symbols to the left of Xi in the RHS)
2. inherited attributes of A.
S-attributed grammars
Definition: These are a class of attribute grammars characterized by having no inherited
attributes, but only synthesized attributes. Inherited attributes, which must be passed down from
parent nodes to children nodes of the abstract syntax tree Attribute evaluation in S-attributed
grammars can be incorporated conveniently in both top-down parsing and bottom-up parsing.
L-attributed grammars
Definition: These are a special type of attribute grammars. They allow the attributes to be
evaluated in one depth-first left-to-right traversal of the abstract syntax tree. As a result, attribute
evaluation in L-attributed grammars can be incorporated conveniently in top-down parsing.
Dependency Graphs
Concept: Dependency graphs are a useful tool for determining an evaluation order for the
attribute instances in a given parse tree.
Definition: A dependency graph depicts the flow of information among the attribute instances in
a particular parse tree; an edge from one attribute instance to another means that the value of the
first is needed to compute the second. Edges express constraints implied by the semantic rules.
Definition: It is a parse tree showing the values of the attributes at each node. The process of
computing the attribute values at the nodes is called annotating or decorating the parse tree.
Syntax directed translation scheme
Concept: The syntax directed translation scheme is used to evaluate the order of semantic rules.
Definition: The Syntax directed translation scheme is a context -free grammar. In translation
scheme, the semantic rules are embedded within the right side of the productions. The position at
which an action is to be executed is shown by enclosed between braces. It is written within the
right side of the production.
Intermediate code
Concept: It is one of the forms of a source programming language used by a compiler.
Definition: It is one of the machine independent codes represented by using three address fields
Generating three-address code
Concept: The three-address code is generated using semantic rules that are similar to those for
constructing syntax trees for generating postfix notation.
Syntax Tree
Concept: A syntax tree depicts the natural hierarchical structure of a source program.
Definition: it is a condensed form of parse tree
Postfix notation
Definition:
A Postfix notation is a liberalized representation of a syntax tree. It is a list of nodes of the tree
in which a node appears immediately after its children.
Definition: It is one of the forms of a intermediate code used to represent source program
fragment using maximum of three address fields
Quadruple
Definition: A quadruple is a record structure with four fields, which we call op, arg1, arg2 and
result.
.
Boolean Expression
Definition: Expressions which are composed of the Boolean operators (and, or, and not) applied
to elements that are Boolean variables or relational expressions are known as Boolean
expressions
Viable prefixes
Definition: Viable prefixes are the set of prefixes of right sentinels forms that can appear on the
stack of shift/reduce parser are called viable prefixes. It is always possible to add terminal
symbols to the end of the viable prefix to obtain a right sentential form.
Calling sequence
Definition: A sequence of actions taken on entry to and exit from each procedure is known as
calling sequence.
Back patching
Definition: Back patching is the activity of filling up unspecified information of labels using
appropriate semantic actions in during the code generation process.
UNIT - IV
Run Time Environment: storage organization, Stack allocation of space, Access to non-local
data on stack , Heap management
Symbol Table: Introduction, symbol table entries, operations on the symbol table, symbol
table organizations, non block structured language, block structured language.
Concept of Run Time Environment: Runtime environment is a state of the target machine,
which may include software libraries, environment variables, etc., to provide services to the
processes running in the system.
Definition of Runtime storage: It holds the generated target code, Data objects, a counter part
of the control stack to keep track of procedure activations
Code: This area is used to place the executable target code, as the size of the generated code is
fixed at compile time
Concept of Standard storage allocation strategies: Storage allocation strategies are Static
allocation, Stack allocation, Heap allocations.
Definition of Static allocation: It lays out storage for all data objects at compile time.
Definition of Heap allocation: It allocates and deallocates storage as needed at runtime from a
data area.
Definition of An activation tree: It depicts the way control enters and leaves activations. It is
used to efficiently describe the nesting of procedure calls to make the stack allocation feasible.
An activation tree is used to represent the activations of procedures during the execution of the
entire program.
In the tree, NODE - represents the activation of a procedure. ROOT - represents activation of
the "main" procedure. CHILDREN NODES - represents activations of procedures called by the
parent procedure.
Calling Sequence is a code that allocates an AR on the stack and enters information into its
fields. Return Sequence is a code used to restore the state of the machine so the calling
procedure can continue its execution after the call.
Concept of Control stack: It keeps track of live procedure activations. Push the node for
activation onto the control stack as the activation begins and to pop the node when the activation
ends.
Concept of Scope of declaration: The portion of the program to which a declaration applies is
called the scope of that declaration. An occurrence of a name in a procedure is said to be local to
the procedure if it is in the scope of declaration within the procedure; otherwise the occurrence is
said to be nonlocal.
Definition of Access link: It refers to nonlocal data held in other activation records.
Concept of Static scope rule: It determines the declaration that applies to a name by examining
the Program text alone.
Concept of Dynamic scope rule: It determines the declaration applicable to a name at run time
by considering the current activations.
Heap Management: Heap is the unused memory space available for allocation dynamically. It
is used for data that lives indefinitely, and is freed only explicitly. The existence of such data is
independent of the procedure that created it.
Memory manager is used to keep account of the free space available in the heap area.
Functions include: Allocation, and Deallocation.
Concept of Symbol table: The information’s are entered into the symbol table is- The string of
characters denoting the name, Attributes of the name, Parameters. Offset for the name.
Definition of Symbol table: Symbol table is a data structure that contains all variables in the
program and Temporary storage and any information needed to reference or allocate storage for
them.
Concept of Programming languages with block structures: Those languages in which some
identifiers only exist inside some sections of the code, and not in others. For instance, Algol,
PL/I, C, Java, Pascal.
Concept of Non block structured language, block structured language: Compiler must carry
out the storage allocation and provide access to variables and data. Allocation can be done in two
ways.
UNIT - V
Code Generation: Issues in the design of a code generator, The Target language, Basic blocks
and flow graphs, optimization of basic blocks, a simple code generator, register allocation and
assignment, optimal code generation for expressions, dynamic programming code generation.
Code Optimization: Introduction, where and how to optimize, principle source of
optimization, function preserving transformations, loop optimizations, global flow analysis,
machine dependent optimization
Code Generation
Code Generator(CG) is the final phase in the compiler model. The input to CG is the
intermediate representation (IR) produced by the front end of the compiler along with required
symbol table information. The output of a CG is a semantically equivalent target program. Code
generation and code optimization phases are referred as backend of the compiler.
Main tasks of CG are: instruction selection, register allocation of assignment, and instruction
ordering. The output of a CG is object code which can take following forms: Absolute code,
relocatable machine code or assembly language.
Absolute machine-code can be placed in a fixed location in memory and immediately executed.
Relocatable machine-language program allows subprograms to be compiled separately. By
using a linking loader, relocatable object modules can be linked together and it can be loaded for
execution. Assembly-language program allows the process of CG somewhat easier.
Definition: The techniques consist of detecting patterns in the program and replacing these
patterns by equivalent and more efficient constructs.
Constant folding
Definition: Deducing at compile time that the value of an expression is a constant and using the
constant instead is known as constant folding.
Inner loops
Definition: The most heavily traveled parts of a program, the inner loops, are an obvious target
for optimization. Typical loop optimizations are the removal of loop invariant computations and
the elimination of induction variables.
Code motion
Definition: Code motion is an important modification that decreases the amount of code in a
loop.
The code optimization techniques consist of detecting patterns in the program and replacing
these patterns.
Local transformation & Global Transformation
Definition: A transformation of a program is called Local, if it can be performed by looking only
at the statements in a basic block otherwise it is called global.
Common Sub-expressions
Definition: An occurrence of an expression E is called a common sub-expression, if E was
previously computed, and the values of variables in E have not changed since the previous
computation.
Dead Code
Definition: A variable is live at a point in a program if its value can be used subsequently
otherwise, it is dead at that point. The statement that computes values that never get used is
known Dead code or useless code.
Reduction in strength
Definition: Reduction in strength is the one which replaces an expensive operation by a cheaper
one such as a multiplication by an addition.
Dangling reference
Concept: A dangling reference occurs when there is storage that has been deallocated.
Definition: It is logical error to use dangling references, since the value of deallocated storage is
undefined according to the semantics of most languages.