Professional Documents
Culture Documents
1 UNIT 1 CDUnit1 - Compatibility Mode
1 UNIT 1 CDUnit1 - Compatibility Mode
Unit 1
Compiler Design
1
© 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 1 © 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 2
1 2
Translator – it is a program that takes as input a program written in one Assembler – An assembler is a translator that converts an assembly
programming language ( Source Language ) and produces as output a language program into relocatable machine language program.
program in another language ( Object or Target Language )
Types of translator – Preprocessor, Compiler, interpreter and assembler.
Interpreter
During interpretation, the HLL program remains in the source form or in
Preprocessor: (#) simplified intermediate-code form and the actions implied by program are
Preprocessor converts a high-level language into another (or same) executed by Interpreter.
simplified high level language. Do not generate a target code.
It converts structured HLL to conventional HLL. Actually, interpreter is not a translator but it is an executer like CPU.
It is also responsible for expansion of macros. Advantage – No overheads of program translation
It combines source modules in different files ( skeletal source program) into a – Smaller than compiler.
single source program. Disadvantage – analysis of program during interpretation which is inefficient
in loops
Compiler: – Slower than compiler
Compiler is a translator that converts a high-level programming language into Interpreter is suitable for debugging purposes.
a low-level programming language such as assembly language. Any programming language that provides debugging facility implies that it
Compilers are machine dependent. has interpreter . Java has both compiler and Interpreter.
3 4
1
Linker & Loader
Compiler operates in phases, each of which transform source program from
one representation to another.
Linker – it allow us to make a single program by linking machine code Phases of a typical compiler are –
of user program with the machine codes of library files. Library files 1. Lexical Analyzer 4. Intermediate-code Generator
contains relocatable machine codes of routines provided by the 2. Syntax Analyzer 5. Code Optimizer
system and it is available to any program that needs it. 3. Semantic Analyzer 6. Code Generator.
Phases are organized into Front End and Back End.
Front end include those phases which depends on source language and
Loader – it takes relocatable machine code, alters relocatable independent of target machine. Front end includes Lexical Analyzer, Syntax
addresses and place the altered instructions and data in memory at Analyzer, Semantic Analyzer, Intermediate-code Generator and some
portion of Code Optimizer. Creation of symbol-table and error-handling of
proper location. This code is called absolute machine code. these phases are also included in front-end.
Back end include those phases which depends on target machine. It
Sometimes linker and loader are collectively termed as loader. includes some portion of Code Optimizer and Code Generator along with
necessary symbol-table and error-handling operations.
Several phases of a compiler are generally grouped into a single unit called
Pass. Activities of phases within a pass are interleaved. There is one input
file and one output file out of each pass. A compiler can be structured as
single-pass or multiple-pass.
5 6
7 8
2
Lexical Analyzer
The lexical analyzer takes input a stream of characters and gives as
output a stream of tokens which parser uses for syntax analysis. • Whitespace: A sequence of space, tab, newline,carriage-return,
Parser send “get next token” command to scanner which then sends form-feed characters etc.
the token. • Lexeme: A sequence of non-whitespace characters de-limited by
Apart from tokenizing the input stream, some secondary tasks of whitespace or special characters (e.g. oper-ators like+,-,*).
lexical analyzer are: The character sequence forming a token is called lexeme for the
Removing comment, white space, tab and newline characters. token.
Correlating error messages with the source program. Typical entry of a symbol table includes – lexeme ptr, lexeme, token,
Source token attributes which includes type, dimension, value etc.
Token
Program Lexical
Parser
Analyzer
“get next token” Examples of lexemes.
• reserved words, keywords, identifiers etc.
Symbol • Each comment is usually a single lexeme
Table
• preprocessor directives
Interaction of Lexical Analyzer with Parser
© 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 10
9 10
Lexical Analyzer
: • Identification of tokens is usually done by a Deterministic
Lexical Analyzer ( or Scanner ) takes as input a stream of characters from Finite-state automaton (DFA).
the source program and groups them logically into tokens.
• The set of tokens of a language is represented by a
• Token: A sequence of characters to be treated as a single unit.•
large regular expression.
• Examples of tokens.–
• This regular expression is fed to a lexical-analzer generator
Reserved words (e.g.begin,end,struct,if etc.)–
Keywords (integer,true etc.– such as Lex, Flex or ML-Lex.
Operators (+,&&,++etc)– • A giant DFA is created by the Lexical analzer generator.
Identifiers (variable names, procedure names, pa-rameter names)–
Literal constants (numeric, string, character con-stants etc.)–
Punctuation marks (:,, etc.)
SYMBOL TABLE
11 12
3
Lexical Analyzer Lexical Analyzer
Tokens, lexeme and Patterns Attributes for tokens
Token is a name give to a logical group of characters. This name A token influence the parsing decisions and attributes of token
reflects the category of group of characters. influence the translation of tokens.
Lexeme is a string of characters.
Typical attributes of a token are – type, value, dimension, length,
Pattern is a rule describing the set of lexemes that can represent a line-number.
token. Patterns are specified by regular expressions and
implemented through programming its DFA. Practically token has only one attribute – “pointer to symbol-table
entry”.
Token Lexeme Pattern Token is generally written as – < token, ptr to ST >
relation <, <=, >, >=, <> < or <= or > or >= or <> Ex E = M * V * * 2 will written as
id avg, count, pi, k1 Lettter followed by letters or digits <id1, 123><assign_op><id2, 125><mult_op><id3,130><exp_op><num, 2>
num 378, 3.02 Any numeric constant ( integer or real ) Pointer Lexeme Token Attributes
literal “God is great” Any characters between “ and “ except “ 123 E id1 Type = real Value = 3e10
if If if Symbol
125 M Id2 Type = real Value = 20
const const const Table
130 V Id3 Type = real Value = 3e8
13 14
Lexical Analyzer
Streams of token from lexical analyzer is passed to next phase the Syntax Analyzer
Lexical Errors or Parser.
In a statement like fi (x== g(y) )…a lexical analyzer cannot tell This phase takes the list of tokens produced by the lexical analysis and arranges
these in a tree-structure (called the syntax tree) that reflects the structure of the
whether fi is misspelling of keyword if or an identifier. program. This phase is often called parsing
If lexical analyzer is able to match a pattern for a lexeme then it The Syntax Analyzer groups tokens into syntactic structures like expressions
generates the token otherwise it flags an error according to the grammar of the language.
Syntactic structures are represented by Parse Tree whose leaves represent tokens
and interior nodes represent string of tokens (expression).
Parse tree is further decomposed into Syntax tree – an internal representation of
syntactic structures.
Ex. For a given grammar S id := E E E+E | E*E | id | num
and token stream id1 := id2 + id3 * num , then parse tree and syntax tree will be
S
:=
id1 := E
id1
Parse tree Syntax tree +
E + E
id2 *
id2 E * E
id3 num
id3 num
15 16
4
Most important task of Semantic Analyzer is Type-Checking – it checks The program is translated to a simple machine-independent intermediate
that each operator has operands of the types permitted by source language language.
specification.
If the source language permits type coercions, then semantic analyzer The intermediate representation of the source program is generated by
converts operands types to suitable ones. traversing the syntax tree obtained from semantic analyzer.
Other tasks of semantic analyzer are Disambiguate overloaded It generates an intermediate representation of the source program for an
operators, Control flow checking, Name checks etc abstract machine.
The intermediate code should have 2 properties – easy to produce and
Ex. If integer and real is applied to * (multiplication), then semantic analyzer easy to translate into target program.
converts integer to real by using some internal operator inttoreal. The intermediate code can be various types, but most common is Three-
address-code which is close to assembly language.
Thus the modified syntax tree for id1 := id2 + id3 * num will be : Three-address-code instructions contain at most 3 operands like “op1
operator op2 operator op3”
:=
Three-address-code for id1 := id2 + id3 * inttoreal (60) will be:
id1
+
id2 * temp1 = inttoreal(num)
id3 inttoreal
temp2 = id3 * temp1
temp3 = id2 + temp2
num id1 = temp3
17 18
Compilation Phase-to-Phase
position := initial + rate * 60
The code optimizer optimizes the code produced by the intermediate code generator
in the terms of time and space.
Code Optimizer attempts to improve intermediate code so that fast running Lexical Analyzer
id1 := id2 + id3 * num
machine code will result. It produces better/semantically equivalent code.
Extensive optimizations slow down the compilation but speed up the execution
phase. Syntax Analyzer
Optimized code of previous example: :=
temp1 = id3 * 60.0 id1
id1 = id2 + temp1 +
Semantic Analyzer :=
id2 * id1
+
The code generator generates assembly code or relocatable machine code from id3 num
the optimized intermediate code.
Intermediate Code Generator id2 *
The code generated depends on the machine and number of registers available. temp1 = nttoreal(num)
id3 inttoreal
Assembly / machine code of the above optimized code will be : temp2 = id3 * temp1
MOVF id3,R2 temp3 = id2 + temp2
Code Optimizer num
19 20
5
Single-Pass Vs Multi-Pass
• Several phases of a compiler are generally grouped into a single Single-Pass Compiler
unit called Pass. Activities of phases within a pass are interleaved. Memory
There is one input file and one output file out of each pass. A
compiler can be structured as single-pass or multiple-pass. Single
• Number of passes depends on the machine and the language for File Pass
Compiler
which compiler is designed.
• Certain languages allow declaration of a variable to occur after use
HLL prg Assembly prg
of that variable. Such languages requires atleast 2 passes.
• A multi-pass compiler requires less space in memory than single-
Multi-Pass Compiler
pass because space occupied by one pass can be reused by next
pass.
Internal
• A multi-pass compiler is slower than single-pass as it reads and Representation
Internal -2
Pass-3 Memory
writes an intermediate file during each pass. Pass-2 Representation -1
21 22
Bootstrapping Bootstrapping…
• Bootstrapping is a process of writing a compiler for a computer language • Compilers are of two kinds: native and cross .
using the language itself.
– Native compilers are written in the same language as the target
• A compiler can be characterized by three languages: language. For example, LMM is a compiler for the language L
– the source language that it compiles (S). that is in a language that runs on machine M and generates
– the implementation language (I) that it is written in. output code that runs on machine M.
– the target language (T) that it generates code for. – Cross compilers are written in different language as the target
• These three language can be quite different. language. For example, LMN is a compiler for language L
• T-diagram is used to show a compiler with 3 languages: running on machine M and generates code for machine N.
• Suppose we want to write a cross-compiler LMN . For this we use a
S T LSN compiler written in language S. We compile LSN through its
I native compiler SMM on machine M to get LMN.
23 24
6
Bootstrapping… Bootstrapping…
Bootstrapping a compiler for a computer language L using the Bootstrapping a compiler to a second machine
language L itself on machine M : Let us assume that we have 2 machines M and N, and their assembly
• Suppose we a compiler for language L that runs on machine M and language is M and N respectively. We want to bootstrap a compiler
generate code for machine M. LLN to obtain a native compiler for L on N i.e LNN.
• First we write a small compiler SMM where S is the subset of L and
• First we obtain LMM -- a native compiler for L on machine M (as explained in
M is the assembly language of machine M. The compiler translates prev slide)
subset S of language L into machine language M.
• We compile LLN through LMM and get LMN which is a cross compiler.
• Second, we write a compiler for complete language L which is
written in simple language S and generates assembly code M for • Next we again compile LLN through LMN and get LNN
machine M, i.e. LSM
L N L N
• We compile LSM though SMM and we get LMM which is a native
compiler for language L on machine M. L N L L N N
L M L L M M
L M L M Bootstrapping a compiler
S S M M S S M M to a second machine
Bootstrapping a Compiler M M
25 26
27 28
7
Input Buffering Input Buffering
• Two pointers are used to read the buffer- lexeme_beginning and • If Lexeme_beginning pointer is in left half and Forward pointer
forward. moved across halfway mark, then right half is filled with N new input
• The string of characters between two pointers is the lexeme read so characters.
far. • If Lexeme_beginning pointer is in right half and Forward pointer
• Initially both pointers point to the first character of the next lexeme to moved across right end of the buffer, then left half is filled with N
be found. new input characters and Forward pointer wrap to the beginning of
the buffer.
Limitation :
E = M V 2 eof • If Lexeme_beginning pointer is in left half and Forward pointer
moved across right end of the buffer, then left half cannot be filled
with N new input characters. In this case token cannot be
Lexeme_beginning Forward recognized.
• If Lexeme_beginning pointer is in right half and Forward pointer
• Forward pointer scans ahead until a match for a pattern is found.
moved across halfway mark by wrapping around left half, then right
• If a pattern is found the lexeme_beginnig pointer moves to the half cannot be filled with N new input characters. In this case token
beginning of next pattern to be found leaving all white spaces. cannot be recognized.
29 30
31 32
8
SPECIFICATION OF TOKENS
Operations on languages:
• The following are the operations that can be applied to languages:
• 1 .Union
• 2.Concatenation
• 3 .Kleene closure
• 4.Positive closure
33 34
easily simulated via an algorithm. Example: Automatic machine tools, automatic packing machines, and
NFA
• Every NFA can be converted to an automatic photo printing machine
equivalent DFA Subset construction
DFA. I1 O1
. .
– Minimal in what sense? Minimized DFA . Automation .
• There are programs that take a regular DFA simulation
Scanner INPUT . q 1…….qn OUTPUT
expression and produce a program generator
. .
.
based on a minimal DFA to recognize IP
strings defined by the RE. OP
• You can find out more in 451 Program
(automata theory) and/or 431
(Compiler design)
35 36
9
Finite Automata
DFA (Deterministic Finite Automata)
(FA)
• FA also called Finite State Machine (FSM)
Analytically, a finite automaton can be represented by a 5-tuple – Abstract model of a computing entity.
(Q,∑, δ, q0, F), – Decides whether to accept or reject a string.
where – Every regular expression can be represented as a FA and vice
1. Q is a finite nonempty set of states. versa
2. ∑ is a nonempty set of inputs called the input alphabet. • Two types of FAs:
3. δ is a function which maps Q x ∑ into Q and is usually called the – Non-deterministic (NFA): Has more than one alternative action
direct transition function. This is the function which describes the for the same input symbol.
change of state during the transition. This mapping is usually
represented by transition table or a transition diagram. – Deterministic (DFA): Has at most one action for a given input
symbol.
4. q0єQ is the initial state.
5. F is the sub set of Q and it is the set of final states. It is assumed here • Example: how do we write a program to recognize the Java
that there may be more than one final state. keyword “int”?
q0 i q1 n q2 t
q3
37 38
39 40
10
Deterministic Finite Automaton (DFA) Nondeterministic Finite Automation
State/∑ a b
a 0,1 0
• A Deterministic Finite Automaton (DFA) is a special form of a NFA. →0
a b _ 2
• no state has - transition 0 1 2 1
start
• for each symbol a and state s, there is at most one labeled _ _
b 2
edge a leaving s.
i.e. transition function is from pair of state-symbol to state (not Transition graph of the NFA
set of states)
The language recognized by this NFA is (a|b)
a *ab
State/∑ a b
b a 0 is the start state s0
a b {2} is the set of final states F
0 1 2 1 0 = {a,b}
→0
S = {0,1,2}
1 1 2
b Transition Function
1 0
The language recognized by 2
41 42
→q0 q0, q1 q2
1/0
q1 q0 q1
0/0 1011/0 q2 q0, q1
q2
1/0
43 44
11
Solution
For the deterministic automaton M1,
M =({q0,q1, q2},{ a, b}, δ,q0, {q2})
Q.) Construct a deterministic automation equivalent to
M1 = (2Q ,{ a, b}, δ, q0, F) M=({q0,q1, q2, q3},{ 0,1}, δ,q0, {q3}),
F={[q2], [q0, q2], [q1, q2],[q0, q1, q2]; Where δ is defined by its state table
State/∑ 0 1
State/∑ a b →q0 q0, q1 q0
q1 q2 q1
q0 q0, q1 q2 q2 q3 q3
q2 ø q0, q1
q3 q2
q0, q1 q0, q1 q1, q2
q1, q2 q0 q0, q1
45 46
N(r1)
NFA for r1 | r2
i f
N(r2)
47 48
12
Thomson’s Construction (cont.) Thomson’s Construction (Example - (a|b) * a )
a a
• For regular expression r1 r2 a:
(a | b)
b
b: b
i N(r1) N(r2) f
NFA for r1 r2 a
Final state of N(r2) become final state of N(r1r2) *
(a|b)
b
• For regular expression r*
a
i N(r) f
(a|b) * a a
NFA for r* b
49 50
a
a+
start a
0 1
(a|b)* a a,
start
0 start b
0
b
© 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 51 © 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 52
51 52
13
TRANSITION SYSTEM CONTAINING
^ -MOVES TRANSITION SYSTEM CONTAINING
^ -MOVES
© 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 53 © 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 54
53 54
© 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 55 © 20 14SIRTS Dr. R R JANGHEL CS VII Compiler Unit1 56
55 56
14
Conversion from Regular grammar (Transition diagram) to
Finite automata Step 2: We eliminate the concatenations in the given r.e. by introducing new
vertices q, and q2
Q.) Construct the finite automaton equivalent to the regular expression
(0 + 1)*(00 + 11)(0 + 1)*
Answer:
Step 1: (Construction of transition graph) First of all we construct the
transition graph with Λ -moves .
Step 3 :
We eliminate the * operations in Figure 2 by introducing two new vertices q5
and q6 and the Λ -moves as shown in Figure 3.
57 58
Step 6:(Construction of DFA) We construct the transition table for the NDFA
Step 4: We eliminate concatenations and + in Fig. 3 and get Fig. 4.
defined by Table 2 State/∑ 0 1
→q0 q0, q3 q0, q4
q3 qf
q4 qf
qf qf qf
Step 5:
Step 7:Transition Table for the DFA of table 2.
We eliminate the Λ -moves in Fig. 4 and get Fig. 5 which gives the NDFA
equivalent to the given r.e. State/∑ 0 1
→q0 q 0, q3 q 0, q4
q 0,q 3 q 0 q3 q f q0 q 4
q0 q4 q 0 q3 q 0 q4 q f
q 0 q 4 qf q 0 q3 q f q 0 q4 q f
q 0 q3 qf q 0 q3 q f q 0 q4 q f
59 60
15
Step 8: The state diagram for the required DFA is given below Step 9:
61 62
63 64
16
Transition Table for NDFA
.
State/∑ 0 1
→q0, q3 q1, q2
q1 qf -
q2 - q3
Transition Table for DFAq q3 qf
3
qf - -
State/∑ 0 1
→q0, q3 q1, q2
q3 q3 qf
q1, q2 qf q3
qf Ø Ø
Ø Ø Ø
65 66
67
17