Professional Documents
Culture Documents
Purpose of Translator Different Types of Translators: Compiler
Purpose of Translator Different Types of Translators: Compiler
Purpose of Translator Different Types of Translators: Compiler
Compiler
A compiler is a translator used to convert high-level programming language to low-level
programming language. It converts the whole program in one session and reports errors
detected after the conversion. Compiler takes time to do its work as it translates high-level
code to lower-level code all at once and then saves it to memory.
A compiler is processor-dependent and platform-dependent. But it has been addressed by a
special compiler, a cross-compiler and a source-to-source compiler. Before choosing a
compiler, user has to identify first the Instruction Set Architecture (ISA), the operating
system (OS) and the programming language that will be used to ensure that it will be
compatible.
Interpreter
Just like a compiler, is a translator used to convert high-level programming language to low-
level programming language. It converts the program one at a time and reports errors
detected at once, while doing the conversion. With this, it is easier to detect errors than in a
compiler. An interpreter is faster than a compiler as it immediately executes the code upon
reading the code.
It is often used as a debugging tool for software development as it can execute a single line
of code at a time. An interpreter is also more portable than a compiler as it is not processor-
dependent, you can work between hardware architectures.
Assembler
An assembler is is a translator used to translate assembly language to machine language. It
is like a compiler for the assembly language but interactive like an interpreter. Assembly
language is difficult to understand as it is a low-level programming language. An assembler
translates a low-level language, an assembly language to an even lower-level language,
which is the machine code. The machine code can be directly understood by the CPU.
Examples of Translators
Here are some examples of translators per type:
Translator Examples
OCaml
Interpreter List Processing (LISP)
Python
You discover errors before you complete the program, so you learn from your
mistakes.
Program can be run before it is completed so you get partial results
immediately.
You can work on small parts of the program and link them later into a whole
program.
Here are some disadvantages of the Interpreter:
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input for
compilers. It deals with macro-processing, augmentation, file inclusion, language extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine language.
The difference lies in the way they read the source code or input. A compiler reads the whole
source code at once, creates tokens, checks semantics, generates intermediate code, executes
the whole program and may involve many passes. In contrast, an interpreter reads a statement
from the input, converts it to an intermediate code, executes it, then takes the next statement in
sequence. If an error occurs, an interpreter stops execution and reports it. whereas a compiler
reads the whole program even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code.The output of an
assembler is called an object file, which contains a combination of machine instructions as well as
the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in order to make
an executable file. All these files might have been compiled by separate assemblers. The major
task of a linker is to search and locate referenced module/routines in a program and to determine
the memory location where these codes will be loaded, making the program instruction to have
absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files into memory
and execute them. It calculates the size of a program (instructions and data) and creates memory
space for it. It initializes various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for platform (B)
is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it into the
source code of another programming language is called a source-to-source compiler.
Compiler Architecture
A compiler can broadly be divided into two phases based on the way they compile.
Analysis Phase
Known as the front-end of the compiler, the analysis phase of the compiler reads the source
program, divides it into core parts and then checks for lexical, grammar and syntax errors.The
analysis phase generates an intermediate representation of the source program and symbol
table, which should be fed to the Synthesis phase as input.
Synthesis Phase
Known as the back-end of the compiler, the synthesis phase generates the target program with
the help of intermediate source code representation and symbol table.
A compiler can have many phases and passes.
Pass : A pass refers to the traversal of a compiler through the entire program.
Phase : A phase of a compiler is a distinguishable stage, which takes input from the
previous stage, processes and yields output that can be used as input for the next stage. A
pass can have more than one phase.
Phases of Compiler
The compilation process is a sequence of various phases. Each phase takes input from its
previous stage, has its own representation of source program, and feeds its output to the next
phase of the compiler. Let us understand the phases of a compiler.
Lexical Analysis
The first phase of scanner works as a text scanner. This phase scans the source code as a
stream of characters and converts it into meaningful lexemes. Lexical analyzer represents these
lexemes in the form of tokens as:
<token-name, attribute-value>
Syntax Analysis
The next phase is called the syntax analysis or parsing. It takes the token produced by lexical
analysis as input and generates a parse tree (or syntax tree). In this phase, token arrangements
are checked against the source code grammar, i.e. the parser checks if the expression made by
the tokens is syntactically correct.
Semantic Analysis
Semantic analysis checks whether the parse tree constructed follows the rules of language. For
example, assignment of values is between compatible data types, and adding string to an integer.
Also, the semantic analyzer keeps track of identifiers, their types and expressions; whether
identifiers are declared before use or not etc. The semantic analyzer produces an annotated
syntax tree as an output.
Code Optimization
The next phase does code optimization of the intermediate code. Optimization can be assumed
as something that removes unnecessary code lines, and arranges the sequence of statements in
order to speed up the program execution without wasting resources (CPU, memory).
Code Generation
In this phase, the code generator takes the optimized representation of the intermediate code and
maps it to the target machine language. The code generator translates the intermediate code into
a sequence of (generally) re-locatable machine code. Sequence of instructions of machine code
performs the task as the intermediate code would do.
Symbol Table
It is a data-structure maintained throughout all the phases of a compiler. All the identifier's names
along with their types are stored here. The symbol table makes it easier for the compiler to
quickly search the identifier record and retrieve it. The symbol table is also used for scope
management.
Organization of the Compiler
The C compilation system consists of a compiler, an assembler, and a link editor. The cc command invokes each of these
components automatically unless you use command-line options to specify otherwise.
1) Preprocessor
It converts the HLL (high level language) into pure high level language. It includes all the header
files and also evaluates if any macro is included. It is the optional because if any language which
does not support #include and macro preprocessor is not required.
2) Compiler
It takes pure high level language as a input and convert into assembly code.
3) Assembler
1. Allocation:
It means get the memory portions from operating system and storing the object data.
2. Relocation:
It maps the relative address to the physical address and relocating the object code.
3. Linker:
It combines all the executable object module to pre single executable file.
4. Loader:
It loads the executable file into permanent storage.
What is Lexical Analysis?
Lexical Analysis is the very first phase in the compiler designing. A Lexer takes the modified
source code which is written in the form of sentences . In other words, it helps you to convert a
sequence of characters into a sequence of tokens. The lexical analyzer breaks this syntax into a
series of tokens. It removes any extra space or comment written in the source code.
Programs that perform Lexical Analysis in compiler design are called lexical analyzers or lexers.
A lexer contains tokenizer or scanner. If the lexical analyzer detects that the token is invalid, it
generates an error. The role of Lexical Analyzer in compiler design is to read character streams
from the source code, check for legal tokens, and pass the data to the syntax analyzer when it
demands.
Example
How Pleasant Is The Weather?
See this Lexical Analysis example; Here, we can easily recognize that there are five words How
Pleasant, The, Weather, Is. This is very natural for us as we can recognize the separators, blanks,
and the punctuation symbol.
HowPl easantIs Th ewe ather?
Now, check this example, we can also read this. However, it will take some time because
separators are put in the Odd Places. It is not something which comes to you immediately.
Basic Terminologies
What’s a lexeme?
A lexeme is a sequence of characters that are included in the source program according to the
matching pattern of a token. It is nothing but an instance of a token.
What’s a token?
Tokens in compiler design are the sequence of characters which represents a unit of information
in the source program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which uses as a
token, the pattern is a sequence of characters.
Lexical analyzer scans the entire source code of the program. It identifies each token one by one.
Scanners are usually implemented to produce tokens only when requested by a parser. Here is
how recognition of tokens in compiler design works-
Lexical Analyzer Architecture
1. “Get next token” is a command which is sent from the parser to the lexical analyzer.
2. On receiving this command, the lexical analyzer scans the input until it finds the next
token.
3. It returns the token to Parser.
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is
present, then Lexical analyzer will correlate that error with the source file and line number.
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
Lexical errors are not very common, but it should be managed by a scanner
Misspelling of identifiers, operators, keyword are considered as lexical errors
Generally, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.
Lexeme: A lexeme is the lowest level syntactic unit of a language (e.g., total, start).
Keywords and reserved words – It is an identifier which is used as a fixed part of the
syntax of a statement. It is a reserved word which you can’t use as a variable name or
identifier.
Noise words – Noise words are optional which are inserted in a statement to enhance the
readability of the sentence.
Comments – It is a very important part of the documentation. It mostly display by, /* */,
or//Blank (spaces)
Delimiters – It is a syntactic element which marks the start or end of some syntactic unit.
Like a statement or expression, “begin”…”end”, or {}.
Identifiers – It is a restrictions on the length which helps you to reduce the readability of
the sentence.
A parse also checks that the input string is well-formed, and if not, reject it.
Following are important tasks perform by the parser in compiler design:
Parsing Techniques
Parsing techniques are divided into two different groups:
Top-Down Parsing,
Bottom-Up Parsing
Top-Down Parsing:
In the top-down parsing construction of the parse tree starts at the root and then proceeds
towards the leaves.
1. Predictive Parsing:
Predictive parse can predict which production should be used to replace the specific input string.
The predictive parser uses look-ahead point, which points towards next input symbols.
Backtracking is not an issue with this parsing technique. It is known as LL(1) Parser
This parsing technique recursively parses the input to make a prase tree. It consists of several
small functions, one for each nonterminal in the grammar.
Bottom-Up Parsing:
In the bottom up parsing in compiler design, the construction of the parse tree starts with the
leave, and then it processes towards its root. It is also called as shift-reduce parsing. This type of
parsing in compiler design is created with the help of using some software tools.
A parser should able to detect and report any error found in the program. So, whenever an error
occurred the parser. It should be able to handle it and carry on parsing the remaining input. A
program can have following types of errors at various compilation process stages. There are five
common error-recovery methods which can be implemented in the parser
Panic-Mode recovery
In the case when the parser encounters an error, this mode ignores the rest of the
statement and not process input from erroneous input to delimiter, like a semi-colon. This
is a simple error recovery method.
In this type of recovery method, the parser rejects input symbols one by one until a single
designated group of synchronizing tokens is found. The synchronizing tokens generally
using delimiters like or.
Phrase-Level Recovery:
Compiler corrects the program by inserting or deleting tokens. This allows it to proceed
to parse from where it was. It performs correction on the remaining input. It can replace a
prefix of the remaining input with some string this helps the parser to continue the
process.
Error Productions
Error production recovery expands the grammar for the language which generates the
erroneous constructs. The parser then performs error diagnostic about that construct.
Global Correction:
The compiler should make less number of changes as possible while processing an
incorrect input string. Given incorrect input string a and grammar c, algorithms will
search for a parse tree for a related string b. Like some insertions, deletions, and
modification made of tokens needed to transform an into b is as little as possible.
Grammar:
A grammar is a set of structural rules which describe a language. Grammars assign structure to
any sentence. This term also refers to the study of these rules, and this file includes morphology,
phonology, and syntax. It is capable of describing many, of the syntax of programming
languages.
The non-terminal symbol should appear to the left of the at least one production
The goal symbol should never be displayed to the right of the::= of any production
A rule is recursive if LHS appears in its RHS
Notational Conventions
Notational conventions symbol may be indicated by enclosing the element in square brackets. It
is an arbitrary sequence of instances of the element which can be indicated by enclosing the
element in braces followed by an asterisk symbol, { … }*.
It is a choice of the alternative which may use the symbol within the single rule. It may be
enclosed by parenthesis ([,] ) when needed.
Two types of Notational conventions area Terminal and Non-terminals
1.Terminals:
2.Nonterminals:
Error handling is concerned with failures due to many causes: errors in the compiler or its environment
(hardware, operating system), design errors in the program being compiled, an incomplete understanding of
the source language, transcription errors, incorrect data, etc. The tasks of the error handling process are to
detect each error, report it to the user, and possibly make some repair to allow processing to continue. It
cannot generally determine the cause of the error, but can only diagnose the visible symptoms. Similarly,
any repair cannot be considered a correction (in the sense that it carries out the user’s intent); it merely
neutralizes the symptom so that processing may continue.
Keywords
Error Report Error Recovery Input String Source Program Correct Program
These keywords were added by machine and not by the authors. This process is experimental
and the keywords may be updated as the learning algorithm improves.
Type Systems
SEMANTIC CHECKING. A compiler must perform many semantic checks on a source program.
Many of these are straightforward and are usually done in conjunction with syntax-
analysis.
Checks done during compilation are called static checks to distinguish from the dynamic
checks done during the execution of the target program.
Static checks include:
Type checks.
The compiler checks that names and values are used in accordance with type rules of
the language.
Type conversions.
Detection of implicit type conversions.
Dereferencing checks.
The compiler checks that dereferencing is applied only to a pointer.
Indexing checks.
The compiler checks that indexing is applied only to an array.
Function call checks.
The compiler checks that a function (or procedure) is applied to the correct number and
type of arguments.
Uniqueness checks.
In many cases an identifier must be declared exactly once.
Flow-of-control checks.
The compiler checks that if a statement causes the flow of control to leave a construct,
then there is a place to transfer this flow. For instance when using break in C.
Symbol Table is an important data structure created and maintained by the compiler in
order to keep track of semantics of variables i.e. it stores information about the scope
and binding information about names, information about instances of various entities
such as variable and function names, classes, objects, etc.
It is built-in lexical and syntax analysis phases.
The information is collected by the analysis phases of the compiler and is used
by the synthesis phases of the compiler to generate code.
It is used by the compiler to achieve compile-time efficiency.
It is used by various phases of the compiler as follows:-
1. Lexical Analysis: Creates new table entries in the table, for example
like entries about tokens.
2. Syntax Analysis: Adds information regarding attribute type, scope,
dimension, line of reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the table to check
for semantics i.e. to verify that expressions and assignments are
semantically correct(type checking) and update it accordingly.
4. Intermediate Code generation: Refers symbol table for knowing how
much and what type of run-time is allocated and table helps in adding
temporary variable information.
5. Code Optimization: Uses information present in the symbol table for
machine-dependent optimization.
6. Target Code generation: Generates code by using address
information of identifier present in the table.
Symbol Table entries – Each entry in the symbol table is associated with attributes that
support the compiler in different phases.
Items stored in Symbol table:
Variable names and constants
Procedure and function names
Literal constants and strings
Compiler generated temporaries
Labels in source languages
Information used by the compiler from Symbol table:
Data type and name
Declaring procedures
Offset in storage
If structure or record then, a pointer to structure table.
For parameters, whether parameter passing by value or by reference
Number and type of arguments passed to function
Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:
INTERMEDIATE CODE
Intermediate code is the interface between front-end and back-end in a compiler
Ideally the details of source language are confined to the front-end and the details
of target machines to the back-end
A source code can directly be translated into its target machine code
Why at all we need to translate the source code into an intermediate code which is
then translated to its target code?
Why Intermediate Code?
If a compiler translates the source language to its target machine language without
having the option for generating intermediate code, then for each new machine, a
full native compiler is required
Intermediate code eliminates the need of a new full compiler for every unique
machine by keeping the analysis portion same for all the compilers
The second part of compiler, synthesis, is changed according to the target machine
It becomes easier to apply the source code modifications to improve code
performance by applying code optimization techniques on the intermediate code .
Why Intermediate Code?
While generating machine code directly from source code is possible, it entails
problems
With m languages and n target machines, we need to write m front ends, m x n
optimizers, and m x n code generators
The code optimizer which is one of the largest and very- difficult-to-write
components of a compiler, cannot be reused
By converting source code to an intermediate code, a machine-independent code
optimizer may be written
This means just m front ends, n code generators and 1 optimizer.
Intermediate Representation
Intermediate codes can be represented in a variety of ways and they have their own
benefits
High Level IR - High-level intermediate code representation is very close to the
source language itself. They can be easily generated from the source code and we
can easily apply code modifications to enhance performance. But for target machine
optimization, it is less preferred
Low Level IR -This one is close to the target machine, which makes it suitable for
register and memory allocation, instruction set selection, etc. It is good for machine-
dependent optimizations
Intermediate code can be either language specific (e.g., Byte Code for Java) or
language independent (three-address code).
Intermediate Code Generation
Intermediate code must be easy to produce and easy to translate to machine code
A sort of universal assembly language Should not contain any machine-specific
parameters (registers, addresses, etc.)
Intermediate code is represented in three-address space but the type of
intermediate code implementation is based on the compiler designer
Quadruples, triples, indirect triples are the classical forms used for machine-
independent optimizations and machine code generation
Static Single Assignment form (SSA) is a recent form and enables more effective
optimizations
Quadruples
Has four fields: op, arg1, arg2 and result
Temporaries are used
Triples
Temporaries are not used and instead references to instructions are made
Indirect triples
In addition to triples we use a list of pointers to triples
Quadruples
In quadruples representation, there are four fields for each instruction: op, arg1,
arg2, result
Binary ops have the obvious representation
Unary ops do not use arg2
Operators like Operators like param does not use either arg does not use either arg2
result
Jumps put the target label into the result
The quadruples implement the three address space for the expression a = b * (-c) + b
* (-c)
Triples
Triples has only three fields for each instruction: op, arg1, arg2
The result of an operation x op y is referred by its position
Triples are equivlant to signature of nodes in DAG or syntax tree
Triples and DAG are equivalent representations only for expressions
Ternary operation like X[i] = Y requires two entries in the triple structure;
similarly for Y = X[i]
Indirect Triples
These consist of a listing of pointers to triples; rather than a listing of the triples themselves
The triples consists of three fields: op, arg1, arg2
The arg1 or arg2 could be pointers
All Example
Run time enviroment
A translation needs to relate the static source text of a program to the dynamic actions
that must occur at runtime to implement the program. The program consists of names for
procedures, identifiers etc., that require mapping with the actual memory location at
runtime.
Runtime environment is a state of the target machine, which may include software
libraries, environment variables, etc., to provide services to the processes running in the
system.
Attention reader! Don’t stop learning now. Practice GATE exam well before the actual
exam with the subject-wise and overall quizzes available in GATE Test Series Course .
Learn all GATE CS concepts with Free Live Classes on our youtube channel.
SOURCE LANGUAGE ISSUES
Activation Tree
A program consist of procedures, a procedure definition is a declaration that, in its
simplest form, associates an identifier (procedure name) with a statement (body of the
procedure). Each execution of procedure is referred to as an activation of the procedure.
Lifetime of an activation is the sequence of steps present in the execution of the
procedure. If ‘a’ and ‘b’ be two procedures then their activations will be non-overlapping
(when one is called after other) or nested (nested procedures). A procedure is recursive if
a new activation begins before an earlier activation of the same procedure has ended. An
activation tree shows the way control enters and leaves activations.
Properties of activation trees are :-
Each node represents an activation of a procedure.
The root shows the activation of the main function.
The node for procedure ‘x’ is the parent of node for procedure ‘y’ if and only if
the control flows from procedure x to procedure y.
Example – Consider the following program of Quicksort
main() {
Int n;
readarray();
quicksort(1,n);
}
quicksort(int m, int n) {
Int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
}
The activation tree for this program will be:
Function of preprocessor
A preprocessor produce input to compilers. They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to define macros that are short hands for longer
constructs.
2. File inclusion: A preprocessor may include header files into the program text.
3. Rational preprocessor: these preprocessors augment older languages with more modern flow-of-control and
data structuring facilities.
4. Language Extensions: These preprocessor attempt to add capabilities to the language by certain amounts to
build-in macro.
Transition Table
The transition table is basically a tabular representation of the transition function. It takes two
arguments (a state and a symbol) and returns a state (the "next state").
Example 1:
Solution:
→q0 q1 q2
q1 q0 q2
*q2 q2 q2
Explanation:
o In the above table, the first column indicates all the current states. Under column 0 and 1,
the next states are shown.
o The first row of the transition table can be read as, when the current state is q0, on input 0
the next state will be q1 and on input 1 the next state will be q2.
o In the second row, when the current state is q1, on input 0, the next state will be q0, and on
1 input the next state will be q2.
o In the third row, when the current state is q2 on input 0, the next state will be q2, and on 1
input the next state will be q2.
o The arrow marked to q0 indicates that it is a start state and circle marked to q2 indicates
that it is a final state.
Example 2:
Solution:
→q0 q0 q1
q1 q1, q2 q2
q2 q1 q3
*q3 q2 q2
Explanation:
o The first row of the transition table can be read as, when the current state is q0, on input 0
the next state will be q0 and on input 1 the next state will be q1.
o In the second row, when the current state is q1, on input 0 the next state will be either q1 or
q2, and on 1 input the next state will be q2.
o In the third row, when the current state is q2 on input 0, the next state will be q1, and on 1
input the next state will be q3.
o In the fourth row, when the current state is q3 on input 0, the next state will be q2, and on 1
input the next state will be q2.
TYPE CHECKING
Introduction:
The compiler must perform static checking (checking done at compiler time).This ensures that certain types
of programming errors will be detected and reported.
A compiler must check that the source program follows both the syntactic and semantic conversions of the
source language. This checking is called static checking example of static checks include.
Flow-of-control checks: Statements that cause flow of control to leave a construct must have some place
to which to transfer flow of control. For example, branching to non-existent labels.
Name-related checks: Sometimes, the same name must appear two or more times. For example, in Ada
the name of a block must appear both at the beginning of the block and at the end.