Compiler Design Notes

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

discuss in detail about the oprations of compiler which transforns the source from one representation to

another.
Compiler operates in various phases each phase transforms the source program from one representation to another.
Every phase takes inputs from its previous stage and feeds its output to the next phase of the compiler.
There are 6 phases in a compiler. Each of this phase help in converting the high-level langue the machine code. The
phases of a compiler are:

Phase 1: Lexical Analysis


Lexical Analysis is the first phase when compiler scans the source code. This process can be left to right, character by
character, and group these characters into tokens.
Here, the character stream from the source program is grouped in meaningful sequences by identifying the tokens. It
makes the entry of the corresponding tickets into the symbol table and passes that token to next phase.
The primary functions of this phase are:
 Identify the lexical units in a source code
 Classify lexical units into classes like constants, reserved words, and enter them in different tables. It will
Ignore comments in the source program
 Identify token which is not a part of the language
Example:
x = y + 10
Tokens

X identifier

= Assignment operator

Y identifier

+ Addition operator

10 Number

Phase 2: Syntax Analysis


Syntax analysis is all about discovering structure in code. It determines whether or not a text follows the expected
format. The main aim of this phase is to make sure that the source code was written by the programmer is correct or
not.
Syntax analysis is based on the rules based on the specific programing language by constructing the parse tree with the
help of tokens. It also determines the structure of source language and grammar or syntax of the language.
Here, is a list of tasks performed in this phase:
 Obtain tokens from the lexical analyzer
 Checks if the expression is syntactically correct or not
 Report all syntax errors
 Construct a hierarchical structure which is known as a parse tree
Example
Any identifier/number is an expression
If x is an identifier and y+10 is an expression, then x= y+10 is a statement.
Consider parse tree for the following example
(a+b)*c

In Parse Tree
 Interior node: record with an operator filed and two files for children
 Leaf: records with 2/more fields; one for token and other information about the token
 Ensure that the components of the program fit together meaningfully
 Gathers type information and checks for type compatibility
 Checks operands are permitted by the source language
Phase 3: Semantic Analysis
Semantic analysis checks the semantic consistency of the code. It uses the syntax tree of the previous phase along with
the symbol table to verify that the given source code is semantically consistent. It also checks whether the code is
conveying an appropriate meaning.
Semantic Analyzer will check for Type mismatches, incompatible operands, a function called with improper
arguments, an undeclared variable, etc.
Functions of Semantic analyses phase are:
 Helps you to store type information gathered and save it in symbol table or syntax tree
 Allows you to perform type checking
 In the case of type mismatch, where there are no exact type correction rules which satisfy the desired
operation a semantic error is shown
 Collects type information and checks for type compatibility
 Checks if the source language permits the operands or not
Example
float x = 20.2;
float y = x*30;
In the above code, the semantic analyzer will typecast the integer 30 to float 30.0 before multiplication
Phase 4: Intermediate Code Generation
Once the semantic analysis phase is over the compiler, generates intermediate code for the target machine. It
represents a program for some abstract machine.
Intermediate code is between the high-level and machine level language. This intermediate code needs to be generated
in such a manner that makes it easy to translate it into the target machine code.
Functions on Intermediate Code generation:
 It should be generated from the semantic representation of the source program
 Holds the values computed during the process of translation
 Helps you to translate the intermediate code into target language
 Allows you to maintain precedence ordering of the source language
 It holds the correct number of operands of the instruction
Example
For example,
total = count + rate * 5
Intermediate code with the help of address code method is:

t1 := int_to_float(5)
t2 := rate * t1
t3 := count + t2
total := t3
Phase 5: Code Optimization
The next phase of is code optimization or Intermediate code. This phase removes unnecessary code line and arranges
the sequence of statements to speed up the execution of the program without wasting resources. The main goal of this
phase is to improve on the intermediate code to generate a code that runs faster and occupies less space.
The primary functions of this phase are:
 It helps you to establish a trade-off between execution and compilation speed
 Improves the running time of the target program
 Generates streamlined code still in intermediate representation
 Removing unreachable code and getting rid of unused variables
 Removing statements which are not altered from the loop
Example:
Consider the following code
a = intofloat(10)
b=c*a
d=e+b
f=d
Can become
b =c * 10.0
f = e+b
Phase 6: Code Generation
Code generation is the last and final phase of a compiler. It gets inputs from code optimization phases and produces
the page code or object code as a result. The objective of this phase is to allocate storage and generate relocatable
machine code.
It also allocates memory locations for the variable. The instructions in the intermediate code are converted into
machine instructions. This phase coverts the optimize or intermediate code into the target language.
The target language is the machine code. Therefore, all the memory locations and registers are also selected and
allotted during this phase. The code generated by this phase is executed to take inputs and generate expected outputs.
Example
a = b + 60.0
Would be possibly translated to registers.
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
Symbol Table Management
A symbol table contains a record for each identifier with fields for the attributes of the identifier. This

Error Handling Routine


In the compiler design process error may occur in all the below-given phases:
 Lexical analyzer: Wrongly spelled tokens
 Syntax analyzer: Missing parenthesis
 Intermediate code generator: Mismatched operands for an operator
 Code Optimizer: When the statement is not reachable
 Code Generator: When the memory is full or proper registers are not allocated
 Symbol tables: Error of multiple declared identifiers

Why two-buffer scheme is used in lexical analysis? Elaborate input buffering strategy, used in lexical analysis phase.
Lexical analysis, the first phase of a compiler, breaks down the source code into meaningful units called tokens
(keywords, identifiers, operators, etc.). During this process, the lexical analyzer needs to look ahead at the input
stream to correctly identify tokens. However, relying on a single buffer can lead to inefficiencies when dealing with
long lexemes (tokens) that might span across the buffer boundary.
The two-buffer scheme addresses this by utilizing two buffers of a fixed size (typically the same size):
1. Active Buffer: This buffer currently holds the input characters being scanned.
2. Inactive Buffer: This buffer is filled with the next chunk of input characters in anticipation of needing more
data.
The lexical analyzer works as follows:
 It scans the active buffer for a complete token using patterns or rules.
 If the potential token reaches the end of the active buffer, the following steps occur:
o The inactive buffer becomes the active buffer.

o New characters are read into the (now inactive) buffer.

o Scanning resumes in the newly active buffer, continuing the token identification process.

Advantages of the Two-Buffer Scheme:


 Handles Long Lexemes: By having two buffers, the lexical analyzer can seamlessly scan across buffer
boundaries without losing track of the token.
 Input buffering can reduce the number of system calls required to read input from the source code, which can
improve performance.
 Reduces I/O Operations: Since characters are read in larger chunks and buffered, the need for frequent disk
or file access is minimized, improving performance.
Input Buffering Strategy
Input buffering is a general technique in lexical analysis that involves using a buffer (or buffers) to store input
characters instead of reading them directly from the source code one at a time. This strategy offers several benefits:
 Efficiency: Reading characters in bulk reduces the overhead associated with individual I/O operations.
 Lookahead Capability: The lexical analyzer can look ahead at multiple characters in the buffer, which is
essential for identifying certain tokens that might require context (e.g., keywords that can also be prefixes of
identifiers).
 Flexibility: The buffer size can be adjusted based on the expected size of lexemes and the trade-off between
memory usage and performance.
Common Buffering Schemes:
 One-Buffer Scheme: While simpler, this scheme can be inefficient if lexemes are potentially larger than the
buffer size.
 Two-Buffer Scheme: As discussed above, this scheme effectively addresses the limitations of the one-buffer
approach.
 Multiple-Buffer Schemes: In some cases, more than two buffers might be used for further performance
optimization.
Choosing a Buffering Scheme:
The choice of buffering scheme depends on factors such as:
 Lexeme Size: If lexemes are typically short, a one-buffer scheme might suffice.
 Performance Requirements: For performance-critical applications, a two-buffer scheme or a multiple-buffer
scheme might be necessary.
 Memory Constraints: If memory is limited, a smaller buffer size or a less complex buffering scheme might
be preferred.

Why two-buffer scheme is used in lexical analysis? Elaborate input buffering strategy, used in lexical analysis phase.
Wh
at is Input Buffering in compiler design?
To identify tokens, Lexical Analysis must visit secondary memory each time. It takes a long time and costs a lot of
money. As a result, the input strings are buffered before being examined by Lexical Analysis.
Lexical analysis reads the input string one character at a time from left to right to detect tokens. To scan tokens, it
employs two pointers.
 The Begin Pointer (bptr) is a pointer that points to the start of the string to be read.
 Look Ahead Pointer(lptr) continues its hunt for the token's end.

Sample Example
Example: For the statement int a,b;

 Both points begin at the start of the string that is saved in the buffer.
 The Look Ahead Pointer examines the buffer until it finds the token.

 Before the token ("int") can be identified, the character ("blank space") beyond the token ("int") must be
checked.

 Both pointers will be set to the next token ('a') after processing the token ("int"), and this procedure will be
continued throughout the program.

Two portions of a buffer can be separated. If you move the look Ahead cursor halfway through the first half, the
second half will be filled with fresh characters to read. If you shift the look Ahead cursor to the right end of the second
half's buffer, the first half will be filled with new characters, and so on.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is converted, a check is
completed to provide that one half of the buffer has not converted off. If it is completed, then the other half
should be reloaded.
Buffer Pairs
Specialized buffering techniques decrease the overhead required to process an input character in moving

characters.
 It consists of two buffers, each of which has an N-character size and is alternately reloaded.
 There are two pointers: lexemeBegin and forward.
 Lexeme Begin denotes the start of the current lexeme, which has yet to be discovered.
 Forward scans until it finds a match for a pattern.
 When a lexeme is discovered, lexeme begin is set to the character immediately after the newly
discovered lexeme, and forward is set to the character at the right end of the lexeme.
 The collection of characters between two points is the current lexeme.

Preliminary Scanning − Pre-processing the character stream being subjected to lexical analysis saves the
trouble of moving the look ahead pointer back and forth over a string of blanks.

List the cousins of compiler and explain the role of any one of them.

Preprocessor
The preprocessor is one of the cousins of the Compiler. It is a program that performs
preprocessing. It performs processing on the given data and produces an output. The output
generated is used as an input for some other program.
The preprocessor increases the readability of the code by replacing a complex expression with a
simpler one by using a macro.
A preprocessor performs multiple types of functionality and operations on the data.
Some of them are-
Macro processing
Macro processing is mapping the input to output data based on a certain set of rules and defined
processes. These rules are known as macros.
Rational Preprocessors
Relational preprocessors are the processors that change older languages with some modern flow-
of-control and data-structuring facilities.
File Inclusion
The preprocessor is also used to include header files in the program text. A header file is a text file
included in our source program file during compilation. When the preprocessor finds an #include
directive in the program, it replaces it with the entire content of the specified header file.
Language extension
Language extension is used to add new capabilities to the existing language. This is done by
including certain libraries in our program, which provides extra functionality. An example of this is
Equel, a database query language embedded in C.
Error Detection
Some preprocessors are capable of performing error-checking on the source code that is given as
input to them. For example, it can check if the headers files are included properly and if the
macros are defined correctly or not.
Conditional Compilation
Certain preprocessors are capable of including or excluding certain pieces of code based on the
result of a condition. They provide more flexibility to the programmers for writing the code as they
allow the programmers to include or exclude certain features of the program based upon some
condition.
Assembler
Assembler is also one of the cousins of the compiler. A compiler takes the preprocessed code and
then converts it into assembly code. This assembly code is given as input to the assembler, and
the assembler converts it into the machine code. Assembler comes into effect in the compilation
process after the Compiler has finished its job.
There are two types of assemblers-
 One-Pass assembler: They go through the source code (output of Compiler) only
once and assume that all symbols will be defined before any instruction that references
them.

 Two-Pass assembler: Two-pass assemblers work by creating a symbol table with the
symbols and their values in the first pass, and then using the symbol table in a second
pass, they generate code.
Linker
Linker takes the output produced by the assembler as input and combines them to create an
executable file. It merges two or more object files that might be created by different assemblers
and creates a link between them. It also appends all the libraries that will be required for the
execution of the file. A linker's primary function is to search and find referred modules in a program
and establish the memory address where these codes will be loaded.
Multiple tasks that can be performed by linkers include-
 Library Management: Linkers can be used to add external libraries to our code to add
additional functionalities. By adding those libraries, our code can now use the functions
defined in those libraries.

 Code Optimization: Linkers are also used to optimize the code generated by the
compiler by reducing the code size and increasing the program's performance.

 Memory Management: Linkers are also responsible for managing the memory
requirement of the executable code. It allocates the memory to the variables used in
the program and ensures they have a consistent memory location when the code is
executed.

 Symbol Resolution: Linkers link multiple object files, and a symbol can be redefined
in multiple files, giving rise to a conflict. The linker resolves these conflicts by choosing
one definition to use.
Loader
The loader works after the linker has performed its task and created the executable code. It takes
the input of executable files generated from the linker, loads it to the main memory, and prepares
this loaded code for execution by a computer. It also allocates memory space to the program. The
loader is also responsible for the execution of programs by allocating RAM to the program and
initializing specific registers.

Following tasks are performed by the loader


 Loading: The loader loads the executable files in the memory and provides memory
for executing the program.

 Relocation: The loader adjusts the memory addresses of the program to relocate its
location in memory.

 Symbol Resolution: The loader is used to resolve the symbols not defined directly in
the program. They do this by looking for the definition of that symbol in a library linked
to the executable file.

 Dynamic Linking: The loader dynamically links the libraries into the executable file at
runtime to add additional functionality to our program.

) Define Top-Down parsing and what are the key problems with Top-Down parse.
In top-down parsing, the parse tree is generated from top to bottom, i.e., from root to leaves & expand till all leaves
are generated.
It generates the parse tree containing root as the starting symbol of the Grammar. It starts derivation from the start
symbol of Grammar & performs leftmost derivation at each step.
Drawback of Top-Down Parsing
 Top-down parsing tries to identify the left-most derivation for an input string ω which is similar to generating
a parse tree for the input string ω that starts from the root and produce the nodes in a pre-defined order.
 The reason that top-down parsing follow the left-most derivation for an input string ω and not the right-most
derivation is that the input string ω is scanned by the parser from left to right, one symbol/token at a time. The
left-most derivation generates the leaves of the parse tree in the left to right order, which connect the input
scan order.
 In the top-down parsing, each terminal symbol produces by multiple production of the grammar (which is
predicted) is connected with the input string symbol pointed by the string marker. If the match is successful,
the parser can sustain. If the mismatch occurs, then predictions have gone wrong.
 At this phase it is essential to reject previous predictions. The prediction which led to the mismatching
terminal symbol is rejected and the string marker (pointer) is reset to its previous position when the rejected
production was made. This is known as backtracking.
 Backtracking was the major drawback of top-down parsing.

Consider the following grammar-


S → T L;
T → int | float
L → L , id | id
Parse the input string int id , id ; using a shift-reduce parser.

Solution-

Stack Input Buffer Parsing Action

$ int id , id ; $ Shift

$ int id , id ; $ Reduce T → int


$T id , id ; $ Shift

$ T id , id ; $ Reduce L → id

$TL , id ; $ Shift

$TL, id ; $ Shift

$ T L , id ;$ Reduce L → L , id

$TL ;$ Shift

$TL; $ Reduce S → T L

$S $ Accept

lexical analysis
High-Level Languages (HLLs):
 Human-Readable: HLLs use syntax that resembles natural language or mathematical notation, making them
easier for programmers to understand and write compared to low-level languages (machine code, assembly).
 Abstraction: HLLs provide a layer of abstraction from the underlying hardware details. Programmers don't
need to worry about memory management, register usage, or specific machine instructions.
 Platform Independence: HLL code is generally portable across different hardware platforms with the help of
compilers or interpreters. The code itself is written independently of the target machine.
Factors Affecting "Purity" of HLLs:
 Side Effects: Some HLLs allow for side effects, meaning they can modify data outside the current function or
scope, potentially impacting program behavior. Examples of side effects include modifying global variables,
reading/writing to files, or interacting with external devices. Languages that emphasize minimizing side
effects might be considered "purer."
 Memory Management: High-level languages might handle memory management automatically (garbage
collection) or require manual memory allocation/deallocation by the programmer. Languages with automatic
memory management can be considered "purer" from the perspective of programmer convenience and
reducing potential errors.
 Low-Level Access: Some HLLs offer mechanisms to access low-level hardware features or perform
operations closer to the machine. This can be useful for performance optimization in certain scenarios, but it
introduces some dependency on the underlying architecture and reduces portability. Languages that minimize
low-level access might be considered "purer."
 The optimization must be correct, it must not, in any way, change the meaning of the program.
 Optimization should increase the speed and performance of the program.
 The compilation time must be kept reasonable.
 The optimization process should not delay the overall compiling process.

 An optimized code often promotes re-usability.


: Ensure that the optimizations don't introduce new bugs or change how the program is supposed to work.
: Reduce the amount of memory the program needs to run

You might also like