Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

1. What are the two parts of compilation?

a. Analysis phases of compilation


b. Synthesis phases of compilation
2. Name the types of a language processing system.
a. Preprocessors
b. Compilers
c. Assembler
d. Loaders and link editors
3. What is symbol table manager?
The table management or bookkeeping portion of the compiler keeps track of the names
used by program and records essential information about each, such as its type (int, real
etc.,) the data structure used to record this information is called a symbol table manger.
4. List some compiler construction tools.
a. Parser generators
b. Scanner generators
c. Syntax-directed translation engine
d. Automatic code generators
e. Data-flow engine
5. Define patterns/lexeme/tokens.
A set of strings in the input for which the same token is produced as output.
This set of strings described by a rule called pattern associated with the token.
A lexeme is a sequence of characters in the source program that is matched by
the pattern for a token.
Token is a sequence of character that can be treated as a single logical entity.

6. Define regular expression.


Regular expressions have the capability to express finite languages by defining a pattern
for finite strings of symbols. The grammar defined by regular expressions is known
as regular grammar. The language defined by regular grammar is known as regular
language.
Regular expression is an important notation for specifying patterns. Each pattern matches
a set of strings, so regular expressions serve as names for a set of strings.
7. Show the R.E. for the following language.
Set of statements over {a,b,c} that contain no two consecutive b’s
(B/c) (A/c/ab/cb)*
8. Define Context Free Grammar (CFG).

Context free grammar is a formal grammar which is used to generate all possible strings
in a given formal language.

Context free grammar G can be defined by four tuples as:

G= (V, T, P, S)

Where,
G describes the grammar

T describes a finite set of terminal symbols.

V describes a finite set of non-terminal symbols

P describes a set of production rules

S is the start symbol.

9. Define Ambiguity.
A grammar that produces more than one parse tree for some sentences is said to be
ambiguous.
10. What is a Parser? List its types.
A parser for grammar G is a program that takes as input a string „w will produces as output
either a parse tree for w, if „w is a sentence of G, or an error message indicating that w is
not a sentence of G. it obtains a string of tokens from the lexical analyzer, verifies that the
string generated by the grammar for the source language.
a) Top down parsing b) Bottom up parsing

11. Discuss the role of lexical analyzer in detail with necessary example.

The main task of lexical analysis is to read input characters in the code and produce tokens.

Lexical analyzer scans the entire source code of the program. It identifies each token one by one.
Scanners are usually implemented to produce tokens only when requested by a parser. Here is how
this works-

1. "Get next token" is a command which is sent from the parser to the lexical analyzer.
2. On receiving this command, the lexical analyzer scans the input until it finds the next
token.
3. It returns the token to Parser.
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any error is
present, then Lexical analyzer will correlate that error with the source file and line number.

Roles of the Lexical analyzer

Lexical analyzer performs below given tasks:

• Helps to identify token into the symbol table


• Removes white spaces and comments from the source program
• Correlates error messages with the source program
• Helps you to expands the macros if it is found in the source program
• Read input characters from the source program

Example of Lexical Analysis, Tokens, Non-Tokens

Consider the following code that is fed to Lexical Analyzer

#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token

int Keyword

maximum Identifier

( Operator

int Keyword

x Identifier
, Operator

int Keyword

Y Identifier

) Operator

{ Operator

If Keyword

Examples of Nontokens
Type Examples

Comment // This will compare 2 numbers

Pre-processor directive #include <stdio.h>

Pre-processor directive #define NUMS 8,9

Macro NUMS

Whitespace /n /b /t

Lexical Errors

A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:

• Lexical errors are not very common, but it should be managed by a scanner
• Misspelling of identifiers, operators, keyword are considered as lexical errors
• Generally, a lexical error is caused by the appearance of some illegal character, mostly at
the beginning of a token.
Error Recovery in Lexical Analyzer

Here, are a few most common error recovery techniques:

• Removes one character from the remaining input


• In the panic mode, the successive characters are always ignored until we reach a well-
formed token
• By inserting the missing character into the remaining input
• Replace a character with another character
• Transpose two serial characters

Lexical Analyzer vs. Parser

Lexical Analyser Parser

Scan Input program Perform syntax analysis

Identify Tokens Create an abstract representation of the code

Insert tokens into Symbol Table Update symbol table entries

It generates lexical errors It generates a parse tree of the source code


13. Cousins of Compiler

COUSINS OF COMPILER
1. Preprocessor 2. Assembler 3. Loader and Link-editor

Preprocessor

A preprocessor is a program that processes its input data to produce output that is used as
input to another program. The output is said to be a preprocessed form of the input data, which is
often used by some subsequent programs like compilers.
They may perform the following functions :
1. Macro processing 3. Rational Preprocessors
2. File Inclusion 4. Language extension

1. Macro processing:

A macro is a rule or pattern that specifies how a certain input sequence should be mapped
to an output sequence according to a defined procedure. The mapping process that instantiates a
macro into a specific output sequence is known as macro expansion.

2. File Inclusion:
Preprocessor includes header files into the program text. When the preprocessor finds an
#include directive it replaces it by the entire content of the specified file.

3. Rational Preprocessors:

These processors change older languages with more modern flow-of-control and data-
structuring facilities.

4. Language extension :

These processors attempt to add capabilities to the language by what amounts to built-in
macros. For example, the language Equel is a database query language embedded in C.

Assembler

Assembler creates object code by translating assembly instruction mnemonics into


machine code. There are two types of assemblers:

· One-pass assemblers go through the source code once and assume that all symbols
will be defined before any instruction that references them.

· Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code
Fig. 1.7 Translation of a statement
Linker and Loader

A linker or link editor is a program that takes one or more objects generated by a compiler
and combines them into a single executable program. Three tasks of the linker are

1.Searches the program to find library routines used by program, e.g. printf(), math routines.
2. Determines the memory locations that code from each module will occupy and relocates its
instructions by adjusting absolute references 3. Resolves references among files.
A loader is the part of an operating system that is responsible for loading programs in
memory, one of the essential stages in the process of starting a program.
12. Build the transition diagram for relational operators, keywords, identifier and constant.

Recognition of Tokens

Our current goal is to perform the lexical analysis needed for the following grammar.

stmt → if expr then stmt


| if expr then stmt else stmt

expr → term relop term // relop is relational operator =, >, etc
| term
term → id
| number

Recall that the terminals are the tokens, the nonterminals produce terminals.

A regular definition for the terminals is

digit → [0-9]
digits → digits+
number → digits (. digits)? (E[+-]? digits)?
letter → [A-Za-z]
id → letter ( letter | digit )*
if → if
then → then
else → else
relop → < | > | <= | >= | = | <>

Lexeme Token Attribute


Whitespace ws —
if if —
then then —
else else —
An identifier id Pointer to table entry
A number number Pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Transition Diagrams

A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each
possible token. It shows the decisions that must be made based on the input seen. The two main
components are circles representing states (think of them as decision points of the lexer) and
arrows representing edges (think of them as the decisions made).

The transition diagram (3.13) for relop is shown on the right.

1. The double circles represent accepting or final states at which point a lexeme has been
found. There is often an action to be done (e.g., returning the token), which is written to
the right of the double circle.
2. If we have moved one (or more) characters too far in finding the token, one (or more) stars
are drawn.
3. An imaginary start state exists and has an arrow coming from it to indicate where to begin
the process.

It is fairly clear how to write code corresponding to this diagram. You look at the first character,
if it is <, you look at the next character. If that character is =, you return (relop,LE) to the parser.
If instead that character is >, you return (relop,NE). If it is another character, return (relop,LT) and
adjust the input buffer so that you will read this character again since you have not used it for the
current lexeme. If the first character was =, you return (relop,EQ).
Recognition of Reserved Words and Identifiers

The transition diagram below corresponds to the regular definition given previously.

Note again the star affixed to the final state.

Two questions remain.

1. How do we distinguish between identifiers and keywords such as then, which also match
the pattern in the transition diagram?
2. What is (gettoken(), installID())?

We will continue to assume that the keywords are reserved, i.e., may not be used as identifiers.
(What if this is not the case—as in Pl/I, which had no reserved words? Then the lexer does not
distinguish between keywords and identifiers and the parser must.)

We will use the method mentioned last chapter and have the keywords installed into the
identifier table prior to any invocation of the lexer. The table entry will indicate that the entry is a
keyword.

installID() checks if the lexeme is already in the table. If it is not present, the lexeme is installed
as an id token. In either case a pointer to the entry is returned.

gettoken() examines the lexeme and returns the token name, either id or a name corresponding to
a reserved keyword.

The text also gives another method to distinguish between identifiers and keywords.

Completion of the Running Example

So far we have transition diagrams for identifiers (this diagram also handles keywords) and the
relational operators. What remains are whitespace, and numbers, which are respectively the
simplest and most complicated diagrams seen so far.

Recognizing Whitespace

The diagram itself is quite simple reflecting the


simplicity of the corresponding regular
expression.

• The delim in the diagram represents any


of the whitespace characters, say space, tab, and newline.
• The final star is there because we needed to find a non-whitespace character in order to
know when the whitespace ends and this character begins the next token.
• There is no action performed at the accepting state. Indeed the lexer does not return to the
parser, but starts again from its beginning as it still must find the next token.

Recognizing Numbers

14. How do you reduce the number of states of DFA without affecting the language? Explain
with an example. (Available in notes also)
DFA minimization stands for converting a given DFA to its equivalent DFA with minimum
number of states.
Minimization of DFA
Suppose there is a DFA D < Q, Σ, q0, δ, F > which recognizes a language L. Then the minimized
DFA D < Q’, Σ, q0, δ’, F’ > can be constructed for language L as:
Step 1: We will divide Q (set of states) into two sets. One set will contain all final states and
other set will contain non-final states. This partition is called P0.
Step 2: Initialize k = 1
Step 3: Find Pk by partitioning the different sets of Pk-1. In each set of Pk-1, we will take all
possible pair of states. If two states of a set are distinguishable, we will split the sets into
different sets in Pk.
Step 4: Stop when Pk = Pk-1 (No change in partition)
Step 5: All states of one set are merged into one. No. of states in minimized DFA will be equal
to no. of sets in Pk.
How to find whether two states in partition Pk are distinguishable ?
Two states ( qi, qj ) are distinguishable in partition P k if for any input symbol a, δ ( qi, a ) and δ (
qj, a ) are in different sets in partition Pk-1.
Example
Consider the following DFA shown in figure.
Step 1. P0 will have two sets of states. One set will contain q1, q2, q4 which are final states of
DFA and another set will contain remaining states. So P0 = { { q1, q2, q4 }, { q0, q3, q5 } }.
Step 2. To calculate P1, we will check whether sets of partition P0 can be partitioned or not:

i) For set { q1, q2, q4 } :


δ ( q1, 0 ) = δ ( q2, 0 ) = q2 and δ ( q1, 1 ) = δ ( q2, 1 ) = q5, So q1 and q2 are not
distinguishable.
Similarly, δ ( q1, 0 ) = δ ( q4, 0 ) = q2 and δ ( q1, 1 ) = δ ( q4, 1 ) = q5, So q1 and q4 are not
distinguishable.
Since, q1 and q2 are not distinguishable and q1 and q4 are also not distinguishable, So q2 and q4
are not distinguishable. So, { q1, q2, q4 } set will not be partitioned in P1.
ii) For set { q0, q3, q5 } :
δ ( q0, 0 ) = q3 and δ ( q3, 0 ) = q0
δ ( q0, 1) = q1 and δ( q3, 1 ) = q4
Moves of q0 and q3 on input symbol 0 are q3 and q0 respectively which are in same set in
partition P0. Similarly, Moves of q0 and q3 on input symbol 1 are q3 and q0 which are in same
set in partition P0. So, q0 and q3 are not distinguishable.
δ ( q0, 0 ) = q3 and δ ( q5, 0 ) = q5 and δ ( q0, 1 ) = q1 and δ ( q5, 1 ) = q5
Moves of q0 and q5 on input symbol 1 are q3 and q5 respectively which are in different set in
partition P0. So, q0 and q5 are distinguishable. So, set { q0, q3, q5 } will be partitioned into { q0,
q3 } and { q5 }. So,
P1 = { { q1, q2, q4 }, { q0, q3}, { q5 } }
To calculate P2, we will check whether sets of partition P1 can be partitioned or not:
iii)For set { q1, q2, q4 } :
δ ( q1, 0 ) = δ ( q2, 0 ) = q2 and δ ( q1, 1 ) = δ ( q2, 1 ) = q5, So q1 and q2 are not
distinguishable.
Similarly, δ ( q1, 0 ) = δ ( q4, 0 ) = q2 and δ ( q1, 1 ) = δ ( q4, 1 ) = q5, So q1 and q4 are not
distinguishable.
Since, q1 and q2 are not distinguishable and q1 and q4 are also not distinguishable, So q2 and q4
are not distinguishable. So, { q1, q2, q4 } set will not be partitioned in P2.
iv)For set { q0, q3 } :
δ ( q0, 0 ) = q3 and δ ( q3, 0 ) = q0
δ ( q0, 1 ) = q1 and δ ( q3, 1 ) = q4
Moves of q0 and q3 on input symbol 0 are q3 and q0 respectively which are in same set in
partition P1. Similarly, Moves of q0 and q3 on input symbol 1 are q3 and q0 which are in same
set in partition P1. So, q0 and q3 are not distinguishable.
v) For set { q5 }:
Since we have only one state in this set, it can’t be further partitioned. So,
P2 = { { q1, q2, q4 }, { q0, q3 }, { q5 } }
Since, P1=P2. So, this is the final partition. Partition P2 means that q1, q2 and q4 states are
merged into one. Similarly, q0 and q3 are merged into one. Minimized DFA corresponding to
DFA of Figure 1 is shown in Figure 2 as:

15. List the reasons for writing a grammar. (Available in notes)


16. Phases of Compiler. ( Available in notes)

You might also like