System Software Notes

1/81
SYSTEM SOFTWARE
STUDY MATERIAL
INDEX
UNIT – I
1.1 Analysis of the source program
1.2 Phases of a compiler
1.3 Cousins of the Compiler
1.4 Grouping of Phases
1.5 Compiler Construction tools
1.6 Lexical Analysis
1.6.1 A role of lexical analyzer
1.6.2 Input Buffering
1.6.3 Specification of tokens
1.6.4 Recognition of tokens.
UNIT – II
2.1 Role of the parser
2.2 Context free grammars
2.3 Top down parsing:
2.3.1Recursive Descent Parsing.
2.3.2 Predictive parsing
2.4 Bottom up parsing:
2.4.1 Handles
2.4.2 Handle pruning
2.4.3 Stack implementation of shift-reducing parsing.
UNIT – III
3.1 Intermediate Code Generation
3.2 Intermediate Languages
3.3 Graphical Representations
3.4 Three-Address Code
3.5 Implementations of three address statements
3.6 Code Generation
3.7 Issues In The Design Of A Code Generator
3.8 Run-Time Storage Management
UNIT – IV
4.1 Elements of assembly language programming
2/81
4.2 Assembly Language Statements:
4.3 Advantage of assembly Language:
4.4 Interpreters
4.4.1 Uses Of Interpreters
4.4.2 Overview Of Interpretation
4.4.3 Pure And Impure Interpreters
4.5 Macros And Macro Processors
4.5.1 Macro definition and call
4.6 Macro expansion
4.6.1 Lexical substitution
4.6.2 Positional Parameters
4.6.3 Keyword Parameters
4.7 Nested Macro Calls
4.8 Advanced Macro Facilities
UNIT – V
5.1 Linkers
5.2 Relocation And Linking Concepts
5.2.1 Program Relocation
5.2.2 Linking
5.2.3 Binary Programs
5.2.4 Object Module
5.3 Self-Relocating Programs
5.4 Linking For Overlays
5.5 Loaders
5.6 Software Tools
5.6.1 Software Tools For Program Development
5.6.1.1 Program Design And Coding
5.6.1.2 Program Entry And Editing
5.6.1.3 Program Testing And Debugging
5.6.1.4 Enhancement Of Program Performance
5.6.1.5 Program Documentation
5.6.1.6 Design Of Software Tools
5.7 Editors
5.7.1 Screen Editors
5.7.2 Word Processors
5.7.3 Structure Editors
5.7.4 Design Of An Editor
5.8 Debug Monitors
5.8.1 Testing Assertions
3/81
UNIT - I
SYLLABUS
Compilers – Analysis of the source program – Phases of a compiler –Cousins of the
Compiler – Grouping of Phases-Compiler Construction tools.
Lexical analysis: A role of lexical analyzer – Input Buffering – Specification of tokens
- Recognition of tokens.
COMPILERS
* A compiler is a program that reads a program written in one language- the source
language- and translates it into an equivalent program in another language – the
target language.
Source Program Compiler Target Program
Error messages
* There are thousands of source languages, ranging from traditional programming

languages such as Fortran, Pascal to specialized languages that have arisen in every
area of computer application.
* A target language may be another programming language, or the machine language of
any computer between a microprocessor and a super computer.
* Compilers are classified as
i) single-pass or one-pass translation
To generate modest input/output activity.
Example : The code for a procedure body is compiled in memory and
written out as a unit to secondary storage.
ii) multi-pass
iii) load-and-go
iv) debugging (symbolic debugging) and
v) optimizing (code optimization)
depending on how they have been constructed or on what function they are
supported to perform.
* The first compilers started to appear in early 1950’s.
The Analysis- Synthesis model of Compilation

Two parts of compilation
i) Analysis
It breaks up the source program into pieces and creates an intermediate

representation of the source program.
ii) Synthesis
It constructs the desired target program from the intermediate

representation.
During analysis, the source programs are determined and recorded in a

hierarchical structure called a tree .
4/81
* A special kind of tree called Syntax Tree is used, in which,eeach node represents
the arguments of the operation.
Each node represents an operation and the children of a node represent the
arguments of the operation. For example, an assignment statement in Pascal is given
as, Position := initial + rate * 60. The syntax tree is given below,
:=
position +
initial *
rate 60
EXAMPLES OF SOFTWARE TOOLS
i) Structure Editors
It takes as input a sequence of commands to build a source program. It

also analyzes the program text, putting an appropriate hierarchical
structure on the source program.
Example : When the user types while, the editor supplies the matching
do., and can jump from a begin or left paranthesis to it s matching end or
right paranthesis .
ii) Pretty printers
It analyzes a program and prints it in such a way that the structure of the
program becomes clearly visible.
Example : Comments may appear in a special font and statements may
appear with an amount of indentation proportional to the depth of their
nesting in the hierarchical organization of the statements.
iii) Static Checkers
It reads a program, analyzes it, and attempts to identify potential bugs

without running the program.
Example : It can catch logical errors, such as, the use of a real variable
as a pointer.
iv) Interpreters
Instead of producing a target program as a translation, It performs the

operations implied by the source program. Eg:
:=
position +
5/81
initial *
rate 60
At the root, it would discover it had an assignment to perform. So it would call
a routine to evaluate the expression on the right, and store result in identifier position .
At right child of root, the routine would discover it had to compute sum of 2
expressions. It would call itself recursively to compute value of rate*60. it would then
add that value to value of variable initial.
EXAMPLES OF ANALYSIS
i) Text formatters
It takes input that is a stream of characters , which includes typeset,

commands to indicate paragraphs, figures or mathematical structure like
superscripts and subscripts.
ii) Silicon Compiler
It has a source language that is identical to a conventional programming

language. Here, the variable does not represents locations in memory,
but logical signals ( 0 or 1 ) or groups of signals in a switching circuit.
The output is a circuit design in appropriate language.
iii) Query Interpreters
It translates a predicate containing relational and Boolean operators into

commands, to search a database for records satisfying that predicate.
THE CONTEXT OF A COMPILER
* A source program may be divided into modules stored in separate files.

* The collection of the source programs is sometimes entrusted, to a program called
preprocessor. It also expands macros into source language statements.
* The source programSkeletal source
created Program
by the compiler may require further processing before it
can be run.
* The compiler createsPreprocessor

assembly code that is translated by an assembler into machine
code and then linked together with same library routines into the code that actually
runs on the machine.Source Program
* The following figure shows a typical “compilation”.

Compiler
Target assembly Program
Assembler
Relocatable machine code
Loader/link-editor Library relocatable

object files
Absolute machine code

6/81
1.1 ANALYSIS OF THE SOURCE PROGRAM
In compiling, analysis consists of three phases
1. Linear Analysis or Lexical Analysis or Scanning
The stream of characters making up the source program is read from left-to-right
and grouped into tokens that are sequences of characters having a collective
meaning. For example, position := initial + rate * 60
In lexical analysis, the characters in the assignment statement would be

grouped into the following tokens :
a) the identifier ( position )
b) the assignment symbol (:= )
c) the identifier ( initial)
d) the plus sign (+)
e) the identifier rate
f) the multiplication sign (*)
g) the number (60)
The blanks separating the characters of these tokens would normally be

eliminated.
2. Hierarchical Analysis or Parsing or Syntax Analysis
The characters or tokens are grouped hierarchically into nested collections with
collective meaning. It involves grouping the tokens of the source program into
grammatical phases that are used by the compiler to output.
The grammatical phases of the source program are represented by a parse tree.
Lexical constructs do not require recursion , while syntactic constructs requires
recursion. Context free grammars are a formalization of recursive rules that can be
used to guide syntactic analysis. For example, parse tree for position := initial +
rate * 60
Assignment statement
7/81
identifiers := expression
Position expression + expression
identifier expression * expression
initial identifier number
rate 60
Hierarchical structure of a program expressed by a recursive rules.

1. any identifier is an expression
2. any number is an expression
3. if expression1 and expression2 are expression, then
expression1 + expression2
expression1 * expression2
(expression)
Symbol Table
The characters are grouped and removed from the input processing of the next
token can begin and are recorded in a table.
3. Semantic Analysis
To check the source program for semantic errors. It uses the hierarchical
structure determined by the syntax analysis phase to identify the operators and
operands of expressions and statements.
An important component of semantic analysis is type checking. For example,
when a binary arithmetic operator is applied to an integer and real. The compiler may
need to convert the integer to real.
ANALYSIS IN TEXT FORMATTERS
The input to a text formatter as specifying a hierarchy of boxes that are

rectangular regions to be filled by bit pattern, representing light and dark pixels to be
printed by the output devices.
Consecutive characters not separated by “white space” are grouped into words
consisting of a sequence of horizontally arranged boxes. Boxes in TEX system may be
built from smaller boxes by horizontally and vertically. For example, \h box { < list of
boxes > } \h box { \v box { !1 } \v box { @ 2 } }
! @
1 2
To build a mathematical expression from operators like sub and sup for
subscripts and superscripts. The EQN preprocessor for mathematics,
BOX sub box  BOX
box
a sub { i sup 2 }  a i 2
8/81
1.2 THE PHASES OF A COMPILER
Each of the phases transforms the source program from one representation to
another.
SYMBOL TABLE MANAGEMENT
It is data structure containing a record for each identifier with fields for the
attributes of the identifier. To store or retrieve data from that record.
Example : int position, initial, rate;
When an identifier in the source program is entered into the symbol table. Error
detection and reporting. Each phase can encounters errors. A compilers that stops
when it finds the first error. Error where the token stream violates the structure rules
of the language are determined by the syntax analysis phase.
THE ANALYSIS PHASES
The lexical analyzer generates the symbol table of lexical value id. The
characters sequence forming a token is called lexeme for the token. Syntax analysis
imposes a hierarchical structure on the token stream by trees.
CODE GENERATION
After syntax and semantic analysis, compilers generates an explicit intermediate

representation of the source program. To translate into the target program.
CODE OPTIMIZATION
To improve the intermediate code and to improve the running time of the target
program without slowing down compilation.
CODE GENERATION
To generate the target code, consisting of relocatable machine code or

assembly code. Memory locations are selected for each of the variables used by the
program. The intermediate instructions are each translated into sequence of machine
instructions that perform the same
Sourcetask.
Program
Syntax Analyzer
Lexical Analyzer
Semantic
Analyzer
Symbol-table Error
Management Handler
Code
Optimization
Intermediate
Code Generation
Target Program
9/81
1.3 COUSINS OF THE COMPILER
* A compiler may be produced by one or more preprocessors.

* The cousins of the compiler are:
1. preprocessor
2. assembler
3. loader and
4. link editor
1. Preprocessor functions
* Preprocessors produced input to compiler.
* They perform the following functions.
a) Macro processing
A preprocessor may allow a user to define macros that are shorthands for longer
constructs.
Eg: # define M 6
main()
{ …
total = M*value;
printf(“m=%d”,M);
}
In the above example M is macro.
b) File inclusion
A preprocessor may include header files into the program text. For example,
<global.h> the contents of the file to replace the statement # include<global.h> when
it process a file containing this statement.
c) Rational preprocessors
These processors constructs flow-of-control and data-structuring facilities. For

example, built-in-macros fro construct like while statement, or if statement.
# define TEST if(X>Y)
AND
PRINT printf(X is big);
10/81
Main()
{
TEST AND PRINT
}
d) Language extension
These processors attempt to add capabilities to the language by built-in macros. For
example, statements beginning with ## are taken by the preprocessors to be database-
access statements. Translated into procedure calls on routines that perform the
database access.
MACRO PROCESSORS
The Macro definitions are given by the keywords like define or macro.
General syntax is
#define identifier string value.
Eg: #define M 5
The use of a macro consists of naming the macro and supplying actual
parameters, i.e., value for its formal parameters. The macro substitutes the actual
parameters for the formal parameters in the body of the macros, the transformed body
then replaces the macro use itself.
2. Assemblers
Some compilers produce assembly code, that is passed to an assembler for

further processing. Other compilers perform the job of the assembler, producing
relocatable machine code that can be passed directly to the loader/linker – editor.
Assembly code is mnemonic version of machine code, in which names are used
instead of binary codes for operations, and names are also given to memory addresses.
Example: To compute b=a+2, the code is
mov a, R1
add #2, R1
mov R1, b
The above code moves the contents of the address a into register R1, then adds
the constant 2 to it and stores the result in the location named by b.
Two-pass assembly
* A Pass consists of reading an input file once.

* In first pass, all the identifiers that denote storage locations are found and stored in
a symbol table. Identifiers are assigned storage locations, after reading the symbol
table.
For example, Identifier Address

a 0
b 4
* In second pass, again the assembler scans the input. This time, it translates each
operation code into the sequence of bits representing that operation in machine
language, and it translates each identifier representing a location into the address
given for that identifier in the symbol table.
11/81
* The result of this pass is relocatable machine code.
* Example, the following is a machine code into which the above assembly
instructions will be translated.
Load 0001 01 00 00000000*

Store 0011 01 10 00000010
Add 0010 01 00 00000100*
Instruction register1 tag operand
code
00 is ordinary address
10 is immediate address
* In the above machine code the first 4 bits stands for instructions like Load (0001),
Store (0011) and Add (0010).
* The next 2 bits 00 or 10 is a tag where 00 refers to ordinary address, where last eight
bits to refer memory address
* The tag 10 refers to immediate mode, where last 8 bits are taken as operand.
* In first and third code the * refers to relocation bit . * means L must be added to the
address of this instruction.
* In this example, address of a =0 and address of b=4
* If L=15 00001111 then after adding L with the original address now a=15 and b=19
and the above instruction will appear as
0001 01 00 00001111
0011 01 10 00000010
0010 01 00 00010011
which is an absolute or relocatable machine code.
3. Loaders
Loader is a program, that performs the two functions of loading and link-
editing. The process of loading consists of taking relocatable machine code, placing
the altered instructions and data in memory at the proper locations
4.Link-Editors
The link-editor allows to make a single program from various files relocatable
machine. The result of various different files compilations, and one or more may be
library files of routines provided by the system.
1.4 THE GROUPING OF PHASES
More than one phases are grouped together.

Front and back ends
The phases are collected into a front end and back end. The front end consists of
source language and are largely independent of the target machine. It include lexical
and syntactic analysis, the creation of the symbol table, semantic analysis and the
generation of the intermediate code. The amount of code optimization can be done by
the front end as well and also includes the error handling.
The back end includes the compiler depend on the target machine, or do not
depend on the source language. Aspect of this phase, error handling and symbol-table
operations.
Passes
12/81
Single pass consisting of reading an input file and writing an output file. For
example, lexical analysis, syntax analysis, semantic analysis, and intermediate code
generation might be grouped into one pass. Eg: The token stream after lexical analysis
is translated directly into intermediate code.
Reducing the number of passes :
Since it takes time to read and write intermediate files, if we group several
phases into one pass – then we may be forced to keep the entire program in
memory,because one phase may need information in a different order than a previous
phase produces it.
Disadvantage:
Some phases when grouped into one pass will present few problems.
Example: The interface between the lexical and syntactic analyzers can often be
limited to a single token. The intermediate and target code generation can be merged
into one pass using a technique called “back patching”.
Back patching is the term of an assembler. If the machine code is generated,

labels in three address statements have to be converted to addresses of instructions.
For example, x := y op z ( or ) x := y * z, where op any operator, x, y & z are names,
constants.
Since the intermediate and final representations of code for an assembler are
roughly the same and approximately the same length, backpatching the length of the
entire assembly program is not feasible.
Compiler Construction tools
* The compiler write ,like any programmer, can profitably use software tools such as
debuggers, version manager, profilers and so on.
* In addition to these software development tools, other more specialized tools have
been developed for helping implement various phases of a compiler.
13/81
1.6 Lexical analysis
A sequence of input characters that comprises a single token is called a lexeme.

A lexical analyzer can insulate a parser from the lexeme representation of tokens.
1.6.1 The role of the Lexical Analyzer
The lexical analyzer is the first phase of a compiler.

Its main task is to read the input characters and produce a sequence of tokens as
output which the parser uses for syntax analysis.
token
Source Lexical Parser
program Analyzer Get token next
Symbol table
Fig. Interaction of lexical analyzer with parser
The lexical analyzer acts as a subroutine or coroutine, which is called by the

parser whenever it needs a new token. This organization eliminates the need for the
intermediate file.
The lexical analyzer returns to the parser a representation for the token it has
found. The representation is an integer code if the token is a simple construct such as
left parentheses, comma or colon.
Lexical analyzes are divided into two phases :

1. scanning
2. lexical analysis
Scanning
To responsible for simple tasks & to eliminate blanks from the input.
Lexical Analysis
To responsible for complex operations.
Functions of the lexical analyzers
 To keep track of line numbers.

 The line numbers can be associated with an error message.
 The source language supports the macro preprocessor functions, these
preprocessor functions implemented as lexical analysis.
 To stripe out white space ( redundant blanks and taps ).
 To delete the comments.
Removal of white space and comments

14/81
White space appear between tokens (blanks, taps and new lines). Comments also
treated as white space, ignored by the parser and translator. If white space is
eliminated by the lexical analyzer, the parser will never have to consider it.
The alternative of modifying the grammar to incorporate white space into the
syntax is not nearly as easy to implement.
Issues in lexical analysis
i) Simpler Design
The separation of lexical analysis from syntax analysis allows to simplify one or
more of phases. For example, already to remove the comments and white space.
ii) Compiler efficiency is improved
A separate lexical analyzer allows to construct a specialized and potentially

more efficient processor for the task. A large amount of time is spent reading the
source program and partitioning it into tokens. Specialized buffering techniques for
reading input characters and processing tokens can significantly speed up the
performance of a compiler.
iii) Compiler portability is enhanced
Input alphabet peculiarities and other device specific anomalies can be

restricted to the lexical analyzer. The representation of special or non-standard
symbols can be isolated in the lexical analyzer.
Tokens, Patterns, Lexemes:
Tokens
A set of strings in the input which is produced as output is same token.
Lexemes
A sequence of characters in the source program that is matched by the pattern

for a token. Tokens as terminal symbol in the grammar for the source language. The
lexemes matched by the pattern for the token represent strings of characters in the
source program that can be treated together as a lexical unit.
Tokens treated as keywords, operators, identifiers, constants, literal, strings and

punctuation symbols such as parentheses, commas and semicolons. Token representing
an identifier is returned to the parser. The lexical analyzer must distinguish between a
keyword and a user defined identifier.
Patterns
It is a rule describing the set of strings that can represent a particular token in
source program.
15/81
To describe the patterns for more complex tokens. To use the regular expression
notation developed. For example:
TOKEN SAMPLE LEXEMES INFORMAL DESCRIPTION OF PATTERN
const const const

if if if
relation <,<=,>,>=,==,!= < or <= or > or >= or == or !=
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E3 any numeric constant
literal “core dumped” any characters between “and” except ”
if the statement
position := initial +rate * 60
is given as input to lexical analyzer ,
the output is
id1= id2 + id3 * 60
Attributes for tokens
When more than one pattern matches a lexeme, the lexical analyzer must
provide additional information about the particular lexeme that matched to the
subsequent phases of the compiler. The lexical analyzer collects information about
tokens into their associated attributes.
The tokens influence parsing decisions; the attributes influence the translation
of tokens. A token has only a single attribute- a pointer to the symbol-table entry in
which the information about the token is kept. The pointer becomes the attributes for
the token. In certain pairs there is no need for an attribute value, the first component
is sufficient to identify the lexeme. For example, the tokens and associated attribute
values E = M * C * 2 are written as a sequence of a pairs :
< id, pointer to symbol-table entry for E>

< assign-op, >
< id, pointer to symbol-table entry for M>
< multi-op, >
< id, pointer to symbol-table entry for C>
< multi-op, >
< num, integer value 2 >
num has been given an integer-valued attribute. The compiler may store the
character string that forms a number in a symbol table and the attribute of token num
be a pointer to the table entry.
Lexical errors
For example, fi (a== f(x))…

where, fi is misspelling of the keyword if or an unbalanced function identifier.
Since fi is a valid identifier.
The lexical analyzer must return the token for an identifier and phase of
compiler handle any error.
Error-recovery actions:
16/81
i. Deleting an extraneous character.
ii. Inserting a missing character.
iii. Replacing an incorrect character by a correct character
iv. Transposing two adjacent characters.
1.6.2 Input buffering
There are two input buffer schemes
a. lookahead – to identify tokens

b. sentinels – for speeding up the lexical analyzer
Three approaches to the implementation of a lexical analyzer :
1. Use a lexical analyzer generator(Lex compiler), to produce the lexical analyzer

from a regular expression based specification. The generator provides routines for
reading and buffering the input.
2. Write the lexical analyzer in a conventional systems-programming language, using

the I/O facilities of that language to read the input.
3. Write the lexical analyzer in assembly language and explicitly manage the reading
of input.
The lexical analyzer is the only phase of the compiler that reads the source program
character-by-character, it is possible to speed some amount of time in the lexical
analysis phase. All these processes may be carried out with an extra buffer into which
the source file is read and then copied, with modification, into the buffer.
Preprocessing the character stream being subjected to lexical analysis saves the
trouble of moving the lookahead pointer back and forth over comments or strings of
blanks. The lexical analyzer needs to look ahead characters beyond the lexeme for a
pattern.
The lexical analyzers used a function to push lookahead characters back into the
input stream. Because a large amount of time can be consumed moving characters,
specialized buffering techniques have been developed to reduce the amount of
overhead required to process an input character.
: : E : = : : M : * C : * : * : 2 : eof : : : : :
  forward
lexeme-beginning
Fig. An input buffer in two halves
A buffer divided into two N character halves. N is the number of characters on one
disk block( e.g.,1024 or 4096 ). To read N input characters into each half of the buffer
with one system read command. To invoke a read command for each input character.
eof marks the end of the source file and is different from any input character.
Sentinels
17/81
Each buffer half to hold a sentinel character at the end. The sentinel is a special
character that cannot be part of the source program. The same buffer arrangement with
the sentinels added eof.
: : : E : : = : : M : * : eof C : * : * : 2 : eof : : : : :eof
  forward
lexeme-beginning
Fig. Sentinels at the end of each buffer half
1.6.3 Specification of token s
The term Alphabet or character class denotes any finite set of symbols. Example
of symbols are letters and characters. The set { 0, 1 } is the binary alphabet. Example
of computer alphabets : ASCII and EBCDIC.
String or sentence or word is a finite sequence of symbols drawn from that

alphabet. The length of a string S, written | S|, is the number of occurrences of symbols
in S. for example, VARMA is string of length 5. The empty string, denoted  , is a
special string of length zero.
Language denotes any set of strings over some fixed alphabet. Alphabet
languages like  , the empty set or {  }, the set containing only the empty string.
Concatenation string formed x and y  xy. The empty string is the identity
element concatenation s =  s = s.
Term Definition
Prefix of s A string obtained by removing zero or trailing symbols of

string s. For example, ban is a prefix of banana.
Suffix of s A string obtained by removing zero or leading symbols of

string s. For example, nana is a suffix of banana.
Substring of s A string obtained by deleting a prefix and a suffix from s. For

example, nan is a substring of banana. Every prefix and
every suffix of s is a substring of s, but not every substring of
s is a prefix or a suffix of s. For every string s, both s and 
are prefixes, suffixes, and substrings of s.
Proper prefix, suffix, Any nonempty string x that is, a prefix, suffix, or
or substring of s substring of s such that s ≠ x.
subsequence of s Any string formed by deleting zero or more not necessarily

contiguous symbols from s. For example, baaa is a
subsequences of banana.
Operations on Language
“exponentiation” operator to languages by denoting L 0 to be { }, and L i to be

L L. L i is L concatenated with itself i-1 times. For
i-1
lexical analysis, union,
concatenation and closure are defined, these operations can be applied to languages.
The following table for definitions of operations or languages.
18/81
Operation Definition
Union of L and M written LUM  LUM ={ s | s is in L or s is in M }

Concatenation of L and M written LM  LM ={ st | s is in L and t is in M }
∞
Kleene closure of L written L *  L* = U L i , L * denotes “Zero or
i=0
more concatenations of” L

∞
Positive closure of L written L +  L+ = U L i , L + denotes “One or
i=1
more concatenations of” L
Example
L = { A, B,…,Z,a,b,…,z } and D = { 0,1,…,9 } L as the alphabet consisting of

the set of upper and lower case letters, and D as the alphabet consisting of the set of
the ten decimal digits. A symbol can be regarded as a string of length of one, the sets
L and D are each finite languages :
1. LUD  is the set of letter and digits.

2. LD  is the set of strings consisting of letters followed by a digit.
3. L4  is the set of all 4 letter strings.
4. L*  is the set of all strings of letters, including  , the empty
string .
5. L(LUD) *  is the set of all strings of letters and digits beginning with a
letter.
Regular expressions
An identifier is a letter followed by zero or more letters or digits. A regular

expressions is a set of defining rules. Each regular expressions r denotes a language
L(r). The defining rules to specify L(r) is formed by combining in various the
language denoted by the sub expressions of r. The rules define the regular expressions
over the alphabet ∑. Associated with each rule is a specification of the language
denoted by the regular expression being defined.
a.  is a regular expression that denotes {  }, that is the set containing the empty
string.
b. If a is a symbol in ∑, then a is a regular expression that denotes { a}, i.e., the
set containing the string a.
c. Suppose r and s are regular expressions denoting the languages L(r) and L(s).
Then
i. (r)|(s) is a regular expressions denoting L(r)UL(s).
ii. (r)(s) is a regular expressions denoting L(r)L(s).
iii. (r)* and (s)* is a regular expressions denoting (L(r))* and (L(s))*.
iv. (r) and (s) is a regular expression denoting L(r) and L(s).
A language denoted by a regular expression is said to be a regular set. The

specification of a regular expression is an example of a recursive definition. Rules (i)
and (ii), basis of the definition, basic symbol to refer to  or a symbol in ∑ appearing
in a regular expressions. Rules(iii), inductive step or induction rules, compound
regular expression.
Parenthesis can be avoided in regular expressions

19/81
1. the unary operator * has the highest precedence and is left associative.
2. concatenation has the second highest precedence and is left associative.
3. | has the lowest precedence and is left associative.
The number of algebraic laws obeyed by regular expressions and can be used to
manipulate regular expressions into equivalent regular forms. Algebraic laws that hold
for regular expressions r, s and t
Axiom Descriptive
r|s = s|r  | is commutative

r|(s|t) = s|(r|t)  | is associative
(rs)|t = r(st)  concatenation is associative
r(s|t)=rs|rt,(s|t)r=st|sr  concatenation distributes over |
r=r, r=r   is the identity element for concatenation
r* = (r|)*  relation between * and 
r** = r*  * is idempotent
Regular Definition
If ∑ is an alphabet of basic symbols, then regular definition is a sequence of

definitions of the form : d 1  r 1 , d 2  r 2 , d 3  r 3 , …, d n  r n , where each d i is a
distinct name, and r i is a regular expression over the symbol in
∑ U { d 1 , d 2 , …d i - 1 }. If r i used d j for some j ≥ i, then r i might be recursively defined.
For example, 5280, 39.37, 6.39E01 is the set of strings. The regular definition of
following :
digit  0|1|…|9
digits  digitdigit*
opt_fra  .digit|
opt_exp  (E(+|-| )digits)|
num  digits opt_fra opt_exp
20/81
Notational Shorthands
One or more instances
If r is a regular expression that denotes the language L(r), then (r)+ is a regular
expression that denote the language (L(r))+. The unary postfix operator + means “one
or more instances of ”. The regular expression a+ denotes the set of all strings of one
or more a’s. The operator + has the same precedence and associativity as the operator
*. The two algebraic identities
r* = r+| and r+ = rr* relate the Kleene and positive closure operators.
Zero or more instances
The unary postfix operator ? means “zero or more instances of ”. The notation r?
is a shorthand for r| . If r is regular expression, then (r)? is the regular expression
that denotes the language L(r)U{ }. For above example, can be rewritten as
digit  0|1|…|9
digits  digit+
opt_fra  (.digit)?
opt_exp  (E(+|-)?digits)?
num  digits opt_fra opt_exp
Character classes
The notation [abc] where a, b, and c are alphabet symbol denotes the regular
expression a | b | c.
Character class [a-z] denotes the regular expression a | b | …|z . Using character
class, identifiers as being strings generated by the regular expression [A-Za-z][A-Za-
z0-9]*.
Non regular sets
Regular expressions can be used to denote only a fixed number of repetition or an

unspecified number of repetitions of a given construct. Regular expressions cannot be
described balanced or nested constructs. Repeating strings cannot be described by
regular expressions. The set { w w | w is a strings of a’s and b’s } cannot be denoted
by any regular expression, nor can it be described by or context-free grammar.
1.6.4 Recognition of Tokens
Lexical analyzer will isolate the lexeme for the next token in the input buffer and
produce as output a pair consisting of the appropriate token and attribute value.
Transition Diagram
An intermediate step in the construction of a lexical analyzer to produce a stylized

flowchart called transition diagrams. Transition diagrams depict the actions that take
place when a lexical analyzer is called by the parser to get the next token. Positions in
a transition diagram are drawn as circles and are called states. The states are
connected by arrows called edges. The edges leaving state s have labels indicating the
input characters that can next appear after the transition diagram has reached state s.
The initial state of the transition diagram is labeled the start state, where control
resides when begin to recognize the token. The double circle state is an accepting state
21/81
or final state, in which the token has been found. A * to indicate states on which input
retraction must take place. If failure occurs in all transition diagrams, then a lexical
error has been detected and to invoke an error-recovery routine. For example, the
transition diagram is >= .
Start > =
o 11 2
22/81
UNIT – II
SYLLABUS
Syntax Analysis: Role of the parser – Context free grammars –Top down parsing:
Recursive Descent Parsing. Predictive parsing – Bottom up parsing: Handles, Handle
pruning, Stack implementation of shift-reducing parsing.
SYNTAX ANALYSIS:
2.1 ROLE OF THE PARSER

 The parser obtains a string of tokens from the lexical analyzer and verifies that the string can be
generated by the grammar for the source language.
 The parser will report any syntax errors.
 Will try to recover from occurring errors so that it can continue processing the remainder of its input.
Three types of parsers for grammars

Universal Parser
 Cocke-Younger-Kasami(CYK) algorithm and Earley’s algorithm can parse any grammar. These
methods are too inefficient to use in production compilers.
Top-down Parser
 To build parse trees from the top to the bottom.
Source program Lexical Token Parse tree Rest of Intermediate

analyzer Parser front end
Get next token representation
Symbol
table
Fig. Position of parser in compiler model
Bottom-up Parser
 To build parse trees from bottom to the top.
 In top-down and bottom-up cases, the input to the parser is scanned from left to right, one symbol at
a time and to work only on sub classes of grammars, such as LL and LR grammars. Automated tools
construct parsers for the larger class of LR grammars.
 The output of the parser is representation of the parse tree for the stream of tokens produced by the
lexical analyzer.
Number of tasks conducted parsing
 Collecting information about various tokens into the symbol table.

 Performing type checking and other kinds of semantic analysis.
 Generating intermediate code.
23/81
 The nature of syntactic errors and general strategies for error recovery, these two strategies are called
panic-mode and phrase-level recovery.
Syntax error handling
 If a compiler had to process only correct programs, its design and implementation would be
simplified.
 When programmers write incorrect programs, a good compiler should assist the programmers in
identifying and locating errors.
 Compiler is required for syntactic accuracy in computer languages.
 Planning the errors handling right from the start can both simplify the structure of a compiler and
improve its response to errors.
Different levels of errors
 Lexical  misspelling an identifier, keyword, or operator.

 Syntactic  an arithmetic expression with unbalanced parentheses.
 Semantic  an operator applied to an incompatible operand.
 Logical  an infinitely recursive calls.
Error detection and recovery of the syntax analysis phase
 Many errors are syntactic in nature or are exposed when the stream of tokens coming from the
lexical analyzer disobeys the grammatical rules defining the programming languages.
 Parsing methods can detect the syntactic errors in programs very efficiently.
 Accurately detecting the semantic and logical errors at compile time is very difficult task.
The error handler in a parser has simple-to-state goals
 It should report the presence of errors clearly and accurately.

 It should recover from each error quickly enough to be able to detect subsequent errors.
 It should not significantly slow down the processing of correct programs.
 The LL and LR methods of parsers are detect an error as show as possible. They have the viable-
prefix property meaning they detect that an error has occurred as soon as they see a prefix of the
input that is not a prefix of any string in the language.
Error-recovery strategies
 Error recovery strategies are used to recover from a syntactic error of the different general strategies
of a parser.
1. Panic-mode recovery
The parser discards input symbols one at a time until one of a designated set of synchronizing
tokens found. The compiler designer must select the synchronizing tokens appropriate for the source
language. While panic-mode correction skips an amount of input without checking it for additional
errors. The synchronized tokens are delimiters, such as semicolon or end, whose role in the source
program is clear. It can be used by most parsing methods. If the multiple errors occur in the same
statement, this method may quite.
24/81
Advantage : Simplest to implement and guaranteed not to go into an infinite loop.
2. Phrase-level recovery
A parser may perform local correction on the remaining input, that is, it may replace a prefix of
the remaining input by some string that allows the parser to continue. A typical local correction would
be to replace a comma by a semicolon, delete an extraneous semicolon, or insert a missing semicolon.
The choice of the local correction is left to the complete designer. To choose replacement that do not
lead to infinite loop. This type of replacement can correct any input string and has been used in several
error-repairing compilers. The method was firs used with top-down parsing.
Drawback : It has in coping with situations in which the actual error has occurred before the
point of detection, is difficulty.
3. Error-productions
The common errors may be encountered, the grammar for the language at hand with productions
that generate the erroneous constructs. The use of the grammar augmented by these error productions to
construct a parser. If the parser uses an error production, can generate appropriate error diagnostics to
indicate the erroneous construct that has been recognized in the input.
4. Global correction
A compiler to make as few changes as possible in processing an incorrect input string. There are
algorithms for choosing a minimal sequence of changes to obtain a globally least cost correction. These
methods are too costly to implement in terms of time and space, so these techniques are currently only
of theoretical. Given an incorrect input string x and grammar G, these algorithms will find a parse tree
for a related string y, such that the number of instructions, deletions, and changes of tokens required to
transform x into y is as small as possible. The notion of least-cost correction does provide for evaluating
error-recovery techniques and it has been used for finding optimal replacement strings for phrase-level
recovery.
2.2 CONTEXT-FREE GRAMMARS CGF)
 Construction of language has a recursive structure that can be defined by context free grammars. For
example, conditional statement defined by a rule,
if s1 and s2 are statements and E is an expression, then “if E then S1 else S2” is statement.
 The regular expressions can specify the lexical structure of tokens. Using the syntactic variable stmt
to denote the class of statements and expr the class of expressions, then grammar production is
stmt  if expr then stmt else stmt
 CFG consists of
1. terminals
2. non-terminals
3. a start symbol and
4. productions.
 Terminals are basic symbols from which strings are formed. The word “token” is a synonym for
“terminal” in programs, for programming languages. For example, stmt  if expr then stmt else
stmt. Each of the keywords if, then and else is a terminal.
25/81
 Non-terminals are syntactic variables that denote set of strings. The non-terminals define sets of
strings that help define the language generated by the grammar. A hierarchical structure on the
language that is useful for both syntax analysis and translation. In the above example, stmt and expr
are non-terminals.
 In a grammar, one non-terminal is distinguished as the start symbol, and the set of strings it denotes
is the language defined by the grammar.
 The productions of a grammar, is in which the terminals and non-terminals can be combined to form
strings. Each productions consists of a non-terminal, followed by an arrow, followed by a string of
non-terminals and terminals.
Example 1:
expr  expr op expr
expr  ( expr )
expr  - expr
expr  id
op  +
op  -
op  *
op  /
op
 In this grammar, the terminal symbols are id + - * /  ( ) the non-terminal symbols are expr and op
and expr is the starting symbol.
CFG: Example 2
G = h{S}, {a, b}, P, Si, where S −! ab and S −! aSb
are the only productions in P.
Derivations look like this:
S = ab
S= aSb = aabb
S = aSb = aaSbb = aaabbb
L(G), the language generated by G is {anbn|n > 0}.
Notational Conventions
These symbols are terminals
 Lower case letters in the alphabet such as a, b, c.

 Operator symbols such as +, - etc.
 Punctuation symbols such as parentheses, comma, etc.
 The digits 0,1,…,9.
 Bold face strings such as id and if.
These symbols are non-terminals
 Uppercase letters in the alphabet such as A,B,C.

 The letter S, which, when it appears is usually the start symbol.
 Lower-case italic names such as expr or stmt.
 Upper-case letters in the alphabet, such as X,Y,Z represent grammar symbols, that is either non-
terminals or terminals.
 Lower-case letters in the alphabet, u, v,…,z represent strings of terminals.
 Lower-case letters , ,  represent strings of grammar symbols. A production can be written as
26/81
A   indicating a single non-terminal A on the left side of the production and a string of grammar
symbols  to the right side of the production.
 If A   1, A   2, …., A   k are all productions with A on the left, A   1|  2 | … |  k, where,
 1,  2 , … ,  k the alternatives for A.
 Unless otherwise stated, the left side of the first production is the start symbol.
 For example, E  E A E | ( E ) | - E | id , A  + | - | * | / |  , here, E and A are non-terminals, with
E is the start symbol. The remaining symbols are terminals.
Derivations
 A production rule is in which the non-terminal on the left is replaced by the on the right side of the
production. For example, E  E + E | E * E | ( E ) | - E | id
 The production E  -E, an expression preceded by a minus sign is also an expression. This
production can be used to generate more complex expressions from simpler expressions by allowing
to replace any instance of an E by –E. This can be rewritten as E  -E, which is “E derives –E”.
 In abstract string, A     if A   is a production and  and  are arbitrary strings of
grammar symbols. If  1   2  …   n,  1derives  n. The symbol  means derives in one
step, the * means derives in zero or more steps, and  + means derives in one or more steps.
1.  *  for any strings ,
2. If  *  and   , then  * .
 A language can be generated by a grammar is said to be a context-free language. If two grammars
generate the same language, the grammars are said to be equivalent.
 Strings in L(G) may contain only terminal symbols of G. A string of terminals w is in L(G) if and
only if S  + w. The string w is called a sentence of G.
 If string S* , where  may contain non-terminals, then  is a sentential form of G. A sentence is
a sequential form with no non-terminals.
 For example, the string –(id + id) is a sentence of grammar E  E + E | E * E | ( E ) | -E | id
because there is the derivation E  -E  - ( E )  -( E + E )  -( id + E )  -( id + id).
 The strings E, -E, -( E ), …,-( id + id ) appearing in this derivations are all sentential forms of this
grammar. E * -( id + id ) to indicate that –(id + id) can be derived from E.
 Leftmost derivations in which only the leftmost non-terminal in any sentential form is replaced at
each step. It is called leftmost. For example, if    by a step in which the leftmost non-terminal
in  is replaced, then written as  lm .
 The leftmost derivation is E  lm - E  lm – ( E )  lm - ( E + E )  lm - ( id +E )  lm - ( id + id ).
 Using notational conventions, every leftmost step can be written wA  lm w where w consists of
terminals only, A   is the production applied, and  is a string of grammar symbols    by a
leftmost derivation,  *lm . If S *lm , then  is a left sentential form of the grammar.
 Right most derivations in which the right most non-terminal is replaced at each step. Rightmost
derivations are also called as canonical derivations.
Parse tree and derivations
 Parse tree is a graphical representation for a derivation.

 Each interior node of a parse tree is labeled by non-terminal A, and the children of the node are
labeled, from left to right, by the symbols in the right side of the production by which A was
replaced in the derivation.
 The leaves of the parse tree are labeled by non-terminals or terminals and, read from left to right,
they construct a sentential form, called the yield or frontier of the tree.
  1  2  …   n , where  1 is a single non-terminal A. For each sentential form  i in the
derivation, a parse tree whose yield is  i. The process is an induction on i.
 A parse tree whose yield is  i-1 = X1 X2 … Xk.  i is derived from  i-1 by replacing Xj, a non-terminal,
by  = Y1Y2…Yr. That is, at the ith step of the derivation, production Xj   is applied to  i-1 to
derive  i = X1 X2 … Xj-1 Xj+1… Xk.
 For example, the parse tree for –(id + id ) implied by derivation.
27/81
E
- E
( E )
E + E
id id
Fig. Parse Tree for (id + id )

28/81
Example : The sentence id + id * id has the two distinct leftmost derivations.
E E
E E+E E E*E
 id + E E+E*E
 id + E * E E + E  id + E * E E * E
 id + id * E  id + id * E
 id + id * id  id + id * id
id E * E E + E id
id id
id id
Parse Tree
Parse Tree
Note : * operator as having higher precedence than +.
Ambiguity
 A grammar that produces more than one parse tree for some sentence is said to be ambiguous.
 An ambiguous grammar is one that produces more than one left most or right most derivation for the
same sentence.
2.3 Top-down Parsing
 To construct an efficient non-backtracking form of top-down parser called a predictive parser.

 The class of LL(1) grammars from which predictive parsers can be constructed automatically.
2.3.1 Recursive-descent parsing
 Top-down parsing can be viewed as an attempt to find a leftmost derivation for an input string.
 It can be viewed as an attempt to construct a parse tree for the input starting from the root and
creating the nodes of the parse tree in preorder.
 Recursive-descent parsing is called predictive parsing , where no backtracking is required.
 Top-down parsing is called recursive-descent, that may involve backtracking, that is making
repeated scans of the input.
 Backtracking is not very efficient, rarely needed to parse programming language.
 For example, consider the grammar S  c A d, A  a b | a and the input string w = cad.
S S S
c A d c A d c A d
a b a
Fig. Steps in top-down parse
 To construct a parse tree for this string top-down, to create a tree consisting of a single node labeled
S.
 The leftmost leaf, labeled c, matches the first symbol of w.
 The advance input pointer to a, the second symbol of w, the next leaf labeled A, match for the
second input symbol.
 The advance input pointer to d, the third input symbol and compared against the next leaf, labeled b.
 b does not match d, then go back to A, to reset the input pointer to position 2, which means that the
procedure for A must store the input pointer in a local variable.
 The second alternative for A to obtain the tree. The leaf a matches the second symbol of w and the
leaf d matches the third symbol.
29/81
 It produced a parse tree for w, and successful completion of parsing.
 A left-recursive grammar can cause a recursive-descent parser, even more with backtracking, to go
into an infinite loop.
2.3.2 Predictive Parsers
 To eliminate left recursion from a grammar, and left factoring the resulting grammar.
 To obtain a grammar that can be parsed by a recursive-descent parser that needs no backtracking i.e.,
a predictive parser.
 To construct a predictive parser, given the current input symbol a and the non-terminal A to be
expanded, which one of the alternatives of production A   1 |  2 | … | n is the unique alternative
that derives a string beginning with a.
 For example, the productions
stmt  if expr then stmt else stmt | while expr do stmt | begin stmt-list end
 The keywords if, while, and begin, which alternative is the only one that can possible to success, if
to find a statement.
Transition diagrams for Predictive Parsers
 To create a transition diagram for the predictive parser, it is very useful plan or flowchart for lexical
analyzer.
 The labels of edges are tokens and non--terminals.
 A transition on a token means that transition if that token is the next input symbol.
 To construct the transition diagram of a predictive parser from a grammar, first eliminate left
recursion from the grammar, and then left factor the grammar.
a) Create an initial and final state.
b) For each production A  X1 X2 …Xn , create a path from the initial to the final state, with
edges labeled X1 , X2 , … , Xn .
 More than one transition from a state on the same input occurs ambiguity.
 To build a recursive-descent parser using backtracking to systematically.
 For example, E  E + T | T , T  T * F | F , F  ( E ) | id
contains a collection of transition diagrams for grammar
E  TE , E  +TE |  , T  FT , T  *FT |  , F  ( E ) | id.
30/81
Substituting diagrams to the transformations on grammar can simplify transition diagrams.
E: T E T: F T
0 1 2 7 8 9
E : + T E 6 T : 1
* 1
F 1 T 1
3 4 5 2
0 1 3

F: ( E )
1 1 1 1
id 6
4 5 7
Fig. Transition diagram for grammar
 T
E : + T +
 E :
3 4 5 3 4
6 6
T +
E: T + E: T
0 3 4 0 3
 
6 6
Fig. Simplified transition diagrams
E: + T: *
T  F 
6 7 8 1
0 3
3
F: ( E )
1 1 1 1
id 6
4 5 7
Fig. Simplified transition diagrams for arithmetic expressions
31/81
Non-recursive predictive parsing
 To build a non-recursive parser by maintaining a stack. The predictive parsing is determining the
production to be applied for a non-terminal.
 To non-recursive parser looks up the production to be applied in a parsing table.
 The table can be constructed directly from grammars.
 A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output stream.
 The input buffer contains the string to be parsed, followed by $, a symbol used as a right end marker
to indicate the end of the input string.
 The stack contains the start symbol of the grammar on top of $.
 The parsing table is a two-dimensional array M [ A , a ], where A is a non-terminal, and a is a
terminal or the symbol $.
 X  the symbol on the top of the stack, and a  the current input symbol.
a + b $
Stack X  Outpu
Predictive Parsing Program
Y
Z
$
Parsing Table M
Fig. Non-recursive Predictive Parser
Three possibilities of two symbols :
1. If X = a = $, the parser halts and announces successful completion of parsing.

2. If X = a  $, the parser pops Xoff the stack and advances the input pointer to the next input
symbol.
3. If X is a nonterminal, the program consults entry M [ X , a ] of the parsing table M. This entry
will be either an X-production of the grammar or an error entry.
For example, if M [ X , a ] = { X  U V W }, the parser replaces X on top of the stack by WVU.
If M [ X , a ] = error, the parser calls an error recovery routine.
Algorithm : Non-recursive predictive parsing.

Input : A string w and a parsing table M for grammar G.
Output : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
Method : The parser is in a configuration in which it has $S on the stack with S, the start symbol of
G on top, and w$ in the input buffer. The program that utilizes the predictive parsing
table M to produce a parse for the input.
set ip to point to the first symbol of w$;

repeat
let X be the top of the stack symbol and a symbol pointed to by ip.
if X is a terminal or $ then
if X = a then
pop X from the stack and advance ip
else /* X is a non-terminal */
32/81
if M [ X , a ] = X  Y1 Y2 … Yk then begin
pop X from the stack;
push Yk , Yk-1 , … Y1 onto the stack, with Y1 on top;
output the production X  Y1 Y2 … Yk
end
else error()
until X = $ /* Stack is empty */
Fig. Predictive parsing algorithm
For example,
E  E + T | T , T  T * F | F , F  ( E ) | id
 The input id + id * id $ the predictive parser create the sequence of moves.
 The input pointer points to the leftmost symbol of the string in the INPUT column.
 The leftmost derivation for the input, the productions output are those of a leftmost derivation
 The input symbol scanned already, followed by the grammar symbol on the stack ( from top to
bottom), the left-sentential forms in the derivations.
STACK INPUT OUTPUT

$E id + id * id $
$ E T id + id * id $ E  T E
$ E T F id + id * id $ T  F T
$ E T id id + id * id $ F  id
$ E T + id * id $
$ E + id * id $ T  
$ E T + + id * id $ E  + T E
$ E T F id * id $ T  F T
$ E T id id * id $
$ E T * id $
$ E T F * * id $ T  * F T
$ E T F id $
$ E T id id $ F  id
$ E T $
$ E $ T  
$ $ E  
Fig. Predictive Parser on input id + id * id
FIRST and FOLLOW
 The construction of a predictive parser is aided by the functions associated with a grammar G.
i) First
ii) Follow
 Sets of tokens yielded by the FOLLOW functions can also be used as synchronizing tokens during
panic-mode error recovery.
 FIRST() be the set of terminals, where,  is any string of grammar symbols, strings derived from
. If  * , then  is in FIRST().
 FOLLOW(A), be the non-terminal A, the set of terminals a that can appear immediately to the right
of A. The set of terminals a such that there exists a derivation of the form
 S *  A a  for  and .
Note : During the derivation, have been symbols between A and a, they derived  and disappeared. If
A can be the rightmost symbol in sentential form, then $ is in FOLLOW(A).
33/81
Rules :
FIRST(X) for all grammar symbols X, apply the rules until no more terminals or  can be added to any
FIRST set.
i) X is a terminal, then FIRST( terminal ) will be terminal itself X. { X }, (or) if X is a

terminal, then FIRST(X) is { X }.
ii) X  a  is terminal, a is terminal, X   ,(or) if X   is a production, then add  to
FIRST(X).
iii) if X is non-terminal and X  Y1 Y2 … Yk is a production, then place a in FIRST(X) if for
some i, a is in FIRST(Yi), and  is in all of FIRST(Y1), … FIRST(Yi-1); that is, Y1 … Yi-1
* . If  is in FIRST(Yj) for all j = 1, 2, …, k, then added  to FIRST(X).
Rules :
FOLLOW(A) for all non-terminals A, apply the rules until nothing can be added to any FOLLOW set.
i) $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
ii) If there is a production A  B, then everything in FIRST() except for  is placed in
FOLLOW(B).
iii) If there is a production A  B, or a production A  B where FIRST() contains 
(i.e.,  * ), then everything in FOLLOW(A) is in FOLLOW(B).
For example, the grammar

E  E + T | T , T  T * F | F , F  ( E ) | id
To eliminate the left recursion
FIRST(E) = { ( , id } FOLLOW(E) = { $, ) }
FIRST(E) = { +,  } FOLLOW(E) = { $, ) }
FIRST(T) = { ( , id } FOLLOW(T) = { +, $, ) }
FIRST(T) = { *,  } FOLLOW(T) = { +, $, ) }
FIRST(F) = { ( , id } FOLLOW(F) = { *, +, $, ) }
Error recovery in Predictive parsing
 An error is detected during predictive parsing when the terminal on top of the stack does not match
the next input symbol or when non-terminal A is on top of the stack, a is the next input symbol, and
the parsing table entry M [A , a ] is empty.
 Panic-mode error recovery is skipping symbols on the input until a token in a selected set of
synchronizing tokens appears. Its effectiveness depends on the choice of synchronizing set.
 The parser recovers from errors to occur
 As a starting point, all symbols in FOLLOW(A) into the synchronizing set for non-terminal A. If
skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is likely that
parsing can continue.
 It is not enough to use FOLLOW(A) as the synchronizing set for A.
 If symbols in FIRST(A) to the synchronizing set for non-terminals A, then it may be possible to
resume parsing according to A if a symbol in FIRST(A) appears in the input.
 If a non-terminal can generate the empty string then the production deriving  can be used as a
default. To reduces the number of non-terminals that have to be considered during error recovery.
 If a terminal on top of the stack cannot be matched, the terminal was inserted and continue parsing.
Phrase-level recovery
34/81
 To implement by filling in the blank entries in the predictive parsing table with pointers to error
routines.
Example :
 Using FIRST and FOLLOW symbols as synchronizing tokens works, when expressions are parsed
to grammar E  E + T | T , T  T * F | F , F  ( E ) | id .
 Construct the parsing table for this grammar, with “synch” indicating synchronizing tokens obtained
from the FOLLOW set of the non-terminal.
2.4 BOTTOM-UP PARSING
 Bottom-up syntax analysis is known as shift-reduce parsing. It is easy to implement.

 The general method of shift-reduce parsing, called LR parsing.
 LR parsing is used in a number of automatic parser generators.
 To construct a parse tree for an input string beginning at the leaves and working up towards the root.
 At each reduction step a particular sub string matching the right side of a production is replaced by
the symbol on the left of that production, and if the sub string is chosen correctly at each step, a right
most derivation is traced out is reverse.
Example :
Consider the grammar, abbcde

S  aAB e aAbcde (A b)
A Abe|b aAde (A Abe)
Bd aABe (Bd)
The sentence abbcde can be reduced to S S ( S  aAB e)
 The sequence of four reductions, trace out the right most derivation in reverse :
S  rm a A B e  rm a A d e  rm a A b c e  rm a b b c e
2.4.1 HANDLES
 A “Handle” of a string is a substring that matches the right side of a production, and whose reduction
to the non-terminal on the left side of the production represents one step along the reverse of a
rightmost derivations.
 A handle of a right-sentential form  is a production A   and a position of  where the string 
may be found and replaced by A to produce the previous right-sentential form in a rightmost
derivation of .
 If a grammar is unambiguous, then every right-sentential form of the grammar has exactly one
handle.
 The handle represents the left most complete sub tree consisting of a node and all its children.
 Consider the grammar, E  E + E | E * E | ( E ) | id. The right most derivation
E rm E + E ( Or)
E rm E + E * E E rm E * E
E rm E + E * id3 E rm E * id3
E rm E + id2 * id3 E rm E + E * id3
E rm id1 + id2 * id3 E rm E + id2 * id3
E rm id1 + id2 * id3
Note : The string appearing to right of a handle contains only terminal symbols.
35/81
2.4.2 HANDLE PRUNING
 A rightmost derivation in reverse can be obtained by “handle pruning”.

 If w is a sentential of the grammar at hand, then w= n, where  n is the nth right-sentential form
unknown rightmost derivation.
 S =  0  rm  1  rm  2  rm ….  rm  n-1  rm  n = w.
 To construct this derivation in reverse order, the handle  n in  n and replace  n by the left side of the
production An   n to obtain the (n-1)st right-sentential form  n-1.
 Again repeat this process, the handle  n-1 in  n-1 and reduce this handle to obtain the right-sentential
form  n-2.
 If by continuing this process, produce a right sentential form consisting only of the start symbol S,
then halt and get successful completion of parsing.
 The reverse of the sequence of the productions used in the reductions is a rightmost derivation for
the input string.
 For example, the grammar E  E + E | E * E | ( E ) | id
RIGHT-SENTENTIAL
HANDLE REDUCING PRODUCTION
FORM
id1 + id2 * id3 id1 E  id
E + id2 * id3 id2 E  id
E + E * id3 id3 E  id
E+E*E E*E EE*E
E+E E+E EE+E
E
Fig. Reductions marked by Shift-reduce parser
2.4.3 STACK IMPLEMENTATION OF SHIFT-REDUCE PARSING
 There are two problems must be solved to parse by handle pruning.

i) To locate the sub string to be reduced in a right-sentential form.
ii) To determine, more than one production with that sub string on the right side.
 To implement a shift-reduce parser is to use a stack to hold grammar symbols and an input buffer to
hold the string to be parsed.
 Initially, the stack is empty, and the string is on the input.
STACK INPUT
$ w$
where , w is the string.
 The parser operates by shifting zero or more input symbols onto the stack until a handle  is on top
of the stack.
 The parser then reduces  to the left side of the appropriate production.
 The parser repeats this cycle until it has detected and error or until the stack contains the start
symbol and the input is empty.
STACK INPUT
S$ $
 After entering this configuration, the parser halts and get successful completion of parsing.
Example:
A shift reduce parser is parsing the input string id1 + id2 * id3 according to the grammar
E  E + E | E * E | ( E ) | id.
36/81
STACK INPUT ACTION
$ id1 + id2 * id3$ Shift
$ id1 + id2 * id3$ Reduce by E  id
$E + id2 * id3$ Shift
$E+ id2 * id3$ Shift
$ E + id2 * id3$ Reduce by E  id
$E+E * id3$ Shift
$E+E* id3$ Shift
$ E + E * id3 $ Reduce by E  id
$E+E*E $ Reduce by E  E * E
$E+E $ Reduce by E  E + E
$E $ Accept
 Primary operations of the parser are shift and reduce.

 A shift-reduce parser can make the four possible actions :
a) shift b) reduce c) accept and d) error
a) In a shift action, the next input symbol is shifted onto the top of the stack.
b) In a reduce action, the parser knows the right side of the handle is at the top of the stack. It
must be locate the left end of the handle within the stack and decide with what non-terminal
to replace the handle.
c) In an accept action, the parser announces successful completion of parsing.
d) In an error action, the parser discovers that a syntax error has occurred and calls an error
recovery routine.
The important use of a stack in shift-reduce parsing:
 The handle will always eventually appear on top of the stack, never inside.
S *rm  A z *rm   B y z *rm    y z , here, A *rm  B y , B *rm 
S *rm  B x A z *rm  B x y z *rm   x y z , here, A *rm y , B *rm 
 In reverse of shift-reduce parser
1) STACK INPUT 2) STACK INPUT

$ yz$ $ xyz$
$B yz$ (B) $B xyz$ (B)
$By z $ ( Shift ) $Bx y z $ ( Shift )
$A z $ (A  By ) $Bxy z $ ( Shift )
$ Az $ ( Shift ) $  B xA z $ (A y )
$ Az $ (S  Az ) $  B xAz $ ( Shift )
$S $ $  B xAz $(S  BxAz)
$S $
37/81
 After making a reduction the parser had to shift zero or more symbols to get the next handle
onto the stack. It never had to go into the stack to find the handle.
Viable prefixes
 The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce
parser are called viable prefix. (or)
 It is a prefix of a right-sentential form that does not continue past the right end of the right
most handle of that sentential form.
Conflicts during shift-reduce parsing
 Context-free grammars for which shift-reduce parsing cannot be used.

 Every shift-reduce parser for such a grammar can reach a configuration in which the
parser, knowing the entire stack contents and the next input symbol, cannot decide
whether to shift or to reduce (a shift/reduce conflict), or cannot decide which of several
reductions to make (a reduce/reduce conflict).
 LR(k) class of grammars, the k in LR(k) refers to the number of symbols of look ahead
on the input.
 Grammars used in compiling fall in the LR(1) class, with one symbol look ahead.
 Non-LR-ness occurs, but the stack contents and the next input symbol are not sufficient
to determine which production should be used in a production.
38/81
UNIT – III
SYLLABUS:
Intermediate Code Generation: Intermediate Languages-Graphical Representation- Three-
Address Code- Implementation of three address statements : Quadruples, triples,direct triples
Code Generation: Issues In The Design Of A Code Generator- Run-Time Storage
Management
3.1 INTERMEDIATE CODE GENERATION
 The front end translates a source program into an intermediate representation from
which the back end generates target code.
 A source program can be translated into target language directly, using a machine-
independent intermediate.
 Retargeting is facilitated; a compiled for a different machine can be created by
attaching a back end for the new machine to an existing front end.
 A machine-independent code optimizer can be applied to the intermediate
representation.
 The syntax directed method can be used to translate into an intermediate form
programming language constructs such as declarations, assignments and flow of
controls statements.
 The source program has been parsed and statically checked.
 Intermediate code generation can be folded into parsing, if desired.
Parser Static intermediate intermediate code
Checker Code Generator code generator
Fig. Position of intermediate Code generator.
3.2 Intermediate Languages
The semantic rules for generating three-address code from common
programming language constructs are similar to those for constructing syntax trees or
for generating postfix notation.

39/81
3.3 Graphical Representations
 A syntax tree depicts the hierarchical structure of a source program.

 A DAG gives the same information but in a more compact way because common sub
expressions are identified.
 For example, a syntax tree and dag for the assignment statement a := b * -c + b * -c
assign assign
a + a +
* * *
b uminus b uminus b uminus
| |
c c
a) Syntax tree b) DAG
Fig. Graphical Representation of a := b * - c + b * -c
 Postfix notation is a linearized representation of a syntax tree, it is a list of the

nodes of the tree in which a node appears immediately after its children.
 The postfix notation for the syntax tree abc - * bc - * + :=
 Syntax trees for assignment statements are produced by the syntax-directed
definition. Non-terminal S generates an assignment statement.
PRODUCTION SEMANTIC RULE
S  id := E S.nptr := mknode (‘:=’ , mkleaf (id, id.place ),

E.nptr )
40/81
E  E1 + E2 E.nptr := mknode (‘+’ , E 1 .nptr, E 2 .nptr )
E  E1 * E2 E.nptr := mknode (‘*’ , E 1 .nptr ,E 2 .nptr )
E  -E 1 E.nptr := mknode (‘-’ , E 1 .nptr )
E  id E.nptr := mkleaf (id, id.place )
Fig. Syntax-directed definition to produce syntax trees for assignment statementsTwo
representations of the syntax tree in the following figure.
Each node is represented as a record with a field for its operator and additional fields
for pointers to its children.
a) Nodes are allocated from an array of records and the index or position of the
node serves as the pointer to the node.
0 id b
1 id c
assign 2 - 1
3 * 0 2
4 id b
* *
5 id c
6 - 5
id a 7 * 4 6
8 + 3 7
id b id b 9 id a
10 := 9 8
11 …
+ …
- -
id id
(a)
(b)
3.4 Three-Address Code

41/81
 Three-address code is a sequence of statements x := y op z , where x, y, and z are
names, constants or compiler-generated temporaries, op for any operator.
 No built-up arithmetic expressions are permitted, a there is only operator on the
right side of the statement.
 Three-address code is each statement contains three addresses , two for the operands
and one for the result.
Types Of Three-Address Statements
a. Statements can have symbolic labels and there are statements for flow of control.
b. A symbolic label represents the index of a three-address statements in the array
holding intermediate code.
c. An indices can be substituted for the labels either by marketing a separate pass, or
by using “backpatching”.
1. Assignment statements
a. x := y op z, where op is a binary arithmetic or logical operation

b. x := op y, where op is unary operation, unary minus, logical negation, shift
operators and conversion operators to convert a fixed-point number to a floating-
point number.
2. Copy Statements
x := y , where the value of y is assigned to x.
3. Unconditional jump
goto L , where L is the next to be executed.
4. Conditional jump
if x relop y goto L , where relop is a relational operator ( < , = , >= , … ) to x
and y and executes the statements with the L next if x stands in relation relop to y.
5. Procedure Calls
param x
42/81
call p, n  for procedure call
return y  where y is a returned value is optional.
6. Indexed assignment
x := y[i] and x[i] := y, where x, y and i data objects.
7. Address and pointer assignments
x := &y, x := *y and *x := y.
Syntax-Directed Translation into Three-address Code
 When three-address code is generated, temporary names are made up for the
interior nodes of a syntax tree.
 The synthesized attribute represents the three-address code for the assignment.
 The non terminal has two attributes :
a) The name that will hold the value, and
b) The sequence of three address statements evaluated.
 Three-address statements may be sent to an output file, rather than built up into the
code attributes.
 Flow of control statements can be added to the assignments in productions and
semantic rules.
 Productions concatenate only the operator after the code for the operands.
 The intermediate form produced by the syntax-directed translations can be changed
by making modifications to the semantic rules.
3.5 Implementations of three address statements
a. Quadruples
To identify the three-address code. A quadruples is a record structure with four
fields, which call op, arg1, arg2 and result. The op field contains an internal code for
the operator. The contents of fields arg1, arg2 and result are pointers to the symbol
43/81
table entries for the names represented by the fields. Temporary names must be entered
into the symbol table as they are created.
b. Triples
To avoid entering temporary names into the symbol table refer to a temporary value
by the position of the statement that computes it. Three address statements can be
represented by records with only three fields : op, arg1, and arg2. The fields arg1 and
arg2, for the arguments of op, are either pointers to the symbol table or pointers into
the triple structure. Three fields are used, this intermediate code format is known as
triples, its refers to “two address code”. Parenthesized numbers represent pointers into
the triple structure.
c. Indirect triples
Implementation of three-address code that has been considered is that of listing
pointers to triples, rather than listing the triples. This implementation is called indirect
triples.
Quadruple Notation
Three address statement defining or using a temporary can immediately access the
location for that temporary via the symbol table. The symbol table interposes an extra
degree of indirection between the computation of a value and its use. Benefit : an
optimizing compiler.
Triples
44/81
Allocation of storage to those temporaries needing it must be deferred to the
code generation phase. Moving a statement that defines a temporary value require to
change all referenced to that statement. Problem : Difficult to use in an optimizing
compiler.
Indirect Triples
To save space compared with quadruples if the same temporary value is used
more than once. Two or more entries in the statement array can point to the same line
of the op-arg1-ara2 structure.
3.6 CODE GENERATION
 To transform the intermediate code into a form from which more efficient target
code can be produced.
Source front intermediate code intermediate code target
Program end code optimizer code generator program
Symbol table
3.7 ISSUES IN THE DESIGN OF A CODE GENERATOR

45/81
Input to the Code Generator
 The input to the code generator consists of the intermediate representation of the
source program produced by the front end, together with information in the symbol
table that is used to determine the run-time addresses of the data objects denoted by
the names in the intermediate representation.
Linear representations such as postfix notations.
Three-address representations such as quadruples.
Virtual machine representations such as syntax trees and dags.
 To code generation the front end has scanned, parsed and translated the source
program into intermediate representation.
 Its input is free of errors.
Target Programs
 The output of the code generator is the target program. The output take variety of
forms.
Absolute machine language.
Re-locatable machine language or assembly language.
 Producing an absolute machine language program can be placed in a fixed location

in a memory and immediately executed.
 Producing a re-locatable machine language program allows subprograms to be
compiled separately. Asset of re-locatable object can be linked together and loaded
for execution by a linking loader.
 Producing an assembly –language program makes the process of code generation.
 To generate symbolic instructions and use the macro facilities of the assembler to
help generate code.
Memory Management
 If machine code is being generated, labels in three-address statements have to be

converted to addresses of instructions.
Instruction Selection
 The instruction set of the target machine determines the difficulty of instruction
selection.
46/81
 The uniformity and completeness of the insertion set are important factors.
 If the target machine does not support each data type in a uniform, then each
exception to the general rule require special handling.
Mov b, R0
Add c, R0
Mov R0, a
Add e, R0
Mov R0, d
 For example, the sequence of statements,
a := b + c
d := a + e will be translates
Register Allocation
 Instructions involving register operands are shorter and faster than those involving
operands in memory.
 The use of registers is subdivided into two sub problems
 During register allocation, to select the set of variables that will reside in
registers at a pointer in the program.
 During a subsequent register assignment phase, to pick the specific register that a
variable will reside in.
 NP complete problem, to avoid this problem by generating code for the three-
address statements in the order in which they have been produced by the
intermediate code generator.
The Target Machine
 Target computer is a byte-addressable machine with four bytes to a word and n

general purpose registers R0, R1, … , Rn-1.
 It has two address instructions, op source, destination
 In which op is an op-code, and source and destination are data fields.
 These fields are not long enough to hold memory address and these instructions
are instructions are specified by continuing registers and memory locations with
address modes.
 The address modes associated with assembly language and costs:
47/81
NODE FORM ADDRESS ADDED COST
Absolute M M 1
Register R R 0
Indexed C(R) C + Contents(R) 1
Indirect register *R Contents(R) 0
Indirect indexed *C(R) Contents(C + contents (R)) 1
Literal #C C 1
Example:
Mov R0, M  to store the contents of register R0 into memory location M.
Mov 4(R0), M  to state value contents ( 4 + contents(R0)) into memory
location M.
Mov *4(R0), M  to store the value contents ( contents (4 + contents (R0))) into
memory location M.
Mov #1, R0  to store the constant 1 into register R0.
Instruction Set
 The cost of an instruction to be one plus the costs associated with the source and
destination address modes.
 This cost corresponds to the length of the instruction.
 Address modes involving registers have cost zero, while those with a memory
location or literal in them have cost one, because such operands have to be stored
with the instruction.
 By minimizing the instruction length to minimize the time taken to perform the
instruction
48/81
Example :
Mov R0, R1  to copy the contents of register R0 into register R1.
Mov R5, M  to copy contents of register R5 into memory location M.
Add #1, R3  to add the constant 1 to the contents of register R3.
Sub 4(R0), *12(R1)  to store the value contents
( contents ( contents ( (12 + contents (R1))) - ( contents (4 + contents (R0))) into the
destination *12(R1)
3.8 Run-Time Storage Management
 An execution of procedure is kept in a block of storage called an activation record .
STORAGE-ALLOCATION STRATEGIES
* It can be used in three data areas in the organization.
Static Allocation
 Static allocation lays out storage for all data objects at compile time.
 The position of an activation record in a memory is fixed at compile time.
 The addresses at which information is to be saved when a procedure call occurs are
known at compile time.
 Limitations with using static allocation
 The size of a data object an constraints on its position in memory must be known at
compile time.
 Recursive procedures are restricted, because all activations of procedure use the
same bindings for local names.
 Data structures cannot be created dynamically, since there is no mechanism for
storage allocation at run-time.
Stack Allocation
49/81
 Stack allocation manages the run-time storage as a stack.

 A new activation record is pushed onto the stack for each execution of a procedure.
 The record is popped when the activation ends.
 Run-time memory is divided into areas for code, static data, and a stack.
 Run-time allocation and de-allocation of activation records occurs as part of the
procedure.
 Call and return sequences
1. call
2. return
3. halt, and
4. action , a place holder for other statements.
 In static allocation, the intermediate code is implemented by a sequence of two
target machine instructions,
mov # here+20, callee.static-area
goto callee.code-area
 The attributes callee.static-area and callee.code-area are constants referring to the

address of the activation record and the first instruction for the called procedure.
 A return from procedure callee is implemented by
goto *callee.static-area
which transfers control to the address saved at the beginning of the activation
record.
 In stack allocation, relative addresses in an activation record can be taken as offsets

from any known position in the activation record.
 Example: 1) mov #stackstart, sp /* initialize the stack */
halt /* terminate execution */
2) add #caller.recordsize, sp
mov #here + 16, *sp
goto callee.code-area
the attribute caller.recordsize represents the size of an activation record. The
source # here +16 in saved in the address pointed to by sp.
Runtime Address for Names
Advantage
50/81
 It makes the compiler more portable. The front end need not be changed even if the
compiler is moved to a different machine where a different run-time organization is
needed.
 Generating the specific sequence of access steps while generating intermediate code
can be significant advantage in an optimizing compiler.
 Stack allocation is based on the control stack, storage is organized stack, and
activation records are pushed and popped as activations begins and end,
respectively.
 Local are bound to fresh storage in each activation, because a new activation record
is pushed onto the stack when a call is made.
 The values of locals are deleted when the activation ends, that is, the values are lost
because the storage for locals disappears when the activation record is popped.
 At run-time, an activation record can be allocated and de-allocated by incrementing
and decrementing the top of the stack, respectively, by the size of the record.
Calling Sequences
 Procedure calls are implemented by generating what are known as calling

sequences in the target code. A call sequence allocates an activation record and
enters and information into its fields.
 A return sequence restores the state of the machine so the calling procedure can
continue execution.
 The code in a calling sequence is divided between the calling procedure( the
caller ) and the procedure it calls ( the callee ).
 The designing of calling sequences and activation records is that fields whose
sizes are fixed are placed in the middle.
Advantage :
 To placing the fields for parameters and a potential returned value next to the
activation record of the caller.
 The caller can access these fields using offsets from the end of its own activation
record, without knowing the complete layout of the record for the callee.
 The register top-sp points to the end of the machine-status fielding an activation
record.
 This position is known to the caller, so it can be made responsible for setting
top-sp before control flows to the called procedure.
 The code for the callee can access its temporaries and local data using offsets
from
top-sp.
51/81
Parameters and returned value 
Control link  caller’s
Links and saved status activation
Temporaries and local data caller’s record
Parameters and returned value responsibility 
Control link  callee’s
Links and saved status  activation
Top-sp Temporaries and local data callee’s record
Responsibility 
Fig. Division of tasks between caller and callee.
 The call sequence,
 The caller evaluates actuals. The caller stores a return address and the old
vale of top-sp into the callee’s activation record.
 The caller increments top-sp to the position, that is , top-sp is moved past the
caller’s local data and temporaries and the callee’s parameter and status
fields.
 The callee saves register values and other status information. The callee
initializes its local data and begins execution.
 The return sequence,
 The callee places a return value next to the activation record of the caller.
 Using the information in the status field, the callee restores top-sp and other
registers and branches to a return address in the caller’s code.
 Although top-sp has been decremented, the caller can copy the returned value
into its own activation record and use it to evaluate an expression.
 The calling sequences allow the number of arguments of the called procedure to
depend on the call.
 At compile time, the target code of the caller knows the number of arguments it
is supplying to the callee. Hence the callee knows the size of the parameter field.
 The target code of the callee must be prepared to handle other calls, it waits until
it is called, and then examines the parameter field.
52/81
Variable-Length Data
 The relative addresses of these pointers are known at compile time, the target
code can access array elements though the pointers.
 The activation record for q begins after the arrays of p, and the variable length
arrays of q begin.
Control link Activation Record for p
Pointer to A 
Pointer to B
Pointer to C
Array A 
Array B Array of p
Array C 
Control link Activation Record for procedure q called by p
Top_sp 
Top_sp Array of q
Fig. Access to dynamically allocated
 Access to data on the stack is two pointers: top-sp and top.

53/81
 The actual top of the stack, it points to the position at which the next activation
record will begin.
 To find local data, for consistency with organization, suppose top-sp points to the
end of the machine status field.
 Top-sp points to the end of this field in the activation record for q.
 Within the field is a control link to the previous value of top-sp when control was
in the calling activation of p. the code reposition top and top-sp can be generated at
compile-time, using the sizes of the fields in the activation records.
 When q returns, the new value of top is top-sp minus the length of the machine-
status and parameters fields in q’s activation record.
 This length is known at compile time, at least to the callee.
 After adjusting top, the new value of top-sp can be copied from the control link
of q.
Dangling References
A dangling references occurs when there is a reference to storage that has been
allocated.
Heap Allocation
 Heap allocation allocates and de-allocates storage as needed at run-time from a data
area known as a heap.
 The values of local names must be retained when an activation ends.
 A called activation outlives the caller. This possibility cannot occur for those
language where activation trees correctly depict the flow of control between
procedures.
 To handle small activation records or records of a predictable size as a special case :
 For each size of interest, keep a linked list of free blocks of that size.
 If possible, fill a request for size s with a block of size s’ , where s’ is the
smallest size greater than or equal to s.
 When the block is eventually de-allocated, it is returned to the linked list it came
from.
 For large blocks of storage use the heap manager.

Control link
Control link
Control link
54/81
 Allocation and de-allocation of small amounts of storage, taking and returning a

block from a linked list are efficient operations.
 Large amounts of storage, the computation to take the storage to use, the time
taken by the allocator is negligible compared with the time taken to do the
computation.
55/81
UNIT – IV
SYLLABUS:
Assemblers: Elements of Assembly language programming: Elements-Assembly language
statements-Advantages of Assembly language-Interpreters: Uses-overview-pure and impure
interpreters
Macros and Macroprocessors: Macro-Macro definition and call-Macro expansion: Lexical
substitution-positional parameters-keyword parameters-nested macro calls-advanced macro
facilities
4 Assemblers
4.1 Elements of assembly language programming
* An assembly language is a machine dependent low level programming language which is
specific to a certain computer system.
* It provides 3 basic features which simplify programming language:
* Mnemonic operation code:

 Use of mnemonic operation codes for machine instructions eliminates the need to
memorize numeric operation codes.
 It also enables the assembler to provide helpful diagnostic for example indication
of misspelt operation codes.
* Symbolic operands:
 Symbolic names can be associated with data or instructions.
 The symbolic names can be used as operands in assembly statements.
 The assembler performs memory binding to these names.
 The programmer need not know the details of memory binding performed by the
assembler.
* Data declarations:
 Data can be declared in variety of notation.
 This avoids manual conversion into internal machine representation.
Statement Format:
* An assembly language statement has the following format:
[label] <opcode> <opcode spec> [,<opcode spec>..]
* If a label is specified in a statement it is associated as a symbolic name with the memory

words generated for the statement.<operand spec> has the following syntax:
<symbolic name> [+<displacement>] [(<index register>)]
* Some possible operand formats are : AREA, AREA+5, AREA(4) and AREA+5(4).
* The first specification refers to the memory word with which the name AREA is associated.
* The second specification refers to the memory word 5 words away from the word with the
name AREA. Here 5 is the displacement or offset from AREA
* The third specification implies indexing with index register 4 – that is, the operand address is
obtained by adding the contents of index register 4 to the address of AREA.
* The last specification is a combination of the previous two specifications.
A simple assembly language:

In this language each statement has two operands, the first operand is always a register. The
second refers to a memory word using symbolic name and optional displacement.
Instruction opcode Assembly mnemonic Remarks

00 STOP Stops execution
56/81
01 ADD First operand is modified
02 SUB condition code is set
03 MULT
04 MOVER Register<-memory move
05 MOVEM Memory<-register move
06 COMP Sets condition code.
07 BC Branch on condition
08 DIV Analogous to SUB
09 READ First operand is not used
10 PRINT
Fig: Mnemonic operation codes
* The MOVE instruction moves the value between a memory word and a register.
* In the MOVER instruction the second operand is the source operand and the first operand is
the target operand.
* Converse is true for the MOVEM instruction.
* All arithmetic is performed in a register and sets a condition code.
* A comparison instruction sets a condition code analogous to the subtract instruction without
affecting the values of the operand.
* The condition operand can be tested by a branch on condition instruction.
* The assembly statement corresponding to it has the format
BC <condition code spec>, <memory address>
* In a machine language program we show all addresses and constants in decimal rather than
octal or hexa decimal.
* The following figure shows the machine instructions format.
* The opcode,register operand and memory operand occupy 2,1 and 3 digits , respectively.
* The sign is not a part of the instruction
* The condition code specified in a BC statement is encoded into the first operand position using
the codes 1-6 for the specification LT,LE,EQ,GT,GE and ANY, respectively
57/81
Fig: Instructions format.
Sign opcode reg memory

Operand operand
Fig: An assembly and equivalent machine language program

START 101
READ N 101) +09 0 113
MOVER BREG, ONE 102) +04 2 115
MOVEM BREG, TERM 103) +05 2 116
AGAIN MULT BREG,TERM 104) +03 2 116
MOVER CREG, TERM 105) +04 3 116
ADD CREG, ONE 106) + 01 3 115
MOVEM,CREG, TERM 107) + 05 3 116
COMP CREG , N 108) + 06 3 113
BC LE, AGAIN 109) + 07 2 104
MOVEM BREG,RESULT 110) + 05 2 114
PRINT RESULT 111) + 10 0 114
STOP 112) + 00 0 000
N DS 1 113)
RESULT DS 1 114)
ONE DC ‘1’ 115) + 00 0 001
TERM DS 1 116)
END
4.2 Assembly Language Statements:
An assembly program contains 3 kind of statements:

1. Imperative statements
2. Declaration statements
3. Assembler directives
1. Imperative statements:
* An imperative statement indicates an action to be performed during the execution of the
assembled program.
* Each imperative statement typically translates into one machine instruction.
2. Declaration statements:
* The syntax of declaration statements is a s follows:
[label1] DS <constant>
[label> DC ‘<value>’
* The DS(short for declare storage) statement reserves areas of memory and associates name
switch them. Consider the following DS statements:
A DS 1
G DS 200
* The first statement reserves a memory area of one word and associates the name A with it.
* The second statement reserves block of 200 memory words.
* The name G is associated with the first word of the block.
58/81
* Other words in the block can be accessed through offsets from G,e.g. G+5 is the sixth word of
the memory block etc.,
* The DC(short for declare constant) statement constructs memory words containing constants.
Consider the following DC statement:
ONE DC ‘1’
* The statement associates the name ONE with a memory word containing the value ‘1’.
* The programmer can declare constants in different forms –
decimal,binary,hexadecimal,etc.
* The assembler converts them to the appropriate internal form.
Use of Constants:
* Contrary to the name declare constant the dc statement does not really implement constants, it
merely initializes memory words to given values.
* These values are not protected by assembler ; they may be changed by moving a new value
into the memory word.
* An assembly program can use constants in the sense implemented in a HLL in two ways-as
immediate operands and as literals.
* Immediate operands can be used in an assembly statement only if the architecture of the
target machine includes the necessary features.
* In such machine the assembler statement is translated into an instruction with two operands-
AREG and the value ‘5’ as an immediate operand.
ADD AREA, 5
* A Simple assembly language does not support this feature, but assembly language of intel
8086 supports it.
* A literal is an operand with the syntax =’<value>’.
* It differs from a constant because its location cannot be specified in the assembly program.
* This helps to ensure that its value is not changed during execution of a program.
* It differs from an immediate operand because no architectural provision is needed to support is
use.
* The value of the literal is protected by the fact that the name and address of this word is not
known to the assembly language programmer.
3. Assembler Directives:
* Assembler directives instruct the assembler to perform certain actions during the assembly of a
program.
* Some assembler directives are described in the following.
START <constant>
* This directive indicates that the first word of the target program generated by the assembler
should be placed in the memory word with address
END [<operand spec>]
* This directive indicates that the end of the target program.
* The optional <operand spec> indicates the address of the instruction where the execution of
the program should begin.
4.5 Advantage of assembly Language:

* The primary advantage of assembly language programming vis-a-vis machine language
programming arise from the use of symbolic operand specification.
* Consider the machine and assembly language statements of the previous figure.The program
computes n!.
59/81
* The following figure shows a changed program to compute ½ * N!, where rectangular boxes
are used to highlight changes in the program. One statement has been inserted before the
PRINT statement to implement division by 2.
* In the machine language program, this leads to changes in the addresses of constants and
reserved memory areas.
* Because of this the address used in most of the program had to change.Such changes are not
needed in the assembly program since operand specification are symbolic in nature.
* Fig: Modified assembly and machine language programs
START 101
READ N 101) +09 0 114
MOVER BREG, ONE 102) +04 2 116
MOVEM BREG, TERM 103) +05 2 117
AGAIN MULT BREG,TERM 104) +03 2 117
MOVER CREG, TERM 105) +04 3 117
ADD CREG, ONE 106) + 01 3 116
MOVEM,CREG, TERM 107) + 05 3 117
COMP CREG , N 108) + 06 3 114
BC LE, AGAIN 109) + 07 2 104
DIV BREG, TWO 110) + 08 2 118
MOVEM BREG,RESULT 111) + 05 2 115
PRINT RESULT 112) + 10 0 115
STOP 113) + 00 0 000
N DS 1 114)
RESULT DS 1 115)
ONE DC ‘1’ 116) + 00 0 001
TERM DS 1 117)
TWO DC ‘2’ 118) + 00 0 001
END
* Assembly language programming holds an edge over HLL programming in situations where it
is necessary or desirable to use specific architectural features of a computer – for example,
special instructions supported by the CPU.
60/81
4.6 INTERPRETERS
* An interpreter is a program that executes instructions written in a high-level language.
* There are two ways to run programs written in a high-level language.
* The most common is to compile the program; the other method is to pass the program through
an interpreter.
* Compiler VS Interpreter
 An interpreter translates high-level instructions into an intermediate form, which

it then executes.
 In contrast, a compiler translates high-level instructions directly into machine

language.
 Compiled programs generally run faster than interpreted programs.
 The advantage of an interpreter, however, is that it does not need to go through

the compilation stage during which machine instructions are generated.
 This process can be time-consuming if the program is long.
 The interpreter, on the other hand, can immediately execute high-level programs.
For this reason, interpreters are sometimes used during the development of a
program, when a programmer wants to add small sections at a time and test them
quickly.
 In addition, interpreters are often used in education because they allow students to
program interactively.
 Both interpreters and compilers are available for most high-level languages
 However, BASIC and LISP are especially designed to be executed by an

interpreter.
 In addition, page description languages, such as PostScript, use an interpreter.

Every PostScript printer, for example, has a built-in interpreter that executes
PostScript instructions.
 An interpreter translates some form of source code into a target representation

that it can immediately execute and evaluate.
 The structure of the interpreter is similar to that of a compiler, but the amount of
time it takes to produce the executable representation will vary as will the amount
of optimization.
 Compiler characteristics:
* spends a lot of time analyzing and processing the program
* the resulting executable is some form of machine- specific binary code

61/81
* the computer hardware interprets (executes) the resulting code
* program execution is fast
 Interpreter characteristics:
* relatively little time is spent analyzing and processing the program
* the resulting code is some sort of intermediate code
* the resulting code is interpreted by another program
* program execution is relatively slow
The following diagram shows the differences.
* Interpreters avoids the overheads of compilation.

* This is an advantage during program development, because a program may be modified very
often.
* Interpretation is expensive in terms of CPU time .Consider an example:
tc : average compilation time per statement
te : average execution time per statement
ti : average interpretation time per statement
* Both compilers and interpreters analyze a source statement to determine its meaning.
* During compilation, analysis of a statement is followed by code generation, while during
interpretation it is followed by actions which implement its meaning.
* We could assume tc ≈ ti.
* te , which is the execution time of the compiler generated code for a statement, can be several
times smaller than tc
62/81
* Let us assume tc=20 . te
* Consider a program P. Let sizep and stmts_executep represent the number of statements in P
and the number of statements executed in some execution of P,respectively.
* We use these parameters to compute te CPU time required to execute a program using
compilation or interpretation.
* Let sizep=200.Let program P execute as follows: 20statements are executed for initialization
process. This is followed by 10 iterations of a loop containing 8 statements, followed by the
execution of 20 statements for printing the results.Thus, stmts_executedp=20+10*8+20=120.
Total execution time using compilation model ≈ 200. tc +120. te

≈ 206. tc
Total execution time using interpretation model = 120. ti
≈ 120. tc
Hence, interpretation is beneficial in this case.
4.4.1 Uses Of Interpreters

* Use of interpreter is motivated by two reasons –efficiency in certain environments and
simplicity.
* The findings of above example can be generalized as follows:
 It is better to use interpretation for a program P if P is modified between
executions, and stmts_executedp < sizep.
 These conditions are satisfied during program development, hence interpretation
should be preferred during program development.
 In all other situations, it is best to use compilation.
* It is simpler to develop an interpreter than to develop a compiler because interpretation does
not involve code generation.
* This simplicity makes interpretation more attractive in situations where programs or
commands are not executed repeatedly.
* Hence interpretation is a popular choice for commands to an operating system or an editor.
* User interfaces of many software packages prefer interpretation for similar reasons.
4.4.2 OVERVIEW OF INTERPRETATION

* The interpreter consists of three main components:
1. Symbol table: the symbol table holds information concerning entities in the source
program
2. Data store: The data store contains values of the data items declared in the program being
interpreted. The data store consists of a set of components {comp i}. A component {comp
i} is an array named namei containing elements of a distinct type typei.
3. Data manipulation routines: A set of data manipulation routines exist. This set contains a
routine for every legal data manipulation action in the source language.
* On analyzing a declaration statement,say a statement declaring an array alpha of type typ,
the interpreter locates a component compj of its data store,such that typej=typ.aplha is noe
mapped into a part of namej(this is memory allocation). The memory mapping for alpha is
remembered in its symbol table entry. An executable statement is analyzed to identify the
actions which constitute its meaning.For each action,the interpreter finds the appropriate data
manipulation routine and invokes it with appropriate parameters. Eg: the meaning of statement
a:=b+c; where a,b,c are of same type can be implemented by executing the calls
Add (b, c, result);
Assign (a, result);
In the interpreter.
* This schematic has 2 important advantages.
* First, the meaning of a source statement is implemented through execution of the interpreter
routines rather than through code generation.
* This simplifies the interpreter.
63/81
* Second, avoiding generation of machine language instructions helps to make the interpreter
portable.
4.4.3 PURE AND IMPURE INTERPRETERS
* The schematic of figure (a) is called a pure interpreter.

* The source program is retained in the source form all though its interpretation.
* This arrangement incurs substantial analysis overheads while interpreting a statement.
Figure (a) Data
Source program Interpreter Results
Figure (b)
Data
IR
Source program Preprocessor Interpreter Results
* An impure interpreter performs some preliminary processing of the source program to reduce
the analysis overheads during interpretation. Figure (b) contains a schematic of impure
interpretation. The preprocessor converts the program to an intermediate representation (IR)
which is used during interpretation. This speeds up interpretation as the code component of the
IR, i.e., the IC, can be analyzed more efficiently than the source form of the program.
4.5 MACROS AND MACRO PROCESSORS:

* Macros provide a program generation facility through macro expansion.
* Many language provide built in facility for writing macros.
* Well known examples are the higher level languages like PL/I,c,ADA,C++.
* Assembly languages of most computer systems also provide such facilities.
* When a language does not support built-in macro facilities a programmer may achieve an
equivalent effect by using generalized preprocessor or software tools like Awk of Unix.
Definition:
* A Macro is a unit of specification for program generation through expansion.
* A macro consists of a name, a set of formal parameters and body of code.
* Macro expansion is a macro name with a set of formal parameters is replaced by some code.
* Lexical expansion is replacement of a character string by another character string during
program generation.
* Semantic expansion is generation of instructions tailored to the requirements of specific
usage.
64/81
4.5.1 Macro definition and call
* A macro definition is enclosed between macro header statement and a macro end
statement.
* Macro definitions are typically located at the start of a program.
* A macro definition consists of
1.A macro prototype
2.One or more model statements
3.Macro preprocessor statements
* The macro prototype statement declares the name of a macro and the names and kinds of its
parameters.
* A model statement is a statement from which an assembly language statement may be
generated during macro expansion.
* A preprocessor statement is used to perform auxiliary functions during macro expansion.
* The macro prototype statement has the following syntax:
<macro name> [<formal parameter spec>[,..]]
* A macro is called by writing the macro name in the mnemonic field of an assembly statement.
* The macro call has the syntax:
<macro name> [<actual parameter spec>[,..]]
where the actual parameter typically resembles an operand specification in an assembly language
statement.
* The following example shows the definition of macro INCR,MACRO and MEND are the
macro header and macro end statements, respectively.
* The prototype statement indicates that three parameters called MEM_VAL,INCr_VAL and
REG exist for the macro.
* Since parameter kind is not specified for any of the parameters, they are all of the default kind
‘positional parameter’.
* Statements with the operation codes MOVER,ADD and MOVEM are model statements.
* No preprocessor statements are used in this macro.
MACRO
INCR &MEM_VAL, &INCR_VAL, &REG
MOVER &REG, &MEM_VAL
ADD &REG, &INCR_VAL
MOVEM &REG, &MEM_VAL
MEND
4.6 Macro expansion

* A macro call leads to macro expansion.
* During macro expansion the macro call statement is replaced by a sequence of assembly
statements.
* To differentiate between the original statements of a program and the statements resulting from
the macro each statement is marked with a ‘+’
* Two key notions concerning macro expansion:
 Expansion time control flow: this determines the order in which model statements
are visited during the macro expansion.
 Lexical Substitution: Lexical substitution is used to generate an assembly
statement from a model statement.
* Flow of control during expansion:
The default flow of control during macro expansion is sequential. Thus in the absence of
preprocessor statements the model statements of a macro are visited sequentially starting with
the statement following the macro prototype and ending with the statement preceding the MEND
statement.
The flow of control during macro expansion is implemented using a macro expansion counter
(MEC)
65/81
* Algorithm: (Outline of macro expansion)
1. MEC := statement number of first statement following the prototype statement;
2. While statement pointed by MEC is not a MEND statement
(a) If a model statement then
(i) Expand the statement.
(ii) MEC:=MEC+1;
(b) Else
(i) MEC:=new value specified in the statement;
3. Exit from macro expansion
MEC is set to point at the statement following the prototype statement .It is incremented by 1
after expanding a model statement. Execution of a preprocessor statement can set MEC to a new
value to implement conditional expansion or expansion time loops.
4.6.1 Lexical substitution

A model statement consists of 3 types of strings
1. An ordinary string
2. The name of formal parameter which is preceded by the character ‘&’
3. The name of the preprocessor variable which is laso preceded by the character ‘&’
During lexical expansion, strings of type 1 are retained without substitution.
Strings of types 2 and 3 are replaced by the ‘values’ of the formal parameters or preprocessor
variables.
The value of a formal parameter is the corresponding actual parameter string.
The rules for determining the value of a formal depend on the kind of parameter.
4.6.2 Positional Parameters:

* A positional formal parameter is written as & <parameter name>, eg., &SAMPLE where
SAMPLE is the name of a parameter.
* The value of a positional formal parameter XYZ is determined by the rule of positional
association as follows:
1.Find the ordinal position of XYZ in the list of formal parameters in the macro prototype
statement.
2.Find the actual parameter specification occupying the same ordinal position in the list of actual
parameter in the actual parameters in the macro call statement.
4.6.3 KeyWord Parameters

* For keyword parameter, <parameter name> is an ordinary string and <parameter kind> is the
string ‘=’ in syntax rule.
* The <actual parameter spec > is written as <formal parameter name>=<ordinary string>.
* The value of a formal parameter XYZ is determined by the rule of keyword association as
follows:
1.Find the actual parameter specification which has the form XYZ=<ordinary string>
2.Let <ordinary string> in the specification be the string ABC.Then value of parameter XYZ is
ABC
4.7 Nested Macro Calls

* A model statement in a macro may constitute call on another macro.
* Such calls are known as nested macro calls.
* We refer the macro contains the other calls as nested macro and the called macro is called
inner macro.
* Expansion of the nested macro follows the LIFO rule.
* In a structure of nested macro calls, expansion of the latest macro call is completed first.
66/81
* Macro COMPUTE of fig-1 contains a nested call on Macro INCR_D.
* Fig 2shows the Expanded code for a nested macro call COMPUTE X,Y
* After lexical expansion, the second model statement of COMPUTE is recognized to be a call
on macro INCR_D.
* Expansion of this macro is now performed.
* This leads to generation of statements marked 2, 3 and 4 in fig 2.
* The third model statement of COMPUTE is now expanded.
* Thus the expanded code for the call on COMPUTE is as follows:
* Fig 1: A nested macro call
+ MOVEM BREG, TMP
+ MOVER BREG, X
+ ADD BREG,Y
+ MOVEM BREG, X
+ MOVER BREG, TMP
MACRO
COMPUTE &FIRST, &SECOND
MOVEM BREG, TMP
INCR_D &FIRST, &SECOND, REG = BREG
MOVER BREG, TMP
MEND
Fig 2: Expanded code for a nested macro call COMPUTE X,Y
COMPUTE X, Y + BREG, TMP 1
+ MOVER BREG,X 2
+ INCR_D X,Y + ADD BREG,Y 3
+ MOVEM BREG,X 4
+ MOVER BREG, TMP 5
67/81
4.8 Advanced Macro Facilities:
* Advanced Macro Facilities are aimed at supporting semantic expansion.
* These facilities are grouped into
1.Facilities for alteration of flow of control during expansion.
2.Expansion time variables
3.Attributes of parameter
* This section describes some advanced facilities and illustrates their use in performing
conditional expansion of model statements and in writing expansion time loops.
Alteration of flow of control during expansion:

Two features provided to facilitate alteration of flow of control during expansion
1.Expansion time sequenceing symbols (SS)
2.Expansion time statements AIF,AGO and ANOP
* A sequencing symbol (SS) has the syntax:
. <ordinary string>
* As SS is defined by putting it in the label field of a statement in the macro body.
* It is used as an operand in an AIF or AGO statement to designate the destination of an
expansion time control transfer.
* An AIF has the syntax:
AIF<expression>< sequenceing symbols>
Where <expression> is a relational expression involving ordinary strings,formal parameters and
their attributes,and expansion time variables. If the relational expression evaluates to
true,expansion time control is transferred to the statement containing <sequencing symbol> in its
label field.
* An AGO has the syntax:
AGO< sequencing symbols>
And unconditionally transfers expansion time control to the statement containing < sequenceing
symbols> in its label field.
* An ANOP statement is written as:
< sequencing symbols> ANOP
and simply has the effect of defining the sequencing symbol.
Expansion variables:
* Expansion time variables are variables that can be called only during the macro expansion of
calls.
* A local EV is created for use only during a particular macro call.
* A global EV exists across all macro calls situated in a program and can be used in any macro
that has declaration for it.
* local EV and global EV’s are created through declaration statements with the following
syntax:
LCV <EV Specification>[,<EV Specification>..]

GBL <EV Specification>[,<EV Specification>..]
* Value of EV’s can be manipulated through the preprocessor statement SET.

* A SET statement is written as <EV specification> SET <SET-expression>
Attributes of formal parameters:

* An attribute is written using the syntax <attribute name><formal parameter specs> and
represents information about the value of the formal parameter, i.e about the corresponding
actual parameter.
Conditional Expansion
68/81
* While writing a general purpose macro it is important to ensure execution efficiency of its
generated code.
* Condition expansion helps in generating assembly codespecially suited to the parameters in a
macro call.
* This is achieved by ensuring that the model is visited only under specific condition during
expansion of macro
Expansion time in loops

* It is often necessary to generate many similar statements during the expansion of a macro.
* This can be achieved by writing similar model statements in the macro.
* Most expansion time loops can be replaced by execution time loops
* For example instead of generating many MOVEM statement it is possible to write an
execution time loop which moves 0 into B,B+1 and B+2.
* An execution time loop leads to more compact assembly programs.
* However such programs would execute slower than programs containing expansion time
loops.
Semantic Expansion
* Semantic expansion is the generation of instructions tailored to the requirements of a specific
usage.
* It can be achieved by a combination of advanced macro facilities like AIF,AGO statements and
expansion time variables.
* The CLEAR macro is an instance of semantic expansion.
* Here the number of MOVEM AREG statements generated by call on CLEAR is determined by
the value of second parameter CLEAR.
* Example:
MACRO
CREATE_CONST &X,&Y
AIF (T’ &X EQ B) .BYTE
&Y DW 25
AGO .OVER
.BYTE ANOP
&Y DB 25
.OVER MEND
This macro creates a constant ‘25’ with the name given by the 2 nd parameter.The type of the
constant matches the type of the first parameter.
69/81
UNIT – V
SYLLABUS:
Linkers: Relocation and linking concept (Program relocation-linking object module)-self
relocating programs-linking for overlays-loaders.
Software tools-software tools for program development-editors-debug monitors.
5.1 LINKERS
Execution of a program written in a language L involves the following steps:
1. Translation of the program.
2. Linking of the program with other programs needed for its execution.
3. Relocation of the program with to execute from the specific memory area allocated to it.
4. Loading of the program in the memory for the purpose of execution.
These steps are performed by different language processors.Step1 is performed by the
translator for language L. Steps 2 and 3 are performed by a linker while Step4 is performed by a
Loader.
Data
Translator Linker Loader Binary
Source program program
Object modules Binary programs
The figure containing a schematic showing steps 1 to 4 in the execution of a program.

The translator outputs a program form called object module for the program. The linker
processes a set of object modules to produce a ready-to-execute program form, which we call a
binary program. The loader loads this program into the memory for the purpose of execution. As
shown in the schematic, the object module(s) and ready-to-execute program forms can be stored
in the form of files for repeated use.
Translated, linked and load time addresses

* While compiling a program P, a translator is given an origin specification for P.
* This is called the translated origin of P.
* The translator uses the value of the translated origin to perform memory allocation for the
symbols declared in P.
* This results in the assignment of a translation time address t symb to each symbolsymb in the
program.
* The execution start address or simply the start address of a program is the address of the
instruction from which its execution must begin.
* The start address specified by the translator is the translated start address of the program.
* The origin of a program may have to be changed by the linker or loader for one of two
reasons.
 First, the same set of translated addresses may have been used in different objects
modules constituting a program,eg. Object modules of library routines often have
the same translated origin. Memory allocation to such programs would conflict
unless their origins are changed.
 Second, an operating system may require that a program should execute from a
specific area of memory. This may require a change in its origin. The change of
origin leads to changes in the execution start address and in the addresses
assigned to symbols.
* The following terminology is used to refer to the address of a program entity at different times:
70/81
1. Translation time address: Address assigned by the translator.
2. Linked address: Address assigned by the linker.
3. Load time address: Address assigned by the loader.
* The same prefixes translation time, linked and load time are used with the origin and execution
start address of a program. Thus,
1. Translated origin: Address of the origin assumed by the translator. This is the address
specified by the programmer in an ORIGIN statement.
2. Linked origin: Address of the origin assigned by the linker while producing a binary
program.
3. Load origin: Address of the origin assigned by the loader while loading the program for
execution.
* The linked and load origins may differ from the translated origin of a program due to one of
the reasons mentioned earlier.
5.2 RELOCATION AND LINKING CONCEPTS

5.2.1 Program Relocation
* Let AA be the set of absolute addresses—instructions or data addresses—used in the
instructions of a program P.
* AA ≠  implies that program P assumes its instructions and data to occupy memory words
with specific addresses. S
* uch a program called an address sensitive program—contains one or more of the following:
1. An address sensitive instruction: an instruction which uses an address ai  AA.

2. An address constant: a data word which contains an address ai  AA.
* In the following, we discuss relocation of programs containing address sensitive instructions.
* Address constants are handled analogously.
* An address sensitive program P can execute correctly only if the start address of the memory
area allocated to it is the same as its translated origin.
* To execute correctly from any other memory area, the address used in each address sensitive
instruction of P must be ‘corrected ’.
Definition 5.2.1 (Program relocation): Program relocation is the process of modifying the
addresses used in the address sensitive instructions of a program such that the program can
execute correctly from the designed area of memory.
* If linked origin ≠ translated origin, relocation must be performed by the linker. If load origin ≠
linked origin, relocation must be performed by the loader.
* In general, a linker performs relocation, whereas some loaders do not.
* However, it would have been more precise to use the term ‘linked origin’.
* Example
Statement Address Code
START 500
ENTRY TOTAL
EXTRN MAX,ALPHA
READ A 500) + 09 0 540
Loop 501)
.
.
MOVER AREG,ALPHA 518) + 04 1 000
BC ANY,MAX 519) + 06 6 000
.
.
BC LT,LOOP 538) + 06 1 501
71/81
STOP 539) + 00 0 000
A DS 1 540)
TOTAL DS 1 541)
END
* The translated origin of the above program is 500.

* The translation time address of Symbol A is 540.
* The instruction corresponding to the statement READ a (existing in translated memory word
500) uses the address 540; hence it is an address sensitive instruction.
* If the linked origin is 900, A would have the link time address 940.
* Hence the address in the READ instruction should be corrected to 940. Similarly the
instruction in translated memory word 538 contains 501, the address of LOOP.
* This should be corrected to 901. (Note that the operand addresses in the instruction with the
address 518 and 519 also need to be corrected.)
Performing relocation
Let the translated and linked origins of program P be t-origin and l-originp, respectively.
Consider a symbol symb in P. let its translation time address be t symb and link time address be lsymb
. The relocation factor of P is defined as
relocation_factorp = l_origin p - t_originp (5.1)
Note that relocation_factor p can be positive , negative or zero.
Consider statement which uses symb as an operand. The translator puts the address t symb in the
instruction generated for it. Now,
tsymb = t_originp + dsymb
Where dsymb is the offset of symb in P. hence
tsymb = l_originp + dsymb
using (5.1),
lsymb = t_originp +relocation_factorp + dsymb
= t_originp+ dsymb + relocation_factorp
= tsymb + relocation_factorp (5.2)
Let IRRp designate the set of instructions requiring relocation in program P. following (5.2),
relocation of program P can be performed by computing the relocation factor for P and adding it
to the translation time address(es) in every instruction i  IRRp .
Example For the above program

Relocation factor = 900 – 500
= 400.
* Relocation is performed as follows: IRRp contains the instructions with translated address 500 and 538.
* The instruction with translated address 500 contains the address 540 in the operand field.
* This address is changed to (540+400) = 940.
* Similarly, 400 is added to the operand address in the instruction with translated address 538.
5.2.2 Linking
Consider an application program AP consisting of a set of program units SP = {P i}. a
program unit Pi interacts with another program unit Pj by using addresses of Pj’s instructions
and data in its own instructions. To realize such interactions, Pj and Pi must contain public
definitions and external references as defined in the following:
Public definition a symbol pub_symb defined in a program unit which may be

referenced in other program units
External reference a reference to a symbol ext_symb which is not defined in the
program unit containing the reference.
The handling of public definitions and external reference is described in the following.
EXTRN and ENTRY statements
72/81
The ENTRY statement lists the public definitions of a program unit, i.e. it lists those
symbols defined in the program unit which may be referenced in other program units. The
EXTRN statement lists the symbols to which external reference are made in the program unit.
Example :In the assembly program , the ENTRY statement indicates that a public definition of
TOTAL exists in the program. Note that LOOP and An are not pubic definitions even though
they are defined in the program. The EXTRN statement indicates that the program contains
external reference to MAX and ALPHA. The assembler does not know the address of an external
symbol. Hence it puts zeros in the address fields of the instructions corresponding to the
statement MOVER AREG<ALPHA and BC ANY, MAX. If the the EXTRN statement did not
exist, the assembler would have flagged reference to MAX and ALPHA as errors.
Resolving external references

Before the application program AP can be executed, it is necessary that for each P i in SP, every external
reference in Pi should be bound to the correct link time address.
Definition 5.2..2 (Linking) Linking is the process of binding an external reference to the correct
link time address.
An external reference is said to be unresolved until linking is performed for it. It is said to be
resolved when its linking is completed.
Statement address code
START 200
ENTRY ALPHA
- -
ALPHA DS 25 231 +00 0 025
END
Example : Let the program unit be linked with the program unit Q described in figure.
Program unit P contains an external reference to symbol ALPHA which is a public definition in
Q with the translation time address 231.Let the link origin of P be 900 and its size be 42
words.The link origin of Q is therefore 942, and the link time address of ALPHA is 973.Linking
is performed by putting the link time address of ALPHA in the instruction of P using ALPHA,ie.,
by putting the address 973 in the instruction with the translation time address 518 in P.
5.2.3 Binary programs

Definition 5.2.3.(Binary program): A binary program is a machine language program
comprising a set of program units SP such that  Pi SP
1. Pi has been relocated to the memory area starting at its link origin,and
2. Linking has been performed for each external reference in Pi.
To form a binary program form a set of object modules, the programmer invokes the linker using
the command
Linker <link origin>,<object module names>
[,<execution start address>]
* Where <link origin> specifies the memory address to be given to the first word of the binary
program.<execution start address> is usually a pair.The linker converts this into the linked
start address. This is stored along with the binary program for use when the program is to be
executed.If specification of <execution start address> is omitted the execution start address is
assumed to be the same as the linked origin.
* Note that a linker converts the object modules in the set of program units Sp into a binary
program. Since we have assumed link address = load address, the loader simply loads the
binary program into the appropriate area of memory for the purpose of execution.
5.2.4 Object Module

73/81
The object module of a program contains all information necessary to relocate and link the
program with other programs.The object module of a program of a program P consists of 4
components:
1. Header: The header contains translated origin,size and execution start address of P.
2. Program: This component contains the machine language program corresponding to
P.
3. Relocation table (RELOCTAB): This table describes IRRp.Each RELOCTAB entry
contains a single field:
Translated address : Translated address of anaddress sensitive instruction.
4.Linking table (LINKTAB) : This table contains information concerning the public
definitions and external references in P.
Each LINKTAB entry contains three fields:
Symbol : Symbolic name.

Type :PD/EXT indicating whether public definition or external
reference
Translated address :For a public definition, this is the address of the first
memory word allocated to the symbol.For an external
reference, it is the address of the memory word which
is required to contain the address of the symbol.
Example 6: Consider the assembly program of figure. The object module of the program
contains the following information.
1.translated origin=500, size=42, execution state address=500/

2.Machine language instructions shown in figure.
3.relocation table.
500
4.Linking table 538
ALPHA EXT 518

MAX EXT 519
A PD 540
Note that the symbol LOOP does not appear in the linking table.This is because it is not
declared as a public definition, ie., it does not appear in an ENTRY statement.
5.3 SELF-RELOCATING PROGRAMS

* The manner in which a program can be modified,or can modify itself,to execute from a given
load origin can be used to classify programs into the following:
1.Nonexecutable programs.
2.Relocatable programs,
3.Self-relocating programs.
* A non relocatable program is a program which cannot be executed in any memory area other
than the area starting on its translated origin.
* Non relocatability is the result of address sensitivity of a program and lack of information
concerning the address sensitive instructions in a program.
* The difference between a relocatable program and a non relocatable program is the availability
of information concerning the address sensitive instructions in it.
* A relocatable program can be processed to relocate it to a desired area of memory.
74/81
* Representative examples of non relocatable and relocatable programs are a hand coded
machine language program and an object module, respectively.
* A self relocating program is a program which can perform the relocation of its own address
sensitive instructions. It contains the following two provisions for this purpose:
1.A table of information concerning the address sensitive exists as a part of the program.
2. Code to perform the relocation of address sensitive instructions also exists as a part of the
program. This is called the relocating logic.
* The start address of the relocating logic is specified as the execution start address of the
program.
* Thus the relocating logic gains control when the program is loaded in memory for execution.
* It uses the load load address and the information concerning address sensitive instructions to
perform its own relocation.
* Execution control is now transferred to the relocated program.
* A self-relocating program can execute in any area of the memory.
* This is very important in time sharing operating systems where the load address of a program
is likely to be different for different executions.
5.4 LINKING FOR OVERLAYS

Definition: An overlay is a part of a program which has the same load origin as some other
parts of the program.
Overlays are used to reduce the main memory requirement of a program.
Overlay Structured Programs
We refer to a program containing overlays as an overlay structured program. Such a
program consists of
1. A permanently resident portion, called the root.

2. A set of overlays.
3.
* Execution of an overlay structured program proceeds as follows:
* To start with, the root is loaded in memory and given control for the purpose of execution.
Other overlays are loaded as when needed.
* Note that the loading of an overlay overwrites a previously loaded overlay with the same load
origin. This reduces the memory requirement of a program.
* It also makes it possible to execute whose size exceeds the amount of memory which can be
allocated to them.
* The overlay structure of a program is designed by identifying mutually exclusive modules-that
is, modules which do not call each other.
* Such modules do not need to reside simultaneously in memory. Hence they are located in
different overlays with the same load origin.
Example : The MS-DOS LINK command to implement the overlay structure.

LINK init + read + write + (trans_a) + (trans_b) + (trans_c), <executable file>,<library
files>.
The object module which paranthesis become one overlay of the program. The object modules
not included in any overlay become part of the root. LINK produces a single binary program
containing all overlays and stores it in <executable file >.
Execution of an overlay structured program:
For linking and execution of an overlay structured program inMSDOS the linker
produces a single executable file at the output, which contains two provisions to support
overlays. First, an overlay manager module is included in the executable file. This module is
responsible for loading the overlays when needed. Second, all calls that cross overlay boundaries
are replaced by an interrupt producing instruction. To start with, the overlay manager receives
75/81
aontrol and loads the root. A procedure call which crosses overlay boundaries leads to an
interrupt. This interrupt is processed by the overlay manager and the appropriate overlay is
loaded into memory.
When each overlay is structured into a separate binary program, as in IBM mainframe
systems, a call which crosses overlay boundaries leads to an interrupt which is attended by the
OS kernel.Control is now transferred to the OS loader to load the appropriate binary program.
Changes in LINKER algorithms
The basic change required in LINKER is in the assignment of load address to
segments.program_load_origin can be used as before while processing the root portion of a
program.The size of the root would decide the load address of the overlays.program_load_origin
would be initialized to this value while processing every overlay.Another change in the LINKER
algorithm would be in the handling of procedure calls that cross overlay boundaries. LINKER
has to identify an inter-overlay call and determine the destination overlay. This information must
be encoded in the software interrupt instruction.
An open issue in the linking of overlay structured programs is the handling of object
modules added during autolinking.Should these objects modules be added to the current overlay
or to the root of the program? The former has the advantage of cleaner semantics,however it may
increase the memory requirement of the program.
5.5 LOADERS
* An absolute loader can only load programs with load origin = linked origin.
* This can be inconvenient if the load address of a program is likely to be different for different
executions of a program.
* A relocating loader performs relocation while loading a program for execution.
* This permits a program to be executed in different parts of the memory.
Linking and loading in MS DOS

* Ms DOS operating system supports two object program forms.
* A file with a COM extension contains a relocatable object program whereas a file with a .EXE
extension contains relocatable program. MS DOS contains a program EXE@BIN which
converts a .EXE program into a .COM program.
* The system contains two loaders—an absolute loader is invoked to simply load the object
program into the memory.
* When the user types a filename with the .COM extension, the absolute loader is invoked to
simply load the object program into the memory.
* When a filename with the .EXE extension is given, the relocating loader relocates the program
to the designated load area before passing control to it for execution.
5.6 SOFTWARE TOOLS

* Computing involves two main activities – program development and use of application
software.
* Language processors and operating systems play an obvious role in these activities.
* A less obvious but vital role is played by programs that help in developing and using other
programs.
* These programs called software tools, perform various housekeeping tasks involved in
program developed and application usage.
Definition : A software tool is a system program which

1. interfaces a program with the entity generating its input data, or
2. interfaces the results of a program with the entity consuming them.
The entity generating the data or consuming the results may be a program or a user.Figure shows
a schematic of a software tool.
Originator Software Consume

tool
Raw program of data Transformed program of data
76/81
Example : A file rewriting utility organizes the data in a file in a format suitable for processing
by a program. The utility may perform blocking/deblocking of data, padding and truncation of
fields and records, etc.
* In this chapter we discuss two kinds of software tools – software tools for program
development , and user interfaces.
5.6.1 SOFTWARE TOOLS FOR PROGRAM DEVELOPMENT

The fundamentals steps in program development are:
1.Program design, coding and documentation.
2. Preparation of programs in machine readable form.
3. Program translation, linking and loading.
4. Program testing and debugging.
5. Performance tuning.
6.Reformatting the data and/ or results of a program to suit other programs.
5.6.1.1 Program Design and Coding

Two categories of tools used in program design and coding are
1. Program generators.
2. Programming environments.
Program generator generates a program which performs a set of functions described in its
specification. Use of a program generator saves substantial design effort since a programmer
merely specifies what functions a program should perform rather than how the functions should
be implemented. Coding effort is saved since the program is generated rather than coded by
hand. A programming environment supports coding by incorporating awareness of the
programming language syntax and semantics in the language editor.
5.6.1.2 Program Entry and Editing
These tools are text editors or more sophisticated programs with text editors as front
ends. The editor functions in two modes. In the command mode, it accepts user commands
specifying the editing function to be performed. In the data mode, the user keys in the text to be
added to the file. Failure to recognize the current mode of the editor can lead to mix up of
commands and data. This can be avoided in two ways. In one approach, a quick exit is provided
from the data mode, e.g. by pressing the escape key, such that editor enters the command mode.
Another approach is to use the screen mode, wherein the editors is in the data mode most of the
time. The user is provided special keys to move the cursor on the screen. A stroke of any other
key is taken to imply input of the corresponding character at the current cursor position. Certain
keys pressed along with the control key signify commands like erase character, delete line, etc.
Thus end of data need not be explicitly indicated by the user. Most Turbo editors on PC’s use this
approach.
5.6.1.3 Program Testing and Debugging
Important steps in program testing and debugging are selection of test data for the
program,analysis of test results to detect errors, and debugging, i.e. localization and removal of
errord. Software tools to assist the programmer in these steps come in the following forms:
1.Test data generators help the user in selecting test data for this program. Their use
helps in ensuring that a program is thoroughlu tested.
2. Automated test drivers help in regression testing, wherein a program correctness is
verified by subjecting it to a standard set of tests after every modification. Regreession testing is
performed as follows: Many sets of test data are prepared for the program. These are given as
inputs to the test driver. The driver selects one set of test data at a time and organizes execution
of the program on the data.
3.Debug monitors in obtaining information for localization of errors.
4.Source code control systems help to keep track of modifications in the source code
Test data selection uses the notion of an execution path which is a sequence of program
statements visited during an execution. For testing, a program can be viewed as a set of
77/81
execution paths. A test data generator determines the conditions which must be satisfied by the
programs inputs for controlto flow along a specific execution path. A test data is a set of input
values which satisfy these conditions.
Producing debug information
Classically, localization and removal of errors has been aided by special purpose debug
information. Such information can be produced statically by analyzing the source program or
dynamically during program execution. Statically produced debug information takes the form of
cross reference listings, lists of undefined variables and unreachable statements, etc. All these are
useful in determining the cause of program malfunction. Techniques of data flow analysis are
employed to collect such information.
Example : The data flow concept of reaching definitions can be used to determine whether
variable x may have a value when execution reaches statement 10 in the following program
NO STATEMENT
10 sum: = x + 10;
If no definitions of x reach statement 10, then x is surely undefined in statement 10. If some
definitions reach statement 10, then x may have a value whwn control reaches the statement.
Whether x is defined in a specific execution of the program would depend on how control flows
during the execution.
Dynamically produced debug information takes the form of value dumps and execution traces
produced during the execution of a program. This information helps to determine the execution
paths followed during an execution and the sequence of values assumed by a variable. Most
programming languages provide facilities to produce dynamic debug information.
5.6.1.4 Enhancement of program performance
Program efficiency depends on two factors—the efficiency of the algorithm and the
efficiency of its coding. An optimizing compiler can improve efficiency of the code but it cannot
improve efficiency of the algorithm. Only a program designer can improve efficiency of an
algorithm by rewriting it. However, this is time consuming process hence some help should be
provided to improve its cost-effectiveness. For example, it is better to focus on only those
sections of a program which consumes a considerable amount of execution time. A performance
tuning tool helps in identifying parts. It is empirically observed that less than three percent of
program code generally accounts for more than 50 percent of program execution time. This
observation promises major economies of effort in improving a program.
Example : A program consists of three modules A, B and c. Sizes of the modules and execution
times consumed by them in a typical execution as given in table.
Name # of statements % of total execution time
A 150 4.00
B 80 6.00
C 35 90.00
It is seen that module C, which is roughly 13 % of the total program size, consumes 90% of the
program execution time. Hence optimization of module C would result in optimization for 90%
of program execution time at only 13% of the cost.
A profile monitor is a software tool that collects information regarding the execution behavior of
a program.e.g. the amount of execution time consumed by its modules, and presents it in the
form of an execution profile. usinng this information, the programmer can focus attention on the
program sections consuming a significant amount of execution time.
5.6.1.5 Program Documentation
Most programming projects suffer from lack of up-to-date documentation. Automatic
documentation tools are motivated by the desire to overcome this deficiency. These tools work
on the source program to produce different forms of documentation, e.g. flow charts, IO
specifications showing files and their records, etc.
5.6.1.6 Design of Software tools
Program preprocessing techniques are used to support static analysis of programs. Tools
generating cross reference listings and lists of unreferenced symbols; test data generators. And
78/81
documentation aids use this technique. Program instrumentation implies insertion of statements
in a program. The instrumented program is translated using a standard translator. During
execution, the inserted statements perform a set of desired functions.Profile and debug moniters
typically use this technique. In a profile monitor, an inserted ststement updates a counter
indicating the number of times a statement is executed, whereas in debug monitors an inserted
statement indicates that execution has reached a specific point in the source program.
Example : A debug moniter instruments a program to insert statements of the form
Call debug_mon(const_i);
Before every statement in the program, wher const_i is an integer constant indicating the serial
numberof the statement in the program. During execution of the instrumented program, the
debug monitor receives control after every statement..
At 10, display total
At 20, display term
Every time debug monitor receives control, it checks to see if statement 10 or 20 is about to be
executed. It then performs the debug action indicatd by the user.
Program interpretation and program generation
Use of Interpreters in software tools is motivated by the same reasons that motivate the
use of interpreters in program development.Since most requirements met by software tools are
ad hoc, it is useful to eliminate the translation phase. However, interpreter based tools suffer
from poor efficiency and poor portability, since an interpreter based tool is only as portable as
the interpreter it uses. A generated program is more efficient and can be made portable.
5.7 EDITORS
Text editors come in the following forms:
1. Line editors.
2. Stream editors
3. Screen editors
4. Word processors
5. Structure editors
* The scope of edit operations in a line editors is limited to a line of text. The line is designated
positionally, e.g. by specifying its serial number in the text, or contextually, e.g. by specifying
a context which uniquely identifies it.
* The primary advantage of line edors is their simplicity.
* A stream editor views entire text as a stream of characters.
* This permits edit operations to cross line boundaries.
* Stream editors typically support character, line and context oriented based on the current
editing context indicated by the position of a text pointer.
* The pointer can be manipulated using positioning or search commands.
5.7.1 Screen Editors

A line or stream editor does nnot display the text in the manner it would appear if printed.
A stream editor uses the what-you-see-is-what-you-get principle in editor design. The
editor displays a screenful of text at a time. The user can move the cursor over the screen,
position it at the point where he desires tp perform some editing and proceed with the
editing directly. His is very useful while formatting the text to produce printed
documents.
5.7.2 Word processors
Word processors are basically document editors with additional features to produce well
formatted hard copy output. Essential features of word processors are commands for
moving sections of text from one place to another, merging of text, and searching and
replacement of words. Many word processors support a spell-check option. With the
79/81
advent of personal computers, word processors have seen widespread use amongst
authors,office personnel and computer professionals.
5.7.3 Structure Editors
A structure editor incorporates an awareness of the structure of a document. This is useful
in browsing through a document, e.g. if a programmer wishes to edit a specific function
in a program file. The structure is specified by the user while creating modifying the
document. Editing requirements are specified using the structure. A special class of
structure editors, called syntax directed editors, are used in programming environments.
5.7.4 Design of an Editor

* The fundamental functions in editing are travelling, editing, viewing, and display.
* Travelling implies movement of the editing context to a new position within the text
* This may be done explicitly by the user or may be implied in a user command. Viewing
implies formatting the text in a manner desired by the user.
* The display component maps this view into the physical characteristics of the display
device being used.
* This determines where a particular view may appear on the users screen.
* The separation of viewing and display functions gives rise to the interesting
possibilities like multiple windows on the same screen, concurrent edit operations
using the same display terminal, etc.
* A simple text editor may choose to combine the viewing and display functions.
* For a given position of the editing context, the editing filters operate on the internal
form of text to prepare the forms suitable for editing and viewing.
* These forms are put in the editing and viewing buffers.
* The viewing and display manager makes provision for appropriate display of text.
* When the cursor position changes, the editing is performed, the editing filter reflects
the changes into the internal form and updates the contents of the viewing buffer.
* Apart from the fundamental editing functions, most editors support an undo function to
nullify one or more of the previous edit operations performed by the user.
* The undo function can be implemented by storing a stack of previous views or by

devising an inverse for each edit operation.
* Multilevel undo commands pose obvious difficulties in implementing overlapping

edits.
The Editor Structure is given below:

Command 80/81
processor
Editing Travelling Viewing

manager manager manager
Editing Viewing
buffer buffer
Editing filter
Viewing filter
Text
5.8 DEBUG MONITORS

Debug monitors provide the following facilities for dynamic debugging:
1. Setting breakpoints in the program
2. Initiating a debug conversation when control reaches a breakpoint.
3. Displaying values of variables
4. Assigning new values to variables
5. Testing userdefined assertions and predicates involving program variables
* The debug monitor functions can be easily implemented in an interpreter.

* How-ever interpretation incurs considerable execution time penalties.
* Debug monitors therefore relay on instrumentation of a compiled program to implement its
functions.
* To enable the user of debug monitor, the user must compile the program under the debug
option.
* The compiler now inserts the following instructions:
A few no- op instructions of the form

No-op <statement no>
* Before each statement, where <statement no> is a constant indicating the serial number of the
statement in the program.
* The compiler also generates a table containing the pairs .
* When a user gives a command to set a breakpoint at, say, statement 100, the debug monitor
instruments the program to introduce the instruction in the place of the no-op instructions
preceding no-op 100.
* The sequence of steps involved in dynamic debugging of a program is as follows:
1. The user compiles the program under the debug option. The compiler produce
two files – the compiled code file and the debug information file.
2. The user activates the debug monitor and indicates the name of the program to be
debugged. The debug monitor opens the compiled code and debug information
files for the program.
81/81
3. The user specifies his debug requirements—a list of breakpoints and actions to be
performed at breakpoints. The debug monitor instruments the program, and builds
a debug table containing the pairs.
4. The instrumented program gets control and executes up to a breakpoint.
5. A software interrupt is generated when the <SI_instrn> is executed. Control is
given to the debug monitor which consults the debug table and performs the
debug actions specified for the breakpoint. A debug conversation is now opened
during which the user may issue some debug commands or modify breakpoints
and debug actions associated with breakpoints.
5.8.1 Testing Assertions

A debug assertion is a relation between the values of program variables. An assertion can
be associated with a program statement. The debug monitor verifies the assertion when
execution reaches that statement. Program execution continues if the assertion is fulfilled, else a
debug conversation is opened. The user can now perform actions to locate the cause of the
program malfunction. Use of debug assertions eliminates the need to produce voluminous
information for debugging purposes.

System Software Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

System Software Notes

Uploaded by

Copyright:

Available Formats

1/81

Source Program Compiler Target Program

* There are thousands of source languages, ranging from traditional programming

The Analysis- Synthesis model of Compilation

It breaks up the source program into pieces and creates an intermediate

It constructs the desired target program from the intermediate

During analysis, the source programs are determined and recorded in a

EXAMPLES OF SOFTWARE TOOLS

It takes as input a sequence of commands to build a source program. It

ii) Pretty printers

iii) Static Checkers

It reads a program, analyzes it, and attempts to identify potential bugs

Instead of producing a target program as a translation, It performs the

It takes input that is a stream of characters , which includes typeset,

ii) Silicon Compiler

It has a source language that is identical to a conventional programming

iii) Query Interpreters

It translates a predicate containing relational and Boolean operators into

THE CONTEXT OF A COMPILER

* A source program may be divided into modules stored in separate files.

* The compiler createsPreprocessor

* The following figure shows a typical “compilation”.

Target assembly Program

Relocatable machine code

Loader/link-editor Library relocatable

Absolute machine code

1.1 ANALYSIS OF THE SOURCE PROGRAM

In compiling, analysis consists of three phases

1. Linear Analysis or Lexical Analysis or Scanning

In lexical analysis, the characters in the assignment statement would be

The blanks separating the characters of these tokens would normally be

2. Hierarchical Analysis or Parsing or Syntax Analysis

Position expression + expression

identifier expression * expression

initial identifier number

Hierarchical structure of a program expressed by a recursive rules.

ANALYSIS IN TEXT FORMATTERS

The input to a text formatter as specifying a hierarchy of boxes that are

SYMBOL TABLE MANAGEMENT

THE ANALYSIS PHASES

After syntax and semantic analysis, compilers generates an explicit intermediate

To generate the target code, consisting of relocatable machine code or

1.3 COUSINS OF THE COMPILER

* A compiler may be produced by one or more preprocessors.

These processors constructs flow-of-control and data-structuring facilities. For

Some compilers produce assembly code, that is passed to an assembler for

* A Pass consists of reading an input file once.

For example, Identifier Address

Load 0001 01 00 00000000*

1.4 THE GROUPING OF PHASES

More than one phases are grouped together.

Reducing the number of passes :

Back patching is the term of an assembler. If the machine code is generated,

Compiler Construction tools

A sequence of input characters that comprises a single token is called a lexeme.

1.6.1 The role of the Lexical Analyzer

The lexical analyzer is the first phase of a compiler.

Fig. Interaction of lexical analyzer with parser

The lexical analyzer acts as a subroutine or coroutine, which is called by the

Lexical analyzes are divided into two phases :

To responsible for complex operations.

Functions of the lexical analyzers