Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

CSC 401: Compiler Construction

2023/2024 Academic Year

1 / 76
Mode of Assessment

Three (3) major assignments – 20 Marks

Three Quizzes – 20 Marks

Mid – Semester Exams (1) – 15 marks + 5 marks from Quizzes

Attendance – 10 Marks

Main Exams – 30 Marks

Reading Material
– A Practical Approach to Compiler Construction. Des Watson (2017), Springer.
– Compilers Principles, Techniques, & Tools, by A.V. Aho, R.Sethi & J.D.Ullman, Pearson
Education
– Principle of Compiler Design, A.V.Aho and J.D. Ullman, Addition – Wesley
2 / 76
Course Outline

Module-I :
– Introduction to Compiling,
– A Simple One-Pass Compiler,
– Lexical analysis

Module-II:
– Syntax Analysis,
– Syntax-Directed Translation

Module-III
– Type Checking,
– Run-Time Environments

Module-IV
– Intermediate Code Generation,
– Code generation and Optimization
3 / 76
Module I
Introduction to Compiling

4 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM

5 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Preprocessor

A preprocessor produce input to compilers.

They may perform the following functions.
– Macro processing: A preprocessor may allow a user to define
macros that are short hands for longer constructs.
– File inclusion: A preprocessor may include header files into the
program text.
– Rational preprocessor: these preprocessors augment older
languages with more modern flow-of-control and data
structuring facilities.
– Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in
macro. 6 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Compiler

Compiler is a translator program that translates a program
written in a source program (HLL) into an equivalent program in
(MLL) the target program.

An important part of a compiler is the error messaging system.

Structure of a Compiler

7 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Compiler

Executing a program written in HLL programming language is
basically of two parts:
– the source program must first be compiled/translated into an
object program.
– Then the resulting object program is loaded into a memory and
executed.

Execution process of source program in Compiler


8 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Assembler

Because of the difficult in writing or reading programs in machine
language.

Programmers use a mnemonic (symbols) for each machine
instruction, which are subsequently translated into machine
language.

Such a mnemonic machine language is called an assembly
language.

Programs known as assemblers were written to automate the
translation of assembly language into machine language.

The input to an assembler program is called source program, the
output is a machine language translation (object program).
9 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Interpreter

An interpreter is a program that appears to execute a source
program as if it were machine language.
– Languages such as BASIC, SNOBOL, LISP can be translated using
interpreters.
– JAVA also uses interpreter.

The process of interpretation can be carried out in following phases:
– Lexical analysis
– Synatx analysis
– Semantic analysis
– Direct Execution
10 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Interpreter

Advantages:
– Modification of user program can be easily made and
implemented as execution proceeds.
– Type of object that denotes a variables may change dynamically.
– Debugging a program and finding errors is simplified for a
program used for interpretation.
– The interpreter for the language makes it machine independent.

Disadvantages:
– The execution of the program is slower.
– Memory consumption is more.
11 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Loader and Link-Editor

Object programs produced by assemblers are normally placed into
memory and executed.

The assembler could place the object program directly in memory
and transfer control to it, thereby causing the machine language
program to be execute.

There is a waste core when the assembler is left in memory while
the program is being executed.

Also the programmer would have to retranslate his program with
each execution, thus wasting translation time.

To over come this problems of wasted translation time and memory.

System programmers developed another component called loader
12 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Loader and Link-Editor

A loader is a program that places programs into memory and
prepares them for execution.

It would be more efficient if subroutines could be translated into
object form so the loader could relocate directly behind the user’s
program.

The task of adjusting programs so they may be placed in arbitrary
core locations is called relocation.

13 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Loader and Link-Editor

Large programs are often compiled in pieces, so the relocatable
machine code may have to be linked together with other
relocatable object files and library files into the code that actually
runs on the machine.

The linker resolves external memory addresses, where the code in
one file may refer to a location in another file.

The loader then puts together all of the executable object files into
memory for execution.

14 / 76
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM


LIST OF COMPILERS – Common Lisp compilers
– Ada compilers – ECMAScript interpreters
– ALGOL compilers – Fortran compilers
– BASIC compilers – Java compilers
– C# compilers – Pascal compilers
– C compilers – PL/I compilers
– C++ compilers – Python compilers
– COBOL compilers – Smalltalk compilers

15 / 76
1.2 TRANSLATOR

A translator is a program that takes as input a program written in
one language and produces as output a program in another
language.

Beside program translation, the translator performs another very
important role, the error-detection.

Any violation of the HLL specification would be detected and
reported to the programmer.

Important role of translator are:
– Translating the HLL program input into an equivalent ml
program.
– Providing diagnostic messages wherever the programmer
violates specification of the HLL
16 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

A compiler operates in phases.

A phase is a logically interrelated operation that takes source
program in one representation and produces output in another
representation.

There are two main phases of compilation.
– Analysis (Machine Independent/Language Dependent)
– Synthesis(Machine Dependent/Language independent)

Compilation process is partitioned into a number of sub
processes called ‘phases’.

17 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

The analysis part breaks up the source program into
constituent pieces and imposes a grammatical structure on
them.

It then uses this structure to create an intermediate
representation of the source program.

Both syntactical errors and semantical soundness of the source
code is checked.

Information about the source program is collected and stored in
a data structure called a symbol table , which is passed along
with the intermediate representation to the synthesis part.

18 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

The synthesis part constructs the desired target program from
the intermediate representation and the information in the
symbol table.

The analysis part is often called the front end of the compiler
and the synthesis part the back end .

19 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

20 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

Lexical Analysis
– The lexical analyzer reads the stream of characters making
up the source program and groups the characters into
meaningful sequences called lexemes.
– For each lexeme, the lexical analyzer produces as output a
token of the form:
<token-name; attribute-value>
– For example: position = initial + rate * 60

21 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

Lexical Analysis
– position is a lexeme that is mapped into a token <id;1>,
– Id is an abstract symbol for identifier and 1 points to the
symbol-table entry for position .
– The symbol-table entry for an identifier holds information
about the identifier, such as its name and type.
– = mapped into the token <=>
– initial to <id,2>, + to <+>, rate to <id,3>, * to <*>, and
60 to <60>
<id;1> <=> <id;2> <+><id;3><*><60>
22 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

Syntax Analysis
– The second phase of the compiler is syntax analysis or
parsing .
– Creates a tree-like intermediate representation that depicts
the grammatical structure of the token stream.
– A typical representation is a syntax tree in which each
interior node represents an operation and the children of
the node represent the arguments of the operation.

23 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

24 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

25 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

Semantic Analysis
– The semantic analyzer uses the syntax tree and the
information in the symbol table to check the source
program for semantic consistency with the language
definition.
– It also gathers type information and saves it in either the
syntax tree or the symbol table, for subsequent use during
intermediate-code generation.
– An important part of semantic analysis is type checking,
where the compiler checks that each operator has matching
operands.
26 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

Intermediate Code Generation
– In translating a source program into target code, one or more
intermediate representations may be constructed.
– E.g. Syntax trees, commonly used during syntax and semantic
analysis.
– After syntax and semantic analysis, explicit low-level or machine-like
intermediate representation may be generated (a program for an
abstract machine).
– This intermediate representation should have two important
properties:

it should be easy to produce and it should be easy to translate into
the target machine.
27 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

Code Optimization
– The machine-independent code-optimization phase attempts to
improve the intermediate code so that better target code will result.
– Usually better means faster, but other objectives may be desired,
such as shorter code, or target code that consumes less power.

28 / 76
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

Code Generation
– The code generator takes as input an intermediate representation
of the source program and maps it into the target language.
– If the target language is machine code, registers or memory
locations are selected for each of the variables used by the
program.
– Then, the intermediate instructions are translated into sequences
– of machine instructions that perform the same task.
– A crucial aspect of code generation is the judicious assignment of
registers to hold variables.

29 / 76
2.0 A Simple Syntax-Directed Translator
Introduction

This section is an introduction to the compiling techniques that will be
discussed in the rest of the course.

Emphasis is on the front end of a compiler,
– in particular on lexical analysis, parsing, and intermediate code
generation.

30 / 76
2.0 A Simple Syntax-Directed Translator
Introduction

The syntax of a programming language describes the proper form of
its programs,

Semantics of the language defines what its programs mean;
– i.e. what each program does when it executes.

Syntax are specified using notations called context-free grammars
or BNF (Backus-NaurForm).

It is difficult to specify the semantics of a language using these
notations.

Therefore, informal descriptions and suggestive examples are often
used.

31 / 76
Metalanguages: BNF

Backus–Naur Form or Backus Normal Form is a metalanguage which was
popularised by its use in the definition of the syntax of ALGOL 60.

A BNF specification consists of a set of rules, each rule defining a symbol (strictly a
non-terminal symbol) of the language.

To appreciate how it works, let’s take an example defined using a trivial language:


This set of rules can be used to generate random “sentences”.

32 / 76
Metalanguages: BNF

33 / 76
2.1 Syntax Definition

A grammar naturally describes the hierarchical structure of most
programming language constructs.

For example, an if-else statement can have the form:
if ( expression ) statement else statement

Using the variable expr to denote an expression and stmt to denote a
statement, this structuring rule can be expressed as:


This is a production rule
– Where if and () are terminal and expr and stmt are nonterminals.
– expr and stmt represent sequences of terminals.

34 / 76
2.1 Syntax Definition
Definition of Grammars

A context-free grammar has four components:
i. A set of terminal symbols, sometimes referred to as tokens.

They are the elementary symbols of the language defined by the grammar.
ii. A set of nonterminals, sometimes called syntactic variables.

Each nonterminal represents a set of strings of terminals.
iii. A set of productions ,

which consists of a nonterminal as the head or left side, an arrow, and a
sequence of terminals and/or nonterminals, called the body or right side.
iv. A designation of one of the nonterminals as the start symbol.

35 / 76
2.1 Syntax Definition
Definition of Grammars

Grammar G for a language L={9-5+2, 3-1, ...}
G=(N,T,P,S)
N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P:
Equivalently it can be defined as:

S = list

36 / 76
2.1 Syntax Definition
Derivations

A grammar derives strings by beginning with the start symbol and
repeatedly replacing a nonterminal by the body of a production for
that nonterminal.

The terminal strings that can be derived from the start symbol form
the language defined by the grammar.
– For example, we can deduce that 9-5+2 is a list as follows:

9 is a list by production, since 9 is a digit.

9-5 is a list by production, since 9 is a list and 5 is a digit .

9-5+2 is a list by production, since 9-5 is a list and 2 is a digit.

37 / 76
Metalanguages: BNF

E.g.


These rules define the syntax of simple arithmetic expressions using integer
values, the operators +, -, * and /, together with parentheses for grouping.

Expressions such as 1+2*3, 2+3+4/5 and 2+3*(44-567) can be generated via
this set of BNF rules.

Let’s consider the generation or derivation of the expression 1+2*3 using
this set of rules

38 / 76
2.1 Syntax Definition
Parsing/Parse Tree

Parsing is the task of taking a string of terminals and figuring out how
to derive it from the start symbol of the grammar.

If it cannot be derived from the start symbol of the grammar, then a
syntax errors is reported within the string.

A parse tree pictorially shows how the start symbol of a grammar
derives a string in the language.

Nodes:
– root : the start symbol
– internal nodes : nonterminal
– leaf nodes : terminal

39 / 76
2.1 Syntax Definition
Parsing/Parse Tree

E.g. Parse tree for 9-5+2 according to the grammar in slide 36 is as
follows:

Grammar Parse Tree

40 / 76
2.1 Syntax Definition
Ambiguity

A grammar is said to be ambiguous if the grammar has more than one
parse tree for a given string of tokens.

E.g. consider a grammar G that can not distinguish between lists and
digits as above.


Two parse trees can be generated for 9-5+2: (9-5)+2 and 9-(5+2)

41 / 76
2.1 Syntax Definition
Associativity of Operators

Conventions are needed for deciding which operator applies an operand
with operators to its left and right.

An operator associates to the left, when an operand with that operator on
both sides of it belongs to the operator to its left.
– e.g. 9+5+2 ≡ (9+5)+2

In most programming languages the four arithmetic operators, addition,
subtraction, multiplication, and division are left-associative.

Other operators such as exponentiation and assignment operators are
right-associative.
– a=b=c ≡ a=(b=c)

42 / 76
2.1 Syntax Definition
Associativity of Operators

Grammar left and right Parse Trees – left and right

43 / 76
2.1 Syntax Definition
Precedence of operators

An operator(*) has higher precedence than other operator(+) if the
operator(*) takes operands before other operator(+) does.
– e.g. 9+5*2≡9+(5*2), 9*5+2≡(9*5)+2
– left associative operators: + , - , * , /
– right associative operators: = , **

How does this play out in our language’s grammar?

Let’s consider the example below.

44 / 76
2.1 Syntax Definition
Precedence of operators

Consider the four arithmetic operators: We want to create the grammar below:


Consider nonterminals expr and term for the two levels of precedence,

Another nonterminal factor for generating basic units in expressions.

Presently, the basic units in expressions are digits and parenthesized
expressions.

45 / 76
2.1 Syntax Definition
Precedence of operators

Now consider the binary operators, * and / , that have the highest
precedence.

46 / 76
2.1 Syntax Definition
Precedence of operators

47 / 76
48 / 76
3.0 Lexical Analysis
Overview of Lexical Analysis

To identify the tokens we need some method of describing the possible
tokens that can appear in the input stream.

For this purpose regular expression are often used,
– a notation that can be used to describe essentially all the tokens of a
programming language.

Secondly, having decided what the tokens are, we need some
mechanism to recognize these in the input stream.
– This is done by the token recognizers, which are designed using
transition diagrams and finite automata.

49 / 76
3.0 Lexical Analysis
Role of Lexical Analyzer

The LA is the first phase of a compiler.

Its main task is to read the input character from source code and
produce as output a sequence of tokens that the parser uses for
syntax analysis.

50 / 76
3.0 Lexical Analysis
Role of Lexical Analyzer

It may also enter lexemes such as identifiers into the symbol table
and also read from it to determine the proper token it must pass to
the parser.

It may perform other tasks such as:
– stripping out comments and whitespaces (blank, newline, tab, and
other characters that are used to separate tokens in the input).
– Another task is correlating error messages generated by the
compiler with the source program.

51 / 76
3.0 Lexical Analysis
Role of Lexical Analyzer

Lexical analyzers maybe divided into a cascade of two processes:
– Scanning: consists of the simple processes that do not require
tokenization of the input, such as deletion of comments and
compaction of consecutive whitespace characters into one.
– Lexical analysis proper is the more complex portion, which produces
tokens from the output of the scanner.

52 / 76
3.0 Lexical Analysis
Why Separate Lexical Analysis & Syntactic Analysis (Parsing)?

Simplicity of design: Separation allows simplification of each task.
– E.g. a parser that deals with comments and whitespace as syntactic units
would be considerably more complex than one that can assume comments and
whitespace have already been removed by the lexical analyzer.

Improved Compiler Efficiency:
– A separate lexical analyzer allows us to apply specialized techniques that serve
only the lexical task, not the job of parsing.
– In addition, specialized buffering techniques for reading input characters can
speed up the compiler significantly.

Compiler portability is enhanced.
– Input-device-specific peculiarities can be restricted to the lexical analyzer.

53 / 76
3.0 Lexical Analysis
Tokens, Patterns, and Lexemes

Three related but distinct terms are often used:
– Token: A token is a group of characters having collective meaning:

typically a word or punctuation mark, separated by a lexical analyzer and
passed to a parser.

The token names are the input symbols that the parser processes.
– Pattern: A rule that describes the set of strings associated to a token.

Often expressed as a regular expression and describing how a particular
token can be formed.
– A lexeme is a sequence of characters in the source program that
matches the pattern for a token and is identified by the lexical
analyzer as an instance of that token.

54 / 76
3.0 Lexical Analysis
Tokens, Patterns, and Lexemes


Note: In the Pascal statement:
const pi = 3.1416

the substring pi is a lexeme for the token “identifier”
55 / 76
3.0 Lexical Analysis
Tokens, Patterns, and Lexemes

When more than one pattern matches a lexeme, the lexical
analyzer must provide additional information about the particular
lexeme.
– It is essential for the code generator to know what string was actually matched.

The lexical analyzer collects information about tokens into their
associated attributes.

In practice: A token has usually only a single attribute - a pointer
to the symbol-table entry in which the information about the token
is kept such as:
– the lexeme, the line number on which it was first seen, etc.

56 / 76
3.0 Lexical Analysis
Tokens, Patterns, and Lexemes

Example: The token names and associated attribute values for
the Fortran statement are written as a sequence of pairs

57 / 76
3.0 Lexical Analysis
Lexical Errors

Lexical analyzers mostly cannot detect errors without the aid of
other components.

For instance, if the string fi is encountered for the first time in a C
program in the context: fi ( a == f(x)) …

It cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.

In general, when an error is found, the lexical analyzer stops (but
other actions are also possible).

The simplest recovery strategy is “panic mode" recovery.

58 / 76
3.0 Lexical Analysis
Assignment

Divide the following C++ program into appropriate lexemes.

Which lexemes should get associated lexical values?

What should those values be?

59 / 76
3.0 Lexical Analysis
Stages of a lexical analyzer

Scanner
– Based on a finite state machine.
– If it lands on an accepting state, it takes note of the type and
position of the acceptance, and continues.
– When the lexical analyzer lands on a dead state, it is done.

The last accepting state is the one that represent the type
and length of the longest valid lexeme.
– The "extra" non valid character is "returned" to the input
buffer.

60 / 76
3.0 Lexical Analysis
Stages of a lexical analyzer

Evaluator
– Goes over the characters of the lexeme to produce a value.
– The lexeme’s type combined with its value is what properly
constitutes a token, which can be given to a parser.
– Some tokens such as parentheses do not really have values, and so
the evaluator function for these can return nothing.
– The evaluators for integers, identifiers, and strings can be
considerably more complex.
– Sometimes evaluators can suppress a lexeme entirely, concealing
it from the parser, which is useful for whitespace and comments.

61 / 76
3.0 Lexical Analysis
Stages of a lexical analyzer

Example

62 / 76
3.0 Lexical Analysis
Input Buffering

The LA scans the characters of the source program one at a time
to discover tokens.

Often, many characters beyond a token may have to be examined
before it can be determined.

For instance, we cannot be sure we've seen the end of an
identifier until we see a character that is not a letter or digit or
underscore.

Also, in C, single-character operators like - , = , or < could also be
the beginning of a two-character operator like -> , == , or <= .

63 / 76
3.0 Lexical Analysis
Input Buffering

For this and other reasons, it is desirable for the lexical analyzer to
read its input from an input buffer.

Because, large amount of time can be consumed scanning
characters,
– specialized buffering techniques have been developed to
reduce the amount of overhead required to process an input
character.

Buffering techniques:
– Buffer pairs
– Sentinels

64 / 76
3.0 Lexical Analysis
Input Buffering: Buffer pairs

Each buffer is of the same size N,
– N is usually the size of a disk block, e.g., 4096 bytes.

Using one system read command N characters can be read into a
buffer, rather than using one system call per character.

If fewer than N characters remain in the input file,
– then a special character, represented by eof, marks the end of
the source.

65 / 76
3.0 Lexical Analysis
Input Buffering: Buffer pairs


Two pointers to the input are maintained:
– Pointer lexemeBegin , marks the beginning of the current lexeme.
– Pointer forward scans ahead until a pattern match is found.

Once the next lexeme is determined, forward is set to the character at
its right end.

After recording the lexeme lexemeBegin is set to the character
immediately after the lexeme just found.
66 / 76
3.0 Lexical Analysis
Input Buffering: Sentinels

Checks are made each time we advance forward, to ensure that we have
not moved off one of the buffers;
– if so, then the another buffer must be reload.

Thus, for each character read, two tests are made:
– one for the end of the buffer using a sentinel,
– and one to determine what character is read.

eof can be used as a sentinel and end of file.

67 / 76
3.0 Lexical Analysis
Regular Expressions

Suppose we wanted to describe the set of valid C identifiers.

It could be described as:


The vertical bar above means union, the parentheses are used to group
sub-expressions, the star means zero or more occurrences of, and the
juxtaposition of letter_ with the remainder of the expression signifies
concatenation.


Regular expression is a formula that describes a possible set of string.

68 / 76
3.0 Lexical Analysis
Regular Expressions

Component of regular expression..

69 / 76
3.0 Lexical Analysis
Regular Expressions

Here are the rules that define the regular expression over alphabet .
– ϵ is a regular expression denoting { € }, that is, the language
containing only the empty string.
– For each ‘a’ in Σ, is a regular expression denoting { a }, the language
with only one string consisting of the single symbol ‘a’ .
– If R and S are regular expressions,

then (R) | (S) means L(r) U L(s)

R.S means L(r).L(s)

R* denotes L(r*)

70 / 76
3.0 Lexical Analysis
Regular Definitions

For notational convenience, names may be given to certain regular
expressions and used in subsequent expressions, as if the names were
themselves symbols.

E.g. C identifiers are strings of letters, digits, and underscores.
– Here is a regular definition for the language of C identifiers.

71 / 76
3.0 Lexical Analysis
Assignment: Extensions of Regular Expressions

Discuss what extensions of regular expressions are and explain how the
regular definitions:


and can be redefined as

72 / 76
3.0 Lexical Analysis
Recognition of Tokens

We learned how to express patterns using regular expressions.

Next we will consider how to take the patterns for all the needed tokens
and build a piece of code that examines the input string and finds a prefix
that is a lexeme matching one of the patterns.

The following example of a grammar for a branching statements and
conditional expressions will be used.

73 / 76
3.0 Lexical Analysis
Recognition of Tokens

For relop, – comparison operators like =, <, >, >=, <= are considered.

The terminals of the grammar, which are if , then , else, relop , id , and
number, are the names of tokens as far as the lexical analyzer is
concerned.

The patterns for these tokens are described using regular definitions as
below:

74 / 76
3.0 Lexical Analysis
Recognition of Tokens

For this language, the lexical analyzer will recognize the keywords if,
then, and else ,
– as well as lexemes that match the patterns for relop, id, and
number.

The lexical analyzer is assigned the job of stripping out whitespaces, by
recognizing the “token" ws defined by:

blank, tab, and newline are abstract symbols.


75 / 76
76 / 76
77 / 76

You might also like