CD Lect5

Compiler
Design
Assoc. Prof. Ahmed Moustafa Elmahalawy
Communication and Computer Engineering Department
‫ميثاق المحاضرة‬
‫األحترام المتبادل‬ ‫إغالق المحمول‬ ‫تحديد الهدف‬ ‫المشاركة‬ ‫اإللتزام بالوقت‬

Chapter 3
Lexical Analysis
Compiler Design Chapter 3: Lexical Analysis
Contents:-
1- The Role of the Lexical Analyzer
2- Lexical Analysis Versus Parsing
3- Tokens, Patterns, and Lexemes
4- Attributes for Tokens
5- Example of Lexical Analysis, Tokens, Non-Tokens
6- Lexical Errors
7- Advantages and disadvantages of Lexical
analysis
8- Summery
A lexical analyzer reads characters from the

input and groups them into "token objects."
Along with a terminal symbol that is used for

parsing decisions, a token object carries additional
information in the form of attribute values.
A sequence of input characters that comprises

a single token is called a lexeme. Thus, we can say
that the lexical analyzer insulates a parser from the
lexeme representation of tokens.
1- The Role of the Lexical Analyzer

As the first phase of a compiler, the main task
of the lexical analyzer is to read the input characters
of the source program, group them into lexemes,
and produce as output a sequence of tokens for
each lexeme in the source program.
It is common for the lexical analyzer to interact
with the symbol table as well. When the lexical
analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the
symbol table.
These interactions are suggested in Fig. 3.1.
Figure 3.1: Interactions between the lexical analyzer and the parser
Sometimes, lexical analyzers are divided into a

cascade of two processes:
a) Scanning consists of the simple processes that

do not require tokenization of the input, such as
deletion of comments and compaction of
consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex

portion, where the scanner produces the
sequence of tokens as output.
2- Lexical Analysis Versus Parsing
There are a number of reasons why the

analysis portion of a compiler is normally separated
into lexical analysis and parsing (syntax analysis)
phases.
1. Simplicity of design is the most important

consideration.
2. Compiler efficiency is improved.
3. Compiler portability is enhanced.

Lexical Analyser Parser
Scan Input program Perform syntax analysis
Create an abstract
Identify Tokens representation of the
code
Insert tokens into Update symbol table
Symbol Table entries
It generates lexical It generates a parse tree

errors of the source code
3- Tokens, Patterns, and Lexemes

When discussing lexical analysis, we use three
related but distinct terms:
- A token is a pair consisting of a token name and an
optional attribute value.
- A pattern is a description of the form that the
lexemes of a token may take.
- A lexeme is a sequence of characters that are
included in the source program according to the
matching pattern of a token. It is nothing but an
instance of a token.
Compiler Design Chapter 2: Model of a Compiler
The lexical analyser or scanner is the

section that fuses characters of the source
text into groups that logically make up the
tokens of the language - symbols like
identifiers, strings, numeric constants,
keywords like while and if, operators like <=,
and so on.
The source program is input to a

lexical analyzer or scanner whose purpose
is to separate the incoming text into pieces
or tokens such as constants, variable
names, keywords (such as DO, IF, and
THEN in PL/I), and operators.
For efficiency reasons, each class of

tokens is given a unique internal
representation number (ID).
For example, a variable name may be

given a representation number of 1, a
constant a value of 2, a label the number 3,
the addition operator ( + ) a value of 4, etc.
WHILE A > 3 * B DO A := A - 1 END
easily decodes into lexems

So the tokens are as:

1<WHILE, >
2<A >
<3>
<*>
3<B>
4<DO>
5<A>
<:=>
6<A>
<->
<1>
as we read it from left to right, but the

Fortran statement
10 DO 20 I = 1.30
is more deceptive.
Readers familiar with Fortran might see

it as decoding into
while those who enjoy perversity

might like to see it as it really is:
Note that in scanning the source

statement and generating the representation
number of each token we have ignored
spaces (or blanks) in the statement.
Some scanners place constants,

labels, and variable names in appropriate
tables.
The lexical analyzer supplies tokens to

the syntax analyzer.
These tokens may take the form of a

pair of items.
_ The first item gives the address or location

of the token in some symbol table.
_ The second item is the representation

number of the token.
Example 3.1: Figure 3.2 gives some typical

tokens, their informally described patterns, and
some sample lexemes.
To see how these concepts are used in

practice, in the C statement
printf (“Total = %d\n”, score) ;
both printf and score are lexemes matching the

pattern for token id, and “Total = %d\n” is a lexeme
matching literal.
Figure 3.2: Examples of tokens

In many programming languages, the following

classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a
keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in
classes such as the token comparison.
3. One token representing all identifiers.
4. One or more tokens representing constants, such
as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left
and right parentheses, comma, and semicolon.
4- Attributes for Tokens

When more than one lexeme can match a
pattern, the lexical analyzer must provide the
subsequent compiler phases additional information
about the particular lexeme that matched.
Thus, the lexical analyzer returns to the parser
not only a token name, but an attribute value that
describes the lexeme represented by the token; the
token name influences parsing decisions, while the
attribute value influences translation of tokens after
the parse.
We shall assume that tokens have at most one

associated attribute, although this attribute may have
a structure that combines several pieces of
information.
The most important example is the token id,
where we need to associate with the token a great
deal of information.
Normally, information about an identifier - e.g.,
its lexeme, its type, and the location at which it is first
found (in case an error message about that identifier
must be issued) - is kept in the symbol table.
Example 3.2: The token names and

associated attribute values for the Fortran statement
are written below as a sequence of pairs.
1<E>
<+>
2<M>
<*>
3<C>
<^>
4<number , 2 >
Since the lexical analyzer is the part of the

compiler that reads the source text, it may perform
certain other tasks besides identification of lexemes,
as:
1- Removal of White Space and Comments
2- Reading Ahead
3- Constants
4- Recognizing Keywords and Identifiers

The lexical analyzer solves two problems by

using a table to hold character strings:
- Single Representation. A string table can insulate

the rest of the compiler from the representation of
strings, since the phases of the compiler can work
with references or pointers to the string in the
table.
- Reserved Words. Reserved words can be

implemented by initializing the string table with the
reserved strings and their tokens.
5- Example of Lexical Analysis, Tokens, Non-

Tokens
Consider the following code that is fed to
Lexical Analyzer
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
Examples of Tokens created

Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Examples of Nontokens
Type Examples
// This will compare 2

Comment
numbers
Pre-processor directive #include <stdio.h>
Pre-processor directive #define NUMS 8,9
Macro NUMS
Whitespace /n /b /t
6- Lexical Errors
A character sequence which is not possible to
scan into any valid token is a lexical error. Important
facts about the lexical error:
• Lexical errors are not very common, but it should
be managed by a scanner
• Misspelling of identifiers, operators, keyword are
considered as lexical errors
• Generally, a lexical error is caused by the
appearance of some illegal character, mostly at
the beginning of a token.
* Error Recovery in Lexical Analyzer

Here, are a few most common error recovery
techniques:
• Removes one character from the remaining input
• In the panic mode, the successive characters are
always ignored until we reach a well-formed
token
• By inserting the missing character into the
remaining input
• Replace a character with another character
• Transpose two serial characters
7- Advantages of Lexical analysis

• Lexical analyzer method is used by programs like
compilers which can use the parsed data from a
programmer’s code to create a compiled binary
executable code
• It is used by web browsers to format and display a
web page with the help of parsed data from
JavsScript, HTML, CSS
• A separate lexical analyzer helps you to construct
a specialized and potentially more efficient
processor for the task
* Disadvantage of Lexical analysis

• You need to spend significant time reading the
source program and partitioning it in the form of
tokens
• Some regular expressions are quite difficult to
understand compared to PEG or EBNF rules
• More effort is needed to develop and debug the
lexer and its token descriptions
• Additional runtime overhead is required to
generate the lexer tables and construct the tokens
8- Summary
• Lexical analysis is the very first phase in the
compiler designing
• Lexemes and Tokens are the sequence of
characters that are included in the source
program according to the matching pattern
of a token
• Lexical analyzer is implemented to scan the
entire source code of the program
• Lexical analyzer helps to identify token into

the symbol table
• A character sequence which is not possible
to scan into any valid token is a lexical error
• Removes one character from the remaining
input is useful Error recovery method
• Lexical Analyser scan the input program
while parser perform syntax analysis
• It eases the process of lexical analysis and

the syntax analysis by eliminating unwanted
tokens
• Lexical analyzer is used by web browsers to
format and display a web page with the help
of parsed data from JavsScript, HTML, CSS
• The biggest drawback of using Lexical
analyzer is that it needs additional runtime
overhead is required to generate the lexer
tables and construct the tokens

CD Lect5

Uploaded by

Copyright:

Available Formats

You might also like

CD Lect5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CD Lect5

Uploaded by

Copyright:

Available Formats

Compiler

‫األحترام المتبادل‬ ‫إغالق المحمول‬ ‫تحديد الهدف‬ ‫المشاركة‬ ‫اإللتزام بالوقت‬

A lexical analyzer reads characters from the

Along with a terminal symbol that is used for

A sequence of input characters that comprises

1- The Role of the Lexical Analyzer

These interactions are suggested in Fig. 3.1.

Sometimes, lexical analyzers are divided into a

a) Scanning consists of the simple processes that

b) Lexical analysis proper is the more complex

2- Lexical Analysis Versus Parsing

There are a number of reasons why the

1. Simplicity of design is the most important

2. Compiler efficiency is improved.

3. Compiler portability is enhanced.

Lexical Analyser Parser

Scan Input program Perform syntax analysis

It generates lexical It generates a parse tree

3- Tokens, Patterns, and Lexemes

The lexical analyser or scanner is the

The source program is input to a

For efficiency reasons, each class of

For example, a variable name may be

WHILE A > 3 * B DO A := A - 1 END

easily decodes into lexems

So the tokens are as:

as we read it from left to right, but the

Readers familiar with Fortran might see

while those who enjoy perversity

Note that in scanning the source

Some scanners place constants,

The lexical analyzer supplies tokens to

These tokens may take the form of a

_ The first item gives the address or location

_ The second item is the representation

Example 3.1: Figure 3.2 gives some typical

To see how these concepts are used in

printf (“Total = %d\n”, score) ;

both printf and score are lexemes matching the

Figure 3.2: Examples of tokens

In many programming languages, the following

4- Attributes for Tokens

We shall assume that tokens have at most one

Example 3.2: The token names and

Since the lexical analyzer is the part of the

1- Removal of White Space and Comments

4- Recognizing Keywords and Identifiers

The lexical analyzer solves two problems by

- Single Representation. A string table can insulate

- Reserved Words. Reserved words can be

5- Example of Lexical Analysis, Tokens, Non-

Examples of Tokens created

// This will compare 2

Pre-processor directive #include <stdio.h>

Pre-processor directive #define NUMS 8,9

* Error Recovery in Lexical Analyzer

7- Advantages of Lexical analysis