CD Lect5

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Compiler

Design
Assoc. Prof. Ahmed Moustafa Elmahalawy
Communication and Computer Engineering Department
‫ميثاق المحاضرة‬

‫األحترام المتبادل‬ ‫إغالق المحمول‬ ‫تحديد الهدف‬ ‫المشاركة‬ ‫اإللتزام بالوقت‬


Chapter 3
Lexical Analysis
Compiler Design Chapter 3: Lexical Analysis

Contents:-
1- The Role of the Lexical Analyzer
2- Lexical Analysis Versus Parsing
3- Tokens, Patterns, and Lexemes
4- Attributes for Tokens
5- Example of Lexical Analysis, Tokens, Non-Tokens
6- Lexical Errors
7- Advantages and disadvantages of Lexical
analysis
8- Summery
Compiler Design Chapter 3: Lexical Analysis

A lexical analyzer reads characters from the


input and groups them into "token objects."

Along with a terminal symbol that is used for


parsing decisions, a token object carries additional
information in the form of attribute values.

A sequence of input characters that comprises


a single token is called a lexeme. Thus, we can say
that the lexical analyzer insulates a parser from the
lexeme representation of tokens.
Compiler Design Chapter 3: Lexical Analysis

1- The Role of the Lexical Analyzer


As the first phase of a compiler, the main task
of the lexical analyzer is to read the input characters
of the source program, group them into lexemes,
and produce as output a sequence of tokens for
each lexeme in the source program.
It is common for the lexical analyzer to interact
with the symbol table as well. When the lexical
analyzer discovers a lexeme constituting an
identifier, it needs to enter that lexeme into the
symbol table.
Compiler Design Chapter 3: Lexical Analysis

These interactions are suggested in Fig. 3.1.

Figure 3.1: Interactions between the lexical analyzer and the parser
Compiler Design Chapter 3: Lexical Analysis

Sometimes, lexical analyzers are divided into a


cascade of two processes:

a) Scanning consists of the simple processes that


do not require tokenization of the input, such as
deletion of comments and compaction of
consecutive whitespace characters into one.

b) Lexical analysis proper is the more complex


portion, where the scanner produces the
sequence of tokens as output.
Compiler Design Chapter 3: Lexical Analysis

2- Lexical Analysis Versus Parsing

There are a number of reasons why the


analysis portion of a compiler is normally separated
into lexical analysis and parsing (syntax analysis)
phases.

1. Simplicity of design is the most important


consideration.

2. Compiler efficiency is improved.

3. Compiler portability is enhanced.


Compiler Design Chapter 3: Lexical Analysis

Lexical Analyser Parser

Scan Input program Perform syntax analysis

Create an abstract
Identify Tokens representation of the
code
Insert tokens into Update symbol table
Symbol Table entries

It generates lexical It generates a parse tree


errors of the source code
Compiler Design Chapter 3: Lexical Analysis

3- Tokens, Patterns, and Lexemes


When discussing lexical analysis, we use three
related but distinct terms:
- A token is a pair consisting of a token name and an
optional attribute value.
- A pattern is a description of the form that the
lexemes of a token may take.
- A lexeme is a sequence of characters that are
included in the source program according to the
matching pattern of a token. It is nothing but an
instance of a token.
Compiler Design Chapter 2: Model of a Compiler

The lexical analyser or scanner is the


section that fuses characters of the source
text into groups that logically make up the
tokens of the language - symbols like
identifiers, strings, numeric constants,
keywords like while and if, operators like <=,
and so on.
Compiler Design Chapter 2: Model of a Compiler

The source program is input to a


lexical analyzer or scanner whose purpose
is to separate the incoming text into pieces
or tokens such as constants, variable
names, keywords (such as DO, IF, and
THEN in PL/I), and operators.
Compiler Design Chapter 2: Model of a Compiler

For efficiency reasons, each class of


tokens is given a unique internal
representation number (ID).

For example, a variable name may be


given a representation number of 1, a
constant a value of 2, a label the number 3,
the addition operator ( + ) a value of 4, etc.
Compiler Design Chapter 2: Model of a Compiler

WHILE A > 3 * B DO A := A - 1 END

easily decodes into lexems


Compiler Design Chapter 2: Model of a Compiler

So the tokens are as:


1<WHILE, >
2<A >
<3>
<*>
3<B>
4<DO>
5<A>
<:=>
6<A>
<->
<1>
Compiler Design Chapter 2: Model of a Compiler

as we read it from left to right, but the


Fortran statement

10 DO 20 I = 1.30

is more deceptive.

Readers familiar with Fortran might see


it as decoding into
Compiler Design Chapter 2: Model of a Compiler
Compiler Design Chapter 2: Model of a Compiler

while those who enjoy perversity


might like to see it as it really is:
Compiler Design Chapter 2: Model of a Compiler

Note that in scanning the source


statement and generating the representation
number of each token we have ignored
spaces (or blanks) in the statement.

Some scanners place constants,


labels, and variable names in appropriate
tables.
Compiler Design Chapter 2: Model of a Compiler

The lexical analyzer supplies tokens to


the syntax analyzer.

These tokens may take the form of a


pair of items.

_ The first item gives the address or location


of the token in some symbol table.

_ The second item is the representation


number of the token.
Compiler Design Chapter 3: Lexical Analysis

Example 3.1: Figure 3.2 gives some typical


tokens, their informally described patterns, and
some sample lexemes.

To see how these concepts are used in


practice, in the C statement

printf (“Total = %d\n”, score) ;

both printf and score are lexemes matching the


pattern for token id, and “Total = %d\n” is a lexeme
matching literal.
Compiler Design Chapter 3: Lexical Analysis

Figure 3.2: Examples of tokens


Compiler Design Chapter 3: Lexical Analysis

In many programming languages, the following


classes cover most or all of the tokens:
1. One token for each keyword. The pattern for a
keyword is the same as the keyword itself.
2. Tokens for the operators, either individually or in
classes such as the token comparison.
3. One token representing all identifiers.
4. One or more tokens representing constants, such
as numbers and literal strings.
5. Tokens for each punctuation symbol, such as left
and right parentheses, comma, and semicolon.
Compiler Design Chapter 3: Lexical Analysis

4- Attributes for Tokens


When more than one lexeme can match a
pattern, the lexical analyzer must provide the
subsequent compiler phases additional information
about the particular lexeme that matched.
Thus, the lexical analyzer returns to the parser
not only a token name, but an attribute value that
describes the lexeme represented by the token; the
token name influences parsing decisions, while the
attribute value influences translation of tokens after
the parse.
Compiler Design Chapter 3: Lexical Analysis

We shall assume that tokens have at most one


associated attribute, although this attribute may have
a structure that combines several pieces of
information.
The most important example is the token id,
where we need to associate with the token a great
deal of information.
Normally, information about an identifier - e.g.,
its lexeme, its type, and the location at which it is first
found (in case an error message about that identifier
must be issued) - is kept in the symbol table.
Compiler Design Chapter 3: Lexical Analysis

Example 3.2: The token names and


associated attribute values for the Fortran statement
are written below as a sequence of pairs.
1<E>
<+>
2<M>
<*>
3<C>
<^>
4<number , 2 >
Compiler Design Chapter 3: Lexical Analysis

Since the lexical analyzer is the part of the


compiler that reads the source text, it may perform
certain other tasks besides identification of lexemes,
as:

1- Removal of White Space and Comments

2- Reading Ahead

3- Constants

4- Recognizing Keywords and Identifiers


Compiler Design Chapter 3: Lexical Analysis

The lexical analyzer solves two problems by


using a table to hold character strings:

- Single Representation. A string table can insulate


the rest of the compiler from the representation of
strings, since the phases of the compiler can work
with references or pointers to the string in the
table.

- Reserved Words. Reserved words can be


implemented by initializing the string table with the
reserved strings and their tokens.
Compiler Design Chapter 3: Lexical Analysis

5- Example of Lexical Analysis, Tokens, Non-


Tokens
Consider the following code that is fed to
Lexical Analyzer
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
Compiler Design Chapter 3: Lexical Analysis

Examples of Tokens created


Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
Compiler Design Chapter 3: Lexical Analysis

Examples of Nontokens
Type Examples

// This will compare 2


Comment
numbers

Pre-processor directive #include <stdio.h>

Pre-processor directive #define NUMS 8,9

Macro NUMS

Whitespace /n /b /t
Compiler Design Chapter 3: Lexical Analysis

6- Lexical Errors
A character sequence which is not possible to
scan into any valid token is a lexical error. Important
facts about the lexical error:
• Lexical errors are not very common, but it should
be managed by a scanner
• Misspelling of identifiers, operators, keyword are
considered as lexical errors
• Generally, a lexical error is caused by the
appearance of some illegal character, mostly at
the beginning of a token.
Compiler Design Chapter 3: Lexical Analysis

* Error Recovery in Lexical Analyzer


Here, are a few most common error recovery
techniques:
• Removes one character from the remaining input
• In the panic mode, the successive characters are
always ignored until we reach a well-formed
token
• By inserting the missing character into the
remaining input
• Replace a character with another character
• Transpose two serial characters
Compiler Design Chapter 3: Lexical Analysis

7- Advantages of Lexical analysis


• Lexical analyzer method is used by programs like
compilers which can use the parsed data from a
programmer’s code to create a compiled binary
executable code
• It is used by web browsers to format and display a
web page with the help of parsed data from
JavsScript, HTML, CSS
• A separate lexical analyzer helps you to construct
a specialized and potentially more efficient
processor for the task
Compiler Design Chapter 3: Lexical Analysis

* Disadvantage of Lexical analysis


• You need to spend significant time reading the
source program and partitioning it in the form of
tokens
• Some regular expressions are quite difficult to
understand compared to PEG or EBNF rules
• More effort is needed to develop and debug the
lexer and its token descriptions
• Additional runtime overhead is required to
generate the lexer tables and construct the tokens
Compiler Design Chapter 3: Lexical Analysis

8- Summary
• Lexical analysis is the very first phase in the
compiler designing
• Lexemes and Tokens are the sequence of
characters that are included in the source
program according to the matching pattern
of a token
• Lexical analyzer is implemented to scan the
entire source code of the program
Compiler Design Chapter 3: Lexical Analysis

• Lexical analyzer helps to identify token into


the symbol table
• A character sequence which is not possible
to scan into any valid token is a lexical error
• Removes one character from the remaining
input is useful Error recovery method
• Lexical Analyser scan the input program
while parser perform syntax analysis
Compiler Design Chapter 3: Lexical Analysis

• It eases the process of lexical analysis and


the syntax analysis by eliminating unwanted
tokens
• Lexical analyzer is used by web browsers to
format and display a web page with the help
of parsed data from JavsScript, HTML, CSS
• The biggest drawback of using Lexical
analyzer is that it needs additional runtime
overhead is required to generate the lexer
tables and construct the tokens

You might also like