Lexical Analysis

Lexical Analysis
Chapter 2
Introduction
Tokens
Source Lexical Analyzer Syntax Analyzer
Program (Scanner) (Parser)
Symbol Table
Manager
01/20/2024 CD 2020 2
Introduction
• Lexical analysis is the first phase of the compiler
• A lexical analyzer is a pattern matcher for character strings
• It is a “front-end” for the parser
• Identifies substrings of the source program that belong together -
lexemes
• Lexemes match a character pattern, which is associated with a
lexical category called a token
• sum is a lexeme; its token may be IDENT
• Lexical analyzer also strips out comments and white spaces in the
form of blank tab and newline characters from the source program
• It also correlates error messages from the compiler with the source
program
01/20/2024 CD 2020 3
Introduction
• The lexical analyzer is usually a function that is called by
the parser when it needs the next token
• Three approaches to build a lexical analyzer:
• Write a formal description of the tokens and use a software
tool that constructs table-driven lexical analyzers given such a
description
• E.g. lex or flex
• Easiest to implement but least efficient
• Design a state diagram that describes the tokens and write a
program that implements the state diagram
• Intermediate in ease, and efficiency
• Design a state diagram that describes the tokens and hand-
construct a table-driven implementation of the state diagram
• Hardest to implement, but most efficient
01/20/2024 CD 2020 4
Reasons to separate lexical and
syntax analysis
• Simplicity - less complex approaches can be used
for lexical analysis; separating them simplifies the
parser
• Efficiency - separation allows optimization of the
lexical analyzer
• Portability - parts of the lexical analyzer may not be
portable, but the parser always is portable
01/20/2024 CD 2020 5
Formal Languages (Revision)
• A formal language is one that can be specified precisely and
is amenable for use with computers
• E.g. syntax of java programming language
• Whereas a natural language is one which is normally spoken
by people
• E.g. Kiswahili, Arabic
• Formal language is a set of strings from a given alphabet
• E.g. 1. Alphabet: {0,1}
• {0, 10, 1011} finite string and {} finite and null string
• {Є, 0, 00, 000,0000, …}
• The set of all strings of zeros & ones having an even number of ones
• E.g. 2. Alphabet: charcters on a computer keyboard
• {0, 10, 1011}, {Є}, java syntax, English systax
01/20/2024 CD 2020 6
cont’d
Formal language can be specified using Finite State Machine or
Regular Expressions
Finite State Machine (FSM )
• It is a theoretical machine consists of
1. A finite set of states: one starting state and zero or more accepting
states
2. A state transition function:
• Two arguments: a state and input symbol
• Returns a state as it’s result
• How the FSM works:
• The input is a string of symbols from the input alphabet
• The machine is initially in the starting state
• Each symbol read from the input string
• Machine proceeds to a new state as indicated by the transition function
• Finally machine is either in accepting state (input is accepted) or non-
accepting state (input is rejected)
• The set of all input strings which will be accepted by the machine
form a language
01/20/2024 CD 2020 7
cont’d
• FSM can be represented by state diagram, table or

other methods
Double circle
• State Diagram Representation representing accepting
state (final state)
Starting state – no
state at it’s source Arc representing
transition function
Labels
representing
01/20/2024 CD 2020 inputs 8
cont’d
• Table Representation
Input symbols
0 1
Accepting state *A A B
marked by asterisk
B B A The next state
The starting state

listed in the first row
01/20/2024 CD 2020 9
cont’d
• Example: strings containing an odd number of zeros
from input alphabet {0,1}
0 1
A B A
*B A B
State Diagram representation Table representation
01/20/2024 CD 2020 10
cont’d
Regular Expression
• These are formulas or expressions consisting of
three possible operations on languages:
1. Union
2. Concatenation, and
3. Kleene star
1. Union: since a language is a set, this operation is
the union operation as defined in set theory
• E.g. {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
• Note: L + {} = L
01/20/2024 CD 2020 11
cont’d
2. Concatenation: concatenation of two languages is that language
formed by concatenating each string in one language with each string
in the other language
• E.g. {ab, a, c} · {b, ǫ} = {ab · b, ab · ǫ, a · b, a · ǫ, c · b, c · ǫ} = {abb, ab, a, cb, c}
• L1●L2 ≠ L2●L1
• L● =L
• L● =
3. Klen star (Closure): if L is a language, we define it as follows
• L0 = {}
• L1 = L
• L2 = L ● L 1
…
• Ln = L ● Ln-1
• L* = L 0 + L 1 + L 2 + L 3 + …
• Note: * = {}
01/20/2024 CD 2020 12
cont’d
• E.g. A regular expression specifying a language

containing set of all strings of zeros and ones:
(0+1)* To understand what strings are in this
language, let L = {0,1}. We need to find L*:
• L0 = {}
• L1 = {0, 1}
• L2 = L · L1 = {00, 01, 10, 11}
• L3 = L · L2 = {000, 001, 010, 011, 100, 101, 110, 111}
...
• L∗ = {, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101,
110, 111, 0000, ...}
01/20/2024 CD 2020 13
Input Buffering
• Scanner performance is crucial:
• This is the only part of the compiler that examines the
entire input program one character at a time
• Disk input can be slow
• The scanner accounts for ~25-30% of total compile time
• We need lookahead to determine when a match
has been found
• Scanners use double-buffering to minimize the
overheads associated with this
01/20/2024 CD 2020 14
Buffer Pairs
• Use two N-byte buffers (N = size of a disk block; typically, N

= 1024 or 4096).
• Read N bytes into one half of the buffer each time. If input
has less than N bytes, put a special EOF marker in the
buffer.
• When one buffer has been processed, read N bytes into the
other buffer (“circular buffers”).
01/20/2024 CD 2020 15
Buffer Pairs cont’d
Code:
if (fwd at end of first half)
reload second half;
set fwd to point to beginning of second half;
else if (fwd at end of second half)
reload first half;
set fwd to point to beginning of first half;
else
fwd++;
• it takes two tests for each advance of the fwd pointer

01/20/2024 CD 2020 16
Buffer Pairs With Sentinels
• Objective: Optimize the common case by reducing the

number of tests to one per advance of fwd
• Idea: Extend each buffer half to hold a sentinel at the
end
• This is a special character that cannot occur in a
program (e.g., EOF)
• It signals the need for some special action (fill other
buffer-half, or terminate processing)
01/20/2024 CD 2020 17
Buffer Pairs With Sentinels cont’d
Code:
fwd++;
if ( *fwd == EOF ) { /* special
processing needed */
if (fwd at end of first half)
. . .
else if (fwd at end of second half)
. . .
else /* end of input */
terminate processing.
}
• Common case now needs just a single test per character
01/20/2024 CD 2020 18
Lexical Tokens
• The lexical analyzer scans the input strings/source code and attempts to
isolate the words/lexims
• Words/lexims are token as a units & passed to the next phase of
compilation
• Some of the words include:
1. Keywords/Reserved words: while, if, else, for, int…
2. Identifiers: constructed by programmers
• May be used to identify variables classes, functions constants etc.
3. Operators: symbols used for arithmetic, logical or character operations
+, -, <=, = …
4. Numeric constants: like integer 43, float 4.25
5. Character constants: single characters or strings of characters enclosed
by quotes
6. Special characters: -, (, ), ,, ; ……
7. Comments
8. White spaces
9. Newline
01/20/2024 CD 2020 19
Lexical Tokens cont’d
• Example:
6 7
1 2 3 4
int fee = 12; //example comment
Keyword Operator Special character

Identifier
Constant
• Each token consists of two parts:

• Class: indicating which kind of token
• Value: indicating which member of the class
01/20/2024 CD 2020 20
Lexical Tokens cont’d
• Example:
Class Value
1 [code for int]
2 [Pointer to symbol table entry for fee]
3 [code for =]
4 [Pointer to constant table entry for 12]
6 [code for ;]
• Note that the lexical analysis phase does not check

for proper syntax
01/20/2024 CD 2020 21
Implementation with Finite State Machines
• Finite state machines can be used to simplify lexical

analysis
• A finite state machine can be implemented very
simply by an array in which there is a row for each
state of the machine and a column for each
possible
• It may be necessary or desirable to code the states
and/or input symbols as integers, depending on the
implementation programming language
01/20/2024 CD 2020 22
Implementation with Finite State
Machines cont’d
boolean [] accept = new boolean [STATES];
int [][] fsm = new int[STATES][INPUTS]; // state table
// initialize table here...
int inp = 0; // input symbol (0..INPUTS)
int state = 0; // starting state;
try
{ inp = System.in.read() - ’0’; // character input,
// convert to int.
while (inp>=0 && inp<INPUTS)
{ state = fsm[state][inp]; // next state
inp = System.in.read() - ’0’; // get next input
}
} catch (IOException ioe)
{ System.out.println ("IO error " + ioe); }
if (accept[state])
System.out.println ("Accepted"); System.out.println
("Rejected");
01/20/2024 CD 2020 23
Examples of Finite State
Machines for Lexical Analysis
• Example 1: An example of a finite state machine which accepts any
identifier beginning with a letter and followed by any number of
letters and digits
• The letter ‘L’ represents any letter (a-z), and the letter ‘D’
represents any numeric digit (0-9)
• This implies that a preprocessor would be needed to convert input
characters to tokens suitable for input to the finite state machine
01/20/2024 CD 2020 24
cont’d
• Example 2: A finite state machine which accepts
numeric constants.
• These constants must begin with a digit, and
numbers such as .099 are not acceptable
01/20/2024 CD 2020 25
cont’d
• Example 3: FSM that accepts the keywords if, int,
import, for, float.
01/20/2024 CD 2020 26
Actions for Finite State Machines
• Lexical analysis involves more than simply
recognizing words
• It may involve:
• Building a symbol table
• Converting numeric constants to the appropriate data
type and
• Putting out tokens.
• For this reason, we wish to associate an action, or
function to be invoked, with each state transition in
the finite state machine
01/20/2024 CD 2020 27
Actions for Finite State Machines
cont’d
• Design a finite state machine, with actions, to read

numeric strings and convert them to an appropriate
internal format, such as floating point
01/20/2024 CD 2020 28
Lexical Tables
• One of the most important functions of the lexical
analysis phase is the creation of tables which are used
later in the compiler. These include:
• Symbol table for identifiers
• Table of numeric constants
• Table of string constants
• Table of statement table
• Table of line numbers
• These can be implemented using:
• Sequential search
• Binary search tree
• Hash table
01/20/2024 CD 2020 29

Lexical Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Lexical Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lexical Analysis

Uploaded by

Copyright:

Available Formats

Lexical Analysis

• FSM can be represented by state diagram, table or

The starting state

State Diagram representation Table representation

• E.g. A regular expression specifying a language

• Use two N-byte buffers (N = size of a disk block; typically, N

• it takes two tests for each advance of the fwd pointer

• Objective: Optimize the common case by reducing the

int fee = 12; //example comment

Keyword Operator Special character

• Each token consists of two parts:

• Note that the lexical analysis phase does not check

• Finite state machines can be used to simplify lexical

• Design a finite state machine, with actions, to read

You might also like