Lexical Analysis

Lexical Analysis

Chapter 2
Source Lexical Analyzer Syntax Analyzer
Program (Scanner) (Parser)

Symbol Table

• Lexical analysis is the first phase of the compiler
• A lexical analyzer is a pattern matcher for character strings
• It is a “front-end” for the parser
• Identifies substrings of the source program that belong together -
• Lexemes match a character pattern, which is associated with a
lexical category called a token
• sum is a lexeme; its token may be IDENT
• Lexical analyzer also strips out comments and white spaces in the
form of blank tab and newline characters from the source program
• It also correlates error messages from the compiler with the source
• The lexical analyzer is usually a function that is called by
the parser when it needs the next token
• Three approaches to build a lexical analyzer:
• Write a formal description of the tokens and use a software
tool that constructs table-driven lexical analyzers given such a
• E.g. lex or flex
• Easiest to implement but least efficient
• Design a state diagram that describes the tokens and write a
program that implements the state diagram
• Intermediate in ease, and efficiency
• Design a state diagram that describes the tokens and hand-
construct a table-driven implementation of the state diagram
• Hardest to implement, but most efficient

Reasons to separate lexical and
syntax analysis
• Simplicity - less complex approaches can be used
for lexical analysis; separating them simplifies the
• Efficiency - separation allows optimization of the
lexical analyzer
• Portability - parts of the lexical analyzer may not be
portable, but the parser always is portable

Formal Languages (Revision)
• A formal language is one that can be specified precisely and
is amenable for use with computers
• E.g. syntax of java programming language
• Whereas a natural language is one which is normally spoken
by people
• E.g. Kiswahili, Arabic
• Formal language is a set of strings from a given alphabet
• E.g. 1. Alphabet: {0,1}
• {0, 10, 1011} finite string and {} finite and null string
• {Є, 0, 00, 000,0000, …}
• The set of all strings of zeros & ones having an even number of ones
• E.g. 2. Alphabet: charcters on a computer keyboard
• {0, 10, 1011}, {Є}, java syntax, English systax
Formal Languages (Revision)
Formal language can be specified using Finite State Machine or
Regular Expressions
Finite State Machine (FSM )
• It is a theoretical machine consists of
1. A finite set of states: one starting state and zero or more accepting
2. A state transition function:
• Two arguments: a state and input symbol
• Returns a state as it’s result
• How the FSM works:
• The input is a string of symbols from the input alphabet
• The machine is initially in the starting state
• Each symbol read from the input string
• Machine proceeds to a new state as indicated by the transition function
• Finally machine is either in accepting state (input is accepted) or non-
accepting state (input is rejected)
• The set of all input strings which will be accepted by the machine
form a language
Formal Languages (Revision)

• FSM can be represented by state diagram, table or

other methods
Double circle
• State Diagram Representation representing accepting
state (final state)

Starting state – no
state at it’s source Arc representing
transition function

Formal Languages (Revision)

• Table Representation
Input symbols

0 1
Accepting state *A A B
marked by asterisk
B B A The next state

The starting state

listed in the first row

Formal Languages (Revision)
• Example: strings containing an odd number of zeros
from input alphabet {0,1}

0 1


*B A B

State Diagram representation Table representation

Formal Languages (Revision)

Regular Expression
• These are formulas or expressions consisting of
three possible operations on languages:
1. Union
2. Concatenation, and
3. Kleene star
1. Union: since a language is a set, this operation is
the union operation as defined in set theory
• E.g. {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
• Note: L + {} = L

Formal Languages (Revision)
2. Concatenation: concatenation of two languages is that language
formed by concatenating each string in one language with each string
in the other language
• E.g. {ab, a, c} · {b, ǫ} = {ab · b, ab · ǫ, a · b, a · ǫ, c · b, c · ǫ} = {abb, ab, a, cb, c}
• L1●L2 ≠ L2●L1
• L● =L
• L● =
3. Klen star (Closure): if L is a language, we define it as follows
• L0 = {}
• L1 = L
• L2 = L ● L 1

• Ln = L ● Ln-1
• L* = L 0 + L 1 + L 2 + L 3 + …
• Note: * = {}
Formal Languages (Revision)

• E.g. A regular expression specifying a language

containing set of all strings of zeros and ones:
(0+1)* To understand what strings are in this
language, let L = {0,1}. We need to find L*:
• L0 = {}
• L1 = {0, 1}
• L2 = L · L1 = {00, 01, 10, 11}
• L3 = L · L2 = {000, 001, 010, 011, 100, 101, 110, 111}
• L∗ = {, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101,
110, 111, 0000, ...}
Input Buffering
• Scanner performance is crucial:
• This is the only part of the compiler that examines the
entire input program one character at a time
• Disk input can be slow
• The scanner accounts for ~25-30% of total compile time
• We need lookahead to determine when a match
has been found
• Scanners use double-buffering to minimize the
overheads associated with this

Buffer Pairs

• Use two N-byte buffers (N = size of a disk block; typically, N

= 1024 or 4096).
• Read N bytes into one half of the buffer each time. If input
has less than N bytes, put a special EOF marker in the
• When one buffer has been processed, read N bytes into the
other buffer (“circular buffers”).
Buffer Pairs cont’d

if (fwd at end of first half)
reload second half;
set fwd to point to beginning of second half;
else if (fwd at end of second half)
reload first half;
set fwd to point to beginning of first half;

• it takes two tests for each advance of the fwd pointer

Buffer Pairs With Sentinels

• Objective: Optimize the common case by reducing the

number of tests to one per advance of fwd
• Idea: Extend each buffer half to hold a sentinel at the
• This is a special character that cannot occur in a
program (e.g., EOF)
• It signals the need for some special action (fill other
buffer-half, or terminate processing)
Buffer Pairs With Sentinels cont’d

if ( *fwd == EOF ) { /* special
processing needed */
if (fwd at end of first half)
. . .
else if (fwd at end of second half)
. . .
else /* end of input */
terminate processing.
• Common case now needs just a single test per character
Lexical Tokens
• The lexical analyzer scans the input strings/source code and attempts to
isolate the words/lexims
• Words/lexims are token as a units & passed to the next phase of
• Some of the words include:
1. Keywords/Reserved words: while, if, else, for, int…
2. Identifiers: constructed by programmers
• May be used to identify variables classes, functions constants etc.
3. Operators: symbols used for arithmetic, logical or character operations
+, -, <=, = …
4. Numeric constants: like integer 43, float 4.25
5. Character constants: single characters or strings of characters enclosed
by quotes
6. Special characters: -, (, ), ,, ; ……
8. White spaces
9. Newline
Lexical Tokens cont’d
• Example:
6 7
1 2 3 4

int fee = 12; //example comment

Keyword Operator Special character


• Each token consists of two parts:

• Class: indicating which kind of token
• Value: indicating which member of the class

Lexical Tokens cont’d
• Example:
Class Value
1 [code for int]
2 [Pointer to symbol table entry for fee]
3 [code for =]
4 [Pointer to constant table entry for 12]
6 [code for ;]

• Note that the lexical analysis phase does not check

for proper syntax

Implementation with Finite State Machines

• Finite state machines can be used to simplify lexical

• A finite state machine can be implemented very
simply by an array in which there is a row for each
state of the machine and a column for each
• It may be necessary or desirable to code the states
and/or input symbols as integers, depending on the
implementation programming language

Implementation with Finite State
Machines cont’d
boolean [] accept = new boolean [STATES];
int [][] fsm = new int[STATES][INPUTS]; // state table
// initialize table here...
int inp = 0; // input symbol (0..INPUTS)
int state = 0; // starting state;
{ inp = System.in.read() - ’0’; // character input,
// convert to int.
while (inp>=0 && inp<INPUTS)
{ state = fsm[state][inp]; // next state
inp = System.in.read() - ’0’; // get next input
} catch (IOException ioe)
{ System.out.println ("IO error " + ioe); }
if (accept[state])
System.out.println ("Accepted"); System.out.println

Examples of Finite State
Machines for Lexical Analysis
• Example 1: An example of a finite state machine which accepts any
identifier beginning with a letter and followed by any number of
letters and digits

• The letter ‘L’ represents any letter (a-z), and the letter ‘D’
represents any numeric digit (0-9)
• This implies that a preprocessor would be needed to convert input
characters to tokens suitable for input to the finite state machine
Examples of Finite State
Machines for Lexical Analysis
• Example 2: A finite state machine which accepts
numeric constants.
• These constants must begin with a digit, and
numbers such as .099 are not acceptable

Examples of Finite State
Machines for Lexical Analysis
• Example 3: FSM that accepts the keywords if, int,
import, for, float.

Actions for Finite State Machines
• Lexical analysis involves more than simply
recognizing words
• It may involve:
• Building a symbol table
• Converting numeric constants to the appropriate data
type and
• Putting out tokens.
• For this reason, we wish to associate an action, or
function to be invoked, with each state transition in
the finite state machine

Actions for Finite State Machines

• Design a finite state machine, with actions, to read

numeric strings and convert them to an appropriate
internal format, such as floating point

Lexical Tables
• One of the most important functions of the lexical
analysis phase is the creation of tables which are used
later in the compiler. These include:
• Symbol table for identifiers
• Table of numeric constants
• Table of string constants
• Table of statement table
• Table of line numbers
• These can be implemented using:
• Sequential search
• Binary search tree
• Hash table
