Professional Documents
Culture Documents
Lexical Analysis
Lexical Analysis
Lexical Analysis
Chapter 2
Introduction
Tokens
Source Lexical Analyzer Syntax Analyzer
Program (Scanner) (Parser)
Symbol Table
Manager
01/20/2024 CD 2020 2
Introduction
• Lexical analysis is the first phase of the compiler
• A lexical analyzer is a pattern matcher for character strings
• It is a “front-end” for the parser
• Identifies substrings of the source program that belong together -
lexemes
• Lexemes match a character pattern, which is associated with a
lexical category called a token
• sum is a lexeme; its token may be IDENT
• Lexical analyzer also strips out comments and white spaces in the
form of blank tab and newline characters from the source program
• It also correlates error messages from the compiler with the source
program
01/20/2024 CD 2020 3
Introduction
• The lexical analyzer is usually a function that is called by
the parser when it needs the next token
• Three approaches to build a lexical analyzer:
• Write a formal description of the tokens and use a software
tool that constructs table-driven lexical analyzers given such a
description
• E.g. lex or flex
• Easiest to implement but least efficient
• Design a state diagram that describes the tokens and write a
program that implements the state diagram
• Intermediate in ease, and efficiency
• Design a state diagram that describes the tokens and hand-
construct a table-driven implementation of the state diagram
• Hardest to implement, but most efficient
01/20/2024 CD 2020 4
Reasons to separate lexical and
syntax analysis
• Simplicity - less complex approaches can be used
for lexical analysis; separating them simplifies the
parser
• Efficiency - separation allows optimization of the
lexical analyzer
• Portability - parts of the lexical analyzer may not be
portable, but the parser always is portable
01/20/2024 CD 2020 5
Formal Languages (Revision)
• A formal language is one that can be specified precisely and
is amenable for use with computers
• E.g. syntax of java programming language
• Whereas a natural language is one which is normally spoken
by people
• E.g. Kiswahili, Arabic
• Formal language is a set of strings from a given alphabet
• E.g. 1. Alphabet: {0,1}
• {0, 10, 1011} finite string and {} finite and null string
• {Є, 0, 00, 000,0000, …}
• The set of all strings of zeros & ones having an even number of ones
• E.g. 2. Alphabet: charcters on a computer keyboard
• {0, 10, 1011}, {Є}, java syntax, English systax
01/20/2024 CD 2020 6
Formal Languages (Revision)
cont’d
Formal language can be specified using Finite State Machine or
Regular Expressions
Finite State Machine (FSM )
• It is a theoretical machine consists of
1. A finite set of states: one starting state and zero or more accepting
states
2. A state transition function:
• Two arguments: a state and input symbol
• Returns a state as it’s result
• How the FSM works:
• The input is a string of symbols from the input alphabet
• The machine is initially in the starting state
• Each symbol read from the input string
• Machine proceeds to a new state as indicated by the transition function
• Finally machine is either in accepting state (input is accepted) or non-
accepting state (input is rejected)
• The set of all input strings which will be accepted by the machine
form a language
01/20/2024 CD 2020 7
Formal Languages (Revision)
cont’d
Starting state – no
state at it’s source Arc representing
transition function
Labels
representing
01/20/2024 CD 2020 inputs 8
Formal Languages (Revision)
cont’d
• Table Representation
Input symbols
0 1
Accepting state *A A B
marked by asterisk
B B A The next state
01/20/2024 CD 2020 9
Formal Languages (Revision)
cont’d
• Example: strings containing an odd number of zeros
from input alphabet {0,1}
0 1
A B A
*B A B
01/20/2024 CD 2020 10
Formal Languages (Revision)
cont’d
Regular Expression
• These are formulas or expressions consisting of
three possible operations on languages:
1. Union
2. Concatenation, and
3. Kleene star
1. Union: since a language is a set, this operation is
the union operation as defined in set theory
• E.g. {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
• Note: L + {} = L
01/20/2024 CD 2020 11
Formal Languages (Revision)
cont’d
2. Concatenation: concatenation of two languages is that language
formed by concatenating each string in one language with each string
in the other language
• E.g. {ab, a, c} · {b, ǫ} = {ab · b, ab · ǫ, a · b, a · ǫ, c · b, c · ǫ} = {abb, ab, a, cb, c}
• L1●L2 ≠ L2●L1
• L● =L
• L● =
3. Klen star (Closure): if L is a language, we define it as follows
• L0 = {}
• L1 = L
• L2 = L ● L 1
…
• Ln = L ● Ln-1
• L* = L 0 + L 1 + L 2 + L 3 + …
• Note: * = {}
01/20/2024 CD 2020 12
Formal Languages (Revision)
cont’d
01/20/2024 CD 2020 14
Buffer Pairs
Code:
if (fwd at end of first half)
reload second half;
set fwd to point to beginning of second half;
else if (fwd at end of second half)
reload first half;
set fwd to point to beginning of first half;
else
fwd++;
Code:
fwd++;
if ( *fwd == EOF ) { /* special
processing needed */
if (fwd at end of first half)
. . .
else if (fwd at end of second half)
. . .
else /* end of input */
terminate processing.
}
• Common case now needs just a single test per character
01/20/2024 CD 2020 18
Lexical Tokens
• The lexical analyzer scans the input strings/source code and attempts to
isolate the words/lexims
• Words/lexims are token as a units & passed to the next phase of
compilation
• Some of the words include:
1. Keywords/Reserved words: while, if, else, for, int…
2. Identifiers: constructed by programmers
• May be used to identify variables classes, functions constants etc.
3. Operators: symbols used for arithmetic, logical or character operations
+, -, <=, = …
4. Numeric constants: like integer 43, float 4.25
5. Character constants: single characters or strings of characters enclosed
by quotes
6. Special characters: -, (, ), ,, ; ……
7. Comments
8. White spaces
9. Newline
01/20/2024 CD 2020 19
Lexical Tokens cont’d
• Example:
6 7
1 2 3 4
01/20/2024 CD 2020 20
Lexical Tokens cont’d
• Example:
Class Value
1 [code for int]
2 [Pointer to symbol table entry for fee]
3 [code for =]
4 [Pointer to constant table entry for 12]
6 [code for ;]
01/20/2024 CD 2020 21
Implementation with Finite State Machines
01/20/2024 CD 2020 22
Implementation with Finite State
Machines cont’d
boolean [] accept = new boolean [STATES];
int [][] fsm = new int[STATES][INPUTS]; // state table
// initialize table here...
int inp = 0; // input symbol (0..INPUTS)
int state = 0; // starting state;
try
{ inp = System.in.read() - ’0’; // character input,
// convert to int.
while (inp>=0 && inp<INPUTS)
{ state = fsm[state][inp]; // next state
inp = System.in.read() - ’0’; // get next input
}
} catch (IOException ioe)
{ System.out.println ("IO error " + ioe); }
if (accept[state])
System.out.println ("Accepted"); System.out.println
("Rejected");
01/20/2024 CD 2020 23
Examples of Finite State
Machines for Lexical Analysis
• Example 1: An example of a finite state machine which accepts any
identifier beginning with a letter and followed by any number of
letters and digits
• The letter ‘L’ represents any letter (a-z), and the letter ‘D’
represents any numeric digit (0-9)
• This implies that a preprocessor would be needed to convert input
characters to tokens suitable for input to the finite state machine
01/20/2024 CD 2020 24
Examples of Finite State
Machines for Lexical Analysis
cont’d
• Example 2: A finite state machine which accepts
numeric constants.
• These constants must begin with a digit, and
numbers such as .099 are not acceptable
01/20/2024 CD 2020 25
Examples of Finite State
Machines for Lexical Analysis
cont’d
• Example 3: FSM that accepts the keywords if, int,
import, for, float.
01/20/2024 CD 2020 26
Actions for Finite State Machines
• Lexical analysis involves more than simply
recognizing words
• It may involve:
• Building a symbol table
• Converting numeric constants to the appropriate data
type and
• Putting out tokens.
• For this reason, we wish to associate an action, or
function to be invoked, with each state transition in
the finite state machine
01/20/2024 CD 2020 27
Actions for Finite State Machines
cont’d
01/20/2024 CD 2020 28
Lexical Tables
• One of the most important functions of the lexical
analysis phase is the creation of tables which are used
later in the compiler. These include:
• Symbol table for identifiers
• Table of numeric constants
• Table of string constants
• Table of statement table
• Table of line numbers
• These can be implemented using:
• Sequential search
• Binary search tree
• Hash table
01/20/2024 CD 2020 29