UNIT1 - Lexical Analysis1

Unit I
Lexical Analysis
06/22/2009 Department of Computer Science 1

ER&DCIInstitute of Technology
Some Important
Some Important Basic
Basic Definitions
Definitions
lexical: of or relating to the morphemes of a language.
morpheme: a meaningful linguistic unit that cannot

be divided into smaller meaningful parts.
lexical analysis: the task concerned with breaking an

input into its smallest meaningful units, called tokens.
06/22/2009 Department of Computer ScienceER&DCI 2

Institute of Technology
The Role
The Role of
of aa Lexical
Lexical Analyzer
Analyzer
pass token
Source read char Lexical Parser
program analyzer
get next
Symbol Table

Lexical Analyzer
Lexical Analyzer
 Functions
 Grouping input characters into
tokens
 Stripping out comments and white
spaces
 Correlating error messages with the
source program

Why Separate?
Why Separate?
 Reasons to separate lexical analysis from

parsing:
 Simpler design
 Improved efficiency
 Portability
 Tools exist to help implement lexical

analyzers and parsers independently

Typical Tokens
Typical Tokens in
in aa PL
PL
 Symbols
+, -, *, /, =, <, >, ->, …
 Keywords
if, while, struct, float, int, …
 Integer and Real (floating point) literals
123, 123.45
 Char (string) literals
 Identifiers
 Comments
 White space
Introducing Basic
Introducing Basic Terminology
Terminology
 What are Major Terms for Lexical Analysis?
 TOKEN
 A classification for a common set of strings
 Examples Include <Identifier>, <number>, etc.
 PATTERN
 The rules which characterize the set of strings
for a token
 LEXEME
 Actual sequence of characters that matches
pattern and is classified by a token
 Identifiers: x, count, name, etc…

Introducing Basic
Introducing Basic Terminology
Terminology
Token Sample Lexemes Informal Description of Pattern
const const const
if if if
relation <, <=, =, < >, >, >= < or <= or = or < > or >= or >
id pi, count, D2 letter followed by letters and digits
num 3.1416, 0, 6.02E23 any numeric constant
literal “core dumped” any characters between “ and “ except
“
Actual values are critical. Info is :

Classifies
Pattern 1. Stored in symbol table
2. Returned to parser

Token Attribute
Token Attribute
 E = C1 ** 10
Token Attribute
ID Index to symbol table entry E
ID Index to symbol table entry

C1
**
NUM 10

Case Study
Case Study
 When blanks are not significant (as
in Fortran and Algol68)
DO 5 I = 1.25 DO 5 I = 1, 25
DO5I is an ID This is a DO loop

So 7 tokens will get generated
DO 10 I =1, 100
STATEMENT
STATEMENT
STATEMENT
10 CONTINUE

Case Study
Case Study (cont.)
(cont.)
 When key words are not reserved

words (such as in PL/1)
example 1:
IF THEN THEN THEN = ELSE ELSE
ELSE = THEN;
Which THEN is an identifier?
Which THEN is the key word?

Handling Lexical
Handling Lexical Errors
Errors
 Error Handling is very localized, with Respect to
Input Source
 For example: fi( a == f(x) ) …
generates no lexical errors in C
 In what Situations do Errors Occur?
 Prefix of remaining input doesn’t match any
defined token
 Possible error recovery actions:
 Deleting or Inserting Input Characters
 Replacing or Transposing Characters
 Or, skip over to next separator to “ignore” problem

Implimentation of
Implimentation of lexical
lexical
analyzer
analyzer
 Use lexical analyzer generator to
produce lexical analyzer from a regular
expression based specification
 Write lexical analyzer in a conventional
systems-programming language
 Write lexical analyzer in assembly
language

I/O -- Key
I/O Key For
For Successful
Successful Lexical
Lexical Analysis
Analysis
 Character-at-a-time I/O
 Block / Buffered I/O
 Block/Buffered I/O
 Utilize Block of memory
 Stage data from source to buffer block at a time
 Maintain two blocks -
 Asynchronous I/O - for 1 block
 While Lexical Analysis on 2nd block
Block 1 Block 2
When done, ptr... Still Process token

issue I/O
in 2nd block
Code to
Code to advance
advance forward
forward ptr
ptr
if forward at end of first half then begin Checking if forward

ptr is at the end of
reload second half ; 1st half
forward := forward + 1
end
else if forward at end of second half then begin
Checking if
reload first half ; forward ptr is at
the end of 2nd half
move forward to beginning of first half
end
else forward := forward + 1;
E = M * C * * 2 eof
Lexeme begining forward

Algorithm: Buffered
Algorithm: Buffered I/O
I/O with
with Sentinels
Sentinels
Current token
E = M * eof C * * 2 eof eof

lexeme beginning forward (scans ahead to
forward : = forward + 1 ; find pattern match)
if forward is at eof then begin
if forward at end of first half then begin
reload second half ; Block I/O
forward : = forward + 1
end
else if forward at end of second half then begin
reload first half ; Block I/O
move forward to biginning of first half
end
else / * eof within buffer signifying end of input * /
terminate lexical analysis
2nd eof  no more input !
end
Specification of
Specification of Patterns
Patterns for
for
Tokens: Terminology
Tokens: Terminology
 An alphabet  is a finite set of symbols
(characters / letters)
 A string s is a finite sequence of symbols

from 
 |s| denotes the length of string s
  denotes the empty string, thus || = 0
 A language is a specific set of strings

over some fixed alphabet 
Examples
Examples
 Alphabet
 {0,1} is binary alphabet
 String
 s=010
 |s| =3
 Language
 {010,011,00,010……….} language over
alphabet {0,1}

Language Concepts
Language Concepts
A language, L, is simply any set of strings over a fixed alphabet.
Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {TEE,FORE,BALL,…}
{FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…} { All grammatically correct
English sentences }
Special Languages:  - EMPTY LANGUAGE
{} - contains  string only
Terms for
Terms for parts
parts of
of aa string
string
EXAMPLES AND OTHER CONCEPTS:

Suppose: S is the string banana
Prefix : ban, banana

Proper prefix, suffix,
Suffix : ana, banana
or substring cannot
Substring : nan, ban, ana, banana be all of S
Subsequence: bnan, nn

Formal Language
Formal Language Operations
Operations
OPERATION DEFINITION
union of L and M L  M = {s | s is in L or s is in M}
written L  M
concatenation of L LM = {st | s is in L and t is in M}
and M written LM

Kleene closure of L L*= Li
written L*

i 0
L* denotes “zero or more concatenations of “ L

positive closure of 
L i
L written L+ L += 
i 1
L+ denotes “one or more concatenations of “ L

Formal
Formal Language
Language Operations
Operations Examples
Examples
L = {A, B, C, D } D = {1, 2, 3}
L  D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
L4 = L2 L2 = {set of all 4 letter strings}
L* = { All possible strings of L plus  }
L+ = L* - 
L (L  D )* = ??

Language
Language &
& Regular
Regular Expressions
Expressions
 A Regular Expression is a Set of Rules /

Techniques for Constructing Sequences of Strings
from an Alphabet.
 Suitable for specifying the structure of tokens in
programming languages
 Let  be an Alphabet, r a Regular Expression

Then L(r) is the Language that is characterized
by the rules of r

Rules for
Rules for Specifying
Specifying Regular
Regular Expressions:
Expressions:
Regular Expression over alphabet 
  is a regular expression denoting {}
• If a is in , a is a regular expression that denotes {a}
• Let r and s be regular expressions with languages L(r) and L(s).

Then
(a) (r) | (s) is a regular expression  L(r)  L(s)
(b) (r)(s) is a regular expression  L(r) L(s)
(c) (r)* is a regular expression  (L(r))*
(d) (r) is a regular expression  L(r)
All are Left-Associative. Parentheses are dropped as allowed by
precedence rules.
Example
Example
 The identifier in Pascal can be defined as
letter (letter | digit) *
 More examples
 a | b denotes the set {a,b}
 (a|b) (a|b) denotes the set {aa, ab,ba,bb}
 a* denotes {, a, aa, aaa, …}
 (a|b)* denotes all strings of a’s and b’s
also equal to (a*b*)*

EXAMPLES of
EXAMPLES of Regular
Regular Expressions
Expressions
L = {A, B, C, D } D = {1, 2, 3}
A|B|C|D =L
(A | B | C | D ) (A | B | C | D ) = L2
(A | B | C | D )* = L*
(A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L  D)

Towards Token
Towards Token Definition
Definition
Regular Definitions: Associate names with Regular Expressions
For Example : PASCAL IDs
letter  A | B | C | … | Z | a | b | … | z
digit  0 | 1 | 2 | … | 9
id  letter ( letter | digit )*
Shorthand Notation:
“+” : one or more r* = r+ |  & r+ = r r*
“?” : zero or one r?=r | 
[range] : set range of characters (replaces “|” )
[A-Z] = A | B | C | … | Z
Example Using Shorthand : PASCAL IDs
id  [A-Za-z][A-Za-z0-9]*

Algebraic Properties
Algebraic Properties of
of
Regular Expressions
Regular Expressions
AXIOM DESCRIPTION
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |
r = r
r = r  Is the identity element for concatenation
r* = ( r |  )* relation between * and 

r** = r* * is idempotent

Non-regular set
Non-regular set
 RE can denote a fixed number or

unspecified number of repetitions of a
given construct. Its best use is for
describing identifiers, constants, … etc.
 RE can not be used to describe balanced
or nested structures, such as nested
loops, nested if-then-else.

Token Recognition
Token Recognition
How can we use concepts developed so far to assist in recognizing
tokens of a source language ?
Assume Following Tokens:
if, then, else, relop, id, num
Given Tokens, What are Patterns ?

if  if Grammar:
stmt  if expr then stmt
then  then |if expr then stmt else stmt
Regular else  else |
definitions expr  term relop term | term
relop  < | <= | > | >= | = | <>
term  id | num
id  letter ( letter | digit )*
num  digit + (. digit + ) ? ( E(+ | -) ? digit + ) ?

What Else
What Else Does
Does Lexical
Lexical Analyzer
Analyzer Do?
Do?
Scan away b, nl, tabs

Can we Define Tokens For These?
blank  b
tab  ^T
newline  ^M
delim  blank | tab | newline
ws  delim +

Overall
Overall
Regular Token Attribute-Value
Expression
ws - -
if if -
then then -
else else -
id id pointer to table entry
num num pointer to table entry
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE
Note: Each token has a unique token identifier to define category of lexemes
Constructing Transition
Constructing Transition Diagrams
Diagrams for
for Tokens
Tokens
• Transition Diagrams (TD) are used to represent the tokens

• As characters are read, the relevant TDs are used to attempt to match
lexeme to a pattern
• Each TD has:
• States : Represented by Circles
• Actions (Transitions) : Represented by Arrows between states
• Start State : Beginning of a pattern (Arrowhead)
• Final State(s) : End of pattern (Concentric Circles)

Transition Diagram
Transition Diagram Symbols
Symbols
A state
The start state
An accepting state
a
A transition

Example TDs
Example TDs
>=: start > = RTN(GE)

0 6 7
other
8 * RTN(G)
We’ve accepted “>” and have read other char that

must be unread.
Example :: All
Example All RELOPs
RELOPs
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
= 4 * return(relop, LT)
5 return(relop, EQ)
>
=
6 7 return(relop, GE)
other
8 * return(relop, GT)

Example TDs
Example TDs :: id
id and
and delim
delim
id :
letter or digit
start letter other *

9 10 11
return( get_token(), install_id())
Either returns ptr or “0” if reserved

delim :
delim
start delim other *
28 29 30

Example TDs
Example TDs :: Unsigned
Unsigned #s
#s
digit digit digit
start digit . digit E +|- digit other *

12 13 14 15 16 17 18 19
E digit
digit digit
start digit * . digit other *

20 21 22 23 24
return(num, install_num())
digit
start digit other *

25 26 27

QUESTION ::
QUESTION
What would the transition

diagram (TD) for strings of letters
that contain the five vowels in
their strict lexicographical order,
look like ?

Answer
Answer
cons  B | C | D | F | … | Z
string  cons* A cons* E cons* I cons* O cons* U cons*
cons cons cons cons cons cons

start A E I O U other
accept
Note: The error path is

error
taken if the character is other
than a cons or the vowel in
the lex order.
What Else Does Lexical Analyzer Do?
All Keywords / Reserved words are matched as ids
• After the match, the symbol table or a special keyword table is
consulted
• Keyword table contains string versions of all keywords and
associated token values
if 15
then 16
begin 17
... ...
• When a match is found, the token is returned, along with its
symbolic value, i.e., “then”, 16
• If a match is not found, then it is assumed that an id has been
discovered
Implementing Transition Diagrams
digit
*
digit other
25 26 27
advances
............. forward
case 25; c = nextchar();
if (isdigit(c)) state = 26;
else state = fail();
Case numbers
break;
correspond to transition
case 26; c = nextchar();
diagram states !
if (isdigit(c)) state = 26;
else state = 27;
break;
case 27; retract(1); lexical_value = install_num();
return ( NUM );
.............
looks at the region
06/22/2009
retracts lexeme_beginning
Department of Computer ScienceER&DCI ... forward 42
forward Institute of Technology
Implementing Transition
Implementing Transition Diagram
Diagram
 Mapping transition diagrams into C code

letter or digit
start letter other

9 10 11 return(id)
switch (state) {
…
case 9: c = nextchar();
if (isletter( c) ) state = 10; else state = failure();
break;
case 10: ….
case 11: retract(1); insert(id); return;

RE and
RE and Finite
Finite Automata
Automata
 Regular Expressions => Specification
 Finite Automata = >Implementation

Generative Versus
Generative Versus Recognition
Recognition
 Regular expressions give you a way to generate all
strings in language
 Automata give you a way to recognize if a specific
string is in language
 Philosophically very different
 Theoretically equivalent (for regular expressions
and automata)
 Standard approach
 Use regular expressions when define language
 Translated into automata for implementation

Finite Automata
Finite Automata &
& Language
Language Theory
Theory
Finite Automata : A recognizer that takes an input
string & determines whether it’s a
valid sentence of the language
Yes
String(x) FA
No
A FA contains :-
• a set of states (s)
• set of i/p symbols (∑)
• start state(s0)
• set of final (accepting states)(F)
• set of transitions(∂)
Finite Automata
Finite Automata 22 types
types
Non-Deterministic : Has more than one alternative action

for the same input symbol.
Deterministic : Has at most one action for a given

input symbol.
Both types are used to recognize regular expressions.

NFA vs.
NFA vs. DFA
DFA
 DFA
 No  transitions
 At most one transition from each state for each
letter
a a
NOT
OK
OK
b a
 NFA – neither restriction

NFAs &
NFAs & DFAs
DFAs
Non-Deterministic Finite Automata (NFAs) easily
represent regular expression, but are somewhat less
precise.
Deterministic Finite Automata (DFAs) require more

complexity to represent regular expressions, but offer
more precision(faster).
We’ll review both plus conversion algorithms, i.e.,

NFA  DFA and DFA  NFA

Non-Deterministic Finite
Non-Deterministic Finite Automata
Automata
An NFA is a mathematical model that consists of :

• S, a set of states
• , the symbols of the input alphabet
• move, a transition function.
• move(state, symbol)  set of states
• move : S  {}  Pow(S)
• A state, s0  S, the start state
• F  S, a set of final or accepting states.

Representing NFAs
Representing NFAs
Transition Diagrams : Number states (circles),

arcs, final states, …
Transition Tables: More suitable to

representation within a
computer
Adv. :- faster access to
transitions of a given state on a
given char
We’ll see examples of both !

Example NFA
Example NFA (a/b) * abb
S = { 0, 1, 2, 3 } Transition Diagram
a
s0 = 0 start
0 a 1 b 2 b 3
F={3}
 = { a, b } b
Transition Table
input
a b
 (null) moves possible
s 
t 0 { 0, 1 } {0} i j
a 1 -- {2}
t Switch state but do not
e 2 -- {3} use any input symbol
Acceptance of NFA
 An NFA accepts an input string s iff there

is some path in the transition diagram
from the start state to some final state
such that the edge labels along this path
spell out s

How Does
How Does An
An NFA
NFA Work
Work ??
a
start a b b
0 1 2 3
b • Given an input string, we trace moves

• If no more input & in final state, ACCEPT
EXAMPLE: 2nd path -OR-
Input: ababb
move(0, a) = 0
1st path
move(0, a) = 1 move(0, b) = 0
move(1, b) = 2 move(0, a) = 1
move(2, a) = ? (undefined) move(1, b) = 2
move(2, b) = 3
REJECT ! ACCEPT !
Question
Question
Draw a TD for an NFA accepting

aa*/bb*

Transition Diagram
aa* | bb*
a
a
1 2

start
0

3 4
b
b

Deterministic Finite
Deterministic Finite Automata
Automata
 A DFA is an NFA with the following restrictions:
  moves are not allowed
 For every state s S, there is one and only one
path from s for every input symbol a  .
Since transition tables don’t have any alternative options, DFAs are
easily simulated via an algorithm.

Deterministic Finite Automata
Input: An input string x terminated by an end-of-file character eof. A

DFA D with start state s0 and set of accepting states F
Output: the answer “yes” if D accepts x; “no” otherwise
s  s0
c  nextchar;
while c  eof do
s  move(s,c);
c  nextchar;
end;
if s is in F then return “yes”
else return “no”
Example - DFA Transition Table
input
b
a a b
start a b b s 0 1 0
0 1 2 3
t 1 1 2
a a
b a t 2 1 3
e 3 1 0
Recall the original NFA:
a
start a b b
0 1 2 3
b
Conversion of
Conversion of NFA
NFA to
to DFA
DFA
 Why?
 DFA is difficult to construct directly

from RE’s
 NFA is difficult to represent in a
computer program and inefficient to
compute

Conversion : NFA  DFA Algorithm
• Algorithm Constructs a Transition Table for DFA from NFA

• Each state in DFA corresponds to a SET of states of the NFA
• Why does this occur ?
•  moves
• non-determinism
Both require us to characterize multiple situations that occur
for accepting the same string.
(Recall : Same input can have multiple paths in NFA)
• Key Issue : Reconciling AMBIGUITY !

From an
From an NFA
NFA to
to aa DFA
DFA
a set of NFA states  a DFA state

 Find the initial state of the DFA
 Find all the states in the DFA
 Construct the transition table
 Find the final states of the DFA

Construction of
Construction of DFA
DFA from
from NFA
NFA
Algorithm :- subset construction
i/p :- NFA(N)
o/p :- DFA(D)
Method
Initial state of D :- set of states consisting of s0 , the initial state

of N , together with all states of N that can be
reached from s0 by means of ε transitions only
Accepting state of D :- set of states that contains atleast 1

accepting state of N

Subset Construction
Subset Construction Algorithm
Algorithm
 While there is an unmarked state x = {s0,s1,s2,………….sn} of D do
begin
mark x
for each i/p symbol a do
begin
Let T be the set of states to which there is a transition on a
from state si in x
y = ε-closure of T
If y has not yet been added to the set of states of D then

make y an unmarked state of D
Add a transition on from x to y labeled a if not already present
end
end

Computing the
Computing the -closure
-closure
push all states in T onto stack; computing the
-closure
initialize -closure(T) to T;
while stack is not empty do begin
pop t, the top element, off the stack;
for each state u with edge from t to u labeled  do
if u is not in -closure(T) do begin
add u to -closure(T) ;
push u onto stack
end
end Institute of Technology
NFA to
NFA to DFA
DFA conversion
conversion -- example1
example1
a
start a b b
0 1 2 3
b

NFA to
NFA to DFA
DFA conversion
conversion -- example1
example1
a
start a b b
0 1 2 3
b
(0,a) = {0,1} New states
a b
(0,b) = {0}
A = {0} A B A
({0,1}, a) = {0,1}
({0,1}, b) = {0,2} B = {0,1} B B C
({0,2}, a) = {0,1} C = {0,2}
D = {0,3} C B D
({0,2}, b) = {0,3}
D B A
NFA to
NFA to DFA
DFA conversion
conversion (cont.)
(cont.)
a a
start a b b
A B C D
b
a
a b
b
A B A
B B C
C B D
D B A

UNIT1 - Lexical Analysis1

Uploaded by

Copyright:

Available Formats

You might also like

UNIT1 - Lexical Analysis1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT1 - Lexical Analysis1

Uploaded by

Copyright:

Available Formats

Unit I

06/22/2009 Department of Computer Science 1

lexical: of or relating to the morphemes of a language.

morpheme: a meaningful linguistic unit that cannot

lexical analysis: the task concerned with breaking an

06/22/2009 Department of Computer ScienceER&DCI 2

06/22/2009 Department of Computer ScienceER&DCI 3

06/22/2009 Department of Computer ScienceER&DCI 4

 Reasons to separate lexical analysis from

 Tools exist to help implement lexical

06/22/2009 Department of Computer ScienceER&DCI 5

06/22/2009 Department of Computer ScienceER&DCI 7

Actual values are critical. Info is :

06/22/2009 Department of Computer ScienceER&DCI 8

ID Index to symbol table entry E

ID Index to symbol table entry

06/22/2009 Department of Computer ScienceER&DCI 9

DO5I is an ID This is a DO loop

06/22/2009 Department of Computer ScienceER&DCI 10

 When key words are not reserved

06/22/2009 Department of Computer ScienceER&DCI 11

06/22/2009 Department of Computer ScienceER&DCI 12

06/22/2009 Department of Computer ScienceER&DCI 13

When done, ptr... Still Process token

if forward at end of first half then begin Checking if forward

Lexeme begining forward

06/22/2009 Department of Computer ScienceER&DCI 15

E = M * eof C * * 2 eof eof

 A string s is a finite sequence of symbols

 A language is a specific set of strings

06/22/2009 Department of Computer ScienceER&DCI 18

EXAMPLES AND OTHER CONCEPTS:

Prefix : ban, banana

06/22/2009 Department of Computer ScienceER&DCI 20

L* denotes “zero or more concatenations of “ L

L+ denotes “one or more concatenations of “ L

06/22/2009 Department of Computer ScienceER&DCI 21

06/22/2009 Department of Computer ScienceER&DCI 22

 A Regular Expression is a Set of Rules /

 Let  be an Alphabet, r a Regular Expression

06/22/2009 Department of Computer ScienceER&DCI 23

• Let r and s be regular expressions with languages L(r) and L(s).

06/22/2009 Department of Computer ScienceER&DCI 25

06/22/2009 Department of Computer ScienceER&DCI 26

06/22/2009 Department of Computer ScienceER&DCI 27

r* = ( r |  )* relation between * and 

06/22/2009 Department of Computer ScienceER&DCI 28

 RE can denote a fixed number or

06/22/2009 Department of Computer ScienceER&DCI 29

Given Tokens, What are Patterns ?

06/22/2009 Department of Computer ScienceER&DCI 30

Scan away b, nl, tabs

06/22/2009 Department of Computer ScienceER&DCI 31

• Transition Diagrams (TD) are used to represent the tokens

06/22/2009 Department of Computer ScienceER&DCI 33

The start state

06/22/2009 Department of Computer ScienceER&DCI 34