Professional Documents
Culture Documents
Compiler Design
Compiler Design
CH3.1
CSE309N
Lexical Analysis
Basic Concepts & Regular Expressions
What does a Lexical Analyzer do?
How does it Work?
Formalizing Token Definition & Recognition
CH3.2
CSE309N
Lexical Analyzer in Perspective
token
source lexical
analyzer parser
program
get next
token
symbol
table
Important Issue:
What are Responsibilities of each Box ?
Focus on Lexical Analyzer and Parser.
CH3.3
CSE309N
Lexical Analyzer in Perspective
LEXICAL ANALYZER PARSER
Scan Input Perform Syntax Analysis
Remove WS, NL, … Actions Dictated by Token
Order
Identify Tokens
Update Symbol Table Entries
Create Symbol Table
Create Abstract Rep. of
Insert Tokens into ST Source
Generate Errors Generate Errors
Send Tokens to Parser And More…. (We’ll see later)
CH3.4
CSE309N
What Factors Have Influenced the
Functional Division of Labor ?
Separation of Lexical Analysis From Parsing Presents a
Simpler Conceptual Model
A parser embodying the conventions for comments and white
space is significantly more complex that one that can assume
comments and white space have already been removed by lexical
analyzer.
CH3.5
CSE309N
Introducing Basic Terminology
What are Major Terms for Lexical Analysis?
TOKEN
A classification for a common set of strings
Examples Include <Identifier>, <number>, etc.
PATTERN
The rules which characterize the set of strings for a token
Recall File and OS Wildcards ([A-Z]*.*)
LEXEME
Actual sequence of characters that matches pattern and is
classified by a token
Identifiers: x, count, name, etc…
CH3.6
CSE309N
Introducing Basic Terminology
CH3.7
CSE309N
Attributes for Tokens
Example: E = M * C ** 2
CH3.9
CSE309N
Buffer Pairs
Lexical analyzer needs to look ahead several characters beyond the
lexeme for a pattern before a match can be announced.
CH3.10
CSE309N
Buffer Pairs (2)
E = M * C * * 2 eof
lexeme beginning
forward (scans
ahead to find
pattern match)
Pitfalls
1. This buffering scheme works quite well most of the time but with
it amount of lookahead is limited.
2. Limited lookahead makes it impossible to recognize tokens in
situations where the distance, forward pointer must travel is more
than the length of buffer.
CH3.12
CSE309N
Algorithm: Buffered I/O with Sentinels
Current token
CH3.14
CSE309N
Language Concepts
A language, L, is simply any set of strings
over a fixed alphabet.
Alphabet Languages
{0,1} {0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {TEE,FORE,BALL,…}
{FOR,WHILE,GOTO,…}
{A,…,Z,a,…,z,0,…9, { All legal PASCAL progs}
+,-,…,<,>,…}
OPERATION DEFINITION
union of L and M L M = {s | s is in L or s is in M}
written L M
concatenation of L LM = {st | s is in L and t is in M}
and M written LM
Kleene closure of L L*= Li
written L* i 0
CH3.16
CSE309N
Formal Language Operations
Examples
L = {A, B, C, D } D = {1, 2, 3}
L D = {A, B, C, D, 1, 2, 3 }
LD = {A1, A2, A3, B1, B2, B3, C1, C2, C3, D1, D2, D3 }
L2 = { AA, AB, AC, AD, BA, BB, BC, BD, CA, … DD}
L4 = L2 L2 = ??
L* = { All possible strings of L plus }
L+ = L* -
L (L D ) = ??
L (L D )* = ??
CH3.17
CSE309N
Language & Regular Expressions
A Regular Expression is a Set of Rules / Techniques for Constructing
Sequences of Symbols (Strings) From an Alphabet.
CH3.18
CSE309N
Rules for Specifying Regular Expressions:
L = {A, B, C, D } D = {1, 2, 3}
A|B |C|D =L
(A | B | C | D ) (A | B | C | D ) = L2
(A | B | C | D )* = L*
(A | B | C | D ) ((A | B | C | D ) | ( 1 | 2 | 3 )) = L (L D)
CH3.20
CSE309N
Algebraic Properties of
Regular Expressions
AXIOM DESCRIPTION
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt
(s|t)r=sr|tr concatenation distributes over |
r = r
r = r Is the identity element for concatenation
CH3.21
CSE309N
Regular Expression Examples
CH3.22
CSE309N
Regular Definitions
We may give names to regular expressions and to define regular
expression using these names as if they were symbols.
dn rn
Where,
each di is a distinct name, and
each ri is a regular expression over the symbols in
{d1, d2, …, di-1 }
CH3.23
CSE309N
Towards Token Definition
id [A-Za-z][A-Za-z0-9]*
CH3.24
CSE309N
Unsigned Number 1240, 39.45, 6.33E15, or 1.578E-41
digit 0 | 1 | 2 | … | 9
digits digit digit*
optional_fraction . digits |
optional_exponent ( E ( + | -| ) digits) |
num digits optional_fraction optional_exponent
Shorthand
digit 0 | 1 | 2 | … | 9
digits digit+
optional_fraction (. digits ) ?
optional_exponent ( E ( + | -) ? digits) ?
num digits optional_fraction optional_exponent
CH3.25
CSE309N
Token Recognition
How can we use concepts developed so far to assist in
recognizing tokens of a source language ?
Assume Following Tokens:
if, then, else, relop, id, num
blank blank
tab tab
newline newline
delim blank | tab | newline
ws delim +
Note: Each token has a unique token identifier to define category of lexemes
CH3.28
CSE309N
Constructing Transition Diagrams for Tokens
CH3.29
CSE309N
Example TDs
other
8 * RTN(GT)
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
= 4
*
return(relop, LT)
5 return(relop, EQ)
>
=
6 7 return(relop, GE)
other
8
*
return(relop, GT)
CH3.31
CSE309N
Example TDs : id and delim
id :
letter or digit
CH3.32
CSE309N
Example TDs : Unsigned #s
digit digit digit
E digit
digit digit
return(num, install_num())
digit
CH3.34
CSE309N
Answer
cons B | C | D | F | … | Z
string cons* A cons* E cons* I cons* O cons* U cons*
accept
CH3.36
CSE309N
Implementing Transition Diagrams
CH3.37
CSE309N
Implementing Transition Diagrams
int state = 0, start = 0
lexeme_beginning = forward;
token nexttoken()
{ while(1) {
switch (state) {
case 0: c = nextchar();
repeat /* c is lookahead character */
if (c== blank || c==tab || c== newline) {
until start < =
state = 0;
a “return” 0 1 2
lexeme_beginning++;
occurs >
/* advance 3
beginning of lexeme */ other
} *
= 4
else if (c == „<„) state = 1;
else if (c == „=„) state = 5; 5
>
else if (c == „>‟) state = 6;
else state = fail(); =
6 7
break;
other
… /* cases 1-8 here */ CH3.38 8
*
CSE309N
Implementing Transition Diagrams, II
digit
*
digit other
25 26 27
advances
............. forward
case 25; c = nextchar();
if (isdigit(c)) state = 26;
else state = fail();
Case numbers
break;
case 26; c = nextchar();
correspond to transition
if (isdigit(c)) state = 26;
diagram states !
else state = 27;
break;
case 27; retract(1); lexical_value = install_num();
return ( NUM );
.............
looks at the region
retracts lexeme_beginning ... forward
forward CH3.39
CSE309N
Implementing Transition Diagrams, III
.............
case 9: c = nextchar();
if (isletter(c)) state = 10;
else state = fail();
break;
case 10; c = nextchar();
if (isletter(c)) state = 10;
else if (isdigit(c)) state = 10;
else state = 11;
break;
case 11; retract(1); lexical_value = install_id();
return ( gettoken(lexical_value) );
.............
letter or digit
*
letter other
reads token 9 10 11
name from ST
CH3.40
CSE309N
When Failures Occur:
int fail()
{
forward = lexeme beginning;
switch (start) {
case 0: start = 9; break;
case 9: start = 12; break;
case 12: start = 20; break;
case 20: start = 25; break;
Switch to
case 25: recover(); break;
next transition
default: /* lex error */
diagram
}
return start;
}
CH3.41
CSE309N
Finite Automata & Language Theory
CH3.42
CSE309N
NFAs & DFAs
CH3.43
CSE309N
Non-Deterministic Finite Automata
CH3.44
CSE309N
Representing NFAs
CH3.45
CSE309N
Example NFA
S = { 0, 1, 2, 3 } a
start a b b
s0 = 0 0 1 2 3
F={3}
b
= { a, b }
What Language is defined ?
a
start a b b
0 1 2 3
CH3.47
CSE309N
Conversion of NFA to DFA
CH3.48
CSE309N
Handling Undefined Transitions
a
start a b b
0 1 2 3
b a
a
a, b
4
CH3.49
CSE309N
NFA- Regular Expressions & Compilation
Problems with NFAs for Regular Expressions:
1. Valid input might not be accepted
2. NFA may behave differently on the same input
CH3.50
CSE309N
Second NFA Example
CH3.51
CSE309N
Second NFA Example - Solution
start a b
0 1
c
3 c 5
String abbc can be accepted.
CH3.52
CSE309N
Alternative Solution Strategy
b
a c
a (b*c) 1 2 3
a (b | c+)? 4 a 5 b
c
c
Now that you have the individual 7
diagrams, “or” them as follows:
CH3.53
CSE309N
Using Null Transitions to “OR” NFAs
b
a c
1 2 3
0 6
4 a 5 b
c
c
7
CH3.54
CSE309N
Other Concepts
CH3.55
CSE309N
Deterministic Finite Automata
CH3.56
CSE309N
Example - DFA
b
a
start a b b
0 1 2 3
a
b a
What Language is Accepted?
a
start a b b
0 1 2 3
CH3.57
CSE309N
Conversion : NFA DFA Algorithm
CH3.58
CSE309N
Converting NFA to DFA – 1st Look
a 3 b 4
2
0 1 5 8
6 c 7
0, 1, 2, 6, 8 a 3
a
c b
1, 2, 5, 6, 7, 8
c 1, 2, 4, 5, 6, 8
c
Which States are FINAL States ?
a
A How do we handle
B
a a alphabet symbols not
c b defined for A, B, C, D ?
D
c C
c
CH3.60
CSE309N
Algorithm Concepts
NFA N = ( S, , s0, F, MOVE )
-Closure(s) : s S
: set of states in S that are reachable
No input is from s via -moves of N that originate
consumed
from s.
-Closure(T) : T S
: NFA states reachable from all t T
on -moves only.
move(T,a) : T S, a
: Set of states to which there is a
transition on input a from some t T
start a b
0 1 6 7 8 9
b
b
4 5 10
b : -closure(move(A,b)) = -closure(move({0,1,2,4,7},b))
adds {5} ( since move(4,b)=5)
b : -closure(move(B,b)) = -closure(move({1,2,3,4,6,7,8},b))}
= {1,2,4,5,6,7,9} = D
Define Dtran[B,b] = D.
b : -closure(move(C,b)) = -closure(move({1,2,4,5,6,7},b))}
= {1,2,4,5,6,7} = C
Define Dtran[C,b] = C.
CH3.64
CSE309N
Conversion Example – continued (3)
5th , we calculate for state D on {a,b}
a : -closure(move(D,a)) = -closure(move({1,2,4,5,6,7,9},a))}
= {1,2,3,4,6,7,8} = B
Define Dtran[D,a] = B.
b : -closure(move(D,b)) = -closure(move({1,2,4,5,6,7,9},b))}
= {1,2,4,5,6,7,10} = E
Define Dtran[D,b] = E.
b : -closure(move(E,b)) = -closure(move({1,2,4,5,6,7,10},b))}
= {1,2,4,5,6,7} = C
Define Dtran[E,b] = C.
CH3.65
CSE309N
Conversion Example – continued (4)
This gives the transition table Dtran for the DFA of:
Input Symbol
Dstates a b
A B C
B B D
C B C
D B E
E B C
b C b
start A a B b D b E
a
a a
CH3.66
CSE309N
Algorithm For Subset Construction
CH3.67
CSE309N
Algorithm For Subset Construction – (2)
CH3.68
CSE309N
Regular Expression to NFA Construction
We now focus on transforming a Reg. Expr. to an NFA
This construction allows us to take:
• Regular Expressions (which describe tokens)
• To an NFA (to characterize language)
• To a DFA (which can be “computerized”)
The construction process is component-wise
Builds NFA from components of the regular
expression in a special order with particular
techniques.
NOTE: Construction is “syntax-directed” translation, i.e., syntax of
regular expression is determining factor for NFA construction and
structure. CH3.69
CSE309N
Motivation: Construct NFA For:
:
a:
b:
ab:
| ab :
a*
( | ab )* :
CH3.70
CSE309N
Motivation: Construct NFA For:
: start i f
a
a: start 0 1
start b
b: A B
ab: start a b
0 1 A B
| ab :
a*
( | ab )* :
CH3.71
CSE309N
Construction Algorithm : R.E. NFA
Construction Process :
1st : Identify subexpressions of the regular expression
symbols
r|s
rs
r*
CH3.72
CSE309N
Piecing Together NFAs
Algorithm: Thompson’s Construction
start L()
i f
start a
i f L(a)
CH3.73
CSE309N
Piecing Together NFAs – continued(1)
CH3.74
CSE309N
Piecing Together NFAs – continued(2)
3.(b) If s, t are regular expressions, N(s), N(t) their NFAs
st (concatenation) has NFA:
start
i N(s) N(t) f L(s) L(t)
overlap
Alternative:
start i f
N(s) N(t)
start i f
N(s)
CH3.76
CSE309N
Properties of Construction
CH3.77
CSE309N
Detailed Example
See example in textbook for (a | b)*abb
2nd Example - (ab*c) | (a(b|c*))
Parse Tree for this regular expression:
r13
r5 | r12
r3 r4 r11 r10
( )
a a r9
r1 r2
r7 | r8
r0 c
* r6
b *
b
c
What is the NFA? Let’s construct it ! CH3.78
CSE309N
Detailed Example – Construction(1)
r0: b
r3: a r1: b
r2: c
r4 : r1 r2 b c
a b c
r5 : r3 r4
CH3.79
CSE309N
Detailed Example – Construction(2)
r7: b
r8: c
r11: a
b
r6: c
r9 : r7 | r 8 c
r10 : r9
b
r13 : r5 | r12
a b c
2 3 4 5 6 7
17
1 b
10 11
a c
8 9 12 13 14 15 16
CH3.81
CSE309N
Direct Simulation of an NFA
s s0
c nextchar;
while c eof do
s move(s,c); DFA
c nextchar;
end;
simulation
if s is in F then return “yes”
else return “no”
S -closure({s0})
c nextchar;
while c eof do
NFA
S -closure(move(S,c));
c nextchar; simulation
end;
if SF then return “yes”
else return “no”
CH3.82
CSE309N
Final Notes : R.E. to NFA Construction
FA
Simulator
P1 : a {action}
P2 : abb {action} 3 patterns
P3 : a*b+ {action}
NFA’s :
P1
start a
1 2
P2
start a b b
3 4 5 6
a b
P3
start
7 8
b
CH3.87
CSE309N
Pattern Matching Based on NFA
continued (2)
Combined NFA :
a P1
1 2
start a b b
0 3 4 5 6 P2
a b
P3
7 8
b
Examples a a b a
{0,1,3,7} {2,4,7} {7} {8} death
pattern matched: - P1 - P3 -
a b b
{0,1,3,7} {2,4,7} {5,8} {6,8}
break tie in
pattern matched: - P1 P3 P2,P3 favor of P2
CH3.88
CSE309N
DFA for Lexical Analyzers
Alternatively Construct DFA:
keep track of correspondence between patterns and new accepting states
Input Symbol
STATE a b Pattern
{0,1,3,7} {2,4,7} {8} none
{2,4,7} {7} {5,8} P1
{8} - {8} P3
{7} {7} {8} none
{5,8} - {6,8} P3
break tie in
{6,8} - {8} P2 favor of P2
CH3.89
CSE309N
Example
Input: aaba
{0,1,3,7} {2,4,7} {7} {8}
Input: aba
{0,1,3,7} {2,4,7} {5,8} P3
Input Symbol
STATE a b Pattern
{0,1,3,7} {2,4,7} {8} none
{2,4,7} {7} {5,8} P1
{8} - {8} P3
{7} {7} {8} none
{5,8} - {6,8} P3
{6,8} - {8} P2
CH3.90
CSE309N
Optimization of DFA based Pattern Matching
Our Target:
1. Construct DFA directly from Regular Expression
2. Minimizes no of states of DFA
3. Produce fast but more compact representations for
transition table of DFA than a straightforward two
dimensional table
CH3.91
CSE309N
Important States of an NFA
Important States of NFA:
if it has a non- out-transition
• The subset construction algorithm uses only the important
states in a subset T when it determines -closure(move(T,a))
• The resulting NFA has exactly one accepting state, but the accepting
state is not important because it has no transition leaving it.
• We give the accepting state a transition on #, making it important state
of NFA.
• By using augmented regular expression (r)#, we can forget about
accepting states as the subset construction proceeds.
• When the construction is complete, any DFA state with a transition on #
must be an accepting state.
CH3.92
CSE309N
Example: (a | b)*abb
#
6
b
5
b
4
a •Each leaf may a symbol /
3 •If not attach a unique integer
a b
1 2
CH3.93
CSE309N NFA Vs DFA
a
1 C
start a b
A B E 3 4 5
b
b #
2 D F 6
b
b
a a a
CH3.94
CSE309N
Minimizing the Number of States of DFA
a
C b
A B D E
G1 G2
E ABCD
No Change a G2
b G2 G1
(ABC) (D)
CH3.96
CSE309N
a
C b
A B D E
G1 G2 G3
ABC D E
a G1 No Change No change
b G1 G2
(AC) (B)
CH3.97
CSE309N
a
C b
A B D E
G1 G2 G3 G4
AC B D E
No Change No Change No Change No change
CH3.98
CSE309N
a
C b
A B D E
a
b
A B D E
CH3.99
CSE309N
example
a
a
A a F
B
a
b b
a
D
b C
b
b
a
A,C,D a
B,F
b
Minimized DFA:
b
CH3.100
CSE309N
Concluding Remarks
Looking Ahead:
The next step in the compilation process is Parsing:
- Top-down vs. Bottom-up
-- Relationship to Language Theory
CH3.101
CSE309N
The End
CH3.102