Professional Documents
Culture Documents
Lexical Analysis: DFA Minimization & Wrap Up
Lexical Analysis: DFA Minimization & Wrap Up
Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these
materials for their personal use.
Automating Scanner Construction
T
S
S has transitions
on to R, Q, & T
Q
R
T
S1
S2
S2 is everything Q
in S - S1 R
a a a b
s0
a
s1
b
s3
b
s4 s0 s1 s2
s1 s1 s3
a
b a b s2 s1 s2
s2 s3 s1 s4
b s4 s1 s2
final state
a a b a
a
s0 a s1 b s3 b s4 a b b
s 0 , s2 s1 s3 s4
b a a b a b
s2
b
DFA Minimization
What about a ( b | c )* ?
b q5
q4
q0 a q1 q2 q3 q8 q9
q6 c q7
First, the subset construction:
Final states
DFA Minimization
S p lit on s2
b
Cu rren t Pa rt ition a b c
a
s0 s1 b c
P0 { s 1 , s 2 , s 3 } {s 0 } n o ne n o ne n o ne
c
s3
final states c
To produce the minimal DFA
minimal
RE NFA DFA
DFA
Abbreviated Register Specification
Thompson’s construction produces
r 0
r 1
r 2
… …
…
s0 sf
r 8
r 9
sf0
0 1 sf1
r 2
…
sf 2
s0
8
9
sf8
sf9
This is a DFA, but it has a lot of states …
minimal
RE NFA DFA
DFA
Abbreviated Register Specification
The DFA minimization algorithm builds
0,1,2,3,4,
r 5,6,7,8,9
s0 sf
minimal
RE NFA DFA
DFA
Limits of Regular Languages
Advantages of Regular Expressions
• Simple & powerful notation for specifying patterns
• Automatic construction of fast recognizers
• Many kinds of syntax can be specified with REs
Example — an expression grammar
Term [a-zA-Z] ([a-zA-z] | [0-9])*
Op +|-||/
Expr ( Term Op )* Term
Of course, this would generate a DFA …
INTEGERFUNC TIONA
P ARAMETER(A=6 ,B=2 )
IMPLICIT CHARA CTER*(A-B)(A-B)
INTEGER FORMAT(1 0), IF(1 0), DO9 E1
10 0 FOR MAT (4H) =(3)
20 0 FOR MAT (4 )=(3 )
DO9 E1=1
How does a compiler scan this?
DO9 E1=1,2 • First pass finds & inserts blanks
9 IF(X)=1 • Can add extra words or tags to
IF(X)H=1 create a scannable language
IF(X)3 00 ,2 00
• Second pass is normal scanner
30 0 CONTINUE
END
C THIS IS A “COMMENT CARD”
$ FILE(1)
END Example due to Dr. F.K. Zadeck
Building Faster Scanners from the DFA
Table-driven recognizers waste effort
• Read (& classify) the next character
char next character;
• Find the next state state s0 ;
• Assign to the state variable call action(state,char);
• Trip through case logic in action() while (char eof)
state (state,char);
• Branch back to the top call action(state,char);
char next character;
We can do better
if (state) = final then
• Encode state & actions in the code report acceptance;
• Do transition tests locally else
report failure;
• Generate ugly, spaghetti-like code
• Takes (many) fewer operations per input character
Building Faster Scanners from the DFA
A direct-coded recognizer for r Digit Digit*
goto s0;
s0: word Ø; s2: word word + char;
char next character; char next character;
if (char = ‘r’) if (‘0’ ≤ char ≤ ‘9’)
then goto s1; then goto s2;
else goto se; else if (char = eof)
s1: word word + char; then report success;
char next character; else goto se;
if (‘0’ ≤ char ≤ ‘9’) se: print error message;
then goto s2; return failure;
else goto se;
• Many fewer operations per character
• Almost no memory operations
• Even faster with careful use of fall-through cases
Building Faster Scanners
Hashing keywords versus encoding them directly
• Some (well-known) compilers recognize keywords as
identifiers and check them in a hash table
• Encoding keywords in the DFA is a better idea
O(1) cost per transition
Avoids hash lookup on each identifier
I
S
R