Radu Prodan


Phases of a Compiler
Front-end Back-end

Source Syntax Annotated Intermediate Target Target

Code Tree Tree Representation Code Code

Source Code

Target Code




Literal Symbol Error
Table Table Handler

▪ Introduction

▪ Regular expressions

▪ Deterministic Finite Automata (DFA)

▪ Automatic DFA generation

▪ Conclusions

▪ Lexical analysis or scanning
– Input: source program (ASCII file)
– Output: set of tokens

▪ Token
– Corresponds to a natural language word
– Keywords: if, while, for, int, float
– Identifiers : user-defined, variable size beginning with letter
– Special symbols: arithmetic or logical operations: +, *, /, <, >, =, <>

▪ Scanning is a special case of pattern matching

– Specified using regular expressions
– Implemented using finite automata

▪ Must be as efficient and fast as possible

– Other compiler phases have a much higher time overhead

▪ Usually defined as an enumerated type
– typedef enum { IF, THEN, ELSE, PLUS, MINUS, NUM, ID, … }

▪ Keywords (fixed strings of characters or keywords)

– IF, THEN, ELSE represent string of characters “if”, “then”, “else”

▪ Special symbols
– PLUS, MINUS represent string of characters “+”, “–”

▪ Identifiers
– ID can represent many user-defined variables or identifiers (“a”, “b”, “c”)

▪ Numbers
– NUM can represent many numbers or values (1, 2, 3, ...)
▪ Attribute
– Associated to a token has one or more

▪ Lexeme or string value

– String matched by token

▪ NUM token type

– Lexeme: “32767” (string value)
– Attribute: 32767 (integer)

▪ PLUS token type

– Lexeme: “+” (string value)
– Attribute: + (arithmetic operator)

▪ ID token type
– Lexeme: “x”, “i”, “j”, “tmp”, “var” (string value)

Token Record
typedef struct typedef struct
{ TokenType tokenval; { TokenType tokenval;
char *stringval; union
int intval; { char *stringval;
float floatval; int intval;
} TokenRecord; float floatval;
} attribute;
} TokenRecord;

▪ Parser
– Controls scanner operation
– Function call that returns next token on demand
– TokenType getToken(void);
– Compute additional token attributes

▪ Example: a[index] = 4 + 2
a [ i n d e x ] = 4 + 2

▪ getToken
a [ i n d e x ] = 4 + 2

▪ Token ID Source
▪ String value: “a” Scanner getToken Parser
program analysis

Symbol Table

▪ Complete automation of lexical analysis phase

▪ Regular expressions
– Specify program tokens

▪ Automatic lexical analysis generation tool

Regular Expressions
▪ Alphabet  ▪ OR operator: r|s
– Set of legal symbols – L(r|s) = L(r)  L(s)

▪ Concatenation: rs
▪ Language L(r)
– L(rs) = L(r)L(s)
– Generated by regular expression r

▪ Repetition: r*
▪ Basic regular expression a – L(r*) = L(r)*
– a
– L(a) = { a } ▪ Sub-expressions: (r)
– L((r)) = L(r)
– Empty string: L() = {  }
– Empty set: L() = { } ▪ Precedence order: *,
concatenation, |
Examples of Regular Expressions
▪  = { a, b, c }

▪ Strings that contain exactly one b

– (a|c)*b(a|c)*

▪ Set of strings that contains at most one b

– (a|c)*|(a|c)*b(a|c)*
– (a|c)*(b|)(a|c)*

▪ Set of strings that contain no consecutive b’s

– notb = a | c
– (notb | b notb)*(b|)

▪ Strings containing an even number of b’s

– (notb* b notb* b)* notb*
Extensions to Regular Expressions
▪ + describes one or more ▪ Binary numbers
repetitions – (0|1)(0|1)*
– (0|1)+

▪ . describes any character ▪ All strings containing at least one b

– .*b.*
▪ Character classes
– [a-z] instead of a|b|...|z ▪ Any character but a, b, and c
– [0-9] instead of 0|1|...|9 – ~(a|b|c)
– Multiple ranges: [a-zA-Z] – [^abc] in Lex

▪ Number with optional leading sign

▪ ~ describes any character not – natural = [0−9]+
in a given set – signedNat = nat | + nat | - nat

– signedNat = (+|-)? nat

▪ ? describes optional sub-
Programming Language Tokens
▪ Numbers: 2.71E-2 = 0.0271
– nat = [0-9]+
– signedNat = (+|-)? nat
– number = signedNat(“.” nat)?(E signedNat)?

▪ Keywords
– keyword = if | while | do | ...

▪ Identifiers
– letter = [a-zA-Z]
– digit = [0-9]
– identifier = letter(letter|digit)*

▪ { this is a Pascal comment }
– {(~})*}

▪ -- this is an Ada comment

– --(~newline)*

▪ /* this is a C comment */
– ba(~(ab))*ab
• Not valid because “not” operator is restricted to single characters
– (~(ab))* could be written as: b*(a*~(a|b)b*)*a*

▪ In some programming languages, comments can be nested

– (* this is (* a Modula-2 *) comment *)
– (* this is (* illegal in a Modula-2 *)
• Impossible since scanner cannot count comment separators

▪ Strings may be matched by several regular expressions
– Are if, while, for keywords or identifiers?
– Are strings <>, ==, <=, >= one or more two tokens?

▪ Regular expressions cannot solve ambiguities

▪ Disambiguating rules
– Keyword (or reserved words) first
– Principle of longest substring

▪ Token delimiters or separators

– Indicate that a longer substring cannot represent a token

▪ Lookahead problem
– Token delimiters must not be consumed but returned to input stream
– Often one single lookahead character is enough, but sometimes not

Token Delimiters
▪ Whitespaces
– Discarded token delimiters
– Free format languages
– whitespace = (newline | blank | tab | comment)+

▪ while x …
– Keyword while, identifier x
– Space as token delimiter

▪ xtemp=ytemp
– Identifiers xtemp and ytemp
– = as token delimiter

▪ Comments usually serve as delimiters

– do/**/if
– Two reserved words (do and if) rather than identifier doif
– /**/ as token delimiter

▪ Fixed format language
– Whitespaces are not delimiters, but removed by a preprocessor before scanning

▪ I F ( X 2 .EQ. 0 ) THE N
– Equivalent to IF(X2.EQ.0) THEN

▪ No reserved words in FORTRAN (all keywords can also be identifiers)

– Character position in each line of input is important

– First IF and THEN are keywords
– Second IF and THEN are variables (identifiers)

▪ Backtrack to arbitrary positions in a code line

– DO99I=1.10  Assign value 1.1 to variable D99I
– DO99I=1,10  for(i:=1; i<=10; i++) in C

Finite Automaton
▪ Mathematical method of recognising patterns in input strings

▪ identifier = letter(letter|digit)*
– Transition graph with two states letter
– Start state: 1
– Accepting state: 2 1 2
– Transitions: arrowed lines
▪ Process of recognising xtemp as identifier

x t e m p
1 2 2 2 2 2

Deterministic Finite Automaton (DFA)
▪ Deterministic Finite Automata (DFA)
– Next state is uniquely given by current state and current input character
– A symbol can label one single transition out of a state

▪ A DFA M consists of
– An alphabet 
– A set of states S
– A transition function T: S   → S
– A start state s0  S
– A set of accepting states A  S

▪ L(M): Language accepted by M

▪ c1c2…cn  L(M)  ci  ,  i  [1..n] 

 s1=T(s0,c1), s2=T(s1,c2), …, sn=T(sn-1,cn)  sn  A

c1 c2 c3 cn-1 cn
s0 s1 s2 ... sn-1 sn
Error Transitions
▪ letter = [a-zA-Z]

▪ T(start, c) defined only if c is a letter

letter start in_id
▪ T(in_id, c) defined only if c is other1 other2
letter or digit
▪ Error transitions any
– Not drawn but assumed to exist

other1 = ~letter
other2 = ~(letter|digit)
DFA Examples
▪ Set of strings that contain exactly one b
– (a|c)*b(a|c)* notb notb

▪ Set of strings that contain at most one b

– (a|c)*|(a|c)*b(a|c)*
notb notb
– (a|c)*(b|)(a|c)*

Numeric Constants digit
digit = [0-9]+ digit
nat = digit+
+ digit

signedNat = (+|-)? nat
– digit

number = signedNat(“.” nat)?(E signedNat)?

+ digit digit digit
digit . digit E digit
– –
digit E digit
Unnested Comments other

▪ Pascal comments { }
– {(~})*}
– other = ~}
other *
▪ C comments 1
– 1 – start
– 2 – entering comment
– 3 – inside comment other
– 4 – exiting comment
– 5 – finish

▪ Easier to write DFA than regular expression

DFA Implementation with State Variables
state := 1; { start }
letter [other]
while state = 1 or 2 do 1 2 3
case state of
1: case input character of digit
letter: advance input;
state := 2;
else state := . . .; { error or other }
end case;
2: case input character of
letter, digit: advance input;
state := 2; { unnecessary }
else state := 3;
end case;
end case;
end while;

if state = 3 then accept else error;

C Comments with State Variables
state := 1; { start }

while state = 1, 2, 3 or 4 do
case state of
1: case input character of
“/”: state := 2;
else state := 6; { error or other }
end case;
2: case input character of
“*”: state := 3;
else state := 6; { error or other }
end case;
3: case input character of
”*”: state := 4;
else { stay in state 3 }
end case;
4: case input character of
“/”: state := 5;
“*”: { stay in state 4 }
else state := 3;
end case; other *
end case;
advance input;
end while;
/ * * /
1 2 3 4 5
if state = 5 then accept else error;

Transition Table
▪ Transition table
– Two-dimensional array indexed by state and input character
– Expresses values of transition function T
– Extra column indicates accepting states
– Square brackets indicate “noninput-consuming” transitions

letter [other]
1 2 3

T Input character Accepting
letter digit other
1 2 No
2 2 2 [3] No
3 Yes

C Comments
DFA Transition Table
other *
/ * * /
1 2 3 4 5

T Input character Accepting
/ * other
1 2 No

State 2 3 No
3 4 3 No
4 5 4 3 No
5 Yes

Table Driven DFA Implementation
state := 1; ▪ Advantage
ch := next input character; – Small code size
– Easy to maintain and
while  T[state, ch]    change
 error(state) do
newstate := T[state, ch];
if Advance[state, ch] then ▪ Disadvantage
ch := next input char; – Tables can get very large
state := newstate; – Table compression methods
end while; slow and rarely used
– Sparse array
if Accept[state] then representations

Single DFA Generation
▪ Automatically construct DFAs for all regular expressions
< =
return LE

< >
return NE

return LT

▪ Combine all DFAs in one single DFA

= return LE return LE
< =
< > return NE < > return NE
< [other]
return LT return LT
Automatic DFA Generation
▪ Nondeterministic Finite Automaton (NFA)
– Constructed for each regular expression
– Thompson’s construction

: =
▪ -transitions 
– Connect NFAs of all tokens  < =
– “Spontaneous” transition without
consuming any input characters 
– “Match” of empty string =

▪ Convert NFA into DFA using a fast and

direct algorithm
– Subset construction

Automatic Scanner Generation

Regular Nondeterministic Deterministic Final

Expressions Final Automaton Automaton

• Thompson‘s • Subset • Table Driven DFA

Construction Construction Implementation

NFA Definition
▪ NFA M consists of
– An alphabet 
– A set of states S
– A transition function: T: S  (  {  }) → (S)
– A start state s0  S
– A set of accepting states A  S

▪ L(M): language accepted by M

▪ c1c2…cn  L(M)  ci    {  },  i  [1..n] 

 s1=T(s0,c1), s2=T(s1,c2), …, sn=T(sn-1,cn)  sn  A

▪ Observations
– Any ci may be 
– c1c2…cn may have fewer than n characters ( removed)
– Sequence of states s1s2…sn chosen from sets of states T(s0,c1), T(s1,c2), …,T(sn-1,cn) not
always uniquely determined
– Arbitrary number of  in input stream corresponding to any number of NFA -transitions

NFA Example 2
▪ String abb can be accepted by any of 
a b
following transition sequences
1 a 3  4
a b  b
1 2 4 2 4
a   b  b 
1 3 4 2 4 2 4
▪ NFA matches multiple regular expressions
– ab: T(1,a)=2, T(2,b)=4
– ab+: T(1,a)=2, T(2,b)=4, T(4,)=2, T(2,b)=4, … a
– ab*: T(1,a)=3, T(3,)=4, T(4,)=2, T(2,b)=4, …
– b*: T(1,)=4 , T(4,)=2, T(2,b)=4, …
b b
▪ NFA accepts same language as regular
expression: ab*|b*
– Equivalent to (a|)b* b

Thompson’s Construction
▪ Construct an NFA from a regular expression

▪ Basic regular expression

–a a


▪ To do
– Concatenation
– Or
– Repetition

▪ Input
– Two regular expressions r and s
– Two NFAs (of r and of s)
▪ Goal
– NFA for regular expression rs 

▪ Connect accepting state of r with start ...s...

state of s through one -transition

▪ L(rs) = L(r) L(s)

▪ Input ▪ Two new states: start and
– Two regular expressions r and s accepted
– Two NFAs (of r and of s)
▪ Connected with -transitions
▪ Goal
– NFA for regular expression r|s ▪ L(r|s) = L(r)  L(s)

 

 
▪ Input ▪ Two new states: start and accepting
– Two regular expressions r and s connected through -transitions
– Two NFAs (of r and of s)
▪ Repetition through -transition from
▪ Goal accepting to start state of r
– NFA for regular expression r*
▪ Empty string is accepted by -transition
from start to accepting state

 

Example: ab | a
▪ a

▪ b b

▪ ab a  b

▪ ab | a
 a  b 

 a 

letter (letter | digit)*
▪ letter letter

▪ digit

▪ letter | digit
 letter 

 

letter (letter | digit)*
▪ (letter | digit)* 

 letter 
 
 

▪ letter (letter | digit)*

 letter 
letter   
 


Subset Construction
▪ Convert an NFA into a DFA

▪ We need some method for eliminating

– -transitions
– Multiple transitions from a state on same input character

▪ -closure of a state s is
– Set of states reachable by a series of zero or more -transitions
– Denoted as s

▪ DFA has as states sets of states of original DFA

Example: a*
1 = { 1, 2, 4 } ▪ DFA start state
– -closure of start state

{ 1, 2, 4 }a = { 3 } = { 2, 3, 4 } ▪ For each DFA state S, compute state S ,  a

– Sa = { t |  sS  T(s,a)=t }
T({1, 2, 4}, a) = { 2, 3, 4 }
▪ Add new transition T(S,a) = Sa
{2, 3, 4}a = { 3 } = { 2, 3, 4 } ▪ DFA accepting states
– All states that contain accepting NFA states
T({ 2, 3, 4 }, a) = { 2, 3, 4 }
 a
 a 
1 2 3 4 a
{1,2,4} {2,3,4}

{ 1 } = { 1, 2, 6 } { 3, 4, 7, 8 }b = { 5 } = { 5, 8 }
{ 1, 2, 6 }a = { 3, 7 } = { 3, 4, 7, 8 } T({3, 4, 7, 8}, b) = { 5, 8}
T({1, 2, 6}, a) = { 3, 4, 7, 8 }

 a  b 
2 3 4 5
1  8
a 
6 7

a b
{1, 2, 6} {3, 4, 7, 8} {5, 8}

{1}={1} letter
{ 1 }letter = { 2 } = { 2, 3, 4, 5, 7, 10 } {1}
T({ 1 }, letter) = { 2, 3, 4, 5, 7, 10 }
{ 2, 3, 4, 5, 7, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 } letter letter
T({ 2, 3, 4, 5, 7, 10 }, letter) = { 4, 5, 6, 7, 9, 10 }
{ 2, 3, 4, 5, 7, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
T({ 2, 3, 4, 5, 7, 10 }, digit) = { 4, 5, 7, 8, 9, 10 } {2,3,4,5,7,10} digit
{ 4, 5, 6, 7, 9, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 } letter
T({ 4, 5, 6, 7, 9, 10 }, letter) = { 4, 5, 6, 7, 9, 10 } digit
{ 4, 5, 6, 7, 9, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
T({ 4, 5, 7, 8, 9, 10 }, digit) = { 4, 5, 7, 8, 9, 10 } {4,5,7,8,9,10}
{ 4, 5, 7, 8, 9, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 }
T({ 4, 5, 7, 8, 9, 10 }, letter) = { 4, 5, 6, 7, 9, 10 }
{ 4, 5, 7, 8, 9, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
T({ 4, 5, 7, 8, 9, 10 }, digit) = { 4, 5, 7, 8, 9, 10 }

 letter 
5 6
letter   
1 2 3 4 9 10
 7 8 

▪ Introduction

▪ Regular expressions

▪ Deterministic Finite Automata (DFA)

▪ Automatic DFA generation

▪ Conclusions

▪ Lexical analysis or scanning

▪ Operates under parser control

▪ Tokens specified through regular expressions

▪ Regular expressions implemented as finite automata

– Automatic NFA generation through Thompson construction
– NFA conversion into DFA through Subset construction

▪ Automatic lexical analyser generator (e.g., Lex)

