Lexical Analysis2 Modefied

You might also like

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 33

Lexical Analysis

Course 338- Compilers


Techniques and Tools
Chapter 3
Lexical Analysis

 Lexical analysis recognizes the vocabulary of


the programming language and transforms a
string of characters into a string of words or
tokens
 Lexical analysis discards white spaces and
comments between the tokens
 Lexical analyzer (or scanner) is the program
that performs lexical analysis
2
Contents

 Scanners
 Tokens
 Regular expressions
 Finite automata

3
Scanners

token
characters Scanner Parser
next token

Symbol
Table

4
Tokens

 A token is a sequence of characters that can


be treated as a unit in the grammar of a
programming language
 A programming language classifies tokens
into a finite set of token types
Type Examples
ID foo i n
NUM 73 13
IF if
5 COMMA ,
Semantic Values of Tokens

 Semantic values are used to distinguish


different tokens in a token type
– < ID, foo>, < ID, i >, < ID, n >
– < NUM, 73>, < NUM, 13 >
– < IF, >
– < COMMA, >
 Token types affect syntax analysis and
semantic values affect semantic analysis
6
Scanner Generators

Scanner
Scanner
definition in Scanner
Generator
Mata Language

Program in
Scanner Token types &
programming
semantic values
language

7
Languages

 A language is a set of strings


 A string is a finite sequence of symbols taken
from a finite alphabet
– The C language is the (infinite) set of all strings that
constitute legal C programs
– The language of C reserved words is the (finite) set
of all alphabetic strings that cannot be used as
identifiers in the C programs
– Each token type is a language
8
Regular Expressions (RE)

 Language allows us to use finite descriptions


to specify (possibly infinite) sets
 RE is the metalanguage used to define the
token types of a programming language

9
Regular Expressions

  is a RE denoting L = {}
 If a  alphabet, then a is a RE denoting L = {a}
 Suppose r and s are RE denoting L(r) and L(s)
 alternation: (r) | (s) is a RE denoting L(r)  L(s)
 concatenation: (r) • (s) is a RE denoting L(r)L(s)
 repetition: (r)* is a RE denoting (L(r))*
 (r) is a RE denoting L(r)
10
Examples

 a|b {a, b}
 (a | b)(a | b) {aa, ab, ba, bb}
 a* {, a, aa, aaa, ...}
 (a | b)* the set of all strings of a’s and b’s
 a | a*b the set containing the string a and all
strings consisting of zero or more a’s followed
by a b

11
Regular Definitions

 Names for regular expressions


d1  r1
d2  r2
..
dn  rn
where ri over alphabet  {d1, d2, ..., di-1}
 Examples:
letter  A | B | ... | Z | a | b | ... | z
digit  0 | 1 | ... | 9
12 identifier  letter ( letter | digit )*
Notational Abbreviations

 One or more instances


(r)+ denoting (L(r))+
r* = r+ |  r+ = r r*
 Zero or one instance
r? = r | 
 Character classes
[abc] = a | b | c [a-z] = a | b | ... | z
[^abc] = any character except a | b | c
 Any character except newline
.
13
Examples

 if {return IF;}
 [a-z][a-z0-9]* {return ID;}
 [0-9]+ {return NUM;}
 ([0-9]+“.”[0-9]*)|([0-9]*“.”[0-9]+) {return REAL;}
 (“--”[a-z]*“\n”)|(“ ” | “\n” | “\t”)+
{/*do nothing for white spaces and comments*/}
 . { error(); }
14
Finite Automata

 A finite automaton is a finite-state transition


diagram that can be used to model the
recognition of a token type specified by a
regular expression
 A finite automaton can be a nondeterministic
finite automaton or a deterministic finite
automaton

15
Nondeterministic Finite Automata
(NFA)
 An NFA consists of
– A finite set of states
– A finite set of input symbols
– A transition function that maps (state,
symbol) pairs to sets of states
– A state distinguished as start state
– A set of states distinguished as final states
16
An Example

 RE: (a | b)*abb start


 States: {1, 2, 3, 4} 1 a,b
 Input symbols: {a, b} a
 Transition function: 2
(1,a) = {1,2}, (1,b) = {1} b
(2,b) = {3}, (3,b) = {4} 3
 Start state: 1 b
 Final state: {4} 4
17
Acceptance of NFA

 An NFA accepts an input string s iff there is


some path in the finite-state transition diagram
from the start state to some final state such
that the edge labels along this path spell out s
 The language recognized by an NFA
automaton is the set of strings it accepts

18
An Example

(a | b)*abb aabb

start a b b
0 1 2 3

19
An Example

(a | b)*abb aaba

start a b b
0 1 2 3

a
b

20
Another Example

 RE: aa* | bb*


 States: {1, 2, 3, 4, 5}
 Input symbols: {a, b}
 Transition function:
(1, ) = {2, 4}, (2, a) = {3}, (3, a) = {3},
(4, b) = {5}, (5, b) = {5}
 Start state: 1
 Final states: {3, 5}
21
Finite-State Transition Diagram

aa* | bb*
a
a
2 3

start
1

4 5
b
b

22
Deterministic Finite Automata (DFA)

 A DFA is a special case of an NFA in which


 no state has an -transition
 for each state s and input symbol a, there is at
most one edge labeled a leaving s

23
An Example

 RE: (a | b)*abb
 States: {1, 2, 3, 4}
 Input symbols: {a, b}
 Transition function:
(1,a) = {2}, (2,a) = {2}, (3,a) = {2}, (4,a) = {2}
(1,b) = {1}, (2,b) = {3}, (3,b) = {4}, (4,b) = {1}
 Start state: 1
 Final state: {4}
24
Finite-State Transition Diagram

(a | b)*abb

b
a
start b b
1 a 2 3 4
a
b a

25
Acceptance of DFA

 A DFA accepts an input string s iff there is one


path in the finite-state transition diagram from
the start state to some final state such that the
edge labels along this path spell out s
 The language recognized by a DFA
automaton is the set of strings it accepts

26
An Example

(a | b)*abb aabb

b
a
start b b
1 a 2 3 4
a
b a

27
An Example

(a | b)*abb aaba

b
a
start b b
1 a 2 3 4
a
b a

28
Combined Finite Automata

start i f
if 1 2 3 IF
ID
start a-z a-z,0-9
[a-z][a-z0-9]* 1 2

0-9 REAL
([0-9]+“.”[0-9]*) 0-9
.
2 3 0-9
| start
1 0-9
([0-9]*“.”[0-9]+) . 4 5 0-9
29 REAL
Combined Finite Automata

i f
2 3 4 IF
 ID
start a-z a-z,0-9
1  5 6

 0-9 REAL
0-9
.
8 9 0-9
7
. 0-9
NFA 10 11 0-9
REAL
30
Combined Finite Automata

f IF
2 3
g-z
a-e a-z,0-9
i 4
j-z a-z,0-9
ID
start 1 a-h 0-9 REAL
0-9 .
5 6 0-9
.
0-9
DFA 7 8 0-9
31 REAL
Recognizing the Longest Match

 The automaton must keep track of the longest


match seen so far and the position of that
match until a dead state is reached
 Use two variables Last-Final (the state
number of the most recent final state
encountered) and Input-Position-at-Last-Final
to remember the last time the automaton was
in a final state

32
An Example

f IF
2 3
iffail+ g-z S C L P
a-e a-z,0-9 1 0 0
i 4
j-z a-z,0-9 i 2 0 0
ID f 3 3 2
start a-h 0-9 f 4 4 3
1 0-9 REAL
. a 4 4 4
5 6 0-9i 4 4 5
.
0-9 l 4 4 6
DFA 7 8 0-9+ ?
REAL
33

You might also like