Professional Documents
Culture Documents
Compiler Design
Compiler Design
Kenesa B. (getkennyo@gmail.com)
CHAPTER TWO
LEXICAL ANALYSIS
O U T L I N E
q Lexical Analysis
§ Token Specification
§ Recognition of Tokens
q Recognition of Machines
§ NFA to DFA Conversion
q Error Recovery
q A typical Lexical Analyzer Generator
q DFA Analysis
Source
Program
symbol token: smallest meaningful sequence
table of characters of interest in source
program
(Contains a record
for each identifier)
Kenesa B. (Ambo University) Compiler Design 5
Input Buffering
q Reading character by character from secondary storage is slow process and time
consuming as well.
q It is necessary to look ahead several characters beyond the lexeme for a pattern
before a match can be announced.
§ Buffer technique is used to eliminate this problem and increase efficiency.
q Many times, a scanner has to look ahead several characters from the current
character in order to recognize the token.
q The lexical analyzer scans the input string from left to right one character a time.
§ It uses two pointers begin_ptr (bp) and forward_ptr(fp) to keep track of the portion of the
input scanned.
q Initially both the pointers point to the first character of the input string
q The forward_ptr moves ahead to search for end of lexeme.
• As soon as the blank space is encountered it indicates end of lexeme.
q In the above example as soon as forward_ptr(fp) encounters a blank space the
lexeme "int" is identified.
• The fp will be moved ahead at white space.
• When fp encounters whitespace it ignores and moves ahead.
• The both the begin_ptr(bp) and forward_ptr(fp) is set at next token i.
q Solutions:
a) L1 ∪ L2 = {a,b,c,d,1,2}
b) L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
c) L1* = all strings of letter a,b,c,d and empty string. {ε, a, b, c, d, aa, ab, ac, ad, ba,
bb, bc, bd, aaa, . . . }
d) L1+ = the set of all strings of one or more letter a,b,c,d, empty string not included
Kenesa B. (Ambo University) Compiler Design 17
Regular Expressions for Tokens
q Regular expressions are used to specify the patterns of tokens.
q Each pattern matches a set of strings. It falls into different categories:
q Reserved (Key) words: They are represented by their fixed sequence of
characters,
§ Ex. if, while and do....
§ If we want to collect all the reserved words into one definition, we could write it as follows:
• Reserved = if | while | do |...
q Special symbols: including arithmetic operators, assignment and equality such
as =, :=, +, -, *
q Identifiers: which are defined to be a sequence of letters and digits beginning
with letter, we can express this in terms of regular definitions as follows:
letter = A|B|…|Z|a|b|…|z or in other way letter = [a-zA-Z]
digit = 0|1|…|9 or digit = [0-9]
identifiers = letter(letter|digit)*
Kenesa B. (Ambo University) Compiler Design 18
Regular Expressions for Tokens
q Numbers: Numbers can be:
– sequence of digits (natural numbers), or decimal numbers, or
– numbers with exponent (indicated by an e or E).
q Example: 2.71E-2 represents the number 0.0271.We can write regular definitions for these
numbers as follows:
• nat = [0-9]+
• signedNat = (+|-)? nat
• number = signedNat(“.” nat)?(E signedNat)?
q Literals or constants: numeric constants such as 42, and string literals such as “ hello,
world”, relop → < | <= | = | <> | > | >=, Delimiter → newline | blank | tab | comment, White space =
(delimiter )+
• For example, by the following regular expression we can describe string constants where the
allowed symbols are alphanumeric characters and sequences consisting of the backslash
symbol followed by a letter (where each such pair is intended to represent a non-alphanumeric
symbol): "([a-zA-Z0-9]|\[a-zA-Z])∗ "
Kenesa B. (Ambo University) Compiler Design 19
Exercise
q Describe the languages denoted by the following regular expressions:
a. (ab) | ε
b. ((a|b)a)*
c. a(a|b)*a
d. ((ε|a)b*)*
e. (a|b)*a(a|b)(a|b)
f. a*ba*ba*ba*
g. (aa|bb)*((ab|ba)(aa|bb)*(ab|ba)(aa|bb)*)*
h. Even binary numbers -> (0|1)*0 (Solved)
i. An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}.
Consider the set of all strings over this alphabet that contains exactly one b.
{b, abc, abaca, baaaac, ccbaca, cccccb...} -> (a | c)*b(a|c)* (Solved)
21
Recognition of Tokens
§ The simplest explanation of an algorithm to c ← NextChar();
recognize words is often a character-by-character if (c = ‘n’)
formulation. then begin;
c ← NextChar();
§ Consider the problem of recognizing the keyword
if (c = ‘e’)
new. then begin;
§ Assuming the presence of a routine NextChar that c ← NextChar();
returns the next character, the code might look if (c = ‘w’)
like the fragment shown below. then report success;
§ The code tests for n followed by e followed by w. else try something else;
§ At each step, failure to match the appropriate end;
character causes the code to reject the string and else try something else;
“try something else.” end;
§ E.g: A recognizer for while produce the following else try something else;
transition diagram:
Step 2: b|c
Step 4: a(b|c)*
Step 3: (b|c)*
q The NFA is
a,b
∅
4
b
∅ ∅
Step2: The set of states resulting from every transition function
constitutes a new state. Calculate all reachable states for every such
state for every input signal.
Kenesa B. (Ambo University) Compiler Design 46
Subset Construction ( Example)
Transition table
q Step3: Repeat this process(step2) until no more new states are reachable.
Kenesa B. (Ambo University) Compiler Design 47
Subset Construction ( Example)
a
12345
b 245 a
35
a
a,b a
b
a b
1 ∅
3
a,b b
b a
2
a
45 5 b
b 4 a
50
Subset Construction (ε -closure)
q ε−closure(n) is the set of NFA states reachable from NFA state n by zero or more
transitions.
§ ε-closure for a given state A means a set of states which can be reached from the
state A with only ε move including the state A itself.
q ε-closure (S’) – is a set of states with the following characteristics:
1. S’ ϵ ε-closure(S’) itself
2. if t ϵ ε-closure (S’) and if there is an edge labeled ε from t to v, then v ϵ ε-
closure (S’)
3. Repeat step 2 until no more states can be added to ε-closure (S’).
A = {0,1,2,4,7}
B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
§ Now that we have B and C we can move on to find the states that have a and b
transitions from B and C.
Kenesa B. (Ambo University) Compiler Design 55
Subset Construction - Example
q Find the state that has an edge on a from B: move(B,a) = {3,8}
• ε-closure(move(A,a)) = {1,2,3,4,6,7,8} which is the same as the state B itself. In
other words, we have a repeating edge to B:
Sequence of
Input stream a.out tokens
Lexical Analyzer
Kenesa B. (Ambo University) Compiler Design 63
Lex specification
q Lex is a program that generates lexical analyzer.
§ It is used with YACC parser generator.
q The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
q It reads the input stream and produces the source code as output through
implementing the lexical analyzer in the C or C++ program.
q Program structure
C declarations in %{ %}
%%
...rule section... P1 { action1 }
%%
P2 { action2 }
...user defined functions...
q Rules section – regular expression <--> action.
• The actions are C program.
q Declaration section – variables, constants
Kenesa B. (Ambo University) Compiler Design 64
“
T H E E N D !
65