Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Ambo University

Hachalu Hundessa Campus


School of Informatics and Electrical Engineering
Department of Computer Science

Kenesa B. (getkennyo@gmail.com)
CHAPTER TWO
LEXICAL ANALYSIS
O U T L I N E
q Lexical Analysis
§ Token Specification
§ Recognition of Tokens
q Recognition of Machines
§ NFA to DFA Conversion
q Error Recovery
q A typical Lexical Analyzer Generator
q DFA Analysis

Kenesa B. (Ambo University) Compiler Design 3


Introduction
What is a lexical?

q The word “lexical” in the traditional sense means “pertaining to words”.


q In terms of programming languages, words are objects like variable names,
numbers, keywords, etc.
§ Such words are traditionally called tokens.
q A program or function which performs lexical analysis is called lexical analyzer,
lexer or scanner.
q The role of lexical analyzer (lexer or scanner) is:
§ to read a sequence of characters from the source program
§ group them into lexemes and
§ produce as output a sequence of tokens for each lexeme in the source program.
q The tokens are then sent to the parser for syntax analysis.
Kenesa B. (Ambo University) Compiler Design 4
Introduction...
q The scanner can also perform the following secondary tasks:
§ stripping out whitespaces (in the form of blanks, tabs, new lines,...)
§ stripping out comments
§ keep track of line numbers (for error reporting),...
q If the lexical analyzer finds a token invalid, it generates an error.
next_char() next_token()
lexical Syntax
analyzer analyzer
get next get next
char token

Source
Program
symbol token: smallest meaningful sequence
table of characters of interest in source
program
(Contains a record
for each identifier)
Kenesa B. (Ambo University) Compiler Design 5
Input Buffering
q Reading character by character from secondary storage is slow process and time
consuming as well.
q It is necessary to look ahead several characters beyond the lexeme for a pattern
before a match can be announced.
§ Buffer technique is used to eliminate this problem and increase efficiency.
q Many times, a scanner has to look ahead several characters from the current
character in order to recognize the token.
q The lexical analyzer scans the input string from left to right one character a time.
§ It uses two pointers begin_ptr (bp) and forward_ptr(fp) to keep track of the portion of the
input scanned.

Kenesa B. (Ambo University) Compiler Design 6


Input Buffering

q Initially both the pointers point to the first character of the input string
q The forward_ptr moves ahead to search for end of lexeme.
• As soon as the blank space is encountered it indicates end of lexeme.
q In the above example as soon as forward_ptr(fp) encounters a blank space the
lexeme "int" is identified.
• The fp will be moved ahead at white space.
• When fp encounters whitespace it ignores and moves ahead.
• The both the begin_ptr(bp) and forward_ptr(fp) is set at next token i.

Kenesa B. (Ambo University) Compiler Design 7


Token, pattern, lexeme
q A token is a sequence of characters from the source program having a collective
meaning.
§ keywords, constants, identifiers, literals, operators, punctuations symbols and
special characters
§ For example: id and num
q Lexemes are the specific character strings that make up a token.
§ Lexemes are said to be a sequence of characters (alphanumeric) in a token.
§ There are some predefined rules for every lexeme to be identified as a valid token.
§ These rules are defined by grammar rules, by means of a pattern.
§ For example: abc, 123, if, t, ...
q Patterns are rules describing the set of lexemes belonging to a token.
§ For example: “letter followed by letters and digits”
– Patterns are usually specified using regular expressions: [a-zA-Z]*
• patterns are defined by means of regular expressions.
Kenesa B. (Ambo University) Compiler Design 8
Token, pattern, lexeme
q For example, in C language, the variable declaration line: int value = 100;
§ contains lexemes and tokens: int (keyword), value (identifier), = (operator), 100 (constant)
and ; (symbol).
q A token consists of a token name and an optional attribute value.
q The token name is an abstract symbol that represents a kind of lexical unit and the
optional attribute value is commonly referred to as token value.

Kenesa B. (Ambo University) Compiler Design 9


Attributes of tokens
q When more than one pattern matches a lexeme, the scanner must provide
additional information about the particular lexeme to the subsequent phases of
the compiler.
§ For example, both 0 and 1 match the pattern for the token num.
q But the code generator needs to know which number is recognized.
q The lexical analyzer collects information about tokens into their associated
attributes.
• Tokens influence parsing decisions;
• Attributes influence the translation of tokens after parse
q Practically, a token has one attribute:
§ a pointer to the symbol table entry in which information about the token is kept.
q The symbol table entry contains various information about the token
§ such as its lexeme, type, the line number in which it was first seen...
§ Ex. y = 31 + 28 * x
Kenesa B. (Ambo University) Compiler Design 10
Attributes of tokens
§ The tokens and their attributes are written as:

Kenesa B. (Ambo University) Compiler Design 11


Lexical Errors
q Very few errors are detected by the lexical analyzer.
q For example:
§ if the programmer mistakes ebgin for begin, the lexical analyzer cannot detect
the error since it will consider ebgin as an identifier.
§ Since ebgin is a valid lexeme for the token id, the lexical analyzer must return
the token id to the parser
§ and let some other phase of the compiler, probably the parser in this case,
handle an error due to transposition of the letters
q Nonetheless, if a certain sequence of characters follows none of the specified
patterns, the lexical analyzer can detect the error.
§ The simplest recovery strategy is panic mode recovery
• skipping (deleting) successive characters from the remaining input until the
lexical analyzer can find a well-formed token (panic mode recovery)
Kenesa B. (Ambo University) Compiler Design 12
Lexical Errors - Recovery
q When an error occurs, the lexical analyzer recovers by:
§ panic mode recovery
§ deleting one character from the remaining input
§ inserting missing characters into the remaining input
§ replacing an incorrect character by a correct character
§ transposing two adjacent characters

You’re recommended to revise Formal Languages &


Automata Theory course.

Kenesa B. (Ambo University) Compiler Design 13


Formal Languages & Automata Theory (a review in one slide)
§ Alphabet (Σ): a finite set of symbols (characters)
§ String: a finite sequence, possibly empty, of symbols drawn from some alphabet Σ
– A string over Σ is a finite sequence of elements from Σ.
– The empty string of no characters is denotedε, the shortest string that can be formed from Σ
§ Language: a set, often infinite, of strings
– Example: Even numbers,Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
• L = {0, 2, 4, 6, 8, 10, 12, 14,...}
§ Finite specifications of (possibly infinite) languages
– Automaton – a recognizer; a machine that accepts all strings in a language (and rejects all
other strings)
– Grammar – a generator; a system for producing all strings in the language (and no other
strings)
§ A particular language may be specified by many different grammars and automata
• A grammar or automaton specifies only one language
Kenesa B. (Ambo University) Compiler Design 14
Regular Expressions
q RE is an algebraic notation for describing sets of strings.
• RE represents patterns of strings of characters.
q A regular expression s is a string which denotes L(s), a set of strings drawn from
an alphabet Σ.
– L(s) is known as the “language of s.”
q L(s) is defined inductively with the following base cases:
• If a ϵ Σ then a is a regular expression and L(a) = {a}.
• ε is a regular expression and L(ε) contains only the empty string.
q Then, for any regular expressions s and t:
a) s|t or s + t is a RE such that L(s|t) = L(s) U L(t) = {s, t} - (alternation)
b) st is a RE such that L(st) contains all strings formed by the concatenation of a string in L(s)
followed by a string in L(t), i.e. L(s)L(t) = {st} - (concatenation)
c) s* is a RE such that L(s*) = L(s) concatenated zero or more times, i.e. L(s)* ={ε, s,…} - (Kleene
closure)
Kenesa B. (Ambo University) Compiler Design 15
Regular Expressions - Examples
q Regular Expression s Language L(s)
• hello { hello }
• d(o|i)g { dog, dig }
• moo* { mo, moo, mooo,... }
• (moo)* {ε, moo, moomoo, moomoomoo,... }
• a(b|a)*a { aa, aaa, aba, aaaa, abaa, abba,... }
q The following shorthands are often used:
§ s? indicates that s is optional.
• s? can be written as (s|ε)
§ s+ indicates that s is repeated one or more times and it can be written as ss*
§ [a-z] indicates any character in that range and it can be written as (a|b|...|z)
§ [ˆx] indicates any character except one and it can be written as Σ - x
Kenesa B. (Ambo University) Compiler Design 16
Regular Expressions - Examples
q Given that L1 = {a, b, c, d}, L2 = {1, 2} based on this, answer the following
questions.
a) What is L1 ∪ L2?
b) What will L1L2 be?
c) What is L1*?
d) What is L1+?

q Solutions:
a) L1 ∪ L2 = {a,b,c,d,1,2}
b) L1L2 = {a1,a2,b1,b2,c1,c2,d1,d2}
c) L1* = all strings of letter a,b,c,d and empty string. {ε, a, b, c, d, aa, ab, ac, ad, ba,
bb, bc, bd, aaa, . . . }
d) L1+ = the set of all strings of one or more letter a,b,c,d, empty string not included
Kenesa B. (Ambo University) Compiler Design 17
Regular Expressions for Tokens
q Regular expressions are used to specify the patterns of tokens.
q Each pattern matches a set of strings. It falls into different categories:
q Reserved (Key) words: They are represented by their fixed sequence of
characters,
§ Ex. if, while and do....
§ If we want to collect all the reserved words into one definition, we could write it as follows:
• Reserved = if | while | do |...
q Special symbols: including arithmetic operators, assignment and equality such
as =, :=, +, -, *
q Identifiers: which are defined to be a sequence of letters and digits beginning
with letter, we can express this in terms of regular definitions as follows:
letter = A|B|…|Z|a|b|…|z or in other way letter = [a-zA-Z]
digit = 0|1|…|9 or digit = [0-9]
identifiers = letter(letter|digit)*
Kenesa B. (Ambo University) Compiler Design 18
Regular Expressions for Tokens
q Numbers: Numbers can be:
– sequence of digits (natural numbers), or decimal numbers, or
– numbers with exponent (indicated by an e or E).
q Example: 2.71E-2 represents the number 0.0271.We can write regular definitions for these
numbers as follows:
• nat = [0-9]+
• signedNat = (+|-)? nat
• number = signedNat(“.” nat)?(E signedNat)?
q Literals or constants: numeric constants such as 42, and string literals such as “ hello,
world”, relop → < | <= | = | <> | > | >=, Delimiter → newline | blank | tab | comment, White space =
(delimiter )+
• For example, by the following regular expression we can describe string constants where the
allowed symbols are alphanumeric characters and sequences consisting of the backslash
symbol followed by a letter (where each such pair is intended to represent a non-alphanumeric
symbol): "([a-zA-Z0-9]|\[a-zA-Z])∗ "
Kenesa B. (Ambo University) Compiler Design 19
Exercise
q Describe the languages denoted by the following regular expressions:
a. (ab) | ε
b. ((a|b)a)*
c. a(a|b)*a
d. ((ε|a)b*)*
e. (a|b)*a(a|b)(a|b)
f. a*ba*ba*ba*
g. (aa|bb)*((ab|ba)(aa|bb)*(ab|ba)(aa|bb)*)*
h. Even binary numbers -> (0|1)*0 (Solved)
i. An alphabet consisting of just three alphabetic characters: Σ = {a, b, c}.
Consider the set of all strings over this alphabet that contains exactly one b.
{b, abc, abaca, baaaac, ccbaca, cccccb...} -> (a | c)*b(a|c)* (Solved)

Kenesa B. (Ambo University) Compiler Design 20



To be continued...

21
Recognition of Tokens
§ The simplest explanation of an algorithm to c ← NextChar();
recognize words is often a character-by-character if (c = ‘n’)
formulation. then begin;
c ← NextChar();
§ Consider the problem of recognizing the keyword
if (c = ‘e’)
new. then begin;
§ Assuming the presence of a routine NextChar that c ← NextChar();
returns the next character, the code might look if (c = ‘w’)
like the fragment shown below. then report success;
§ The code tests for n followed by e followed by w. else try something else;
§ At each step, failure to match the appropriate end;
character causes the code to reject the string and else try something else;
“try something else.” end;
§ E.g: A recognizer for while produce the following else try something else;
transition diagram:

Kenesa B. (Ambo University) Compiler Design 22


Initial MiniJava lexical specification
public class Dog {
public static void main(String args[]) {
int x = 5 * 2;
System.out.println(x);
}
}
Program ::= (Token | Whitespace)*
Token ::= ID | Integer | ReservedWord | Operator | Delimiter
ID ::= Letter (Letter | Digit)*
Letter ::= a | ... | z | A | ... | Z
Digit ::= 0 | ... | 9
Integer ::= Digit+
ReservedWord::= class | public | static | extends | void | int | boolean | if | else |
while | return | true | false | this | new | String | main | System.out.println
Operator ::= + | - | * | / | < | <= | >= | > | == | != | && | !
Delimiter ::= ; | . | , | = | ( | ) | { | } | [ | ]
Implementing Regular Expressions
q Regular expressions can be implemented using finite automata.
§ Once we have all our tokens defined using regular expressions, we can create a finite
automaton for recognizing them.
q A finite automaton (FA) is an abstract machine that can be used to represent
certain forms of computation.
q Graphically, an FA consists of a number of states (represented by numbered
circles) and a number of edges (represented by labeled arrows) between those
states.
q Each edge is labeled with one or more symbols drawn from an alphabet Σ.
§ The machine begins in a start state S0.
q There are two kinds of finite automata:
§ NFAs (nondeterministic finite automata) - have no restrictions on the labels of their edges.
§ DFAs (deterministic finite automata) - have, for each state, and for each symbol of its input
alphabet exactly one edge with that symbol leaving that state.
Kenesa B. (Ambo University) Compiler Design 24
Finite Automata
q A finite automaton has:
§ A finite set of states, one of which is designated the initial state or start state,
and some of which are designated as final states.
§ An alphabet ∑ of possible input symbols.
§ A finite set of transitions that specifies for each state and for each symbol of
the input alphabet, which state to go to next
q For each input symbol presented to the FA, it moves to the state indicated by the
edge with the same label as the input symbol.
§ Some states of the FA are known as accepting states and are indicated by a double circle.
§ We say that the FA rejects the input string if it ends in a non-accepting state
§ For example, here is an FA for the keyword for:

Kenesa B. (Ambo University) Compiler Design 25


Finite Automata (Examples)
q Here is an FA for identifiers of the form [a-z][a-z0-9]+

q And here is an FA for numbers of the form ([1-9][0-9]*)|0

Kenesa B. (Ambo University) Compiler Design 26


Non-Deterministic Finite Automata (NFA)
q Definition: An NFA, M, consists of five tuples: ( Σ,S, δ, s0, F)
• a set of input symbols Σ, the input alphabet
• a finite set of states S,
• a transition function δ: S × (Σ U { ε}) -> S (next state),
• a start state s0 from S, and
• a set of accepting/final states F from S.
q The language accepted by M, written L(M), is defined as:
§ The set of strings of characters c1c2...cn with each ci from Σ U { ε} such that there exist states
s1 in δ(s0,c1), s2 in δ(s1,c2), ... , sn in δ(sn-1,cn) with sn an element of F.
q It is a finite automata which has choice of edges
• The same symbol can label edges from one state to several different states.
q An edge may be labeled by ε, the empty string
• We can have transitions without any input character consumption.
Kenesa B. (Ambo University) Compiler Design 27
Simulating an NFA
q Keep track of a set of states, initially the start state and everything reachable by
ε-moves.
q For each character in the input:
§ Maintain a set of next states, initially empty.
§ For each current state:
• Follow all transitions labeled with the current letter.
• Add these states to the set of new states.
§ Add every state reachable by an ε-move to the set of next states.
q Accept if at least one state in the set of states is accepting.
– Once a final state is reached and the associated action is performed, we pick up where we left
off at the first character of the next token and begin again at the start state.
– If we do not end in a final state or encounter an unexpected symbol while in any state, we have
an error condition.

Kenesa B. (Ambo University) Compiler Design 28


Transition Graph
q The transition graph for an NFA recognizing the language of regular expression
(a|b)*abb
§ all strings of a's and b's ending in the particular string abb
a
§ S = {0,1,2,3},
§ Σ= {a,b},
§ S0 = 0,
b
§ F = {3}
§ The mapping δof an NFA can be represented in
a transition table:
δ(0,a) = {0,1}
δ(0,b) = {0}
δ(1,b) = {2}
δ(2,b) = {3}
Kenesa B. (Ambo University) Compiler Design 29
Transition diagram for keywords

Kenesa B. (Ambo University) Compiler Design 30


Transition diagram for relops
q Transition diagram that recognizes the lexemes matching the token relop and id.

Kenesa B. (Ambo University) Compiler Design 31


NFA - Example
q An NFA can also have anε(epsilon) transition, which represents the empty string.
§ An ε-transition is taken without consuming any character from the input.
q We could represent the regular expression a*(ab|ac) with this NFA:

q In principle, one can implement an NFA in software or hardware by simply keeping


track of all of the possible states.
§ But this is inefficient.
– In the worst case, we would need to evaluate all states for all characters on each input transition.
§ A better approach is to convert the NFA into an equivalent DFA
Kenesa B. (Ambo University) Compiler Design 32
NFA - Example

the string aab is recognised by


the sequence of transitions

q State 1 is the starting state and state 3 is accepting.


• There is an epsilon transition from state 1 to state 2
q This NFA recognises the language described by the regular expression a*(a|b).
q A program that decides if a string is accepted by a given NFA will have to check all
possible paths to see if any of these accepts the string.
• This requires either backtracking until a successful path found or simultaneously following all
possible paths
• both of which are too time-consuming to make NFAs suitable for efficient recognisers.
Kenesa B. (Ambo University) Compiler Design 33
Deterministic Finite Automata (DFA)
q A deterministic finite automaton is a special case of an NFA
§ No state has an ε-transition
§ For each state S and input symbol a there is at most one edge labeled a leaving S
q Each entry in the transition table is a single state
• At most one path exists to accept a string
• Simulation algorithm is simple
q Example: A DFA that accepts (a|b)*abb

Kenesa B. (Ambo University) Compiler Design 34


Conversion Algorithms
q Regular expressions and finite automata (NFA & DFA) are all equally powerful.
§ There is an algorithm for converting any RE into an NFA.
§ There is an algorithm for converting any NFA to a DFA.
§ There is an algorithm for converting any DFA to a RE.
• DFA is by far the most straightforward of the three to implement in software.
q Direct construction of Nondeterministic finite
• Automation (NFA) to recognize a given regular expression.
• Easy to build in an algorithmic way
• Requires ε-transitions to combine regular sub expressions
q Construct a deterministic finite automation (DFA) to simulate the NFA
• Use a set-of-state construction
• Minimize the number of states in the DFA
• Generate the scanner code.
Kenesa B. (Ambo University) Compiler Design 35
Converting REs to NFAs
q We will construct an NFA compositionally from a regular expression
• from each subexpression construct an NFA fragment and then combine these
fragments into bigger fragments
• a fragment is not a complete NFA, adding the necessary components to make a
complete NFA
q An NFA fragment consists of a number of states with transitions between these
and additionally two incomplete transitions:
• One pointing into the fragment and one pointing out of the fragment.
q The incoming half-transition is not labelled by a symbol, but the outgoing half-
transition is labelled by either ε or an alphabet symbol.
• These half-transitions are the entry and exit to the fragment and are used to connect it to
other fragments or additional “glue” states.
q There is a (beautiful!) procedure from converting a regular expression to an NFA.
§ We can follow an algorithm given by Ken Thompson (Thompson construction)
Kenesa B. (Ambo University) Compiler Design 36
Converting REs to NFAs (Thompson Construction)

1. Empty string εis a regular expression denoting { ε}:

2. The NFA for any character {a} is:

3. For the regular expression {AB}:


(concatenation)

Kenesa B. (Ambo University) Compiler Design 37


Converting REs to NFAs (Thompson Construction)

4. For the regular expression {A|B}:


(union or alternation)

5. The Kleene closure A* is constructed by:

Kenesa B. (Ambo University) Compiler Design 38


Converting REs to NFAs (Example)
q Let’s consider the process for an example regular expression a(cat|cow)*
q We start with the innermost expression cat then, do the same thing for cow

q The alternation of the two expressions cat|cow is accomplished by adding a new


starting and accepting node, with epsilon transitions.

Kenesa B. (Ambo University) Compiler Design 39


Converting REs to NFAs (Example)
q Then, the Kleene closure (cat|cow)* is accomplished by adding another starting and
accepting state around the previous FA, with epsilon transitions:

q Finally, the concatenation of a(cat|cow)* is achieved by adding a single state at the


beginning for a:

Kenesa B. (Ambo University) Compiler Design 40


Example 2
q Construct an NFA for the given RE: a(b|c)*
Step 1: a, b, c

Step 2: b|c
Step 4: a(b|c)*

Step 3: (b|c)*

Kenesa B. (Ambo University) Compiler Design 41


Example 3
q Construct an NFA for the given RE: (1+0)*1

q The NFA is

Kenesa B. (Ambo University) Compiler Design 42


Converting NFAs to DFAs (Subset Construction)
q Surprisingly, for any NFA there is a DFA that accepts the same language.
q The basic idea is to create a DFA such that each state in the DFA corresponds to
multiple states in the NFA
• The conversion is done by simulating all possible paths in an NFA at once.
q This means that we operate with sets of NFA states
• When we have several choices of a next state, we take all of the choices
simultaneously and form a set of the possible next-states.
q The idea is that such a set of NFA states will become a single DFA state.
q For any given symbol we form the set of all possible next-states in the NFA, so we get
a single transition (labelled by that symbol) going from one set of NFA states to
another set.
q Hence, the transition becomes deterministic in the DFA that is formed from the sets of
NFA states.
Kenesa B. (Ambo University) Compiler Design 43
Subset Construction
q There are three main cases of non-determinism in NFAs:
1. Transition to a state without consuming any input.
2. Multiple transitions on the same input symbol.
3. No transition on an input symbol.
q To convert NFAs to DFAs we need to get rid of non-determinism from NFAs.
q Given an NFA with states Q, inputs Σ, transition function δN, start state s0, and final
states F, construct equivalent DFA with:
§ States 2s (Set of subsets of S), Inputs Σ, Start state {s0}, Final states = all those
with a member of F.
q But as a DFA state, an expression like {p,q} must be read as a single symbol, not as a
set.
§ The transition function δD is defined by: δD({s1,…,sk}, a) is the union over all i =
1,…,k of δN(si, a).
Kenesa B. (Ambo University) Compiler Design 44
Subset Construction
q Algorithm
• Input − An NFA
• Output − An equivalent DFA
– Step 1 − Create state table from the given NFA.
– Step 2 − Create a blank state table under possible input alphabets for the
equivalent DFA.
– Step 3 − Mark the start state of the DFA by s0 (Same as the NFA).
– Step 4 − Find out the combination of States {s0, s1,... , sn} for each possible
input alphabet.
– Step 5 − Each time we generate a new DFA state under the input alphabet
columns, we have to apply step 4 again, otherwise go to step 6.
– Step 6 − The states which contain any of the final states of the NFA are the
final states of the equivalent DFA.
Kenesa B. (Ambo University) Compiler Design 45
Subset Construction ( Example)
q Example:
Step1: Construct a transition table showing all reachable states for
3 every state for every input signal.
a S δ(s,a) δ(s,b)
a
2 b
a,b
1 5

a,b

4

b
∅ ∅
Step2: The set of states resulting from every transition function
constitutes a new state. Calculate all reachable states for every such
state for every input signal.
Kenesa B. (Ambo University) Compiler Design 46
Subset Construction ( Example)
Transition table

q Step3: Repeat this process(step2) until no more new states are reachable.
Kenesa B. (Ambo University) Compiler Design 47
Subset Construction ( Example)
a

12345
b 245 a

35
a
a,b a
b
a b
1 ∅
3

a,b b
b a
2
a
45 5 b
b 4 a

Stops here as there are no more reachable states


Kenesa B. (Ambo University) Compiler Design 48
Exercise
q Draw DFAs for the string matched by the following definition:
digit = [0-9]
nat = digit+
q Convert the given NFA to DFA
signednat = (+|-)?nat
number=signednat(“.”nat)?(E signedNat)? a)
q Construct NFA for token identifier.
letter(letter|digit)*
q Construct NFA for the following RE:
(a|b)*abb b)

Kenesa B. (Ambo University) Compiler Design 49



To be continued...

50
Subset Construction (ε -closure)
q ε−closure(n) is the set of NFA states reachable from NFA state n by zero or more
transitions.
§ ε-closure for a given state A means a set of states which can be reached from the
state A with only ε move including the state A itself.
q ε-closure (S’) – is a set of states with the following characteristics:
1. S’ ϵ ε-closure(S’) itself
2. if t ϵ ε-closure (S’) and if there is an edge labeled ε from t to v, then v ϵ ε-
closure (S’)
3. Repeat step 2 until no more states can be added to ε-closure (S’).

q Example: find ε-closure for NFA of (a|b)*abb

Kenesa B. (Ambo University) Compiler Design 51


Subset Construction - Example
q Example: find ε-closure for NFA of (a|b)*abb

§ ε-closure (0)= {0, 1, 2, 4, 7}


§ ε-closure (1)= {1, 2, 4}
§ ε-closure (3)= {1, 2, 3, 4, 6, 7}
§ ε-closure (5)= {1, 2, 4, 5, 6, 7}...
Kenesa B. (Ambo University) Compiler Design 52
Subset Construction - Example
q Example: Construct an equivalent DFA for above NFA with ε-closure
q Solution:
1. Begin with the start state 0 and calculate ε-closure(0).
• the set of states reachable by ε-transitions which includes 0 itself is { 0,1,2,4,7}.
This defines a new state A in the DFA A = {0,1,2,4,7}
2. We must now find the states that A connects to. There are two symbols in the
language (a, b) so in the DFA we expect only two edges: from A on a and from A on b.
Call these states B and C:

Kenesa B. (Ambo University) Compiler Design 53


Subset Construction - Example
q We find B and C in the following way:
§ Find the state B that has an edge on a from A
a) start with A{0,1,2,4,7}. Find which states in A have states reachable by a transitions. This set
is called move(A,a) The set is {3,8}:
move(A,a) = {3,8}
b) now do an ε-closure on move(A,a). Find all the states in move(A,a) which are reachable with ε-
transitions. We have 3 and 8 to consider. Starting with 3 we can get to 3 and 6 and from 6 to 1
and 7, and from 1 to 2 and 4. Starting with 8 we can get to 8 only. So the complete set is
{1,2,3,4,6,7,8}.
So ε-closure (move(A,a)) = B = {1,2,3,4,6,7,8}
• This defines the new state B that has an edge on a from A
§ Find the state C that has an edge on b from A
c) start with A{0,1,2,4,7}. Find which states in A have states reachable by b transitions. This set
is called move(A,b) The set is {5}: move(A,b) = {5}
Kenesa B. (Ambo University) Compiler Design 54
Subset Construction - Example
q We find B and C in the following way:
d) now do an ε-closure on move(A,b). Find all the states in move(A,b) which are reachable with ε-
transitions. We have only state 5 to consider. From 5 we can get to 5, 6, 7, 1, 2, 4. So the complete
set is {1,2,4,5,6,7}.
So ε-closure(move(A,a)) = C = {1,2,4,5,6,7}
§ This defines the new state C that has an edge on b from A

A = {0,1,2,4,7}
B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}

§ Now that we have B and C we can move on to find the states that have a and b
transitions from B and C.
Kenesa B. (Ambo University) Compiler Design 55
Subset Construction - Example
q Find the state that has an edge on a from B: move(B,a) = {3,8}
• ε-closure(move(A,a)) = {1,2,3,4,6,7,8} which is the same as the state B itself. In
other words, we have a repeating edge to B:

q Find the state D that has an edge on b from B: move(B,b) = {5,9}


• ε-closure(move(B,a)) = D = {1,2,4,5,6,7,9}

Kenesa B. (Ambo University) Compiler Design 56


Subset Construction - Example
q Find the state that has an edge on a from D: move(D,a) = {3,8}
• ε-closure(move(D,a)) = {1,2,3,4,6,7,8} = B
q Find the state E that has an edge on b from D: move(D,b) = {5,10}
• ε-closure(move(D,b) = E = {1,2,4,5,6,7,10}
q Find the state that has an edge on a from C: move(C,a) = {3,8}
q Find the state that has an edge on b from C: move(C,b) = {5}
• ε-closure(move(C,b)) = C
q Find the state that has an edge on a from E: move(E,a) = {3,8}
q Find the state that has an edge on b from E: move(A,b) = {5}

Kenesa B. (Ambo University) Compiler Design 57


Example
q DFA for above NFA

Kenesa B. (Ambo University) Compiler Design 58


Exercise
q Construct an equivalent DFA for the following NFAs with ε-closure

Kenesa B. (Ambo University) Compiler Design 59


Reading Assignment
q DFA minimization
§ How to minimize a DFA ? (see Dragon Book 3.9.6, pp.180)
§ How to convert RE to DFA directly ? (see Dragon Book 3.9.5 pp.179))
q DFA Analysis

Kenesa B. (Ambo University) Compiler Design 60


Lexical- Analyzer Generator: Lex
q The first phase in a compiler reads the input source and converts strings in the source
to tokens.
q Lex: generates a scanner (lexical analyzer or lexer) given a specification of the tokens
using REs.
§ The input notation for the Lex tool is referred to as the Lex language and
§ The tool itself is the Lex compiler.
q The Lex compiler transforms the input patterns into a transition diagram and
generates code, in a file called lex.yy.c, that simulates this transition diagram.
q By using regular expressions, we can specify patterns to lex that allow it to scan and
match strings in the input.
q Each pattern in lex has an associated action.
§ Typically an action returns a token, representing the matched string, for subsequent use by the
parser.
§ It uses patterns that match strings in the input and converts the strings to tokens.
Kenesa B. (Ambo University) Compiler Design 61
Scanner, Parser, Lex and Yacc

Kenesa B. (Ambo University) Compiler Design 62


Generating a Lexical Analyzer using Lex
q Lex is a scanner generator - it takes lexical specification as input, and produces a
lexical analyzer written in C.
Lex source
program Lex compiler lex.yy.c
lex.l

lex.yy.c C compiler a.out

Sequence of
Input stream a.out tokens

Lexical Analyzer
Kenesa B. (Ambo University) Compiler Design 63
Lex specification
q Lex is a program that generates lexical analyzer.
§ It is used with YACC parser generator.
q The lexical analyzer is a program that transforms an input stream into a sequence of
tokens.
q It reads the input stream and produces the source code as output through
implementing the lexical analyzer in the C or C++ program.
q Program structure
C declarations in %{ %}
%%
...rule section... P1 { action1 }
%%
P2 { action2 }
...user defined functions...
q Rules section – regular expression <--> action.
• The actions are C program.
q Declaration section – variables, constants
Kenesa B. (Ambo University) Compiler Design 64

T H E E N D !
65

You might also like