Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

AUTOMATA THEORY
AND COMPILER
DESIGN- 21CS51

Ms. Savitha T, Assistant Professor


DEPT. OF CSE | RNSIT

DR. SAMPADA K S, CSE, RNSIT 1


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

MODULE-2
Regular Expressions and Languages: Regular Expressions, Finite Automata and Regular
Expressions, Proving Languages Not to Be Regular

Lexical Analysis Phase of compiler Design: Role of Lexical Analyzer, Input Buffering,
Specification of Token, Recognition of Token.

PART 1: REGULAR LANGUAGE:


One way of defining regular language is via regular expressions. This involves combination
of strings of symbols from some alphabet , parenthesis and the operator + , . , *
Definition of regular Expressions:
Let ∑ be a given alphabet, then
• Ø, Ɛ and a ∈ ∑ are all primitive regular expressions.
• If R and S are regular expressions then
o R+S,
o R.S,
o R* are also regular expressions
• A string is a regular expression iff it can be derived from the primitive regular expression by
a finite number of application of the rules.

Language associated with regular expressions.

Definition: Let M = (Q, ∑, δ, q0, A) be a DFA. The language L is regular if there exists a machine
M such that L = L(M).
• Definition: A regular expression is recursively defined as follows.
1. Ø is a regular expression denoting an empty language.
2. Ɛ-(epsilon) is a regular expression indicates the language containing an empty string.
3. a is a regular expression which indicates the language containing only {a}.
4. if R is a regular expression denoting a language L(R) and S is regular language denoting
L(S) then

a. R+S is a regular expression corresponding to the language L(R)UL(S).


b. R.S is a regular expression corresponding to the language L(R).L(S).
c. R* is a regular expression corresponding to the language LR*.
5. The expressions obtained by applying any of the rules from 1-4 are regular expressions.

SAVITHA T, CSE, RNSIT 2


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
The following table shows examples of regular expressions and the language corresponding to
these regular expressions:

Regular Meaning
expressions
a* String consisting of nay number of a’s(0 or more)
a+ String consisting of atleast 1 a (1 or more)
(a+b) String consisting of either one a or one b
(a+b)* Set of strings of a’s and b’s of any length including the NULL string.
(a+b)*abb Set of strings of a’s and b’s ending with the string abb
ab(a+b)* Set of strings of a’s and b’s starting with the string ab.
(a+b)*aa(a+b)* Set of strings of a’s and b’s having a substring aa.
Set of string consisting of any number of a’s (may be empty string
a*b*c* also) followed by any number of b’s (may include empty string)
followed by any number of c’s (may include empty string).
a+b+c+ Set of string consisting of atleast one ‘a’ followed by string consisting
of atleast one ‘b’ followed by string consisting of at least one ‘c’.
aa*bb*cc* Set of string consisting of atleast one ‘a,b,c’
(a+b)*(a+bb) Set of strings of a’s and b’s ending with either a or bb
(aa)*(bb)*b Set of strings consisting of even number of a’s followed by odd
number of b’s

01*+1 Set of strings consisting of 1’s or 0 followed by 1(where 1 may appear 0


or more times)
(01)*+1 Set of strings consisting of 0 followed by 1 ( 0 or more times) or 1

0(1*+1) Set of strings consisting of a zero followed by any number of 1’s

(1+ Ɛ )(00*1)*0* Strins of 0’s and 1’s without any consecutive 1’s

(0+10)*1* Strings of 0’s and 1’s ending withany number of 1’s

(a+b)(a+b) Strings of a’s and b’s of length 2

(0+1)*000 Set of strings of 0’s and 1’s ending with three consecutive zeros
(11)* Set consisting of even number of 1’s

SAVITHA T, CSE, RNSIT 3


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 4


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 5


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 6


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 7


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Finite automata and Regular Expressions

SAVITHA T, CSE, RNSIT 8


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
To Obtain NFA from the regular expression
Theorem: Let R be a regular expression. Then there exists a finite automaton M = (Q, , , q0, A)
which accepts L(R).

Proof: By definition,,  and a are regular expressions. So, the corresponding machines to recognize
these expressions are shown in figure

q0 qf
 q0  qf q0 a qf

(a) (b) (c)

The schematic representation of a regular expression R to accept the language L(R) is shown in
figure where q is the start state and f is the final state of machine M.

L(R)
q M f

Schematic representation of FA accepting L(R)


In the definition of a regular expression it is clear that if R and S are regular expressions, then R+S
and R.S and R* are regular expressions which clearly uses three operators ‘+’, ‘-‘ and ‘.’. Let us take
each case separately and construct equivalent machine. Let M1 = (Q1, 1, 1, q1, f1) be a machine
which accepts the language L(R1) corresponding to the regular expression R1. Let M2 = (Q2, 2, 2,
q2, f2) be a machine which accepts the language L(R2) corresponding to the regular expression R2.

Case 1: R = R1 + R2. We can construct an NFA which accepts either L(R1) or L(R2) which can be
represented as L(R1 + R2) as shown in figure 3.3.

L(R1)
 q1 M1 f1 
q0 qf
 q2 M2 f2 
L(R2)
To accept the language L(R1 + R2)

It is clear from figure that the machine can either accept L(R1) or L(R2). Here, q0 is the start state of
the combined machine and qf is the final state of combined machine M.

Case 2: R = R1 . R2. We can construct an NFA which accepts L(R1) followed by L(R2) which can be
represented as L(R1 . R2) as shown in figure 3.4.

SAVITHA T, CSE, RNSIT 9


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

L(R1) L(R2)

q1 M1 f1 q2 M2 f2
To accept the language L(R1 . R2)

It is clear from figure that the machine after accepting L(R1) moves from state q1 to f1. Since there is
a -transition, without any input there will be a transition from state f 1 to state q2. In state q2, upon
accepting L(R2), the machine moves to f2 which is the final state. Thus, q1 which is the start state of
machine M1 becomes the start state of the combined machine M and f2 which is the final state of
machine M2, becomes the final state of machine M and accepts the language L(R1.R2).

Case 3: R = (R1)*. We can construct an NFA which accepts either L(R1)*) as shown in figure 3.5.a. It
can also be represented as shown in figure 3.5.b.


 
q0 q1 M1 f1 qf


L(R1)

(a)

q0 q1 M1 f1 qf
 

To accept the language L(R1)*

It is clear from figure that the machine can either accept  or any number of L(R1)s thus accepting
the language L(R1)*. Here, q0 is the start state qf is the final state.

Example
Obtain an NFA which accepts strings of a’s and b’s starting with the string ab.

The regular expression corresponding to this language is ab(a+b)*.

SAVITHA T, CSE, RNSIT 10


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Step 1: The machine to accept ‘a’ is shown below.

4 a 5

Step 2: The machine to accept ‘b’ is shown below.

6 b 7

Step 3: The machine to accept (a + b) is shown below.


a
 4 5 

3 8

 6 7 
b

Step 4: The machine to accept (a+b)* is shown below.


a
 4 5 
 
2 3 8 9
 6 7 
b

Step 5: The machine to accept ab is shown below.

a b
0 1 2

Step 6: The machine to accept ab(a+b)* is shown below.



a
 4 5 
a b  
0 1 2 3 8 9
 6 7 
b

To accept the language L(ab(a+b)*)

SAVITHA T, CSE, RNSIT 11


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Obtain the regular expression from FA (KLEEN’S Theorem)


Theorem: Let M = (Q, , , q0, A) be an FA recognizing the language L. Then there exists an
equivalent regular expression R for the regular language L such that L = L(R).

SAVITHA T, CSE, RNSIT 12


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 13


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 14


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 15


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 16


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 17


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 18


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 19


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 20


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 21


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
The general procedure to obtain a regular expression from FA is shown below. Consider the
generalized graph
r1 r r
q0 q1
r

Generalized transition graph

where r1, r2, r3 and r4 are the regular expressions and correspond to the labels for the edges. The
regular expression for this can take the form:

r = r1*r2 (r4 + r3r1*r2)* (3.1)

Note:
1. Any graph can be reduced to the graph shown in figure 3.9. Then substitute the regular
expressions appropriately in the equation 3.1 and obtain the final regular expression.
2. If r3 is not there in figure 3.9, the regular expression can be of the form
r = r1*r2 r4* (3.2)

3. If q0 and q1 are the final states then the regular expression can be of the form
r = r1* + r1*r2 r4* (3.3)

Example
Obtain a regular expression for the FA shown below:
0
q0 q1
1
0 1 0

q2 q3 0,1
1

The figure can be reduced as shown below:


01
q0
10

It is clear from this figure that the machine accepts strings of 01’s and 10’s of any length and the
regular expression can be of the form

(01 + 10)*

SAVITHA T, CSE, RNSIT 22


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

What is the language accepted by the following FA

0 1 0,1
0
q0 q1 q2
1
Since, state q2 is the dead state, it can be removed and the following FA is obtained.

0 1
q0 q1
1

The state q0 is the final state and at this point it can accept any number of 0’s which can be
represented as using notation 0*

q1 is also the final state. So, to reach q1 one can input any number of 0’s followed by 1 and followed
by any number of 1’s and can be represented as
0*11*

So, the final regular expression is obtained by adding 0* and 0*11*. So, the regular expression is

R.E = 0* + 0*11*
= 0* (  + 11*)
= 0* (  + 1+)
= 0* (1*) = 0*1*
It is clear from the regular expression that language consists of any number of 0’s (possibly ) followed
by any number of 1’s(possibly ).

SAVITHA T, CSE, RNSIT 23


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 24


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 25


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 26


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 27


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 28


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 29


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 30


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 31


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Regular languages

In theoretical computer science and formal language theory, a regular language is a formal
language that can be expressed using a regular expression. Note that the "regular
expression" features provided with many programming languages are augmented with
features that make them capable of recognizing languages that cannot be expressed by the
formal regular expressions. In the Chomsky hierarchy, regular languages are defined to be
the languages that are generated by Type-3 grammars (regular grammars).Regular
languages are very useful in input parsing and programming language design.

The collection of regular languages over an alphabet Σ is defined recursively as follows:

• The empty language Ø is a regular language.


• For each a ϵ Σ(a belongs to Σ),the singleton language {a} is a regular language.
• If A and B are regular languages, then
o A U B(union),
o A • B (concatenation),and
o A* (Kleene star) are regular languages.
• No other languages over Σ are regular.

Proving languages not to be regular languages

Pumping Lemma (PL) for Regular Languages:

Theorem: Let M=( ) be an FA and has n number of states. Let L be the regular language
accepted by M. for every string x in L, there exists a constant n such that |x| >=n. if the
string can be broken into three substrings x,y,z such that .
1. |y|>0
2. |xy|≤ n
3. For all k≥0, the string xykz is also in L.
PROOF:

Let L be regular defined by an FA having ‘n’ states. Let x=a1,a2,a3 --- an and is in L. |x|=n ≥ n.
Let the start state be P1. Let w=xyz where x=a1,a2,a3 --- an-1,y=an and z= 

SAVITHA T, CSE, RNSIT 32


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Uses of Pumping Lemma: - This is to be used to show that, certain languages are not regular. It
should never be used to show that some language is regular. If you want to show that language is
regular, write separate expression, DFA or NFA.

General Method of proof: -


(i) Select w such that |w|  n
(ii) Select y such that |y|  1
(iii) Select x such that |xy|  n
(iv) Assign remaining string to z
Select k suitably to show that, resulting string is not in L.
Example 1: To prove that L={w|w  anbn, where n ≥ 1} is not regular
Proof:

Let L be regular. Let n is the constant (PL Definition). Consider a word w in L.

Let w = anbn, such that |w|=2n. Since 2n > n and L is regular it must satisfy PL.

SAVITHA T, CSE, RNSIT 33


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
xy contain only a’s. (Because |xy| ≤ n).

Let |y|=l, where l > 0 (Because |y| > 0).

Then, the break up of x. y and z can be as follows

from the definition of PL , w=xykz, where k=0,1,2, ----- , should belong to L.

That is an-l (al)k bn L, for all k=0,1,2,------ 

Put k=0. we get an-l bn  L.Contradiction.

Hence the Language is not regular.

Example 2: To prove that L={w|w is a palindrome on {a,b}*} is not regular. i.e., L={aabaa,
aba, abbbba,…}

Proof:

Let L be regular. Let n is the constant (PL Definition). Consider a word w in L. Let w =
anban, such that |w|=2n+1. Since 2n+1 > n and L is regular it must satisfy PL.

xy contain only a’s. (Because |xy| ≤ n).

Let |y|=l, where l > 0 (Because |y| > 0).

That is, the break up of x. y and z can be as follows

SAVITHA T, CSE, RNSIT 34


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

from the definition of PL w=xykz, where k=0,1,2, ----- , should belong to L.

That is an-l (al)k ban L, for all k=0,1,2, --------.

Put k=0. we get an-l b an L, because, it is not a palindrome.

Contradiction, hence the language is not regular

Example 3: To prove that L={ all strings of 1’s whose length is prime} is not regular. i.e.,
L={12, 13 ,15 ,17 ,111 , --- }

Proof: Let L be regular. Let w = 1p where p is prime and | p| = n +2


Let y = m.
by PL xykz L
| xykz | = | xz | + | yk | Let k = p-m
= (p-m) + m (p-m)
= (p-m) (1+m) ------- this cannot be prime
if p-m ≥ 2 or 1+m ≥ 2
1. (1+m) ≥ 2 because m ≥ 1
2. Limiting case p=n+2
(p-m) ≥ 2 since m ≤n

Example 4: To prove that L={ 0i2 | i is integer and i >0} is not regular. i.e., L={02, 04 ,09 ,016
,025 ,----}
Proof: Let L be regular. Let w = 0n2 where |w| = n2 ≥ n
by PL xykz L, for all k = 0,1,---
Select k = 2
| xy2z | = | xyz | + | y |
= n2 + Min 1 and Max n
Therefore n2 < | xy2z | ≤ n2 + n

n2 < | xy2z | < n2 + n + 1+n adding 1 + n (Note that less than or equal to is
n2 < | xy2z | < (n + 1)2 replaced by less than sign

SAVITHA T, CSE, RNSIT 35


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 36


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 37


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 38


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 39


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

SAVITHA T, CSE, RNSIT 40


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

PART 2: LEXICAL ANALYSIS PHASE OF COMPILER


DESIGN
OVER VIEW OF LEXICAL ANALYSIS
• To identify the tokens we need some method of describing the possible
tokens that can appear in the input stream. For this purpose we introduce regular
expression, a notation that can be used to describe essentially all the tokens
of programming language.
• Secondly , having decided what the tokens are, we need some
mechanism to recognize these in the input stream. This is done by the token
recognizers, which are designed using transition diagrams and finite automata.

ROLE OF LEXICAL ANALYZER


the LA is the first phase of a compiler. It main task is to read the input character
and produce as output a sequence of tokens that the parser uses for syntax analysis.

Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input
character until it can identify the next token. The LA return to the parser representation for the
token it has found. The representation will be an integer code, if the token is a simple construct
such as parenthesis, comma or colon.
LA may also perform certain secondary tasks as the user interface. One such task is
striping out from the source program the commands and white spaces in the form of blank, tab
and new line characters. Another is correlating error message from the compiler with the source
program.

SAVITHA T, CSE, RNSIT 41


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

LEXICAL ANALYSIS VS PARSING:


Lexical analysis Parsing

Scanner simply turns an input String (say a file) into a A parser converts this list of tokens into a Tree-like
list of tokens. These tokens represent things like object to represent how the tokens fit together to
identifiers, parentheses, operators etc. form a cohesive whole (sometimes referred to as a
sentence).
The lexical analyzer (the "lexer") parses individual
symbols from the source code file into tokens. From A parser does not give the nodes any meaning
there, the "parser" proper turns those whole tokens beyond structural cohesion. The next thing to do is
into sentences of your grammar extract meaning from this structure (sometimes called
contextual analysis).

TOKEN, LEXEME, PATTERN:

Token: Token is a sequence of characters that can be treated as a single logical


entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols 5)constants
Pattern: A set of strings in the input for which the same token is produced as output. This set
of strings is described by a rule called a pattern associated with the token.
Lexeme: A lexeme is a sequence of characters in the source program that is matched by
the pattern for a token.
Example:
Description of token

Token lexeme pattern


const const const
if if If
relation <,<=,= ,< >,>=,> < or <= or = or < > or >= or letter
followed by letters & digit
i pi any numeric constant
nun 3.14 any character b/w “and “except"
literal "core" pattern

A pattern is a rule describing the set of lexemes that can represent a particular token in source
program.

LEXICAL ERRORS:
Lexical errors are the errors thrown by your lexer when unable to continue. Which means that
there's no way to recognize a lexeme as a valid token for the lexer.
These errors are detected during the lexical analysis phase. Typical lexical errors are:
➢ Exceeding length of identifier or numeric constants.
➢ The appearance of illegal characters
➢ Unmatched string

SAVITHA T, CSE, RNSIT 42


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Example 1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $ appears at the end of statement.

Example 2 : This is a comment */


This is an lexical error since end of comment is present but beginning is not present
Syntax errors, on the other side, will be thrown by your scanner when a given set of already
recognized valid tokens don't match any of the right sides of your grammar rules. simple panic-
mode error handling system requires that we return to a high-level parsing function when a
parsing or lexical error is detected.

Error recovery for lexical errors:


Panic Mode Recovery
➢ In this method, successive characters from the input are removed one at a time until a designated
set of synchronizing tokens is found. Synchronizing tokens are delimiters such as; or }
➢ The advantage is that it is easy to implement and guarantees not to go into an infinite loop
➢ The disadvantage is that a considerable amount of input is skipped without checking it for
additional errors
Error-recovery actions are:
i. Delete one character from the remaining input.
ii. Insert a missing character in to the remaining input.
iii. Replace a character by another character.
iv. Transpose two adjacent characters.

INPUT BUFFERING
Lexical Analysis has to access secondary memory each time to identify tokens. It is time-consuming
and costly. So, the input strings are stored into a buffer and then scanned by Lexical Analysis. The
lexical analyzer scans the input from left to right one character at a time. It uses two pointers lexeme
begin ptr(bp) and forward to keep track of the pointer of the input scanned. Initially both the
pointers point to the first character of the input string as shown below

The forward ptr moves ahead to search for end of lexeme. As soon as the blank space is encountered,
it indicates end of lexeme. In above example as soon as ptr (fp) encounters a blank space the lexeme
“int” is identified.

SAVITHA T, CSE, RNSIT 43


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

The fp will be moved ahead at white space, when fp encounters white space, it ignore and moves
ahead. then both the begin ptr(bp) and forward ptr(fp) are set at next token.
The input character is thus read from secondary storage, but reading in this way from secondary
storage is costly. hence buffering technique is used.A block of data is first read into a buffer, and
then second by lexical analyzer. there are two methods used in this context: One Buffer Scheme, and
Two Buffer Scheme. These are explained as following below.

One Buffer Scheme:


In this scheme, only one buffer is used to store the input string but the problem with this scheme is
that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the buffer
has to be refilled, that makes overwriting the first of lexeme.

Two Buffer Scheme:


To overcome the problem of one buffer scheme, in this method two buffers are used to store the
input string. the first buffer and second buffer are scanned alternately. when end of current buffer is
reached the other buffer is filled. the only problem with this method is that if length of the lexeme is
longer than length of the buffer then scanning input cannot be scanned completely.

A specialized buffering techniques0 used to reduce the amount of overhead, which is required to
process an input character in moving characters.

SAVITHA T, CSE, RNSIT 44


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Buffer pairs
• Consists of two buffers, each consists of N-character size which are reloaded alternatively.
• Two pointers lexemeBegin and forward are maintained.
• Lexeme Begin points to the beginning of the current lexeme which is yet to be found.
• Forward scans ahead until a match for a pattern is found.
• Once alexeme is found, lexeme begin is set to the character immediately after the lexeme
which is just found and forward is set to the character at its right end.
• Current lexeme is the set of characters between two pointers.

What is Sentinels ?
Sentinels is used to make a check, each time when the forward pointer is moved, a check is done to
ensure that one half of the buffer has not moved off. If it is done, then the other half must be
reloaded.
Therefore the ends of the buffer halves require two tests for each advance of the forward
pointer. Test 1: For end of buffer. Test 2: To determine what character is read. The usage of sentinel
reduces the two tests to one by extending each buffer half to hold a sentinel character at the end.

The sentinel is a special character that cannot be part of the source program. (eof character is used as
sentinel).
Disadvantages

SAVITHA T, CSE, RNSIT 45


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
• This scheme works well most of the time, but the amount of lookahead is limited.
• This limited
lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the buffer.
Advantages
• Most of the time, It performs only one test to see whether forward pointer points to an eof.
• Only when it reaches the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eofs, the average number of tests per input
character is very close to 1.
Observe from the above algorithm that instead of having two tests as in buffer pair technique, there is
only one test i.e., testing the eof marker.

SPECIFICATION OF TOKENS
1. In theory of compilation regular expressions are used to formalize the specification of tokens
2. Regular expressions are means for specifying regular languages
3. Example: Letter_(letter_ | digit)*
4. Each regular expression is a pattern specifying the form of strings

REGULAR EXPRESSIONS
1. Ɛ is a regular expression, L(Ɛ) = {Ɛ}
2. If a is a symbol in ∑then a is a regular expression, L(a) = {a}
3. (r) | (s) is a regular expression denoting the language L(r) L(s)
4. (r)(s) is a regular expression denoting the language L(r)L(s)
5. (r)* is a regular expression denoting (L(r))*

6. (r) is a regular expression denoting L(r)

Regular expression is a formula that describes a possible set of string.


Component of regular expression..

SAVITHA T, CSE, RNSIT 46


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
X the character x
. any character, usually accept a new line
[x y z] any of the characters x, y, z, …..
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
R+ one or more occurrences ……
R1R2 an R1 followed by an R2
R2|R1 either an R1 or an R2.
A token is either a single string or one of a collection of strings of a certain type. If we view
the set of strings in each token class as an language, we can use the regular-
expression notation to describe tokens.

Consider an identifier, which is defined to be a letter followed by zero or more letters


or digits. In regular expression notation we would write.

Identifier = letter (letter | digit)*


Here are the rules that define the regular expression over alphabet .
o is a regular expression denoting { € }, that is, the language containing only the
empty string.
o For each ‘a’ in ∑, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol ‘a’ .
o If R and S are regular expressions, then

(R) | (S) means LrULs


R.S means Lr.Ls
R* denotes Lr*

REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and
to define regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter.
The following regular definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?))
Pascal identifier
Letter - A | B | ……| Z | a | b |……| z|
Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*

RECOGNITION OF TOKENS:

We learn how to express pattern using regular expressions. Now, we must study how to take the

SAVITHA T, CSE, RNSIT 47


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
patterns for all the needed tokens and build a piece of code that examines the input string and finds a
prefix that is a lexeme matching one of the patterns. The two functions that are used while designing
the lexical analyzer are:
This function is invoked only if we want to unread the last character read. It is
identified by having a an edge labeled other to the final state with * marked to it. This function un
reads the last character read
Once the identifier is identified, the function install_ID() is called. This function
checks whether the identifier is already there in the symbol table. If it is not there, it is entered into
the symbol table and returns the pointer to that entry. If it is already there, it returns the pointer to
that entry.

Stmt expr then stmt


| If expr then else stmt
| є
Ex rm relop term
| term
Te
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the
names of tokens as far as the lexical analyzer is concerned, the patterns for the
tokens are described using regular definitions.

digit -->[0,9]
digits-->digit+
number -->digit(.digit)?(e.[+-]?digits)?
letter -->[A-Z,a-z]
id -->letter(letter/digit)*
if --> if
then -->then
else -->else
relop --> </>/<=/>=/==/< >

In addition, we assign the lexical analyzer the job stripping out white space, by recognizing
the “token” we defined by:
ws --> ank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the
ASCII characters of the same names. Token ws is different from the other tokens in
that ,when we recognize it, we do not return it to parser ,but rather restart the
lexical analysis from the character that follows the white space . It is the following
token that gets returned to the parser.

Lexeme Token Name Attribute Value


Any ws _ _
if if _

SAVITHA T, CSE, RNSIT 48


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
then then _
else else _
Any id id pointer to table entry
Any number number pointer to table entry
< relop LT
<= relop LE
= relop ET
<> relop NE

TRANSITION DIAGRAM:
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input
looking for a lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is labeled
by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out
of state s labeled by a. if we find such an edge ,we advance the forward pointer
and enter the state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a
lexeme has been found, although the actual lexeme may not consist of all
positions b/w the lexeme
Begin and forward pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one position, then
we shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge
labeled “start” entering from nowhere .the transition diagram always begins in
the state before any input symbols have been used.

As an intermediate step in the construction of a LA, we first produce


a stylized flowchart, called a transition diagram. Position in a transition diagram,
are drawn as circles and are called as states.

SAVITHA T, CSE, RNSIT 49


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Architecture of a transition-diagram-based lexical analyzer
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1)
{/*repeat character processing until a return or failure occurs*/
switch(state)
{
case 0: c= nextchar();
if (c == ‘<‘) state = 1;
else if (c == ‘=‘) state = 5;
else if (c == ‘>’) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …

case 8: retract();
retToken.attribute = GT;
return(retToken);
}

➢ Transition diagrams for identifier

The above TD for an identifier, defined to be a letter followed by any no of letters


or digits.A sequence of transition diagram can be converted into program to look for
the tokens specified by the diagrams. Each state gets a segment of code.
state = 0;
for (;;)
{
switch (state)
{
case 0:
ch = getchar()
if (ch == letter) state = 10;
else state = 3; // Identify the next token
break;

SAVITHA T, CSE, RNSIT 50


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
case 1:
ch = getchar()
if (ch == letter or ch == digit) state = 11;
else state = 2;
break;
case 2: retract(); // undo the last character read
return ( ID, install_ID() )
case 3: /* Identify the next token */
}
}
⚫ Transition diagram for whitespace

Transition diagram for unsigned numbers

SAVITHA T, CSE, RNSIT 51

You might also like