Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 38

Lexical Analysis

Find the FIRST and FOLLOW


• S -> ABCD | Є
• A -> a | Є
• B-> bA
• C -> a | Є
• D -> d
Introduction
• The task of analyzing the syntax is divided into two parts.
– Lexical – deals with small-scale language constructs, such as
names and numeric literals.
– Syntax – deals with large-scare constructs, such as expressions,
statements, and program units.

• Reason for separating it.


– Simplicity: Less complex.
– Efficiency: Allows optimization of lexical analyser
– Portability: machine dependent and independent
Lexical Analyzer
• A pattern matcher
• To find a substring of a given string of
characters that matches a given character
pattern.
Lexical Analyzer
• Token
• Pattern
• Lexeme

printf(“Total = %d\n”,score);

Id – printf & score (Lexemes)


Literal - “Total = %d\n” (Lexeme)
Tokens
• Keyword
• Operator
• Identifiers
• Constant - Numbers & Literals
• Punctuation Symbols – Left & Right
parenthesis, comma and semicolon.
Lexical Analyzer Responsibilities
• Lexical analyzer [Scanner]
– Scan input
– Remove white spaces
– Remove comments
– Manufacture tokens
– Generate lexical errors
– Pass token to parser

7
Two process
• Scanner – deletion of comments, compaction
of consecutive whitespace characters into
one.

• Lexical analysis – Producing tokens from the


output of the scanner.
Attributes for Tokens
E = M * C **2

<id, pointer to symbol table>


<assign_op>
<id, pointer to symbol table>
Lexical Errors
• Very hard for the Lexical analyzer to tell that
there is an error in the code without the aid of
the other components.

• fi(a==f(x))

• How do LA know that the if is written as fi or fi


is an undefined identifier.
Lexical Errors
• However, in some situation the LA is unable to
proceed because none of the patterns for
tokens matches any prefix of the remaining
input.

• Simplest recovery strategy is “panic mode”


recovery. (Delete the successive characters
until the LA can find a well –formed token).
Other Possible recovery actions
• Delete one character from the remaining input

• Insert a missing character into the remaining


input.

• Replace a character by another character.

• Transpose two adjacent characters.


Tricky problems in Token recognition

DO index variable = start, end, step


statements
END DO

Or equivalent

do 100 n=2,10,1
100 nfac=nfac*n
Tricky problems in Token recognition

• Assignment
DO 5 I = 1.25

do loop
• DO 5 I = 1,25
Input Buffering
• Two – Buffer Scheme
Input Buffering
• Examining ways of speeding reading the source program
– In one buffer technique, the last lexeme under process will be over-written when we
reload the buffer.
– Two-buffer scheme handling large look ahead safely
Buffer Pairs
• Two buffers of the same size, say 4096, are alternately reloaded.
• Two pointers to the input are maintained:
– Pointer lexeme_Begin marks the beginning of the current
lexeme.
– Pointer forward scans ahead until a pattern match is found.
Regular Expression
• Describing all the languages that can be built
from these operators applied to the symbols
of some alphabet.

letter(letter|digit)*

RE are built recursively out of smaller re, using


the rules
Specification of Patterns for Tokens:
Definitions
• An alphabet  is a finite set of symbols
(characters)
• A string s is a finite sequence of symbols from

– s denotes the length of string s
–  denotes the empty string, thus  = 0
• A language is a specific set of strings over
some fixed alphabet 

19
Specification of Patterns for Tokens: String
Operations
• The concatenation of two strings x and y is
denoted by xy
• The exponentation of a string s is defined by

s0 =  (Empty string: a string of length zero)


si = si-1s for i > 0

note that s = s = s

20
Recognition of Tokens
Transition Diagrams
• Patterns -> Stylished flow charts

• Lexeme Begin and forward.


Automaton
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other

= 4
*
return(relop, LT)

5 return(relop, EQ)
>

=
6 7 return(relop, GE)
other
*
8 return(relop, GT)
23
Two More...
id :
letter or digit

start letter other *


9 10 11

delim :
delim

start delim other *


28 29 30

24
RE to Automata
Minimizing
• The DFA for a(b|c)*
Example #2: Applying Minimization
Example # 4
• Minimize the following DFA:

C
b a
b a
a b b start a b b
A B D E A B D E
a a
a
a b a
From Regular Expression to DFA Directly

• The “important states” of an NFA are those


without an -transition, that is if
move({s},a)   for some a then s is an
important state
• The subset construction algorithm uses only
the important states when it determines
-closure(move(T,a))

29
From Regular Expression to DFA Directly
(Algorithm)
• Augment the regular expression r with a
special end symbol # to make accepting states
important: the new expression is r#
• Construct a syntax tree for r#
• Traverse the tree to construct functions
nullable, firstpos, lastpos, and followpos

30
From Regular Expression to DFA Directly:
Syntax Tree of (a|b)*abb#

concatenation

#
6
b
closure 5
b
4
a
alternation
* 3

position
| number
(for leafs )
a b
31
1 2
From Regular Expression to DFA Directly:
Annotating the Tree
• nullable(n): the sub tree at node n generates languages
including the empty string

• firstpos(n): set of positions that can match the first


symbol of a string generated by the sub tree at node n

• lastpos(n): the set of positions that can match the last


symbol of a string generated be the sub tree at node n

• followpos(i): the set of positions that can follow position


i in the tree
32
From Regular Expression to DFA Directly:
Annotating the Tree
Node n nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)


/ \ or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
if nullable(c1) then if nullable(c2) then
• nullable(c1)
firstpos(c1)  lastpos(c1) 
/ \ and
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
else firstpos(c1) else lastpos(c2)
*
| true firstpos(c1) lastpos(c1)
c1 33
From Regular Expression to DFA Directly:
Syntax Tree of (a|b)*abb#

{1, 2, 3} {6}

{1, 2, 3} {5} {6} # {6}


6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4
firstpos lastpos
{1, 2} {1, 2} {3} a {3}
* 3

{1, 2} | {1, 2}

{1} a {1} {2} b {2} 34


1 2
From Regular Expression to DFA Directly:
followpos

for each node n in the tree do


if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i)  firstpos(n)
end do
end if
end do

35
From Regular Expression to DFA Directly:
Algorithm
s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do

36
From Regular Expression to DFA Directly:
Example

Node followpos
1 {1, 2, 3} 1
2 {1, 2, 3} 3 4 5 6
3 {4}
2
4 {5}
5 {6}
6 -

b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a 37
a
Thank You

You might also like