J 3lecxical - Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Compiler Construction / Compiler Design

CS F 363/ IS F342
D.C.KIRAN
dck@pilani.bits-pilani.ac.in
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Lexical Analysis
Scanning Perspective

A scanner produces a token (token <type>, value) stream from a character


stream.
Two modes of operation
• Complete pass produces the token stream, which is then passed to the parser.
• Parser calls scanner to request next token.
3
BITS Pilani, Pilani Campus
Jargons
• A Token is a pair consisting of a token name and an
optional attribute value.
Token <type>: Keywords, operators, identifiers, constants,
punctuation symbols (such as comma, semicolon)
Can be:
– Terminal symbol in a grammar
– Classes of sequences of characters with a collective meaning, e.g.,
IDENT
– Constants, Operators, Punctuation, Reserved words (keywords)
• A Lexeme is a sequence of characters in the source program
that matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token. E.g. Relation
{<.<=,>,>=,==,<>}
• A Pattern is a description of the form that the lexemes of
token may take.
• A Delimiter can be Blank, tab, new line character.

BITS Pilani, Pilani Campus


Token

L(G)= (Lex1,lex2, lex3, lex4, lex5, lex6,………….. lex1, lex1, lexn}

RE1 RE2 REm

Token1 Token2 Token-m

L(G)= (abc,siff,d2d, if, int, for,………….. <,<>, <=,>==)

Regular Expression RE2 RE3 RE4 REm


RE1

Tokens IDENTIFIER IF_T INT_T FOR_T RELOP

BITS Pilani, Pilani Campus


Token example

TK_if → if
TK_then → then
TK_else → else
TK_relop → < | <= | > | >= | = | <>
TK_id → letter ( letter | digit )*
TK_num → digit + (. digit + ) * ( E(+ | -) * digit + ) *

TK_blank → b
TK_tab → ^T
TK_newline → ^M
TK_delim → blank | tab | newline
TK_ws → delim +

BITS Pilani, Pilani Campus


Lexical Analyzer Responsibilities

• Lexical analyzer [Scanner]


– Scan input
– Remove white spaces
– Remove comments
– Manufacture tokens
– Generate lexical errors
– Pass token to parser

7
BITS Pilani, Pilani Campus
Lexical Analyzer Responsibilities

BITS Pilani, Pilani Campus


Implementation Options

• Hand written lexer


– Implementing by string comparison
– Implement a finite state automaton
• start in some initial state
• look at each input character in sequence, update
lexer state accordingly
• if state at end of input is an accepting state, the
input string matches the RE
• Lexer generator
– Generate tokenizer automatically (e.g., lex,flex, jlex)
– Uses RE to NFA to DFA algorithm

BITS Pilani, Pilani Campus


Extensions to pure DFAs

• One accepting state per token <type>

• Keywords are not identifiers


– Look up identifier in keyword table (e.g., hash table) to see whether it
is in fact a keyword
• “Look-ahead” to distinguish tokens with common prefix (e.g., 100 vs 100.5)
– Try to find the longest possible match by continuing to scan from an
accepting state.
– Backtrack to last accepting state when “stuck”.
char str[5]={“B”,”I”,”T”,”S”};

BITS Pilani, Pilani Campus


Hand Written Lexer

Basic Scanning Technique


 Use 1 character of look-ahead
 Obtain char with getc()
 Do a case analysis
 Based on lookahead char
 Based on current lexeme
 Outcome
 If char can extend lexeme, all is well, go on.
 If char cannot extend lexeme:
 Figure out what the complete lexeme is and return
its token
 Put the lookahead back into the symbol stream

BITS Pilani, Pilani Campus


The Model of Recognition of Tokens

Input stream i f d 2 =…
Lexeme_beginning

FA simulator

(IF, if)
i f delim

= d

(Assign, ) (ID, d2)

12
BITS Pilani, Pilani Campus
Identifier vs Keywords

BITS Pilani, Pilani Campus


Principle Of Longest Match

elsex=0; else x = 0

elsex = 0

BITS Pilani, Pilani Campus


Principle Of Longest Match

start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
Pushback(LA)
= 4
*
return(relop, LT)

5 return(relop, EQ)
>

=
6 7 return(relop, GE)
other
8
*
return(relop, GT)
Pushback(LA)

BITS Pilani, Pilani Campus


Regular expression rules:
r_1 { action_1 }
Automata for regular
r_2 { action_2 }
expression r_1
.
. new final
. states
r_n { action_n }
Ar_1
new start
sate ε
Ar_2
ε
s0

ε ..
.
Ar_n

For faster scanning, convert this NFA to a DFA and minimize the states

BITS Pilani, Pilani Campus


=
< Return (TK_relOp, LE)

Step 1:
< >
Return (TK_relOp, NE)

< =
ϵ Return (TK_relOp, LE)

Step 2:
< >
ϵ Return (TK_relOp, NE)

Return (TK_relOp, LE)


=
Step 3:
< > Return (TK_relOp, NE)

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
Lexical Analyzer

BITS Pilani, Pilani Campus


Example

lexemes

BITS Pilani, Pilani Campus


Simple-Lexical Analyzer

/*Global Variables*/

int charClass, lexLen, LETTER=0, DIGIT=1, UNKNOWN=-1;

char lexeme [100], nextChar;

• Convenient utility subprograms:


– getChar - gets the next character of input, puts it in nextChar,
determines its class and puts the class in charClass
– addChar - puts the character from nextChar into the place the
lexeme is being accumulated, lexeme
– lookup - determines whether the string in lexeme is a reserved
word (returns a code)

BITS Pilani, Pilani Campus


Simple-Lexical Analyzer

/*Global Variables*/
int charClass, lexLen, LETTER=0, DIGIT=1, UNKNOWN=-1;
char lexeme [100], nextChar;
/*addChar*/
void addChar() { if(lexLen<=99) lexeme[lexlen++]=nextChar;
else printf(“error-lexeme too long\n”);
/*getChar*/
void getChar() {
if(isalpha(nextChar)) {charClass= LETTER;
else if (isdigit(nextChar)) charClass=DIGIT;
}
/*getNonblank*/
void getNonblank() { while(isspace(nextChar)) getChar();}

BITS Pilani, Pilani Campus


Simple-Lexical Analyzer
int Lookup(char *str)
{
int z=0,lt=strlen(str),key=0;
while (z<lt) /*Computing the hash key*/
{
z++;
key+=str[z];
}
key=key%HTSMAX;

if(keyTable[key].flag==1)
{
if(strcmp(keyTable[key].lex,str))
{
strcpy(tok.lexeme,str);
/*Copying to Symbol table*/
strcpy(tok.token,keyTable[key].token);
tok.sym=keyTable[key].sym;
tok.lineno=lineno;
state=100; return 1;
}
}
return 0; /*Not Keyword*/
}
BITS Pilani, Pilani Campus
Simple-Lexical Analyzer

int lex() {
getChar();
switch (charClass) {
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT)
{
addChar();
getChar();
}
return lookup(lexeme);
break;

BITS Pilani, Pilani Campus


Simple-Lexical Analyzer


case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
return INT_LIT;
break;
} /* End of switch */
} /* End of function lex */

BITS Pilani, Pilani Campus


I/O - Key For Successful Lexical Analysis

 Character-at-a-time I/O
 Block / Buffered I/O Tradeoff
 Block/Buffered I/O
 Utilize Block of memory
 Stage data from source to buffer block at a time
 Maintain two blocks - Why ?

Block 1 Block 2

When done, ptr... Still Process


issue I/O
token in 2nd block

BITS Pilani, Pilani Campus


Buffered I/O

Buffer

ma i n { i n
0 n-1
forward

Buffer
Source program
main
{ i n t a a = 5
int aa =55,bag=10; 0 n-1
…………….
…………… forward
……….. Buffer
}
5 5 , b a g = 1
n-1
forward
BITS Pilani, Pilani Campus
Buffered I/O

Buffer 1 Buffer 2

………m a i n { i n t aa=55, ba
0 n-1 n
2n-1
forward
forward : = forward + 1 ;
if forward ! = eof then begin
if forward at end of first block then begin
Source program
reload second block ;
main
forward : = forward + 1
{
Int aa=55,bag=10; end
……………. else if forward at end of second block then begin
…………… reload first block ;
……….. move forward to beginning of first block
} end
else / * eof within buffer signifying end of input * /
terminate lexical analysis
end

BITS Pilani, Pilani Campus


Lexical errors

• If user omits the space in “realf”


– No lexical error, single token IDENTIFIER (“realf”) is
produced instead of sequence REAL, IDENTIFIER(f)

• One <more> mistake


whil ( x := 0 ) do
generates no lexical errors

• Typically few lexical error <type>s


– illegal chars
– unterminated comments
– ill-formed constants
– Identifier length exceeds max

BITS Pilani, Pilani Campus


What is hard about Lexical Analysis?

Poor language design can complicate scanning


• Reserved words are important
– In PL/I there are no reserved keywords, so you can write a valid
statement like:
if then then then = else; else else = then

• Significant blanks
– In Fortran blanks are not significant
do 10 i = 1,25 do loop
do 10 i = 1.25 assignment to variable named do10i
How does a compiler do this?
• First pass finds & inserts blanks
• Second pass is normal scanner

30
BITS Pilani, Pilani Campus
string fun1(int x, string y,){
1) <Program> →<funcons>comma<functionbody>comma int a;
2) <funcons>→<funcon><funcons>/ ε string b;
3) <funcon> →<funsignature><functionbody> b = b + y;
4) <Funsignature>→ <type> id ( <params>) *y;
}
5) <type>→ int / float/string
6) <params>→ <type> id , <params> / ε float fun2(int I, float j,){
7) <functionbody>→{ <declaraons> <statements> *<E>; } int k;
8) <declaraons>→ <type>id ; <declarations >/ ε k = I *j+k;
9) <statements> → id := <E>;/ id := id <more>;<statements>/ε *k;
10) <E>→ <E>+<E >/ }
<E*E>/id/<intigerliteral>/<floatliteral>/<stringliteral>
11) <more> →(<args>) ,
{
12) <args> → id comma <args> / ε
float z; string s= “bits”;
int p;
Z=10.5; p=5;
z = fun1(p,s,);
p = fun2(p,z,);
*z;
}
,

BITS Pilani, Pilani Campus


Token Stream

1. TK_Semicolon : the semicolon


2. TK_Comma : the comma ,
3. TK_LFBK : the left bracket (
4. TK_RTBK : the right bracket )
5. TK_LFBR : the left brace {
6. TK_RTBR :the right brace }
7. TK_ASSIGN : the assignment operator :=
8. TK_PLUS : the + operator
9. TK_STAR :the multiplication operator
10. TK_INTLIT : a string of one or more digits. Any unnecessary leading zeros
removed
11. TK_REAL_LIT: a real number which has a decimal point (0.22, 00.15. 3.414
etc)
12. TK_STRINGLIT : a string enclosed in a pair of double quotes, may contain
escaped double quotes within
13. TK_INT : the keyword int
14. TK_FLOAT : the keyword float
15. TK_STRING : the keyword string
16. TK_ID : an identifier name. starts with a letter(a-z,A-Z), can contain
numbers, of maximum length 10

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus

You might also like