J 3lecxical - Analysis

Compiler Construction / Compiler Design
CS F 363/ IS F342
D.C.KIRAN
dck@pilani.bits-pilani.ac.in
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus
Lexical Analysis
Scanning Perspective
A scanner produces a token (token <type>, value) stream from a character

stream.
Two modes of operation
• Complete pass produces the token stream, which is then passed to the parser.
• Parser calls scanner to request next token.
3
BITS Pilani, Pilani Campus
Jargons
• A Token is a pair consisting of a token name and an
optional attribute value.
Token <type>: Keywords, operators, identifiers, constants,
punctuation symbols (such as comma, semicolon)
Can be:
– Terminal symbol in a grammar
– Classes of sequences of characters with a collective meaning, e.g.,
IDENT
– Constants, Operators, Punctuation, Reserved words (keywords)
• A Lexeme is a sequence of characters in the source program
that matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token. E.g. Relation
{<.<=,>,>=,==,<>}
• A Pattern is a description of the form that the lexemes of
token may take.
• A Delimiter can be Blank, tab, new line character.

Token
L(G)= (Lex1,lex2, lex3, lex4, lex5, lex6,………….. lex1, lex1, lexn}
RE1 RE2 REm
Token1 Token2 Token-m
L(G)= (abc,siff,d2d, if, int, for,………….. <,<>, <=,>==)
Regular Expression RE2 RE3 RE4 REm

RE1
Tokens IDENTIFIER IF_T INT_T FOR_T RELOP

Token example
TK_if → if
TK_then → then
TK_else → else
TK_relop → < | <= | > | >= | = | <>
TK_id → letter ( letter | digit )*
TK_num → digit + (. digit + ) * ( E(+ | -) * digit + ) *
TK_blank → b
TK_tab → ^T
TK_newline → ^M
TK_delim → blank | tab | newline
TK_ws → delim +

Lexical Analyzer Responsibilities
• Lexical analyzer [Scanner]

– Scan input
– Remove white spaces
– Remove comments
– Manufacture tokens
– Generate lexical errors
– Pass token to parser
7
Lexical Analyzer Responsibilities

Implementation Options
• Hand written lexer

– Implementing by string comparison
– Implement a finite state automaton
• start in some initial state
• look at each input character in sequence, update
lexer state accordingly
• if state at end of input is an accepting state, the
input string matches the RE
• Lexer generator
– Generate tokenizer automatically (e.g., lex,flex, jlex)
– Uses RE to NFA to DFA algorithm

Extensions to pure DFAs
• One accepting state per token <type>
• Keywords are not identifiers

– Look up identifier in keyword table (e.g., hash table) to see whether it
is in fact a keyword
• “Look-ahead” to distinguish tokens with common prefix (e.g., 100 vs 100.5)
– Try to find the longest possible match by continuing to scan from an
accepting state.
– Backtrack to last accepting state when “stuck”.
char str[5]={“B”,”I”,”T”,”S”};

Hand Written Lexer
Basic Scanning Technique

Use 1 character of look-ahead
Obtain char with getc()
Do a case analysis
Based on lookahead char
Based on current lexeme
Outcome
If char can extend lexeme, all is well, go on.
If char cannot extend lexeme:
Figure out what the complete lexeme is and return
its token
Put the lookahead back into the symbol stream

The Model of Recognition of Tokens
Input stream i f d 2 =…
Lexeme_beginning
FA simulator
(IF, if)
i f delim
= d
(Assign, ) (ID, d2)
12
Identifier vs Keywords

Principle Of Longest Match
elsex=0; else x = 0
elsex = 0

Principle Of Longest Match
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
Pushback(LA)
= 4
*
return(relop, LT)
5 return(relop, EQ)
>
=
6 7 return(relop, GE)
other
8
*
return(relop, GT)
Pushback(LA)

Regular expression rules:
r_1 { action_1 }
Automata for regular
r_2 { action_2 }
expression r_1
.
. new final
. states
r_n { action_n }
Ar_1
new start
sate ε
Ar_2
ε
s0
ε ..
.
Ar_n
For faster scanning, convert this NFA to a DFA and minimize the states

=
< Return (TK_relOp, LE)
Step 1:
< >
Return (TK_relOp, NE)
< =
ϵ Return (TK_relOp, LE)
Step 2:
< >
ϵ Return (TK_relOp, NE)
Return (TK_relOp, LE)

=
Step 3:
< > Return (TK_relOp, NE)

Lexical Analyzer

Example
lexemes

Simple-Lexical Analyzer
/*Global Variables*/
int charClass, lexLen, LETTER=0, DIGIT=1, UNKNOWN=-1;
char lexeme [100], nextChar;
• Convenient utility subprograms:

– getChar - gets the next character of input, puts it in nextChar,
determines its class and puts the class in charClass
– addChar - puts the character from nextChar into the place the
lexeme is being accumulated, lexeme
– lookup - determines whether the string in lexeme is a reserved
word (returns a code)

/*Global Variables*/
int charClass, lexLen, LETTER=0, DIGIT=1, UNKNOWN=-1;
char lexeme [100], nextChar;
/*addChar*/
void addChar() { if(lexLen<=99) lexeme[lexlen++]=nextChar;
else printf(“error-lexeme too long\n”);
/*getChar*/
void getChar() {
if(isalpha(nextChar)) {charClass= LETTER;
else if (isdigit(nextChar)) charClass=DIGIT;
}
/*getNonblank*/
void getNonblank() { while(isspace(nextChar)) getChar();}

int Lookup(char *str)
{
int z=0,lt=strlen(str),key=0;
while (z<lt) /*Computing the hash key*/
{
z++;
key+=str[z];
}
key=key%HTSMAX;
if(keyTable[key].flag==1)
{
if(strcmp(keyTable[key].lex,str))
{
strcpy(tok.lexeme,str);
/*Copying to Symbol table*/
strcpy(tok.token,keyTable[key].token);
tok.sym=keyTable[key].sym;
tok.lineno=lineno;
state=100; return 1;
}
}
return 0; /*Not Keyword*/
}
int lex() {
getChar();
switch (charClass) {
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT)
{
addChar();
getChar();
}
return lookup(lexeme);
break;
…

…
case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
return INT_LIT;
break;
} /* End of switch */
} /* End of function lex */

I/O - Key For Successful Lexical Analysis
Character-at-a-time I/O
Block / Buffered I/O Tradeoff
Block/Buffered I/O
Utilize Block of memory
Stage data from source to buffer block at a time
Maintain two blocks - Why ?
Block 1 Block 2
When done, ptr... Still Process

issue I/O
token in 2nd block

Buffered I/O
Buffer
ma i n { i n
0 n-1
forward
Buffer
Source program
main
{ i n t a a = 5
int aa =55,bag=10; 0 n-1
…………….
…………… forward
……….. Buffer
}
5 5 , b a g = 1
n-1
forward
Buffered I/O
Buffer 1 Buffer 2
………m a i n { i n t aa=55, ba
0 n-1 n
2n-1
forward
forward : = forward + 1 ;
if forward ! = eof then begin
if forward at end of first block then begin
Source program
reload second block ;
main
forward : = forward + 1
{
Int aa=55,bag=10; end
……………. else if forward at end of second block then begin
…………… reload first block ;
……….. move forward to beginning of first block
} end
else / * eof within buffer signifying end of input * /
terminate lexical analysis
end

Lexical errors
• If user omits the space in “realf”

– No lexical error, single token IDENTIFIER (“realf”) is
produced instead of sequence REAL, IDENTIFIER(f)
• One <more> mistake

whil ( x := 0 ) do
generates no lexical errors
• Typically few lexical error <type>s

– illegal chars
– unterminated comments
– ill-formed constants
– Identifier length exceeds max

What is hard about Lexical Analysis?
Poor language design can complicate scanning

• Reserved words are important
– In PL/I there are no reserved keywords, so you can write a valid
statement like:
if then then then = else; else else = then
• Significant blanks
– In Fortran blanks are not significant
do 10 i = 1,25 do loop
do 10 i = 1.25 assignment to variable named do10i
How does a compiler do this?
• First pass finds & inserts blanks
• Second pass is normal scanner
30
string fun1(int x, string y,){
1) <Program> →<funcons>comma<functionbody>comma int a;
2) <funcons>→<funcon><funcons>/ ε string b;
3) <funcon> →<funsignature><functionbody> b = b + y;
4) <Funsignature>→ <type> id ( <params>) *y;
}
5) <type>→ int / float/string
6) <params>→ <type> id , <params> / ε float fun2(int I, float j,){
7) <functionbody>→{ <declaraons> <statements> *<E>; } int k;
8) <declaraons>→ <type>id ; <declarations >/ ε k = I *j+k;
9) <statements> → id := <E>;/ id := id <more>;<statements>/ε *k;
10) <E>→ <E>+<E >/ }
<E*E>/id/<intigerliteral>/<floatliteral>/<stringliteral>
11) <more> →(<args>) ,
{
12) <args> → id comma <args> / ε
float z; string s= “bits”;
int p;
Z=10.5; p=5;
z = fun1(p,s,);
p = fun2(p,z,);
*z;
}
,

Token Stream
1. TK_Semicolon : the semicolon

2. TK_Comma : the comma ,
3. TK_LFBK : the left bracket (
4. TK_RTBK : the right bracket )
5. TK_LFBR : the left brace {
6. TK_RTBR :the right brace }
7. TK_ASSIGN : the assignment operator :=
8. TK_PLUS : the + operator
9. TK_STAR :the multiplication operator
10. TK_INTLIT : a string of one or more digits. Any unnecessary leading zeros
removed
11. TK_REAL_LIT: a real number which has a decimal point (0.22, 00.15. 3.414
etc)
12. TK_STRINGLIT : a string enclosed in a pair of double quotes, may contain
escaped double quotes within
13. TK_INT : the keyword int
14. TK_FLOAT : the keyword float
15. TK_STRING : the keyword string
16. TK_ID : an identifier name. starts with a letter(a-z,A-Z), can contain
numbers, of maximum length 10


J 3lecxical - Analysis

Uploaded by

Copyright:

Available Formats

You might also like

J 3lecxical - Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

J 3lecxical - Analysis

Uploaded by

Copyright:

Available Formats

Compiler Construction / Compiler Design

A scanner produces a token (token <type>, value) stream from a character

BITS Pilani, Pilani Campus

L(G)= (Lex1,lex2, lex3, lex4, lex5, lex6,………….. lex1, lex1, lexn}

RE1 RE2 REm

Token1 Token2 Token-m

L(G)= (abc,siff,d2d, if, int, for,………….. <,<>, <=,>==)

Regular Expression RE2 RE3 RE4 REm

Tokens IDENTIFIER IF_T INT_T FOR_T RELOP

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• Lexical analyzer [Scanner]

BITS Pilani, Pilani Campus

• Hand written lexer

BITS Pilani, Pilani Campus

• One accepting state per token <type>

• Keywords are not identifiers

BITS Pilani, Pilani Campus

Basic Scanning Technique

BITS Pilani, Pilani Campus

(Assign, ) (ID, d2)

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

Return (TK_relOp, LE)

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

int charClass, lexLen, LETTER=0, DIGIT=1, UNKNOWN=-1;

char lexeme [100], nextChar;

• Convenient utility subprograms:

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

When done, ptr... Still Process

BITS Pilani, Pilani Campus

BITS Pilani, Pilani Campus

• If user omits the space in “realf”

• One <more> mistake

• Typically few lexical error <type>s

BITS Pilani, Pilani Campus

Poor language design can complicate scanning

BITS Pilani, Pilani Campus

1. TK_Semicolon : the semicolon

BITS Pilani, Pilani Campus

You might also like