J 3lecxical - Analysis

Compiler Construction / Compiler Design

CS F 363/ IS F342
Lexical Analysis
Scanning Perspective

A scanner produces a token (token <type>, value) stream from a character

Two modes of operation
• Complete pass produces the token stream, which is then passed to the parser.
• Parser calls scanner to request next token.
• A Token is a pair consisting of a token name and an
optional attribute value.
Token <type>: Keywords, operators, identifiers, constants,
punctuation symbols (such as comma, semicolon)
Can be:
– Terminal symbol in a grammar
– Classes of sequences of characters with a collective meaning, e.g.,
– Constants, Operators, Punctuation, Reserved words (keywords)
• A Lexeme is a sequence of characters in the source program
that matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token. E.g. Relation
• A Pattern is a description of the form that the lexemes of
token may take.
• A Delimiter can be Blank, tab, new line character.

L(G)= (Lex1,lex2, lex3, lex4, lex5, lex6,………….. lex1, lex1, lexn}


Token1 Token2 Token-m

L(G)= (abc,siff,d2d, if, int, for,………….. <,<>, <=,>==)

Regular Expression RE2 RE3 RE4 REm



Token example

TK_if → if
TK_then → then
TK_else → else
TK_relop → < | <= | > | >= | = | <>
TK_id → letter ( letter | digit )*
TK_num → digit + (. digit + ) * ( E(+ | -) * digit + ) *

TK_blank → b
TK_tab → ^T
TK_newline → ^M
TK_delim → blank | tab | newline
TK_ws → delim +

Lexical Analyzer Responsibilities

• Lexical analyzer [Scanner]

– Scan input
– Remove white spaces
– Remove comments
– Manufacture tokens
– Generate lexical errors
– Pass token to parser

Lexical Analyzer Responsibilities

Implementation Options

• Hand written lexer

– Implementing by string comparison
– Implement a finite state automaton
• start in some initial state
• look at each input character in sequence, update
lexer state accordingly
• if state at end of input is an accepting state, the
input string matches the RE
• Lexer generator
– Generate tokenizer automatically (e.g., lex,flex, jlex)
– Uses RE to NFA to DFA algorithm

Extensions to pure DFAs

• One accepting state per token <type>

• Keywords are not identifiers

– Look up identifier in keyword table (e.g., hash table) to see whether it
is in fact a keyword
• “Look-ahead” to distinguish tokens with common prefix (e.g., 100 vs 100.5)
– Try to find the longest possible match by continuing to scan from an
accepting state.
– Backtrack to last accepting state when “stuck”.
char str[5]={“B”,”I”,”T”,”S”};

Hand Written Lexer

Basic Scanning Technique

 Use 1 character of look-ahead
 Obtain char with getc()
 Do a case analysis
 Based on lookahead char
 Based on current lexeme
 If char can extend lexeme, all is well, go on.
 If char cannot extend lexeme:
 Figure out what the complete lexeme is and return
its token
 Put the lookahead back into the symbol stream

The Model of Recognition of Tokens

Input stream i f d 2 =…

FA simulator

(IF, if)
i f delim

= d

(Assign, ) (ID, d2)

Identifier vs Keywords

Principle Of Longest Match

elsex=0; else x = 0

elsex = 0

Principle Of Longest Match

start < =
0 1 2 return(relop, LE)
3 return(relop, NE)
= 4
return(relop, LT)

5 return(relop, EQ)

6 7 return(relop, GE)
return(relop, GT)

Regular expression rules:
r_1 { action_1 }
Automata for regular
r_2 { action_2 }
expression r_1
. new final
. states
r_n { action_n }
new start
sate ε

ε ..

For faster scanning, convert this NFA to a DFA and minimize the states

< Return (TK_relOp, LE)

Step 1:
< >
Return (TK_relOp, NE)

< =
ϵ Return (TK_relOp, LE)

Step 2:
< >
ϵ Return (TK_relOp, NE)

Return (TK_relOp, LE)

Step 3:
< > Return (TK_relOp, NE)

Lexical Analyzer

Simple-Lexical Analyzer

/*Global Variables*/

int charClass, lexLen, LETTER=0, DIGIT=1, UNKNOWN=-1;

char lexeme [100], nextChar;

• Convenient utility subprograms:

– getChar - gets the next character of input, puts it in nextChar,
determines its class and puts the class in charClass
– addChar - puts the character from nextChar into the place the
lexeme is being accumulated, lexeme
– lookup - determines whether the string in lexeme is a reserved
word (returns a code)

Simple-Lexical Analyzer

/*Global Variables*/
int charClass, lexLen, LETTER=0, DIGIT=1, UNKNOWN=-1;
char lexeme [100], nextChar;
void addChar() { if(lexLen<=99) lexeme[lexlen++]=nextChar;
else printf(“error-lexeme too long\n”);
void getChar() {
if(isalpha(nextChar)) {charClass= LETTER;
else if (isdigit(nextChar)) charClass=DIGIT;
void getNonblank() { while(isspace(nextChar)) getChar();}

Simple-Lexical Analyzer
int Lookup(char *str)
int z=0,lt=strlen(str),key=0;
while (z<lt) /*Computing the hash key*/

/*Copying to Symbol table*/
state=100; return 1;
return 0; /*Not Keyword*/
Simple-Lexical Analyzer

int lex() {
switch (charClass) {
case LETTER:
while (charClass == LETTER || charClass == DIGIT)
return lookup(lexeme);

Simple-Lexical Analyzer

case DIGIT:
while (charClass == DIGIT) {
return INT_LIT;
} /* End of switch */
} /* End of function lex */

I/O - Key For Successful Lexical Analysis

 Character-at-a-time I/O
 Block / Buffered I/O Tradeoff
 Block/Buffered I/O
 Utilize Block of memory
 Stage data from source to buffer block at a time
 Maintain two blocks - Why ?

Block 1 Block 2

When done, ptr... Still Process

issue I/O
token in 2nd block

Buffered I/O


ma i n { i n
0 n-1

Source program
{ i n t a a = 5
int aa =55,bag=10; 0 n-1
…………… forward
……….. Buffer
5 5 , b a g = 1
Buffered I/O

Buffer 1 Buffer 2

………m a i n { i n t aa=55, ba
0 n-1 n
forward : = forward + 1 ;
if forward ! = eof then begin
if forward at end of first block then begin
Source program
reload second block ;
forward : = forward + 1
Int aa=55,bag=10; end
……………. else if forward at end of second block then begin
…………… reload first block ;
……….. move forward to beginning of first block
} end
else / * eof within buffer signifying end of input * /
terminate lexical analysis

Lexical errors

• If user omits the space in “realf”

– No lexical error, single token IDENTIFIER (“realf”) is
produced instead of sequence REAL, IDENTIFIER(f)

• One <more> mistake

whil ( x := 0 ) do
generates no lexical errors

• Typically few lexical error <type>s

– illegal chars
– unterminated comments
– ill-formed constants
– Identifier length exceeds max

What is hard about Lexical Analysis?

Poor language design can complicate scanning

• Reserved words are important
– In PL/I there are no reserved keywords, so you can write a valid
statement like:
if then then then = else; else else = then

• Significant blanks
– In Fortran blanks are not significant
do 10 i = 1,25 do loop
do 10 i = 1.25 assignment to variable named do10i
How does a compiler do this?
• First pass finds & inserts blanks
• Second pass is normal scanner

string fun1(int x, string y,){
1) <Program> →<funcons>comma<functionbody>comma int a;
2) <funcons>→<funcon><funcons>/ ε string b;
3) <funcon> →<funsignature><functionbody> b = b + y;
4) <Funsignature>→ <type> id ( <params>) *y;
5) <type>→ int / float/string
6) <params>→ <type> id , <params> / ε float fun2(int I, float j,){
7) <functionbody>→{ <declaraons> <statements> *<E>; } int k;
8) <declaraons>→ <type>id ; <declarations >/ ε k = I *j+k;
9) <statements> → id := <E>;/ id := id <more>;<statements>/ε *k;
10) <E>→ <E>+<E >/ }
11) <more> →(<args>) ,
12) <args> → id comma <args> / ε
float z; string s= “bits”;
int p;
Z=10.5; p=5;
z = fun1(p,s,);
p = fun2(p,z,);

Token Stream

1. TK_Semicolon : the semicolon

2. TK_Comma : the comma ,
3. TK_LFBK : the left bracket (
4. TK_RTBK : the right bracket )
5. TK_LFBR : the left brace {
6. TK_RTBR :the right brace }
7. TK_ASSIGN : the assignment operator :=
8. TK_PLUS : the + operator
9. TK_STAR :the multiplication operator
10. TK_INTLIT : a string of one or more digits. Any unnecessary leading zeros
11. TK_REAL_LIT: a real number which has a decimal point (0.22, 00.15. 3.414
12. TK_STRINGLIT : a string enclosed in a pair of double quotes, may contain
escaped double quotes within
13. TK_INT : the keyword int
14. TK_FLOAT : the keyword float
15. TK_STRING : the keyword string
16. TK_ID : an identifier name. starts with a letter(a-z,A-Z), can contain
numbers, of maximum length 10

