Compilers: Topic 2: Lexical Analysis

Compilers
Topic 2: Lexical Analysis
2.1. Introduction
Mick O´Donnell : michael.odonnell@uam.es
Introduction
• The Role of the Lexical Analyser
FRONT END
Source Lexical Syntactic Semantic

Code Analyser Analyser Analyser
1
Lexical Analyser
Introduction
• Also known as tokeniser or scanner.

• In Spanish, called analizador Morfológico
• Purpose: translation of the source code into a sequence of
symbols.
• The symbols identified by the morphological analyser will be
considered terminal symbols in the grammar used by the
syntactic analyser.
(reserved-word,begin)
“begin (type, int)(<id>,A)(<symb>,;)
int A; (<id>,A)(<mult-symb>,:=)
A := 100; (<cons int>,100)(<symb>,;)
A := A+A; (<id>,A)(<mult-symb>,:=)
output A (<id>,A)(<symb>,+)(<id>,A)(<symb>,;)
End” (reserved-word,output)(<id>,A)
(reserved-word,end)
Lexical Analyser
Introduction
• Other tasks:
• Identification of lexical errors,
• e.g., starting an identifier with a digit where the language does not
allow this: 2abc
• Deletion of white-space:
• Usually, the function of white-space is only to separate tokens.
• Exceptions: languages where whitespace indicates code block, e.g.,
python:
if 1 == 2:
print 1
print 2
• Deletion of comments: not relevant to execution of program.
2
Lexical Analyser
Drawing the border between symbols
What are Symbols?
• How do we determine what are the symbols of a given language?
• Case:
• Assume we have a language with assignment operator :=
• The ‘assignment statement’ has syntax: A := 1 + 2
STATEMENT → ID ASSIGNOP EXPR ‘;’
• The rule for ASSIGNOP could be:
ASSIGNOP → ‘:=’
…meaning ‘:=’ is a symbol, and thus a unit of lexical analysis.
• However, the rule might have been:

ASSIGNOP → ‘:’ ‘=’
…meaning ‘:’ and ‘=’ are two symbols for lexical analysis
5
Lexical Analyser
Drawing the border between symbols
What are Symbols?
• General Rules:
• A symbol is a sequence of characters that cannot be
sepaated from each other by white space.
• Symbols can be separated from other symbols by white
space
• With A:=1 + 2
• ‘:=‘ can be separated from ‘A’ and ‘1’
• BUT ‘:’ cannot be separated from ‘=‘
• Thus ‘:=‘ should be treated as a symbol
3
Lexical Analyser
Determining the Token set
What token labels to use?

• To determine which token labels we assign to symbols, we first need to
derive the syntactic grammar of the language.
• THEN, we extract out the terminal symbols of this grammar, which become
the token labels in lexical analysis.
• This ensures that the labels assigned in lexical analysis are what we need in
syntactic analysis.
• For example, we might assign the label “reserved_word” to both “begin” and
“end”.
• But it is clear we cannot use such a label in parsing:
Program -> reserved_word Statement* reserved_word
• … would allow “end A=1 begin” as a program.

• Each token label has to reflect the different roles that the token class can
serve in a program.
Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
1 : <program> ::= begin <dcl train> ; <stm train> end
2 : <dcl train> ::= <declaration>
3 : | <declaration> ; <dcl train>
4 : <stm train> ::= <statement>
5 : | <statement> ; <stm train>
6 : <declaration>::= <mode> <idlist>
7 : <mode> ::= bool
8 : | int
9 : | ref <mode>
10 : <idlist> ::= <id>
11 : | <id> , <idlist>
12 : <statement> ::= <asgt stm>
13 : | <cond stm>
14 : | <loop stm>
15 : | <transput stm>
15 : | <case stm>
16 : | call <id>
17 : <asgt stm> ::= <id> := <exp>
18 : <cond stm> ::= if <exp> then <stm train> fi
19 : | if <exp> then <stm train> else <stm train>
4
Lexical Analyser
20 : <loop stm> ::= while <exp> do <stm train> end
21 : | repeat <stm train> until <exp>
22 : <transput stm> ::= input <id>
23 : | output <exp>
24 : <exp> ::= <factor>
25 : | <exp> + <factor>
26 : | <exp> - <factor>
27 : | - <exp>
28 : <factor> ::= <primary>
29 : | <factor> * <primary>
30 : <primary> ::= <id>
31 : | <constant>
32 : | ( <exp> )
33 : | ( <compare> )
34 : <compare> ::= <exp> = <exp>
35 : | <exp> <= <exp>
36 : | <exp> > <exp>
Topic 2
One and Two Pass Lexical Analysis
5
Lexical Analyser
One Pass Lexical Analyser
• Identifies symbols and imediately assigns token label to

symbol:
“begin (begin,begin)(type, int) (id,A)

int A; (semic,;) (id,A) (eqsgn,:=)
A := 100; (int,100)(semic,;) (id,A)
A := A+A; (eqsgn,:=) (id,A) (symb,+) (id,A)
print A (semic,;)
end” (reserved-word,output) (id,A)
(end,end)
11
Lexical Analyser
Two Pass Lexical Analysis
• In a two-pass lexical analyser:
• First pass groups characters into symbols
“begin “begin” “int”, “A” “;” “A”
int A;
A := 100;
“:=” “100” “;” “A” “:=”
A := A+A; “A” “+” “A” “;” “print”
print A “A” “end”
end”
• Second pass assigns token labels to symbols
(begin,begin)(type, int) (id,A)

“begin” “int”, “A” “;” “A” (semic,;) (id,A) (eqsgn,:=)
(int,100)(semic,;) (id,A)
“:=” “100” “;” “A” “:=”
(eqsgn,:=) (id,A) (symb,+) (id,A)
“A” “+” “A” “;” “print” (semic,;)
“A” “end” (reserved-word,output) (id,A)
(end,end)
12
6
Lexical Analyser
Two Pass Lexical Analysis
• Most programming languages are designed such that the code can be
segmented into tokens without any knowledge at all of the meaning of the
token.
• Simple rules are adhered to:
• White-space ends a symbol
• Multiple white-space ignored
• identifiers contain only alphanumeric chars or _
• identifiers never start with a number
• a symbol starting with a number IS a number: 1, 34, 10.0
• Some chars are always symbol by themselves: } { ; ( ) ,
• mathematical chars can be solo or followed by “=“
• =, >, <, +, -, /, *
• ==, >=, <=, +=, -=, /=, *=
• The first char of the symbol tells us which group it is in

13
• Identifier rules:
• Java: Consist of Unicode letter _ $ 0-9
Cannot start with 0-9.
• C: Consists of a-z A-Z 0-9 _
Cannot start with 0-9 or _
• Exceptions:
• Lisp:
• Identifier consists of a-Z A-Z 0-9 _ + - * / @ $ = < > . Etc.
• No restriction on starting char
• If char sequence can be interpreted as a number, it is
• Else it is an ‘identifier’
• E.g., ‘1+’ is an ‘identifier’
‘+1’ is a number
14
7
Topic 3
Methods of Lexical Analysis
Lexical Analyser: Using grammars

Approaches to Lexical Analysis
Three main Approaches:

1) Ad-Hoc Coding : code is written to recognise each type of
token.
2) Finite expressions: e.g.,

• float: “[0-9]*.[0-9]+”
• Id: “[a-zA-Z_][a-zA-Z_0-9]*”
3) Context free grammar, e.g.,

Token :- Id | Int | Literal | …
Id :- Alfa | Alfa Id2
Id2 :- Alfa | Digit | Alfa Id2|Digit Id2
…
16
8
Topic 2.1
Ad Hoc Coding of Lexical Analysis:

Recognising Symbols
Lexical Analyser
Two Pass Lexical Analysis with ad-hoc code
• Common approach (1):
• Human writes code to recognise the tokens of the source language:
def tokenise():
symbolList = []
while not eof():
// process next chars until end of symbol

// add symbol to symbolList
. . .
return symbolList
18
9
Lexical Analyser
Two Pass Lexical Analysis with ad-hoc code
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': ...
'alpha': ...
'digit': ...
etc.
return symbolList
19
Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': ...
'alpha': ...
'digit': ...
etc.
return symbolList
def type (char):

if char in “a-zA-z_”: return ‘alpha’
if char in “0-9”: return ‘digit’
if char in “ \t\n”: return ‘whitespace’
if char in “{};,”: return ‘sepchar’
if char in “><=+-/*”: return ‘mathchar’
20
10
Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'alpha': // alpha includes here '_'

symbol = “” + getc()
while type(nextc) in ['alpha', 'digit']:
symbol += getc()
symbolList.append(symbol)
'whitespace': getc()
'digit': ...
...
21
Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):

symbol = “”+getc()
symbol += getc()
'whitespace': getc()
'digit': ...
...
22
11
Lexical Analyser
. . .
‘mathchar': // = > < + - * /

if nextc == '=':
symbol += getc()
'sepchar': // { } ; ,
default: print "ERROR: Unknown Char: “+getc()
23
Lexical Analyser
Numbers:
• Formats: 1, 34, 34.001, .0
• Procedure
1) Read digits until we reach a nondigit
2) If nextchar is “.” then read digits until we reach a nondigit
24
12
Lexical Analyser
Numbers:
• Formats: 1, 34, 34.001, .0
• Procedure
1) Read digits until we reach a nondigit
2) If nextchar is “.”, then read digits until we reach a nondigit
‘digit': symbol = “”+getc()

while nextc in “0123456789”:
symbol += getc()
if nextc == “.”:
symbol += getc()
while nextc in “0123456789”:
symbol += getc()
25
Topic 2.2

Assigning Token Labels
13
Lexical Analyser
Two Pass Lexical Analyser
• Second Stage: assigning token labels to symbols
1. Reserved words matched by comparision (or hash lookup)

If Symbol in RESERVED_WORDS: Token = symbol
2. Use regular expressions for user-supplied symbols:

• Int : “[0-9]+”
• Float : “[0-9]*.[0-9]+”
• Id : “[a-zA-Z_][a-zA-Z0-9_]*”
27
Topic 2.2

Single Pass Approach
14
Lexical Analyser
The code shown earlier was for recognising symbols.
• Symbols of different types were recognised in different parts.
• We can use this to simplify token labelling

• The specific code used to identify the symbol knows if it is a
number, alphanumerical, mathematical or separator.
• Thus, we can use this code to assign token label as well
29
Lexical Analyser
Single Pass Lexical Analyser
def tokenise():
symbolList = []
while not eof():
while nextc in WHITE_SPACE_CHARS: getc()
symbtype = type(nextc)
case symbtype:
'alpha': ...
'digit': ...
'sepchar': ...
'mathchar': ...
symbolList.append( [token,symbol] )
return symbolList
30
15
Lexical Analyser
Single Pass Lexical Analyser
def tokenise():
symbolList = []
while not eof():
while nextc in WHITE_SPACE_CHARS: getc()
symbtype = type(nextc)
case symbtype:
symbol += getc()
if symbol in RESERVED_WORDS:
token = symbol
else:
token = 'id'
...
symbolList.append( [token,symbol])
return symbolList
31
Topic 2.3
Lexical Analysis using

Regular Expressions
16
Lexical Analyser: Using grammars
Lexical analysis using regular expressions and grammars
• The previous section looked at lexical analysis informally, just in

terms of a computer program written by hand to recognise the
tokens of a language.
• The rules of lexical syntax are only represented implicitly in the
code. One has to interpret the code to see that an identifier must
start with an alpha char.
• In earlier days, this was sufficient.
• However, there are problems with this approach:
• Portability: a change in syntax requires editing of the source code (it may
be better to state the lexical structure in an external data file, requiring no
need to edit the source code)
• Difficult to prove that the tokenising code actually conforms to the
specification of the language – does it do as it should?
• This section will explore lexical analysis from a more formal
perspective 33
Lexical Analyser
Lexical analysis using regular expressions
• One approach is to describe the tokens in terms of regular

expressions
• A program can read in these regular expressions and generate

code to perform the tokenisation
• FLEX and LEX
34
17
Lexical Analyser
Standard Compiler Architecture
• LEX and YACC (Yet Another Compiler Compiler) are often used
to build a compiler quickly
Lexical Syntactic
rules rules
Lex Yacc
Source My My
LexAnlyser SynAnalyser
Code
• One does not write the lexical analyser directly, just the patterns
to recognise tokens
35
Lexical Analyser
Flex input for a simple tokeniser
DIGIT [0-9]
ID [a-z][a-z0-9]*
{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }
{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}
if|then|begin|end|procedure|function
{ printf( “(%s,: ‘%s’)", yytext, yytext);}
{ID} printf( “(id, ‘%s’)", yytext );
"+"|"-"|"*"|"/" printf( “(mathop, ‘%s’)", yytext );
[ \t\n]+ /* eat up whitespace */
. printf( "Unrecognized character: %s\n", yytext );
36
18
Lexical Analyser
Flex processing
• Flex generates C code to procede char by char through the input text.
• When the generated scanner is run, it analyzes its input looking for strings
which match any of its patterns.
• If it finds more than one match, it takes the one matching the most text.
Thus “23.56” will be recognised as float not integer
{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }
{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}
• Once the match is determined, the text corresponding to the match is made
available in the global character pointer yytext,
• The action corresponding to the matched pattern is then executed.
• After recognising a token, input scanning for all patterns restarts from that
point (any partially matched pattern is discarded).
37
Topic 2.4
Lexical Analysis using CFGs

(Context Free Grammars)
19
Lexical Analyser
Using a CFG
Using Grammars for token recognition
• More formally, one can describe the possible tokens of a

language using a context-free grammar
<Token> ::= <Id> | <ReservedWd> | <Number> | ...
<Id> ::= <Letter> | <Letter><IdC>
<IdC> ::= <Letter> | <Digit> | <Letter><IdC> | <Digit><IdC>
• This has the advantage that both the syntactic description of the
language and its lexical description are in the same language.
• Processing of such grammars is however slower than for regular
expressions.
39
Lexical Analyser
Definite Finite Automata
Using Grammars for token recognition
• However, some context free grammars can be automatically

translated into right-regular grammars, where rules are of two
forms: ( ‘a’ is a terminal; ‘A’ is a nonterminal )
• A→a
• A → aB
• A grammar in such a form can then be represented as a

deterministic finite automaton, (DFA) which allows efficient
processing of the input.
• a DFA is a finite state machine where, for each pair of state and
input symbol, there is one and only one transition to a next state.
40
20
Lexical Analyser
Deriving a right-regular grammars
To derive a right regular grammar from full context free grammar:
1. For each rule whose RHS starts with a nonterminal, replace
the nonterminal with its expansion(s)
E.g.
Number ::= Integer
Integer ::= IntegerSS | - IntegerSS
IntegerSS ::= digit | digit IntegerSS
Number ::= IntegerSS | - IntegerSS

Number ::= digit | digit IntegerSS | - IntegerSS

41
Lexical Analyser
Deriving a right-regular grammars
To derive a right regular grammar from full context free grammar:
1. For each rule whose RHS starts with a nonterminal, replace
the nonterminal with its expansion(s)
2. At the end of replacements, eliminate any rule which cannot

be reached from the START symbol
e.g., assume grammar has start symbol: Token

<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | …
We only preserve nonterminals referenced in this rule or referenced in the
nonterminals it contains, etc.
42
21
Lexical Analyser
A samle CFG grammar for tokens
<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | <CharLit> | <SS> | <MS>
<Id> ::= <Letter> | <Letter> <IdC>
<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>
<Number> ::= <Integer> | <Real>
<Integer> ::= <IntegerSS> | - <IntegerSS>
<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>
<Real> ::= <FixedPoint> | <FixedPoint> <Exponente>
<FixedPoint> ::= <Integer> . <IntegerSS> | . <IntegerSS> | <Integer> .
<Exponent> ::= E <Integer>
<StringLit> ::= "" | " <CharSeq> "
<CharSeq> ::= <Character> | <Character> <CharSeq>
<CharLit> ::= ' <Character> '
<SS> ::= + | - | * | / | = | < | > | ( | ) / Simple Symbol/
<MS> ::= =+ | != | <= | >= | ++ | -- / Multiple Symbol /
<Character> ::= <Letter> | <Digit> | <SS> | ! | . | , | b | \' | \" | \n
<Letter> ::= A | B | ... | Z | a | b | ... | z
<Digit> ::= 0 | 1 | ... | 9
43
Lexical Analyser
Same grammar converted to a RRG (almost)
<Token> ::= <Letter>
| <Letter> <IdC>
| <ReservedWd>
| <Digit>
| <Digit> <IntegerSS>
| - <IntegerSS>
| <Digit>. <IntegerSS>
| <Digit> <IntegerSS> . <IntegerSS>
| - <IntegerSS> . <IntegerSS>
| . <IntegerSS>
| <Digit> . | <Digit> <IntegerSS> .
| - <IntegerSS> .
…
| ""
| " <CharSeq> "
| ' <Character> '
| + | - | * | / | = | < | > | ( | ) | =+ | != | <= | >= | ++ | --
<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>
<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>
44
22
Lexical Analyser
RRG converted to a DFA
45
Lexical Analyser
A more complex example
46
23
Lexical Analyser
Complex Grammar of ASPLE (lexical and syntactic)
1 : <program> ::= begin <dcl train> ; <stm train> end
2 : <dcl train> ::= <declaration>
3 : | <declaration> ; <dcl train>
4 : <stm train> ::= <statement>
5 : | <statement> ; <stm train>
6 : <declaration>::= <mode> <idlist>
7 : <mode> ::= bool
8 : | int
9 : | ref <mode>
10 : <idlist> ::= <id>
11 : | <id> , <idlist>
12 : <statement> ::= <asgt stm>
13 : | <cond stm>
14 : | <loop stm>
15 : | <transput stm>
15 : | <case stm>
16 : | call <id>
17 : <asgt stm> ::= <id> := <exp>
18 : <cond stm> ::= if <exp> then <stm train> fi
19 : | if <exp> then <stm train> else <stm
train> fi
47
Lexical Analyser
20 : <loop stm> ::= while <exp> do <stm train> end
21 : | repeat <stm train> until <exp>
22 : <transput stm> ::= input <id>
23 : | output <exp>
24 : <exp> ::= <factor>
25 : | <exp> + <factor>
26 : | <exp> - <factor>
27 : | - <exp>
28 : <factor> ::= <primary>
29 : | <factor> * <primary>
30 : <primary> ::= <id>
31 : | <constant>
32 : | ( <exp> )
33 : | ( <compare> )
34 : <compare> ::= <exp> = <exp>
35 : | <exp> <= <exp>
36 : | <exp> > <exp>
48
24
Lexical Analyser
37 : <constant> ::= <Boolean constant>
38 : | <int constant>
39 : <Boolean constant> ::= true
40 : | false
41 : <int constant> ::= <number>
42 : <number> ::= <digit>
43 : | <number> <digit>
44 : <id> ::= <letter>
45 : | <letter><rest id>
46 : <rest id> ::= <alphanumeric>
47 : | <alphanumeric><rest id>
46 : <digit> ::= 0 | 1 | ... | 9
47 : <letter> ::= a | b | ... | z | A | B | ... | Z
48 : <case stm> ::= case ( <expr> ) <constant case train> esac
49 : <constant case train> ::= <constant case>
50 : | <constant case> <constant case train>
51 : <constant case> ::= <int constant> : <stm train>
52 : <alphanumeric> ::= <digit>
53 : | <letter>
54 : <procedures> ::= <procedure> <procedures>
55 : |
56 : <procedure> ::= procedure <id> begin <stm train> end
49
Lexical Analyser
• We create a new Nonterminal to represent the tokens we wish to recognise
<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | <CharLit> | <SS> | <MS>
• We then derive from this a RRG
50
25
Lexical Analyser
The Right-regular grammar
<Token> ::= begin | ; | end | bool | int | ref | ,
| call | := | if | then | fi | else
| while | do | repeat | until | input
| output | + | - | * | ( | ) | = | <= | >
| case | esac | : | procedure
| true | false
| 0 | ··· | 9 | 0 <int constant> | ···
| 9 <int constant>
| A | ··· | Z | a | ··· | z
| A <rest id> | ··· | Z <rest id>
| a <rest id> | ··· | z <rest id>
<int constant> ::= 0 | ··· | 9 | 0 <int constant> | ···

| 9 <int constant>
<rest id> ::= A | ··· | Z | a | ··· | z | 0 | ··· | 9
| 0 <rest id> | ··· | 9 <rest id>
| A <rest id> | ··· | Z <rest id>
| a <rest id> | ··· | z <rest id>
51
Lexical Analyser
Graph associated to the grammar
A,...,Z,
a,...,z,
0,...,9
A,...,Z, rest id
a,...,z,
0,...,9 A,...,Z,
begin,end,bool,int,ref, a,...,z
call,if,then,fi,else,while,do,repeat,
until,input,output,case,esac,
λ procedure,true,false,0,···,9,A,...,Z, SU
a,...z,;,,,+, -,*,(,),=,:=,<=,,>,:
int 0,...,9
0,...,9 constant
0,...,9
52
26
Lexical Analyser
Other patterns
• ASPLE just provides a few data types, but there are others which are also very
common:
• Real numbers, e.g. 3.45, .44, –5., 3.45E2, .44E-2, –5.E123
<real>::=<fixed point>
| <fixed point><exponent>
<integer>::=<int constant>
|-<int constant>
<fixed point>::=<integer>.<int constant>
| .<int constant>
| <integer>.
<exponent>::=E<integer>
• Character and strings, e.g. “”, “hello world”, ‘a’,...,‘z’
<literal>::=“” | “<string>”
<character>::=‘<symbol>’
<string> ::= <symbol> | <symbol><string>
53
Topic 3
Semantic Actions in
Lexical Analysis
27
Lexical analyser: Semantic actions
Previous concepts
• The compiler can delegate some semantic tasks to the lexical

analyser:
• Storing the information about the identifiers in the symbols
table.
• Calculating the numeric values (in binary code) for each
numeric constant.
• etc.
• These tasks vary according to:

• The objectives of the translators / interpreters.
• The division of tasks between the different components in
the translator / interpreter.
55

Semantic actions
• These actions are sometimes expressed by inserting actions between the
symbols in the rules. For instance:
<id>::= actionf0 <letter> actionf1 actionf2 <rest id> actionf3
<rest id>::= <letter> actionf1 actionf2 <rest id>
| <digit> actionf1 actionf2 <rest id>
| λ
where
• actionf0 might be: initialise a counter
• actionf1 add 1 to the counter
• actionf2 copy the character which has just been recognised inside
a buffer.
• actionf3 add to the buffer an end-of-string mark. Check that the
number of characters is not higher than the maximum length
allowed. If this happens, notify the error. Otherwise, insert the
identifier in the symbols table and return a pointer to the element
inside the table.
56
28
Semantic actions
• Other example:
<integer>::= actiong0 <int constant>
| actiong0 -<int constant> actiong2
<int constant>::= <digit> actiong1
| <digit> actiong1 <int constant>
• where
• actiong0 can be the initialisation of an integer variable
value←0
• actiong1 performs the calculation of the value of the number which
has been read until now
value←(10*value) + value(digit)
• actiong2 changes the sign of the value calculated:
value← -value
57
Lexical Analyser: Summary

Summary
Three main Approaches:

1) Ad-Hoc Coding
2) Finite expressions: e.g.,

• float: “[0-9]*.[0-9]+”
• Id: “[a-zA-Z_][a-zA-Z_0-9]*”
3) Context free grammar, converted to RRG and DFA

Token :- Id | Int | Literal | …
Id :- Alfa | Alfa Id2
Id2 :- Alfa | Digit | Alfa Id2 | Digit Id2
…
58
29

Compilers: Topic 2: Lexical Analysis

Uploaded by

Copyright:

Available Formats

You might also like

Compilers: Topic 2: Lexical Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Compilers: Topic 2: Lexical Analysis

Uploaded by

Copyright:

Available Formats

Compilers

Topic 2: Lexical Analysis

Mick O´Donnell : michael.odonnell@uam.es

• The Role of the Lexical Analyser

Source Lexical Syntactic Semantic

• Also known as tokeniser or scanner.

• Deletion of comments: not relevant to execution of program.

• How do we determine what are the symbols of a given language?

• However, the rule might have been:

What are Symbols?

What token labels to use?

• … would allow “end A=1 begin” as a program.

One and Two Pass Lexical Analysis

• Identifies symbols and imediately assigns token label to

“begin (begin,begin)(type, int) (id,A)

• Second pass assigns token labels to symbols

(begin,begin)(type, int) (id,A)

• The first char of the symbol tells us which group it is in

Methods of Lexical Analysis

Lexical Analyser: Using grammars

Three main Approaches:

2) Finite expressions: e.g.,

3) Context free grammar, e.g.,

Ad Hoc Coding of Lexical Analysis:

// process next chars until end of symbol

def type (char):

'alpha': // alpha includes here '_'

'alpha': // alpha includes here '_'

‘mathchar': // = > < + - * /

default: print "ERROR: Unknown Char: “+getc()

‘digit': symbol = “”+getc()

Ad Hoc Coding of Lexical Analysis:

• Second Stage: assigning token labels to symbols

1. Reserved words matched by comparision (or hash lookup)

2. Use regular expressions for user-supplied symbols:

Ad Hoc Coding of Lexical Analysis:

The code shown earlier was for recognising symbols.

• Symbols of different types were recognised in different parts.

• We can use this to simplify token labelling

• Thus, we can use this code to assign token label as well

Lexical Analysis using

• The previous section looked at lexical analysis informally, just in

• One approach is to describe the tokens in terms of regular

• A program can read in these regular expressions and generate

• FLEX and LEX

{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }

{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}

{ID} printf( “(id, ‘%s’)", yytext );

"+"|"-"|"*"|"/" printf( “(mathop, ‘%s’)", yytext );

[ \t\n]+ /* eat up whitespace */

. printf( "Unrecognized character: %s\n", yytext );

{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }

{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}

Lexical Analysis using CFGs

Using Grammars for token recognition

• More formally, one can describe the possible tokens of a

Using Grammars for token recognition