Professional Documents
Culture Documents
Compilers: Topic 2: Lexical Analysis
Compilers: Topic 2: Lexical Analysis
Compilers: Topic 2: Lexical Analysis
2.1. Introduction
Introduction
FRONT END
1
Lexical Analyser
Introduction
Lexical Analyser
Introduction
• Other tasks:
• Identification of lexical errors,
• e.g., starting an identifier with a digit where the language does not
allow this: 2abc
• Deletion of white-space:
• Usually, the function of white-space is only to separate tokens.
• Exceptions: languages where whitespace indicates code block, e.g.,
python:
if 1 == 2:
print 1
print 2
2
Lexical Analyser
Drawing the border between symbols
What are Symbols?
• Case:
• Assume we have a language with assignment operator :=
• The ‘assignment statement’ has syntax: A := 1 + 2
STATEMENT → ID ASSIGNOP EXPR ‘;’
• The rule for ASSIGNOP could be:
ASSIGNOP → ‘:=’
…meaning ‘:=’ is a symbol, and thus a unit of lexical analysis.
Lexical Analyser
Drawing the border between symbols
• General Rules:
• A symbol is a sequence of characters that cannot be
sepaated from each other by white space.
• Symbols can be separated from other symbols by white
space
• With A:=1 + 2
• ‘:=‘ can be separated from ‘A’ and ‘1’
• BUT ‘:’ cannot be separated from ‘=‘
• Thus ‘:=‘ should be treated as a symbol
3
Lexical Analyser
Determining the Token set
Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
1 : <program> ::= begin <dcl train> ; <stm train> end
2 : <dcl train> ::= <declaration>
3 : | <declaration> ; <dcl train>
4 : <stm train> ::= <statement>
5 : | <statement> ; <stm train>
6 : <declaration>::= <mode> <idlist>
7 : <mode> ::= bool
8 : | int
9 : | ref <mode>
10 : <idlist> ::= <id>
11 : | <id> , <idlist>
12 : <statement> ::= <asgt stm>
13 : | <cond stm>
14 : | <loop stm>
15 : | <transput stm>
15 : | <case stm>
16 : | call <id>
17 : <asgt stm> ::= <id> := <exp>
18 : <cond stm> ::= if <exp> then <stm train> fi
19 : | if <exp> then <stm train> else <stm train>
4
Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
20 : <loop stm> ::= while <exp> do <stm train> end
21 : | repeat <stm train> until <exp>
22 : <transput stm> ::= input <id>
23 : | output <exp>
24 : <exp> ::= <factor>
25 : | <exp> + <factor>
26 : | <exp> - <factor>
27 : | - <exp>
28 : <factor> ::= <primary>
29 : | <factor> * <primary>
30 : <primary> ::= <id>
31 : | <constant>
32 : | ( <exp> )
33 : | ( <compare> )
34 : <compare> ::= <exp> = <exp>
35 : | <exp> <= <exp>
36 : | <exp> > <exp>
Topic 2
5
Lexical Analyser
One Pass Lexical Analyser
11
Lexical Analyser
Two Pass Lexical Analysis
• In a two-pass lexical analyser:
• First pass groups characters into symbols
“begin “begin” “int”, “A” “;” “A”
int A;
A := 100;
“:=” “100” “;” “A” “:=”
A := A+A; “A” “+” “A” “;” “print”
print A “A” “end”
end”
6
Lexical Analyser
Two Pass Lexical Analysis
• Most programming languages are designed such that the code can be
segmented into tokens without any knowledge at all of the meaning of the
token.
• Simple rules are adhered to:
• White-space ends a symbol
• Multiple white-space ignored
• identifiers contain only alphanumeric chars or _
• identifiers never start with a number
• a symbol starting with a number IS a number: 1, 34, 10.0
• Some chars are always symbol by themselves: } { ; ( ) ,
• mathematical chars can be solo or followed by “=“
• =, >, <, +, -, /, *
• ==, >=, <=, +=, -=, /=, *=
• Identifier rules:
• Java: Consist of Unicode letter _ $ 0-9
Cannot start with 0-9.
• C: Consists of a-z A-Z 0-9 _
Cannot start with 0-9 or _
• Exceptions:
• Lisp:
• Identifier consists of a-Z A-Z 0-9 _ + - * / @ $ = < > . Etc.
• No restriction on starting char
• If char sequence can be interpreted as a number, it is
• Else it is an ‘identifier’
• E.g., ‘1+’ is an ‘identifier’
‘+1’ is a number
14
7
Topic 3
16
8
Topic 2.1
Lexical Analyser
Two Pass Lexical Analysis with ad-hoc code
• Common approach (1):
• Human writes code to recognise the tokens of the source language:
def tokenise():
symbolList = []
while not eof():
18
9
Lexical Analyser
Two Pass Lexical Analysis with ad-hoc code
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': ...
'alpha': ...
'digit': ...
etc.
return symbolList
19
Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': ...
'alpha': ...
'digit': ...
etc.
return symbolList
20
10
Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': getc()
'digit': ...
...
21
Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': getc()
'digit': ...
...
22
11
Lexical Analyser
. . .
'sepchar': // { } ; ,
symbol = “” + getc()
symbolList.append(symbol)
23
Lexical Analyser
Numbers:
• Formats: 1, 34, 34.001, .0
• Procedure
1) Read digits until we reach a nondigit
2) If nextchar is “.” then read digits until we reach a nondigit
24
12
Lexical Analyser
Numbers:
• Formats: 1, 34, 34.001, .0
• Procedure
1) Read digits until we reach a nondigit
2) If nextchar is “.”, then read digits until we reach a nondigit
25
Topic 2.2
13
Lexical Analyser
Two Pass Lexical Analyser
27
Topic 2.2
14
Lexical Analyser
29
Lexical Analyser
Single Pass Lexical Analyser
def tokenise():
symbolList = []
while not eof():
while nextc in WHITE_SPACE_CHARS: getc()
symbtype = type(nextc)
symbol = “”+getc()
case symbtype:
'alpha': ...
'digit': ...
'sepchar': ...
'mathchar': ...
symbolList.append( [token,symbol] )
return symbolList
30
15
Lexical Analyser
Single Pass Lexical Analyser
def tokenise():
symbolList = []
while not eof():
while nextc in WHITE_SPACE_CHARS: getc()
symbtype = type(nextc)
symbol = “”+getc()
case symbtype:
'alpha': // alpha includes here '_'
while type(nextc) in ['alpha', 'digit']:
symbol += getc()
if symbol in RESERVED_WORDS:
token = symbol
else:
token = 'id'
...
symbolList.append( [token,symbol])
return symbolList
31
Topic 2.3
16
Lexical Analyser: Using grammars
Lexical analysis using regular expressions and grammars
Lexical Analyser
Lexical analysis using regular expressions
34
17
Lexical Analyser
Standard Compiler Architecture
• LEX and YACC (Yet Another Compiler Compiler) are often used
to build a compiler quickly
Lexical Syntactic
rules rules
Lex Yacc
Source My My
LexAnlyser SynAnalyser
Code
• One does not write the lexical analyser directly, just the patterns
to recognise tokens
35
Lexical Analyser
Flex input for a simple tokeniser
DIGIT [0-9]
ID [a-z][a-z0-9]*
if|then|begin|end|procedure|function
{ printf( “(%s,: ‘%s’)", yytext, yytext);}
36
18
Lexical Analyser
Flex processing
• Flex generates C code to procede char by char through the input text.
• When the generated scanner is run, it analyzes its input looking for strings
which match any of its patterns.
• If it finds more than one match, it takes the one matching the most text.
Thus “23.56” will be recognised as float not integer
• Once the match is determined, the text corresponding to the match is made
available in the global character pointer yytext,
• The action corresponding to the matched pattern is then executed.
• After recognising a token, input scanning for all patterns restarts from that
point (any partially matched pattern is discarded).
37
Topic 2.4
19
Lexical Analyser
Using a CFG
39
Lexical Analyser
Definite Finite Automata
• a DFA is a finite state machine where, for each pair of state and
input symbol, there is one and only one transition to a next state.
40
20
Lexical Analyser
Deriving a right-regular grammars
To derive a right regular grammar from full context free grammar:
1. For each rule whose RHS starts with a nonterminal, replace
the nonterminal with its expansion(s)
E.g.
Number ::= Integer
Integer ::= IntegerSS | - IntegerSS
IntegerSS ::= digit | digit IntegerSS
41
Lexical Analyser
Deriving a right-regular grammars
To derive a right regular grammar from full context free grammar:
1. For each rule whose RHS starts with a nonterminal, replace
the nonterminal with its expansion(s)
42
21
Lexical Analyser
A samle CFG grammar for tokens
<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | <CharLit> | <SS> | <MS>
<Id> ::= <Letter> | <Letter> <IdC>
<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>
<Number> ::= <Integer> | <Real>
<Integer> ::= <IntegerSS> | - <IntegerSS>
<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>
<Real> ::= <FixedPoint> | <FixedPoint> <Exponente>
<FixedPoint> ::= <Integer> . <IntegerSS> | . <IntegerSS> | <Integer> .
<Exponent> ::= E <Integer>
<StringLit> ::= "" | " <CharSeq> "
<CharSeq> ::= <Character> | <Character> <CharSeq>
<CharLit> ::= ' <Character> '
<SS> ::= + | - | * | / | = | < | > | ( | ) / Simple Symbol/
<MS> ::= =+ | != | <= | >= | ++ | -- / Multiple Symbol /
<Character> ::= <Letter> | <Digit> | <SS> | ! | . | , | b | \' | \" | \n
<Letter> ::= A | B | ... | Z | a | b | ... | z
<Digit> ::= 0 | 1 | ... | 9
43
Lexical Analyser
Same grammar converted to a RRG (almost)
<Token> ::= <Letter>
| <Letter> <IdC>
| <ReservedWd>
| <Digit>
| <Digit> <IntegerSS>
| - <IntegerSS>
| <Digit>. <IntegerSS>
| <Digit> <IntegerSS> . <IntegerSS>
| - <IntegerSS> . <IntegerSS>
| . <IntegerSS>
| <Digit> . | <Digit> <IntegerSS> .
| - <IntegerSS> .
…
| ""
| " <CharSeq> "
| ' <Character> '
| + | - | * | / | = | < | > | ( | ) | =+ | != | <= | >= | ++ | --
<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>
<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>
44
22
Lexical Analyser
RRG converted to a DFA
45
Lexical Analyser
46
23
Lexical Analyser
Complex Grammar of ASPLE (lexical and syntactic)
1 : <program> ::= begin <dcl train> ; <stm train> end
2 : <dcl train> ::= <declaration>
3 : | <declaration> ; <dcl train>
4 : <stm train> ::= <statement>
5 : | <statement> ; <stm train>
6 : <declaration>::= <mode> <idlist>
7 : <mode> ::= bool
8 : | int
9 : | ref <mode>
10 : <idlist> ::= <id>
11 : | <id> , <idlist>
12 : <statement> ::= <asgt stm>
13 : | <cond stm>
14 : | <loop stm>
15 : | <transput stm>
15 : | <case stm>
16 : | call <id>
17 : <asgt stm> ::= <id> := <exp>
18 : <cond stm> ::= if <exp> then <stm train> fi
19 : | if <exp> then <stm train> else <stm
train> fi
47
Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
20 : <loop stm> ::= while <exp> do <stm train> end
21 : | repeat <stm train> until <exp>
22 : <transput stm> ::= input <id>
23 : | output <exp>
24 : <exp> ::= <factor>
25 : | <exp> + <factor>
26 : | <exp> - <factor>
27 : | - <exp>
28 : <factor> ::= <primary>
29 : | <factor> * <primary>
30 : <primary> ::= <id>
31 : | <constant>
32 : | ( <exp> )
33 : | ( <compare> )
34 : <compare> ::= <exp> = <exp>
35 : | <exp> <= <exp>
36 : | <exp> > <exp>
48
24
Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
37 : <constant> ::= <Boolean constant>
38 : | <int constant>
39 : <Boolean constant> ::= true
40 : | false
41 : <int constant> ::= <number>
42 : <number> ::= <digit>
43 : | <number> <digit>
44 : <id> ::= <letter>
45 : | <letter><rest id>
46 : <rest id> ::= <alphanumeric>
47 : | <alphanumeric><rest id>
46 : <digit> ::= 0 | 1 | ... | 9
47 : <letter> ::= a | b | ... | z | A | B | ... | Z
48 : <case stm> ::= case ( <expr> ) <constant case train> esac
49 : <constant case train> ::= <constant case>
50 : | <constant case> <constant case train>
51 : <constant case> ::= <int constant> : <stm train>
52 : <alphanumeric> ::= <digit>
53 : | <letter>
54 : <procedures> ::= <procedure> <procedures>
55 : |
56 : <procedure> ::= procedure <id> begin <stm train> end
49
Lexical Analyser
50
25
Lexical Analyser
The Right-regular grammar
<Token> ::= begin | ; | end | bool | int | ref | ,
| call | := | if | then | fi | else
| while | do | repeat | until | input
| output | + | - | * | ( | ) | = | <= | >
| case | esac | : | procedure
| true | false
| 0 | ··· | 9 | 0 <int constant> | ···
| 9 <int constant>
| A | ··· | Z | a | ··· | z
| A <rest id> | ··· | Z <rest id>
| a <rest id> | ··· | z <rest id>
51
Lexical Analyser
Graph associated to the grammar
A,...,Z,
a,...,z,
0,...,9
A,...,Z, rest id
a,...,z,
0,...,9 A,...,Z,
begin,end,bool,int,ref, a,...,z
call,if,then,fi,else,while,do,repeat,
until,input,output,case,esac,
λ procedure,true,false,0,···,9,A,...,Z, SU
a,...z,;,,,+, -,*,(,),=,:=,<=,,>,:
int 0,...,9
0,...,9 constant
0,...,9
52
26
Lexical Analyser
Other patterns
• ASPLE just provides a few data types, but there are others which are also very
common:
• Real numbers, e.g. 3.45, .44, –5., 3.45E2, .44E-2, –5.E123
<real>::=<fixed point>
| <fixed point><exponent>
<integer>::=<int constant>
|-<int constant>
<fixed point>::=<integer>.<int constant>
| .<int constant>
| <integer>.
<exponent>::=E<integer>
• Character and strings, e.g. “”, “hello world”, ‘a’,...,‘z’
<literal>::=“” | “<string>”
<character>::=‘<symbol>’
<string> ::= <symbol> | <symbol><string>
53
Topic 3
Semantic Actions in
Lexical Analysis
27
Lexical analyser: Semantic actions
Previous concepts
55
56
28
Lexical analyser: Semantic actions
Semantic actions
• Other example:
<integer>::= actiong0 <int constant>
| actiong0 -<int constant> actiong2
<int constant>::= <digit> actiong1
| <digit> actiong1 <int constant>
• where
• actiong0 can be the initialisation of an integer variable
value←0
• actiong1 performs the calculation of the value of the number which
has been read until now
value←(10*value) + value(digit)
• actiong2 changes the sign of the value calculated:
value← -value
57
29