Compilers: Topic 2: Lexical Analysis

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Compilers

Topic 2: Lexical Analysis

2.1. Introduction

Mick O´Donnell : michael.odonnell@uam.es

Introduction

• The Role of the Lexical Analyser

FRONT END

Source Lexical Syntactic Semantic


Code Analyser Analyser Analyser

1
Lexical Analyser
Introduction

• Also known as tokeniser or scanner.


• In Spanish, called analizador Morfológico
• Purpose: translation of the source code into a sequence of
symbols.
• The symbols identified by the morphological analyser will be
considered terminal symbols in the grammar used by the
syntactic analyser.
(reserved-word,begin)
“begin (type, int)(<id>,A)(<symb>,;)
int A; (<id>,A)(<mult-symb>,:=)
A := 100; (<cons int>,100)(<symb>,;)
A := A+A; (<id>,A)(<mult-symb>,:=)
output A (<id>,A)(<symb>,+)(<id>,A)(<symb>,;)
End” (reserved-word,output)(<id>,A)
(reserved-word,end)

Lexical Analyser
Introduction
• Other tasks:
• Identification of lexical errors,
• e.g., starting an identifier with a digit where the language does not
allow this: 2abc
• Deletion of white-space:
• Usually, the function of white-space is only to separate tokens.
• Exceptions: languages where whitespace indicates code block, e.g.,
python:
if 1 == 2:
print 1
print 2

• Deletion of comments: not relevant to execution of program.

2
Lexical Analyser
Drawing the border between symbols
What are Symbols?

• How do we determine what are the symbols of a given language?

• Case:
• Assume we have a language with assignment operator :=
• The ‘assignment statement’ has syntax: A := 1 + 2
STATEMENT → ID ASSIGNOP EXPR ‘;’
• The rule for ASSIGNOP could be:
ASSIGNOP → ‘:=’
…meaning ‘:=’ is a symbol, and thus a unit of lexical analysis.

• However, the rule might have been:


ASSIGNOP → ‘:’ ‘=’
…meaning ‘:’ and ‘=’ are two symbols for lexical analysis
5

Lexical Analyser
Drawing the border between symbols

What are Symbols?

• General Rules:
• A symbol is a sequence of characters that cannot be
sepaated from each other by white space.
• Symbols can be separated from other symbols by white
space

• With A:=1 + 2
• ‘:=‘ can be separated from ‘A’ and ‘1’
• BUT ‘:’ cannot be separated from ‘=‘
• Thus ‘:=‘ should be treated as a symbol

3
Lexical Analyser
Determining the Token set

What token labels to use?


• To determine which token labels we assign to symbols, we first need to
derive the syntactic grammar of the language.
• THEN, we extract out the terminal symbols of this grammar, which become
the token labels in lexical analysis.
• This ensures that the labels assigned in lexical analysis are what we need in
syntactic analysis.
• For example, we might assign the label “reserved_word” to both “begin” and
“end”.
• But it is clear we cannot use such a label in parsing:
Program -> reserved_word Statement* reserved_word

• … would allow “end A=1 begin” as a program.


• Each token label has to reflect the different roles that the token class can
serve in a program.

Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
1 : <program> ::= begin <dcl train> ; <stm train> end
2 : <dcl train> ::= <declaration>
3 : | <declaration> ; <dcl train>
4 : <stm train> ::= <statement>
5 : | <statement> ; <stm train>
6 : <declaration>::= <mode> <idlist>
7 : <mode> ::= bool
8 : | int
9 : | ref <mode>
10 : <idlist> ::= <id>
11 : | <id> , <idlist>
12 : <statement> ::= <asgt stm>
13 : | <cond stm>
14 : | <loop stm>
15 : | <transput stm>
15 : | <case stm>
16 : | call <id>
17 : <asgt stm> ::= <id> := <exp>
18 : <cond stm> ::= if <exp> then <stm train> fi
19 : | if <exp> then <stm train> else <stm train>

4
Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
20 : <loop stm> ::= while <exp> do <stm train> end
21 : | repeat <stm train> until <exp>
22 : <transput stm> ::= input <id>
23 : | output <exp>
24 : <exp> ::= <factor>
25 : | <exp> + <factor>
26 : | <exp> - <factor>
27 : | - <exp>
28 : <factor> ::= <primary>
29 : | <factor> * <primary>
30 : <primary> ::= <id>
31 : | <constant>
32 : | ( <exp> )
33 : | ( <compare> )
34 : <compare> ::= <exp> = <exp>
35 : | <exp> <= <exp>
36 : | <exp> > <exp>

Topic 2

One and Two Pass Lexical Analysis

5
Lexical Analyser
One Pass Lexical Analyser

• Identifies symbols and imediately assigns token label to


symbol:

“begin (begin,begin)(type, int) (id,A)


int A; (semic,;) (id,A) (eqsgn,:=)
A := 100; (int,100)(semic,;) (id,A)
A := A+A; (eqsgn,:=) (id,A) (symb,+) (id,A)
print A (semic,;)
end” (reserved-word,output) (id,A)
(end,end)

11

Lexical Analyser
Two Pass Lexical Analysis
• In a two-pass lexical analyser:
• First pass groups characters into symbols
“begin “begin” “int”, “A” “;” “A”
int A;
A := 100;
“:=” “100” “;” “A” “:=”
A := A+A; “A” “+” “A” “;” “print”
print A “A” “end”
end”

• Second pass assigns token labels to symbols

(begin,begin)(type, int) (id,A)


“begin” “int”, “A” “;” “A” (semic,;) (id,A) (eqsgn,:=)
(int,100)(semic,;) (id,A)
“:=” “100” “;” “A” “:=”
(eqsgn,:=) (id,A) (symb,+) (id,A)
“A” “+” “A” “;” “print” (semic,;)
“A” “end” (reserved-word,output) (id,A)
(end,end)
12

6
Lexical Analyser
Two Pass Lexical Analysis
• Most programming languages are designed such that the code can be
segmented into tokens without any knowledge at all of the meaning of the
token.
• Simple rules are adhered to:
• White-space ends a symbol
• Multiple white-space ignored
• identifiers contain only alphanumeric chars or _
• identifiers never start with a number
• a symbol starting with a number IS a number: 1, 34, 10.0
• Some chars are always symbol by themselves: } { ; ( ) ,
• mathematical chars can be solo or followed by “=“
• =, >, <, +, -, /, *
• ==, >=, <=, +=, -=, /=, *=

• The first char of the symbol tells us which group it is in


13

• Identifier rules:
• Java: Consist of Unicode letter _ $ 0-9
Cannot start with 0-9.
• C: Consists of a-z A-Z 0-9 _
Cannot start with 0-9 or _

• Exceptions:
• Lisp:
• Identifier consists of a-Z A-Z 0-9 _ + - * / @ $ = < > . Etc.
• No restriction on starting char
• If char sequence can be interpreted as a number, it is
• Else it is an ‘identifier’
• E.g., ‘1+’ is an ‘identifier’
‘+1’ is a number

14

7
Topic 3

Methods of Lexical Analysis

Lexical Analyser: Using grammars


Approaches to Lexical Analysis

Three main Approaches:


1) Ad-Hoc Coding : code is written to recognise each type of
token.

2) Finite expressions: e.g.,


• float: “[0-9]*.[0-9]+”
• Id: “[a-zA-Z_][a-zA-Z_0-9]*”

3) Context free grammar, e.g.,


Token :- Id | Int | Literal | …
Id :- Alfa | Alfa Id2
Id2 :- Alfa | Digit | Alfa Id2|Digit Id2

16

8
Topic 2.1

Ad Hoc Coding of Lexical Analysis:


Recognising Symbols

Lexical Analyser
Two Pass Lexical Analysis with ad-hoc code
• Common approach (1):
• Human writes code to recognise the tokens of the source language:
def tokenise():
symbolList = []
while not eof():

// process next chars until end of symbol


// add symbol to symbolList
. . .
return symbolList

18

9
Lexical Analyser
Two Pass Lexical Analysis with ad-hoc code

def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': ...
'alpha': ...
'digit': ...
etc.

return symbolList

19

Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):
'whitespace': ...
'alpha': ...
'digit': ...
etc.

return symbolList

def type (char):


if char in “a-zA-z_”: return ‘alpha’
if char in “0-9”: return ‘digit’
if char in “ \t\n”: return ‘whitespace’
if char in “{};,”: return ‘sepchar’
if char in “><=+-/*”: return ‘mathchar’

20

10
Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):

'alpha': // alpha includes here '_'


symbol = “” + getc()
while type(nextc) in ['alpha', 'digit']:
symbol += getc()
symbolList.append(symbol)

'whitespace': getc()

'digit': ...

...

21

Lexical Analyser
def tokenise():
symbolList = []
while not eof():
case type(nextc):

'alpha': // alpha includes here '_'


symbol = “”+getc()
while type(nextc) in ['alpha', 'digit']:
symbol += getc()
symbolList.append(symbol)

'whitespace': getc()

'digit': ...

...

22

11
Lexical Analyser

. . .

‘mathchar': // = > < + - * /


symbol = “” + getc()
if nextc == '=':
symbol += getc()
symbolList.append(symbol)

'sepchar': // { } ; ,
symbol = “” + getc()
symbolList.append(symbol)

default: print "ERROR: Unknown Char: “+getc()

23

Lexical Analyser

Numbers:
• Formats: 1, 34, 34.001, .0

• Procedure
1) Read digits until we reach a nondigit
2) If nextchar is “.” then read digits until we reach a nondigit

24

12
Lexical Analyser

Numbers:
• Formats: 1, 34, 34.001, .0

• Procedure
1) Read digits until we reach a nondigit
2) If nextchar is “.”, then read digits until we reach a nondigit

‘digit': symbol = “”+getc()


while nextc in “0123456789”:
symbol += getc()
if nextc == “.”:
symbol += getc()
while nextc in “0123456789”:
symbol += getc()
symbolList.append(symbol)

25

Topic 2.2

Ad Hoc Coding of Lexical Analysis:


Assigning Token Labels

13
Lexical Analyser
Two Pass Lexical Analyser

• Second Stage: assigning token labels to symbols

1. Reserved words matched by comparision (or hash lookup)


If Symbol in RESERVED_WORDS: Token = symbol

2. Use regular expressions for user-supplied symbols:


• Int : “[0-9]+”
• Float : “[0-9]*.[0-9]+”
• Id : “[a-zA-Z_][a-zA-Z0-9_]*”

27

Topic 2.2

Ad Hoc Coding of Lexical Analysis:


Single Pass Approach

14
Lexical Analyser

The code shown earlier was for recognising symbols.

• Symbols of different types were recognised in different parts.

• We can use this to simplify token labelling


• The specific code used to identify the symbol knows if it is a
number, alphanumerical, mathematical or separator.

• Thus, we can use this code to assign token label as well

29

Lexical Analyser
Single Pass Lexical Analyser
def tokenise():
symbolList = []
while not eof():
while nextc in WHITE_SPACE_CHARS: getc()
symbtype = type(nextc)
symbol = “”+getc()
case symbtype:
'alpha': ...
'digit': ...
'sepchar': ...
'mathchar': ...
symbolList.append( [token,symbol] )
return symbolList

30

15
Lexical Analyser
Single Pass Lexical Analyser
def tokenise():
symbolList = []
while not eof():
while nextc in WHITE_SPACE_CHARS: getc()
symbtype = type(nextc)
symbol = “”+getc()
case symbtype:
'alpha': // alpha includes here '_'
while type(nextc) in ['alpha', 'digit']:
symbol += getc()
if symbol in RESERVED_WORDS:
token = symbol
else:
token = 'id'
...
symbolList.append( [token,symbol])
return symbolList

31

Topic 2.3

Lexical Analysis using


Regular Expressions

16
Lexical Analyser: Using grammars
Lexical analysis using regular expressions and grammars

• The previous section looked at lexical analysis informally, just in


terms of a computer program written by hand to recognise the
tokens of a language.
• The rules of lexical syntax are only represented implicitly in the
code. One has to interpret the code to see that an identifier must
start with an alpha char.
• In earlier days, this was sufficient.
• However, there are problems with this approach:
• Portability: a change in syntax requires editing of the source code (it may
be better to state the lexical structure in an external data file, requiring no
need to edit the source code)
• Difficult to prove that the tokenising code actually conforms to the
specification of the language – does it do as it should?
• This section will explore lexical analysis from a more formal
perspective 33

Lexical Analyser
Lexical analysis using regular expressions

• One approach is to describe the tokens in terms of regular


expressions

• A program can read in these regular expressions and generate


code to perform the tokenisation

• FLEX and LEX

34

17
Lexical Analyser
Standard Compiler Architecture

• LEX and YACC (Yet Another Compiler Compiler) are often used
to build a compiler quickly

Lexical Syntactic
rules rules

Lex Yacc

Source My My
LexAnlyser SynAnalyser
Code

• One does not write the lexical analyser directly, just the patterns
to recognise tokens
35

Lexical Analyser
Flex input for a simple tokeniser
DIGIT [0-9]
ID [a-z][a-z0-9]*

{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }

{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}

if|then|begin|end|procedure|function
{ printf( “(%s,: ‘%s’)", yytext, yytext);}

{ID} printf( “(id, ‘%s’)", yytext );

"+"|"-"|"*"|"/" printf( “(mathop, ‘%s’)", yytext );

[ \t\n]+ /* eat up whitespace */

. printf( "Unrecognized character: %s\n", yytext );

36

18
Lexical Analyser
Flex processing
• Flex generates C code to procede char by char through the input text.
• When the generated scanner is run, it analyzes its input looking for strings
which match any of its patterns.
• If it finds more than one match, it takes the one matching the most text.
 Thus “23.56” will be recognised as float not integer

{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }

{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}

• Once the match is determined, the text corresponding to the match is made
available in the global character pointer yytext,
• The action corresponding to the matched pattern is then executed.
• After recognising a token, input scanning for all patterns restarts from that
point (any partially matched pattern is discarded).
37

Topic 2.4

Lexical Analysis using CFGs


(Context Free Grammars)

19
Lexical Analyser
Using a CFG

Using Grammars for token recognition

• More formally, one can describe the possible tokens of a


language using a context-free grammar
<Token> ::= <Id> | <ReservedWd> | <Number> | ...
<Id> ::= <Letter> | <Letter><IdC>
<IdC> ::= <Letter> | <Digit> | <Letter><IdC> | <Digit><IdC>
• This has the advantage that both the syntactic description of the
language and its lexical description are in the same language.
• Processing of such grammars is however slower than for regular
expressions.

39

Lexical Analyser
Definite Finite Automata

Using Grammars for token recognition

• However, some context free grammars can be automatically


translated into right-regular grammars, where rules are of two
forms: ( ‘a’ is a terminal; ‘A’ is a nonterminal )
• A→a
• A → aB

• A grammar in such a form can then be represented as a


deterministic finite automaton, (DFA) which allows efficient
processing of the input.

• a DFA is a finite state machine where, for each pair of state and
input symbol, there is one and only one transition to a next state.

40

20
Lexical Analyser
Deriving a right-regular grammars
To derive a right regular grammar from full context free grammar:
1. For each rule whose RHS starts with a nonterminal, replace
the nonterminal with its expansion(s)
E.g.
Number ::= Integer
Integer ::= IntegerSS | - IntegerSS
IntegerSS ::= digit | digit IntegerSS

Number ::= IntegerSS | - IntegerSS


IntegerSS ::= digit | digit IntegerSS

Number ::= digit | digit IntegerSS | - IntegerSS


IntegerSS ::= digit | digit IntegerSS

41

Lexical Analyser
Deriving a right-regular grammars
To derive a right regular grammar from full context free grammar:
1. For each rule whose RHS starts with a nonterminal, replace
the nonterminal with its expansion(s)

2. At the end of replacements, eliminate any rule which cannot


be reached from the START symbol

e.g., assume grammar has start symbol: Token


<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | …
We only preserve nonterminals referenced in this rule or referenced in the
nonterminals it contains, etc.

42

21
Lexical Analyser
A samle CFG grammar for tokens
<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | <CharLit> | <SS> | <MS>
<Id> ::= <Letter> | <Letter> <IdC>
<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>
<Number> ::= <Integer> | <Real>
<Integer> ::= <IntegerSS> | - <IntegerSS>
<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>
<Real> ::= <FixedPoint> | <FixedPoint> <Exponente>
<FixedPoint> ::= <Integer> . <IntegerSS> | . <IntegerSS> | <Integer> .
<Exponent> ::= E <Integer>
<StringLit> ::= "" | " <CharSeq> "
<CharSeq> ::= <Character> | <Character> <CharSeq>
<CharLit> ::= ' <Character> '
<SS> ::= + | - | * | / | = | < | > | ( | ) / Simple Symbol/
<MS> ::= =+ | != | <= | >= | ++ | -- / Multiple Symbol /
<Character> ::= <Letter> | <Digit> | <SS> | ! | . | , | b | \' | \" | \n
<Letter> ::= A | B | ... | Z | a | b | ... | z
<Digit> ::= 0 | 1 | ... | 9

43

Lexical Analyser
Same grammar converted to a RRG (almost)
<Token> ::= <Letter>
| <Letter> <IdC>
| <ReservedWd>
| <Digit>
| <Digit> <IntegerSS>
| - <IntegerSS>
| <Digit>. <IntegerSS>
| <Digit> <IntegerSS> . <IntegerSS>
| - <IntegerSS> . <IntegerSS>
| . <IntegerSS>
| <Digit> . | <Digit> <IntegerSS> .
| - <IntegerSS> .

| ""
| " <CharSeq> "
| ' <Character> '
| + | - | * | / | = | < | > | ( | ) | =+ | != | <= | >= | ++ | --
<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>
<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>

44

22
Lexical Analyser
RRG converted to a DFA

45

Lexical Analyser

A more complex example

46

23
Lexical Analyser
Complex Grammar of ASPLE (lexical and syntactic)
1 : <program> ::= begin <dcl train> ; <stm train> end
2 : <dcl train> ::= <declaration>
3 : | <declaration> ; <dcl train>
4 : <stm train> ::= <statement>
5 : | <statement> ; <stm train>
6 : <declaration>::= <mode> <idlist>
7 : <mode> ::= bool
8 : | int
9 : | ref <mode>
10 : <idlist> ::= <id>
11 : | <id> , <idlist>
12 : <statement> ::= <asgt stm>
13 : | <cond stm>
14 : | <loop stm>
15 : | <transput stm>
15 : | <case stm>
16 : | call <id>
17 : <asgt stm> ::= <id> := <exp>
18 : <cond stm> ::= if <exp> then <stm train> fi
19 : | if <exp> then <stm train> else <stm
train> fi

47

Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
20 : <loop stm> ::= while <exp> do <stm train> end
21 : | repeat <stm train> until <exp>
22 : <transput stm> ::= input <id>
23 : | output <exp>
24 : <exp> ::= <factor>
25 : | <exp> + <factor>
26 : | <exp> - <factor>
27 : | - <exp>
28 : <factor> ::= <primary>
29 : | <factor> * <primary>
30 : <primary> ::= <id>
31 : | <constant>
32 : | ( <exp> )
33 : | ( <compare> )
34 : <compare> ::= <exp> = <exp>
35 : | <exp> <= <exp>
36 : | <exp> > <exp>

48

24
Lexical Analyser
Identifying the scope of the lexical analysis in the grammar of the language
37 : <constant> ::= <Boolean constant>
38 : | <int constant>
39 : <Boolean constant> ::= true
40 : | false
41 : <int constant> ::= <number>
42 : <number> ::= <digit>
43 : | <number> <digit>
44 : <id> ::= <letter>
45 : | <letter><rest id>
46 : <rest id> ::= <alphanumeric>
47 : | <alphanumeric><rest id>
46 : <digit> ::= 0 | 1 | ... | 9
47 : <letter> ::= a | b | ... | z | A | B | ... | Z
48 : <case stm> ::= case ( <expr> ) <constant case train> esac
49 : <constant case train> ::= <constant case>
50 : | <constant case> <constant case train>
51 : <constant case> ::= <int constant> : <stm train>
52 : <alphanumeric> ::= <digit>
53 : | <letter>
54 : <procedures> ::= <procedure> <procedures>
55 : |
56 : <procedure> ::= procedure <id> begin <stm train> end

49

Lexical Analyser

• We create a new Nonterminal to represent the tokens we wish to recognise

<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | <CharLit> | <SS> | <MS>

• We then derive from this a RRG

50

25
Lexical Analyser
The Right-regular grammar
<Token> ::= begin | ; | end | bool | int | ref | ,
| call | := | if | then | fi | else
| while | do | repeat | until | input
| output | + | - | * | ( | ) | = | <= | >
| case | esac | : | procedure
| true | false
| 0 | ··· | 9 | 0 <int constant> | ···
| 9 <int constant>
| A | ··· | Z | a | ··· | z
| A <rest id> | ··· | Z <rest id>
| a <rest id> | ··· | z <rest id>

<int constant> ::= 0 | ··· | 9 | 0 <int constant> | ···


| 9 <int constant>
<rest id> ::= A | ··· | Z | a | ··· | z | 0 | ··· | 9
| 0 <rest id> | ··· | 9 <rest id>
| A <rest id> | ··· | Z <rest id>
| a <rest id> | ··· | z <rest id>

51

Lexical Analyser
Graph associated to the grammar
A,...,Z,
a,...,z,
0,...,9

A,...,Z, rest id
a,...,z,
0,...,9 A,...,Z,
begin,end,bool,int,ref, a,...,z
call,if,then,fi,else,while,do,repeat,
until,input,output,case,esac,
λ procedure,true,false,0,···,9,A,...,Z, SU
a,...z,;,,,+, -,*,(,),=,:=,<=,,>,:

int 0,...,9
0,...,9 constant

0,...,9

52

26
Lexical Analyser
Other patterns
• ASPLE just provides a few data types, but there are others which are also very
common:
• Real numbers, e.g. 3.45, .44, –5., 3.45E2, .44E-2, –5.E123
<real>::=<fixed point>
| <fixed point><exponent>
<integer>::=<int constant>
|-<int constant>
<fixed point>::=<integer>.<int constant>
| .<int constant>
| <integer>.
<exponent>::=E<integer>
• Character and strings, e.g. “”, “hello world”, ‘a’,...,‘z’
<literal>::=“” | “<string>”
<character>::=‘<symbol>’
<string> ::= <symbol> | <symbol><string>

53

Topic 3

Semantic Actions in
Lexical Analysis

27
Lexical analyser: Semantic actions
Previous concepts

• The compiler can delegate some semantic tasks to the lexical


analyser:
• Storing the information about the identifiers in the symbols
table.
• Calculating the numeric values (in binary code) for each
numeric constant.
• etc.

• These tasks vary according to:


• The objectives of the translators / interpreters.
• The division of tasks between the different components in
the translator / interpreter.

55

Lexical analyser: Semantic actions


Semantic actions
• These actions are sometimes expressed by inserting actions between the
symbols in the rules. For instance:
<id>::= actionf0 <letter> actionf1 actionf2 <rest id> actionf3
<rest id>::= <letter> actionf1 actionf2 <rest id>
| <digit> actionf1 actionf2 <rest id>
| λ
where
• actionf0 might be: initialise a counter
• actionf1 add 1 to the counter
• actionf2 copy the character which has just been recognised inside
a buffer.
• actionf3 add to the buffer an end-of-string mark. Check that the
number of characters is not higher than the maximum length
allowed. If this happens, notify the error. Otherwise, insert the
identifier in the symbols table and return a pointer to the element
inside the table.

56

28
Lexical analyser: Semantic actions
Semantic actions
• Other example:
<integer>::= actiong0 <int constant>
| actiong0 -<int constant> actiong2
<int constant>::= <digit> actiong1
| <digit> actiong1 <int constant>
• where
• actiong0 can be the initialisation of an integer variable
value←0
• actiong1 performs the calculation of the value of the number which
has been read until now
value←(10*value) + value(digit)
• actiong2 changes the sign of the value calculated:
value← -value

57

Lexical Analyser: Summary


Summary

Three main Approaches:


1) Ad-Hoc Coding

2) Finite expressions: e.g.,


• float: “[0-9]*.[0-9]+”
• Id: “[a-zA-Z_][a-zA-Z_0-9]*”

3) Context free grammar, converted to RRG and DFA


Token :- Id | Int | Literal | …
Id :- Alfa | Alfa Id2
Id2 :- Alfa | Digit | Alfa Id2 | Digit Id2

58

29

You might also like