Lecture 2-3 (CMS)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

Compiler Construction

Lecture 2: Lexer / Scanner


BITS Pilani
Hyderabad Campus
Phases of a compiler

BITS Pilani, Hyderabad Campus


Lexical Analysis

BITS Pilani, Hyderabad Campus


NEED AND ROLE OF LEXICAL ANALYZER
1. It generates the sequence of tokens for each lexeme.
• Each token is a logical cohesive unit such as identifiers,
keywords, operators and punctuation marks.

2. It removes white spaces and comments.

3. It maintains the line number of the program.

4. It needs to enter that lexeme into the symbol table and also
reads from the symbol table.

5. Generate lexical errors.


• Appearance of illegal characters
• Unmatched string or spelling errors
BITS Pilani, Hyderabad Campus
Basic Terminology

– TOKEN: a pair consisting of


• Token name: abstract symbol representing lexical unit..
• Optional attribute value.

– PATTERN
• A rule describing a set of strings

– LEXEME
• a sequence of characters that match some pattern
• Identifiers: x, count, name, etc…

BITS Pilani, Hyderabad Campus


Lexemes

• Lexemes are the lowest level syntactic units.


Example:
val = (int)(xdot + y*0.3) ;

In the above statement, the lexemes are


val, = , ( , int, ), (, xdot, +, y, * ,
0.3, ), ;

BITS Pilani, Hyderabad Campus


Tokens

The category of lexemes are tokens.


• Identifiers: Names chosen by the programmer.
Eg. val, xdot, y.

• Keywords: Names chosen by the language


designer to help syntax and structure. Eg. int,
return, void. (Keywords that cannot be used as
identifiers are known as reserved words ).

BITS Pilani, Hyderabad Campus


Tokens (Contd.)

• Operators: Identify actions. Eg. +, &&, !

• Literals: Denote values directly. Eg. 3.14, -


10, ‘a’, true, null

• Punctuation Symbols: Supports syntactic


structure. Eg. (, ), ;, {, }

BITS Pilani, Hyderabad Campus


Examples

Tokens Pattern Simple


Lexeme
while while while
relation_op = | != | < | > <
integer (0-9)* 42
string Characters “hello”
between “ “

BITS Pilani, Hyderabad Campus


A program fragment viewed as
a stream of Tokens

BITS Pilani, Hyderabad Campus


Example - 1
Lexeme Token

int max(int a, int b)


int Keyword
{ max Identifier
if(a>b) ( Separator
return a; int Keyword

else a Identifier
, Separator
return b;
int Keyword
} b Identifier
) Separator
{ operator
if Keyword
.. ..
BITS Pilani, Hyderabad Campus
Example - 2

Input string: size = r* 32 + c

<token, lexeme> pairs:


<id, size>
<assign, =>
<id, r>
<arith_sym,*>
<integer,32>
<arith_sym,+>
<id, c>

BITS Pilani, Hyderabad Campus


In general

Source program text Tokens

YES NO
BITS Pilani, Hyderabad Campus
Attributes for tokens
➢ When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the
particular lexeme that matched.
i.e., <token name, attribute value>
➢ Token name influences parsing decisions,
➢ Attribute value influences translation of tokens after the parse.

➢ We shall assume that tokens have at most one associated attribute, although
this attribute may have a structure that combines several pieces of
information.
➢ Example: token name id,
➢ Information: its lexeme, its type, and the location at which it is first
found is kept in the symbol table.
➢ Thus, the appropriate attribute value for an identifier is a pointer to
the symbol table entry for that identifier.
BITS Pilani, Hyderabad Campus
Attributes for tokens

BITS Pilani, Hyderabad Campus


Specification of Tokens
➢ We need a formal way to specify patterns: regular expressions

➢ Recap of some basic terminology:

➢ Alphabet: any finite set of symbols

➢ String over alphabet: finite sequence of symbols drawn from that


alphabet

➢ Language: countable set of strings over some fixed alphabet

➢ Empty string

➢ String Concatenation
BITS Pilani, Hyderabad Campus
Operations on Lanaguages

BITS Pilani, Hyderabad Campus


Regular Expressions (R.E)

➢ 𝜖 is a R.E. (denotes 𝜖 )

➢ ∅ is a R.E. (denotes empty language)

➢ For each 𝑎 ∈ Σ, a is a R.E. (denotes 𝑎 )

Let 𝑟1 and 𝑟2 be R.E’s denoting the languages 𝐿1 and 𝐿2 .

➢ 𝑟1 | 𝑟2 is also a R.E for 𝐿1 ∪ 𝐿2 . (| denotes union)

➢ 𝑟1 𝑟2 is also a R.E. (concatenation)

➢ If 𝑟 is a R.E denoting language 𝐿, then 𝑟 ∗ is also a R.E for 𝐿∗ .

BITS Pilani, Hyderabad Campus


Regular definitions
Regular expression for an identifier starts with a lower-case
alphabet and may followed by a string of lower-case alphabets and
digits 0, 1, …, 9.

BITS Pilani, Hyderabad Campus


Regular definitions

BITS Pilani, Hyderabad Campus


Extensions of regular expressions

BITS Pilani, Hyderabad Campus


Extensions of regular expressions

𝐿𝑒𝑡𝑡𝑒𝑟 → 𝑎 − 𝑧
𝐷𝑖𝑔𝑖𝑡 → 0 − 9
𝐼𝐷 → 𝐿𝑒𝑡𝑡𝑒𝑟(𝐿𝑒𝑡𝑡𝑒𝑟|𝐷𝑖𝑔𝑖𝑡)∗

BITS Pilani, Hyderabad Campus


Token Recognition
➢ In the previous slides, we learned how to express patterns using
regular expressions.

➢ Now, we must study how to take the patterns for all the needed
tokens and build a piece of code that examines the input string
and finds a prefix that is a lexeme matching one of the patterns

BITS Pilani, Hyderabad Campus


Transition diagrams for relational operators
• Transition diagram for relop (<, <=, >, >=, =)

BITS Pilani, Hyderabad Campus


Implementation of Transition-Diagrams

BITS Pilani, Hyderabad Campus


Example

Initial state Final or Accept state

Means retract the forward


pointer

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Another Example

BITS Pilani, Hyderabad Campus


Architecture of a DFA based
Lexical Analyzer

BITS Pilani, Hyderabad Campus


Implementation: Reserve words and identifiers

➢ Recognizing keywords and identifiers presents a problem.


Usually, keywords like if or then are reserved, so they are not
identifiers even though they look like identifiers.

➢ Two ways to handle the issue:

1. Install the reserved words in the symbol table initially.

2. Create separate transition diagrams for each keyword


BITS Pilani, Hyderabad Campus
Transition diagram for unsigned numbers

BITS Pilani, Hyderabad Campus


Architecture of a
Transition-Diagram based Lexical Analyzer
➢ There are several ways that a collection of transition diagrams
can be used to build a lexical analyzer.

➢ Choice 1: We could arrange for the transition diagrams for each


token to be tired sequentially.
➢ How to resolve the issue of keywords (if transition diagrams
are used) and identifiers?

BITS Pilani, Hyderabad Campus


Cont..
➢ Choice 2: We could run the various transition diagrams “in
parallel,” feeding the next input character to all of them and
allowing each one to make whatever transitions it required.

➢ We must be careful to resolve the case where one diagram


finds a lexeme that matches its pattern, while one or more
other diagrams are still able to process input.

➢ Solution: Longest prefix

BITS Pilani, Hyderabad Campus


Cont..
➢ Choice 3: (The preferred approach) Combine all the transition
diagrams into one.

➢ We allow the transition diagram to read input until there is


no possible next state.

➢ Then take the longest lexeme that matched any pattern.

BITS Pilani, Hyderabad Campus


Ambiguous Token Rule Sets

We resolve ambiguities using two rules:

– Longest match: The regular expression that matches the


longest string takes precedence.

– Rule Priority: The regular expressions identifying tokens


are written down in sequence. If two regular expressions
match the same (longest) string, the first regular
expression in the sequence takes precedence.

BITS Pilani, Hyderabad Campus


Thank you

BITS Pilani, Hyderabad Campus

You might also like