Lecture 2-3 (CMS)

Compiler Construction
Lecture 2: Lexer / Scanner

BITS Pilani
Hyderabad Campus
Phases of a compiler
BITS Pilani, Hyderabad Campus

Lexical Analysis

NEED AND ROLE OF LEXICAL ANALYZER
1. It generates the sequence of tokens for each lexeme.
• Each token is a logical cohesive unit such as identifiers,
keywords, operators and punctuation marks.
2. It removes white spaces and comments.
3. It maintains the line number of the program.
4. It needs to enter that lexeme into the symbol table and also
reads from the symbol table.
5. Generate lexical errors.

• Appearance of illegal characters
• Unmatched string or spelling errors
Basic Terminology
– TOKEN: a pair consisting of

• Token name: abstract symbol representing lexical unit..
• Optional attribute value.
– PATTERN
• A rule describing a set of strings
– LEXEME
• a sequence of characters that match some pattern
• Identifiers: x, count, name, etc…

Lexemes
• Lexemes are the lowest level syntactic units.

Example:
val = (int)(xdot + y*0.3) ;
In the above statement, the lexemes are

val, = , ( , int, ), (, xdot, +, y, * ,
0.3, ), ;

Tokens
The category of lexemes are tokens.

• Identifiers: Names chosen by the programmer.
Eg. val, xdot, y.
• Keywords: Names chosen by the language

designer to help syntax and structure. Eg. int,
return, void. (Keywords that cannot be used as
identifiers are known as reserved words ).

Tokens (Contd.)
• Operators: Identify actions. Eg. +, &&, !
• Literals: Denote values directly. Eg. 3.14, -

10, ‘a’, true, null
• Punctuation Symbols: Supports syntactic

structure. Eg. (, ), ;, {, }

Examples
Tokens Pattern Simple

Lexeme
while while while
relation_op = | != | < | > <
integer (0-9)* 42
string Characters “hello”
between “ “

A program fragment viewed as
a stream of Tokens

Example - 1
Lexeme Token
int max(int a, int b)

int Keyword
{ max Identifier
if(a>b) ( Separator
return a; int Keyword
else a Identifier
, Separator
return b;
int Keyword
} b Identifier
) Separator
{ operator
if Keyword
.. ..
Example - 2
Input string: size = r* 32 + c
<token, lexeme> pairs:

<id, size>
<assign, =>
<id, r>
<arith_sym,*>
<integer,32>
<arith_sym,+>
<id, c>

In general
Source program text Tokens
YES NO
Attributes for tokens
➢ When more than one lexeme can match a pattern, the lexical analyzer must
provide the subsequent compiler phases additional information about the
particular lexeme that matched.
i.e., <token name, attribute value>
➢ Token name influences parsing decisions,
➢ Attribute value influences translation of tokens after the parse.
➢ We shall assume that tokens have at most one associated attribute, although
this attribute may have a structure that combines several pieces of
information.
➢ Example: token name id,
➢ Information: its lexeme, its type, and the location at which it is first
found is kept in the symbol table.
➢ Thus, the appropriate attribute value for an identifier is a pointer to
the symbol table entry for that identifier.
Attributes for tokens

Specification of Tokens
➢ We need a formal way to specify patterns: regular expressions
➢ Recap of some basic terminology:
➢ Alphabet: any finite set of symbols
➢ String over alphabet: finite sequence of symbols drawn from that

alphabet
➢ Language: countable set of strings over some fixed alphabet
➢ Empty string
➢ String Concatenation
Operations on Lanaguages

Regular Expressions (R.E)
➢ 𝜖 is a R.E. (denotes 𝜖 )
➢ ∅ is a R.E. (denotes empty language)
➢ For each 𝑎 ∈ Σ, a is a R.E. (denotes 𝑎 )
Let 𝑟1 and 𝑟2 be R.E’s denoting the languages 𝐿1 and 𝐿2 .
➢ 𝑟1 | 𝑟2 is also a R.E for 𝐿1 ∪ 𝐿2 . (| denotes union)
➢ 𝑟1 𝑟2 is also a R.E. (concatenation)
➢ If 𝑟 is a R.E denoting language 𝐿, then 𝑟 ∗ is also a R.E for 𝐿∗ .

Regular definitions
Regular expression for an identifier starts with a lower-case
alphabet and may followed by a string of lower-case alphabets and
digits 0, 1, …, 9.

Regular definitions

Extensions of regular expressions

Extensions of regular expressions
𝐿𝑒𝑡𝑡𝑒𝑟 → 𝑎 − 𝑧
𝐷𝑖𝑔𝑖𝑡 → 0 − 9
𝐼𝐷 → 𝐿𝑒𝑡𝑡𝑒𝑟(𝐿𝑒𝑡𝑡𝑒𝑟|𝐷𝑖𝑔𝑖𝑡)∗

Token Recognition
➢ In the previous slides, we learned how to express patterns using
regular expressions.
➢ Now, we must study how to take the patterns for all the needed
tokens and build a piece of code that examines the input string
and finds a prefix that is a lexeme matching one of the patterns

Transition diagrams for relational operators
• Transition diagram for relop (<, <=, >, >=, =)

Implementation of Transition-Diagrams

Example
Initial state Final or Accept state
Means retract the forward

pointer

Example

Example

Example

Another Example

Architecture of a DFA based
Lexical Analyzer

Implementation: Reserve words and identifiers
➢ Recognizing keywords and identifiers presents a problem.

Usually, keywords like if or then are reserved, so they are not
identifiers even though they look like identifiers.
➢ Two ways to handle the issue:
1. Install the reserved words in the symbol table initially.
2. Create separate transition diagrams for each keyword

Transition diagram for unsigned numbers

Architecture of a
Transition-Diagram based Lexical Analyzer
➢ There are several ways that a collection of transition diagrams
can be used to build a lexical analyzer.
➢ Choice 1: We could arrange for the transition diagrams for each

token to be tired sequentially.
➢ How to resolve the issue of keywords (if transition diagrams
are used) and identifiers?

Cont..
➢ Choice 2: We could run the various transition diagrams “in
parallel,” feeding the next input character to all of them and
allowing each one to make whatever transitions it required.
➢ We must be careful to resolve the case where one diagram

finds a lexeme that matches its pattern, while one or more
other diagrams are still able to process input.
➢ Solution: Longest prefix

Cont..
➢ Choice 3: (The preferred approach) Combine all the transition
diagrams into one.
➢ We allow the transition diagram to read input until there is

no possible next state.
➢ Then take the longest lexeme that matched any pattern.

Ambiguous Token Rule Sets
We resolve ambiguities using two rules:
– Longest match: The regular expression that matches the

longest string takes precedence.
– Rule Priority: The regular expressions identifying tokens

are written down in sequence. If two regular expressions
match the same (longest) string, the first regular
expression in the sequence takes precedence.

Thank you

Lecture 2-3 (CMS)

Uploaded by

Copyright:

Available Formats

You might also like

Lecture 2-3 (CMS)

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2-3 (CMS)

Uploaded by

Copyright:

Available Formats

Compiler Construction

Lecture 2: Lexer / Scanner

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

2. It removes white spaces and comments.

3. It maintains the line number of the program.

5. Generate lexical errors.

– TOKEN: a pair consisting of

BITS Pilani, Hyderabad Campus

• Lexemes are the lowest level syntactic units.

In the above statement, the lexemes are

BITS Pilani, Hyderabad Campus

The category of lexemes are tokens.

• Keywords: Names chosen by the language

BITS Pilani, Hyderabad Campus

• Operators: Identify actions. Eg. +, &&, !

• Literals: Denote values directly. Eg. 3.14, -

• Punctuation Symbols: Supports syntactic

BITS Pilani, Hyderabad Campus

Tokens Pattern Simple

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

int max(int a, int b)

Input string: size = r* 32 + c

<token, lexeme> pairs:

BITS Pilani, Hyderabad Campus

Source program text Tokens

BITS Pilani, Hyderabad Campus

➢ Recap of some basic terminology:

➢ Alphabet: any finite set of symbols

➢ String over alphabet: finite sequence of symbols drawn from that

➢ Language: countable set of strings over some fixed alphabet

BITS Pilani, Hyderabad Campus

➢ ∅ is a R.E. (denotes empty language)

➢ For each 𝑎 ∈ Σ, a is a R.E. (denotes 𝑎 )

Let 𝑟1 and 𝑟2 be R.E’s denoting the languages 𝐿1 and 𝐿2 .

➢ 𝑟1 | 𝑟2 is also a R.E for 𝐿1 ∪ 𝐿2 . (| denotes union)

➢ 𝑟1 𝑟2 is also a R.E. (concatenation)

➢ If 𝑟 is a R.E denoting language 𝐿, then 𝑟 ∗ is also a R.E for 𝐿∗ .

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

Initial state Final or Accept state

Means retract the forward

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

➢ Recognizing keywords and identifiers presents a problem.

➢ Two ways to handle the issue:

1. Install the reserved words in the symbol table initially.