Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Chapter Tow : Lexical Analysis



oRegular Expressions and Finite Automata

oConversion RE-NFA-DFA

oLexical Analyzer Generator

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
oLexical analysis is the first phase of a compiler.

oThe role of the lexical analyzer is to read a sequence

of characters from the source program and produce
tokens to be used by the parser.

oThe lexical analyzer breaks these sentence (source

code) into a series of tokens, by removing any
whitespace and comments in the source code.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
oThe main task of the lexical analyzer is to read the
input characters of the source program, group them
into lexemes, and produce as an output a sequence
of tokens for each lexeme in the source program.

oThe stream of tokens is sent to the parser for syntax


Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
oThe lexical analyzer also interacts with the symbol
table, e.g., when the lexical analyzer discovers a
lexeme constituting an identifier, it needs to enter
that lexeme into the symbol table.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
oThe following are additional tasks performed by the
lexical analyzer other than identifying lexemes:
o Stripping out comments and whitespace (blank, newline,
and tab)

o Correlating error messages generated by the compiler

with the source program by keeping track of line
numbers (using newline characters)

o Expanding macros in some lexical analyzers

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Tokens, Patterns, and Lexemes
o A token is a pair consisting of a token name and an optional attribute
value. The token name is an abstract symbol representing a kind of
lexical unit, e.g., a keyword, an identifier, etc. The token names are input
symbols that the parser processes. In any programming language
(Keywords, operators, identifiers, constants, literals, punctuation
symbols) are token

o Lexemes are said to be a sequence of characters (alphanumeric) in a

token. There are some predefined rules for every lexeme to be identified
as a valid token. These rules are defined by grammar rules, by means of a

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Tokens, Patterns, and Lexemes
• A pattern is a description of the form that lexemes of a token may take.
In case of a keyword as a token, the pattern is the sequence of characters
that forms the keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many strings.

• In programming language, keywords, constants, identifiers, strings,

numbers, operators and punctuations symbols can be considered as

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Tokens, Patterns, and Lexemes

Token Some lexemes Informal pattern

begin begin, Begin, BEGIN, beGin, Begin in small or capital
… letters
if if, IF, iF, If if in small or capital letters

ident Distance, F1, x, Dist1, … Letter followed by zero or

more letters and/or digits

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Attributes of tokens
oWhen more than one pattern matches a lexeme, the
scanner must provide additional information about
the particular lexeme to the subsequent phases of
the compiler.

oFor ex., both 0 and 1 match the pattern for the token
num. But the code generator needs to know which
number is recognized.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Attributes of tokens
• The lexical analyzer collects information about tokens into
their associated attributes.
• Practically, a token has one attribute: a pointer to the
symbol table entry in which the information about the
token is kept.
• Symbol table entry contains information about the token
such as the lexeme, the line number in which it was first
seen, …

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Attributes of tokens
• For ex. consider x = y + 2
The tokens and their attributes are written as:
<id, pointer to symbol-table entry for x>
<assign_op, >
<id, pointer to symbol-table entry for y>
<plus_op, >
<num, integer value 2>

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
oVery few errors are detected by the lexical analyzer.

oFor ex., if the programmer mistakes wihle for while,

the lexical analyzer cannot detect the error (why?)

oNonetheless, if a certain sequence of characters

follows none of the specified patterns, the lexical
analyzer can detect the error.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
oWhen an error occurs, the lexical analyzer recovers by:
o skipping (deleting) successive characters from the remaining
input until the lexical analyzer can find a well-formed token
(panic mode recovery)
o deleting extraneous(unimportant) characters
o inserting missing characters
o replacing an incorrect character by a correct character
o transposing two adjacent characters

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Specifying and recognizing tokens
oRegular expressions are used to specify the patters of

oEach pattern matches a set of strings.


letter  A|B|C|…|Z|a|b|c|…|z

digit  0|1|…|9

identifier  letter (letter|digit)*

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
o Alphabet: It is defined as a finite set of symbols.

o String: A “string” over an alphabet is a finite sequence of symbols

from that alphabet, which is usually written next to one another and
not separated by commas.
o Sentence and word are also used in terms of string

o ε is the empty string

o |s| is the length of string s.

o Substring: z is a substring of w if z appears consecutively within w.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Operations on Languages
o Concatenation: L1L2 = { s1s2 | s1 ∈ L1 and s2 ∈ L2 }
o Union: L1 ∪ L2 = { s | s ∈ L1 or s ∈ L2 }
o Exponentiation: L0 = {ε} L1 = L L2 = LL
o Kleene Closure: L* = include the empty string
o Positive Closure: L+ = doesn’t include the empty string .

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Automata—What is it?
• An automaton is an abstract model of a digital computer.
• An automaton has a mechanism to read input, which is a
string over a given alphabet. This input is actually written
on an “input file”, which can be read by the automaton but
can not change it.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Automata—What is it?
• The automaton has a temporary “storage”
device, which has unlimited number of cells,
the contents of which can be altered by the
• Automaton has a control unit, which is said to
be in one of a finite number of “internal states”.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Types of Automaton
oDeterministic Automata
oNon-deterministic Automata

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Deterministic Automata
oA deterministic automata is one in which each move
(transition from one state to another) is unequally
determined by the current configuration.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Deterministic Automata
oA deterministic automata is one in which each move
(transition from one state to another) is unequally
determined by the current configuration.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Deterministic Automata

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Deterministic Automata

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Deterministic Automata(EX)

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Regular Expiration
o Regular expressions were designed to represent regular
languages with a mathematical tool, a tool built from a set of
primitives and operations.

o This representation involves a combination of strings of

symbols from some alphabet S, parentheses and the
operators +, ×, and *.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Building Regular Expressions

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Languages defined by Regular Expressions
• There is a very simple correspondence between
regular expressions and the languages they denote:

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Languages defined by Regular Expressions
• There is a very simple correspondence between
regular expressions and the languages they denote:

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

Chapter Tow : Lexical Analysis
Ex (revision)

• Determine a deterministic Finite State Automaton

from the given Nondeterministic FSA.

Prepared by Befkadu (MSc) : 2012 E.C\2018-2020 A.C in AASTU

You might also like