Professional Documents
Culture Documents
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
1
Structure of Compiler
2
3
4
LEXICAL ANALYSIS
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition
• Reviewing Finite Automata Concepts
• Non-Deterministic and Deterministic FA
• Conversion Process
• Regular Expressions to NFA
• NFA to DFA
• Relating NFAs/DFAs /Conversion to Lexical Analysis
5
Structure of Compiler
6
NEED AND ROLE OF LEXICAL ANALYZER
1. Lexical Analysis is the first phase of compiler. It reads the input characters from left to
right, one character at a time, from the source program.
2. It generates the sequence of tokens for each lexeme. Each token is a logical cohesive unit
such as identifiers, keywords, operators and punctuation marks.
3. It needs to enter that lexeme into the symbol table and also reads from the symbol table.
7
Why Separating Lexical and Syntactic?
There are several reasons for separating the analysis phase of compiling into lexical analysis
and parsing:
» It leads to simpler design of the parser as the unnecessary tokens can be eliminated by
scanner.
» Efficiency of the process of compilation is improved. The lexical analysis phase is most time
consuming phase in compilation. Using specialized buffering to improve the speed of
compilation.
» Portability of the compiler is enhanced as the specialized symbols and characters(language
and machine specific) are isolated during this phase
8
Lexical Analyzer in Perspective
9
NEED AND ROLE OF LEXICAL ANALYZER
10
Basic Terminology
13
Basic Terminology
Token classes
• One token per keyword
• Tokens for the operators
• One token representing all
identifiers
• Tokens representing constants
(e.g. numbers)
• Tokens for punctuation
symbols
14
Example
15
Example of TOKENS
16
Example of TOKENS
17
18
19
Example of NON TOKENS
20
Tasks Lexical Analyzer
» Keeping track of line numbers while scanning the new line characters. These
line numbers are used by the error handler to print the error messages.
» Preprocessing of macros
21
Identify tokens and lexemes?
1.x=x*(acc+123)
22
• Alphabet: any finite set of symbols
• String over alphabet: finite sequence of symbols drawn from that alphabet
• Language: countable set of strings over some fixed alphabet
Regular Language:
23
Formal Definition of a Finite Automaton
24
25
Language operations
26
Concatenation
27
28
29
30
Nondeterministic Finite Automata:
31
Nondeterministic Finite Automata:
o Generalize FAs by adding nondeterminism,allowing several alternative computations
on the same input string.
o Ordinary deterministic FAs follow one path on each input.
32
Nondeterministic Finite Automata:
By continuing to experiment in this way, you will see that N1 accepts all
strings that contain either 101 or 11 as a substring.
33
Nondeterministic Finite Automata:
34
Formal Definition Nondeterministic Finite Automata:
35
Nondeterministic Finite Automata:
36
Nondeterministic Finite Automata:
NFA Example
37
Nondeterministic Finite Automata:
NFA Example
38
DFA VS NFA
39
40
41
REGULAR EXPRESSIONS
42
REGULAR EXPRESSIONS
• Regular expressions
• describe regular languages
• Example:
(a b c) *
• describes the language
r
Given regular expressions 1 and r2
r1 r2
r1 r2
Are regular expressions
r1 *
r1
44
REGULAR EXPRESSIONS
A regular expression:
a b c * (c )
45
REGULAR EXPRESSIONS
Example
L (a b c) * , a, bc, aa, abc, bca,...
L
L
L a a 46
REGULAR EXPRESSIONS
• L r1 r2 L r1 L r2
L r1 r2 L r1 L r2
L r1 * L r1 *
L r1 L r1
47
REGULAR EXPRESSIONS
• Regular expression: a b a *
L a b a * L a b L a *
L a b L a *
L a L b L a *
a b a *
a, b , a, aa, aaa,...
a, aa, aaa,..., b, ba, baa,...
48
REGULAR EXPRESSIONS
• Regular expression r a b * a bb
L r a, bb, aa, abb, ba, bbb,...
r (0 1) * 00 (0 1) *
49
REGULAR EXPRESSIONS
50
REGULAR EXPRESSIONS
51
52
53
Mini QUIZ-9:
https://bit.ly/3hMYX30
54
55
REGULAR EXPRESSIONS to NFA
56
REGULAR EXPRESSIONS to NFA
57
REGULAR EXPRESSIONS to NFA
58
59
regular expressions
We need a formal way to specify patterns: regular expressions
• Alphabet: any finite set of symbols
• String over alphabet: finite sequence of symbols drawn from that alphabet
• Language: countable set of strings over some fixed alphabet
60
Star operation
Kleene closure
61
Different operations on languages
Operations on Languages
62
Formal Definition of Regular Expressions
Rules for specifying Regular Expressions
63
Regular Expressions Precedence
Regular Expressions
64
Regular Expressions Examples
Regular Expressions example
65
Regular Expressions Examples
Regular Expressions example
66
Regular Expressions example
Regular Expressions example
67
Regular Expressions example
Regular Expressions example
68
Regular Expressions example
Regular
Expressions
example
69
Tokens Specification
Regular Definition
70
Overall
71
Token Recognition
72
What Else Does Lexical Analyzer Do?
73
Implementation: Transition Diagrams
• Intermediate step in constructing
lexical analyzer
• Convert patterns into flowcharts
called transition diagrams
– nodes or circles: called states
– Edges: directed from state to
another,labeled by symbols
74
Implementation: Transition Diagrams
75
Implementation: Transition Diagrams
76
Implementation: Transition Diagrams
Example TDs :
Unsigned #s
77
Implementation: Transition Diagrams
Example TDs :
78
Implementation: Transition Diagrams
What would the transition diagram (TD) for strings containing each vowel, in their strict lexicographical order,
look like?
79
Basics of DFA, RE, NFA
80
Mini QUIZ-9:
https://bit.ly/3hMYX30
81
Some basic Terminologies
82
Handling Lexical Errors
1. Its hard for lexical analyzer without the aid of other components, that there is a source-code
error.
-- If the statement fi is encountered for the first time in a C program it can not tell whether fi is
misspelling of if statement or a undeclared literal.
-- Probably the parser in this case will be able to handle this.
In what Situations do Errors Occur?
Lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix of
remaining input.
Panic mode recovery: delete successive characters from remaining input until token
found
• Insert missing character
• Delete a character
• Replace character by another
• Transpose two adjacent characters
• Is the strategy generally followed by the lexical analyzer to correct the errors in the
lexemes.
• It is nothing but the minimum number of the corrections to be made to convert an invalid
lexeme to a valid lexeme.
• But it is not generally used in practice because it is too costly to implement.
85
86
Example Programs
88