Professional Documents
Culture Documents
2 - Lexical Analyzer Lecture 01
2 - Lexical Analyzer Lecture 01
2 - Lexical Analyzer Lecture 01
i f ( x 1 * x 2 < 1 . 0 ) {
if ( x1 * x2 < 1.0 ) {
• Error reporting
• Model using regular expression
• Recognize using finite state automata
• Sentences consist of string of tokens
• Examples: number, identifier, string, keyword
s0 =
si = si-1s for i > 0
note that s = s = s
• Certain languages do not have any reserved words, e.g., while, do, if,
else, etc., are reserved in ’C’, but not in PL/1
• Lexemes in a fixed position. Fixed format vs. free format languages
• FORTRAN Fixed Format and some keywords are context-dependent
• 80 columns per line
• Column 1-5 for the statement number/label column
• Column 6 for continuation mark (?)
• Column 7-72 for the program statements
• Column 73-80 Ignored (Used for other purpose)
• Letter C in Column 1 meant the current line is a comment
• Blanks are not significant in FORTRAN and can appear in the midst of identifiers, but
not so in ’C’
• Handling of blanks
• in C, blanks separate identifiers
• in FORTRAN, blanks are important only in literal strings
• variable counter is same as count er
• Another example
DO 10 I = 1.25 DO10I=1.25
DO 10 I = 1,25 DO10I=1,25
• The first line is a variable assignment DO10I=1.25
• The second line is beginning of a Do loop
• LA cannot catch any significant errors except for simple errors such as, illegal symbols,
etc.
• In such cases, LA skips characters in the input until a well-formed token is found
• How to describe tokens
2.e0 20.e-01 2.000
• How to break text into token
if (x==0) a = x << 1;
if (x==0) a = x < 1;
• How to break input into tokens efficiently
• Tokens may have similar prefixes
• Each character should be looked at only once
• Programming language tokens can be described by regular languages
• Regular languages
• Are easy to understand
• There is a well understood and useful theory
• They have efficient implementation
• Regular languages have been discussed in great detail in the “Theory
of Computation” course
Specification of Patterns for Tokens:
25
• Example:
letter AB…Zab…z
digit 01…9
id letter ( letterdigit )*
27
• The following shorthands are often used:
r+ = r*r = rr*
r? = r
[a-z] = abc…z
• Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+ )?
29
(a) All strings of lowercase letters and digits contain the five vowels in order.
(b) All strings of digits contain triple of each of five vowels in order.
(c) All strings of a's and b's that do not contain the substring abbbb.
(d) All strings of a's and b's that do not contain the subsequence abb.
(e) All strings of uppercase letters and digits contain first three lowercase letters
together for 4 times in order.
Grammar
stmt if expr then stmt
if expr then stmt else stmt
Regular definitions
expr term relop term
term
if if
term id
then then
num
else else
relop < <= <> > >= =
id letter ( letter | digit )*
num digit+ (. digit+)? ( E (+-)? digit+ )?
32
Identify lexeme patterns, keywords and tokens in the following
piece of code.
Draw a transition diagram for the alpha-numeric pattern .
Let P, Q and R be the set of uppercase letters (A…Z), lowercase letters (a…z) and
digits (0…1). Define the following languages made of three operations – union,
concatenation and closure.
(a) P U Q; (b) P U Q U R;
(c) PR U QR; (d) PQR;
(e) P*; (f) P* U Q*;
(g) P*Q*; (h) P+;
(i) P+ U Q+; (j) R(P* U Q*)
(k) P(P*Q*)R; (m) R6;
(n) P+Q*; (o) P*R+
Let, alphabet ⅀ = {a, b}. Describe the following regular expressions.
(a|b)* (a*|b)*
(a*b)|a a+b
(ab)(ab) (a|b)(a|b)
(ab)(a|b) 10(a|b)*00(a|b)*
1*b(a|b)aaab*0 (a+b*)aab
a*ba*ba*ba*ba*
(a|b)*(a|b)a(a|b)
a*(ba*ba*)*
• Regular expression are declarative specifications
• Transition diagram is an implementation
• Transition diagrams have a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input
looking for a lexeme that matches one of several patterns.
• Edges are directed from one state of the transition diagram to another. Each edge is
labeled by a symbol or set of symbols.
• Certain states are said to be accepting, or final. These states indicate that a lexeme has
been found, although the actual lexeme may not consist of all positions between the
lexemeBegin and forward pointers.
• In addition, if it is necessary to retract the forward pointer one position, then we shall
additionally place a * near that accepting state.
• One state is designated as start state, or initial state; it is indicated by an edge, labeled
“start," entering from nowhere.
The Lex and Flex Scanner Generators
• LEX and FLEX are scanner generators
• LEX is popularly known as FLEX in more advanced compiler design tools.
• LEX/FLEX is used to specify a lexical analyzer by specifying regular expressions
to describe patterns for tokens.
• LEX/FLEX tool can be used to generate scanner programs which are further
used to recognize lexical patterns in texts.
• FLEX is basically an open source tool which is used as an alternative to LEX.
• Both tools are used with parser generators, such as Yacc and Bison. The
control flow programs which are generated by LEX and FLEX tools, is directed
by the regular expressions in the sequence of input characters.
• Systematically translate regular definitions into C source code for efficient
scanning.
46
The Lex and Flex Scanner Generators
• To describe any valid input patterns, LEX language is used to produce ‘lex.l’ and then
LEX compiler transforms the input notation (lex file with extension .l) into a C file
‘lex.yy.c’ (in C programming language), and a transition diagram where the C code
simulates the transition diagram.
• Further, the C code is compiled by C language compiler and produces ‘a.out’ file.
• This executable output file then receives a stream of characters and generates a
number of tokens in the output.
• The output file ‘a.out’ works as a subroutine, is a part of parser and it returns an
integer that represents one of the many possible tokens produced by the output file.
• This attribute value may represent a pointer to the symbol table or a numeric value,
is found in a global variable called ‘yylval’.
• This global variable is shared between lexical analyzer and parser, and thus return a
name with an attribute value for a token.
• To be specific, the ‘yy’ in both ‘lex.yy.c’ and ‘yylval’ represents the Yacc generator.
lex
source lex or flex lex.yy.c
program compiler
lex.l
lex.yy.c C a.out
compiler
input sequence
a.out
stream of tokens
48
Structure of LEX Programs
• To understand the LEX specifications and features of a LEX program, the primary
structure of the LEX program needs to be understood, where the program is divided
into three fragments, viz. definition section, rules section and user subroutines or
user code section.
• As per the specifications and structure of Yacc compiler, one fragment is separated
from another fragment by a line with a couple of percentage signs.
• First two fragments are essential and one fragment may be allowed to be empty
among these two fragments.
• Third fragment may not be essential and may be omitted with the preceding two
percentage signs.
………… Definition Section…………
%%
………… Rules Section…………………
%%
……… User Subroutines / User Code Section………
%%
Definition Section
• ‘Definition Section’ can consist of literal block, name definition, start condition,
translations and table declarations.
• According to C file specifications, the lines start with whitespace characters are
copied VERBATIM to C file.
• Usually, this is used to include the unintended comments between “/*” and “*/”
preceded by whitespace character.
• The intended texts enclosed between “%{“ and “%}” is copied VERBATIM to the
output C file.
• However, both “%{“ and “%}” are appeared on lines considered themselves as
unintended.
Definition Section
The following is an example of intended texts.
Definition Section
• The element Name definition can have the following form.
• name definition
• The name can start with a letter or underscore (‘_’) followed by one or more letters,
digits, ‘_’ (underscore) or ‘-‘ (dash).
• The definition can begin at the location of non-white space characters and continue
to the end of the line. For example,
•
• ∗
• The definition digits is a regular expression that matches zero or more digits and
definition letters-digits is another regular expression that matches a pair of lower
case and upper case letters followed by “-“ (dash) followed by zero or more digits.
• A reference "." ∗ is found identical to ."
∗ and it matches “-“ (minus sign) followed by one or more digits followed by “.”
• The above three examples are three different rules and out of which first
two rules match to ‘newline’ characters and ceiling functions defined in
two lines.
• Another rule matches to regular expression (‘ ’) and line number counter.
Users Subroutine Section
• This segment of LEX structure contains the auxiliary functions used for a specific
programming language induced code specified in ‘Rules Section’.
• For example, C language codes are generated by LEX tool and kept into a function
(yylex() in YACC).
• These auxiliary functions are compiled separately and loaded with lexical analyzer.
• Further, these functions in ‘Users Subroutine Section’ are allowed users to add
their own code to lex.yy.c VERBATIM.
• To find the next token, “lex.yy.c” can be used as a companion routine of LEX and it
can be called by the YACC parser generator.
• The ‘Users Subroutine Section’ segment may be voluntary.
• When it is no longer be used, the line with double percentage signs is skipped.
Users Subroutine Section
• An example is shown below within the format of this section.
LEX Program
Cont..
Cont..
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA
Optional
regular
NFA DFA
expressions
63
• The mapping of an NFA can be represented in a transition table
(0,a) = {0,1}
(0,b) = {0}
(1,b) = {2}
(2,b) = {3}
64
• An NFA accepts an input string x if and only if there is some path with
edges labeled with symbols from x in sequence from the start state to
some accepting state in the transition graph
• A state transition from one state to another on the path is called a
move
• The language defined by an NFA is the set of input strings it accepts,
such as (ab)*abb for the example NFA
65
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1
p2 { action2 }
start
s0
N(p2) action2
… …
pn { actionn } actionn
N(pn)
Subset construction
DFA
66
start
i f
a start a
i f
start N(r1)
r1r2 i f
N(r2)
r 1r 2 start
i N(r1) N(r2) f
r* start
i N(r) f
67
start a
1 2
a { action1 }
abb { action2 } start a b b
a*b+ { action3 }
3 4 5 6
a b
start
7 b 8
a
1 2
start
0 3
a
4
b
5
b
6
a b
7 b 8 68