2 - Lexical Analyzer Lecture 01

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 68

1) Majority of texts, diagrams and tables in the slide is

based on the text book Compilers: Principles,


Techniques, and Tools by Aho, Sethi, Ullman and
Lam.
2) http://www.cs.fsu.edu/~engelen/courses/COP5621
(Florida State University)
Interactions between the lexical analyzer and the parser
Lexical analyzers are divided into a cascade of two processes:
• Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments and compaction
of consecutive whitespace characters into one.
• Lexical analysis is the more complex portion, which produces tokens
from the output of the scanner.
There are a number of reasons why the analysis portion of a compiler is
normally separated into lexical analysis and parsing (syntax analysis)
phases.
• Simplicity of design is the most important consideration.
• Compiler efficiency is improved.
• Compiler portability is enhanced.
• Recognize tokens and ignore white spaces, comments

i f ( x 1 * x 2 < 1 . 0 ) {

if ( x1 * x2 < 1.0 ) {

• Error reporting
• Model using regular expression
• Recognize using finite state automata
• Sentences consist of string of tokens
• Examples: number, identifier, string, keyword

• Sequences of characters in a token is a lexeme


• Examples: 20.34, counter, const, “How are you?”

• Rule of description is a pattern


• Examples: letter(letter|digit)*

• Task: Identify tokens and corresponding lexemes


• A token is a classification of lexical units
• Lexemes are the specific character strings that make up a token
• Patterns are rules describing the set of lexemes belonging to a token
The following classes cover most of the tokens:
• One token for each keyword. The pattern for a keyword is the same as
the keyword itself.
• Tokens for the operators, either individually or in classes such as the
token comparison.
• One token representing all identifiers.
• One or more tokens representing constants, such as numbers and
literal strings.
• Tokens for each punctuation symbol, such as left and right
parentheses, comma, and semicolon.
• An alphabet  is a finite set of symbols (characters)
• A string s is a finite sequence of symbols from 
• s denotes the length of string s
•  denotes the empty string, thus  = 0
• A language is a specific set of strings over some fixed alphabet 
• A string over an alphabet is a finite sequence of symbols drawn from that
alphabet. In language theory, the terms “sentence" and “word" are often
used as synonyms for “string.“
• A language is any countable set of strings over some fixed alphabet.
• The concatenation of two strings x and y is denoted by xy
• The exponentiation of a string s is defined by

s0 = 
si = si-1s for i > 0

note that s = s = s
• Certain languages do not have any reserved words, e.g., while, do, if,
else, etc., are reserved in ’C’, but not in PL/1
• Lexemes in a fixed position. Fixed format vs. free format languages
• FORTRAN Fixed Format and some keywords are context-dependent
• 80 columns per line
• Column 1-5 for the statement number/label column
• Column 6 for continuation mark (?)
• Column 7-72 for the program statements
• Column 73-80 Ignored (Used for other purpose)
• Letter C in Column 1 meant the current line is a comment
• Blanks are not significant in FORTRAN and can appear in the midst of identifiers, but
not so in ’C’
• Handling of blanks
• in C, blanks separate identifiers
• in FORTRAN, blanks are important only in literal strings
• variable counter is same as count er
• Another example
DO 10 I = 1.25 DO10I=1.25

DO 10 I = 1,25 DO10I=1,25
• The first line is a variable assignment DO10I=1.25
• The second line is beginning of a Do loop
• LA cannot catch any significant errors except for simple errors such as, illegal symbols,
etc.
• In such cases, LA skips characters in the input until a well-formed token is found
• How to describe tokens
2.e0 20.e-01 2.000
• How to break text into token
if (x==0) a = x << 1;
if (x==0) a = x < 1;
• How to break input into tokens efficiently
• Tokens may have similar prefixes
• Each character should be looked at only once
• Programming language tokens can be described by regular languages
• Regular languages
• Are easy to understand
• There is a well understood and useful theory
• They have efficient implementation
• Regular languages have been discussed in great detail in the “Theory
of Computation” course
Specification of Patterns for Tokens:

• The regular expressions are built recursively out of smaller regular


expressions, using the rules described below.
• Each regular expression r denotes a language L(r), which is also defined
recursively from the languages denoted by r's subexpressions.
• There are two rules that form the basis:
• ɛ is a regular expression, and L(ɛ) is {ɛ}, that is, the language whose sole member is
the empty string.
• If a is a symbol in ∑, then a is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with a in its one position.
• Basis symbols:
•  is a regular expression denoting language {}
• a   is a regular expression denoting {a}
• If r and s are regular expressions denoting languages L(r) and M(s)
respectively, then
• rs is a regular expression denoting L(r)  M(s)
• rs is a regular expression denoting L(r)M(s)
• r* is a regular expression denoting L(r)*
• (r) is a regular expression denoting L(r)
• A language defined by a regular expression is called a regular set
22
Specification of Patterns for Tokens:
Regular Expressions
Specification of Patterns for Tokens:
Regular Expressions
• For notational convenience, we may wish to give names to certain regular
expressions and use those names in subsequent expressions, as if the names were
themselves symbols.
• If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of
definitions of the form:
d1  r 1
d2  r 2

dn  r n
where each ri is a regular expression over the alphabet
  {d1, d2, …, di-1 }
• Any dj in ri can be textually substituted in ri to obtain an equivalent set of
definitions

25
• Example:

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*

• Regular definitions are not recursive:

digits  digit digitsdigit wrong!

27
• The following shorthands are often used:

r+ = r*r = rr*
r? = r
[a-z] = abc…z

• Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )?

29
(a) All strings of lowercase letters and digits contain the five vowels in order.

(b) All strings of digits contain triple of each of five vowels in order.

(c) All strings of a's and b's that do not contain the substring abbbb.

(d) All strings of a's and b's that do not contain the subsequence abb.

(e) All strings of uppercase letters and digits contain first three lowercase letters
together for 4 times in order.
Grammar
stmt  if expr then stmt
 if expr then stmt else stmt

Regular definitions
expr  term relop term
 term
if  if
term  id
then  then
 num
else  else
relop  <  <=  <>  >  >=  =
id  letter ( letter | digit )*
num  digit+ (. digit+)? ( E (+-)? digit+ )?

32
Identify lexeme patterns, keywords and tokens in the following
piece of code.
Draw a transition diagram for the alpha-numeric pattern .
Let P, Q and R be the set of uppercase letters (A…Z), lowercase letters (a…z) and
digits (0…1). Define the following languages made of three operations – union,
concatenation and closure.

(a) P U Q; (b) P U Q U R;
(c) PR U QR; (d) PQR;
(e) P*; (f) P* U Q*;
(g) P*Q*; (h) P+;
(i) P+ U Q+; (j) R(P* U Q*)
(k) P(P*Q*)R; (m) R6;
(n) P+Q*; (o) P*R+
Let, alphabet ⅀ = {a, b}. Describe the following regular expressions.

(a|b)* (a*|b)*
(a*b)|a a+b
(ab)(ab) (a|b)(a|b)
(ab)(a|b) 10(a|b)*00(a|b)*
1*b(a|b)aaab*0 (a+b*)aab
a*ba*ba*ba*ba*
(a|b)*(a|b)a(a|b)
a*(ba*ba*)*
• Regular expression are declarative specifications
• Transition diagram is an implementation
• Transition diagrams have a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input
looking for a lexeme that matches one of several patterns.
• Edges are directed from one state of the transition diagram to another. Each edge is
labeled by a symbol or set of symbols.
• Certain states are said to be accepting, or final. These states indicate that a lexeme has
been found, although the actual lexeme may not consist of all positions between the
lexemeBegin and forward pointers.
• In addition, if it is necessary to retract the forward pointer one position, then we shall
additionally place a * near that accepting state.
• One state is designated as start state, or initial state; it is indicated by an edge, labeled
“start," entering from nowhere.
The Lex and Flex Scanner Generators
• LEX and FLEX are scanner generators
• LEX is popularly known as FLEX in more advanced compiler design tools.
• LEX/FLEX is used to specify a lexical analyzer by specifying regular expressions
to describe patterns for tokens.
• LEX/FLEX tool can be used to generate scanner programs which are further
used to recognize lexical patterns in texts.
• FLEX is basically an open source tool which is used as an alternative to LEX.
• Both tools are used with parser generators, such as Yacc and Bison. The
control flow programs which are generated by LEX and FLEX tools, is directed
by the regular expressions in the sequence of input characters.
• Systematically translate regular definitions into C source code for efficient
scanning.

46
The Lex and Flex Scanner Generators
• To describe any valid input patterns, LEX language is used to produce ‘lex.l’ and then
LEX compiler transforms the input notation (lex file with extension .l) into a C file
‘lex.yy.c’ (in C programming language), and a transition diagram where the C code
simulates the transition diagram.
• Further, the C code is compiled by C language compiler and produces ‘a.out’ file.
• This executable output file then receives a stream of characters and generates a
number of tokens in the output.
• The output file ‘a.out’ works as a subroutine, is a part of parser and it returns an
integer that represents one of the many possible tokens produced by the output file.
• This attribute value may represent a pointer to the symbol table or a numeric value,
is found in a global variable called ‘yylval’.
• This global variable is shared between lexical analyzer and parser, and thus return a
name with an attribute value for a token.
• To be specific, the ‘yy’ in both ‘lex.yy.c’ and ‘yylval’ represents the Yacc generator.
lex
source lex or flex lex.yy.c
program compiler
lex.l

lex.yy.c C a.out
compiler

input sequence
a.out
stream of tokens

48
Structure of LEX Programs
• To understand the LEX specifications and features of a LEX program, the primary
structure of the LEX program needs to be understood, where the program is divided
into three fragments, viz. definition section, rules section and user subroutines or
user code section.
• As per the specifications and structure of Yacc compiler, one fragment is separated
from another fragment by a line with a couple of percentage signs.
• First two fragments are essential and one fragment may be allowed to be empty
among these two fragments.
• Third fragment may not be essential and may be omitted with the preceding two
percentage signs.
………… Definition Section…………
%%
………… Rules Section…………………
%%
……… User Subroutines / User Code Section………
%%
Definition Section
• ‘Definition Section’ can consist of literal block, name definition, start condition,
translations and table declarations.
• According to C file specifications, the lines start with whitespace characters are
copied VERBATIM to C file.
• Usually, this is used to include the unintended comments between “/*” and “*/”
preceded by whitespace character.

• The intended texts enclosed between “%{“ and “%}” is copied VERBATIM to the
output C file.
• However, both “%{“ and “%}” are appeared on lines considered themselves as
unintended.
Definition Section
The following is an example of intended texts.
Definition Section
• The element Name definition can have the following form.
• name definition
• The name can start with a letter or underscore (‘_’) followed by one or more letters,
digits, ‘_’ (underscore) or ‘-‘ (dash).
• The definition can begin at the location of non-white space characters and continue
to the end of the line. For example,

• ∗

• The definition digits is a regular expression that matches zero or more digits and
definition letters-digits is another regular expression that matches a pair of lower
case and upper case letters followed by “-“ (dash) followed by zero or more digits.
• A reference "." ∗ is found identical to ."
∗ and it matches “-“ (minus sign) followed by one or more digits followed by “.”

followed by zero or more digits.


• It represents a negative real number.
Definition Section
• Declaration element have declaration of global variables and may have
initialization of corresponding variables.
• Following are some examples of declaration element.
• float value1,value1=2.7;
• float value2,value2=3.8;
• int line_number=0;
• The global variables can be accessed inside both the functions 'yylex()' and
'main()' which are defined just after the second line with a couple of percentage
signs.
Rules Section
• This segment of LEX structure contains pattern lines with corresponding actions.
• Each pattern is a regular expression and it uses regular definition of declaration
section.
• A line starts with anything else is considered as a pattern line in Rules Section and
a line starts with whitespace enclosed in “%{“ and “%}” is a typical C code.
• Rules section has the following format for representation for a number of rules.
Rules Section
• The following are some examples of pattern lines with corresponding actions.

• The above three examples are three different rules and out of which first
two rules match to ‘newline’ characters and ceiling functions defined in
two lines.
• Another rule matches to regular expression (‘ ’) and line number counter.
Users Subroutine Section
• This segment of LEX structure contains the auxiliary functions used for a specific
programming language induced code specified in ‘Rules Section’.
• For example, C language codes are generated by LEX tool and kept into a function
(yylex() in YACC).
• These auxiliary functions are compiled separately and loaded with lexical analyzer.
• Further, these functions in ‘Users Subroutine Section’ are allowed users to add
their own code to lex.yy.c VERBATIM.
• To find the next token, “lex.yy.c” can be used as a companion routine of LEX and it
can be called by the YACC parser generator.
• The ‘Users Subroutine Section’ segment may be voluntary.
• When it is no longer be used, the line with double percentage signs is skipped.
Users Subroutine Section
• An example is shown below within the format of this section.
LEX Program
Cont..
Cont..
• Translate regular expressions to NFA
• Translate NFA to an efficient DFA

Optional

regular
NFA DFA
expressions

Simulate NFA Simulate DFA


to recognize to recognize
tokens tokens
61
Nondeterministic finite automata (NFA) have no restrictions on the labels
of their edges. A symbol can label several edges out of the same state, and
, the empty string, is a possible label.

• An NFA is a 5-tuple (S, , , s0, F) where

S is a finite set of states


 is a finite set of symbols, the alphabet
 is a mapping from S to a set of states
s0  S is the start state
F  S is the set of accepting (or final) states
62
• An NFA can be diagrammatically represented by a labeled directed graph called a
transition graph. This graph is very much like a transition diagram, except:
• The same symbol can label edges from one state to several different states,
and
• An edge may be labeled by ε, the empty string, instead of, or in addition to,
symbols from the input alphabet.
• The transition graph for an NFA recognizing the language of regular expression
(a|b)*abb.
a
S = {0,1,2,3}
 = {a, b}
start a b b
0 1 2 3 s0 = 0
F = {3}
b

63
• The mapping  of an NFA can be represented in a transition table

(0,a) = {0,1}
(0,b) = {0}
(1,b) = {2}
(2,b) = {3}

64
• An NFA accepts an input string x if and only if there is some path with
edges labeled with symbols from x in sequence from the start state to
some accepting state in the transition graph
• A state transition from one state to another on the path is called a
move
• The language defined by an NFA is the set of input strings it accepts,
such as (ab)*abb for the example NFA

65
Lex specification with NFA
regular expressions
p1 { action1 } N(p1) action1
p2 { action2 }

start
s0
 N(p2) action2
… …
pn { actionn }  actionn
N(pn)

Subset construction

DFA
66

start
i  f

a start a
i f

start  N(r1) 
r1r2 i f
 N(r2) 
r 1r 2 start
i N(r1) N(r2) f

r* start
i  N(r)  f

 67
start a
1 2
a { action1 }
abb { action2 } start a b b
a*b+ { action3 }
3 4 5 6
a b

start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8 68

You might also like