Professional Documents
Culture Documents
2-Lexical Analysis
2-Lexical Analysis
2-Lexical Analysis
LECTURE 2
1. Analysis Phase
2. Synthesis Phase
Analysis Phase
Reads the source program, divides it into core parts
and then checks for lexical, grammar and syntax
errors.
The analysis phase generates an intermediate
representation of the source program and symbol
table, which should be fed to the Synthesis phase as
input.
Synthesis Phase
The semantics of a language is much more difficult to describe than the syntax.
For specifying semantics, we shall therefore use informal descriptions and
suggestive examples.
Improved Efficiency
Apply specialized techniques such as buffering techniques for reading input characters
can speed up the compiler significantly.
token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken
Symbol
table
Tokens Lexeme Patterns
Tokens Lexeme Patterns
Tokens
1. A token object consists of
a terminal symbol that is used for parsing decisions
information in the form of attribute value
3. The token name is an abstract symbol representing a kind of lexical unit, e.g.,
a particular keyword, or a sequence of input characters denoting an identifier.
4. The token names are the input symbols that the parser processes and ignores
the attribute value
atoi() in C++
31 + 28 + 59
p o s i t ion is a lexeme that would be mapped into a token (id, I), where id
is an abstract symbol standing for identifier and 1 points to the symbol table entry for position. The
symbol-table entry for an identifier holds
information about the identifier, such as its name and type.
The assignment symbol = is mapped into the token (=).
Since this token needs no attribute-value, we have omitted the second component. We could have
used any abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the
lexeme itself as the name of the abstract symbol.
i n i t i a l is a lexeme that is mapped into the token (id,2), where 2 points
to the symbol-table entry for i n i t i a l .
+ is a lexeme that is mapped into the token (+).
r a t e is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry for r a
te.
6. * is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60)
Tokens Lexeme Patterns
Pattern
There are some predefined rules for every lexeme to be
identified as a valid token
These rules are defined by grammar rules, by means of a
pattern.
A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
Description of the form that the lexemes of a token
may take.
Longest Match Rule
Lexical analyzer scans the code letter by letter; and
when it encounters a whitespace, operator symbol, or
special symbols, it decides that a word is completed.
Example
In the case of a keyword as a token, the pattern is just the
sequence of characters that form the keyword.
Solution:
Define reserved words and their tokens in the table initially.
Character string forms an identifier only if it is not a keyword.
Benefit
Single Representation. A string table can insulate the rest
of the compiler from the representation of strings.
References are manipulated more efficiently than the strings
themselves.
Phases of the compiler work with references or pointers to
the string in the table
String Table
the lexical analyzer will
continue reading from the input
String table can be implemented as a hash table
as long as it encounters letters
and digits.
HINT
Token Value
lexical analyzer returns to the parser not only a token name,
but an attribute value that describes the lexeme represented
by the token;
Token name influences parsing decisions, while the attribute
value influences translation of tokens after the parse.
It is extremely important for the code generator to know
which lexeme was found in the source program.
information - e.g., its lexeme, its type, and the location at
which it is first found
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a
source-code error
fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser
A situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens
matches any prefix of the remaining input.