2-Lexical Analysis

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 52

COMPILER CONSTRUCTION

LECTURE 2

Muhammad Fawad Nasim


fawad.nasim@superior.edu.pk
Recap

1. Analysis Phase
2. Synthesis Phase
Analysis Phase
 Reads the source program, divides it into core parts
and then checks for lexical, grammar and syntax
errors.
 The analysis phase generates an intermediate
representation of the source program and symbol
table, which should be fed to the Synthesis phase as
input.
Synthesis Phase

 Generates the target program with the help of intermediate


source code representation and symbol table.
Syntax vs Semantics
 The syntax of a programming language describes the proper form of its
programs, while the semantics of the language defines what its programs
mean; that is, what each program does when it executes.
 Syntax is specified by context-free grammars.

 The semantics of a language is much more difficult to describe than the syntax.
For specifying semantics, we shall therefore use informal descriptions and
suggestive examples.

 a context-free grammar can be used to help guide the translation of programs


Lexical Errors
 A character sequence that cannot be scanned into any
valid token is a lexical error.
 Lexical errors are uncommon, but they still must be
handled by a scanner.
 Misspelling of identifiers, keyword, or operators are
considered as lexical errors.
 Usually, a lexical error is caused by the appearance of
some illegal character, mostly at the beginning of a
token.
1-Lexical Analysis
A lexical analyzer reads characters from the input and groups them into "token objects."
Why to Separate Lexical Analysis
Simplicity Of Design
A parser that had to deal with comments and whitespace as syntactic units would be
considerably more complex than one that can assume comments and whitespace have
already been removed by the lexical analyzer.

Improved Efficiency
Apply specialized techniques such as buffering techniques for reading input characters
can speed up the compiler significantly.

Enhancing Compiler Portability


Some compiler collections have been created around carefully designed intermediate
representations that allow the front end for a particular language to interface with the back
end for a certain target machine. With these collections, we can produce compilers for
different source languages for one target machine by combining different front ends with the
back end for that target machine. Similarly, we can produce compilers for different target
machines, by combining a front end with back ends for different target machines.
Assignment(Python) = Lexical Analyzer
1- Lexical Analysis
1. Read the input characters of the source program, group them into
lexemes, and produce as output a sequence of tokens for each
lexeme in the source program.
2. Removes any whitespace or comments in the source code.
3. Generates an error for invalid token.
4. May keep track of the number of newline characters seen, so it
can associate a line number with each error message
5. The stream of tokens is sent to the parser for syntax analysis.
Logical Implementation
1. Scanning consists of the simple processes that do not
require tokenization of the input, such as deletion of
comments and compaction of consecutive whitespace
characters into one.
2. Lexical analysis proper is the more complex portion,
where the scanner produces the sequence of tokens
as output.
Symbol Table
 Lexical analyzer quite often interact with the symbol table to
enter lexemes into the symbol table
 Interaction is implemented by having the parser call the
lexical analyzer.
 The call, suggested by the getNextToken command, causes
the lexical analyzer to read characters from its input until it
can identify the next lexeme and produce for it the next
token, which it returns to the parser.
The role of lexical analyzer

token
Source To semantic
Lexical Analyzer Parser
program analysis
getNextToken

Symbol
table
Tokens Lexeme Patterns
Tokens Lexeme Patterns
Tokens
1. A token object consists of
 a terminal symbol that is used for parsing decisions
 information in the form of attribute value

2. Token = name and an optional attribute

3. The token name is an abstract symbol representing a kind of lexical unit, e.g.,
a particular keyword, or a sequence of input characters denoting an identifier.

4. The token names are the input symbols that the parser processes and ignores
the attribute value

5. A token is a terminal along with additional information


Tokens Lexeme Patterns
In programming language, keywords, constants, identifiers,
strings, numbers, operators and punctuations symbols can be
considered as tokens.
Types of Tokens
Keywords (words that convey a special meaning to the language compiler. do, float etc in c++ )
Identifiers (variable names etc)
Literals (constant)
Punctuators (“”, ; [] {})
Operators (+,/ etc)
Tokens Lexeme Patterns
Example 1
int value = 100;

 int (keyword), value (identifier), = (operator),


100 (constant) and ; (symbol).
Example 2
count = count + increment;
 (id, "count ") (=) (id, "count ") (+) (id, "increment ") (;)
 Parser works with the terminal stream id = id + id;
Example 3 Fortran

<id, pointer to symbol-table entry for E>


<assign-op >
<id, pointer to symbol-table entry for M >
<mult -op>
<id, pointer to symbol-table entry for C>
<exp-op> ; exponent
<number , integer value 2>
Constants
 Anytime a single digit appears in a grammar for expressions,
it seems reasonable to allow an arbitrary integer constant in
its place.
 Integer constants can be allowed either by creating a
terminal symbol, num, for such constants or by incorporating
the syntax of integer constants into the grammar.
 lexical analyzer passes to the parser a token consisting of
the terminal num along with an integer-valued attribute
computed from the digits.
Constants

Converting character to number

atoi() in C++
31 + 28 + 59

Terminal symbol + has no attributes


Ignored Tokens
 W ( blank I tab ( newline ))+

we recognize it, we do not return it to the parser


Removal of White Space and Counting line numbers
Code example
Key points
 One token for each keyword. The pattern for a keyword is the same as the keyword
itself.
 Tokens for the operators, either individually or in classes e.g. comparison
 One token representing all identifiers.
 One or more tokens representing constants, such as numbers and literal
strings.
 Tokens for each punctuation symbol, such as left and right parentheses,
comma, and semicolon.
Tokens Lexeme Patterns
Lexeme
 Lexemes are said to be a sequence of input characters (alphanumeric) in a token. 

 It is identified by the lexical analyzer as an instance of that token.

 Lexeme refers to a single word and all of its forms.


 Example:”go" in English has the forms "go" "goes" "went" and "going". All of these
words are from the same lexeme "go.“
Tokens Lexeme Patterns
 Lexeme

Find, finds, found, and finding are forms of the English


lexeme FIND
Tokens Lexeme Patterns
Run, runs, ran and running are forms of the same lexeme,
conventionally written as RUN
c=a+b*5;
   Lexemes and tokens ?
Example
 p o s i t i o n = i n i t i a l + r a t e * 60

p o s i t ion is a lexeme that would be mapped into a token (id, I), where id
is an abstract symbol standing for identifier and 1 points to the symbol table entry for position. The
symbol-table entry for an identifier holds
information about the identifier, such as its name and type.
The assignment symbol = is mapped into the token (=).
Since this token needs no attribute-value, we have omitted the second component. We could have
used any abstract symbol such as assign for the token-name, but for notational convenience we have
chosen to use the
lexeme itself as the name of the abstract symbol.
i n i t i a l is a lexeme that is mapped into the token (id,2), where 2 points
to the symbol-table entry for i n i t i a l .
+ is a lexeme that is mapped into the token (+).
r a t e is a lexeme that is mapped into the token (id, 3), where 3 points to the symbol-table entry for r a
te.
6. * is a lexeme that is mapped into the token (*).
7. 60 is a lexeme that is mapped into the token (60)
Tokens Lexeme Patterns
Pattern
 There are some predefined rules for every lexeme to be
identified as a valid token
 These rules are defined by grammar rules, by means of a
pattern.
 A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
 Description of the form that the lexemes of a token
may take.
Longest Match Rule
 Lexical analyzer scans the code letter by letter; and
when it encounters a whitespace, operator symbol, or
special symbols, it decides that a word is completed.
Example
 In the case of a keyword as a token, the pattern is just the
sequence of characters that form the keyword.

 For identifiers and some other tokens, the pattern is a more


complex structure that is matched by many strings.
Key words
 Most languages use fixed character strings such as for, do,
and if , as punctuation marks or to identify constructs, called
keywords.
 Keywords generally satisfy the rules for forming identifiers
 KeyWord or identifier ?
Recognition: Reserved Words vs Identifiers
if case are reserved words or identifier ?

Solution:
 Define reserved words and their tokens in the table initially.
 Character string forms an identifier only if it is not a keyword.
Benefit
 Single Representation. A string table can insulate the rest
of the compiler from the representation of strings.
 References are manipulated more efficiently than the strings
themselves.
 Phases of the compiler work with references or pointers to
the string in the table
String Table
the lexical analyzer will
continue reading from the input
 String table can be implemented as a hash table
as long as it encounters letters
and digits.

If the table has an entry for s, then


the token retrieved by words.get is
returned
Exercise
 Divide the following C+ + program:

into appropriate lexeme

HINT
Token Value
 lexical analyzer returns to the parser not only a token name,
but an attribute value that describes the lexeme represented
by the token;
 Token name influences parsing decisions, while the attribute
value influences translation of tokens after the parse.
 It is extremely important for the code generator to know
which lexeme was found in the source program.
information - e.g., its lexeme, its type, and the location at
which it is first found
Lexical Errors
It is hard for a lexical analyzer to tell, without the aid of other components, that there is a
source-code error

fi is a valid lexeme for the token id, the lexical analyzer must return the token id to the parser

Misspelled If or undeclared identifier fi of a function

A situation arises in which the lexical analyzer is unable to proceed because none of the patterns for tokens
matches any prefix of the remaining input.

"panic mode" recovery ; Delete until find well formed token


Error Recovery
in practice most lexical errors involve a single character.

Delete one character from the remaining input.


Insert a missing character into the remaining input.
Replace a character by another character.
Transpose two adjacent characters.
Writing Scanner:Input Buffering
1. Fetching a block of characters is usually more efficient than
fetching one character at a time
2. Look one or more characters beyond the next lexeme before
we can be sure we have the right lexeme
e.g. > can be >=, or single >
Cautions !!!

 Reads ahead only when it must.


 some operators can be identified without reading ahead
Sample Lexical Analyzer
Assignment #2
Write a program that
1. count Total number of lines from any input text file
2. Identify reserved words, identifiers and numbers
3. Discard comments
a) A comment begins with // and includes all characters until the
end of that line.
b) A comment begins with /* and includes all characters
through the next occurrence of the character sequence */.
Deadline = Till Next class

You might also like