L2 Lexical Analysis

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 59

Lexical Analysis

Introduction
• This is the first phase of a compiler.
• The compiler spends most of its time (20–30% of compile time) in this phase because reading character by
character is done only in this phase.
• If a lexical analyzer is implemented efficiently, the overall efficiency of the compiler improves.
• Lexical analyzers are used in text processing, query processing, and pattern matching tools.
Introduction
• The scanner or lexical analyzer (LA) performs the task of reading a source text as a file of
• characters and dividing them up into tokens.
• Tokens are like words in natural language.
• Will learn about-
• design of a lexical analyzer.
• specify tokens using regular expressions
• how to design LA using Finite Automata (FA) or Lex tool.
• Design a pattern recognizer as Nondeterministic Finite Automata (NFA).
• NFA is slower than Deterministic Finite Automata (DFA).
• Hence NFA, NFA with ε-transitions, inter conversions of NFA and DFA, and
• minimal DFA
• Discussion on LEX
Lexical Analyser (Scanner)
• Lexical analysis is the act of breaking down source text into a set of words called tokens.
• Each token is found by matching sequential characters to patterns.
• Programming languages are defined using Context-free grammars, and these include the regular languages.
• All the tokens are defined with regular grammar and the lexical analyzer identifies strings as tokens and sends them to a
syntax analyzer for parsing.
• The interaction of a lexical analyzer with a parser is shown in Figure
• Only on getting a request from the parser, the lexical analyzer reads the next token and sends the next token to parser.
• If the token is an identifier or a procedure, then the lexical analyzer stores that token in the symbol table.

• Lexical Analysis can be implemented with the Deterministic finite Automata.


• The output is a sequence of tokens that is sent to the parser for syntax analysis
Example
• if there is a source text A + B, LA does not read A + B at once and sends tokens: id, +, id. On getting the fi rst request
from the parser, the LA reads the fi rst string A, recognizes that as the token id, stores it in the symbol table, and sends
the token id to the parser. On the next request, it only reads + and sends the operato as it is as a token. On the third
request, it reads string B, recognizes that as token id, stores it in the symbol table and sends the token id to the parser.
• Out of all the phases, a compiler spends much time in lexical analysis. Hence, if this phase is implemented efficiently,
this contributes to the overall efficiency of the compiler.
• Lexical analysis and parsing can be combined in one phase but there are some advantages for doing it separately.

• Example of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
Advantages of Separating Lexical Analysis from
Syntax Analysis
• Simplicity: Techniques of DFA are sufficient for lexical analysis and techniques of PDA can be used for
syntax analysis. Designing them together is a complex process. Separation also simplifies the syntax analyzer
and allows us to use independent tools.
• Efficiency: Makes it easier to perform simplifications and optimizations unique to the different paradigms.
For example, a compiler spends much time in LA. If this module is implemented efficiently, then this
contributes to the overall efficiency of a compiler.
• Portability: Portability is enhanced. Due to input/output and character set variations, lexical analyzers are not
always machine independent. We can take care of input alphabet peculiarities at this level.
Tasks of a Lexical Analyzer

• 1. Tokenization i.e. Dividing the program into valid tokens.


• 2. white space characters. White space characters like tab, space,
newline characters, and comment lines are not required during the
translation of a source code.
• 3. Filtering comment lines
• 4. Associating the error messages in other phases.-If an error is seen
by any other phase, then the lexical analyzer helps other phases in
giving error diagnostics properly.
Error Recovery in Lexical Analysis
• Generally, errors often detected in a lexical analysis are as follows:
• 1. Numeric literals that are too long
• 2. Long identifiers (often a warning is given)
• 3. Ill-formed numeric literals
• 4. Input characters that are not in the source language
How to recover from errors?
• Delete: If there is an unknown character, simply delete it. Deletion of character is also called panic-mode.
• This is used by many compilers. However, it has certain disadvantages:
• The meaning of the program may get changed.
• The whole input may get deleted in the process.
• Example: “charr” can be corrected as “char” by deleting “r”
• Insert: Insert an extra or missing character to group into a meaningful token.
• Example: “cha” can be corrected as “char” by inserting “r”
• Transpose: Based on certain rules, we can transpose two characters.
• Like “whiel” can be corrected to “while” by the transpose method.
• Replace: In some cases, it may require replacing one character by another.
• Example: “chrr” can be corrected as “char” by replacing “r” with “a”.
Tokens, Patterns, Lexemes
• Token: It is a group of characters with logical meaning. Token is a logical building block of the language.
Example: id, keyword, Num etc.
• Pattern: It is a rule that describes the character that can be grouped into tokens. It is expressed as a regular
expression. Input stream of characters are matched with patterns and tokens are identified.
• Example: Pattern/rule for id is that it should start with a letter followed by any number of letters or digits.
This is given by the regular expression: [A – Za – z][A – Za – z 0 – 9]*.
• Using this pattern, the given input strings “xyz, abcd, a7b82” are recognized as token id.
• Lexeme: It is the actual text/character stream that matches with the pattern and is recognized as a token.
• For example, “int” is identified as token keyword. Here “int” is lexeme and keyword is token.
• For example, consider the following statement. float key = 1.2;
Issues in Lexical Analysis
• If keywords are not reserved, then lexical analyzer must distinguish between a
keyword and identifier. Example: Keywords are not reserved in PL/I. Thus, the
rules for recognizing keywords from identifiers are quite complicated;
Attributes for Tokens
• When a token represents many lexemes, additional information must be provided by the scanner or lexical
analyzer about the particular lexeme of a token that matched with the subsequent phases of the compiler.
• For example, the token id matches with x or y or z. But it is essential for the code generator to know which
string is actually matched.
• The lexical analyzer accumulates information about tokens into their associated attributes. The tokens control
parsing decisions, whereas the attributes show significance in translation of tokens. Generally, tokens will
have only one attribute, that is, pointer to the symbol table entry in
• which information about the token is available. Suppose we want the type of id, token, lexeme, line number,
all these information can be stored in the symbol table entry of that identifier.
• Example 1: Consider the statement: distance = 0.5 * g * t * t;
GATE Question
Strategies for Implementing a Lexical Analyzer

• Lexical analyzer can be designed in the following ways:


• 1. Use a scanner generator tool like Lex/Flex to produce a lexical analyzer. The specification can be given
using a regular expression, which on compilation generates a scanner capable of identifying tokens. The
generator provides routines for reading and buffering the input. This is easy to implement but least efficient.
• 2. Design a lexical analyzer in a high-level programming language like C and use the input
• and buffering techniques provided by the language. This approach is intermediate in
• terms of efficiency.
• 3. Writing a lexical analyzer in assembly language is the most effi cient method. It can explicitly
• manage input and buffering. This is a very complex approach to implement.
Designing a lexical analyzer either by hand or automated tools

• 1. Describe rules for tokens using regular expressions.


• 2. Design a recognizer for such rules, that is, for tokens. Designing a recognizer corresponds to converting
regular expressions to Finite Automata. The processing can be speeded if the regular expression is
represented in Deterministic Finite Automata. This involves the following steps:
• • Convert regular expression to NFA with e
• • Convert NFA with e to NFA without e
• • Convert NFA to DFA
Input buffering
• In order to recognize a token, a scanner has to look ahead several characters from the current character many
times.
• For example, “char” is a keyword in C, while the term “chap” may be a variable name.
• When the character “c” is encountered, the scanner cannot decide whether it is a variable, keyword, or
function name until it reads three more characters.
• It takes a lot of time to read character by character from a file, and
• so specialized buffering techniques are developed.
• These buffering techniques, makes the reading process easy and also reduces the amount of overhead
required to process.
• Look ahead with 2N buffering
Look ahead with 2N buffering

• Buffer divided into two N-Character halves as shown in Figure 2.3. Typically, N is the number of
characters on one disk block, for example, 1024 or 4096.
• We read N input characters into each half of the buffer with one system read command, rather than
invoking a read command for each input character.
• If less than N characters remain in the input, then special characters EOF is read into the buffer after the
input characters,
• as in Figure 2.3. eof marks end of source file and is different from any input character.
• Two pointers “lexeme” and “fwd” to the input buffer are maintained
• The string of characters enclosed between the two pointers is the current lexeme.
• To find the lexeme, first initialize both the pointers with the first character of the input. Keep incrementing
the fwd pointer until a match for the pattern is found. Once a character not matching is found, stop
incrementing the pointer and extract the string between the “lexeme” and “fwd” pointers. This string is
required the lexeme and process it and set both the pointer to the next character to identify the next lexeme.
With this scheme, comments and white space can be treated as patterns that yield no token.
• While the fwd pointer is being incremented, if it is about to move past the halfway mark, the right half is
filled with N new input characters. Else if the fwd pointer is about to move past the right end of buffer, not
only the left half is filled with N new characters but also the fwd pointer wraps around the beginning of the
buffer.
• Most of the time this buffering scheme works quite well, but the amount of look ahead is limited with it. This
limited look ahead may make it difficult to recognize tokens in cases where the distance that the fwd pointer
must travel is more than the length of the buffer.
• For example, in PL/I program, consider the statement
DECLARE (ARG1, ARG2, ARG3……..ARGn)
• Until we see what character follows the right parenthesis, we cannot determine whether DECLARE is a
function name, keyword, or an array name. In all the cases, the lexeme ends at second E, but the amount of
look ahead needed is proportional to the number of arguments, which in principle is unbounded.
The performance can be improved by using other methods.

1. Buffer Pairs
1. Sentinels
Sentinels
• In order to optimize the number of tests to one for each advance of fwd pointer, sentinels are used with buffer.
This is shown in Figure 2.4. The idea is to extend each buffer half to hold a sentinel at the end.
• One of the special characters that cannot occur in a program is Sentinel (e.g., EOF).
• It indicates the need for some special action (terminate processing or fill other buffer-half).
Specification of Tokens

• Patterns are specified using regular expressions. Each pattern matches a set of strings; so regular expressions
will serve as names for sets of strings.
Strings and Languages
• A language is a dynamic set of visual, auditory, or tactile symbols of communication and the elements used
to manipulate them. Language can also refer to the use of such systems as a general phenomenon.
• Symbol and Alphabet
• A symbol is an abstract entity. It cannot be formerly defined as points in geometry.
• Example: Letters, digits, or special symbols like $, @, # etc.
• Alphabet: Finite collection of symbols denoted by .
• Example: English alphabet = {a, b,……z}
• Binary alphabet = {0, 1}
• String/word: Set of symbols from the set of alphabets
• Example: 1101, 1111, 01010101 strings from binary alphabet.
• 0a1 is not a string from binary alphabet.
Operations on Language
Regular Expressions
Recognitions of Tokens
Finite State Machine
Finite Automation
DFA
NFA
Lex Tool: Lexical Analyser Generator

You might also like