Professional Documents
Culture Documents
L2 Lexical Analysis
L2 Lexical Analysis
L2 Lexical Analysis
Introduction
• This is the first phase of a compiler.
• The compiler spends most of its time (20–30% of compile time) in this phase because reading character by
character is done only in this phase.
• If a lexical analyzer is implemented efficiently, the overall efficiency of the compiler improves.
• Lexical analyzers are used in text processing, query processing, and pattern matching tools.
Introduction
• The scanner or lexical analyzer (LA) performs the task of reading a source text as a file of
• characters and dividing them up into tokens.
• Tokens are like words in natural language.
• Will learn about-
• design of a lexical analyzer.
• specify tokens using regular expressions
• how to design LA using Finite Automata (FA) or Lex tool.
• Design a pattern recognizer as Nondeterministic Finite Automata (NFA).
• NFA is slower than Deterministic Finite Automata (DFA).
• Hence NFA, NFA with ε-transitions, inter conversions of NFA and DFA, and
• minimal DFA
• Discussion on LEX
Lexical Analyser (Scanner)
• Lexical analysis is the act of breaking down source text into a set of words called tokens.
• Each token is found by matching sequential characters to patterns.
• Programming languages are defined using Context-free grammars, and these include the regular languages.
• All the tokens are defined with regular grammar and the lexical analyzer identifies strings as tokens and sends them to a
syntax analyzer for parsing.
• The interaction of a lexical analyzer with a parser is shown in Figure
• Only on getting a request from the parser, the lexical analyzer reads the next token and sends the next token to parser.
• If the token is an identifier or a procedure, then the lexical analyzer stores that token in the symbol table.
• Example of tokens:
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
Advantages of Separating Lexical Analysis from
Syntax Analysis
• Simplicity: Techniques of DFA are sufficient for lexical analysis and techniques of PDA can be used for
syntax analysis. Designing them together is a complex process. Separation also simplifies the syntax analyzer
and allows us to use independent tools.
• Efficiency: Makes it easier to perform simplifications and optimizations unique to the different paradigms.
For example, a compiler spends much time in LA. If this module is implemented efficiently, then this
contributes to the overall efficiency of a compiler.
• Portability: Portability is enhanced. Due to input/output and character set variations, lexical analyzers are not
always machine independent. We can take care of input alphabet peculiarities at this level.
Tasks of a Lexical Analyzer
• Buffer divided into two N-Character halves as shown in Figure 2.3. Typically, N is the number of
characters on one disk block, for example, 1024 or 4096.
• We read N input characters into each half of the buffer with one system read command, rather than
invoking a read command for each input character.
• If less than N characters remain in the input, then special characters EOF is read into the buffer after the
input characters,
• as in Figure 2.3. eof marks end of source file and is different from any input character.
• Two pointers “lexeme” and “fwd” to the input buffer are maintained
• The string of characters enclosed between the two pointers is the current lexeme.
• To find the lexeme, first initialize both the pointers with the first character of the input. Keep incrementing
the fwd pointer until a match for the pattern is found. Once a character not matching is found, stop
incrementing the pointer and extract the string between the “lexeme” and “fwd” pointers. This string is
required the lexeme and process it and set both the pointer to the next character to identify the next lexeme.
With this scheme, comments and white space can be treated as patterns that yield no token.
• While the fwd pointer is being incremented, if it is about to move past the halfway mark, the right half is
filled with N new input characters. Else if the fwd pointer is about to move past the right end of buffer, not
only the left half is filled with N new characters but also the fwd pointer wraps around the beginning of the
buffer.
• Most of the time this buffering scheme works quite well, but the amount of look ahead is limited with it. This
limited look ahead may make it difficult to recognize tokens in cases where the distance that the fwd pointer
must travel is more than the length of the buffer.
• For example, in PL/I program, consider the statement
DECLARE (ARG1, ARG2, ARG3……..ARGn)
• Until we see what character follows the right parenthesis, we cannot determine whether DECLARE is a
function name, keyword, or an array name. In all the cases, the lexeme ends at second E, but the amount of
look ahead needed is proportional to the number of arguments, which in principle is unbounded.
The performance can be improved by using other methods.
1. Buffer Pairs
1. Sentinels
Sentinels
• In order to optimize the number of tests to one for each advance of fwd pointer, sentinels are used with buffer.
This is shown in Figure 2.4. The idea is to extend each buffer half to hold a sentinel at the end.
• One of the special characters that cannot occur in a program is Sentinel (e.g., EOF).
• It indicates the need for some special action (terminate processing or fill other buffer-half).
Specification of Tokens
• Patterns are specified using regular expressions. Each pattern matches a set of strings; so regular expressions
will serve as names for sets of strings.
Strings and Languages
• A language is a dynamic set of visual, auditory, or tactile symbols of communication and the elements used
to manipulate them. Language can also refer to the use of such systems as a general phenomenon.
• Symbol and Alphabet
• A symbol is an abstract entity. It cannot be formerly defined as points in geometry.
• Example: Letters, digits, or special symbols like $, @, # etc.
• Alphabet: Finite collection of symbols denoted by .
• Example: English alphabet = {a, b,……z}
• Binary alphabet = {0, 1}
• String/word: Set of symbols from the set of alphabets
• Example: 1101, 1111, 01010101 strings from binary alphabet.
• 0a1 is not a string from binary alphabet.
Operations on Language
Regular Expressions
Recognitions of Tokens
Finite State Machine
Finite Automation
DFA
NFA
Lex Tool: Lexical Analyser Generator