Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 24

Compiler:-

Issues in scanning
By;
Maira
Maimoona
Shomaila

To;
Madam Ayesha
Background knowledge;
computer

software hardware

series of 1s and 0 s

…..It would be a difficult task for computer


programmers to write such codes, which is why we have
compilers to write such codes.
Cont…
 we write programs in high-level
language, which is easier for us to
understand and remember.
 These programs are then fed into a
series of tools and OS components to
get the desired code that can be used
by the machine.
 This is known as Language
Processing System.
Phases and passes;
 A compiler can have many phases and
passes.
 Pass : A pass refers to the traversal of a
compiler through the entire program.
 Phase : A phase of a compiler is a
distinguishable stage, which takes input
from the previous stage, processes and
yields output that can be used as input for
the next stage. A pass can have more than
one phase.
Passes of compiler;
Phases of Compiler
Role of Lexical Analyzer
A program which performs lexical analysis is termed as a
lexical analyzer (lexer), tokenizer or scanner.
Stages of lexical analyzer;
 Lexical analysis consists of two stages of
processing which are as follows:
• Scanning
• Tokenization
Cont…
Cont…
 If the lexical analyzer detects that the token is
invalid, it generates an error.
 It reads character streams from the source code, checks
for legal tokens, and pass the data to the syntax analyzer
when it demands.
 Example
 How Pleasant Is The Weather?
 See this example; Here, we can easily recognize that
there are five words How Pleasant, The, Weather, Is. This
is very natural for us as we can recognize the
separators, blanks, and the punctuation symbol.
 HowPl easantIs Th ewe ather?
Cont…
 Lexical Errors
 A lexical error is a sequence of
characters that does not match the
pattern of any token.
 The most common causes of a lexical
error are:
 The addition of an extraneous character;
The removal of a character that should be
present;
Lexical Analyzer Implementation;

 Lexer usually discards ‘uninteresting’


tokens that don’t contribute to parsing.
 Example…whitespace, comments.
 What happen if we remove all whitespace
and comments before lexing?
 Recall: if (i = j) then z<-0> else z<-1>
Basic Terminologies;

 What's a lexeme?
 A lexeme is a sequence of characters that are
included in the source program according to the
matching pattern of a token. It is nothing but an
instance of a token.
 What's a token?
 The token is a sequence of characters which
represents a unit of information in the source
program.
 What is Pattern?
 A pattern is a description which is used by the
token. In the case of a keyword which uses as a
token, the pattern is a sequence of characters.
Lexical Analyzer Architecture:
How tokens are recognized

Lexical Analyzer skips whitespaces and comments


while creating these tokens. If any error is present,
then Lexical analyzer will correlate that error with
the source file and line number.
Example;
c=a+b*5
Lexemes Tokens
C identifier
= assignment symbol
a identifier
+ + (addition symbol)
B identifier
* * (multiplication symbol)
5 5 (number)
Scanners issues;
 Scanners are concerned with issues such
as:
 Case sensitivity (or insensitivity)
 Whether or not blanks are significant
 Whether or not newlines are significant
 Whether comments can be nested
Lexical Errors;
 Errors that might occur during scanning,
called lexical errors include:
 Encountering characters that are not in the language’s
alphabet
 Too many characters in a word or line (yes, such
languages do exist!)
 An unclosed character or string literal
 An end of file within a comment
 Misspelling of identifiers, operators, keyword are
considered as lexical errors
 Generally, a lexical error is caused by the appearance
of some illegal character, mostly at the beginning of a
token.
Issues in Lexical Analysis
 Lexical analysis is the process of
producing tokens from the source
program. It has the following issues:

• Lookahead
• Ambiguities
Lookahead;
 Lookahead is required to decide when one
token will end and the next token will
begin. The simple example which has
lookahead issues are i vs. if, = vs. ==.
Therefore a way to describe the lexemes
of each token is required.
Cont…
A way needed to resolve ambiguities
• Is if it is two variables i and f or if?
• Is == is two equal signs =, = or ==?
• arr(5, 4) vs. fn(5, 4) II in Ada (as array
reference syntax and function call syntax are
similar.
 Hence, the number of lookahead to be
considered and a way to describe the
lexemes of each token is also needed.
 Regular expressions are one of the most
popular ways of representing tokens.
Ambiguities
 Lex can handle ambiguous specifications.
When more than one expression can
match the current input, lex chooses as
follows:
• The longest match is preferred.
• Among rules which matched the same
number of characters, the rule given first
is preferred.
Error Recovery Schemes;
• Panic mode recovery
• Local correction
• Global correction
Lexical error handling
approaches;
 Lexical errors can be handled by the
following actions:
• Deleting one character from the
remaining input.
• Inserting a missing character into the
remaining input.
• Replacing a character by another
character.
• Transposing two adjacent characters.

You might also like