Lexer.: Chapter Two: Lexical Analysis

Chapter Two: Lexical Analysis
2. Lexical Analysis
 The primary function of a scanner is to transform a
character stream into a token stream.
 A scanner is sometimes called a lexical analyzer, or
lexer.
 Lexical analysis is the first phase of a compiler
 The lexical analyzer breaks these syntaxes into a series
of tokens, by removing any whitespace or comments
in the source code.
 If the lexical analyzer finds a token invalid, it generates
an error.
 It reads character streams from the source code,
checks for legal tokens, and passes the data to the
syntax analyzer when it demands.
Cont…
Tokens
 Lexemes are said to be a sequence of characters (alphanumeric) in a token.

 In programming language, keywords, constants, identifiers, strings,
numbers, operators, and punctuations symbols can be considered as
tokens .
 For example, in C language, the variable declaration line
int value = 100;
Contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ;
(symbol).
 Example of Non Tokens:
comment /* try again */

preprocessor directive #include<stdio.h>
preprocessor directive
macro #define NUMS 5 , 6
blanks, tabs, and newlines NUMS
2.1. LEXICAL ERRORS:

Lexical errors are the errors thrown by your lexer when unable to continue.

This means that there’s no way to recognize a lexeme as a valid token for your
lexer

Error-recovery actions are:
Delete one character from the remaining input.
Insert a missing character in to the remaining input.
Replace a character by another character.
2.2. Specifications of Tokens

To identify the tokens it introduce regular expression, a notation that can be
used to describe essentially all the tokens of programming language .

Secondly token recognizers, which are designed using transition diagrams
and finite automata.
Cont….

2.2.1. Alphabets, Strings and Languages
Alphabets

Any finite set of symbols ∑ = {0,1} is a set of binary alphabets, ∑ = {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal
alphabets, ∑ = {a-z, A-Z} is a set of English language alphabets.
Strings

Any finite sequence of alphabets is called a string.

Length of the string is the total number of occurrence of alphabets, e.g., the length of the string tutorials point is 14 and
is denoted by |tutorials point| = 14.

A string having no alphabets, i.e. a string of zero length is known as an empty string and is denoted by ε (epsilon).

Example: 01101, 111, 0001, 111 … are strings from the binary alphabet ∑ = { 0, 1 }.
String Cont….

Power of an alphabet:
– If ∑ is an alphabet, we can express the set of all strings of a certain length from that alphabet by using an exponential
notation. We define ∑k to be the set of strings of length k, each of whose symbols is in ∑.
– ∑0= {ϵ}, regardless of what alphabet ∑ is.
– If ∑ = {0,1}, then ∑1 = {0,1}, ∑2 = {00, 01, 10, 11}, ∑3={000, 001, 010, 011, 100, 101, 110,111}

The set of all strings over an alphabet ∑ is conventionally denoted ∑*.
Example: {0,1}* = {ϵ, 0, 1, 00, 01, 10, 11, 000, …}
∑* = ∑0 U∑1 U∑2 U∑3U …

The set of non empty strings from alphabet ∑ is denoted ∑+ (excluding the empty string from the set of strings)
∑+ = ∑1 U ∑2 U ∑3 U …
∑* = ∑+ U {ε}.
String Cont….

Operation on strings
– concatenation (product):
• Let x and y be strings. Then xy denotes the concatenation of x and y, that is, the string formed by making a copy of x
and followed it by a copy of y.
• Example: Let x= 01101 and y= 110, then xy= 01101110 (yx=11001101)
• Note: For any string w, εw = wε = w (i.e. ϵ is the identity for concatenation)
Language(remember DFA and )

A language is considered as a finite set of strings over some finite set of alphabets.

Finite languages can be described by means of regular expressions.
2.3.2. Regular Expressions
Regular expressions are a convenient way to specify various simple

sets of strings.
A set of strings defined by regular expressions is called a regular set.
For purposes of scanning, a token class is a regular set whose
structure is defined by a regular expression.
A particular instance of a token class is sometimes called a lexeme.
Regular expressions have the capability to express finite languages
by defining a pattern for finite strings of symbols.
The grammar defined by regular expressions is known as regular
grammar.
The language defined by regular grammar is known as regular
language
.
Reading assignment
Conversion of NFA to DFA
.
Error Handling in Compiler Design
 The tasks of the Error Handling process are to detect each

error and report it to the user.
 Types or Sources of Error – There are two types of error:
run-time and compile-time error:
 run-time is an error which takes place during the execution
of a program.
 invalid input data and logical error(not produce the expected
result) are example of this.
 Compile-time errors rises at compile time, before execution
of the program. Syntax error or missing file reference are
example of this Compile-time errors
Cont..
Classification of Compile-time error –
 Lexical :This includes misspellings of identifiers, keywords or
operators
 Syntactical : missing semicolon or unbalanced parenthesis.
 Semantical : incompatible value assignment or type
mismatches between operator and operand.
 Logical : code not reachable, infinite loop.
recovery methods of errors
There are some common recovery methods that are follows.
1. Panic mode recovery
 This is the easiest way of error-recovery and also, it prevents
the parser from developing infinite loops while recovering
error.
 Advantage is that it is easy to implement and guarantees not
to go to infinite loop.
 Disadvantage is that a considerable amount of input is
skipped without checking it for additional errors.
Classification of Compile-time error
There are some common recovery methods that are follows.
2. Phase level recovery
 Perform local correction on the input to repair the error.
3. Error productions:
 Some common errors are known to the compiler designers that
may occur in the code.
 Augmented grammars can also be used, as productions that
generate erroneous constructs when these errors are
encountered.
 Example: write 5x instead of 5*x
4. Global correction:
 Its aim is to make as few changes as possible while converting an
incorrect input string to a valid string.
Classification of Errors:
. Lexical phase errors: These errors are detected during the lexical analysis
1
phase. Typical lexical errors are :
•Exceeding length of identifier or numeric constants.
•Appearance of illegal characters, eg . printf("Geeksforgeeks");$
•Unmatched string
2. Syntactic phase errors

These errors are detected during syntax analysis phase. Typical syntax errors are
•Errors in structure
•Missing operator
•Misspelled keywords
•Unbalanced parenthesis
3. Semantic errors
These errors are detected during semantic analysis phase.
Typical semantic errors are:
 Incompatible type of operands
 Undeclared variables
 Not matching of actual arguments with formal one
A Typical Lexical Analyzer Generator
Lex/Flex: Lex is a program designed to generate scanners, also
known as tokenizers, which recognize lexical patterns in text.
Lex is an acronym that stands for "lexical analyzer generator."
The End

Lexer.: Chapter Two: Lexical Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lexer.: Chapter Two: Lexical Analysis

Uploaded by

Copyright:

Available Formats

Chapter Two: Lexical Analysis

 Lexemes are said to be a sequence of characters (alphanumeric) in a token.

comment /* try again */

Language(remember DFA and )

Regular expressions are a convenient way to specify various simple

Conversion of NFA to DFA

 The tasks of the Error Handling process are to detect each

phase. Typical lexical errors are :

•Exceeding length of identifier or numeric constants.

•Appearance of illegal characters, eg . printf("Geeksforgeeks");$

2. Syntactic phase errors

You might also like