Module 3 Ss and CD Lecture Notes 18cs61

Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61
Syntax Analysis: Introduction

The Role of the Parser
 Parser accepts a string of tokens from lexical analyzer and checks the grammar for the source
language.
 The additional job of parser is to report any syntactic errors in an intelligent fashion and to
recover from commonly occurring errors to continue processing the remainder of the program.
 The parser constructs a parse tree and passes it to the rest of the compiler for further
processing.
Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 1

Methods of Parsing
There are several types of parsers for grammars. They are
➢ Universal Parsing: Cocke-Younger-Kasami Algorithm and Earley’s Algorithm. It is
too inefficient to use in production compilers.
➢ Top down Parsing: Build parse tree from the top (root) to bottom (leaves). ➢ Bottom up
Parsing: Build parse tree from the bottom (leaves) to top (root). ➢ The input to the parser is
scanned from left to right, one symbol at a time. ➢ The most efficient top-down and bottom-up
methods work for LL and LR grammars. Syntax Error Handling
Common programming errors can occur at many different levels.
➢ Lexical errors-misspelling an id, keyword etc.
➢ Syntax errors-unbalanced parenthesis.
➢ Semantic errors-type mismatching.
➢ Logical errors-anything from in correct reasoning.
The function of the error handler in the parser is
➢ It should report the presence of the errors clearly and accurately.
➢ It should recover from each error as fast as possible, so that subsequent errors can be
detected.
➢ It should not slow down the processing of correct programs.
Error Recovery Strategies
(1)Panic mode recovery
➢ On discovering an error, the parser discards input symbols one at a time until one of the
designated set of synchronizing token is found.
➢ Synchronizing tokens are usually delimiters, such as semicolon or}, whose role is
clear and unambiguous.

Advantages:
 Simplicity.
➢ Guaranteed not to go into infinite loop.
(2) Phrase Level recovery
➢ Perform local correction on the remaining input.
➢ Local correction includes replacing a comma by a semicolon, deleting an extra

semicolon or inserting a semicolon etc.
Disadvantages:
➢ Care should be taken by the replacements, otherwise lead to infinite loop. ➢ It is
not suitable, if the actual error has occurred before the point of detection. ➢ Very
difficult to implement.
(3) Error productions
Define the production rules for common errors.
Disadvantages:
➢ Introducing error productions will complicate the grammar.
➢ Using this technique many errors are resolved, but not all types of errors. (4)
Global correction
➢ The parser examines the whole program and tries to find out the closest match for it
which is error free.
➢ The closest match program has less number of insertions, deletions and changes of
tokens to recover from erroneous input.
Disadvantages:
➢ Too costly to implement in terms of time and space, so currently it is used only in
theoretical purpose.
Context Free Grammar

 It consists of terminals, non-terminals, a start symbol and productions. That is G=

(N, T, P, S)
Terminals (T)
Basic symbols from which statements are formed. The word token is a synonym
for terminal.
Non-terminals (N)
➢ Syntactic variable that denote set of string.
Start Symbol(S)
➢ One of the non-terminals is distinguished as a start symbol.
Productions (P)
➢ It specifies the manner in which the terminals and non-terminals can be combined to
form strings.
➢ Each production consists of a non-terminal followed by an arrow, followed by a string non-

terminals and terminals.
➢ Example : The grammar for arithmetic expression is shown below. ➢
exp→exp+term
➢ exp→exp-term
➢ exp→term
➢ term→term* factor
➢ term→term/factor
➢ factor→ (exp)
➢ factor→id
➢ Here terminals are +,-, /,*, (,), id and non-terminals are exp, term, factor.
Notational conventions(Contd…)
➢ 1) These symbols are terminals:
➢ Lower case letters a,b,c..z ,Operators +.-,*, / etc. Punctuation symbols, : ; ( )

etc.Bold face strings id, if etc.The digits 0, 1,…, 9.
➢ 2) These symbols are non terminals:

➢ Uppercase letters early in the alphabet A, B, C,etc.
➢ S is usually start symbol.
➢ 3) X, Y, Z represents grammar symbols (NT ).

4) α, β, γ represents set of grammar symbols.
➢ 5) A set of productions A→ α1, A→ α2, A→ α3,…,A→ αk with a common head A

written as
➢ A→ α1| α2| α3 …|αk
➢ 6) Unless stated, head of the first production is the start symbol.
➢ Using above things grammar for arithmetic expression is written as ➢
E→E+T|E-T|T
➢ T→T*F|T/F|F
➢ F → (E) |id
Derivations
Definition: Replacing non-terminal by a non-terminal or terminal according to a production

rules is known as derivation. Use the productions from head to body.
Reduction
Reduction is defined as replacing a string by an NT according to a grammar production rules.
Types of derivations:
a) Left most derivation (LMD)
At each step, we replace a leftmost variable by one of its production bodies α→β,
in which leftmost non terminal in α is replaced, we write α═>β.
b) Right most derivation (RMD)
At each step, we replace a right most variable by one of its production bodies, we write
α═>β.
rm
Example :

Consider the grammar E→E+E|E-E|-E|(E)|id and input string –(id+id).
LMD is E═>-E═>-(E+E) ═>-(id+E) ═>-(id+id)
RMD is E═>-E═>-(E+E) ═>-(E+id) ═>-(id+id)
Reduction is - (id+id) ═>-(E+id) ═>-(E+E) ═>-(E) ═>-E═> E
Parse Trees
Parse tree is a graphical representation of a derivation .
Example :
Consider the grammar E→E+E|E-E|-E|(E)|id and construct parse tree for the input string –
(id+id).
Definition: Grammar that produces more than one parse tree for some sentence is called ambiguous.
That is grammar that produces more than one leftmost or right most derivation for the same sentence.
Example: The arithmetic expression grammar permits two distinct leftmost derivations for the
sentence id+id*id. Therefore, it is called ambiguous grammar.

Grammars are more powerful notations than regular expressions.

➢ Every construct described by a regular expression can also be described by a
grammar, but vice-versa is not true.
➢ Also, every regular language is a context free language but vice-versa is not true.
Consider the regular expression (a|b) *abb. The grammar for the above regular expression is
A0→a A0|b A0|aA1

A1→bA2
A2→bA3
A3→ Є
describes the same language, the set of strings of a’s and b’s ending with abb but vice versa is not
true.

Ambiguity: A grammar that produces more than one parse for some sentence is said to be
ambiguous grammar.
• Let us derive the parse tree for the string id +id * id
Elimination of Left Recursion

A grammar is left recursive if it has a nonterminal A such that there is a derivation A --→+ Aα
for some string α. Top-down parsing methods cannot handle left-recursive grammars, so a
transformation is needed to eliminate left.
Top-down parsing in computer science is a parsing strategy where one first looks at the highest
level of the parse tree and works down the parse tree by using the rewriting rules of a formal
grammar. LL parsers are a type of parser that uses a top-down parsing strategy.
In top-down parsing, the parse tree is generated from top to bottom, i.e., from root to leaves &
expand till all leaves are generated.
It generates the parse tree containing root as the starting symbol of the Grammar. It starts derivation
from the start symbol of Grammar & performs leftmost derivation at each step.
Drawback of Top-Down Parsing
 Top-down parsing tries to identify the left-most derivation for an input string ω which is
similar to generating a parse tree for the input string ω that starts from the root and produce
the nodes in a pre-defined order.
 The reason that top-down parsing follow the left-most derivation for an input string ω and not
the right-most derivation is that the input string ω is scanned by the parser from left to right,

one symbol/token at a time. The left-most derivation generates the leaves of the parse tree in
the left to right order, which connect the input scan order.
 In the top-down parsing, each terminal symbol produces by multiple production of the
grammar (which is predicted) is connected with the input string symbol pointed by the string
marker. If the match is successful, the parser can sustain. If the mismatch occurs, then
predictions have gone wrong.
 At this phase it is essential to reject previous predictions. The prediction which led to the
mismatching terminal symbol is rejected and the string marker (pointer) is reset to its previous
position when the rejected production was made. This is known as backtracking.
 Backtracking was the major drawback of top-down parsing.
Types of Top-Down Parsing
There are two types of top-down parsing which are as follows −
 Top-Down Parsing with Backtracking

In Backtracking, the parser can make repeated scans of input. If the required input string is not
achieved by applying one production rule, then another production rule can be applied at each step to
get the required string.
 Top-Down Parsing without Backtracking

Once, the production rule is applied, it cannot be undone.
Predictive Parser − Predictive Parser is also known as Non-Recursive Predictive Parsing. A

predictive parser is an effective approach of implementing recursivedescent parsing by handling the
stack of activation records explicitly. The predictive parser has an input, a stack, a parsing table, and
an output. The input includes the string to be parsed, followed by $, the right-end marker.

Recursive Descent Parser − A top-down parser that implements a set of recursive procedures to
process the input without backtracking is known as recursive-descent parser, and parsing is known as
recursive-descent parsing
Bottom-up parsing can be defined as an attempt to reduce the input string w to the start symbol
of grammar by tracing out the rightmost derivations of w in reverse. Eg. A general shift reduce
parsing is LR parsing.
Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches
the root node. Here, we start from a sentence and then apply production rules in reverse manner in
order to reach the start symbol. The image given below depicts the bottom-up parsers available.
Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-step
and reduce-step.
 Shift step: The shift step refers to the advancement of the input pointer to the next input
symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The shifted
symbol is treated as a single node of the parse tree.

 Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to
(LHS), it is known as reduce-step. This occurs when the top of the stack contains a handle. To
reduce, a POP function is performed on the stack which pops off the handle and replaces it
with LHS non-terminal symbol.
LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free
grammar which makes it the most efficient syntax analysis technique. LR parsers are also known as
LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for the
construction of right-most derivation in reverse, and k denotes the number of lookahead symbols to
make decisions.
There are three widely used algorithms available for constructing an LR parser:
 SLR(1) – Simple LR Parser:

o Works on smallest class of grammar
o Few number of states, hence very small table
o Simple and fast construction
 LR(1) – LR Parser:
o Works on complete set of LR(1) Grammar
o Generates large table and large number of states
o Slow construction
 LALR(1) – Look-Ahead LR Parser:
o Works on intermediate size of grammar
o Number of states are same as in SLR(1)
 LL vs. LR

LL LR
Does a leftmost derivation. Does a rightmost derivation in reverse.
Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack.
Ends when the stack is empty. Starts with an empty stack.
Uses the stack for designating what is still to Uses the stack for designating what is already
be expected. seen.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the Tries to recognize a right hand side on the stack,
stack, and pushes the corresponding right pops it, and pushes the corresponding
hand side. nonterminal.
Expands the non-terminals. Reduces the non-terminals.
Reads the terminals when it pops one off Reads the terminals while it pushes them on
the stack. the stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
A parser should be able to detect and report any error in the program. It is expected that when an error
is encountered, the parser should be able to handle it and carry on parsing the rest of the input. Mostly
it is expected from the parser to check for errors but errors may be encountered at various stages of
the compilation process. A program may have the following kinds of errors at various stages:
 Lexical : name of some identifier typed incorrectly

 Syntactical : missing semicolon or unbalanced parenthesis
 Semantical : incompatible value assignment
 Logical : code not reachable, infinite loop
 There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.
Panic mode

When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by
not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest way of
error-recovery and also, it prevents the parser from developing infinite loops.
Statement mode
When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc. Parser designers have to be careful here because one wrong correction
may lead to an infinite loop.
Error productions
Some common errors are known to the compiler designers that may occur in the code. In addition, the
designers can create augmented grammar to be used, as productions that generate erroneous
constructs when these errors are encountered.
Global correction
The parser considers the program in hand as a whole and tries to figure out what the program is
intended to do and tries to find out a closest match for it, which is error-free. When an erroneous
input (statement) X is fed, it creates a parse tree for some closest error-free statement Y. This may
allow the parser to make minimal changes in the source code, but due to the complexity (time and
space) of this strategy, it has not been implemented in practice yet.
Abstract Syntax Trees
Parse tree representations are not easy to be parsed by the compiler, as they contain more details than
actually needed. Take the following parse tree as an example:

If watched closely, we find most of the leaf nodes are single child to their parent nodes. This
information can be eliminated before feeding it to the next phase. By hiding extra information, we
can obtain a tree as shown below:

ASTs are important data structures in a compiler with least unnecessary information. ASTs are more
compact than a parse tree and can be easily used by a compiler.

Module 3 Ss and CD Lecture Notes 18cs61

Uploaded by

Copyright:

Available Formats

You might also like

Module 3 Ss and CD Lecture Notes 18cs61

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 3 Ss and CD Lecture Notes 18cs61

Uploaded by

Copyright:

Available Formats

Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Syntax Analysis: Introduction

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 1

There are several types of parsers for grammars. They are

➢ Universal Parsing: Cocke-Younger-Kasami Algorithm and Earley’s Algorithm. It is

too inefficient to use in production compilers.

methods work for LL and LR grammars. Syntax Error Handling

Common programming errors can occur at many different levels.

➢ Lexical errors-misspelling an id, keyword etc.

➢ Syntax errors-unbalanced parenthesis.

➢ Semantic errors-type mismatching.

➢ Logical errors-anything from in correct reasoning.

The function of the error handler in the parser is

➢ It should report the presence of the errors clearly and accurately.

➢ It should not slow down the processing of correct programs.

Error Recovery Strategies

(1)Panic mode recovery

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 2

➢ Guaranteed not to go into infinite loop.

(2) Phrase Level recovery

➢ Perform local correction on the remaining input.

➢ Local correction includes replacing a comma by a semicolon, deleting an extra

➢ Care should be taken by the replacements, otherwise lead to infinite loop. ➢ It is

(3) Error productions

Define the production rules for common errors.

➢ Introducing error productions will complicate the grammar.

Context Free Grammar

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 3

 It consists of terminals, non-terminals, a start symbol and productions. That is G=

➢ Syntactic variable that denote set of string.

➢ One of the non-terminals is distinguished as a start symbol.

➢ Each production consists of a non-terminal followed by an arrow, followed by a string non-

➢ Example : The grammar for arithmetic expression is shown below. ➢

➢ 1) These symbols are terminals:

➢ Lower case letters a,b,c..z ,Operators +.-,*, / etc. Punctuation symbols, : ; ( )

➢ 2) These symbols are non terminals:

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 4

➢ Uppercase letters early in the alphabet A, B, C,etc.

➢ S is usually start symbol.

➢ 3) X, Y, Z represents grammar symbols (NT ).

➢ 5) A set of productions A→ α1, A→ α2, A→ α3,…,A→ αk with a common head A

➢ A→ α1| α2| α3 …|αk

➢ 6) Unless stated, head of the first production is the start symbol.

➢ Using above things grammar for arithmetic expression is written as ➢

Definition: Replacing non-terminal by a non-terminal or terminal according to a production

Reduction is defined as replacing a string by an NT according to a grammar production rules.

a) Left most derivation (LMD)

b) Right most derivation (RMD)

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 5

Consider the grammar E→E+E|E-E|-E|(E)|id and input string –(id+id).

LMD is E═>-E═>-(E+E) ═>-(id+E) ═>-(id+id)

RMD is E═>-E═>-(E+E) ═>-(E+id) ═>-(id+id)

Reduction is - (id+id) ═>-(E+id) ═>-(E+E) ═>-(E) ═>-E═> E

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 6

Grammars are more powerful notations than regular expressions.

A0→a A0|b A0|aA1

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 7

• Let us derive the parse tree for the string id +id * id