Module 3 Ss and CD Lecture Notes 18cs61

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Syntax Analysis: Introduction


The Role of the Parser

 Parser accepts a string of tokens from lexical analyzer and checks the grammar for the source
language.
 The additional job of parser is to report any syntactic errors in an intelligent fashion and to
recover from commonly occurring errors to continue processing the remainder of the program.
 The parser constructs a parse tree and passes it to the rest of the compiler for further
processing.

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 1


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Methods of Parsing

There are several types of parsers for grammars. They are

➢ Universal Parsing: Cocke-Younger-Kasami Algorithm and Earley’s Algorithm. It is

too inefficient to use in production compilers.

➢ Top down Parsing: Build parse tree from the top (root) to bottom (leaves). ➢ Bottom up

Parsing: Build parse tree from the bottom (leaves) to top (root). ➢ The input to the parser is

scanned from left to right, one symbol at a time. ➢ The most efficient top-down and bottom-up

methods work for LL and LR grammars. Syntax Error Handling

Common programming errors can occur at many different levels.

➢ Lexical errors-misspelling an id, keyword etc.

➢ Syntax errors-unbalanced parenthesis.

➢ Semantic errors-type mismatching.

➢ Logical errors-anything from in correct reasoning.

The function of the error handler in the parser is

➢ It should report the presence of the errors clearly and accurately.

➢ It should recover from each error as fast as possible, so that subsequent errors can be
detected.

➢ It should not slow down the processing of correct programs.

Error Recovery Strategies

(1)Panic mode recovery

➢ On discovering an error, the parser discards input symbols one at a time until one of the
designated set of synchronizing token is found.

➢ Synchronizing tokens are usually delimiters, such as semicolon or}, whose role is
clear and unambiguous.

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 2


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Advantages:

 Simplicity.

➢ Guaranteed not to go into infinite loop.

(2) Phrase Level recovery

➢ Perform local correction on the remaining input.

➢ Local correction includes replacing a comma by a semicolon, deleting an extra


semicolon or inserting a semicolon etc.

Disadvantages:

➢ Care should be taken by the replacements, otherwise lead to infinite loop. ➢ It is

not suitable, if the actual error has occurred before the point of detection. ➢ Very

difficult to implement.

(3) Error productions

Define the production rules for common errors.

Disadvantages:

➢ Introducing error productions will complicate the grammar.

➢ Using this technique many errors are resolved, but not all types of errors. (4)

Global correction

➢ The parser examines the whole program and tries to find out the closest match for it
which is error free.

➢ The closest match program has less number of insertions, deletions and changes of
tokens to recover from erroneous input.

Disadvantages:

➢ Too costly to implement in terms of time and space, so currently it is used only in
theoretical purpose.

Context Free Grammar

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 3


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

 It consists of terminals, non-terminals, a start symbol and productions. That is G=


(N, T, P, S)

Terminals (T)
Basic symbols from which statements are formed. The word token is a synonym
for terminal.

Non-terminals (N)

➢ Syntactic variable that denote set of string.

Start Symbol(S)

➢ One of the non-terminals is distinguished as a start symbol.

Productions (P)

➢ It specifies the manner in which the terminals and non-terminals can be combined to
form strings.

➢ Each production consists of a non-terminal followed by an arrow, followed by a string non-


terminals and terminals.

➢ Example : The grammar for arithmetic expression is shown below. ➢

exp→exp+term

➢ exp→exp-term

➢ exp→term

➢ term→term* factor

➢ term→term/factor

➢ factor→ (exp)

➢ factor→id

➢ Here terminals are +,-, /,*, (,), id and non-terminals are exp, term, factor.

Notational conventions(Contd…)

➢ 1) These symbols are terminals:

➢ Lower case letters a,b,c..z ,Operators +.-,*, / etc. Punctuation symbols, : ; ( )


etc.Bold face strings id, if etc.The digits 0, 1,…, 9.

➢ 2) These symbols are non terminals:

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 4


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

➢ Uppercase letters early in the alphabet A, B, C,etc.

➢ S is usually start symbol.

➢ 3) X, Y, Z represents grammar symbols (NT ).


4) α, β, γ represents set of grammar symbols.

➢ 5) A set of productions A→ α1, A→ α2, A→ α3,…,A→ αk with a common head A


written as

➢ A→ α1| α2| α3 …|αk

➢ 6) Unless stated, head of the first production is the start symbol.

➢ Using above things grammar for arithmetic expression is written as ➢

E→E+T|E-T|T

➢ T→T*F|T/F|F

➢ F → (E) |id

Derivations

Definition: Replacing non-terminal by a non-terminal or terminal according to a production


rules is known as derivation. Use the productions from head to body.

Reduction

Reduction is defined as replacing a string by an NT according to a grammar production rules.

Types of derivations:

a) Left most derivation (LMD)

At each step, we replace a leftmost variable by one of its production bodies α→β,
in which leftmost non terminal in α is replaced, we write α═>β.

b) Right most derivation (RMD)

At each step, we replace a right most variable by one of its production bodies, we write
α═>β.
rm

Example :

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 5


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Consider the grammar E→E+E|E-E|-E|(E)|id and input string –(id+id).

LMD is E═>-E═>-(E+E) ═>-(id+E) ═>-(id+id)

RMD is E═>-E═>-(E+E) ═>-(E+id) ═>-(id+id)

Reduction is - (id+id) ═>-(E+id) ═>-(E+E) ═>-(E) ═>-E═> E

Parse Trees
Parse tree is a graphical representation of a derivation .

Example :

Consider the grammar E→E+E|E-E|-E|(E)|id and construct parse tree for the input string –
(id+id).

Definition: Grammar that produces more than one parse tree for some sentence is called ambiguous.

That is grammar that produces more than one leftmost or right most derivation for the same sentence.

Example: The arithmetic expression grammar permits two distinct leftmost derivations for the
sentence id+id*id. Therefore, it is called ambiguous grammar.

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 6


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Grammars are more powerful notations than regular expressions.


➢ Every construct described by a regular expression can also be described by a
grammar, but vice-versa is not true.

➢ Also, every regular language is a context free language but vice-versa is not true.

Consider the regular expression (a|b) *abb. The grammar for the above regular expression is

A0→a A0|b A0|aA1


A1→bA2

A2→bA3

A3→ Є

describes the same language, the set of strings of a’s and b’s ending with abb but vice versa is not
true.

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 7


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Ambiguity: A grammar that produces more than one parse for some sentence is said to be
ambiguous grammar.

• Let us derive the parse tree for the string id +id * id

Elimination of Left Recursion


A grammar is left recursive if it has a nonterminal A such that there is a derivation A --→+ Aα
for some string α. Top-down parsing methods cannot handle left-recursive grammars, so a
transformation is needed to eliminate left.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest
level of the parse tree and works down the parse tree by using the rewriting rules of a formal
grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

In top-down parsing, the parse tree is generated from top to bottom, i.e., from root to leaves &
expand till all leaves are generated.

It generates the parse tree containing root as the starting symbol of the Grammar. It starts derivation
from the start symbol of Grammar & performs leftmost derivation at each step.

Drawback of Top-Down Parsing

 Top-down parsing tries to identify the left-most derivation for an input string ω which is
similar to generating a parse tree for the input string ω that starts from the root and produce
the nodes in a pre-defined order.
 The reason that top-down parsing follow the left-most derivation for an input string ω and not
the right-most derivation is that the input string ω is scanned by the parser from left to right,

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 8


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

one symbol/token at a time. The left-most derivation generates the leaves of the parse tree in
the left to right order, which connect the input scan order.
 In the top-down parsing, each terminal symbol produces by multiple production of the
grammar (which is predicted) is connected with the input string symbol pointed by the string
marker. If the match is successful, the parser can sustain. If the mismatch occurs, then
predictions have gone wrong.
 At this phase it is essential to reject previous predictions. The prediction which led to the
mismatching terminal symbol is rejected and the string marker (pointer) is reset to its previous
position when the rejected production was made. This is known as backtracking.
 Backtracking was the major drawback of top-down parsing.
Types of Top-Down Parsing

There are two types of top-down parsing which are as follows −

 Top-Down Parsing with Backtracking


In Backtracking, the parser can make repeated scans of input. If the required input string is not
achieved by applying one production rule, then another production rule can be applied at each step to
get the required string.

 Top-Down Parsing without Backtracking


Once, the production rule is applied, it cannot be undone.

Predictive Parser − Predictive Parser is also known as Non-Recursive Predictive Parsing. A


predictive parser is an effective approach of implementing recursivedescent parsing by handling the
stack of activation records explicitly. The predictive parser has an input, a stack, a parsing table, and
an output. The input includes the string to be parsed, followed by $, the right-end marker.

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 9


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

Recursive Descent Parser − A top-down parser that implements a set of recursive procedures to
process the input without backtracking is known as recursive-descent parser, and parsing is known as
recursive-descent parsing

Bottom-up parsing can be defined as an attempt to reduce the input string w to the start symbol
of grammar by tracing out the rightmost derivations of w in reverse. Eg. A general shift reduce
parsing is LR parsing.

Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches
the root node. Here, we start from a sentence and then apply production rules in reverse manner in
order to reach the start symbol. The image given below depicts the bottom-up parsers available.

Shift-Reduce Parsing

Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-step
and reduce-step.

 Shift step: The shift step refers to the advancement of the input pointer to the next input
symbol, which is called the shifted symbol. This symbol is pushed onto the stack. The shifted
symbol is treated as a single node of the parse tree.

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 10


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

 Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to
(LHS), it is known as reduce-step. This occurs when the top of the stack contains a handle. To
reduce, a POP function is performed on the stack which pops off the handle and replaces it
with LHS non-terminal symbol.

LR Parser
The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of context-free
grammar which makes it the most efficient syntax analysis technique. LR parsers are also known as
LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for the
construction of right-most derivation in reverse, and k denotes the number of lookahead symbols to
make decisions.
There are three widely used algorithms available for constructing an LR parser:

 SLR(1) – Simple LR Parser:


o Works on smallest class of grammar
o Few number of states, hence very small table
o Simple and fast construction
 LR(1) – LR Parser:
o Works on complete set of LR(1) Grammar
o Generates large table and large number of states
o Slow construction
 LALR(1) – Look-Ahead LR Parser:
o Works on intermediate size of grammar
o Number of states are same as in SLR(1)
 LL vs. LR

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 11


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

LL LR

Does a leftmost derivation. Does a rightmost derivation in reverse.

Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack.

Ends when the stack is empty. Starts with an empty stack.

Uses the stack for designating what is still to Uses the stack for designating what is already
be expected. seen.

Builds the parse tree top-down. Builds the parse tree bottom-up.

Continuously pops a nonterminal off the Tries to recognize a right hand side on the stack,
stack, and pushes the corresponding right pops it, and pushes the corresponding
hand side. nonterminal.

Expands the non-terminals. Reduces the non-terminals.

Reads the terminals when it pops one off Reads the terminals while it pushes them on
the stack. the stack.

Pre-order traversal of the parse tree. Post-order traversal of the parse tree.

A parser should be able to detect and report any error in the program. It is expected that when an error
is encountered, the parser should be able to handle it and carry on parsing the rest of the input. Mostly
it is expected from the parser to check for errors but errors may be encountered at various stages of
the compilation process. A program may have the following kinds of errors at various stages:

 Lexical : name of some identifier typed incorrectly


 Syntactical : missing semicolon or unbalanced parenthesis
 Semantical : incompatible value assignment
 Logical : code not reachable, infinite loop

 There are four common error-recovery strategies that can be implemented in the parser to deal
with errors in the code.

Panic mode

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 12


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by
not processing input from erroneous input to delimiter, such as semi-colon. This is the easiest way of
error-recovery and also, it prevents the parser from developing infinite loops.

Statement mode

When a parser encounters an error, it tries to take corrective measures so that the rest of inputs of
statement allow the parser to parse ahead. For example, inserting a missing semicolon, replacing
comma with a semicolon etc. Parser designers have to be careful here because one wrong correction
may lead to an infinite loop.

Error productions

Some common errors are known to the compiler designers that may occur in the code. In addition, the
designers can create augmented grammar to be used, as productions that generate erroneous
constructs when these errors are encountered.

Global correction

The parser considers the program in hand as a whole and tries to figure out what the program is
intended to do and tries to find out a closest match for it, which is error-free. When an erroneous
input (statement) X is fed, it creates a parse tree for some closest error-free statement Y. This may
allow the parser to make minimal changes in the source code, but due to the complexity (time and
space) of this strategy, it has not been implemented in practice yet.

Abstract Syntax Trees

Parse tree representations are not easy to be parsed by the compiler, as they contain more details than
actually needed. Take the following parse tree as an example:

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 13


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

If watched closely, we find most of the leaf nodes are single child to their parent nodes. This
information can be eliminated before feeding it to the next phase. By hiding extra information, we
can obtain a tree as shown below:

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 14


Regulation – 2018(CBCS Scheme) System software & Compiler Design– 18CS61

ASTs are important data structures in a compiler with least unnecessary information. ASTs are more
compact than a parse tree and can be easily used by a compiler.

Prepared by SHOBA V Sri Sairam College of Engineering, Anekal, Bengaluru Page 15

You might also like