Syntax Analyzer

Syntax Analysis
Week 03
1
Overview
 Translating from one language to another

always requires some kind of parsing.
 Parsing is the process of identifying
structure in data.
 It determines whether the input data has
some pre-determined structure and respond
accordingly.
2
Syntax Analysis Overview
 Goal – determine if the input token stream
satisfies the syntax of the program
 What do we need to do this?
 An expressive way to describe the syntax
 A mechanism that determines if the input token
stream satisfies the syntax description
 For lexical analysis
 Regular expressions describe tokens
 Finite automata = mechanisms to generate tokens
from input stream
3
Syntax Analysis Overview
 Methods commonly used in compilers to
create parsers for grammars are classified as
 Top down
 Build parse trees from top (root) to the bottom (leaves)
 Bottom up
 Starts from leaves and work up to the roots
 Both methods work only on subclasses of
Grammar, but several of these subclasses,
are expressive enough to describe most
syntactic constructs in programing
languages.
4
How it works..
 The parser obtains a sequence of tokens

from the Lexical analyzer.
 Recognizes structure through the use of
Context-Free Grammar.
 Finally it generates the Parse Tree/ Syntax
Tree.
5
Example
 Input Code: if (x == y) { a=1; }

 Tokens generated:
IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR
 Parser will take those tokens as input

and output Parse Tree.
6
Example (cont)
if (x == y) { a=1; }
7
Syntax Tree
 Different programming constructs can have

different structure of Syntax Tree, like:
 Syntax Tree for Operators
 Syntax Tree for If-statement
 Syntax Tree for While Loop, etc.
 Syntax Tree for assignment statement.
8
Example (cont)
 Syntax Tree which is generated like:
9
Syntax Rules
 Just like Lexical Analyzer, Syntax Analyzer
also requires some rules to check the
structure of the program.
 We use CFG to check the structure of the
program or to apply Syntax rules.
 As we know how to write CFG, if we can
understand how can we implement CFG then
we can better understand Parsing process.
10
Problems in CFG
 Before CFG is implemented, we must

understand and remove some problems in
CFG. These are:
 Ambiguity
 First-First Conflict
 Left Recursion
 Unreachable Production, etc.
11
Implementation of CFG
 Just like we used DFA and symbol table to

implement rules of Lexical Analyzer, we need
a machine that can be used to implement
rules of CFG.
 Such a machine is called “Push Down
Automata” and is combination of:
FA + Stack
 We have seen the architecture of PDA.
12
Implementation of CFG
 Also before we can implement CFG, we have

to understand different types of Parsers.
 Parser can be divided into two:

 Top Down Parsers
 Bottom Up Parsers
13
Top Down Parsers
 Gets their name because of the way they
construct the tree.
 Uses Left Most Derivation to construct tree.
 Most suited for hand-written Parsers.
 LL-Parsers.
 Ambiguity, Left Recursion, First-First Conflict
create problem for these Grammars.
14
Bottom Up Parsers
 Construct the tree from Bottom to top or from

Leaves to Root.
 Uses Right Most Derivation.
 Used in Parser Generator tools. Difficult to
use in Hand-written Parsers.
 LR-Parsers.
15
THE ROLE OF THE PARSER
source lexical token Parser

program analyzer get next
token
symbol
table
16
Where is Syntax Analysis
Performed? if (b == 0) a = b;
Lexical Analysis or Scanner
if ( b == 0 ) a = b ;
Syntax Analysis or Parsing
if
abstract syntax tree
== = or parse tree
b 0 a b
17
Parsing Analogy
• Syntax analysis for natural languages
• Recognize whether a sentence is grammatically correct
• Identify the function of each word
sentence
subject verb indirect object object

I gave him noun phrase
article noun
“I gave him the book”
the book
18
Syntax Error Handling
 Most programing language specifications do

not describe how a compiler should respond
to errors; the response is left to the compiler
designer.
 Planning the error handling right from the
start can both simplify the structure of the
compiler and improve its response to the
errors.
19
Programs can contain errors at many different
levels. For e.g.:
 Lexical: Such as misspelling an identifier,
keyword, or operator
 Syntactic: Such as an arithmetic expression
with unbalanced parentheses
 Semantic: Such as an operator applied to an
incompatible operand
 Logical: Such as an infinitely recursive call
20
 Many of the error detection and recovery in a

compiler is centered around the syntax
analysis phase.
 As many errors are syntactic in nature or are
exposed when the stream of tokens coming
from the lexical analyzer disobeys the
grammatical rules defining the programing
language.
21
 The error handler in a parser has simple to

state goals:
 It should report the presence of errors clearly
and accurately.
 It should recover from each error quickly
enough to be able to detect subsequent
errors.
 It should not significantly slow down the
processing of correct programs.
22
Many of the errors could be classified simply:

 60% were punctuation errors.
 20% operator and operand errors.
 15% keyword errors.
 The remaining 5% other kinds.
23
How should an error handler reports the
presence of an error?
 The place in the source program where an
error is detected because there is a good
chance that the actual error occurred
within the previous few tokens.
 A common strategy employed by many
compilers is to print the offending line with
a pointer to the position at which an error
is detected.
24
How should an error handler reports the
presence of an error?
 If there is reasonable likelihood of what the
error actually is, an informative,
understandable diagnostic message is
also included, e.g., “semicolon missing at
this position”.
25
Formal Method for Describing Syntax
 The formal language generation mechanism

which is commonly used to describe the
syntax of programming language is called
grammar.
 Backus-Naur Form and Context Free
Grammars
 In the middle to late 1950s 2 men, Noam
Chomsky and John Backus, developed the same
syntax description formalism, which became
widely used method for programming language
syntax
26
Context Free Grammars
 In mid 1950s, Chomsky a noted linguistic
described two useful grammar classes
 Context Free
 Whole programing languages, with minor
exceptions, can be described by context

free grammars
 Regular
 The tokens of programing languages can
be described by regular grammars

27
Context Free Grammar
 A language used to describe another
language
 Consists of a collection of rules (or
productions)
 Consist of 4 components:
1. Terminal symbols = token or 
The basic symbols from which strings

are formed. In each of the keywords if
then and else are terminals.
28
2. Non-terminal symbols = syntactic

variables
3. Start symbol S = special non-terminal
4. Productions of the form LHSRHS
 LHS = single non-terminal
 RHS = string of terminals and non-terminals
 Describes how the nonterminal on the lhs can be
replaced by a string of terminals and nonterminals
on the rhs.
29
expr  expr op expr

In this grammar the
expr  (expr)
terminal symbols are
expr  - expr
Id + - * / ()
expr  id
op  + Non terminal
op  - symbols are
op  * expr and op
op  /
30
Context Free Grammar(CFG)
 For example A simple Java assignment
statement, might be presented by the
abstraction <assign>
 Definition or rule may be given by

<assign> → < var > = < expression >
 LHS: the abstraction being defined

 RHS: contains a mixture of tokens, lexemes, and
references to other abstractions
31
Notational Conventions
 These symbols are terminals:

1. Lower case letters early in the alphabet such as
a, b, c
2. Operator symbols such as +,-, etc
3. Punctuation symbols such as parenthesis,
comma etc.
4. The digits 0,1, ….., 9
5. Boldface strings such as id or if
32
Example of a grammar in CFG
 In a grammar for a complete programing
language, the start symbol represents a
complete program and is usually named
<program>
 <program> → begin <stmt_list> end
 <stmt_list> → <stmt> | <stmt>; <stmt_list>
 <stmt> → <var> = <expression>
 <var> → A | B | C | D
 <expression> → <var> + <var> | <var> -
<var> | <var>
33
 These symbols are non terminals:
1. Upper-case letters early in the alphabet such as
A, B, C
2. The letter S, which, when it appears, is usually
the start symbol.
3. Lower case italic names such as expr or stmt
 Upper case letters late in the alphabet, such

as X, Y, Z represent grammar symbols, that
is either non terminals or terminals
34
 Using these short hands, we could write the

previous grammar as
 E  E O E | (E) | - E | id
 A+|-|*|/|^
35
Grammars and Derivation
 A grammar is a generative device for
defining language.
 The sentences of the language are
generated through a sequence of
applications of the rules, beginning with a
special nonterminal of the grammar called
the start symbol.
 A sentence generation is called a
derivation.
36
Derivation
 For instance to generate the string - (id * id)

we can say:
 E => -E => -(E) => -(E A E) => -(id A E) => -
(id * E) => -(id * id)
37
 The process of generating a sentence
begin A = B – C end
 Derivation: <program> (start symbol)
=> begin <stmt_list> end
=> begin <stmt> end
=> begin <var> = <expression> end
=> begin A = <expression> end
=> begin A = <var> - <var> end
=> begin A = B - <var> end
=> begin A = B - C end
38
 Leftmost derivation:
 the replaced non-terminal is always the leftmost
non-terminal
 Rightmost derivation:
 the replaced non-terminal is always the rightmost
non-terminal
 Sentential forms
 Each string in the derivation, including
<program>
39
 When we construct a derivation, there are
two choices at each step: which nonterminal
to expand, and which production to use for
the given nonterminal. We will sometimes be
concerned with a leftmost derivation, which
eliminates that first degree of freedom at
each step. We can write, if we need to be
clear or emphasize this, S =(lm)*> a. a is then
called a left-sentential form of the grammar.
40
Parse Tree and Derivation
 A heirarical structure that shows the

derivation process
 Parse Tree for –(id + id)
41
Parse Tree and Derivation
 Parse Tree for –(id + id)
 E => -E => -(E) => -(E + E) => -(id + E) => -

(id + id)
42
Parse Tree and Derivations
 A = B * (A + C)
<assign>
 <id> = <expr>
 A = <expr>
 A = <id> * <expr>
 A = B * <expr>
 A = B * ( <expr> )
 A = B * ( <id> + <expr> )
 A = B * ( A + <expr> )
 A = B * ( A + <id> )
 A = B * ( A + C )
43
44
Ambiguity in Grammar
Grammar is ambiguous if there are multiple
derivations (therefore multiple parse trees) for
a single string
Derivation and parse tree usually reflect

semantics of the program
Ambiguity in grammar often reflects ambiguity

in semantics of language (which is considered
undesirable)
45
Ambiguity Example
Two parse trees for 2-1+1
Tree corresponding
Tree corresponding
to 2-<1+1>
to <2-1>+1
Start Start
Expr Expr
Expr Op Expr
Expr Op Expr
-
+
Int
Int
Expr Op Expr 1 2 Expr Op Expr
- +
Int Int Int Int
2 1 1 1
46
Eliminating Ambiguity
Solution: hack the grammar
Original Grammar Hacked Grammar

Start  Expr Start  Expr
Expr  Expr Op Expr Expr  Expr Op Int
Expr  Int Expr  Int
Expr  Open Expr Close Expr  Open Expr Close
Conceptually, makes all operators associate to

left
47
Parse Trees for Hacked Grammar
Only one parse tree for 2-1+1!
Valid parse tree No longer valid parse tree
Start Start
Expr Expr
Expr Op Int Expr Op Expr

+ 1 -
Int
Expr Op Int 2 Expr Op Expr
- 1 +
Int
Int Int
2
1 1
48
Eliminating ambiguity
 As we've noted, we may sometimes wish to
eliminate ambiguity from a grammar.
 Example of "dangling-if":
 S  if E then S
| if E then S else S
| other // any other type of statement
 The ambiguity arises in statements of the
form:
 if E1 then if E2 then S1 else S2
49
 Draw the two parse trees. Which one
conforms to the usual interpretation made by
languages? ("match else with the most recent
unmatched if")
50
Stat Two Parse Trees
if Stat
Expr
e1 if Expr then Stat else Stat
Stat e2 s1 s2
if Expr then Stat else Stat Which one

e1 s2
is correct?
if Expr then Stat
e2 s1
51
Eliminating Ambiguity
 Often can eliminate ambiguity by adding

nonterminals and allowing recursion only on
S
right or left
S + T
 SS+T|T
T T * 3
 T  T * num | num
1 2
 T non-terminal enforces precedence
52
A Closer Look at Eliminating
Ambiguity
 Precedence enforced by
 Introduce distinct non-terminals for each
precedence level
 Operators for a given precedence level are
specified as RHS for the production
 Higher precedence operators are accessed by
referencing the next-higher precedence non-
terminal
53
Operator Precedence
 A=B+C*A
 How to force “*” to have higher precedence
over “+”?
 add more non-terminal symbols
 Observe that higher precedent operator
reside at “deeper” levels of the trees
54
Operator Precedence
A=B+C*A
 Before:
<assign> → <id> +<expr>
<id> → A | B | C | D
<expr> → <expr>+<expr>
| <expr> * <expr>
| ( <expr> )
| <id>
55
Operator Precedence
 After:
<assign> → <id> + <expr>
<id> → A | B | C | D
<expr> → <expr> +<term>
| <term>
<term> → <term> *<factor>
| <factor>
<factor> → ( <expr> )
| <id>
56
Operator Precedence
 A=B+C*A
57
Associativity
 An operator is either left, right or non
associative
 Left: a + b + c = (a + b) + c
 Right: a ^ b ^ c = a ^ (b ^ c)
 Non: a < b < c is illegal (thus undefined)
 Position of the recursion relative to the
operator dictates the associativity
 Left (right) recursion  left (right)
associativity
58
Associativity of Operators
 A=B+C–D*F/G
 Left-associative
Operators of the same precedence evaluated from
left to right C++/Java: +, -, *, /, %
 Right-associative
Operators of the same precedence evaluated from
right to left C++/Java: unary -, unary +, ! (logical
negation)
 How to enforce operator associativity using BNF?
59
60
61
Eliminating Left Recursion
 left recursion can be a problem for some top-

down parsers, causing infinite recursion
 how to eliminate it in a simple case by the
introduction of a new nonterminal and
rewriting productions.
62
A -> Aa | b will be
 Non Left Recursive production:

A -> βA'
A' -> αA' | ε
 No matter how many A-productions there

were, we can eliminate immediate left
recursion from them by the following
technique.
63
First we group the A productions as

 A -> A α1 | A α 2 | ... | A α m | β 1 | β 2 | ... | β n
Where no β begins with an A. then we replace
the A productions by
 A -> β 1 A' | β 2 A'|... | β n A'
 A' -> α1 A' | α 2 A'|... | α m A' | ε
So
A -> βA‘
A' -> αA' | ε
64
Eliminating Left Factoring
 Two rules start with the same RHS

 Make a new rule to distinguish them
 S-> if ( exp ) S
 S-> if ( exp ) S else S
becomes
 S-> if ( exp ) S S’
 S’ -> else S | 
65

Syntax Analyzer

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Syntax Analyzer

Uploaded by

Copyright:

Available Formats

Syntax Analysis

 Translating from one language to another

 The parser obtains a sequence of tokens

 Input Code: if (x == y) { a=1; }

IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR

 Parser will take those tokens as input

 Different programming constructs can have

 Syntax Tree which is generated like:

 Before CFG is implemented, we must

 Just like we used DFA and symbol table to

 Also before we can implement CFG, we have

 Parser can be divided into two:

 Construct the tree from Bottom to top or from

source lexical token Parser

Lexical Analysis or Scanner

Syntax Analysis or Parsing

subject verb indirect object object

 Most programing language specifications do

 Many of the error detection and recovery in a

 The error handler in a parser has simple to

Many of the errors could be classified simply:

 20% operator and operand errors.

 15% keyword errors.

 The remaining 5% other kinds.

 The formal language generation mechanism

exceptions, can be described by context

be described by regular grammars

The basic symbols from which strings

2. Non-terminal symbols = syntactic

expr  expr op expr

 Definition or rule may be given by

 LHS: the abstraction being defined

 These symbols are terminals:

 Upper case letters late in the alphabet, such

 Using these short hands, we could write the

 For instance to generate the string - (id * id)

 A heirarical structure that shows the

 Parse Tree for –(id + id)

 Parse Tree for –(id + id)

 E => -E => -(E) => -(E + E) => -(id + E) => -

Derivation and parse tree usually reflect

Ambiguity in grammar often reflects ambiguity

Original Grammar Hacked Grammar

Conceptually, makes all operators associate to

Expr Op Int Expr Op Expr

e1 if Expr then Stat else Stat

if Expr then Stat else Stat Which one

 Often can eliminate ambiguity by adding

 T non-terminal enforces precedence

 left recursion can be a problem for some top-

 Non Left Recursive production:

 No matter how many A-productions there

First we group the A productions as

 Two rules start with the same RHS

You might also like