Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 65

Syntax Analysis

Week 03

1
Overview

 Translating from one language to another


always requires some kind of parsing.
 Parsing is the process of identifying
structure in data.
 It determines whether the input data has
some pre-determined structure and respond
accordingly.

2
Syntax Analysis Overview
 Goal – determine if the input token stream
satisfies the syntax of the program
 What do we need to do this?
 An expressive way to describe the syntax
 A mechanism that determines if the input token
stream satisfies the syntax description
 For lexical analysis
 Regular expressions describe tokens
 Finite automata = mechanisms to generate tokens
from input stream

3
Syntax Analysis Overview
 Methods commonly used in compilers to
create parsers for grammars are classified as
 Top down
 Build parse trees from top (root) to the bottom (leaves)
 Bottom up
 Starts from leaves and work up to the roots
 Both methods work only on subclasses of
Grammar, but several of these subclasses,
are expressive enough to describe most
syntactic constructs in programing
languages.
4
How it works..

 The parser obtains a sequence of tokens


from the Lexical analyzer.
 Recognizes structure through the use of
Context-Free Grammar.
 Finally it generates the Parse Tree/ Syntax
Tree.

5
Example

 Input Code: if (x == y) { a=1; }


 Tokens generated:

IF LPAR ID EQ ID RPAR LBR ID AS INT SEMI RBR

 Parser will take those tokens as input


and output Parse Tree.

6
Example (cont)
if (x == y) { a=1; }

7
Syntax Tree

 Different programming constructs can have


different structure of Syntax Tree, like:
 Syntax Tree for Operators
 Syntax Tree for If-statement
 Syntax Tree for While Loop, etc.
 Syntax Tree for assignment statement.

8
Example (cont)

 Syntax Tree which is generated like:

9
Syntax Rules
 Just like Lexical Analyzer, Syntax Analyzer
also requires some rules to check the
structure of the program.
 We use CFG to check the structure of the
program or to apply Syntax rules.
 As we know how to write CFG, if we can
understand how can we implement CFG then
we can better understand Parsing process.

10
Problems in CFG

 Before CFG is implemented, we must


understand and remove some problems in
CFG. These are:
 Ambiguity
 First-First Conflict
 Left Recursion
 Unreachable Production, etc.

11
Implementation of CFG

 Just like we used DFA and symbol table to


implement rules of Lexical Analyzer, we need
a machine that can be used to implement
rules of CFG.
 Such a machine is called “Push Down
Automata” and is combination of:
FA + Stack
 We have seen the architecture of PDA.

12
Implementation of CFG

 Also before we can implement CFG, we have


to understand different types of Parsers.

 Parser can be divided into two:


 Top Down Parsers
 Bottom Up Parsers

13
Top Down Parsers
 Gets their name because of the way they
construct the tree.
 Uses Left Most Derivation to construct tree.
 Most suited for hand-written Parsers.
 LL-Parsers.
 Ambiguity, Left Recursion, First-First Conflict
create problem for these Grammars.

14
Bottom Up Parsers

 Construct the tree from Bottom to top or from


Leaves to Root.
 Uses Right Most Derivation.
 Used in Parser Generator tools. Difficult to
use in Hand-written Parsers.
 LR-Parsers.

15
THE ROLE OF THE PARSER

source lexical token Parser


program analyzer get next
token

symbol
table

16
Where is Syntax Analysis
Performed? if (b == 0) a = b;

Lexical Analysis or Scanner

if ( b == 0 ) a = b ;

Syntax Analysis or Parsing

if
abstract syntax tree
== = or parse tree
b 0 a b

17
Parsing Analogy
• Syntax analysis for natural languages
• Recognize whether a sentence is grammatically correct
• Identify the function of each word

sentence

subject verb indirect object object


I gave him noun phrase

article noun
“I gave him the book”
the book

18
Syntax Error Handling

 Most programing language specifications do


not describe how a compiler should respond
to errors; the response is left to the compiler
designer.
 Planning the error handling right from the
start can both simplify the structure of the
compiler and improve its response to the
errors.

19
Syntax Error Handling
Programs can contain errors at many different
levels. For e.g.:
 Lexical: Such as misspelling an identifier,
keyword, or operator
 Syntactic: Such as an arithmetic expression
with unbalanced parentheses
 Semantic: Such as an operator applied to an
incompatible operand
 Logical: Such as an infinitely recursive call

20
Syntax Error Handling

 Many of the error detection and recovery in a


compiler is centered around the syntax
analysis phase.
 As many errors are syntactic in nature or are
exposed when the stream of tokens coming
from the lexical analyzer disobeys the
grammatical rules defining the programing
language.

21
Syntax Error Handling

 The error handler in a parser has simple to


state goals:
 It should report the presence of errors clearly
and accurately.
 It should recover from each error quickly
enough to be able to detect subsequent
errors.
 It should not significantly slow down the
processing of correct programs.
22
Syntax Error Handling

Many of the errors could be classified simply:


 60% were punctuation errors.

 20% operator and operand errors.

 15% keyword errors.

 The remaining 5% other kinds.

23
How should an error handler reports the
presence of an error?
 The place in the source program where an
error is detected because there is a good
chance that the actual error occurred
within the previous few tokens.
 A common strategy employed by many
compilers is to print the offending line with
a pointer to the position at which an error
is detected.
24
How should an error handler reports the
presence of an error?
 If there is reasonable likelihood of what the
error actually is, an informative,
understandable diagnostic message is
also included, e.g., “semicolon missing at
this position”.

25
Formal Method for Describing Syntax

 The formal language generation mechanism


which is commonly used to describe the
syntax of programming language is called
grammar.
 Backus-Naur Form and Context Free
Grammars
 In the middle to late 1950s 2 men, Noam
Chomsky and John Backus, developed the same
syntax description formalism, which became
widely used method for programming language
syntax
26
Context Free Grammars
 In mid 1950s, Chomsky a noted linguistic
described two useful grammar classes
 Context Free
 Whole programing languages, with minor

exceptions, can be described by context


free grammars
 Regular
 The tokens of programing languages can

be described by regular grammars


27
Context Free Grammar
 A language used to describe another
language
 Consists of a collection of rules (or
productions)
 Consist of 4 components:
1. Terminal symbols = token or 

The basic symbols from which strings


are formed. In each of the keywords if
then and else are terminals.
28
Context Free Grammar

2. Non-terminal symbols = syntactic


variables
3. Start symbol S = special non-terminal
4. Productions of the form LHSRHS
 LHS = single non-terminal
 RHS = string of terminals and non-terminals
 Describes how the nonterminal on the lhs can be
replaced by a string of terminals and nonterminals
on the rhs.

29
Context Free Grammar

expr  expr op expr


In this grammar the
expr  (expr)
terminal symbols are
expr  - expr
Id + - * / ()
expr  id
op  + Non terminal
op  - symbols are
op  * expr and op
op  /
30
Context Free Grammar(CFG)
 For example A simple Java assignment
statement, might be presented by the
abstraction <assign>

 Definition or rule may be given by


<assign> → < var > = < expression >

 LHS: the abstraction being defined


 RHS: contains a mixture of tokens, lexemes, and
references to other abstractions
31
Notational Conventions

 These symbols are terminals:


1. Lower case letters early in the alphabet such as
a, b, c
2. Operator symbols such as +,-, etc
3. Punctuation symbols such as parenthesis,
comma etc.
4. The digits 0,1, ….., 9
5. Boldface strings such as id or if

32
Example of a grammar in CFG
 In a grammar for a complete programing
language, the start symbol represents a
complete program and is usually named
<program>
 <program> → begin <stmt_list> end
 <stmt_list> → <stmt> | <stmt>; <stmt_list>
 <stmt> → <var> = <expression>
 <var> → A | B | C | D
 <expression> → <var> + <var> | <var> -
<var> | <var>
33
Notational Conventions
 These symbols are non terminals:
1. Upper-case letters early in the alphabet such as
A, B, C
2. The letter S, which, when it appears, is usually
the start symbol.
3. Lower case italic names such as expr or stmt

 Upper case letters late in the alphabet, such


as X, Y, Z represent grammar symbols, that
is either non terminals or terminals
34
Notational Conventions

 Using these short hands, we could write the


previous grammar as

 E  E O E | (E) | - E | id

 A+|-|*|/|^

35
Grammars and Derivation
 A grammar is a generative device for
defining language.
 The sentences of the language are
generated through a sequence of
applications of the rules, beginning with a
special nonterminal of the grammar called
the start symbol.
 A sentence generation is called a
derivation.

36
Derivation

 For instance to generate the string - (id * id)


we can say:
 E => -E => -(E) => -(E A E) => -(id A E) => -
(id * E) => -(id * id)

37
Grammars and Derivation
 The process of generating a sentence
begin A = B – C end
 Derivation: <program> (start symbol)
=> begin <stmt_list> end
=> begin <stmt> end
=> begin <var> = <expression> end
=> begin A = <expression> end
=> begin A = <var> - <var> end
=> begin A = B - <var> end
=> begin A = B - C end

38
Grammars and Derivation

 Leftmost derivation:
 the replaced non-terminal is always the leftmost
non-terminal
 Rightmost derivation:
 the replaced non-terminal is always the rightmost
non-terminal
 Sentential forms
 Each string in the derivation, including
<program>

39
Grammars and Derivation
 When we construct a derivation, there are
two choices at each step: which nonterminal
to expand, and which production to use for
the given nonterminal. We will sometimes be
concerned with a leftmost derivation, which
eliminates that first degree of freedom at
each step. We can write, if we need to be
clear or emphasize this, S =(lm)*> a. a is then
called a left-sentential form of the grammar.

40
Parse Tree and Derivation

 A heirarical structure that shows the


derivation process

 Parse Tree for –(id + id)

41
Parse Tree and Derivation

 Parse Tree for –(id + id)

 E => -E => -(E) => -(E + E) => -(id + E) => -


(id + id)

42
Parse Tree and Derivations
 A = B * (A + C)
<assign>
 <id> = <expr>

 A = <expr>

 A = <id> * <expr>

 A = B * <expr>

 A = B * ( <expr> )

 A = B * ( <id> + <expr> )

 A = B * ( A + <expr> )

 A = B * ( A + <id> )

 A = B * ( A + C )

43
44
Ambiguity in Grammar
Grammar is ambiguous if there are multiple
derivations (therefore multiple parse trees) for
a single string

Derivation and parse tree usually reflect


semantics of the program

Ambiguity in grammar often reflects ambiguity


in semantics of language (which is considered
undesirable)

45
Ambiguity Example
Two parse trees for 2-1+1
Tree corresponding
Tree corresponding
to 2-<1+1>
to <2-1>+1
Start Start

Expr Expr

Expr Op Expr
Expr Op Expr
-
+
Int
Int
Expr Op Expr 1 2 Expr Op Expr
- +
Int Int Int Int
2 1 1 1

46
Eliminating Ambiguity
Solution: hack the grammar

Original Grammar Hacked Grammar


Start  Expr Start  Expr
Expr  Expr Op Expr Expr  Expr Op Int
Expr  Int Expr  Int
Expr  Open Expr Close Expr  Open Expr Close

Conceptually, makes all operators associate to


left
47
Parse Trees for Hacked Grammar
Only one parse tree for 2-1+1!
Valid parse tree No longer valid parse tree
Start Start

Expr Expr

Expr Op Int Expr Op Expr


+ 1 -
Int
Expr Op Int 2 Expr Op Expr
- 1 +
Int
Int Int
2
1 1
48
Eliminating ambiguity
 As we've noted, we may sometimes wish to
eliminate ambiguity from a grammar.
 Example of "dangling-if":
 S  if E then S
| if E then S else S
| other // any other type of statement
 The ambiguity arises in statements of the
form:
 if E1 then if E2 then S1 else S2

49
 Draw the two parse trees. Which one
conforms to the usual interpretation made by
languages? ("match else with the most recent
unmatched if")

50
Stat Two Parse Trees
if Stat
Expr

e1 if Expr then Stat else Stat

Stat e2 s1 s2

if Expr then Stat else Stat Which one


e1 s2
is correct?
if Expr then Stat

e2 s1

51
Eliminating Ambiguity

 Often can eliminate ambiguity by adding


nonterminals and allowing recursion only on
S
right or left
S + T
 SS+T|T
T T * 3
 T  T * num | num
1 2

 T non-terminal enforces precedence

52
A Closer Look at Eliminating
Ambiguity
 Precedence enforced by
 Introduce distinct non-terminals for each
precedence level
 Operators for a given precedence level are
specified as RHS for the production
 Higher precedence operators are accessed by
referencing the next-higher precedence non-
terminal

53
Operator Precedence

 A=B+C*A
 How to force “*” to have higher precedence
over “+”?
 add more non-terminal symbols
 Observe that higher precedent operator
reside at “deeper” levels of the trees

54
Operator Precedence

A=B+C*A
 Before:
<assign> → <id> +<expr>
<id> → A | B | C | D
<expr> → <expr>+<expr>
| <expr> * <expr>
| ( <expr> )
| <id>

55
Operator Precedence
 After:
<assign> → <id> + <expr>
<id> → A | B | C | D
<expr> → <expr> +<term>
| <term>
<term> → <term> *<factor>
| <factor>
<factor> → ( <expr> )
| <id>

56
Operator Precedence

 A=B+C*A

57
Associativity
 An operator is either left, right or non
associative
 Left: a + b + c = (a + b) + c
 Right: a ^ b ^ c = a ^ (b ^ c)
 Non: a < b < c is illegal (thus undefined)
 Position of the recursion relative to the
operator dictates the associativity
 Left (right) recursion  left (right)
associativity

58
Associativity of Operators
 A=B+C–D*F/G
 Left-associative
Operators of the same precedence evaluated from
left to right C++/Java: +, -, *, /, %
 Right-associative
Operators of the same precedence evaluated from
right to left C++/Java: unary -, unary +, ! (logical
negation)
 How to enforce operator associativity using BNF?

59
Associativity of Operators

60
Associativity of Operators

61
Eliminating Left Recursion

 left recursion can be a problem for some top-


down parsers, causing infinite recursion
 how to eliminate it in a simple case by the
introduction of a new nonterminal and
rewriting productions.

62
Eliminating Left Recursion
A -> Aa | b will be

 Non Left Recursive production:


A -> βA'
A' -> αA' | ε

 No matter how many A-productions there


were, we can eliminate immediate left
recursion from them by the following
technique.
63
Eliminating Left Recursion

First we group the A productions as


 A -> A α1 | A α 2 | ... | A α m | β 1 | β 2 | ... | β n
Where no β begins with an A. then we replace
the A productions by
 A -> β 1 A' | β 2 A'|... | β n A'
 A' -> α1 A' | α 2 A'|... | α m A' | ε
So
A -> βA‘
A' -> αA' | ε
64
Eliminating Left Factoring

 Two rules start with the same RHS


 Make a new rule to distinguish them
 S-> if ( exp ) S
 S-> if ( exp ) S else S
becomes
 S-> if ( exp ) S S’
 S’ -> else S | 

65

You might also like