Professional Documents
Culture Documents
Specifying Languages With Regular Expressions and Context-Free Grammars
Specifying Languages With Regular Expressions and Context-Free Grammars
035
Specifying Languages with Regular
Expressions
p and Context-Free Grammars
Martin Rinard
Laboratory for Computer Science
Massachusetts Institute of Technology
Language Definition Problem
General
Gener al Rules Example
p
(0 | 1)*.(0|1)*
1) r1| r2 → r1 ( | 1)(0
(0 )( | 1)*.(0|1)*
) ( | )
2) r1| r2 → r2 1(0|1)*.(0|1)*
3)) r* →rr* 1.(0|1)*
1.(0|1)(0|1)*
4) r* → ε
1.(0|1)
1.0
Nondeterminism in Generation
• Rewriting is similar to equational reasoning
• But different rule applications may yield different final
results
p 1
Example Example 2
(0|1)*.(0|1)* (0|1)*.(0|1)*
(0|1)(0|1)*.(0|1)* ( | )( | ) ( | )
(0|1)(0|1)*.(0|1)*
1(0|1)*.(0|1)* 0(0|1)*.(0|1)*
1.(0|1)* 0.(0|1)*
1 (0|1)(0|1)*
1.(0|1)(0|1)* 0.(0|1)(0|1)*
1.(0|1) 0.(0|1)
10
1.0 01
0.1
Concept of Language Generated by
Regular Expressions
• Set of all strings generated by a regular
expression is language of regular expression
• In general, language may be (countably) infinite
• String in language is often called a token
Examples of Languages and Regular
Expressions
• ∑ = { 0, 1, . }
• (0|1)*.(0|1)*
(0|1)* (0|1)* - Binary
Bi floating
fl ti pointi t numbers
b
• (00)* - even-length all-zero strings
• 1*(01*01*)* - strings wwith
ith even number of
of
zeros
• ∑ = { a,b,c,
a b c 0,0 1,
1 2}
• (a|b|c)(a|b|c|0|1|2)* - alphanumeric
identifiers
• (0|1|2)* - trinary numbers
Alternate Abstraction
Finite-State
Finite State Automata
• Alphabet ∑
S t off states
• Set t t with
ith iinitial
iti l and
d acceptt states
t t
• Transitions between states, labeled with letters
(0|1)*.(0|1)*
1 . 1 Start state
0 0 Accept state
Automaton Accepting String
Conceptually, run string through automaton
• Have current state and current letter in string
• Start with start state and first letter in string
• At each step, match current letter against a transition
whose
h label
l b l is same as letter
l
• Continue until reach end of string or match fails
• If end
d iin acceptt state,
t t automaton
t t accepts t string
ti
• Language of automaton is set of strings it accepts
Example
p
Current state
1
. 1 Start state
0 0 Accept state
11.0
Current letter
Example
p
Current state
1
. 1 Start state
0 0 Accept state
11.0
Current letter
Example
p
Current state
1
. 1 Start state
0 0 Accept state
11.0
Current letter
Example
p
Current state
1
. 1 Start state
0 0 Accept state
11.0
Current letter
Example
p
Current state
1
. 1 Start state
0 0 Accept state
11.0
Current letter
Example
p
Current state
1
. 1 Start state
0 0 Accept state
11.0
String is accepted!
Current letter
Generative Versus Recognition
• Regular expressions give you a way to generate
all strings in language
• Automata give you a way to recognize if a specific
string is in language
• Philosophically
hl h ll very different
d ff
• Theoretically equivalent (for regular
expressions aand
nd automata)
• Standard approach
• Use regular expressions when define language
• Translated automatically into automata for
implementation
From Regular Expressions to
Automata
• Construction by structural induction
• Given
Gi an arbitrary
bit regular
l expression
i r
• Assume we can convert r to an automaton with
• One start
start state
• One accept state
• Show how too convert
c c
all constructors too deliver
an automaton with
• One start state
• One accept state
Basic Constructs
Start state
Accept sstate
tate
ε
ε
a
a∈Σ
Sequence
Start state
Accept state
r1r2 r1 r2
Sequence
Old start state Start state
Old accept state Accept state
r1r2 r1 r2
Sequence
Old start state Start state
Old accept state Accept state
ε
r1r2 r1 r2
Sequence
Old start state Start state
Old accept state Accept state
ε ε
r1r2 r1 r2
Sequence
Old start state Start state
Old accept state Accept state
ε ε ε
r1r2 r1 r2
Choice
Start state
Accept state
r1
r1|r2
r2
Choice
Old start state Start state
Old accept state Accept state
r1
r1|r2
r2
Choice
Old start state Start state
Old accept state Accept state
ε r1
r1|r2
ε r2
Choice
Old start state Start state
Old accept state Accept state
ε r1 ε
r1|r2
ε r2 ε
Kleene Star
Old start state Start state
Old accept
accept state
state Accept sstate
tate
r* r
Kleene Star
Old start state Start state
Old accept
accept state
state Accept sstate
tate
r* r
Kleene Star
Old start state Start state
Old accept
accept state
state Accept sstate
tate
ε ε
r* r
Kleene Star
Old start state Start state
Old accept
accept state
state Accept sstate
tate
ε ε
r* r
Kleene Star
Old start state Start state
Old accept
accept state
state Accept sstate
tate
ε ε
r* r
ε
NFA vs. DFA
• DFA
• No ε transitions
• At most one transition from each state for
each letter
a a
NOT
OK
OK
b a
• NFA – neither restriction
Conversions
a ε a ε
1
ε
2
ε 3 5 ε
7
ε
8
. 9
ε
10
ε 11 13 ε
15
ε
16
ε 4 6 ε ε 12 14 ε
ε ε
b b
. a a
5,7,2,3,4,8 13,15,10,11,12,16
a
a
. a
a
1,2,3,4,8 9,10,11,12,16
b b
b
b 672 3 48
6,7,2,3,4,8 . 14 15 10 11 12 16
14,15,10,11,12,16
b b
Lexical Structure in Languages
Each language typically has several categories of
words. In a typical programming language:
• Keywords (if, while)
• Arithmetic Operations (+, -, *, /)
• Integer numbers (1, 2, 45, 67)
• Floating point numbers (1.0, .2, 3.337)
• Identifiers
d f ((abc,
b i, j, ab345)
b )
• Typically have a lexical category for each
keyword and/or each category
• Each lexical category defined by regexp
Lexical Categories Example
• IfKeyword = if
• WhileKeyword = while
• Operator = +|-|*|/
• Integer
tege = [0[0-9]
9] [0
[0-9]*
9]
• Float = [0-9]*. [0-9]*
• Identifier = [a-z]([a-z]|[0-9])*
• Note that [0-9] = (0|1|2|3|4|5|6|7|8|9)
[a-z] = (a|b|c|…|y|z)
• Will use lexical
l i l categories
t i ini nextt level
l l
Programming Language Syntax
Expr
Expr
Op Expr
Open Expr Close +
< > Int
1
Expr Expr
Op
-
Int Int
2 1
Ambiguity in Grammar
Grammar is ambiguous if there are multiple derivations
(therefore multiple parse trees) for a single string
Ambig
guityy in ggrammar often reflects ambiguity
g y in
semantics of language
(which is considered undesirable)
Ambiguity Example
Two parse trees for 2-1+1
Tree corresponding
Tree corresponding
to 2-<1+1>
to <2-1>+1
Start Start
Expr Expr
E
Expr Op Expr
Expr Op Expr
-
+
Int
Int
Expr Op Expr 2 Expr Op Expr
1
- +
Int Int Int Int
2 1 1 1
Eliminating Ambiguity
Solution: hack the grammar
Conceptually,
p y, makes all operators
p associate to left
Parse Trees for Hacked Grammar
Only one parse tree for 2-1+1!
Valid parse tree No longer valid parse tree
Start Start
Expr Expr
Expr Op Int
* 4
Expr
p Op
p Int
- 3
Int
2
Hacking Around Precedence
Original Grammar Hacked Grammar
Op = +|-|*|/ AddOp = +|-
Int = [0-9] [0-9]* MulOp = *|/
Open = < Int = [0-9] [0-9]*
Close = > Open = <
Close = >
Start → Expr Start → Expr
Expr → Expr Op Int Expr → Expr AddOp Term
Expr → Int Expr → Term
Expr → Open Expr Close Term → Term MulOp Num
Term → Num
Num → Int
Num → Open Expr Close
Parse Tree Changes
New parse tree
Old parse tree for 2-3*4
f 2
for 2-3*4
3*4 Start
Start
Expr
Expr
Expr AddOp
Term
-
Expr Op IInt
* 4 Term
Term MulOp Num
Expr Op Int Num *
- 3 Num Int
Int Int 4
2 2 Int
3
General Idea
if Stat
Expr
Stat e2
2 s1
1 s2
2
e2
Alternative Readings
• Leftmost derivation
• Always
Al expandsd leftmost
l ft t remaining
i i
nonterminal
• Similarly for rightmost derivation
• Sentential form
• Partially or fully derived string from a step in
valid derivation
• 0 + Exprp Op p Expr
p
• 0 + Expr - 2
Defining a Language
• Grammar
• Generative approach
• All strings that grammar generates (How many are
there for grammar in previous example?)
• Automaton
• Recognition approach
• All strings that automaton accepts
• Different flavors of grammars and automata
• In general, grammars and automata correspond
Regular Languages
• Automaton Characterization
• (S,A,F,s
F 0,sF)
• Finite set of states S
• Finite Alphabet A
• Transition function F : S ×A → S
tate s0
• Start sstate
• Final states sF
• Lanuage is set of strings accepted by Automaton
Regular Languages
• Grammar Characterization
• (T,NT,S,P
T NT S P)
• Finite set of Terminals T
of Nonterminals NT
• Finite set of
• Start Nonterminal S (goal symbol, start
symbol)
• Finite set of Productions P: NT → (T | NT)*
• RHS of production can have any sequence of
terminals or nonterminals
Push-Down Automata
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.