CPSC 388 - Compiler Design and Construction: Scanners - Regular Expressions

CPSC 388 – Compiler Design
and Construction
Scanners – Regular Expressions

Announcements
 Last day to Add/Drop Sept 11
 Wipe Down Computer Keyboards and Mice
 ACM programming contest
 Read Chapter 3
 Homework 2 due this Friday
 PROG1 due this Friday
 FSA for Java string constants (Anybody
figure it out?)
Homework 1 returned
 Summary – In addition to your summary please
answer the following questions in your report:
 Context – What is the context of the research
presented in the paper? What new ideas or concepts
do you think were presented in the paper?
 Evaluation – How was the research evaluated? What
evaluation techniques would you like to see to
compare the research to the state of the art before
the paper?
 Significance – What is the significance of the
research? Do you feel the work is a minor
improvement or a major step in the field?
 Grammar / spelling / clarity / support for statements
made
Scanner Generator
.Jlex file .java file

Containing Scanner Generator Containing
Regular Expressions Scanner code
To understand Regular Expressions

you need to understand Finite-State Automata
FSA Formal Definition (5-tuple)
Q – a finite set of states
Σ – The alphabet of the automata
(finite set of characters to label edges)
δ – state transition function
δ(statei,character)  statej
q – The start state
F – The set of final states
Types of FSA
 Deterministic (DFA)
 No State has more than one outgoing
edge with the same label
 Non-Deterministic (NFA)
 States may have more than one
outgoing edge with same label.
 Edges may be labeled with ε, the empty
string. The FSA can take an epsilon
transition without looking at the current
input character.
Terms to Know
 Alphabet (Σ) – any finite set of
symbols e.g. binary, ASCII, Unicode
 String – finite sequence of symbols
e.g. 010001, banana, bãër
 Language – any countable set of
strings e.g.
Empty set
Well-formed C programs
English words
Regular Expressions
 Easy way to express a language that is
accepted by FSA
 Rules:
 ε is a regular expression
 Any symbol in Σ is a regular expression
If r and s are any regular expressions then so is:
 r|s denotes union e.g. “r or s”
 rs denotes r followed by s (concatination)
 (r)* denotes concatination of r with itself zero or
more times (Kleene closer)
 () used for controlling order of operations
Example Regular Expressions
Regular Expression Corresponding Language
ε {“”}
a {“a”}
abc {“abc”}
a|b|c {“a”,”b”,”c”}
(a|b|c)* {“”,”a”,”b”,”c”,”aa”,”ab”,”ac”,”aaa”,…}
a|b|c|…|z|A|B|…|Z Any letter
0|1|2|…|9 Any digit
Precedence in Regular Expressions
 * has highest precedence, left associative
 Concatenation has second highest

precedence, left associative
 | has lowest associative, left associative

More Regular Expression Examples
Regular Expression Corresponding Language
ε|a|b|ab* {“”, “a”, “b”, “ab”, “abb”, “abbb”,…}
ab*c {“ac”, “abc”, “abbc”,…}
ab*|a* {“”, “a”, “ab”, “aa”, “aaa”, “abb”,…}
a(b*|a*) {“a”, “ab”, “aa”, “abb”, “aaa”, …}
a(b|a)* {“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}

You Try
 What is the language described by
each Regular Expression?
a*
(a|b)*
a|a*b
(a|b)(a|b)
aa|ab|ba|bb
(+|-|ε)(0|1|2|3|4|5|6|7|8|9)*
Regular Definitions
If Σ is an alphabet of basic symbols,
then a regular definition is a
sequence of definitions of the form:
D1 → R 1
1. Each di is a new symbol not in Σ and
D2 → R2 not the same as any other of the d’s.
…
2. Each ri is a regular expression over
Dn → R n
Σ U (d1,d2,…,di-1)
Regular Definitions Example
Example C identifiers:
Σ = ASCII
letter_ → a|b|c|…|z|A|B|C|…|Z|_
digit → 0|1|2|…|9
id → letter_(letter_|digit)*
Regular Definitions Example
Example Unsigned Numbers (integer or float):
Σ = ASCII
digit → 0|1|2|…|9
digits → digit digit*
optionalFraction → . digits | ε
optionalExponent → (E(+|-| ε)digits)| ε
number → digits optionalFraction optionalExponent
Special Characters in Reg. Exp.
What does each of the following mean?
* – Kleene Closure
| – or
() – grouping
[] – creates a character class
+ – Positive Closure
? – zero or one instance
“” – anything in quotes means itself, e.g. “*”
. – matches any single character (except newline)
\ – used for escape characters (newline, tab, etc.)
^ – matches beginning of a line
$ – matches the end of a line
Extensions to Regular Expressions
 + means one or more occurrence
(positive closure)
 ? means zero or one occurrence
 Character classes
 a|r|t can be written [art]
 a|b|…|z can be written [a-z]
As long as there is a clear ordering to
characters
 [^a-z] matches any character except a-z
Example Using Character Classes
^[^aeiou]*$
Matches any complete line that does not
contain a lowercase vowel
How do you tell which meaning of ^ is
intended?
Try It
 Create Character Classes for:
 First ten letters (up to “j”)
 Lowercase consonants
 Digits in hexadecimal
 Create Regular Expressions for:
 Case Insensitive keyword such as
SELECT (or Select or SeLeCt) in SQL
 Java string constants
 Any string of whitespace characters
Creating a Scanner
 Create a set of regular expressions, one for
each token to be recognized
 Convert regular expressions into one
combined DFA
 Run DFA over input character stream
 Longest matching regular expression is selected
 If a tie then use first matching regular
expression
 Attach code to run when a regular
expression matches

CPSC 388 - Compiler Design and Construction: Scanners - Regular Expressions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CPSC 388 - Compiler Design and Construction: Scanners - Regular Expressions

Uploaded by

Copyright:

Available Formats

CPSC 388 – Compiler Design

Scanners – Regular Expressions

.Jlex file .java file

To understand Regular Expressions

 Concatenation has second highest

 | has lowest associative, left associative

ab*c {“ac”, “abc”, “abbc”,…}

ab|a {“”, “a”, “ab”, “aa”, “aaa”, “abb”,…}

a(b|a) {“a”, “ab”, “aa”, “abb”, “aaa”, …}

a(b|a)* {“a”, “ab”, “aa”, “aaa”, “aab”, “aba”,…}

You might also like