Regular Expressions

Regular Expressions
• DFA
Find the longest match
Deterministic, fast, state machine must be built
• NFA
Leftmost, longest (in order)
ordering of RE changes pattern matched.
• POSIX NFA
“longest of the leftmost”
for multiple matches starting at the same (leftmost)
position, return the one matching the most text.
See table of systems & uses
Regular Expressons
1 of 20
Lecture 14
EECS 498 Winter 2000
Examples
M = (Q, S, d, q0, F ) where Q = q0..3, S = { 0, 1 }, F = { q0 }
1
start q0 q2
1
0 0 0 0
1
q1 q3
1
This DFSA M accepts all strings with ...
Regular Expressons
2 of 20
Lecture 14
An FSA to accept decimal strings:
0...9
B
‘.’
0...9
0...9
0...9
S ‘-’ A H
‘.’
‘.’ 0...9
Trace through “-34.76” and “-17.-56”
Regular Expressons
3 of 20
Lecture 14
What’s the structure for the DFA engine?

state = 0;
loop
state = TBL[ state, input ];
until FINAL[ state ] && EOF(input);
The DFA engine serves as the basis for some language parsing
engines ...
Regular Expressons
4 of 20
Lecture 14
Structuring Elements for REs
Meta Meaning
Chars
. (period) Match any single character
* (kleene star) Match zero or more REs
[] Match any character within brackets
0-9 matches digits
a-z matches lower case characters
“-” in first position matches “-”
“^” in first position inverts set
^ matches beginning of line
$ matches end-of-line
{a,b} Match count of preceding pattern (from ‘a’ to
‘b’ times). ‘b’ optional
If “a” is a word, this invokes a substitution from

the definition section.
Regular Expressons
5 of 20
Lecture 14
\ escape for metacharaters

+ matches 1 or more REs
? matches 0 or 1 REs
| alternation
“##” matches literal between quotes
() grouping of RE
• Alphabet Elements
character class is more efficient
• Concatenation, Alternation, grouping
• intervals {low,high}
• anchors
• (F)lex model
set of patterns is Alternation
• Perl Model
scan string for match
Regular Expressons
6 of 20
Lecture 14
Construction of NFA from RE
λ
S E F S F
general machine empty string transition

a
S F S E1 A E2 F
alphabet symbol transition

Concatenation: (E1 E2)
λ E1 λ
E
F
λ λ
S
λ λ S
λ A λ F
E2
Alternation: (E1 | E2) Closure (E*)
DFA can be always be constructed from NFA (state expansion)

with potential exponential increase in state space.
Regular Expressons
7 of 20
Lecture 14
Example: Develop an NDFSA from RE
( + | - | λ ) d+ ≡ ( + | - | λ ) d d*
(+|-|λ) d d*
S A F
- d d*
S A F
λ
- d d*
S A B F
λ
+ d
- d λ λ
S A B C F
λ
Regular Expressons
8 of 20
Lecture 14
Perl Operators & Operands
• $_, $ARG
default input, string pattern matched.
• $‘, $PREMATCH
string preceding last successful match
• $&, $MATCH
string matched by last successful pattern match
• $’, $POSTMATCH
string following last successful pattern match
• $+, $LAST_PAREN_MATCH
last bracket matched by last search
/Version: (.*) | Revision: (.*)/

$num = $+;
Regular Expressons
9 of 20
Lecture 14
Any program that uses these variables (or calls functions that
do), forces the match engine to make copies of text string used
for matching. This can be VERY expensive for large strings.
Copying required for parenthesized capture.
• “=~” and “=!”: By default, pattern matched against content

of variable “$_”. Can use these operators to change this:
$a = “hello world”;
$a =~ /^he/ # true
$a =! /^help/ # also true.
Regular Expressons
10 of 20
Lecture 14
Regex Operators and their Precedence
Parentheses ( PATTERN)
(?: PATTERN)
Multipliers ? + * {m,n}
?? +? *? {m,n}?
Sequencing and abc ^ $ \A \Z

anchoring (?=PATTERN) (?!PATTERN)
Alternation |
Regular Expressons
11 of 20
Lecture 14
Pattern Matching
• match: m/PATTERN/
• substitution: s/PATTERN/REPLACE/mods
• split: split PATTERN
returns list of places where pattern didn’t match

Modifiers
Modifier Meaning
g Global: find all occurances
i match case insensitive
m treat string as multiple lines
o optimize: compile pattern once
s treat string as single line
Regular Expressons
12 of 20
Lecture 14
Traditional NFA
Matching Strategy
We’re given a string and a regular expression (RE). The
processing “engine” will establish if the string is a member of the
language defined by the RE.
The engine will match: left-most, longest (first) match.

a) Match as far left as possible
The match only has to reach the end of RE, doesn’t have
to reach the end of the string.
Consider /x*/ applied to “fox”.

b) RE is set of alternatives (possible singleton).
Stop on first match that allows successful completion of
entire regular expression.
Regular Expressons
13 of 20
Lecture 14
Upon failure, backtrack

c) An alternative matches if every item (assertion or atom)
matches.
If ordered matching fails, engine backtracks.
Consider /x*y*/
Match an “x”, try all possible y’s.

Try next “x”, try all possible y’s.
(second pattern varies faster due to backtracking)

d) Assertions
^ beginning of string (line)
$ end of string (line)
\b word boundary (between \w and \W)
Regular Expressons
14 of 20
Lecture 14
\B non word boundary (inverse of \b)
beginning of string
\A
matches only once if /m specified
end of string
\Z
matches only once if /m specified
end of previous global pattern

\G
m/PATTERN/g
Lookahead assertion
(?=PATTERN)
Matches if PATTERN follows
Lookahead
(?!PATTERN)
Matches if PATTERN does NOT follow
Regular Expressons
15 of 20
Lecture 14
e) Quantified Atom
The atom is matched some number of times
Maximal Minimal
Expected Range
greedy lazy
At least n, by no more than m

{n,m} {n,m}?
m, n < 216
{n,} {n,}? At least n times
{n} {n}? Exactly n times
* *? {0,}
+ +? {1,}
? ?? {0,1}
x* = xxx...xx | xxx...x | xxx... | ... | x | λ

x*? = λ | x | xx | ... | xxx...x | xxx...xx | ...
Regular Expressons
16 of 20
Lecture 14
The maximal (or greedy) form can cause LOTS of backtracking.
This is further complicated if one uses parentheses to group

references (since recovery is thereby required). (pg 150).
f) Match items according to their type
(PATTERN) grouped regular expression

with backreference ($1, \1, $2, \2, etc.)
(?:PATTERN) grouped regular expression

without backreference
simplifies backtracking
“.” (period). Matches everything except \n.
[...] character classes

[aeiou], [fee|fie|foe] == [feio|]
Regular Expressons
17 of 20
Lecture 14
\a,\n
\d,\D digit, non-digit
\w,\W word ([a-zA-Z_0-9]), non-word
\s,\S whitespace, non-whitespace
\[0-7]{1,3} octal value
\x[0-9a-f]{1,2} hexidecimal value
Regular Expressons
18 of 20
Lecture 14
RE Extensions for Perl
• Comment: (?#text)
• No Backreference: (?:PATTERN)
Nothing saved in $1 ($n)

• Zero Width lookahead assertion: (?=PATTERN)
• Negated lookahead assertion: (?!PATTERN)
• Embedded Pattern Match Modifier: (?imsx)
$pattern = “testString”;
if (/$pattern/i)
same as
$pattern = “(?i)testString”;
if (/$pattern/)
Regular Expressons
19 of 20
Lecture 14
POSIX NFA
Must examine ALL cases to find the longest left-most match
(not just first match)
Pattern to determine if DFA, NFA or POSIX NFA

“X(.+)+X”
on
“==XX=====================”
and
“==XX=====================X”
• instant failure on 1: DFA

• More than short time on 1 but fails on 2: NFA
• Doesn’t return on 2: Posix NFA
Regular Expressons
20 of 20
Lecture 14

Regular Expressions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regular Expressions

Uploaded by

Copyright:

Available Formats

Regular Expressions

See table of systems & uses

This DFSA M accepts all strings with ...

Trace through “-34.76” and “-17.-56”

What’s the structure for the DFA engine?

If “a” is a word, this invokes a substitution from

\ escape for metacharaters

general machine empty string transition

alphabet symbol transition

DFA can be always be constructed from NFA (state expansion)

Example: Develop an NDFSA from RE

/Version: (.*) | Revision: (.*)/

Copying required for parenthesized capture.

• “=~” and “=!”: By default, pattern matched against content

Sequencing and abc ^ $ \A \Z

returns list of places where pattern didn’t match

g Global: find all occurances

i match case insensitive

m treat string as multiple lines

o optimize: compile pattern once

s treat string as single line

The engine will match: left-most, longest (first) match.

Consider /x*/ applied to “fox”.

Upon failure, backtrack

If ordered matching fails, engine backtracks.

Match an “x”, try all possible y’s.

(second pattern varies faster due to backtracking)

^ beginning of string (line)

$ end of string (line)

\b word boundary (between \w and \W)

end of previous global pattern

At least n, by no more than m

{n,} {n,}? At least n times

{n} {n}? Exactly n times

x* = xxx...xx | xxx...x | xxx... | ... | x | λ

This is further complicated if one uses parentheses to group

f) Match items according to their type

(PATTERN) grouped regular expression

(?:PATTERN) grouped regular expression

“.” (period). Matches everything except \n.

[...] character classes

\[0-7]{1,3} octal value

\x[0-9a-f]{1,2} hexidecimal value

Nothing saved in $1 ($n)

Pattern to determine if DFA, NFA or POSIX NFA

• instant failure on 1: DFA

You might also like

/Version: (.) | Revision: (.)/