Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Regular Expressions

• DFA
Find the longest match
Deterministic, fast, state machine must be built
• NFA
Leftmost, longest (in order)
ordering of RE changes pattern matched.
• POSIX NFA
“longest of the leftmost”
for multiple matches starting at the same (leftmost)
position, return the one matching the most text.

See table of systems & uses

Regular Expressons
1 of 20
Lecture 14
EECS 498 Winter 2000

Examples
M = (Q, S, d, q0, F ) where Q = q0..3, S = { 0, 1 }, F = { q0 }

1
start q0 q2
1

0 0 0 0
1
q1 q3
1

This DFSA M accepts all strings with ...

Regular Expressons
2 of 20
Lecture 14
EECS 498 Winter 2000
An FSA to accept decimal strings:
0...9

B
‘.’
0...9
0...9
0...9

S ‘-’ A H
‘.’
‘.’ 0...9

Trace through “-34.76” and “-17.-56”

Regular Expressons
3 of 20
Lecture 14
EECS 498 Winter 2000

What’s the structure for the DFA engine?


state = 0;
loop
state = TBL[ state, input ];
until FINAL[ state ] && EOF(input);

The DFA engine serves as the basis for some language parsing
engines ...

Regular Expressons
4 of 20
Lecture 14
EECS 498 Winter 2000
Structuring Elements for REs

Meta Meaning
Chars
. (period) Match any single character
* (kleene star) Match zero or more REs
[] Match any character within brackets
0-9 matches digits
a-z matches lower case characters
“-” in first position matches “-”
“^” in first position inverts set
^ matches beginning of line
$ matches end-of-line
{a,b} Match count of preceding pattern (from ‘a’ to
‘b’ times). ‘b’ optional

If “a” is a word, this invokes a substitution from


the definition section.

Regular Expressons
5 of 20
Lecture 14
EECS 498 Winter 2000

\ escape for metacharaters


+ matches 1 or more REs
? matches 0 or 1 REs
| alternation
“##” matches literal between quotes
() grouping of RE
• Alphabet Elements
character class is more efficient
• Concatenation, Alternation, grouping
• intervals {low,high}
• anchors
• (F)lex model
set of patterns is Alternation
• Perl Model
scan string for match

Regular Expressons
6 of 20
Lecture 14
EECS 498 Winter 2000
Construction of NFA from RE
λ
S E F S F

general machine empty string transition


a
S F S E1 A E2 F

alphabet symbol transition


Concatenation: (E1 E2)

λ E1 λ
E
F
λ λ
S
λ λ S
λ A λ F
E2
Alternation: (E1 | E2) Closure (E*)

DFA can be always be constructed from NFA (state expansion)


with potential exponential increase in state space.

Regular Expressons
7 of 20
Lecture 14
EECS 498 Winter 2000

Example: Develop an NDFSA from RE

( + | - | λ ) d+ ≡ ( + | - | λ ) d d*

(+|-|λ) d d*
S A F

- d d*
S A F
λ

- d d*
S A B F
λ

+ d

- d λ λ
S A B C F
λ

Regular Expressons
8 of 20
Lecture 14
EECS 498 Winter 2000
Perl Operators & Operands

• $_, $ARG
default input, string pattern matched.
• $‘, $PREMATCH
string preceding last successful match
• $&, $MATCH
string matched by last successful pattern match
• $’, $POSTMATCH
string following last successful pattern match
• $+, $LAST_PAREN_MATCH
last bracket matched by last search

/Version: (.*) | Revision: (.*)/


$num = $+;

Regular Expressons
9 of 20
Lecture 14
EECS 498 Winter 2000

Any program that uses these variables (or calls functions that
do), forces the match engine to make copies of text string used
for matching. This can be VERY expensive for large strings.

Copying required for parenthesized capture.

• “=~” and “=!”: By default, pattern matched against content


of variable “$_”. Can use these operators to change this:

$a = “hello world”;
$a =~ /^he/ # true
$a =! /^help/ # also true.

Regular Expressons
10 of 20
Lecture 14
EECS 498 Winter 2000
Regex Operators and their Precedence

Parentheses ( PATTERN)
(?: PATTERN)

Multipliers ? + * {m,n}
?? +? *? {m,n}?

Sequencing and abc ^ $ \A \Z


anchoring (?=PATTERN) (?!PATTERN)

Alternation |

Regular Expressons
11 of 20
Lecture 14
EECS 498 Winter 2000

Pattern Matching

• match: m/PATTERN/
• substitution: s/PATTERN/REPLACE/mods
• split: split PATTERN

returns list of places where pattern didn’t match


Modifiers

Modifier Meaning

g Global: find all occurances

i match case insensitive

m treat string as multiple lines

o optimize: compile pattern once

s treat string as single line

Regular Expressons
12 of 20
Lecture 14
EECS 498 Winter 2000
Traditional NFA
Matching Strategy
We’re given a string and a regular expression (RE). The
processing “engine” will establish if the string is a member of the
language defined by the RE.

The engine will match: left-most, longest (first) match.


a) Match as far left as possible
The match only has to reach the end of RE, doesn’t have
to reach the end of the string.

Consider /x*/ applied to “fox”.


b) RE is set of alternatives (possible singleton).
Stop on first match that allows successful completion of
entire regular expression.

Regular Expressons
13 of 20
Lecture 14
EECS 498 Winter 2000

Upon failure, backtrack


c) An alternative matches if every item (assertion or atom)
matches.

If ordered matching fails, engine backtracks.

Consider /x*y*/

Match an “x”, try all possible y’s.


Try next “x”, try all possible y’s.

(second pattern varies faster due to backtracking)


d) Assertions

^ beginning of string (line)

$ end of string (line)

\b word boundary (between \w and \W)

Regular Expressons
14 of 20
Lecture 14
EECS 498 Winter 2000
\B non word boundary (inverse of \b)

beginning of string
\A
matches only once if /m specified

end of string
\Z
matches only once if /m specified

end of previous global pattern


\G
m/PATTERN/g

Lookahead assertion
(?=PATTERN)
Matches if PATTERN follows

Lookahead
(?!PATTERN)
Matches if PATTERN does NOT follow

Regular Expressons
15 of 20
Lecture 14
EECS 498 Winter 2000

e) Quantified Atom
The atom is matched some number of times

Maximal Minimal
Expected Range
greedy lazy

At least n, by no more than m


{n,m} {n,m}?
m, n < 216

{n,} {n,}? At least n times

{n} {n}? Exactly n times

* *? {0,}

+ +? {1,}

? ?? {0,1}

x* = xxx...xx | xxx...x | xxx... | ... | x | λ


x*? = λ | x | xx | ... | xxx...x | xxx...xx | ...

Regular Expressons
16 of 20
Lecture 14
EECS 498 Winter 2000
The maximal (or greedy) form can cause LOTS of backtracking.

This is further complicated if one uses parentheses to group


references (since recovery is thereby required). (pg 150).

f) Match items according to their type

(PATTERN) grouped regular expression


with backreference ($1, \1, $2, \2, etc.)

(?:PATTERN) grouped regular expression


without backreference
simplifies backtracking

“.” (period). Matches everything except \n.

[...] character classes


[aeiou], [fee|fie|foe] == [feio|]

Regular Expressons
17 of 20
Lecture 14
EECS 498 Winter 2000

\a,\n
\d,\D digit, non-digit
\w,\W word ([a-zA-Z_0-9]), non-word
\s,\S whitespace, non-whitespace

\[0-7]{1,3} octal value

\x[0-9a-f]{1,2} hexidecimal value

Regular Expressons
18 of 20
Lecture 14
EECS 498 Winter 2000
RE Extensions for Perl

• Comment: (?#text)
• No Backreference: (?:PATTERN)

Nothing saved in $1 ($n)


• Zero Width lookahead assertion: (?=PATTERN)
• Negated lookahead assertion: (?!PATTERN)
• Embedded Pattern Match Modifier: (?imsx)

$pattern = “testString”;
if (/$pattern/i)
same as
$pattern = “(?i)testString”;
if (/$pattern/)

Regular Expressons
19 of 20
Lecture 14
EECS 498 Winter 2000

POSIX NFA
Must examine ALL cases to find the longest left-most match
(not just first match)

Pattern to determine if DFA, NFA or POSIX NFA


“X(.+)+X”
on
“==XX=====================”
and
“==XX=====================X”

• instant failure on 1: DFA


• More than short time on 1 but fails on 2: NFA
• Doesn’t return on 2: Posix NFA

Regular Expressons
20 of 20
Lecture 14
EECS 498 Winter 2000

You might also like