Regex

University of Oslo : Department of Informatics
Describing Languages
with Regular Expressions
Jonathon Read
25 September 2012 INF4820: Algorithms for AI and NLP

Outlook
How can we write programs that handle sentences?

Outlook
How can we write programs that handle sentences?
I Describing languages with regular expressions

I Representing and implementing regular expressions
using finite state automata
I Estimating the probability of unobserved strings of words
with language models
I Sequence-labelling part-of-speech
using Hidden Markov models
Productivity of languages
Even simple formal languages are infinite:
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
The fox.
The hungry fox.
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
The fox.
The hungry fox.
The hungry fox ate.
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox ate the chicken.
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox quickly ate the chicken.
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox quickly ate the chicken.
The hungry brown fox quickly ate the delicious roast
chicken and washed it down with a pint of beer.
Characterising language
Simplifying assumption
A language is a set of utterances
I utterances inside this set are well-formed
I utterances not in this set are ill-formed

Characterising language
Simplifying assumption
A language is a set of utterances
I utterances inside this set are well-formed
I utterances not in this set are ill-formed
How do we represent sets of utterances, if the set is infinite?

Regular expressions
Regular expressions (RE, RegEx, RegExp):

I Algebraic notation for characterising sets of strings
I They consist of constants and operators
Example
/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/
Note: an implementation is supplied in many programming

languages and text editors—for instance, try C-M-s in emacs,
or grep on the command line.
Matching
Sequences of character constants specify how to match strings.

Further expressiveness is added by metacharacters, including:
. any single character (except new lines)
ˆ the start of a line
$ the end of a line
Example
I /ˆChapter .$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter & }
Note: When the literal of an operator or metacharacter—i.e. one

of {}[]()ˆ$.|*+?\—should be matched, it must be escaped
using a back slash, e.g. match a full-stop with /\./
Disjunction
The | operator expresses a logical or
Example
I /â (fox|wolf)$/ ⇒
{ a fox, a wolf }
Note: The | operator has low precedence—brackets ensure that

it does not specify the set { a fox, wolf }
Character classes
Character classes can also be used to specify disjunction—they

are expressed using square brackets, [ and ]:
Examples
I /ˆ[Ff]ox$/ ⇒
{ Fox, fox }
I /ˆf[aio]x]$/ ⇒
{ fax, fix, fox }
I /ˆ[a-z]$/ ⇒
{ a, b, c, . . . , z }
I /ˆChapter [1-9]$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter 9 }
Character classes
Used inside a character class, ˆ negates the class:
Example
I /[Â-Za-z]/ matches any non-alphabetic character
I /[ˆ ]/ matches anything that is not a space
Many implementations provide named character classes:
Examples
I /\d/ ⇒ /[[:digit:]]/ ⇒ /[0-9]/
I /\w/ ⇒ /[[:alnum:]]/ ⇒ /[a-zA-Z0-9 ]/
I /\D/ ⇒ /[ˆ0-9]/
I /[[:punct:]]/ matches punctuation characters
Quantification
Quantification can be specified in a number of ways:

? zero or one of the preceeding element
* zero or more of the preceeding element
+ one or more of the preceeding element
{n} exactly n of the preceeding element
{n,m} from n to m of the preceeding element
{n,} n or more of the preceeding element
{,m} less than m of the preceeding element
Example
/ˆChapter [1-9]\d*$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter 99999, . . . }
Lazy quantification
How to match quoted items?
“Yes”, he said, “but why?”
Normal quantification operators are greedy—they will the

match the largest possible sequence in the input:
/‘‘.+’’/ ⇒
{ “Yes”, he said, “but why?” }
This can be overridden with ?, which becomes the lazy

operator when used next to a quantification operator:
/‘‘.+?’’/ ⇒
{ “Yes”, “but why?” }
Capturing groups
Brackets are used to specify matching groups, which (a) enforce

precedence and (b) indicate groups for later reference, using an
escaped number (1-9).
Example
I <[bi]>.+?</[bi]> ⇒
{ “” + . . . + “”, “” + . . . + “”,
“” + . . . + “”, “” + . . . + “” }
I <([bi])>.+?</\1> ⇒
{ “” + . . . + “”, “” + . . . + “” }
Putting it all together
What does this match?

/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/
/[A-Z][a-z]* / a word with an initial capital, followed

by a space

/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

by a space
/\d+[A-Z]?/ one or more digits, optionally followed
by a capital letter

/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

by a space
by a capital letter
/, / a comma and a space

/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

by a space
by a capital letter
/\d{4} / four digits, followed by a space

/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

by a space
by a capital letter
/[A-Z][a-z]*/ a word with an initial capital

/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

by a space
by a capital letter
/[A-Z][a-z]*/ a word with an initial capital
Gaustadalleén 23B, 0373 Oslo

Some exercises
Write regular expressions for the following:

1. all alphabetic strings;
2. all lower case alphabetic strings ending in a b;
3. all strings of two repeated words;
4. all strings from the alphabet a,b such that a is immediately
preceeded by and immediately followed by a b.
5. capturing the first word of an English sentence (making
sure to deal with punctuation)
Some exercises

/[a-zA-Z]+/
Some exercises

/[a-zA-Z]+/
/[a-z]*b/
Some exercises

/[a-zA-Z]+/
/[a-z]*b/
3. all strings of two repeated words, separated by a space;
/([a-zA-Z]+) \1/
Some exercises

/[a-zA-Z]+/
/[a-z]*b/
/([a-zA-Z]+) \1/
/b+(ab+)+/
Some exercises

/[a-zA-Z]+/
/[a-z]*b/
/([a-zA-Z]+) \1/
/b+(ab+)+/
5. capturing the first word of an English sentence (making
sure to deal with punctuation)
/ˆ[â-zA-Z]*([a-zA-Z]+)/
Applications in AI and NLP
Weizenbaum 1966
User: Men are all alike.
Eliza: In what way?
User: They’re always bugging us about something or other.
Eliza: Can you think of a specific example?
User: Well, my boyfriend made me come here.
Eliza: Your boyfriend made you come here?
User: He says I am depressed much of the time.
Eliza: I am sorry to hear you are depressed.
Weizenbaum 1966
User: Men are all alike.
Eliza: In what way?
User: They’re always bugging us about something or other.
Eliza: Can you think of a specific example?
User: Well, my boyfriend made me come here.
Eliza: Your boyfriend made you come here?
User: He says I am depressed much of the time.
Eliza: I am sorry to hear you are depressed.
Can be reproduced with a cascade of regular expression

substitutions, e.g. using sed:
s/.* all .*/In what way/
s/.* always .*/Can you think of a specific example/
s/.* I am (depressed|sad) .*/I am sorry to hear you are \1/
Lexical morphology
s/mouse/mice/
s/(bush|fox|house)/\1es/
s/(.)/\1s/
Lexical morphology
s/mouse/mice/
s/(bush|fox|house)/\1es/
s/(.)/\1s/
Concise expressions of
genetic sequences:
I finding codons, e.g.
s/CG.|AG[AG]/arginine
I specifying patterns e.g.
/CG.|AG[AG].{,100}GG./
Summary
Regular expressions
I A finite way of specifying infinite sets
I Character constants, metacharacters and operators
I The fundamental operations are:
I Matching characters, wildcards (.) and anchors (ˆ and $)
I Disjunction (| and [ ])
I Quantification (?, *, + and {n, m})
I Precedence can be enforced with brackets (( and ))
I More complex operations include capturing groups
Next week:
I Finite state automata
I Searching state spaces

Regex

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regex

Uploaded by

Copyright:

Available Formats

University of Oslo : Department of Informatics

25 September 2012 INF4820: Algorithms for AI and NLP

How can we write programs that handle sentences?

How can we write programs that handle sentences?

I Describing languages with regular expressions

Even simple formal languages are infinite:

Even simple formal languages are infinite:

Even simple formal languages are infinite:

Even simple formal languages are infinite:

Even simple formal languages are infinite:

Even simple formal languages are infinite:

Even simple formal languages are infinite:

I utterances not in this set are ill-formed

I utterances not in this set are ill-formed

How do we represent sets of utterances, if the set is infinite?

Regular expressions (RE, RegEx, RegExp):

Note: an implementation is supplied in many programming

Sequences of character constants specify how to match strings.

Note: When the literal of an operator or metacharacter—i.e. one

The | operator expresses a logical or

Note: The | operator has low precedence—brackets ensure that

Character classes can also be used to specify disjunction—they

Used inside a character class, ˆ negates the class:

Many implementations provide named character classes:

Quantification can be specified in a number of ways:

How to match quoted items?

“Yes”, he said, “but why?”

Normal quantification operators are greedy—they will the

This can be overridden with ?, which becomes the lazy

Brackets are used to specify matching groups, which (a) enforce

What does this match?

What does this match?

/[A-Z][a-z]* / a word with an initial capital, followed

What does this match?

/[A-Z][a-z]* / a word with an initial capital, followed

What does this match?

/[A-Z][a-z]* / a word with an initial capital, followed

What does this match?

/[A-Z][a-z]* / a word with an initial capital, followed

What does this match?

/[A-Z][a-z]* / a word with an initial capital, followed

What does this match?

/[A-Z][a-z]* / a word with an initial capital, followed

Gaustadalleén 23B, 0373 Oslo

Write regular expressions for the following:

1. all alphabetic strings;

1. all alphabetic strings;

1. all alphabetic strings;

1. all alphabetic strings;

1. all alphabetic strings;

Can be reproduced with a cascade of regular expression

You might also like