Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

University of Oslo : Department of Informatics

Describing Languages
with Regular Expressions

Jonathon Read

25 September 2012 INF4820: Algorithms for AI and NLP


Outlook

How can we write programs that handle sentences?


Outlook

How can we write programs that handle sentences?

I Describing languages with regular expressions


I Representing and implementing regular expressions
using finite state automata
I Estimating the probability of unobserved strings of words
with language models
I Sequence-labelling part-of-speech
using Hidden Markov models
Productivity of languages

Even simple formal languages are infinite:

x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
Productivity of languages

Even simple formal languages are infinite:

x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
Productivity of languages

Even simple formal languages are infinite:

x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
Productivity of languages

Even simple formal languages are infinite:

x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
Productivity of languages

Even simple formal languages are infinite:

x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox ate the chicken.
Productivity of languages

Even simple formal languages are infinite:

x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox ate the chicken.
The hungry fox quickly ate the chicken.
Productivity of languages

Even simple formal languages are infinite:

x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox ate the chicken.
The hungry fox quickly ate the chicken.
The hungry brown fox quickly ate the delicious roast
chicken and washed it down with a pint of beer.
Characterising language

Simplifying assumption
A language is a set of utterances
I utterances inside this set are well-formed

I utterances not in this set are ill-formed


Characterising language

Simplifying assumption
A language is a set of utterances
I utterances inside this set are well-formed

I utterances not in this set are ill-formed

How do we represent sets of utterances, if the set is infinite?


Regular expressions

Regular expressions (RE, RegEx, RegExp):


I Algebraic notation for characterising sets of strings
I They consist of constants and operators

Example
/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

Note: an implementation is supplied in many programming


languages and text editors—for instance, try C-M-s in emacs,
or grep on the command line.
Matching

Sequences of character constants specify how to match strings.


Further expressiveness is added by metacharacters, including:
. any single character (except new lines)
ˆ the start of a line
$ the end of a line

Example
I /ˆChapter .$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter & }

Note: When the literal of an operator or metacharacter—i.e. one


of {}[]()ˆ$.|*+?\—should be matched, it must be escaped
using a back slash, e.g. match a full-stop with /\./
Disjunction

The | operator expresses a logical or

Example
I /ˆa (fox|wolf)$/ ⇒
{ a fox, a wolf }

Note: The | operator has low precedence—brackets ensure that


it does not specify the set { a fox, wolf }
Character classes

Character classes can also be used to specify disjunction—they


are expressed using square brackets, [ and ]:

Examples
I /ˆ[Ff]ox$/ ⇒
{ Fox, fox }
I /ˆf[aio]x]$/ ⇒
{ fax, fix, fox }
I /ˆ[a-z]$/ ⇒
{ a, b, c, . . . , z }
I /ˆChapter [1-9]$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter 9 }
Character classes

Used inside a character class, ˆ negates the class:

Example
I /[ˆA-Za-z]/ matches any non-alphabetic character
I /[ˆ ]/ matches anything that is not a space

Many implementations provide named character classes:

Examples
I /\d/ ⇒ /[[:digit:]]/ ⇒ /[0-9]/
I /\w/ ⇒ /[[:alnum:]]/ ⇒ /[a-zA-Z0-9 ]/
I /\D/ ⇒ /[ˆ0-9]/
I /[[:punct:]]/ matches punctuation characters
Quantification

Quantification can be specified in a number of ways:


? zero or one of the preceeding element
* zero or more of the preceeding element
+ one or more of the preceeding element
{n} exactly n of the preceeding element
{n,m} from n to m of the preceeding element
{n,} n or more of the preceeding element
{,m} less than m of the preceeding element

Example
/ˆChapter [1-9]\d*$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter 99999, . . . }
Lazy quantification

How to match quoted items?

“Yes”, he said, “but why?”

Normal quantification operators are greedy—they will the


match the largest possible sequence in the input:
/‘‘.+’’/ ⇒
{ “Yes”, he said, “but why?” }

This can be overridden with ?, which becomes the lazy


operator when used next to a quantification operator:
/‘‘.+?’’/ ⇒
{ “Yes”, “but why?” }
Capturing groups

Brackets are used to specify matching groups, which (a) enforce


precedence and (b) indicate groups for later reference, using an
escaped number (1-9).

Example
I <[bi]>.+?</[bi]> ⇒
{ “<b>” + . . . + “</b>”, “<i>” + . . . + “</i>”,
“<b>” + . . . + “</i>”, “<b>” + . . . + “</i>” }
I <([bi])>.+?</\1> ⇒
{ “<b>” + . . . + “</b>”, “<i>” + . . . + “</i>” }
Putting it all together

What does this match?


/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/
Putting it all together

What does this match?


/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

/[A-Z][a-z]* / a word with an initial capital, followed


by a space
Putting it all together

What does this match?


/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

/[A-Z][a-z]* / a word with an initial capital, followed


by a space
/\d+[A-Z]?/ one or more digits, optionally followed
by a capital letter
Putting it all together

What does this match?


/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

/[A-Z][a-z]* / a word with an initial capital, followed


by a space
/\d+[A-Z]?/ one or more digits, optionally followed
by a capital letter
/, / a comma and a space
Putting it all together

What does this match?


/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

/[A-Z][a-z]* / a word with an initial capital, followed


by a space
/\d+[A-Z]?/ one or more digits, optionally followed
by a capital letter
/, / a comma and a space
/\d{4} / four digits, followed by a space
Putting it all together

What does this match?


/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

/[A-Z][a-z]* / a word with an initial capital, followed


by a space
/\d+[A-Z]?/ one or more digits, optionally followed
by a capital letter
/, / a comma and a space
/\d{4} / four digits, followed by a space
/[A-Z][a-z]*/ a word with an initial capital
Putting it all together

What does this match?


/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/

/[A-Z][a-z]* / a word with an initial capital, followed


by a space
/\d+[A-Z]?/ one or more digits, optionally followed
by a capital letter
/, / a comma and a space
/\d{4} / four digits, followed by a space
/[A-Z][a-z]*/ a word with an initial capital

Gaustadalleén 23B, 0373 Oslo


Some exercises

Write regular expressions for the following:


1. all alphabetic strings;
2. all lower case alphabetic strings ending in a b;
3. all strings of two repeated words;
4. all strings from the alphabet a,b such that a is immediately
preceeded by and immediately followed by a b.
5. capturing the first word of an English sentence (making
sure to deal with punctuation)
Some exercises

1. all alphabetic strings;


/[a-zA-Z]+/
Some exercises

1. all alphabetic strings;


/[a-zA-Z]+/
2. all lower case alphabetic strings ending in a b;
/[a-z]*b/
Some exercises

1. all alphabetic strings;


/[a-zA-Z]+/
2. all lower case alphabetic strings ending in a b;
/[a-z]*b/
3. all strings of two repeated words, separated by a space;
/([a-zA-Z]+) \1/
Some exercises

1. all alphabetic strings;


/[a-zA-Z]+/
2. all lower case alphabetic strings ending in a b;
/[a-z]*b/
3. all strings of two repeated words, separated by a space;
/([a-zA-Z]+) \1/
4. all strings from the alphabet a,b such that a is immediately
preceeded by and immediately followed by a b.
/b+(ab+)+/
Some exercises

1. all alphabetic strings;


/[a-zA-Z]+/
2. all lower case alphabetic strings ending in a b;
/[a-z]*b/
3. all strings of two repeated words, separated by a space;
/([a-zA-Z]+) \1/
4. all strings from the alphabet a,b such that a is immediately
preceeded by and immediately followed by a b.
/b+(ab+)+/
5. capturing the first word of an English sentence (making
sure to deal with punctuation)
/ˆ[ˆa-zA-Z]*([a-zA-Z]+)/
Applications in AI and NLP

Weizenbaum 1966
User: Men are all alike.
Eliza: In what way?
User: They’re always bugging us about something or other.
Eliza: Can you think of a specific example?
User: Well, my boyfriend made me come here.
Eliza: Your boyfriend made you come here?
User: He says I am depressed much of the time.
Eliza: I am sorry to hear you are depressed.
Applications in AI and NLP

Weizenbaum 1966
User: Men are all alike.
Eliza: In what way?
User: They’re always bugging us about something or other.
Eliza: Can you think of a specific example?
User: Well, my boyfriend made me come here.
Eliza: Your boyfriend made you come here?
User: He says I am depressed much of the time.
Eliza: I am sorry to hear you are depressed.

Can be reproduced with a cascade of regular expression


substitutions, e.g. using sed:
s/.* all .*/In what way/
s/.* always .*/Can you think of a specific example/
s/.* I am (depressed|sad) .*/I am sorry to hear you are \1/
Applications in AI and NLP

Lexical morphology
s/mouse/mice/
s/(bush|fox|house)/\1es/
s/(.)/\1s/
Applications in AI and NLP

Lexical morphology
s/mouse/mice/
s/(bush|fox|house)/\1es/
s/(.)/\1s/

Concise expressions of
genetic sequences:
I finding codons, e.g.
s/CG.|AG[AG]/arginine
I specifying patterns e.g.
/CG.|AG[AG].{,100}GG./
Summary

Regular expressions
I A finite way of specifying infinite sets
I Character constants, metacharacters and operators
I The fundamental operations are:
I Matching characters, wildcards (.) and anchors (ˆ and $)
I Disjunction (| and [ ])
I Quantification (?, *, + and {n, m})
I Precedence can be enforced with brackets (( and ))
I More complex operations include capturing groups

Next week:
I Finite state automata
I Searching state spaces

You might also like