Professional Documents
Culture Documents
Regex
Regex
Describing Languages
with Regular Expressions
Jonathon Read
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
Productivity of languages
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
Productivity of languages
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
Productivity of languages
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
Productivity of languages
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox ate the chicken.
Productivity of languages
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox ate the chicken.
The hungry fox quickly ate the chicken.
Productivity of languages
x =1 + 2
x =1 + 2 + 3
x =1 + 2 + 3 + . . .
With natural languages there are so many more choices:
The fox.
The hungry fox.
The hungry fox ate.
The hungry fox ate the chicken.
The hungry fox quickly ate the chicken.
The hungry brown fox quickly ate the delicious roast
chicken and washed it down with a pint of beer.
Characterising language
Simplifying assumption
A language is a set of utterances
I utterances inside this set are well-formed
Simplifying assumption
A language is a set of utterances
I utterances inside this set are well-formed
Example
/[A-Z][a-z]* \d+[A-Z]?, \d{4} [A-Z][a-z]*/
Example
I /ˆChapter .$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter & }
Example
I /ˆa (fox|wolf)$/ ⇒
{ a fox, a wolf }
Examples
I /ˆ[Ff]ox$/ ⇒
{ Fox, fox }
I /ˆf[aio]x]$/ ⇒
{ fax, fix, fox }
I /ˆ[a-z]$/ ⇒
{ a, b, c, . . . , z }
I /ˆChapter [1-9]$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter 9 }
Character classes
Example
I /[ˆA-Za-z]/ matches any non-alphabetic character
I /[ˆ ]/ matches anything that is not a space
Examples
I /\d/ ⇒ /[[:digit:]]/ ⇒ /[0-9]/
I /\w/ ⇒ /[[:alnum:]]/ ⇒ /[a-zA-Z0-9 ]/
I /\D/ ⇒ /[ˆ0-9]/
I /[[:punct:]]/ matches punctuation characters
Quantification
Example
/ˆChapter [1-9]\d*$/ ⇒
{ Chapter 1, Chapter 2, . . . , Chapter 99999, . . . }
Lazy quantification
Example
I <[bi]>.+?</[bi]> ⇒
{ “<b>” + . . . + “</b>”, “<i>” + . . . + “</i>”,
“<b>” + . . . + “</i>”, “<b>” + . . . + “</i>” }
I <([bi])>.+?</\1> ⇒
{ “<b>” + . . . + “</b>”, “<i>” + . . . + “</i>” }
Putting it all together
Weizenbaum 1966
User: Men are all alike.
Eliza: In what way?
User: They’re always bugging us about something or other.
Eliza: Can you think of a specific example?
User: Well, my boyfriend made me come here.
Eliza: Your boyfriend made you come here?
User: He says I am depressed much of the time.
Eliza: I am sorry to hear you are depressed.
Applications in AI and NLP
Weizenbaum 1966
User: Men are all alike.
Eliza: In what way?
User: They’re always bugging us about something or other.
Eliza: Can you think of a specific example?
User: Well, my boyfriend made me come here.
Eliza: Your boyfriend made you come here?
User: He says I am depressed much of the time.
Eliza: I am sorry to hear you are depressed.
Lexical morphology
s/mouse/mice/
s/(bush|fox|house)/\1es/
s/(.)/\1s/
Applications in AI and NLP
Lexical morphology
s/mouse/mice/
s/(bush|fox|house)/\1es/
s/(.)/\1s/
Concise expressions of
genetic sequences:
I finding codons, e.g.
s/CG.|AG[AG]/arginine
I specifying patterns e.g.
/CG.|AG[AG].{,100}GG./
Summary
Regular expressions
I A finite way of specifying infinite sets
I Character constants, metacharacters and operators
I The fundamental operations are:
I Matching characters, wildcards (.) and anchors (ˆ and $)
I Disjunction (| and [ ])
I Quantification (?, *, + and {n, m})
I Precedence can be enforced with brackets (( and ))
I More complex operations include capturing groups
Next week:
I Finite state automata
I Searching state spaces