Lecture 2 3

Adv.
Natural Language
Processing
Lecture 2 & 3
Instructor: Dr. Muhammad Asfand-e-yar
MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

Previous Lecture
• What is NLP?
• Examples of NLP
• What is Special about Languages?
• Why it is hard for Computer to understand Human Language?
• Where NLP lies?
• Applications of NLP

Today’s Lecture
• Regular Expressions
• Text Normalization
• Edit Distance

Basic Text Processing
Regular Expressions

ELIZA Application
User: You are like my father in some ways.
ELIZA: WHAT RESEMBLANCE DO YOU SEE
User: You are not very aggressive but I think you don’t want me to notice that.
ELIZA: WHAT MAKES YOU THINK I AM NOT VERY AGGRESSIVE
User: You don’t argue with me.
ELIZA: WHY DO YOU THINK I DON’T ARGUE WITH YOU
User: You are afraid of me.
ELIZA: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
The dialogue above is from ELIZA, an early natural language processing

system that could carry on a limited conversation with a user by imitating the
responses of a Rogerian psychotherapist

ELIZA Application
ELIZA is a surprisingly simple program that uses pattern matching to
recognize phrases like
“You are X”
and translate them into suitable outputs like
“What makes you think I am X?”.
This simple technique succeeds in this domain because ELIZA doesn’t
actually need to know anything to mimic a Rogerian psychotherapist.
Eliza’s mimicry of human conversation was remarkably successful: many

people who interacted with ELIZA came to believe that it really understood
them and their problems, many continued to believe in ELIZA’s abilities even
after the program’s operation was explained. Even today such chatbots are
a fun diversion.
Regular Expression
Important tool for describing text patterns: the regular expression (RE)
RE is used to specify strings that might be extracted from a document,

For example; “You are X” in Eliza above, to defining strings like $199 or
$24.99 for extracting tables of prices from a document.
RE is used to turn to a set of tasks collectively called text normalization, in

which plays an important part.
Normalizing text means converting text to a more convenient, standard form.

For example, most of what we are going to do with language relies on
first separating out or tokenizing words from running text, the task of
tokenization.
Regular Expression
Regular Expression (RE) is a standardized text search strings in
computer science language for specifying text.
RE is used in every computer language, word processor, and text

processing tools like the Unix tools grep or Emacs.
RE is an algebraic notation for characterizing a set of strings.
RE is particularly useful for searching in texts, when we have a

pattern to search for and/or a corpus of texts to search through.
Regular Expression
The simplest kind of RE is a sequence of simple characters.
To search for woodchuck, we type /woodchuck/.

OR the expression /Buttercup/ matches any string containing
the substring Buttercup; grep with that expression would
return the line “I’m called little Buttercup”.
The search string can consist of a single character (like /!/) or a

sequence of characters (like /urgl/).

Regular Expression
A formal language for specifying text strings
How can we search for any of these?
• woodchuck
• woodchucks
• Woodchuck
• Woodchucks

Regular Expression
Regular Expressions are Case Sensitive;
RE Example Patterns Matched

/woodchucks/ “interesting links to woodchucks and lemurs”
/a/ “Mary Ann stopped by Mona’s”
/!/ “You’ve left the burglar behind again!” said Nori
Some Simple regular expression search

Regular Expression
RE Match Example Patterns

/[wW]oodchuck/ Woodchuck or woodchuck “Woodchuck”
/[abc]/ ‘a’, ‘b’, or ‘c’ “Hast du gut gelernt?”
/[1234567890]/ any digit “plenty cubes having 1”
The use of the brackets [ ] to specify a disjunction of characters.
Therefore, it can also
/[ABCDEFGHIJKLMNOPQRSTUVWXYZ]/
Means “any Capital Letter”
Regular Expression

/[A-Z]/ an upper case letter “we should call it ‘Gut Gemacht”
/[a-z]/ a lower case letter “the Bean was impatient”
/[0-9]/ a single digit “Kapital 1: Text Schrieben”
The use of the brackets [ ] to specify a disjunction of characters.

Regular Expression
The square braces can also be used to specify what a single
character should not be used in a string by use of the caret ^.
If the caret ^ is the first symbol after the open square brace [, the
resulting pattern is negated.
For example, the pattern /[â]/ matches any single character (including
special characters) except a.
This is only true when the caret is the first symbol after the open
square brace.
If it occurs anywhere else, it usually stands for a caret symbol

Regular Expression
Regular Expressions with condition “Not a substring”;

/[Â-Z]/ not an upper case letter “Am Himmel”
/[^Ss]/ neither ‘S’ nor ‘s’ “Hast du gut gelernt?”
/[^\.]/ not a period or dot “our resident”
/[e^]/ either ‘e’ or ‘^’ “look up ^ now”
/a^b/ the patter ‘a^b’ “look up a^b now”
Uses of caret/cap “^”

Regular Expression
How can we talk about optional elements, like an optional “s” in
“woodchuck” and “woodchucks”?
We can’t use the square brackets, because while the “[ ]” allow us to

say “s or S”, they don’t allow us to say “s or nothing”.
Therefore we use the question mark /?/, which means “the preceding
character or nothing”.
The question mark means that “zero or one instances of the previous
character”.
Regular Expression
Regular Expressions with Optional Conditions”;

/woodchucks?/ Woodchuck or woodchucks “woodchuck”
/colou?r/ color or colour “colour”
Uses of question mark “?”

/beg.n/ any character between beg “begin” , “beg’n” ,
and n “begun”
Uses of the period or dot “.” to specify any character
Regular Expression
How can I distinguish between cat and dog?
Since we can’t use the square brackets to search for “cat or dog”
Why can’t we say /[catdog]/?
We need a new operator, the disjunction operator, also called the pipe
symbol |.
The pattern /cat|dog/ matches
• either the string cat
• or the string dog

Regular Expression
Disjunction
Woodchucks is another name for groundhog!
• The pipe | for disjunction
RE Example Pattern
/groundhog|woodchuck/ “groundhog” or
“woodchuck”
/yours|mine/ “yours" or “mine”
/a|b|c/ = [abc]
/[gG]roundhog | [Ww]oodchuck/
Regular Expression
Sometimes we need to use the disjunction operator in the midst
of a larger sequence.
For example, suppose I want to search for information about pet
fish for my cousin David. How can I specify both guppy and
guppies?
Is it possible to express the above as /guppy|ies/?
No, because that would match only the strings guppy and ies.
Regular Expression
Precedence:
This is because sequences like guppy take precedence over the
disjunction operator |.
To make the disjunction operator apply only to a specific pattern,

then use the parenthesis operators ( and ).
Therefore, the pattern /gupp(y|ies)/ would specify that we meant

the disjunction only to apply to the suffixes y and ies.
Regular Expression
Operator Precedence
This idea that one operator may take precedence over another, use parentheses to
specify what we mean, is formalized by the operator precedence hierarchy for
regular expressions.
The following table gives the order of RE operator precedence, from highest
precedence to lowest precedence.
1 Parenthesis ()
2 Counters * + ? {}
3 Sequences and anchors the ^my end$
4 Disjunction |
Thus, because Counters have a higher precedence than Sequences

Regular Expression
For example; match repeated instances of a string.
Column 1 Column 2 Column 3
/Column [0-9]+ */
Will not match any number of columns; instead, it will match a single column
followed by any number of spaces.
The star here applies only to the space that precedes it, not to the whole
sequence.
With the parentheses, the expression is;

/(Column [0-9]+ *)*/
to match the word Column, followed by a number and optional spaces, the
whole pattern repeated any number of times.
Regular Expression
Construct a Finite State Machine for a language of certain
sheep?
The string that look like the following:

baa
baaa
baaaa
baaaaa
…

Regular Expression
The string that look like the following:
baa
baaa
baaaa
baaaaa
…
Stephen C Kleene
On the above string we will apply Kleene Closure property;
baa* or ba+
Regular Expression
Kleene Closure property.
baa* or ba+
Therefore, it can be expressed as /baa*/ or /[baa]*/
1. /baa*/ => means one b, one a followed by zero or more as
2. /ba+/ => means one b followed by one a or more as
3. /[baa]*/ => means zero or more bs or as (incorrect according to the
given language)
Therefore, according to language the correct expression will be /baa*/

Regular Expression
baa* or ba+
What will be the FSA machine? a
b a ε
q0 q1 q2 q3

Regular Expression
baa* or ba+
a
b a a ε
q0 q1 q2 q3 q4

Regular Expression
baa* or ba+ L(M) = {baa, baaa, baaaa ,baaaaa ,baaaaaa, . . .}
a
b a a ε
q0 q1 q2 q3 q4
b b b
a a, b
q5

Regular Expressions
An FSA for the words for English numbers 1–99.

Regular Expressions
An FSA for the words for English numbers 1–99 in Dollars and Cents.

Regular Expressions:
RE=> ? , * , +
Kleene *, Kleene +
RE Matches Example Pattern

/colou?r/ Optional previous char color , colour
/oo*h!/ 0 or more of previous char oh! , ooh! , oooh! , ooooh!
/o+h!/ 1 or more of previous char oh! , ooh! , oooh! , ooooh!
/baa*/ b and 1 or more a baa , baaa , baaaa , baaaaa
/ba+/ b and 1 or more a baa , baaa , baaaa , baaaaa

Regular Expressions
RE - Anchors => ^ , $
RE Matches Example Pattern

/^[A-Z]/ in the beginning capital letter “Palo Alto”
/^[Â-Za-z]/ no alphabet in the beginning “1” or “ ‘Hello’ ”
/\.$/ period “.” in the end “The end.”
/.$/ any thing in the end “The end?” or “The end!”

Example “the”
Find me all instances of the word “the” in a text.
/the/
Misses capitalized examples
/[tT]he/
won’t treat underscores and numbers as word (the or the25)
/\b[tT]he\b/
Incorrectly returns other or theology
/[â-zA-Z][tT]he[â-zA-Z]/
It is not being mentioned that “the” word should be in beginning of a line.

Example “the”
There are also two other anchors: • \b matches a word boundary
• \B matches a non-boundary
For example, /\b99\b/ will match the string 99 in

“There are 99 bottles of beer on the wall”
(because 99 follows a space)
But not 99 in
“There are 299 bottles of beer on the wall”
(since 99 follows a number).
But it will match 99 in $99

(since 99 follows a dollar sign ($), which is not a digit, underscore, or letter).
Example “the”
/[â-zA-Z][tT]he[â-zA-Z]/
“the” word should be in beginning of a line or in end of a line
/(^|[â-zA-Z])[tT]he([â-zA-Z]|$)/

Example “the”
The process we just went through was based on fixing two kinds of errors
Matching strings that we should not have matched (there, then, other)
False positives (Type I)
Not matching things that we should have matched (The, the)

False negatives (Type II)

Example “the”
In NLP we are always dealing with these kinds of errors.
Reducing the error rate for an application often involves two

antagonistic efforts:
• Increasing accuracy or precision (minimizing false positives)
• Increasing coverage or recall (minimizing false negatives).

Complex Example
Let’s try out a more significant example of the RE.
For example; build an application to help a user buy a computer on the Web.
• The user might want
“any machine with more than 6 GHz and 500 GB of disk space for less
than $1000”.
• To do this kind of retrieval, initially analyze expressions like 6 GHz or
500 GB or Mac or $999.99.
In the rest of the section some simple regular expressions will be analyzed
for this task.

Complex Example
First, let’s complete RE for prices.
RE for a dollar sign followed by a string of digits:

/$[0-9]+/
$ is for the end
Decimal point and two digits afterwards

/$[0-9]+\.[0-9][0-9]/
This pattern only allows $199.99 but not $199

Complex Example
Make the cents optional and to make sure a word boundary:
/\b$[0-9]+(\.[0-9][0-9])?\b/
How about specifications for processor speed?

Here’s a pattern for that:
/\b[0-9]+˽*(GHz|[Gg]igahertz)\b/
Allowing optional fractions again “5.5 GB”;

/\b[0-9]+(\.[0-9]+)? *(GB|[Gg]igabytes?)\b/
˽ means White Spaces
Regular Expression
RE /{3}/ means
“exactly 3 occurrences of the previous character or expression”.
Therefore,
/a\.{24}z/
will match a followed by 24 dots followed by z
/a\.{24, 30}z/
will match a followed by 24 dots OR upto 30 dots followed by z
/a\.{24, }z/
will match a followed by at least 24 dots followed by z

Regular Expression
Substitutions
To do substitutions put parentheses ( and ) around the first pattern and use
the number operator \1 in the second pattern to refer back.
Here’s how it looks:
s/([0-9]+)/<\1>/
The \1 will be replaced by whatever string matched the first item in

parentheses.
Therefore, this will match

The bigger they were, the bigger they will be
but not
The bigger they were, the faster they will be
Regular Expression
Capture group
The use of parentheses to store a pattern in memory is called a capture group.
Every time a capture group is used (i.e., parentheses surround a pattern), the
resulting match is stored in a numbered register.
For example; match two different sets of parentheses, \2 means whatever matched
the second capture group.
Thus
/the (.*)er they (.*), the \1er we \2/
The faster they ran, the faster we ran
but not
The faster they ran, the faster we ate.
Similarly, the third capture group is stored in \3, the fourth is \4, and so on.

Regular Expression
Occasionally parentheses are used for grouping, but don’t want to capture the
resulting pattern in a register.
In that case we use a non-capturing group, which is specified by putting the
commands ?: after the open parentheses, in the form (?: pattern ).
For example;
/(?:some|a few) (people|cats) like some \1/
will match
some cats like some people
but not
some people like some a few
Hence the above example means that (?: some|a few) are not selected.

Exercise
The faster they ran, the faster we ran
The faster they walk, the faster we walk
The slower they ran, the slower we ran
The slower they walk, the slower we walk
How can you make a regular expression of above example?
s/The (.*)er they (.*), the \1 we \2/

Exercise
The faster they ran, the slower we ran
The slower they ran, the faster we ran
How can you make a regular expression of above example?
s/The (faster|slower) they (.*), the \1 we (.*)/

OR
s/The (faster|slower) they ran, the \1 we ran/

Regular Expression (ELIZA)
Substitutions and capture groups are very useful in implementing
simple chatbots like ELIZA
User 1: Men are all alike.

ELIZA1: IN WHAT WAY
User 2: They’re always bugging us about something or other.
ELIZA2: CAN YOU THINK OF A SPECIFIC EXAMPLE
User 3: Well, my friend made me come here.
ELIZA3: YOUR FRIEND MADE YOU COME HERE
User 4: He says I’m depressed much of the time.
ELIZA4: I AM SORRY TO HEAR YOU ARE DEPRESSED
ELIZA1: IN WHAT WAY
ELIZA works by having a series or cascade of regular expression substitutions

each of which matches and changes some part of the input lines.
The first substitutions change all instances of my to YOUR, and I’m to YOU ARE,
and so on.
ELIZA1: IN WHAT WAY
s/.* all .*/IN WHAT WAY/

s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
s/.* I’[mM] (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
ELIZA1: IN WHAT WAY
What will be the RE of given sentence in selected dialog?
1) s/(my|your) friend made (me|you) come here/\1 friend made \2 come here/
2) s/(my|your) (.*) (me|you) (.*)/\1 \2 \3 \4/
Now more examples:
User 6: Ahhh, my wrist watch made me crazy.
User 7: Hmm, my car drive me cool.
Regular Expression
Lookahead
Finally, there will be times to predict the future: lookahead in the text
to see
• if some pattern matches,
• but not advance the match cursor,
• Therefore, the pattern should occur
These lookahead assertions make use of the “?” syntax that check in
the previous section for non-capture groups.
(?= pattern) is true if pattern occurs, but is zero-width, i.e. the match
pointer doesn’t advance the cursor.
(?! pattern) only returns true if a pattern does not match, but again is
zero-width and doesn’t advance the cursor.
Regular Expression
Negative Look ahead
Negative lookahead is commonly used when parsing some complex
pattern but want to rule out a special case.
For example;
to match, at the beginning of a line, any single word that doesn’t
start with “Volcano”.
RE: In the beginning
Negative lookahead to do this: Negative Look ahead
RE: Look ahead
/(^?!Volcano)[A-Za-z]+/
Exercise
Look ahead:
How to write a RE, if to match a word “But”, at the beginning of a

line?
Solution: /(^?=But)[A-Za-z]+/

Summary
Regular expressions play a surprisingly large role
• Sophisticated sequences of regular expressions are often the first
model for any text processing text
For many hard tasks, machine learning classifiers are used

• But regular expressions are used as features in the classifiers
• Can be very useful in capturing generalizations

Exercise
Write regular expressions for the following languages.
1. the set of all alphabetic strings;
/[A-Za-z]+/
2. the set of all lower case alphabetic strings ending in a “b”;
/[a-z]+b/
3. the set of all strings from the alphabet “a, b” such that each a is
immediately preceded by bs and immediately followed by a b;
/b+ab/
OR
/bb*ab/
Exercise
Write the regular expression for the following FSA
Solution:
/aba(ab)* | ab(ab)*/

Exercise
Write the set of Strings for the following FSA
one
the one very old sheep

q0 q1 q2 q3 q4 q5
the
Solution:
the one very old sheep
the very old sheep
MS(CS), Bahria University, Islamabad the one old sheep Instructor: Dr. Muhammad Asfand-e-yar
Word tokenization

Words and Corpora
Corpus (plural = Corpora)
A collection of written or spoken material stored on a computer and
used to find out how language is used.
Example:
He stepped out into the hall, was delighted to encounter a water
brother.
This sentence has
• 13 words if we don’t count punctuation marks as words,
• 15 if we count punctuation.
Whether we treat period (“.”), comma (“,”), and so on as words
depends on the task.
Punctuation is critical for finding boundaries of things (commas, periods,
colons) and for identifying some aspects of meaning (question marks,
exclamation
MS(CS), marks, quotation marks).
Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Words and Corpora
Spoken Sentences
For example:
An utterance is the spoken correlate of a sentence:
I do uh main- mainly business data processing
The given is the spoken sentence (i.e. utterance) and has two
kinds of disfluencies.
1. The broken-off word main- is called a fragment.
2. Words like uh and um are called fillers or filled pauses.
Should we consider these to be words?
Again, it depends on the application.
If we are building a speech transcription system, we might want to
eventually strip out the disfluencies.
Words and Corpora
What is the reason to use fillers and fragments in spoken
speech?
Disfluencies like uh or um are actually helpful in speech

recognition in predicting the upcoming word,
because they may signal that the speaker is restarting the clause or idea,
Therefore for speech recognition they are treated as regular

words.

Words and Corpora
How about inflected forms like cats versus cat?
These two words have the same lemma cat but are different
wordforms.
Lemma
A lemma is a set of lexical forms having the same stem, the same
major part-of-speech, and the same word sense.
Wordform
The wordform is the full inflected or derived form of the word.
For morphologically complex languages like Arabic, we often need
to deal with lemmatization.
Words and Corpora

Words and Corpora
For many tasks in English, however, wordforms are sufficient
How many words are there in English?

To answer this question we need to word type distinguish two
ways of talking about words.
Types are the number of distinct words in a corpus; if the set of

words in the vocabulary is V, the number of types is the word
token vocabulary size |V|.
Tokens are the total number N of running words.
How many words?
I do uh main- mainly business data processing
Fragments (i.e. main-), filled pauses (i.e. uh)
Seuss’s cat in the hat is different from other cats!

Lemma: same stem, part of speech, rough word sense
cat and cats = same lemma
Wordform: the full inflected surface form
cat and cats = different wordforms

How many words?
they lay back on the San Francisco grass and looked at the stars and their
• Type: an element of the vocabulary.

• Token: an instance of that type in running text.
How many Tokens or types in given example?
• 15 tokens (or 14)
14 => they lay back on the San Francisco grass and looked at the stars and their
• 13 types (or 12) (or 11?)
Text Normalization
Every NLP task needs to do text normalization:
1. Segmenting/tokenizing words in running text
2. Normalizing word formats
3. Segmenting sentences in running text

Heap’s Law: VR(n) = knβ
How many words? Heaps' law (also called Herdan's law) describes the
number of distinct words in a document (or set of
documents) as a function of the document length (so
N = number of tokens called type-token relation).
V = vocabulary = set of types The formula is given:
1) V is the number of distinct words in an instance
|V| is the size of the vocabulary text of size n.
2) k and β are free parameters determined various
experiments.
With English text corpora, typically k is between 10
and 100, and β is between 0.4 and 0.6.
Tokens = N Types = |V|

Switchboard phone conversations 2.4 million 20 thousand
Shakespeare 884,000 31 thousand
Google N-grams
MS(CS), Bahria University, Islamabad 1 trillion 13 million
The Tokenization in UNIX
(Inspired by Ken Church’s UNIX for Poets.)
Given a text file, output the word tokens and their frequencies
tr -sc ’A-Za-z’ ’\n’ < shakes.txt Change all non-alpha to newlines
| sort Sort in alphabetical order
| uniq –c Merge and count each type

tr -sc ’A-Za-z’ ’\n’ < shakes.txt | head
The output of above command is:
THE
SONNETS
by
William
Shakespeare
From
fairest
creatures
We
...
Now that there is one word per line, it will sort in the lines, and pass
them to uniq -c which will collapse and count them:
tr -sc 'A-Za-z' '\n' < shakes.txt | sort | uniq -c
1945 A
72 AARON
19 ABBESS
25 Aaron
6 Abate
1 Abates
5 Abbess
6 Abbey
3 Abbot
.... …
Merging upper and lower case
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c
Sorting the counts
tr ‘A-Z’ ‘a-z’ < shakes.txt | tr –sc ‘A-Za-z’ ‘\n’ | sort | uniq –c | sort –n –r
23243 the
22225 i
18618 and
16339 to The -n option to sort means to sort numerically
15687 of rather than alphabetically, and the -r option means
12780 a to sort in reverse order (highest-to-lowest)
12163 you
10839 my
10005 in
MS(CS), Bahria University, Islamabad
8954 d Instructor: Dr. Muhammad Asfand-e-yar
The Unix command sequence removes all the numbers and

punctuation, for most NLP applications it is required to keep
these while tokenization.
Punctuations are break off as a separate token;

• commas are a useful piece of information for parsers.
• periods help indicate sentence boundaries.

Issues in Tokenization
Finland’s capital  Finland Finlands Finland’s ?
what’re, I’m, isn’t  What are, I am, is not
Hewlett-Packard  Hewlett Packard ?
state-of-the-art  state of the art ?
Lowercase  lower-case lowercase lower case ?
San Francisco  one token or two?
m.p.h., PhD.  ??
555,550.50  ??
rock ‘n’ roll  rock and roll (separate tokens), one token?
A tokenizer can also be used to expand clitic contractions that are marked by
apostrophes, for example,
• what're to the two tokens what are,
• we're to the two tokens we are.
A clitic is a part of a word that can’t stand on its own, and can only occur when it is
attached to another word.
Some such contractions occur in other alphabetic languages, including articles and
pronouns in French (j'ai, l'homme).
Depending on the application, tokenization algorithms may also tokenize multiword
expressions for example:
New York or rock 'n' roll as a single token, which requires a multiword expression
dictionary of some sort.
Tokenization is thus intimately tied up with named entity detection, the task of detecting
names,
MS(CS), dates,
Bahria University, and organizations
Islamabad Instructor: Dr. Muhammad Asfand-e-yar
Penn Treebank tokenization
Penn Treebank tokenization standard, used for the parsed corpora
(treebanks) released by the Linguistic Data Consortium (LDC), the source of
many useful datasets.
This standard separates out
• clitics (i.e. doesn’t becomes does plus n’t)
• keeps hyphenated words together
• and separates out all punctuation
Input “ The San Francisco-based restaurant , “ they said , “ does n’t charge $ 10 ” .
Output
. “ The San Francisco-based restaurant , “ they said , “ does n’t charge $ 10 ” .

Tokenization: Language Issues
French
L'ensemble  one token or two? L'ensemble => means “the entire”
• L ? L’ ? Le ?
• Want l’ensemble to match with un ensemble
German noun compounds are not segmented

Lebensversicherungsgesellschaftsangestellter
Lebens versicherungs gesellschafts angestellter
means ‘life insurance company employee’
German information retrieval needs compound splitter
Tokenization: Language Issues
Chinese and Japanese no spaces between words:
莎拉波娃现在居住在美国东南部的佛罗里达。
莎拉波娃现在居住在美国东南部的佛罗里达
Sharapova now lives in US southeastern Florida
Further complicated in Japanese, with multiple alphabets
intermingled
Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!
Word Tokenization in Chinese
The example provided in previous slides are also called Word
Segmentation
Chinese words are composed of characters

Characters are generally 1 syllable and 1 morpheme.
Average word is 2.4 characters long.
Standard baseline segmentation algorithm:

Maximum Matching (also called Greedy or Max-Match)

Maximum Matching
Word Segmentation Algorithm
Given a wordlist of any string as dictionary, and a string.
1) Start a pointer at the beginning of the string
2) Find the longest word in dictionary that matches the string

starting at pointer
1) Move the pointer over the word in string
2) Go to 2

Maximum Matching
Word Segmentation Algorithm
function MAXMATCH(sentence, dictionary D) returns word sequence W
if sentence is empty
return empty list
for i length(sentence) downto 1
firstword = first i chars of sentence
remainder = rest of sentence
if InDictionary(firstword, D)
return list(firstword, MaxMatch(remainder,dictionary) )
# no word was found, so make a one-character word
firstword = first char of sentence
remainder = rest of sentence
return list (firstword, MaxMatch(remainder,dictionary D) )
Max-Match Segmentation
Thecatinthehat The cat in the hat
The table down there
Thetabledownthere
Theta bled own there
Doesn’t generally work in English!
But works astonishingly well in Chinese Language

莎拉波娃现在居住在美国东南部的佛罗里达。
莎拉波娃现在居住在美国东南部的佛罗里达
Sharapova now lives in US southeastern Florida
Modern probabilistic segmentation algorithms even better
Max-Match Segmentation
Also works well in German Language
Lebensversicherungsgesellschaftsangestellter
Lebens versicherungs gesellschafts angestellter
life insurance company employee

Word Normalization and Stemming

Normalization
Tokens can also be Normalized, and choose the Normalized form to
match multiple forms
Information Retrieval: indexed text & query terms must have same form.
For example to match U.S.A. to USA and US
or to match uh-huh to uhhuh
We implicitly define equivalence classes of terms

e.g., deleting periods in a term
Alternative: asymmetric expansion:
• Enter: window Search: window, windows
• Enter: windows Search: Windows, windows, window
• Enter: Windows Search: Windows
Potentially more powerful, but less efficient

Case folding
Case folding is another form of Normalization.
For example: every thing is matched to lower case.
Applications like Information Case folding is used for

Retrieval: reduce all letters to • Sentiment Analysis,
lower case • Machine Translation,
• Normally users tend to use lower • Information extraction
case
• Possible exception: upper case in
mid-sentence? Case folding is helpful (US versus
• e.g., General Motors us is important)
• Fed vs. fed
• SAIL vs. sail

Lemmatization
Lemmatization: determines that two words have the same root,
despite their surface differences
Lemmatization is used to reduce variant forms to base form
• am, are, is  be
• car, cars, car's, cars'  car
the boy's cars are different colors  the boy car be different color
Lemmatization: have to find correct dictionary headword form
Machine translation
Spanish quiero (‘I want’), quieres (‘you want’) same lemma as querer ‘want’
Morphology
How Lemmatization is performed?
Through Morphological Parsing
Morphology is the way to build Morphemes:

Morphemes means “The small meaningful units that make up words”
Two types of Morphemes are:

• Stems: The core meaning-bearing units
• Affixes: Adding additional meaning of various kinds OR Bits and pieces that
adhere to stems
• Often with grammatical functions

Stemming
Reduce terms to their stems in information retrieval
Stemming is crude chopping of affixes
• language dependent
• e.g., automate(s), automatic, automation all reduced to automat.
for example compressed for exampl compress

and compression are both and compress ar both
accepted as equivalent to accept as equival to
compress. compress
For stemming widely used algorithm is Porter’s Algorithm

Porter’s Algorithm
The most common English stemmer.
This was not the map we found in Billy Bones's chest, but an accurate
copy, complete in all things-names and heights and soundings-with
the single exception of the red crosses and the written notes.
The Porter stemmer algorithm produces the following stemmer output.
Thi wa not the map we found in Billi Bone s chest but an accur copi
complet in all thing name and height and sound with the singl except
of the red cross and the written note
Porter’s Algorithm
The most common English stemmer.
Step 1a Step 2 (for long stems)
sses  ss ational ate
caresses  caress relational  relate
ies  i izer ize
ponies  poni digitizer  digitize
ss  ss ator ate
caress  caress operator  operate
s ø …
cats  cat Step 3 (for longer stems)
Step 1b al ø
(*v*)ing  ø revival  reviv
walking  walk able  ø
sing  sing adjustable  adjust
(*v*)ed  ø ate  ø
plastered  plaster activate  activ
…
MS(CS), Bahria University, Islamabad
…
Morphology in a Corpus
Why only strip –ing if there is a vowel?
(*v*)ing  ø
walking  walk
sing  sing
tr -sc 'A-Za-z' '\n' < shakes.txt | grep 'ing$' | sort | uniq -c | sort –nr
tr -sc 'A-Za-z' '\n' < shakes.txt | grep '[aeiou].*ing$' | sort | uniq -c | sort –nr
1312 King 548 being
548 being 541 nothing
541 nothing 152 something
388 king 145 coming
375 bring 130 morning
358 thing 122 having
307 ring 120 living
152 something 117 loving
145 coming 116 Being
130 morning 102 going
Dealing with complex morphology is
sometimes necessary
Some languages requires complex morpheme segmentation
Turkish
Uygarlastiramadiklarimizdanmissinizcasina
`(behaving) as if you are among those whom we could not civilize’
Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ +
lar ‘plural’ + imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

Sentence Segmentation and Decision
Trees
Sentence Segmentation
Sentence segmentation is another important step in text
processing.
The most useful cues for segmenting a text into sentences are
punctuation, like periods, question marks, exclamation points.
Question marks, exclamation points and Periods are relatively

unambiguous markers of sentence boundaries.
!, ? are relatively unambiguous
Sentence Segmentation
!, ? are relatively unambiguous
Period “.” is quite ambiguous

• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
Sentence Segmentation/Tokenization works by:

Build a binary classifier
• Looks at a “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers: hand-written rules, regular expressions, or machine-learning

Determining if a word is end-of-sentence:
a Decision Tree

More sophisticated decision tree features
Case of word with “.”: Upper, Lower, Cap, Number
Case of word after “.”: Upper, Lower, Cap, Number
For example: Dr. , Ph.D. , 50.5 , …
Numeric features
• Length of word with “.”
• Probability (word with “.” occurs at end-of-s)
• Probability (word after “.” occurs at beginning-of-s)

Implementing Decision Trees
A decision tree is just an if-then-else statement
The interesting research is choosing the features

Setting up the structure is often too hard to do by hand
• Hand-building only possible for very simple features, domains
• For numeric features, it’s too hard to pick each threshold
• Instead, structure usually learned by machine learning from a training

corpus

Exercise
Implement the Max-Match algorithm in any Programming Language.
Algorithm is provided in the slide 82 & 83.

Septs to follow:
1. Construct a dictionary as *.txt, list of words per line
2. Pass a string, as given below, to the Max-Match function and
identify the string.
For example:
Input: wecanonlyseeashortdistanceahead
Output: we can only see a short distance ahead.
Q&A
That’s all for today’s Lecture

Lecture 2 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 3

Uploaded by

Copyright:

Available Formats

Adv.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

The dialogue above is from ELIZA, an early natural language processing

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

Eliza’s mimicry of human conversation was remarkably successful: many

RE is used to specify strings that might be extracted from a document,

RE is used to turn to a set of tasks collectively called text normalization, in

Normalizing text means converting text to a more convenient, standard form.

RE is used in every computer language, word processor, and text

RE is an algebraic notation for characterizing a set of strings.

RE is particularly useful for searching in texts, when we have a

To search for woodchuck, we type /woodchuck/.

The search string can consist of a single character (like /!/) or a

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

RE Example Patterns Matched

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

RE Match Example Patterns

RE Match Example Patterns

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

RE Match Example Patterns

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

We can’t use the square brackets, because while the “[ ]” allow us to

RE Match Example Patterns

RE Match Example Patterns

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

Is it possible to express the above as /guppy|ies/?

To make the disjunction operator apply only to a specific pattern,

Therefore, the pattern /gupp(y|ies)/ would specify that we meant

Thus, because Counters have a higher precedence than Sequences

With the parentheses, the expression is;

The string that look like the following:

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

Therefore, according to language the correct expression will be /baa*/

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

RE Matches Example Pattern

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

RE Matches Example Pattern

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

It is not being mentioned that “the” word should be in beginning of a line.

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

For example, /\b99\b/ will match the string 99 in

But it will match 99 in $99

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

Not matching things that we should have matched (The, the)

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

Reducing the error rate for an application often involves two

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

MS(CS), Bahria University, Islamabad Instructor: Dr. Muhammad Asfand-e-yar

RE for a dollar sign followed by a string of digits:

Decimal point and two digits afterwards

This pattern only allows $199.99 but not $199

How about specifications for processor speed?

Allowing optional fractions again “5.5 GB”;

s/The (.)er they (.), the \1 we \2/

s/The (faster|slower) they (.), the \1 we (.)/