Professional Documents
Culture Documents
KEN2570 2 Morphology
KEN2570 2 Morphology
KEN2570 2 Morphology
Jerry Spanakis
@gerasimoss
http://dke.maastrichtuniversity.nl/jerry.spanakis
• Words
• Most commonly used
- Words
• Letters
3 4
Agenda What is a word?
• The quick brown fox jumps over the lazy dog.
• Tokenization: segment sequence of characters into
words
- convention in western languages: word boundaries = spaces
• Many Asian languages: no spaces between words or
even sentences
- Thequickbrownfoxjumpsoverthelazydog
• Ambiguous characters make word segmentation
difficult:
Morphology Basic Text processing
7
Morphology Morphology
• Word formation through morpheme composition • Word formation through morpheme composition
- Inflection - Inflection
- function of the morpheme: add information about tenses, - Derivation
count, person, gender and case - happi-ness, un-predict-able, Zufrieden-heit, un-brauch-
- e.g. kauf-st, kauf-te, car, car-s, Freund, (den) Freund-en, …, bar, an-kleiden
ein-e schön-e Blume, arbol-es verde-s - bound morphemes derive new words:
e.g. stem morpheme happy (adjective) + bound
morpheme –ness à happiness (noun)
• Vowel Harmony
- Hungarian, Turkish: choice of suffix depends on already
existing vowels in the word
Morphological Analysis Challenges
• Automatic tools that infer morphological information • How does morphology influence the difficulty of the
- Gender task?
- Tense …
15 16
More data à Lower type/token ratio Type/Token Ratio
0.4
• Let’s think about the following corpora.
0.35
Which has the higher type/token ratio? Rank them!
type/token
0.1
0.05
0
10K 100K 1M 10M 100M
# tokens 17 18
0.35 0.35
English English
type/token
type/token
0.3 Wikipedia 0.3 Wikipedia
ratio
0.15 0.15
Newswire Tweets
0.1 0.1
0.05 0.05
0 0
10K 100K 1M 10M 100M 10K 100K 1M 10M 100M
# tokens 19
# tokens 20
“really” on Twitter “really” on Twitter
224571 really 50 reallllllly 15 reallllyy 8 reallyyyyyyy 6 realllllllllly 4 realllllllyyyy
1189 rly 48 reeeeeally 15 reallllllllly 8 reallyyyyyy 6 reaaaaaallly 4 reaalllyyy
1119 realy 41 reeally 15 reaallly 8 realky 5 rrrreally 4 reaalllly
731 rlly 38 really2 14 reeeeeeally 7 relaly 5 rrly 4 reaaalllyy
590 reallly 37 reaaaaally 14 reallllyyyy 7 reeeeeeeeeally 5 rellly 4 reaaalllly
234 realllly 35 reallyyyyy 13 reeeaaally 7 reeeealy 5 reeeeeeeeally 4 reaaaaly
216 reallyy 31 reely 12 rreally 7 reeeeaaally 5 reeeeaally 3 reeeeealllly
156 relly 30 realllyyy 12 reaaaaaally 7 reallllllyyy 5 reeeeaaallly 3 reeeealllly
146 reallllly 27 realllyy 11 reeeeallly 7 realllllllllllly 5 reeallyyy 3 reeeeaaaaally
132 rily 27 reaaly 11 reeeallly 7 reaaaaaaally 5 reallllllllllly 3 reeeaallly
104 reallyyy 26 realllyyyy 11 realllllyyy 7 raelly 5 reallllllllllllly 3 reeeaaallllyyy
89 reeeally 25 realllllllly 11 reaallyy 7 r3ally 5 reaalllyy 3 reealy
89 realllllly 22 reaaallly 10 reallyreallyreally 6 r-really 5 reaaaalllly 3 reeallly
84 reaaally 21 really- 10 reaaaly 6 reeeaaalllyyy 5 reaaaaallly 3 reeaaly
82 reaally 19 reeaally 9 reeeeeeeally 6 reeeaaallly 4 rllly 3 reeaalllyyy
72 reeeeally 18 reallllyyy 9 reallys 6 reeeaaaally 4 reeeeeeeeeeally 3 reeaalllly
65 reaaaally 16 reaaaallly 9 really-really 6 realyl 4 reeealy 3 reeaaallly
57 reallyyyy 15 realyy 9 r)eally 6 r-e-a-l-l-y 4 reeaaaally 3 reallyyyyyyyyy
53 rilly 15 reallyreally 8 reeeaally 6 realllyyyyy 4 realllllyyyy 3 reallyl
21 22
27 28
Frequency of words Frequency of words
29 30
#words
20000
15000
10000
#words
5000
0
1 2 3 4 5 6 7 8 9 10
31 32
Text Normalization European Languages
33 34
35 36
Non- European languages Normalization
• Chinese and Japanese no spaces between words:
• Idea: Map several words to the same word
- 莎拉波娃现在居住在美国东南部的佛罗里达。
- Difference unimportant (less important) for the task
- 莎拉波娃 现在 居住 在 美国 东南部 的 佛罗里达
- E.g. Case information at first word of sentence; Digits in
- Sharapova now lives in US southeastern Florida
machine translation; U.S.A. and USA
• Word segmentation model - Task dependent
- Predict wether new word starts after each symbol - Advantage:
- More examples of the same token
• Further complicated in Japanese, with multiple alphabets - Implicitly define equivalence classes
intermingled • Techniques:
- Case folding/True-casing
- Word classes
- Lemmatization/Stemming
37 38
41 42
45 46
• Numeric features
- Length of word with “.”
- Probability(word with “.” occurs at end-of-s)
- Probability(word after “.” occurs at beginning-of-s)
47 48
Implementing Decision Trees Basic Text Processing
• or any other classifier: SVM, Logistic Regression, …
• Basic tools always helpful
• The interesting research is choosing the features
- This was a research direction for many, many years
- Regular expressions
• Setting up the structure is often too hard - Minimum edit distance
to do by hand
- Hand-building only possible for very simple features, domains
- For numeric features, it’s too hard to pick each threshold
- Instead, structure usually learned by machine learning from a training corpus
49 50
• A formal language for specifying text strings • Letters inside square brackets []
Pattern Matches
• How can we search for any of these?
[wW]oodchuck Woodchuck, woodchuck
- woodchuck
[1234567890] Any digit
- woodchucks
- Woodchuck
- Woodchucks • Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
51 52
Regular Expressions: Negation in Disjunction
Regular Expressions: More Disjunction
• Woodchucks is another name for groundhog!
• Negations [^Ss] • The pipe | for disjunction
- Carat means negation only when first in []
53 54
Pattern Matches
colou?r Optional color colour
previous char Pattern Matches
oo*h! 0 or more of oh! ooh! oooh! ooooh! ^[A-Z] Palo Alto
previous char ^[^A-Za-z] 1 “Hello”
o+h! 1 or more of oh! ooh! oooh! ooooh! \.$ The end.
previous char
.$ The end? The end!
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
55 56
Example Errors
• The process we just went through was based on fixing
• Find me all instances of the word “the” two kinds of errors
- Matching strings that we should not have matched (there,
the then, other)
- False positives (Type I)
Misses capitalized examples - Not matching things that we should have matched (The)
[tT]he - False negatives (Type II)
Incorrectly returns other or theology • In NLP we are always dealing with these kinds of
[^a-zA-Z][tT]he[^a-zA-Z] errors.
• Reducing the error rate for an application often
Smarter & useful: involves two antagonistic efforts:
\b[tT]he\b - Increasing accuracy or precision (minimizing false positives)
- Increasing coverage or recall (minimizing false negatives).
57 58
classifiers - giraffe
- Can be very useful in capturing generalizations
• Also for Machine Translation, Information Extraction, Speech Recognition
59 61
Edit Distance Minimum Edit Distance
• The minimum edit distance between two strings • Two strings and their alignment:
62 63
• An alignment:
• If each operation has cost of 1 -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
- Distance between these is 5 TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• If substitutions cost 2 (Levenshtein)
- Distance between them is 8
• Given two sequences, align each letter to a letter or gap
64 65
Other uses of Edit Distance in NLP How to find the Min Edit Distance?
• Evaluating Machine Translation and speech recognition
• Searching for a path (sequence of edits) from the start
R Spokesman confirms senior government adviser was shot string to the final string:
H Spokesman said the senior adviser was shot dead - Initial state: the word we’re transforming
S I D I
- Operators: insert, delete, substitute
- Goal state: the word we’re trying to get to
• Named Entity Extraction and Entity Coreference - Path cost: what we want to minimize: the number of edits
- IBM Inc. announced today
- IBM profits
- Stanford President John Hennessy announced yesterday
- for Stanford University President John Hennessy
66 67
68 69
Dynamic Programming for Defining Min Edit Distance (Levenshtein)
Minimum Edit Distance
• Dynamic programming:
• Initialization
A tabular computation of D(n,m) D(i,0) = i
• Solving problems by combining solutions to D(0,j) = j
• Recurrence Relation: deletion
subproblems. For each i = 1…M
For each j = 1…N insertion
• Bottom-up D(i-1,j) + 1
- We compute D(i,j) for small i,j D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
- And compute larger D(i,j) based on previously computed 0; if X(i) = Y(j)
smaller values • Termination:
D(N,M) is distance
- i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
substitution
70 71
T 6 T 6
N 5 N 5
E 4 E 4
T 3 T 3
N 2
N 2
I 1
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
72
Edit Distance The Edit Distance Table
N 9
N 9 8 9 10 11 12 11 10 9 8
O 8
O 8 7 8 9 10 11 10 9 8 9
I 7
I 7 6 7 8 9 10 9 8 9 10
T 6 T 6 5 6 7 8 9 8 9 10 11
N 5 N 5 4 5 6 7 8 9 10 11 10
E 4 E 4 3 4 5 6 7 8 9 10 9
T 3 T 3 4 5 6 7 8 7 8 9 8
N 2 N 2 3 4 5 6 7 8 7 8 7
I 1 I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N # E X E C U T I O N
74
76 78
Adding Backtrace to Minimum Edit Distance The Distance Matrix
x0 …………………… xN
Every non-decreasing path
• Recurrence Relation:
For each i = 1…M
For each j = 1…N from (0,0) to (M, N)
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
corresponds to
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j) an alignment
LEFT insertion of the two sequences
ptr(i,j)= DOWN deletion
DIAG
substitution An optimal alignment is composed
y0 ……………………………… yM of optimal subalignments
79 80
• Space:
O(nm)
• Backtrace
O(n+m)
81 82
Weighted Edit Distance
Confusion matrix for spelling errors
• Why would we add weights to the computation?
- Spell Correction: some letters are more likely to be mistyped
than others
- Biology: certain kinds of deletions or insertions are more
likely than others
83 84
• Initialization:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M
• Recurrence Relation:
D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
• Termination:
D(N,M) is distance
85 86
Summary
Basic tools
Morphology
Data preparation
88