KEN2570 2 Morphology

KEN2570 Words, Words, Words
Natural Language Processing

Morphology and Text processing
Jerry Spanakis
@gerasimoss
http://dke.maastrichtuniversity.nl/jerry.spanakis
Text levels Basic building blocks

• Documents • What are the basic building blocks?
- Cannot split any further
- Too small
- Tasks gets complicated
• Sentences I like NLP. - Too large
- Not flexible enough
- Blocks are only occurring rarely
• Words
• Most commonly used
- Words
• Letters
3 4
Agenda What is a word?
• The quick brown fox jumps over the lazy dog.
• Tokenization: segment sequence of characters into
words
- convention in western languages: word boundaries = spaces
• Many Asian languages: no spaces between words or
even sentences
- Thequickbrownfoxjumpsoverthelazydog
• Ambiguous characters make word segmentation
difficult:
Morphology Basic Text processing
What is a word? Morphology

• internal structure of words, word formation
• Vocabulary: Set of all words in a corpus/system • words are composed of morphemes: smallest
- In general: Open vocabulary meaning-carrying units of language
- No fixed set of possible words - house, house-s, small, small-est, un-predict-able, un-happi-
• Type: an element of the vocabulary. ness
• Token: an instance of that type in running text. • morpheme types
- stem morphemes may appear as separate words: house-s
- functional or bound morphemes need to be connected with
stem morphemes
- afix types: prefix, suffix, infix, circumfix
- un-happy, kauf-st, Gespräch-s-ablauf, ge-dronk-en
7
Morphology Morphology
• Word formation through morpheme composition • Word formation through morpheme composition
- Inflection - Inflection
- function of the morpheme: add information about tenses, - Derivation
count, person, gender and case - happi-ness, un-predict-able, Zufrieden-heit, un-brauch-
- e.g. kauf-st, kauf-te, car, car-s, Freund, (den) Freund-en, …, bar, an-kleiden
ein-e schön-e Blume, arbol-es verde-s - bound morphemes derive new words:
e.g. stem morpheme happy (adjective) + bound
morpheme –ness à happiness (noun)
Morphology Morphology Specialities

• Word formation through morpheme composition • morpho-phonological processes at morpheme
- Inflection boundaries
- Derivation - stick / sticks, go / goes
- Composition: - „Fugen-s“ in German compounds: Universität-s-professor
- rain-bow, water-proof, Haus-tür, Einkauf-s-wagen, eis-kalt
- combination of stem morphemes à create new words
• German „Umlaut“: Mutter / Mütter
• Vowel Harmony
- Hungarian, Turkish: choice of suffix depends on already
existing vowels in the word
Morphological Analysis Challenges
• Automatic tools that infer morphological information • How does morphology influence the difficulty of the
- Gender task?
- Tense …
• Available for well-resourced languages
How many words/tokens? Type/Token Ratio

If they want to go, they should go . • Useful statistic of a corpus
• How will the type/token ratio change when adding

• How many different words?
more data?
• How many different tokens?
15 16
More data à Lower type/token ratio Type/Token Ratio
0.4
• Let’s think about the following corpora.
0.35
Which has the higher type/token ratio? Rank them!
type/token
0.3 English - English Wikipedia

ratio
Wikipedia - Simple English Wikipedia

0.25
- Newswire
0.2 - Twitter
0.15
0.1
0.05
0
10K 100K 1M 10M 100M
# tokens 17 18
Type/Token Ratio Type/Token Ratio

0.4 0.4
0.35 0.35
English English
type/token
type/token
0.3 Wikipedia 0.3 Wikipedia
ratio
0.25 ratio 0.25

Simple English Simple English
0.2 Wikipedia 0.2 Wikipedia
0.15 0.15
Newswire Tweets
0.1 0.1
0.05 0.05
0 0
10K 100K 1M 10M 100M 10K 100K 1M 10M 100M
# tokens 19
# tokens 20
“really” on Twitter “really” on Twitter
224571 really 50 reallllllly 15 reallllyy 8 reallyyyyyyy 6 realllllllllly 4 realllllllyyyy
1189 rly 48 reeeeeally 15 reallllllllly 8 reallyyyyyy 6 reaaaaaallly 4 reaalllyyy
1119 realy 41 reeally 15 reaallly 8 realky 5 rrrreally 4 reaalllly
731 rlly 38 really2 14 reeeeeeally 7 relaly 5 rrly 4 reaaalllyy
590 reallly 37 reaaaaally 14 reallllyyyy 7 reeeeeeeeeally 5 rellly 4 reaaalllly
234 realllly 35 reallyyyyy 13 reeeaaally 7 reeeealy 5 reeeeeeeeally 4 reaaaaly
216 reallyy 31 reely 12 rreally 7 reeeeaaally 5 reeeeaally 3 reeeeealllly
156 relly 30 realllyyy 12 reaaaaaally 7 reallllllyyy 5 reeeeaaallly 3 reeeealllly
146 reallllly 27 realllyy 11 reeeeallly 7 realllllllllllly 5 reeallyyy 3 reeeeaaaaally
132 rily 27 reaaly 11 reeeallly 7 reaaaaaaally 5 reallllllllllly 3 reeeaallly
104 reallyyy 26 realllyyyy 11 realllllyyy 7 raelly 5 reallllllllllllly 3 reeeaaallllyyy
89 reeeally 25 realllllllly 11 reaallyy 7 r3ally 5 reaalllyy 3 reealy
89 realllllly 22 reaaallly 10 reallyreallyreally 6 r-really 5 reaaaalllly 3 reeallly
84 reaaally 21 really- 10 reaaaly 6 reeeaaalllyyy 5 reaaaaallly 3 reeaaly
82 reaally 19 reeaally 9 reeeeeeeally 6 reeeaaallly 4 rllly 3 reeaalllyyy
72 reeeeally 18 reallllyyy 9 reallys 6 reeeaaaally 4 reeeeeeeeeeally 3 reeaalllly
65 reaaaally 16 reaaaallly 9 really-really 6 realyl 4 reeealy 3 reeaaallly
57 reallyyyy 15 realyy 9 r)eally 6 r-e-a-l-l-y 4 reeaaaally 3 reallyyyyyyyyy
53 rilly 15 reallyreally 8 reeeaally 6 realllyyyyy 4 realllllyyyy 3 reallyl
21 22
“really” on Twitter 1 rrrrrrrrrrrrrrrreeeeeeeeeeeaaaaaaalllllllyyyyyy

1 rrrrrrrrrreally
3 really) 2 rlyyyy 2 reeaallyy 1 rrrrrrreeeeeeaaaalllllyyyyyyy
3 r]eally 2 rlyyy 2 reeaalllyy 1 rrrrrrealy
3 realluy 2 reqally 2 reeaallly 1 rrrrrreally
3 reallllyyyyy 2 rellyy 2 reeaaally …
3 reallllllyyyyyyy 2 rellys 2 reaqlly 1 re-he-he-heeeeally
3 reallllllyyyy 2 reeely 2 realyyy 1 re-he-he-he-ealy
3 reallllllyy 2 reeeeeealy 2 reallyyyyyyyyyyyy 1 reheheally
3 realllllllllllllllly 2 reeeeeallly 2 reallyyyyyyyy 1 reelllyy
3 realiy 2 reeeeeaally 2 really* 1 reellly
3 reaallyyyy 2 reeeeeaaally 2 really/ 1 ree-hee-heally
3 reaallllly 2 reeeeeaaallllly 2 realllyyyyyy …
3 reaaallyy 2 reeeeallyyy 2 reallllyyyyyy 1 reeeeeeeeeaally
3 reaaaallyy 2 reeeeallllyyy 2 realllllyyyyyy 1 reeeeeeeeeaaally
3 reaaaallllly 2 reeeeaaallllyyyy 2 realllllyy 1 reeeeeeeeeaaaaaalllyyy
3 reaaaaaly 2 reeeeaaalllly 2 reallllllyyyyy 1 reeeeeeeeeaaaaaaallllllllyyyyyyyy
3 reaaaaaaaally 2 reeeeaaaally 2 realllllllyyyyy 1 reeeeeeeeeaaaaaaallllllllyyyyyyyy
3 r34lly 2 reeeeaaaalllyyy 2 realllllllyy 1 reeeeeeeeeaaaaaaaaalllllllllyyyyyyyy
2 rrreally 2 reeeallyy 2 reallllllllllllllly 1 reeeeeeeeaaaaaaaalllllyyyyyy
2 rreeaallyy 2 reeallyy 2 reallllllllllllllllly
23 24
1 reallyreallyreallyreallyreallyreallyreallyreallyreallyreally
reallyreallyreallyreallyreallyreallyreally
1 reallyreallyreallyreallyreallyr33lly
1 really/really/really
How many words are there?
1 really(really
… • How many English words exist?
1 reallllllllyyyy
1 realllllllllyyyyyy • When we increase the size of our corpus, what
1 realllllllllyyyyy happens to the number of types?
1 realllllllllyyyy
1 realllllllllyyy
1 reallllllllllyyyyy
1 reallllllllllllyyyyyy
- a bit surprising: vocabulary continues to grow in any actual
1 reallllllllllllllllllly dataset
1 reallllllllllllllllllllly - you’ll just never see all the words
1 reallllllllllllllllllllllyyyyy
1 reallllllllllllllllllllllllllly
- in 1 million tweets, 15M tokens, 600k types
1 realllllllllllllllllllllllllllly - in 56 million tweets, 847M tokens, 11M types
1 reallllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllly
1 reallllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
lllllllly
25 26
How many words? How many words?
N = number of tokens • Other languages even worse

- TED corpus
V = vocabulary = set of types
|V| is the size of the vocabulary
Tokens = N Types = |V| Tokens = N Types = |V|

Switchboard phone 2.4 million 20 thousand English 4.2 million 56 K
conversations German 3.9 million 126 K
Shakespeare 884,000 31 thousand Dutch 4.0 million 99 K
Google N-grams 1 trillion 13 million Italian 4.4 million 93 K
Romanian 4.2 million
Church and Gale (1990): |V| > O(N½)
27 28
Frequency of words Frequency of words
• Most frequent words: • Word frequencies

- Mainly function words - Nearly half of the words occur
- Punctuation marks ! Rank%(r) WordCount(f)Word only once
- No content words 1 65600 , - Around 10% more than 10 times
2 49889 .
- First content word: - Around 2% occur more than 100 times
3 23760 und
- Welt (Rank 69)
4 21837 die
5 14117 das
6 13646 ich
7 13623 der
8 13045 ist
9 12095 es
10 11401 wir
29 30
Zipf’s Law Challenge

• order list of words by occurrence • Why is this a problem?
• rank: position in the list
• Calculate probabilities for events/words
- Statistical Modeling:
- Advantage to have data with normal distribution
#words
20000
15000
10000
#words
5000
0
1 2 3 4 5 6 7 8 9 10
31 32
Text Normalization European Languages
• Simplifying the task by preparing the data • Major points
• Every NLP task needs to do text normalization: - Splitting of punctuation marks

- I go home. -> I go home .
1. Segmenting/tokenizing words in running text
- Handling joined words
2. Normalizing word formats - I‘m -> I ‘m
3. Segmenting sentences in running text
33 34
Issues in Tokenization Tokenization: language issues

• French
- L'ensemble ® one token or two?
- L ? L’ ? Le ?
• Finland’s capital ® Finland Finlands Finland’s ?
• what’re, I’m, isn’t ® What are, I am, is not - Want l’ensemble to match with un ensemble
• Hewlett-Packard ® Hewlett Packard ?
• state-of-the-art ® state of the art ? • German noun compounds are not segmented
• Lowercase ® lower-case lowercase lower case ? - Lebensversicherungsgesellschaftsangestellter
• San Francisco ® one token or two?
(the first one to pronounce it loud correctly gets a free coffee)
• m.p.h., PhD. ® ??
à ‘life insurance company employee’
- Approach:
- Compound splitting
35 36
Non- European languages Normalization
• Chinese and Japanese no spaces between words:
• Idea: Map several words to the same word
- 莎拉波娃现在居住在美国东南部的佛罗里达。
- Difference unimportant (less important) for the task
- 莎拉波娃现在居住在美国东南部的佛罗里达
- E.g. Case information at first word of sentence; Digits in
- Sharapova now lives in US southeastern Florida
machine translation; U.S.A. and USA
• Word segmentation model - Task dependent
- Predict wether new word starts after each symbol - Advantage:
- More examples of the same token
• Further complicated in Japanese, with multiple alphabets - Implicitly define equivalence classes
intermingled • Techniques:
- Case folding/True-casing
- Word classes
- Lemmatization/Stemming
37 38
Normalization Case folding/True casing

• Need to “normalize” terms • Applications like IR: reduce all letters to lower case
- Information Retrieval: indexed text & query terms - Since users tend to use lower case
must have same form. - Possible exception: upper case in mid-sentence?
- We want to match U.S.A. and USA - e.g., General Motors
- Fed vs. fed
• We implicitly define equivalence classes of - SAIL vs. sail
terms
- e.g., deleting periods in a term • For sentiment analysis, MT, Information extraction
• Alternative: asymmetric expansion: - Case is helpful (US versus us is important)
- Enter: window Search: window, windows - Normalize case information at beginning of the sentence
- Enter: windows Search: Windows, windows, window - The house was expensive.
- Enter: Windows Search: Windows
- the house was expensive.
• Potentially more powerful, but less efficient
39 40
Word classes Lemmatization
• Reduce inflections or variant forms to base form
• Use classes instead of actual words
- am, are, is ® be
- Numbers
- Places - car, cars, car's, cars' ® car
- Mainly learn mapping for these words • the boy's cars are different colors ® the boy car be
• Applications: different color
- Machine translation
- I need 1 book; I need 2 books; I need 5 books; I need 10101 books • Lemmatization: have to find correct dictionary headword
- I need @NUMBER books
form
- Speech recognition
• Machine translation
- Spanish quiero (‘I want’), quieres (‘you want’) same lemma as
querer ‘want’
41 42
Stemming Should you do lemmatization or stemming?

• Reduce terms to their stems • Depends on your task!
- Without having morphological tools • Stemming is considered very aggressive and many
• Stemming is crude chopping of affixes researchers are negative
- No additional recources necessary - Esp. in the deep learning era, you should keep preprocessing
- e.g., automate(s), automatic, automation all reduced to to a minimum
automat. • Lemmatization is preferred, esp. when you don’t have
large training data
- It helps reducing the dimensionality
for example compressed for exampl compress and
and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
43 44
Removing Spaces? Sentence Segmentation
• !, ? are relatively unambiguous
• Tokenization is usually about adding spaces • Period “.” is quite ambiguous
• But might we also want to remove spaces? - Sentence boundary
- Abbreviations like Inc. or Dr.
• What are some English examples? - Numbers like .02% or 4.3
- names? • Build a binary classifier
- New York à NewYork - Looks at a “.”
- non-compositional compounds? - Decides EndOfSentence/NotEndOfSentence
- hot dog à hotdog
- Classifiers: hand-written rules, regular expressions, or
- other artifacts of our spacing conventions?
machine-learning
- New York-Long Island Railway à ?
45 46
Determining if a word is end-of-sentence: More sophisticated decision tree features

a Decision Tree
• Case of word with “.”: Upper, Lower, Cap, Number
• Case of word after “.”: Upper, Lower, Cap, Number
• Numeric features
- Length of word with “.”
- Probability(word with “.” occurs at end-of-s)
- Probability(word after “.” occurs at beginning-of-s)
47 48
Implementing Decision Trees Basic Text Processing
• or any other classifier: SVM, Logistic Regression, …
• Basic tools always helpful
• The interesting research is choosing the features
- This was a research direction for many, many years
- Regular expressions
• Setting up the structure is often too hard - Minimum edit distance
to do by hand
- Hand-building only possible for very simple features, domains
- For numeric features, it’s too hard to pick each threshold
- Instead, structure usually learned by machine learning from a training corpus
49 50
Regular expressions Regular Expressions: Disjunctions
• A formal language for specifying text strings • Letters inside square brackets []
Pattern Matches
• How can we search for any of these?
[wW]oodchuck Woodchuck, woodchuck
- woodchuck
[1234567890] Any digit
- woodchucks
- Woodchuck
- Woodchucks • Ranges [A-Z]
Pattern Matches
[A-Z] An upper case letter Drenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
51 52
Regular Expressions: Negation in Disjunction
Regular Expressions: More Disjunction
• Woodchucks is another name for groundhog!
• Negations [^Ss] • The pipe | for disjunction
- Carat means negation only when first in []
Pattern Matches Pattern Matches

[Â-Z] Not an upper case letter Oyfn pripetchik groundhog|woodchuck
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason” yours|mine yours
[ê^] Neither e nor ^ Look here mine
a^b The pattern a carat b Look up a^b now a|b|c = [abc]

[gG]roundhog|[Ww]oodchuck
53 54
Regular Expressions: ? * + . Regular Expressions: Anchors ^ $
Pattern Matches
colou?r Optional color colour
previous char Pattern Matches
oo*h! 0 or more of oh! ooh! oooh! ooooh! ^[A-Z] Palo Alto
previous char ^[Â-Za-z] 1 “Hello”
o+h! 1 or more of oh! ooh! oooh! ooooh! \.$ The end.
previous char
.$ The end? The end!
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
55 56
Example Errors
• The process we just went through was based on fixing
• Find me all instances of the word “the” two kinds of errors
- Matching strings that we should not have matched (there,
the then, other)
- False positives (Type I)
Misses capitalized examples - Not matching things that we should have matched (The)
[tT]he - False negatives (Type II)
Incorrectly returns other or theology • In NLP we are always dealing with these kinds of
[â-zA-Z][tT]he[â-zA-Z] errors.
• Reducing the error rate for an application often
Smarter & useful: involves two antagonistic efforts:
\b[tT]he\b - Increasing accuracy or precision (minimizing false positives)
- Increasing coverage or recall (minimizing false negatives).
57 58
Summary How similar are two strings?
• Regular expressions play a surprisingly large role • Spell correction

• Computational Biology
- Sophisticated sequences of regular expressions are - The user typed • Align two sequences of nucleotides
often the first model for any text processing text “graffe” AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• For many hard tasks, we use machine learning Which is closest?
- graf • Resulting alignment:
classifiers - graft -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
- But regular expressions are used as features in the - grail TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
classifiers - giraffe
- Can be very useful in capturing generalizations
• Also for Machine Translation, Information Extraction, Speech Recognition
59 61
Edit Distance Minimum Edit Distance
• The minimum edit distance between two strings • Two strings and their alignment:
• Is the minimum number of editing operations

- Insertion
- Deletion
- Substitution
• Needed to transform one into the other
62 63
Minimum Edit Distance Alignment in Computational Biology

• Given a sequence of bases
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• An alignment:
• If each operation has cost of 1 -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
- Distance between these is 5 TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• If substitutions cost 2 (Levenshtein)
- Distance between them is 8
• Given two sequences, align each letter to a letter or gap
64 65
Other uses of Edit Distance in NLP How to find the Min Edit Distance?
• Evaluating Machine Translation and speech recognition
• Searching for a path (sequence of edits) from the start
R Spokesman confirms senior government adviser was shot string to the final string:
H Spokesman said the senior adviser was shot dead - Initial state: the word we’re transforming
S I D I
- Operators: insert, delete, substitute
- Goal state: the word we’re trying to get to
• Named Entity Extraction and Entity Coreference - Path cost: what we want to minimize: the number of edits
- IBM Inc. announced today
- IBM profits
- Stanford President John Hennessy announced yesterday
- for Stanford University President John Hennessy
66 67
Minimum Edit as Search Defining Min Edit Distance

• But the space of all edit sequences is huge! • For two strings
- We can’t afford to navigate naïvely - X of length n
- Lots of distinct paths wind up at the same state. - Y of length m
- We don’t have to keep track of all of them
- Just the shortest path to each of those revisted states. • We define D(i,j)
- the edit distance between X[1..i] and Y[1..j]
- i.e., the first i characters of X and the first j characters of Y
- The edit distance between X and Y is thus D(n,m)
68 69
Dynamic Programming for Defining Min Edit Distance (Levenshtein)
Minimum Edit Distance
• Dynamic programming:
• Initialization
A tabular computation of D(n,m) D(i,0) = i
• Solving problems by combining solutions to D(0,j) = j
• Recurrence Relation: deletion
subproblems. For each i = 1…M
For each j = 1…N insertion
• Bottom-up D(i-1,j) + 1
- We compute D(i,j) for small i,j D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
- And compute larger D(i,j) based on previously computed 0; if X(i) = Y(j)
smaller values • Termination:
D(N,M) is distance
- i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
substitution
70 71
The Edit Distance Table

The Edit Distance Table
N 9
N 9
O 8
O 8
I 7 I 7
T 6 T 6
N 5 N 5
E 4 E 4
T 3 T 3
N 2
N 2
I 1
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
72
Edit Distance The Edit Distance Table
N 9
N 9 8 9 10 11 12 11 10 9 8
O 8
O 8 7 8 9 10 11 10 9 8 9
I 7
I 7 6 7 8 9 10 9 8 9 10
T 6 T 6 5 6 7 8 9 8 9 10 11
N 5 N 5 4 5 6 7 8 9 10 11 10
E 4 E 4 3 4 5 6 7 8 9 10 9
T 3 T 3 4 5 6 7 8 7 8 9 8
N 2 N 2 3 4 5 6 7 8 7 8 7
I 1 I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N # E X E C U T I O N
74
Computing alignments MinEdit with Backtrace

• Edit distance isn’t sufficient
- We often need to align each character of the two strings to
each other
• We do this by keeping a “backtrace”
• Every time we enter a cell, remember where we came
from
• When we reach the end,
- Trace back the path from the upper right corner to read off
the alignment
76 78
Adding Backtrace to Minimum Edit Distance The Distance Matrix
• Base conditions: Termination:

D(i,0) = i D(0,j) = j D(N,M) is distance
x0 …………………… xN
Every non-decreasing path
• Recurrence Relation:
For each i = 1…M
For each j = 1…N from (0,0) to (M, N)
D(i-1,j) + 1 deletion
D(i,j)= min D(i,j-1) + 1 insertion
corresponds to
D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution
0; if X(i) = Y(j) an alignment
LEFT insertion of the two sequences
ptr(i,j)= DOWN deletion
DIAG
substitution An optimal alignment is composed
y0 ……………………………… yM of optimal subalignments
79 80
Result of Backtrace Performance

• Two strings and their alignment: • Time:
O(nm)
• Space:
O(nm)
• Backtrace
O(n+m)
81 82
Weighted Edit Distance
Confusion matrix for spelling errors
• Why would we add weights to the computation?
- Spell Correction: some letters are more likely to be mistyped
than others
- Biology: certain kinds of deletions or insertions are more
likely than others
83 84
Weighted Min Edit Distance
• Initialization:
D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M
• Recurrence Relation:
D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
• Termination:
D(N,M) is distance
85 86
Summary
Basic tools
Morphology
Data preparation
88

KEN2570 2 Morphology

Uploaded by

Copyright:

Available Formats

You might also like

KEN2570 2 Morphology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KEN2570 2 Morphology

Uploaded by

Copyright:

Available Formats

KEN2570 Words, Words, Words

Natural Language Processing

Text levels Basic building blocks

What is a word? Morphology

Morphology Morphology Specialities

• Available for well-resourced languages

How many words/tokens? Type/Token Ratio

• How will the type/token ratio change when adding

0.3 English - English Wikipedia

Wikipedia - Simple English Wikipedia

Type/Token Ratio Type/Token Ratio

0.25 ratio 0.25

“really” on Twitter 1 rrrrrrrrrrrrrrrreeeeeeeeeeeaaaaaaalllllllyyyyyy

How many words? How many words?

N = number of tokens • Other languages even worse

Tokens = N Types = |V| Tokens = N Types = |V|

Church and Gale (1990): |V| > O(N½)

• Most frequent words: • Word frequencies

Zipf’s Law Challenge

• Simplifying the task by preparing the data • Major points

• Every NLP task needs to do text normalization: - Splitting of punctuation marks

Issues in Tokenization Tokenization: language issues

Normalization Case folding/True casing

Stemming Should you do lemmatization or stemming?

Determining if a word is end-of-sentence: More sophisticated decision tree features

Regular expressions Regular Expressions: Disjunctions

Pattern Matches Pattern Matches

a^b The pattern a carat b Look up a^b now a|b|c = [abc]

Regular Expressions: ? * + . Regular Expressions: Anchors ^ $

Summary How similar are two strings?

• Regular expressions play a surprisingly large role • Spell correction

• Is the minimum number of editing operations

• Needed to transform one into the other

Minimum Edit Distance Alignment in Computational Biology

Minimum Edit as Search Defining Min Edit Distance

The Edit Distance Table

Computing alignments MinEdit with Backtrace

• Base conditions: Termination:

Result of Backtrace Performance

Weighted Min Edit Distance

You might also like