Professional Documents
Culture Documents
NLP 2
NLP 2
Scanned withCamScanner
_. What are words? |
Affixes : happy
Words are the fundamental building block of language. Every human language,
spoken,signed, or written, is composed of words. Every area of speech and language Stem : happy
| processing, from speech recognition to machine translation to information retrieval on
the web, requires extensive knowledge about words. Psycholinguistic models of human ’not’, whil e nes s means | "bei ng in a state or condition”.
Happyis a free morpheme
language processing and models from generative linguistic are also heavily based on ecauseit can appear onits
own(as a “word” in its own
right). Bou
nd morphemes have
lexicalknowledge. to be attached to a free morpheme,
and so cannot be words in their ownr
cant have sentencesin English such ight. Thys, you
Words are Orthographic tokens separated by white space. In some languages the as “Jasonfeels very un ness today”,
distinction between words and sentencesis lessclear. | Bound Morphemes: These are lexica
l items incorporated into a word as a depe ndent
part. "They cannot stand alone, but must
Chinese, Japanese: no white space between words be connected to another morphemes.
Bound
morphemes operate in the connection proc
nowhitespace — no white space/no whites pace/nowhit esp ace esses by means of derivation, inflection, and
compounding. Free morphemes, on the other hand,
are autonomous, can occuron their
Turkish: words could ‘represent a complete “sentence” Own and arethus also words at the same time. Technicall
y, bound morphemesand free
Eg: uygarlastiramadiklarimizdanmissinizcasina | morphemes aresaid to differ in terms of their distributi or free
on dom of occurrence. As a
“(behaving) as if you are among those whom wecould notcivilize” rule, lexemesconsistofat least one free morpheme 4
Morphology deals with the syntax of complex words Morphology handles the formation of words by using morphemesbase form (stem,lemma),
and parts of words, also called
morphemes,as well as with the semantics of their lexical meanings. Understand e.g., believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
ing how
words are formed and what semantic properties they conveythrough their forms enables Morphological parsing is the task of recognizing the morphemesinside a word e.g., hands,
human beings to easily recognize individual words and their meanings in discourse. foxes, children and its important for many tasks like machine translation, information
Morphology also looks at parts of speech, intent and Stress, and the ways context retrieval, etc. and useful in parsing, text simplification, etc
can
change a word's pronunciation and meaning. Survey of English Morphology
Morphology is the study of the structure and formation of words. Its most important unit Morphology is the study of the way wordsare built up from smaller meaning- bearing
is the morpheme, which is defined as the “minimalunit of meaning” units, morphemes. A morpheme is often defined as the minimal meaning-bearing unit in
Consider a wordlike: “unhappiness”. This has three parts: a language. So for example the word fox consists of a single morpheme(the morpheme.
fox) while the word cats consists of two: the morpheme cat and the morpheme-s. As
morphemes
“|S
this example suggests, it is often useful to distinguish two broad classes of morphemes:
stems and affixes. The exact details of the distinction vary from language to language, |
but intuitively, the stem is the ‘main’ morphemeof the word, supplying the main meaning,-
me
un happyness,
prefix
Fs wae
i the affixes add ‘additional’
while ‘addi
' meanings
. of various
. .
kinds. Affixesarefurt her divided
tbo
into
suffix
the stem, suffixe
affixes
prefixes, suffixes, infixes, and circum- fixes. Prefixes precede
stem the ©
d inside the stem. Pol exam ple,
itch
the stem, circumfixes do both, and infixes are inserte
. Morphemes > un happy ness unbuckl e is compo
mposed of a stem eatand the suffix -s. The word
word eats Is comp niesof circumixes,
any good exam
Prefix : un a stem buckle and the prefix un-. English doesn't have
Yiord Level An alysis. (25) A
Montane
(2+) Natural Language Processing
ticiple of some verbs
uages do. | n German
, for example, the past par word stem with a grammatical morpheme,usually resulting in a word ofa different class,
_
new words outof
Derivational Morphology on the other hand uses affixes to create Toillustrate this consider the following
two sentences:
eee
INFLECTIONAL MORPHOLOGY later in 2.2.: Inflection in nouns, though.
a.
Inflection is a morphological process that adapts existing words so that they function “We may define inflectional morphology as the branch of morpholo gy that deals with
=
effectively in sentences without changing the category of the base morphemes. paradigms. It is therefore concerned with two things: on the one hand, with the
semantic
oppositions among categories; and on the other, with the formal means, including
—————
Inflection can be seen as the “realization of morphosyntactic features through
inflections, that distinguish them.” (Matthews, 1991).
morphological means” . But what exactly does that mean? It means that inflectional
Inflectional morphology is that it changes the word form, it determines the grammar and
morphemesof a word do not havetobelisted in a dictionary since we can guesstheir
it does not form a new lexemebutrathera variantof a lexeme that does not need its own
meaning from the root word. We know whenweseethe word whatit connects to and entry in the dictionary.
_a_—e—vv—o_—ore —_—
most times can even guessthe difference to its original. For example let us consider
word stem + grammatical morphemes
help-s, help-ed and help-er. According to what | have said about wordslisted in the
cat + s only for nouns, verbs, and some adjectives
dictionary, all of these variants might be inflectional morphemes, but then on the other
Nouns
hand does help-s really need an extra listing or can we guess from the root help
and
plural:
the suffix -s what it means? Does our natural feeling and instinct for language nottell
us, that the suffix —s indicates the third person singular and that help-s therefore regular: +s,+es irregular: mouse - mice; ox - oxen
only is
a variant from help (considering help as a verb and not a noun here)? Yesit manyspelling rules: e.g. -y -> -ies like: butterfly - butterflies
does. As
native speakeroneinstantly knowsthat —s, asalso the past form indicator possessive: t's, +’
—ed only show
a grammatical variant of the root lexemehelp. Verbs
So why is help-er or even help-less-ness different? The answeri main verbs(sleep, eat, walk)
s actually very simple.
The suffixes in these last two words change the word class modal verbs(can, will, should)
and therefore form a new
lexeme. Help-er can now bethe new root for additional suffixes primary verbs (be, have, do)
such help-er-s which
would then be an inflectional morpheme again, the root here
being the smallest free VERB INFLECTIONAL SUFFIXES
morpheme after you removeall affixes.
1. The suffix —s functions in the Present Simple as the third person marking of the verb
After establishing this westill have a problem if we consider the word : to work — he work-s
help as a noun and
as a verb. How do wedistinguish these two? The answer is context
and the phenomenon 2. The suffix -ed functions in the past simple as the past tense marker in regular
is called a zero morphemes. Only threw context can
we Say if help is a verb or a noun. verbs: to love — lov-ed
ff
Word Level Anatysis \ ag
Neca
(28) Natural Language Processing
i ee
2
yy he, Oe
in the
regular verbs) function
z
!
-en(for some
Scanned withCamScanner_
g of text will be something
The result of this mappin Stemming Algorithms Examples
colors >
the boy's cars are different
Porter stemmer: This stemming algorithm is an older one. It’s from the 1980s andits
the boy car be differ color
in their flavor. main concern is removing the commonendings to words so that they can be resolved
However, the two wordsdiffer
to a common form. It’s not too complex and development on it is frozen. Typically, it's
cess that chops off the ends of wordsin
Stemming usually refers to a crude heuristic pro l a nice Starting basic stemmer,butit's not really advised to use it for any production/
the time, and often includes the remova
the hope of achievingthis goalcorrectly mostof complex application. Instead, it has its place in research as a nice, basic stemming
pler of the two approaches. With
of derivational affixes. Stemming is definitely the sim algorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm
rd stem need not be the same
stemming, words are reduced to their word stems. A wo when compared to others.
al or smaller form of the
root as a dictionary-based morphological root, it just is equ to
Snowball stemmer: This algorithm is also known as the Porter2stemming algorithm. It is
word.
almostuniversally accepted as better than the Porter stemmer, even being acknowledged
Stemming algorithmsare typically rule-based. You can view them as heuristic process as suchby the individual who created the Porter stemmer. That being said, it is also more
that sort-of lops off the ends of words. A word is looked at and run through a series of aggressive than the Porter stemmer.A lot of the things added to the Snowball stemmer
conditionals that determine how to cut it down. were because of issues noticed with the Porter stemmer. There is about a 5% difference
For example, we may havea suffix rule that, based ona list of known suffixes, cuts them in the way that Snowball stems versus Porter.
off. In the English language, we have suffixeslike “-ed” and “-ing” which may be useful Lancaster stemmer: Just for fun, the Lancaster stemming algorithm is ancther algorithm
to cut off in order to map the words “cook,” “cooking,” and “cooked”all to the same stem that you can use. This one is the most aggressive stemming algorithm of the bunch.
of “cook.” However,if you use the stemmerin NLTK, you can add your own custom rules to this
Over stemming comes from when too much of a wordis cut off. This can result in algorithm very easily. It's a good choice for that. One complaint around this stemming
_ nonsensical stems, whereall the meaning of the wordis lost or muddled. Or it can result algorithm though is that it sometimes is overly aggressive andcan really transform words
in words being resolved to the same stems, even though they probably should not be. into strange stems. Just make sure it does what you want it to before you go with this
option! |
Take the four words university, universal, universities, and the universe. A stemming
and
algorithm that resolves these four words to the stem “univers” has over stemmed. While Lemmatization usually refers to doing things properly with the use of a vocabulary
endings only
it might be nice to have universal and universe stemmed together and university and morphological analysis of words, normally aiming to remove inflectional
as the Jemma .
universities stemmed together,all four do notfit. A better resolution might have the first and to return the base or dictionary form of a word, which is known
wordsto their dictionary
two resolve to “univers” and the latter two resolve to “universi.” But enforcing rules that lemmatization is a more calculated process. It involves resolving
make that so might result in more issues arising. form. In fact, a lemmaof a wordis its dictionary or canonical form!
, it requires a little more to actually
Because lemmatization is more nuanced in this respect
Under stemmingis the opposite issue. It comes from when we haveseveral wordsthat
needs to know its part
make work. For lemmatization to. resolve a wordto its lemma,it
actually are forms of one another. It would be nice for them to all resolve to the same
power such as a part of speech _
of speech. That requires extra computational linguistics
stem, but unfortunately, they do not. This can be seen if we have a stemming algorithm
is and are to “be").
tagger. This allowsit to do better resolutions (like resolving
that stems the words data and datum to “dat” and “datu.”
is thatit's often times harder to creale a
Another thing to note about lemmatization
tizers
is a stemming algorithm. Because lemma
lemmatizer in a new language thanit
Word Level Analysis (3s)
(34) Natural Language Processing
“8
2
sical array
in the form rm of f a clas i
l group s will be exposed
See
te automaton > qo G2 Go
Deterministic finite sta
denoted Q
4. A finite set of states, often * qi M1 qi
s,often denoted 2
2. A finite set of input symbol g2 G2 q1
input symbol and
es as arguments a state and an
3. A transition function that tak te and ais
on commonly denoted 6 If qis a sta
returnsa state. The transition functi is
——Ss
a a
Transition diagrams Extended transition function
e Each state is a node Describes what happens when westart in any state and follow any sequence ofinputs
e For each state q « Q and each symbol ¢ 2,let 5(q, a) =p If 5 is our transition function, then the extended transition function is denoted by 6
e Then the transition diagram has an arc from q to p, labeled a The extended transition function is a function that takes a state q and a string w and
- There is an arrow to the start state q, returns a state p (the state that the automaton reaches whenstarting in state q and
processing the sequence of inputs w)
e Nodes corresponding to final states are marked with doubledcircle
Formal definition of the extended transition function
Transition tables
e Definition by induction on the length of the input string
Tabular representation of a function
Basis: 5 (g, ©) =q
e The rowscorrespondto the states and the columns to the input s
If we are in a state q and read no inputs, then wearestill in state q Induction: Suppose
¢ The entry for the row corresponding to state q and the column corresponding to input
w is a string of the form xa; thatis a is the last symbolof w, and x is the string consisting
a is the state 5(q, a)
of all but the last symbol
A= (£49) 9,}, (0, 1}, 5, q,, {q, })
Then:8 (q, w) = 5(8(q,x), a)
wherethe transition function 6 is given by the table
To compute6 (gq, w), first compute 5(q,x), the state that the automationis in after procesing
all but the last symbol of w
Suppose thisstate is p, /.e., 5(q,X) =p
a - the last
Then 5 (q, w) is what we get by making a transition from state p on input
symbol of w | |
Word Level Analysis (43)
‘ («2) Natural Language Processing
NFA: Formal definition
the language
Design a DFA to accept
n number of 1} A nondeterministic finite automaton (NFA)is a tuple A = (Q, 2, 5, q,. F) where:
n number of 0 and an eve
L = {w| whasboth an eve
1. Qis a finite set of states
The Language of a DFA
5, go, F), denoted L(A) is define
d by E is a finite set of input-symbols
N
The language of a DFAA= (Q, £,
5 q, € Qis the start state
L(A) = {w | 5 (go, w) isin FP}
a
®
k
,j
e the start state go to the one of the F (F <Q)is the setoffinal or accepting states
®
The language of is the set of strings w that tak
t;
t
Q and an input symbol in
5, the transition function is a function that takes a state in
ao
accepting states
language A as arguments and returns a subset of Q
If L is a L(A) from some DFA,then L is a regular
in the type of value that 6 retums
The only difference between a,NFA and a DFAis
Nondeterministic Finite Automata (NFA) in 01
as Example: An NFA accepting strings that end
A NFA hasthe powerto be in several states at once. This ability is often expressed
an ability to “guess” something aboutits input . Each NFA accepts a language thatis A = ({q0, 1, 92}, {0,1}, 5, 90, {92})
table
also accepted by some DFA. NFAare often more succinct and easier than DFAs . We wherethe transition function 5 is given by the
can always convert an NFA to a DFA,but the latter may have exponentially more states | 1
than the NFA (a rare case) .The difference between the DFA and the NFAis the type of > gO |{q0,q1} {99}
transition function 6 ‘ qi |8 {q2}
For a NFA
5 is a function that takes a state and input symbol as arguments
(like the DFAtransition function), but returns a set of zero or more states
(rather than returning exactly one state, as the DFA must)
Example: An NFA accepting strings that end in 01
Nondeterministic automaton that acceptsall and only the strings of Os and 1s that end
The Extended Transition Function
in 01
Basics: : 5°(q, 2) = {q}
only in the state we beganin
Without reading any input symbols, we are
Induction
Word Level Analysis (4s)
(4+) Natural Language Processing
2 _Lg re nr te
———
F '
I w, and x is the string
xa; that is a is the last symbol of
Supposew is a string of the form
aa
) changesthat occur
ee
Veord Level Analysis (#9)
(+s) Natural Language Processing
can have an optional Bie a. an
: An initial hypothesis might be that adjectives zelV -ation/N
bullet Aaa te ee
noun
is mig suggest
suffix (-er, -est, Or *y).
use
irregular.nouns for each class. We can All of these have applications in speech and language processing. For morphological
=a
metaphor, taking as inputa string of letters and producing as outputa string of morphemes. : c
word on each outgoing arc,etc.,
comparing the inputletter by letter with each
~—
An FST can be formally defined in a number of ways; we will rely on the following S
Finite-State Transducers definition, based on whatis called the Mealy machine MEALY MACHINE extensionto a ~€
- &
and
We've now seen that FSAscan represent the morphotactic structure of a lexicon, simple FSA:
~
§ ¢ s
can be used for word recognition. A transducer maps between one representation a finite set of states gO, q1,....., gn—1
a ST ————_—
Q
3
and another: a finite-state transducer or FST is a type of finite automaton which maps z | _ a finite set corresponding to the input alphabet . 5 (as)
between two sets of symbols. We can visualize an FST as a two-tape automaton which OK . x | @
A a finite set corresponding to the output alphabet
recognizes or generatespairs ofstrings. Intuitively, we can do this by labeling each arc ct Wy c
q,€Q the start state
in the finite-state machine with two symbol strings, one from each tape. Fig. 9 shows an : a
FcQ the set of final states
example of an FST where eacharcis labeled by an input and output string, separated
a state
5(g, W) the transition function or transition matrix between states; Given
by a colon. Q'E Q.dis
q€ Qand a string w € 2*, 5(q, w) returns a set of new states
subsets
thus a function from Q x =* to 2°(because there are 2° possible
a given
of Q). 6 returns a setof states rather than a single state because
input may be ambiguous in which state it mapsto.
for each state
o (q, W) the output function giving the set of possible output strings
Figure 9: A finite-state transducer (q, W) gives a set
and input. Given a state q € Q and a string w € =*, a
The FST thus has a more general function than an FSA; where an FSAdefines a formal n from Q x £* to
of outputstrings, each a string 0 € A’. is thus a functio
language bydefining a set of strings, an FST defines a relation between sets of strings.
2”.
Another way of looking at an FST is as a machine that reads one string and generates ges, FSTs are isomorphic to regular
Where FSAs are isomorphic to regular langua
another. Here’s a summary ofthis four-fold way of thinking about transducers: f strings, a natural extension of the regular
relations. Regular relations are sets of pairso
FST as recognizer: transducer that takes a pair of strings
zer. a tran | as input
| and outputs and regular languages, FSTs and regular
languages, which are sets of strings. Like FSAs
acceptif the string-pair is in the string-pair language, and rejectifit s not. " general they are not closed under difference,
FST as generator: a machine that out puts pairs relations are closed under union, although in
| of strings
| of the lan he some useful subclasses of FSTs are closed
output Is a yes or no, and a pair of outputstrings. , ee complementation and intersection (although
that are not augmented with the e are more
FST as translator: a machinethat reads a string and outputs another string. under these operations; in general, FSTs
s union, FSTs have two additional closure
FST as set relater: a machine that computesrelations between sets. likely to have such closure properties). Beside
on of a transducer T
ely useful: inversion: The inversi
properties that turn out to be extrem
the input alphabet
output labels. Thus, if T maps from
(T-1) simply switches the input and
m O to |.
| to the output alphabet O, T-1 maps fro
m O1 to 02, then
m I1 to O1 and T2 a transducer fro
composition: If T1 is a transducer fro
T1 °T2 mapsfrom 11 to O2.
Word Level Analysis (ss)
(2) Natural Language Processing
FST-as-parser into an FST-as.
es
5
7
E-i nsertion rule that
for example we could write an
~
lexic
xiCaal,l, intermediate, ' and surface. So ls shown in Figure 15.
intermediate to surface leve
esi the mapping from the
which makestheillega|
.). Figure 17 shows the transition table for the rule
on # from q.). Rule: ING ! € if stem contains vowel(e.g. motoring ! motor)
‘’ symbol.
transitions explicit with the Do stemmersreally improve the performance of information retrieval engines? One
eRal te a
gf sy
i
nin ie Ries
CVCV ...V
{igre 4
ok
re
tests for a stem ending with a double consonantother
than L, S or Z. Elaborate conditions
VCVC ...C
like this are required onlyrarely.
with the longest matching S1 for the given word. For example, with
[C]VCVC... [VJ
icant ion
SSES -> SS
g (VC)" to
where the square brackets denote arbitrary presence of their contents. Usin
IES -> |
denote VC repeated m times, this may again be written as
SS -> SS
[C)(vC)"[V).
‘
m will be called the measure of any word or word part whenrepresented in this form. The S ->
_
case m = 0 covers the null word. Here are some examples: (here the conditions are all null) CARESSES maps to CARESS since SSESis the
at
m=0 TR, EE, TREE, Y, BY. longest match for S1. Equally CARESS maps to CARESS (S1="SS') and CARESto
m=1 TROUBLE, OATS, TREES, IVY. CARE (S1='S’).
In the rules below, examplesof their application, successful or otherwise, are given on
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
the right in lower case. The algorithm now follows:
The rules for removing a suffix will be given in the form Step 1a
(condition) $1 -> S2 . SSES -> SS caresses -> caress
‘This meansthat if a word ends with the suffix S1, and the stem before S1 satisfies the
IES -> | ponies -> poni
given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.
(m > 1) EMENT -> ties -> ti
Here $1 is ‘EMENT' and $2is null. This would map REPLACEMENTto REPLAC, since SS > .$S caress. -> caress
REPLACis a word part for which m = 2. S > cats -> Cat
The ‘condition’ part may also contain the following:
Step 1b
*S - the stem endswith S (andsimilarly for the other letters).
(m>0)EED -> EE feed -> feed
*y* - the stem contains a vowel.
agreed
aa
-> agree
*d - the stem ends with a double consonant (e.g. -TT,
-SS). (*v*) ED -> plastered -> plaster
the stem endscvc,where the secondcis not
*o _
W,X orY (e.g. -WIL, -HOP).
|
bled -> bled
And the condition part may also contain
expressions with and, or and not. so that (*v*) ING -> motoring -> motor
(m>1 and (*S or *T))
sing -> sing
tests for a stem with m>1 endingin S or
T, while
(“d and not (*L or *S or *Z)) If the second orthird of the rules in Step 1b is successful, the following is done:
(2) Natural Language Processing
, Word Levei Analysis 62
ey.
ATE conflat(ed) -> conflate
a
->
AT (m>0) ABLI -> ABLE conformabli > conformable
ceeShs Ea Siete
{
Now which of these three Seen “coffee” and “coffee” is high. In spelling correction, we need to find and correct spelling
On the other hand, you might not have
eel
And so, when you use a bigram model to predict the conditional probability of the next
—__j
N-gram language model word, you are thus making the following approximation:
ieg
are the type of models that assign Probabilit
Statistical language models, in its essence,
n-1
P(w,, |W { ) ~ P(w,, Wa)
nd simplest model that Assigns
to the sequencesof words. In thisarticle, we'll understa the
am This assumption that the probability of a word depends only on the previous word is also
probabilities to sentences and sequencesof words, the n-gr
known as Markov assumption.
You can think of an N-gram as the sequence of N words, by that notion, a 2-gram
(or bigram) is a two-word sequence of wordslike “please turn’, “turn your’, or "Vour Markov models are the class of probabilistic models that assumethat we canpredict the
homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please probability of some future unit without looking too far in the past. The Markov assumption
turn your”, or “turn your homework” is the presumption that the future behavior of a dynamical system only depends onits
recent history. In particular, in a kth-order Markov model, the next state only depends
Let’s start with equation P(w\|h), the probability of word w, given some history, h. For
on the k most recent states, therefore an N-gram modelis a (N-1 )-order Markov model.
example,
Maximum Likelihood Estimate
P(Thelits water is so transparentthat)
_ count(W;_,,W;)
Here, P(w, lw.) =
(w;|W.-1) count(w;_;)
w = The
h = its water is so transparent that c(w; |W;_1)
And, one wayto estimate the above probability function is through the relative frequency i-1
count approach, where you would take a substantially large corpus, count the number Example
of times you seeits water is so transparentthat, and then count the
number of times itis ’ | <s> lam Sam </s>
followed bythe. In other words, you are answering C(W; |W;_
the question: P(w; Wi =o)
, <s> | am Sam </s>
Outof the times you saw the history h, how manyt i-1
imes did the word w follow it <s> | do not like green eggs and ham </s>
P(the|
. its water is so transparent that) = C(its wateris so
transparent that the)/C(its water 2 1 2
IS SO transparent that) P(l|<s>) == =.67 P(Sam|<s>) =37 33 P(amll) =37 .67
Now, you can imagine it is not feasible to
A perform this over an enti
ntire co ; i
eciall 1 1 1
if itis of a significant size rpus, @5P Y -P(</s >|Sam) = 37 0.5 P(Samlam) = 5 5 P(do| |) = a7 33
This shortcoming and ways to deco
mposethe probability function using
serves as the base intuition of the chain rule You can further generalize the bigram model to the trigram model which looks like two
the N-gram model. Here, you, instead
probability using the entire Corp of computing words into the past and can thus befurther generalized to the N-gram model. Basically,
us, would approximat € it by just
a few historical words an N-gram model predicts the occurrence of a word based on the occurrenceof its N—- 1
The Bigram Model
previous words. An N-gram model uses the previous N-1 words to predict the next one
P(wn | wn-N+1 wn-N+2... wn-1 )
(2) Natural Language Proc Word Level Analysis (3)
essing
nie eh el
P(phone | Please turn off your cell) exponentially with the number of words of py; 4. San Diego hasnice weather.
Number of parameters required grows pa Or
Unigram: P(phone)
as
only on its previous word. In general, we can say that this probability is (the number of
Bigram: P(phone | cell)
eine
times the previous word ‘wp’ occurs before the word ‘wn’)/ (the total numberoftimes the
Trigram: P(phone | yourcell) previous word ‘wp’ occursin the corpus) =
Another example for the sentence boy is chasing the big dog (Count (wp wn))/(Count (wp))
unigrams: P(dog) Let's work this out with an example.
bigrams: P(dog| big) To find the probability of the word “you” following the word “thank”, we can write this as
trigrams: P(dog| the big) P (you | thank) which is a conditional probability.
quadrigrams: P(dog | chasing the big) This becomes equal to:
Assume a languagehasT word types in its lexicon, how likely is word x to follow word y? =(No. of times “Thank You” occurs) / (No. of times “Thank” occurs)
Simplest model of word probability: 1/T 1/1
I
Alternative 1: estimatelikelihood of x occurring in new text based on its = 1
general frequency
of occurrenceestimated from a corpus (unigram probability) Wecan say with certainty that whenever “Thank” occurs,it will be followed by “You” (This
popcorn is morelikely to occur than unicorn is because we havetrained on a set of only five sentences and “Thank” occurred only
Alternative 2: condition the likelihood of x occurring oncein the context of “Thank You”). Let's see an example of a case whenthe preceding
in the context of previous words
(bigrams, trigrams,...) Estimate the probability word occursin different contexts.
of each word given prior context.
mythical unicorn is more likely than mythic Let's calculate the probability of the word “Diego” coming after “San”. We wantto find the
al popcorn
So here we are answering the questi P (Diego | San). This means that wearetrying to find the probability that the next word
on — how far back in the history of a sequenceof
words should we go to predict the will be “Diego” given the word “San”. We can dothis by:
next word? For instance, a bigram
predicts the occurrence of a word model (N = 2) =(No of times “San Diego” occurs) / (No. of times “San” occurs)
given o nly its previous word (as N—
Similarly, a trigram model (N 1 = in this case). = 2/3
=3) predicts the occurrence
of a word b asedon its previous
two words (as N— 1 = 2in this = 0.67
case).
Let us see a way to ass ign This is because in our corpus, one of the three preceding “San’s was followed
a probability to a word occurring
First of all, we nextin a sequence of words. by
need a ver y lar ge sa m ple of English sentences “Francisco”. So, the P (Francisco | San) = 1/3.
For the purpos (called a corpus).
eo f Our exa mpl e, we’ l | consider a very small In our corpus, only “Diego” and “Francisco” occur after “San” with the probabilitie
sample of sentences, s 2 /
reality, a corpus will be butin
ext rem ely lar g €. Sa y our 3 and 1/3 respectively. So if we want to create a next word prediction software based
corpus contains the follow
1. He said thank you. ing sentences: on our corpus, and a user types in “San”, we will give two options: “Diego” ranked most
2. He said bye as he likely and “Francisco” rankedlesslikely.
walked through the doo
r.
3. He wentto San Generally, the bigram model works well andit may not be necessary to usetrigram
Diego.
models or higher N-gram models.
(4) Natural Lan
guage Processing
Word Level Analysis (ss)
“Scanned with CamScanner _
Challenges There is not muchdifference between the twoin theory. So, the discussionin the rest of
ach, and estimation
There are, of course, challenges, as with every modeling appro the blog post appliesto both. *
, as well as the use of
method. Let's look at the key ones affecting the N-gram model There are different types of spelling errors. Let's try to classify them a bit formally. Note
Maximum Likelihood Estimation (MLE) that the belowis for the purposeofillustration, and is not an rigorous linguistic approach
Sparse data to classifying spelling errors.
Notall N-gramsfoundin the training data, need smoothing Non-word Errors: These are the most common type of errors. You either miss a few
Change of domain keystrokes or let your fingers hurtle a bit longer. E.g., typing langage when you meant
Train on WSJ, attemptto identify Shakespeare — won't work language; or hurryu when you meant hurry
Sensitivity to the training corpus Real Word Errors: If you have fat fingers, sometimes instead of creating a non-word,
you end up creating a real word, but one you didn't intend. E.g, typing buckled when you
The N-gram model, like many statistical models, is significantly dependenton thetraining
meant bucked. Or your fingers are a tad wonky, and youtypein three when you meant
corpus. As a result, the probabilities often encode particular facts about a giventraining
there.
corpus. Besides, the performance of the N-gram model varies with the change in the
Cognitive Errors: The previous two types of errors result not from ignorance of a word
value of N.
or its correct spelling. Cognitive errors can occur dueto those factors. The words piece
Moreover, you may have a languagetask in which you knowall the words that can occur,
and peace are homophones (sound the same). So you are not sure which oneis which.
and hence we know the vocabulary size V in advance. The closed vocabulary assumption
Sometimes your damn sure about your spellings despite a few grammar nazis claim
assumesthere are no unknown words, whichis unlikely in practical scenarios.
you're not.
Smoothing | Short forms/Slang/Lingo: These are possibly not even spelling errors. May be u f just
A notable problem with the MLE approachis sparse data. Meaning, any N-gram that being kewl. Or you are trying hard to fit in everything within a text message or a tweet
appeareda sufficient number oftimes might have a reasonable estimate for its probability. and must commit a spelling sin. We mention them here for the sake of completeness.
But because any corpus is limited, some perfectly acceptable English word sequences Intentional Typos: Well, because you are clever. You type in teh and pwned and zomg
are bound to be missing fromit. carefully and frown if they get autocorrected. It could be a marketing trick, one that
As a result ofit, the N-gram matrix for any training corpus is bound to have a substantial probably even backfired.
numberof cases of putative “zero probability N-grams” The word N-gram approachto spelling error detection and correction is to generate
l
every possible misspelling of each word in a sentence either just by typographica
N-gram for spelling correction homophones as
—— modifications (letter insertion, deletion, substitution), or by including
Let's define the job of a spell checker and an auto corrector. A word needs to be checked - choosing the spelling
well, (and presumably including the correct spelling), and then
is, given a sentence W = {w,
for spelling correctness and correctedif necessary, many a time in the context of the that gives the sentence the highest prior probability. That
wk , wk , etc, we choose
Surrounding words. A spellchecker points to Spelling errors and possibly suggests ° PWo TSW w.}, where wk has alternative spelling
alternatives. An auto corrector usually goes a step further and automatically picks the imizes P(W ). using the N-gram
the spelling among these possible spellings that max-
can be used instead, which a find
mostlikely word. In case of the correct word already having been typed, the same is grammar to compute P(W ). Aclass-based N-gram
it may not do as well at finding unlikely
retained. So, in practice, an auto correctis a bit more aggressive than a spellchecker, unlikely part-of-speech combinations, although
but this is more of an implementation detail — tools allow you to configure the behaviour. word combinations.
Cal
is (7)5
Ward Level Analys i
(76) Natural Language Processing
=
¥“
;
Expected Questions:
t+
?
t ways to create words from morphemes
What are morphemes? Whatare differen
©
of Language model?
Whatis language model? Explain the use
Pp
FSTin detail.
Whatis the role of FSA in Morphological analysis? Explain
{ OBJECTIVES
be able to understand:
| After reading this chapter, the student will
English (Penn Treebank )y
© Part-Of-Speechtagging (POS)- Tag set for
gging,
Rule based POS tagging, Stochastic POSta
OOO ©
words.
issues —Multiple tags & words, Unknown
Introduction to CFG,
l (HMM), Maximum Entropy, and
Sequencelabelling: Hidden Markov Mode .
Conditional Random Field (CRF).
|
;
|
(7s) Natural Language Processing |