Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Scanned with CamScanner

oii shgiesele ee She


OBJECTIVES
After reading this chapter, the stude
nt will be able to understand:
Morphology analysis
ORORIORIOMOMOMONEO

Stemming and Lemmatization


Regular expression
Finite Automata
Finite-State Morphological Parsing
Combining FST Lexicon and Rules
Lexicon free FST Porter stemmer | |
N —Grams, N-gram language model, N-gram for spelling correction.
1. Morphology analysis
Suffix : ness_

Scanned withCamScanner
_. What are words? |
Affixes : happy
Words are the fundamental building block of language. Every human language,
spoken,signed, or written, is composed of words. Every area of speech and language Stem : happy
| processing, from speech recognition to machine translation to information retrieval on
the web, requires extensive knowledge about words. Psycholinguistic models of human ’not’, whil e nes s means | "bei ng in a state or condition”.
Happyis a free morpheme
language processing and models from generative linguistic are also heavily based on ecauseit can appear onits
own(as a “word” in its own
right). Bou
nd morphemes have
lexicalknowledge. to be attached to a free morpheme,
and so cannot be words in their ownr
cant have sentencesin English such ight. Thys, you
Words are Orthographic tokens separated by white space. In some languages the as “Jasonfeels very un ness today”,
distinction between words and sentencesis lessclear. | Bound Morphemes: These are lexica
l items incorporated into a word as a depe ndent
part. "They cannot stand alone, but must
Chinese, Japanese: no white space between words be connected to another morphemes.
Bound
morphemes operate in the connection proc
nowhitespace — no white space/no whites pace/nowhit esp ace esses by means of derivation, inflection, and
compounding. Free morphemes, on the other hand,
are autonomous, can occuron their
Turkish: words could ‘represent a complete “sentence” Own and arethus also words at the same time. Technicall
y, bound morphemesand free
Eg: uygarlastiramadiklarimizdanmissinizcasina | morphemes aresaid to differ in terms of their distributi or free
on dom of occurrence. As a
“(behaving) as if you are among those whom wecould notcivilize” rule, lexemesconsistofat least one free morpheme 4
Morphology deals with the syntax of complex words Morphology handles the formation of words by using morphemesbase form (stem,lemma),
and parts of words, also called
morphemes,as well as with the semantics of their lexical meanings. Understand e.g., believe affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
ing how
words are formed and what semantic properties they conveythrough their forms enables Morphological parsing is the task of recognizing the morphemesinside a word e.g., hands,
human beings to easily recognize individual words and their meanings in discourse. foxes, children and its important for many tasks like machine translation, information
Morphology also looks at parts of speech, intent and Stress, and the ways context retrieval, etc. and useful in parsing, text simplification, etc
can
change a word's pronunciation and meaning. Survey of English Morphology
Morphology is the study of the structure and formation of words. Its most important unit Morphology is the study of the way wordsare built up from smaller meaning- bearing
is the morpheme, which is defined as the “minimalunit of meaning” units, morphemes. A morpheme is often defined as the minimal meaning-bearing unit in
Consider a wordlike: “unhappiness”. This has three parts: a language. So for example the word fox consists of a single morpheme(the morpheme.
fox) while the word cats consists of two: the morpheme cat and the morpheme-s. As
morphemes
“|S
this example suggests, it is often useful to distinguish two broad classes of morphemes:
stems and affixes. The exact details of the distinction vary from language to language, |
but intuitively, the stem is the ‘main’ morphemeof the word, supplying the main meaning,-
me
un happyness,
prefix
Fs wae
i the affixes add ‘additional’
while ‘addi
' meanings
. of various
. .
kinds. Affixesarefurt her divided
tbo
into
suffix
the stem, suffixe
affixes
prefixes, suffixes, infixes, and circum- fixes. Prefixes precede
stem the ©
d inside the stem. Pol exam ple,
itch
the stem, circumfixes do both, and infixes are inserte
. Morphemes > un happy ness unbuckl e is compo
mposed of a stem eatand the suffix -s. The word
word eats Is comp niesof circumixes,
any good exam
Prefix : un a stem buckle and the prefix un-. English doesn't have
Yiord Level An alysis. (25) A

Montane
(2+) Natural Language Processing
ticiple of some verbs
uages do. | n German
, for example, the past par word stem with a grammatical morpheme,usually resulting in a word ofa different class,

Scanned with CamScanner


but many other lang to the end; so the past participle
‘by adding ge- to the beginning of the stem and -t often with a meaning hard to predict exactly. For example, the verb-computerizes can’
pheme Is insertedin the
he ves eigen ‘ say) is gesagt(said). Infixes, in which a mor take the derivational suffix -ation to produce the noun computerization.
mple in the Philipine language Tagalog,
ofthe ver? rae | on very commonly for exa
xed to the Tagalog Inflectional morphology & Derivational morphology
th ste um, 5 which marks
the agent of an action, is infi
or dipiti, ite
Pe | a 7
humingi. Morphemes are defined as smallest meaning-bearing units. Morphemes .can be
stem hingi ‘borrow’ to produce
y since a word is classified in various ways. One commonclassification is to separate those morphemes
Prefixes an d suffixes are ofte
n called concatenative morpholog
. A number of languages that mark the grammatical forms of words (-s, -ed, -ing and others) from those that form
composed of a number . of morphe mes concatenated together
oncatenative morphology, in which mor
phemes are combinedin new lexemes conveying new meanings, e.g. un- and -ment. The former morphemes
have extensive non-c :
e is one exam, ple of non- are inflectional morphemes and form.a key part of grammar,the latter are derivational
more complex ways. The Tagalog in- fixation example abov
i and um) are intermingled. morphemes andplay a role in word-formation, as we have seen. The followingcriteria
coricatenative morphology, since two morphemes (hing
root- help you to distinguish the two types:
Another kind of non-concatenative morphology is called templatic morphology or
e Effect: Inflectional morphemes encode grammatical categories andrelations, thus
and-pattern morphology. This is very common in Arabic, Hebrew, and other Semitic
marking word-forms, while derivational morphemescreate new lexemes.
languages. In Hebrew, for example, a verb is constructed us- ing two components: a
root, consisting usually of three consonants (CCC) and carrying the basic meaning, e Position: Derivational morphemes are closer to the stem than inflectional morphemes,
and a template, which gives the ordering of consonants and vowels and specifies more cf. amendments (amend[stem] — ment[derivational] — s[inflectional]) and legalized
semantic information about the resulting verb, such as the semantic voice (e.g. active, (legal[stem] — ize[derivational] — ed[inflectional]).
passive, middle). For example the Hebrew tri-consonantal root Imd, meaning ‘learn’ or e Productivity: Inflectional morphemes are highly productive, which meansthat they
‘study’, can be combined with the active voice CaCaC template to produce the word can be attached to the vast majority of the members of a given class (say, verbs,
lamad, ‘he studied’, or the intensive CiCeC template to produce the word limed, ‘he nouns or adjectives), whereas derivational morphemestend to be more restricted
taught’, or the intensive passive template CuCaC to produce the word lumad, ‘he was with regard to their scope of application. For example, the past morphemecan in
taught’. principle be attached to all verbs; suffixation by means of the adjective-forming
A word can have more than oneaffix. For example, the word rewrites have the prefix derivational morphemes-able, however, is largely restricted to dynamic transitive
re-, the stem write, and the suffix -s. The word unbelievably has a stem (believe) plus verbs, which excludes formations such as *bleedable or *lieable.
three affixes (un-, -able, and -ly). While English doesn't tend to stack more than 4 or 5 e Class properties: Inflectional morphemes makeup a closed and fairly stable class of
affixes, languageslike Turkish can have words with 9 or 10 affixes, as we saw above. items which can be listed exhaustively, while derivational morphemes tend to be
Languagesthat tend to string affixes together like Turkish does are called agglutina much more numerable and more open to changesin their inventory.
tive.
languages.
Both inflectional and derivational morphemesmustbe attached to other morphemes; they
There are two broad (and partially Overlapping) classes of ways cannot occur by themselves, in isolation, and are therefore known as bound morphemes
to form words from
morphemes: inflection and derivation. Inflection
is the combination of a word stem with Inflected words are variations of already existing lexemes that were only changedin their
a grammatical morpheme, usually resulting
in a word of the same Class as the original grammatical shape. Therefore manyofthe Inflectional Morphemesare notlisted in the
Stem, and usually filling some syntac- tic function like agreemen
t. For example, English dictionary. If you know the word surprise and lookit up you will also find in the same entry
has the inflectional mor- |pheme -§ for marking the
plural on nouns,and the inflectional the word surprise -s which simply expresses the plural. .
morphemes -ed for marking the past
tense on verbs. Derivation is the
combination of a ’
4
(28) Natural Language Processing Word Level Analysis (7)
A.VT
i aac

_
new words outof
Derivational Morphology on the other hand uses affixes to create Toillustrate this consider the following
two sentences:

Scanned with CamScanner


already existing lexemes. Typicalaffixes are -ness, -ish, -ship and so on. ANSBfIiXes
'
1. I help my grandmotherin her garden.
'
'
| a ae
t
‘ do not change the grammatical form of a word such as inflectional affixes do, but instead 2. He is my grandmother's help.
often create a new meaning of the base or change the word class of the base. An Here our general knowledge of words andthe
ir meaning showsu sthat in
Example would be the wordlight. The plural form light-s would be considered Inflectional as a verb and expresses us working with our grandm
4, help is used
otheri n order to support her. In 2.
Morphology, but if we consider de-light the prefix -de has changed the meaning of the help is a noun and stands for the person that regular ly suppor ts my grandmother. This
word completely. We now do notthink oflight in the form of sunshine or lamps anymore variation of a word without actually changi ng its form is called zero morphemes and
but instead about a feeling. Also if we consider en-light the suffix -en has changed the cannotonly distinguish verb and noun (which makes it a derivational morphemes) but
word classoflight from noun to verb. also singular and plural, which makes it an inflectional morphem
ct

e. | will talk aboutthis

eee
INFLECTIONAL MORPHOLOGY later in 2.2.: Inflection in nouns, though.

a.
Inflection is a morphological process that adapts existing words so that they function “We may define inflectional morphology as the branch of morpholo gy that deals with

=
effectively in sentences without changing the category of the base morphemes. paradigms. It is therefore concerned with two things: on the one hand, with the
semantic
oppositions among categories; and on the other, with the formal means, including

—————
Inflection can be seen as the “realization of morphosyntactic features through
inflections, that distinguish them.” (Matthews, 1991).
morphological means” . But what exactly does that mean? It means that inflectional
Inflectional morphology is that it changes the word form, it determines the grammar and
morphemesof a word do not havetobelisted in a dictionary since we can guesstheir
it does not form a new lexemebutrathera variantof a lexeme that does not need its own
meaning from the root word. We know whenweseethe word whatit connects to and entry in the dictionary.
_a_—e—vv—o_—ore —_—
most times can even guessthe difference to its original. For example let us consider
word stem + grammatical morphemes
help-s, help-ed and help-er. According to what | have said about wordslisted in the
cat + s only for nouns, verbs, and some adjectives
dictionary, all of these variants might be inflectional morphemes, but then on the other
Nouns
hand does help-s really need an extra listing or can we guess from the root help
and
plural:
the suffix -s what it means? Does our natural feeling and instinct for language nottell
us, that the suffix —s indicates the third person singular and that help-s therefore regular: +s,+es irregular: mouse - mice; ox - oxen
only is
a variant from help (considering help as a verb and not a noun here)? Yesit manyspelling rules: e.g. -y -> -ies like: butterfly - butterflies
does. As
native speakeroneinstantly knowsthat —s, asalso the past form indicator possessive: t's, +’
—ed only show
a grammatical variant of the root lexemehelp. Verbs
So why is help-er or even help-less-ness different? The answeri main verbs(sleep, eat, walk)
s actually very simple.
The suffixes in these last two words change the word class modal verbs(can, will, should)
and therefore form a new
lexeme. Help-er can now bethe new root for additional suffixes primary verbs (be, have, do)
such help-er-s which
would then be an inflectional morpheme again, the root here
being the smallest free VERB INFLECTIONAL SUFFIXES
morpheme after you removeall affixes.
1. The suffix —s functions in the Present Simple as the third person marking of the verb
After establishing this westill have a problem if we consider the word : to work — he work-s
help as a noun and
as a verb. How do wedistinguish these two? The answer is context
and the phenomenon 2. The suffix -ed functions in the past simple as the past tense marker in regular
is called a zero morphemes. Only threw context can
we Say if help is a verb or a noun. verbs: to love — lov-ed
ff
Word Level Anatysis \ ag
Neca
(28) Natural Language Processing
i ee
2
yy he, Oe
in the
regular verbs) function
z
!

-en(for some

Scanned with CamScanner


(regul ar verbs) and fect aspect: study of derivation has also been important in a number of psycholinguistic. debates
in general, in the mar king of the per
ed
3. The suffixes t partciple and,
marking of the pas | concerning the perception and production of language.
-
i d / To ea tate eaten
l d studie Derivational morphology is defined as morphology that creates new lexemes, either by
study studie in
sent participle, the gerund and
n the marking of the pre changing the syntactic category (part of speech) of a base or by adding substantial, non-
The suffix _ing-ing functionsi
4, : To eat — eating / To study - studying
tinuous aspect: grammatical meaning or both. On the one hand, derivation may be distinguished from
the marking of the con
INFLECTIONAL SUFF
IXES
: inflectional morphology, which typically does not change category but rather modifies
NOUN dog — dogs
marking of the plural of nouns: lexemesto fit into various syntactic contexts; inflection typically expresses distinctions
4. ‘The suffix -s functionsin the — Laura's
ve marker (saxon genitive): Laura’ like number, case, tense, aspect, person, among others. On the otherhand, derivation
>. The suffix -s functions as 4 possessi
may be distinguished from compounding, which also creates new lexemes, but by
book.
AL FFIXES combining two or more bases rather than by affixation, reduplication, subtraction, or
ADJECTIVE INFLECTION SU
internal modification of various sorts. Although the distinctions are generally useful, in
ve ker: quick — quicker
The suffix -er functions as comparati mar practice applying them is not always easy.
ative marker: quick - quickest
The suffix -est functions as superl
Wecan distinguish affixes in two principal types:
DERIVATIONAL MORPHOLOGY 1. Prefixes - attached at the beginning of a lexical item or base-morpheme — ex: ur,
mes are connected to existing lexical forms |
Derivation is concerned with the way morphe pre-, post-, dis, im-, etc. | :
rd formation that creates new lexemes, |
as affixes. Derivational morphology is a type of wo
stantial new meaning (or both) to 2. Suffixes — attached at the end of a lexical item ex: -age, -ing, -ful, -abie, -ness,
either by changing syntactic category or by adding sub
sted with inflection on the one hand or -hood, -ly, etc.
a free or bound base. Derivation may be contra
n ivation and inflection and | EXAMPLES OF MORPHOLOGICAL DERIVATION
with compounding on the other. The distinctions betwee der
cut. New words
between derivation and compounding, however, are not always clear- a. Lexical item (free morpheme): like (verb) b. Lexical item: like
n, ication, internal :
may be derived by a variety of formal meansincluding affixatio redupl + prefix (bound morpheme) dis- + suffix —able = likeable
n ed ss-
modification of various sorts, subtraction, and conversion. Affixatio is best attest cro
n = dislike (verb) + prefix un- =unlikeable
linguistically, especially prefixation and suffixation. Reduplicatio is also widely found, |
+ suffix —ness = unlikeableness
with various internal changes like ablaut and root and pattern derivation less common.
Derived words mayfit into a number of semantic categories. For nouns, event and result, | Derivational affixes can cause semantic change:
personal and participant, collective and abstract noun are frequent. For verbs, causative 1. Prefix pre- means before; post- means after, un- means not, re- means again.
and applicative categories are well-attested, as are relational and qualitative derivations 2. Prefix = fixed before; Unhappy = not happy = sad; Retell = tell again.
for adjectives. Languages frequently also have ways of deriving negatives, relational
3. Prefix de- added to a verb conveys a sense of subtraction; dis- and un- have a
words, and evaluative. Most languages have derivation of some sort, although there
sense of negativity.
are languages that rely more heavily on compounding than on derivation to build their
4. To decompose; to defame; to uncover; to discover.
lexical stock. A number of topics have dominated the theoretical literature on derivation, |
including productivity (the extent to which new words can be created with a given affix |
or morphological process), the principles that determine the ordering of affixes, and the |
place of derivationat morphology with respect to other components of the grammar. The |
i (30) Natural Language Processing |
1s |
Word Level Analysis (1)
SN
————
ee es Le

inflection from derivation; inflection is invariably relevant to syntax, derivation


Derivation Versus Inflection not. But

Scanned with CamScanner


nal one rather
tion is | a functiome | than
, a formal Booij (1996) has argued that even this criterion is problematic unless we are clear what
istinct
The distin i
ction tion and d inflec
ivati
between deriva |
Either derivation or inflection may be we mean by relevance to syntax. Caseinflections, for example, mark grammatical
one, as Booij (2000, p. 360) has pointed out.
of bases, context, and are therefore clearly inflectional. Number-marking on verbs is arguably
affected by formal meanslike affixation, reduplication, internal modification
inflectional when it is triggered by the numberof subject or object, but number on nouns
and other morphological processes. But derivation serves to create new lexemeswhile
or tense and aspect on verbsis a matter of semantic choice, independentof grammatical
inflection prototypically serves to modify lexemestofit different grammatical contexts, In
configuration. Booij therefore distinguishes whathe calls contextualinflection, inflection
the clearest cases, derivation changes category, for example taking a verb like employ
triggered by distinctions elsewhere in a sentence, from inherentinflection, inflection
and makingit a noun (employment, employer, employee) or an adjective (employable),
that does not depend on the syntactic context, the latter being closer to derivation than .
or taking a nounlike union and making it a verb (unionize) or an adjective (unionish,
unionesque). Derivation need not change category, however. For example, the creation the former. Some theorists (Bybee, 1985; Dressler, 1989) postulate a continuum from
of abstract nouns from concrete ones in English (king ~ kingdom; child ~ childhood) derivation to inflection, with no clear dividing line between them. Anotherposition is that
is a matter of derivation, as is the creation of namesfor trees (poirier ‘pear tree’) from of Scalise (1984), who has argued that evaluative morphologyis neitherinflectional nor
the correspondingfruit (poire ‘pear’) in French, even though neither process changes derivational but rather constitutes a third category of morphology.
category. Derivationalprefixation in English tends not to change category, but it does add
2. Stemming and Lemmatization
substantial new meaning, for example creating negatives (unhappy, inconsequential),
various quantitative or relational forms (tricycle, preschool, submarine) or evaluative In natural language processing, there may come a time when you want your program to
(megastore, miniskirt). Inflection typically adds grammatical information about recognize that the words “ask” and “asked” are just different tenses of the same verb. |
number
(singular, dual, plural), person(first, second, third), tense (past, future), aspect(perfective, This is the idea of reducing different forms of a word to a core root. Words that are
imperfective, habitual), and case (nominative, accusative), among derived from one another can be mapped to a central word or symbol, especially if they
other grammatical
categories that languages might mark. have the same core meaning.
Nevertheless, there are instances that are difficult to categor Maybethis is in an information retrieval setting and you wantto boost your algorithm's
ize, or that seem to fall
somewhere between derivation and inflection. Many recall. Or perhapsyou aretrying to analyze word usage in a corpusand wish to condense
of the difficult cases hinge on
determining the necessary and sufficient criteri related words so that you don’t have as much variability. Either way, this technique of text
a in defining derivation. Certainly,
, category change is not necessary, as there are normalization may be useful to you.
many obvious cases of derivation that do
not involve category change. But further, Haspe This is where something like stemming or lemmatization comesin. ,
lmath (1996) has argued that there are
cases ofinflection (participles of various sorts, For grammatical reasons, documents are goingto usedifferent forms of a word, such as
for example) that are category- changing,
So that this criterion cannot even be said organize, organizes, and organizing. Additionally, there are families of derivationally related
to be sufficient. Productivity has also
used as a criterion—inflection typically been
being completely productive, derivation words with similar meanings, such as democracy, democratic, and democratization. \|n
only sporadically productive—but some being
derivation is just as productive as inflec many situations, it seems asif it would be useful for a search for one of these words to
for example, forming nouns from adjec tion,
tives with -nessin English, and some return documentsthat contain another wordin the set. |
displayslexical gaps (Booij, 2000, inflection
p. 363 Says, €.g., that the Dutch infini The goal of both stemming and lemmatization is to reduce inflectional forms and
‘to make an anthology’ does not corre tive bloemlezen
spond to any finite forms),
so Productivity is neither sometimesderivationally related forms of a word to a commonbase form. Forinstance:
necessary nor sufficient for distingu
ishing inflection and derivation. And am, are, is => be
makesthecriterion of relevanc erson (1982)
e one for distinguishing Car, Cars, Car’s, cars’ = Car
(2) Natural Language Proce Word Level Analysis (33)
ssing
7
like:

Scanned withCamScanner_
g of text will be something
The result of this mappin Stemming Algorithms Examples
colors >
the boy's cars are different
Porter stemmer: This stemming algorithm is an older one. It’s from the 1980s andits
the boy car be differ color
in their flavor. main concern is removing the commonendings to words so that they can be resolved
However, the two wordsdiffer
to a common form. It’s not too complex and development on it is frozen. Typically, it's
cess that chops off the ends of wordsin
Stemming usually refers to a crude heuristic pro l a nice Starting basic stemmer,butit's not really advised to use it for any production/
the time, and often includes the remova
the hope of achievingthis goalcorrectly mostof complex application. Instead, it has its place in research as a nice, basic stemming
pler of the two approaches. With
of derivational affixes. Stemming is definitely the sim algorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm
rd stem need not be the same
stemming, words are reduced to their word stems. A wo when compared to others.
al or smaller form of the
root as a dictionary-based morphological root, it just is equ to
Snowball stemmer: This algorithm is also known as the Porter2stemming algorithm. It is
word.
almostuniversally accepted as better than the Porter stemmer, even being acknowledged
Stemming algorithmsare typically rule-based. You can view them as heuristic process as suchby the individual who created the Porter stemmer. That being said, it is also more
that sort-of lops off the ends of words. A word is looked at and run through a series of aggressive than the Porter stemmer.A lot of the things added to the Snowball stemmer
conditionals that determine how to cut it down. were because of issues noticed with the Porter stemmer. There is about a 5% difference
For example, we may havea suffix rule that, based ona list of known suffixes, cuts them in the way that Snowball stems versus Porter.
off. In the English language, we have suffixeslike “-ed” and “-ing” which may be useful Lancaster stemmer: Just for fun, the Lancaster stemming algorithm is ancther algorithm
to cut off in order to map the words “cook,” “cooking,” and “cooked”all to the same stem that you can use. This one is the most aggressive stemming algorithm of the bunch.
of “cook.” However,if you use the stemmerin NLTK, you can add your own custom rules to this
Over stemming comes from when too much of a wordis cut off. This can result in algorithm very easily. It's a good choice for that. One complaint around this stemming
_ nonsensical stems, whereall the meaning of the wordis lost or muddled. Or it can result algorithm though is that it sometimes is overly aggressive andcan really transform words
in words being resolved to the same stems, even though they probably should not be. into strange stems. Just make sure it does what you want it to before you go with this
option! |
Take the four words university, universal, universities, and the universe. A stemming
and
algorithm that resolves these four words to the stem “univers” has over stemmed. While Lemmatization usually refers to doing things properly with the use of a vocabulary
endings only
it might be nice to have universal and universe stemmed together and university and morphological analysis of words, normally aiming to remove inflectional
as the Jemma .
universities stemmed together,all four do notfit. A better resolution might have the first and to return the base or dictionary form of a word, which is known
wordsto their dictionary
two resolve to “univers” and the latter two resolve to “universi.” But enforcing rules that lemmatization is a more calculated process. It involves resolving
make that so might result in more issues arising. form. In fact, a lemmaof a wordis its dictionary or canonical form!
, it requires a little more to actually
Because lemmatization is more nuanced in this respect
Under stemmingis the opposite issue. It comes from when we haveseveral wordsthat
needs to know its part
make work. For lemmatization to. resolve a wordto its lemma,it
actually are forms of one another. It would be nice for them to all resolve to the same
power such as a part of speech _
of speech. That requires extra computational linguistics
stem, but unfortunately, they do not. This can be seen if we have a stemming algorithm
is and are to “be").
tagger. This allowsit to do better resolutions (like resolving
that stems the words data and datum to “dat” and “datu.”
is thatit's often times harder to creale a
Another thing to note about lemmatization
tizers
is a stemming algorithm. Because lemma
lemmatizer in a new language thanit
Word Level Analysis (3s)
(34) Natural Language Processing
“8
2

a aboutth e structure of alanguage, it's a much more intensive

Scanned with CamScanner


require a lot more knowledge ming algorithm. 8 increases recall while harming precision. As an example of what can go wrong, note that
processthanjust trying to set
up a heuristic stem the Porter stemmerstemsall of the following words:
e
just s, ee ke ai
saw, stemming might eum
If confronted with the token ther : eu cen operate operatingoperates operation operative operatives operationalto oper.
see oF saw depending on whe
would attempt to return either ng most commonly However, since operate in its various forms is a common verb, we would expect to lose
may also differ in that stemmi
was as a verb or a noun. The’ two considerable precision on queries such as the following with Porter stemming:
whereas lemmatiza tion commonly only collapses
collapses derivationally re lated words, operational and research
| emming or
a. L inguistic processing | for st
the different: inflectional forms of a lemm
operating and system
tto the indexing process,
lemmatization is often done by an additional plug-in componen
open-source. operative and dentistry
and a number of such components exist, both commercial and
For a caselike this, moving to using a lemmatizer would not completely fix the problem
The most commonalgorithm for stemming English, and one that has repeatedly been
becauseparticular inflectional forms are used in particular collocations: a sentence with
shown to be empirically very effective, is Porter’s algorithm (Porter, 1980). The entire
the words operate and system is not a good match for the query operating and system.
algorithm is too long andintricate to present here, but wewill indicate its general nature.
Getting better value from term normalization depends more on pragmatic issues of word
Porter’s algorithm consists of 5 phases of word reductions, applied sequentially. Within
use than on formal issuesoflinguistic morphology.
each phasethere are various conventions to select rules, such as selecting the rule from
each rule group that applies to the longest suffix. In the first phase, this convention is The situation is different for languages with much more morphology (such as Spanish,
used with the following rule group: German, Hindi and Finnish).
(F) Rule Example 3. Regularexpression __.
SSES —+ SS caresseS -—» caress
IES — | ponies Regular expression (RE), a languagefor specifying text search strings or search pattem.
— poni
Ss — SS caress — caress The regular expression languages used for searching texts in UNIX (vi, Perl, Emacs,
S > cats -—> Cat grep) and Microsoft. Usually search patterns are used bystring searching algorithms for
“find” or “find and replace” operations on strings, or for input validation. It is a technique
Manyof the later rules use a concept of the measure of word,
a which loosely checks the
developedin theoretical computer science and formal language theory.
numberof syllables to see whether a wordis long enough
thatit is reasonable to regard
the matching portion of a rule as a suffix rather than Words are almost identical, and many RE features exist in the various Web search
as part of the stem of a word. For
example, the rule: engines. Besides this practical use, the regular expression is an important theoretical
> EMENT —> tool throughout computer science and linguistics. A regular expression is a formula in
a special language that is used for specifying simple classes of strings. A string is a
would map replacementto replac, but
not cementto c. " sequence of symbols; for the purpose of most text-based search techniques, a string is a
cae ae using a stemmer, you can
use a /emm atizer , a tool from Natural Language sequenceof alphanumeric characters(letters, numbers, spaces, tabs, and punctuation).
rocessing which doesfull morpholo
gical analysis to accurately ident For these purposes a spaceis just a characterlike any other, and we representit with
ify the lemma for
the symbol uw. Formally, a regular expression is an algebraic notation for characterizing
a set of strings. Thus, they can be used to specify search strings as well as to define a
ggregate - at least not by very muc language in a formal way.
h. While
(38) Natural Language
Processing
Sse Ward Level Analysis (37)
a

a[bc] same as previous


Their
ibi ng a certain amount 0 f text.

Scanned with CamScanner


ribi
Basically a regular expression is a3 pattern desc Character classes — \d \w \s and.
But we oe nm
theory on which ney =
name comes from the mathematical ‘ron _
\d matches a single character that is a digit
e abbreviated to “regex oF eo
dig into that. You will usually find the nam er \w matches a word character (alphanumeric character plus
rma®
extremely useful in extracting info underscore)
expressions (regex Or regexp) are
a Specitic matches a whitespace character (includes tabs andline breaks)
ches of a specific search pattern (i.e. \s
text by searching for one or more mat |
. matches any character
sequence of ASCII or unicode characters).
i
for short, is broadly defined
the \d, \w and\s also present their negations with \D, \W and\S respectively. |
Natural Language Processing, or fica
e, like speech and text, y
automatic manipulation of natural languag For example, \D will perform the inverse match with respect to that obtained with \d.
ical inference for, the field of
software.Statistical iia aims to do statist \D matches a single non-digit character
natural Language.
In order to be taken literally, you must escape the characters “.[$()|*+?{with a backslash
aed
J\BO\weNLP(wee) \b ig \ as they have special meaning.
www.regextester.com/ \$\d matchesa string that has a $ before one digit
Figure 1: shows matching of the string NLP in the given text on the site https://
We can match also non-printable characterslike tabs \t, new-lines \n, carriage returns \r
Basic Regular Expression Patterns
| Flags
Anchors — “ and $
| We are learning how to construct a regex but forgetting a fundamental concept: flags.
AThe matches anystringthat starts with The -> Try it!
A regex usually comes within this form /abc/, where the search patter is delimited by
end$ matchesa string that ends with end
two slash characters /. At the end we can specify a flag with these values (we can also
AThe end$ exact string match (starts and ends with The end)
combine them each other):
roar matches anystring that has the text roarin it
g (global) does not return after the first match, restarting the subsequent searches
Quantifiers — * + ? and {} from the end of the previous match
abc* matches a string that has ab followed by zero or more c
m (multi-line) when enabled “ and $ will match the start and end of line, instead of
abc+ matches a string that has ab followed by one or more c the whole string
abc? matchesa string that has ab followed by zero or one c i (insensitive) makes the whole expression case-insensitive (for instance /aBcfi
abc{2} matches a string that has ab followed by 2 c would match AbC)
abc{2,}_ matches a string that has ab followed by 2 or more c Grouping and capturing — ()
abc{2,5} matches a string that has ab followed by 2 up to5c a(bc) parentheses create a capturing group with value bc
a(bc)* matchesa string that has a followed by zero or more copies of the sequence a(?:bc)* using ?: we disable the capturing group
bc a(?<foo>bc) using ?<foo> we put a nameto the group
This operator is very useful when we need to extract informationfrom strings or data
bc using your preferred programming language. Any multiple occurrences captured by
OR operator — | or[]
a(b|c) -matches a string that has a followed by b orc Word Level Analysis
(33)
1
(38) Natural Language Processing
ip
rrayv:: W we Will ACCESS thej
ee al

sical array
in the form rm of f a clas i
l group s will be exposed

Scanned with CamScanner -


evera
several
values specifying using an index on the result of the match.
erento twin’ 4. Finite Automata
If we choose to put a nameto the groups (using (?<foo>...)) we Wi ‘be the ‘§ The regular expression is more than just a convenient metalanguagefor text searching.
the group valuesusing the matchresult like a dictionary where the keys wl name First, a regular expression is one way of describing a finite-state automaton (FSA).
»
of each group. Finite-state automata are the theoretical foundation of a good deal of the computational
| work wewill describe in this book, Any FSA regular expression can be implemented as
Bracket expressions—[]
same as albjc
abc matches a string that has either ana oraborac-7 Is the a finite-state automaton (except regular expressions that use the memory feature; more
ed matchesa string that has either an a ora b or ac -> is the same as alb|c on this later). Symmetrically, any finite-state automaton can be described with a regular
expression. Second, a regular expression is one way of characterizing a particular kind
[a-fA-F0-9] a string that represents a single hexadecimal digit, case insensitively
of formal language called a regular language. Both regular expressions and finite-state
[0-9]% a string that has a character from 0 to 9 before a % sign
automata can be used to describe regular languages. The relation among these three
[*a-zA-Z] a string that has not a letter from a to z or from A to Z.In this case the
theoretical constructions is sketched outin the figure below.
“ is used as negation of the expression
regular
Greedy and Lazy match expressions
.
- ee
The quantifiers (* + {}) are greedy operators, so they expand the match as far as they
can through the provided text.
For example, <.+> matches <div>simple div</div> in This is a <div> simple div</div>
test. In order to catch only the div tag we can use a ? to makeit lazy:
finite “7777 TTT TTT tr rrr terns regular
<.+7> matches any character one or moretimes included inside < and >, automata
expanding as needed
Figure 2: The relationship betweenfinite automata, regular expressions,and regular languages
Boundaries — \b and \B
A formal language is completely determined by the ‘wordsin the dictionary’, rather than
\babc\b performs a “whole words only” search
by any grammatical rules
\b represents an anchorlike caret (it is similar to $ and 4) matchin
g positions where one A (formal) languageL overalphabet2 is just a setof strings in =". ;
Side is a word character(like \w) and the otherside is not a
word character (for instance
it may be the beginning of the String or a space charact Thus any subset L c 2* determines a language over =
er). |
It comeswithits negation, \B. This matches all The languaage determined by a regular expression over2 is
positions where \b doesn’t match and
could beif we wantto find a search patter
n fully surrounded by word characters. L(r) def {v «© *|v matchesr}.
\Babc\B matchesonly if the pattern is fully s urrou Two regular expressions rand s (over the samealphabet) are equivalent iff Lir) and L(s)
nded by wordcharacters
Look-ahead and Look-behind — (?=) are equal sets (i.e. have exactly the same members.)
and (?<=) |
d(?=r) matches a d onlyif is followed by A finite automaton has a finite set of states with which it accepts or rejects strings.
r, but r will not be part of the overa
_ match ll regex
}
Finite State Automata (FSA) can be:
(?<=r)d matchesa only if is preceded by an Deterministic
r, but r will not be part of the overa
i ll
'
i
regex match On eachinputthere is one and only onestate to which the automaton can transition from
i
|
{ its current state
f
}
(«0) Natural Language
Processing Word Level Analysis (a1)
i
a ee

See

Scanned with CamScanner


|
Nondeterministic
ce
in several states at on
An automaton can be
at

te automaton > qo G2 Go
Deterministic finite sta
denoted Q
4. A finite set of states, often * qi M1 qi
s,often denoted 2
2. A finite set of input symbol g2 G2 q1
input symbol and
es as arguments a state and an
3. A transition function that tak te and ais
on commonly denoted 6 If qis a sta
returnsa state. The transition functi is
——Ss

graphthat represents the automaton


a symbol, then 5(q, a) is a state p (and in the
there is an arc from q to p labeled a)
4. Astart state, one of the states in Q
5. Asetoffinal or accepting states F (F < Q)
ADFAis a tuple A = (Q, 2, 5, q,, F) The DFA define a language: the setof all strings that result in a sequence of state
Other notations for DFAs transitions from the start state to an accepting state.

a a
Transition diagrams Extended transition function
e Each state is a node Describes what happens when westart in any state and follow any sequence ofinputs
e For each state q « Q and each symbol ¢ 2,let 5(q, a) =p If 5 is our transition function, then the extended transition function is denoted by 6
e Then the transition diagram has an arc from q to p, labeled a The extended transition function is a function that takes a state q and a string w and
- There is an arrow to the start state q, returns a state p (the state that the automaton reaches whenstarting in state q and
processing the sequence of inputs w)
e Nodes corresponding to final states are marked with doubledcircle
Formal definition of the extended transition function
Transition tables
e Definition by induction on the length of the input string
Tabular representation of a function
Basis: 5 (g, ©) =q
e The rowscorrespondto the states and the columns to the input s
If we are in a state q and read no inputs, then wearestill in state q Induction: Suppose
¢ The entry for the row corresponding to state q and the column corresponding to input
w is a string of the form xa; thatis a is the last symbolof w, and x is the string consisting
a is the state 5(q, a)
of all but the last symbol
A= (£49) 9,}, (0, 1}, 5, q,, {q, })
Then:8 (q, w) = 5(8(q,x), a)
wherethe transition function 6 is given by the table
To compute6 (gq, w), first compute 5(q,x), the state that the automationis in after procesing
all but the last symbol of w
Suppose thisstate is p, /.e., 5(q,X) =p
a - the last
Then 5 (q, w) is what we get by making a transition from state p on input
symbol of w | |
Word Level Analysis (43)
‘ («2) Natural Language Processing
NFA: Formal definition

Scanned with CamScanner


dees eaerer sect aratetnen amegiager nnemer, emer s Th

the language
Design a DFA to accept
n number of 1} A nondeterministic finite automaton (NFA)is a tuple A = (Q, 2, 5, q,. F) where:
n number of 0 and an eve
L = {w| whasboth an eve
1. Qis a finite set of states
The Language of a DFA
5, go, F), denoted L(A) is define
d by E is a finite set of input-symbols
N
The language of a DFAA= (Q, £,
5 q, € Qis the start state
L(A) = {w | 5 (go, w) isin FP}
a

®
k
,j
e the start state go to the one of the F (F <Q)is the setoffinal or accepting states

®
The language of is the set of strings w that tak
t;
t
Q and an input symbol in
5, the transition function is a function that takes a state in

ao
accepting states
language A as arguments and returns a subset of Q
If L is a L(A) from some DFA,then L is a regular
in the type of value that 6 retums
The only difference between a,NFA and a DFAis
Nondeterministic Finite Automata (NFA) in 01
as Example: An NFA accepting strings that end
A NFA hasthe powerto be in several states at once. This ability is often expressed
an ability to “guess” something aboutits input . Each NFA accepts a language thatis A = ({q0, 1, 92}, {0,1}, 5, 90, {92})
table
also accepted by some DFA. NFAare often more succinct and easier than DFAs . We wherethe transition function 5 is given by the
can always convert an NFA to a DFA,but the latter may have exponentially more states | 1
than the NFA (a rare case) .The difference between the DFA and the NFAis the type of > gO |{q0,q1} {99}
transition function 6 ‘ qi |8 {q2}
For a NFA
5 is a function that takes a state and input symbol as arguments
(like the DFAtransition function), but returns a set of zero or more states
(rather than returning exactly one state, as the DFA must)
Example: An NFA accepting strings that end in 01
Nondeterministic automaton that acceptsall and only the strings of Os and 1s that end
The Extended Transition Function
in 01
Basics: : 5°(q, 2) = {q}
only in the state we beganin
Without reading any input symbols, we are
Induction
Word Level Analysis (4s)
(4+) Natural Language Processing
2 _Lg re nr te
———

F '
I w, and x is the string
xa; that is a is the last symbol of
Supposew is a string of the form
aa

Scanned with CamScanner


_

ol Equivalence of Deterministic and Nondeterministic Finite Automata


eee

consisting ofall but the last symb


.., Pk} Every language that can be described by some NFA canalso be described by some
Also suppose that 5, (q, x) = {P1, P2,..
DFA. The DFAin practice has about as many states as the NFA, althoughit has more
Let ,
transitions. In the worst case, the smallest DFA can have 2° (for a smallest NFA with n
a

U 5(Pi, 2) = {Fs Gelb state)


ij=1
5. Finite-State Morphological Parsing
Then: 8(q, W) = {r,t
from Consider a simple example: parsing just the productive nominal plural (-s) and the verbal
We compute 3°(q, w) byfirst computing 5°(q, x) and by then following any transition
any of these states that is labeled a progressive (-ing). Our goal will be to take inputformslike thosein thefirst column below
and produce output forms like those in the second column.
Example: An NFA accepting strings that end in 01
Morphological Parsed Output
D.1
cats cat+N +PL
cat cat+N +SG
cities city +N +PL
geese goose +N +PL
goose (goose +N +SG)or (goose +V)
Processing w = 00101
gooses goose +V 3SG
1. &(q0, €) = {qO}
merging merge +V +PRES — PART
2. 8(q0, 0) = &(qO, O) = {q0, 41}
caught (catch +V +PAST — PART) or(catch +V +PAST)
3. &(q0, 00) = 8(g0,.0) U &(q1, 0) = {q0, q1} U 8 {q0, qi}
morphological
The second column contains the stem of each word as well as assorted
4. &(g0, 007) = 5(q0, 1) U 8(q7, 1) = {gO} U {q2} {q0, q2} For example the
features. These features specify additional information about the stem.
5. 8(g0, 0010) = 8(g0, O) U 5(qg2, 0) = {q0,q1} U 6={q0, qI} that it is plural.
feature +N meansthat the word is a noun; +SG meansit is singular, +PL
6. 8(q0, 00101) = &(g0, 1) U 5(q1, 1) = {q0} U {q2} = {q0, q2} that some of the input
consider +SG to be a primitive unit that means‘singular’. Note
nt morphological parses.
The Language of a NFA forms (like caught or goose) will be ambiguous between differe
least the following:
The language of a NFAA (Q, £,5, q,, F), denoted L(A) is defined by In order to build a morphological parser, we'll need at
L(A) = {w | 8(q0, w) NF # 6} with basic information about them
4. alexicon: The list of stems and affixes, together
etc).
The language of A is the set of strings w € = * such that 6°(q,, W) contains at least one (whether a stem is a Noun stem or a Verb stem,
accepting state. dering that explains which classes of
2. morphotactics: the model of morphemeor
inside a word. For example,
The fact that choosing using the input symbols of w lead to a non-accepting state, or do morphemes can follow other classes of morphemes
follows the noun rather than precedingit.
notleadto any state atall, does not prevent w from being accepted by a NFA as a whole. the rule that the English plural morpheme
ot
&
3
4
A
r
(+s) Natural Language Processing Word Level Analysis (#7)

I
eal

) changesthat occur

Scanned with CamScanner |


ling rules are used to model the
3. orthographic rules: these spel irreg—past-verb-form
| bine (for example the y ! ie spelling
in a word, usually when two morphemes com
+ -s to cities rather than cities). preterite (ed) oo!
rule discussed above that changes city
6. Building a Finite-State Lexicon a
Alexiconis a repository for words. The simplestpossible lexicon would consist of an explicit
irreg—verb—stem
list of every word of the language (every word,i.e. including abbreviations (‘AAA’) and proper
names(‘Jane’ or‘Beijing’) as follows: a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca,
Figure 4: A finite-state automatonfor English verbal inflection
aback
This lexicon has three stem classes (reg-verb-stem, irreg-verb-stem, and irreg-past-
Since it will often be inconvenient or impossible, for the various rea- sons we discussed
verb-form), plus 4 more affix classes (-ed past, -ed participle, -ing participle, and 3rd
above, to list every word in the language, computational lexicons are usually structured
singular -s):
with a list of each of the stems andaffixes of the language together with a representation
of the morphotactics that tells us how they can fit together. There are many ways to reg-verb-stem irreg-verb-stem rea past past-part pres-part 3sg
model morphotactics; one of the most common is the finite-state automaton. A very
walk cut caught -ed -ed -ing -s
simplefinite-state model for English nominal inflection might look like Figure 3.
fry speak ate
The FSAin Figure 3 assumesthat the lexicon includes regular nouns (reg-noun) that
take the regular -s plural (e.g. cat, dog, fox, aardvark). These are the vast majority of talk sing eaten
English nounssince for now wewill iignore the fact that the plural of words like fox have impeach sang
an inserted e: foxes. The lexicon also includes irregular noun forms that cut
don’t take -s,
both singularirreg-sg-noun (goose, mouse) and plural irreg-pl-noun spoken
(geese, mice).
reg—noun plural (—s) English derivational morphology is significantly more complex than English infiectional
morphology, and so automata for modelling English derivation tend to be quite complex.
irreg—pl-noun
Some models of English derivation, in fact, are based on the more complex context-free
grammars
irregq—sg—noun
As a preliminary example, though, of the kind of analysis it would require, we present
Figure 3: A finite-state automaton for English
nominalinflection. a small part of the morphotactics of English adjectives, taken from Antworth (1990).
reg-noun irreg-pi-noun irreg-sg-noun Antworth offers the following data on English adjectives:
fox geese goose big, bigger, biggest
Ls
cool, cooler, coolest, coolly
cat sheep sheep
red, redder, reddest
dog mice mouse clear, clearer, clearest, clearly, unclear, unclearty
aardvark happy, happier, happiest, happily
unhappy, unhappier, unhappiest, unhappily
A similar modelfor English verbalinf real, unreal, really
lection might looklike Figure 4,

ee
Veord Level Analysis (#9)
(+s) Natural Language Processing
can have an optional Bie a. an
: An initial hypothesis might be that adjectives zelV -ation/N
bullet Aaa te ee

noun
is mig suggest
suffix (-er, -est, Or *y).

Scanned with CamScanner


cool, etc) and an optional
Se

obligatory root (big, In the table


ESA will recognize all the adjectives
the FSA in Figure 5. Alas, while this
like unbig, redly, and realest. We need
nize ungrammatical forms
above,it will also recog So adj-roott
which can occur with which suffixes.
B to set up classes of roots and specify real) while
, and
Br would include adjectives that can occur with un- and -ly (clear, happy nts
(big, cool, and red). Antworth (1 990) prese
adj-root2 will include adjectives that can’t
gives an idea of the complexity to
Figure 6 as a partial solution to these problems. This
be expected from English derivation.
-er -est
un- adj-root -ly
Gi) —ful/A

-ogy.
Figure 7: An FSA for another fragment ofEnglish derivational morphol
:
Figure 5: An FSA for a fragment ofEnglish adjective morphology: Antworth’s Proposal#1. f 0 x
adj-root,
-er —ly
-est
g (HOO
Figure 6: An FSA for a fragmentof English adjective morphology: Antworth’s Proposal #2
This gives an idea of the complexity to be expected from English derivation. For a further \ Q
an: wegive in Figure 7 another fragment of an FSA for English nominal and verbal |
ihiehataanean 993), Bauer (1983), and Porter (1980). This
U
: , such as the well-known generalization that
any verb endingin -ize can be followed by the nominalizing suffix -ation (Bauer,
1983; n will
Sproat, 1993)). Thus since there is a word fossilize, we can predict Figure 8: Compiled FSA fora few English nouns with their inflection. Note that this automato
the word fossilization .
— ly, adjectives ending in -al or -able at q5 (equal
b y following states q0, q1, and q2. Similar meat ———
formal, , realizable) can take the Suffix -ity, or someti
mes th
(naturalness, casualness).
if‘ seu :ss tose
i
oe
}
:
= €3
- 5i
L . | .* Nord Level Analysis
¥, oe
ee | Natural Language Processing
f
hen
by expanding the NominalInflection
Figure 8 shows the noun-recognition FSA produced
.

use
irregular.nouns for each class. We can All of these have applications in speech and language processing. For morphological
=a

FSA of Figure 9 with sample regular and i, D


simply starting at the initial state, ang parsing (and for many other NLP applications), we will apply the FST as translator
=

Figure 8 to recognize strings like aardvarks by


=

metaphor, taking as inputa string of letters and producing as outputa string of morphemes. : c
word on each outgoing arc,etc.,
comparing the inputletter by letter with each
~—

An FST can be formally defined in a number of ways; we will rely on the following S
Finite-State Transducers definition, based on whatis called the Mealy machine MEALY MACHINE extensionto a ~€
- &
and
We've now seen that FSAscan represent the morphotactic structure of a lexicon, simple FSA:
~

§ ¢ s
can be used for word recognition. A transducer maps between one representation a finite set of states gO, q1,....., gn—1
a ST ————_—

Q
3
and another: a finite-state transducer or FST is a type of finite automaton which maps z | _ a finite set corresponding to the input alphabet . 5 (as)
between two sets of symbols. We can visualize an FST as a two-tape automaton which OK . x | @
A a finite set corresponding to the output alphabet
recognizes or generatespairs ofstrings. Intuitively, we can do this by labeling each arc ct Wy c
q,€Q the start state
in the finite-state machine with two symbol strings, one from each tape. Fig. 9 shows an : a
FcQ the set of final states
example of an FST where eacharcis labeled by an input and output string, separated
a state
5(g, W) the transition function or transition matrix between states; Given
by a colon. Q'E Q.dis
q€ Qand a string w € 2*, 5(q, w) returns a set of new states
subsets
thus a function from Q x =* to 2°(because there are 2° possible
a given
of Q). 6 returns a setof states rather than a single state because
input may be ambiguous in which state it mapsto.
for each state
o (q, W) the output function giving the set of possible output strings
Figure 9: A finite-state transducer (q, W) gives a set
and input. Given a state q € Q and a string w € =*, a
The FST thus has a more general function than an FSA; where an FSAdefines a formal n from Q x £* to
of outputstrings, each a string 0 € A’. is thus a functio
language bydefining a set of strings, an FST defines a relation between sets of strings.
2”.
Another way of looking at an FST is as a machine that reads one string and generates ges, FSTs are isomorphic to regular
Where FSAs are isomorphic to regular langua
another. Here’s a summary ofthis four-fold way of thinking about transducers: f strings, a natural extension of the regular
relations. Regular relations are sets of pairso
FST as recognizer: transducer that takes a pair of strings
zer. a tran | as input
| and outputs and regular languages, FSTs and regular
languages, which are sets of strings. Like FSAs
acceptif the string-pair is in the string-pair language, and rejectifit s not. " general they are not closed under difference,
FST as generator: a machine that out puts pairs relations are closed under union, although in
| of strings
| of the lan he some useful subclasses of FSTs are closed
output Is a yes or no, and a pair of outputstrings. , ee complementation and intersection (although
that are not augmented with the e are more
FST as translator: a machinethat reads a string and outputs another string. under these operations; in general, FSTs
s union, FSTs have two additional closure
FST as set relater: a machine that computesrelations between sets. likely to have such closure properties). Beside
on of a transducer T
ely useful: inversion: The inversi
properties that turn out to be extrem
the input alphabet
output labels. Thus, if T maps from
(T-1) simply switches the input and
m O to |.
| to the output alphabet O, T-1 maps fro
m O1 to 02, then
m I1 to O1 and T2 a transducer fro
composition: If T1 is a transducer fro
T1 °T2 mapsfrom 11 to O2.
Word Level Analysis (ss)
(2) Natural Language Processing
FST-as-parser into an FST-as.

Scanned with CamScanner _


easy to convert a Koskenniemi (1983), we allow each arc only to have a single symbol from each alphabet.
Inversion is useful because it makes it that run
t allows us to take two transducers We can then combine the two symbol alphabets < and A to create a new alphabet, 2’,
1
generator. Composition is useful becausei
f
: sducer. Composition worksasjn which makesthe relationship to FSAs quite clear. 2’ is a finite alphabet of complex symbols.
t in series and replace them with one more. complex tran
t l to applying T1 to S and then Each complex symbolis composedof an input output pair i : 0; one symbol i from the input
t=
ci algebra; applying T1 ° T2to an input sequenceS is identica
). alphabet Z, and one symbol o from an output alphabetA, thus 2’ c 2«A. Z and A may each
T2 to the result; thus T1 ° T2(S) = T2(T1(S)
t
to produce [a:c]+. also include the epsilon symbol ¢. Thus, where an FSA accepts a language stated over a
Fig. 10, for example, shows the composition of [a:b]+ with [b:c]+
finite alphabet of single symbols,
a:b
such asthe alphabetof our sheep language:
(3.2) & = {b,a,!} |
an FST defined this way accepts a language stated overpairs of symbols, asin:
Figure 10: The composition of[a:b]+ with [b:c]+ to produce [a:c]+. (3.3) D'={a:a,bib,!:athace € st}
The projection of an FSTis the FSAthat is produced by extracting only one side of the In two-level morphology, the pairs of symbols in £' Feasible pair are also called feasible
relation. We can referto the projection to the left or upper side of the relation as the upper pairs. Thus each symbola: b in the transducer alphabet 2’expresses how the symbol a from
or first projection and the projection to the lower or right side of the relation as the lower one tape is mappedto the symbolb on the other tape. For example a : Q means that ana
or second projection. on the upper tapewill correspond to nothing on the lowertape. Just as for an FSA, we can
Morphological Parsing with Finite-State Transducers. write regular expressions in the complex alphabet 2’. Since it's most common for symbois
to map to themselves, in two-level morphology wecall pairs like a : a default pairs, and
Let's now turn to the task of morphological parsing. Given the input cats, for instance,
just refer to them by the single letter a. We are now ready to build an FST morphological
we'dlike to output cat +N +PI, telling us that cat is a plural noun. Given the Spanish input
parser out of our earlier morphotactic FSAs and lexica by adding an extra “lexical” tape
bebo (‘I drink’), we'd like beber +V +PiInd +1P +Sg, telling us that bebo is the present
and the appropriate morphological features. Fig. 12 shows an augmentation of Fig. 13 with
indicative first-person singular form of the Spanish verb beber, ‘to drink’. In the finite-
the nominal morphological features (+Sg and +P) that correspond to each morpheme. The
State morphology paradigm that we will use, we represent a word as a correspondence
symbol * indicates a morpheme boundary, while the symbol # indicates a word boundary.
between a lexical level, which represents a concatenation of mor- Surface phemes making
The morphological features map to the empty string 9 or the boundary symbols since there
up a word, and the surface level, which represents the concatenation:ofletters which
is no segment corresponding to them on the output tape.
make up the actual spelling of the word. Fig. 11 showsthese twolevels for
(English) cats.
Lexical reg—noun
imreg—sg—noun
Surface
irreg—pl-noun
Figure 11: Schematic examplesofthe lexical and surface
tapes;
The actual transducers will involve intermediate tapes
as well Figure 12: A schematic transducerfor English nominal numberinflection Thum.
Forfinite-state morphologyit's convenient
to view an FST ashavin g two tapes. The upper
The sym- bols above each arc represent elements of the morphological parse in the
_orlexical tape, IS Composed from characters from the intermediate
one alphabet =. The lower or surface lexical tape; the symbols below each arc represent the surface tape (or
tape, is composed of characters from anoth
er alphabetA. In the two-level morpholo
gy of
Word Level Analysis (35)
6) Natural Language Processing
2424——S ——
bol * and word-boundary
g the morpheme-boundary sym
tape, to be described later), usin
need to be expanded by The resulting transducer, shown in Fig. 13, will map plural nouns into the stem plus
leaving q0 are schematic, and
marker #. The labels on the arcs

Scanned with CamScanner


the morphological marker +PI, and singular nounsinto the stem plus the morphological
individual wordsin the lexicon. marker +Sg. Thus a surface cats will map to cat +N +PI. This can be viewed in feasible-
gical noun parser, it needs to be expanded with
In order to use Fig. 11 as a morpholo pair format as follows: c:c a:a t:t +N:o +Pl:°s# Since the output symbols include the
stems, replacing the labels reg-noun etc,
all the individual regular and irregular noun morpheme and word boundary markers ~ and #, the lower labels do not correspond
sducer, so thatirregular
In order to do this, we need to update the lexicon for this tran exactly to the surface level. Hence werefer to tapes with these morpheme boundary
We do this by allowing
plurals like geese will parse into the correct stem goose +N +PI. markers in Fig. 14 as intermediate tapes; the next section will show how the boundary
e, the new
the lexicon to also have twolevels. Since surface geese mapsto lexical goos marker is removed. .
l y
lexical entry will be “g:g o:e o:e s:s e:e”. Regular forms are simpler; the two-leve entr
for fox will now be“f:f 0:0 x:x”, but by relying on the orthographic convention that f stands oa (EEE
maT.
for f:f and so on, we can simply refer to it as fox and the form for geese as “g 0:e O:e s
e”. Thus, the lexicon will look only slightly more complex: Intermediate
reg-noun irreg-pl-noun irreg-sg-noun Figure 14: An exampleofthe lexical and intermediate tapes.
fox go:eo:ese goose
Orthographic Rules and Finite-State Transducers
cat sheep sheep
The method described in the previous section will successfully recognize wordslike
aardvark mo:iu:é s:ice mouse
aardvarks and mice. But just concatenating the morphemes won't workfor cases where
there is a spelling change; it would incorrectly reject an input like foxes and accept an
inputlike foxs. We need to deal with the fact that English often requires spelling changes
at morpheme boundaries by introducing spelling rules (or orthographic rules). Here will
introduce a number ofnotations for writing such rules and shows how to implementthe
rules as transducers. Someof these spelling rules:
Description of Rule
Consonant |-letter consonant doubled before-ing/ed beg/begting
doubling
E deletion Silent e dropped before -ing and -ed make/making
E Inserting e addedafter-s, -z, -x, -ch, -sh before fs watch/watches
Y replacement -y changesto -/e before-s, Cjbefore ed/
Sq) try/tnes
Figure 13: A fleshed-out Englis Kinsertion ' verbs ending with vowel -c add -k panic/panicked
h nomina linflection FSTTlex ,
arcs with individual word stems (only a few s expanded from Thumby re- placing the three We can think of these spelling changes as taking as input a simple concatenation
ample word stems are shown).
of morphemes and producing as output a_ stightly-modified, (correctly-spelled)
concatenation of morphemes. Figure 15 shows the three levels we are talking about:
t SoH rN
ait
(58) Natural Language
Processing &) Word Level Analysis (s7)

es
5
7
E-i nsertion rule that
for example we could write an
~
lexic
xiCaal,l, intermediate, ' and surface. So ls shown in Figure 15.
intermediate to surface leve
esi the mapping from the

Scanned with CamScanner


Lexical f
Intermediate
Surface
Figure 15: An exampleof thelexical, in termediate and surface tapes.
Between each pair of tapes is a 2-level transducer; the lexical transducer between the ‘
lexical and intermediate levels, and.the E-insertion spelling rule between the intermediate Figure 16: The transducerfor the E-insertion rule of (T..1), extended from a similar transducer in Antworth
and surface levels. The E-insertion spelling rule inserts an e on the surface tape when The idea in building a transducer for a particularrule is to express only the constraints
the intermediate tape has a morpheme boundary * followed by the morpheme necessary for that rule, allowing any other string of symbols to pass through unchanged.
Such a rule might say something like “insert an e on the surface tape just when the This rule is used to insure that we can only seethe e:e pair if we are in the proper context.
lexical tape has a morphemeendingin x (or z, etc) and the next morphemeis -s. So state q0, which models having seen only default pairs unrelated to the rule, is an
x accepting state, as is q,, which models having seen a Z, S, OF x. q, models having seen
e>e/<(s)4-s# the morpheme boundary after the z, s, or x, and again is an accepting state. State q,
z (T..1) models having just seen the E-insertion; it is notan accepting state, since the insertion
This is the rule notation of Chomsky and Halle (1968); a rule of the form a -> b/ c_d is only allowedif it is followed by the s morphemeandthen the end-of-word symbol#.
means ‘rewrite a as b when it occurs between c and d’. Since the symbol € means The other symbol is used in Figure 16 to safely pass through any parts of words that
an empty transition, replacing it means inserting something. The symbol * indicates don’t play a role in the E-insertion rule. other means ‘any feasible pair thatis notin this
a morpheme boundary. These boundaries are deleted by including the symbol *:< in transducer’; it is thus a version of @:@ which is context-dependentin a transducer-by-
the default pairs for the transducer: thus morpheme boundary markers transducer way. So for exam- ple when leaving state q,, we go to q, on the z, S, or x
are deleted on
the surface level by default. (Recall that the colon is used to Separate symbols symbols, rather than following the other arc and staying in q,. The semantics of other
on the
intermediate and surface forms). The # symbolis a special symbol depends on what symbols are on other arcs; since # is mentioned on somearcs, it is
that marks a word
boundary. Thus (T..1) means ‘insert an e after a morpheme-final (by definition) not included in other, and thus, for example, is explicitly mentioned on
x, s, or z, and before the
morpheme s’. Figure 15 shows an automaton that the arc from q, to q,. A transducer needs to correctly reject a string that applies the rule
corresponds to this rule
when it shouldn't. One possible bad string would have the correct environmentfor the
E-insertion, but have no insertion. State q, is used to insure that the e is always inserted
wheneverthe environment is appropriate; the transducer reaches q, only whenit has
seen an s after an appropriate morpheme- boundary. If the machine is in state q, and
the next symbolis #, the machinerejects the string (becausethere is no legaltransition
Word Level Analysis (2)
~——
Eee
I
a el a

which makestheillega|
.). Figure 17 shows the transition table for the rule
on # from q.). Rule: ING ! € if stem contains vowel(e.g. motoring ! motor)
‘’ symbol.
transitions explicit with the Do stemmersreally improve the performance of information retrieval engines? One

Scanned with CamScanner


, # problem is that stemmers are not perfect. Following kinds of errors of omission and of
a, 1 1 1 OO - 0 0? commission in the Porter algorithm:
q;: 4, 21 1 2 - 0 0 Errors of Commission
q,: 5 1 1 0 3 ) 0 organization - organ
Doing - doe
Q: 4 - ° - ° : .
qs ot generalization - generic
q.: 1 1 1 2 - - 0 numerical - numerous

Figure 17: The state-transition table for E-insertion rule of Figure 1.2. policy - police |
university - universe
7. Lexicon free FST Porter stemmer negligible - negligent
While building a transducer from a lexicon plus rules is the standard algorithm for Errors of Omission
morphological parsing, there are simpler algorithms that don’t require the large on-line European - Europe
lexicon demandedby this algorithm. These are used especially in Information Retrieval
analysis - analyzes
(IR) tasks in which a user needs someinformation, andis looking for relevant documents
matrices - matrix
(perhaps on the web, perhapsin a digital library database). User gives the system a
noise - noisy
query with some important characteristics of documents she desires, and the IR system
retrieves whatit thinks are the relevant documents. One commontype of query isBoolean sparse - sparsity
combinations of relevant keywordsor phrases, e.g. (marsupial OR kangaroo OR koala). explain - explanation
The system then returns documents that have these wordsin them. Since a document urgency - urgent
with the word marsupials might not match the keyword marsupial, some IR systems first
Porter Stemmer
run a stemmeron the keywords and on the words in the document. Since morphological
O or U, and other than Y preceded
Parsingin IR is only usedto help form equivalence classes, the details of the suffixes are A consonantin a wordis a letter otherthan A, E, |,
is defined to some extentin terms
irrelevant; what matters is determining that two words have the same stem. by a consonant. (The fact that the term ‘consonant’
the consonants are T and Y, and in
of itself does not make it ambiguous.) So in TOY
Oneof the most widely used such stemming algorithmsis the simple and efficient Porter a consonantit is a vowel.
SYZYGYthey are S, Z and G. If a letter is not
(1980) algorithm, which is based on a series of simple cascaded rewrite rules. Since
v.A list ccc... of length greater than 0 will
A consonantwill be denoted byc, a vowel by
cascaded rewrite rules are just the sort of thing that could be easily implemented as an than 0 will be denoted by V. Any word,
FST, we think of the Porter algorithm as a lexicon-free FST stemmer . The algorithm
be denoted byC, anda list vw... of length greater
the four forms:
or part of a word, therefore has oneof
containsruleslike: _
CVCV ...C
Rule: ATIONAL! ATE (e.g. relational ! relate)
Word Level Analysis e
(«) Natural Language Processing
aeee |

eRal te a
gf sy
i
nin ie Ries
CVCV ...V
{igre 4
ok
re
tests for a stem ending with a double consonantother
than L, S or Z. Elaborate conditions
VCVC ...C
like this are required onlyrarely.

Scanned with CamScanner


VCVC...V
In a set of rules written beneath each other, only one is
obeyed, and this will be the one
These mayall be represented by the single form
aise

with the longest matching S1 for the given word. For example, with
[C]VCVC... [VJ
icant ion

SSES -> SS
g (VC)" to
where the square brackets denote arbitrary presence of their contents. Usin
IES -> |
denote VC repeated m times, this may again be written as
SS -> SS
[C)(vC)"[V).

m will be called the measure of any word or word part whenrepresented in this form. The S ->
_

case m = 0 covers the null word. Here are some examples: (here the conditions are all null) CARESSES maps to CARESS since SSESis the

at
m=0 TR, EE, TREE, Y, BY. longest match for S1. Equally CARESS maps to CARESS (S1="SS') and CARESto
m=1 TROUBLE, OATS, TREES, IVY. CARE (S1='S’).
In the rules below, examplesof their application, successful or otherwise, are given on
m=2 TROUBLES, PRIVATE, OATEN, ORRERY.
the right in lower case. The algorithm now follows:
The rules for removing a suffix will be given in the form Step 1a
(condition) $1 -> S2 . SSES -> SS caresses -> caress
‘This meansthat if a word ends with the suffix S1, and the stem before S1 satisfies the
IES -> | ponies -> poni
given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g.
(m > 1) EMENT -> ties -> ti
Here $1 is ‘EMENT' and $2is null. This would map REPLACEMENTto REPLAC, since SS > .$S caress. -> caress
REPLACis a word part for which m = 2. S > cats -> Cat
The ‘condition’ part may also contain the following:
Step 1b
*S - the stem endswith S (andsimilarly for the other letters).
(m>0)EED -> EE feed -> feed
*y* - the stem contains a vowel.
agreed

aa
-> agree
*d - the stem ends with a double consonant (e.g. -TT,
-SS). (*v*) ED -> plastered -> plaster
the stem endscvc,where the secondcis not
*o _
W,X orY (e.g. -WIL, -HOP).
|
bled -> bled
And the condition part may also contain
expressions with and, or and not. so that (*v*) ING -> motoring -> motor
(m>1 and (*S or *T))
sing -> sing
tests for a stem with m>1 endingin S or
T, while
(“d and not (*L or *S or *Z)) If the second orthird of the rules in Step 1b is successful, the following is done:
(2) Natural Language Processing
, Word Levei Analysis 62

ey.
ATE conflat(ed) -> conflate
a

->
AT (m>0) ABLI -> ABLE conformabli > conformable

Scanned with CamScanner


troubl(ed) -> trouble
‘. BL -> BLE (m>0) ALLI -—> AL radicalli -> radical
-> IZE siz(ed) -> size
IZ (m>0) ENTLI -> ENT differentli -> different
single letter hopp(ing) -> hop
(*d and not (*L or *S or -—>
(m>0) ELI -> vileli -— vile
*2)) (m>0) OUSLI -> OUS analogousii -> analogous
tann(ed) -> tan
(m>0) IZATION -—> IZE vietnamization -> vietnamize
fall(ing) -> fall
(m>0) ATION -> ATE predication -> predicate
hiss(ing) -> hiss
(m>0) ATOR -> ATE operator -—> operate
fizz(ed) -> fizz
(m>0) ALISM -> AL feudalism -> feudal
(m=1 and *o) -> fail(ing) -> fail
(m>0) IVENESS -> IVE decisiveness -_> decisive
fil(ing) -> file
(m>0) FULNESS -> FUL hopefulness -> hopeful
The rule to mapto a single letter causes the removal of one of the double letter pair. (m>0) OUSNESS -> OUS callousness -—> callous
The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be
(m>0)ALITI -> AL formaliti -> formal
recognisedlater. This E may be removedin step 4.
(m>0) IVITI -> IVE sensitiviti —> sensitive
Step 1c
(m>0) BILITI -—> BLE sensibiliti -> sensible
(*v*) Y > ‘| happy -> happi
sky -> sky The test for the string S1 can be made fast by doing a program switch on the penultimate
values
letter of the word being tested. This gives a fairly even breakdown of the possible
presented here in
Step 1 deals with plurals and past participles. The subsequent steps are much more of the string S1. It will be seen in fact that the S1-strings in step 2 are
applied in the
Straightforward. alphabetical order of their penultimate letter. Similar techniques may be
Step 2 other steps.
(m>0) ATIONAL > ATE relational relate
(m>0) TIONAL -> TION conditional condition
rational rational
(m>0) ENC! -> ENCE valenci -—> valence
}
(m>0) ANCI -> - ANCE hesitanci -> hesitance 1
|
7
}
|
(m>0) IZER -> IZE digitizer -> digitize
;
| Word Level Analysis ()
i
i
i
(«) Natural Language Processing
Step 3 -> triplic (m>1) OUS -> homologous > homolog
iC triplicate

Scanned with CamScanner


(m>0) ICATE 7 (m>1) IVE -> effective -> effect
-> form
formative
(m>0)ATIVE >> (m>1) IZE -> bowdlerize -> bowdler
formalize “7 formal
(m>0)ALIZE >
electric
The suffixes are now removed. All that remainsis a little tidying up.
(m>0) ICITI > electriciti | -?
Step 5a
electrical -> electric
(m>0) ICAL -? (m>1)E -> probate -> probat
(m>0) FUL -> hopeful “>
-> rate
rate
(m>0)NESS_ > goodness ->
(m=1 and not *o) E -> cease -> ceas
Step 4 Step 5b
(m>1) AL -7 revival -> reviv
(m > 1 and *d and *L) -> single letter controll -> control
(m>1) ANCE -> allowance -> allow roll -> | roll
/(m>1) ENCE -> inference -> infer
(m>1) ER -> airliner -> airlin
8. N —Grams
Imagine listening to someone as they speak andtrying to guess the next word that they
(m>1) IC -> gyroscopic -> gyroscop
are going to say. For example, what wordis likely to follow this sentence fragment?
(m>1) ABLE -> adjustable -> adjust I'd like to make a collect...
(m>1) IBLE ‘> defensible -—> defens Probably the most Jikely word is call, although it’s possible the next word could be
(m>1) ANT -> irritant -> irrit telephone, or person-to-person orinternational. (Think of some others). Guessing the
next word (or word prediction) is an essential subtask of speech recognition, hand-
(m>1) EMENT -> replacement -> replac writing recognition, augmentative communication for the disabled, and spelling error
(m>1) MENT -> adjustment -> adjust detection. In such tasks, word-identification is difficult because the input is very noisy
(m>1) ENT -> dependent -— depend
and ambiguous. Thus, looking at previous words can give us an important clue about
what the next ones are going to be.
(m>1 and (*S or *T)) ION -> adoption -> adopt N-gram is simply a sequence of N words. For instance, let us take a look at the following
(m>1) OU -> homologou examples.
-> homolog
(m>1) ISM ->
communism 1. San Francisco (is a 2-gram)
-> commun
(m>1) ATE > 2. The Three Musketeers(is a 3-gram)
activate > activ
(m>1) ITI 3. She stood up slowly (is a 4-gram)
=>
angulariti -> angular
ee

Word Level Analysis (7)


(6s) Natural Language Processing
SS
“San
quite frequently? Probably,
|

ceeShs Ea Siete
{

N-grams have you seen


gre

Now which of these three Seen “coffee” and “coffee” is high. In spelling correction, we need to find and correct spelling
On the other hand, you might not have
eel

“Scanned with CamScanner


“t Francisco” and “The Three Musketeers”. errorslike the following that accidentally result in real English words:
b lly, “She stood up slowly” is an example of
“She stood up slowly” that freq uently. Basica
tences as Examples1 and 2. They are leaving in about fifteen minutes to go to her house.
b. an N-gram that does not occur as often in sen
ence of an N-gram or the probability of a wor4 The design and construction of the system will take more than a year.
oe
| Nowif we assign a probability to the occurr
useful. Why? Since these errors have real words, we can't find them byjust flagging wordsthat aren't
occurring next in a sequenceof words, it can be very
together to form in the dictionary. But note that in about fifteen minutes is a much less probable sequence
First of all, it can help in deciding which N-grams can be chunked
than in about fifteen minutes. A spellchecker can use a probability estimator both to
single entities (like “San Francisco” chunked together as one word, “high school” being
detect these errors and to suggest higher-probability corrections. Word prediction is also
chunked as one word).
important for augmentative communication systems that help the disabled. People who
It can also help make next word predictions. Say you havethe partial sentence “Please
are unable to use speech or sign language to communicate, like the physicist Steven
hand over your”. Then it is more likely that the next word is going to be “test” or
Hawking, can communicate by using simple body movements to select words from a
“assignment”or “paper” than the next word being “school”.
menu that are spoken by the system. Word prediction can be used to suggest likely
Weformalize this idea of word prediction with probabilistic models called N-gram models, words for the menu. Besides these sample areas, N-grams are also crucial in NLP tasks
which predict the next word from the previous N — 1 words. Suchstatistical models of like part-of- speech tagging, natural language generation, and wordsimilarity, as well as
word sequencesare also called language models of LMs. Computing the probability in applications from authorship identification and sentiment extraction to predictive text
of the next word will turn out to be closely related to computing the probability of a input systemsfor cell phones.
sequence of words. The following sequence, for example, has a non-zero probability of
Language model
appearing in a text: .
.-- all ofa sudden, | notice three guys standing on the sidewalk... Problem of Modelling Language
while this same set of words in a different order has a much lower Formal languages, like programming languages, can befully specified. All the reserved
probability:
words can be defined and the valid ways that they can be used can be precisely defined.
on guysall | of notice sidewalk three a sudden standin
g the We cannot do this with natural language. Natural languages are not designed; they
As we will see, estimatorslike N-gramsthat assign a conditio nal probabil ity to possible emerge, and therefore there is no formal specification. There may be formal rules for
next woe can be used to assign a joint probability to an entire
sentence. Whether parts of the language, and heuristics, but natural language that does not confirm is often
estimating probabilities of next words orof
whole sequences, the N-gram model is one used. Natural languages involve vast numbers of terms that can be used in ways that
of the most important tools in speech and language processing. N-gram
s are essential introduceall kinds of ambiguities, yet can still be understood by other humans. Further,
In any task in which we haveto identify words
in noisy, ambiguous input. In speech languages change, word usages change:it is a moving target. Nevertheless, linguists
recognition, for example, the input spee
ch sounds are very conf usab le and try to specify the language with formal grammars andstructures.It can be done, butit
words sound extremely similar. N-gr many
: am models are al so essential i is very difficult and the results can befragile. An alternative approach to specifying the
translation. isti
nN statistical machin
ine
model of the language is to learn it from examples.
It can also help to make Spel
ling error corre ctions. For inst
ance, the sentence “drink Statistical Language Modelling, or Language Modelling and LM for short, is the
coff ee” could be corr ecte d to “drink coffee” if you knew thatt
he word “coffee” had a high development of probabilistic models that are able to predict the next word in the sequence
probability of occurrence
after the word drink” and also the overlap of letters given the words that precedeit. It is a probability distribution over sequences of words.
between
i
.
(ss) Natural Language Processing Word Level Analysis { 59 }
ud
Cree gg Te Mee enaSa SG eee
NN
bability P(W1,....Wm) to the
Of length m,,it assigns 4 pro
Given such a sequence, say In general, no! There are far to many possible sentencesin this method that would need
eee oe eeee

Scanned with CamScanner


whole sequence. to be calculated and we would like have very sparse data making results unreliable.
c language modellin
g is to calculate the probability of a senteng,
The goal of probabilisti Markov Property
of sequenceof words: A stochastic process has the Markov propertyif the conditional probability distribution of
)
P(W) = P(w1, w2, wW3,..., WN future states of the process (conditional on both past and present states) depends only
sequence.
bability of the next word in the upon the present state, not on the sequence of events that precededit. A process with
and can be usedto find the pro
P(w5 | wi, w2, w3, w4) this property is called a Markov process.
e is called a Language Model In other words, the probability of the next word can be estimated given only
the previous
“4 model that computes either of thes
k number of words.
Methodfor Calculating Probabilities:
For example, if k=1:
Conditional Probability:
y of A given is: P(transparent| its water is so) = P(transparent| so)
let A and B be two events with P(B) =/= 0, the conditional probabilit
or if k=2:
ptalpy- £48) P(transparent| its water is so) * P(transparent| is so)
P(B)
General equation for the Markov Assumption, k=i :
P(X,, Xqye--X,) = P(X,)P(X, | X,)e-P(X, | Xpp-25%X)
P(w,| w,w,...W,,) = P(w,| W,_,-- Wo)
Chain Rule between words and phrases that
The language model provides context to distinguish
The chain rule applied to compute the joined probability of words in a sequence is “recognize speech” and
sound similar. For example, in American English, the phrases
therefore: “wreck a nice beach” sound similar, but mean different things.
Most possible word
P(W, W,....W.) = [| P(w,| w, w,...w,_,) Data sparsity is a major problem in building language models.
i is to make the assumption that the
sequences are not observed in training. One solution
words. This is known as an n-gram
For example, probability of a word only depends onthe previous n
model is also known as the bag of
model or unigram model when n = 1. The unigram
P(‘its water is so transparent") =
words model.
P(its)* is useful in many natural language
Estimating the relative likelinood of different phrases
P(water|its)* text as an output. Language
processing applications, especially those that generate
P(is| its water)* part-of-speech tagging,
modeling is used in speech recognition, machine translation,
recognition, information retrieval
P(so| its wateris)* parsing, Optical Character Recognition, handwriting
|
P(transparent| its water is so) and other applications.
easier
with word sequences. Ambiguities are
This is a lot to calculate, could we not simply estimate
this by countin ividing thé In speech recognition, sounds are matched
a pronunciation
results as shownin the following formula: g and alvieing language model is integrated with
to resolve when evidence from the
P(transparentlits water is so) = count(its water is so transparent) model and an acoustic model.
- count(its water is so)
Ward Level Analysis (1)
(70) natura Language Processing
TE
sete aye
——z—=—e
2 —s & —sliee JL
| mode}. There R
4
lecodtio
query) likeliho n
| ieval in the
*
ation retr
. lsare usedin informate
angguage mode
Lan gua ge modelis associ d with each document in a co Document :i As the name,suggests, the bigram model approximates the probability of a word given
;

Scanned with CamScanner


separate lan
probability of the query Q in the document's languag, |
all the previous words by using only the conditional probability of one preceding word. In
are ranked based on the s purpose,
unigram langua ge model is used forthi
]
other words, you approximate it with the probability: P(the | that)
model P(Q| Md). Commonly, the
a a ee

And so, when you use a bigram model to predict the conditional probability of the next
—__j
N-gram language model word, you are thus making the following approximation:
ieg
are the type of models that assign Probabilit
Statistical language models, in its essence,
n-1
P(w,, |W { ) ~ P(w,, Wa)
nd simplest model that Assigns
to the sequencesof words. In thisarticle, we'll understa the
am This assumption that the probability of a word depends only on the previous word is also
probabilities to sentences and sequencesof words, the n-gr
known as Markov assumption.
You can think of an N-gram as the sequence of N words, by that notion, a 2-gram
(or bigram) is a two-word sequence of wordslike “please turn’, “turn your’, or "Vour Markov models are the class of probabilistic models that assumethat we canpredict the
homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please probability of some future unit without looking too far in the past. The Markov assumption
turn your”, or “turn your homework” is the presumption that the future behavior of a dynamical system only depends onits
recent history. In particular, in a kth-order Markov model, the next state only depends
Let’s start with equation P(w\|h), the probability of word w, given some history, h. For
on the k most recent states, therefore an N-gram modelis a (N-1 )-order Markov model.
example,
Maximum Likelihood Estimate
P(Thelits water is so transparentthat)
_ count(W;_,,W;)
Here, P(w, lw.) =
(w;|W.-1) count(w;_;)
w = The
h = its water is so transparent that c(w; |W;_1)
And, one wayto estimate the above probability function is through the relative frequency i-1
count approach, where you would take a substantially large corpus, count the number Example
of times you seeits water is so transparentthat, and then count the
number of times itis ’ | <s> lam Sam </s>
followed bythe. In other words, you are answering C(W; |W;_
the question: P(w; Wi =o)
, <s> | am Sam </s>
Outof the times you saw the history h, how manyt i-1
imes did the word w follow it <s> | do not like green eggs and ham </s>
P(the|
. its water is so transparent that) = C(its wateris so
transparent that the)/C(its water 2 1 2
IS SO transparent that) P(l|<s>) == =.67 P(Sam|<s>) =37 33 P(amll) =37 .67
Now, you can imagine it is not feasible to
A perform this over an enti
ntire co ; i
eciall 1 1 1
if itis of a significant size rpus, @5P Y -P(</s >|Sam) = 37 0.5 P(Samlam) = 5 5 P(do| |) = a7 33
This shortcoming and ways to deco
mposethe probability function using
serves as the base intuition of the chain rule You can further generalize the bigram model to the trigram model which looks like two
the N-gram model. Here, you, instead
probability using the entire Corp of computing words into the past and can thus befurther generalized to the N-gram model. Basically,
us, would approximat € it by just
a few historical words an N-gram model predicts the occurrence of a word based on the occurrenceof its N—- 1
The Bigram Model
previous words. An N-gram model uses the previous N-1 words to predict the next one
P(wn | wn-N+1 wn-N+2... wn-1 )
(2) Natural Language Proc Word Level Analysis (3)
essing
nie eh el

P(phone | Please turn off your cell) exponentially with the number of words of py; 4. San Diego hasnice weather.
Number of parameters required grows pa Or

Scanned with CamScanner


5. Itis raining in San Francisco.
context.
Let's assume a bigram model. So weare going to find the probability of a word based
aly

Unigram: P(phone)
as

only on its previous word. In general, we can say that this probability is (the number of
Bigram: P(phone | cell)
eine

times the previous word ‘wp’ occurs before the word ‘wn’)/ (the total numberoftimes the
Trigram: P(phone | yourcell) previous word ‘wp’ occursin the corpus) =
Another example for the sentence boy is chasing the big dog (Count (wp wn))/(Count (wp))
unigrams: P(dog) Let's work this out with an example.
bigrams: P(dog| big) To find the probability of the word “you” following the word “thank”, we can write this as
trigrams: P(dog| the big) P (you | thank) which is a conditional probability.
quadrigrams: P(dog | chasing the big) This becomes equal to:
Assume a languagehasT word types in its lexicon, how likely is word x to follow word y? =(No. of times “Thank You” occurs) / (No. of times “Thank” occurs)
Simplest model of word probability: 1/T 1/1

I
Alternative 1: estimatelikelihood of x occurring in new text based on its = 1
general frequency
of occurrenceestimated from a corpus (unigram probability) Wecan say with certainty that whenever “Thank” occurs,it will be followed by “You” (This
popcorn is morelikely to occur than unicorn is because we havetrained on a set of only five sentences and “Thank” occurred only
Alternative 2: condition the likelihood of x occurring oncein the context of “Thank You”). Let's see an example of a case whenthe preceding
in the context of previous words
(bigrams, trigrams,...) Estimate the probability word occursin different contexts.
of each word given prior context.
mythical unicorn is more likely than mythic Let's calculate the probability of the word “Diego” coming after “San”. We wantto find the
al popcorn
So here we are answering the questi P (Diego | San). This means that wearetrying to find the probability that the next word
on — how far back in the history of a sequenceof
words should we go to predict the will be “Diego” given the word “San”. We can dothis by:
next word? For instance, a bigram
predicts the occurrence of a word model (N = 2) =(No of times “San Diego” occurs) / (No. of times “San” occurs)
given o nly its previous word (as N—
Similarly, a trigram model (N 1 = in this case). = 2/3
=3) predicts the occurrence
of a word b asedon its previous
two words (as N— 1 = 2in this = 0.67
case).
Let us see a way to ass ign This is because in our corpus, one of the three preceding “San’s was followed
a probability to a word occurring
First of all, we nextin a sequence of words. by
need a ver y lar ge sa m ple of English sentences “Francisco”. So, the P (Francisco | San) = 1/3.
For the purpos (called a corpus).
eo f Our exa mpl e, we’ l | consider a very small In our corpus, only “Diego” and “Francisco” occur after “San” with the probabilitie
sample of sentences, s 2 /
reality, a corpus will be butin
ext rem ely lar g €. Sa y our 3 and 1/3 respectively. So if we want to create a next word prediction software based
corpus contains the follow
1. He said thank you. ing sentences: on our corpus, and a user types in “San”, we will give two options: “Diego” ranked most
2. He said bye as he likely and “Francisco” rankedlesslikely.
walked through the doo
r.
3. He wentto San Generally, the bigram model works well andit may not be necessary to usetrigram
Diego.
models or higher N-gram models.
(4) Natural Lan
guage Processing
Word Level Analysis (ss)
“Scanned with CamScanner _
Challenges There is not muchdifference between the twoin theory. So, the discussionin the rest of
ach, and estimation
There are, of course, challenges, as with every modeling appro the blog post appliesto both. *
, as well as the use of
method. Let's look at the key ones affecting the N-gram model There are different types of spelling errors. Let's try to classify them a bit formally. Note
Maximum Likelihood Estimation (MLE) that the belowis for the purposeofillustration, and is not an rigorous linguistic approach
Sparse data to classifying spelling errors.
Notall N-gramsfoundin the training data, need smoothing Non-word Errors: These are the most common type of errors. You either miss a few
Change of domain keystrokes or let your fingers hurtle a bit longer. E.g., typing langage when you meant
Train on WSJ, attemptto identify Shakespeare — won't work language; or hurryu when you meant hurry
Sensitivity to the training corpus Real Word Errors: If you have fat fingers, sometimes instead of creating a non-word,
you end up creating a real word, but one you didn't intend. E.g, typing buckled when you
The N-gram model, like many statistical models, is significantly dependenton thetraining
meant bucked. Or your fingers are a tad wonky, and youtypein three when you meant
corpus. As a result, the probabilities often encode particular facts about a giventraining
there.
corpus. Besides, the performance of the N-gram model varies with the change in the
Cognitive Errors: The previous two types of errors result not from ignorance of a word
value of N.
or its correct spelling. Cognitive errors can occur dueto those factors. The words piece
Moreover, you may have a languagetask in which you knowall the words that can occur,
and peace are homophones (sound the same). So you are not sure which oneis which.
and hence we know the vocabulary size V in advance. The closed vocabulary assumption
Sometimes your damn sure about your spellings despite a few grammar nazis claim
assumesthere are no unknown words, whichis unlikely in practical scenarios.
you're not.
Smoothing | Short forms/Slang/Lingo: These are possibly not even spelling errors. May be u f just
A notable problem with the MLE approachis sparse data. Meaning, any N-gram that being kewl. Or you are trying hard to fit in everything within a text message or a tweet
appeareda sufficient number oftimes might have a reasonable estimate for its probability. and must commit a spelling sin. We mention them here for the sake of completeness.
But because any corpus is limited, some perfectly acceptable English word sequences Intentional Typos: Well, because you are clever. You type in teh and pwned and zomg
are bound to be missing fromit. carefully and frown if they get autocorrected. It could be a marketing trick, one that
As a result ofit, the N-gram matrix for any training corpus is bound to have a substantial probably even backfired.
numberof cases of putative “zero probability N-grams” The word N-gram approachto spelling error detection and correction is to generate
l
every possible misspelling of each word in a sentence either just by typographica
N-gram for spelling correction homophones as
—— modifications (letter insertion, deletion, substitution), or by including
Let's define the job of a spell checker and an auto corrector. A word needs to be checked - choosing the spelling
well, (and presumably including the correct spelling), and then
is, given a sentence W = {w,
for spelling correctness and correctedif necessary, many a time in the context of the that gives the sentence the highest prior probability. That
wk , wk , etc, we choose
Surrounding words. A spellchecker points to Spelling errors and possibly suggests ° PWo TSW w.}, where wk has alternative spelling
alternatives. An auto corrector usually goes a step further and automatically picks the imizes P(W ). using the N-gram
the spelling among these possible spellings that max-
can be used instead, which a find
mostlikely word. In case of the correct word already having been typed, the same is grammar to compute P(W ). Aclass-based N-gram
it may not do as well at finding unlikely
retained. So, in practice, an auto correctis a bit more aggressive than a spellchecker, unlikely part-of-speech combinations, although
but this is more of an implementation detail — tools allow you to configure the behaviour. word combinations.
Cal
is (7)5
Ward Level Analys i
(76) Natural Language Processing
=
¥“
;
Expected Questions:
t+

Scanned with CamScanner



f aa
° :
lysis? Discuss
t
we need to do Morphological Ana
f
5
4. What is morphology. Why do
F.
i+
:
Morphological Analysis.
*
various application domains of

£
examples.
morphologyin detail with suitable
Explain derivational & inflectional
N

?
t ways to create words from morphemes
What are morphemes? Whatare differen
©

of Language model?
Whatis language model? Explain the use
Pp

Write a note on N-Gram language Model.


oo

FSTin detail.
Whatis the role of FSA in Morphological analysis? Explain
{ OBJECTIVES
be able to understand:
| After reading this chapter, the student will
English (Penn Treebank )y
© Part-Of-Speechtagging (POS)- Tag set for
gging,
Rule based POS tagging, Stochastic POSta

OOO ©
words.
issues —Multiple tags & words, Unknown
Introduction to CFG,
l (HMM), Maximum Entropy, and
Sequencelabelling: Hidden Markov Mode .
Conditional Random Field (CRF).
|
;
|
(7s) Natural Language Processing |

You might also like