Professional Documents
Culture Documents
NLP 1
NLP 1
OBJECTIVES
After reading this chapter, the student will be able to Understand:
Introduction to Natural Language Processing.
OP OODDODAOOO ©
History of NLP. |
Generic NLP Systems .
Levels of NLP.
Knowledge in Language processing .
Ambiguity-in NLP ..
Stages in NLP.
Challenges for NLP.
Application Areas of NLP:
the way computers can be used to understand and manage natural langua
ge text or 3. Goals of Natural Language Processing:
speech to do useful things. The term “natural” in the context of the language is used 1. The ultimate goal of natural language processing is for computers to achieve
to distinguish human languages (such as Gujarati, English, Spanish and French) from human-like comprehension of texts/languages. When this ts actyeved, computer
computer languages (such as C, C++, Java and Prolog). The definition of Natural systems will be able to understand, draw inferences from, summanze, tansiale
Language Processing clarifies that it is a theoretically induced range of computational and generate accurate and natural human text and language.
techniques (multiple methods or techniques for language analysis) for analyzing and
2. The goal of natural language processingis to specify a language comprehension
representing naturally occurring text (such as English, Gujarati and Punjabi) at one .
and production theory to such a level of detail that a person is abie to write a
or more levels of linguistic analysis for the purpose of achieving human like language
computer program which can understand and produce natural language.
processing for a range of tasks or applications.
3. The basic goal of NLP is to accomplish human like language processing. The .
2. Need for Natural Language Processing (N LP) choice of word “processing” is very deliberate and should not be replaced with
“understanding”. For although the field of NLP was onginally referred to as Natural
Significant growth in the volume andvariety of data is due to the amountof unstructured
Language Understanding (NLU), that goal has not yet been accomplished. A full a BE
text data—in fact, up to 80% ofall your data is unstructured text data. Companies collect
NLU system would be able to:
huge amounts of documents, emails, social media, and other text-based information
— Paraphrase an inputtext.
to get to know their customers better, offer services or market their products .However,
mostof this data is unused and untouched. | — Translate the text into another language.
Text analytics, through the useof natural language processing (NLP), holds the key to — Answer questions aboutthe contents ofthe text.
unlocking the business value within these vast data resources. — Draw inferences from the text.
In the era of big data, businessescan fully utilize the data potential and take advantage
of the latest parallel text analytics and NLP algorithms packaged in a variety of open
source software namely R, pythonetc.
.
A. Brief overview of NLP | |
human | anguage
eke ot and
th . I t
on the interactions between
* The field of study that focuses for short. ‘ty s : e
Processing, or NLP gs caw eae
computers is called Natural Language Database. ; Artificial
inguistics Intelligence PCOS ie
artificial intelligence, and computationa
intersection of computer science,
sing lies in making computers unders
tand I
The essence of Natural Language Proces the
sk though. Computers can understand NaturalLanguage!
the natural language. That's not an easyta human a rocessing,
~Pro
peace
andthe tables in the database, but
structured form of data like spreadsheets ¥ Ff f. Web NLP |
t
ctured categoryof data, and it gets difficul
languages,texts, and voices form an unstru
the need for Natural Language
for the computer to understand it, and there arises Information : Machine . &Text =. ’ ~ Language«_-
Retrieval Translation Categorization Morphological —
Processing.
(using ontology) (E-M & M-E) ~ summarization Analysis
s forms and it would get very
There's a lot of natural language data out there in variou
easy if computers can understand and proces s that data. We can train the models
: ssGak: Abstractive Extractive
writing for
in accordance with expected output in different ways. Humans have been Seeeeea at eee Summarization . Summarization
would be great
thousands ofyears, there are lotofliterature pieces available, and it
Figure 2: NLP in the Computer science taxonomy
if we make computers understand that. But the task is never going to be easy. There
are various challenges floating out there like understanding the correct meaning of the 5. History of NLP
sentence, correct Named-Entity Recognition(NER), correct prediction of various parts of
NLP began in the 1950s as the intersection of artificial intelligence and linguistics.
speech, coreference resolution(the most challenging thing in my opinion).
NLP wasoriginally distinct from text information retrieval (IR), which employs highly
Computers can't truly understand the humanlanguage. If we feed enough data and train
scalable statistics-based techniques to index and search large volumesoftext efficiently:
a model property, it can distinguish and try categorizing various parts of speech(noun,
Manning et al1 provide an excellent introduction to IR.. With time, however, NLP and IR
verb, adjective, supporters, etc...) based on previously fed data and experiences. If it
have converged somewhat. Currently, NLP borrows from several, very diverse fields,
encounters a new word it tried making the nearest guess which can be embarrassingly
requiring today's NLP researchers and developers to broaden their mental knowledge-
wrong 2 few times.
base significantly.
K's very difficutt for a computer to extract the exact meaning from a sentence. For
Early simplistic approaches, for example, word-for-word Russian-to-English machine
example — The boy radiated fire like vibes. The boy had a very motivating personality or
translation, were defeated by homographs—identically spelled words with multiple :
he actually radiated fire? As you can see overhere, parsing English with a compute is
|
meanings—and metaphor, leading to the apocryphal story of the Biblical, ‘the spiritis ~
going to be complicated.
willing, but the flesh is weak’ being translatedto ‘the vodkaiis agreeable, but the meat)is
spoiled,’
Chomsky's 1956 theoretical analysis of language grammars provided an estimate of “ 3
the problem's difficulty, : influencing the creation (1963) of Backus-Naur Form (BNF) “Ae
notation. BNF is used to specify,a ‘context-free grammar ee andis common usedte
‘ :
ae
igre aa 0 as Ne teraa
e ae machinetranslation. These parts when combined would allow for artificialimeligence to te
. represen
rules that cofectively
decivation code syntactic?
validate program
heuristics.) Chomsky also identified still more gain real knowledge of the world, not just playing chess or moving around an obstacie is
'
absolute constraints. not expert systems used to specify course. In the nearfuture computers wil be able to read all of the information onfneand
ctve ‘Teguiar’ grammars. the basis of the regular expressions
by Kleene (1956), was first learn from it and solve problems and possibly cure diseases. There lent for NLP and on
text-search patiems. Regular expression syntax. defined
Al is humanity. research will not stop until both are at a human tevel Of awereress and a
supported by Ken Thompson's grep ublity on UNIX.
understanding. os
parser generators such
Subsequenty (1970s). lexical-analyzer (lexer) generators and
A lexer transforms text into tokens; a 6. Generic NLP system
as the lex‘yace combination utlized grammars.
generators simplify programming-
perser validates a token sequence. Lexer/parser
and BNF specifications,
language implementation greatly by taking regular-expression Typed Input Somer SPL
that determine lexing/
ee
constructs from simpler ones), and uses a look-ahead ofa single token
to make parsing
| Orme
The Prolog language was originally invented (1970) for NLP applications. Its syntax is
Figure 3: Generic NLP Sytem
especially suited for writing grammars, although, in the easiest implementation mode
eflectre
(top-down parsing), rules must be phrased differently (i. e, right-recursively) from those Any natural language processing should start with some inout and ends wih
intended for a yacc-style parser. Top-down parsers are easier to implement than bottom- and accurate output. The inputs for natura! lanquage processor can be text or speech.
Outret may be m
up parsers (they don't need generators), but are much slower. There are a variety of output that can be generated by the system.
wocate, es
’ Recent research has increasingly focused on unsupervised and semi-supervised the form of answer when inputis a question. Similarty outputs can be Database
the re: z
leaming algorithms. Such algorithms are able to learn from data that has not been Spoken response, Semantics, Part of speech, Morphology of word, Semanscs of
hand-annotated with the desired answers, or using a combination of annotated and non- word/ Sentencesetc.
annotated data. Generally, this task is much moredifficult than supervised learning, and
typically produces less accurate results for a given amountof input data. However, there
aR. is an enormous amountof non-annotated data available (including, among otherthings,
r
fe the entire content of the World Wide Web), which can often make upfor the inferior
. results.
~ greas synergizewell with each other. The NLPcan broadly be divided into various levels oe
Reasoning: To produce an answerto a question which is not explicitly storedin ada oe a -
asshown in figure. Natural LanguageInterface to Database (NLIDB) carries out reasoning based ondatastored.
nS
in the database. For example, consider a databasethat holds the academic information about =i
.
Contextual student, and user posed a query such as: ‘Which studentislikelyto fail in Maths subject? a
ia Parsing A reasoning To answer the query, NLIDB needs a domain expert to narrow down the reasoning process.” . :|
Application
reasoning
and execution
KnowledgeLanguage processing sh
A natural language understanding system must have knowledge about what the words
syntactic Utierance mean, how words combine to form sentences, how word meanings combine to from
] planning
sentence meanings and so on. The different forms of knowledge required for natural
language understanding are given below.
Reasoning systems as they deal with how wordsarerelated to the soundsthat realize them.
MORPHOLOGICAL KNOWLEDGE
Phonolocy: — deeis with intespretetion of speech sound within and across words. Morphology concemsword formation. It is a study of the pattemsof formation of words by
Morphology: f is & saucy of the way words are built up from smaller meaning-bearing units the combination of sounds into minimaldistinctive units of meaning called morphemes. ts -
calles morphemes. For example. the word ‘fox’ has single mompheme while the word ‘cats’
Morphological knowledge concems how words are constructed from morphemes. ~ a
have two morcphemes “caf end mocpheme “-s" represents singular and plural concepts.
SYNTACTIC KNOWLEDGE
blorphoiogical Jexicon is the Est of stem and affixes together with basic information,
Syntax is the level at which we study how words combine to form phrases, phrases
wheter the siem is a Noun stem or a Verb stem [21]. The detailed analysis ofthis level
combine to form clauses and clauses join to make sentences. Syntactic analysis ie
is Gscussed in chapter 4. Syntax: It is a study of formal relationships between words. It
8 @ study of how words are clustered in classes in the form of Pari-of-Speech (POS),
concernssentence formation. It deals with how words can beput together to form correct __ . eee
how they are grouped with their neighbours into phrases, and the way words depend on sentences.It also determines whatstructural role each word plays in the sentence and
what phrases are subparts of whatother phrases. » Ea
&ach other in a sentence.
Semantics: it is 2 study of the meaning of words that are associated with grammatical SEMANTIC KNOWLEDGE a
structure. f consists of two kinds of approaches: syntax-driven semantic analysis of the words and sentences. This is the study of context :
and It concerns the meanings
has, no matter in which context itis: ee.
serattic gemmat. The detailed explanation of this level is discussed in chapter4.
In independent meaningthat is the meaning a sentence
used. Defining the meaningof a sentenceis very difficult due to the ambiguities kNONed Sigg’
Gsooute contest, the level of NLP works with text longer than
a sentence. There are two
types ofGiscourse- anaphora resolution and discourse/text structure recognition.
Anaphora PRAGMATIC KNOWLEDGE
oe Pragmatics is the extension of the meaning or semantics. Pragmatics deals with the So
They are:
a: contextual aspects of meaningin particularsituations. It concerns how sentences are
-Lexical Analysis:- Analysis
Eo used in differentsituations and how useaffects the interpretation of the sentence. of word forms
e Syntactic Analysis:-Struct
DISCOURSE KNOWLEDGE ure processing
Discourse concems connected sentences.It is a study of chunks of language whichare Semantic Analysis:- Meaningr
epresentation
bigger than a single sentence. Discourse language concernsinter-sententiallinks thatis Discourse Analysis:- Processing
ofinterrelated sentences "
how the immediately preceding sentencesaffect the interpretation of the next sentence, ¢ Pragmatic Analysis: -The purposeful
Use of sentencesin situations.
Discourse knowledgeis important for interpreting pronouns and temporal aspectsof the Ambiguity can occur atall these
levels. It is a property oflinguistic expressi
. information conveyed. expression (word/phrase/sentence)
ons. fan. ee
has more than oneinterpretation we can
WORLD KNOWLEDGE refer it a8.)
ambiguous.
Word knowledgeis nothing but everyday knowledge that all speakers share about the For e.g: Consider the sentence, “The
chicken is readyto eat”.
world.It includes the general knowledge aboutthe structure of the world and what each The interpretations in the above phrase
can be:
language user must know aboutthe other user's beliefs and goals. This essential to
* The chicken(bird) is ready to be fed
or
makethe language understanding muchbetter.
* The chicken (food) is ready to be eaten.
knowledgerepresentation and reasoning systems have incorporated natural
language Consider another sentence: “There was nota
as interfaces to expert systems or knowledge basesthat performed tasks single man atthe party”
separate from
natural languageprocessing. As this book shows, however, the computation The interpretations in this case can be:
al nature of
representation and inference in natural language makesit the ¢ Lack of bachelors at the party or
ideal modelfor all tasks
in anintelligent computer system. Natural languageprocessing e Lack of menaltogether
combines the qualitative
characteristics ofhuman knowledge processing with a computer's Thereare different types of ambiguities
quantitative advantages,
allowing for an in-depth, systematic processing of vast “oa
amounts of information. The 1. Lexical Ambiguity: is the ambiguity of a single word. A word
essays in this interdisciplinary book cover a range of implementat can be ambiguous “ae
ions and designs, from with respectto its syntactic class. Eg: book, study.
formal computational models to large-scale natural language
processing systems. For eg: The word “silver” can be used as a noun, an adjective, or a verb.
‘ivodution(33)
Waiter (running upstairs and coming back panting): Yes sir, they are there. —-HMM-Based Taggers
tourist, since he does not 3. Machine Learning Approaches |
Clearly, the waiter is falling short of the expectation of the
understand the pragmatics of the situation.
10. Stages of NLP
Pragmatic ambiguity arises when the statement is not specific, and the context
does not provide the information needed to clarify the statement. Information is Thereare generalfive stepsiin natural language processing
missing, and mustbeinferred. Consider the example: “I love you too.” Lexical Analysis: It involves identifying and analyzing the structure of words. ‘Lexicon of=
This can be interpreted as: a language means thecollection of words and phrasesin a language. Lexical analysis
is
dividing the whole chunk of text into paragraphs, sentences, and words.-
— | love you (just like you love me)
The lexical analysis in NLP deals with the study at the level of words with respectto.
ea | love you (just like someone else does)
their lexical meaning and part-of-speech. This level of linguistic processing utilizes a.
—_ | love you (and | love someoneelse) language's lexicon, whichis a collection of individual lexemes. Alexemeis a basic unit of ’
— |loveyou (as well as liking you) lexical meaning; which is an abstract unit of morphological analysis that represents‘he s
It is a highly complex task to resolve all these kinds of ambiguities, especially in the set of forms or “senses” taken by single morphemes, <5 ae
upper levelsof NLP. The meaningof a word, phrase, or sentence cannot be understood “Duck”, for example, can take the form ofa noun ora verb butits part-of-speech and lexical|
in isolation and contextual knowledge is needed to interpret the meaning, pragmatic meaning can only be derived in context with other words used in the phrase/sentence..
and world knowledgeis required in higher levels. It is not easy to create a world model This, in fact, is an early step towards a more sophisticated Information Retrieval system,
for disambiguation tasks. Linguistic tools and lexical resources are needed for the whereprecision is improved through part-of-speech tagging. ;
“4
developmentof disambiguation techniques. Resourceless languages are lagging behind Syntactic Analysis (Parsing): It involves analysis of words in the sentence for grammar
in these fields compared to resourceful languagesin implementation of these techniques. and arranging words in a manner that showsthe relationship among: the words. Theae
analyzer.‘ae
Rule based methods are language specific where as stochastic or statistical methods sentence such as “The school goesto boy” is rejected by English syntactic
are language independent: Automatic resolution ofall these ambiguities contains several
Synta ctic - Semantic Disclosure’
long standing problems but again development towards full-fledged disambiguation Lexical
Analysis ~ Integration
Analysis Analysis
techniquesis required which takes careofall the ambiguities. It is very much necessary
for the accurate working of NLP applications such as Machine Translation, Information how these meanings combine:
Semantic Analysis: It concerns what words mean‘and
_ Retrieval, Question Answering etc. : It draws the exact meaning orthe dcionary
in sentences to form sentence meanings.
Statistical Approaches of Ambiguity Resolution in Natural Language Processing the text. The text is checked for meaningfu lness. It is done by mapping
meaning from
are: domain. The semantic analyzer. disregards
syntactic structures and objects in the task
1. Probabilistic model can be (plant : industrial plant/living
sentence such as “hot ice-cream”. Another example
2. Part of Speech Tagging organism)
— Rule-Based Approaches
— Natural LanguageProcessing
=
it can be attached to, and can therefore not appear as a “word” on ther own. .
ces are used in different situations
Pragmatic Analysis: This concems how senten In Information Retrieval, document and query terms can be stemmed to match Be
it afects the interpr etation of the senten ce. During this, what was said is re-
and how morphological variants of terms between the documents and query, such that the singular -
es deriving those aspects of language
imecpreted on what @ actually meant. It involv form of a noun in a query will match even with its plural form in the document. and vice
which requir e real world knowle dge. versa, thereby increasing recall.
Surface form
Morphologic! Discourse Analysis
| | want to print |
analysis Resolving references | Ali's init file
fadvual words are between sentences Me
Stages of NLP
Pragmatic Analysis
To reinterpret what Syntactic Analysis
was said to what was
The part-of-speech tagging output of the lexical analysis can be used at the syntactic
axtually meant
Semantic Analysis
levelof linguistic processing to group words into phrase and Clause brackets. Syntactic
Atransformation is Analysis also referred to as “parsing”, allows the extraction of phrases which convey
made from the input
more meaningthanjustthe individual words by themselves, such as in a noun phrase.
text to an intemal
representation that In Information Retrieval, parsing can be leveraged to improve indexing since phrases
refiects the meaning
can be usedasrepresentations of documents which provide better information than just
single-word indices. In the same way, phrases that are syntachcally Gerwed from the
Figure 4: Stages ofNLP
query offers better search keys to match with documents that are simiarty parsed.
Morphological Analysis:
The morphological level of linguistic processing deals with the study of word structures
and word formation, focusing on the analysis of the individual components of words. The
most important unit of morphology, defined as having the “minimal unit of meaning” is
referred to as the morphemes. For example, the word: “unhappiness”.
It can be broken
= down into three morphemes (prefix, stem, and suffix), with each conveying
some form
In Information Retrieval, the query and document matching process can This level of analysis enables major breakthroughs in Information Retrieval as it
be performed
on a conceptual level, as opposed to simple terms, thereby further facilitates the conversation between the IR system andthe users,allowing the elicitation
increasing system
precision. Moreover, by applying semantic analysis to the query, of the purpose upon whichthe information being soughtis planned to be used, thereby
term expansion would
_ be possible with the use oflexical sources, Offering ensuring that the information retrieval system is fit for purpose.
improved retrieval of the relevant
documents even if exact terms are not used in
the query. Precision may increase with
query expansion, as with recall probably increasi
ng as well.
| who's ==
OnOn
. w
s management, multilingual query processing, and natural languageinterface to database ae %
Pragmatic
system. Currentlyinteractive applications may beclassified into following categories: a
whal whal
t
type Speech Recognition / Speech Understanding and Synthesis / Speech Generation: © ;
Execute the command
Syntactic Net @)
Speech understanding system attempts to perform a semantic and pragmatic processing
Ipr /ali/stuff.init
of spoken utterance to understand whatthe user is saying and act on whatis being said.
The research area in this category includes: linguistic analysis, design & developing
Discourse Analysis
efficient and effective algorithms for speech recognition and synthesis.
The discourselevel of linguistic processing deals with the analysis of structure and meaning
LanguageTranslator: It is a task of automatically converting one natural language into
of text beyond a single sentence, making connections between words and sentences.At
another preserving the meaning of input text and producing an equivalent text in the .
this level, Anaphora Resolution is also achieved byidentifying the entity referenced by an
output language. The researchareain this category includes, language modelling.
anaphor (most commonlyin the form of, but notlimited to, a pronoun). An example is shown
Information Retrieval (IR): It is a scientific discipline that deals with analysis, design and
below.
implementation of a computerized system that addresses representation, organization,
©
and accessto large amounts of heterogeneous information encoded in digital format.
"| voted for Obama because he was most fromuserand ©
The search engine is the well known application of IR which accepts query
document to user. It returns the document, not the relevant answers;
returns the relevant
areain IR a
users areleft to extract answers from the returned documents. The research
aligned with my values,” she said.
information categorization 2and * Se
Figure 5: Anaphora Resolution IIfustration includes: information searching, information extraction,
n.
With the capability to recognize and resolve anaphora relationships, document and query information summarization from unstructured informatio
d information from unstructured SS
representations are improved,since,at the lexical level, the implicit presence of concepts is Information Extraction: Itincludes extraction of structure
from natural language text. The research
accounted for throughout the document aswell as in the query, while at the semantic and text. It is an activity offilling predefined template
y includes identifyi ng nameden tity, resolvin g anaphora and identifying
discourse levels, an integrated content representation of the documents and queries are area in this categor
generated. : relationships between entities. :
Se
d QuestionSs oee ee ee ee SS
Exp ecte
SS
s stages involvedin
4. What is Natural language processing ( NLP) ? Discuss variou
NLP processwith suitable example.
of analysis
2. What is Natural Language Understanding? Discuss various levels
under it with example.
suitable
3. What do you mean by ambiguity in Natural language? Explain with
example. Discuss various ways to resolve ambiguity in NL.
l language?
4. What do mean by lexical ambiguity and syntactic ambiguity in Natura
Whatare different ways to resolve these ambiguities?
ns in detail.
_ 5, List various applications of NLP and discuss any 2 applicatio