Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

UNIT -1

 Introduction to Natural Language Understanding

 The study of Language, Applications of NLP.

 Evaluating Language Understanding Systems.

 Different levels of Language Analysis.

 Representations and Understanding.

 Organization of Natural language Understanding Systems.

 Linguistic Background: An outline of English syntax

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 1
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

SHORT ANSWER TYPE QUESTIONS

Ques 1. What is language modeling?

Ans. Language modeling is central to many important natural language processing tasks. The notion of a
language model is inherently probabilistic. A language model is a function that puts a probability measure over
strings drawn from some vocabulary. A language model is basically a probability distribution over words or
word sequences. In practice, a language model gives the probability of a certain word sequence being “valid”.
Validity in this context does not refer to grammatical validity at all.

Ques 2. What do you mean by the term linguistic?

Ans . Linguistics is the scientific study of language. Linguists (experts in linguistics) work on specific
languages, but their primary goal is to understand the nature of language in general by asking questions such as:

 What distinguishes human language from other animal communication systems?


 What features are common to all human languages?
 How are the modes of linguistic communication (speech, writing, sign language) related to each other?
 How is language related to other types of human behavior?

Ques 3. Define the terms ‘Lexicon’ and ‘Morphemes’.

Ans : The definition of a lexicon is a dictionary or the vocabulary of a language, a people or a subject . Lexicon
is the central knowledge base of linguistic meanings as meanings. Any expansions or extensions of linguistic meanings
ride on the constructions of larger structures out of the elements of the lexicon.
The lexicon of a natural language contains all lexical items, that is, words. In a certain sense, the lexicon of any natural
language is the stock of unique and irregular pieces of information. Initially the term “lexicon” was used to characterize a
list of morphemes of a specific language different from a word list.
As the ideas of transformational generative grammar developed, some researchers started to treat the lexicon as a
component of the generative language model playing an auxiliary role in respect of grammar. The word was defined as a
meaningful unit that can be identified in a syntactic chain, and the lexicon was seen as a list of indivisible finite
elements regulated by morpho-lexical rules.

Ques 4. What is ‘Stemming’ and ‘Lemmatization’?

Ans : Stemming is used to normalize words into its base form or root form. E.g : celebrates , celebrated and
celebrating all these words have single root word ‘celebrate’.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 2
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

Lemmatization is quite similar to the stemming. Lemmatization usually refers to doing things properly with
the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings
only and to return the base or dictionary form of a word, which is known as the lemma.
If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to
return either see or saw depending on whether the use of the token was as a verb or a noun. The two may
also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization
commonly only collapses the different inflectional forms of a lemma. Linguistic processing for stemming or
lemmatization is often done by an additional plug-in component to the indexing process, and a number of such
components exist, both commercial and open-source.

Ques 5. What is NER (Named Entity Relation)?


Ans : Named-entity recognition (NER) (also known as (named) entity identification, entity chunking,
and entity extraction) is a subtask of information extraction that seeks to locate and classify named
entities mentioned in unstructured text into pre-defined categories such as person names, organizations,
locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Ques 6. Mention some application areas of NLP.


Ans : Speech recognition
Machine Translations
Text summarization
Auto correction and detection.
Email filtering
Sentiment analysis
Social media analysis.

LONG ANSWER TYPE QUESTIONS

Ques 7. What are the major disciplines used in studying language? Briefly explain each of them.

Ans : Major disciplines used in study of a language are as following:


(i)Linguists : How do words form phrases and sentences? What constrains the possible meanings for a sentence?
Intuitions about well-formedness and meaning; mathematical models of structure (for example, formal language theory,
model theoretic semantics)

(ii)Psycholinguists : How do people identify the structure of sentences? How are word meanings identified? When does
understanding take place? How do people identify the structure of sentences? How are word meanings identified? When
does understanding take place?

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 3
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

(iii)Philosophers: What is meaning, and how do words and sentences acquire it? How do words identify objects in the
world? Natural language argumentation using intuition about counterexamples; mathematical models (for example, logic
and model theory)

(iv)Computational Linguists: How is the structure of sentences identified? How can knowledge and reasoning be
modeled? How can language be used to accomplish specific tasks? Algorithms, data structures; formal models of
representation and reasoning; AI techniques (search and representation methods).

Ques 8.Briefly explain the history of natural language processing.


Ans. History The field of natural language processing has been around for nearly 70 years. Perhaps most
famously, Alan Turing laid the foundation for the field by developing the Turing test in 1950. The Turing test
is a test of a machine’s ability to demonstrate intelligence that is indistinguishable from that of a human. For the
machine to pass the Turing test, it must generate human-like responses such that a human evaluator would not
be able to tell whether the responses were generated by a human or a machine (i.e., the machine’s responses are
of human quality).
 Like the broader field of artificial intelligence, NLP has had many booms and busts, lurching from hype
cycles to AI winters. In 1954, Georgetown University and IBM successfully built a system that could
automatically translate more than 60 Russian sentences to English. At the time, researchers at
Georgetown University thought machine translation would be a solved problem within three to five
years.
 The success in the US also spurred the Soviet Union to launch similar efforts. The Georgetown IBM
success coupled with the Cold War mentality led to increased funding for NLP in these early years.
 However, by 1966, progress had stalled, and the Automatic Language Processing Advisory Committee
(known as ALPAC)—a US government agency set up to evaluate the progress in computational
linguistics. The report led to a reduction in funding for machine translation research.
 Despite these setbacks, the field of NLP reemerged in the 1970s. By the 1980s, computational power
had increased significantly and costs had come down sufficiently, opening up the field to many more
researchers around the world.
 In the late 1980s, NLP rose in prominence again with the release of the first statistical machine
translation systems, led by researchers at IBM’s Thomas J. Watson Research Center. Prior to the rise
of statistical machine translation, machine translation relied on human handcrafted rules for language.
These systems were called rules-based machine translation. The rules would help correct and control
mistakes that the machine translation systems would typically make, but crafting such rules was a
laborious and painstaking process.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 4
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

 By the mid-1980s, IBM applied a statistical approach to speech recognition and launched a voice-
activated typewriter called Tangora, which could handle a 20,000- word vocabulary.
 DARPA, Bell Labs, and Carnegie Mellon University also had similar successes by the late 1980s.
Speech recognition software systems by then had larger vocabularies than the average human and could
handle continuous speech recognition, a milestone in the history of speech recognition.
 Today’s NLP heavyweights, such as Google, hired their first speech recognition employees in 2007.
The US government also got involved then; the National Security Agency began tagging large volumes
of recorded conversations for specific keywords, facilitating the search process for NSA analysts.
 By the early 2010s, NLP researchers, both in academia and industry, began experimenting with deep
neural networks for NLP tasks. Early deep learning–led successes came from a deep learning method
called long short-term memory (LSTM).
 In 2015, Google used such a method to revamp Google Voice.
 NLP made waves from 2014 onward with the release of Amazon Alexa, a revamped Apple Siri,
Google Assistant, and Microsoft Cortana.
 Google also launched a much-improved version of Google Translate in 2016, and now chatbots and
voice bots are much more common place.
 That being said, it wasn’t until 2018 that NLP had its very own Image-Net moment with the release of
large pre-trained language models trained using the Transformer architecture; the most notable of these
was Google’s BERT, which was launched in November 2018.
 In 2019, generative models such as Open-AI’s GPT-2 made splashes, generating new content on the
fly based on previous content, a previously insurmountable feat.
 In 2020, Open-AI released an even larger and more impressive version, GPT-3, building on its
previous successes.
 Heading into 2021 and beyond, NLP is now no longer an experimental subfield of AI. Along with
computer vision.

Ques 9. Category wise explain in detail various applications of NLU system


Ans. The applications can be divided into two major classes:
(i) Text-based applications
(ii) Dialogue-based applications.

Text-based applications: This consists of processing of written text, such as books, newspapers, reports, manuals, e-
mail messages, and so on. These are all reading-based tasks. Text-based natural language research is ongoing in
applications such as
 finding appropriate documents on certain topics from a data-base of texts (for example, finding relevant books in
a library)
B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)
Page 5
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

 Extracting information from messages or articles on certain topics (for example, building a database of all stock
transactions described in the news on a given day)

 Translating documents from one language to another (for example, producing automobile repair manuals in many
different languages)

 Summarizing texts for certain purposes (for example, producing a 3-page summary of a 1000-page government
report).

Some machine translation systems have been built that are based on pattern matching, that is, a sequence of words in
one language is associated with a sequence of words in another language. Translation is accomplished by finding the best
set of patterns that match the input and producing the associated output in the other language. This technique can produce
reasonable results in some cases but sometimes produces completely wrong translations because of its inability to use an
understanding of content to disambiguate word senses and sentence meanings appropriately.

One very attractive domain for text-based research is story understanding. In this task the system processes a story and
then must answer questions about it. This is similar to the type of reading comprehension tests used in schools and
provides a very rich method for evaluating the depth of understanding the system is able to achieve.

Dialogue-based applications: This involves human-machine communication. Most naturally this involves spoken
language, but it also includes interaction using keyboards. Typical potential applications include :
 Question-answering systems, where natural language is used to query a database (for example, a query system to
a personnel database)
 Automated customer service over the telephone (for example, to perform banking transactions or order items
from a catalogue)
 Tutoring systems, where the machine interacts with a student (for example, an automated mathematics tutoring
system)
 Spoken language control of a machine (for example, voice control of a VCR or computer) .
 General cooperative problem-solving systems (for example, a system that helps a person plan and schedule
freight shipments)

Text-to-speech and speech-to-text Software is now able to convert text to high-fidelity audio very easily. For
example, Google Cloud Text-to-Speech is able to convert text into human-like speech in more than 180 voices
across over 30 languages. Likewise, Google Cloud Speech-to-Text is able to convert audio to text for over 120
languages, delivering a truly global offering

Chat bots : If you have spent some time perusing websites recently, you may have realized that more and
more sites now have a chat bot that automatically chimes in to engage the human user. The chat bot usually
greets the human in a friendly, non‐ threatening manner and then asks the user questions to gauge the purpose

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 6
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

and intent of the visit to the site. The chat bot then tries to automatically respond to any questions the user has
without human intervention. Such chat bots are now automating digital customer engagement.

Voice bots : Ten years ago, automated voice agents were clunky. Unless humans responded in a fairly
constrained manner (e.g., with yes or no type responses), the voice agents on the phone could not process the
information. Now, AI voice bots like those provided by VOIQ are able to help augment and automate calls for
sales, marketing, and customer success teams.

Sentiment analysis(Opinion mining): Opinion mining, or sentiment analysis, is a text analysis technique that
uses computational linguistics and natural language processing to automatically identify and extract sentiment
or opinion from within text (positive, negative, neutral, etc.).With the explosion of social media content, there is
an ever-growing need to automate customer sentiment analysis, dissecting tweets, posts, and comments for
sentiment such as positive versus negative versus neutral or angry versus sad versus happy. Such software is
also known as emotion AI. It allows you to get inside your customers’ heads and find out what they like and
dislike, and why, so you can create products and services that meet their needs.

Information extraction: One major challenge in NLP is creating structured data from unstructured and/or
semi-structured documents. For example, named entity recognition software is able to extract people,
organizations, locations, dates, and currencies from long-form texts such as mainstream news. Information
extraction also involves relationship extraction, identifying the relations between entities.

Ques 10. Define the term NLP. What are the various levels of analysis in NLP system.
Ans. Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering,
and artificial intelligence concerned with the interactions between computers and human (natural) languages, in
particular how to program computers to process and analyze large amounts of natural language data.
The term ‘Natural language processing’ (NLP) is normally used to describe the function of software or
hardware components in a computer system which analyze or synthesize spoken or written language.

The Different Levels of Language Analysis:

1. Lexical Analysis : The first phase of NLP is the Lexical Analysis. This phase scans the source code as a
stream of characters and converts it into meaningful lexemes. It divides the whole text into paragraphs,
sentences, and words.
2. Morphological Analysis: concerns how words are constructed from more basic meaning units called
morphemes. A morpheme is the primitive unit of meaning in a language (for example, the meaning of the
word "friendly" is derivable from the meaning of the noun "friend" and the suffix "-ly", which transforms a
noun into an adjective).

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 7
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

3. Syntactic Analysis (Parsing) : Syntactic Analysis is used to check grammar, word arrangements, and shows
the relationship among the words. Example: Agra goes to the Poonam

In the real world, Agra goes to the Poonam, does not make any sense, so this sentence is rejected by the
Syntactic analyzer.

4. Semantic Analysis : Semantic analysis is concerned with the meaning representation. It mainly focuses on
the literal meaning of words, phrases, and sentences.

5. Discourse Integration : Discourse Integration depends upon the sentences that proceeds it and also invokes
the meaning of the sentences that follow it.

6. Pragmatic Analysis : Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended
effect by applying a set of rules that characterize cooperative dialogues.

For Example: "Open the door" is interpreted as a request instead of an order.

Ques11. Differentiate between NLP and NLU. Explain levels of knowledge representation in NLP.
Ans. Natural language processing (NLP) is actually made up of natural language understanding (NLU)
and natural language generation (NLG). Natural language understanding is how the machine takes in the
query or request from the user and use sentiment analysis, part-of-speech tagging, topic classification, and
other machine learning techniques to understand the intent of what the user has said.
This also includes turning the unstructured data, the plain language query , into structured data that can be used
to query the data set.
Natural language generation is how the machine takes the results of the query and puts them together into
easily understandable human language. Applications for these technologies could include product descriptions,
automated insights, and other business intelligence applications in the category of natural language search.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 8
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

Levels of knowledge representation in NLP are as described below :


Phonetic And Phonological Knowledge : Phonetics is the study of language at the level of sounds while
phonology is the study of combination of sounds into organized units of speech, the formation of syllables and
larger units. Phonetic and phonological knowledge are essential for speech based systems as they deal with how
words are related to the sounds that realize them.

Morphological Knowledge : Morphology concerns word formation. It is a study of the patterns of formation of
words by the combination of sounds into minimal distinctive units of meaning called mophemes.
Morphological knowledge concerns how words are constructed from morphemes.

Syntactic Knowledge: Syntax is the level at which we study how words combine to form phrases, phrases
combine to form clauses and clauses join to make sentences. Syntactic analysis concerns sentence formation. It
deals with how words can be put together to form correct sentences. It also determines what structural role each
word plays in the sentence and what phrases are subparts of what other phrases.

Semantic Knowledge : It concerns meanings of the words and sentences. This is the study of context
independent meaning that is the meaning a sentence has, no matter in which context it is used. Defining the
meaning of a sentence is very difficult due to the ambiguities involved.

Pragmatic Knowledge: Pragmatics is the extension of the meaning or semantics. Pragmatics deals with the
contextual aspects of meaning in particular situations. It concerns how sentences are used in different situations
and how use affects the interpretation of the sentence.

Discourse Knowledge: Discourse concerns connected sentences. It is a study of chunks of language which are
bigger than a single sentence. Discourse language concerns inter-sentential links that is how the immediately
preceding sentences affect the interpretation of the next sentence. Discourse knowledge is important for
interpreting pronouns and temporal aspects of the information conveyed.

World Knowledge: Word knowledge is nothing but everyday knowledge that all speakers share about the
world. It includes the general knowledge about the structure of the world and what each language user must
know about the other user’s beliefs and goals. This is essential to make the language understanding much better.

Ques 12. What is information extraction? Explain various sub tasks of information extraction.

Ans : This explosion of information and need for more sophisticated and efficient information handling tools gives rise to
Information Extraction(IE) and Information Retrieval(IR) technology. Information Extraction systems takes natural
language text as input and produces structured information specified by certain criteria, that is relevant to a particular
application.
Various sub-tasks of IE such as Named Entity Recognition, Co-reference Resolution, Named Entity Linking, Relation
Extraction, Knowledge Base reasoning forms the building blocks of various high end Natural Language Processing (NLP)

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 9
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

tasks such as Machine Translation, Question-Answering System, Natural Language Understanding, Text Summarization
and Digital Assistants like Siri, Cortana and Google Now.

Various sub-tasks of IE such as Named Entity Recognition, Coreference Resolution, Named Entity Linking, Relation
Extraction, Knowledge Base reasoning forms the building blocks of various high end Natural Language Processing (NLP)
tasks such as Machine Translation, Question-Answering System, Natural Language Understanding, Text Summarization
and Digital Assistants like Siri, Cortana and Google Now.
(i)Parts-of-Speech (POS) tagging: In corpus linguistics, part-of-speech tagging (POS tagging or PoS
tagging or POST),also called grammatical tagging is the process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, based on both its definition and its context. A simplified form of
this is commonly taught to school-age children, in the identification of words
as nouns, verbs, adjectives, adverbs, etc.

(ii) Parsing : Produces syntactic analysis in the form of a tree that shows the phrases comprising the sentence
and the hierarchy in which these phrases are associated. Constituency parsers have been used for pronoun
resolution, labeling phrases with semantic roles and assignment of functional category tags.

(iii) Named Entity Recognition (NER) : The task is to find Person (PER), Organization (ORG), Location
(LOC) and Geo-Political Entities (GPE). For instance, in the statement ”Michael Jordan lives in United States”,
NER system extracts Michael Jordan which refers to name of the person and United States which refers to
name of the country.
(iv) Named Entity Linking (NEL): Named Entity Linking (NEL) also known as Named Entity
Disambiguation (NED) or Named Entity Normalization (NEN) is the task of identifying the entity that
corresponds to particular occurrence of a
B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)
Page 10
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

noun in a text document.


(v) Co-reference Resolution (CR): Coreference Resolution is the task which determines which noun phrases
(including pronouns, proper names and common names) refer to the same entities in documents. For instance,
in the sentence, ”I have seen the annual report. It shows that we have gained 15% profit in this financial year”.
Here, ”I” refers to name of the person, ”It” refers to annual report and ”we” refers to the name of the company
in which that person works.
(vi). Temporal Information Extraction (Event Extraction): Task of identifying events (i.e information
which can be ordered in a temporal order)in free text and deriving detailed and structured information about
them, ideally identifying who did what to whom, where, when and why.

(vii). Relation Extraction (RE): Relation Extraction is the task of detecting and classifying pre-defined
relationships between entities identified in the text.

(viii). Knowledge Base Reasoning and Completion: There are various applications based on link prediction such
as recommendation systems, Knowledge base completion and finding links between users in social networks. In
recommendation systems, goal is to predict the rating of the movies which are not already rated and
recommending it to users to have better user experience.

Ques 13. Explain in detail, the three main approaches of developing NLP system.

Ans. The three dominant approaches of developing NLP system today are rule-based, traditional machine
learning (statistical-based), and neural network–based.
(i) Rule-based NLP (ii) Traditional (or classical) machine learning (iii) Neural networks

Rule based NLP


Traditional NLP software relies heavily on human-crafted rules of languages; domain experts, typically
linguists, curate these rules using things like regular expressions and pattern matching. Rule-based NLP
performs well in narrowly scoped-out use cases but typically does not generalize well. More and more rules are
necessary to generalize such a system, and this makes rule-based NLP a labor-intensive and brittle solution
compared to the other NLP approaches.
Here are examples of rules in a rule-based system: words ending in -ing are verbs, words ending in -er or -
est are adjectives, words ending in ’s are possessives, etc. Think of how many rules we would need to create by
hand to make a system that could analyze and process a large volume of natural language data. Not only would
the creation of rules be a mind-bogglingly difficult and tedious process, but we would also have to deal with the
many errors that would occur from using such rules. We would have to create rules for rules to address all the
corner cases for each and every rule.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 11
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

Traditional (or classical) machine learning based NLP


Traditional machine learning relies less on rules and more on data. It uses a statistical approach, drawing
probability distributions of words based on a large annotated corpus. Humans still play a meaningful role;
domain experts need to perform feature engineering to improve the machine learning model’s performance.
Features include capitalization, singular versus plural, surrounding words, etc. After creating these features,
you would have to train a traditional ML model to perform NLP tasks; e.g., text classification. Since
traditional ML uses a statistical approach to determine when to apply certain features or rules to pro‐ cess
language, traditional ML-based NLP is easier to build and maintain than a rule-based system. It also generalizes
better than rule-based NLP.

Neural networks based NLP


Neural networks address the shortcomings of traditional machine learning. Instead of requiring humans to
perform feature engineering, neural networks will “learn” the important features via representation learning. To
perform well, these neural networks just need large amounts of data. The amount of data required for these
neural nets to perform well is substantial, but, in today’s internet age, data is not too hard to acquire. You can
think of neural networks as very powerful function approximators or “rule” creators; these rules and features
are several degrees more nuanced and complex than the rules created by humans, allowing for more automated
learning and more generalization of the system in processing natural language data.
Of these three, the neural network–based branch of NLP, fueled by the rise of very deep neural networks
(i.e., deep learning), is the most powerful and the one that has led to many of the mainstream commercial
applications of NLP in recent years.

Ques 14. Draw the architecture of NLP system. Explain various components of Natural Language
Generation System.
Ans . Architecture of NLP system consists of following modules:
(a).Text Planning :This includes relevant content from KB also known as content planning.
(b). Sentence Planning: It includes selecting required word which formed meaningful phrases.
(c). Surface Realization: In the context of Natural Language Generation, surface realization is the task of
generating the linear form of a text following a given grammar. Surface realization models usually consist of
a cascade of complex sub-modules, either rule-based or neural network-based, each responsible for a specific
sub-task
(d). Discourse Planning : It is used for discourse integration. It means a sense of the context. The meaning of
any single sentence which depends upon that sentences. It also considers the meaning of the following sentence.
For example, the word “that” in the sentence “He wanted that” depends upon the prior discourse context.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 12
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

Components of NL System

(a).Text Organization: Text organization refers to how a text is organized to help readers follow and
understand the information presented. There are a number of standard forms that help text organization when
writing.
(b). Text Realization: In linguistics, realization is the process by which some kind of surface representation is
derived from its underlying representation; that is, the way in which some abstract object of linguistic analysis
comes to be produced in actual language. Phonemes are often said to be realized by speech sounds. The
different sounds that can realize a particular phoneme are called its allophones.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 13
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

(c). Content Selection: Content selection is a central component in many natural language generation tasks,
where, given a generation goal, the system must determine which information should be expressed in the output
text. In summarization, content selection is usually accomplished through sentence (and, occasionally, phrase)
extraction.
(d). Linguistic Resource: Linguistic resources are essential for creating grammars, in the framework of
symbolic approaches or to carry out the training of modules based on machine learning. In Latin, the word
corpus means body, but when used as a source of data in linguistics, it can be interpreted as a collection of
texts.

Ques 15. What is the organization of NLP system? Explain.


Ans Organization of a general NLP system is described as below:
(i) Interpretation processes: This maps from one representation to the other. For instance, the process
that maps a sentence to its syntactic structure and logical form is called the parser. It uses knowledge
about word and word meanings (the lexicon) and a set of rules defining the legal struc-tures (the
grammar) in order to assign a syntactic structure and a logical form to an input sentence.
(ii) An alternative organization could perform syntactic processing first and then perform semantic
interpretation on the resulting structures. Combining the two, however, has considerable
advantages because it leads to a reduction in the number of possible interpretations, since every
proposed interpretation must simultaneously be syntactically and semantically well formed.
For example, consider the following two sentences:
 Visiting relatives can be tyring
 Visiting museums can be tyring.
These two sentences have identical syntactic structure, so both are syntactically ambiguous. In 1st sentence, the
subject might be relatives who are visiting you or the event of you visiting relatives. Both of these alternatives
are semantically valid, and you would need to determine the appropriate sense by using the contextual
mechanism. However, 2nd sentence has only one possible semantic inter-pretation, since museums are not
objects that can visit other people; rather they must be visited.

(iii)Contextual processing: The process that transforms the syntactic structure and logical form into a final
meaning representation is called contextual processing. This process includes issues such as
identifying the objects referred to by noun phrases such as definite descriptions (for example, "the
man") and pronouns, the analysis of the temporal aspects of the new information conveyed by the
sentence, the identification of the speaker’s intention (for example, whether "Can you lift that rock"
is a yes/no question or a request),

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 14
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

(iv) Inferential processing required to interpret the sentence appropriately within the application domain. It uses
knowledge of the discourse context (determined by the sentences that preceded the current one) and
knowledge of the application to produce a final representation. The system would then perform whatever
reasoning tasks are appropriate for the application.

(v) Generation process: The meaning that must be expressed is passed to the generation component of the system. It
uses knowledge of the discourse context, plus information on the grammar and lexicon, to plan the form of an
utterance, which then is mapped into words by a realization process.

Ques 16. Explain the steps to build the NLP pipeline.


Ans. Following are the steps to build an NLP pipeline -

Step1: Sentence Segmentation : Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph
into separate sentences.

Example: Consider the following paragraph - Independence Day is one of the important festivals for every Indian
citizen. It is celebrated on the 15th of August each year ever since India got independence from the British rule. The
day celebrates independence in the true sense.

Sentence Segment produces the following result:

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 15
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

1. "Independence Day is one of the important festivals for every Indian citizen."
2. "It is celebrated on the 15th of August each year ever since India got independence from the British rule."
3. "This day celebrates independence in the true sense."

Step2: Word Tokenization : Word Tokenizer is used to break the sentence into separate words or tokens.

Example: Word Tokenizer generates the following result:

"JavaTpoint", "offers", "Corporate", "Training", "Summer", "Training", "Online", "Training", "and", "Winter",
"Training", "."

Step3: Stemming : Stemming is used to normalize words into its base form or root form. For example, celebrates,
celebrated and celebrating, all these words are originated with a single root word "celebrate." The big problem with
stemming is that sometimes it produces the root word which may not have any meaning.

For Example, intelligence, intelligent, and intelligently, all these words are originated with a single root word
"intelligen." In English, the word "intelligen" do not have any meaning.

Step 4: Lemmatization : Lemmatization is quite similar to the Stamming. It is used to group different inflected forms of
the word, called Lemma. The main difference between Stemming and lemmatization is that it produces the root word,
which has a meaning.

For example: In lemmatization, the words intelligence, intelligent, and intelligently has a root word intelligent, which has
a meaning.

Step 5: Identifying Stop Words : In English, there are a lot of words that appear very frequently like "is", "and", "the",
and "a". NLP pipelines will flag these words as stop words. Stop words might be filtered out before doing any statistical
analysis. Example: He is a good boy.

Step 6: Dependency Parsing : Dependency Parsing is used to find that how all the words in the sentence are related to
each other.

Step 7: POS tags : POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It indicates that
how a word functions with its meaning as well as grammatically within the sentences. A word has one or more parts of
speech based on the context in which it is used.

Example: "Google" something on the Internet. Google is used as a verb, although it is a proper noun.

Step 8: Named Entity Recognition (NER) : Named Entity Recognition (NER) is the process of detecting the named
entity such as person name, movie name, organization name, or location.

Example: Steve Jobs introduced iPhone at the Macworld Conference in San Francisco, California.

Step 9: Chunking : Chunking is used to collect the individual piece of information and grouping them into bigger pieces
of sentences.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 16
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

Ques 17. What are the various techniques to evaluate an NLP system?
Ans : Various techniques to evaluate NLP system are described as below :
(i) Run and Test : One obvious way to evaluate a system is to run the program and see how well it
performs the task it was designed to do. If the program is meant to answer questions about a
database of facts, you might ask it questions to see how good it is at producing the correct answers.
(ii) Black Box Evaluation: If the system is designed to participate in simple conversations on a certain
topic, you might try conversing with it. This is called black box evaluation because it evaluates
system performance without looking inside to see how it works. While ultimately this method of
evaluation may be the best test of a system’s capabilities, it is problematic in the early stages of
research because early evaluation results can be misleading.
 Sometimes the techniques that pro-duce the best results in the short term will not lead to the best
results in the long term. For instance, if the overall performance of all known systems in a given
application is uniformly low, few conclusions can be drawn.
 The fact that one system was correct 50 percent of the time while another was correct only 40
percent of the time says nothing about the long-term viability of either approach. Only when the
success rates become high, making a practical application feasible, can much significance be given
to overall system performance measures.

(iii)Glass Box Evaluation: An alternative method of evaluation is to identify various subcomponents of a


system and then evaluate each one with appropriate tests. This is called glass box evaluation because you look
inside at the structure of the system. The problem with glass box evaluation is that it requires some consensus
on what the various components of a natural language system should be.

Ques 18. What do you mean by understanding and representation of a natural language system?
Ans : A crucial component of understanding involves computing a representation of the meaning of sentences and texts.
Without defining the notion of representation, however, this theory has little importance. For instance, why not simply use
the sentence itself as a representation of its meaning?
 One reason is that most words have multiple meanings, which we will call senses. The word "cook", for
example, has a sense as a verb and a sense as a noun; "dish" has multiple senses as a noun as well as a
sense as a verb; and "still" has senses as a noun, verb, adjective, and adverb.
 This ambiguity would inhibit the system from making the appropriate inferences needed to model
understanding. The disambiguation problem appears much easier than it actually is because people do
not generally notice ambiguity. While a person does not seem to consider each of the possible.
 To represent meaning, we must have a more precise language. The tools to do this come from
mathematics and logic and involve the use of formally specified representation languages.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 17
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

 Formal languages are specified from very simple building blocks. The most fundamental is the notion of
an atomic symbol which is distinguishable from any other atomic symbol simply based on how it is
written. Useful representation languages have the following two properties:
 The representation must be precise and unambiguous. You should be able to express every distinct
reading of a sentence as a distinct formula in the representation.
 The representation should capture the intuitive structure of the natural language sentences that it represents. For
example, sentences that appear to be structurally similar should have similar structural representations, and the
meanings of two sentences that are paraphrases of each other should be closely related to each other.

Syntax: Representing Sentence Structure The syntactic structure of a sentence indicates the way that
words in the sentence are related to each other. This structure indicates how the words are grouped
together into phrases, what words modify what other words, and what words are of central importance in
the sentence. In addition, this structure may identify the types of relationships that exist between phrases
and can store other information about the particular sentence structure that may be needed for later
processing. For example, consider the following sentences:
1. John sold the book to Mary.
2. The book was sold to Mary by John.
Most syntactic representations of language are based on the notion of context-free grammars, which
represent sentence structure in terms of what phrases are subparts of other phrases.

Logical Form: The intended meaning of a sentence depends on the situation in which the sentence is
produced. The Division is between context-independent meaning and context-dependent meaning. The
representation of the context-independent meaning of a sentence is called its logical form.
The fact that "catch" may refer to a baseball move or the results of a fishing expedition is knowledge about
English and is independent of the situation in which the word is used. On the other hand, the fact that a
particular noun phrase "the catch" refers to what Jack caught when fishing yesterday is contextually
dependent.

Final Meaning Representation: The final representation needed is a general knowledge representation
(KR), which the system uses to represent and reason about its application domain. This is the language in
which all the specific knowledge based on the application is represented. The goal of contextual
interpretation is to take a representation of the structure of a sentence and its logical form, and to map this
into some expression in the KR that allows the system to perform the appropriate task in the domain. In a
question-answering application, a question might map to a database query, in a story-understanding
application, a sentence might map into a set of expressions that represent the situation that the sentence

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 18
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

describes. First-order predicate calculus (FOPC) is the final representation language because it is
relatively well known, well studied, and is precisely defined.

Ques 19. Write short notes on the following:


(a).Morphemes (b) Polysemy and Homonomy (c) Phrases in language.
Ans . (a) Morphemes: Words are potentially complex units, composed of even more basic units, called
morphemes. A morpheme is the smallest part of a word that has grammatical function or meaning. For
example, sawed, sawn, sawing, and saws can all be analyzed into the morphemes {saw} + {-ed}, {-n}, {-ing},
and {-s}, respectively. On elementary basis two types of morphemes exist:
(i) Lexical Morphemes: These can not be divided into other words . there meaning completely exist in that
words.
(ii) Grammatical Morphemes: When suffix like {-ed , -ing , -ful , -ly –est } or prefix like {-pre , -sub , -un}
Are added in words then it is known as grammatical morphemes.
Affixes are classified according to whether they are attached before or after the form to which they are added.
Prefixes are attached before and suffixes after. E.g: {re-} of resaw is a prefix.
A root morpheme is the basic form to which other morphemes are attached. It provides the basic meaning of
the word. The morpheme {saw} is the root of sawers.
Derivational morphemes are added to forms to create separate words: {-er} is a derivational suffix whose
addition turns a verb into a noun, usually meaning the person or thing that performs the action denoted by the
verb. For example, {paint}+{-er} creates painter, one of whose meanings is “someone who paints.”
Inflectional morphemes do not create separate words. They merely modify the word in which they occur in
order to indicate grammatical properties such as plurality.

(b). A word is polysemous if it can be used to express different meanings. The difference between the
meanings can be obvious or subtle. E.g: school , university , college.
Two or more words are homonyms if they either sound the same (homophones), have the same spelling
(homographs), or both, but do not have related meanings. E.g : (right & write) , (piece & peace).

(c) Phrases in language: Traditionally “phrase” is defined as “a group of words that does not contain a verb
and its subject and is used as a single part of speech.” This definition has three characteristics:
(1) It specifies that only a group of words can constitute a phrase, implying that a single word cannot.
(2) It distinguishes phrases from clauses.
(3) It requires that the groups of words believed to be a phrase constitute a single grammatical unit.
 A single word may be a phrase when it is the head of that phrase. The head of a phrase is the phrase’s
central element; any other words (or phrases) in the phrase orient to it, either by modifying it or
complementing it.

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 19
KRISHNA INSTITUTE OF TECHNOLOGY NATURAL LANGUAGE PROCESSING ( UNIT 1)

 The head determines the phrase’s grammatical category: if the head is a noun, the phrase is a noun
phrase; if the head is a verb, the phrase is a verb phrase, and so on.
 The head can also determine the internal grammar of the phrase: if the head is a noun, then it may be
modified by an article; if the head is a transitive verb, it must be complemented by a direct object.

 Heads also determine such things as the number of their phrases: if the head of an NP is singular, then
the NP is singular; if the head is plural, then the NP is plural.

*****************End of Unit 1*****************

B.TECH(CSE IV YEAR) MR. ANUJ KHANNA( Asstt. Professor)


Page 20

You might also like