Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Natural Language Processing

Dr. Ankur Priyadarshi


Assistant Professor
Computer Science and Information Technology
Syllabus

Prerequisites:

1. Basic knowledge about English grammar and


Theory of Computation.
2. Basic knowledge in Machine Learning tools.
Course objectives
1. To understand the algorithms available for the processing of
linguistic information and computational properties of natural languages.

2. To conceive basic knowledge on various morphological,


syntactic and semantic NLP tasks.

3. To familiarize various NLP software libraries and datasets


publicly available.

4. To develop systems for various NLP problems with moderate


complexity.

5. To learn various strategies for NLP system evaluation and error


analysis.
Unit I:
INTRODUCTION TO NLP
Natural Language
Processing
⊹ Natural language processing (NLP) refers to the branch of computer

science—and more specifically, the branch of artificial intelligence or

AI—concerned with giving computers the ability to understand text and

spoken words in much the same way human beings can.

⊹ NLP combines computational linguistics—rule-based modeling of human

language—with statistical, machine learning, and deep learning models.

⊹ Together, these technologies enable computers to process human

language in the form of text or voice data and to ‘understand’ its full

meaning, complete with the speaker or writer’s intent and sentiment.


NLP APPLICATIONS

1. Information Extraction

2. Question Answering

3. Sentiment Analysis

4. Machine Translation and many..

Speech recognition, Intent classification, Urgency detection, Auto-correct, Market Intelligence, Email
filtering, Voice assistants and chatbots, Advertisement to target audience, Recruitment
Information Extraction (IE)
1. Working with an enormous amount of text data is always hectic

and time-consuming.

2. Hence, many companies and organisations rely on Information

Extraction techniques to automate manual work with intelligent

algorithms.

3. Information extraction can reduce human effort, reduce expenses,

and make the process less error-prone and more efficient.


Example: IE

We can extract the following information from the text:

● Country – India, Captain – Virat Kohli


● Batsman – Virat Kohli, Runs – 2
● Bowler – Kyle Jamieson
● Match venue – Wellington
● Match series – New Zealand
● Series highlight – single fifty, 8 innings, 3 formats
Question Answering

⊹ Question answering is a critical NLP problem and a


long-standing artificial intelligence milestone.
⊹ QA systems allow a user to express a question in natural
language and get an immediate and brief response.
⊹ QA systems are now found in search engines and phone
conversational interfaces, and they’re fairly good at
answering simple snippets of information.
⊹ On more hard questions, however, these normally only go as
far as returning a list of snippets that we, the users, must
then browse through to find the answer to our question.
Sentiment Analysis

⊹ Sentiment analysis (or opinion mining) is a natural

language processing (NLP) technique used to determine

whether data is positive, negative or neutral.

⊹ Sentiment Analysis, as the name suggests, it means to

identify the view or emotion behind a situation. It basically

means to analyze and find the emotion or intent behind a

piece of text or speech or any mode of communication.


Suppose, there is a fast-food chain company and they sell a variety of

different food items like burgers, pizza, sandwiches, milkshakes, etc. They

have created a website to sell their food and now the customers can order

any food item from their website and they can provide reviews as well, like

whether they liked the food or hated it.

● User Review 1: I love this cheese sandwich, it’s so delicious.


● User Review 2: This chicken burger has a very bad taste.
● User Review 3: I ordered this pizza today.
So, as we can see that out of these above 3 reviews,

The first review is definitely a positive one and it signifies that the customer was

really happy with the sandwich. The second review is negative, and hence the

company needs to look into their burger department. And, the third one doesn’t

signify whether that customer is happy or not, and hence we can consider this as a

neutral statement.
Machine Translation
Machine Translation (MT) is the task of automatically
converting one natural language into another, preserving
the meaning of the input text, and producing fluent text in the
output language.

Machine Translation (MT) is the task of automatically converting one natural


language into another, preserving the meaning of the input text, and producing
fluent text in the output language.

While machine translation is one of the oldest subfields of artificial intelligence


research, the recent shift towards large-scale empirical techniques has led to
very significant improvements in translation quality.

The Stanford Machine Translation group's research interests lie in techniques


that utilize both statistical methods and deep linguistic analyses.
Machine translation: approaches

● Rule-based Machine Translation (RBMT): 1970s-1990s

● Statistical Machine Translation (SMT): 1990s-2010s

● Neural Machine Translation (NMT): 2014-...


Rule based MT (RBMT)

A rule-based system requires experts’ knowledge about the source and

the target language to develop syntactic, semantic and morphological

rules to achieve the translation.

The Wikipedia article of RBMT includes a basic example of rule-based

translation from English to German. The translation needs an

English-German dictionary, a rule set for English grammar and a rule

set for German grammar


An RBMT system contains a pipeline of Natural Language Processing

(NLP) tasks including Tokenization, Part-of-Speech tagging and so on.

Most of these jobs have to be done in both source and target language.

SYSTRAN is one of the oldest Machine Translation company.

It translates from and to around 20 languages.

SYSTRAN was used for the Apollo-Soyuz project (1973) and by the

European Commission (1975)


Advantages

● No bilingual text required

● Domain-independent

● Total control (a possible new rule for every situation)

● Reusability (existing rules of languages can be transferred

when paired with new languages)

Disadvantages

● Requires good dictionaries

● Manually set rules (requires expertise)


Statistical MT
This approach uses statistical models based on the analysis of bilingual

text corpora.

It was first introduced in 1955, but it gained interest only after 1988

when the IBM Watson Research Center started using it.


SMT Examples

● Google Translate (between 2006 and 2016, when they

announced to change to NMT)

● Microsoft Translator (in 2016 changed to NMT)

● Moses: Open source toolkit for statistical machine translation


Advantages

● Less manual work from linguistic experts

● One SMT suitable for more language pairs

● Less out-of-dictionary translation: with the right language

model, the translation is more fluent

Disadvantages

● Requires bilingual corpus

● Specific errors are hard to fix

● Less suitable for language pairs with big differences in word


Neural MT
❖ The neural approach uses neural networks to achieve machine

translation.

❖ Compared to the previous models, NMTs can be built with one

network instead of a pipeline of separate tasks.


NMT examples

● Google Translate (from 2016) link to language team at Google

AI

● Microsoft Translate (from 2016) link to MT research at

Microsoft

● Translation on Facebook: link to NLP at Facebook AI

● OpenNMT: An open-source neural machine translation

system.
Advantages

● End-to-end models (no pipeline of specific tasks)

Disadvantages

● Requires bilingual corpus

● Rare word problem


NLP PHASES
Lexical Analysis
● It involves identifying and analyzing the structure of words. Lexicon of a

language means the collection of words and phrases in that particular

language.

● The lexical analysis divides the text into paragraphs, sentences, and words.

So we need to perform Lexicon Normalization.

The most common lexicon normalization techniques are Stemming:

● Stemming: Stemming is the process of reducing derived words to their word


stem, base, or root form generally a written word form like-“ing”, “ly”, “es”, “s”,
etc
● Lemmatization: Lemmatization is the process of reducing a group of words
into their lemma or dictionary form. It takes into account things like POS(Parts
of Speech), the meaning of the word in the sentence, the meaning of the
word in the nearby sentences, etc. before reducing the word to its lemma.
Syntactic Analysis
Syntactic Analysis is used to check grammar, arrangements of words,

and the interrelationship between the words.

Example: Mumbai goes to the Sara

Here “Mumbai goes to Sara”, which does not make any sense, so this

sentence is rejected by the Syntactic analyzer.

Syntactical parsing involves the analysis of words in the sentence for

grammar.

Dependency Grammar and Part of Speech (POS) tags are the important

attributes of text syntactic.


Semantic analysis

The way we understand what someone has said is an unconscious


process relying on our intuition and knowledge about language itself.

In other words, the way we understand language is heavily based on


meaning and context. Computers need a different approach, however.
The word “semantic” is a linguistic term and means "related to
meaning or logic."

Semantic analysis is the process of understanding the meaning and


interpretation of words, signs and sentence structure.
Discourse Integration

Discourse integration is closely related to pragmatics (context of the sentence).


Discourse integration is considered as the larger context for any smaller part of NL

structure. NL is so complex and, most of the time, sequences of text are dependent

on prior discourse.

This concept occurs often in pragmatic ambiguity. This analysis deals with how the

immediately preceding sentence can affect the meaning and interpretation of the

next sentence. Here, context can be analyzed in a bigger context, such as paragraph

level, document level, and so on.


Pragmatic Analysis
Pragmatic Analysis is part of the process of extracting information from text.
Specifically, it’s the portion that focuses on taking a structures set of text and
figuring out what the actual meaning was.

It actually comes from the field of linguistics (as a lot of NLP does), where the
context is considered from the text.

Why is this important? Because a lot of text’s meaning does have to do with
the context in which it was said/written.

Ambiguity, and limiting ambiguity, are at the core of natural language


processing, so needless to say, pragmatic analysis is actually quite crucial
with respect to extracting meaning or information.
Difficulty In NLP

● Contextual words and phrases and homonyms

● Synonyms

● Irony and sarcasm

● Ambiguity

● Errors in text or speech

● Colloquialisms and slang

● Domain-specific language

● Low-resource languages

● Lack of research and development

You might also like