NLP Notes

UNIT - I
Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models
Finding the Structure of Documents: Introduction, Methods, Complexity of the

Approaches, Performances of the Approaches
NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
 ability of a computer program to understand human language referred to
as natural language.
 It's a component of artificial intelligence
 It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
 Question Answering
 Spam Detection
 Sentiment Analysis
 Machine Translation
 Spelling correction
 Speech Recognition
 Chatbot
 Information extraction
Components of NLP
o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)
NLU (Natural Language Understanding)

 Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of the
sentence within a single word.
Example:
Manya is looking for a match.
 Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within
the sentence.
Example:
I saw the girl with the binocular.
 Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
Example: Kiran went to suresh. He eats apple.
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
Phases of NLP
NLP Challenges
 Elongated words
 Shortcuts
 Emojis
 Mix use of Language
 Ellipsis
LEXICON ANALYSIS
 It is fundamental stage
 Identifying and analysing the structure of words
 It is word level processing
 Dividing the whole text into paragraph, sentence and words
 It involves stemming and lemmatization
SYNTACTIC ANALYSIS
 Required syntactic knowledge

 Find the roles played by words in a sentence,
 Interpret the relationship between words,
 Interpret the grammatical structure of sentences.
SEMANTIC ANALYSIS
 exact meaning or dictionary meaning from the text.

 to check the text for meaningfulness.
DISCOURSE ANALYSIS
 Required discourse knowledge
PRAGMATIC ANALYSIS
 how people communicate with each other, in which context they are talking
 required knowledge of the word
Finding the Structure of Words
Words and Their Components
Words are the basic building blocks of a Language. we have following components of Words
 Tokens
 Lexemes
 Morphemes
 Typology
Tokens
 Tokens are words that are created by dividing the text into smaller units
 Process to identify tokens from the given text is known as Tokenization
 Tokenization involves segmenting text into smaller units that are analysed individually.
 Input is text and output are tokens
Types of Tokenization
 Character Tokenization
 Word Tokenization
 Sentence Tokenization
 Sub word Tokenization
 Number Tokenization
Character Tokenization
Input: "Today is Monday"
Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation based)
Sentence Tokenization
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]
Sub word Tokenization (frequently used words, infrequency used

words)
Input: unusual
Output: [“un”, “usual”].
Morphological Process
Morphemes
Number tokenization
She had 100 pencils
LEXEMES
 Base or canonical form of words

 Process to find the lexemes is known as lemmatization.
MORPHEMES
 Words are formed by combing more than one morpheme

 Process to find morphemes from text is known as morphological process
 We have following types of morphemes
1. Free morphemes
2. Bounded morphemes
TYPOLOGY
 It refers categorized or classification of a language based structural and grammatical features

 We have following categories
1. Isolated or analytical languages
2. Synthetic languages
3. Agglutinative languages
Issues and Challenges
 Irregularity
 Ambiguity
 Productivity
Irregularity
 Words or words forming follow regular patterns then it is regularity
 Words or words forming doesn’t follow regular patterns then it is
irregularity
Ambiguity
 word or word forms that can be having more than one meaning
irrespective of context.
 Word forms that look like same but meaning is not unique
 Occurred morphological processing
 Ambiguity can be
o Word sense ambiguity
 Meaning depending on the context
o Parts of speech ambiguity
 Different part of speech
o Structural ambiguity
 Multiple valid syntactic structure
o Referential ambiguity
 Referring person or noun
Productivity
 Forming new word or word forms using productive rules
 Person names, location names, organization names.
Morphological Models
Morphological models are used to analyse the structure and formation of words
We have 5 morphological models
 Dictionary Lookup
 Finite state morphology
 Unification based morphology
 Functional morphology
 Morphological Induction
Morphemes
Root word affixes
Prefix
In fix
suffix
Dictionary Lookup
Includes
wordbase form or canonical form search in dictionary
retrieve information
Finite state morphology

Based on formal language theory
Process is known as FSTs (finite state transducers)
success
success
un success
pre fix stem
successfull
stem suffix
unsuccessfull
prefix stem suffix
e stem suffix
prefix
stem
STEM CHANGES
Some irregular word requires stem changes
d o g epsilon
m s
c e
i e
u s
Mice
mouse
FST has two types of tapes
 Surface tape
 Lexical tape
Surface tape
c a t s
Lexical tape
c a t N Pl
FST has 7 tuples
MORPHEMES TYPES
Basically, two types of morphemes

o Free morphemes
 Lexical
 example
 Functional
 example
o Bound morphemes
 Inflectional
example
 Derivational
 Class changing
 Class maintaining
Finding structure of Document
Segmentation is chunking the input text or speech into blocks
Types of segmentation
 Sentence boundary detection

o Optical character recognition
o Automatic speech recognition
 Topic boundary detection
Corpus
Documents/sentences
Word/tokens
Vocabulary
I met Dr.Xyz and he suggested some medices.
What is the time now?
Topic boundary detection

 Discourse segmentation / text segmentation
 Process of dividing speech or text into homogenous blocks
called as topic segmentation
 Two ways (text segmentation)
1. By following headlines
2. By paragraph breaks
 Two ways (speech segmentation)
1. Pause duration
2. Speaker changes
METHODS for sentence boundary and topic boundary

1. Generative sequence classification method
2. Discriminative local classification method
Generative sequence classification method

 Observations: words & punctuations
 Labels: sentence boundary, topic boundary
Hidden Markov Model (HMM) is method of Generative approach
How it works
 Learn from the data about the observation and corresponding
hidden states (POS)
 Predict label or sequence generation
 Classifies the new sequence
Ex:
I Love Coding
She sells apples
the quick brown fox jumps over the lazy dog
Discriminative local classification method
Local feature: word, prefixes, suffixes, nearby POS
Label: sentence boundary label, topic boundary label
Ex: maximum entropy Markov model, SVM
Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition
Complexity of approaches
 Quality
 Quantity
 Computational complexity
 Structural complexity
 Space
 Time
 Training
 Prediction
Performance of the approaches

 Precision
 Recall
 Accuracy
 F1 measure/F1 score
Confusion matrix
Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%
True Negative Rate:
Actually No, how often does it predict No

TN/actual No=50/60=83%
True Positive Rate: (Recall/Sensitivity)

When it actual Yes how often does it predict Yes
TP/Actual Yes
100/105=95%
Accuracy
How often classifier correct
TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%
Misclassification Rate:
Overall, how often is it wrong
FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall
UNIT -II
Prerequires CFG
Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven

Approach to Syntax, Representation of Syntactic Structure, Parsing Algorithms,
Models for Ambiguity Resolution in Parsing, Multilingual Issues
Chart parser
RegEx parser
Shift reduce parser
Recursive parser
Syntax Analysis /syntactic Analysis

Syntax Vs grammar
Return_type Function_name(parameters);
Function_name(parameters) return_type;
Function_name(parameter);
Return_type Function_name(parameters)
Ramu eats apple

Eats ramu apple
Tree
Parsing
CFG
G= (N, T, P, S)
A α
α  (NUT)*
AB
SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN
Ate ramu apple the

RAMU ATE THE APPLE
Brown ,switchboard
At/in the/at same/ap time/nn reaction/nn among/in anti-

organization/jj
At the same time reaction among anti-organization
Penn treebank
Ate ramu apple the
Representation of Syntactic Structure
Two types of approaches
 Phrase structure graph

o Example
 Dependency graph
o Example

NLP Notes

Uploaded by

Copyright:

Available Formats

You might also like

NLP Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Notes

Uploaded by

Copyright:

Available Formats

UNIT - I

Finding the Structure of Documents: Introduction, Methods, Complexity of the

NLU (Natural Language Understanding)

Manya is looking for a match.

I saw the girl with the binocular.

Example: Kiran went to suresh. He eats apple.

 Required syntactic knowledge

 exact meaning or dictionary meaning from the text.

 Required discourse knowledge

Words and Their Components

Input: "Today is Monday"

Sub word Tokenization (frequently used words, infrequency used

Output: [“un”, “usual”].

She had 100 pencils

 Base or canonical form of words

 Words are formed by combing more than one morpheme

 It refers categorized or classification of a language based structural and grammatical features

Issues and Challenges

We have 5 morphological models

Root word affixes

wordbase form or canonical form search in dictionary

Finite state morphology

Process is known as FSTs (finite state transducers)

pre fix stem

prefix stem suffix

Some irregular word requires stem changes

FST has two types of tapes

FST has 7 tuples

Basically, two types of morphemes

Finding structure of Document

Segmentation is chunking the input text or speech into blocks

 Sentence boundary detection

I met Dr.Xyz and he suggested some medices.

What is the time now?

Topic boundary detection

METHODS for sentence boundary and topic boundary

Generative sequence classification method

Performance of the approaches

True Negative Rate:

Actually No, how often does it predict No

True Positive Rate: (Recall/Sensitivity)

Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven

Syntax Analysis /syntactic Analysis

Ramu eats apple

Ate ramu apple the

At/in the/at same/ap time/nn reaction/nn among/in anti-

 Phrase structure graph

You might also like