NLP Notes

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

UNIT - I

Finding the Structure of Words: Words and Their Components, Issues and
Challenges, Morphological Models

Finding the Structure of Documents: Introduction, Methods, Complexity of the


Approaches, Performances of the Approaches

NLP
Natural Language Processing
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence
 ability of a computer program to understand human language referred to
as natural language.
 It's a component of artificial intelligence
 It is a technology used by machines to understand, analyse, manipulate,
and interpret human's languages.
Applications of NLP
 Question Answering
 Spam Detection
 Sentiment Analysis
 Machine Translation
 Spelling correction
 Speech Recognition
 Chatbot
 Information extraction

Components of NLP
o NLU (Natural Language Understanding)
o NLG (Natural Language Generation)

NLU (Natural Language Understanding)


 Lexical Ambiguity

Lexical Ambiguity exists in the presence of two or more possible meanings of the
sentence within a single word.

Example:

Manya is looking for a match.

 Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within
the sentence.

Example:

I saw the girl with the binocular.

 Referential Ambiguity

Referential Ambiguity exists when you are referring to something using the pronoun.

Example: Kiran went to suresh. He eats apple.

In the above sentence, you do not know that who is hungry, either Kiran or Sunita.

Phases of NLP
NLP Challenges

 Elongated words
 Shortcuts
 Emojis
 Mix use of Language
 Ellipsis
LEXICON ANALYSIS

 It is fundamental stage
 Identifying and analysing the structure of words
 It is word level processing
 Dividing the whole text into paragraph, sentence and words
 It involves stemming and lemmatization

SYNTACTIC ANALYSIS

 Required syntactic knowledge


 Find the roles played by words in a sentence,
 Interpret the relationship between words,
 Interpret the grammatical structure of sentences.

SEMANTIC ANALYSIS

 exact meaning or dictionary meaning from the text.


 to check the text for meaningfulness.

DISCOURSE ANALYSIS

 Required discourse knowledge

PRAGMATIC ANALYSIS

 how people communicate with each other, in which context they are talking
 required knowledge of the word
Finding the Structure of Words

Words and Their Components

Words are the basic building blocks of a Language. we have following components of Words

 Tokens
 Lexemes
 Morphemes
 Typology

Tokens

 Tokens are words that are created by dividing the text into smaller units
 Process to identify tokens from the given text is known as Tokenization
 Tokenization involves segmenting text into smaller units that are analysed individually.
 Input is text and output are tokens

Types of Tokenization

 Character Tokenization
 Word Tokenization
 Sentence Tokenization
 Sub word Tokenization
 Number Tokenization

Character Tokenization

Input: "Today is Monday"

Output: ["T", "o", "d", "a", "y", "i", "s", "M", "o", "n", "d", "a",” y”]
Word Tokenization (whitespace based, punctuation based)

Sentence Tokenization

Input: "Tokenization is an important NLP task. It helps break down text into smaller units."

Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]

Sub word Tokenization (frequently used words, infrequency used


words)
Input: unusual

Output: [“un”, “usual”].

Morphological Process

Morphemes

Number tokenization

She had 100 pencils

LEXEMES

 Base or canonical form of words


 Process to find the lexemes is known as lemmatization.

MORPHEMES

 Words are formed by combing more than one morpheme


 Process to find morphemes from text is known as morphological process
 We have following types of morphemes
1. Free morphemes
2. Bounded morphemes

TYPOLOGY

 It refers categorized or classification of a language based structural and grammatical features


 We have following categories
1. Isolated or analytical languages
2. Synthetic languages
3. Agglutinative languages

Issues and Challenges

 Irregularity
 Ambiguity
 Productivity
Irregularity
 Words or words forming follow regular patterns then it is regularity
 Words or words forming doesn’t follow regular patterns then it is
irregularity

Ambiguity
 word or word forms that can be having more than one meaning
irrespective of context.
 Word forms that look like same but meaning is not unique
 Occurred morphological processing
 Ambiguity can be
o Word sense ambiguity
 Meaning depending on the context
o Parts of speech ambiguity
 Different part of speech
o Structural ambiguity
 Multiple valid syntactic structure
o Referential ambiguity
 Referring person or noun

Productivity
 Forming new word or word forms using productive rules
 Person names, location names, organization names.

Morphological Models
Morphological models are used to analyse the structure and formation of words

We have 5 morphological models

 Dictionary Lookup
 Finite state morphology
 Unification based morphology
 Functional morphology
 Morphological Induction

Morphemes

Root word affixes

Prefix

In fix

suffix

Dictionary Lookup
Includes

wordbase form or canonical form search in dictionary

retrieve information

Finite state morphology


Based on formal language theory

Process is known as FSTs (finite state transducers)

success
success

un success

pre fix stem

successfull

stem suffix

unsuccessfull

prefix stem suffix

e stem suffix

prefix

stem
STEM CHANGES

Some irregular word requires stem changes

d o g epsilon

m s

c e

i e

u s

Mice

mouse

FST has two types of tapes

 Surface tape
 Lexical tape

Surface tape

c a t s

Lexical tape

c a t N Pl

FST has 7 tuples

MORPHEMES TYPES

Basically, two types of morphemes


o Free morphemes
 Lexical
 example
 Functional
 example
o Bound morphemes
 Inflectional
example
 Derivational
 Class changing
 Class maintaining

Finding structure of Document

Segmentation is chunking the input text or speech into blocks

Types of segmentation

 Sentence boundary detection


o Optical character recognition
o Automatic speech recognition
 Topic boundary detection

Corpus

Documents/sentences

Word/tokens

Vocabulary

I met Dr.Xyz and he suggested some medices.

What is the time now?

Topic boundary detection


 Discourse segmentation / text segmentation
 Process of dividing speech or text into homogenous blocks
called as topic segmentation
 Two ways (text segmentation)
1. By following headlines
2. By paragraph breaks
 Two ways (speech segmentation)
1. Pause duration
2. Speaker changes

METHODS for sentence boundary and topic boundary


1. Generative sequence classification method
2. Discriminative local classification method

Generative sequence classification method


 Observations: words & punctuations
 Labels: sentence boundary, topic boundary
Hidden Markov Model (HMM) is method of Generative approach
How it works
 Learn from the data about the observation and corresponding
hidden states (POS)
 Predict label or sequence generation
 Classifies the new sequence
Ex:
I Love Coding
She sells apples
the quick brown fox jumps over the lazy dog
Discriminative local classification method
Local feature: word, prefixes, suffixes, nearby POS
Label: sentence boundary label, topic boundary label
Ex: maximum entropy Markov model, SVM

Applications:
1. POS tagging
2. Speech recognition
3. Named entity recognition

Complexity of approaches
 Quality
 Quantity
 Computational complexity
 Structural complexity
 Space
 Time
 Training
 Prediction

Performance of the approaches


 Precision
 Recall
 Accuracy
 F1 measure/F1 score

Confusion matrix

Precision
When it predicts yes, how often is it correct
TP/predicted Yes
100/110=90.9%

True Negative Rate:

Actually No, how often does it predict No


TN/actual No=50/60=83%

True Positive Rate: (Recall/Sensitivity)


When it actual Yes how often does it predict Yes
TP/Actual Yes
100/105=95%

Accuracy
How often classifier correct

TN+TP/TN+FP+FN+TP
50+100/50+10+5+100
150/165=90%

Misclassification Rate:
Overall, how often is it wrong

FN+FP/TN+FP+FN+TP
5+10/50+10+5+100
15/165=9%
PRCESION = Total No. Of Correct Positive Prediction/Total No. Of
Positive Prediction
RECALL= Total No. Of Correct Positive Prediction/Total no. of positive
instances (+ve ,-ve)
ACCURACY
How often classifier correct
TN+TP/TN+FP+FN+TP
F1 SCORE= 2* precision *recall / precision +recall
UNIT -II
Prerequires CFG

Syntax Analysis: Parsing Natural Language, Treebanks: A Data-Driven


Approach to Syntax, Representation of Syntactic Structure, Parsing Algorithms,
Models for Ambiguity Resolution in Parsing, Multilingual Issues

Chart parser
RegEx parser
Shift reduce parser
Recursive parser

Syntax Analysis /syntactic Analysis


Syntax Vs grammar

Return_type Function_name(parameters);
Function_name(parameters) return_type;
Function_name(parameter);

Return_type Function_name(parameters)

Ramu eats apple


Eats ramu apple
Tree

Parsing
CFG

G= (N, T, P, S)

A α
α  (NUT)*

AB

SNP VP
NP{article(adj)N, PRO, PN}
VPV NP(PP)(ADV)
PP PREP NP
NPN
NP DN

Ate ramu apple the


RAMU ATE THE APPLE

Brown ,switchboard

At/in the/at same/ap time/nn reaction/nn among/in anti-


organization/jj
At the same time reaction among anti-organization
Penn treebank
Ate ramu apple the
Representation of Syntactic Structure
Two types of approaches

 Phrase structure graph


o Example
 Dependency graph
o Example

You might also like