Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 74

Natural Language

Processing
Natural language processing (NLP) is a branch of artificial intelligence
that helps computers understand, interpret and manipulate human
language.
Structured Language
• Human Language: Lake of precisely defined structure
• Mathematics use certain structure

• Formal Logic also uses structured language


• consider this expression
Grammar
• Structured language are easy to parse and understand by computers
Defined by a Strick rules/grammar

• Violation of grammatical rules gives syntax error


Unstructured Text
• Every language have a defined grammatical rules
• What computer can do to make sense of unstructured text ?
• Using the process words and phrases computers can be able to parse
the sentences
Natural Language
Processing
•Context is everything
Natural Language
Processing
•Context is everything

•Meaning / semantic
• Implicitly apply the knowledge of physical world
NLP Pipeline Take raw input text clean it, normalize It
and convert it into a form that is suitable
For feature extraction
• Similarly, the next stage needs to extract and produce feature
representation that are appropriate for the type of model you are
planning to use and NLP task you are planning to achieve
Modeling •Design baseline statistical models
NLP Pipeline - 1. Text Processing
NLP Pipeline - 1. Text Processing
NLP Pipeline - 1. Text Processing
• Same cases(upper and lower case)
• Remove punctuation marks
• Remove some common words such as a , the an, of, are etc.
NLP Pipeline - 2. Feature Extraction
• Wordnet

Statistical model
NLP Pipeline - 2. Feature Extraction
• Document level task
• Such as spam detection
Bag of
• Sentiment analysis words or
doc to vec

Per document
representation
NLP Pipeline - 2. Feature Extraction
• Work with induvial words or phrases
• Such as text generation or machine translation
• You need word level representation
• You can use word to vec
NLP Pipeline - 3. Modeling
• Designing a model
• Statistical model
• Machine learning model
Clean
Normalization
Case Normalization
Punctuation Removal
Punctuation Removal
Tokenization
Tokenization
Whitespace Tokenization
NLTK
• Natural language toolkit
Word Tokenization
Sentence Tokenization
Stop Word Removal
Uninformative words that don’t add a lot of meaning to the sentence.
Stop Word Removal
Stop Words in English
Part-of-Speech Tagging
Part-of-Speech Tagging
Named Entity Recognition
Stemming And Lemmatization
Stemming
Lemmatization
It uses dictionary to map the words
Lemmatization
Lemmatization with PoS
NLP Pipeline
• Web Scraping example
• https://github.com/Rafia-Shaikh-eng/Data-Science-Analytics/blob/9a3f8cd9d
0db34f9bf279f1877e7456e73630c61/Wikipedia_WebScraping.ipynb
• Text processing example
Key Concepts
• Corpus : A computer readable collection of text or speech
Eg : I do uh main- mainly data processing. How many words? 7? 8?

• Regular expressions : A language for specifying text search

• Lemmatization : Task of determining two words have the same root


eg sang and sung are forms of verb sing

• Normalization : USA or U.S.A

• Tokenization: Splitting a string, text into a list of tokens


“This is FAST University” to [“This” ,“is”, “FAST”, “University”]
Key concepts continued..
• Edit distance : How similar two strings are based on the number of
edits (insertions, deletions, substitutions)
Language Modeling
Models that assign probabilities to sequence of words are
called language models.
N-gram Language Model
N gram
An n-gram is a sequence of n words
Example Predict the word from the list of words in the vocab
Smoothing
Problem: N grams made of known words still might be missing in the
training corpus. “John”, “eats” but not “John eats”

Solution-> Laplace Smoothing : Add one to all bigram counts

Formula : P(wi|wi-1) = c(wi-1,wi)+1


c(wi-1) + V
Where: c = count of word
V = extra observations
Evaluating Language Models
• Extrinsic evaluation
- Embed in an application and see how much the performance improves

• Intrinsic evaluation
- Training / test set

• Perplexity: It is the inverse probability of test set, normalized by


number of words. For bigrams, the following formula is used

Minimizing perplexity is the same as maximizing probability


Bag of words
A bag of words model is a way of extracting features from text for use
in modeling. It involves:

1) A vocabulary of known words


2) A measure of the presence of known words

Normal texting cleaning: ignoring case/stop words/reducing to stem


Bag of words vector
We can represent a string as a bag of words vector

• Review 1: This movie is very scary and long


• Review 2: This movie is not scary and is slow
• Review 3: This movie is spooky and good

Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]

Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]

Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
Naïve Bayes for text classification
What is the need for text classification?
Answer: Spam/Not spam, Positive/Negative movie review, sentiment
analysis etc
w P(w|+) P(w|-)
I 0.1 0.2
For a document d and class c love 0.1 0.001
this 0.01 0.01
fun 0.05 0.005
Formula : P(c|d) = argmax P(d|c)P(c) film 0.1 0.1
P(d) Probability 0.0000005 0.000000001
Part of Speech (POS) tagging

• It is defined as the process of assigning one of the parts of speech to


the given word. For example the word back:

- The back door => JJ (Adjective)

- On my back => NN (Noun)


- Win the voters back => RB (Adverb)

- Promised to back the bill => VB (Verb)


• A POS tagger takes in a phrase or sentence and assigns the most
probable part of speech. Common types of POS tagging:

• Rules based
An apple, An octopus : Article always precedes a noun

• Statistical based: A text corpus is used to derive probabilities.


Common models include n-gram models, Hidden Markov Model and
Maximum Entropy Model

• Transformation based tagging : Combination of rules and statistical


based
Further Reading
Speech and Language Processing 3rd Edition by Daniel Jurafsy & James
H. Martin

You might also like