NLP Intro

Natural Language
Processing
Natural language processing (NLP) is a branch of artificial intelligence
that helps computers understand, interpret and manipulate human
language.
Structured Language
• Human Language: Lake of precisely defined structure
• Mathematics use certain structure
• Formal Logic also uses structured language

• consider this expression
Grammar
• Structured language are easy to parse and understand by computers
Defined by a Strick rules/grammar
• Violation of grammatical rules gives syntax error

Unstructured Text
• Every language have a defined grammatical rules
• What computer can do to make sense of unstructured text ?
• Using the process words and phrases computers can be able to parse
the sentences
Natural Language
Processing
•Context is everything
Natural Language
Processing
•Context is everything
•Meaning / semantic
• Implicitly apply the knowledge of physical world
NLP Pipeline Take raw input text clean it, normalize It
and convert it into a form that is suitable
For feature extraction
• Similarly, the next stage needs to extract and produce feature
representation that are appropriate for the type of model you are
planning to use and NLP task you are planning to achieve
Modeling •Design baseline statistical models
NLP Pipeline - 1. Text Processing
• Same cases(upper and lower case)
• Remove punctuation marks
• Remove some common words such as a , the an, of, are etc.
NLP Pipeline - 2. Feature Extraction
• Wordnet
Statistical model
• Document level task
• Such as spam detection
Bag of
• Sentiment analysis words or
doc to vec
Per document
representation
• Work with induvial words or phrases
• Such as text generation or machine translation
• You need word level representation
• You can use word to vec
NLP Pipeline - 3. Modeling
• Designing a model
• Statistical model
• Machine learning model
Clean
Normalization
Case Normalization
Punctuation Removal
Punctuation Removal
Tokenization
Tokenization
Whitespace Tokenization
NLTK
• Natural language toolkit
Word Tokenization
Sentence Tokenization
Stop Word Removal
Uninformative words that don’t add a lot of meaning to the sentence.
Stop Word Removal
Stop Words in English
Part-of-Speech Tagging
Part-of-Speech Tagging
Named Entity Recognition
Stemming And Lemmatization
Stemming
Lemmatization
It uses dictionary to map the words
Lemmatization
Lemmatization with PoS
NLP Pipeline
• Web Scraping example
• https://github.com/Rafia-Shaikh-eng/Data-Science-Analytics/blob/9a3f8cd9d
0db34f9bf279f1877e7456e73630c61/Wikipedia_WebScraping.ipynb
• Text processing example
Key Concepts
• Corpus : A computer readable collection of text or speech
Eg : I do uh main- mainly data processing. How many words? 7? 8?
• Regular expressions : A language for specifying text search
• Lemmatization : Task of determining two words have the same root

eg sang and sung are forms of verb sing
• Normalization : USA or U.S.A
• Tokenization: Splitting a string, text into a list of tokens

“This is FAST University” to [“This” ,“is”, “FAST”, “University”]
Key concepts continued..
• Edit distance : How similar two strings are based on the number of
edits (insertions, deletions, substitutions)
Language Modeling
Models that assign probabilities to sequence of words are
called language models.
N-gram Language Model
N gram
An n-gram is a sequence of n words
Example Predict the word from the list of words in the vocab
Smoothing
Problem: N grams made of known words still might be missing in the
training corpus. “John”, “eats” but not “John eats”
Solution-> Laplace Smoothing : Add one to all bigram counts
Formula : P(wi|wi-1) = c(wi-1,wi)+1

c(wi-1) + V
Where: c = count of word
V = extra observations
Evaluating Language Models
• Extrinsic evaluation
- Embed in an application and see how much the performance improves
• Intrinsic evaluation
- Training / test set
• Perplexity: It is the inverse probability of test set, normalized by

number of words. For bigrams, the following formula is used
Minimizing perplexity is the same as maximizing probability

Bag of words
A bag of words model is a way of extracting features from text for use
in modeling. It involves:
1) A vocabulary of known words

2) A measure of the presence of known words
Normal texting cleaning: ignoring case/stop words/reducing to stem

Bag of words vector
We can represent a string as a bag of words vector
• Review 1: This movie is very scary and long

• Review 2: This movie is not scary and is slow
• Review 3: This movie is spooky and good
Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]
Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
Naïve Bayes for text classification
What is the need for text classification?
Answer: Spam/Not spam, Positive/Negative movie review, sentiment
analysis etc
w P(w|+) P(w|-)
I 0.1 0.2
For a document d and class c love 0.1 0.001
this 0.01 0.01
fun 0.05 0.005
Formula : P(c|d) = argmax P(d|c)P(c) film 0.1 0.1
P(d) Probability 0.0000005 0.000000001
Part of Speech (POS) tagging
• It is defined as the process of assigning one of the parts of speech to

the given word. For example the word back:
- The back door => JJ (Adjective)
- On my back => NN (Noun)

- Win the voters back => RB (Adverb)
- Promised to back the bill => VB (Verb)

• A POS tagger takes in a phrase or sentence and assigns the most
probable part of speech. Common types of POS tagging:
• Rules based
An apple, An octopus : Article always precedes a noun
• Statistical based: A text corpus is used to derive probabilities.

Common models include n-gram models, Hidden Markov Model and
Maximum Entropy Model
• Transformation based tagging : Combination of rules and statistical

based
Further Reading
Speech and Language Processing 3rd Edition by Daniel Jurafsy & James
H. Martin

NLP Intro

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP Intro

Uploaded by

Copyright:

Available Formats

Natural Language

• Formal Logic also uses structured language

• Violation of grammatical rules gives syntax error

• Regular expressions : A language for specifying text search

• Lemmatization : Task of determining two words have the same root

• Normalization : USA or U.S.A

• Tokenization: Splitting a string, text into a list of tokens

Solution-> Laplace Smoothing : Add one to all bigram counts

Formula : P(wi|wi-1) = c(wi-1,wi)+1

• Perplexity: It is the inverse probability of test set, normalized by

Minimizing perplexity is the same as maximizing probability

1) A vocabulary of known words

Normal texting cleaning: ignoring case/stop words/reducing to stem

• Review 1: This movie is very scary and long

• It is defined as the process of assigning one of the parts of speech to

- The back door => JJ (Adjective)

- On my back => NN (Noun)

- Promised to back the bill => VB (Verb)

• Statistical based: A text corpus is used to derive probabilities.

• Transformation based tagging : Combination of rules and statistical

You might also like