Professional Documents
Culture Documents
NLP Intro
NLP Intro
Processing
Natural language processing (NLP) is a branch of artificial intelligence
that helps computers understand, interpret and manipulate human
language.
Structured Language
• Human Language: Lake of precisely defined structure
• Mathematics use certain structure
•Meaning / semantic
• Implicitly apply the knowledge of physical world
NLP Pipeline Take raw input text clean it, normalize It
and convert it into a form that is suitable
For feature extraction
• Similarly, the next stage needs to extract and produce feature
representation that are appropriate for the type of model you are
planning to use and NLP task you are planning to achieve
Modeling •Design baseline statistical models
NLP Pipeline - 1. Text Processing
NLP Pipeline - 1. Text Processing
NLP Pipeline - 1. Text Processing
• Same cases(upper and lower case)
• Remove punctuation marks
• Remove some common words such as a , the an, of, are etc.
NLP Pipeline - 2. Feature Extraction
• Wordnet
Statistical model
NLP Pipeline - 2. Feature Extraction
• Document level task
• Such as spam detection
Bag of
• Sentiment analysis words or
doc to vec
Per document
representation
NLP Pipeline - 2. Feature Extraction
• Work with induvial words or phrases
• Such as text generation or machine translation
• You need word level representation
• You can use word to vec
NLP Pipeline - 3. Modeling
• Designing a model
• Statistical model
• Machine learning model
Clean
Normalization
Case Normalization
Punctuation Removal
Punctuation Removal
Tokenization
Tokenization
Whitespace Tokenization
NLTK
• Natural language toolkit
Word Tokenization
Sentence Tokenization
Stop Word Removal
Uninformative words that don’t add a lot of meaning to the sentence.
Stop Word Removal
Stop Words in English
Part-of-Speech Tagging
Part-of-Speech Tagging
Named Entity Recognition
Stemming And Lemmatization
Stemming
Lemmatization
It uses dictionary to map the words
Lemmatization
Lemmatization with PoS
NLP Pipeline
• Web Scraping example
• https://github.com/Rafia-Shaikh-eng/Data-Science-Analytics/blob/9a3f8cd9d
0db34f9bf279f1877e7456e73630c61/Wikipedia_WebScraping.ipynb
• Text processing example
Key Concepts
• Corpus : A computer readable collection of text or speech
Eg : I do uh main- mainly data processing. How many words? 7? 8?
• Intrinsic evaluation
- Training / test set
Vector of Review 1: [1 1 1 1 1 1 1 0 0 0 0]
Vector of Review 2: [1 1 2 0 0 1 1 0 1 0 0]
Vector of Review 3: [1 1 1 0 0 0 1 0 0 1 1]
Naïve Bayes for text classification
What is the need for text classification?
Answer: Spam/Not spam, Positive/Negative movie review, sentiment
analysis etc
w P(w|+) P(w|-)
I 0.1 0.2
For a document d and class c love 0.1 0.001
this 0.01 0.01
fun 0.05 0.005
Formula : P(c|d) = argmax P(d|c)P(c) film 0.1 0.1
P(d) Probability 0.0000005 0.000000001
Part of Speech (POS) tagging
• Rules based
An apple, An octopus : Article always precedes a noun