Week 1

Natural Language Processing
Mahmmoud Mahdi
Introduction
Introduction to NLP
● What is NLP?
● How does it related to ML and Tech?
● What are the typical task?
What’s that all about
speak read write
plan
think
All in natural language

What is the Natural Language Processing
In a way, if we unlock the key to understanding how language

works, we unlock the key to understanding how human brain
works.
Where NLP Stands
Background
Corpora, Tokens, and Types
● All NLP methods, be they classic or modern, begin with a

text dataset, also called a corpus (plural: corpora).
● A corpus usually contains raw text and any metadata
associated with the text. The raw text is a sequence of
characters (bytes), but most times it is useful to group
those characters into contiguous units called tokens.
● The process of breaking a text down into tokens is called
tokenization.
● Types are unique tokens present in a corpus. The set of
all types in a corpus is its vocabulary
In machine learning
parlance, the text along
with its metadata is called
an instance or data point.
The corpus, a collection of

instances, is also known as
a dataset.
Feature engineering
The process of understanding the linguistics of a

language and applying it to solving NLP problems is
called feature engineering.
Unigrams, Bigrams, Trigrams, …, N-grams
● N-grams are fixed-length (n) consecutive token sequences

occurring in the text.
● A bigram has two tokens, a unigram one. Generating
n-grams from a text is straightforward enough.
Lemmas and Stems
Lemmas are root forms of words.
Consider the verb fly.
● It can be inflected into many different words [flow,

flew, flies, flown, flowing, and so on]
● fly is the lemma for all of these seemingly different
words.
This reduction is called lemmatization

Lemmas and Stems
Stemming is the poorman’s lemmatization. It involves the use

of handcrafted rules to strip endings of words to reduce
them to a common form called stems.
Categorizing Sentences and Documents
Problems (supervised document classification)
● Assigning topic labels,

● Predicting sentiment of reviews,
● Filtering spam emails,
● Language identification, and
● Email triaging
Categorizing Words: POS Tagging
A common example of categorizing words is

part-of-speech (POS) tagging
Categorizing Spans: Chunking and Named Entity Recognition
We might want to identify the noun phrases (NP) and verb

phrases (VP) in text. This is called chunking or shallow
parsing. Shallow parsing aims to derive higher-order units
composed of the grammatical atoms, like nouns, verbs,
adjectives, and so on.
A named entity is a string mention of a real-world concept

like a person, location, organization, drug name, and so on
Structure of Sentences
Whereas shallow parsing

identifies phrasal units, the
task of identifying the
relationship between them is
called parsing.
NLP with Classiﬁcation and
Vector
NLP with Classiﬁcation and Vector
1. Sentiment Analysis with Logistic Regression

2. Sentiment Analysis with Naïve Bayes
3. Vector Space Models
4. Machine Translation and Document Search
Sentiment Analysis with Logistic Regression
Types of Machine Learning
Supervised ML (training)
Sentiment Analysis
Sentiment Analysis
Vocabulary
Feature Extraction
Problem with sparse representations
Positive and negative counts
Word frequency in classes
Feature extraction
Freqs: dictionary mapping from (word, class) to frequency

Feature Extraction
Feature Extraction
Feature Extraction
Feature Extraction
Preprocessing
● Eliminate handles and URLs

● Tokenize the string into words.
● Remove stop words like "and, is, a, on, etc."
● Stemming- or convert every word to its stem. Like dancer,
dancing, danced, becomes 'danc'. You can use porter
stemmer to take care of this.
● Convert all your words to lowercase.
Preprocessing: stop words and punctuation
Preprocessing: stemming and lowercasing
General Overview
General Overview
● Your X becomes of dimension (m,3) as follows.

General Implementation
Logistic
Regression
Overview of logistic regression
Logistic Regression Overview
Logistic regression makes use of the sigmoid function which

outputs a probability between 0 and 1.
The sigmoid function with some weight parameter θ and some

input x^(i) is defined as follows.
LR: Training
LR: Training
LR: Testing
LR: Testing
Cost Function for LR
Your Turn
● Dataset:
https://data.world/data-society/twitter-user-data
● Feature engineering
● convert to vector
● Apply Logistic regression
Questions ?

Week 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 1

Uploaded by

Copyright:

Available Formats

Natural Language Processing

speak read write

All in natural language

In a way, if we unlock the key to understanding how language

● All NLP methods, be they classic or modern, begin with a

The corpus, a collection of

The process of understanding the linguistics of a

● N-grams are fixed-length (n) consecutive token sequences

Lemmas are root forms of words.

Consider the verb fly.

● It can be inflected into many different words [flow,

This reduction is called lemmatization

Stemming is the poorman’s lemmatization. It involves the use

Problems (supervised document classification)

● Assigning topic labels,

A common example of categorizing words is

We might want to identify the noun phrases (NP) and verb

A named entity is a string mention of a real-world concept

Whereas shallow parsing

1. Sentiment Analysis with Logistic Regression

Freqs: dictionary mapping from (word, class) to frequency

● Eliminate handles and URLs

● Your X becomes of dimension (m,3) as follows.

Logistic regression makes use of the sigmoid function which

The sigmoid function with some weight parameter θ and some

You might also like