ML 12 NLP Example

Natural Language Processing
for Spam Detection
The City College of New York

CSc 59929 – Introduction to Machine Learning
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Source for this lecture
Badreesh Shetty, Towards Data Science
https://towardsdatascience.com/natural-language-processing-nlp-for-
machine-learning-d44498845d5b

CSc 59929 – Introduction to Machine Learning 2
Examples of NLP
• Information Retrieval (Google finds relevant and similar results).
• Information Extraction (Gmail structures events from emails).
• Machine Translation (Google Translate translates language from one
language to another).
• Text Simplification (Rewordify simplifies the meaning of sentences).
• Sentiment Analysis (Hater News gives us the sentiment of the user).
• Text Summarization (Smmry or Reddit’s autotldr gives a summary of
sentences).
• Spam Filter (Gmail filters spam emails separately).
• Auto-Predict (Google Search predicts user search results).
• Auto-Correct (Google Keyboard and Grammarly correct words otherwise
spelled wrong).
• Speech Recognition (Google WebSpeech or Vocalware).
• Question Answering (IBM Watson’s answers to a query).
• Natural Language Generation (Generation of text from image or
video data.)
From Shetty
Natural Language Toolkit (NLTK)
NLTK is a popular open-source package in Python.
Rather than building all tools from scratch, NLTK provides

all common NLP Tasks.
See http://pypi.python.org/pypi/nltk

From Shetty
Importing NLTK Library

From Shetty
NLTK Downloader

From Shetty
Reading and exploring a dataset

From Shetty
Removing punctuation

From Shetty
Tokenizing
Tokenizing separates text into units (e.g., words, sentences).

From Shetty
Removing stopwords
Stopwords are common words that are likely to appear in any
text (e.g., or, and, is).

From Shetty
Preprocessing: Stemming
Stemming helps reduce a word to its stem form. For
example, it removes suffices (e.g., “ing”, “ly”, “s”) by a
simple rule-based approach.

From Shetty
Preprocessing: Lemmatizing
Lemmatizing derives the canonical form (‘lemma’) of a
word., its the root form. It is better than stemming as it uses a
dictionary-based approach (i.e a morphological analysis to
the root word) For example: Entitling, Entitled->Entitle

From Shetty
Vectorizing
Vectorizing is the process of encoding text as integers (i.e., in
numeric form) to create feature vectors so that machine
learning algorithms can use the data.

From Shetty
Vectorizing: Bag-of-Words
Bag of Words (BoW) or CountVectorizer describes the
presence of words within the text data. The result is a list of
the words in the documents along with a count of how many
times each word appears in the document.

From Shetty
Vectorizing: N-Grams
N-grams are all combinations of adjacent words or letters of

length n that are in the source text. N-grams with n=1 are
called unigrams. Similarly, bigrams (n=2), trigrams (n=3),
etc., can also be used.

From Shetty
Vectorizing: TF-IDF
TF-IDF (or TFIDF) is short for term frequency-inverse

document frequency. It is the “relative frequency” that a
word appears in a document compared to its frequency across
all documents in a collection of documents. It is most
typically used for search engine scoring, text summarization,
or document clustering.

From Shetty
Feature engineering
Feature engineering is the process of using domain

knowledge of the data to create features that make machine
learning algorithms work.
Feature engineering is fundamental to the application of

machine learning and is both difficult and expensive.
The need for manual feature engineering can be obviated by

automated feature learning.

From Wikipedia
Feature engineering: feature creation
Example of two features
• Lengths of words excluding whitespaces in message body
• Percentage of punctuation marks in a message body

From Shetty
Feature engineering: feature checking
Lengths of words

From Shetty
Feature engineering: feature checking
Percentage of punctuation marks

From Shetty
ML model selection
This example uses Random Forest models and uses a grid-
search to find the best combination of the two
hyperparameters
• Number of estimators (number of trees in the forest)
• Maximum depth (maximum number of levels in each
decision tree)

From Shetty
Model using bag of words
Best fit is for

• Number of estimators = 150
• Maximum depth = 90
From Shetty
Model using TF-IDF
Best fit is for

• Number of estimators = 150
• Maximum depth = 90
From Shetty
Spam-Ham classifier results
n = 1,114
From Shetty
TN FP
FN TP
n = 1,114
From Shetty
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1 = 2*Pr*Re/(Pr+Re)
Accuracy = (TP+TN)/n
FPR = FP/(FP+TN) = 0.000

FNR = FN/(FN+TP) = 0.174
TN FP
FN TP From Shetty
Confusion matricies
True
P N
Predicted P TP FP
N FN TN

Confusion matricies
True
P N
Predicted P TP FP
N FN TN

Confusion matricies
True
P N
Predicted P TP FP
N FN TN

Confusion matricies
Predicted Predicted
N P P N
N TN FP P TP FN
True
True
P FN TP N FP TN
True True
N P P N
Predicted
N TN FN P TP FP
Predicted
P FP TP N FN TN


ML 12 NLP Example

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML 12 NLP Example

Uploaded by

Copyright:

Available Formats

Natural Language Processing

for Spam Detection

The City College of New York

Badreesh Shetty, Towards Data Science

The City College of New York

NLTK is a popular open-source package in Python.

Rather than building all tools from scratch, NLTK provides

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

N-grams are all combinations of adjacent words or letters of

The City College of New York

TF-IDF (or TFIDF) is short for term frequency-inverse

The City College of New York

Feature engineering is the process of using domain

Feature engineering is fundamental to the application of

The need for manual feature engineering can be obviated by

The City College of New York

The City College of New York

The City College of New York

The City College of New York

The City College of New York

Best fit is for

Best fit is for

FPR = FP/(FP+TN) = 0.000

The City College of New York

The City College of New York

The City College of New York

The City College of New York

You might also like