Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Natural Language Processing

for Spam Detection

The City College of New York


CSc 59929 – Introduction to Machine Learning
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Source for this lecture

Badreesh Shetty, Towards Data Science

https://towardsdatascience.com/natural-language-processing-nlp-for-
machine-learning-d44498845d5b

The City College of New York


CSc 59929 – Introduction to Machine Learning 2
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Examples of NLP
• Information Retrieval (Google finds relevant and similar results).
• Information Extraction (Gmail structures events from emails).
• Machine Translation (Google Translate translates language from one
language to another).
• Text Simplification (Rewordify simplifies the meaning of sentences).
• Sentiment Analysis (Hater News gives us the sentiment of the user).
• Text Summarization (Smmry or Reddit’s autotldr gives a summary of
sentences).
• Spam Filter (Gmail filters spam emails separately).
• Auto-Predict (Google Search predicts user search results).
• Auto-Correct (Google Keyboard and Grammarly correct words otherwise
spelled wrong).
• Speech Recognition (Google WebSpeech or Vocalware).
• Question Answering (IBM Watson’s answers to a query).
• Natural Language Generation (Generation of text from image or
video data.)
The City College of New York
From Shetty
CSc 59929 – Introduction to Machine Learning 3
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Natural Language Toolkit (NLTK)

NLTK is a popular open-source package in Python.

Rather than building all tools from scratch, NLTK provides


all common NLP Tasks.

See http://pypi.python.org/pypi/nltk

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 4
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Importing NLTK Library

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 5
Spring 2020 – Erik K. Grimmelmann, Ph.D.
NLTK Downloader

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 6
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Reading and exploring a dataset

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 7
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Removing punctuation

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 8
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Tokenizing
Tokenizing separates text into units (e.g., words, sentences).

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 9
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Removing stopwords
Stopwords are common words that are likely to appear in any
text (e.g., or, and, is).

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 10
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Preprocessing: Stemming
Stemming helps reduce a word to its stem form. For
example, it removes suffices (e.g., “ing”, “ly”, “s”) by a
simple rule-based approach.

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 11
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Preprocessing: Lemmatizing
Lemmatizing derives the canonical form (‘lemma’) of a
word., its the root form. It is better than stemming as it uses a
dictionary-based approach (i.e a morphological analysis to
the root word) For example: Entitling, Entitled->Entitle

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 12
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Vectorizing
Vectorizing is the process of encoding text as integers (i.e., in
numeric form) to create feature vectors so that machine
learning algorithms can use the data.

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 13
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Vectorizing: Bag-of-Words
Bag of Words (BoW) or CountVectorizer describes the
presence of words within the text data. The result is a list of
the words in the documents along with a count of how many
times each word appears in the document.

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 14
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Vectorizing: N-Grams

N-grams are all combinations of adjacent words or letters of


length n that are in the source text. N-grams with n=1 are
called unigrams. Similarly, bigrams (n=2), trigrams (n=3),
etc., can also be used.

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 15
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Vectorizing: TF-IDF

TF-IDF (or TFIDF) is short for term frequency-inverse


document frequency. It is the “relative frequency” that a
word appears in a document compared to its frequency across
all documents in a collection of documents. It is most
typically used for search engine scoring, text summarization,
or document clustering.

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 16
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Feature engineering

Feature engineering is the process of using domain


knowledge of the data to create features that make machine
learning algorithms work.

Feature engineering is fundamental to the application of


machine learning and is both difficult and expensive.

The need for manual feature engineering can be obviated by


automated feature learning.

The City College of New York


From Wikipedia
CSc 59929 – Introduction to Machine Learning 17
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Feature engineering: feature creation
Example of two features
• Lengths of words excluding whitespaces in message body
• Percentage of punctuation marks in a message body

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 18
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Feature engineering: feature checking
Lengths of words

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 19
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Feature engineering: feature checking
Percentage of punctuation marks

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 20
Spring 2020 – Erik K. Grimmelmann, Ph.D.
ML model selection
This example uses Random Forest models and uses a grid-
search to find the best combination of the two
hyperparameters
• Number of estimators (number of trees in the forest)
• Maximum depth (maximum number of levels in each
decision tree)

The City College of New York


From Shetty
CSc 59929 – Introduction to Machine Learning 21
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Model using bag of words

Best fit is for


• Number of estimators = 150
• Maximum depth = 90
The City College of New York
From Shetty
CSc 59929 – Introduction to Machine Learning 22
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Model using TF-IDF

Best fit is for


• Number of estimators = 150
• Maximum depth = 90
The City College of New York
From Shetty
CSc 59929 – Introduction to Machine Learning 23
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Spam-Ham classifier results

n = 1,114
The City College of New York
From Shetty
CSc 59929 – Introduction to Machine Learning 24
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Spam-Ham classifier results

TN FP
FN TP

n = 1,114
The City College of New York
From Shetty
CSc 59929 – Introduction to Machine Learning 25
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Spam-Ham classifier results

Precision = TP/(TP+FP)

Recall = TP/(TP+FN)

F1 = 2*Pr*Re/(Pr+Re)
Accuracy = (TP+TN)/n

FPR = FP/(FP+TN) = 0.000


FNR = FN/(FN+TP) = 0.174

TN FP
FN TP From Shetty
The City College of New York
CSc 59929 – Introduction to Machine Learning 26
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Confusion matricies

True
P N

Predicted P TP FP
N FN TN

The City College of New York


CSc 59929 – Introduction to Machine Learning 27
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Confusion matricies

True
P N

Predicted P TP FP
N FN TN

The City College of New York


CSc 59929 – Introduction to Machine Learning 28
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Confusion matricies

True
P N

Predicted P TP FP
N FN TN

The City College of New York


CSc 59929 – Introduction to Machine Learning 29
Spring 2020 – Erik K. Grimmelmann, Ph.D.
Confusion matricies
Predicted Predicted
N P P N

N TN FP P TP FN
True

True
P FN TP N FP TN

True True
N P P N
Predicted

N TN FN P TP FP
Predicted
P FP TP N FN TN

The City College of New York


CSc 59929 – Introduction to Machine Learning 30
Spring 2020 – Erik K. Grimmelmann, Ph.D.

You might also like