Professional Documents
Culture Documents
Spam Classifier
Spam Classifier
Spam Classifier
2017UCO1596
Ishaan Rawat
Spam Classifier 2017UCO1644
Anish Jangra
2017UCO1654
What, Why, How & Wow of Our Project
“ Spam is a waste of the receivers’ time, and, a waste of the sender’s optimism. ”
1
What: Naïve-bayes Spam Classifier
Preprocessing the Data What Did We Learn?
NB classifier, using NB algorithm, a
popular statistical technique of e-mail Learned the NB algorithm
filtering 2 Teamwork, collaborating & collating
They use bag of words features to Training the naïve-bayes spam results together as a team
identify spam email classifier
3
Why: NB Vs The Spam-Menace
Testing the trained AI model What Next?
Spam emails responsible for >77% of
the global email traffic How to further improve the model?
Naïve Bayes algorithm predominantly 4 Where to use the model?
famous in business-related and open- Obtaining & Quoting relevant results How is the Problem Evolving?
source spam filters
1 2 3
Data set from SpamAssassin public
Extract the email body Remove the null values and create a
mail corpus containing around 1900
from the email data frame of the emails
spam emails and 3900 legit emails
6
Generate the features for the emails in the data set
7
70:30 Split the data into training and test set
8
Create a sparse matrix for training and test data
Sparse matrix for test data
● Convert the sparse matrix of training data into a full feature matrix. It consists of the frequency of each word in the vocabulary in the
emails in the training data
Bayes Theorem
● The values will be calculated by summing across the axis of the matrix for ham and spam emails respectively in the training data
● Using the above calculated values we will classify an email as ham or spam in our test data
Full feature matrix Prob words in spam email Prob of words in ham emails
Training &
Exec-Summary Pre-processing Results Conclusion | 04
Testing
Testing the Naïve-Bayes Classifier
The probabilities are Calculate the probability of The test matrix has 2500
The test data is in the form converted to log form an email being spam or columns and the
of a matrix (to avoid errors during ham using the matrix inner probability matrix has
calculations) product 2500 rows
Training &
Exec-Summary Pre-processing Results Conclusion | 05
Testing
Significant Results We Obtained