Professional Documents
Culture Documents
Format For PBS
Format For PBS
Format For PBS
on
Bachelor of Engineering
in
Information Technology
By
Isha Agarwal
2022-23
CERTIFICATE
This is to certify that the seminar report entitled “Email Spam Detection” being submitted by Isha
Agarwal, TI01 is a record of bonafide work carried out by her under the supervision and guidance
of Dr. Rupali M.Chopade in partial fulfillment of the requirement for TE (Information Technology
Engineering) – 2019 course of Savitribai Phule Pune University, Pune for the academic year 2022-
23
Date:18thNovember2022
Place: Pune
This project-based seminar report has been examined by us as per the Savitribai Phule Pune
University, Pune, requirements at Marathwada Mitra Mandal's College of Engineering, Pune on 18th
November 2022.
2
ACKNOWLEDGEMENT
3
Abstract
Today's technology has made machine learning a buzzword, and it is developing very quickly. Google
Maps, Google Assistant, Alexa, and other services we use every day are examples of how machine
learning is being used without our knowledge. Machine learning algorithms build a model based on sample
data, also known as training data, in order to make predictions or decisions without being explicitly
programmed to do so.
Machine learning algorithms use historical data as input to predict new output values. There are a lot of
applications of machine learning , these include email spam filtering or malware filtering, speech
recognition ,image recognition etc. It is also used in virtual assistants ,online fraud detections ,stock market
trading and medical diagnosis
Unwanted emails sent in large numbers are referred to as email spam (spamming). The primary goal of
spam is to generate revenue, and because it is so inexpensive to send compared to traditional marketing
strategies, spam marketing is nevertheless incredibly effective despite the extremely low response rates to
it. Junk emails are delivered to users' inboxes with messages that are meant to advertise goods and services
in order to make money.
4
Table of Contents
5
List of Figures
6
CHAPTER 1
Introduction
Spam emails are unwanted messages that are sent without the recipient's knowledge or agreement. Spam
is identified by email servers employing spam filter software, which analyses incoming emails based on a
variety of criteria. A program known as a spam filter is used to identify unsolicited, undesired, and virus-
infected emails and stop them from reaching a user's mailbox.
7
1.3 Motivation
Email spam is causes many problems to the users . These may affect the system
as they may contain malware, links that may lead to malicious sites etc .
They also prevent the user from making good use of time and cause destructive effects on the memory
space and CPU power of the system.
This seminar aims in studying the Email Spam Detection using the different machine learning algorithms,
most prominently Naïve Bayes Algorithm .
It also aims at how emails are classified into spam and ham and how they can cause various problems.
8
CHAPTER 2
LITERATURE SURVEY
Many organizations and people now have more convenient contact alternatives thanks to electronic mail.
Spammers who send unsolicited emails take use of this technique to make fraudulent gains.
This topic seeks to introduce a technique for spam email detection using machine learning algorithms that
are enhanced using bio-inspired techniques. To investigate the effective techniques, a literature survey is
conducted.
A literature survey is carried out to explore efficient methods applies on various datasets to produce
successful outcomes. An intensive study was conducted, along with feature extraction and pre-processing,
to build machine learning models utilizing Naive Bayes, Support Vector Machine, Random Forest,
Decision Tree, and Multi-Layer Perceptron on seven different email datasets.
This study looks into how spam and ham email clusters can be identified using unsupervised learning.
9
STUDY OF LITERATURE SURVEY
Table 1.1
Sr Paper Title Publication Authors Findings
No. & Year
1 An Unsupervised IEEE - 29 Asif Karim, This research described a novel
Approach for September (Member, IEEE), framework based entirely on
Content-Based 2021 Sami Azam , unsupervised methodologies to
Clustering of Emails (Member, IEEE), separate ham from spam emails
Into Spam and Ham Bharanidharan through unsupervised
Through Shanmugam , and clustering.(DBSCAN,OPTICS,K-
Multiangular Feature Krishnan MEANS)
Formulation Kannoorpatti
10
CHAPTER 3
Methodology & Algorithms Used
3.1) Methodology :
A particular machine learning algorithm is then used to learn the classification rules from these email
messages. Several studies have been carried out on machine learning techniques and many of these
algorithms are being applied in the field of email spam filtering.
Examples of such algorithms include Naïve Bayes, Support Vector Machines, Neural Networks, K-
Nearest Neighbour, Decision Tree, and Random Forests.
The emails are classified in spam and not spam (ham) on the basis of the words present in them. Generally
words like offers, extra income, earn more, cash prizes etc are present in the spam emails. Probability of
each word in the email is calculated and is assigned a binary number. The probability of each word should
be same i.e spam and then the email is classified as spam email.
11
Bayes Formula for conditional probability-
Fig 1.1
A technique for figuring out conditional probabilities, or the likelihood of one event happening given that
another has already happened, is the Bayes Theorem. A conditional probability might result in more
precise conclusions since it takes into account more conditions, or more data.
As a result, conditional probabilities are essential for computing precise probabilities and predictions in
machine learning.
Bayes theorem helps to calculate the probability of occurring one event with uncertain knowledge while
other one has already occurred. Bayes Theorem is used to estimate the precision of values and provides a
method for calculating the conditional probability.
12
Example
Let us consider that there are 10,000 emails ,in which 70% are spam and remaining 30% are not spam .
Consider that the word “review us” is present in 5% of the spam mails and 1% in the ham mails . Now
assuming that we have got an email containing the above word , we have to determine whether it is spam
or not .
10,000
Spam Ham
7000 3000
5% 1%
350 6650 30 2970
Fig 1.2
P(Word)= 350*0.7+30*0.3=254
P(spam|Word )=(350*0.7)/254=0.964
Therefore them email can be classified as spam .
13
Algorithms Used
It is a supervised machine learning algorithm where words probabilities play the main rule here. If some
words occur often in spam but not in ham, then this incoming e-mail is probably spam. Naïve bayes
classifier technique has become a very popular method in mail filtering software. Bayesian filter should
be trained to work effectively. Every word has certain probability of occurring in spam or ham email in
its database. If the total of words probabilities exceeds a certain limit, the filter will mark the e-mail to
either category. The Naïve Bayes algorithm is based on the Bayes theorem of conditional probability and
it uses a probabilistic classifier .
Fig 1.3
14
CHAPTER 4
Performance Analysis
These scores are based upon considering examples with default parameters and hyperparameter tuning.
From above we can conclude that the score of Naive Bayes classifier is the highest . This is because it is
easy to build and implement and does not require any iterative process and much training data. It can also
perform efficiently on large datasets and both discrete and continuous data .
15
CHAPTER 5
Outline and Future Scope
Outline:
The findings in the paper were that email spam detection can be carried out by Naïve Byes , K-Means ,
Decision Tree , SVM techniques , etc . By comparing it with others , we got to know that the most
suitable algorithm is the Naïve Byes algorithm .
Future Scope:
In future, an email spam detector can be made by using various other machine learning algorithms like K-
means and Artificial neural networks(ANN) ,We can improve the email spam detection using the hybrid
deep learning algorithms consisting of CNN and RNN by optimizing the hidden neurons for improved
cyber security.
16
CHAPTER 6
CONCLUSION
• Thus, we have studied the Naive Bayes algorithm for detection of spam and ham emails and how
it the most convenient method for email spam filtering .
• We have also learnt about the various algorithms in machine learning like Support Vector Machine
(SVM),K-means clustering, Decision Tree and Random Forest etc.
17
REFERENCES
18