Format For PBS

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Seminar Report

on

“Email Spam Detection”


Submitted to the

Savitribai Phule Pune University

In partial fulfillment for the award of the Degree of

Bachelor of Engineering

in

Information Technology

By

Isha Agarwal

Under the guidance of

Dr. Rupali M. Chopade

Department Of Information Technology

Marathwada Mitra Mandal's College of Engineering

Karvenagar, Pune-411052, Maharashtra, India

2022-23
CERTIFICATE
This is to certify that the seminar report entitled “Email Spam Detection” being submitted by Isha
Agarwal, TI01 is a record of bonafide work carried out by her under the supervision and guidance
of Dr. Rupali M.Chopade in partial fulfillment of the requirement for TE (Information Technology
Engineering) – 2019 course of Savitribai Phule Pune University, Pune for the academic year 2022-
23

Date:18thNovember2022
Place: Pune

Dr. R.M.Chopade Dr. R.M.Chopade Dr. V.N.Gohokar


Guide Head of the Department Principal

This project-based seminar report has been examined by us as per the Savitribai Phule Pune
University, Pune, requirements at Marathwada Mitra Mandal's College of Engineering, Pune on 18th
November 2022.

2
ACKNOWLEDGEMENT

I am extremely grateful to Dr. R.M.Chopade, Head of Department, Department of Information


Technology, for providing all the required resources for the successful completion of my seminar and
for her valuable suggestions and guidance in the preparation of the seminar report. I express my thanks
to all staff members and friends for all the help and co-ordination extended in bringing out this seminar
successfully in time. I will be failing in duty if I do not acknowledge with grateful thanks to the authors
of the references and other literatures referred to in this seminar.

Isha Agarwal, TI01

3
Abstract

Today's technology has made machine learning a buzzword, and it is developing very quickly. Google
Maps, Google Assistant, Alexa, and other services we use every day are examples of how machine
learning is being used without our knowledge. Machine learning algorithms build a model based on sample
data, also known as training data, in order to make predictions or decisions without being explicitly
programmed to do so.

Machine learning algorithms use historical data as input to predict new output values. There are a lot of
applications of machine learning , these include email spam filtering or malware filtering, speech
recognition ,image recognition etc. It is also used in virtual assistants ,online fraud detections ,stock market
trading and medical diagnosis

Unwanted emails sent in large numbers are referred to as email spam (spamming). The primary goal of
spam is to generate revenue, and because it is so inexpensive to send compared to traditional marketing
strategies, spam marketing is nevertheless incredibly effective despite the extremely low response rates to
it. Junk emails are delivered to users' inboxes with messages that are meant to advertise goods and services
in order to make money.

4
Table of Contents

Sr No. Chapter Page no.


1 Introduction 7
1.2 Introduction to Machine Learning 7
1.3 Introduction to Email Spam Detection 7
1.4 Motivation 8
1.5 Aim & Objectives 8
2 Literature Survey 9
2.1 Study on Literature Survey 10
3 Methodology and Algorithms Used 11
3.1 Formula Used 12
3.1 Example 13
3.2 Algorithms 14
4 Performance Analysis 15
4.1 Performance Comparison of Classifiers 15
5 Outline and Future Scope 16
6 Conclusion 17
7 References 18

5
List of Figures

Figure No. Figure Name Pg No.


1.1 Bayes formula for conditional probability 11
1.2 Flowchart- Naïve Bayes example 12
1.3 Flowchart- Naïve Bayes 13

6
CHAPTER 1
Introduction

1.1 Introduction to Machine Learning

What is Machine Learning?


Machine learning is an application of artificial intelligence that involves algorithms and data that
automatically analyze and make decision by itself without human intervention. It describes how
computer perform tasks on their own by previous experiences. Therefore we can say in machine
language artificial intelligence is generated on the basis of experience.
There are three types of machine learning
• Supervised learning
• Unsupervised learning
• Reinforcement learning

1.2 Introduction to Email Spam Detection

What is Email Spam Detection?

Spam emails are unwanted messages that are sent without the recipient's knowledge or agreement. Spam
is identified by email servers employing spam filter software, which analyses incoming emails based on a
variety of criteria. A program known as a spam filter is used to identify unsolicited, undesired, and virus-
infected emails and stop them from reaching a user's mailbox.

7
1.3 Motivation

Email spam is causes many problems to the users . These may affect the system
as they may contain malware, links that may lead to malicious sites etc .
They also prevent the user from making good use of time and cause destructive effects on the memory
space and CPU power of the system.

1.4 Aim and Objective(s) of the work

This seminar aims in studying the Email Spam Detection using the different machine learning algorithms,
most prominently Naïve Bayes Algorithm .
It also aims at how emails are classified into spam and ham and how they can cause various problems.

8
CHAPTER 2
LITERATURE SURVEY

Many organizations and people now have more convenient contact alternatives thanks to electronic mail.
Spammers who send unsolicited emails take use of this technique to make fraudulent gains.
This topic seeks to introduce a technique for spam email detection using machine learning algorithms that
are enhanced using bio-inspired techniques. To investigate the effective techniques, a literature survey is
conducted.

A literature survey is carried out to explore efficient methods applies on various datasets to produce
successful outcomes. An intensive study was conducted, along with feature extraction and pre-processing,
to build machine learning models utilizing Naive Bayes, Support Vector Machine, Random Forest,
Decision Tree, and Multi-Layer Perceptron on seven different email datasets.
This study looks into how spam and ham email clusters can be identified using unsupervised learning.

9
STUDY OF LITERATURE SURVEY

Table 1.1
Sr Paper Title Publication Authors Findings
No. & Year
1 An Unsupervised IEEE - 29 Asif Karim, This research described a novel
Approach for September (Member, IEEE), framework based entirely on
Content-Based 2021 Sami Azam , unsupervised methodologies to
Clustering of Emails (Member, IEEE), separate ham from spam emails
Into Spam and Ham Bharanidharan through unsupervised
Through Shanmugam , and clustering.(DBSCAN,OPTICS,K-
Multiangular Feature Krishnan MEANS)
Formulation Kannoorpatti

2 Email Spam IEEE - 20 Nikhil Kumar, The spam email classification is


Detection Using September Sanket Sonowal, very significant in categorizing
Machine Learning 2020 Nishant emails and to distinct emails that
Algorithms are spam or non-spam and the
Multinomial Naïve Bayes gives
the best results .
3 Detecting Spam IEEE - 13 Simran Gibson 1 , The Multinomial Naïve Bayes
Email With Machine October Biju Issac 1 , (Senior (MNB) is the algorithm that has
Learning Optimized 2020 Member, IEEE), LI performed better than all the
With Bio-Inspired Zhang 1 , (Senior other algorithms. This research
Metaheuristic Member, IEEE), and conducts experiments involving
Algorithms Seibu Mary Jacob , five different machine learning
(Member, IEEE) models with Particle Swarm
Optimization (PSO) and Genetic
Algorithm (GA).
4 Efficient Clustering IEEE - 17 Asif Karim, Unsupervised methodology to
of Emails Into Spam August (Member, IEEE), differentiate between ham and
and Ham: The 2020 Sami Azam , spam emails through
Foundational Study (Member, IEEE), clustering(Optics,Spectral and K-
of a Comprehensive Bharanidharan means)
Unsupervised Shanmugam , and
Framework Krishnan
Kannoorpatti

10
CHAPTER 3
Methodology & Algorithms Used

3.1) Methodology :

A particular machine learning algorithm is then used to learn the classification rules from these email
messages. Several studies have been carried out on machine learning techniques and many of these
algorithms are being applied in the field of email spam filtering.

Examples of such algorithms include Naïve Bayes, Support Vector Machines, Neural Networks, K-
Nearest Neighbour, Decision Tree, and Random Forests.

The emails are classified in spam and not spam (ham) on the basis of the words present in them. Generally
words like offers, extra income, earn more, cash prizes etc are present in the spam emails. Probability of
each word in the email is calculated and is assigned a binary number. The probability of each word should
be same i.e spam and then the email is classified as spam email.

11
Bayes Formula for conditional probability-

Fig 1.1

A technique for figuring out conditional probabilities, or the likelihood of one event happening given that
another has already happened, is the Bayes Theorem. A conditional probability might result in more
precise conclusions since it takes into account more conditions, or more data.

As a result, conditional probabilities are essential for computing precise probabilities and predictions in
machine learning.
Bayes theorem helps to calculate the probability of occurring one event with uncertain knowledge while
other one has already occurred. Bayes Theorem is used to estimate the precision of values and provides a
method for calculating the conditional probability.

12
Example

Let us consider that there are 10,000 emails ,in which 70% are spam and remaining 30% are not spam .
Consider that the word “review us” is present in 5% of the spam mails and 1% in the ham mails . Now
assuming that we have got an email containing the above word , we have to determine whether it is spam
or not .

10,000
Spam Ham

7000 3000
5% 1%
350 6650 30 2970

Fig 1.2

Applying the formula, we get -

P(Word)= 350*0.7+30*0.3=254
P(spam|Word )=(350*0.7)/254=0.964
Therefore them email can be classified as spam .

13
Algorithms Used

3.2) Naïve Bayes Algorithm

It is a supervised machine learning algorithm where words probabilities play the main rule here. If some
words occur often in spam but not in ham, then this incoming e-mail is probably spam. Naïve bayes
classifier technique has become a very popular method in mail filtering software. Bayesian filter should
be trained to work effectively. Every word has certain probability of occurring in spam or ham email in
its database. If the total of words probabilities exceeds a certain limit, the filter will mark the e-mail to
either category. The Naïve Bayes algorithm is based on the Bayes theorem of conditional probability and
it uses a probabilistic classifier .

Fig 1.3

14
CHAPTER 4
Performance Analysis

4.1) Performance Comparison of Classifiers


Table No. 1.2

Sr.No Algorithm Score 1 Score 2 Score 3 Score 4

1. SVM Classifier 0.81 0.92 0.95 0.92

2. K-Nearest Neighbor 0.92 0.88 0.87 0.88

3. Naive Bayes Classifier 0.87 0.98 0.98 0.98

4. Decision Tree 0.94 0.95 0.93 0.95

5. Random Forest 0.90 0.92 0.92 0.92

These scores are based upon considering examples with default parameters and hyperparameter tuning.
From above we can conclude that the score of Naive Bayes classifier is the highest . This is because it is
easy to build and implement and does not require any iterative process and much training data. It can also
perform efficiently on large datasets and both discrete and continuous data .

15
CHAPTER 5
Outline and Future Scope

Outline:
The findings in the paper were that email spam detection can be carried out by Naïve Byes , K-Means ,
Decision Tree , SVM techniques , etc . By comparing it with others , we got to know that the most
suitable algorithm is the Naïve Byes algorithm .

Future Scope:
In future, an email spam detector can be made by using various other machine learning algorithms like K-
means and Artificial neural networks(ANN) ,We can improve the email spam detection using the hybrid
deep learning algorithms consisting of CNN and RNN by optimizing the hidden neurons for improved
cyber security.

16
CHAPTER 6
CONCLUSION

• Thus, we have studied the Naive Bayes algorithm for detection of spam and ham emails and how
it the most convenient method for email spam filtering .

• We have also learnt about the various algorithms in machine learning like Support Vector Machine
(SVM),K-means clustering, Decision Tree and Random Forest etc.

17
REFERENCES

1. A. Karim, S. Azam, B. Shanmugam, K. Kannoorpatti, and M. Alazab, ‘‘A comprehensive survey


for intelligent spam email detection,’’ IEEE Access, vol. 7, pp. 168261–168295, 2019
2. E. Bauer. 15 Outrageous Email Spam Statistics That Still Ring True in 2018, RSS. Accessed: Oct.
10, 2020. [Online].
3. R. M. Ravindran and D. A. S. Thanamani, ‘‘K-means document clustering using vector space
model,’’ Bonfring Int. J. Data Mining, vol. 5, no. 2, pp. 10–14, Jul. 2015
4. S. Halder, R. Tiwari, and A. Sprague, ‘‘Information extraction from spam emails using stylistic
and semantic features to identify spammers,’’ in Proc. IEEE Int. Conf. Inf. Reuse Integr., Aug.
2011
5. D. Hao, L. Zhang, J. Sumkin, A. Mohamed, and S. Wu, ‘‘Inaccurate labels in weakly-supervised
deep learning: Automatic identification and correction and their impact on classification
performance,’’ IEEE J. Biomed. Health Informat., vol. 24, no. 9, pp. 2701–2710, Sep. 2020

18

You might also like