Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Email Based Spam Detection

Shivani Verma
Dept. of Computer Science and Engineering
Lovely Professional University, India

Abstract— In today's digital landscape, many people heavily rely unwanted messages and safeguarding users against potential
on emails and messages sent by unfamiliar individuals. security threats.
Unfortunately, this convenience also opens the door for spammers
to flood our inboxes with irrelevant and often absurd messages.
This not only clutters our emails but also significantly slows down
our internet speeds. Moreover, these spam messages pose a threat
by potentially extracting valuable information from our contact
lists.Addressing this issue involves a challenging and significant
research endeavor: identifying these spammers and their spam
content. Email spamming involves sending bulk messages, placing
the burden of expense mostly on the recipient, essentially making
it a form of postage-due advertising. The economic viability of
spam email lies in its cost-effectiveness for the sender.

To combat this, a proposed model utilizes Bayes' theorem and the


Naive Bayes' Classifier to determine whether a specific message
qualifies as spam or not. Additionally, efforts are made to detect IP
address of the senders

Keywords— Term Frequency, Inverse Document Frequency,language


toolkit.

I. INTRODUCTION
In recent times, the internet has become an essential part
of our daily lives, leading to a surge in email users. However,
this increased reliance on email has brought about a
significant issue known as Spam - unsolicited bulk emails
primarily used for advertising purposes. Spam floods our
inboxes with unwanted messages that disrupt our online
experience.

Spam arises when our email addresses are shared on


unauthorized websites or used without our consent. Its effects
are widespread: inundating our inboxes, slowing down
internet speeds, and potentially compromising personal
information stored in our contact lists. Furthermore, it can
manipulate search results and consume valuable time,
causing immense frustration for recipients.

Identifying these spammers and the content they generate


is a challenging task. Despite numerous studies, effective
methods to distinguish spam from genuine messages are still
limited. Additionally, spam not only congests network
communication but also serves as a vehicle for various
attacks. These emails may contain harmful links leading to
phishing or malware sites designed to steal confidential data.

To combat this issue, various spam filtering techniques are


employed. These techniques aim to differentiate between
legitimate and spam emails, mitigating the impact of
II. LITERATURE SURVEY

Researchers hihlighted various email header features


used to efficiently identify and classify spam messages.
These features were chosen based on their effectiveness in
detecting spam across major email providers like Yahoo
Mail, Gmail, and Hotmail. Their aim was to create a
universal mechanism for detecting spam in emails.

It was introduced a novel approach that focused on the


frequency of word repetition in incoming emails. They
tagged key sentences containing keywords, determined
the grammatical roles of words in those sentences, and
assembled them into vectors to measure similarities
between emails. The K-Means algorithm was used for
email classification based on these vectors.

In some studies, authors discussed cyber attacks,


particularly how phishing and malicious tactics through
email can lead to financial loss and damage to one's social
reputation. They utilized Bayesian Classifiers by
considering individual words in emails, adapting
continually to new spam forms to identify and prevent
potential threats.

A proposed system in another paper[4] employed


machine learning techniques to detect repetitive keyword
patterns indicative of spam. This system also considered
various email parameters like Cc/Bcc, domain, and
header as features in the machine learning algorithm. It
utilized a pre-trained model with feedback mechanisms to
distinguish between accurate and ambiguous outputs,
presenting an alternative architecture for spam filtering.
Additionally, this paper incorporated email body analysis,
including commonly used keywords and punctuations.

These papers collectively aim to enhance spam


detection methods by leveraging diverse techniques,
ranging from analyzing email headers and word
repetition frequencies to employing machine learning
models and considering different email parameters and
content for efficient spam classification.
In the paper[5],authors investigated the use of string
matching algorithms for spam email detection.
Particularly this work examines and compares the
efficiency of six well-
Output: Deleted emails are moved to the trash bin where they are
known string matching algorithms, namely Longest Common stored.
Subsequence (LCS), Levenshtein Distance (LD), Jaro , Jaro -
Winkler, Bi-gram, and TFIDF on two various datasets which 6. Voice Message:
are Enron corpus and CSDMC2010 spam dataset. They Input: The sender sends an email in text format.
observed that Bi-gram algorithm performs best in spam Output: The receiver accesses and reads the email using a voice
detection in both datasets. note.
III. PROPOSED SYSTEM
7. Offline Notification:
The proposed system addresses the spam issue through a Input: Sender sends an email.
spam classification system. It's challenging to manually Output: Receivers receive an offline notification in text format via
identify spam messages sent repeatedly by spammers. Hence, SMS.
the system employs strategies to detect spam, not only
recognizing spam words but also identifying the system's IP 8. Delete for Everyone:
address from which the spam originated. By blacklisting the IP Input: The sender deletes the sent email.
address, the system can flag subsequent spam messages from Output: The email is erased for both the sender and the receiver.
the same source.
In the proposed model ,the web application is done using dot
9. Read Message:
net and spam detection is done using machine learning The
Input: The receiver reads the email.
web application consists of following modules:
Output: The sender receives a notification indicating that the
message has been read.

1.TheUser
Additionally, when a message is received in the inbox, it gets
Management : exported to a dataset. This message is then processed by a Naïve
user who is using this for the very first time must Bayes Classifier to determine whether it qualifies as spam or not.
register, by using the website the user or the individual should However, before this classification, the model needs to undergo
get registered into it, by registering this will help to maintain training, a process explained in the subsequent section.

2.
separate account for each user. Registration of the user is must
before they log in. The user will login to the main page with
his registered name and password. Once the user successfully
login the authorized page will be displayed otherwise that
IV. SPAMDETECTION USING MACHINE LEARNING|
shows the error messages. Login is compulsory.
Login: The user will login to the main page with his registered
1. For training the algorithm dataset from Kaggle is used
name and password. Once the user successfully login the
which is shown below
authorized page will be displayed otherwise that shows the
error messages. Login is compulsory.
Registration: First time while using the website the user or the
individual should get registered into it, by registering this will
help to maintain separate account for each user. Registration
of the user is must before they log in.
2. Compose:

Input: The sender creates a new email, adding the recipient's


address, subject, and message.
Output: The email is sent to the recipient's address specified.

3. Inbox:
Fig.1. Dataset
Input: All incoming emails sent to an individual are collected
and displayed on this page. 2. It has many fields, some of these columns of the dataset are
Output: The receiver can open and read the emails received in not required. So remove some columns which are not
their inbox, sorted by date. required. We need to change the names of the columns.

4. Sent:
Input: The sender composes and sends an email to the
recipient.
Output: Sent emails can be viewed in this folder.

5. Trash:
Input: Unwanted emails are selected and deleted.
Fig.2. Classification dataset
5. We need to find out the most repeated words in the spam
and ham messages.So Word Cloud library is used.
With the help of NLTK (Natural Language Tool Kit) for the
text processing, Using Matplotlib you can plot graphs
, histogram and bar plot and all those things ,Word Cloud is
used to present text data and pandas for data manipulation
and analysis, NumPy is to do the mathematical and
scientific operation.
The packages used in the proposed model are shown below.

Fig.3.Packages
Fig.6.Spam word cloud

3. Split the data into training and testing sets as shown below.
Some percentage f the data set is used as train dataset and the
rest as a test dataset.

Fig.4.Train dataset

4. Reset train and test index as shown in the next column:

Fig.7. Ham word cloud

5. Whenever there is any message, we must first preprocess


the input messages. We need to convert all the input characters
to lowercase.

6. Then split up the text into small pieces and also removing
the punctuations. So the Tokenization process is used to
remove punctuations and splitting messages.
7. The Porter Stemming Algorithm is used for stemming.
Stemming is the process of reducing words to their root word.

8. We need to find the probability of the word in spam and


ham messages.

Fig.5. Reset train and test index


Eqn.1. Frequency of word
Then spam word frequency is calculated as follows:

Eqn.2.Spam word frequency


Fig.9.Spam Message

9. Tf –idf(term frequency-inverse document frequency) has If “Thanx” is an exported message from the inbox to the
to be calculated. dataset then using Bayes’ theorem and Naive Bayes’
TF: Term Frequency, which measures how many times a Classifier, the above message is detected as Ham as shown
term occurs in a document. below.
TF(t) = (Number of times t appeared in a document) / (Total
terms in the document).
IDF: Inverse Document Frequency, which measures the
significance of the term.
IDF(t) = loge(Total documents / documents with term t in it).
10. See how well the model performed by evaluating
Naïve Bayes Classifier and showing the accuracy score.
V. RESULTS AND DISCUSSIONS
When we receive message in the inbox ,that message will be
exported to dataset as shown below. This message will be
detected as spam or not. Fig.10.Ham message

The IP address of the sender can also be detected.

Fig.11.IP address of the sender

VI. CONCLUSION
Email are one of the most important tool and method
communication nowadays, Internet connectivity enables the
widespread delivery of messages across the globe. Daily, over
270 billion emails circulate, with around 57% categorized as
Fig.8. Exported Dataset spam. These unwanted emails, often commercial or malicious,
pose threats by targeting personal information such as financial
The exported message will be detected as spam or not using details, leading to potential harm for individuals,
Bayes’ theorem and Naive Bayes’ Classifier following all organizations, or groups. Apart from advertising, they may
the steps discussed above along with finding probability of contain harmful links to phishing or malware sites aimed at
words in spam and ham messages to detect it as spam or not. stealing sensitive data. Spam is not just a nuisance; it's
The below figures shows message which got detected as financially detrimental and poses security risks.
spam and ham.
If “Urgent! Please call 09062703810” is an exported message To address this issue, the system is designed to detect and
from the inbox to the dataset then based on trained dataset and prevent unsolicited and undesired emails. By reducing spam
using Bayes’ theorem and Naive Bayes’ Classifier, the above messages, it aims to benefit both individuals and organizations.
message is detected as Spam as shown below. In the future, this system can evolve by incorporating different
algorithms and additional features to further enhance its
effectiveness in combating spam.
REFERENCES

[1] Mohammed Reza Parsei, Mohammed Salehi “E-Mail Spam


Detection Based on Part of Speech Tagging” 2nd International
Conference on Knowledge Based Engineering and Innovation
(KBEI), 2015.
[2] Shukor Bin Abd Razak, Ahmad Fahrulrazie Bin Mohamad
“Identification of Spam Email Based on Information from Email
Header” 13th International Conference on Intelligent Systems
Design and Applications (ISDA), 2013.
[3] Sunil B. Rathod, Tareek M. Pattewar “Content Based Spam
Detection in Email using Bayesian Classifier”, presented at the
IEEE ICCSP 2015 conference.
[4] Aakash Atul Alurkar, Sourabh Bharat Ranade, Shreeya Vijay
Joshi, Siddhesh Sanjay Ranade, Piyush A. Sonewa, Parikshit N.
Mahalle, Arvind V. Deshpande “A Proposed Data Science
Approach for Email Spam Classification using Machine Learning
Techniques”, 2017.
[5] Kriti Agarwal, Tarun Kumar “Email Spam Detection using
integrated approach of Naïve Bayes and Particle Swarm
Optimization”, Proceedings of the Second International
Conference on Intelligent Computing and Control Systems
(ICICCS), 2018.
[6] Cihan Varol, Hezha M.Tareq Abdulhadi “Comparison of String
Matching Algorithms on Spam Email Detection”, International
Congress on Big Data, Deep Learning and Fighting Cyber
Terrorism Dec, 2018.

You might also like