Professional Documents
Culture Documents
Email Spam Detection
Email Spam Detection
Shivani Verma
Dept. of Computer Science and Engineering
Lovely Professional University, India
Abstract— In today's digital landscape, many people heavily rely unwanted messages and safeguarding users against potential
on emails and messages sent by unfamiliar individuals. security threats.
Unfortunately, this convenience also opens the door for spammers
to flood our inboxes with irrelevant and often absurd messages.
This not only clutters our emails but also significantly slows down
our internet speeds. Moreover, these spam messages pose a threat
by potentially extracting valuable information from our contact
lists.Addressing this issue involves a challenging and significant
research endeavor: identifying these spammers and their spam
content. Email spamming involves sending bulk messages, placing
the burden of expense mostly on the recipient, essentially making
it a form of postage-due advertising. The economic viability of
spam email lies in its cost-effectiveness for the sender.
I. INTRODUCTION
In recent times, the internet has become an essential part
of our daily lives, leading to a surge in email users. However,
this increased reliance on email has brought about a
significant issue known as Spam - unsolicited bulk emails
primarily used for advertising purposes. Spam floods our
inboxes with unwanted messages that disrupt our online
experience.
1.TheUser
Additionally, when a message is received in the inbox, it gets
Management : exported to a dataset. This message is then processed by a Naïve
user who is using this for the very first time must Bayes Classifier to determine whether it qualifies as spam or not.
register, by using the website the user or the individual should However, before this classification, the model needs to undergo
get registered into it, by registering this will help to maintain training, a process explained in the subsequent section.
2.
separate account for each user. Registration of the user is must
before they log in. The user will login to the main page with
his registered name and password. Once the user successfully
login the authorized page will be displayed otherwise that
IV. SPAMDETECTION USING MACHINE LEARNING|
shows the error messages. Login is compulsory.
Login: The user will login to the main page with his registered
1. For training the algorithm dataset from Kaggle is used
name and password. Once the user successfully login the
which is shown below
authorized page will be displayed otherwise that shows the
error messages. Login is compulsory.
Registration: First time while using the website the user or the
individual should get registered into it, by registering this will
help to maintain separate account for each user. Registration
of the user is must before they log in.
2. Compose:
3. Inbox:
Fig.1. Dataset
Input: All incoming emails sent to an individual are collected
and displayed on this page. 2. It has many fields, some of these columns of the dataset are
Output: The receiver can open and read the emails received in not required. So remove some columns which are not
their inbox, sorted by date. required. We need to change the names of the columns.
4. Sent:
Input: The sender composes and sends an email to the
recipient.
Output: Sent emails can be viewed in this folder.
5. Trash:
Input: Unwanted emails are selected and deleted.
Fig.2. Classification dataset
5. We need to find out the most repeated words in the spam
and ham messages.So Word Cloud library is used.
With the help of NLTK (Natural Language Tool Kit) for the
text processing, Using Matplotlib you can plot graphs
, histogram and bar plot and all those things ,Word Cloud is
used to present text data and pandas for data manipulation
and analysis, NumPy is to do the mathematical and
scientific operation.
The packages used in the proposed model are shown below.
Fig.3.Packages
Fig.6.Spam word cloud
3. Split the data into training and testing sets as shown below.
Some percentage f the data set is used as train dataset and the
rest as a test dataset.
Fig.4.Train dataset
6. Then split up the text into small pieces and also removing
the punctuations. So the Tokenization process is used to
remove punctuations and splitting messages.
7. The Porter Stemming Algorithm is used for stemming.
Stemming is the process of reducing words to their root word.
9. Tf –idf(term frequency-inverse document frequency) has If “Thanx” is an exported message from the inbox to the
to be calculated. dataset then using Bayes’ theorem and Naive Bayes’
TF: Term Frequency, which measures how many times a Classifier, the above message is detected as Ham as shown
term occurs in a document. below.
TF(t) = (Number of times t appeared in a document) / (Total
terms in the document).
IDF: Inverse Document Frequency, which measures the
significance of the term.
IDF(t) = loge(Total documents / documents with term t in it).
10. See how well the model performed by evaluating
Naïve Bayes Classifier and showing the accuracy score.
V. RESULTS AND DISCUSSIONS
When we receive message in the inbox ,that message will be
exported to dataset as shown below. This message will be
detected as spam or not. Fig.10.Ham message
VI. CONCLUSION
Email are one of the most important tool and method
communication nowadays, Internet connectivity enables the
widespread delivery of messages across the globe. Daily, over
270 billion emails circulate, with around 57% categorized as
Fig.8. Exported Dataset spam. These unwanted emails, often commercial or malicious,
pose threats by targeting personal information such as financial
The exported message will be detected as spam or not using details, leading to potential harm for individuals,
Bayes’ theorem and Naive Bayes’ Classifier following all organizations, or groups. Apart from advertising, they may
the steps discussed above along with finding probability of contain harmful links to phishing or malware sites aimed at
words in spam and ham messages to detect it as spam or not. stealing sensitive data. Spam is not just a nuisance; it's
The below figures shows message which got detected as financially detrimental and poses security risks.
spam and ham.
If “Urgent! Please call 09062703810” is an exported message To address this issue, the system is designed to detect and
from the inbox to the dataset then based on trained dataset and prevent unsolicited and undesired emails. By reducing spam
using Bayes’ theorem and Naive Bayes’ Classifier, the above messages, it aims to benefit both individuals and organizations.
message is detected as Spam as shown below. In the future, this system can evolve by incorporating different
algorithms and additional features to further enhance its
effectiveness in combating spam.
REFERENCES