Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

FRAUD EMAIL DETECTION USING MACHINE

LEARNING

PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT FOR THE REQUIREMENT FOR
THE AWARD OF THE DEGREE OF

BACHELOR OF TECHNOLOGY
(Computer Science & Engineering)

UNDER THE ESTEEMED GUIDANCE OF


ER. PRIYANKA CHANDEL
(Assistant Professor)

SUBMITTED BY
Akash Thakur

(UNIVERSITY ROLL NO. 1914031003)

KC GROUP OF

INSTITUTIONS PANDOGA,

UNA, HP

HIMACHAL PRADESH TECHNICAL UNIVERSITY


HAMIRPUR, INDIA

1
CANDIDATE’S DECLARATION

I hereby certify that the work which is being presented in the thesis entitled “Fraud Email
detection using machine learning ” by “AKASH THAKUR” in partial fulfillment of
requirementsfor the award of the Bachelor of Technology in Computer Science and
Engineering and submitted to the Department of Computer Science & Engineering of
KC Group of Institutions, Pandoga Una under Himachal Pradesh Technical
University, Hamirpur is an authentic record of my own work carried out during a period
from September 2022 to January 2023 under the supervision of Er.PRIYANKA
CHANDEL ,Assistant Professor, Computer Science & Engineering Department.
The matter presented in this thesis has not been submitted by me in anyother University /
Institute for the award of B.Tech. Degree.

(AKASH THAKUR)
University Roll no.: 1914031003

This is to certify that the above statement made by the candidate is correct to the best
of my/ourknowledge

Signature of the SUPERVISOR (S)

The B.Tech Viva –Voce Examination of AKASH THAKUR has been held on and
accepted

Signature of Supervisor (S) Signature of External


Examiner

Signature of H.O.D.

2
ACKNOWLEDGEMENT

I express my sincere gratitude to the Himachal Pradesh Technical University, Hamirpur


for giving me the opportunity to work on the project during my final year of B.Tech.
Project work is an important aspect in the field of engineering.
I would like to thank the Director, Dean Academics and members of the Departmental
Research Committee for their valuable suggestions and healthy criticism during my
presentation of the work.I also owe my sincerest gratitude towards Er. Priya Mankotia,
HOD (CSE), for his valuable advice throughout my thesis which helped me immensely
to complete my work successfully.
It gives me immense pleasure and profound privilege to express my gratitude along with
sincere thanks to Er. Priyanka Chandel, Assistant Professor, CSE department for giving
me the opportunity to work in this area. I am grateful for her continual support,
encouragement, and invaluable suggestion. She not only provided me help whenever
needed, but also the resources required to complete this project on time. I would like to
place on record my deep sense of gratitude for her generous guidance, help and useful
suggestions.
In the end, I wish to express my deep sense of gratitude to my family, for supporting and
encouraging me at every step of my work. It is the power of their blessings, which has
given me the courage, confidence and push for hard work.
I also wish to extend my special thanks to my parents for their insightful comments and
constructive suggestions to improve the quality of this research work.

Akash Thakur

Roll no.:- 1914031003

B.Tech (Computer Science & Engg.)

3
ABSTRACT
Electronic Mail (e-mail) is one of most widely used services of Internet. This service allows
an Internet user to send a message in formatted manner (mail) to the other Internet user in any
part of world. Message in mail not only contain text, but it also contains images, audio and
videos data. The person who is sending mail is called sender and person who receives mail is
called recipient. It is just like postal mail service.

MALICIOUS EMAIL
Malicious emails, phishing emails in particular; are one of the greatest threats in cyber
security. They target not only large enterprises, but also small business, individuals and
everyone in between.The reason for this lies in their simplicity. Along with the methods
attackers use to trick users into opening attachments, entering personal details and
clicking on malicious links. Malicious email attachments are designed to launch an
attack on a users computer. The attachments within these malicious emails can be
disguised as documents, PDFs, e-files, and voicemails. Attackers attach these files to
email that can install malware capable of destroying data and stealing information. Some
of these infections can allow the attacker to take control of the user’s computer, giving
attackers access to the screen, capture keystrokes, and access other network systems.
Since many email systems automatically block obvious malicious programs, attackers
conceal a piece of software called an exploit inside other types of commonly emailed
files – Microsoft Word documents, a ZIP or RAR files, Adobe PDF documents, or even
image and video files. The exploit takes advantage of software vulnerabilities and then
downloads the intended malicious software, called a payload, to the computer. Attackers
can also embed a malicious macro in the document and use social engineering to trick
the user into clicking the “Enable Content” button that will allow the macro to run and
infect the victim’s computer.Attackers typically send these email attachments and
provide email content that is sufficiently convincing to get the user to believe it is
legitimate communication.

4
LIST OF FIGURES

FIGURES PAGE NO.

Figure 1.1: component of Email 8


Figure 4.1.1: system architecture 19
Figure 4.2.1 flowchart 20

5
LIST OF TABLES

Table 4.1:Simulation Tools 15

6
TABLE OF CONTENTS
TITLE PAGE NO.

Title page 1
Candidate’s declaration 2
Acknowledgement 3
Abstract 4
List of Figures 5

CHAPTER-1 INTRODUCTION 8-9


CHAPTER-2 LITERATURE SURVEY 10
CHAPTER-3 PROBLEM STATEMENT 11-14
CHAPTER-4 METHODOLOGY AND PROPOSED WORK 15-20
4.1 OBJECTIVES 15
4.2 SIMULATION TOOLS 15
4.3 PROPOSED SYSTEM 16-17
4.4 PROPOSED FLOWCHART 18-20
CHAPTER-5 IMPLEMENTATION AND ANALYSIS 21-28
5.1 CODE IN PYTHON 21-28
REFERENCES 29

7
CHAPTER 1
INTRODUCTION

Electronic Mail (e-mail) is one of most widely used services of Internet. This service allows
an Internet user to send a message in formatted manner (mail) to the other Internet user in any
part of world. Message in mail not only contain text, but it also contains images, audio and
videos data. The person who is sending mail is called sender and person who receives mail is
called recipient. It is just like postal mail service. Components of E-Mail System: The basic
components of an email system are: User Agent (UA), Message Transfer Agent (MTA), Mail
Box, and Spool file. These are explained as following below.

1. User Agent (UA): The UA is normally a program which is used to send and receive
mail. Sometimes, it is called as mail reader. It accepts variety of commands for
composing, receiving and replying to messages as well as for manipulation of the
mailboxes.

2. Message Transfer Agent (MTA): MTA is actually responsible for transfer of mail
from one system to another. To send a mail, a system must have client MTA and
system MTA. It transfers mail to mailboxes of recipients if they are connected in the
same machine. It delivers mail to peer MTA if destination mailbox is in another
machine. The delivery from one MTA to another MTA is done by Simple Mail
Transfer Protocol.

8
3. Mailbox: It is a file on local hard drive to collect mails. Delivered mails are present in
this file. The user can read it delete it according to his/her requirement. To use e-mail
system each user must have a mailbox. Access to mailbox is only to owner of
mailbox.
4. Spool file: This file contains mails that are to be sent. User agent appends outgoing
mails in this file using SMTP. MTA extracts pending mail from spool file for their
delivery. E-mail allows one name, an alias, to represent several different e-mail
addresses. It is known as mailing list, whenever user have to send a message, system
checks recipient’s name against alias database. If mailing list is present for defined
alias, separate messages, one for each entry in the list, must be prepared and handed to
MTA. If for defined alias, there is no such mailing list is present, name itself becomes
naming address and a single message is delivered to mail transfer entity.

Services provided by E-mail system:


• Composition – The composition refers to process that creates messages and answers.
For composition any kind of text editor can be used.

• Transfer – Transfer means sending procedure of mail i.e., from the sender to
recipient.

• Reporting – Reporting refers to confirmation for delivery of mail. It helps user to


check whether their mail is delivered, lost or rejected.

• Displaying – It refers to present mail in form that is understand by the user.

• Disposition – This step concern with recipient that what will recipient do after
receiving mail i.e., save mail, delete before reading or delete after reading.

9
CHAPTER 2
LITERATURE SURVEY
HISTORY OF EMAIL: -
The history of email entails an evolving set of technologies and standards that culminated in
the email systems in use today.

Computer-based messaging between users of the same system became possible following the
advent of time-sharing in the early 1960s, with a notable implementation by MIT's CTSS
project in 1965. Informal methods of using shared files to pass messages were soon expanded
into the first mail systems. Most developers of early mainframes and minicomputers
developed similar, but generally incompatible, mail applications. Over time, a complex web
of gateways and routing systems linked many of them. Some systems also supported a form
of instant messaging, where sender and receiver needed to be online simultaneously.

In 1971 the first ARPANET network mail was sent, introducing the now-familiar address
syntax with the '@' symbol designating the user's system address. Over a series of RFCs,
conventions were refined for sending mail messages over the File Transfer Protocol. Several
other email networks developed in the 1970s and expanded subsequently.

Proprietary electronic mail systems began to emerge in the 1970s and early 1980s. IBM
developed a primitive in-house solution for office automation over the period 1970–1972,
and replaced it with OFS (Office System), proving mail transfer between individuals, in
1974. This system developed into IBM Profs, which was available on request to customers
before being released commercially in 1981. CompuServe began offering electronic mail
designed for intraoffice memos in 1978. The development team for the Xerox Star began
using electronic mail in the late 1970s. Development work on DEC's ALL-IN-1 system began
in 1977 and was released in 1982. Hewlett-Packard launched HPMAIL (later HP Desk
Manager) in 1982, which became the world's largest selling email system.

The Simple Mail Transfer Protocol (SMTP) protocol was implemented on the ARPANET in
1983. LAN email systems emerged in the mid-1980s. For a time in the late 1980s and early
1990s, it seemed likely that either a proprietary commercial system or the X.400 email
system, part of the Government Open Systems Interconnection Profile (GOSIP), would
predominate. However, once the final restrictions on carrying commercial traffic over the
Internet ended in 1995, a combination of factors made the current Internet suite of SMTP,
POP3 and IMAP email protocols the standard.

During the 1980s and 1990s, use of email became common in business, government,
universities, and defence/military industries. Starting with the advent of webmail (the web-
era form of email) and email clients in the mid-1990s, use of email began to extend to the rest
of the public. By the 2000s, email had gained ubiquitous status. The popularity of
smartphones since the 2010s has enabled instant access to emails.

10
CHAPTER 3
PROBLEM FORMULATION
Most Common Types of Email Fraud: -
1. Phishing: -

Phishing is a malicious technique used by cyber criminals to gather sensitive information


(credit card data, usernames and passwords, etc.) from users. The attackers pretend to be a
trustworthy entity to bait the victims into trusting them and revealing their confidential data.
The data gathered through phishing can be used for financial theft, identity theft, to gain
unauthorized access to the victim’s accounts or to accounts they have access to, to blackmail
the victim and more.

According to Check Point’s Brand Phishing Report for Q3 2020, email phishing was the most
common type of branded phishing attacks, accounting for 44% of attacks. The brands that
were largely used by attackers in fake phishing messages were Microsoft, DHL, and Apple.

2. Spoofing: -

This is a compromise attempt during which an unauthorized individual tries to gain access to
an information system by impersonating an authorized user. For example, email spoofing is
when cyber attackers send phishing emails using a forged sender address. You might believe
that you’re receiving an email from a trusted entity, which causes you to click on the links in
the email, but the link may end up infecting your PC with malware.

Normally, email spoofing attacks are emails that appear to come from a genuine email
address when they were actually sent by malicious actors whose ultimate purpose is to trick
you into opening the message and download a corrupted attachment. What’s more, email
spoofing can turn into elaborate BEC schemes that can take months to unfold and often lead
to huge financial and data losses.

Due to the fact that a mechanism for address authentication is not established by the Simple
Mail Transfer Protocol (SMTP), email spoofing is still very common. While protocols and
methods for email address authentication have been developed to combat this type of email
fraud, the implementation of such frameworks seems to be moving slow.

3. Business Email Compromise (BEC): -

Business Email Compromise (BEC) is a type of targeted fraud in which a threat actor
pretends to be a company executive or high-level employee in order to defraud or collect
confidential information from the organization or its partners. The main objective of a BEC
scam is to try and convince the potential victim to transfer money or personal data to the

11
What are the Risks?
Don’t be fooled! These are fraudulent communications that in most cases have nothing to do
with the institution they claim to be affiliated with. Opening, replying, or clicking the links
provided in these emails poses a serious security risk to you and the campus network.

Some of the risks involved are:

1. Identity theft: Once you provide your personal information in response to a phishing
attempt, this information can be used to access your financial accounts, make
purchases, or secure loans in your name.
2. Virus infections: Some fraudulent emails include links or attachments that, once
clicked, download malicious software to your computer. Others may also install
keystroke loggers that record your computer activity.
3. Loss of personal data: Some phishing attacks will attempt to deploy crypto malware
on your machine, malicious software that encrypts files on a victim’s computer and
denies owners access to their files until they pay a ransom.
4. Compromising institutional information: If your university IT account is
compromised, scammers may be able to access sensitive institutional information and
research data.
5. Putting friends and family at risk: If your personal information is accessed,
attackers will scan your accounts for personal information about your contacts and
will in turn attempt to phish for their sensitive information. Phishers may also send
emails and social media messages from your accounts in an attempt to gain
information from your family, friends, and colleagues.

How to Recognize Fraud Emails: -


To identify a fraudulent email, you need to keep an eye out for a few elements that I’ve listed
below:

1. Email address: -

You should always check the email header and the “from” address to identify the sender and
find out where the message was really sent from.

2. Logo: -

While a phishing email may contain the actual logo of the alleged company, fraudulent
emails may use one that appears stretched or distorted.

3. Email greeting: -

Some emails may not address the member by name. Or, there may be no name mentioned at
all.

12
4. Spelling: -

When checking an email, you should always look high and low for misspellings, grammatical
mistakes, or punctuation errors that can help identify phishing emails.

5. Legitimacy: -

Another common phishing technique is to include supposedly legitimate links in the email’s
body to look like they redirect to a legit website. If you take a closer look, you’ll realize that
the link in question may actually redirect you to a corrupted website that has nothing to do
with the company the email is pretending to be from. Always check the legitimacy of the
links – you can easily do that by pointing the mouse cursor over it. When it comes to mobile
devices, extra care needs to be taken when clicking on email links. Always check the site by
verifying the website address in the address bar.

How to Prevent Email Fraud: -


Taking all the necessary steps to ensure the safety of your email accounts against attacks and
impede all unauthorized access is crucial for you and your users. How do you ensure email
fraud protection for your organization? I have a few suggestions below: -

1. Conducting regular phishing attack tests: -

Your staff is your greatest protection against email threats, especially when it comes to
phishing attacks – be them simple or more sophisticated, such as spear phishing. This
significant risk of endpoint compromise can be avoided by staff who have learned to
recognize phishing attempts.

2. Filtering spam: -

Since most email scams begin with unsolicited emails, you should consider taking the
necessary steps to prevent spam from getting into your inbox. Most email apps and services
include spam-filtering features, which can help you configure your email applications to filter
spam.

3. Using multifactor authentication: -

In case the passwords of an email account are successfully compromised, multifactor


authentication will deter malicious hackers from accessing the account and severely affecting
your business.

4. Blocking email auto-forwarding: -

Blocking email auto-forwarding will make it harder for threat actors to gain access to your
corporate or personal email accounts.

13
5. Using antivirus software and keeping it updated: -

Installing antivirus software on your computer should be mandatory. If it has an automatic


update and an email scanning feature, even better, as they will make sure your protection
against viruses is always up-to-date.

6. Configuring your email client: -

There several ways you can configure your email client so you can be less vulnerable to
email fraud. For example, you can configure your email program to view your emails as
“text-only”. This will protect you from scams that abuse email HTML.

7. Using email security software: -

When it comes to securing the content of your emails and preventing them from being read
by other parties, secure encrypted email is always a good idea. However, this practice alone is
not enough. Therefore, you also need to consider an integrated cybersecurity solution, able to
detect basic and advanced forms of email attacks.

How and Where to Report Email Fraud: -


If you have identified a fraudulent email, there are multiple ways you can report it:

• Forward the suspicious email to your IT admin or cybersecurity team and let them
know your concerns.
• If you’re receiving emails in the name of a certain company, make sure you reach
out to them by forwarding the suspicious email and let them know about the scam.
• Notify the Internet Crime Complaint Centre (IC3).
• Forward the phishing emails to the S. Federal Trade Commission’s Anti-Phishing
Working Group (APWG) at reportphishing@apwg.or or spam@uce.gov.
• Report scams to your state consumer protection office.
• Report Social Security Administration (SSA) imposters online to SSA’s Inspector
General.
• Report Internal Revenue Service (IRS) imposters to the Treasury Inspector
General for Tax Administration (TIGTA), at 1-800-366-4484.

14
CHAPTER 4
METHODOLOGY AND PROPOSED WORK
4.1 Objectives: -
Spam messages refer to unsolicited or unwanted messages/emails that are sent in bulk to
users. In most messaging/emailing services, messages are detected as spam automatically so
that these messages do not unnecessarily flood the users’ inboxes. These messages are
usually promotional and peculiar in nature. Thus, it is possible for us to build ML/DL models
that can detect Spam messages.

4.2 SIMULATION TOOLS: -

Hp laptop i5 7th gen


COMPUTER

RAM 4 GB RAM

PLATFORM Python

OTHER Keyboard, Mouse


HARDWARE

SOFTWARE
Jupiter Notebook

15
4.3 PROPOSED SYSTEM: -

Detecting Spam Emails Using TensorFlow in Python: -

In this article, we’ll build a TensorFlow-based Spam detector; in simpler terms, we will have
to classify the texts as Spam or Ham. This implies that Spam detection is a case of a Text
Classification problem. So, we’ll be performing EDA on our dataset and building a text
classification model.

Importing Libraries and Dataset: -

Python libraries make it very easy for us to handle the data and perform typical and complex
tasks with a single line of code: -

• Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.
• NumPy – NumPy arrays are very fast and can perform large computations in a very
short time.
• Matplotlib/Seaborn/Word cloud– This library is used to draw visualizations.
• NLTK – Natural Language Tool Kit provides various functions to process the raw
textual data.

Text Pre-processing: -

Textual data is highly unstructured and need attention on many aspects like:

• Stopwords Removal
• Punctuations Removal
• Stemming or Lemmatization

Although removing data means loss of information but we need to do this to make the data
perfect to feed into a machine learning model.

Word2Vec Conversion

We cannot feed words to a machine learning model because they work on numbers only. So,
first, we will convert the our words to vectors with the token id’s to the corresponding words
and after padding them our textual data will arrive to a stage where we can feed it to a model.

Model Development and Evaluation

We will implement a Sequential model which will contain the following parts:

• Three Embedding Layers to learn a featured vector representation of the input


vectors.
• A LSTM layer to identify useful patterns in the sequence.
• Then we will have one fully connected layer.
• The final layer is the output layer which outputs probabilities for the two classes.

16
Callback

Callbacks are used to check whether the model is improving with each epoch or not. If not,
then what are the necessary steps to be taken like ReduceLROnPlateau decreases learning
rate further. Even then if model performance is not improving then training will be stopped
by EarlyStopping. We can also define some custom callbacks to stop training in between if
the desired results have been obtained early.

Model Evaluation Results

Having trained our model, we can plot a graph depicting the variance of training and
validation accuracies with the no. of epochs.

17
4.4 PROPOSED FLOWCHART: -

SYSTEM ARCHITECTURE: -

Step 1. Pick a random mail from the collection for testing purposes.

Step 2. The e-mail in question is in its unprocessed state. E-mail must be preprocessed before
the feature extraction and classification procedure can begin. Tokenization, stemming, and
stop word elimination are all steps in the preprocessing process:
(1) To begin, split down the e-mail into distinct words and tokenize it. Tokenization
separates each word into its own token.
(2) Eliminate all punctuation marks from the characters you obtained through
tokenization.
(3) Stemming is done with the tokens earned in the previous stage. The stemming
process decreases the size of a word to its base word. For stemming, a predetermined range of
available words is examined, as well as the irrespective stem words.
(4) For stemming, a list of suffixes keywords is maintained in an array with their base
words.
(5) Check to see whether there are any tokens available in the base input text.
(6) Stem the phrase to the proper base word from of the array list if the test token’s
suffixes are true.
(7) Otherwise, stemming is unnecessary. Word has already been converted to its root
word format. Therefore, proceed to the next token.

Step3. To use the feature extraction technique, select suitable attribute words from the
validation set. Just the set of features that is most nearly connected to the category is selected.

Step4. Use extracted features and created tokens to train ML and DL models. That model can
easily distinguish between spam and ham emails.

Step5. Tokens are classified as spam or ham based on their feature similarity as ML models
determines.

Step 6. Finally, the likelihood of spam or ham tokens in a sentence is evaluated for final
classification:
(1) The mail is regarded spam if the significance level of spam tokens is higher than
zero
(2) Otherwise, e-mail is regarded as ham e-mail

Step7. Mark the e-mail as spam or ham and proceed with the rest of the emails.

18
19
DATA FLOW DAIGRAM: -

20
CHAPTER 5
IMPLEMENTATION AND ANALYSIS

5.1 SOURCE CODE: -

In [1]:

6. import numpy as np
7. import pandas as pd
8. import matplotlib.pyplot as plt
9. %matplotlib inline

In [3]:

10. data = pd.read_csv('spamham.csv')

In [4]:

11. data.head()

Out[4]:

Category Message

0 ham Go until jurong point, crazy.. Available only ...

1 ham Ok lar... Joking wif u oni...

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

3 ham U dun say so early hor... U c already then say...

4 ham Nah I don't think he goes to usf, he lives aro...

In [5]:

12. data.shape

Out[5]:

13. (5572, 2)

In [14]:

14. print("We have",data.shape[0],"observations")


15. We have 5572 observations

In [17]:

21
16. dis = data['Category'].value_counts()

In [18]:

17. print("We have",dis[0], 'normal mails')


18. print("We have",dis[1], 'spam mails')
19. We have 4825 normal mails
20. We have 747 spam mails

This is our distribution in the total data

In [23]:

21. plt.pie(x = dis.values, explode = (0.1,0), labels = dis.index,


autopct='%1.1f%%')
22. plt.show()

Label Spam as 1 and Ham as 0


Using loc
In [29]:

23. data.loc[data['Category'] == 'spam']

Out[29]:

Category Message

2 spam Free entry in 2 a wkly comp to win FA Cup fina...

5 spam FreeMsg Hey there darling it's been 3 week's n...

8 spam WINNER!! As a valued network customer you have...

9 spam Had your mobile 11 months or more? U R entitle...

11 spam SIX chances to win CASH! From 100 to 20,000 po...

22
Category Message

... ... ...

5537 spam Want explicit SEX in 30 secs? Ring 02073162414...

5540 spam ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...

5547 spam Had your contract mobile 11 Mnths? Latest Moto...

5566 spam REMINDER FROM O2: To get 2.50 pounds free call...

5567 spam This is the 2nd time we have tried 2 contact u...

In [30]:

24. data.loc[data['Category'] == 'spam']['Category']

Out[30]:

25. 2 spam
26. 5 spam
27. 8 spam
28. 9 spam
29. 11 spam
30. ...
31. 5537 spam
32. 5540 spam
33. 5547 spam
34. 5566 spam
35. 5567 spam
36. Name: Category, Length: 747, dtype: object

In [35]:

37. data.loc[data['Category'] == 'spam', 'Category'] = 1


38. data.loc[data['Category'] == 'ham', 'Category'] = 0
39. data.head()

Out[35]:

Category Message

0 0 Go until jurong point, crazy.. Available only ...

1 0 Ok lar... Joking wif u oni...

23
Category Message

2 1 Free entry in 2 a wkly comp to win FA Cup fina...

3 0 U dun say so early hor... U c already then say...

4 0 Nah I don't think he goes to usf, he lives aro...

Use get_dummies
In [24]:

40. pd.get_dummies(data['Category'])

Out[24]:

ham spam

0 1 0

1 1 0

2 0 1

3 1 0

4 1 0

... ... ...

5567 0 1

5568 1 0

5569 1 0

5570 1 0

5571 1 0

Select whether you need the ham as 1 or spam as 1 and insert that column into the data and drop the categorical
category already present in the dataframe

Build the classic X and y

X -> Feature y -> Label


In [36]:

41. X = data['Message']
42. y = data['Category']

In [40]:

24
43. print(X,'\n\n',y)
44. 0 Go until jurong point, crazy.. Available only ...
45. 1 Ok lar... Joking wif u oni...
46. 2 Free entry in 2 a wkly comp to win FA Cup fina...
47. 3 U dun say so early hor... U c already then say...
48. 4 Nah I don't think he goes to usf, he lives aro...
49. ...
50. 5567 This is the 2nd time we have tried 2 contact u...
51. 5568 Will ü b going to esplanade fr home?
52. 5569 Pity, * was in mood for that. So...any other s...
53. 5570 The guy did some bitching but I acted like i'd...
54. 5571 Rofl. Its true to its name
55. Name: Message, Length: 5572, dtype: object
56.
57. 0 0
58. 1 0
59. 2 1
60. 3 0
61. 4 0
62. ..
63. 5567 1
64. 5568 0
65. 5569 0
66. 5570 0
67. 5571 0
68. Name: Category, Length: 5572, dtype: object

Train - Test Split


In [41]:

69. from sklearn.model_selection import train_test_split

In [43]:

70. X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

In [44]:

71. print("Entire data :", data.shape)


72. print("X_train shape : ", X_train.shape)
73. print("y_train shape : ", y_train.shape)
74. print("X_test shape : ", X_test.shape)
75. print("y_test shape : ", y_test.shape)
76. Entire data : (5572, 2)

25
77. X_train shape : (4457,)
78. y_train shape : (4457,)
79. X_test shape : (1115,)
80. y_test shape : (1115,)

In [45]:

81. 1115/5572

Out[45]:

82. 0.20010768126346015

0.2 % is maintained as we have given

Feature Extraction
In [47]:

83. from sklearn.feature_extraction.text import TfidfVectorizer

Why use Tfidf


Classic approach is to use a bad of words

Why we chose Tfidf

For example, there are two messages in the dataset.

‘hello world’ and ‘hello foo bar’.

1.TF(‘hello’) is 2.

84. Since hello appears once in the message and has hello in both of the message
2/1

85.

2.IDF(‘hello’) is log(2/2).

If a word occurs a lot, it means that the word gives less information.

say irrelevant words like mail recieves a high impact so that it will be a spam, as a result of using tfidf we will have
rare things like offer have high impact

min_df = When building the vocabulary ignore terms that have a document
frequency strictly lower than the given threshold. This value is also
called cut-off in the literature.

26
Stop words are words like a,an,are,etc.

In [49]:

86. extractor = TfidfVectorizer(min_df = 1, stop_words = 'english', lowercase=True )

In [75]:

87. X_train_features = extractor.fit_transform(X_train)


88. X_test_features = extractor.transform(X_test)

In [56]:

89. # Convert object type to int


90. y_train = y_train.astype(int)
91. y_test = y_test.astype(int)

Build the model

SVM model

SVM builds a classifier by searching for a separating hyperplane (optimal hyperplane) which is optimal and
maximises the margin that separates the categories (in our case spam and ham). Thus, SVM has the advantage of
robustness in general and effectiveness when the number of dimensions is greater than the number of samples.

In [58]:

92. from sklearn.svm import LinearSVC

In [59]:

93. model = LinearSVC()

In [63]:

94. model.fit(X_train_features, y_train)

Out[63]:

27
95. LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
96. intercept_scaling=1, loss='squared_hinge', max_iter=1000,
97. multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
98. verbose=0)

Evaluate the model


In [65]:

99. from sklearn.metrics import accuracy_score

In [71]:

100. train_pred = model.predict(X_train_features)


101. print("Training accuracy : ",accuracy_score(y_train, train_pred))
102. Training accuracy: 0.9995512676688355

In [76]:

103. test_pred = model.predict(X_test_features)


104. print("Test accuracy : ",accuracy_score(y_test, test_pred))
105. Test accuracy : 0.9838565022421525

Checking
In [87]:

106. mail = ["WINNER!! As a valued network customer you have been selected to
receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid
12 hours only."]

In [88]:

107. processed = extractor.transform(mail)

In [89]:

108. model.predict(processed)

Out[89]:

109. array([1])

Your model may not give best output because it hasn't seen a lot of data.
You can implement this model using a large dataset and you will be able to build a
better model.
Success

28
REFERENCES

• https://www.google.com/search?q=flow+chart+of+email+detection&source=lnms&tb
m=isch&sa=X&ved=2ahUKEwjL8P7e4pL8AhVMcGwGHXJWDr8Q_AUoAXoEC
AEQAw&biw=1280&bih=620&dpr=1.5
• https://r.search.yahoo.com/_ylt=Awrx.Yc5VqdjmlQfwdO7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1671939770/RO=10/RU=https%3a
%2f%2fen.wikipedia.org%2fwiki%2fEmail/RK=2/RS=5PfvLrwtwXyWVOzkzIqf.M
PB_nA-
• https://github.com/Sumit-Rakesh/Email-Spam-Detection-classification-project-in-
python/blob/c4c37736a0bed872c7ef3b1109b2292c204294e0/email_spam_classifier.i
pynb
• https://github.com/raaaouf/Email-Spam-
Classifier/blob/main/Email_Spam_Classifier.ipynb
• https://www.geeksforgeeks.org/detecting-spam-emails-using-tensorflow-in-python/
• https://www.geeksforgeeks.org/introduction-to-electronic-
mail/#amp_tf=From%20%251%24s&aoh=16698895249600&referrer=https%3A%2F
%2Fwww.google.com&ampshare=https%3A%2F%2Fwww.geeksforgeeks.org%2Fin
troduction-to-electronic-mail%2F
• https://www.umass.edu/it/security/phishing-fraudulent-emails-text-messages-phone-
calls

29
30

You might also like