Professional Documents
Culture Documents
Project
Project
LEARNING
PROJECT REPORT
SUBMITTED IN PARTIAL FULFILLMENT FOR THE REQUIREMENT FOR
THE AWARD OF THE DEGREE OF
BACHELOR OF TECHNOLOGY
(Computer Science & Engineering)
SUBMITTED BY
Akash Thakur
KC GROUP OF
INSTITUTIONS PANDOGA,
UNA, HP
1
CANDIDATE’S DECLARATION
I hereby certify that the work which is being presented in the thesis entitled “Fraud Email
detection using machine learning ” by “AKASH THAKUR” in partial fulfillment of
requirementsfor the award of the Bachelor of Technology in Computer Science and
Engineering and submitted to the Department of Computer Science & Engineering of
KC Group of Institutions, Pandoga Una under Himachal Pradesh Technical
University, Hamirpur is an authentic record of my own work carried out during a period
from September 2022 to January 2023 under the supervision of Er.PRIYANKA
CHANDEL ,Assistant Professor, Computer Science & Engineering Department.
The matter presented in this thesis has not been submitted by me in anyother University /
Institute for the award of B.Tech. Degree.
(AKASH THAKUR)
University Roll no.: 1914031003
This is to certify that the above statement made by the candidate is correct to the best
of my/ourknowledge
The B.Tech Viva –Voce Examination of AKASH THAKUR has been held on and
accepted
Signature of H.O.D.
2
ACKNOWLEDGEMENT
Akash Thakur
3
ABSTRACT
Electronic Mail (e-mail) is one of most widely used services of Internet. This service allows
an Internet user to send a message in formatted manner (mail) to the other Internet user in any
part of world. Message in mail not only contain text, but it also contains images, audio and
videos data. The person who is sending mail is called sender and person who receives mail is
called recipient. It is just like postal mail service.
MALICIOUS EMAIL
Malicious emails, phishing emails in particular; are one of the greatest threats in cyber
security. They target not only large enterprises, but also small business, individuals and
everyone in between.The reason for this lies in their simplicity. Along with the methods
attackers use to trick users into opening attachments, entering personal details and
clicking on malicious links. Malicious email attachments are designed to launch an
attack on a users computer. The attachments within these malicious emails can be
disguised as documents, PDFs, e-files, and voicemails. Attackers attach these files to
email that can install malware capable of destroying data and stealing information. Some
of these infections can allow the attacker to take control of the user’s computer, giving
attackers access to the screen, capture keystrokes, and access other network systems.
Since many email systems automatically block obvious malicious programs, attackers
conceal a piece of software called an exploit inside other types of commonly emailed
files – Microsoft Word documents, a ZIP or RAR files, Adobe PDF documents, or even
image and video files. The exploit takes advantage of software vulnerabilities and then
downloads the intended malicious software, called a payload, to the computer. Attackers
can also embed a malicious macro in the document and use social engineering to trick
the user into clicking the “Enable Content” button that will allow the macro to run and
infect the victim’s computer.Attackers typically send these email attachments and
provide email content that is sufficiently convincing to get the user to believe it is
legitimate communication.
4
LIST OF FIGURES
5
LIST OF TABLES
6
TABLE OF CONTENTS
TITLE PAGE NO.
Title page 1
Candidate’s declaration 2
Acknowledgement 3
Abstract 4
List of Figures 5
7
CHAPTER 1
INTRODUCTION
Electronic Mail (e-mail) is one of most widely used services of Internet. This service allows
an Internet user to send a message in formatted manner (mail) to the other Internet user in any
part of world. Message in mail not only contain text, but it also contains images, audio and
videos data. The person who is sending mail is called sender and person who receives mail is
called recipient. It is just like postal mail service. Components of E-Mail System: The basic
components of an email system are: User Agent (UA), Message Transfer Agent (MTA), Mail
Box, and Spool file. These are explained as following below.
1. User Agent (UA): The UA is normally a program which is used to send and receive
mail. Sometimes, it is called as mail reader. It accepts variety of commands for
composing, receiving and replying to messages as well as for manipulation of the
mailboxes.
2. Message Transfer Agent (MTA): MTA is actually responsible for transfer of mail
from one system to another. To send a mail, a system must have client MTA and
system MTA. It transfers mail to mailboxes of recipients if they are connected in the
same machine. It delivers mail to peer MTA if destination mailbox is in another
machine. The delivery from one MTA to another MTA is done by Simple Mail
Transfer Protocol.
8
3. Mailbox: It is a file on local hard drive to collect mails. Delivered mails are present in
this file. The user can read it delete it according to his/her requirement. To use e-mail
system each user must have a mailbox. Access to mailbox is only to owner of
mailbox.
4. Spool file: This file contains mails that are to be sent. User agent appends outgoing
mails in this file using SMTP. MTA extracts pending mail from spool file for their
delivery. E-mail allows one name, an alias, to represent several different e-mail
addresses. It is known as mailing list, whenever user have to send a message, system
checks recipient’s name against alias database. If mailing list is present for defined
alias, separate messages, one for each entry in the list, must be prepared and handed to
MTA. If for defined alias, there is no such mailing list is present, name itself becomes
naming address and a single message is delivered to mail transfer entity.
• Transfer – Transfer means sending procedure of mail i.e., from the sender to
recipient.
• Disposition – This step concern with recipient that what will recipient do after
receiving mail i.e., save mail, delete before reading or delete after reading.
9
CHAPTER 2
LITERATURE SURVEY
HISTORY OF EMAIL: -
The history of email entails an evolving set of technologies and standards that culminated in
the email systems in use today.
Computer-based messaging between users of the same system became possible following the
advent of time-sharing in the early 1960s, with a notable implementation by MIT's CTSS
project in 1965. Informal methods of using shared files to pass messages were soon expanded
into the first mail systems. Most developers of early mainframes and minicomputers
developed similar, but generally incompatible, mail applications. Over time, a complex web
of gateways and routing systems linked many of them. Some systems also supported a form
of instant messaging, where sender and receiver needed to be online simultaneously.
In 1971 the first ARPANET network mail was sent, introducing the now-familiar address
syntax with the '@' symbol designating the user's system address. Over a series of RFCs,
conventions were refined for sending mail messages over the File Transfer Protocol. Several
other email networks developed in the 1970s and expanded subsequently.
Proprietary electronic mail systems began to emerge in the 1970s and early 1980s. IBM
developed a primitive in-house solution for office automation over the period 1970–1972,
and replaced it with OFS (Office System), proving mail transfer between individuals, in
1974. This system developed into IBM Profs, which was available on request to customers
before being released commercially in 1981. CompuServe began offering electronic mail
designed for intraoffice memos in 1978. The development team for the Xerox Star began
using electronic mail in the late 1970s. Development work on DEC's ALL-IN-1 system began
in 1977 and was released in 1982. Hewlett-Packard launched HPMAIL (later HP Desk
Manager) in 1982, which became the world's largest selling email system.
The Simple Mail Transfer Protocol (SMTP) protocol was implemented on the ARPANET in
1983. LAN email systems emerged in the mid-1980s. For a time in the late 1980s and early
1990s, it seemed likely that either a proprietary commercial system or the X.400 email
system, part of the Government Open Systems Interconnection Profile (GOSIP), would
predominate. However, once the final restrictions on carrying commercial traffic over the
Internet ended in 1995, a combination of factors made the current Internet suite of SMTP,
POP3 and IMAP email protocols the standard.
During the 1980s and 1990s, use of email became common in business, government,
universities, and defence/military industries. Starting with the advent of webmail (the web-
era form of email) and email clients in the mid-1990s, use of email began to extend to the rest
of the public. By the 2000s, email had gained ubiquitous status. The popularity of
smartphones since the 2010s has enabled instant access to emails.
10
CHAPTER 3
PROBLEM FORMULATION
Most Common Types of Email Fraud: -
1. Phishing: -
According to Check Point’s Brand Phishing Report for Q3 2020, email phishing was the most
common type of branded phishing attacks, accounting for 44% of attacks. The brands that
were largely used by attackers in fake phishing messages were Microsoft, DHL, and Apple.
2. Spoofing: -
This is a compromise attempt during which an unauthorized individual tries to gain access to
an information system by impersonating an authorized user. For example, email spoofing is
when cyber attackers send phishing emails using a forged sender address. You might believe
that you’re receiving an email from a trusted entity, which causes you to click on the links in
the email, but the link may end up infecting your PC with malware.
Normally, email spoofing attacks are emails that appear to come from a genuine email
address when they were actually sent by malicious actors whose ultimate purpose is to trick
you into opening the message and download a corrupted attachment. What’s more, email
spoofing can turn into elaborate BEC schemes that can take months to unfold and often lead
to huge financial and data losses.
Due to the fact that a mechanism for address authentication is not established by the Simple
Mail Transfer Protocol (SMTP), email spoofing is still very common. While protocols and
methods for email address authentication have been developed to combat this type of email
fraud, the implementation of such frameworks seems to be moving slow.
Business Email Compromise (BEC) is a type of targeted fraud in which a threat actor
pretends to be a company executive or high-level employee in order to defraud or collect
confidential information from the organization or its partners. The main objective of a BEC
scam is to try and convince the potential victim to transfer money or personal data to the
11
What are the Risks?
Don’t be fooled! These are fraudulent communications that in most cases have nothing to do
with the institution they claim to be affiliated with. Opening, replying, or clicking the links
provided in these emails poses a serious security risk to you and the campus network.
1. Identity theft: Once you provide your personal information in response to a phishing
attempt, this information can be used to access your financial accounts, make
purchases, or secure loans in your name.
2. Virus infections: Some fraudulent emails include links or attachments that, once
clicked, download malicious software to your computer. Others may also install
keystroke loggers that record your computer activity.
3. Loss of personal data: Some phishing attacks will attempt to deploy crypto malware
on your machine, malicious software that encrypts files on a victim’s computer and
denies owners access to their files until they pay a ransom.
4. Compromising institutional information: If your university IT account is
compromised, scammers may be able to access sensitive institutional information and
research data.
5. Putting friends and family at risk: If your personal information is accessed,
attackers will scan your accounts for personal information about your contacts and
will in turn attempt to phish for their sensitive information. Phishers may also send
emails and social media messages from your accounts in an attempt to gain
information from your family, friends, and colleagues.
1. Email address: -
You should always check the email header and the “from” address to identify the sender and
find out where the message was really sent from.
2. Logo: -
While a phishing email may contain the actual logo of the alleged company, fraudulent
emails may use one that appears stretched or distorted.
3. Email greeting: -
Some emails may not address the member by name. Or, there may be no name mentioned at
all.
12
4. Spelling: -
When checking an email, you should always look high and low for misspellings, grammatical
mistakes, or punctuation errors that can help identify phishing emails.
5. Legitimacy: -
Another common phishing technique is to include supposedly legitimate links in the email’s
body to look like they redirect to a legit website. If you take a closer look, you’ll realize that
the link in question may actually redirect you to a corrupted website that has nothing to do
with the company the email is pretending to be from. Always check the legitimacy of the
links – you can easily do that by pointing the mouse cursor over it. When it comes to mobile
devices, extra care needs to be taken when clicking on email links. Always check the site by
verifying the website address in the address bar.
Your staff is your greatest protection against email threats, especially when it comes to
phishing attacks – be them simple or more sophisticated, such as spear phishing. This
significant risk of endpoint compromise can be avoided by staff who have learned to
recognize phishing attempts.
2. Filtering spam: -
Since most email scams begin with unsolicited emails, you should consider taking the
necessary steps to prevent spam from getting into your inbox. Most email apps and services
include spam-filtering features, which can help you configure your email applications to filter
spam.
Blocking email auto-forwarding will make it harder for threat actors to gain access to your
corporate or personal email accounts.
13
5. Using antivirus software and keeping it updated: -
There several ways you can configure your email client so you can be less vulnerable to
email fraud. For example, you can configure your email program to view your emails as
“text-only”. This will protect you from scams that abuse email HTML.
When it comes to securing the content of your emails and preventing them from being read
by other parties, secure encrypted email is always a good idea. However, this practice alone is
not enough. Therefore, you also need to consider an integrated cybersecurity solution, able to
detect basic and advanced forms of email attacks.
• Forward the suspicious email to your IT admin or cybersecurity team and let them
know your concerns.
• If you’re receiving emails in the name of a certain company, make sure you reach
out to them by forwarding the suspicious email and let them know about the scam.
• Notify the Internet Crime Complaint Centre (IC3).
• Forward the phishing emails to the S. Federal Trade Commission’s Anti-Phishing
Working Group (APWG) at reportphishing@apwg.or or spam@uce.gov.
• Report scams to your state consumer protection office.
• Report Social Security Administration (SSA) imposters online to SSA’s Inspector
General.
• Report Internal Revenue Service (IRS) imposters to the Treasury Inspector
General for Tax Administration (TIGTA), at 1-800-366-4484.
14
CHAPTER 4
METHODOLOGY AND PROPOSED WORK
4.1 Objectives: -
Spam messages refer to unsolicited or unwanted messages/emails that are sent in bulk to
users. In most messaging/emailing services, messages are detected as spam automatically so
that these messages do not unnecessarily flood the users’ inboxes. These messages are
usually promotional and peculiar in nature. Thus, it is possible for us to build ML/DL models
that can detect Spam messages.
RAM 4 GB RAM
PLATFORM Python
SOFTWARE
Jupiter Notebook
15
4.3 PROPOSED SYSTEM: -
In this article, we’ll build a TensorFlow-based Spam detector; in simpler terms, we will have
to classify the texts as Spam or Ham. This implies that Spam detection is a case of a Text
Classification problem. So, we’ll be performing EDA on our dataset and building a text
classification model.
Python libraries make it very easy for us to handle the data and perform typical and complex
tasks with a single line of code: -
• Pandas – This library helps to load the data frame in a 2D array format and has
multiple functions to perform analysis tasks in one go.
• NumPy – NumPy arrays are very fast and can perform large computations in a very
short time.
• Matplotlib/Seaborn/Word cloud– This library is used to draw visualizations.
• NLTK – Natural Language Tool Kit provides various functions to process the raw
textual data.
Text Pre-processing: -
Textual data is highly unstructured and need attention on many aspects like:
• Stopwords Removal
• Punctuations Removal
• Stemming or Lemmatization
Although removing data means loss of information but we need to do this to make the data
perfect to feed into a machine learning model.
Word2Vec Conversion
We cannot feed words to a machine learning model because they work on numbers only. So,
first, we will convert the our words to vectors with the token id’s to the corresponding words
and after padding them our textual data will arrive to a stage where we can feed it to a model.
We will implement a Sequential model which will contain the following parts:
16
Callback
Callbacks are used to check whether the model is improving with each epoch or not. If not,
then what are the necessary steps to be taken like ReduceLROnPlateau decreases learning
rate further. Even then if model performance is not improving then training will be stopped
by EarlyStopping. We can also define some custom callbacks to stop training in between if
the desired results have been obtained early.
Having trained our model, we can plot a graph depicting the variance of training and
validation accuracies with the no. of epochs.
17
4.4 PROPOSED FLOWCHART: -
SYSTEM ARCHITECTURE: -
Step 1. Pick a random mail from the collection for testing purposes.
Step 2. The e-mail in question is in its unprocessed state. E-mail must be preprocessed before
the feature extraction and classification procedure can begin. Tokenization, stemming, and
stop word elimination are all steps in the preprocessing process:
(1) To begin, split down the e-mail into distinct words and tokenize it. Tokenization
separates each word into its own token.
(2) Eliminate all punctuation marks from the characters you obtained through
tokenization.
(3) Stemming is done with the tokens earned in the previous stage. The stemming
process decreases the size of a word to its base word. For stemming, a predetermined range of
available words is examined, as well as the irrespective stem words.
(4) For stemming, a list of suffixes keywords is maintained in an array with their base
words.
(5) Check to see whether there are any tokens available in the base input text.
(6) Stem the phrase to the proper base word from of the array list if the test token’s
suffixes are true.
(7) Otherwise, stemming is unnecessary. Word has already been converted to its root
word format. Therefore, proceed to the next token.
Step3. To use the feature extraction technique, select suitable attribute words from the
validation set. Just the set of features that is most nearly connected to the category is selected.
Step4. Use extracted features and created tokens to train ML and DL models. That model can
easily distinguish between spam and ham emails.
Step5. Tokens are classified as spam or ham based on their feature similarity as ML models
determines.
Step 6. Finally, the likelihood of spam or ham tokens in a sentence is evaluated for final
classification:
(1) The mail is regarded spam if the significance level of spam tokens is higher than
zero
(2) Otherwise, e-mail is regarded as ham e-mail
Step7. Mark the e-mail as spam or ham and proceed with the rest of the emails.
18
19
DATA FLOW DAIGRAM: -
20
CHAPTER 5
IMPLEMENTATION AND ANALYSIS
In [1]:
6. import numpy as np
7. import pandas as pd
8. import matplotlib.pyplot as plt
9. %matplotlib inline
In [3]:
In [4]:
11. data.head()
Out[4]:
Category Message
In [5]:
12. data.shape
Out[5]:
13. (5572, 2)
In [14]:
In [17]:
21
16. dis = data['Category'].value_counts()
In [18]:
In [23]:
Out[29]:
Category Message
22
Category Message
5566 spam REMINDER FROM O2: To get 2.50 pounds free call...
5567 spam This is the 2nd time we have tried 2 contact u...
In [30]:
Out[30]:
25. 2 spam
26. 5 spam
27. 8 spam
28. 9 spam
29. 11 spam
30. ...
31. 5537 spam
32. 5540 spam
33. 5547 spam
34. 5566 spam
35. 5567 spam
36. Name: Category, Length: 747, dtype: object
In [35]:
Out[35]:
Category Message
23
Category Message
Use get_dummies
In [24]:
40. pd.get_dummies(data['Category'])
Out[24]:
ham spam
0 1 0
1 1 0
2 0 1
3 1 0
4 1 0
5567 0 1
5568 1 0
5569 1 0
5570 1 0
5571 1 0
Select whether you need the ham as 1 or spam as 1 and insert that column into the data and drop the categorical
category already present in the dataframe
41. X = data['Message']
42. y = data['Category']
In [40]:
24
43. print(X,'\n\n',y)
44. 0 Go until jurong point, crazy.. Available only ...
45. 1 Ok lar... Joking wif u oni...
46. 2 Free entry in 2 a wkly comp to win FA Cup fina...
47. 3 U dun say so early hor... U c already then say...
48. 4 Nah I don't think he goes to usf, he lives aro...
49. ...
50. 5567 This is the 2nd time we have tried 2 contact u...
51. 5568 Will ü b going to esplanade fr home?
52. 5569 Pity, * was in mood for that. So...any other s...
53. 5570 The guy did some bitching but I acted like i'd...
54. 5571 Rofl. Its true to its name
55. Name: Message, Length: 5572, dtype: object
56.
57. 0 0
58. 1 0
59. 2 1
60. 3 0
61. 4 0
62. ..
63. 5567 1
64. 5568 0
65. 5569 0
66. 5570 0
67. 5571 0
68. Name: Category, Length: 5572, dtype: object
In [43]:
In [44]:
25
77. X_train shape : (4457,)
78. y_train shape : (4457,)
79. X_test shape : (1115,)
80. y_test shape : (1115,)
In [45]:
81. 1115/5572
Out[45]:
82. 0.20010768126346015
Feature Extraction
In [47]:
1.TF(‘hello’) is 2.
84. Since hello appears once in the message and has hello in both of the message
2/1
85.
2.IDF(‘hello’) is log(2/2).
If a word occurs a lot, it means that the word gives less information.
say irrelevant words like mail recieves a high impact so that it will be a spam, as a result of using tfidf we will have
rare things like offer have high impact
min_df = When building the vocabulary ignore terms that have a document
frequency strictly lower than the given threshold. This value is also
called cut-off in the literature.
26
Stop words are words like a,an,are,etc.
In [49]:
In [75]:
In [56]:
SVM model
SVM builds a classifier by searching for a separating hyperplane (optimal hyperplane) which is optimal and
maximises the margin that separates the categories (in our case spam and ham). Thus, SVM has the advantage of
robustness in general and effectiveness when the number of dimensions is greater than the number of samples.
In [58]:
In [59]:
In [63]:
Out[63]:
27
95. LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
96. intercept_scaling=1, loss='squared_hinge', max_iter=1000,
97. multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
98. verbose=0)
In [71]:
In [76]:
Checking
In [87]:
106. mail = ["WINNER!! As a valued network customer you have been selected to
receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid
12 hours only."]
In [88]:
In [89]:
108. model.predict(processed)
Out[89]:
109. array([1])
Your model may not give best output because it hasn't seen a lot of data.
You can implement this model using a large dataset and you will be able to build a
better model.
Success
28
REFERENCES
• https://www.google.com/search?q=flow+chart+of+email+detection&source=lnms&tb
m=isch&sa=X&ved=2ahUKEwjL8P7e4pL8AhVMcGwGHXJWDr8Q_AUoAXoEC
AEQAw&biw=1280&bih=620&dpr=1.5
• https://r.search.yahoo.com/_ylt=Awrx.Yc5VqdjmlQfwdO7HAx.;_ylu=Y29sbwNzZz
MEcG9zAzEEdnRpZAMEc2VjA3Ny/RV=2/RE=1671939770/RO=10/RU=https%3a
%2f%2fen.wikipedia.org%2fwiki%2fEmail/RK=2/RS=5PfvLrwtwXyWVOzkzIqf.M
PB_nA-
• https://github.com/Sumit-Rakesh/Email-Spam-Detection-classification-project-in-
python/blob/c4c37736a0bed872c7ef3b1109b2292c204294e0/email_spam_classifier.i
pynb
• https://github.com/raaaouf/Email-Spam-
Classifier/blob/main/Email_Spam_Classifier.ipynb
• https://www.geeksforgeeks.org/detecting-spam-emails-using-tensorflow-in-python/
• https://www.geeksforgeeks.org/introduction-to-electronic-
mail/#amp_tf=From%20%251%24s&aoh=16698895249600&referrer=https%3A%2F
%2Fwww.google.com&share=https%3A%2F%2Fwww.geeksforgeeks.org%2Fin
troduction-to-electronic-mail%2F
• https://www.umass.edu/it/security/phishing-fraudulent-emails-text-messages-phone-
calls
29
30