Professional Documents
Culture Documents
Project Report Format
Project Report Format
On
FalseDream
BATCHELOR OF TECHNOLOGY
In
B.TECH
Under the guidance of
Mr. Brijesh Kr. Verma
CERTIFICATE
Certified that the project entitled “FALSEDREAM ” submitted by Dev Dutt Pandey
[ROLL NO] and Divyansh Pandey [ROLL NO] in the partial fulfillment of the
requirements for the award of the degree of Bachelor of Technology (Computer
Science And Engineering) of Dr. APJ Abdul Kalam Technical University, is a record
of student’s own work carried under our supervision and guidance. The project report
embodies results of original work and studies carried out by students and the contents
do not forms the basis for the award of any other degree to the candidate or to
anybody else.
We take this opportunity to express our profound gratitude and deep regards to our
guide Mr. Brijesh Kr. Verma and our coordinator Dr. Avinash Gupta for their
exemplary guidance, monitoring and constant encouragement. The blessing, help and
guidance given by them time to time shall carry us a long way in the journey of life on
which we are about to embark.
Email phishing is the most commonly used type of cyberattack. It uses email messages to trick
you into doing something dangerous that benefits the attacker.
Phishing uses impersonation and other kinds of deceptions to make you believe it is from
somebody you trust, and that the action you are taking will somehow benefit you. Phishing can
take many different forms, including simple attempts at deception that most people can spot. It
can also occur in much more complex situations that include a sequence of messages. The
process of deceiving people into taking some action is called social engineering. So phishing is
really a form of social engineering, like traditional scams and fraud schemes. However, they are
launched using email messages. These are typically against employees in businesses, hoping
that staff have not had sufficient cyber security awareness training to spot these attacks and
avoid them.
INTRODUCTION: This chapter gives our problem definition along with the aims
and objectives. This part also includes a section on objectives, project analysis which
gives information about our project.
LITERATURE REVIEW: This chapter explains the takes the form of making
important summaries from these sources that are of relevance from the entire work.
RESULTS: This section gives the tasks and features that were accomplished module-
wise in phishing email detection analysis system which is developed.
DISCUSSION: This section is used to describe the significance of the phishing email
detection system and provides new insights about overall system.
CSE DEPARTMENT, BBD, Lucknow Page vii
CONCLUSION: This section covers the various inferences that were drawn after the
completion of the entire project.
FUTURE SCOPE: This section gives the future enhancements that can be made in
the project idea and its implementation.
REFERENCES: This section lists all the sources we have used in our project so that
readers can easily find what we have cited.
4. RESULT ANALYSISANDDISCUSSION 33
Choosing right ML algorithm 33
Result Analysis of sentimental analysis 37
Machine Learning 37-46
Data Visualization 46-47
5. CONCLUSION 72
6. FUTURE SCOPE OFTHEPROJECT… 73-75
APPENDIX-A : LISTOFFIGURES xi
APPENDIX-B :CODE IMPLEMENTATION xiv
REFERENCES: l
Research background:
As we all know, we are going through a phase of immense technology growth and it affects
everyone in our society and the whole world. As technology is growing it is also creating an
overwhelming issue for security. This new security issue is creating a lot of chaos in everyone
life because it's not only affecting the peoples' computer and so it’s also affecting their personal
life because due to this their personal data get leaked every time confidential information that
very important like their bank details and credit card details and their personal identity
information and many other things. In all these issues there is one issue which is very important
to pay attention to know as phishing it’s one of the ways for criminals to steal your data from
your electronic devices it uses social engineering and technology to steal a victim’s identity data
and account information. Email is one of the main ways of communication between users as in
the current day its showing great traffic increase over the internet. It is one of the fastest ways of
communication so now almost every one of using email to share information between users.
The past year's data has shown a great increase in the rate of phishing activity because many
victims have lost their data money and other information. It is the practice of luring the people
towards their fraud website and try from users to get their password email, account details, and
all of the credentials without let them suspect them. In the mails its is send as a faked message
from a faked message disguised seems like a message which is sent from a reputable company
which related to the financial department. According to a report from the Anti-Phishing
Working Group (APWG), the number of phishing detections in the first quarter of 2020 reached
the number of maximum in march 1,65,772, and in the second quarter of 2020, the total number
of phishing sites was 146,994 that’s was downed by 11% from Q1 of 2020. The numbers are
generally comparable to previous quarters: 139,685 in 1Q2020,132,553 in 4Q2019, 122,359 in
3Q 2019, and 112,163 in 2Q 2019. As we can see there is a significant increase in phishing
attack in comparison to previous years so it is one of the main concern of companies and people
nowadays it is affecting people mentally due to their loss and in this pandemic situation of
COVID -18 the money that they saved for their use that all got stolen because of phishing it's a
great issue that needs a solution the striking data, it is clear that phishing has shown an apparent
upward trend in recent years. Similarly, the harm caused by phishing can be imagined as well.
The system ensures that every link of the user info from the data of the blacklist persons or the
blacklist sites they all help us finding the phishing email after that we are going to categories the
email in two categories in two different types of legitimate email and phishing email. The basic
steps are:
• To detect the phishing email by checking the URL.
Scope of study:
With the emergence of email, the convenience of communication has led to the problem of
spam, and any other type of e-attacks especially phishing attacks through email. Various anti-
phishing technologies have been proposed to solve the problem of phishing attacks. studied the
effectiveness of phishing blacklists. Blacklists mainly include sender blacklists and link
blacklists. This detection method which is used first extracts the sender’s address and after that
for more precaution link address in the message and checks whether the sender or the URL is
blacklisted or not, to distinguish whether the email is a phishing email or legitimate mail. The
update of a blacklist mail address or link is usually reported by users, and whether it is a
phishing website or not is manually identified. At present, the two well-known phishing
websites are PhishTank and OpenPhish. To some extent, the perfection of the blacklist
determines the effectiveness of this method based on the blacklist mechanism for phishing
email detection. The current situation is that new threats may not only cause severe damage to
customers’ computers but also aim to steal their money and identity.
History According to APWG, the term phishing was coined in 1996 due to social engineering
attacks against America On-line (AOL) accounts by online scammers. The term phishing comes
from fishing in a sense that fishers (i.e. attackers) use a bait (i.e. socially-engineered messages)
to fish (e.g. steal personal information of victims). However, it should be noted that the theft of
personal information is mentioned here as an example, and that attackers are not restricted by
that as previously defined in Section II. The origins of the ph replacement of the character f in
fishing is due to the fact that one of the earliest forms of hacking was against telephone
networks, which was named Phone Phreaking. As a result, ph became a common hacking
character replacement of f. According to APWG, stolen accounts via phishing attacks were also
used as a currency between hackers by 1997 to trade hacking software in exchange of the stolen
accounts. Phishing attacks were historically started by stealing AOL accounts, and over the
years moved into attacking more profitable targets, such as on-line banking and e-commerce
services. Currently, phishing attacks do not only target system endusers, but also technical
Phishing Motives According to Weider D. et. al. [6], the primary motives behind phishing
attacks, from an attacker’s perspective, are: • Financial gain: phishers can use stolen banking
credentials to their financial benefits. • Identity hiding: instead of using stolen identities directly,
phishers might sell the identities to others whom might be criminals seeking ways to hide their
identities and activities (e.g. purchase of goods). • Fame and notoriety: phishers might attack
victims for the sake of peer recognition.
Brand watch is also a sentiment analysis tool developed by a team of PhD qualifiers in the
United Kingdom; this is also commercially available currently. Through this tool they are trying
to access whether a sentiment is positive, negative or neutral .
Importance.
According to APWG, phishing attacks were in a raise till August, 2009 when the all-time high
of 40,621 unique3 phishing reports were submitted to APWG. The total number of submitted
unique phishing websites that were associated with the 40,621 submitted reports in August,
was 26,402, which is 35% lower than that of the peak in the year 2009 [8]. However, according
to APWG, the drop in phishing attacks was due to the switch in the activities of the Avalanche
gang from traditional phishing campaigns into malware-based phishing campaigns. In other
words, the Avalanche gang did not stop phishing campaigns but rather switched their tactics
toward malware-based phishing attacks (which still requires electronic communication channels
and social engineering techniques to deliver malware). Among the various types of malware
that are used in phishing attacks, Trojan horses software seem to be in a raise, and are the most
popular type of malware deployed by phishing attacks. According to APWG, Trojans software
contributed 72% of the total malware detected in the 1st half of 2011, from the previous value
of 55% in the 2nd half of 2010. It is also important to note that although the number of phishing
attack reports dropped since the peak in 2009, the number of phishing attack reports are still
high ,compared to that of the 2nd half of 2008 which faced an average of 28,916 unique reports,
and ranged between 22,000 and 26,000 of unique reports each month in the 1st half of 2011. On
the other hand, the 2nd half of 2011 saw a raise in phishing reports and websites, which seems
to be correlated with holidays season [9] as depicted in Figures 1 and 2. Which is further
amplified when knowing that each phishing campaign can be sent to thousands or even millions
of users via electronic communication channels. The year 2011 saw a number of notable spear
phishing attacks against well known security firms such as RSA [10] and HB Gary [2], which
resulted in further hacks against their clients such as RSA’s client Lockheed Martin [3]. This
shows that the dangers of phishing attacks, or security vulnerabilities due to the human factor,
are not limited to the naivety of endusers since technical engineers can also be victims.
Minimizing the impact of phishing attacks is extremely important and adds great value to the
overall security of an organization.
Challenges
Because the phishing problem takes advantage of human ignorance or naivety with regards to
their interaction with electronic communication channels (e.g. E-Mail, HTTP, etc. . . ), it is not
an easy problem to permanently solve. All of the proposed solutions attempt to minimize the
impact of phishing attacks. From a high-level perspective, there are generally two commonly
suggested solutions to mitigate phishing attacks: • User education; the human is educated in an
attempt to enhance his/her classification accuracy to correctly identify phishing messages, and
then apply proper actions on the correctly classified phishing messages, such as reporting
attacks to system administrators. • Software enhancement; the software is improved to better
classify phishing messages on behalf of the human, or provide information in a more obvious
way so that the human would have less chance to ignore it. The challenges with both of the
approaches are: • Non-technical people resist learning, and if they learn they do not retain their
knowledge permanently, and thus training should be made continuous. Although some
researchers agree that user education is helpful [1], [11], [12], a number of other researchers
disagree [13], [14]. Stefan Gorling [13] says that: “this is not only a question of knowledge, but
of utilizing this knowledge to regulate behavior. And that the regulation of behavior is
dependent on many more aspects other than simply the amount of education we have given to
the user” • Some software solutions, such as authentication and security warnings, are still
dependent on user behavior. If users ignore security warnings, the solution can be rendered
useless. • Phishing is a semantic attack that uses electronic communication channels to deliver
content with natural languages (e.g. Arabic, English, French, etc. . . ) to persuade victims to
perform certain actions. The challenge here is that computers have extreme difficulty in
accurately understanding the semantics of natural languages. A notable attempt is E-mail-Based
Intrusion Detection System (EBIDS) [15], which uses Natural Language Processing (NLP)
techniques to detect phishing attacks, however its performance evaluation showed a phishing
Due to the broad nature of the phishing problem, we find important to visualize the life-cycle of
the phishing attacks, and based on that categorize anti-phishing solutions. Based on our review
of the literature, we depict a flowchart describing the life-cycle of phishing campaigns from the
perspective of anti-phishing techniques, which is intended to be the most comprehensive
phishing solutions flowchart. See Figure 3. When a phishing campaign is started (e.g. by
sending phishing emails to users), the first protection line is detecting the campaign. The
detection techniques are broad and could incorporate techniques used by service providers to
detect the attacks, end-user client software classification, and user awareness programs. More
details are in Section IV-A. The ability to detect phishing campaigns can be enhanced whenever
a phishing campaign is detected by learning from such experience. For example, by learning
from previous phishing campaigns, it is possible to enhance the detection of future phishing
campaigns. Such learning can be performed by a human observer, or software (i.e. via a
machine learning algorithm).
With the emergence of email, the convenience of communication has led to the problem of
spam, and any other type of e-attacks especially phishing attacks through email. Various anti-
phishing technologies have been proposed to solve the problem of phishing attacks. studied the
effectiveness of phishing blacklists. Blacklists mainly include sender blacklists and link
blacklists. This detection method which is used first extracts the sender’s address and after that
for more precaution link address in the message and checks whether the sender or the URL is
blacklisted or not, to distinguish whether the email is a phishing email or legitimate mail.
Working:
The update of a blacklist mail address or link is usually reported by users, and whether it is a
phishing website or not is manually identified. At present, the two well-known phishing
websites are PhishTank and OpenPhish. To some extent, the perfection of the blacklist
determines the effectiveness of this method based on the blacklist mechanism for phishing
email detection. The current situation is that new threats may not only cause severe damage to
customers’ computers but also aim to steal their money and identity.
The block diagram given above represents our approach to the problem. The description of each
block of the diagram is given below:
1. User: In this step, they will log in and user ID will be taken to check about the user.
2. Compose Mail: After the mail composes the detection algorithm for RCNN will be
started to find the RCNN algorithm fit in the mail which is composed.
3. Detection System: This step aims to detect the mail contain the malicious URL or the
who compose the mail is backlisted or the URL is from some fishy site that can harm
your data.
4. Database: After the detection system detects the type of user an URL then the mail gets
stored in the database and then distributed into two following categories:
5. ➢ Phishing email- This category contains all those emails which are sent by the
blacklisted person or contain some URL that is harmful to the user.
6. ➢ Legitimate – this category contains all those email clear data no bad URL only
legitimate data contained in it.
7. Admin: Admin can prepare data to Analysis and Admin also can detect which email is
phishing mail with more accuracy.
8. Result/graph: In this, all those phishing email detections are simply transferring.
Various techniques for detecting phishing emails are mentioned in the literature. In the entire
technology development process, there are mainly three types of technical methods including
blacklist mechanisms, classification algorithms based on machine learning and based on deep
learning. From previous work, the existing detection methods based on the blacklist mechanism
mainly rely on people’s identification and reporting of phishing links requiring a large amount
of manpower and time. However, applying artificial intelligence to the detection method based
on a machine learning classification algorithm requires feature engineering to manually find
representative features that are not conducive to the migration of application scenarios.
Moreover, the current detection method based on deep learning is limited to word embedding in
the content representation of the email. These methods directly transferred natural language
processing (NLP) and deep learning technology, ignoring the specificity of phishing email
detection so that the results were not ideal Given the methods mentioned above and the
corresponding problems, we set to study phishing email detection systematically
based on deep learning. Specifically, this paper makes the following contributions:
Disadvantages –
1. With respect to the particularity of the email text, we analyze the email structure, and
mine the text features from four more detailed parts: the email header, the email body,
the word-level, and the char-level.
2. The RCNN model is improved by using the Then, the email is modelled from multiple
levels using an improved RCNN model. Noise is introduced as little as possible, and the
context information of the email can be better captured.
With the emergence of email, the convenience of communication has led to the problem of
massive spam, especially phishing attacks through email. Various anti phishing technologies
have been proposed to solve the problem of phishing attacks. studied the effectiveness of
phishing blacklists. Blacklists mainly include sender blacklists and link blacklists. This
detection method extracts the sender’s address and link address in the message and checks
whether it is in the blacklist to distinguish whether the email is a phishing email. The update of
a blacklist is usually reported by users, and whether it is a phishing website or not is manually
identified. At present, the two well-known phishing websites are PhishTank and OpenPhish. To
some extent, the perfection of the blacklist determines the effectiveness of this method based on
the blacklist mechanism for phishing email
detection. The current situation is that new threats may not only cause severe damage to
customers’ computers but also aim to steal their money and identity. Among these threats,
phishing is a noteworthy one and is a criminal activity that uses social engineering and
technology to steal a victim’s identity data and account
information. According to a report from the Anti-Phishing Working compared with the fourth
quarter of According to the striking data, it is clear that phishing has shown an apparent upward
trend in recent years. Similarly, the harm caused by phishing can be imagined as well.
Advantages –
1. Phishing email refers to an attacker using a fake email to trick the recipient into
returning information such as an account password to a designated recipient.
2. Additionally, it may be used to trick recipients into entering special web pages, which
are usually disguised as real web pages, such as a bank’s web page, to convince users to
enter sensitive information such as a credit card or bank card number and password.
Although the attack of phishing email seems simple, its harm is immense.
R-C NN Algorithms
Let’s quickly summarize the different algorithms in the R-CNN family (R-CNN, Fast R-CNN,
and Faster R-CNN) that we saw in the first article. This will help lay the ground for our
implementation part later when we will predict the bounding boxes present in previously unseen
images (new data). R-CNN extracts a bunch of regions from the given image using selective
search, and then checks if any of these boxes contains an object. We first extract these regions,
and for each region, CNN is used to extract specific features. Finally, these features are then
used to detect objects. Unfortunately, R-CNN becomes rather slow due to these multiple steps
involved in the process. Fast R-CNN, on the other hand, passes the entire image to ConvNet
which generates regions of interest (instead of passing the extracted regions from the image).
Also, instead of using three different models (as we saw in R-CNN), it uses a single model
which extracts features from the regions, classifies them into different classes, and returns the
bounding boxes. All these steps are done simultaneously, thus making it execute faster as
compared to R-CNN. Fast R-CNN is, however, not fast enough when applied on a large dataset
as it also uses selective search for extracting the regions.
The project involved analyzing the design of few applications so as to make the application
more users friendly. To do so, it was really important to keep the navigations from one screen to
the other well-ordered and at the same time reducing the amount of typing the user needs to do.
In order to make the application more accessible, the browser version had to be chosen so that it
is compatible with most of the Browsers.
Functional Requirement
Software Requirement
For developing the application, the following are the Software Requirements:
Python
Django
MySQL
MySQL client
WampServer 2.4
Windows 7
Windows XP
Windows 8
Windows 10
Python
Hardware Requirement
For developing the application, the following are the Hardware Requirements:
Processor: Pentium IV or higher
RAM: 256 MB
Space on Hard Disk: minimum 512MB
MODULES OF
PROJECT
This is the first step of our project, wherein the user can login/ signup to the system
through his/ her credentials. The login credentials are stored in the my sql database
through connection.
The user can register to the system by filling the details as shown in the figure 3.4.
The user can use these credentials to further login into the system and save his/her
details for further consideration. Also, user Client personal details along with his log
in details will be entered here. Which can we later be edited also from the Client
dashboard.
Once the user has registered to the system, he or she can login directly by filling the
email and password as shown in the figure 3.5. this is also going to be used by the
admin to perform system login to get access to admin dashboard.
After logging in to the system, the user Client gets into the homepage where there are
different options to select from Client dashboards shown in figure 3.6.
In the Client dashboard the user is given a lot of options in the navigation bar like my
details, compose mail, Check Phishing, View Phishing Details ,Feedback and logout
options.
When the user clicks on the Compose mail button he or she is redirected to the about
page as shown in the figure 3.7.
Once the client checks all the mail he or she can see how many phishing mail have been sent to
him by whom through phishing email option which will redirect them to new page as we can
see in below figure 3.9
Once user check the links he or she can see all the links history at once through view phishing
details option like shown in below figure 3.11
After this Client user can click on the logout button and to come back to the login page.
Now lets the tabs of Admin dashboard one by one. The first one consist of the list of the active
registered those who have registered onto the portal for expressing their views here Admin can
details of each and every user as seen in figure 3.10.
Next tab on the dashboard is user details tab where all the user data will be shows with their
name and mail like figure 3.11.
As you can see in the positive graph analysis which is basically a pie chart show the types of
mail like social and other that have been sent by the user.
Now the next option is phishing attack where the admin can see all phishing mails that have
been sent to user by other users from there only admin can take actions against those mails as
shown in figure3.16
Here the admin will be able to see the feedback given by the Client users. After this the last tab
on the dashboard is the logout tab from here the admin can logout of the dashboard.
In the Sentiment Analysis the following steps are major to identify the positive,
negative or neutral of the mails . They are:
Data Set Description.
Pre-Processing the Dataset.
Feature Extraction.
Data Visualization.
Data Preprocessing is based on word embedding, which encodes the URL string into a two-
dimensional tensor that can be received by the deep learning model. After data preprocessing,
each character is encoded to a fixed length vector consisting of 0 and 1. This is because the
neural network needs to ensure that the input data is a vector of numbers when performing
mathematical operations.
First, we process the length of the URL string. There is a limit on the length of the URL in the
HTTP standard protocol RFC2616 document: “Servers ought to be cautious about depending on
URL lengths above 255 bytes because some older client or proxy implementations might not
properly support these lengths.” So, we set the length of URL to 255 characters, which means
that if the length of the URL exceeds 255 characters, only the first 255 characters are
intercepted. If the length of the URL is shorter than 255, add 0 to the end of the URL string to a
length of 255 characters.
At the same time, we counted the frequency of occurrences of characters in all URLs in the
dataset and selected the first 59 characters with the highest frequency as valid characters. It
contains 26 English letters, 10 Arabic numerals, and 23 special characters including “@/: = #-.”
Other characters that are not in the list are all “special characters,” and each URL is treated as a
sequence of only 60 different characters. As shown in Figure 3, each character is encoded into a
60-bit 01 string where one in the interface value row and zero in the rest. Then, we use the
word2vec method in natural language processing to encode the previously processed 60-bit 01
string into a 64-bit word vector. Thus, each URL is processed into a two-dimensional matrix of
length , which then passes to the input of PDRCNN.
Data
Pre-
Processing:
Again type combine.tail() in the cell and you get the following result.
Columns not in the original data frames are added as new columns and the new cells are
populated with NaN value.
STEP — 2
Removing Garbage Data
In our analysis we can clearly see that the garbage data do not contribute anything significant to
solve our problem. So, it’s better if we remove them in our dataset.
Given below is a user-defined function to remove unwanted text patterns from the mails. It
takes two arguments, one is the original string of text and the other is the pattern of text that we
want to remove from the string. The function returns the same input string but without the given
pattern. We will use this function to remove the pattern from all the mails in our data.
STEP — 3
Removing Punctuation, Numbers, and Special Characters
Punctuation, numbers and special characters do not help much. It is better to remove them from
the text just as we removed the mails texts. Here we will replace everything except characters
and hashtags with spaces.
STEP — 4
Removing Short Words
We have to be a little careful here in selecting the length of the words which we want to
remove. So, I have decided to remove all the words having length 3 or less. These words are
also known as Stop Words.
For example, terms like “hmm”, “and”, “oh” are of very little use. It is better to get rid of them.
STEP — 5
Tokenization
Now we will tokenize all the cleaned mails in our dataset. Tokens are individual terms or words,
and tokenization is the process of splitting a string of text into tokens.
Here we tokenize our sentences because we will apply Stemming from the
“NLTK” package in the next step.
So finally, these are the basic steps to follow when we have to Pre-Process a dataset containing
textual data.
OK, so now we are done with our Data Pre-Processing stages.
Let’s move on to our next step that is Feature Extraction.
Feature Extraction:
we will discuss how we can extract features from our textual dataset by using Bag-of-Words.
Bag-of-Words Features
Besides some of the decisions that we make when choosing a machine learning
algorithm have less to do with the optimization or the technical aspects of the
algorithm but more to do with business decisions. Below we look at some of the
factors that can help you narrow down the search for your machine learning
algorithm:
The type and kind of data we have plays a key role in deciding which algorithm to
use. Some algorithms can work with smaller sample sets while others require tons and
tons of samples. Certain algorithms work with certain types of data. E.g. Naïve Bayes
works well with categorical input but is not at all sensitive to missing data.
Deal with missing value. Missing data affects some models more than others. Even
for models that handle missing data, they can be sensitive to it (missing data for
certain variables can result in poor predictions)
Choose what to do with outliers
Outliers can be very common in multidimensional data.
Some models are less sensitive to outliers than others. Usually tree models are less
sensitive to the presence of outliers. However regression models, or any model that
tries to use equations, could definitely be effected by outliers.
Outliers can be the result of bad data collection, or they can be legitimate extreme
values.
Feature engineering is the process of going from raw data to data that is ready for
modeling. It can serve multiple purposes:
Make the models easier to interpret (e.g. binning)
Capture more complex relationships (e.g. NNs)
Reduce data redundancy and dimensionality (e.g. PCA)
Rescale variables (e.g. standardizing or normalizing)
Categorize by input:
Categorize by output:
Depending on the storage capacity of your system, we might not be able to store
gigabytes of classification/regression models or gigabytes of data to cauterize. This is
the case, for instance, for embedded systems.
Does the prediction have to be fast?
i. It relies on more features to learn and predict (e.g. using two features vs ten
features to predict a target)
ii. It relies on more complex feature engineering (e.g. using polynomial terms,
interactions, or principal components)
iii. It has more computational overhead (e.g. a single decision tree vs. a random
forest of 100trees).
Besides this, the same machine learning algorithm can be made more complex based
on the number of parameters or the choice of some hyperparameters. For example,
A regression model can have more features, or polynomial terms and interaction
terms, a decision tree can have more or less depth. Making the same algorithm more
complex increases the chance of overfitting.
We generally use different models to see which best fits our dataset and then we use that model
for predicting results on the test data.
Here we will use 2 different models
Logistic Regression
RCNN
and then we will compare their performance and choose the best possible model with the best
possible feature extraction technique for predicting results on our test data.
We will use F1 Score throughout to asses our model’s performance instead of accuracy. You
will get to know why at the end of this topic.
CODE :-
Now, let’s move on to applying different models on our dataset from the features extracted by
using Bag-of-Words.
1. Logistic Regression
The first model we are going to use is Logistic Regression.
Bag-of-Words Features
Fitting the Logistic Regression Model.
OUTPUT :-
Figure 3.20.
Predicting the probabilities for a Phishing mail into either Positive or
Negative class.
The output basically provides us with the probabilities of the mails falling into either of the
classes that is Negative or Positive.
Calculating the F1 score
Bag-of-Words Features
Algorithm F1 score
RCNN 0.572134
Logistic Regression 0.524879
We saw that for many of the algorithms, setting the right parameters is important for good
performance. And at last, we took SVM as our model because of its accuracy.
Data Visualization:
So Data Visualization is one of the most important steps in Machine Learning projects because
it gives us an approximate idea about the dataset and what it is all about before proceeding to
apply different machine learning models.
Graph:
Graph analysis is the part where admin can know the statistics about the process of details. The
data are taken from the project flow and it shows until updated value. The data give a clear
solution to admin that part of improvement and user satisfaction and other factors.
Result:
Analysis of email structure. a circle represents a character, and a rectangle represents a word. A
rectangle is filled with an indefinite number of circles, indicating that the word consists of an
indefinite number of characters.
We use a new deep learning model named to detect phishing emails. The model employs
an improved RCNN to model the email header and the email body at both the character
level and the word level. Therefore, the noise is introduced into the model minimally. In
the model, we use the attention mechanism in the header and the body, making the
model pay more attention to the more valuable information between them. We use the
evaluate the model. The model obtains a promising result. Several experiments are
For future work, we will focus on how to improve our model for detecting phishing
emails with no email header and only an email body. The model employs an improved
RCNN to model the email header and the email body at both the character level and the
word level. Therefore, the noise is introduced into the model minimally. In the model,
we use the attention mechanism in the header Phishing Email Detection Using Improved
RCNN Model with Multilevel Vectors and Attention Mechanism the body, making the
model pay more attention to the more valuable information between them. We use the
APPENDIX-A
LIST OF FIGURES
APPENDIX-B
CODE IMPLEMENTATION
UI DESIGN:
<!DOCTYPE html>
{% load staticfiles %}
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<link href="https://fonts.googleapis.com/css?family=Russo+One&display=swap"
rel="stylesheet">
<style>
body{
background: url("{% static 'bg3.png' %}");
background-size: cover;
}
.menu table{
width:100%;
text-align:center;
font-family: 'Russo One', sans-serif;
}
h1{
font-family: 'Russo One', sans-serif;
}
.menu table td:hover{
background:;
CSE DEPARTMENT, BBD, Lucknow Page xlix
FALSEDREAM
}
.menu table td{
background:rgb(243, 243, 243);
}
.menu table,.menu table th,.menu table td {
border: ;
border-collapse: collapse;
}
.menu table th,.menu table td {
padding: 15px;
.topic h1{
color:black;
padding:2px;
text-align:center;
border-style:none;
height:100px;
width:1330px;
float:left;
}
.mainholder{
position:relative;
top:50px;
left:50px;
z-index:999;
float:left;
}
</style>
</head>
<body>
<div class="background-image">
<div class="topic"><h1 style="color:#ff0e00;margin-top:10px;margin-left:30px;border-
style:none;width:1300px;height:56px;border-color:black;background:;">Phishing Email
Detection Using Improved RCNN Model with Multilevel Vectors and Attention
Mechanism</h1></div>
<div class="menu">
<table>
<tr>
<td><a style="color:#010101;text-decoration: none;" href="{% url 'mydetails'
%}">MY DETAILS</a></td>
<td><a style="color:#010101;text-decoration: none;" href="{% url 'userpage'
%}">COMPOSE MAIL</a></td>
<td><a style="color:#010101;text-decoration: none;" href="{% url 'checking' %}">
CHECK PHISHING DETAILS</a></td>
<td><a style="color:#010101;text-decoration: none;" href="{% url 'checking_attack'
%}"> VIEW PHISHING DETAILS</a></td>
<td><a style="color:#010101;text-decoration: none;" href="{% url 'feedback' %}">
FEEDBACK</a></td>
</tr>
</table>
</div>
</div>
</div>
<div class="marqee">
</div>
<div class="mainholder">
{% block userblock %}
{% endblock %}
</div>
</body>
</html>
USER LOGIN:
<!DOCTYPE html>
<html>
<head>
<title>Phishing Email Detection</title>
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css"
integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO"
crossorigin="anonymous">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.6.1/css/all.css" integrity="sha384-
gfdkjb5BdAXd+lj+gudLWI+BXq4IuLW5IT+brZEZsLFm++aCMlF1V92rMkPaX4PP"
crossorigin="anonymous">
<style>
body,
html {
margin: 0;
padding: 0;
height: 100%;
background: #60a3bc !important;
}
.user_card {
height: 400px;
width: 350px;
margin-top: auto;
margin-bottom: auto;
background: #f39c12;
position: relative;
display: flex;
justify-content: center;
flex-direction: column;
padding: 10px;
box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);
-webkit-box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);
-moz-box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);
border-radius: 5px;
.brand_logo_container {
position: absolute;
height: 170px;
width: 170px;
top: -75px;
border-radius: 50%;
background: #60a3bc;
padding: 10px;
text-align: center;
}
.brand_logo {
height: 150px;
width: 150px;
border-radius: 50%;
border: 2px solid white;
}
.form_container {
margin-top: 100px;
}
.login_btn {
width: 100%;
background: #c0392b !important;
color: white !important;
}
.login_btn:focus {
box-shadow: none !important;
outline: 0px !important;
}
.login_container {
padding: 0 2rem;
}
.input-group-text {
background: #c0392b !important;
color: white !important;
border: 0 !important;
border-radius: 0.25rem 0 0 0.25rem !important;
}
.input_user,
.input_pass:focus {
box-shadow: none !important;
outline: 0px !important;
}
.custom-checkbox .custom-control-input:checked~.custom-control-label::before {
background-color: #c0392b !important;
}
</style>
</head>
<body>
<div class="container h-100">
<div class="d-flex justify-content-center h-100">
<div class="user_card">
<div class="d-flex justify-content-center">
<div class="brand_logo_container">
<img src="https://cdn2.iconfinder.com/data/icons/security-safety-volume-2/1000/Phishing_Attack-
512.png" class="brand_logo" alt="Logo">
</div>
</div>
<div class="d-flex justify-content-center form_container">
<form method="POST">
{% csrf_token %}
<div class="input-group mb-3">
<div class="input-group-append">
<span class="input-group-text"><i class="fas fa-user"></i></span>
</div>
<input type="text" name="email" class="form-control input_user" value="" placeholder="Email">
</div>
<div class="input-group mb-2">
<div class="input-group-append">
<span class="input-group-text"><i class="fas fa-key"></i></span>
</div>
<input type="password" name="password" class="form-control input_pass" value=""
placeholder="password">
</div>
<div class="form-group">
<div class="custom-control custom-checkbox">
<input type="checkbox" class="custom-control-input" id="customControlInline">
<label class="custom-control-label" for="customControlInline">Remember me</label>
</div>
</div>
<div class="d-flex justify-content-center mt-3 login_container">
<input type="submit" name="button" value="login" class="btn login_btn">
</div>
</form>
</div>
<div class="mt-4">
<div class="d-flex justify-content-center links">
Don't have an account? <a href="{% url 'userregister' %}" class="ml-2">Sign Up</a>
</div>
<div class="d-flex justify-content-center links">
Admin - Site <a href="{% url 'login_page' %}" class="ml-2">Admin Only</a>
</div>
<div class="d-flex justify-content-center links">
</div>
</div>
</div>
</div>
</div>
<h2 style="color:black;margin-top:-100px;text-align:center;">{{a}}</h2>
</body>
</html>
FEEDBACK
{% extends 'users/design.html' %}
{% block userblock %}
{% load staticfiles %}
<link href="https://fonts.googleapis.com/css?family=Russo+One&display=swap" rel="stylesheet">
<style>
.feedback{
position: absolute;
top:120px;
left:130px;
padding:10px;
width:500px;
font-family: 'Russo One', sans-serif;
.feedback table{
width:30em;
text-align:center;
border-collapse:collapse;
border-spacing:1px;
background:;
}
border-style:solid;
border-width:1px;
height:370px;
width:390px;
margin-top:40px;
margin-left:740px;
background: url("{% static 'feedback.jpg' %}");
background-size: 100%100%;
}
</style>
<div class="feedback">
<table>
<form method="post">
{% csrf_token %}
<tr>
<td style="color:black">FEEDBACK</td>
<td><textarea name="feedback" rows="4" cols="50"> </textarea></td>
</tr>
<tr>
<td style="text-align:center;" colspan="2"><input type="submit" name="submit" value="SUBMIT"
style="background:white;color:black;padding: 10px;
border-radius: 10px;"></td> </tr>
</form>
</table>
</div>
<div class="fimage"></div>
{% endblock %}
CHECKING ATTACK
{% extends 'users/design.html' %}
{% block userblock %}
{% load staticfiles %}
<link href="https://fonts.googleapis.com/css?family=Russo+One&display=swap"
rel="stylesheet">
<style>
.viewfeedback{
position: absolute;
top: 50px;
left: -27px;
padding: 5px;
height: 300px;
width: 1275px;
float: left;
overflow: scroll;
font-family: 'Russo One', sans-serif;
}
.viewfeedback table{
width:40em;
text-align:center;
border-collapse:collapse;
border:2;
border-spacing:1px;
background:;
}
padding:5px;
}
.viewfeedback table tr td{
background:#ff0e00;
padding:5px;
}
.viewfeedback table tr:hover td{
background:rgba();
}
.any{
border-color:rgba(199,21,133,0.8);
border-width:1px;
height:380px;
width:360px;
margin-top:50px;
margin-left:830px;
background: url("{% static '20.jpg' %}");
background-size: 100%100%;
float:left;
}
</style>
<div class="viewfeedback">
<form method="post">
{% csrf_token %}
<table border="2">
<tr>
<th style="color:">User Name</th>
<th style="color:">Website</th>
<th style="color:">Attack Details</th>
</tr>
{% for a in obj %}
<tr>
<td style="color:white">{{a.usid.userid}}</td>
<td style="color:white">{{a.website}}</td>
<td style="color:white">{{a.atk}}</td>
{% endfor %}
<tr></tr>
</table>
</form>
</div>
<div class="any"></div>
VIEW MAIL
{% extends 'users/design.html' %}
{% block userblock %}
{% load staticfiles %}
<link href="https://fonts.googleapis.com/css?family=Russo+One&display=swap"
rel="stylesheet">
<meta charset="UTF-8">
<title>Title</title>
<style>
body{
background-size: cover;
font-family: 'Russo One', sans-serif;
}
.index{
border-style:none;
height:50px;
width:300px;
background:blue;
margin-left:450px;
text-align:center;
margin-top:-30px;
}
.mailtable{
position: absolute;
margin-top:50px;
left:120px;
padding:5px;
height:350px;
width:750px;
overflow:scroll;
float:left;
}
.mailtable table{
width:50em;
text-align:center;
border-spacing:1px;
background:;
}
}
.viewimage{
border-style:solid;
border-width:1px;
height:350px;
width:400px;
margin-top:-400px;
margin-left:900px;
background: url("{% static 'gif.gif' %}");
background-size: 100%100%;
}
.compose{
border-style:none;
border-width:1px;
height:254px;
width:100px;
margin-top:-300px;
margin-left:0px;
background: rgb(243, 243, 243);
background-size: 100%100%;
</style>
</head>
<body>
<div class="mailtable">
<table>
<tr>
<th>MAIL ID</th>
<th>CHAT</th>
<th>DELETE</th>
</tr>
{% for o in form %}
<tr>
<td>{{o.to}}</td>
<td>{{o.chat}}</td>
<td ><a href="{% url 'deleteobj' o.id %}" style="text-
decoration:none;color:black">Delete</a></td>
</tr>
{% endfor %}
</table>
</div>
ANALYSIS PAGE
{% extends 'admins/admin_design.html' %}
{% block adminblock %}
{% load staticfiles %}
<link
href="https://fonts.googleapis.com/css?family=Russo+One&display=swap"
rel="stylesheet">
<style>
.category{
position: absolute;
margin-top:50px;
left:px;
padding:5px;
height:350px;
width:1150px;
overflow:scroll;
float:left;
font-family: 'Russo One', sans-serif;
}
.category table{
width:70em;
text-align:center;
border-collapse:collapse;
border-spacing:1px;
background:;
}
</style>
<div class="category">
<table>
<tr>
<th>SENDER MAIL</th>
<th>TO MAIL</th>
<th>SUBJECT</th>
<th>CHAT</th>
<th>CATEGORY</th>
<th>DELETE</th>
</tr>
{% for o in obj %}
<tr>
<td>{{o.sendermail}}</td>
<td>{{o.to}}</td>
<td>{{o.subject}}</td>
<td>{{o.chat}}</td>
<td>{{o.category}}</td>
<td ><a href="{% url 'analysisdelete' o.id %}" style="text-
decoration:none;color:black">Delete</a></td>
</tr>
{% endfor %}
</table>
</div>
<div class="sideimage"></div>
{%endblock%}
USER REGISTER
<link
href="//netdna.bootstrapcdn.com/bootstrap/3.0.3/css/bootstrap.min.css"
rel="stylesheet" id="bootstrap-css">
<script
src="//netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js"></
script>
<script
src="//cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.min.js"></sc
ript>
<!------ Include the above in your HEAD tag >
<!Doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>Registration Form</title>
<meta name="viewport" content="width=device-width, initial-
scale=1">
<style>
/*-----Background */
body{
background-
image:url(https://s3.envato.com/files/243754334/primag.jpg);
background-repeat:no-repeat;
background-size:cover;
width:100%;
height:100vh;
overflow:auto;
/*-----for border */
.container{
font-family:Roboto,sans-serif;
background-image:url(https://image.freepik.com/free-vector/dark-
blue-blurred-background_1034-589.jpg) ;
/*---for heading */
.heading{
text-decoration:bold;
text-align : center;
font-size:30px;
color:#F96;
padding-top:10px;
}
/* for email */
/* label */
#email{
margin-top: 5px;
}
/* for Password */
/* label */
.mail{
margin-left: 44px;
font-family: sans-serif;
color: white;
font-size: 14px;
margin-top: 13px;
}
.pass{
color: white;
margin-top: 9px;
font-size: 14px;
font-family: sans-serif;
margin-left: 6px;
margin-top: 9px;
}
#password{
margin-top: 6px;
}
/*------------for phone Number */
/* label */
.pno{
font-size: 18px;
margin-left: -13px;
margin-top: 10px;
color: #ff9;
/* for Gender */
/* label */
.gender {
color: white;
font-family: sans-serif;
font-size: 14px;
margin-left: 28px;
margin-top: 8px;
}
}
.btn.btn-warning:hover {
box-shadow: 2px 1px 2px 3px #99ccff;
background:#5900a6;
color:#fff;
transition: background-color 1.15s ease-in-out,border-color 1.15s
ease-in-out,box-shadow 1.15s ease-in-out;
</style>
</head>
<body>
<div class="container">
<!---heading >
<header class="heading"> Registration-Form</header><hr></hr>
<!---Form starting >
<div class="row ">
<!--- For Name >
<form method="POST">
{% csrf_token %}
<div class="col-sm-12">
<div class="row">
<div class="col-xs-4">
<label class="firstname">First Name :</label>
<div class="col-sm-12">
<div class="row">
<div class="col-xs-4">
<label class="lastname">Last Name :</label></div>
<div class ="col-xs-8">
<input type="text" name="lname" id="lname"
<div class="col-sm-12">
<div class="row">
<div class="col-xs-4">
<label class="lastname">User Id:</label></div>
<div class ="col-xs-8">
<input type="text" name="userid" id="lname"
placeholder="Enter your UserID" class="form-control last">
</div>
</div>
</div>
<!-----For email >
<div class="col-sm-12">
<div class="row">
<div class="col-xs-4">
<label class="mail" >Email :</label></div>
<div class="col-xs-8" >
<input type="email" name="email"
id="email"placeholder="Enter your email" class="form-control" >
</div>
</div>
</div>
<!-----For Password and confirm password >
<div class="col-sm-12">
<div class="row">
<div class="col-xs-4">
<label class="pass">Password :</label></div>
<div class="col-xs-8">
<input type="password" name="password"
id="password" placeholder="Enter your Password" class="form-control">
</div>
</div>
</div>
</div>
<div class="col-sm-12">
<div class="btn btn-warning"><input type="submit"
value="submit"></div>
</div>
</div>
</form>
</div>
</div>
</body>
</html>
ADMIN DESIGN
<!DOCTYPE html>
{% load staticfiles %}
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Title</title>
<link href="https://fonts.googleapis.com/css?family=Russo+One&display=swap" rel="stylesheet">
<style>
body{
background: url("{% static 'bg3.png' %}");
background-size: cover;
}
.menu table{
width:100%;
text-align:center;
font-family: 'Russo One', sans-serif;
}
h1{
font-family: 'Russo One', sans-serif;
}
.menu table td:hover{
background:;
}
.menu table td{
background:rgb(243, 243, 243);
}
.menu table,.menu table th,.menu table td {
border: ;
CSE DEPARTMENT, BBD, Lucknow Page l
FALSEDREAM
border-collapse: collapse;
}
.menu table th,.menu table td {
padding: 15px;
.topic h1{
color:black;
padding:2px;
text-align:center;
border-style:none;
height:100px;
width:1330px;
float:left;
}
.mainholder{
position:relative;
top:50px;
left:50px;
z-index:999;
float:left; }
</style>
</head>
<body>
<div class="background-image">
<div class="topic"><h1 style="color:#ff0e00;margin-top:10px;margin-left:30px;border-
style:none;width:1300px;height:56px;border-color:black;background:;">
</tr>
</table>
</div>
</div>
</div>
<div class="marqee">
</div>
<div class="mainholder">
{% block adminblock %}
{% endblock %}
</div>
</body>
</html>
ADMIN LOGIN
</head>
<!--Coded with love by Mutiullah Samim-->
<body>
<div class="container h-100">
<div class="d-flex justify-content-center h-100">
<div class="user_card">
<div class="d-flex justify-content-center">
<div class="brand_logo_container">
<img src="https://cdn2.iconfinder.com/data/icons/security-safety-volume-
2/1000/Phishing_Attack-512.png" class="brand_logo" alt="Logo">
</div>
</div>
<div class="d-flex justify-content-center form_container">
<form method="POST">
{% csrf_token %}
<div class="input-group mb-3">
<div class="input-group-append">
<span class="input-group-text"><i class="fas fa-user"></i></span>
</div>
<input type="text" name="username" class="form-control input_user" value=""
placeholder="Admin Id">
</div>
<div class="input-group mb-2">
<div class="input-group-append">
<span class="input-group-text"><i class="fas fa-key"></i></span>
</div>
<input type="password" name="password" class="form-control input_pass" value=""
placeholder="password">
</div>
<div class="form-group">
<div class="custom-control custom-checkbox">
<input type="checkbox" class="custom-control-input" id="customControlInline">
<label class="custom-control-label" for="customControlInline">Remember me</label>
</div>
</div>
CSE DEPARTMENT, BBD, Lucknow Page l
FALSEDREAM
<div class="mt-4">
</div>
</div>
</div>
</div>
</div>
</body>
</html>
REFRENCES
Ra, Vinayakumar, Barathi Ganesh HBa, Anand Kumar Ma, Soman KPa,
Prabaharan Poornachandran, and A. Verma. "DeepAnti-PhishNet: Applying deep
neural networks for phishing email detection." In Proc. 1st AntiPhishing Shared
Pilot 4th ACM Int. Workshop Secur. Privacy Anal.(IWSPA), pp. 1-11. Tempe,
AZ, USA, 2018.
Gavves, E., Fernando, B., Snoek, C. G., Smeulders, A. W., and Tuytelaars, T.
(2015). Local alignments for fine-grained categorization. International Journal of
Computer Vision, 111(2):191–212.