Detection of Phishing On Apps and Websites - Project Report

School of Information Technology and
Engineering
SWE2012 – Software Security

Title: Detection of Phishing on Apps And Websites
Project Report
Fall Semester 2022-23

SLOT: A1
Submitted By
Deepika S – 19MIS0005
Sethumadhavan V – 19MIS0010
Varun S -19MIS0046
Surya Teja G – 19MIS0247
Yuvaraj S- 19MIS0427
Under the Guidance of

Prof N Asha
Assistant Professor Sr. Grade 1
Detection of Phishing on Apps And Websites
Abstract:
Phishing is a type of social engineering attack often used to steal user
data, including login credentials and credit card numbers. It occurs when an
attacker, masquerading as a trusted entity, dupes a victim into opening an email,
instant message, or text message. Phishing attack is a simplest way to obtain
sensitive information from innocent users. Aim of the phishers is to acquire
critical information like username, password and bank account details. Cyber
security persons are now looking for trustworthy and steady detection
techniques for phishing websites detection. This paper deals with machine
learning technology for detection of phishing URLs by extracting and analyzing
various features of legitimate and phishing URLs. Decision Tree, random forest
and Support vector machine algorithms are used to detect phishing websites.
Aim of the paper is to detect phishing URLs as well as narrow down to best
machine learning algorithm by comparing accuracy rate, false positive and false
negative rate of each algorithm.
In recent years, advancements in Internet and cloud technologies have led to a
significant increase in electronic trading in which consumers make online
purchases and transactions. This growth leads to unauthorized access to users’
sensitive information and damages the resources of an enterprise. Phishing is
one of the familiar attacks that trick users to access malicious content and gain
their information. In terms of website interface and uniform resource locator
(URL), most phishing webpages look identical to the actual webpages. Various
strategies for detecting phishing websites, such as blacklist, heuristic, Etc., have
been suggested. However, due to inefficient security technologies, there is an
exponential increase in the number of victims. The anonymous and
uncontrollable framework of the Internet is more vulnerable to phishing attacks.
Existing research works show that the performance of the phishing detection
system is limited. There is a demand for an intelligent technique to protect users
from the cyber-attacks. In this study, the author proposed a URL detection
technique based on machine learning approaches. A recurrent neural network
method is employed to detect phishing URL. Researcher evaluated the proposed
method with 7900 malicious and 5800 legitimate sites, respectively. The
experiments’ outcome shows that the proposed method’s performance is better
than the recent approaches in malicious URL detection.
Page.No:2
Introduction:
Nowadays Phishing becomes a main area of concern for security
researchers because it is not difficult to create the fake website which looks so
close to legitimate website. Experts can identify fake websites but not all the
users can identify the fake website and such users become the victim of
phishing attack. Main aim of the attacker is to steal banks account credentials.
In United States businesses, there is a loss of US $10billion per year because
their clients become victim to phishing . In 3rd Microsoft Computing Safer
Index Report released in February 2020, it was estimated that the annual
worldwide impact of phishing could be as high as $5 billion. Phishing attacks
are becoming successful because lack of user awareness. Since phishing attack
exploits the weaknesses found in users, it is very difficult to mitigate them but it
is very important to enhance phishing detection techniques.
Problem Definition:
Phishing is a type of social engineering attack often used to
steal user data, including login credentials and credit card numbers. It occurs
when an attacker, masquerading as a trusted entity, dupes a victim into opening
an email, instant message, or text message. Phishing attack is a simplest way to
obtain sensitive information from innocent users. Aim of the phishers is to
acquire critical information like username, password and bank account details.
Cyber security persons are now looking for trustworthy and steady detection
techniques for phishing websites detection. This paper deals with machine
learning technology for detection of phishing URLs by extracting and analyzing
various features of legitimate and phishing URLs. Decision Tree, random forest
and Support vector machine algorithms are used to detect phishing websites.
Aim of the paper is to detect phishing URLs as well as narrow down to best
machine learning algorithm by comparing accuracy rate, false positive and false
negative rate of each algorithm.
ALGORITHMS AND TECHNIQUES USED:

➢ Regexp Tokenizer
➢ Snowball Stemmer
➢ Beautiful Soup
➢ Logistic Regression
➢ MultinomialNB
Page.No:3
COMPLETE DESIGN:
Proposed Approach:
MODULE DESCRIPTION:
➢ The first step is to load the data in the form of csv file which
contains different types of URLs.
➢ We have to vectorize our URLs. We used Count Vectorizer and
gathered words using tokenizer, since there are words in URLs
that are more important than other words e.g. ‘virus’, ‘.exe’,
‘.data’ etc.
Page.No:4
MAJOR MODULES/TECHNIQUES INCORPORATED IN THE
PROJECT:
➢ Regexp Tokenizer
➢ Snowball Stemmer
➢ Beautiful Soup
➢ Logistic Regression
➢ Multinomial lNB
DETAILED WORK FLOW DESIGN:
S/W TOOLS USED:

Tools used for the development are as follows:
❖ Jupyter Notebook – (Anaconda Navigator)
❖ Python
Page.No:5
DATASET:
➢ Name of our dataset is phishing_site_urls.csv
➢ The given data set is in comma separated values(.csv file).
➢ File is containing 5,49,346 unique entries.
➢ There are two columns.
➢ Label column is prediction col which has 2 categories
❖ A. Good - which means the URLs is not containing
malicious stuff and this site is not a Phishing Site.
❖ B. Bad - which means the URLs contains malicious stuffs
and this site isa Phishing Site.
➢ There is no missing value in the dataset.
user platform website
SEQUENCE DIAGRAM:
train data
data visualization
graphical representation
data quality
good data percentage
false data percentage
confusion metrics
applies regression expression tokenizer
splits the data
applies snowball stemmer
splits the code words
apply logistic regression
makes the model
in .pkl file
makes desicions
builds a model using fast api
enter url for phishing or not
tells phishing or not
Page.No:6
ACTIVITY DIAGRAM:
user desktop w ebsite
train the
data
data
visualization
apply agorithms and

classifactions'
make .pkl file
build a launch a
model model
user input tells phishing

(link) or not
Page.No:7
IMPLEMENTATION:
Regexp Tokenizer:
➢ A tokenizer that splits a string using a regular expression, which matches

either the tokens or the separators between tokens.
Page.No:8
Snowball Stemmer:
➢ Snowball is a small string processing language, gives root words

➢ It is a stemming algorithm which is also known as the Porter2 stemming
algorithm as it is a better version of the Porter Stemmer since some issues
of it were fixed in this stemmer.
➢ The Snowball compiler translates a Snowball program into source code in
another language - currently Ada, ISO C, C#, Go, Java, Javascript, Object
Pascal, Python and Rust are supported.
Visualization:
➢ Visualize some important keys using word cloud
➢ create a function to visualize the important keys from url
Page.No:9
Chrome web driver:
➢ WebDriver tool use for automated testing of webapps across many

browsers. It provides capabilities for navigating to web pages, user input
and more.
Beautiful Soup:
➢ It is use for getting data out of HTML, XML, and other markup languages.
➢ Use the Beautiful Soup library to extract only relevant hyperlinks for
Google, i.e. links only with '<'a'>' tags with href attributes.
➢ Turn the URL’s into a Data frame.
➢ After you get the list of your websites with hyperlinks turn them into a
Pandas Data Frame with columns “from” (URL where the link resides) and
“to” (link destination URL).
Page.No:10
Logistic Regression:
➢ Logistic Regression is a Machine Learning classification algorithm that

is used to predict the probability of a categorical dependent variable. In
logistic regression, the dependent variable is a binary variable that
contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In
other words, the logistic regression model predicts P(Y=1) as a function
of X.
Page.No:11
MultinomialNB:
Applying Multinomial Naive Bayes to NLP Problems. Naive
Bayes Classifier Algorithm is a family of probabilistic algorithms based on
applying Bayes' theorem with the “naive” assumption of conditional
independence between every pair of a feature.
Page.No:12
Innovation Idea Introduced in Design:

➢ In this project we used a app called fastapi.
➢ We can use this API by importing a library called uvicorn.
➢ The API is connected to pishing detection model.
➢ It is done by creating a PKL file it is a python module that enables
objects to be strialized to files on disk and deserialized back into the
program at run time.
Process to execute the model is:

➢ First we have to run .ipynb file and then the PKL file is strialized.
➢ The next step is to run the python file.It will open the fastapi, from
that API we can detect the website.
Page.No:13
Implementation
Software Details:
Jupyter Notebook - (Anaconda Navigator):
Page.No:14
Python:
The python code is used to deploy FastApi usind the .pkl file which is
generated from jupyter notebook.
Sample code:
Page.No:15
FastApi:
We are using fastapi for our project to deploy it as a website
as a platform. This is an interactive and responsive website that will be
used to detect whether a website is legitimate or phishing. This website
is made using different web designing languages which include HTML,
CSS and Javascript.
Page.No:16
Results and discussion:
it's that simple yet so effective. We get an accuracy of 98%. That’s a very
high value for a machine to be able to detect a malicious URL with.
Want to test some links to see if the model gives good predictions
Page.No:17
GIVING THE SAFE INPUT WEBSITE LINK:
Page.No:18
GIVING THE UNSAFE INPUT WEBSITE LINK:
Page.No:19
Performance metrics:
From the obtained results of the above models,
logistic regression has highest model performance. So, we can
conclude that logistic regression has higher accuracy value compared
to others in detection of phishing website.
Tabulate Test cases to verify the Security objectives in your

project:
ID TEST TEST DATA EXPECTED ACTUAL STATUS
STEPS RESULT RESULT
1 URL www.dghjdgf.com Phishing Phishing PASS
Website Website
2 URL Nobellitt.com Not a Phishing FAIL
phishing Website
website
3 URL Youtube.com Phishing Not a FAIL
Website phishing
website
4 URL FaceBook.com Not a Not a PASS
phishing phishing
website website
5 URL Mail.printakid.com Phishing Phishing PASS
Website Website
6 URL Serviciobys.com Phishing Phishing PASS
Website Website
7 URL https://www.javatpoint Phishing Not a FAIL
.com/ Website phishing
website
8 URL The council tax scam Not a Phishing FAIL
phishing Website
website
9 URL https://open.spotify.co Not a Not a PASS
m/ phishing phishing
website website
10 URL Dropbox scam Not a Phishing FAIL
phishing Website
website
Page.No:20
CONCLUSION:
Phishing becomes a main area of concern for security researchers
because it is not difficult to create the fake website which looks so close to
legitimate website. Experts can identify fake websites but not all the users can
identify the fake website and such users become the victim of phishing attack.
Main aim of the attacker is to steal banks account credentials. In United States
businesses, there is a loss of US $10billion per year because their clients
become victim to phishing. In 3rd Microsoft Computing Safer Index Report
released in February 2020, it was estimated that the annual worldwide impact of
phishing could be as high as $5 billion. Phishing attacks are becoming
successful because lack of user awareness. Since phishing attack exploits the
weaknesses found in users, it is very difficult to mitigate them, but it is very
important to enhance phishing detection techniques.
References:
➢ Pujara, Purvi, and M. B. Chaudhari. "Phishing website detection using
machine learning: a review." International Journal of Scientific Research
in Computer Science, Engineering and Information Technology 3.7
(2018): 395-399.
➢ Mahajan, Rishikesh, and Irfan Siddavatam. "Phishing website detection
using machine learning algorithms." International Journal of Computer
Applications 181.23 (2018): 45-47.
➢ Kulkarni, Arun D., and Leonard L. Brown III. "Phishing websites
detection using machine learning." (2019).
➢ Kiruthiga, R., and D. Akila. "Phishing websites detection using machine
learning." International Journal of Recent Technology and
Engineering 8.2 (2019): 111-114.
➢ Kumar, J., Santhanavijayan, A., Janet, B., Rajendran, B., &
Bindhumadhava, B. S. (2020, January). Phishing website classification
and detection using machine learning. In 2020 international conference
on computer communication and informatics (iccci) (pp. 1-6). IEEE.
Page.No:21

Detection of Phishing On Apps and Websites - Project Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detection of Phishing On Apps and Websites - Project Report

Uploaded by

Copyright:

Available Formats

School of Information Technology and

SWE2012 – Software Security

Fall Semester 2022-23

Under the Guidance of

ALGORITHMS AND TECHNIQUES USED:

DETAILED WORK FLOW DESIGN:

S/W TOOLS USED:

good data percentage

false data percentage

applies regression expression tokenizer

splits the data

applies snowball stemmer

splits the code words

apply logistic regression

makes the model

builds a model using fast api

enter url for phishing or not

tells phishing or not

apply agorithms and

make .pkl file

user input tells phishing

➢ A tokenizer that splits a string using a regular expression, which matches

➢ Snowball is a small string processing language, gives root words

Chrome web driver:

➢ WebDriver tool use for automated testing of webapps across many

➢ Logistic Regression is a Machine Learning classification algorithm that

Innovation Idea Introduced in Design:

Process to execute the model is:

Tabulate Test cases to verify the Security objectives in your

You might also like