Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Government Polytechnic Khamgaon

Computer Department

Capstone Project Presentation on


Phishing Website Detection System using ML

Presented By -

2100210097 27 Aasawari Kshirsagar


2100210098 28 Rasika Majgaonkar Guided By: Prof.V. M. Bande
2100210112 37 Vaishnavi Sable
2100210125 49 Tanvi Wankhede
CONTENTS

➢ Introduction
➢ Project Overview
➢ Problem Statement
➢ Methodology
➢ Project Plan and Timeline
➢ Challenges Encountered
➢ Achievements and Progress
➢ Lessons Learned
➢ Next Steps
➢ Conclusion
➢ References
INTRODUCTION

 Phishing Website attack is a type of cyber threat where attackers create


deceptive websites that mimic legitimate ones, aiming to trick users into
divulging sensitive information.

 The ultimate goal of a phishing attack is to exploit the victim's trust and
obtain sensitive information that can be used for fraudulent activities,
unauthorized access, or identity theft.

 Phishing website detection involves the use of machine learning


techniques to identify and block websites.
PROJECT OVERVIEW

 The Phishing Website Detection project aims to create a robust system


that accurately identifies whether a user-entered website is a phishing or
not.

 The project aims to improve the accuracy of identifying phishing websites


compared to existing models, addressing the growing social issue of
increased phishing attacks despite strong security measures.

 The ultimate objective is to contribute to overcoming this social problem


by implementing a highly effective phishing detection system
PROBLEM STATEMENT
METHODOLOGY

01 02 03 04

Model Deployment and


Data Collection Feature Extraction Implementation Monitoring
Utilized the Kaggle dataset as our We focus on feature selection
and Training
Preparing for the deployment of
primary source of data. and engineering. We carefully Implemented machine learning trained models in the upcoming
Preprocess the dataset by choose features that are highly models using the selected weeks. This involves finalizing
handling missing values, indicative of phishing behavior. features. Trained these models model selection, conducting
removing duplicates, and By selecting and engineering on the preprocessed dataset to thorough testing, and ensuring
normalizing features to ensure these features, we aim to provide learn patterns and relationships compatibility with the
data quality and consistency our models with the necessary between features and phishing deployment environment.
information to make accurate behavior.
predictions.
TOOLS AND TECHNOLOGIES

 Anaconda Environment with Python : Anaconda provides a convenient environment for managing Python
packages and dependencies.

 Python Flask, HTML, CSS, JS : For designing the user-interface and backend integration.

 Machine Learning with Python Libraries : for Training our model using Scikit-learn's algorithms and evaluate its
performance. Once trained, integrate the model into your Flask application to perform real-time detection.
PROJECT PLAN AND TIMELINE
Week 4
Week 2 We implemented the
We studied different algorithms and trained our
datasets and decided models on the selected
which dataset to use. dataset .

Week 1 Week 3 Week 5


We created the user We learnt about the We compared the
interface for our algorithms which we accuracy of all the
project decided to implemented models
implement . and for improvement
in it implemented
hyperparameter
tunning.

www.free-powerpoint-templates-design.com
CHALLENGES ENCOUNTERED

1. Finding Suitable Environment -

Anaconda is one of the best environment as it already includes most of the pre
installed libraries such as scikit learn,pandas,etc

2. Accuracy -

We use technique called hyperparameter tuning to increase accuarcy of the algorithms and
to find out the parameters that contributes maximum to the accuracy
ACHIEVEMENTS AND PROGRESS

• Successful Completion of GUI


1

• All the selected algorithms are implemented


2
DEMONSTRATION

• User Interface
MODELS IMPLEMENTED

• Ensemble Technique

1] Bagging
Random Forest Algorithm
RANDOM FOREST

Accuracy :

Classification Report :
Confusion Matrix

Train and Test Accuracy Graph


VARIABLE IMPORTANCE
XGBOOST

• Ensemble Technique
2] Boosting Diagram:

Not Phishing
XGBOOST

Accuracy :

Classification Report :
Confusion Matrix

Train and Test Accuracy Graph


LOGISTIC REGRESSION

• Logistic regression is a statistical method used for binary classification by estimating the probability of a binary
outcome based on one or more predictor variables.
LOGISTIC REGRESSION

Accuracy :

Classification Report :
Confusion Matrix

Train and Test Accuracy Graph


K - NEAREST NEIGHBOUR (KNN)

• The K-NN algorithm works by finding the K nearest neighbors to a given data point based on a distance metric,
such as Euclidean distance.
K - NEAREST NEIGHBOUR (KNN)

Accuracy :

Classification Report :
Confusion Matrix

Train and Test Accuracy Graph


LESSONS LEARNED

 Python proficiency: Learned python to understand machine learning algorithms python libraries such as NumPy,
Pandas, Matplotlib and Scikit-learn.
 Environment Choice: Selected Anaconda as primary environment for its efficient package management system
and support for data science tools.
 Dataset selection: Identified and acquired a suitable dataset to project’s requirements.
 Prioritized data preprocessing to ensure high-quality input for model training.
 Recognized the importance of hyperparameters, and allocated more resources and time for hyperparameter
tuning.
NEXT STEPS

 Integration: We will now integrate the main model with the front end of the website.
 Feature Extraction: Design a mechanism to extract features and preprocess from URLs entered by users on
the website.
 Compare Model Evaluation: Perform experiments to evaluate the performance of different models using real
URL inputs.
 Deployment and Maintenance: Deploy the finalized web application with the integrated machine learning
model.
CONCLUSION

 In Capstone Project Execution, we have implemented few classification models to predict phishing websites.
 As the Random Forest Classifier and XG Boost classifier has performed better than other models.
 In next phase of project implementation, we will integrate the classification model that performs better in real
environment.

Algorithms Accuracy
Random Forest 99.97%
XG Boost Classifier 99.56%
KNN 90%
Logistic Regression 91%
REFERENCES

 https://ieeexplore.ieee.org/document/9730579
 https://ieeexplore.ieee.org/document/10169697
 https://ieeexplore.ieee.org/document/10249799
 https://ieeexplore.ieee.org/document/9824544
 https://ieeexplore.ieee.org/document/10049452
THANK YOU

You might also like