Professional Documents
Culture Documents
My Mini Project Final
My Mini Project Final
PROJECT REPORT
ON
Phishing Detection System Through Hybrid
Machine Learning Model
SUBMITTED IN PARTIAL FULFILMENT FOR THE AWARD
OF THE DEGREE OF
BACHELOR OF ENGINEERING
in
ARTIFICIAL INTELLIGENCE
by
P Prem Theja (20EG106157)
S Sai Charan Reddy (20EG106136)
B Naresh (21EG506101)
2023-2024
CERTIFICATE
This is to certify that the project work entitled “Phishing Detection System Through
Hybrid Machine Learning Model” is submitted in partial fulfillment of the requirements
for the award of the degree of BACHELOR OF ENGINEERING in ARTIFICIAL
INTELLIGENCE to ANURAG UNIVERSITY, Hyderabad, is a record of bonafide work carried
out by themunder supervision and guidance of Dr. K Basava Raju. The results embodied in this
report have not been submitted to any other University for the award of any other Degree or Diploma.
Date:
CHAPTER PG.NO
List of Tables i
List of Figures ii
List of Abbreviations iii
ACKNOWLEDGEMENT iv
ABSTRACT v
1. INTRODUCTION 1
2. PROBLEM STATEMENT 2
3. SPECIFICATION 5
3.3 Packages
4. METHODOLOGY 7
6.1 Conclusion
APPENDIX 13
REFERENCES 24
LIST OF TABLES
i
LIST OF FIGURES
ii
LIST OF ABBREVIATIONS
DT : Decision Tree
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
In this study, the primary focus is on combating the escalating threat of phishing attacks, a form
of cybercrime that has evolved into a highly dangerous and prevalent menace on the internet. Phishing
leverages email deception and counterfeit websites to trick users into divulging sensitive information.
Despite numerous research efforts in the field, there remains a lack of comprehensive solutions to
effectively thwart these attacks.
To address this gap, machine learning emerges as a crucial line of defense against cybercrimes,
particularly phishing. The project utilizes a phishing URL-based dataset, extracted from a well-known
repository, which aggregates attributes from both phishing and legitimate URLs. This dataset,
encompassing data from over 11,000 websites in vector form, forms the foundation of the study.
The dataset undergoes rigorous preprocessing to ensure data quality and consistency, preparing it
for the application of various machine learning algorithms. These algorithms encompass a range of
approaches, including Decision Tree (DT), Linear Regression (LR), Support Vector Classifier (SVC),
and a novel hybrid model named LSD (Logistic Regression + Support Vector Machine + Decision
Tree). The hybrid model utilizes both soft and hard voting mechanisms.
v
1. INTRODUCTION
The advent of the internet has ushered in an era of unprecedented connectivity and convenience,
but it has also paved the way for an alarming surge in cybercrime. Among the myriad forms of online
threats, phishing attacks have risen to the forefront as one of the most insidious and dangerous
cybercrimes.
Phishing attacks have become a pervasive and dangerous form of cybercrime on the internet,
threatening individuals, organizations, and even governments. Despite significant efforts in research and
cybersecurity, the battle against phishing attacks remains far from over. The attackers constantly adapt
their tactics, making it a formidable challenge to develop comprehensive and enduring countermeasures.
In this landscape of evolving cyber threats, machine learning has emerged as a pivotal tool in the
arsenal of cybersecurity. Machine learning, a subset of artificial intelligence, empowers systems to
recognize patterns, adapt, and make intelligent decisions without explicit programming. It offers a
dynamic approach to combating threats like phishing attacks.
This project is grounded in the belief that a fusion of machine learning algorithms can provide an
effective defense against the ever-evolving world of phishing attacks. We take as our starting point a
dataset that captures the characteristics of phishing and legitimate URLs extracted from over 11,000
websites. These URLs, after preprocessing, serve as the foundation for developing and deploying a range
of machine-learning models.
The primary goal of this study is to design and evaluate a robust and efficient system for the
detection of phishing attacks using a hybrid machine-learning approach. The proposed hybrid model,
known as the LSD model (Logistic Regression + Support Vector Machine + Decision Tree), combines
multiple machine-learning techniques to enhance accuracy and efficiency.
In the pursuit of a comprehensive solution, this project evaluates the proposed approach using a
set of diverse evaluation metrics, including precision, accuracy, recall, F1-score, and specificity.
Comparative analyses against other models will showcase the effectiveness of the hybrid approach in
defending against phishing attacks.
In summary, this project is dedicated to fortifying our digital defenses by harnessing the power of
machine learning. By focusing on phishing attacks and the development of a hybrid model, we aim to
provide a robust safeguard against one of the most potent and pervasive cybercrimes on the internet.
1
2. PROBLEM STATEMENT
The digital landscape is currently plagued by a myriad of cybercrimes, with phishing attacks
standing out as a pervasive and severe threat. Phishing, initiated in the mid-1990s, has evolved into a
highly dangerous and deceitful cybercrime. It capitalizes on the art of deception, utilizing email
distortion and counterfeit websites to trick users into revealing their most sensitive information.
While extensive efforts have been dedicated to understanding, preventing, and mitigating
phishing attacks, a comprehensive and enduring solution remains elusive. The attackers continuously
adapt their tactics and strategies, thereby challenging the efficacy of existing countermeasures.
Machine learning, an increasingly potent tool in cybersecurity, holds the promise of providing a
more adaptive and dynamic defense against these evolving phishing threats. However, the effectiveness
of machine learning models in countering phishing is still a topic of ongoing research and development
[1].
The problem at hand is to design a robust and efficient system for the detection of phishing
attacks based on URL characteristics. This system must leverage machine learning techniques, including
a novel hybrid model, to combat phishing with high accuracy and efficiency. Additionally, it must be
evaluated against well-defined metrics to demonstrate its superiority over other existing models[1].
In essence, the problem statement is to bridge the gap in current cybersecurity efforts by
developing a cutting-edge solution capable of defending against the most severe forms of online
deception, thereby safeguarding individuals and organizations against the relentless threat of phishing
attacks.
The current state of phishing detection systems is marked by a reliance on traditional methods and
rudimentary tools. These existing systems often fall short of effectively combatting the evolving and
sophisticated tactics employed by cybercriminals in phishing attacks.
Most conventional approaches rely on rule-based methods, such as blacklists and heuristics, to
identify potential phishing URLs and emails. These systems have limitations, as they struggle to keep
pace with the continuous creation of new phishing sites and techniques. They often generate false positives
and false negatives, which can lead to user frustration and decreased trust in the system [2].
2
Table 1 : Literature Review of Existing Systems
Paper Title Year Algorithms used Drawbacks
Furthermore, existing systems typically lack the adaptability and learning capabilities required to
respond dynamically to emerging threats. They may not effectively adapt to changes in attacker behavior
or the subtle variations in phishing strategies. As a result, users remain vulnerable to increasingly
deceptive and well-crafted phishing attempts [2].
In essence, the current state of phishing detection systems is marred by their inability to provide
robust protection against the ever-evolving threat of phishing attacks. As cybercriminals become more
sophisticated, there is a pressing need for a more advanced and adaptive solution to counteract this form
of cybercrime effectively [3].
3
2.2 PROPOSED SYSTEM
In response to the inherent limitations of existing phishing detection systems, this project introduces
a highly advanced and adaptive system for phishing detection based on URL characteristics. The proposed
system leverages the power of hybrid machine learning models to significantly enhance the accuracy and
efficiency of detecting phishing attacks, addressing the dynamic nature of these threats.
4
3. SPECIFICATIONS
Processor : i3 or above
Ram : 4GB or above
Hard Disk : 100GB or above
Software requirements establish the agreement between your team and the customer
on what the application is supposed to do. Without a description of what features will be
included and details on how the features will work, the users of the software cannot
determine if the software will meet their needs. The key software requirements required for
the project are:
Python
Google Colab
3.2.1 PYTHON :
Google Colab is a free Jupyter notebook environment that runs entirely in the cloud.
Most importantly, it does not require a setup and the notebooks that you create can be
simultaneously edited by your team members just the way you edit documents in Google
Docs. Colab supports many popular machine learning libraries which can be easily loaded
in your notebook. As a programmer, we can perform the following using Google Colab:
write and execute code in python. Document your code that supports mathematical
equations. Import/Save notebooks from/to Google Drive. Integrate PyTorch, TensorFlow,
Keras, OpenCV.
5
3.3 PACKAGES :
3.3.1 NumPy :
3.3.2 Pandas:
Pandas is a popular Python library for data manipulation and analysis. It offers easy-
to-use data structures like data frames and series, which simplify tasks such as cleaning,
exploring, and analyzing data
3.3.3 Matplotlib:
A comprehensive library for data visualization and plotting static and interactive
charts, graphs and figures. Used for visualizing model training results.
3.3.4 Sklearn:
Provides a wide range of machine learning algorithms, utilities and metrics for
model development and evaluation. Used for train/test split, preprocessing, model evaluation
etc.
6
4. METHODOLOGY
Data Sources:
Dataset of phishing and legitimate URLs from the Kaggle website.
Data Cleaning:
The collected data needs to be cleaned to remove inconsistencies, missing values, and outliers.
Feature Engineering:
Identify and engineer relevant features from the URL data. Extract attributes that may be indicative of
phishing behavior, such as domain length, special characters, and more.
Data Splitting:
Split the dataset into training and testing sets using the train_test_split function from Scikit-Learn.
Model Selection: Choose suitable machine learning and deep learning models for phishing detection,
such as Decision Tree, Support Vector Machine (SVM), and Logistic Regression. Train these models
using the training data.
Model Training:
Training Data: The prepared dataset is split into training and validation sets.
Model Training: Models are trained on the training data using suitable algorithms.
Evaluation Metrics:
Performance Metrics: Define evaluation metrics such as accuracy, precision, recall, F1-score, and more.
Model Evaluation:
Model Validation: Validate the models using the validation dataset to assess their performance.
Model Interpretability: Examine the trained models to understand how they make predictions and
identify crucial features.
Visualization: Visualize data, model outputs, and feature importance for better insights.
Monitoring: Continuously monitor the deployed model's performance, retraining it as needed to adapt to
changing patterns and data distributions.
This detailed methodology covers the essential stages of phishing prediction using hybrid machine
learning model, from data collection and preprocessing to model deployment and monitoring, ensuring
the development of an effective phishing prediction system.
8
5. IMPLEMENTATION AND RESULTS
• A dataset from Kaggle is loaded which contains the data from different sources.
9
6.2 MODEL ARCHITECTURE AND COMPILATION
6.4 PREDICTION
Model predictions involve leveraging the strengths of Logistic Regression, Support
Vector Machine, and Decision Tree models within the hybrid ensemble. These predictions
accurately classify URLs, distinguishing between phishing and legitimate ones, providing
robust protection against cyber threats with high precision and efficiency.
10
6. CONCLUSION AND FUTURE ENHANCEMENTS
7.1 CONCLUSION
In the realm of cybersecurity, where the ever-evolving landscape of cyber threats
poses a constant challenge, this project has taken significant strides in addressing a
pressing concern – phishing attacks. The project, "Phishing Detection Through Hybrid
Machine Learning Model," has yielded commendable results and holds immense
significance in fortifying online security.
This project underscores the vital role that machine learning plays in modern
cybersecurity and encourages further exploration of hybrid approaches for a safer online
environment.
11
7.2 FUTURE ENHANCEMENTS
Here are some future enhancement ideas for the speaker identification model with
brief descriptions:
2. Multimodal Analysis: Expand the project by combining URL analysis with content analysis,
image recognition, and other data modalities. This comprehensive approach can create a more
robust and effective phishing detection system, capable of identifying phishing attempts across
various forms of content.
3. Email Service Integration: Integrate the phishing detection system with email services to
automatically scan links within emails for potential phishing URLs. This real-time scanning adds
an extra layer of protection for users, particularly in the context of email communication.
12
APPENDIX
13
14
15
16
17
18
19
20
21
22
REFERENCES
[1] Abdul Karim, Mobeen Shahroz, Khabib Mustofa, Samir Brahim, and S. Ramana Kunar Joga,
“Phishing Detection System Through Machine Learning Based on URL”, vol 11, issue no. 3, DOI
2023.3252366, 03 March 2023.
[2] A. A. Ubing, S. Kamilia, A. Abdullah, N. Jhanjhi and M. Supramaniam, “Phishing website detection:
An improved accuracy through feature selection and ensemble learning”, vol. 10, issue no. 1,23 March
2019.
[3] Maria Sameen, Kyunghyun Han, Seong Oun Hwang, “PhishHaven—An Efficient Real-Time AI
Phishing URLs Detection”, vol 8, issue no. 2, 20 April 2020.
24