Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

A

PROJECT REPORT
ON
Phishing Detection System Through Hybrid
Machine Learning Model
SUBMITTED IN PARTIAL FULFILMENT FOR THE AWARD
OF THE DEGREE OF
BACHELOR OF ENGINEERING
in
ARTIFICIAL INTELLIGENCE
by
P Prem Theja (20EG106157)
S Sai Charan Reddy (20EG106136)
B Naresh (21EG506101)

UNDER THE GUIDANCE OF


Dr. K Basava Raju
ASSOCIATE PROFESSOR
DEPARTMENT OF ARTIFICIAL INTELLIGENCE.

DEPARTMENT OF ARTIFICIAL INTELLIGENCE


ANURAG UNIVERSITY, VENKATAPUR
TELANGANA– 500088

2023-2024
CERTIFICATE

This is to certify that the project work entitled “Phishing Detection System Through
Hybrid Machine Learning Model” is submitted in partial fulfillment of the requirements
for the award of the degree of BACHELOR OF ENGINEERING in ARTIFICIAL
INTELLIGENCE to ANURAG UNIVERSITY, Hyderabad, is a record of bonafide work carried
out by themunder supervision and guidance of Dr. K Basava Raju. The results embodied in this
report have not been submitted to any other University for the award of any other Degree or Diploma.

Date:

Name: P Prem Theja Signature:


Place: Hyderabad Roll No: 20EG106157

Name: S Sai Charan Reddy Signature:


Place: Hyderabad Roll No: 20EG106136

Name: B Naresh Signature:


Place: Hyderabad Roll No: 21EG506101

Project Guide Head of the Department


Dr. K Basava Raju Dr. A. Mallikarjuna Reddy
TABLE OF CONTENTS

CHAPTER PG.NO

List of Tables i
List of Figures ii
List of Abbreviations iii
ACKNOWLEDGEMENT iv
ABSTRACT v

1. INTRODUCTION 1

2. PROBLEM STATEMENT 2

2.1 Existing System

2.2 Proposed System

3. SPECIFICATION 5

3.1 Software Requirements

3.2 Hardware Requirements

3.3 Packages

4. METHODOLOGY 7

5. IMPLEMENTATION AND RESULTS 9

5.1 Feature Extraction

5.2 Model Architecture and Compilation

5.3 Evaluation and Analysis


5.4 Predictions

6. CONCLUSION AND FUTURE ENHANCEMENTS 11

6.1 Conclusion

6.2 Future Enhancements

APPENDIX 13

REFERENCES 24
LIST OF TABLES

SNo Table Name Page.No


1 Literature Review of Existing systems. 3

i
LIST OF FIGURES

Fig.No Figure Name Page.No


1 Architecture Diagram 4

2 Histogram Data count 10

ii
LIST OF ABBREVIATIONS

SVC : Support Vector Classifier

DT : Decision Tree

SVM : Support Vector Machine

iii
ACKNOWLEDGEMENT

We extend our sincere thanks to Prof. S Ramachandram, Vice-


Chancellor, Anurag University, for his encouragement and constant
coordination.

We would like to thank Prof. Balaji Utla, Registrar of Anurag


University, for his valuable suggestions and motivation to give our best while
working on this project.

We take pleasure in thanking Dr. A Mallikarjuna Reddy, Head of


Department, Department of Artificial Intelligence, Anurag University, for
his incredible support and inspiration.

It is our privilege to express gratitude, and indebtedness to our


coordinator Ms. Shilpa Shesham, Assistant Professor, Department of
Artificial Intelligence, Anurag University, for her incredible support and
constructive criticism which helped us do better.

We express our sincere gratitude to Dr. K Basava Raju, Associate


Professor, Department of Artificial Intelligence, Anurag University, for his
valuable insights, motivation, and guidance for the successful completion of
the work.

iv
ABSTRACT

In this study, the primary focus is on combating the escalating threat of phishing attacks, a form
of cybercrime that has evolved into a highly dangerous and prevalent menace on the internet. Phishing
leverages email deception and counterfeit websites to trick users into divulging sensitive information.
Despite numerous research efforts in the field, there remains a lack of comprehensive solutions to
effectively thwart these attacks.

To address this gap, machine learning emerges as a crucial line of defense against cybercrimes,
particularly phishing. The project utilizes a phishing URL-based dataset, extracted from a well-known
repository, which aggregates attributes from both phishing and legitimate URLs. This dataset,
encompassing data from over 11,000 websites in vector form, forms the foundation of the study.

The dataset undergoes rigorous preprocessing to ensure data quality and consistency, preparing it
for the application of various machine learning algorithms. These algorithms encompass a range of
approaches, including Decision Tree (DT), Linear Regression (LR), Support Vector Classifier (SVC),
and a novel hybrid model named LSD (Logistic Regression + Support Vector Machine + Decision
Tree). The hybrid model utilizes both soft and hard voting mechanisms.

v
1. INTRODUCTION

The advent of the internet has ushered in an era of unprecedented connectivity and convenience,
but it has also paved the way for an alarming surge in cybercrime. Among the myriad forms of online
threats, phishing attacks have risen to the forefront as one of the most insidious and dangerous
cybercrimes.

Phishing attacks have become a pervasive and dangerous form of cybercrime on the internet,
threatening individuals, organizations, and even governments. Despite significant efforts in research and
cybersecurity, the battle against phishing attacks remains far from over. The attackers constantly adapt
their tactics, making it a formidable challenge to develop comprehensive and enduring countermeasures.

In this landscape of evolving cyber threats, machine learning has emerged as a pivotal tool in the
arsenal of cybersecurity. Machine learning, a subset of artificial intelligence, empowers systems to
recognize patterns, adapt, and make intelligent decisions without explicit programming. It offers a
dynamic approach to combating threats like phishing attacks.

This project is grounded in the belief that a fusion of machine learning algorithms can provide an
effective defense against the ever-evolving world of phishing attacks. We take as our starting point a
dataset that captures the characteristics of phishing and legitimate URLs extracted from over 11,000
websites. These URLs, after preprocessing, serve as the foundation for developing and deploying a range
of machine-learning models.

The primary goal of this study is to design and evaluate a robust and efficient system for the
detection of phishing attacks using a hybrid machine-learning approach. The proposed hybrid model,
known as the LSD model (Logistic Regression + Support Vector Machine + Decision Tree), combines
multiple machine-learning techniques to enhance accuracy and efficiency.

In the pursuit of a comprehensive solution, this project evaluates the proposed approach using a
set of diverse evaluation metrics, including precision, accuracy, recall, F1-score, and specificity.
Comparative analyses against other models will showcase the effectiveness of the hybrid approach in
defending against phishing attacks.

In summary, this project is dedicated to fortifying our digital defenses by harnessing the power of
machine learning. By focusing on phishing attacks and the development of a hybrid model, we aim to
provide a robust safeguard against one of the most potent and pervasive cybercrimes on the internet.

1
2. PROBLEM STATEMENT

The digital landscape is currently plagued by a myriad of cybercrimes, with phishing attacks
standing out as a pervasive and severe threat. Phishing, initiated in the mid-1990s, has evolved into a
highly dangerous and deceitful cybercrime. It capitalizes on the art of deception, utilizing email
distortion and counterfeit websites to trick users into revealing their most sensitive information.

While extensive efforts have been dedicated to understanding, preventing, and mitigating
phishing attacks, a comprehensive and enduring solution remains elusive. The attackers continuously
adapt their tactics and strategies, thereby challenging the efficacy of existing countermeasures.

Machine learning, an increasingly potent tool in cybersecurity, holds the promise of providing a
more adaptive and dynamic defense against these evolving phishing threats. However, the effectiveness
of machine learning models in countering phishing is still a topic of ongoing research and development
[1].

The problem at hand is to design a robust and efficient system for the detection of phishing
attacks based on URL characteristics. This system must leverage machine learning techniques, including
a novel hybrid model, to combat phishing with high accuracy and efficiency. Additionally, it must be
evaluated against well-defined metrics to demonstrate its superiority over other existing models[1].

In essence, the problem statement is to bridge the gap in current cybersecurity efforts by
developing a cutting-edge solution capable of defending against the most severe forms of online
deception, thereby safeguarding individuals and organizations against the relentless threat of phishing
attacks.

2.1 EXISTING SYSTEM

The current state of phishing detection systems is marked by a reliance on traditional methods and
rudimentary tools. These existing systems often fall short of effectively combatting the evolving and
sophisticated tactics employed by cybercriminals in phishing attacks.

Most conventional approaches rely on rule-based methods, such as blacklists and heuristics, to
identify potential phishing URLs and emails. These systems have limitations, as they struggle to keep
pace with the continuous creation of new phishing sites and techniques. They often generate false positives
and false negatives, which can lead to user frustration and decreased trust in the system [2].

2
Table 1 : Literature Review of Existing Systems
Paper Title Year Algorithms used Drawbacks

“Phishing or Not 2023 Logistic Vulnerable to URL


Phishing? A Regression with manipulation.
Survey on the stochastic
Detection of descent,
Phishing Perceptron, and
Websites” Confidence-
Weighted
Classification.

“PhishHaven—An 2021 AdaBoost Classifier, High Computational


Efficient Real- Neural Networks, Complexity
Time AI Phishing Support Vector
URLs Detection Machines.
System”

Furthermore, existing systems typically lack the adaptability and learning capabilities required to
respond dynamically to emerging threats. They may not effectively adapt to changes in attacker behavior
or the subtle variations in phishing strategies. As a result, users remain vulnerable to increasingly
deceptive and well-crafted phishing attempts [2].

In essence, the current state of phishing detection systems is marred by their inability to provide
robust protection against the ever-evolving threat of phishing attacks. As cybercriminals become more
sophisticated, there is a pressing need for a more advanced and adaptive solution to counteract this form
of cybercrime effectively [3].

3
2.2 PROPOSED SYSTEM

In response to the inherent limitations of existing phishing detection systems, this project introduces
a highly advanced and adaptive system for phishing detection based on URL characteristics. The proposed
system leverages the power of hybrid machine learning models to significantly enhance the accuracy and
efficiency of detecting phishing attacks, addressing the dynamic nature of these threats.

Proposed System Architecture

Fig 1. Architecture Diagram

4
3. SPECIFICATIONS

3.1 HARDWARE REQUIREMENTS

Processor : i3 or above
Ram : 4GB or above
Hard Disk : 100GB or above

3.2 SOFTWARE REQUIREMENTS

Software requirements establish the agreement between your team and the customer
on what the application is supposed to do. Without a description of what features will be
included and details on how the features will work, the users of the software cannot
determine if the software will meet their needs. The key software requirements required for
the project are:

Python
Google Colab

3.2.1 PYTHON :

Python is an interpreted, object-oriented, high-level programming language with


dynamic semantics. It is high-level built-in data structures, combined with dynamic typing
and dynamic binding, make it very attractive for rapid application development, as well as
for use as a scripting or glue language to connect existing components together.

3.2.2 GOOGLE COLAB:

Google Colab is a free Jupyter notebook environment that runs entirely in the cloud.
Most importantly, it does not require a setup and the notebooks that you create can be
simultaneously edited by your team members just the way you edit documents in Google
Docs. Colab supports many popular machine learning libraries which can be easily loaded
in your notebook. As a programmer, we can perform the following using Google Colab:
write and execute code in python. Document your code that supports mathematical
equations. Import/Save notebooks from/to Google Drive. Integrate PyTorch, TensorFlow,
Keras, OpenCV.

5
3.3 PACKAGES :

3.3.1 NumPy :

NumPy is a fundamental library in Python for numerical and scientific computing.


It provides support for large, multi-dimensional arrays and matrices, along with a collection
of mathematical functions to operate on these arrays efficiently.

3.3.2 Pandas:

Pandas is a popular Python library for data manipulation and analysis. It offers easy-
to-use data structures like data frames and series, which simplify tasks such as cleaning,
exploring, and analyzing data

3.3.3 Matplotlib:

A comprehensive library for data visualization and plotting static and interactive
charts, graphs and figures. Used for visualizing model training results.

3.3.4 Sklearn:

Provides a wide range of machine learning algorithms, utilities and metrics for
model development and evaluation. Used for train/test split, preprocessing, model evaluation
etc.

6
4. METHODOLOGY

Data Collection and Preprocessing:


Acquire a suitable dataset of phishing and legitimate URLs. Verify data quality and address missing values
or inconsistencies.

Data Sources:
Dataset of phishing and legitimate URLs from the Kaggle website.

Data Cleaning:
The collected data needs to be cleaned to remove inconsistencies, missing values, and outliers.

Feature Engineering:
Identify and engineer relevant features from the URL data. Extract attributes that may be indicative of
phishing behavior, such as domain length, special characters, and more.

Data Splitting:
Split the dataset into training and testing sets using the train_test_split function from Scikit-Learn.

Model Selection and Architecture:

Model Selection: Choose suitable machine learning and deep learning models for phishing detection,
such as Decision Tree, Support Vector Machine (SVM), and Logistic Regression. Train these models
using the training data.

Model Training:

Training Data: The prepared dataset is split into training and validation sets.

Model Training: Models are trained on the training data using suitable algorithms.

Evaluation Metrics:

Performance Metrics: Define evaluation metrics such as accuracy, precision, recall, F1-score, and more.

Model Evaluation:

Model Validation: Validate the models using the validation dataset to assess their performance.

Interpretability and Visualization:

Model Interpretability: Examine the trained models to understand how they make predictions and
identify crucial features.

Visualization: Visualize data, model outputs, and feature importance for better insights.

Deployment and Monitoring:


7
Model Deployment: Deploy the trained model in a production environment for real-time prediction.

Monitoring: Continuously monitor the deployed model's performance, retraining it as needed to adapt to
changing patterns and data distributions.

This detailed methodology covers the essential stages of phishing prediction using hybrid machine
learning model, from data collection and preprocessing to model deployment and monitoring, ensuring
the development of an effective phishing prediction system.

8
5. IMPLEMENTATION AND RESULTS

• A dataset from Kaggle is loaded which contains the data from different sources.

6.1 FEATURE EXTRACTION


This process encompasses the identification of structural components, domain
features, and syntax patterns. Employing preprocessing techniques ensures data uniformity
and quality improvement. Additionally, key phishing indicators, such as subdomain
irregularities and suspicious keywords, are extracted. The culmination involves transforming
these features into a vector representation for seamless integration with machine learning
algorithms. This meticulous feature extraction is instrumental in empowering the hybrid
machine-learning model to discern subtle patterns indicative of phishing behavior, thereby
enhancing the overall efficacy of the system.

Fig 2. Histogram Data count

9
6.2 MODEL ARCHITECTURE AND COMPILATION

The model architecture utilizes a hybrid approach, combining Logistic Regression,


Support Vector Machine, and Decision Tree models. This ensemble is optimized for
phishing detection. Compilation involves configuring model parameters and settings,
ensuring compatibility and efficiency in the collaborative functioning of these diverse
algorithms.

6.3 EVALUATION AND ANALYSIS


The trained model is tested on previously unseen test data. Metrics like accuracy,
precision, recall, and f1-score provide a quantitative evaluation of model performance. The
confusion matrix gives intuition about errors the model makes in classifying different
speakers. The evaluation results are analyzed to identify gaps, misclassifications, and other
issues. This provides insights to further refine the model architecture, training process, and
data preprocessing.

6.4 PREDICTION
Model predictions involve leveraging the strengths of Logistic Regression, Support
Vector Machine, and Decision Tree models within the hybrid ensemble. These predictions
accurately classify URLs, distinguishing between phishing and legitimate ones, providing
robust protection against cyber threats with high precision and efficiency.

10
6. CONCLUSION AND FUTURE ENHANCEMENTS

7.1 CONCLUSION
In the realm of cybersecurity, where the ever-evolving landscape of cyber threats
poses a constant challenge, this project has taken significant strides in addressing a
pressing concern – phishing attacks. The project, "Phishing Detection Through Hybrid
Machine Learning Model," has yielded commendable results and holds immense
significance in fortifying online security.

Through the comprehensive methodology outlined in this project, we achieved the


development of a robust hybrid machine learning model. This model, combining the
strengths of Logistic Regression, Support Vector Machine (SVM), and Decision Tree,
demonstrated superior performance in the detection of phishing attacks. With a focus on
critical attributes of URLs, the model successfully differentiated between genuine and
malicious URLs, bolstering the protection of end-users against phishing attempts.
In conclusion, this project stands as a testament to the power of machine learning
in bolstering online security. Its accomplishments are not only practical but also indicative
of the continuous vigilance and innovation required to stay one step ahead in the relentless
battle against cybercrime. As the digital landscape evolves, so too must our defenses, and
this project is a step in the right direction.

This project underscores the vital role that machine learning plays in modern
cybersecurity and encourages further exploration of hybrid approaches for a safer online
environment.

11
7.2 FUTURE ENHANCEMENTS

Here are some future enhancement ideas for the speaker identification model with
brief descriptions:

1. Human-Machine Collaboration: Implement a system that allows human experts to work


collaboratively with machine learning models. Develop mechanisms for experts to provide
feedback and refine the model's performance over time. This collaboration can lead to continuous
improvement and adaptation to emerging phishing tactics.

2. Multimodal Analysis: Expand the project by combining URL analysis with content analysis,
image recognition, and other data modalities. This comprehensive approach can create a more
robust and effective phishing detection system, capable of identifying phishing attempts across
various forms of content.

3. Email Service Integration: Integrate the phishing detection system with email services to
automatically scan links within emails for potential phishing URLs. This real-time scanning adds
an extra layer of protection for users, particularly in the context of email communication.

12
APPENDIX

13
14
15
16
17
18
19
20
21
22
REFERENCES

[1] Abdul Karim, Mobeen Shahroz, Khabib Mustofa, Samir Brahim, and S. Ramana Kunar Joga,
“Phishing Detection System Through Machine Learning Based on URL”, vol 11, issue no. 3, DOI
2023.3252366, 03 March 2023.

[2] A. A. Ubing, S. Kamilia, A. Abdullah, N. Jhanjhi and M. Supramaniam, “Phishing website detection:
An improved accuracy through feature selection and ensemble learning”, vol. 10, issue no. 1,23 March
2019.

[3] Maria Sameen, Kyunghyun Han, Seong Oun Hwang, “PhishHaven—An Efficient Real-Time AI
Phishing URLs Detection”, vol 8, issue no. 2, 20 April 2020.

24

You might also like