Major Project File

A Practical Training Report
submitted
in partial fulfillment
for the award of the Degree
Bachelor of Technology
in Department of Computer Science and Engineering
Department of Computer Science and Engineering Amity
School of Engineering & Technology
Amity University Rajasthan, Jaipur

CANDIDATE’S DECLARATION
We hereby declare that the work, which is being presented in the summer training report, entitled “Phishing
URL detection using Machine Learning” in partial fulfillment for the award of Degree of “Bachelor of
Technology” in Department of Computer Science and Engineering, and submitted to the Department of
Computer Science and Engineering, Amity School of Engineering & Technology, Amity University,
Rajasthan.
Signature of Head of Department
Name:
(Office seal)
ACKNOWLEDGEMENT
A major project is a golden opportunity for learning and self-development. We consider our self very
lucky and honored to have so many wonderful people lead us through in completion of this project.
First and foremost, we would like to thank Dr. Sunil Pathak, HOD, CSE who gave us an opportunity
to undertake this project.
My grateful thanks to Dr. Sunil Pathak for his guidance in my project work, who in spite of being
extraordinarily busy with academics, took time out to hear, guide and keep us on the correct path. We
do not know where we would have been without his help.
CSE department monitored our progress and arranged all facilities to make life easier. We choose this
moment to acknowledge their contribution gratefully.
Name and signature of Students
Sakshi Methani (A20405219022)
Shipra Kanwar (A20405219089)

ABSTRACT
Phishing is a kind of social engineering attack with the intention to lure the victim to give up their
personal data such as financial information to the attackers which appears as a legitimate source.
Attackers are coming up with new techniques hence detection of such links has become a crucial
concern. Since most of the internet users are not able to differentiate between a legitimate website
and a malicious one. There are several ways to identify a phishing website but they are extremely
time consuming. Machine learning can be used to detect phishing web links. Thus, to mitigate
phishing threats, researchers are working on improving the accuracy of phishing detection
through a variety of list-based and machine learning-based methods that leverage host
information, handcrafted capabilities of URLs and website contents. This work is a survey of
latest trends in Phishing URL detection, with a special focus on Machine Learning based
approaches.
CONTENTS
TITLE i
CERTIFICATE ii
ACKNOWLEDGEMENT iii
ABSTRACT v
LISTOF FIGURES vi
LISTOF TABLES vii
Chapter-1: Project Introduction ................................................................................................... 8

1.1 : Motivation ................................................................................................................................8
1.2 : Overview .................................................................................................................................. 9
1.3 : Expected outcome.................................................................................................................. 10
1.4 : Gantt Chart............................................................................................................................ 11
1.5 :SRS .......................................................................................................................................... 12
Chapter-2: Methodology ...............................................................................................................15

2.1 : Dataset Selection..................................................................................................................... 16
2.2 : Feature Extraction ................................................................................................................. 18
Chapter-3: Design Criteria .......................................................................................................... 21

3.1 : System Design ........................................................................................................................ 21
3.2 : Design Diagrams..................................................................................................................... 22
3.3 : System Architecture ...............................................................................................................23
3.4 : Information & Communication design ............................................................................... 25
Chapter-4: Development & Implementation ............................................................................. 26

4.1 : Developmental feasibility ....................................................................................................... 26
4.2 : Implementation Specifications ..............................................................................................28
4.3 : System modules and flow of implementations ..................................................................... 29
4.4 : Critical modules of product/system ......................................................................................32
Chatper-5: Results & Testing ...................................................................................................... 34

5.1 : Decision Tree ......................................................................................................................... 34
5.2 : Random Forest ...................................................................................................................... 36
5.3: KNN .........................................................................................................................................37
5.4 : Logistic Regression................................................................................................................. 38
5.5 : Accuracy.................................................................................................................................. 39
5.6 : Phishing URL detection using Random forest ..................................................................... 40
Chapter-6: Conclusion & Future Improvements ....................................................................... 43

6.1 : Performance Estimation ........................................................................................................43
6.2 : Usability of Product / system ................................................................................................. 46
6.3 : Limitations ............................................................................................................................. 48
6.4 : Scope of Improvement .......................................................................................................... 49
References ...................................................................................................................................... 50
LIST OF FIGURES
Figure Number Figure Name Page Number
Fig 1.1 Gantt chart 11
Fig 2.1 Proposed model 15
Fig 2.1.1 Dataset information 16
Fig 2.2.1 Process flowchart 20

for selected
features
Fig 3.2.1 System 22

Architecture
Fig 5.1.1 DT testing and 34
training
Fig 5.1.2 DT accuracy 34
Fig 5.1.3 DT confusion 35

matrix
Fig 5.2.1 RF testing and 36

training
Fig 5.2.2 RF accuracy 36
Fig 5.2.3 RF confusion 36

matrix
Fig 5.3.1 KNN testing and 37

training
Fig 5.3.2 KNN accuracy 37
Fig 5.3.3 KNN confusion 37

matrix
Fig 5.4.2 LR accuracy 38
Fig 5.4.3 LR confusion matrix 38
Fig 5.6.1 Features Weights 40
Fig 5.6.2 Exporting and 40

loading the model
Fig 5.6.3 Prediction Matrix 40
Fig 5.6.4 Input 1 41
Fig 5.6.5 Prediction 1 41
Fig 6 Confusion matrix 44

Precision and Recall
LIST OF TABLES
Table Number Table Name Page Number

Table 1 Dataset 17
description
Table 2 Accuracy 39
Comparison
Table 3 Model 44
Performance
Table 4 Comparison of 45
different ML
algorithms
CHAPTER 1: PROJECT INTRODUCTION
1.1 MOTIVATION
As we are living in the 21st century, humans are becoming more and more dependent upon
Information technology (IT) and Cyberspace. This increased amount of dependance hascaused
the start of many web-based essential services such as money transactions, social media, and
many more, increasing the availability and accessibility of these services on a daily basis [1].
With the increase of users of these services, cyber-attacks and frauds have also increased. The
hackers or attackers create huge threats for both the end users and these web-based services.
The attackers use various techniques to gain confidential information about their targets.
The most common type of attack used by hackers against these types of services is Phishing
attack. Phishing attack is a type of social engineering attack in which the attacker makes the
target reveal their confidential information such as credentials, credit card number, etc [2].
Phishing attacks could be of two forms i.e., social engineering based and Malware based. Social
engineering-based attacks target human mind and manipulate them to give up their personal
confidential
Phishers steal personal information and financial account details such as usernames and
passwords, leaving users vulnerable in the online space. Phishing detection techniques do
suffer low detection accuracy and high false alarm especially when novel phishing approaches
are introduced.
8
1.2 PROJECT OVERVIEW
Phishing attacks could be of two forms i.e., social engineering based and Malware based. Social
engineering-based attacks target human mind and manipulate them to give up their personal
confidential information while malware-based attacks make use of a keylogger of screen logger
to record key strokes made by the user Phishing can be conducted through emails, calls,
messages and websites [3]. In this paper, we will solely focus on website phishing attacks. In
Web phishing, the phisher creates an illegitimate malicious webpage mimicking a genuine and
legitimate webpage [4]. Then the URL of this page is sent to the target in form of emails or
text messages with the intent of making the user enter their confidential information onto the
webpage which later can be assessed by the attacker. Even though conducting phishing attacks
doesn’t require much technical expertise, the severity level of these types of attacks is very
high as the attacker gets the hold of sensitive data through which attacker can carry out many
malicious actions such as defamation, data leak and gaining financial benefits [5]. As per many
studies, phishing attacks are on rapid growth and are causing huge amount of financial loses to
organizations as well as to the individuals.
The project’s objectives are as follows:

• To study various automatic phishing detection methods
• To identify the appropriate machine learning techniques and define a solution using the
selected method • To select an appropriate dataset for the problem statement
• To apply appropriate algorithms to achieve the solution to phishing attacks
The main purpose of this project is to detect fake or malicious urls or websites which are trying
to get access to user’s sensitive data. In this project we are using machine learning algorithms
to safeguard the sensitive data and to detect phishing websites which are trying to harm user
by gaining access to sensitive data.
9
1.3 EXPECTED OUTCOME
● Help users in identifying malicious phishing URLs and avoiding being attacked.
● A comparative analysis of multiple models and algorithms to detect a Phishing Website,
as well as URL attributes used to distinguish between a valid and malicious website, is
presented in a research paper, which also includes a review of prior works.
10
1.4 GANTT CHART
11
1.4 SOFTWARE REQUIREMENT SPECIFICATIONS
A. Hardware Requirements
a. Processor CPU - Intel Pentium Dual Core and Higher
b. Hard Disk capacity - 512MB Space required minimum
c. RAM - 4GB minimum
B. Software requirements
a. Programming language - Python
b. Operating system - Windows 8.1 or above
c. IDE - Anaconda , iPython version 3.x
C. Supported Python modules

a. Numpy
b. Pandas
c. Seaborn
d. Matplotlib
The hardware and software requirements for any project are essential to ensure smooth and
efficient execution. For the phishing URL detection project, the hardware requirements are
not too demanding. The system should have an Intel Pentium Dual Core processor or higher,
at least 512 MB of hard disk space, and 4 GB of RAM. These requirements are relatively
easy to fulfill, and most modern computers should meet these requirements without any
issues.
Moving on to software requirements, the project requires the use of the Python programming
language. The Python language is widely used in the field of machine learning due to its
simplicity, versatility, and the availability of various libraries and modules. The project also
requires an operating system that is compatible with Python, such as Windows 8.1 or above.
Additionally, an Integrated Development Environment (IDE) is required to write, test and
debug the code. For this project, the recommended IDEs are Visual Studio Code and
Anaconda.
Furthermore, the project requires several Python modules to support the machine learning
algorithm's implementation. These modules include NumPy, Pandas, Scikit-learn, Matplotlib,
and Seaborn, among others. These modules are essential for various tasks such as data
12
manipulation, data visualization, and machine learning algorithm implementation.
In summary, the hardware and software requirements for the phishing URL detection project
are relatively straightforward and easy to fulfill. The project requires a system with an Intel
Pentium Dual Core processor or higher, at least 512 MB of hard disk space, and 4 GB of
RAM. The software requirements include the Python programming language, an operating
system that supports Python, and an IDE such as Visual Studio Code or Anaconda.
Additionally, several Python modules such as NumPy, Pandas, Scikit-learn, Matplotlib, and
Seaborn are necessary to support the machine learning algorithm's implementation.
13
Other Non-Functional Requirements
A non-functional requirement is a determination that depicts the framework’s activity

abilities and requirements that improve its usefulness. Some of them are as follows:
• Reusability: the same code with limited changes can be used for detecting phishing attacks
variants like smishing, vishing, etc.
• Maintainability: The implementation is very basic and includes print statements that makes
it easy to debug.
• Usability: The software used is very user friendly and open source. It also runs on any
operating system.
• Scalability: The implementation can include detection of vishing, smishing, etc
14
CHAPTER 2: METHODOLOGY
We have proposed a random forest classifier-based model to distinguish between malicious and
legitimate URLs. In this model we have used UCI phishing dataset for the testing and training of
the model and used a total of 24 features from the dataset. The additional details of our approach
are given below in this section.
Fig 2.1: Proposed Method
15
2.1 Dataset Selection
The dataset consisting of all the training and testing data we’ve used here is collected from UCI
phishing dataset. This is a publicly available dataset, consisting of a total of 11,055 entries. Out
of these 11,055 URL entries provided in the dataset, 4,898 URLs are legitimate URLs and the
rest 6157 are phishing URLs. The rows in the dataset signify the URLs and the columns signify
the features. There are a total of 30 features present in the dataset. The labelling on the URLs for
each feature is categorized into 3 numeric values: 1 for phishing, 0 for suspicious and -1 for
legitimate URL.
Figure 2.1.1: Dataset information [df.info()]
16
TABLE I: DATASET DESCRIPTION [df.describe()]
cou mi 50
mean std 25% 75% max
nt n %
110 3191.447 2764 552 8291 1105
index 5528 1
55 95 .5 8 .5 5
having_IPhaving_IP_A 110 0.3137
0.949534 -1 -1 1 1 1
ddress 55 95
110 -0.633
URLURL_Length 0.766095 -1 -1 -1 -1 1
55 198
110 0.7387
Shortining_Service 0.673998 -1 1 1 1 1
55 61
109 0.7034
having_At_Symbol 59 4 0.710787 -1 1 1 1 1
double_slash_redirectin 110 0.7414

0.671011 -1 1 1 1 1
g 55 74
110 0.0639
having_Sub_Domain 0.817518 -1 -1 0 1 1
55 53
Domain_registeration_l 110 -0.336
0.941629 -1 -1 -1 1 1
ength 55 771
110 0.6133
popUpWidnow 55 88 0.789818 -1 1 1 1 1
110 0.8169
Iframe 0.576784 -1 1 1 1 1
55 15
… … … … … … … … …
Links_pointing_to_pag 110 0.3440
e 55 07 0.569944 -1 0 0 1 1
110 0.7195
Statistical_report 0.694437 -1 1 1 1 1
55 84
110 0.1138
0.993539 -1 -1 1 1 1
Result 55 85
17
2.2 Feature Extraction
Feature selection is a crucial step in machine learning algorithms, especially in classification
schemes. The selected features play a vital role in determining the accuracy and effectiveness of the
prediction provided by theML algorithm [16]. The better the selection of features, the better the
prediction result. Therefore, it is essential to identify the most relevant and prominent features and
combine them for the best possible outcome.
In our project, we used 24 features out of 30 available in the dataset. We categorized the selected
features into different classes to better understand their relevance and importance in our ML model.
The classes include practical features, lexical features, procedural features, and dropped features.
The practical features category includes features that are the most important according to various
studies and practicality. These features have a significant impact on the outcome of our prediction
model. Some of the practical features we used in our project are URL_Length, having_Sub_Domain,
having_IP_Address, Prefix_Suffix, Do-main_registeration_length, Request_URL,
URL_of_Anchor, SSLfinal_State, Links_in_tags, HTTPS_token, age_of_domain, Page_Rank,
web_traffic, Google_Index, DNSRecord, Links_pointing_to_page, and more.
In the lexical features category, we considered features that contain the textual properties of the
URL. These features help to identify certain characteristics of the URL that could be relevant to the
classification of phishing URLs. Some of the lexical features we used in our project include
having_At_Symbol, having_IP_Address, Domain_registeration_length, and more.
Procedural features are another category we used in our project. These features are used to identify
the processor procedure involved in the creation of the URL. Some of the procedural features
we used in our projectinclude double_slash_redirecting and web_traffic.
Lastly, we had dropped features that we did not consider for the training and testing of the ML
model. These features did not have a significant impact on our prediction model, and therefore, we
excluded them from our analysis. The dropped features included Favicon, Iframe, Redirect,
popUpWidnow, RightClick, andSubmitting_to_email.
The feature selection process is essential because it helps to reduce overfitting, improve the accuracy
of the model, and reduce the computational cost of training. The selected features are combined with
each other to provide the best possible result. It is crucial to determine which features are most
relevant and what combinationof features will provide us with the best outcome.
To select the best features, we used different techniques, such as correlation analysis, feature
18
importance measures, and recursive feature elimination. Correlation analysis helped us to identify
highly correlated features,which could be removed to reduce redundancy. Feature importance
measures helped us to rank the features
19
based on their relevance to the target variable. We used the Gini importance measure to rank the
features in our project. Recursive feature elimination helped us to identify the optimal number of
features required for our model.
In conclusion, feature selection is a critical step in machine learning algorithms, especially in

classification schemes. It helps to identify the most relevant and prominent features, which are
combined to provide the best possible result. We categorized the selected features in our project into
different classes, such as practical features, lexical features, procedural features, and dropped
features. We used different techniques to select the best features, such as correlation analysis, feature
importance measures, and recursive feature elimination. The selected features helped to improve the
accuracy of our model and reduce the computational cost of training.
Fig 2.2.1: Process flowchart for selected features
20
CHAPTER 3: DESIGN CRITERIA
3.1 System Design
Data Collection and Preprocessing: In this phase, the dataset containing URLs is
collected from various sources and preprocessed to remove any irrelevant or duplicated
data.
Feature Extraction: In this phase, various features are extracted from the preprocessed
URLs using techniques like lexical analysis, regular expressions, and data mining. The
extracted features are then transformed into numerical form to be used by the ML
algorithms.
Feature Selection: In this phase, the most relevant features are selected from the
extracted features to be used by the ML algorithms. This is done using techniques like
correlation analysis and feature importance analysis.
Model Training: In this phase, the selected ML algorithm (in this case, Random Forest)
is trained using the selected features and the preprocessed dataset. The model is
evaluated using techniques like cross-validation and hyperparameter tuning.
Model Deployment: In this phase, the trained model is deployed in a production

environment to detect phishing URLs in real-time. The model can be integrated with a
web browser or a standalone application.
Performance Monitoring: In this phase, the performance of the deployed model is

monitored continuously to ensure that it is providing accurate results. The model can be
retrained periodically using new data to improve its accuracy. The architecture of the
system is as shown in fig 4.1; the URLs to be classified as legitimate or phishing is fed
as input to the appropriate classifier. Then the classifier that is being trained to classify
URLs as phishing or legitimate from the training dataset uses the pattern it recognized
to classify the newly fed input. The features such as IP address, URL length, domain,
having favicon, etc. are extracted from the URL and a list of its valuesis generated. The
list is fed to the classifiers such as KNN, kernel SVM, Decision tree and Random Forest
classifier. These models’ performance is then evaluated and an accuracy score is
generated. The trained classifier using the generated list predicts if the URL is legitimate
or phishing. The list contains values 1, 0 and -1 if the features exist, not applicable and
if the features don't exist respectively. There are 30 features being considered in this
project.
21
3.2 Design Diagrams
Fig 3.2.1: System Architecture
22
3.3 System Architecture
Data Collection: The first stage in the system architecture for Phishing URL Detection using
ML is data collection. This involves gathering both phishing URLs and legitimate URLs from
various sources, such as public datasets, known phishing URL repositories, or web crawling.
The collected data is then used as the foundation for training and evaluating the machine
learning model.
Data Preprocessing and Feature Engineering: Once the data is collected, it needs to be
preprocessed and transformed into a suitable format for ML model training. This may involve
tasks such as data cleaning, feature extraction, and data normalization. Relevant features are
extracted from the URLs to represent them as input for the ML model. Features could include
URL length, domain name, presence of special characters, use of subdomains, etc.
ML Model Training: The preprocessed data is then used to train a machine learning model,
such as a binary classification model or a deep learning model, using an appropriate
algorithm and framework. This stage involves training the model on the labeled data, where
the phishing URLs are labeled as "phishing" and legitimate URLs are labeled as "legitimate".
The model learns to classify new URLs as either phishing or legitimate based on the patterns
it learns from the training data.
Model Evaluation and Deployment: After the model is trained, it needs to be evaluated to
assess its performance and effectiveness in detecting phishing URLs. This is done using
appropriate evaluation metrics, such as accuracy, precision, recall, F1-score, etc. Once the
model is evaluated and deemed effective, it can be deployed in a production environment,
such as a web application or API, to perform real-time or batch processing of incoming URLs
for phishing detection.
Model Monitoring, Maintenance, and User Interface: Once the model is deployed, it needs to
be monitored for its performance and updated periodically with new data to maintain its
accuracy and effectiveness. This may involve retraining the model, updating features, or
incorporating new data sources. Additionally, a user interface is developed to interact with
the system, allowing users to input URLs for phishing detection, view detection results, and
manage system settings. Proper security measures, such as authentication, authorization, and
encryption, are also implemented to protect the system and data from unauthorized access or
data breaches. Logging and auditing mechanisms are also implemented for monitoring,
analysis, and compliance purposes.
23
In conclusion, the system architecture for Phishing URL Detection using Machine Learning
typically involves data collection, data preprocessing and feature engineering, ML model
training, model evaluation and deployment, model monitoring and maintenance, user
interface development, and implementation of security measures. Careful design and
implementation of the system architecture are essential to ensure the robustness, scalability,
and security of the Phishing URL Detection using ML system.
24
3.4 Information & Communication design
ML-based approaches have shown promising results in accurately detecting phishing URLs,
and ICD can enhance the usability and effectiveness of these systems by designing visually
appealing and user-friendly interfaces for end-users.
One important aspect of ICD in phishing URL detection using ML is the design of visual
representations of ML model outputs. For example, the output of an ML model may consist
of a probability score indicating the likelihood of a URL being a phishing URL. ICD can be
used to design visual representations, such as color-coded indicators or progress bars, to convey
this information to users in a clear and intuitive manner. This can help users quickly understand
the risks associated with a particular URL and make informed decisions about whether to
interact with it.
Furthermore, ICD can be utilized to design informative and user-friendly dashboards that
display the results of phishing URL detection in a visually appealing way. These dashboards
can provide users with an overview of the URLs that have been scanned, their classification
results, and any additional actions that users should take. The design should prioritize the
display of relevant information in a visually organized and easily understandable format,
allowing users to quickly grasp the results of the phishing URL detection process.
ICD can also be employed in the design of user interactions with the ML-based phishing URL
detection system. This includes the design of user input forms for submitting URLs for
scanning, the design of progress indicators to keep users informed about the scanning process,
and the design of feedback messages to inform users of the scanning results. These design
elements should be visually appealing, easy to understand, and accessible to users with
different levels of technical expertise.
In conclusion, Information and Communication Design (ICD) plays a critical role in the
usability and effectiveness of phishing URL detection systems that utilize machine learning
(ML) techniques. By designing visually appealing and user-friendly interfaces, ICD can
enhance the overall user experience, empower users with information to make informed
decisions about interacting with URLs, and contribute to the success of ML-based phishing
URL detection systems in mitigating the risks associated with phishing attacks.
25
CHAPTER 4: DEVELOPMENT & IMPLEMENTATION
4.1 Developmental feasibility
Data Availability: The quality and quantity of data available for training the machine
learning model are crucial factors in the success of the project. There must be sufficient
data on phishing URLs to enable the model to accurately distinguish between legitimate
and phishing URLs. The availability of labeled data, where each URL is tagged as either
phishing or legitimate, is also essential.
Feature Engineering: In order to distinguish between phishing and legitimate URLs,

the model must be able to extract relevant features from the URLs. Features such as
domain name, TLD, and path structure are commonly used in phishing detection
models. Effective feature engineering is a key aspect of building a robust phishing
detection system.
Algorithm Selection: The choice of machine learning algorithm can impact the
performance of the system. Various algorithms, such as decision trees, neural networks,
and support vector machines, can be used for phishing detection. Each algorithm has its
strengths and weaknesses, and the best approach depends on the specific use case.
Model Training and Validation: The model must be trained on a representative set of
data and validated using an independent dataset. Cross-validation techniques can be
used to estimate the performance of the model.
False Positives and False Negatives: A key challenge in phishing detection is

minimizing the number of false positives (legitimate URLs flagged as phishing) and
false negatives (phishing URLs not detected). These errors can impact the user
experience and undermine the credibility of the system.
Adversarial Attacks: Attackers can modify the URLs to evade detection by the
machine learning system. Adversarial attacks can be difficult to detect and can degrade
the performance of the system.
Integration and Deployment: The machine learning model must be integrated with
existing security infrastructure and deployed in a production environment. The system
must be scalable, reliable, and easy to use.
26
In summary, developing a machine learning-based phishing URL detection system is a
complex and challenging task that requires careful consideration of various factors.
While the potential benefits are significant, it is essential to address the above issues to
ensure a successful project.
27
4.2 Implementation Specification
• Data Collection: The first step is to gather a sufficient amount of data on both
phishing and legitimate URLs. This data can be obtained through web scraping or by
utilizing existing datasets. The data should be cleaned and preprocessed before being
used for training the machine learning model.
• Feature Engineering: The next step is to engineer relevant features from the URLs
that will be used as input for the machine learning algorithm. Common featuresinclude
domain name, TLD, path structure, and length. The feature engineering process should
be carefully designed to ensure that the model can accurately distinguish between
phishing and legitimate URLs.
• Algorithm Selection and Model Training: Once the features have been engineered,
the next step is to select an appropriate machine learning algorithm and train the model
on the data. Various algorithms, such as decision trees, logistic regression, and neural
networks, can be used for phishing detection. The model should be trained on a
representative set of data and validated using an independent dataset. Cross-validation
techniques can be used to estimate the performance of the model.
• False Positive and False Negative Reduction: After the model has been trained,
efforts should be made to minimize false positives (legitimate URLs flagged as
phishing) and false negatives (phishing URLs not detected). This can be achieved by
adjusting the threshold for classification or by using ensemble methods to combine the
predictions of multiple models.
• Adversarial Attack Detection: To detect adversarial attacks, the system should be

designed to recognize when the input data has been tampered with. This can be achieved
by using techniques such as anomaly detection or by incorporating features that are
difficult to manipulate.
• Integration and Deployment: The final step is to integrate the machine learning
model with existing security infrastructure and deploy it in a production environment.
The system should be scalable, reliable, and easy to use. A user interface can be
developed to allow security personnel to monitor the system and manually verify
flagged URLs.
28
4.3 System modules and flow of implementation
Data Collection Module: The data collection module is a crucial component of a

phishing URL detection system using machine learning (ML), involving the acquisition
of a large dataset of URLs, including both legitimate and phishing URLs. The dataset
serves as the foundation for training and evaluating the ML model, and can be
obtained from various sources, such as public datasets, APIs, web scraping, or custom
data collection techniques. Public datasets from reputable security organizations, APIs
provided by security-related services, web scraping of known phishing URLs, and
custom data collection techniques such as targeted data collection from relevant sources
can be utilized for data collection.
The collected dataset should be comprehensive, diverse, and representative of the types
of URLs the system aims to detect. It should include a wide range of legitimate URLs
as well as various types of phishing URLs, including different phishing techniques and
attack vectors. Data preprocessing is also an important step, involving cleaning,
transforming, and feature extraction from the collected URLs to ensure that the data is
in a suitable format for ML model training.
A diverse and representative dataset is critical for training an effective ML model for
phishing URL detection. The quality and quantity of data collected directly impact the
accuracy, reliability, and generalizability of the trained model. Careful consideration
should be given to the sources and quality of data collected, as it directly influences
the performance of the ML model. Proper data collection and preprocessing areessential
for developing a robust and accurate phishing URL detection system using ML.
Feature Extraction Module: The feature extraction module in a phishing URL

detection system using machine learning is responsible for extracting relevant features
from the collected dataset of URLs. These features serve as input to the ML model for
training and prediction. Feature extraction involves extracting meaningful information
or characteristics from the URLs that can help distinguish between legitimate and
phishing URLs.
Features could include various attributes of the URLs, such as URL length, domain age,
presence of suspicious keywords, use of HTTPS, domain reputation, presence of
redirections, and other relevant attributes. These features are selected based on their
potential relevance in differentiating between legitimate and phishing URLs. Thechoice
of features may vary depending on the specific ML algorithm used and the
characteristics of the dataset.
Feature extraction is a critical step in the development of an effective phishing URL
detection system. The quality and relevance of the features extracted directly impact the
accuracy and reliability of the ML model. Careful consideration should be given tothe
choice of features and their extraction methods to ensure that they capture the key
characteristics of phishing URLs while minimizing noise or irrelevant information.
Proper feature extraction is essential for building a robust and accurate ML model for
phishing URL detection.
Labeling and Training Data Preparation Module: The labeling and training data
29
preparation module is a crucial step in the development of a phishing URL detection
system using machine learning. In this module, the collected dataset of URLs is labeled
to identify whether each URL is legitimate or phishing. The labeling process involves
assigning a binary label (e.g., 0 for legitimate and 1 for phishing) to each URL based on
its ground truth status.
Once the dataset is labeled, it is then split into training and testing data. The majority of
the labeled data is used for training the machine learning model, while a smaller portion
is reserved for model evaluation. This helps in assessing the performance and
generalization of the model.
Preparing the training data is a critical step as the quality and quantity of labeled data
directly impact the accuracy and effectiveness of the ML model. It is important to ensure
that the labeled data is representative of the real-world scenarios and includes adiverse
set of legitimate and phishing URLs to train a robust and reliable model.
Properly labeled and prepared training data is essential for building an accurate and
effective ML model for phishing URL detection. It forms the foundation of themodel's
learning process, enabling it to understand the patterns and characteristics of legitimate
and phishing URLs, and make accurate predictions during the testing and deployment
phases.
Machine Learning Model Development Module: In this module, a machine learning

model is developed using the training data collected and pre-processed in earlier
modules. Several machine learning algorithms, such as decision trees, random forests,
support vector machines (SVM), or deep learning models like convolutional neural
networks (CNN) or recurrent neural networks (RNN), can be utilized for this task.
The development process involves training the machine learning model on the labeled
dataset, which includes both legitimate and phishing URLs. Hyperparameter tuning and
model evaluation techniques are applied to optimize the model's performance. The
hyperparameters of the model, such as learning rate, regularization, and batch size, are
fine-tuned to achieve the best possible accuracy and minimize false positives or false
negatives.
Model evaluation techniques, such as cross-validation or holdout validation, are used to
assess the model's performance on unseen data. This helps in estimating the model's
accuracy, precision, recall, F1-score, and other performance metrics. The model is
refined iteratively based on the evaluation results until satisfactory performance is
achieved.
The machine learning model developed in this module forms the foundation of the URL
classification module, where it is used to predict the class labels (legitimate or phishing)
of incoming URLs in real-time. The accuracy and reliability of the machine learning
model are critical for the system's overall effectiveness in detecting phishing URLs.
Regular updates and improvements to the model may be required to adapt to evolving
phishing techniques and maintain high accuracy levels.
Model Integration and Deployment Module: Once the machine learning model is
developed and optimized, it can be integrated into the main system as a module for real-
time or batch processing of incoming URLs. The system can be deployed on a cloud
server, a local server, or any other suitable hosting environment, depending on
30
the requirements and constraints of the project.
URL Classification Module: In this module, incoming URLs are classified as either
legitimate or phishing using the trained machine learning model. The features extracted
from the URLs during the feature extraction module are fed as input to the model, which
then predicts the class label based on the patterns it has learned during the training phase.
The machine learning model, which has been trained on the labeled dataset, applies its
learned knowledge to classify incoming URLs in real-time. The model utilizes the
extracted features to identify whether a given URL exhibits characteristics of legitimate
or phishing URLs based on its learned patterns, such as URL length, domain age,
presence of suspicious keywords, and other relevant attributes.
The URL classification module plays a crucial role in the system's overall accuracy
and effectiveness. It is responsible for making real-time predictions on incomingURLs,
helping to identify potential phishing URLs and prevent users from falling victim to
phishing attacks. The accuracy and reliability of the classification module directly
impact the system's performance and ability to detect phishing URLs accurately and
efficiently.
Once the URL is classified as either legitimate or phishing, the system can take
appropriate actions, such as blocking or flagging suspicious URLs, notifying users, or
triggering further security measures to prevent potential phishing attacks. The URL
classification module is at the heart of the system's decision-making process and is
essential for its overall efficacy in detecting phishing URLs using machine learning..
Result Analysis and Reporting Module: This module involves the analysis of the
results generated by the URL classification module. The system can generate reports,
logs, or alerts based on the classification results, such as notifying users of potential
phishing URLs or generating statistics on the performance of the system.
Model Updates and Maintenance Module: The machine learning model needs to be
regularly updated with new data and retrained to adapt to new types of phishing attacks.
This module involves periodic updates to the model to keep it effective and accurate in
detecting phishing URLs.
Overall, the flow of implementation involves data collection, feature extraction,

labeling and training data preparation, machine learning model development, model
integration and deployment, URL classification, result analysis and reporting, and
model updates and maintenance. Each module plays a crucial role in building an
effective and accurate Phishing URL Detection system using Machine Learning.
31
4.4 Critical modules of product/system
Data Collection and Preprocessing Module: The data collection and preprocessing
module plays a crucial role in gathering and preparing data for the phishing URL
detection system. It involves collecting data from various sources, such as public
databases, security feeds, and web scraping. The collected data is then preprocessed to
remove irrelevant or redundant information, handle missing values, and normalize the
data. Additionally, feature extraction techniques may be employed to extract relevant
features from URLs, such as domain information, URL structure, and content. This
module ensures that the data used for training the ML model is clean, relevant, and
properly preprocessed to ensure accurate and reliable model performance.
ML Model Training and Evaluation Module: The ML model training and evaluation
module is responsible for training the machine learning model using the preprocessed
data. It involves selecting an appropriate ML algorithm, such as decision trees, random
forests, or deep learning algorithms, and training the model using labeled data.Feature
engineering techniques, such as one-hot encoding or word embedding, may beapplied
to further enhance the model's performance. Hyperparameter tuning is also performed
to optimize the model's performance. Once the model is trained, it isevaluated using
techniques such as cross-validation to assess its accuracy, precision, recall, and other
performance metrics. This module ensures that the ML model is trained to its optimal
performance and evaluated rigorously to ensure its effectiveness in detecting phishing
URLs.
Feature Extraction and Selection Module: The feature extraction and selection module
focus on extracting relevant features from URLs, which are essential for accurate
phishing URL detection. This module may employ techniques such as natural language
processing (NLP) to analyze the content of URLs, regular expressions to extract specific
patterns, and domain reputation analysis to assess the reputation of domains. Feature
selection techniques, such as mutual information, chi-squared test, orrecursive feature
elimination, may be applied to select the most relevant features for training the ML
model. This module is critical in identifying the most discriminative features from URLs
to enhance the model's ability to distinguish between legitimate and phishing URLs.
URL Input and Analysis Module: The URL input and analysis module is responsible
for accepting manual input of URLs for analysis. It may include a user interface that
allows users to enter URLs for analysis or upload a batch of URLs for processing. The
module should have proper error handling to validate and sanitize the input URLs to
prevent any security risks, such as SQL injection or cross-site scripting (XSS) attacks.
Once the URLs are input, they are passed through the trained ML model for prediction.
The results, such as the predicted label (legitimate or phishing) and confidence score,
are displayed to the user. This module should provide a seamless and user-friendly
experience for users to input URLs and obtain the analysis results.
32
Model Deployment and Integration Module: The model deployment and integration
module focuses on deploying the trained ML model into a production environment and
integrating it with other system components. It may involve setting up APIs
(Application Programming Interfaces) to expose the model's functionality, establishing
communication protocols, and handling model versioning and updates. Proper
deployment of the model in a production environment is critical to ensure its scalability,
performance, and availability. Integration with other system components, such as
logging and monitoring modules, is also important for effective system management
and troubleshooting.
Monitoring and Logging Module: The monitoring and logging module is responsible
for monitoring the performance and accuracy of the ML model and logging relevant
events and activities. It includes monitoring the accuracy of the model's predictions,
prediction speed, system resource utilization, and other relevant metrics. This module
may also generate alerts or notifications for any anomalies or issues, such as low model
accuracy or high resource utilization. Proper logging of events and activities is
important for auditing purposes, debugging, and troubleshooting. It allows system
administrators to track the system's performance, identify potential issues, and take
corrective actions in a timely manner.
Security and Privacy Module: The security and privacy module is critical for ensuring
the security and privacy of the system and the data it processes. This module may
involve implementing security measures such as authentication, authorization, and
encryption to protect the system from unauthorized access or data breaches. Proper
security protocols, such as HTTPS or SSL, may be implemented to secure the
communication between different system components and external APIs. Additionally,
privacy measures should be in place to protect user data and comply with relevant data
protection regulations. This module is crucial in safeguarding the system and the data
it handles from potential security threats and privacy breaches.
33
CHAPTER 5: RESULT & TESTING
5.1 Decision Tree
Decision Tree
Fig 5.1.1: DT testing and training
Fig 5.1.2: DT accuracy
34
Fig 5.1.3: DT confusion matrix
35
5.2 Random Forest
Fig 5.2.1: RF testing and training
Fig 5.2.2: RF accuracy
Fig 5.2.3: RF confusion matrix
36
5.3 KNN
Fig 5.3.1: KNN testing and training
Fig 5.3.2: KNN accuracy
Fig 5.3.3: KNN confusion matrix
37
5.4 Logistic Regression
Fig 5.4.1: LR testing and training
Fig 5.4.2: LR accuracy
Fig 5.4.3: LR confusion matrix
38
5.5 Accuracy
TABLE 2: ACCURACY COMPARISON
S. no. Model Accuracy
1. Decision Tree 93-95%
2. Random Forest 96-98%
3. KNN 92-94%
4. Logistic Regression 91-92%
39
5.6 Phishing URL Detection using Random Forest
Fig 5.6.1: Features Weights
Fig 5.6.2: Exporting and loading the model
Fig 5.6.3: Prediction Matrix
40
Fig 5.6.4: Input 1
Fig 5.6.5: Prediction 1
Fig 5.6.6: Input 2
Fig 5.6.8: Input 3
41
42
CHAPTER 6: CONCLUSION & FUTURE IMPROVEMENT
6.1 Performance Estimation

In order to evaluate the performance of the proposed model we’ve created the confusion matrix
and calculated the Precision, Recall and Accuracy for the tested data. Confusion matrix is a 2×2
matrix, having predicted values on one axis and actual values on another axis.
For calculating both Precision and Recall only True Positive (TP), False Positive (FP) and False
Negative (FN) are used. Precision is the total number of instances that were positively predicted.
The value of Precision lies between 0 and 1.
Additionally, Precision is a critical measure as it indicates the model's ability to avoid false
positives, i.e., correctly identifying genuine URLs as legitimate. A high Precision score implies
that the model is accurately predicting phishing URLs without misclassifying legitimate ones. On
the other hand, a low Precision score could indicate a higher rate of false positives, which can
result in genuine URLs being mistakenly flagged as phishing, leading to inconvenience and
potential loss of trust.
The Recall metric, also known as True Positive Rate, evaluates the model's ability to identify all
positive instances correctly, including both true positives (TP) and false negatives (FN). It is
essential to consider both Precision and Recall together as they have a trade-off relationship. A
model with high Precision but low Recall may miss some genuine phishing URLs, while a model
with high Recall but low Precision may have a higher false positive rate. Thus, finding anoptimal
balance between Precision and Recall is crucial for an effective phishing detection model.
Accuracy, the percentage of correctly predicted instances out of all instances, is another important
performance metric. It provides an overall measure of the model's accuracy in predicting both
positive and negative instances. However, Accuracy may not always be the mostreliable metric,
especially in imbalanced datasets where the number of positive and negative instances varies
significantly. In such cases, Precision, Recall, and F1-score, which is theharmonic mean of
Precision and Recall, are often used in combination to get a comprehensive understanding of the
model's performance.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 (1)
𝑇𝑃+𝐹𝑃
We calculated the Recall to find the percentage of instances that were positively predicted by
the classifier out of all positive instances. Recall is the same as the True Positive Rate (TPR).
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃
𝑇𝑃+𝐹𝑁 (2)
Accuracy as the name suggests is the percentage of instances in which all the correct predictions
were made out of all the instances.
𝑇𝑃+𝑇𝑁 (3)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
43
The overall performance of the proposed method is discussed in this section. Out of 30 features
present in the dataset, we selected 24 out of 30 features. All URLs in the dataset were categorized
into 3 numeric values 1 for phishing, 0 for suspicious and -1 for legitimate URL. In the used
dataset there were 0 URLs labeled as suspicious, we have calculated the performance for all the
1 and -1 tagged URLs. Below (Table 1) consists of the final overall performance.
TABLE 3: MODEL PERFORMANCE
Precision Recall F1-score
1 0.96 0.97 0.96
-1 0.97 0.97 0.97
Accuracy 0.97
Figure 6: Confusion matrix Precision and Recall
With these combinations of feature testing and training we were able to achieve 97% accuracy. The accuracy was
fluctuating between 96-98%. The precision for the URL labelled as 1 (phishing) is 96% and recall 97% with97%
F1-score. The URLs labelled as -1 (legitimate) were able to achieve 97% precision, recall and F1-score.
A. Comparison with other ML algorithms
Furthermore, we did extensive comparison of our proposed model with the existing models and ML algorithms
and we found that random forest classifiers yielded better accuracy than other algorithms such as Logistic
Regression (LR), K-Nearest Neighbour (KNN) and Decision Tree (DT). The comparison is as follows:
44
TABLE 4: COMPARISON OF DIFFERENT ML ALGORITHMS
ML Model Precision Recall F1-score
RF 1 0.96 0.97 0.96
-1 0.97 0.97 0.97
Accuracy 0.97
LR 1 0.95 0.93 0.94
-1 0.9 0.93 0.91
Accuracy 0.93
KNN 1 0.95 0.94 0.94
-1 0.92 0.93 0.93
Accuracy 0.94
DT 1 0.96 0.96 0.96
-1 0.95 0.94 0.95
Accuracy 0.95
45
6.2 Usability of Product/System
Offline detection: ML-based phishing URL detection can be valuable even if it doesn't involve real-time
detection. For instance, you can use the trained ML model to analyze a batch of URLs collected over a
period of time or from a dataset. This offline analysis can help identify potential phishing URLs and
provide insights into the characteristics, patterns, and trends of phishing attacks. By examining historical
data, you can uncover patterns that may not be evident in real-time, allowing you to better understand the
evolving nature of phishing attacks and develop effective countermeasures. This offline detection
capability can enhance the usability of your project by providing valuable insights and informing proactive
measures to mitigate risks.
Research and analysis: ML-based phishing URL detection can also be used for research and analysis
purposes. By studying the behavior of phishing attacks, such as their techniques, tactics, and motivations,
through the analysis of large datasets of phishing URLs, researchers can gain valuable insights. This can
contribute to the advancement of the field of cybersecurity by helping in the development of new detection
techniques, identifying emerging threats, and understanding the evolving nature of phishing attacks.
ML-based phishing URL detection can serve as a powerful tool for researchers to analyze and understand
the dynamics of phishing attacks, leading to valuable contributions to the field of cybersecurity.
Comparative evaluation: ML-based phishing URL detection can serve as a benchmark for comparative
evaluation in your project. By using established metrics such as precision, recall, and accuracy, you can
evaluate the performance of your proposed detection method against the ML-based approach. This
comparison can provide insights into the effectiveness of your proposed method and help identify its
strengths and weaknesses in comparison to established techniques. This usability aspect is particularly
useful in evaluating the performance of your project against existing state-of-the-art methods, providing a
quantitative assessment of its effectiveness. The comparative evaluation can aid in making informed
decisions about the suitability of your proposed method for your project and guide further refinements to
improve its performance.
Customization and experimentation: ML-based phishing URL detection can also be used for
customization and experimentation purposes. You can fine-tune the ML model with your own dataset or
apply feature engineering techniques to adapt it to your specific use case. This customization can improve
the performance of the model and make it more suitable for your project's requirements and constraints.
Additionally, ML-based detection provides a platform for experimentation, allowing you to test different
approaches, algorithms, and parameters to optimize the detection performance for your specific project.
This customization and experimentation capability can enhance the usability of your project by enabling
you to tailor the ML-based detection to your specific needs and experiment with different configurations
to achieve the best results.
Education and awareness: ML-based phishing URL detection can also be used for educational and
awareness purposes. You can use the system to demonstrate the risks and impacts of phishing attacks,
educate users about how to identify and avoid phishing URLs, and raise awareness about cybersecurity
best practices. By showcasing the capabilities of ML-based detection, you can educate users about the
potential threats posed by phishing attacks and empower them with the knowledge to protect themselves
46
against such attacks. This usability aspect can contribute to improving the overall cybersecurity awareness
and hygiene of users, making them more vigilant and cautious while dealing with suspicious URLs, and
reducing the risk of falling victim to phishing attacks. Furthermore, by promoting cybersecurity education
and awareness, your project can have a positive impact beyond its immediate scope, creating a safer online
environment for users.
.
47
6.3 Limitations
1. Limited dataset: The dataset you used may not be representative of all possible phishing URLs. It's
possible that your model might not perform as well on other datasets or in the real world.
2. Lack of context: Your model may not take into account the context in which the URLs are being
used, such as the type of email or website, which could affect its accuracy.
3. Adversarial attacks: Phishing attackers may try to evade detection by modifying the URLs in
various ways, such as using URL shorteners or inserting random characters. Your model may not
be robust against such attacks.
4. Time-consuming feature engineering: The process of feature engineering can be time-consuming

and may require domain expertise. Your model's performance may be limited by the features you
were able to engineer.
5. Limited generalization: Your model may perform well on the dataset you used for training and
testing, but may not generalize well to other datasets or real-world scenarios.
6. Limited interpretability: Some machine learning models, including Random Forest, can be difficult
to interpret, which may make it hard to understand why the model is making certain predictions or
to identify and correct any biases or errors.
48
6.4 Scope of Improvements
1. Ensemble models: Although you have found Random Forest to perform well in your project, there
are other machine learning algorithms that can be used for phishing URL detection. Ensemble
models, which combine the predictions of multiple models, have been shown to improve the
accuracy of classification tasks. In future work, you could experiment with different ensemble models,
such as stacking or boosting, to see if they can improve the performance of your model.
2. Deep Learning: Another promising approach for phishing URL detection is using deep learning
techniques such as neural networks. Deep learning models can automatically learn relevant
features from raw data, which can be useful when dealing with large and complex datasets. In
future work, you could explore the use of deep learning models for phishing URL detection, such
as Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs).
3. Transfer learning: Transfer learning is a technique where a pre-trained model is used as a starting
point for a new task. This can be useful when you have limited data for your task or when you
want to leverage the knowledge learned from a related task. In future work, you could explore the
use of transfer learning for phishing URL detection, such as using a pre-trained model for natural
language processing to detect similar characteristics in phishing URLs.
4. Adversarial training: As mentioned earlier, phishing attackers may try to evade detection by
modifying URLs in various ways. Adversarial training is a technique that involves training a model
on both clean and adversarial examples, in order to make it more robust to such attacks. In future
work, you could explore the use of adversarial training for phishing URL detection, to make your
model more robust against such attacks.
5. Real-time detection: In many cases, phishing attacks occur in real-time, such as when a user
receives a suspicious email or visits a phishing website. In future work, you could explore the use
of real-time detection techniques for phishing URL detection, such as using stream processing or
real-time machine learning to detect phishing URLs in real-time.
6. Deployment and integration: Once a phishing URL detection model is developed, it needs to be
integrated with existing security systems or deployed to end-users in order to be effective. In future
work, you could explore the challenges of deploying and integrating phishing URL detection models
in real-world scenarios, such as dealing with limited resources or integrating with existing security
systems.
7. Explainability: As mentioned earlier, some machine learning models, including Random Forest,
can be difficult to interpret. In order to build trust and to identify and correct any biases or errors,
it's important to have models that are transparent and explainable. In future work, you could
explore the use of explainable AI techniques, such as LIME or SHAP, to understand and explain
the predictions of your model.
49
References
[1] A. Anand, K. Gorde, J. R. A. Moniz, N. Park, T. Chakraborty and B.-T. Chu,

"Phishing URL detection with oversampling based on text generative adversarial
networks," in 2018 IEEE International Conference on Big Data (Big Data), 2018.
[2] M. N. Feroz and S. Mengel, "Phishing URL detection using URL ranking," in
2015 ieee international congress on big data, 2015.
[3] S. Gupta and A. Singhal, "Dynamic classification mining techniques for

predicting phishing URL," in Soft Computing: Theories and Applications:
Proceedings of SoCTA 2016, Volume 2, 2018.
[4] P. Xu, "A transformer-based model to detect phishing URLs," arXiv preprint
arXiv:2109.02138, 2021.
[5] S. C. Jeeva and E. B. Rajsingh, "Intelligent phishing url detection using

association rule mining," Human-centric Computing and Information Sciences, vol. 6,
p. 1–19, 2016.
[6] V. R. Hawanna, V. Y. Kulkarni and R. A. Rane, "A novel algorithm to detect

phishing URLs," in 2016 international conference on automatic control and dynamic
optimization techniques (ICACDOT), 2016.
[7] W. Wei, Q. Ke, J. Nowak, M. Korytkowski, R. Scherer and M. Woźniak,

"Accurate and fast URL phishing detector: a convolutional neural network approach,"
Computer Networks, vol. 178, p. 107275, 2020.
[8] M. Zouina and B. Outtaj, "A novel lightweight URL phishing detection system
using SVM and similarity index," Human-centric Computing and Information
Sciences, vol. 7, p. 1–13, 2017.
[9] M. Sameen, K. Han and S. O. Hwang, "PhishHaven—an efficient real-time ai

phishing URLs detection system," IEEE Access, vol. 8, p. 83425–83443, 2020.
[10] P. Yang, G. Zhao and P. Zeng, "Phishing website detection based on

multidimensional features driven by deep learning," IEEE access, vol. 7, p.
15196–15209, 2019.
[11] M. Chatterjee and A.-S. Namin, "Detecting phishing websites through deep
reinforcement learning," in 2019 IEEE 43rd Annual Computer Software and
Applications Conference (COMPSAC), 2019.
50
[12] A. J. Obaid, K. K. Ibrahim, A. S. Abdulbaqi and S. M. Nejrs, "An adaptive
approach for internet phishing detection based on log data," Periodicals of Engineering
and Natural Sciences, vol. 9, p. 622–631, 2021.
[13] A. Pandey and J. Chadawar, "Phishing URL Detection using Hybrid Ensemble
Model," International Journal of Engineering Research & Technology, 11 (4), p.
479–482, 2022.
[14] S. Al-Ahmadi, "PDMLP: phishing detection using multilayer perceptron,"

International Journal of Network Security & Its Applications (IJNSA) Vol, vol. 12,
2020.
[15] R. Wazirali, R. Ahmad and A. A.-K. Abu-Ein, "Sustaining accurate detection of

phishing URLs using SDN and feature selection approaches," Computer Networks,
vol. 201, p. 108591, 2021.
[16] C. Rupa, G. Srivastava, S. Bhattacharya, P. Reddy and T. R. Gadekallu, "A

machine learning driven threat intelligence system for malicious URL detection," in
Proceedings of the 16th International Conference on Availability, Reliability and
Security, 2021.
[17] F. Sadique, R. Kaul, S. Badsha and S. Sengupta, "An automated framework for
real-time phishing URL detection," in 2020 10th Annual Computing and
Communication Workshop and Conference (CCWC), 2020.
[18] M. Somesha, A. R. Pais, R. S. Rao and V. S. Rathour, "Efficient deep learning

techniques for the detection of phishing websites," Sādhanā, vol. 45, p. 1–18, 2020.
[19] M. Sánchez-Paniagua, E. F. Fernández, E. Alegre, W. Al-Nabki and V.

Gonzalez-Castro, "Phishing URL detection: A real-case scenario through login
URLs," IEEE Access, vol. 10, p. 42949–42960, 2022.
51

Major Project File

Uploaded by

Copyright:

Available Formats

You might also like

Major Project File

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Major Project File

Uploaded by

Copyright:

Available Formats

A Practical Training Report

Department of Computer Science and Engineering Amity

School of Engineering & Technology

Amity University Rajasthan, Jaipur

Signature of Head of Department

to undertake this project.

do not know where we would have been without his help.

moment to acknowledge their contribution gratefully.

Name and signature of Students

Sakshi Methani (A20405219022)

Shipra Kanwar (A20405219089)

Chapter-1: Project Introduction ................................................................................................... 8

Chapter-2: Methodology ...............................................................................................................15

Chapter-3: Design Criteria .......................................................................................................... 21

Chapter-4: Development & Implementation ............................................................................. 26

Chatper-5: Results & Testing ...................................................................................................... 34

Chapter-6: Conclusion & Future Improvements ....................................................................... 43

Figure Number Figure Name Page Number

Fig 1.1 Gantt chart 11

Fig 2.1 Proposed model 15

Fig 2.1.1 Dataset information 16

Fig 2.2.1 Process flowchart 20

Fig 3.2.1 System 22

Fig 5.1.3 DT confusion 35

Fig 5.2.1 RF testing and 36

Fig 5.2.2 RF accuracy 36

Fig 5.2.3 RF confusion 36

Fig 5.3.1 KNN testing and 37

Fig 5.3.3 KNN confusion 37

Fig 5.4.3 LR confusion matrix 38

Fig 5.6.1 Features Weights 40

Fig 5.6.2 Exporting and 40

Fig 5.6.3 Prediction Matrix 40

Fig 5.6.4 Input 1 41

Fig 5.6.5 Prediction 1 41

Fig 5.6.6 Input 2 41

Fig 5.6.7 Prediction 2 41

Fig 5.6.8 Input 3 41

Fig 5.6.9 Prediction 3 42

Fig 6 Confusion matrix 44

Table Number Table Name Page Number

The project’s objectives are as follows:

C. Supported Python modules

A non-functional requirement is a determination that depicts the framework’s activity

• Scalability: The implementation can include detection of vishing, smishing, etc

Fig 2.1: Proposed Method

Figure 2.1.1: Dataset information [df.info()]

double_slash_redirectin 110 0.7414

In conclusion, feature selection is a critical step in machine learning algorithms, especially in

Fig 2.2.1: Process flowchart for selected features

3.1 System Design

Model Deployment: In this phase, the trained model is deployed in a production

Performance Monitoring: In this phase, the performance of the deployed model is

Fig 3.2.1: System Architecture

4.1 Developmental feasibility

Feature Engineering: In order to distinguish between phishing and legitimate URLs,

False Positives and False Negatives: A key challenge in phishing detection is

• Adversarial Attack Detection: To detect adversarial attacks, the system should be