Professional Documents
Culture Documents
NTAL Report PDF
NTAL Report PDF
Title of Report : Identifying phishing URLs using Naive Bayes classifier algorithm
Expected Outcome
1 Abstract 1
2 Description of Mini-Project
2.1 Introduction 2
2.2 Aim and Objective 6
2.3 Software Interfaces 6
2.4 Implementation 7
2.5 Screenshots 11
3 Conclusion 15
4 References 15
Abstract
Phishing attack is an aberrant trick to peculate user’s private information by duping them to
assail via a spurious website planned to mimic and resembles as an authentic website. The user’s
confidential information such as username, password, and PIN number will be grabbed by the
attacker and creates a fraudulent transaction. The information holder’s credentials as well as
money will be seized. The phishing and legitimate website will have high intelligible
resemblances by which the attacker will seize the credentials of the user. In order to detect the
phishing attacks there exists various techniques such as blacklisting, whitelisting, heuristics and
machine learning. Nowadays machine learning is used and found to be more effective. The
proposed system extracts the source code features, URL features and image features from the
phishing website. The features that are extracted are given to the random forest algorithm or to
cross validation algorithm to acquire the reduced features. The reduced features are again given
to the Naïve Bayes classifier in order to classify the webpage as genuine or phished.
1
Introduction
Nowadays internet plays a vital role in everyone’s day to day life. Every day the technology are
growing in tremendous speed it makes the user to use it in a smarter way. As the technology
grows it leaves it impact in all the fields. There exist some loopholes that are available on the
internet by this it acts as a back door to attack the user. Many attacks are exhibited over the
network one of them is phishing in which the attacker impersonates himself as legitimate and
grabs the user credentials. To tempt the user by high visual resemblances. According to the
phishing report during the first quarter of the year 2016 India ranks fifth place of the top 10
countries affected by phishing attack. The users who are all unaware of these attacks may fall
into the trap. This paper considers source code, URL and image features of a website and selects
the optimum features by using random forest algorithm and finally classify the website as
phishing and non-phishing by using Naïve Bayes classifier.
1. Dataset Formation
Dataset for this project was taken from a dataset published in 2015 which is available for
download and build a machine learning model to detect phishing URLs. The dataset contains a
large selection of features extracted from a website to determine if it’s a phishing website or not.
It contains the following features which help to determine:
• having_IP_Address
• URL_Length
• Shortining_Service
• having_At_Symbol
• double_slash_redirecting
• Prefix_Suffix
• having_Sub_Domain
• SSLfinal_State
• Domain_registeration_length
• Favicon
• port
• HTTPS_token
• Request_URL
• URL_of_Anchor
• Links_in_tags
• SFH
• Submitting_to_email
• Abnormal_URL
• Redirect
• on_mouseover
• RightClick
2
• popUpWidnow
• Iframe
• age_of_domain
• DNSRecord
• web_traffic
• Page_Rank
• Google_Index
• Links_pointing_to_page
• Statistical_report
• Result
2. Feature Selection
Often in data science we have hundreds or even millions of features and we want a way to create
a model that only includes the most important features. This has three benefits. First, we make
our model simpler to interpret. Second, we can reduce the variance of the model, and therefore
overfitting. Finally, we can reduce the computational cost (and time) of training a model. The
process of identifying only the most relevant features is called “feature selection.”
Random Forests are often used for feature selection. The reason is because the tree-based
strategies used by random forests naturally ranks by how well they improve the purity of
the node. This mean decreases in impurity over all trees. Nodes with the greatest decrease
in impurity happen at the start of the trees, while nodes with the least decrease in
impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can
create a subset of the most important features.
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks, that operate by constructing a multitude of
decision trees at training time and outputting the class that is the mode of the classes
(classification) or mean prediction (regression) of the individual trees. Random decision
forests correct for decision trees' habit of overfitting to their training set.
3
evaluation technique is that it does not give an indication of how well the learner will
generalize to an independent/ unseen data set. Getting this idea about our model is known
as Cross Validation.
3. Classification
Naive Bayesian:
The Naive Bayesian classifier is based on Bayes’ theorem with the independence
assumptions between predictors. A Naive Bayesian model is easy to build, with no
complicated iterative parameter estimation which makes it particularly useful for very
large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly
well and is widely used because it often outperforms more sophisticated classification
methods.
Algorithm:
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c),
P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor
(x) on a given class (c) is independent of the values of other predictors. This assumption
is called class conditional independence.
4
• P(c|x) is the posterior probability of class (target) given predictor (attribute).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.
5
Aim and Objective
The project aims at creating a Machine Learning model to classify the incoming website. The
Machine learning model will be trained with a large quantity to dataset to increase its accuracy.
Once the model is trained it will be tested on a set of data to identify whether the incoming
webpage is a valid or false webpage. This data includes various features of the webpage that are
extracted and clustered as Source code features, URL features and Image features.
Software Interfaces
The project was implemented in Ubuntu 18 operating system. The programming language used
for implementation is Python 2 with additional Machine learning packages including NumPy and
sklearn.
Python:
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a
scripting or glue language to connect existing components together. Python's simple, easy to
learn syntax emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code reuse.
The Python interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed.
NumPy package:
NumPy is module for Python. The name is an acronym for "Numeric Python" or "Numerical
Python". It is an extension module for Python, mostly written in C. This makes sure that the
precompiled mathematical and numerical functions and functionalities of NumPy guarantee great
execution speed.
Furthermore, NumPy enriches the programming language Python with powerful data structures,
implementing multi-dimensional arrays and matrices. These data structures guarantee efficient
calculations with matrices and arrays. The implementation is even aiming at huge matrices and
arrays, better known under the heading of "big data". Besides that, the module supplies a large
library of high-level mathematical functions to operate on these matrices and arrays.
Sklearn package:
Scikit-learn is a free machine learning library for the Python programming language. It features
various classification, regression and clustering algorithms including support vector machines,
random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with
the Python numerical and scientific libraries NumPy and SciPy.
6
Implementation
1. Dataset file
The dataset file contains a large number of website features and the values for each
feature. The dataset file looks like the following
Figure 1: Dataset
This file contains the code necessary for feature selection implemented on the dataset.
Code:
import numpy
import math
if method == "info_gain":
return list(information_gain(X, y)[0: num_features])
7
def information_gain(X, y):
cols = X.shape[1]
info_gains = numpy.zeros(cols)
for i in range(cols):
values = X[:,i]
info_gains[i] = calculate_info_gain(values, y)
return numpy.argsort(info_gains)[::-1]
ig = entropy(count(y))
distinct = set(values)
for val in distinct:
indices = numpy.argwhere(values==1).flatten()
ig -= entropy(count(y[indices]))
return ig
def count(values):
unique, counts = numpy.unique(values, return_counts=True)
return dict(zip(unique, counts))
def entropy(category_value_dict):
total = sum(category_value_dict.values())
ent = 0.0
for key in category_value_dict:
prob = float(category_value_dict[key])/float(total)
ent += (-prob * math.log(prob))
return ent
3. Classification File:
This file contains the code required to create a machine learning model which is used
classify the websites as phishing or not.
Code:
import sys
8
import json
import numpy as np
import math
from collections import defaultdict
from sklearn.model_selection import KFold
import feature_selection as fs
def read_data(filename):
data = np.array(map(lambda x: x.strip().split(","),
open(filename).readlines()[1:]))
return data
print features_index
if (data.shape[1] - 1) not in features_index:
# Adding output feature if not already added
features_index.append(data.shape[1] - 1)
print features_index
if validate_features(data.shape, features_index):
return data[:,features_index]
N = X.shape[0]
class_prob = defaultdict(float)
class_feature_value_count = defaultdict(lambda: defaultdict(lambda:
defaultdict(float)))
9
for i,row in enumerate(X):
class_prob[y[i]] += 1.0
class_prob[cls] /= float(N)
if true_class == predicton:
correct += 1
accuracy = 100*float(correct)/X_test.shape[0]
return accuracy
10
class_feature_value_count}
json.dump(model, open("model.txt","w"))
def main(argv):
print "Phishing URL predictor - Naive Bayes approach"
training_file = argv[1]
option = "random"
if len(argv) > 2:
option = argv[2]
if option == "random":
#Random shuffle run
split = 0.8
if len(argv) == 4:
split = float(argv[3])
data = read_data(training_file)
# Uncomment the following line to filter features.
kf = KFold(n_splits=k)
data = read_data(training_file)
data = filter_features(data, fs.select_features(data,
{"method":"info_gain","num_features":25}))
np.random.shuffle(data)
X = data[:,:-1]
y = data[:,-1]
accuracy = 0.0
11
for train_idx, test_idx in kf.split(data):
class_prob, class_feature_value_count =
train(X[train_idx],y[train_idx])
accuracy += predict(X[test_idx], y[test_idx], class_prob,
class_feature_value_count)
print "Cross-validated. Folds = "+str(k)+"."
print "Average Accuracy over different folds = "+str(accuracy/k)
else:
print "Illegal options set. Use either 'random' or 'cv'"
if __name__ == '__main__':
main(sys.argv)
12
Screenshots
13
Figure 4: Cross Validation with 3 Folds
14
Conclusion
This project signifies the effect of feature set reduction for the detection of phishy webpages.
Random Forest algorithm and Cross Validation algorithm has been proved to be suitable for
optimization problems. The proposed system aims to take advantage of this quality and apply the
random forest algorithm and cross validation algorithm to identify the optimal features and then
pass it to the Bayesian classifier for the detection of phishing attacks.
References
• https://archive.ics.uci.edu/ml/datasets/phishing+websites
• https://docs.scipy.org/doc/
• http://scikit-learn.org/stable/documentation.html
• https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd
• https://machinelearningmastery.com/k-fold-cross-validation/
• https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/
15