Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Don Bosco Institute of Technology

Kurla, Mumbai, 400070

Academic year : 2018-19

Department of Computer Engineering

 Title of Report : Identifying phishing URLs using Naive Bayes classifier algorithm

 Year and Semester : IVth Year and VIIth Semester

 Name of the Subject : Network Threats and Attacks Laboratory

 Group Members Name and Roll No. :

Sr. Name Roll No.


No
1 Yash Mahajan 42
2 Vaibhav Patil 59

Expected Outcome

CO 6 To develop improved communication and collaborative skills in meeting security


threats as a team member or a team leader
Table of Contents

Sr. No. Title Page No.

1 Abstract 1

2 Description of Mini-Project
2.1 Introduction 2
2.2 Aim and Objective 6
2.3 Software Interfaces 6
2.4 Implementation 7
2.5 Screenshots 11

3 Conclusion 15

4 References 15
Abstract

Phishing attack is an aberrant trick to peculate user’s private information by duping them to
assail via a spurious website planned to mimic and resembles as an authentic website. The user’s
confidential information such as username, password, and PIN number will be grabbed by the
attacker and creates a fraudulent transaction. The information holder’s credentials as well as
money will be seized. The phishing and legitimate website will have high intelligible
resemblances by which the attacker will seize the credentials of the user. In order to detect the
phishing attacks there exists various techniques such as blacklisting, whitelisting, heuristics and
machine learning. Nowadays machine learning is used and found to be more effective. The
proposed system extracts the source code features, URL features and image features from the
phishing website. The features that are extracted are given to the random forest algorithm or to
cross validation algorithm to acquire the reduced features. The reduced features are again given
to the Naïve Bayes classifier in order to classify the webpage as genuine or phished.

1
Introduction

Nowadays internet plays a vital role in everyone’s day to day life. Every day the technology are
growing in tremendous speed it makes the user to use it in a smarter way. As the technology
grows it leaves it impact in all the fields. There exist some loopholes that are available on the
internet by this it acts as a back door to attack the user. Many attacks are exhibited over the
network one of them is phishing in which the attacker impersonates himself as legitimate and
grabs the user credentials. To tempt the user by high visual resemblances. According to the
phishing report during the first quarter of the year 2016 India ranks fifth place of the top 10
countries affected by phishing attack. The users who are all unaware of these attacks may fall
into the trap. This paper considers source code, URL and image features of a website and selects
the optimum features by using random forest algorithm and finally classify the website as
phishing and non-phishing by using Naïve Bayes classifier.

The project was divided in to 3 main sections:

1. Dataset Formation

Dataset for this project was taken from a dataset published in 2015 which is available for
download and build a machine learning model to detect phishing URLs. The dataset contains a
large selection of features extracted from a website to determine if it’s a phishing website or not.
It contains the following features which help to determine:
• having_IP_Address
• URL_Length
• Shortining_Service
• having_At_Symbol
• double_slash_redirecting
• Prefix_Suffix
• having_Sub_Domain
• SSLfinal_State
• Domain_registeration_length
• Favicon
• port
• HTTPS_token
• Request_URL
• URL_of_Anchor
• Links_in_tags
• SFH
• Submitting_to_email
• Abnormal_URL
• Redirect
• on_mouseover
• RightClick

2
• popUpWidnow
• Iframe
• age_of_domain
• DNSRecord
• web_traffic
• Page_Rank
• Google_Index
• Links_pointing_to_page
• Statistical_report
• Result

2. Feature Selection

Often in data science we have hundreds or even millions of features and we want a way to create
a model that only includes the most important features. This has three benefits. First, we make
our model simpler to interpret. Second, we can reduce the variance of the model, and therefore
overfitting. Finally, we can reduce the computational cost (and time) of training a model. The
process of identifying only the most relevant features is called “feature selection.”

a) Random Forest Algorithm

Random Forests are often used for feature selection. The reason is because the tree-based
strategies used by random forests naturally ranks by how well they improve the purity of
the node. This mean decreases in impurity over all trees. Nodes with the greatest decrease
in impurity happen at the start of the trees, while nodes with the least decrease in
impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can
create a subset of the most important features.
Random forests or random decision forests are an ensemble learning method for
classification, regression and other tasks, that operate by constructing a multitude of
decision trees at training time and outputting the class that is the mode of the classes
(classification) or mean prediction (regression) of the individual trees. Random decision
forests correct for decision trees' habit of overfitting to their training set.

b) Cross Validation Algorithm

The process of deciding whether the numerical results quantifying hypothesized


relationships between variables, are acceptable as descriptions of the data, is known as
validation. Generally, an error estimation for the model is made after training, better
known as evaluation of residuals. In this process, a numerical estimate of the difference
in predicted and original responses is done, also called the training error. However, this
only gives us an idea about how well our model does on data used to train it. Now its
possible that the model is underfitting or overfitting the data. So, the problem with this

3
evaluation technique is that it does not give an indication of how well the learner will
generalize to an independent/ unseen data set. Getting this idea about our model is known
as Cross Validation.

K-Fold Cross Validation:


As there is never enough data to train your model, removing a part of it for validation
poses a problem of underfitting. By reducing the training data, we risk losing important
patterns/ trends in data set, which in turn increases error induced by bias. So, what we
require is a method that provides ample data for training the model and also leaves ample
data for validation. K Fold cross validation does exactly that. In K Fold cross validation,
the data is divided into k subsets. Now the holdout method is repeated k times, such that
each time, one of the k subsets is used as the test set/ validation set and the other k-1
subsets are put together to form a training set. The error estimation is averaged over all k
trials to get total effectiveness of our model. As can be seen, every data point gets to be in
a validation set exactly once, and gets to be in a training set k-1 times. This significantly
reduces bias as we are using most of the data for fitting, and also significantly reduces
variance as most of the data is also being used in validation set. Interchanging the training
and test sets also adds to the effectiveness of this method. As a general rule and empirical
evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any value.

3. Classification

Naive Bayesian:

The Naive Bayesian classifier is based on Bayes’ theorem with the independence
assumptions between predictors. A Naive Bayesian model is easy to build, with no
complicated iterative parameter estimation which makes it particularly useful for very
large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly
well and is widely used because it often outperforms more sophisticated classification
methods.

Algorithm:

Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c),
P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor
(x) on a given class (c) is independent of the values of other predictors. This assumption
is called class conditional independence.

4
• P(c|x) is the posterior probability of class (target) given predictor (attribute).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.

5
Aim and Objective

The project aims at creating a Machine Learning model to classify the incoming website. The
Machine learning model will be trained with a large quantity to dataset to increase its accuracy.
Once the model is trained it will be tested on a set of data to identify whether the incoming
webpage is a valid or false webpage. This data includes various features of the webpage that are
extracted and clustered as Source code features, URL features and Image features.

Software Interfaces

The project was implemented in Ubuntu 18 operating system. The programming language used
for implementation is Python 2 with additional Machine learning packages including NumPy and
sklearn.

Python:
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a
scripting or glue language to connect existing components together. Python's simple, easy to
learn syntax emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code reuse.
The Python interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms, and can be freely distributed.

NumPy package:
NumPy is module for Python. The name is an acronym for "Numeric Python" or "Numerical
Python". It is an extension module for Python, mostly written in C. This makes sure that the
precompiled mathematical and numerical functions and functionalities of NumPy guarantee great
execution speed.
Furthermore, NumPy enriches the programming language Python with powerful data structures,
implementing multi-dimensional arrays and matrices. These data structures guarantee efficient
calculations with matrices and arrays. The implementation is even aiming at huge matrices and
arrays, better known under the heading of "big data". Besides that, the module supplies a large
library of high-level mathematical functions to operate on these matrices and arrays.

Sklearn package:
Scikit-learn is a free machine learning library for the Python programming language. It features
various classification, regression and clustering algorithms including support vector machines,
random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with
the Python numerical and scientific libraries NumPy and SciPy.

6
Implementation

The project contains 3 main files which contain the implementation:

1. Dataset file

The dataset file contains a large number of website features and the values for each
feature. The dataset file looks like the following

Figure 1: Dataset

2. Feature selection file:

This file contains the code necessary for feature selection implemented on the dataset.

Code:

import numpy
import math

def select_features(data, config):


method = config["method"]
num_features = config["num_features"]
X = data[:,:-1]
y = data[:,-1]

if method == "info_gain":
return list(information_gain(X, y)[0: num_features])

7
def information_gain(X, y):

cols = X.shape[1]
info_gains = numpy.zeros(cols)
for i in range(cols):
values = X[:,i]
info_gains[i] = calculate_info_gain(values, y)

return numpy.argsort(info_gains)[::-1]

def calculate_info_gain(values, y):

ig = entropy(count(y))
distinct = set(values)
for val in distinct:
indices = numpy.argwhere(values==1).flatten()
ig -= entropy(count(y[indices]))

return ig

def count(values):
unique, counts = numpy.unique(values, return_counts=True)
return dict(zip(unique, counts))

def entropy(category_value_dict):
total = sum(category_value_dict.values())

ent = 0.0
for key in category_value_dict:
prob = float(category_value_dict[key])/float(total)
ent += (-prob * math.log(prob))

return ent

3. Classification File:

This file contains the code required to create a machine learning model which is used
classify the websites as phishing or not.

Code:

import sys

8
import json
import numpy as np
import math
from collections import defaultdict
from sklearn.model_selection import KFold

import feature_selection as fs

def read_data(filename):
data = np.array(map(lambda x: x.strip().split(","),
open(filename).readlines()[1:]))
return data

def validate_features(datashape, features_index):


cols = datashape[1]
return reduce(lambda check1, check2: check1 and check2 ,map(lambda
index: index < cols, features_index))

def filter_features(data, features_index):

print features_index
if (data.shape[1] - 1) not in features_index:
# Adding output feature if not already added
features_index.append(data.shape[1] - 1)

print features_index
if validate_features(data.shape, features_index):
return data[:,features_index]

print "Unable to filter features as the indices are out of bounds"


return data

def split_train_test(data, ratio_train=0.8):


np.random.shuffle(data)
X = data[:,:-1]
y = data[:,-1]
total_size = X.shape[0]
train_size = int(total_size * ratio_train)
return X[:train_size,:], y[:train_size], X[train_size:,:], y[train_size:]

def train(X, y):

N = X.shape[0]
class_prob = defaultdict(float)
class_feature_value_count = defaultdict(lambda: defaultdict(lambda:
defaultdict(float)))

9
for i,row in enumerate(X):
class_prob[y[i]] += 1.0

for j,col in enumerate(row):


class_feature_value_count[y[i]][j][col] += 1

for cls in class_prob:


for j in class_feature_value_count[cls]:
for val in class_feature_value_count[cls][j]:
class_feature_value_count[cls][j][val] =
class_feature_value_count[cls][j][val] / class_prob[cls]

class_prob[cls] /= float(N)

return class_prob, class_feature_value_count

def predict(X_test, y_test, class_prob, class_feature_value_count):


correct = 0
for i,row in enumerate(X_test):
true_class = y_test[i]
predicton = None
max_val = float("-inf")

for cls in class_prob:


val = math.log(class_prob[cls])
for feature in class_feature_value_count[cls]:
val +=
class_feature_value_count[cls][feature][row[feature]]

if val > max_val:


max_val = val
predicton = cls

if true_class == predicton:
correct += 1

accuracy = 100*float(correct)/X_test.shape[0]

return accuracy

def store(class_prob, class_feature_value_count):


print "Storing model"
model = {"class_prob":class_prob, "class_feature_value_count":

10
class_feature_value_count}
json.dump(model, open("model.txt","w"))

def main(argv):
print "Phishing URL predictor - Naive Bayes approach"
training_file = argv[1]
option = "random"
if len(argv) > 2:
option = argv[2]

if option == "random":
#Random shuffle run
split = 0.8
if len(argv) == 4:
split = float(argv[3])

data = read_data(training_file)
# Uncomment the following line to filter features.

data = filter_features(data, fs.select_features(data,


{"method":"info_gain","num_features":14}))
X_train, y_train, X_test, y_test = split_train_test(data, split)
class_prob, class_feature_value_count = train(X_train, y_train)
store(class_prob, class_feature_value_count)
accuracy = predict(X_test, y_test, class_prob,
class_feature_value_count)
print "\n"
print "Random test-train split. Training ratio = "+str(split)+"."
print "\n"
print "Accuracy = "+str(accuracy)+" %."
print "\n"

elif option == "cv":


k=5
if len(argv) == 3:
k = int(argv[2])

kf = KFold(n_splits=k)
data = read_data(training_file)
data = filter_features(data, fs.select_features(data,
{"method":"info_gain","num_features":25}))
np.random.shuffle(data)
X = data[:,:-1]
y = data[:,-1]
accuracy = 0.0

11
for train_idx, test_idx in kf.split(data):
class_prob, class_feature_value_count =
train(X[train_idx],y[train_idx])
accuracy += predict(X[test_idx], y[test_idx], class_prob,
class_feature_value_count)
print "Cross-validated. Folds = "+str(k)+"."
print "Average Accuracy over different folds = "+str(accuracy/k)

else:
print "Illegal options set. Use either 'random' or 'cv'"

if __name__ == '__main__':
main(sys.argv)

12
Screenshots

Figure 2: Random Forest with Test-Train Data Split at 10%

Figure 3: Random Forest with Test-Train Data Split at 80%

13
Figure 4: Cross Validation with 3 Folds

Figure 5: Cross Validation with 6 Folds

14
Conclusion
This project signifies the effect of feature set reduction for the detection of phishy webpages.
Random Forest algorithm and Cross Validation algorithm has been proved to be suitable for
optimization problems. The proposed system aims to take advantage of this quality and apply the
random forest algorithm and cross validation algorithm to identify the optimal features and then
pass it to the Bayesian classifier for the detection of phishing attacks.

References
• https://archive.ics.uci.edu/ml/datasets/phishing+websites
• https://docs.scipy.org/doc/
• http://scikit-learn.org/stable/documentation.html
• https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd
• https://machinelearningmastery.com/k-fold-cross-validation/
• https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

15

You might also like