Professional Documents
Culture Documents
Webb Bib Cast 05
Webb Bib Cast 05
net/publication/331941085
CITATIONS READS
19 1,553
6 authors, including:
All content following this page was uploaded by Rana Muhammad Amir Latif on 19 February 2020.
Rana M. Amir Latif, Muhammad Umer, Tayyaba Tariq, Muhammad Farhan, Osama Rizwan, Ghazanfar Ali
Department of Computer Science
COMSATS University Islamabad, Sahiwal Campus
Sahiwal, Pakistan
ranaamir10611@gmail.com, muhammadumer063@gmail.com, tayyaba.tariq.tt@gmail.com, farhansajid@gmail.com,
rizwan.osama.official@gmail.com, ghazanfarali78622@gmail.com
III. METHODOLOGY
In this paper our main focus to find that relevant features
that differentiate legitimate websites from the phishing and Knowledge Flow
suspicious websites. For identification of that features there
could be carried out the certain analysis with using different
algorithms of WEKA also do some statistical Investigation for
more finding more difference between this classes of the Machine Learning
website as shown in “Fig. 1”. Interface Knowledge Flow is an Objectives
algorithms used
• Accurately Predicting
alternative to the explorer, the user lays out the data by • J48
Website Legitimacy
connecting them together in order to form a knowledge flow by Comparison • Decision Stump
• Comparing Results
selecting the WEKA component from a toolbar. For the • Random forest
• Identifying Best
purpose of our experimentation authors have connected Tree
Algorithm
together CSV loader, class assigner, Cross-validation, and then • Naïve Bayes
an algorithm such as J48, random forest as so on. followed by
the Classifier Performance evaluator and finally, Authors view Fig. 1. Flow Diagram of Methodology
the output using text viewer. The most important thing that
authors have chosen knowledge flow interface for our Smart Website Analyzer for secure e-banking and e-
experimentations purpose is that it not only provides the commerce payment incorporates concepts of Artificial
statistical values but also provides a pictorial view of data flow, Intelligence and different Algorithm from WEKA which
it shows the complete network of how data is downloaded from provides the legitimacy of a website based on different well-
the source file in the suitable loader depending upon the file defined parameters. The prediction model fabrication involves
format and then passed through class assigner, after that cross- more than one algorithm in WEKA. The system is designed in
validation is done, then the suitable algorithm is selected from a way that it takes inputs from the user, matches it with the
the window above for testing purpose, then it is passed through training data and yields an output. Following are the fields that
a classifier performance evaluator and in the end results are the user put as inputs. Following all attributes will only contain
viewed using text viewer. categorical values as shown in “Table. 1”.
For a collection of a dataset of different type of websites
datasets (website could be phishing. Suspicious and legitimate) TABLE I. ATTRIBUTES OF WEBSITE DATASET
download data from UCI Machine Learning Repository. As
Sr. No Attribute Value
after the identification of different features of websites and 1 SFH 1,-1,0
collected 1353 different website data from machine learning 2 popUpWidnow -1,0,1
repository. In this dataset a phishing website collected from 3 SSLfinal_State 1,-1,0
phishing tank data archive which is the free community site 4 Request_URL -1,0,1
5 URL_of_Anchor -1,0,1 D. Logistic Model Tree
6 web_traffic 1,0,-1
7 URL_Length 1,-1,0 For supervised learning linear models and induction
8 age_of_domain 1,-1 method is a popular technique both methods use for numeric
9 having_IP_Address 0,1 values and nominal classes. Tress that have liner regression
10 Result 0,1,-1 function at the leaves can use for predicting numeric quantities.
Authors can use logistic regression instead of linear regression
A. J48 algorithm for this author can use stage wise fitting process that can select
In ID3 J48 use an extension. J48 has many additional suitable features to form data and show how authors can
features like a derivation of rules, continuous attribute value approach to create a logistic regression model at the leaves to
ranges, decision trees pruning and decision trees pruning etc. refine at high levels in the tree. With state of the art of learning
J48 is an open Java source code of C4.5 algorithm in the schemes, authors can compare the performance of our
WEKA for data mining. WEKA tool associated with tree algorithm with different schemes.
pruning also it provides more options to integrate with decision
trees. Tool for precision can be used as potential overfitting IV. RESULTS AND DISCUSSION
pruning. For every single leaf pruning the recursive
classification will be performed and the classification of that Predictions have been made by us using the WEKA data
data should be perfect. This rule of the algorithm will generate mining tool for classification and accuracy by applying
specific data. To gain the accuracy and equilibrium of different algorithmic approaches. we are comparing the results
flexibility is the main objective of generalization of the in the two ways firstly we find the best algorithm by using the
decision tree. comparison of the different attributes like Correctly Classified
Instances, Incorrectly Classified Instances, Mean absolute error
B. Random Forest and kappa statistic and so on. Secondly, the accuracy of these
algorithms will analyze with different parameters like TP Rate,
In ensembling, bagging is a technique in which authors FP Rate, Precision, Recall, F-Measure, MCC, ROC Area and
build different independent learners, models, predictors and PRC Area that is visualized in the bar chart. The selected
combine them in different model averaging like majority vote algorithm makes the website analyzing process automated.
or normal average and weighted average. These all models are Before making payment on any e-commerce website, this
little different from each other because they typically take prediction model can be used for determining the legitimacy of
random sample and bootstrap of data for each model. Every that website.
model has some probability and for each observation. For the
making of the final model this technique uncorrelated many
A. Random Forest
learners that reduce errors by minimizing the variance.
Random forest is the example of the bagging in the ensemble We use the random forest algorithm in WEKA to analysis the
as shown in “Fig. 2”. legitimacy of the websites. In the result we are extracting
some statistical information about the algorithm that shows
C. Naïve Bayes different parameters to describe the accuracy of the algorithm
With an independent assumption among predictors, this is a as shown in “Table. 2” classification accuracy achieved shows
classification technique which is based on Bayes. This Bayes that 89.8744% out of total 1353 instances from which 1216
classifier assumes that a particular feature in the class has no are correctly classified and 137 are not correctly classified,
relation to any other feature in the class. It can be defined with mean absolute error is 0.0922, kappa statistics is 0.8198 are
an example that authors have an apple it considers as a fruit if outputs. Also, in “Fig. 3” we are visualizing the different
it has a red color and 3 inches in diameter. Even authors can parameters in a bar chart that can show the accuracy of that
see that these features will depend on the existence of other algorithm in more precisely. As in description, there is there a
features but all other features will contribute undependably in series of the bar has shown in a bar chart. Series1 shows the
the probability of the fruit that is why authors can say it is a bar of the Fraud websites, series2 show the bar of the
Naïve. Naïve Bayes model is very useful for large datasets. On Legitimate websites, series3 show the bar of the suspicious
highly sophisticated classification algorithm naïve Bayes is website and series4 shows the weighted average of these
known as outperforming as shown in “Fig. 2”. parameters that are defined in the bar char.
Fig. 3. Bar chart Representation of the random forest algorithm websites, series3 show the bar of the suspicious website and
series4 shows the weighted average of these parameters that
are defined in the bar char.
B. J48
We use the J48 algorithm in WEKA to analysis the TABLE III. STATISTICAL INFORMATION OF J48 ALGORITHM
legitimacy of the websites. In the result, we are extracting
some statistical information about the algorithm that shows Correctly Classified
1215 89.8004%
different parameters to describe the accuracy of the algorithm Instances
Incorrectly Classified
as shown in “Table. 4” classification accuracy achieved shows Instances
138 10.1996%
that 89.8004% out of total 1353 instances from which 1215 are Kappa Statistics 0.1898
correctly classified and 138 are not correctly classified, mean Mean Absolute Error 0.0916
absolute error is 0.0916, kappa statistics is 0.8198 are outputs. Root Mean Squared
0.2335
Also, in “Fig. 4” we are visualizing the different parameters in Error
a bar chart that can show the accuracy of that algorithm in Relative Absolute Error 24.488%
more precisely. As in description, there is there a series of the Root Relative Squared
53.9903%
Error
bar has shown in a bar chart. Series1 shows the bar of the
Total Number of
Fraud websites, series2 show the bar of the Legitimate Instances
1353