Improved Intrusion Detection Applying Feature Selection Using Rank & Score of Attributes in KDD-99 Data Set

Improved Intrusion Detection Applying
Feature Selection Using Rank & Score

of Attributes in KDD-99 Data Set
A THESIS SUBMITTED BY
JYOTI HARBOLA
MTECH. COMPUTER SCIENCE
GUIDED BY
DR. K. S. VAISLA
ASSOCIATE PROFESSOR
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING
Index
1.
Introduction of Intrusion Detection System
2.
Feature Selection
3.
Literature Survey
4.
Proposed Work
5.
Result and Discussion
6.
Conclusion and future work
7.
References
1: Introduction to IDS
Intrusions are the activities that violate the security policy of
system.
Intrusion Detection is the process used to identify intrusions.
IDS known as powerful tools for determine, rejecting, and
debarring malevolent attacks all over the network.
Types of Intrusion Detection System

Based on the sources of the audit information used by each
IDS, the IDSs may be classified into
Host-base IDSs
Distributed IDSs
Network-based IDSs
Conti
Host-based IDSs
Get audit data from host audit trails.
Detect attacks against a single host
Distributed IDSs
Gather audit data from multiple host and possibly the network
that connects the hosts
Detect attacks involving multiple hosts
Network-Based IDSs
Use network traffic as the audit data source, relieving the
burden on the hosts that usually provide normal computing
services.
Methods to build light weight IDS

There are two representative methods to build lightweight IDSs:
1. Parameters optimization of data mining and machine
learning algorithms and
2. Feature selection of audit data.
2: Introduction to Feature Selection

Intrusion Detection Systems process huge amount of audit
data which contains a lot of features.
However, all features are not essential to classify network
audit data.
Some of these features are irrelevant or redundant.
They not only increase computational cost, such as time and
overheads but also decrease the detection rates.
Conti..
Feature selection is used to nd out only important features
or feature set out of all the features of audit data.
It is used to reducing the dimensionality of data by
eliminating those features which are noisy, redundant or
irrelevant for a classification problem.
It can make IDSs lightweight and improve the classication
performance.
The main objectives of feature

selection are
Reduced computational complexity and dimensionality
Retention of sufficient information

Economy
Improved accuracy
Problem understanding
3: Literature Survey
Intrusion Detection Techniques

Data Mining method
Pattern Matching
Measure Based method
Machine Learning Method
Data Mining
Data mining can help improve intrusion detection by
addressing following ways:
Remove normal activity from alarm data to allow analysts
to focus on real attacks.
Identify false alarm generators and bad sensor
signatures.
Find anomalous activity that uncovers a real attack.
Identify long, ongoing patterns (different IP address,
same activity).
Data mining algorithms for IDS

1. Decision Trees : Decision tree is a predictive modeling
technique most often used for classification in data
mining.
2. Nave Bayes :Bayesian network is a model that encodes
probabilistic relationships among variables of interest.
3. Artificial Neural Networks : Neural networks (NN) are
systems modeled based on the working of the human
brain
Conti.
4. K-Nearest Neighbor :K-Nearest Neighbor (k-NN) is an
instance based learning method for classifying objects
based on the closest training examples in the feature
space.
5. Support Vector Machine: Support Vector Machines have
been proposed as a novel technique for intrusion
detection.
Feature Selection Algorithm

Feature Scoring: All feature scoring algorithms implements
the following method. Higher scores are better.
GainRatio
Feature ranking: All feature ranking algorithms provide
the following method to determine the rank of a features.
Lower ranks are better.
RecursiveFeatureEliminationSVM
Feature subset selection: Subset selection algorithms differ
with the scoring and ranking methods in that they only
provide a set of features that are selected without further
information on the quality of each feature individually.
GreedyForwardSelection
4: Proposed Work
Problem Statement
As large amount of data flowing over the network, real time
intrusion detection is not feasibly possible due to computational
and resource limitations.
For the analysis of the intrusion detection the KDD99 dataset
need to be analysed and classification need to be done.
In KDD99 dataset there are 42 features involved for analysing
the algorithms. It seems that numbers of features are large and
taking more time that can be reduced and accuracy can be
increased.
Here accuracy means the most of the correctly classified instances
by an algorithm among the total instances.
Proposed Work
The main objectives of Feature Selection in reducing the
features of the dataset are given below.
To reduce the size and attributes of the KDD99 dataset
using feature selection algorithm.
To minimize the time for intrusion detection process.
To increase the accuracy of the algorithm used for Intrusion
Detection Systems.
Experimental Methodology
In our experiment work we have analysed the intrusion dataset in
following steps.
A: Selection of the training and testing dataset:
We have chosen KDD Cup99 data set to perform the experiment.

This data set is widely accepted as a point of reference dataset
and has been referred by many researchers.
10% of KDD Cup99 from KDD Cup 99 data set is chosen to
analyze the data mining algorithms and rules for testing data sets
to detect intrusion.
KDD Data Set

NO
FEATURES
NO
FEATURES
Duration
22
is_guest_login
Protocol type
23
Count
Service
24
srv_count
Flag
25
serror_rate
src_bytes
26
srv_serror_rate
dst_bytes
27
rerror_rate
Land
28
srv_rerror_rate
wrong_fragment
29
same_srv_rate
Urgent
30
diff_srv_rate
10
Hot
31
srv_diff_host_rate
11
num_failed_logins
32
dst_host_count
12
logged_in
33
dst_host__srv_count
13
num_compromised
34
dst_host_same_srv_rate
14
root_shell
35
dst_host_diff_srv_rate
15
su_attempted
36
dst_host_same_src_port_rate
16
num_root
37
dst_host_srv_diff_host_rate
17
num_file_creations
38
dst_host_serror_rate
18
num_shells
39
dst_host_srv_serror_rate
19
num_access_files
40
dst_host_rerror_rate
20
num_outbound_cmds
41
dst_host_srv_rerror_rate
21
is_host_login
42
Class
B: WEKA tool:
WEKA is a compilation of machine learning algorithms for data

mining tasks.
WEKA contains algorithm and tools for data association rules,preprocessing, regression, classification, clustering, association rules,
and visualization.
WEKA consists of Explorer, Experimenter, Knowledge flow,
Simple Command Line Interface, Java interface .
C: Confusion matrix:
Confusion matrix allows us to calculate the performance of an
algorithm.
In our experiment each algorithm has produced confusion matrix
in output which is being used for the accuracy calculation.
The basic structure of the confusion matrix is as follows:
classified as
9282
429
a = normal
12306
b = anomaly
527
D: Implemented algorithm using Java Eclipse:

Using eclipse we have implemented the algorithm to select the features
according to Attribute Rank, Attribute Score and Attribute Subset in dataset.
Select a Feature Subset
Attribute Rank
Evaluator
Attribute Score
Evaluator
Analyze the new feature

subset in WEKA
Attribute Subset
5: RESULTS AND
DISCUSSIONS
We have chosen five classifier algorithms for classifying the

dataset.
The following table shows classifier used in WEKA.

Algorithm
Method
SVM
SMO
ANN
Multilayer Perception
KNN
Lazy.IBK
Nave-Bayes
Bayes.naiveBayes
Decision Tree
Tree.L48
After analysing the KDD cup dataset in WEKA we have analysed

the different parameters of the algorithmic results.
The following table shows the different parameter associated with
the algorithms.
Parameter
SVM
ANN
KNN
NB
DT
Total number of
22544
22544
22544
22544
22544
21342
21588
22490
18205
22394
1202
956
54
4339
150
Kappa statistic
0.891
0.9136
0.9951
0.6234
0.9864
Mean
absolute
0.0533
0.0545
0.0024
0.1923
0.0105
mean
0.2309
0.197
0.0346
0.4369
0.0725
10.8721
11.107
0.4974
39.213
2.142
instances
Correctly
classified
instances
Incorrectly
classified
instances
Error
Root
square Error
Relative Absolute
Error %
Confusion Matrix
Nave-Bayes
ANN
-- classified as
9225
486
3858
8975
-- classified as
a = normal
9282
429
b = anomaly
527
12306
SVM
a = normal
|
b = anomaly
KNN
9000
711
491
12342
-- classified as
|
|
-- classified as
a = normal
9711
b = anomaly
54
12779
anomaly
Decision Tree
a
9643
82
-- classified as
68
a = normal
12751
b = anomaly
a = normal
|
b=
Algorithm Accuracy
The accuracy parameter calculated for all the algorithms using the
confusion matrix.
This table shows the running time and correctly classification
accuracy of various algorithms.
It is clear from the table that most of the algorithm lacks the 100%
accuracy.
Algorithm Accuracy = ( True positive + True negative)

/ Total population
Algorithm/Model
(TP+TN)/Population
Accuracy
Time Taken in sec
KNN
(9711+12779)/22544
0.997
1.2
Nave-Bayes
(9222+8983)/22544
0.807
4.2
Decision Tree
(9643+12751)/22544
0.959
2.39
SVM
(9000+12342)/22544
0.946
2.58
ANN
(9282+123065)/22544
0.957
3.4
Feature Selection
Using eclipse we have implemented the algorithm to select
the features.
After feature Selection using JAVA we got the result in form

of different parameters.
The parameters are:
1. Attribute rank: This algorithm finds the rank of each
attribute on a scale of predefined procedures and
lower integers represents the higher rank.
Rank of feature 0:0

Rank of feature 1:2
Rank of feature 2:3
Rank of feature 3:4
Rank of feature 4:6
Rank of feature 5:8
Rank of feature 6:10

2. Attribute Score: Score of the attributes on the scale of

0-1. We have taken attributes which are near to the
value 1.
Feature 1 Score :0.252308168019056
Feature 2 Score :0.0
Feature 5 Score :0.8635694816581653
Feature 6 Score :0.11528480177913569
Feature 7 Score :0.7922705451305562
Feature 8 Score :0.9846316059892641
Feature 9 Score :0.9789083106047213
Feature 10 Score :0.9115301140903463
Feature 11 Score :0.8290498095737958
Feature 12 Score :0.9666111319850087
Feature 13 Score :0.8944075418750325
Feature 14 Score :0.9061700649053416
Feature 15 Score :0.9725557274441675
Feature 17 Score :0.9423248173571703
Feature 18 Score :0.866985540181125
Feature 20 Score :0.7478944089652559
Feature 21 Score :0.9202667510834831
Feature 22 Score :0.27918071822520096

Feature 23 Score :0.520870647268416
Feature 24 Score :0.3033022547442611
Feature 25 Score :0.3040742115973446
Feature 26 Score :0.39849817657149333
Feature 27 Score :0.44446021300018684
Feature 28 Score :0.3652166472274252
Feature 29 Score :0.31709809540361417
Feature 30 Score :0.35523299912347567
Feature 31 Score :0.3364717747461143
Feature 32 Score :0.35192749650209687
Feature 33 Score :0.3819583107257604
Feature 34 Score :0.3023248036411523
Feature 35 Score :0.5598661851509699
Feature 36 Score :0.7026082232317187
Feature 37 Score :0.24266342434395133
Feature 38 Score :0.24544627184394258
Feature 39 Score :0.30724088607267636
Feature 40 Score :0.42643408681495976
3.
Attribute Subset:
JavaML has many subset selection algorithms. We have

used GreedyForwardSelection method which works on
greedy approach for feature selection.
Selected 20 attributes by GreedyForwardSelection are:
[0, 32, 33, 38, 39, 36, 37, 8, 9, 10, 21, 23, 22, 25, 24, 27, 26,
29, 28, 31]
KNN (IBK) run information after feature selection

IB1 instance-based classifier using 1 nearest neighbour(s) for classification
Time taken to build model: 0.04 seconds
=== Summary ===
Correctly Classified Instances
99.5507 %
Incorrectly Classified Instances
0.4493 %
Kappa statistic
0.991
Mean absolute error
0.0061
Relative absolute error
1.2214 %
Root relative squared error
12.9175 %
=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.997
0.007
0.994
0.997
0.996
0.999 normal
0.993
0.003
0.997
0.993
0.995
0.999
Weighted Avg.
0.996
0.005
0.996
0.996
0.996
anomaly
0.999
Comparison for KNN (IBK) runs information before and

after feature selection
KNN(IBK) Algorithm
Before Feature Selection
After Feature Selection
Correctly classified instances
22390
22445
Incorrectly classified
154
99
Kappa statistic
0.9858
0.991
Mean absolute Error
0.0113
0.0061
Root mean square Error
0.0811
0.0644
Relative Absolute Error %
2.421
1.2214 %
Time Taken in Analysis
1.2
0.04
instances
(Seconds)
Comparison for KNN (IBK) runs

information before and after feature
selection
Incorrectly Classified
Correctly Classified
Instances
Instances
22450
22440
22430
22420
22410
22400
22390
22380
22370
22360
22445
200
154
150
99
100
22390
50
0
Mean absolute Error
Kappa statistic
0.992
0.991
0.99
0.989
0.988
0.987
0.986
0.985
0.984
0.983
0.012
0.0113
0.991
0.01
0.008
0.0061
0.006
0.9858
0.004
0.002
0
Comparison for KNN (IBK) runs

information before and after feature
selection
Root Mean Square Error
0.1
Relative Absolute Error(%)

3
0.0811
0.08
0.0644
2.421
2.5
2
0.06
1.2214
1.5
0.04
0.02
0.5
0
Time Taken in Analysis (Seconds)

1.4
1.2
1.2
1
0.8
0.6
0.4
0.2
0.04
0
6: CONCLUSION AND
FUTURE WORK
Conclusion
We focused on feature selection for intrusion detection.
We have primarily analysed the KDD Cup dataset in WEKA in
initial step and have calculated the accuracy of algorithms.
Then we analyzed that with full feature dataset the accuracy is not
high and algorithm classification or running time is high.
We have selected the best algorithm from them, which was giving
the best accuracy.
Conti
We have used JAVA-ML libraries for feature selection on
different parameters.
We have used methods for feature selection and tested our
selected features with the same algorithms. Result shows
that for large datasets it is better to run the intrusion
detection identification algorithm with minimal features so
that intrusion is correctly and timely can be classified.
With feature selection or filtration methods the accuracy is
increased and the algorithmic running time is also reduced.
FUTURE WORK
In future work we will implement the rough and soft set
theories for feature selection which are an extension of the
data mining feature selection methods.
Soft set theory is a general method for solving problems of
uncertainty. Soft Sets represent a powerful tool for decision
making about information systems, data mining and
drawing conclusions from data.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
S.T.Zargar, J.Joshi, D,Tipper, A Survey of Defence Mechanisms Against Distributed Denial of Service (DDoS) Flooding
Attacks, Communication Surveys & Tutorials, IEEE(Volume 15, Issue:4), 2013.
C.Lima, M.Assis, C.Protsio,"An empirical investigation of attribute selection techniques based on Shannon,Rnyi and Tsallis
Entropies for Networ Intrusion Detection." American Journal of Intelligent Systems, PP.111-117, 2012.
K. Kira, L.A. Rendell,The feature selection problem: Traditional methods and New Algorithm, IEEE, PP. 129-134, 1998
H. Almuallim, T.G Dietterich,Learning with many irrelevant features, pp. 547-551, MIT Press 1991.
H. Witten, Ian, Eibe Frank, Practical Machine Learning Tools and Techniques, 2nd edition, 2005.
T.S Chou, K.K Yen, J.Luo, Network intrusion detection using feature selection of Soft Computing Paradigms, 2008
J.R Quinlan Programs for Machine Learning, Morgan Kaufmann, 1994.Sanmay Das,Filters,Wrapper and a Boostingbased
Hybrid for Feature Selection,2012.
Lei Yu, Huan Liu, Efcient Feature Selection via Analysis of Relevance and Redundancy, Journal of Machine Learning
Research, 2004.
S. Ganapathy, K.Kulothungan, S. Muthurajkuma, Intelligent Feature Selection and Classification Techniques for Intrusion
Detection in Networks a survey, Journal on Wireless Communications and Networking 2013.
R. Agarwal and M. V. Joshiy, "PNrule: A New Framework for Learning Classifier Models in Data Mining (a case-study in
network intrusion detection)," Citeseer2000.
R.Beghdad, "Efficient Deterministic Method for Detecting New U2R Attacks", Computer Communications, vol. 32, pp. 11041110, 2009.
V. Marinova, A Short Survey of Intrusion Detection Systems, Problems of Engineering Cybernetics and Robotics, 58:2330,
2007.
M. Kantardzic, Data mining: Concepts, Models, Methods, and Algorithms Wiley-IEEE Press, 2011.
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, The MIT Press, 2001.
D. Kim, H.N. Nguyen, S.Y. Ohn, and J. Park., Fusions of GA and SVM for Anomaly Detection in Intrusion Detection System, In
Proc. of the 2nd International Symposium on Neural Networks (ISNN05),Chongqing, China, LNCS, volume 3498, pages 415
420, Springer-Verlag, May 2005.
THANK YOU

Improved Intrusion Detection Applying Feature Selection Using Rank & Score of Attributes in KDD-99 Data Set

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Improved Intrusion Detection Applying Feature Selection Using Rank & Score of Attributes in KDD-99 Data Set

Uploaded by

Copyright:

Available Formats

Improved Intrusion Detection Applying

Feature Selection Using Rank & Score

Introduction of Intrusion Detection System

Result and Discussion

Conclusion and future work

Types of Intrusion Detection System

Methods to build light weight IDS

2: Introduction to Feature Selection

The main objectives of feature

Retention of sufficient information

Intrusion Detection Techniques

Data mining algorithms for IDS

Feature Selection Algorithm

We have chosen KDD Cup99 data set to perform the experiment.

KDD Data Set

WEKA is a compilation of machine learning algorithms for data

D: Implemented algorithm using Java Eclipse:

Select a Feature Subset

Analyze the new feature

We have chosen five classifier algorithms for classifying the

The following table shows classifier used in WEKA.

After analysing the KDD cup dataset in WEKA we have analysed

Algorithm Accuracy = ( True positive + True negative)

Time Taken in sec

After feature Selection using JAVA we got the result in form

Rank of feature 0:0

Rank of feature 20:32

2. Attribute Score: Score of the attributes on the scale of

Feature 22 Score :0.27918071822520096

JavaML has many subset selection algorithms. We have

Selected 20 attributes by GreedyForwardSelection are:

KNN (IBK) run information after feature selection

Incorrectly Classified Instances

Mean absolute error

Relative absolute error

Root relative squared error

=== Detailed Accuracy By Class ===

Comparison for KNN (IBK) runs information before and

Before Feature Selection

After Feature Selection

Correctly classified instances

Mean absolute Error

Root mean square Error

Relative Absolute Error %

Time Taken in Analysis

Comparison for KNN (IBK) runs

Before Feature Selection

After Feature Selection

Mean absolute Error

After Feature Selection

Before Feature Selection

After Feature Selection

Before Feature Selection

After Feature Selection

Comparison for KNN (IBK) runs

Relative Absolute Error(%)

After Feature Selection

Before Feature Selection

Time Taken in Analysis (Seconds)

After Feature Selection

After Feature Selection

You might also like