Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

Improved Intrusion Detection Applying

Feature Selection Using Rank & Score


of Attributes in KDD-99 Data Set
A THESIS SUBMITTED BY
JYOTI HARBOLA
MTECH. COMPUTER SCIENCE
GUIDED BY
DR. K. S. VAISLA
ASSOCIATE PROFESSOR
DEPARTMENT OF COMPUTER
SCIENCE & ENGINEERING

Index
1.

Introduction of Intrusion Detection System

2.

Feature Selection

3.

Literature Survey

4.

Proposed Work

5.

Result and Discussion

6.

Conclusion and future work

7.

References

1: Introduction to IDS
Intrusions are the activities that violate the security policy of
system.
Intrusion Detection is the process used to identify intrusions.
IDS known as powerful tools for determine, rejecting, and
debarring malevolent attacks all over the network.

Types of Intrusion Detection System


Based on the sources of the audit information used by each
IDS, the IDSs may be classified into
Host-base IDSs

Distributed IDSs
Network-based IDSs

Conti
Host-based IDSs
Get audit data from host audit trails.
Detect attacks against a single host
Distributed IDSs
Gather audit data from multiple host and possibly the network
that connects the hosts
Detect attacks involving multiple hosts
Network-Based IDSs
Use network traffic as the audit data source, relieving the
burden on the hosts that usually provide normal computing
services.

Methods to build light weight IDS


There are two representative methods to build lightweight IDSs:
1. Parameters optimization of data mining and machine
learning algorithms and
2. Feature selection of audit data.

2: Introduction to Feature Selection


Intrusion Detection Systems process huge amount of audit
data which contains a lot of features.
However, all features are not essential to classify network
audit data.
Some of these features are irrelevant or redundant.
They not only increase computational cost, such as time and
overheads but also decrease the detection rates.

Conti..
Feature selection is used to nd out only important features
or feature set out of all the features of audit data.
It is used to reducing the dimensionality of data by
eliminating those features which are noisy, redundant or
irrelevant for a classification problem.
It can make IDSs lightweight and improve the classication
performance.

The main objectives of feature


selection are
Reduced computational complexity and dimensionality

Retention of sufficient information


Economy
Improved accuracy
Problem understanding

3: Literature Survey

Intrusion Detection Techniques


Data Mining method

Pattern Matching
Measure Based method
Machine Learning Method

Data Mining
Data mining can help improve intrusion detection by
addressing following ways:
Remove normal activity from alarm data to allow analysts
to focus on real attacks.
Identify false alarm generators and bad sensor
signatures.
Find anomalous activity that uncovers a real attack.
Identify long, ongoing patterns (different IP address,
same activity).

Data mining algorithms for IDS


1. Decision Trees : Decision tree is a predictive modeling
technique most often used for classification in data
mining.
2. Nave Bayes :Bayesian network is a model that encodes
probabilistic relationships among variables of interest.
3. Artificial Neural Networks : Neural networks (NN) are
systems modeled based on the working of the human
brain

Conti.
4. K-Nearest Neighbor :K-Nearest Neighbor (k-NN) is an
instance based learning method for classifying objects
based on the closest training examples in the feature
space.
5. Support Vector Machine: Support Vector Machines have
been proposed as a novel technique for intrusion
detection.

Feature Selection Algorithm


Feature Scoring: All feature scoring algorithms implements
the following method. Higher scores are better.
GainRatio
Feature ranking: All feature ranking algorithms provide
the following method to determine the rank of a features.
Lower ranks are better.
RecursiveFeatureEliminationSVM
Feature subset selection: Subset selection algorithms differ
with the scoring and ranking methods in that they only
provide a set of features that are selected without further
information on the quality of each feature individually.
GreedyForwardSelection

4: Proposed Work

Problem Statement
As large amount of data flowing over the network, real time
intrusion detection is not feasibly possible due to computational
and resource limitations.
For the analysis of the intrusion detection the KDD99 dataset
need to be analysed and classification need to be done.
In KDD99 dataset there are 42 features involved for analysing
the algorithms. It seems that numbers of features are large and
taking more time that can be reduced and accuracy can be
increased.
Here accuracy means the most of the correctly classified instances
by an algorithm among the total instances.

Proposed Work
The main objectives of Feature Selection in reducing the
features of the dataset are given below.
To reduce the size and attributes of the KDD99 dataset
using feature selection algorithm.
To minimize the time for intrusion detection process.
To increase the accuracy of the algorithm used for Intrusion
Detection Systems.

Experimental Methodology
In our experiment work we have analysed the intrusion dataset in
following steps.
A: Selection of the training and testing dataset:

We have chosen KDD Cup99 data set to perform the experiment.


This data set is widely accepted as a point of reference dataset
and has been referred by many researchers.
10% of KDD Cup99 from KDD Cup 99 data set is chosen to
analyze the data mining algorithms and rules for testing data sets
to detect intrusion.

KDD Data Set


NO

FEATURES

NO

FEATURES

Duration

22

is_guest_login

Protocol type

23

Count

Service

24

srv_count

Flag

25

serror_rate

src_bytes

26

srv_serror_rate

dst_bytes

27

rerror_rate

Land

28

srv_rerror_rate

wrong_fragment

29

same_srv_rate

Urgent

30

diff_srv_rate

10

Hot

31

srv_diff_host_rate

11

num_failed_logins

32

dst_host_count

12

logged_in

33

dst_host__srv_count

13

num_compromised

34

dst_host_same_srv_rate

14

root_shell

35

dst_host_diff_srv_rate

15

su_attempted

36

dst_host_same_src_port_rate

16

num_root

37

dst_host_srv_diff_host_rate

17

num_file_creations

38

dst_host_serror_rate

18

num_shells

39

dst_host_srv_serror_rate

19

num_access_files

40

dst_host_rerror_rate

20

num_outbound_cmds

41

dst_host_srv_rerror_rate

21

is_host_login

42

Class

B: WEKA tool:

WEKA is a compilation of machine learning algorithms for data


mining tasks.

WEKA contains algorithm and tools for data association rules,preprocessing, regression, classification, clustering, association rules,
and visualization.
WEKA consists of Explorer, Experimenter, Knowledge flow,
Simple Command Line Interface, Java interface .

C: Confusion matrix:
Confusion matrix allows us to calculate the performance of an
algorithm.
In our experiment each algorithm has produced confusion matrix
in output which is being used for the accuracy calculation.
The basic structure of the confusion matrix is as follows:

classified as

9282

429

a = normal

12306

b = anomaly

527

D: Implemented algorithm using Java Eclipse:


Using eclipse we have implemented the algorithm to select the features
according to Attribute Rank, Attribute Score and Attribute Subset in dataset.

Select a Feature Subset

Attribute Rank
Evaluator

Attribute Score
Evaluator

Analyze the new feature


subset in WEKA

Attribute Subset

5: RESULTS AND
DISCUSSIONS

We have chosen five classifier algorithms for classifying the


dataset.

The following table shows classifier used in WEKA.


Algorithm

Method

SVM

SMO

ANN

Multilayer Perception

KNN

Lazy.IBK

Nave-Bayes

Bayes.naiveBayes

Decision Tree

Tree.L48

After analysing the KDD cup dataset in WEKA we have analysed


the different parameters of the algorithmic results.
The following table shows the different parameter associated with
the algorithms.

Parameter

SVM

ANN

KNN

NB

DT

Total number of

22544

22544

22544

22544

22544

21342

21588

22490

18205

22394

1202

956

54

4339

150

Kappa statistic

0.891

0.9136

0.9951

0.6234

0.9864

Mean

absolute

0.0533

0.0545

0.0024

0.1923

0.0105

mean

0.2309

0.197

0.0346

0.4369

0.0725

10.8721

11.107

0.4974

39.213

2.142

instances
Correctly
classified
instances
Incorrectly
classified
instances

Error

Root

square Error
Relative Absolute
Error %

Confusion Matrix
Nave-Bayes

ANN

-- classified as

9225

486

3858

8975

-- classified as

a = normal

9282

429

b = anomaly

527

12306

SVM

a = normal
|

b = anomaly

KNN

9000

711

491

12342

-- classified as
|
|

-- classified as

a = normal

9711

b = anomaly

54

12779

anomaly
Decision Tree
a
9643

82

-- classified as

68

a = normal

12751

b = anomaly

a = normal
|

b=

Algorithm Accuracy
The accuracy parameter calculated for all the algorithms using the
confusion matrix.
This table shows the running time and correctly classification
accuracy of various algorithms.
It is clear from the table that most of the algorithm lacks the 100%
accuracy.

Algorithm Accuracy = ( True positive + True negative)


/ Total population

Algorithm/Model

(TP+TN)/Population

Accuracy

Time Taken in sec

KNN

(9711+12779)/22544

0.997

1.2

Nave-Bayes

(9222+8983)/22544

0.807

4.2

Decision Tree

(9643+12751)/22544

0.959

2.39

SVM

(9000+12342)/22544

0.946

2.58

ANN

(9282+123065)/22544

0.957

3.4

Feature Selection
Using eclipse we have implemented the algorithm to select
the features.

After feature Selection using JAVA we got the result in form


of different parameters.
The parameters are:
1. Attribute rank: This algorithm finds the rank of each
attribute on a scale of predefined procedures and
lower integers represents the higher rank.

Rank of feature 0:0


Rank of feature 1:2
Rank of feature 2:3
Rank of feature 3:4
Rank of feature 4:6
Rank of feature 5:8
Rank of feature 6:10
Rank of feature 7:12
Rank of feature 8:11
Rank of feature 9:15
Rank of feature 10:14
Rank of feature 11:17
Rank of feature 12:19
Rank of feature 13:20
Rank of feature 14:18
Rank of feature 15:23
Rank of feature 16:26
Rank of feature 17:27
Rank of feature 18:24
Rank of feature 19:25

Rank of feature 20:32


Rank of feature 21:34
Rank of feature 22:33
Rank of feature 23:30
Rank of feature 24:31
Rank of feature 25:38
Rank of feature 26:40
Rank of feature 27:39
Rank of feature 28:35
Rank of feature 29:37
Rank of feature 30:36
Rank of feature 31:28
Rank of feature 32:29
Rank of feature 33:21
Rank of feature 34:22
Rank of feature 35:16
Rank of feature 36:13
Rank of feature 37:9
Rank of feature 38:7
Rank of feature 39:5
Rank of feature 40:1

2. Attribute Score: Score of the attributes on the scale of


0-1. We have taken attributes which are near to the
value 1.
Feature 1 Score :0.252308168019056
Feature 2 Score :0.0
Feature 3 Score :0.0
Feature 4 Score :0.0
Feature 5 Score :0.8635694816581653
Feature 6 Score :0.11528480177913569
Feature 7 Score :0.7922705451305562
Feature 8 Score :0.9846316059892641
Feature 9 Score :0.9789083106047213
Feature 10 Score :0.9115301140903463
Feature 11 Score :0.8290498095737958
Feature 12 Score :0.9666111319850087
Feature 13 Score :0.8944075418750325
Feature 14 Score :0.9061700649053416
Feature 15 Score :0.9725557274441675
Feature 16 Score :1.0
Feature 17 Score :0.9423248173571703
Feature 18 Score :0.866985540181125
Feature 19 Score :0.0
Feature 20 Score :0.7478944089652559
Feature 21 Score :0.9202667510834831

Feature 22 Score :0.27918071822520096


Feature 23 Score :0.520870647268416
Feature 24 Score :0.3033022547442611
Feature 25 Score :0.3040742115973446
Feature 26 Score :0.39849817657149333
Feature 27 Score :0.44446021300018684
Feature 28 Score :0.3652166472274252
Feature 29 Score :0.31709809540361417
Feature 30 Score :0.35523299912347567
Feature 31 Score :0.3364717747461143
Feature 32 Score :0.35192749650209687
Feature 33 Score :0.3819583107257604
Feature 34 Score :0.3023248036411523
Feature 35 Score :0.5598661851509699
Feature 36 Score :0.7026082232317187
Feature 37 Score :0.24266342434395133
Feature 38 Score :0.24544627184394258
Feature 39 Score :0.30724088607267636
Feature 40 Score :0.42643408681495976
Feature 41 Score :0.0

3.

Attribute Subset:

JavaML has many subset selection algorithms. We have


used GreedyForwardSelection method which works on
greedy approach for feature selection.

Selected 20 attributes by GreedyForwardSelection are:

[0, 32, 33, 38, 39, 36, 37, 8, 9, 10, 21, 23, 22, 25, 24, 27, 26,
29, 28, 31]

KNN (IBK) run information after feature selection


IB1 instance-based classifier using 1 nearest neighbour(s) for classification
Time taken to build model: 0.04 seconds
=== Summary ===
Correctly Classified Instances

99.5507 %

Incorrectly Classified Instances

0.4493 %

Kappa statistic

0.991

Mean absolute error

0.0061

Relative absolute error

1.2214 %

Root relative squared error

12.9175 %

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.997

0.007

0.994

0.997

0.996

0.999 normal

0.993

0.003

0.997

0.993

0.995

0.999

Weighted Avg.

0.996

0.005

0.996

0.996

0.996

anomaly
0.999

Comparison for KNN (IBK) runs information before and


after feature selection
KNN(IBK) Algorithm

Before Feature Selection

After Feature Selection

Correctly classified instances

22390

22445

Incorrectly classified

154

99

Kappa statistic

0.9858

0.991

Mean absolute Error

0.0113

0.0061

Root mean square Error

0.0811

0.0644

Relative Absolute Error %

2.421

1.2214 %

Time Taken in Analysis

1.2

0.04

instances

(Seconds)

Comparison for KNN (IBK) runs


information before and after feature
selection
Incorrectly Classified
Correctly Classified
Instances

Instances
22450
22440
22430
22420
22410
22400
22390
22380
22370
22360

22445

200
154

150
99

100
22390
50
0
Before Feature Selection

Before Feature Selection

After Feature Selection

Mean absolute Error

Kappa statistic
0.992
0.991
0.99
0.989
0.988
0.987
0.986
0.985
0.984
0.983

After Feature Selection

0.012

0.0113

0.991
0.01
0.008
0.0061

0.006
0.9858

0.004
0.002
0

Before Feature Selection

After Feature Selection

Before Feature Selection

After Feature Selection

Comparison for KNN (IBK) runs


information before and after feature
selection
Root Mean Square Error
0.1

Relative Absolute Error(%)


3

0.0811

0.08

0.0644

2.421

2.5
2

0.06

1.2214

1.5

0.04

0.02

0.5

0
Before Feature Selection

After Feature Selection

Before Feature Selection

Time Taken in Analysis (Seconds)


1.4
1.2
1.2
1
0.8
0.6
0.4
0.2

0.04

0
Before Feature Selection

After Feature Selection

After Feature Selection

6: CONCLUSION AND
FUTURE WORK

Conclusion
We focused on feature selection for intrusion detection.
We have primarily analysed the KDD Cup dataset in WEKA in
initial step and have calculated the accuracy of algorithms.
Then we analyzed that with full feature dataset the accuracy is not
high and algorithm classification or running time is high.
We have selected the best algorithm from them, which was giving
the best accuracy.

Conti
We have used JAVA-ML libraries for feature selection on
different parameters.
We have used methods for feature selection and tested our
selected features with the same algorithms. Result shows
that for large datasets it is better to run the intrusion
detection identification algorithm with minimal features so
that intrusion is correctly and timely can be classified.
With feature selection or filtration methods the accuracy is
increased and the algorithmic running time is also reduced.

FUTURE WORK
In future work we will implement the rough and soft set
theories for feature selection which are an extension of the
data mining feature selection methods.
Soft set theory is a general method for solving problems of
uncertainty. Soft Sets represent a powerful tool for decision
making about information systems, data mining and
drawing conclusions from data.

References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

13.
14.
15.

S.T.Zargar, J.Joshi, D,Tipper, A Survey of Defence Mechanisms Against Distributed Denial of Service (DDoS) Flooding
Attacks, Communication Surveys & Tutorials, IEEE(Volume 15, Issue:4), 2013.
C.Lima, M.Assis, C.Protsio,"An empirical investigation of attribute selection techniques based on Shannon,Rnyi and Tsallis
Entropies for Networ Intrusion Detection." American Journal of Intelligent Systems, PP.111-117, 2012.
K. Kira, L.A. Rendell,The feature selection problem: Traditional methods and New Algorithm, IEEE, PP. 129-134, 1998
H. Almuallim, T.G Dietterich,Learning with many irrelevant features, pp. 547-551, MIT Press 1991.
H. Witten, Ian, Eibe Frank, Practical Machine Learning Tools and Techniques, 2nd edition, 2005.
T.S Chou, K.K Yen, J.Luo, Network intrusion detection using feature selection of Soft Computing Paradigms, 2008
J.R Quinlan Programs for Machine Learning, Morgan Kaufmann, 1994.Sanmay Das,Filters,Wrapper and a Boostingbased
Hybrid for Feature Selection,2012.
Lei Yu, Huan Liu, Efcient Feature Selection via Analysis of Relevance and Redundancy, Journal of Machine Learning
Research, 2004.
S. Ganapathy, K.Kulothungan, S. Muthurajkuma, Intelligent Feature Selection and Classification Techniques for Intrusion
Detection in Networks a survey, Journal on Wireless Communications and Networking 2013.
R. Agarwal and M. V. Joshiy, "PNrule: A New Framework for Learning Classifier Models in Data Mining (a case-study in
network intrusion detection)," Citeseer2000.
R.Beghdad, "Efficient Deterministic Method for Detecting New U2R Attacks", Computer Communications, vol. 32, pp. 11041110, 2009.
V. Marinova, A Short Survey of Intrusion Detection Systems, Problems of Engineering Cybernetics and Robotics, 58:2330,
2007.
M. Kantardzic, Data mining: Concepts, Models, Methods, and Algorithms Wiley-IEEE Press, 2011.
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, The MIT Press, 2001.
D. Kim, H.N. Nguyen, S.Y. Ohn, and J. Park., Fusions of GA and SVM for Anomaly Detection in Intrusion Detection System, In
Proc. of the 2nd International Symposium on Neural Networks (ISNN05),Chongqing, China, LNCS, volume 3498, pages 415
420, Springer-Verlag, May 2005.

THANK YOU

You might also like