Professional Documents
Culture Documents
Improved Intrusion Detection Applying Feature Selection Using Rank & Score of Attributes in KDD-99 Data Set
Improved Intrusion Detection Applying Feature Selection Using Rank & Score of Attributes in KDD-99 Data Set
Index
1.
2.
Feature Selection
3.
Literature Survey
4.
Proposed Work
5.
6.
7.
References
1: Introduction to IDS
Intrusions are the activities that violate the security policy of
system.
Intrusion Detection is the process used to identify intrusions.
IDS known as powerful tools for determine, rejecting, and
debarring malevolent attacks all over the network.
Distributed IDSs
Network-based IDSs
Conti
Host-based IDSs
Get audit data from host audit trails.
Detect attacks against a single host
Distributed IDSs
Gather audit data from multiple host and possibly the network
that connects the hosts
Detect attacks involving multiple hosts
Network-Based IDSs
Use network traffic as the audit data source, relieving the
burden on the hosts that usually provide normal computing
services.
Conti..
Feature selection is used to nd out only important features
or feature set out of all the features of audit data.
It is used to reducing the dimensionality of data by
eliminating those features which are noisy, redundant or
irrelevant for a classification problem.
It can make IDSs lightweight and improve the classication
performance.
3: Literature Survey
Pattern Matching
Measure Based method
Machine Learning Method
Data Mining
Data mining can help improve intrusion detection by
addressing following ways:
Remove normal activity from alarm data to allow analysts
to focus on real attacks.
Identify false alarm generators and bad sensor
signatures.
Find anomalous activity that uncovers a real attack.
Identify long, ongoing patterns (different IP address,
same activity).
Conti.
4. K-Nearest Neighbor :K-Nearest Neighbor (k-NN) is an
instance based learning method for classifying objects
based on the closest training examples in the feature
space.
5. Support Vector Machine: Support Vector Machines have
been proposed as a novel technique for intrusion
detection.
4: Proposed Work
Problem Statement
As large amount of data flowing over the network, real time
intrusion detection is not feasibly possible due to computational
and resource limitations.
For the analysis of the intrusion detection the KDD99 dataset
need to be analysed and classification need to be done.
In KDD99 dataset there are 42 features involved for analysing
the algorithms. It seems that numbers of features are large and
taking more time that can be reduced and accuracy can be
increased.
Here accuracy means the most of the correctly classified instances
by an algorithm among the total instances.
Proposed Work
The main objectives of Feature Selection in reducing the
features of the dataset are given below.
To reduce the size and attributes of the KDD99 dataset
using feature selection algorithm.
To minimize the time for intrusion detection process.
To increase the accuracy of the algorithm used for Intrusion
Detection Systems.
Experimental Methodology
In our experiment work we have analysed the intrusion dataset in
following steps.
A: Selection of the training and testing dataset:
FEATURES
NO
FEATURES
Duration
22
is_guest_login
Protocol type
23
Count
Service
24
srv_count
Flag
25
serror_rate
src_bytes
26
srv_serror_rate
dst_bytes
27
rerror_rate
Land
28
srv_rerror_rate
wrong_fragment
29
same_srv_rate
Urgent
30
diff_srv_rate
10
Hot
31
srv_diff_host_rate
11
num_failed_logins
32
dst_host_count
12
logged_in
33
dst_host__srv_count
13
num_compromised
34
dst_host_same_srv_rate
14
root_shell
35
dst_host_diff_srv_rate
15
su_attempted
36
dst_host_same_src_port_rate
16
num_root
37
dst_host_srv_diff_host_rate
17
num_file_creations
38
dst_host_serror_rate
18
num_shells
39
dst_host_srv_serror_rate
19
num_access_files
40
dst_host_rerror_rate
20
num_outbound_cmds
41
dst_host_srv_rerror_rate
21
is_host_login
42
Class
B: WEKA tool:
WEKA contains algorithm and tools for data association rules,preprocessing, regression, classification, clustering, association rules,
and visualization.
WEKA consists of Explorer, Experimenter, Knowledge flow,
Simple Command Line Interface, Java interface .
C: Confusion matrix:
Confusion matrix allows us to calculate the performance of an
algorithm.
In our experiment each algorithm has produced confusion matrix
in output which is being used for the accuracy calculation.
The basic structure of the confusion matrix is as follows:
classified as
9282
429
a = normal
12306
b = anomaly
527
Attribute Rank
Evaluator
Attribute Score
Evaluator
Attribute Subset
5: RESULTS AND
DISCUSSIONS
Method
SVM
SMO
ANN
Multilayer Perception
KNN
Lazy.IBK
Nave-Bayes
Bayes.naiveBayes
Decision Tree
Tree.L48
Parameter
SVM
ANN
KNN
NB
DT
Total number of
22544
22544
22544
22544
22544
21342
21588
22490
18205
22394
1202
956
54
4339
150
Kappa statistic
0.891
0.9136
0.9951
0.6234
0.9864
Mean
absolute
0.0533
0.0545
0.0024
0.1923
0.0105
mean
0.2309
0.197
0.0346
0.4369
0.0725
10.8721
11.107
0.4974
39.213
2.142
instances
Correctly
classified
instances
Incorrectly
classified
instances
Error
Root
square Error
Relative Absolute
Error %
Confusion Matrix
Nave-Bayes
ANN
-- classified as
9225
486
3858
8975
-- classified as
a = normal
9282
429
b = anomaly
527
12306
SVM
a = normal
|
b = anomaly
KNN
9000
711
491
12342
-- classified as
|
|
-- classified as
a = normal
9711
b = anomaly
54
12779
anomaly
Decision Tree
a
9643
82
-- classified as
68
a = normal
12751
b = anomaly
a = normal
|
b=
Algorithm Accuracy
The accuracy parameter calculated for all the algorithms using the
confusion matrix.
This table shows the running time and correctly classification
accuracy of various algorithms.
It is clear from the table that most of the algorithm lacks the 100%
accuracy.
Algorithm/Model
(TP+TN)/Population
Accuracy
KNN
(9711+12779)/22544
0.997
1.2
Nave-Bayes
(9222+8983)/22544
0.807
4.2
Decision Tree
(9643+12751)/22544
0.959
2.39
SVM
(9000+12342)/22544
0.946
2.58
ANN
(9282+123065)/22544
0.957
3.4
Feature Selection
Using eclipse we have implemented the algorithm to select
the features.
3.
Attribute Subset:
[0, 32, 33, 38, 39, 36, 37, 8, 9, 10, 21, 23, 22, 25, 24, 27, 26,
29, 28, 31]
99.5507 %
0.4493 %
Kappa statistic
0.991
0.0061
1.2214 %
12.9175 %
0.997
0.007
0.994
0.997
0.996
0.999 normal
0.993
0.003
0.997
0.993
0.995
0.999
Weighted Avg.
0.996
0.005
0.996
0.996
0.996
anomaly
0.999
22390
22445
Incorrectly classified
154
99
Kappa statistic
0.9858
0.991
0.0113
0.0061
0.0811
0.0644
2.421
1.2214 %
1.2
0.04
instances
(Seconds)
Instances
22450
22440
22430
22420
22410
22400
22390
22380
22370
22360
22445
200
154
150
99
100
22390
50
0
Before Feature Selection
Kappa statistic
0.992
0.991
0.99
0.989
0.988
0.987
0.986
0.985
0.984
0.983
0.012
0.0113
0.991
0.01
0.008
0.0061
0.006
0.9858
0.004
0.002
0
0.0811
0.08
0.0644
2.421
2.5
2
0.06
1.2214
1.5
0.04
0.02
0.5
0
Before Feature Selection
0.04
0
Before Feature Selection
6: CONCLUSION AND
FUTURE WORK
Conclusion
We focused on feature selection for intrusion detection.
We have primarily analysed the KDD Cup dataset in WEKA in
initial step and have calculated the accuracy of algorithms.
Then we analyzed that with full feature dataset the accuracy is not
high and algorithm classification or running time is high.
We have selected the best algorithm from them, which was giving
the best accuracy.
Conti
We have used JAVA-ML libraries for feature selection on
different parameters.
We have used methods for feature selection and tested our
selected features with the same algorithms. Result shows
that for large datasets it is better to run the intrusion
detection identification algorithm with minimal features so
that intrusion is correctly and timely can be classified.
With feature selection or filtration methods the accuracy is
increased and the algorithmic running time is also reduced.
FUTURE WORK
In future work we will implement the rough and soft set
theories for feature selection which are an extension of the
data mining feature selection methods.
Soft set theory is a general method for solving problems of
uncertainty. Soft Sets represent a powerful tool for decision
making about information systems, data mining and
drawing conclusions from data.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
S.T.Zargar, J.Joshi, D,Tipper, A Survey of Defence Mechanisms Against Distributed Denial of Service (DDoS) Flooding
Attacks, Communication Surveys & Tutorials, IEEE(Volume 15, Issue:4), 2013.
C.Lima, M.Assis, C.Protsio,"An empirical investigation of attribute selection techniques based on Shannon,Rnyi and Tsallis
Entropies for Networ Intrusion Detection." American Journal of Intelligent Systems, PP.111-117, 2012.
K. Kira, L.A. Rendell,The feature selection problem: Traditional methods and New Algorithm, IEEE, PP. 129-134, 1998
H. Almuallim, T.G Dietterich,Learning with many irrelevant features, pp. 547-551, MIT Press 1991.
H. Witten, Ian, Eibe Frank, Practical Machine Learning Tools and Techniques, 2nd edition, 2005.
T.S Chou, K.K Yen, J.Luo, Network intrusion detection using feature selection of Soft Computing Paradigms, 2008
J.R Quinlan Programs for Machine Learning, Morgan Kaufmann, 1994.Sanmay Das,Filters,Wrapper and a Boostingbased
Hybrid for Feature Selection,2012.
Lei Yu, Huan Liu, Efcient Feature Selection via Analysis of Relevance and Redundancy, Journal of Machine Learning
Research, 2004.
S. Ganapathy, K.Kulothungan, S. Muthurajkuma, Intelligent Feature Selection and Classification Techniques for Intrusion
Detection in Networks a survey, Journal on Wireless Communications and Networking 2013.
R. Agarwal and M. V. Joshiy, "PNrule: A New Framework for Learning Classifier Models in Data Mining (a case-study in
network intrusion detection)," Citeseer2000.
R.Beghdad, "Efficient Deterministic Method for Detecting New U2R Attacks", Computer Communications, vol. 32, pp. 11041110, 2009.
V. Marinova, A Short Survey of Intrusion Detection Systems, Problems of Engineering Cybernetics and Robotics, 58:2330,
2007.
M. Kantardzic, Data mining: Concepts, Models, Methods, and Algorithms Wiley-IEEE Press, 2011.
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, The MIT Press, 2001.
D. Kim, H.N. Nguyen, S.Y. Ohn, and J. Park., Fusions of GA and SVM for Anomaly Detection in Intrusion Detection System, In
Proc. of the 2nd International Symposium on Neural Networks (ISNN05),Chongqing, China, LNCS, volume 3498, pages 415
420, Springer-Verlag, May 2005.
THANK YOU