C4.5 Based Sequential Attack Detection and Identification Model

C4.
5 based Sequential Attack Detection and

Identification
Model
Radhika Kumar , Anjali Sardana, R. C. Joshi
Information Security Laboratory

Department of Electronics and Computer Engineering
Indian Institute of Technology
Roorkee – 247667
{Anjlsfec, radhsdec, rcjosfec}@iitr.ernet.in
9,000
Introduction
8,000
Total vulnerabilities cataloged

7,000
6,000
5,000
CERT Statistics 4,000
3,000
 Internet was designed for openness 2,000
and functionality 1,000
Failures can be accidental or intentional

0
 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
Year
 Examples : Figure : The number of total vulnerabilities
 Denial of Service (DoS) catalogued from 1995 to 2006
160000
 Distributed Denial of Service (DDoS) 1,53,140
Number of incidents reported

140000
 Domain Name System attack 120000
100000
 IP Spoofing 82,094
80000
 Sequence Number Hijacking 60000
40000
20000
6
0
Year
Figure : The number of Internet security

incidents reported from 1988 to
2003 2
Service denied to Legitimate Users
Packets drop due to queue

overflow
Victim
overflow
Buffer
Edge Router Transit Domain
Legitimate Packets
Edge Router Stub Domain
Attack Packets
bottleneck link
Figure : Packets drop under DDoS Attack
3
Motivation
Existing approaches to defend against Attacks
 Before the attack

 Prevention
 During the attack

 Detection and Characterization
 After the attack

 Response and Mitigation
All of them suffer from various constraints
4
Sequential Multi-Level Classification
Model
 The objective is to find the natural hierarchy in the network traffic and to exploit
the generic and differentiating characteristics of different attacks to build a
more secure environment.
 A differential approach is used to detect one kind of attack at a time from the
network traffic.
 A sequential model with different binary classifiers at each level, categorizing
attacks in a step by step manner, is used.
 Rules are also generated at different levels of abstraction.
 KDD99 Dataset is used for evaluation
Node 1
Class 1 Node 2
Class 2 Node 3
Class 3 Class 4
Mathematical Model
Traffic Feature Distribution
Flow Number of
X  {ni , i  1,2,3, N} Id (i) Packets
Where
(ni)
X is a random process that i occurs
ni times.
1 n1
2 n2
X
i is defined by one or combination
of following traffic features in
packet header like: 3 n3
 Source IP address
 Destination IP address
: :
: :
.
 Source Port
 Destination Port
 Layer 3 Protocol Type N nN

6
Mathematical Model: Basis of C4.5
1. Traffic Feature Measurement : Entropy
ni N
S   ni
N
H ( X )   pi log 2 pi Where pi  ( )
1 S 1
Measure of Dispersal or concentration of a distribution.
Range is (0 to log2N) All observations same H(X) = 0
H(X)=log2N if n1=n2=………=nN.
2. Sampling
{t  ,t} {t  ,t} {t  ,t}
{ X ( t ), t  j  , j  n } 1 2 3 ... ... N H(X)
Δ X(Δ,1) X(Δ,2) X(Δ,1) ... ... X(Δ,N) H(Δ)

 is constant time window
n is set of positive integers 2Δ X(2Δ,1) X(2Δ,2) X(2Δ,3) ... ... X(2Δ,N) H(2Δ)
X (t ) represents number of packet 3Δ X(3Δ,1) X(3Δ,2) X(3Δ,3) ... ... X(3Δ,N) H(3Δ)
arrivals for a flow in {t  , t}
: : : : : : : :
3. Traffic Feature Selection : : : : : : : :
nΔ X(nΔ,1) X(nΔ,2) X(nΔ,3) ... ... X(nΔ,N) H(nΔ)

7
C4.5: The Classification Algorithm
 The sequential nature of the proposed multi-level architecture
needed binary classification at each level.
 C4.5 gives highest overall accuracy as the single level classifier compared to
other single level classifiers (Tavallaee, 2009, classifiers were tested on KDD99
Dataset)
 C4.5 uses the concept of entropy,
x to measures the impurity of data items.
I(S) =  RF(Ci,
j1
S) log(RF(Ci, S))
t
Information gain


G(S, B) = I(S) - (| Si | / | S |) I(Si)
i 1
t
 Gain ratio P(S, B) = -  (| Si | / | S |) log (| Si | / | S |)
i 1
 The test B that maximizes G(S, B) / P(S, B) is then chosen as current

partitioning attribute.
KDD’99 Dataset
 KDD attacks: Four main categories

 DOS: Denial-of-Service Attack
 Probing Attack
 U2R: User to Root Attack
 R2L: Remote to Local Attack
 KDD’99 datasets: 41 features classified into three groups:
 Basic Features
 Traffic Features: Time based and Host based Traffic Features
 Content Features
Sequential Classification
 Some observations:
 The Dos attack instances in training data are more than the combined
number of Probe, U2R and R2L attacks => Dos attack is the most common type
of attack.
 The Dos attack by nature is characterized by Time-based traffic features.
 Probe attack is defined by host-based features.
 U2R and R2L attacks are detected by studying the content features of the
data.
 Finally they all are attack so must have some common characteristic different
from the normal traffic.
 First Stage:
 Separation of attack data from normal traffic on the basis of characteristics
common to all attack traffic.
 Second Stage:
 Separation of most common attack – Dos attack from other three kinds using
time based features.
 Third Stage:
 Separation of Probe attacks from other two kinds using host- based traffic
features
 Fourth Stage:
 Separation of U2R and R2L attacks using Content features.
 Snapshot of Level 4 classifier (Trained on U2R and R2L attack data )
 Training Results
 Correctly Classified Instances 98.9813 %
 Incorrectly Classified Instances 1.0187 %
Evaluation Matrix
Actual Class Classified as

Normal Attack
Normal True Negative (TN) False Positive (FP)
Attack False Negative (FN) True Positive (TP)
 Precision: proportion of predicted positives/negative which are

actual positive/negative
 True Alarm Ratio: TP / (TP + FP)
 False Alarm Ratio: FP / ( FP + TP )
 Recall: proportion of actual positives/negative which are predicted
positive/negative
 Sensitivity, Detection rate, Alarm rate -TP / (TP + FN)
 False positive rate , False Alarm rate – FP / ( FP + TN)
 False negative rate – FN / (FN + TP)
Normal Attack
Normal 60253 340 All data
Testing Results
Attack 22684 227752 Level 1
Correctly Classified Instances 92.5975%
Actual Classified as
Class Normal Dos Attack Other
Attacks Attack
Normal 0 83 257 Normal
Dos Attack 0 222524 795 Level 2
Other 0 435 3998
Attacks
Other
Actual
Class Normal Dos
Classified as
Probe Others
Dos Attacks
Attack Level 3
Normal 0 0 253 7
Dos Attack 0 0 358 471
Probe 0 0 3086 0
Attack
Other 0 0 347 527 Other
Attacks Probe
Correctly Classified Instances 71.5587% Attacks
Attack Level 4
Actual Classified as
Class Normal Dos Probe U2R R2L
Normal 0 0 0 1 6
Dos 0 0 0 0 471
Probe 0 0 0 0 0 U2R R2L
U2R 0 0 0 9 8
R2L 0 0 0 2 508
Improvements in Training
Dataset
 KDD99 10% Training dataset and Testing dataset distribution
Training Set Testing Set
Normal 19.69% 19.48%
Probe 0.83% 1.34%
Dos 79.24% 73.90%
U2R 0.01% 0.07%
R2L 0.23% 5.20%
 The 10% KDD99 training dataset has huge number of similar

records for Dos attack and normal traffic as compared to Probe,
U2R and R2L attacks.
 Level 1 classifier get biased towards normal class
 Testing Result: High false negative rate– 9.95%
 Improvements
 New dataset – U2R, R2L and Probe data was duplicated 5 times
 Level 1 classifier was trained using this new dataset
 Testing Results of Level-1 Classifier on earlier dataset
 Attack detection rate increased from 90.942% to 92.2515%.
 Accuracy percentage increased from 92.5975% to 93.5974%.

Results after Improvements in
Training Data
 Confusion Matrix of Level 1 Classifier after data duplication
Normal Attack
Normal 60099 494
Attack 19405 231031
 Misuse and Anomaly Detection Rate of Level 1 Classifier before and
after data duplication
True Positives Known Attacks New Attacks
In Test dataset 220,525 29,911
Detected by level 1 classifier 219,827 7,905
(trained on original dataset) (99.6835 %) (26.4618%)
Detected by level 1 classifier 220,525 10,543

(trained on new dataset) (99.9832%) (35.2479%)
 The data data duplication improved the misuse and anomaly detection
rate from 99.6835% and 26.4618% to 99.9832% and 35.2479%,
respectively.
Descriptive Modeling
 The advantage of multi-level sequential approach is that we
get small and easily interpretable trees.
 Rules can be derived from these decision trees at different level of
abstraction.
 These rules are in terms of 41 features of KDD dataset.
 E.g. Rule derived from second classifier
If ( %of connection to different services for same host for last
1000 connections < 0.1 and
% of connection to different host for same service for past
1000 connections < 0.01 and
number of connection to same host for the past two seconds
>2)
=> Dos Attack
Conclusion
 The model has low false alarm ratio of 0.15%.
 Individual attack detection rate of 99.644% for Dos and
100% for Probe is achievable.
 The percentage accuracy for classification between U2R and
R2L is as high as 98.1024%.
 New dataset gives better result :
 Misuse detection rate 99.9832% and anomaly detection rate 35.247%
 The trees generated are small and easy to derive rules at
different levels of abstraction.
References
[1] S. Axellson, “The Base-Rate Fallacy and the Difficulty of Intrusion Detection,” ACM
Transaction on Information and System Security, 2000.
[2] Corey, V. et. al.: Network forensics analysis, Internet Computing, IEEE , Volume:6 Issue:
6 , 2002 pp: 60 –66.
[3] R. J. Henery, “Classification,” Machine Learning, Neural and Statistical
Classification,” D. Michie , D. J. Spiegelhalter, and C. C. Taylor (Eds.), Ellis Horwood, New
York, 1994.
[4] E. Bloedorn, L. Talbot, C. Skorupka, A. Christiansen, W. Hill, and J. Tivel, “Data Mining
applied to Intrusion Detection: MITRE Experiences,” In Proc. IEEE International
Conference on Data Mining, 2001.
[5] Y. Ma, D. Choi, and S. Ata, Eds., Application of Data Mining to Network Intrusion
Detection: Classifier Selection Model, ser. Lecture Notes in Computer Science. Berlin
Heidelberg, Germany: Springer-Verlag , 2008, vol. 5297.
[6] M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A Detailed Analysis of the KDD
CUP 99 Data Set,” in Proc. IEEE Symposium CISDA’09, 2009.
[7] J. R. Quinlan, “C4.5: Programs for machine learning,” Morgan Kaufmann, San Mateo,
California, 1993.
[8] Weka – Data Mining Machine Learning Software. [Online]. Available:
http://www.cs.waikato.ac.nz/ml/weka/
[9] KDD Cup 1999 Data. [Online]. Available:
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
[10] M. Sabhnani, and G. Serpen, “Why Machine Learning Algorithms Fail in Misuse Detection on
KDD Intrusion Detection Dataset,” Intelligent Data Analysis, vol. 6, June 2004.
[11] K. Kendall, “A Database of Computer Attacks for the Evaluation of Intrusion Detection
Systems,” M. Eng. Thesis, Massachusetts Institute of Technology, Massachusetts, United
Thank You

C4.5 Based Sequential Attack Detection and Identification Model

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C4.5 Based Sequential Attack Detection and Identification Model

Uploaded by

Copyright:

Available Formats

C4.

5 based Sequential Attack Detection and

Radhika Kumar , Anjali Sardana, R. C. Joshi

Information Security Laboratory

Total vulnerabilities cataloged

CERT Statistics 4,000

 Internet was designed for openness 2,000

and functionality 1,000

Failures can be accidental or intentional

 Distributed Denial of Service (DDoS) 1,53,140

Number of incidents reported

 Domain Name System attack 120000

 Sequence Number Hijacking 60000

Figure : The number of Internet security

Packets drop due to queue

 Before the attack

 During the attack

 After the attack

All of them suffer from various constraints

 Layer 3 Protocol Type N nN

Δ X(Δ,1) X(Δ,2) X(Δ,1) ... ... X(Δ,N) H(Δ)

3. Traffic Feature Selection : : : : : : : :

nΔ X(nΔ,1) X(nΔ,2) X(nΔ,3) ... ... X(nΔ,N) H(nΔ)

 The test B that maximizes G(S, B) / P(S, B) is then chosen as current

 KDD attacks: Four main categories

Actual Class Classified as

 Precision: proportion of predicted positives/negative which are

 The 10% KDD99 training dataset has huge number of similar

 Accuracy percentage increased from 92.5975% to 93.5974%.

Detected by level 1 classifier 220,525 10,543

You might also like