Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 5, September October 2013 ISSN 2278-6856

Network Traffic Intrusion Detection System using Decision Tree & K-Means Clustering Algorithm
Mrs. Ghatge Dipali D.
Assistant Professor, Computer Engineering Department, Karmaveer Bhaurao Patil College of Engg. & Poly.,Satara

Abstract- In this world of computer networks and highly


advanced emerging technologies. Network Security is a crucial topic, as network attacks have increased over past few years. So that Intrusion Detection System (IDS) has become important component to secure the network. As data mining techniques make it possible to search large amount of data for characteristics, rules and patterns, it can be applied to network monitoring data recorded on host or in a network for detecting intrusion and attacks. This paper gives introduction of different data mining techniques. Furthermore I present an intrusion detection scheme based on K-means clustering algorithm. I use the DARPA 98 Lincoln Laboratory evaluation dataset as training & testing data set. Training data containing unlabeled flow records are separated into clusters of normal & anomalous traffic. The corresponding cluster centroids are used for efficient distance based on detection of anomalies. I provide a detail description of the data mining and anomaly detection process and present the experimental result.

of decision tree and K-means clustering algorithm for intrusion detection. A) IDS using Decision treeDecision tree is one of the powerful data mining methods. In decision tree leaf nodes represents class of data. Decision tree helps us to categorize the data from largest dataset. a) DARPA98 Dataset: The DARPA set was defined by the information system technology Group of MIT Lincoln Laboratory. It provides the data set for the both training and testing. All attacks in DARPA sets can be categorized into four classes of attacks. They are Denial Of Service (Dos), Remote to Local (R2L), User to Root (U2R) and Scan. b) Process to make decision tree using DARPA data set:

Keywords: IDS, K-means, DARPA, Data Mining, KDD.

1. INTRODUCTION
Intrusion detection is the process of examining and evaluating the events occurring in a computer system in order to detect the signs of security problems. Data mining techniques are very striking because they can be applied to any kind of data in order to learn more about the hidden structures and correlations [1]. The application of data mining methods to monitor data recorded from computer networks is a remarkable solution for intrusion detection. Intrusion detection system (IDS) using data mining can be termed as network data mining. Section II gives the introduction of different data mining techniques used for intrusion detection. In section III I present the details of K-means clustering algorithm used for intrusion detection. I use the DARPA 98 Lincoln Laboratory Evaluation data set (DARPA set) as a training data as well as testing data. KDD 99 intrusion detection data set is also based on DARPA set. In section IV I have presented some initial experimental results of ongoing work and section V concludes the paper with a stance on future work. 1. Classification of DARPA set: In training set 4 types of attacks are considered. We have to extract the TCP Dump data for each attack in whole DARPA training set. TCP dump list contains the information that identifies each flow and indicates whether the flow is an attack or not. 2. Preprocessing: Preprocessing is done to summarize the information from the TCP dump files. Preprocessing manufactures the raw packet data to make the information meaningful. 3. ID3 algorithm: The data that we get after preprocessing is given as input to ID3 algorithm. ID3 adopts the greedy concept to locate the features in the decision tree, that is it chooses the features from the learning dataset according to the correlation between the features and the class [2]. 4. Decision tree generation is done by using the features located by ID3 algorithm. B) K-means clustering algorithm. K-means clustering algorithm is another powerful data mining algorithm. In the next section, I have included the details of K-means algorithm, raw data and the extracted features of traffic. This raw data and features

2. DATA MINING TECHNIQUES


There are various data mining techniques used for Intrusion detection. In this section I have given the details Volume 2, Issue 5 September October 2013

Page 218

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 5, September October 2013 ISSN 2278-6856
are to be given as an input to the K-means clustering algorithm. LibSVM had two major components, a training program and a prediction program. The training program looked at input data and attempted to develop a support vector model. This model was then used by the second component to classify or predict the class label of the testing data. Both the training program and prediction program required the data to be in a specific format, demonstrated below. <label> <index1>:<value1> indexN> :<valueN>

3. IDS using K-means Clustering Algorithm.


For implementing K-means clustering algorithm for intrusion detection we have to use the training data that contains flow records of both normal and anomalous traffic are transformed into feature datasets. The datasets are divided into different clusters for normal and anomalous traffic using K-means clustering algorithm. The resulting cluster centroids are deployed for the fast detection of anomalies in new monitoring data based on simple distance calculations [1]. a) Raw data & extracted features: Flow records which are available in many networks are used as input to the data mining process. Flow records contain IP information and statistical information such as number of packets and bytes observed in a certain period of time. K-means algorithm applied to the above dataset is explained in the next subsection. b) K-means clustering:K-means clustering [3] is a clustering analysis algorithm that groups objects based on their feature values into K disjoint clusters. Steps followed in K-means clustering are as follows: 1. Define the number of clusters K. 2. Initialize the K cluster centroids. 3. Iterate all objects & compute the distances to the centroids of all the clusters. 4. Recalculate the centroids of all the modified clusters. 5. Repeat step 3 while the centroids do not change any more. A distance function is Euclidean function,

4. EXPERIMENTAL RESULTS:
For experiment corrected 10% kdd data has been used.

Figure 1 10% KDD data From the above considered data part of data has been considered as training data and remaining as testing data.

Figure 2 Training & testing data The file that contains the details of 42 features extracted is provided as input to the IDS unit.

where, x=(x1,.,xm) and y=(y1,.,ym) are the two input vectors with m quantitative features. I apply the K-means clustering algorithm to training dataset that contain normal as well as anomalous traffic. I assume that normal & anomalous traffic form different clusters. I have used SVMs [4] (Support Vector Machines) are a useful technique for data classification. A classification task usually involves separating data into training and testing sets. Each instance in the training set contains one \target value" (i.e. the class labels) and several \attributes" (i.e. the features or observed variables). The goal of SVM is to produce a model (based on the training data) which predicts the target values of the test data given only the test data attributes. SVM requires that each data instance is represented as a vector of real numbers. The goal of a support vector machines is to find hyper-planes that separate data points into their respective classes. The better the separation achieved, the better the data classification thats ultimately possible. Volume 2, Issue 5 September October 2013

Figure 3 Provide Extracted features to IDS The extracted features are compared with the training dataset and with the help of K-means clustering algorithm and SVM java Library, the features extracted are added to the normal group or anomalous group. The categorization done is represented using Pie chart to get the pictorial representation of the result.

5. CONCLUSION
In this paper, I have implemented the K-means clustering algorithm using SVMs (Support Vector Machines) JAVA Page 219

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 5, September October 2013 ISSN 2278-6856
Library for generating the report of anomaly detection. The KDD data set which is part of DARPA dataset is adopted for the training data. The proposed model achieved the pictorial view of the percentage of anomaly present in the packet flow. I have described the process of generating pie chart showing classification of normal and anomalous traffic in the network. In future I am going to provide the extracted features of the packets to the IDS run time instead of manually providing the file that contains the features. And the log of the anomalous packets can be generated which could be viewed any time by the network administrator.

ACKNOWLEDGMENT
This work is done under the valuable guidance of Prof. G.A.Patil, Asst. Prof. and Head of the Computer Department, D.Y.Patil College of Engg. & Technology, Kasba Bawada Kolahpur. I thank Prof. G.A.Patil for his valuable comments and discussions on this work.

REFERENCES:
[1] Gerhard Munz, Sa Li, Georg Carle Computer Networks and Internet Wilhelm Schickard Institute for Computer Science University of Tuebingen, Germany, Traffic Anomaly Detection Using K-Means Clustering. [2] Joong-hee Lee, Jong-hyouk Lee, Seon-gyoung Sohn, Jong-ho Ryu, Tai-myoung Chung Effective Value of Decision Tree with KDD 99 Intrusion Detection Datasets for Intrusion Detection System in 10th International Conference on Advanced Communication Technology, 2008. [3] J. MacQueen, Some methods for classication and analysis of multivariateObservations in Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. [4] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin A Practical Guide to Support Vector C

AUTHOR Prof. Mrs. Dipali Dayanand Ghatge received the B.E degree in Information Technology and M.E. degree in Computer Science and Engineering from shivaji University, Kolhapur. Currently working at Karmaveer Bhaurao Patil College of Engineering and Polytechnic, Satara since 2005.

Volume 2, Issue 5 September October 2013

Page 220

You might also like