Professional Documents
Culture Documents
Network Traffic Intrusion Detection System Using Decision Tree & K-Means Clustering Algorithm
Network Traffic Intrusion Detection System Using Decision Tree & K-Means Clustering Algorithm
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 5, September October 2013 ISSN 2278-6856
Network Traffic Intrusion Detection System using Decision Tree & K-Means Clustering Algorithm
Mrs. Ghatge Dipali D.
Assistant Professor, Computer Engineering Department, Karmaveer Bhaurao Patil College of Engg. & Poly.,Satara
of decision tree and K-means clustering algorithm for intrusion detection. A) IDS using Decision treeDecision tree is one of the powerful data mining methods. In decision tree leaf nodes represents class of data. Decision tree helps us to categorize the data from largest dataset. a) DARPA98 Dataset: The DARPA set was defined by the information system technology Group of MIT Lincoln Laboratory. It provides the data set for the both training and testing. All attacks in DARPA sets can be categorized into four classes of attacks. They are Denial Of Service (Dos), Remote to Local (R2L), User to Root (U2R) and Scan. b) Process to make decision tree using DARPA data set:
1. INTRODUCTION
Intrusion detection is the process of examining and evaluating the events occurring in a computer system in order to detect the signs of security problems. Data mining techniques are very striking because they can be applied to any kind of data in order to learn more about the hidden structures and correlations [1]. The application of data mining methods to monitor data recorded from computer networks is a remarkable solution for intrusion detection. Intrusion detection system (IDS) using data mining can be termed as network data mining. Section II gives the introduction of different data mining techniques used for intrusion detection. In section III I present the details of K-means clustering algorithm used for intrusion detection. I use the DARPA 98 Lincoln Laboratory Evaluation data set (DARPA set) as a training data as well as testing data. KDD 99 intrusion detection data set is also based on DARPA set. In section IV I have presented some initial experimental results of ongoing work and section V concludes the paper with a stance on future work. 1. Classification of DARPA set: In training set 4 types of attacks are considered. We have to extract the TCP Dump data for each attack in whole DARPA training set. TCP dump list contains the information that identifies each flow and indicates whether the flow is an attack or not. 2. Preprocessing: Preprocessing is done to summarize the information from the TCP dump files. Preprocessing manufactures the raw packet data to make the information meaningful. 3. ID3 algorithm: The data that we get after preprocessing is given as input to ID3 algorithm. ID3 adopts the greedy concept to locate the features in the decision tree, that is it chooses the features from the learning dataset according to the correlation between the features and the class [2]. 4. Decision tree generation is done by using the features located by ID3 algorithm. B) K-means clustering algorithm. K-means clustering algorithm is another powerful data mining algorithm. In the next section, I have included the details of K-means algorithm, raw data and the extracted features of traffic. This raw data and features
Page 218
4. EXPERIMENTAL RESULTS:
For experiment corrected 10% kdd data has been used.
Figure 1 10% KDD data From the above considered data part of data has been considered as training data and remaining as testing data.
Figure 2 Training & testing data The file that contains the details of 42 features extracted is provided as input to the IDS unit.
where, x=(x1,.,xm) and y=(y1,.,ym) are the two input vectors with m quantitative features. I apply the K-means clustering algorithm to training dataset that contain normal as well as anomalous traffic. I assume that normal & anomalous traffic form different clusters. I have used SVMs [4] (Support Vector Machines) are a useful technique for data classification. A classification task usually involves separating data into training and testing sets. Each instance in the training set contains one \target value" (i.e. the class labels) and several \attributes" (i.e. the features or observed variables). The goal of SVM is to produce a model (based on the training data) which predicts the target values of the test data given only the test data attributes. SVM requires that each data instance is represented as a vector of real numbers. The goal of a support vector machines is to find hyper-planes that separate data points into their respective classes. The better the separation achieved, the better the data classification thats ultimately possible. Volume 2, Issue 5 September October 2013
Figure 3 Provide Extracted features to IDS The extracted features are compared with the training dataset and with the help of K-means clustering algorithm and SVM java Library, the features extracted are added to the normal group or anomalous group. The categorization done is represented using Pie chart to get the pictorial representation of the result.
5. CONCLUSION
In this paper, I have implemented the K-means clustering algorithm using SVMs (Support Vector Machines) JAVA Page 219
ACKNOWLEDGMENT
This work is done under the valuable guidance of Prof. G.A.Patil, Asst. Prof. and Head of the Computer Department, D.Y.Patil College of Engg. & Technology, Kasba Bawada Kolahpur. I thank Prof. G.A.Patil for his valuable comments and discussions on this work.
REFERENCES:
[1] Gerhard Munz, Sa Li, Georg Carle Computer Networks and Internet Wilhelm Schickard Institute for Computer Science University of Tuebingen, Germany, Traffic Anomaly Detection Using K-Means Clustering. [2] Joong-hee Lee, Jong-hyouk Lee, Seon-gyoung Sohn, Jong-ho Ryu, Tai-myoung Chung Effective Value of Decision Tree with KDD 99 Intrusion Detection Datasets for Intrusion Detection System in 10th International Conference on Advanced Communication Technology, 2008. [3] J. MacQueen, Some methods for classication and analysis of multivariateObservations in Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. [4] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin A Practical Guide to Support Vector C
AUTHOR Prof. Mrs. Dipali Dayanand Ghatge received the B.E degree in Information Technology and M.E. degree in Computer Science and Engineering from shivaji University, Kolhapur. Currently working at Karmaveer Bhaurao Patil College of Engineering and Polytechnic, Satara since 2005.
Page 220