Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)

IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

Intrusion Detection System Using PCA with


Random Forest Approach

Mr Subhash Waskle Mr Lokesh Parashar


Department of Computer Science Assistant Professor
Patel College of Science & Technology, Department of Computer Science
Indore, India Patel College of Science & Technology, Indore, India
Subhash.waskle87@gmail.com lokesh23324@gmail.co m

Mr Upendra Singh
Assistant Professor
Department of Information Technology
Shri Govindram Seksaria Institute of Technology and Science, Indore, India
Upendrasingh49@gmail.com

Abstract: With the evolution in wireless wormhole etc. are seen on the network system. These
communication, there are many security threats over attacks are to steal the information fro m the system or
the internet. The intrusion detection system (IDS ) helps to corrupt the data present over any system [1]. To
to find the attacks on the system and the intruders are
make misuse of the data, the intruders attack the
detected. Previously various machine learning (ML)
techniques are applied on the IDS and tried to improve system in various ways, some of the attacks are DoS,
the results on the detection of intruders and to increase probe, snort, r2l etc. So to prevent the system fro m
the accuracy of the IDS . This paper has proposed an such attacks, the intrusion detection system was
approach to develop efficient IDS by using the principal introduced. IDS keep track of attacks on the system
component analysis (PCA) and the random forest and to prevent the system from these attacks [2].
classification algorithm. Where the PCA will help to
organise the dataset by reducing the dimensionality of So to detect such attacks, the various works have
the dataset and the random forest will help in done earlier by using various techniques. Here an
classification. Results obtained states that the proposed intrusion detection system that makes use of the
approach works more efficiently in terms of accuracy as
principal co mponent analysis is used along with the
compared to other techniques like S VM, Naïve Bayes,
and Decision Tree. The results obtained by proposed
random forest technique. Both the methods work for
method are having the values for performance time a special purpose, where the PCA g ives the
(min) is 3.24 minutes, Accuracy rate (%) is 96.78 %, granularity in the data, and the random forest helps
and the Error rate (%) is 0.21 %. the classification between the nodes for attacks [3].

Keywords: IDS, Knowledge Discovery Dataset, PCA, 1.1 Intrusion Detection System:
Random Forest.
Intrusion is a term which deals with entering the
I.INTRODUCTION system without any permission and with spoiling the
informat ion present inside the system [4]. This
Nowadays, the involvement of the internet in normal intrusion in any system can also harm the hardware
life has been increased rapidly. The internet has made of the system. The intrusion has become a significant
a crucial place in everyone’s life. The use of the term to prevent the system fro m. Th is intrusion inside
internet has become very crucial fo r everyone. So any system can be controlled or keep ing track of this
with the increase in the use of the internet for intrusion can be done with the help of the I DS. The
personal activities, it is also necessary to keep secure various types of intrusion detection systems are used
the system from malicious activities. earlier, but in the end, the accuracy concerns are seen
Different attacks are seen on the system or the in every method used. The two terms, such as
network. The attacks like a b lack hole, grey hole, detection rate and the false alarm rate, are analysed

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 803

Authorized licensed use limited to: Auckland University of Technology. Downloaded on August 08,2020 at 06:56:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

for the evaluation of the accuracy of the system [5]. prediction fro m the classifier that obtained in the very
These two terms should be in the manner that the first step.
false alarm rate should be minimised and the
improvement in the detection rate should be there in Pseudocode for the creation of a random forest is as
the system. So the random forest along with the PCA follows:
is applied for the IDS.

The IDS can be of two types in nature, fo r which it 1. Select some features k from total m as k<<m
2. By applying split point fro m k features get node
works, that are:
d
 Network Intrusion Detection Systems
3. By applying best split get the daughter nodes
(NIDS): In this system, the network t raffic is
4. Repeat 3 steps till 1 node is reached
analysed, and the intrusion over it is analysed.
5. Create forest by repeating the steps fro m 1 to 4
 Host-based Intrusion Detection Systems for the creation of forest.
(HIDS): Here, the system keeps track of the
system files that are accessed over the network.
There is also a subset of IDS types. The most
common variants are based on signature
detection and anomaly detection.
 Signature-based: In this, the system found
some specific patterns which are used by
malware. These detected patterns are called
signatures. This is good in detecting known
attacks, but when it comes to new attacks, it
fails in such signature detection.
 Anomaly-based: This is specially developed for
the detection of unknown attacks. This system
uses ML to construct the model.

Figure 2. Random Forest Model.

1.3 PCA:

The principal co mponent analysis is the technique


that is used, especially for the reduction of the
dimension of the given dataset. The principal
component analysis is one of the most efficient and
an accurate method for reducing the d imensions of
data, and it provides the desired results [6]. This
method reduces the aspects of the given dataset into a
desired number of attributes called principal
components.
This method takes all the input as the dataset, which
is having a high number of attributes so as the
dimension of the dataset is very high. This method
Figure 1. Intrusion Detection System [2] reduces the size of the dataset by taking the data
points on the same axis. The data points are shifted
1.2 Random Forest: on a single axis, and the principal components are
carried out. The PCA can be performed using the
RF is one of the most powerful methods that is used following steps:
in machine learn ing for classification problems. The
random forest comes in the category of the 1. Take the dataset with all dimensions d.
supervised classification algorith m [3]. This 2. Calculate the mean vector for each dimension d.
algorith m is carried out in two different stages the 3. Calculate the covariance matrix for the whole
first one deals with the creation of the forest of the dataset.
given dataset, and the other one deals with the

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 804

Authorized licensed use limited to: Auckland University of Technology. Downloaded on August 08,2020 at 06:56:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

4. Calculate the eigen vectors (e1, e2, e3 … .ed), Chowdhury and ANN submitted by Alex Shenfield,
and eigen values (v1,, v2, v3, ….vd). Aladdin Ayesh and David Day.[14]
5. Perform sorting of eigenvalue in decreasing order Here the authors studied various machine learn ing
and select n eigenvector with the highest algorith ms for the intrusion detection system. They
eigenvalues to get a matrix of d*n= M. compared some of the techniques like SVM, Ext reme
6. By using this M form a new sample space. learning machine and the random forest. The authors
7. The obtained sample spaces are the principal have stated the results as the Extreme mach ine
components. learning method performs a way better as compared
to other algorithms.[15]
II. LITERATURE REVIEW The authors here worked to improve the quality of
the dataset to provide it to the intrusion detection
Authors here presented a mechanism to design the system. They have used a fuzzy rule-based feature
IDS fo r the IoT that is based on the classification of selection technique for the improvement of the
the traffic by making the use of deep learning model. dataset. They used the KDD dataset and resulted
They performed the binary and mult i-class shown dynamic growth in the result of the IDS.[16]
classification. The obtained accuracy for the
presented system is high.[7] III. PROBLEM DOMAIN
The authors here gave a solution for the IDS as they
applied the SVM and Naïve Bayes algorithms and The systems which wo rk over the internet suffer fro m
proved that the SVM works better than the Naïve various malicious activities. The major problem seen
Bayes method. They carried the experiment on the in this field is the intrusion in the system for violat ing
KDD dataset, and they also give the results in terms the information. This intrusion is detected by creating
like detection and false alarm rate.[8] an intrusion detection system; this system also needs
In this paper, the authors performed three different to be accurate and efficient in the detection of the
experiments. They applied the feature selection as intruders. Various machine learn ing algorithms were
well in the analysis. Also showed the naïve Bayes, used for intrusion detection; some of them are SVM,
adaptive boost and partial decision tree. They Naïve Bayes etc. But the results state that there may
analysed all techniques for intrusion detection. [9] be some improvements to be done on terms of
In this paper, the authors have evaluated that the accuracy and the detection rates and the false alarm
Artificial neural networks with the feature selection rate. So me other techniques can replace previously
technique will provide better results as compared to applied techniques such as SVM and Naïve Bayes.
the Support vector machine technique. They used Also, the study states that the dataset can be
NSL-KDD dataset for the experiment. The given improved by using some methods over it. To imp rove
approach worked well.[10] the quality of the input to the proposed system.
Here the authors presented a review on the intrusion
detection systems, which uses a machine-learn ing
algorith m. The authors provided various machine IV. PROPOSED SOLUTION
learning algorith m’s comparison based on their
performance. They evaluated the survey based on the The intrusion detection system works for the
detection rates and false alarm rates.[11] improvement of the system, wh ich is affected by the
Authors have presented an approach for intrusion intruders. This system can do the detection of the
detection, which uses logistic regression and belief intruders. The proposed system tries to eliminate the
propagation. And the proposed method has proved existing problems related to the previous work. The
that it provides better average detection time as proposed system consists of the two methods that are
compared to earlier techniques.[12] principal co mponent analysis , and the other one is the
The authors used an in-depth learning approach for random forest.
the feature extraction fro m the dataset. They tried to The principal co mponent analysis is used for the
extract the features from dataset to make a dataset reduction of the dimension of the dataset; by this
efficient for use and in this way, they decided to method, the dataset quality will be improved as the
provide better input to the intrusion detection dataset may contain the correct attributes. After this,
system.[13] the random forest algorith m will be applied for the
Here they have surveyed the intrusion detection detection of the intruders, which provide both the
systems based on the machine learning approach. detection rate and the false alarm rate in an improved
They analysed all the machine learning algorith ms manner as compared to SVM.
that are used till the date and concluded that the
algorith ms proposed by Md Nasimuzzaman

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 805

Authorized licensed use limited to: Auckland University of Technology. Downloaded on August 08,2020 at 06:56:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

4.1 Algorithm for the proposed solution: 4.2 Flowchart for the Proposed Algorithm:

The attribute compatibility replaces the coordination


degree of the original attribute for the split node
standard.

1. Attribute compatibility
Let the modulus be | Pr | for the main decision set,
secondary set be | Se |, and attribute co mpatibility is
defined as:
| | | |
CO( X → D) = | |
(1)

Here, X, is the subset for non-empty C. Strict


compatibility is called when the influence of the
secondary set over the mindsets seen. A contradiction
is seen between the main and the second set. The
secondary set is rounded off by the expression.
| |
CO( X → D) = | |
(2)

Here X is the subset for non-empty C. In this, the


wide compatibility of the second set is seen.

Algorithm for the Base Classifier Improvement:

Step 1: in itialisation of data set active attribute by


marking all condition attributes .
Step 2: calculate the modulus for every condition
attribute in both primary and secondary set.
Step 3: By using equation (1) co mpatibility Figure 3. Flowchart for the Proposed Approach
calculation of all conditional attribute is done in this
step. Use equation (2) if mo re characteristic with
similar compatibility is seen. V. RESULTS
Step 4: To separate the sample, select the most
extensive compatibility for splitting as the split node The experiment carried out for the proposed approach
and delete the active tag. uses the KDD dataset, and the results obtained were
satisfying. The follo wing configurations are used for
Step 5: go on selecting the active attribute for
splitting till the active quality is reached up to leaf performing our analysis:
node.
 In Hardware: 4 GB RAM, 140 Gb SSD
Step 6: At last, the base classifier is generated.
Harddisk, Intel core i3 and intel motherboard.
 In Software: 64-bit windows 10 and Python 3.8.
 Python packages like Nu mPy, pandas and Keras
Library
 Data set: KDD dataset.

The application of PCA along with the Random


Forest worked well in comparison with existing
techniques like SVM, Naïve Bayes, and Decision
tree. The tabular form is presented below for the
Performance time (min), Accuracy rate (%), and
Error rate (%) for different approaches:

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 806

Authorized licensed use limited to: Auckland University of Technology. Downloaded on August 08,2020 at 06:56:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

VI. CONCLUS ION

As the involvement of the systems over the internet


Table 2. Result Comparison with other Classifiers increasing rapid ly, the security concerns have also
seen. The proposed approach deals with the detection
Method Performance Accuracy Error rate of intruders over the internet efficiently. The
time (min) rate (%) (%) proposed algorithm has performed well as co mpared
SVM 4.57 84.34 2.67 to the previously applied algorith ms such as SVM,
Naïve 9.12 80.85 3.49 Naïve Bayes, and Decision Tree. The detection rates
Bayes and the false error rates can be improved at a great
Decision 12.36 89.91 0.78 extent by the proposed approach. The dataset used
Tree here is the knowledge discovery dataset. The results
PCA with 3.42 96.78 0.21 obtained by our proposed method having the values
Random for Perfo rmance t ime (min)is 3.24 minutes, Accuracy
Forest rate (%) is 96.78 %, and the Error rate (%) is 0.21 %.

The table given above gives a numerical Reference:


representation of the obtained values from the
1. Jafar Abo Nada; Mohammad Rasmi Al-Mosa, 2018
experiment. The error rate found in our proposed International Arab Conference on Information T echnology
approach is very low as of .21%. As well, the (ACIT), A Proposed Wireless Intrusion Detection Prevention
accuracy obtained is much higher than previous and Attack System
algorith ms. Also, the time taken fo r the performance 2. Kinam Park; Youngrok Song; Yun-Gyung Cheong, 2018
IEEE Fourth International Conference on Big Data
is less than other algorithms. Computing Service and Applications (BigDataService),
Classification of Attack Types for Intrusion Detection
SVM Systems Using a Machine Learning Algorithm
Result Comparison with other 3. S. Bernard, L. Heutte and S. Adam “On the Selection of
Classifiers Decision Trees in Random Forests” Proceedings of
International Joint Conference on Neural Networks, Atlanta,
120 Naïve Bayes Georgia, USA, June 14-19, 2009, 978-1-4244-3553-
1/09/$25.00 ©2009 IEEE
100 4. A. T esfahun, D. Lalitha Bhaskari, ” Intrusion Detection
% and minutes

Decision Tree using Random Forests Classifier with SMOTE and Feature
80 Reduction” 2013 International Conference on Cloud &
60 Ubiquitous Computing & Emerging T echnologies, 978-0-
4799-2235-2/13 $26.00 © 2013 IEEE
PCA with 5. Le, T.-T.-H., Kang, H., & Kim, H. (2019). The Impact of
40
Random PCA-Scale Improving GRU Performance for Intrusion
20 Forest Detection. 2019 International Conference on Platform
Technology and Service
0 (PlatCon). Doi:10.1109/platcon.2019.8668960
6. Anish Halimaa A, Dr K.Sundarakantham: Proceedings of the
Third International Conference on Trends in Electronics and
Informatics (ICOEI 2019) 978-1-5386-9439-8/19/$31.00
©2019 IEEE “ MACHINE LEARNING BASED
INT RUSION DETECTION SYST EM.”
7. Mengmeng Ge, Xiping Fu, Naeem Syed, Zubair Baig,
Gideon Teo, Antonio Robles-Kelly (2019). Deep Learning-
Based Intrusion Detection for IoT Networks, 2019 IEEE 24th
Pacific Rim International Symposium on Dependable
parameters Computing (PRDC), pp. 256-265, Japan.
8. R. Patgiri, U. Varshney, T. Akutota, and R. Kunde, ’’An
Investigation on Intrusion Detection System Using Machine
Figure 4. Result Comparison with other Classifiers Learning” 978-1-5386-9276-9/18/$31.00 c2018IEEE.
9. Rohit Kumar Singh Gautam, Er. Amit Doegar; 2018 8th
International Conference on Cloud Computing, Data Science
Here is the graphical representation of the obtained & Engineering (Confluence) “ An Ensemble Approach for
values. It is seen that in all the three aspects, the Intrusion Detection System Using Machine Learning
proposed method worked well. The values can be Algorithms.”
10. Kazi Abu Taher, Billal Mohammed Yasin Jisan, Md.
seen in the above graph. Mahbubur Rahma, 2019 International Conference on
Robotics, Electrical and Signal Processing Techniques
(ICREST )“Network Intrusion Detection using Supervised
Machine Learning T echnique with Feature Selection.”
11. L. Haripriya, M.A. Jabbar, 2018 Second International
Conference on Electronics, Communication and Aerospace

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 807

Authorized licensed use limited to: Auckland University of Technology. Downloaded on August 08,2020 at 06:56:18 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the International Conference on Electronics and Sustainable Communication Systems (ICESC 2020)
IEEE Xplore Part Number: CFP20V66-ART; ISBN: 978-1-7281-4108-4

Technology (ICECA)” Role of Machine Learning in


Intrusion Detection System: Review”
12. Nimmy Krishnan, A. Salim, 2018 International CET
Conference on Control, Communication, and Computing
(IC4) “ Machine Learning-Based Intrusion Detection for
Virtualized Infrastructures”
13. Mohammed Ishaque, Ladislav Hudec, 2019 2nd International
Conference on Computer Applications & Information
Security (ICCAIS)“ Feature extraction using Deep Learning
for Intrusion Detection System.”
14. Aditya Phadke, Mohit Kulkarni, Pranav Bhawalkar, Rashmi
Bhattad, 2019 3rd International Conference on Computing
Methodologies and Communication (ICCMC)“ A Review of
Machine Learning Methodologies for Network Intrusion
Detection.”
15. Iftikhar Ahmad , Mohammad Basheri, Muhammad Javed
Iqbal, Aneel Rahim, IEEE Access ( Volume: 6 )
Page(s): 33789 – 33795 “Performance Comparison of
Support Vector Machine, Random Forest, and Extreme
Learning Machine for Intrusion Detection.”
16. B. Riyaz, S. Ganapathy, 2018 International Conference on
Recent Trends in Advanced Computing (ICRTAC)” An
Intelligent Fuzzy Rule-based Feature Selection for Effective
Intrusion Detection.”

978-1-7281-4108-4/20/$31.00 ©2020 IEEE 808

Authorized licensed use limited to: Auckland University of Technology. Downloaded on August 08,2020 at 06:56:18 UTC from IEEE Xplore. Restrictions apply.

You might also like