A Comparative Analysis of Machine Learni

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

A Comparative Analysis of Machine Learning


Approaches to Intrusion Detection
Syed Ayaz Imam
School of Computer Science and Engineering (SCOPE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Archit Aggarwal
School of Computer Science and Engineering (SCOPE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Akshat Bakliwal
School of Computer Science and Engineering (SCOPE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Vikas Vijayvargiya
School of Electronics Engineering (SENSE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Abstract- Network administrators use a Network Intrusion Detection System (NIDS) to detect network security breaches
in their enterprises. However, designing a convenient and dynamic NIDS for unanticipated and unpredictable attacks
poses numerous obstacles. Signature-based Intrusion Detection Systems (IDS) are currently insufficient to handle the
hazards posed by zero-day attacks to networked systems. On the NSL-KDD dataset, we applied data mining techniques
and compared their performance on metrics such as accuracy, precision, and recall.

Keywords –IDS, Denial of Service, U2R, R2L, Machine Learning, KNN, Accuracy, F1-Score, Decision Trees, Random
Forest, Feature Scaling, Encoding, Sampling

I. INTRODUCTION
Intrusion detection appears to have a simple goal: to detect intrusions. However, the process is
challenging, and intrusion detection systems don't actually detect intrusions; instead, they identify
evidence of intrusions, either while they're happening or after they've happened. An attack's
"manifestation" is a term used to describe such evidence. The system cannot identify an intrusion if there
is no manifestation, if the manifestation lacks adequate information, or if the information it contains is
untrustworthy.

Administrators in the late 1970s and 1980s printed audit logs on fan-folded paper, which were routinely
stacked high at the end of a typical week. It took a long time to search through such a stack. Due to the
amount of data and the lack of automated analysis, administrators mostly employed audit logs as a
forensic tool to establish the cause of a security event after it occurred. There was little chance of stopping
an attack in the middle. Audit logs migrated online as storage grew more affordable, and researchers built
systems to evaluate the data. Intrusion detection applications were typically executed at night when the
system's user traffic was low because the analysis was slow and often computationally intensive. As a
result, a majority of such intrusions were recognized post their occurrence. Researchers developed real-
time intrusion detection systems that analyzed audit data as it was produced in the early 1990s. As a
result, assaults and attempted attacks could be detected in real-time, allowing for real-time response and,

Volume XIII, Issue 9, 2021 Page No: 229


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

in certain situations, attack preemption. Recent intrusion detection research has focused on creating
products that consumers can deploy efficiently in large networks. Given rising security concerns, a
plethora of new attack strategies, and constant changes in the computing environment, this is no easy feat.

There are currently no entirely efficient solutions despite substantial research into Intrusion Detection
Systems and a variety of antiviruses. The question arises as to why, despite investing so much money, we
have yet to develop an intrusion detection system capable of averting such attacks and losses. Antivirus
systems cannot provide adequate security because they are based on misuse intrusion detecting
technologies. Unless intruders are discovered, antivirus systems will be unable to cope with their
inventive and sophisticated methods. Anomaly Intrusion Detection Systems was established to address
this issue, and they can detect any undesirable changes in network data or deviations from usual data
standards, implying that they can detect unique intrusion types.

This study puts forward a comprehensive analysis of the performance of various algorithms. To study the
NSL-KDD dataset, we first cleaned the dataset and then analyzed the data via graphs and charts. Then we
proceed with data preprocessing and created a machine learning pipeline to train classification models.
We have used Support Vector Classifier, Decision Trees, Random Forest, Voting Classifier, Naive Bayes
Classifier, K-Nearest Neighbour Classifier, and Logistic Regression classifier to train on the data.

The document begins with the introduction to the dataset followed by an introduction to the machine
learning algorithms used. After that, we have explained the implementation methodology that includes the
pre-processing steps. Towards the end, we have compared the outcomes of the models with each other
and concluded with the findings of the work.

II.DATASET
The NSL-KDD data set is made up of selected records from the entire KDD data set. Because the NSL
KDD train set contains no duplicated records, the classifier will not generate a biased result. Additionally,
there are no duplicate records in the test set.

The training dataset contains 21 distinct attacks, compared to 37 in the test dataset. The known attack types
are those that appear in the training dataset, whereas the novel assaults are those that appear in the test
dataset but are not present in the training datasets.

There are four different forms of attacks: DoS, Probe, U2R, and R2L.

Table 1. Attack Classes

Attacks in Dataset Attacks types

DOS Back, Land, Neptune, Pod, Smurf, Teardrop, Mailbomb, Processtable,


Udpstorm, Apache2, Worm

Probe Satan, IPsweep, Nmap, Portsweep, Mscan, Saint

User to Root (U2R) Buffer_overflow, Loadmodule, Rootkit, Perl, Sqlattack, Xterm, Ps

Volume XIII, Issue 9, 2021 Page No: 230


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

Remote to Local (R2L) Guess_password, Ftp_write, Imap, Phf, Multi hop, Warezmaster, Xlock,
Xsnoop, Snmpguess, Snmpgetattack, Httptunnel, Sendmail, Named

In this exploratory research, the NSL-KDD dataset with 42 attributes is employed. The ‘class' attribute,
which is designated 42 attributes in the data set, specifies whether a given instance is a normal connection
instance or an attack. Out of the 42 attributes, 41 can be divided into one of four categories, as shown
below:

• Basic (B) Features are the individual TCP connections attributes.


• Content (C) features are said to be the values in connection suggested by the domain knowledge.
• Traffic (T) features are computed by a two-second time window.
• Host (H) features are designed to assess attacks that last greater than two seconds.

Table 2. Attribute Information


S.No Labe Attribute S.N Labe Attribute Name S.No Labe Attribute Name
l Name o l l

1 B Duration 15 C Su_attempted 29 T Serv_serror_rate

2 B Protocol_type 16 C Num_root 30 T Srv_rerror_rate

3 B Service 17 C Num_file_creati 31 T Srv_diff_host_rate


on

4 B Src_bytes 18 C Num_shells 32 H Dst_host_count

5 B Dst_bytes 19 C Num_access_file 33 H Dst_host_srv_count


s

6 B Flag 20 C Num_outbound_ 34 H Dst_host_same_srv_ra


cmds te

7 B Land 21 C Is_hot_login 35 H Dst_host_diff_srv_rate

8 B Wrong_fragm 22 C Is_guest_login 36 H Dst_host_same_src_p


ent ort_rate

9 B Urgent 23 T Count 37 H Dst_host_srv_diff_hos


t_rate

10 C Hot 24 T Serror_rate 38 H Dst_host_serror_rate

11 C Num_failed_l 25 T Rerror_rate 39 H Dst_host_srv_serror_r


ogins ate

Volume XIII, Issue 9, 2021 Page No: 231


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

12 C Logged_in 26 T Same_srv_rate 40 H Dst_host_rerror_rate

13 C Num_compro 27 T Diff_srv_rate 41 H Dst_host_srv_rerror_r


mised ate

14 C Root_shell 28 T Srv_count 42 - Class

III.ALGORITHMS USED

The Support Vector Machine (SVM) is a commonly used supervised learning model for classification and
regression problems. For our task, we have used SVM for classification tasks. The goal of SVM is to
determine the decision boundary for categorizing n-dimensional space into classes so that subsequent data
points can be easily placed in the right category. The ideal choice boundary is known as a hyperplane.

In decision trees, the dataset attributes are represented by internal nodes, decision rules are represented by
branches, and the outcome in a Supervised classification technique is represented by leaf nodes. The
Decision Node and the Leaf Node are the two nodes of a Decision tree. Leaf nodes are the output of those
decisions and do not contain any more branches, whereas Decision nodes are used to make any decision
and have several branches.

Random forest is built on ensemble learning, which is a method for solving a complicated problem and
improving the model's performance by merging multiple classifiers. Random forest, as the name implies,
is a classifier that combines a number of decision trees on different subsets of a dataset and averages the
results to increase the dataset's predictive accuracy. Instead of relying on a single decision tree, the
random forest collects the forecasts from each tree and predicts the final output based on the majority
votes of predictions. The bigger the number of trees in the forest, the more accurate it is and the problem
of overfitting is avoided.

Naive Bayes is one of the most fundamental machine learning algorithms used in machine learning
analysis. It works on the principle of probability and assumes that each attribute has an independent and
equal contribution in predicting the outcome. Due to these assumptions, Naive Bayes usually exhibits a
lower accuracy when compared to its counterpart algorithms.

The KNN method assumes that the new case/data and existing cases are similar and places the new case
in the category that is most similar to the existing categories. The KNN algorithm stores all available data
and classifies a new data point based on its similarity to the existing data. This means that new data can
be quickly sorted into a suitable category using the KNN algorithm. KNN is a non-parametric algorithm,
which means it makes no assumptions about the data it uses. It's also known as a lazy learner algorithm
since it doesn't learn from the training set right away; instead, it saves the dataset and performs an action
on it when it comes time to classify it.

Volume XIII, Issue 9, 2021 Page No: 232


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

IV.IMPLEMENTATION

Figure 1 depicts the sequential flow of the Implementation. We began by


loading training and the test data using the Pandas library. The training set
had a dimension of [125973, 42] whereas the test set had [22544, 42]
dimensions.

The target values(‘attack’ class) could be categorized into five categories


namely Normal, Probe, DoS(denial of service attack), R2L(root to local
attack), and U2R(user-to-root attack). We mapped the ‘attack’ class to these
categories.

Once we mapped the values to the respective categories, we performed a


thorough analysis of the data which is a precursor to data preprocessing.
Data preprocessing is an essential stage in machine learning since the
quality of data and valuable information obtained from it directly influences
our model's capacity in learning; thus, before training our model, we must
pre-process our data. The table below depicts the mapping of the ‘attack’
class.

Figure 1. Flow Diagram

Table 3. Mapping Table

Mapped Original Attack Class


Attack Class

Probe ipsweep, satan, nmap, portsweep, saint, mscan,

DoS teardrop, pod, land, back, neptune, smurf, mailbomb, udpstorm, apache2,
processtable,

U2R perl, loadmodule, loadmodule, buffer_overflow, xterm, ps, sqlattack, httptunnel

R2L Ftp_write, phf 'guess_passwd, warezmaster, warezclient, imap, spy, multihop,


named, snmpguess, worm, snmpgetattack, xsnoop, xlock, sendmail

Normal normal

Volume XIII, Issue 9, 2021 Page No: 233


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

Our preprocessing pipeline includes feature scaling(using standard scaler), label encoding, one-hot
encoding, and sampling. The procedural flow is depicted in figure x. Once we mapped the ‘attack’
column, we then proceeded to feature scaling. We have used Sk-Learn’sStandardScaler which
standardizes features by removing the mean and scales the values to unit variance. Every column of int64
and float64 values were normalized or scaled so that each feature contributes proportionately to the
model.

Then we proceeded to label encoding which involves converting each categorical value in a column to a
corresponding numerical value. For this study, we label encoded the ‘attack’ column(both test and
training sets) into five categories labeled as 0,1,2, 3, and 4 as:

Table 4. Attack Labels


Attack Type Label

Normal 0

Dos 1

Probe 2

R2L 3

U2R 4

Figure 2 shows the attack class frequencies in the


training set and the test set. The distribution is
clearly imbalanced and for this reason, we have
performed sampling. Data sampling provides a
variety of strategies to alter the training data set to
balance or balance the distribution of the classes
more effectively. Once the data is balanced the
converted dataset may be immediately trained
without changes.

Figure 2. Attack Class Distribution

V.RESULTS AND CONCLUSION

The NSL-KDD Cup 99 dataset contains normal and attack network connections and is a multiple class
classification issue. We used a variety of classification methods on the pre-processed NSL-KDD dataset
in this study, and the experimental analysis revealed that the random forest technique has the highest F1

Volume XIII, Issue 9, 2021 Page No: 234


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

score of all the algorithms, with an incredible accuracy of 99.9%. We intend to use ensemble classifiers in
the future and experiment with different subsets of features, comparing performance by changing the size
of feature subsets.

Figure 3a. F1-Score Comparison Figure 3b. Accuracy Comparison

Table 5. Accuracy and F1 Scores of Machine Learning Algorithms


Performance SVM Naive Bayes Decision Tree Random Forest KNN
Metrics

Accuracy 98.91 97.37 99.8 99.9 99.77

F1-Score 0.99 0.97 1.00 1.00 1.00

The above table summarizes the findings of our study. It can be inferred that while all the classification
algorithms exhibit high accuracy, Decision Tree and Random Forest classifiers have the highest accuracy
and F1 scores among the other algorithms. The decision tree model and Random forest are able to achieve
such high accuracies due to their intuitiveness for feature selection which is not the case in other
classification algorithms. On the other hand, Naive Bayes reported the least accuracy among the other
algorithms. The assumption of independent predictors by the Naive Bayes algorithm in the NSL-KDD
dataset could have resulted in lower accuracy. The use of ensemble classifiers with different subsets may
result in an accuracy equivalent to Random Forest, because of their similarities.

Volume XIII, Issue 9, 2021 Page No: 235


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

REFERENCES
[1] Yost, J. R. (2015). The march of IDES: Early history of intrusion-detection expert systems. IEEE Annals of the
History of Computing, 38(4), 42-54.

[2] Bauer, D. S., &Koblentz, M. E. (1988, January). NIDX-an expert system for real-time network intrusion
detection. In 1988 Computer Networking Symposium (pp. 98-99). IEEE Computer Society.

[3] Paulauskas, N., &Auskalnis, J. (2017, April). Analysis of data pre-processing influence on intrusion detection
using NSL-KDD dataset. In 2017 open conference of electrical, electronic and information sciences (eStream) (pp.
1-5). IEEE.

[4] Ding, Yalei, and YuqingZhai. "Intrusion detection system for NSL-KDD dataset using convolutional neural
networks." Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence.
2018.

[5] Bala, R., & Nagpal, R. (2019). A review on kdd cup99 and nslnsl-kdd dataset. International Journal of Advanced
Research in Computer Science, 10(2).

[6] Meena, Gaurav, and Ravi Raj Choudhary. "A review paper on IDS classification using KDD 99 and NSL KDD
dataset in WEKA." In 2017 International Conference on Computer, Communications and Electronics (Comptelix),
pp. 553-558. IEEE, 2017.

[7] Parsaei, M. R., Rostami, S. M., &Javidan, R. (2016). A hybrid data mining approach for intrusion detection on
an imbalanced NSL-KDD dataset. International Journal of Advanced Computer Science and Applications, 7(6), 20-
25.

[8] Kumar, V., Chauhan, H., & Panwar, D. (2013). K-means clustering approach to analyze NSL-KDD intrusion
detection dataset. International Journal of Soft Computing and Engineering (IJSCE) ISSN, 2231-2307.

[9] Abrar, I., Ayub, Z., Masoodi, F., &Bamhdi, A. M. (2020, September). A machine learning approach for intrusion
detection system on NSL-KDD dataset. In 2020 International Conference on Smart Electronics and Communication
(ICOSEC) (pp. 919-924). IEEE.

[10] Choudhary, M., & Choudhary, V. (2014). Performance analysis of data reduction algorithms using attribute
selection in NSL-KDD dataset.

[11] Ibrahim, L. M., Basheer, D. T., &Mahmod, M. S. (2013). A comparison study for intrusion database (Kdd99,
Nsl-Kdd) based on a self organization map (SOM) artificial neural network. Journal of Engineering Science and
Technology, 8(1), 107-119.

[12] Beulah, J. R., &Punithavathani, D. S. (2015). Simple hybrid feature selection (SHFS) for enhancing network
intrusion detection with NSL-KDD dataset. International Journal of Applied Engineering Research, 10(19), 40498-
40505.

[13] Singh, N., &Virmani, D. (2021). Computational method to prove efficacy of datasets. Journal of Information
and Optimization Sciences, 42(1), 211-233.

[14] Pervez, M. S., & Farid, D. M. (2014, December). Feature selection and intrusion classification in NSL-KDD
cup 99 dataset employing SVMs. In The 8th International Conference on Software, Knowledge, Information
Management and Applications (SKIMA 2014) (pp. 1-6). IEEE.

Volume XIII, Issue 9, 2021 Page No: 236


Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

[15] Mohammed, B., &Gbashi, E. K. (2021). Intrusion Detection System for NSL-KDD Dataset Based on Deep
Learning and Recursive Feature Elimination. Engineering and Technology Journal, 39(7), 1069-1079.
Hsu, C. M., Hsieh, H. Y., Prakosa, S. W., Azhari, M. Z., & Leu, J. S. (2018, October). Using long-short-term
memory based convolutional neural networks for network intrusion detection. In International wireless internet
conference (pp. 86-94). Springer, Cham.

[16] Heba, F. E., Darwish, A., Hassanien, A. E., & Abraham, A. (2010, November). Principal components analysis
and support vector machine based intrusion detection system. In 2010 10th international conference on intelligent
systems design and applications (pp. 363-367). IEEE.

[17] Farnaaz, N., & Jabbar, M. A. (2016). Random forest modeling for network intrusion detection systems.
Procedia Computer Science, 89, 213-217.

[18] Panda, M., Abraham, A., & Patra, M. R. (2010, August). Discriminative multinomial naive bayes for network
intrusion detection. In 2010 Sixth International Conference on Information Assurance and Security (pp. 5-10).
IEEE.

[19] Taher, K. A., Jisan, B. M. Y., & Rahman, M. M. (2019, January). Network intrusion detection using supervised
machine learning technique with feature selection. In 2019 International conference on robotics, electrical and
signal processing techniques (ICREST) (pp. 643-646). IEEE.

[20] Mishra, J. H. A. Performance Analysis of Some Neural Network Algorithms using NSL-KDD Dataset.

[21] Ghosh, P., & Mitra, R. (2015, February). Proposed GA-BFSS and logistic regression based intrusion detection
system. In Proceedings of the 2015 Third International Conference on Computer, Communication, Control and
Information Tech

[22] Hussain, J., &Lalmuanawma, S. (2016). Feature analysis, evaluation and comparisons of classification
algorithms based on noisy intrusion dataset. Procedia Computer Science, 92, 188-198.

Volume XIII, Issue 9, 2021 Page No: 237

You might also like