A Comparative Analysis of Machine Learni

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

A Comparative Analysis of Machine Learning

Approaches to Intrusion Detection
Syed Ayaz Imam
School of Computer Science and Engineering (SCOPE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Archit Aggarwal
School of Computer Science and Engineering (SCOPE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Akshat Bakliwal
School of Computer Science and Engineering (SCOPE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Vikas Vijayvargiya
School of Electronics Engineering (SENSE)
Vellore Institute of Technology, Vellore, Tamil Nadu, India

Abstract- Network administrators use a Network Intrusion Detection System (NIDS) to detect network security breaches
in their enterprises. However, designing a convenient and dynamic NIDS for unanticipated and unpredictable attacks
poses numerous obstacles. Signature-based Intrusion Detection Systems (IDS) are currently insufficient to handle the
hazards posed by zero-day attacks to networked systems. On the NSL-KDD dataset, we applied data mining techniques
and compared their performance on metrics such as accuracy, precision, and recall.

Keywords –IDS, Denial of Service, U2R, R2L, Machine Learning, KNN, Accuracy, F1-Score, Decision Trees, Random
Forest, Feature Scaling, Encoding, Sampling

Intrusion detection appears to have a simple goal: to detect intrusions. However, the process is
challenging, and intrusion detection systems don't actually detect intrusions; instead, they identify
evidence of intrusions, either while they're happening or after they've happened. An attack's
"manifestation" is a term used to describe such evidence. The system cannot identify an intrusion if there
is no manifestation, if the manifestation lacks adequate information, or if the information it contains is

Administrators in the late 1970s and 1980s printed audit logs on fan-folded paper, which were routinely
stacked high at the end of a typical week. It took a long time to search through such a stack. Due to the
amount of data and the lack of automated analysis, administrators mostly employed audit logs as a
forensic tool to establish the cause of a security event after it occurred. There was little chance of stopping
an attack in the middle. Audit logs migrated online as storage grew more affordable, and researchers built
systems to evaluate the data. Intrusion detection applications were typically executed at night when the
system's user traffic was low because the analysis was slow and often computationally intensive. As a
result, a majority of such intrusions were recognized post their occurrence. Researchers developed real-
time intrusion detection systems that analyzed audit data as it was produced in the early 1990s. As a
result, assaults and attempted attacks could be detected in real-time, allowing for real-time response and,

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

in certain situations, attack preemption. Recent intrusion detection research has focused on creating
products that consumers can deploy efficiently in large networks. Given rising security concerns, a
plethora of new attack strategies, and constant changes in the computing environment, this is no easy feat.

There are currently no entirely efficient solutions despite substantial research into Intrusion Detection
Systems and a variety of antiviruses. The question arises as to why, despite investing so much money, we
have yet to develop an intrusion detection system capable of averting such attacks and losses. Antivirus
systems cannot provide adequate security because they are based on misuse intrusion detecting
technologies. Unless intruders are discovered, antivirus systems will be unable to cope with their
inventive and sophisticated methods. Anomaly Intrusion Detection Systems was established to address
this issue, and they can detect any undesirable changes in network data or deviations from usual data
standards, implying that they can detect unique intrusion types.

This study puts forward a comprehensive analysis of the performance of various algorithms. To study the
NSL-KDD dataset, we first cleaned the dataset and then analyzed the data via graphs and charts. Then we
proceed with data preprocessing and created a machine learning pipeline to train classification models.
We have used Support Vector Classifier, Decision Trees, Random Forest, Voting Classifier, Naive Bayes
Classifier, K-Nearest Neighbour Classifier, and Logistic Regression classifier to train on the data.

The document begins with the introduction to the dataset followed by an introduction to the machine
learning algorithms used. After that, we have explained the implementation methodology that includes the
pre-processing steps. Towards the end, we have compared the outcomes of the models with each other
and concluded with the findings of the work.

The NSL-KDD data set is made up of selected records from the entire KDD data set. Because the NSL
KDD train set contains no duplicated records, the classifier will not generate a biased result. Additionally,
there are no duplicate records in the test set.

The training dataset contains 21 distinct attacks, compared to 37 in the test dataset. The known attack types
are those that appear in the training dataset, whereas the novel assaults are those that appear in the test
dataset but are not present in the training datasets.

There are four different forms of attacks: DoS, Probe, U2R, and R2L.

Table 1. Attack Classes

Attacks in Dataset Attacks types

DOS Back, Land, Neptune, Pod, Smurf, Teardrop, Mailbomb, Processtable,

Udpstorm, Apache2, Worm

Probe Satan, IPsweep, Nmap, Portsweep, Mscan, Saint

User to Root (U2R) Buffer_overflow, Loadmodule, Rootkit, Perl, Sqlattack, Xterm, Ps

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

Remote to Local (R2L) Guess_password, Ftp_write, Imap, Phf, Multi hop, Warezmaster, Xlock,
Xsnoop, Snmpguess, Snmpgetattack, Httptunnel, Sendmail, Named

In this exploratory research, the NSL-KDD dataset with 42 attributes is employed. The ‘class' attribute,
which is designated 42 attributes in the data set, specifies whether a given instance is a normal connection
instance or an attack. Out of the 42 attributes, 41 can be divided into one of four categories, as shown

• Basic (B) Features are the individual TCP connections attributes.

• Content (C) features are said to be the values in connection suggested by the domain knowledge.
• Traffic (T) features are computed by a two-second time window.
• Host (H) features are designed to assess attacks that last greater than two seconds.

Table 2. Attribute Information

S.No Labe Attribute S.N Labe Attribute Name S.No Labe Attribute Name
l Name o l l

1 B Duration 15 C Su_attempted 29 T Serv_serror_rate

2 B Protocol_type 16 C Num_root 30 T Srv_rerror_rate

3 B Service 17 C Num_file_creati 31 T Srv_diff_host_rate


4 B Src_bytes 18 C Num_shells 32 H Dst_host_count

5 B Dst_bytes 19 C Num_access_file 33 H Dst_host_srv_count


6 B Flag 20 C Num_outbound_ 34 H Dst_host_same_srv_ra

cmds te

7 B Land 21 C Is_hot_login 35 H Dst_host_diff_srv_rate

8 B Wrong_fragm 22 C Is_guest_login 36 H Dst_host_same_src_p

ent ort_rate

9 B Urgent 23 T Count 37 H Dst_host_srv_diff_hos


10 C Hot 24 T Serror_rate 38 H Dst_host_serror_rate

11 C Num_failed_l 25 T Rerror_rate 39 H Dst_host_srv_serror_r

ogins ate

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

12 C Logged_in 26 T Same_srv_rate 40 H Dst_host_rerror_rate

13 C Num_compro 27 T Diff_srv_rate 41 H Dst_host_srv_rerror_r

mised ate

14 C Root_shell 28 T Srv_count 42 - Class


The Support Vector Machine (SVM) is a commonly used supervised learning model for classification and
regression problems. For our task, we have used SVM for classification tasks. The goal of SVM is to
determine the decision boundary for categorizing n-dimensional space into classes so that subsequent data
points can be easily placed in the right category. The ideal choice boundary is known as a hyperplane.

In decision trees, the dataset attributes are represented by internal nodes, decision rules are represented by
branches, and the outcome in a Supervised classification technique is represented by leaf nodes. The
Decision Node and the Leaf Node are the two nodes of a Decision tree. Leaf nodes are the output of those
decisions and do not contain any more branches, whereas Decision nodes are used to make any decision
and have several branches.

Random forest is built on ensemble learning, which is a method for solving a complicated problem and
improving the model's performance by merging multiple classifiers. Random forest, as the name implies,
is a classifier that combines a number of decision trees on different subsets of a dataset and averages the
results to increase the dataset's predictive accuracy. Instead of relying on a single decision tree, the
random forest collects the forecasts from each tree and predicts the final output based on the majority
votes of predictions. The bigger the number of trees in the forest, the more accurate it is and the problem
of overfitting is avoided.

Naive Bayes is one of the most fundamental machine learning algorithms used in machine learning
analysis. It works on the principle of probability and assumes that each attribute has an independent and
equal contribution in predicting the outcome. Due to these assumptions, Naive Bayes usually exhibits a
lower accuracy when compared to its counterpart algorithms.

The KNN method assumes that the new case/data and existing cases are similar and places the new case
in the category that is most similar to the existing categories. The KNN algorithm stores all available data
and classifies a new data point based on its similarity to the existing data. This means that new data can
be quickly sorted into a suitable category using the KNN algorithm. KNN is a non-parametric algorithm,
which means it makes no assumptions about the data it uses. It's also known as a lazy learner algorithm
since it doesn't learn from the training set right away; instead, it saves the dataset and performs an action
on it when it comes time to classify it.

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930


Figure 1 depicts the sequential flow of the Implementation. We began by

loading training and the test data using the Pandas library. The training set
had a dimension of [125973, 42] whereas the test set had [22544, 42]

The target values(‘attack’ class) could be categorized into five categories

namely Normal, Probe, DoS(denial of service attack), R2L(root to local
attack), and U2R(user-to-root attack). We mapped the ‘attack’ class to these

Once we mapped the values to the respective categories, we performed a

thorough analysis of the data which is a precursor to data preprocessing.
Data preprocessing is an essential stage in machine learning since the
quality of data and valuable information obtained from it directly influences
our model's capacity in learning; thus, before training our model, we must
pre-process our data. The table below depicts the mapping of the ‘attack’

Figure 1. Flow Diagram

Table 3. Mapping Table

Mapped Original Attack Class

Attack Class

Probe ipsweep, satan, nmap, portsweep, saint, mscan,

DoS teardrop, pod, land, back, neptune, smurf, mailbomb, udpstorm, apache2,

U2R perl, loadmodule, loadmodule, buffer_overflow, xterm, ps, sqlattack, httptunnel

R2L Ftp_write, phf 'guess_passwd, warezmaster, warezclient, imap, spy, multihop,

named, snmpguess, worm, snmpgetattack, xsnoop, xlock, sendmail

Normal normal

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

Our preprocessing pipeline includes feature scaling(using standard scaler), label encoding, one-hot
encoding, and sampling. The procedural flow is depicted in figure x. Once we mapped the ‘attack’
column, we then proceeded to feature scaling. We have used Sk-Learn’sStandardScaler which
standardizes features by removing the mean and scales the values to unit variance. Every column of int64
and float64 values were normalized or scaled so that each feature contributes proportionately to the

Then we proceeded to label encoding which involves converting each categorical value in a column to a
corresponding numerical value. For this study, we label encoded the ‘attack’ column(both test and
training sets) into five categories labeled as 0,1,2, 3, and 4 as:

Table 4. Attack Labels

Attack Type Label

Normal 0

Dos 1

Probe 2

R2L 3

U2R 4

Figure 2 shows the attack class frequencies in the

training set and the test set. The distribution is
clearly imbalanced and for this reason, we have
performed sampling. Data sampling provides a
variety of strategies to alter the training data set to
balance or balance the distribution of the classes
more effectively. Once the data is balanced the
converted dataset may be immediately trained
without changes.

Figure 2. Attack Class Distribution


The NSL-KDD Cup 99 dataset contains normal and attack network connections and is a multiple class
classification issue. We used a variety of classification methods on the pre-processed NSL-KDD dataset
in this study, and the experimental analysis revealed that the random forest technique has the highest F1

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

score of all the algorithms, with an incredible accuracy of 99.9%. We intend to use ensemble classifiers in
the future and experiment with different subsets of features, comparing performance by changing the size
of feature subsets.

Figure 3a. F1-Score Comparison Figure 3b. Accuracy Comparison

Table 5. Accuracy and F1 Scores of Machine Learning Algorithms

Performance SVM Naive Bayes Decision Tree Random Forest KNN

Accuracy 98.91 97.37 99.8 99.9 99.77

F1-Score 0.99 0.97 1.00 1.00 1.00

The above table summarizes the findings of our study. It can be inferred that while all the classification
algorithms exhibit high accuracy, Decision Tree and Random Forest classifiers have the highest accuracy
and F1 scores among the other algorithms. The decision tree model and Random forest are able to achieve
such high accuracies due to their intuitiveness for feature selection which is not the case in other
classification algorithms. On the other hand, Naive Bayes reported the least accuracy among the other
algorithms. The assumption of independent predictors by the Naive Bayes algorithm in the NSL-KDD
dataset could have resulted in lower accuracy. The use of ensemble classifiers with different subsets may
result in an accuracy equivalent to Random Forest, because of their similarities.

Journal of Xi'an University of Architecture & Technology ISSN No : 1006-7930

