Professional Documents
Culture Documents
Evaluating Deep Neural Network and Classical Machine Learning Algorithms
Evaluating Deep Neural Network and Classical Machine Learning Algorithms
Evaluating Deep Neural Network and Classical Machine Learning Algorithms
By
J.S. RISHIDEV
(Reg.No: 20204012533112)
BY
J.S. RISHIDEV
(Reg.No:20204012533112)
i
DECLARATION
MASTER OF SCIENCE
in
Cyber Security
on
is my original work and that it has not previously formed the basis for the award of any
J.S. RISHIDEV
ii
ACKNOWLEDGEMENT
I thank the God the Almighty for showering this blessing which helped me to
J.S. RISHIDEV
iii
List of Figures
List of Tables
iv
List of Abbreviations
o ML - Machine Learning
o DNN – Deep Neural Network
o NIDS -Network-based Intrusion Detection Systems (NIDS)
o IDS – Intrusion Detection System
o AI - Artificial Intelligence
o HID-Host-Based Intrusion Detection
o DL – Deep Learning
o NN – Neural Networks
o ANN - Artificial Neural Networks
o MLP -Multilayer Perceptron
o LNN - Lightweight Deep Neural Network
o RF - Random Forest
o SNN -Shared Nearest Neighbor
o RNN -Recurrent Neural Networks
o FNN -Feed-Forward Neural Networks
o NB -Naive Bayes
o SVM -Support Vector Machines
o HMM -Hidden Markov Model
o RBF -Radial Basis Function
o CNN -Convolution Neural Network
o ReLU -Rectified Linear Unit
o BP-Back Propagation
o AF-Activation Function
o DoS -Denial-of-Service-Attack
o U2R -User-to-Root-Attack
o R2L -Remote-to-Local-Attack
o TPR -True Positive Rate
o FPR -False Positive Rate
o FNR -False Negative Rate
o CR-Classification Rate
v
Table of contents
1 Introduction 3
2 Review of Literature 8
4 Evaluation of DNN
4.1. Deep Neural Network 20
4.2. Activation Function 23
4.3. Dataset description 27
4.4. Identifying network parameters. 30
4.5. Identifying network structures. 32
4.6. Proposed Architecture. 33
6 Conclusion 41
References 42
1
ABSTRACT
To secure a network from intrusion and for the confidentiality of any data, an
Intrusion Detection System plays a vital role. IDS calls for the need of integration of Deep
Neural Networks (DNNs). Cyber security is an emerging field in the IT-sector. As more
devices are connected to the internet, the attack surface for hackers is steadily increasing.
Network-based Intrusion Detection Systems (NIDS) can be used to detect malicious traffic in
networks and Machine Learning is an up-and-coming approach for improving the detection
rate. We’re going to explore the use of Deep Neural Networks to improve the accuracy of a
Network-based IDS (NIDS) and compare the results with classical Machine Learning
Algorithms.
In this work, we explore how to model an intrusion detection system based on deep
learning, and we propose a deep learning approach for intrusion detection using Deep neural
networks (DNN-IDS). Moreover, we study the performance of the model in binary
classification, and the number of neurons and different learning rate impacts on the
performance of the proposed model. We compare it with those of, Naïve Bayes, Decision
Tree, and other machine learning methods proposed by previous researchers on the
benchmark data set.
Deep Neural Networks have been utilized to predict the attacks on Network
Intrusion Detection System (NIDS). A DNN with 0.1 rate of learning is applied and is run for
100 number of epochs and KDDCup-’99’ dataset has been used for training and bench
marking the network. For comparison purposes, the training is done on the same dataset with
other machine learning algorithms such as Decision Tree, Logistic Regression, Naive Bayes
and DNN of layers ranging from 1 to 5. The experimental results show that DNN-IDS is very
suitable for modeling a classification model with high accuracy and that its performance is
superior to that of traditional machine learning classification methods in both binary and
multiclass classification. The DNN-IDS model improves the accuracy of the intrusion
detection and provides a new research method for intrusion detection.
2
CHAPTER- 1
INTRODUCTION
Cyber-attacks on digital infrastructure can result in damage on both humans, their
property and the environment. Attacks on industrial process control systems could lead to
life-threatening malfunctions or emissions of dangerous chemicals into the environment. A
recent example is the “Triton” attack which targeted the process control systems of
petrochemical plants. To detect cyber security threats, Intrusion Detection Systems (IDS) can
be used. An IDS monitors networks or computers in order to detect malicious activity.
Machine learning (ML) is a type of artificial intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being explicitly
programmed to do so. Machine learning algorithms use historical data as input to predict new
output values. Because of this, machine learning facilitates computers in building models
from sample data in order to automate decision-making processes based on data inputs. A
machine learning model is an expression of an algorithm that combs through mountains of
data to find patterns or make predictions. Fueled by data, machine learning (ML) models
are the mathematical engines of artificial intelligence. Supervised learning is the types of
machine learning in which machines are trained using well "labelled" training data, and on
basis of that data, machines predict the output. The labelled data means some input data is
already tagged with the correct output. In supervised learning, the training data provided to
the machines work as the supervisor that teaches the machines to predict the output correctly.
It applies the same concept as a student learns in the supervision of the teacher. Supervised
learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y). The Supervised learning algorithm can
be further categorized into two types of problems i.e., Regression and Classification.
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables. Classification algorithms
are used when the output variable is categorical, which means there are two classes.
Unsupervised learning is a machine learning technique in which models are not supervised
using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while
3
learning new things. Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities, and represent that dataset in a
compressed format. The unsupervised learning algorithm can be further categorized into two
types of problems i.e., Clustering and Association. Clustering is a method of grouping the
objects into clusters such that objects with most similarities remains into a group and has less
or no similarities with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence of those
commonalities. An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item.
Artificial intelligence and machine learning are the part of computer science that are
correlated with each other. These two technologies are the most trending technologies which
are used for creating intelligent systems. Although these are two related technologies and
sometimes people use them as a synonym for each other, but still, both are the two different
terms in various cases. AI is a bigger concept to create intelligent machines that can simulate
human thinking capability and behavior, whereas, machine learning is an application or
subset of AI that allows machines to learn from data without being programmed explicitly.
Artificial intelligence is now widely employed to help in detecting intrusions on computer
systems; this is because of its efficient and adaptive nature. Neural network is the branch of
artificial intelligence that receives the highest attention. A neural network conducts an
analysis of the information and provides a probability estimate that the data matches the
characteristics which it has been trained to recognize, this helps in no small measure in
reducing false positive rate.
The current intrusion detection system has more advantages in network protection,
in the face of the current complex network environment and increasingly updated attack
methods, most of the traditional intrusion detection systems rely on rule base and traditional
4
machine learning. The algorithm is computationally complex and has been unable to adapt to
the new network environment. Many problems have arisen: low data detection efficiency,
poor adaptability, low false negative rate and false positive rate. In recent years, the
theoretical results and practical results of deep learning have emerged in an endless stream,
and have achieved amazing results in the fields of speech recognition and image
classification, and are suitable for processing large-scale data. Aiming at the shortcomings of
traditional intrusion detection methods, slow detection speed and weak adaptive ability, this
paper proposes a hybrid intrusion detection method based on deep confidence neural
network. The method realizes effective identification and classification of massive, high-
dimensional and nonlinear intrusion network data, and improves the accuracy of IDS
classification.
In general, intrusion detection can be classified into two main categories, namely:
host-based intrusion detection (HID) and network-based intrusion detection (NID). HID data
source mainly uses the internal log of the operating system for audit and judgment. The
detection method is not sensitive to network traffic. The system can accurately locate and
define the specific operations of intrusions. However, it occupies a lot of resources on the
host itself and depends on the reliability of the host. NID analyzes network traffic data, finds
suspicious intrusions hidden in the traffic data, and performs corresponding alarms and
intercepts on the detected intrusions. Network traffic data has high dimensionality and
complex features. In fact, NID is a typical classification problem. Specifically, it can
automatically identify possible attacks and threats hidden in network traffic in time and
determine their specific types.
5
not have powerful computational resources. Therefore, it is essential to design efficient and
high-performance models for IDS.
Deep Neural Networks (DNN) is an important method for machine learning and has
been widely used in many fields. DNN is inspired by structure of mammalian visual system.
Biologists have found that mammalian visual system contains many layers of neural network.
It processes information from retina to visual center layer by layer, sequentially extracts edge
feature, part feature, shape feature and eventually forms abstract concept. In general, the
depth of DNN is greater than or equal to 4. For example, a Multilayer Perceptron (MLP) with
more than 1 hidden layer is a DNN framework. DNN extracts feature layer by layer and
combine low-level features to form high-level features, which can find distributed expression
of data. Compared with shallow neural networks (NN), DNN has better feature expression
and ability to model complex mapping. A deep neural network (DNN) is an ANN with
multiple hidden layers between the input and output layers. Similar to shallow ANNs, DNNs
can model complex non-linear relationships.
Traditional static security technologies such as firewalls and encryption technology
although have a certain effect, but the lack of mechanism for active intrusion detection and
the need of manual implementation, maintenance in the use process, make it difficult to meet
the current security requirements. Intrusion detection based on neural networks is an active
dynamic security technology, to provide an effective and real-time protection for internal
attacks, external attacks and misuse. At the same time, with the use of neural network
advantages in self-organization, self-learning and generalization, such intrusion detection
method not only for known attack patterns has good recognition ability, but also has the
ability to detect unknown attacks. Therefore, it becomes the research hotspot in the field of
current network security.
Self-learning system is one of the effective methods to deal with the present-day
attacks. This uses supervised, semi-supervised and unsupervised mechanisms of machine
learning to learn the patterns of various normal and malicious activities with a large corpus of
normal and attack connection records. Though various machine learning based solution
found, the applicability in commercial system is in early stages. The existing machine
learning based solutions outputs a high false positive rate including a high computational
cost. The primary reason for this is machine learning classifiers learn the characteristic of
6
simple TCP/IP features locally. Deep learning is complex subnet of machine learning that
learns hierarchical feature representations and hidden sequential relationships through by
passing the TCP/IP information on several layers. Deep learning has achieved a significant
result in long standing artificial intelligence (AI) tasks in the field of image processing,
speech recognition, natural language processing and many others.
There are multiple systems that can be used for shielding such information and
communication technology systems from vulnerabilities, namely anomaly detection and
IDSs. A demerit of anomaly-detection systems is the complexity involved in the process of
defining rules. Each protocol being analyzed must be defined, implemented and tested for
accuracy. Another pitfall relating to anomaly detection is that harmful activity that falls
within usual usage pattern is not recognized. Therefore, the need for an IDS that can adapt
itself to the recent novel attacks and can be trained as well as deployed by using datasets of
irregular distribution becomes indispensable. This work towards analyzing an effectiveness
of various shallow and deep networks for NIDS done with the openly available network-
based intrusion data set KDDCup ’99’. The tcpdump data of the 1998 DARPA intrusion
detection evaluation data set is pre-processed to build KDDCUP 99 data set. The feature
extraction from tcp - dump data is facilitated by the MADMAID data mining framework.
This data set was built by capturing network traffic for ten weeks from thousands of UNIX
systems and hundreds of users accessing those systems in the MIT Lincoln laboratory. The
data captured during the first 7 weeks were utilized for training purpose and the last 3 weeks
data were utilized for testing purposes.
The rest of the thesis is organized as follows, Chapter II reviews the work related to
IDS, different deep neural networks and some discussions about KDDCup-’99’ dataset that
was published. Chapter III takes an in-depth look at the existing methods of Deep Neural
Networks (DNN) and the applications of ReLU activation function. Chapter IV explains the
proposed method in this paper, Chapter V evaluates and discuss the final results. Chapter VI
concludes and states a plausible workflow into the future of this research work.
7
CHAPTER 2
REVIEW OF LITERATURE
8
predictions for a given instance. The variability across base learners and the validation
residuals are in turn harnessed to compute confidence intervals using the conformal
prediction framework. Using a set of 24 diverse IC50 data sets from ChEMBL 23, they show
that Snapshot Ensembles perform on par with Random Forest (RF) and ensembles of
independently trained deep neural networks. In addition, it’s found that the confidence
regions predicted using the Deep Confidence framework span a narrower set of values.
Overall, Deep Confidence represents a highly versatile error prediction framework that can
be applied to any deep learning-based application at no extra computational cost [7].
However, there is no standard theory to guide you in selecting right deep learning tools as it
requires knowledge of topology, training method and other parameters.
A Lightweight deep neural network (LNN) model structure using depth wise
convolution is introduced, which achieves satisfactory performance with lower
computational cost and smaller model size. Lightweight units are key building blocks for our
model, which can fully extract data features while reducing computational cost by expanding
and compressing feature maps. Lightweight units expand the feature maps through the
expansion layer. After expansion, the depth wise convolution is used to extract more features,
and then the compression layer is used to compress the feature map to shrink the network
again. At the same time, they also adopt inverse residual structure and channel shuffle
operation to achieve more effective training. They compared this model with other models
for NID performance within modern networks. The Experimental results on the UNSW-
NB15 dataset demonstrates their approach not only reduced the computational cost and the
model size, but also achieved high detection accuracy and low false alarm rate in binary
classification problems [8].
Australian Centre Cyber Security created UNSW-NB15 dataset on the basis of the
IXIA Perfect Storm tool. This dataset provides up to 100 GB of raw data traffic, including
nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic,
Reconnaissance, Shellcode, and Worms. UNSW-NB15 dataset is used on the basis of a large
number of original data samples to train the classification of network data packets that
contain application-layer data. UNSW-NB15 dataset contains a massive amount of data with
the same MAC address and IP address that may affect the accuracy of detection. Therefore,
the Ethernet header and IP header in the packet should be removed to eliminate the influence
9
of fixed MAC address and IP address on the model’s accuracy and get the simplified packet
(Reduced-Data). When finished reducing, the rest data contains 1460 bytes content
maximum, forms a two-dimensional matrix with a size of 38 × 38 by removing the last bytes
if more than 1444 bytes or padding zeros to 1444 bytes if less than 1444 bytes. Subsequently,
the Reduced-Data is standardized and normalized, and then send to the deep learning model
as input data for training or inference [9].
The main drawback of anomaly detection IDSs is that it is difficult to define what
a normal behavior of the system, and without a proper and effective means of discriminating
normal from abnormal profile, an anomaly detection IDS will hardly be of any use. The need
to ameliorate this shortcoming made researchers to consider neural network technique;
because neural networks have an enhanced ability to classify by learning and are good in
pattern recognition. [10] There are lot of Advantages of using neural network in IDS such as
that, it provides more accurate statistical distribution than statistical models, because most
statistical models make some assumptions about the underlying distribution of user behavior.
These assumptions may not always be valid and can a times lead to high false-alarm. Neural
network generally makes less assumptions and normally modify them through learning
process. Neural network has low cost for development. Statistical model algorithms cost
more to build, because it is costly to reconstruct statistical algorithms after removing
assumptions that are found to be invalid. It is highly scalable compared to other techniques.
Good in reducing both false positive error and false negative error rate. False positive rate
counts of false alarms and false negative counts missed intrusions. Anomaly detection IDS
tend to have more false positives than false negatives. IDS designers exploit neural network
either as a pattern recognition technique or for classification and prediction. Pattern
recognition is realized by using a multilayered feed forward neural network that has been
trained accordingly. During training, the neural network parameters are optimized to
associate outputs with corresponding input patterns. In neural network algorithms, input
pattern is identified by matching its output with the known class. During training session,
output is compared with corresponding class, if it does not match, adjustment to weight and
threshold values are made repeatedly until it matches the desired class. Denial of service and
probing are some of the types of attacks that are known to be easily detected by pattern
recognition. Classification and prediction can be implemented using Kohonen’s self-
10
organizing maps to classifying inputs into clusters, with well-defined boundaries for normal
and abnormal profiles [11]. The neural network will evolve through learning process to
identify relevant profile for any type network traffic.
In the model proposed they have assumed that incoming packets has been extracted
using anyone of the different methods available. The model can be used for both pattern
recognition and classification neural network techniques. For the training phase, they have
both attack and normal data set, it should be noted, both data sets and neural network
learning modules are connected to neural network model module, this is because it the main
decider for what type of training scheme would be employed. For each training round the
output is compared with expected output at the validation module and appropriate cost is
assign to the whole operation, this will determine the extent of modification to both weight
and threshold. The exercise continuous until the neural network model is fully trained. The
training exercise is then conducted periodically or in some secured IDS is done continuously.
This would improve the performance of the IDS. Take note, since IDS itself is not immune
from attack, continuous training should be avoided, since at least theoretically somebody can
implant malicious data set as normal profiled data [10].
In [12] discussed the applicability of shared nearest neighbor (SNN) classifier to
intrusion detection. They claimed that their method as best one by reporting a higher
detection rate. This was not acceptable due to the fact that they used the custom-built-in data
set.
A highly scalable and hybrid DNN framework called scale-hybrid-IDS-AlertNet
was proposed in the study by Vinayakumar et al. [16] The framework may be used in real
time to effectively monitor network traffic to alert system administrators to possible cyber-
attacks. It was composed of a distributed deep learning model with DNNs for handling and
analyzing very large-scale data in real time. The authors tested the framework on various
data sets, including NSL-KDD and KDD’99. On NSL-KDD, the best F-Measure for binary
classification was 80.7% and 76.5% for multiclass classification.
Yin et al.[18] proposed a model for intrusion detection using recurrent neural
networks (RNNs). RNNs are especially suited to data that are time dependent. The model
consisted of forward and back propagation stages. Forward propagation calculates the output
11
values, whereas back propagation passes residuals accumulated to update the weights. The
model consisted of 20 hidden nodes, with Sigmoid as the activation function and Softmax as
the classification function. The learning rate was set to 0.1, and the number of epochs to 50.
Experimental results using the NSL-KDD data set showed the accuracy values were 83.28%
and 81.29% for binary and multiclass classification, respectively.
Recurrent neural networks include input units, output units and hidden units,
and the hidden unit completes the most important work. The RNN model essentially has a
one-way flow of information from the input units to the hidden units, and the synthesis of the
one-way information flow from the previous temporal concealment unit to the current timing
hiding unit is shown in Fig. 1. We can regard hidden units as the storage of the whole
network, which remember the end to-end information. When we unfold the RNN, we can
find that it embodies the deep learning. A RNNs approach can be used for supervised
classification learning. Recurrent neural networks have introduced a directional loop that can
memorize the previous information and apply it to the current output, which is the essential
difference from traditional Feed-forward Neural Networks (FNNs). The preceding output is
also related to the current output of a sequence, and the nodes between the hidden layers are
no longer connectionless; instead, they have connections. Not only the output of the input
layer but also the output of the last hidden layer acts on the input of the hidden layer.
Wu et al.[20] built a CNN and an RNN for the classification of the NSL-KDD
data set. The authors focused on solving the data imbalance problem by using the cost
function-based method. The cost function weight coefficients of each class are set based on
the training sample number. The reported accuracies of the deep learning models
outperformed traditional machine learning algorithms, such as J48, NB, NB tree, RF, and
SVM. However, the accuracy of the CNN model was slightly lower than that of the RNN
model.
Altwaijry et al.[19] developed an intrusion detection model using DNN. The proposed
model consisted of four hidden fully connected layers and was trained using NSL-KDD data
set. The DNN model obtained accuracy values of 84.70% and 77.55% for the two-class and
five-class classification problems, respectively. The proposed model outperformed traditional
machine learning algorithms, including NB, J48, RF, Bagging, and Adaboost in terms of
accuracy and recall.
12
CHAPTER 3
Deep Learning Algorithms
Research in intrusion detection systems (IDS) began in the 1980s, and ever since
many algorithms have been used to build ADNIDS. Traditional machine learning algorithms
such as random forests (RF), self-organized maps, support vector machines (SVM),
and artificial neural networks (ANN) have been widely used in developing ADNIDS.
However, as data sets are evolving in terms of size and type, traditional machine learning
algorithms become increasingly unable to cope with real-world network application
environments.[15]
Despite several decades of research and applications in IDS, there are still many
challenges to be addressed. In particular, better detection accuracy, reduced false-positive
rates, and the ability to detect unknown attacks are all required. Recently, researchers have
effectively employed deep learning-based methods in a range of applications, including
image recognition, emotion detection, and handwritten-character recognition. Deep learning
has the ability to identify better representations from raw data, compared with traditional
machine learning approaches.[15]
There are many classification methods such as decision trees, rule-based systems,
neural networks, support vector machines, naïve Bayes and nearest-neighbor. Each technique
uses a learning method to build a classification model. However, a suitable classification
approach should not only handle the training data, but it should also identify accurately the
class of records it has not ever seen before. Creating classification models with reliable
generalization ability is an important task of the learning algorithm.
13
then contrasted to a predefined threshold, and a score greater than the threshold indicates
malware. Likewise, if the score is less than the threshold, the traffic is identified as
normal.[21]
Support Vector Machines (SVM)
SVM is a discriminative classifier defined by a splitting hyperplane. SVMs use a
kernel function to map the training data into a higher-dimensioned space so that intrusion is
linearly classified. SVMs are well known for their generalization capability and are mainly
valuable when the number of attributes is large and the number of data points is small.
Different types of separating hyperplanes can be achieved by applying a kernel, such as
linear, polynomial, Gaussian Radial Basis Function (RBF), or hyperbolic tangent. In IDS
datasets, many features are redundant or less influential in separating data points into correct
classes. Therefore, features selection should be considered during SVM training. SVM can
also be used for classification into multiple classes. In the work by Li et al., an SVM
classifier with an RBF kernel was applied to classify the KDD 1999 dataset into predefined
classes. From a total of 41 attributes, a subset of features was carefully chosen by using
feature selection method.[21]
Fuzzy Logic technique is based on the degrees of uncertainty rather than the
typical true or false Boolean logic on which the contemporary PCs are created. Therefore, it
presents a straightforward way of arriving at a final conclusion based upon unclear,
ambiguous, noisy, inaccurate or missing input data. With a fuzzy domain, fuzzy logic permits
an instance to belong, possibly partially, to multiple classes at the same time. Therefore,
fuzzy logic is a good classifier for IDS problems as the security itself includes vagueness,
and the borderline between the normal and abnormal states is not well identified. In addition,
the intrusion detection problem contains various numeric features in the collected data and
several derived statistical metrics. Building IDSs based on numeric data with hard thresholds
produces high false alarms. An activity that deviates only slightly from a model could not be
recognized or a minor change in normal activity could produce false alarms. With fuzzy
logic, it is possible to model this minor abnormality to keep the false rates low. With fuzzy
logic, the false alarm rate in determining intrusive actions could be decreased. They outlined
14
a group of fuzzy rules to describe the normal and abnormal activities in a computer system,
and a fuzzy inference engine to define intrusions.[21]
15
arrive at solutions, which saves both time and money. ANNs are considered fairly simple
mathematical models to enhance existing data analysis technologies. They can be used for
many practical applications, such as predictive analysis in business intelligence, spam email
detection, natural language processing in chatbots, and many more.[13]
A CNN model Proposed in [15] which has BCNN and MCNN, where the first model
(BCNN) is used for binary classification, and the second model (MCNN) is used for
multiclass classification of network attacks.
The input layer is either an 11×11×1 matrix for the NSL-KDD data set, or
a 14×14×1 matrix for the UNSW-NB15 data set, as defined in the previous section. In the
rest of this article, we use S to represent the input image side, that is, the height or width,
where S=11 or S=14, depending on the data set used.
The model is composed of a total of five convolutional layers, two pooling layers,
and four fully connected layers. Our input image is small, either 11×11 pixels
or 14×14 pixels, and so a smaller filter size, that is, 2×2, is more appropriate for this image.
As we wish to keep the representational power of our model, we increase the number of
feature maps as the network deepens and apply padding in all convolutional layers to
overcome the problems of image shrinkage and information loss around the perimeter of the
image. We also use batch normalization after each convolutional layer.
The first convolutional layer has as input an S×S×1 image. We use k=8 (2×2×1) kernels,
zero-padding of 1, and a stride of 1, for a convolutional layer of size S×S×8. Each activation
map i is calculated as shown in Equation (3.2.1), where l is the current layer, B(l)i is a bias
matrix, k(l−1) is the number of kernels used in the previous layer, W is the current layer kernel
matrix, and Y(l−1) is the output of the previous layer. Our nonlinearity is the Leaky ReLU
function, defined as shown in Equation (3.2.2), with α=0.12.
16
The second convolutional layer has as input an S×S×8, and we use a k=16 kernels for
a convolutional layer of size S×S×16. We add the input image at this point to the tensor. This
step is inspired by the idea of skip connections in residual networks. Skip connections speed-
up the learning process and overcome the problem of vanishing gradients. The vanishing
gradient problem is encountered when training ANN with gradient-based a serious detriment
to deep learning using backpropagation. This is because the neural network's weights are
updated in proportion to the partial derivative of the error function with respect to the current
weight during training, and sometimes, if the gradient is very small, the weight value is
inadequately updated. In the worst case, this stops the neural network from further training.
We noticed the vanishing gradient problem in our model and incorporated skip connections
to amplify the input signal and prevent zero gradients. As shown in Fig.3.1, the input image
is added to the tensor at two points in the architecture. This architecture is then repeated,
before the fifth and final convolutional layer, which has 32 feature maps.
17
Next, there are two max-pooling layers that have a 2∗2 window size. The tensor is then
flattened and followed by four fully connected layers, with sizes 500, 300, 100, and 20,
respectively. All fully connected layers use a 30% dropout rate to reduce overfitting,36 set
experimentally.
The output layer is a 5 class Softmax layer (one class for each attack type, plus the normal
class). Softmax outputs a probability-like prediction for each character class, see Equation
(3), where N is the number of output classes. The CNN architecture is shown in Fig.3.1
Optimization
The model is tested with two optimizers: Stochastic Gradient Descent and Adam,and
selected Adam as it was found to work better. The loss function used is the categorical cross-
entropy loss, which is widely used to calculate the probability that the input belongs to a
particular class. It is usually used as the default function for multiclass classification. In our
model, we set the learning rate to lr=0.001, set experimentally.
18
3.3. RECURRENT NEURAL NETWORK:
Recurrent neural networks recognize data's sequential characteristics and use patterns
to predict the next likely scenario. RNNs are used in deep learning and in the development of
models that simulate neuron activity in the human brain. They are especially powerful in use
cases where context is critical to predicting an outcome, and are also distinct from other types
of artificial neural networks because they use feedback loops to process a sequence of data
that informs the final output. These feedback loops allow information to persist.
Like feed-forward neural networks, RNNs can process data from initial input to final
output. Unlike feed-forward neural networks, RNNs use feedback loops, such
as backpropagation through time, throughout the computational process to loop information
back into the network. This connects inputs and is what enables RNNs to process sequential
and temporal data.A truncated backpropagation through time neural network is an RNN in
which the number of time steps in the input sequence is limited by a truncation of the input
sequence. This is useful for recurrent neural networks that are used as sequence-to-sequence
models, where the number of steps in the input sequence (or the number of time steps in the
input sequence) is greater than the number of steps in the output sequence.
19
CHAPTER- 4
Evaluation of DNN
Deep neural networks (DNNs) are Artificial Neural Network (ANN) with a
multi-layered structure comprised within the input-output layers. The extension of
conventional artificial neural networks is deep neural networks. Compared to conventional
neural networks, there are two main differences that deep neural networks have. It is shallow
for conventional neural networks to have one or two hidden layers. On the other hand, there
are many hidden layers in deep neural networks. For instance, a neural network of millions of
neurons was used by the Google brain project. There is a wide range of models for deep
neural networks, ranging from DNNs, CNNs, RNNs, and LSTMs. Recent studies have even
brought us attention-based networks that focus on specific parts of a deep neural network.
The larger the network and the more layers it has, the more complex the network becomes
and the more resources and more time it needs to train. Deep neural networks work best with
GPU-based architectures that take less time to train than classical CPUs, while recent
developments have shortened training times considerably.[22] They can model complex non-
linear relationships and can generate computational models where the object is expressed in
terms of the layered composition of primitives. While traditional machine learning
algorithms are linear, deep neural networks are stacked in increasing hierarchy of complexity
as well as abstraction. Each layer applies a nonlinear transformation onto its input and creates
a statistical model as output from what it learns. In simple terms, the input layer is received
by the input layer and passed onto the first hidden layer. These hidden layers perform
mathematical computations on our inputs. One of the challenges in creating neural networks
is deciding the hidden layers’ count and the count of the neurons for each layer. Each neuron
has an activation function which is used to standardize the output from the neuron. The
“Deep” in Deep learning refers to having more than one layer which is hidden. The output
layer returns the output data. Until the output has reached an acceptable level of accuracy,
epochs are continued.
20
4.1.1. Backpropagation Algorithm
Neural Networks are able to learn the desired function using big amounts of data and
an iterative algorithm called backpropagation. We feed the network with data, it produces an
output, we compare that output with a desired one (using a loss function) and we readjust the
weights based on the difference. And repeat. And repeat. The adjustment of weights is
performed using a non-linear optimization technique called stochastic gradient descent. After
a while, the network will become really good at producing the output. Hence, the training is
over. Hence, we manage to approximate our function. And if we pass an input with an
unknown output to the network, it will give us an answer based on the approximated
function. Let’s use an example to make this clearer. Let’s say that for some reason we want
to identify images with a tree. We feed the network with any kind of images and it produces
an output. Since we know if the image has actually a tree or not, we can compare the output
with our truth and adjust the network. As we pass more and more images, the network will
make fewer and fewer mistakes. Now we can feed it with an unknown image, and it will tell
us if the image contains a tree. Over the years researchers came up with amazing
improvements on the original idea. Each new architecture was targeted on a specific problem
and one achieved better accuracy and speed.
Backpropagation, short for "backward propagation of errors," is an algorithm for
supervised learning of artificial neural networks using gradient descent. Given an artificial
neural network and an error function, the method calculates the gradient of the error function
with respect to the neural network's weights. It is a generalization of the delta rule for
perceptrons to multilayer feedforward neural networks.
The "backwards" part of the name stems from the fact that calculation of the gradient
proceeds backwards through the network, with the gradient of the final layer of weights
being calculated first and the gradient of the first layer of weights being calculated last.
Partial computations of the gradient from one layer are reused in the computation of the
gradient for the previous layer. This backwards flow of the error information allows for
efficient computation of the gradient at each layer versus the naive approach of calculating
the gradient of each layer separately.
21
Backpropagation's popularity has experienced a recent resurgence given the widespread
adoption of deep neural networks for image recognition and speech recognition. It is
considered an efficient algorithm, and modern implementations take advantage of specialized
GPUs to further improve performance. The backpropagation algorithm consists of two
phases:
1. The forward pass where our inputs are passed through the network and output
predictions obtained (also known as the propagation phase).
2. The backward pass where we compute the gradient of the loss function at the final
layer (i.e., predictions layer) of the network and use this gradient to recursively apply
the chain rule to update the weights in our network (also known as the weight
update phase).
The working of backpropagation can be explained from the figure shown below.
• The input layer receives the inputs X through the preconnected path
• Input is customized by using actual weights ‘W’, where the weights are selected
randomly.
• Output is calculated for every neuron from the input layer, at the hidden layer and the
output data has arrived at the output layer
• Evaluate the errors obtained from the outputs.
• To decrease the error, adjust the weights by going back to the hidden layer from the
output layer.
• Repeat the process until the desired output is obtained.
The difference between the actual output and the desired output is used to calculate errors
obtained in the result.[28]
22
value of network first, and then, a training sample is selected to calculate gradient of error
relative to this sample.
The BP learning process can be described as follows: (1) Forward propagation of
operating signal: the input signal is propagated from the input layer, via the hide layer, to the
output layer. During the forward propagation of operating signal, the weight value and offset
value of the network are maintained constant and the status of each layer of neuron will only
exert an effect on that of next layer of neuron. In case that the expected output cannot be
achieved in the output layer, it can be switched into the back propagation of error signal.
(2) Back propagation of error signal: the difference between the real output and expect output
of the network is defined as the error signal; in the back propagation of error signal, the error
signal is propagated from the output end to the input layer in a layer-by-layer manner. During
the back propagation of error signal, the weight value of network is regulated by the error
feedback. The continuous modification of weight value and offset value is applied to make
the real output of network more closer to the expected one.
Neural network has neurons that work in correspondence of weight, bias and their
respective activation function. In a neural network, we would update the weights and biases
of the neurons on the basis of the error at the output. This process is known as back-
propagation. Activation functions make the back-propagation possible since the gradients
are supplied along with the error to update the weights and biases. A neural network
without an activation function is essentially just a linear regression model. The activation
function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.[23] Sometimes the activation function is called a “transfer
function.” If the output range of the activation function is limited, then it may be called a
“squashing function.” Many activation functions are nonlinear and may be referred to as the
“nonlinearity” in the layer or the network design.[24] Technically, the activation function is
used within or after the internal processing of each node in the network, although networks
are designed to use the same activation function for all nodes in a layer.
23
A network may have three types of layers: input layers that take raw input from the
domain, hidden layers that take input from another layer and pass output to another layer,
and output layers that make a prediction. All hidden layers typically use the same activation
function. The output layer will typically use a different activation function from the hidden
layers and is dependent upon the type of prediction required by the model. Activation
functions are also typically differentiable, meaning the first-order derivative can be
calculated for a given input value. This is required given that neural networks are typically
trained using the backpropagation of error algorithm that requires the derivative of prediction
error in order to update the weights of the model.[24]
(4.1)
Generally, the derivatives of the sigmoid function are applied to learning algorithms.
The graph of the sigmoid function is ‘S’ shaped.
Some of the major drawbacks of the sigmoid function include gradient saturation, slow
convergence, sharp damp gradients during backpropagation from within deeper hidden layers
to the input layers, and non-zero centered output that causes the gradient updates to
propagate in varying directions.[25]
24
4.2.2. Hyperbolic Tangent Function (Tanh)
The hyperbolic tangent function, a.k.a., the tanh function, is another type of AF. It is a
smoother, zero-centered function having a range between -1 to 1. As a result, the output of
the tanh function is represented by:
(4.2)
The tanh function is much more extensively used than the sigmoid function since it delivers
better training performance for multilayer neural networks. The biggest advantage of the tanh
function is that it produces a zero-centered output, thereby supporting the backpropagation
process. The tanh function has been mostly used in recurrent neural networks for natural
language processing and speech recognition tasks.
However, the tanh function, too, has a limitation – just like the sigmoid function, it cannot
solve the vanishing gradient problem. Also, the tanh function can only attain a gradient of 1
when the input value is 0 (x is zero). As a result, the function can produce some dead
neurons during the computation process.
(4.3)
25
This function is mainly used in multi-class models where it returns probabilities of
each class, with the target class having the highest probability. It appears in almost all the
output layers of the DL architecture where they are used. The primary difference between the
sigmoid and softmax AF is that while the former is used in binary classification, the latter is
used for multivariate classification.
(4.4)
One of the most popular AFs in DL models, the rectified linear unit (ReLU) function,
is a fast-learning AF that promises to deliver state-of-the-art performance with stellar results.
Compared to other AFs like the sigmoid and tanh functions, the ReLU function offers much
better performance and generalization in deep learning. The function is a nearly linear
function that retains the properties of linear models, which makes them easy to optimize with
gradient-descent methods.
The ReLU function performs a threshold operation on each input element where all values
less than zero are set to zero. Thus, the ReLU is represented as:
26
(4.5)
By rectifying the values of the inputs less than zero and setting them to zero, this function
eliminates the vanishing gradient problem observed in the earlier types of activation
functions (sigmoid and tanh).[25]
The most significant advantage of using the ReLU function in computation is that it
guarantees faster computation – it does not compute exponentials and divisions, thereby
boosting the overall computation speed. Another critical aspect of the ReLU function is that
it introduces sparsity in the hidden units by squishing the values between zero to maximum.
ReLu (Rectified Linear units) has turned out to be more efficient and have the capacity to
accelerate the entire training process altogether. Usually, Neural networks use a sigmoidal
activation function or tanh (hyperbolic tangent) activation functions. But these functions are
prone to vanishing gradient problem. Vanishing gradient occurs when lower layers of a DNN
have gradients of nearly null because units of higher layers are nearly saturated at the
asymptotes of the tanh function. ReLU offers an alternative to sigmoidal non-linearity which
addresses the issues mentioned so far.
The DARPA’s program for ID evaluation of 1998 was managed and prepared by
Lincoln Labs of MIT. The main objective of this is to analyze and conduct research in ID. A
standardized dataset was prepared, which included various types of intrusions which imitated
a military environment and was made publicly available.
ReLu has turned out to be more efficient and have the A detailed report and major
shortcomings of the provided synthetic data set such as KDDCup-’98’ and KDDCup-’99”
were discussed by. The main condemnation was that they failed to validate their data set a
simulation of real-world network traffic profile. Irrespective of all these criticisms, the
dataset of KDDCup-’99’ has been used as an effective dataset by many researchers for
27
bench-marking the IDS algorithms over the years. In contrast to the critiques about the
creation of the dataset, has revealed a detailed analysis of the contents, identified the non-
uniformity and simulated the artifacts in the simulated network traffic data. The reasons
behind why the machine learning classifiers have limited capability in identifying the attacks
that belong to the content categories R2L, U2R in KDDCup-’99’ datasets have been
discussed by. They have concluded that it is not possible to get acceptable detection rate
using classical ML algorithms. It is also stated the possibility of getting high detection rate in
most of the cases by producing a refined and augmented data set by combining the train and
test sets. However, a significant approach has not been discussed. The DARPA / KDDCup-
’88 failed to evaluate the traditional IDS resulting in many major criticisms. To eradicate this
used Snort ID system on DARPA / KDDCup-’98’ tcpdump traces. The system performed
poorly resulting in low accuracy and the impermissible false positive rates. It failed in
detecting dos and probing category but contrasting performing better than the detection of
R2L and U2R. Despite the harsh criticisms [30], still KDDCup-’99’ set is one of the most
widely used publicly available bench-marking datasets reliable for studies related to IDS
evaluation and other security-related tasks. In the effort of mitigating the underlying
problems existing with KDDCup-’99’ set, a refined version of dataset named NSL-KDD was
proposed by. It removed the connection redundancy records in the entire train and test data.
In addition to that, the invalid records were also removed from the test data. These measures
prevent the classifier from being biased in the direction of the more frequent records. Even
after the refinement, this failed to solve the issues reported by, and a new dataset named
UNSW-NB15 was proposed.
28
utilized data for evaluating several IDSs. This dataset is grouped together by almost
4,900,000 individual connections which includes a feature count of. The simulated attacks
were categorized broadly as given below:
•Denial-of-Service-Attack (DoS): Intrusion where a person aims to make a host inaccessible
to its actual purpose by briefly or sometimes permanently disrupting services by flooding the
target machine with enormous amounts of requests and hence overloading the host.
• User-to-Root-Attack (U2R): A category of commonly used maneuver by the perpetrator
start by trying to gain access to a user’s pre-existing access and exploiting the holes to obtain
root control.
• Remote-to-Local-Attack (R2L): The intrusion in which the attacker can send data packets
to the target but has no user account on that machine itself, tries to exploit one vulnerability
to obtain local access cloaking themselves as the existing user of the target machine.
• Probing-Attack: The type in which the perpetrator tries to gather information about the
computers of the network and the ultimate aim for doing so is to get past the firewall and
gaining root access.
KDDCup-’99’ set is classified into the following three groups: Basic features:
Attributes obtained from a connection of TCP/IP comes from this group. Majority of these
features results in implicitly delaying the detection. Traffic features: Features computed w.r.t.
a window of time is categorized under this group. This can be further subdivided into 2
groups:” Same host” features: The connections that has identical end host as the connection
under consideration for the continuously 2 seconds fall into this category and serves the
purpose of calculating the statistics of protocol behavior, etc. “Same service” features: The
connections that are only having identical services to the present connection for the last two
seconds fall under this category. Content features: Generally probing attacks and DoS attacks
have at least some kind of frequent sequential intrusion patterns unlike R2L and U2R attacks.
This is due to the reason that they involve multiple connections to a single set of a host(s)
under short span of time while the other 2 intrusions are integrated into the packets of data
partitions in which generally only one connection is involved. For the detection of these
types of attacks, we need some unique features by which we will be able to search for some
irregular behavior. These are called content features
29
4.4. Identifying network parameters:
Parameters are the coefficients of the model, and they are chosen by the model
itself. It means that the algorithm, while learning, optimizes these coefficients (according to a
given optimization strategy) and returns an array of parameters which minimize the error. To
give an example, in a linear regression task, you have your model that will look like y=b + ax,
where b and a will be your parameter. The only thing you have to do with those parameters is
initializing them.[26] The parameters of a neural network are typically the weights of the
connections. In this case, these parameters are learned during the training stage. So, the
algorithm itself (and the input data) tunes these parameters. The hyper parameters are
typically the learning rate, the batch size or the number of epochs.
Minibatch size: when you are facing billions of data, it might result inefficient (as well as
counterproductive) feeding your NN with all of them. A good practice is feeding it with
smaller samples of your data, called batches: by doing so, every time the algorithm train itself,
it will train on a sample of the same size of the batch. The typical size is 32 or higher,
however you need to keep in mind that, if the size is too big, the risk is an over generalized
model which won’t fit new data well.
Epochs: it represents how many times you want your algorithm to train on your whole dataset
(note that epochs are different from iterations: those latter are the number of batches needed to
complete one epoch). Again, the number of epochs depend on the kind of data and task you
are facing. An idea could be imposing a condition such that epochs stop when the error is
close to zero. Or, mor e easily, you can start with a relatively low number of epochs and then
increase it progressively, tracking some evaluation metrics (like accuracy).
Dropout: this technique consists of removing some nodes so that the NN is not too heavy.
This can be implemented during the training phase. The idea is that we do not want our NN to
be overwhelmed by information, especially if we consider that some nodes might be
redundant and useless. So, while building our algorithm, we can decide to keep, for each
training stage, each node with probability p (called ‘keep probability’) or drop it with
probability 1-p (called ‘drop probability’). Strategies are approaches and best practices we
might want to have towards our algorithm to make it more performing. Among these there are
30
the following: Data normalization: while inspecting your data, you might notice that some
features are represented on different scales. This might affect the performance of your NN,
since the convergence is slower. Normalizing data means converting all of them to the same
scale, within the range [0–1]. You can also decide to Standardize your data, that means
making them normally distributed with mean equal to 0 and standard deviation equal to 1.
While data normalization happens before training your NN, another way you can normalize
your data is through the so-called Batch Normalization: it happens directly during your NN
training, specifically after the weighted sum and before the activation function.
Optimization algorithm: in the previous paragraph, I mentioned the gradient descent as the
optimization algorithm. However, we have many variants of this latter: Stochastic Gradient
Descent (it minimizes the loss according to the gradient descent optimization, and for each
iteration it randomly selects a training sample — that’s why it’s called stochastic), the
RMSProp (that differs from the previous since each parameter has an adapted learning rate)
and the Adam Optimizer (it is a RMSProp + momentum). Of course, this is not the full list,
yet it is sufficient to understand that Adam optimizer is often the best choice, since it allows
you to set different hyperparameters and customize your NN.
Regularization: this strategy is pivotal if you want to keep your model simple and avoid
overfitting. The idea is that regularization adds a penalty to the model if weights are great/too
many. Indeed, it adds to our loss function a new term which tends to increase (hence, the loss
increases too) if the re-calibration procedure increases weights. There are two kinds of
regularization: the Lasso regularization (L1) and Bridge regularization (L2):
31
The L1 regularization tends to shrink weights to zero, with the risk of getting rid of some
inputs (since they will be multiplied with a null value), whereas the L2 might shrink weights
to very low values, but not to zero (hence inputs are preserved).[27]
32
Conventionally, increasing the count of the layers results in better results compared
to increasing the neuron count in a layer. Therefore, the following network topologies were
used in order to scrutinize and conclude the optimum network structure for our input data. •
DNN with 1,2,3,4,5 layers. For all the above network topologies, 100 epochs were run and
the results were observed. Finally, the best performance was showed by DNN 3 layer
compared to all the others. To broaden the search for better results, all the common classical
machine learning algorithms were used and the results were compared to the DNN 3 layer,
which still outperformed every single classical algorithm. The detailed statistical results for
different network structures are reported in the table I.
33
An overview of proposed DNNs architecture for all use cases is shown in Fig.
4.1. This comprises of a hidden-layer count of 5 and an output-layer. The input-layer consists
of 41 neurons. The neurons in input-layer to hidden-layer and hidden to output-layer are
connected completely. Back-propagation mechanism is used to train the DNN networks. The
proposed network is composed of fully connected layers, bias layers and dropout layers to
make the network more robust. Input and hidden layers: This layer consists of 41 neurons.
These are then fed into the hidden layers. Hidden layers use ReLU as the non-linear
activation function. Then weights are added to feed them forward to the next hidden layer.
Weight is the parameter within a neural network that transforms input data within the
network's hidden layers.
34
Chapter-5
The model proposed has been evaluated based on the following standard performance
measures:
True Positive Rate (TPR): It is calculated as the ratio between the number of
correctly predicted attacks and the total number of attacks. If all intrusions are detected then
the TPR is 1 which is extremely rare for an IDS. TPR is also called a Detection Rate (DR) or
the Sensitivity. The TPR can be expressed mathematically as
𝑇𝑃
TPR=𝑇𝑃+𝐹𝑁 (5.1)
False Positive Rate (FPR): It is calculated as the ratio between the number of normal
instances incorrectly classified as an attack and the total number of normal instances.
𝐹𝑃
FPR=𝐹𝑃+𝑇𝑁 (5.2)
False Negative Rate (FNR): False negative means when a detector fails to identify an
anomaly and classifies it as normal. The FNR can be expressed mathematically as:
𝐹𝑁
FNR=𝐹𝑁+𝑇𝑃 (5.3)
Classification rate (CR) or Accuracy: The CR measures how accurate the IDS is in detecting
normal or anomalous traffic behavior. It is described as the percentage of all those correctly
predicted instances to all instances:
TP+TN
Accuracy= (5.4)
TP+TN+FP+FN
whereby FP, sometimes referred to as Type I error, is the rate of legitimate traffic classified
as attacks. FN, sometimes referred to as Type II error, is the rate of legitimate traffic
classified as intrusions. Additional metrics we consider in this paper are the Recall,
the Precision and F1score.
35
Recall: The ability of a model to find all the relevant cases within a data set. Mathematically,
we define recall as the number of true positives divided by the number of true positives plus
the number of false negatives [29].
TP
Recall = (5.5)
TP+FN
Precision: The ability of a classification model to identify only the relevant data points.
Mathematically, precision the number of true positives divided by the number of true
positives plus the number of false positives.
TP
Precision= (5.6)
TP+FP
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes
both false positives and false negatives into account. Intuitively it is not as easy to understand
as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven
class distribution.
Precision.Recall
F1score = 2 (5.7)
Precision+Recall
The KDDCup-’99’ dataset was fed into classical ML algorithms. After the training is
completed, all models were compared for f1-score, accuracy, recall and precision with the
test dataset. The scores for the accuracy of classical ML algorithms have been compared in
detail in Fig.5.1, and the Precision, Recall, F1score of the classical machine learning
algorithms are compared in Fig.5.2.It is clear that the Naïve Bayes Algorithm has given the
Highest Accuracy rate
36
Accuracy
ML Algorithms
Decision Tree
0.928
Naive Bayes
0.929
Logistic Regression
0.846
Accuracy
0.954
Decision Tree 0.912
0.999
0.955
Naive Bayes 0.923
0.988
0.896
Logistic Regression 0.819
0.988
37
0.9305
0.93 0.93
0.9295
0.9285
ACCURACY
0.928
0.9275
0.927 0.927
0.9265
0.926
0.9255
DNN1 DNN2 DNN3 DNN 4 DNN5
DNN LAYERS
The Accuracy of all the five Deep Neural network layers were compared in Fig.5.It is clear
that the layer 3 has given the highest accuracy which is higher than the classical machine
learning Algorithms.
RECALL
0.916
0.913 0.913
0.912
0.911 0.911
0.91
0.909
dnn 1 dnn 2 dnn 3 dnn 4 dnn 5
DNN LAYERS
38
1.01
0.99
0.98
0.97
0.96 0.955
0.954 0.954 0.954 0.953
0.95
0.94
0.93
DNN 1 DNN 2 DNN 3 DNN 4 DNN 5
Precision F1 Score
DNN 5
0.927
DNN 4
0.928
DNN 3
0.93
Algorithms
DNN 2
0.929
DNN 1
0.929
DECISION TREE
0.928
NAÏVE BAYES
0.929
LOGISTIC REGRESSION
0.846
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94
Accuracy
39
Table-5.1 – Overall Performance of the Algorithms
For the scope of this thesis, the KDDCup-’99’ dataset was fed into classical
Machine Learning Algorithms as well as DNNs of varying hidden layers. After the training is
completed, all models were compared for f1-score, accuracy, recall and precision with the
test dataset. The scores for the DNN layers have been depicted in detail in Fig.5.3, Fig.5.4,
Fig.5.5.
The scores of the classical Machine Learning Algorithms are compared in Fig.5.1, Fig.5.2. In
Table 5.1 all the values have been compared in detail. In fig.5.6 Overall accuracy is depicted.
It’s certain that, DNN layer 3 network have outperformed all the other classical machine
learning algorithms. It is so because of the ability of DNNs to extract data and features with
higher abstraction and the non-linearity of the networks adds up to the advantage when
compared with the other algorithms.
40
Chapter-6
CONCLUSION
This thesis has elaborately recapitulated the usefulness of Deep Neural Networks
in Intrusion Detection System comprehensively. For the purpose of reference, other classical
Machine Learning algorithms such as Logistic Regression, Naïve Bayes, Decision Tree have
been accounted and compared against the results of Deep Neural Network. The publicly
available KDDCup-’99’ dataset has been primarily used as the benchmarking tool for the
study, through which the superiority of the Deep Neural Networks (DNN) over the other
compared algorithms have been documented clearly. For further refinement of the algorithm,
this work takes into account of DNNs with five counts of hidden layers and it was concluded
that a DNN layer 3 has been proven to be effective and accurate of all, which has the
accuracy of 0.930. From the empirical results of this thesis, we may claim that deep learning
methods are a promising direction towards Intrusion Detection tasks, but even though the
performance on artificial dataset is exceptional, application of the same on network traffic in
the real-time which contains more complex and recent attack types is necessary.
Additionally, studies regarding the flexibility of these DNNs in adversarial environments are
required. The increase in vast variants of deep learning algorithms calls for an overall
evaluation of these algorithms in regard to its effectiveness towards IDSs.
FUTURE WORK
KDDCup ‘99’ and NSL-KDD Datasets are most well-known and outdated.
Moreover, these are not representative for today’s network traffic. Applying the proposed
methodologies on the recent network traffic dataset such as CICIDS2017 which is labelled
based on the timestamp, source and destination IPs, source and destination ports, protocols
and attacks, A complete network topology was configured to collect this dataset which
contains Modem, Firewall, Switches, Routers, and nodes with different operating systems is
essential. This will be remained as one of significant future work direction.
41
REFERENCES:
[5] Vinayakumar, R., Soman, K. P., & Poornachandran, P. (2019). A Comparative Analysis
of Deep Learning Approaches for Network Intrusion Detection Systems (N-IDSs): Deep
Learning for N-IDSs. International Journal of Digital Crime and Forensics (IJDCF)
[6] Wang Peng, Xiangwei Kong, Guojin Peng, Xiaoya Li, Zhongjie Wang, (2019)
Network Intrusion Detection Based on Deep Learning . International Conference on
Communications, Information System and Computer Engineering (CISCE)
[8] Ruijie Zhao, Zhaojie Li, Zhi Xue, Tomoaki Ohtsuki, and Guan Gui, A Novel Approach
based on Lightweight Deep Neural Network for Network Intrusion Detection (2021)
42
[9] Yanyan Zhang and Xiangjin Ran, A Step-Based Deep Learning Approach for Network
Intrusion Detection (2021)
[10] “An Overview of Neural Networks Use in Anomaly Intrusion Detection Systems”
(2009) by Yusuf Sani, Ahmed Mohamedou, Khalid Ali, Anahita Farjamfar, Mohamed
Azman, Solahuddin Shamsuddin, IEEE Student Conference on Research and Development
(SCOReD 2009)
[11] T. Kohonen, “The self-organizing map”, Proc. IEEE, vol. 78, no. 9, pp. 1464-1480,
[12] L. Ertoz, M. Steinbach, and V. Kaumar, “Finding clusters of different ¨ sizes, shapes,
and densities in noisy, high dimensional data,” in Proceedings of the 2003 SIAM
International Conference on Data Mining. SIAM, 2003,
[16] Vinayakumar R, Alazab M, Soman K, et al. Deep learning approach for intelligent
intrusion detection system. IEEE Access. 2019;7:41525–41550
[17] Pasupa K, Sunhem W. A comparison between shallow and deep architecture classifiers
on small dataset. In: 2016 8th International Conference on Information Technology and
Electrical Engineering (ICITEE), Yogyakarta, Indonesia: IEEE, 2016. pp. 1–6
43
[18] Yin C, Zhu Y, Fei J, et al. A deep learning approach for intrusion detection using
recurrent neural networks. IEEE Access. 2017; 5:21954–21961
[20] Wu K, Chen Z, Li W. A novel intrusion detection model for a massive network using
convolutional neural networks. IEEE Access. 2018; 6:50850–50859
[21] Khraisat, A., Gondal, I., Vamplew, P. et al. Survey of intrusion detection systems:
techniques, datasets and challenges. (2019). https://doi.org/10.1186/s42400-019-0038-7
[22] Abdulhamit Subasi, in Practical Machine Learning for Data Analysis Using Python
(2020)
44
[28] “Back-propagation” Accessed: 15-12-2021 [online] Available:
https://www.watelectronics.com/back-propagation-neural-network/
45