Evaluating Deep Neural Network and Classical Machine Learning Algorithms

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

EVALUATING DEEP NEURAL NETWORK AND CLASSICAL

MACHINE LEARNING ALGORITHMS FOR NETWORK INTRUSION


DETECTION SYSTEM

Submitted in partial fulfillment of the


requirements for the mini project (III Semester) of
M.Sc. in Cyber Security
in the Centre for Information Technology and Engineering of the
Manonmaniam Sundaranar University

By
J.S. RISHIDEV
(Reg.No: 20204012533112)

Tirunelveli – 627 012, India


November 2021
EVALUATING DEEP NEURAL NETWORK AND CLASSICAL
MACHINE LEARNING ALGORITHMS FOR NETWORK INTRUSION
DETECTION SYSTEM

Submitted in partial fulfillment of the


requirements for the mini project (III Semester) of
M.Sc in Cyber Security
in the Centre for Information Technology and Engineering of the
Manonmaniam Sundaranar University

BY
J.S. RISHIDEV
(Reg.No:20204012533112)

Approved By: _____________________


(GUIDE)

Tirunelveli – 627 012, India


November 2021
CERTIFICATE

Certified that this report “EVALUATING DEEP NEURAL NETWORK AND


CLASSICAL MACHINE LEARNING ALGORITHMS FOR NETWORK
INTRUSION DETECTION SYSTEM” submitted for the mini project (III Semester-Phase
I) of Master of Science in Cyber Security is the bonafide work of Mr.J.S. RISHIDEV (Reg.
20204012533112), Centre for Information Technology and Engineering, Manonmaniam
Sundaranar University, Tirunelveli, who carried out the mini project under my supervision.
Certified further, that to the best of our knowledge the work reported herein does not form
part of any other thesis on the basis of which a degree or award was conferred on an earlier
occasion on this or any other candidate.

GUIDE/ SUPERVISOR PROFESSOR AND HEAD

Submitted for viva-voce examination held on __________________

INTERNAL EXAMINER EXTERNAL EXAMINER(S)

i
DECLARATION

I hereby state that the thesis submitted for the degree of

MASTER OF SCIENCE

in

Cyber Security

on

“EVALUATING DEEP NEURAL NETWORK AND CLASSICAL MACHINE


LEARNING ALGORITHMS FOR NETWORK INTRUSION DETECTION
SYSTEM”

is my original work and that it has not previously formed the basis for the award of any

Degree, Diploma, Associate ship, Fellowship or any other similar title

J.S. RISHIDEV

ii
ACKNOWLEDGEMENT

I wish this opportunity to express my gratitude to each, who helped me directly or

indirectly to complete my thesis.

I thank the God the Almighty for showering this blessing which helped me to

complete my thesis successfully.

It is with deep sense of gratitude that I express my heartfelt thanks to Dr.N.KRISHNAN.,


M.Sc., M. Tech., Ph.D., Professor and Head in-Charge for Centre for Information
Technology and Engineering, Manonmaniam Sundaranar University, Tirunelveli.

I am extremely thankful to guide Dr. C. Divya, B.E., M.E., Ph.D.,

Assistant Professor, Centre for Information Technology and Engineering,


Manonmaniam Sundaranar University, Tirunelveli for timely suggestions and
encouragement, which paved the way for the successful completion of my thesis work.
I thank all my faculty members and Administrative Staff Members of Center of
Information Technology and Engineering for their timely instructions and encouragement,
and advice and expert guidance in completing this thesis.

Finally, I acknowledge indebtedness to all my Family Members and Friends whose


overwhelming response helped me lot to finish this thesis in a successful manner for their
kind hospitality extended to me during my research discussion visits.

J.S. RISHIDEV

iii
List of Figures

3.1 CNN model. CNN, convolutional neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Proposed architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 Accuracy of Classical Machine learning Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 F1score, Recall and Precision of Classical ML Algorithms. . . . . . . . . . . . . . . . . . . . 37

5.3 Accuracy of DNN layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 Recall values of DNN Layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.5 Precision and F1 Score of DNN layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .39


5.6. Overall Accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

List of Tables

Table 4.1 Statistical results of network Structures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32

Table-5.1 Overall Performance of the Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

iv
List of Abbreviations
o ML - Machine Learning
o DNN – Deep Neural Network
o NIDS -Network-based Intrusion Detection Systems (NIDS)
o IDS – Intrusion Detection System
o AI - Artificial Intelligence
o HID-Host-Based Intrusion Detection
o DL – Deep Learning
o NN – Neural Networks
o ANN - Artificial Neural Networks
o MLP -Multilayer Perceptron
o LNN - Lightweight Deep Neural Network
o RF - Random Forest
o SNN -Shared Nearest Neighbor
o RNN -Recurrent Neural Networks
o FNN -Feed-Forward Neural Networks
o NB -Naive Bayes
o SVM -Support Vector Machines
o HMM -Hidden Markov Model
o RBF -Radial Basis Function
o CNN -Convolution Neural Network
o ReLU -Rectified Linear Unit
o BP-Back Propagation
o AF-Activation Function
o DoS -Denial-of-Service-Attack
o U2R -User-to-Root-Attack
o R2L -Remote-to-Local-Attack
o TPR -True Positive Rate
o FPR -False Positive Rate
o FNR -False Negative Rate
o CR-Classification Rate

v
Table of contents

Chapter No Title Page.no


Abstract 2

1 Introduction 3

2 Review of Literature 8

3 Deep Learning Algorithms


3.1. Artificial Neural Network 15
3.2. Convolution Neural Network. 16
3.3. Recurrent Neural Network: 19

4 Evaluation of DNN
4.1. Deep Neural Network 20
4.2. Activation Function 23
4.3. Dataset description 27
4.4. Identifying network parameters. 30
4.5. Identifying network structures. 32
4.6. Proposed Architecture. 33

5 Result and Discussion 35

6 Conclusion 41

References 42

1
ABSTRACT

To secure a network from intrusion and for the confidentiality of any data, an
Intrusion Detection System plays a vital role. IDS calls for the need of integration of Deep
Neural Networks (DNNs). Cyber security is an emerging field in the IT-sector. As more
devices are connected to the internet, the attack surface for hackers is steadily increasing.
Network-based Intrusion Detection Systems (NIDS) can be used to detect malicious traffic in
networks and Machine Learning is an up-and-coming approach for improving the detection
rate. We’re going to explore the use of Deep Neural Networks to improve the accuracy of a
Network-based IDS (NIDS) and compare the results with classical Machine Learning
Algorithms.
In this work, we explore how to model an intrusion detection system based on deep
learning, and we propose a deep learning approach for intrusion detection using Deep neural
networks (DNN-IDS). Moreover, we study the performance of the model in binary
classification, and the number of neurons and different learning rate impacts on the
performance of the proposed model. We compare it with those of, Naïve Bayes, Decision
Tree, and other machine learning methods proposed by previous researchers on the
benchmark data set.
Deep Neural Networks have been utilized to predict the attacks on Network
Intrusion Detection System (NIDS). A DNN with 0.1 rate of learning is applied and is run for
100 number of epochs and KDDCup-’99’ dataset has been used for training and bench
marking the network. For comparison purposes, the training is done on the same dataset with
other machine learning algorithms such as Decision Tree, Logistic Regression, Naive Bayes
and DNN of layers ranging from 1 to 5. The experimental results show that DNN-IDS is very
suitable for modeling a classification model with high accuracy and that its performance is
superior to that of traditional machine learning classification methods in both binary and
multiclass classification. The DNN-IDS model improves the accuracy of the intrusion
detection and provides a new research method for intrusion detection.

2
CHAPTER- 1
INTRODUCTION
Cyber-attacks on digital infrastructure can result in damage on both humans, their
property and the environment. Attacks on industrial process control systems could lead to
life-threatening malfunctions or emissions of dangerous chemicals into the environment. A
recent example is the “Triton” attack which targeted the process control systems of
petrochemical plants. To detect cyber security threats, Intrusion Detection Systems (IDS) can
be used. An IDS monitors networks or computers in order to detect malicious activity.
Machine learning (ML) is a type of artificial intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being explicitly
programmed to do so. Machine learning algorithms use historical data as input to predict new
output values. Because of this, machine learning facilitates computers in building models
from sample data in order to automate decision-making processes based on data inputs. A
machine learning model is an expression of an algorithm that combs through mountains of
data to find patterns or make predictions. Fueled by data, machine learning (ML) models
are the mathematical engines of artificial intelligence. Supervised learning is the types of
machine learning in which machines are trained using well "labelled" training data, and on
basis of that data, machines predict the output. The labelled data means some input data is
already tagged with the correct output. In supervised learning, the training data provided to
the machines work as the supervisor that teaches the machines to predict the output correctly.
It applies the same concept as a student learns in the supervision of the teacher. Supervised
learning is a process of providing input data as well as correct output data to the machine
learning model. The aim of a supervised learning algorithm is to find a mapping function to
map the input variable(x) with the output variable(y). The Supervised learning algorithm can
be further categorized into two types of problems i.e., Regression and Classification.
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables. Classification algorithms
are used when the output variable is categorical, which means there are two classes.
Unsupervised learning is a machine learning technique in which models are not supervised
using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while

3
learning new things. Unsupervised learning cannot be directly applied to a regression or
classification problem because unlike supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities, and represent that dataset in a
compressed format. The unsupervised learning algorithm can be further categorized into two
types of problems i.e., Clustering and Association. Clustering is a method of grouping the
objects into clusters such that objects with most similarities remains into a group and has less
or no similarities with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence of those
commonalities. An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set of
items that occurs together in the dataset. Association rule makes marketing strategy more
effective. Such as people who buy X item (suppose a bread) are also tend to purchase Y
(Butter/Jam) item.

Artificial intelligence and machine learning are the part of computer science that are
correlated with each other. These two technologies are the most trending technologies which
are used for creating intelligent systems. Although these are two related technologies and
sometimes people use them as a synonym for each other, but still, both are the two different
terms in various cases. AI is a bigger concept to create intelligent machines that can simulate
human thinking capability and behavior, whereas, machine learning is an application or
subset of AI that allows machines to learn from data without being programmed explicitly.
Artificial intelligence is now widely employed to help in detecting intrusions on computer
systems; this is because of its efficient and adaptive nature. Neural network is the branch of
artificial intelligence that receives the highest attention. A neural network conducts an
analysis of the information and provides a probability estimate that the data matches the
characteristics which it has been trained to recognize, this helps in no small measure in
reducing false positive rate.

The current intrusion detection system has more advantages in network protection,
in the face of the current complex network environment and increasingly updated attack
methods, most of the traditional intrusion detection systems rely on rule base and traditional

4
machine learning. The algorithm is computationally complex and has been unable to adapt to
the new network environment. Many problems have arisen: low data detection efficiency,
poor adaptability, low false negative rate and false positive rate. In recent years, the
theoretical results and practical results of deep learning have emerged in an endless stream,
and have achieved amazing results in the fields of speech recognition and image
classification, and are suitable for processing large-scale data. Aiming at the shortcomings of
traditional intrusion detection methods, slow detection speed and weak adaptive ability, this
paper proposes a hybrid intrusion detection method based on deep confidence neural
network. The method realizes effective identification and classification of massive, high-
dimensional and nonlinear intrusion network data, and improves the accuracy of IDS
classification.

In general, intrusion detection can be classified into two main categories, namely:
host-based intrusion detection (HID) and network-based intrusion detection (NID). HID data
source mainly uses the internal log of the operating system for audit and judgment. The
detection method is not sensitive to network traffic. The system can accurately locate and
define the specific operations of intrusions. However, it occupies a lot of resources on the
host itself and depends on the reliability of the host. NID analyzes network traffic data, finds
suspicious intrusions hidden in the traffic data, and performs corresponding alarms and
intercepts on the detected intrusions. Network traffic data has high dimensionality and
complex features. In fact, NID is a typical classification problem. Specifically, it can
automatically identify possible attacks and threats hidden in network traffic in time and
determine their specific types.

In recent years, a large number of literatures have conducted research on the


detection performance of deep learning in IDS. However, implementation challenges are
rarely discussed. One of the most important issues is the complexity of the neural network,
including the computational requirements and the model size. The complexity of neural
networks greatly hinders the deployment of DL-based algorithms in practical environments,
especially the use of reinforcement learning and federated learning. These models require
continuous training after deployment, but the practical deployment environment often does

5
not have powerful computational resources. Therefore, it is essential to design efficient and
high-performance models for IDS.
Deep Neural Networks (DNN) is an important method for machine learning and has
been widely used in many fields. DNN is inspired by structure of mammalian visual system.
Biologists have found that mammalian visual system contains many layers of neural network.
It processes information from retina to visual center layer by layer, sequentially extracts edge
feature, part feature, shape feature and eventually forms abstract concept. In general, the
depth of DNN is greater than or equal to 4. For example, a Multilayer Perceptron (MLP) with
more than 1 hidden layer is a DNN framework. DNN extracts feature layer by layer and
combine low-level features to form high-level features, which can find distributed expression
of data. Compared with shallow neural networks (NN), DNN has better feature expression
and ability to model complex mapping. A deep neural network (DNN) is an ANN with
multiple hidden layers between the input and output layers. Similar to shallow ANNs, DNNs
can model complex non-linear relationships.
Traditional static security technologies such as firewalls and encryption technology
although have a certain effect, but the lack of mechanism for active intrusion detection and
the need of manual implementation, maintenance in the use process, make it difficult to meet
the current security requirements. Intrusion detection based on neural networks is an active
dynamic security technology, to provide an effective and real-time protection for internal
attacks, external attacks and misuse. At the same time, with the use of neural network
advantages in self-organization, self-learning and generalization, such intrusion detection
method not only for known attack patterns has good recognition ability, but also has the
ability to detect unknown attacks. Therefore, it becomes the research hotspot in the field of
current network security.
Self-learning system is one of the effective methods to deal with the present-day
attacks. This uses supervised, semi-supervised and unsupervised mechanisms of machine
learning to learn the patterns of various normal and malicious activities with a large corpus of
normal and attack connection records. Though various machine learning based solution
found, the applicability in commercial system is in early stages. The existing machine
learning based solutions outputs a high false positive rate including a high computational
cost. The primary reason for this is machine learning classifiers learn the characteristic of

6
simple TCP/IP features locally. Deep learning is complex subnet of machine learning that
learns hierarchical feature representations and hidden sequential relationships through by
passing the TCP/IP information on several layers. Deep learning has achieved a significant
result in long standing artificial intelligence (AI) tasks in the field of image processing,
speech recognition, natural language processing and many others.
There are multiple systems that can be used for shielding such information and
communication technology systems from vulnerabilities, namely anomaly detection and
IDSs. A demerit of anomaly-detection systems is the complexity involved in the process of
defining rules. Each protocol being analyzed must be defined, implemented and tested for
accuracy. Another pitfall relating to anomaly detection is that harmful activity that falls
within usual usage pattern is not recognized. Therefore, the need for an IDS that can adapt
itself to the recent novel attacks and can be trained as well as deployed by using datasets of
irregular distribution becomes indispensable. This work towards analyzing an effectiveness
of various shallow and deep networks for NIDS done with the openly available network-
based intrusion data set KDDCup ’99’. The tcpdump data of the 1998 DARPA intrusion
detection evaluation data set is pre-processed to build KDDCUP 99 data set. The feature
extraction from tcp - dump data is facilitated by the MADMAID data mining framework.
This data set was built by capturing network traffic for ten weeks from thousands of UNIX
systems and hundreds of users accessing those systems in the MIT Lincoln laboratory. The
data captured during the first 7 weeks were utilized for training purpose and the last 3 weeks
data were utilized for testing purposes.
The rest of the thesis is organized as follows, Chapter II reviews the work related to
IDS, different deep neural networks and some discussions about KDDCup-’99’ dataset that
was published. Chapter III takes an in-depth look at the existing methods of Deep Neural
Networks (DNN) and the applications of ReLU activation function. Chapter IV explains the
proposed method in this paper, Chapter V evaluates and discuss the final results. Chapter VI
concludes and states a plausible workflow into the future of this research work.

7
CHAPTER 2

REVIEW OF LITERATURE

Intrusion detection is to monitor network traffic and computer system activity


information, and analyze the data to find out whether there are malicious attacks by hackers
or damage the behavior of computers and network resources. Deep confidence neural
network has been used to extract features of network monitoring data, and uses BP neural
network as top-level classifier to classify intrusion types. The method was validated using the
KDD CUP'99 dataset from the Lincoln Laboratory of the Massachusetts Institute of
Technology. The results show that the method had a significant improvement over the
traditional machine learning accuracy. The influence of parameters of DBN network model
on the intrusion detection effect is analyzed experimentally. The deep feature learning model
based on deep confidence neural network and other traditional common methods are
analyzed. Feature learning methods were compared experimentally. According to the
analysis of experimental results, the detection rate of this method is significantly improved
compared with the traditional machine learning method [6].
Although traditional machine learning techniques have been widely used to detect
network attacks, they still require significant preprocessing. Such algorithms require feature
engineering and assume the availability of handcrafted features. However, with fast-paced
technological advancements, the size of everyday data sets available to organizations is
growing. Thus, shallow learning with traditional machine learning algorithms may not be
suitable to deal with real-world environments since it relies on high levels of human
involvement in data preparation.[17] In addition, these techniques have the disadvantage of
low detection accuracy. Deep learning has emerged recently and demonstrated success in
many real-world problems. It has the ability to automatically capture features and
correlations in large data sets.
Deep Confidence generates an ensemble of deep neural networks by recording the
network parameters throughout the local minima visited during the optimization phase of a
single neural network. This approach serves to derive a set of base learners (i.e., snapshots)
with comparable predictive power on average that will however generate slightly different

8
predictions for a given instance. The variability across base learners and the validation
residuals are in turn harnessed to compute confidence intervals using the conformal
prediction framework. Using a set of 24 diverse IC50 data sets from ChEMBL 23, they show
that Snapshot Ensembles perform on par with Random Forest (RF) and ensembles of
independently trained deep neural networks. In addition, it’s found that the confidence
regions predicted using the Deep Confidence framework span a narrower set of values.
Overall, Deep Confidence represents a highly versatile error prediction framework that can
be applied to any deep learning-based application at no extra computational cost [7].
However, there is no standard theory to guide you in selecting right deep learning tools as it
requires knowledge of topology, training method and other parameters.
A Lightweight deep neural network (LNN) model structure using depth wise
convolution is introduced, which achieves satisfactory performance with lower
computational cost and smaller model size. Lightweight units are key building blocks for our
model, which can fully extract data features while reducing computational cost by expanding
and compressing feature maps. Lightweight units expand the feature maps through the
expansion layer. After expansion, the depth wise convolution is used to extract more features,
and then the compression layer is used to compress the feature map to shrink the network
again. At the same time, they also adopt inverse residual structure and channel shuffle
operation to achieve more effective training. They compared this model with other models
for NID performance within modern networks. The Experimental results on the UNSW-
NB15 dataset demonstrates their approach not only reduced the computational cost and the
model size, but also achieved high detection accuracy and low false alarm rate in binary
classification problems [8].
Australian Centre Cyber Security created UNSW-NB15 dataset on the basis of the
IXIA Perfect Storm tool. This dataset provides up to 100 GB of raw data traffic, including
nine types of attacks: Fuzzers, Analysis, Backdoors, DoS, Exploits, Generic,
Reconnaissance, Shellcode, and Worms. UNSW-NB15 dataset is used on the basis of a large
number of original data samples to train the classification of network data packets that
contain application-layer data. UNSW-NB15 dataset contains a massive amount of data with
the same MAC address and IP address that may affect the accuracy of detection. Therefore,
the Ethernet header and IP header in the packet should be removed to eliminate the influence

9
of fixed MAC address and IP address on the model’s accuracy and get the simplified packet
(Reduced-Data). When finished reducing, the rest data contains 1460 bytes content
maximum, forms a two-dimensional matrix with a size of 38 × 38 by removing the last bytes
if more than 1444 bytes or padding zeros to 1444 bytes if less than 1444 bytes. Subsequently,
the Reduced-Data is standardized and normalized, and then send to the deep learning model
as input data for training or inference [9].
The main drawback of anomaly detection IDSs is that it is difficult to define what
a normal behavior of the system, and without a proper and effective means of discriminating
normal from abnormal profile, an anomaly detection IDS will hardly be of any use. The need
to ameliorate this shortcoming made researchers to consider neural network technique;
because neural networks have an enhanced ability to classify by learning and are good in
pattern recognition. [10] There are lot of Advantages of using neural network in IDS such as
that, it provides more accurate statistical distribution than statistical models, because most
statistical models make some assumptions about the underlying distribution of user behavior.
These assumptions may not always be valid and can a times lead to high false-alarm. Neural
network generally makes less assumptions and normally modify them through learning
process. Neural network has low cost for development. Statistical model algorithms cost
more to build, because it is costly to reconstruct statistical algorithms after removing
assumptions that are found to be invalid. It is highly scalable compared to other techniques.
Good in reducing both false positive error and false negative error rate. False positive rate
counts of false alarms and false negative counts missed intrusions. Anomaly detection IDS
tend to have more false positives than false negatives. IDS designers exploit neural network
either as a pattern recognition technique or for classification and prediction. Pattern
recognition is realized by using a multilayered feed forward neural network that has been
trained accordingly. During training, the neural network parameters are optimized to
associate outputs with corresponding input patterns. In neural network algorithms, input
pattern is identified by matching its output with the known class. During training session,
output is compared with corresponding class, if it does not match, adjustment to weight and
threshold values are made repeatedly until it matches the desired class. Denial of service and
probing are some of the types of attacks that are known to be easily detected by pattern
recognition. Classification and prediction can be implemented using Kohonen’s self-

10
organizing maps to classifying inputs into clusters, with well-defined boundaries for normal
and abnormal profiles [11]. The neural network will evolve through learning process to
identify relevant profile for any type network traffic.
In the model proposed they have assumed that incoming packets has been extracted
using anyone of the different methods available. The model can be used for both pattern
recognition and classification neural network techniques. For the training phase, they have
both attack and normal data set, it should be noted, both data sets and neural network
learning modules are connected to neural network model module, this is because it the main
decider for what type of training scheme would be employed. For each training round the
output is compared with expected output at the validation module and appropriate cost is
assign to the whole operation, this will determine the extent of modification to both weight
and threshold. The exercise continuous until the neural network model is fully trained. The
training exercise is then conducted periodically or in some secured IDS is done continuously.
This would improve the performance of the IDS. Take note, since IDS itself is not immune
from attack, continuous training should be avoided, since at least theoretically somebody can
implant malicious data set as normal profiled data [10].
In [12] discussed the applicability of shared nearest neighbor (SNN) classifier to
intrusion detection. They claimed that their method as best one by reporting a higher
detection rate. This was not acceptable due to the fact that they used the custom-built-in data
set.
A highly scalable and hybrid DNN framework called scale-hybrid-IDS-AlertNet
was proposed in the study by Vinayakumar et al. [16] The framework may be used in real
time to effectively monitor network traffic to alert system administrators to possible cyber-
attacks. It was composed of a distributed deep learning model with DNNs for handling and
analyzing very large-scale data in real time. The authors tested the framework on various
data sets, including NSL-KDD and KDD’99. On NSL-KDD, the best F-Measure for binary
classification was 80.7% and 76.5% for multiclass classification.

Yin et al.[18] proposed a model for intrusion detection using recurrent neural
networks (RNNs). RNNs are especially suited to data that are time dependent. The model
consisted of forward and back propagation stages. Forward propagation calculates the output

11
values, whereas back propagation passes residuals accumulated to update the weights. The
model consisted of 20 hidden nodes, with Sigmoid as the activation function and Softmax as
the classification function. The learning rate was set to 0.1, and the number of epochs to 50.
Experimental results using the NSL-KDD data set showed the accuracy values were 83.28%
and 81.29% for binary and multiclass classification, respectively.
Recurrent neural networks include input units, output units and hidden units,
and the hidden unit completes the most important work. The RNN model essentially has a
one-way flow of information from the input units to the hidden units, and the synthesis of the
one-way information flow from the previous temporal concealment unit to the current timing
hiding unit is shown in Fig. 1. We can regard hidden units as the storage of the whole
network, which remember the end to-end information. When we unfold the RNN, we can
find that it embodies the deep learning. A RNNs approach can be used for supervised
classification learning. Recurrent neural networks have introduced a directional loop that can
memorize the previous information and apply it to the current output, which is the essential
difference from traditional Feed-forward Neural Networks (FNNs). The preceding output is
also related to the current output of a sequence, and the nodes between the hidden layers are
no longer connectionless; instead, they have connections. Not only the output of the input
layer but also the output of the last hidden layer acts on the input of the hidden layer.
Wu et al.[20] built a CNN and an RNN for the classification of the NSL-KDD
data set. The authors focused on solving the data imbalance problem by using the cost
function-based method. The cost function weight coefficients of each class are set based on
the training sample number. The reported accuracies of the deep learning models
outperformed traditional machine learning algorithms, such as J48, NB, NB tree, RF, and
SVM. However, the accuracy of the CNN model was slightly lower than that of the RNN
model.
Altwaijry et al.[19] developed an intrusion detection model using DNN. The proposed
model consisted of four hidden fully connected layers and was trained using NSL-KDD data
set. The DNN model obtained accuracy values of 84.70% and 77.55% for the two-class and
five-class classification problems, respectively. The proposed model outperformed traditional
machine learning algorithms, including NB, J48, RF, Bagging, and Adaboost in terms of
accuracy and recall.

12
CHAPTER 3
Deep Learning Algorithms

Research in intrusion detection systems (IDS) began in the 1980s, and ever since
many algorithms have been used to build ADNIDS. Traditional machine learning algorithms
such as random forests (RF), self-organized maps, support vector machines (SVM),
and artificial neural networks (ANN) have been widely used in developing ADNIDS.
However, as data sets are evolving in terms of size and type, traditional machine learning
algorithms become increasingly unable to cope with real-world network application
environments.[15]

Despite several decades of research and applications in IDS, there are still many
challenges to be addressed. In particular, better detection accuracy, reduced false-positive
rates, and the ability to detect unknown attacks are all required. Recently, researchers have
effectively employed deep learning-based methods in a range of applications, including
image recognition, emotion detection, and handwritten-character recognition. Deep learning
has the ability to identify better representations from raw data, compared with traditional
machine learning approaches.[15]

There are many classification methods such as decision trees, rule-based systems,
neural networks, support vector machines, naïve Bayes and nearest-neighbor. Each technique
uses a learning method to build a classification model. However, a suitable classification
approach should not only handle the training data, but it should also identify accurately the
class of records it has not ever seen before. Creating classification models with reliable
generalization ability is an important task of the learning algorithm.

Hidden Markov Model (HMM)


HMM is a statistical Markov model in which the system being modeled is assumed to
be a Markov process with unseen data. Prior research has shown that HMM analysis can be
applied to identify particular kinds of malware . In this technique, a Hidden Markov Model is
trained against known malware features (e.g., operation code sequence) and once the training
stage is completed, the trained model is applied to score the incoming traffic. The score is

13
then contrasted to a predefined threshold, and a score greater than the threshold indicates
malware. Likewise, if the score is less than the threshold, the traffic is identified as
normal.[21]
Support Vector Machines (SVM)
SVM is a discriminative classifier defined by a splitting hyperplane. SVMs use a
kernel function to map the training data into a higher-dimensioned space so that intrusion is
linearly classified. SVMs are well known for their generalization capability and are mainly
valuable when the number of attributes is large and the number of data points is small.
Different types of separating hyperplanes can be achieved by applying a kernel, such as
linear, polynomial, Gaussian Radial Basis Function (RBF), or hyperbolic tangent. In IDS
datasets, many features are redundant or less influential in separating data points into correct
classes. Therefore, features selection should be considered during SVM training. SVM can
also be used for classification into multiple classes. In the work by Li et al., an SVM
classifier with an RBF kernel was applied to classify the KDD 1999 dataset into predefined
classes. From a total of 41 attributes, a subset of features was carefully chosen by using
feature selection method.[21]

Fuzzy Logic technique is based on the degrees of uncertainty rather than the
typical true or false Boolean logic on which the contemporary PCs are created. Therefore, it
presents a straightforward way of arriving at a final conclusion based upon unclear,
ambiguous, noisy, inaccurate or missing input data. With a fuzzy domain, fuzzy logic permits
an instance to belong, possibly partially, to multiple classes at the same time. Therefore,
fuzzy logic is a good classifier for IDS problems as the security itself includes vagueness,
and the borderline between the normal and abnormal states is not well identified. In addition,
the intrusion detection problem contains various numeric features in the collected data and
several derived statistical metrics. Building IDSs based on numeric data with hard thresholds
produces high false alarms. An activity that deviates only slightly from a model could not be
recognized or a minor change in normal activity could produce false alarms. With fuzzy
logic, it is possible to model this minor abnormality to keep the false rates low. With fuzzy
logic, the false alarm rate in determining intrusive actions could be decreased. They outlined

14
a group of fuzzy rules to describe the normal and abnormal activities in a computer system,
and a fuzzy inference engine to define intrusions.[21]

3.1. Artificial Neural Network (ANN)

An Artificial Neural Network (ANN) is an information processing paradigm that


is inspired the brain. ANNs, like people, learn by example. An ANN is configured for a
specific application, such as pattern recognition or data classification, through a learning
process. Learning largely involves adjustments to the synaptic connections that exist
between the neurons.[14] An artificial neural network has three or more layers that are
interconnected. The first layer consists of input neurons. Those neurons send data on to the
deeper layers, which in turn will send the final output data to the last output layer. All the
inner layers are hidden and are formed by units which adaptively change the information
received from layer to layer through a series of transformations. Each layer acts both as an
input and output layer that allows the ANN to understand more complex objects.
Collectively, these inner layers are called the neural layer. The units in the neural layer try to
learn about the information gathered by weighing it according to the ANN’s internal system.
These guidelines allow units to generate a transformed result, which is then provided as an
output to the next layer. An additional set of learning rules makes use of backpropagation, a
process through which the ANN can adjust its output results by taking errors into account.
Through backpropagation, each time the output is labeled as an error during the supervised
training phase, the information is sent backward. Each weight is updated proportionally to
how much they were responsible for the error. Hence, the error is used to recalibrate the
weight of the ANN’s unit connections to take into account the difference between the desired
outcome and the actual one. In due time, the ANN will “learn” how to minimize the chance
for errors and unwanted results.
Training an artificial neural network involves choosing from allowed models for
which there are several associated algorithms. An ANN has several advantages but one of the
most recognized of these is the fact that it can actually learn from observing data sets. In this
way, ANN is used as a random function approximation tool. These types of tools help
estimate the most cost-effective and ideal methods for arriving at solutions while defining
computing functions or distributions. ANN takes data samples rather than entire data sets to

15
arrive at solutions, which saves both time and money. ANNs are considered fairly simple
mathematical models to enhance existing data analysis technologies. They can be used for
many practical applications, such as predictive analysis in business intelligence, spam email
detection, natural language processing in chatbots, and many more.[13]

3.2. Convolution Neural Network

A CNN model Proposed in [15] which has BCNN and MCNN, where the first model
(BCNN) is used for binary classification, and the second model (MCNN) is used for
multiclass classification of network attacks.

The input layer is either an 11×11×1 matrix for the NSL-KDD data set, or
a 14×14×1 matrix for the UNSW-NB15 data set, as defined in the previous section. In the
rest of this article, we use S to represent the input image side, that is, the height or width,
where S=11 or S=14, depending on the data set used.
The model is composed of a total of five convolutional layers, two pooling layers,
and four fully connected layers. Our input image is small, either 11×11 pixels
or 14×14 pixels, and so a smaller filter size, that is, 2×2, is more appropriate for this image.
As we wish to keep the representational power of our model, we increase the number of
feature maps as the network deepens and apply padding in all convolutional layers to
overcome the problems of image shrinkage and information loss around the perimeter of the
image. We also use batch normalization after each convolutional layer.
The first convolutional layer has as input an S×S×1 image. We use k=8 (2×2×1) kernels,
zero-padding of 1, and a stride of 1, for a convolutional layer of size S×S×8. Each activation
map i is calculated as shown in Equation (3.2.1), where l is the current layer, B(l)i is a bias
matrix, k(l−1) is the number of kernels used in the previous layer, W is the current layer kernel
matrix, and Y(l−1) is the output of the previous layer. Our nonlinearity is the Leaky ReLU
function, defined as shown in Equation (3.2.2), with α=0.12.

16
The second convolutional layer has as input an S×S×8, and we use a k=16 kernels for
a convolutional layer of size S×S×16. We add the input image at this point to the tensor. This
step is inspired by the idea of skip connections in residual networks. Skip connections speed-
up the learning process and overcome the problem of vanishing gradients. The vanishing
gradient problem is encountered when training ANN with gradient-based a serious detriment
to deep learning using backpropagation. This is because the neural network's weights are
updated in proportion to the partial derivative of the error function with respect to the current
weight during training, and sometimes, if the gradient is very small, the weight value is
inadequately updated. In the worst case, this stops the neural network from further training.
We noticed the vanishing gradient problem in our model and incorporated skip connections
to amplify the input signal and prevent zero gradients. As shown in Fig.3.1, the input image
is added to the tensor at two points in the architecture. This architecture is then repeated,
before the fifth and final convolutional layer, which has 32 feature maps.

Fig.3.1 CNN model. CNN, convolutional neural network.

17
Next, there are two max-pooling layers that have a 2∗2 window size. The tensor is then
flattened and followed by four fully connected layers, with sizes 500, 300, 100, and 20,
respectively. All fully connected layers use a 30% dropout rate to reduce overfitting,36 set
experimentally.

The output layer is a 5 class Softmax layer (one class for each attack type, plus the normal
class). Softmax outputs a probability-like prediction for each character class, see Equation
(3), where N is the number of output classes. The CNN architecture is shown in Fig.3.1

The model incorporates various methods to reduce overfitting. In particular, our


model incorporates a dropout layer, where randomly selected activations are set to 0 during
training, so that the model becomes less sensitive to specific weights in the network. In
addition, the model has weight decay, also called L2 regularization, which reduces
overfitting by penalizing model weights. It has also incorporated batch normalization, which
normalizes the input values of the layer, reducing overfitting and improving gradient flow
through the network. Finally, the model incorporates a data augmentation technique,
specifically SMOTE, which is also effective for the reduction of overfitting. The
performance of the final model on a separate unseen test set, which contains new attack types
not present in the training set is.

Optimization
The model is tested with two optimizers: Stochastic Gradient Descent and Adam,and
selected Adam as it was found to work better. The loss function used is the categorical cross-
entropy loss, which is widely used to calculate the probability that the input belongs to a
particular class. It is usually used as the default function for multiclass classification. In our
model, we set the learning rate to lr=0.001, set experimentally.

18
3.3. RECURRENT NEURAL NETWORK:

Recurrent neural networks recognize data's sequential characteristics and use patterns
to predict the next likely scenario. RNNs are used in deep learning and in the development of
models that simulate neuron activity in the human brain. They are especially powerful in use
cases where context is critical to predicting an outcome, and are also distinct from other types
of artificial neural networks because they use feedback loops to process a sequence of data
that informs the final output. These feedback loops allow information to persist.

Artificial neural networks are created with interconnected data processing


components that are loosely designed to function like the human brain. They are composed
of layers of artificial neurons -- network nodes -- that have the ability to process input and
forward output to other nodes in the network. The nodes are connected by edges or weights
that influence a signal's strength and the network's ultimate output.

In some cases, artificial neural networks process information in a single direction


from input to output. These "feed-forward" neural networks include convolutional neural
networks that underpin image recognition systems. RNNs, on the other hand, can be layered
to process information in two directions.

Like feed-forward neural networks, RNNs can process data from initial input to final
output. Unlike feed-forward neural networks, RNNs use feedback loops, such
as backpropagation through time, throughout the computational process to loop information
back into the network. This connects inputs and is what enables RNNs to process sequential
and temporal data.A truncated backpropagation through time neural network is an RNN in
which the number of time steps in the input sequence is limited by a truncation of the input
sequence. This is useful for recurrent neural networks that are used as sequence-to-sequence
models, where the number of steps in the input sequence (or the number of time steps in the
input sequence) is greater than the number of steps in the output sequence.

19
CHAPTER- 4

Evaluation of DNN

4.1. DEEP NEURAL NETWORK

Deep neural networks (DNNs) are Artificial Neural Network (ANN) with a
multi-layered structure comprised within the input-output layers. The extension of
conventional artificial neural networks is deep neural networks. Compared to conventional
neural networks, there are two main differences that deep neural networks have. It is shallow
for conventional neural networks to have one or two hidden layers. On the other hand, there
are many hidden layers in deep neural networks. For instance, a neural network of millions of
neurons was used by the Google brain project. There is a wide range of models for deep
neural networks, ranging from DNNs, CNNs, RNNs, and LSTMs. Recent studies have even
brought us attention-based networks that focus on specific parts of a deep neural network.
The larger the network and the more layers it has, the more complex the network becomes
and the more resources and more time it needs to train. Deep neural networks work best with
GPU-based architectures that take less time to train than classical CPUs, while recent
developments have shortened training times considerably.[22] They can model complex non-
linear relationships and can generate computational models where the object is expressed in
terms of the layered composition of primitives. While traditional machine learning
algorithms are linear, deep neural networks are stacked in increasing hierarchy of complexity
as well as abstraction. Each layer applies a nonlinear transformation onto its input and creates
a statistical model as output from what it learns. In simple terms, the input layer is received
by the input layer and passed onto the first hidden layer. These hidden layers perform
mathematical computations on our inputs. One of the challenges in creating neural networks
is deciding the hidden layers’ count and the count of the neurons for each layer. Each neuron
has an activation function which is used to standardize the output from the neuron. The
“Deep” in Deep learning refers to having more than one layer which is hidden. The output
layer returns the output data. Until the output has reached an acceptable level of accuracy,
epochs are continued.

20
4.1.1. Backpropagation Algorithm
Neural Networks are able to learn the desired function using big amounts of data and
an iterative algorithm called backpropagation. We feed the network with data, it produces an
output, we compare that output with a desired one (using a loss function) and we readjust the
weights based on the difference. And repeat. And repeat. The adjustment of weights is
performed using a non-linear optimization technique called stochastic gradient descent. After
a while, the network will become really good at producing the output. Hence, the training is
over. Hence, we manage to approximate our function. And if we pass an input with an
unknown output to the network, it will give us an answer based on the approximated
function. Let’s use an example to make this clearer. Let’s say that for some reason we want
to identify images with a tree. We feed the network with any kind of images and it produces
an output. Since we know if the image has actually a tree or not, we can compare the output
with our truth and adjust the network. As we pass more and more images, the network will
make fewer and fewer mistakes. Now we can feed it with an unknown image, and it will tell
us if the image contains a tree. Over the years researchers came up with amazing
improvements on the original idea. Each new architecture was targeted on a specific problem
and one achieved better accuracy and speed.
Backpropagation, short for "backward propagation of errors," is an algorithm for
supervised learning of artificial neural networks using gradient descent. Given an artificial
neural network and an error function, the method calculates the gradient of the error function
with respect to the neural network's weights. It is a generalization of the delta rule for
perceptrons to multilayer feedforward neural networks.

The "backwards" part of the name stems from the fact that calculation of the gradient
proceeds backwards through the network, with the gradient of the final layer of weights
being calculated first and the gradient of the first layer of weights being calculated last.
Partial computations of the gradient from one layer are reused in the computation of the
gradient for the previous layer. This backwards flow of the error information allows for
efficient computation of the gradient at each layer versus the naive approach of calculating
the gradient of each layer separately.

21
Backpropagation's popularity has experienced a recent resurgence given the widespread
adoption of deep neural networks for image recognition and speech recognition. It is
considered an efficient algorithm, and modern implementations take advantage of specialized
GPUs to further improve performance. The backpropagation algorithm consists of two
phases:
1. The forward pass where our inputs are passed through the network and output
predictions obtained (also known as the propagation phase).

2. The backward pass where we compute the gradient of the loss function at the final
layer (i.e., predictions layer) of the network and use this gradient to recursively apply
the chain rule to update the weights in our network (also known as the weight
update phase).

The working of backpropagation can be explained from the figure shown below.

• The input layer receives the inputs X through the preconnected path
• Input is customized by using actual weights ‘W’, where the weights are selected
randomly.
• Output is calculated for every neuron from the input layer, at the hidden layer and the
output data has arrived at the output layer
• Evaluate the errors obtained from the outputs.
• To decrease the error, adjust the weights by going back to the hidden layer from the
output layer.
• Repeat the process until the desired output is obtained.

The difference between the actual output and the desired output is used to calculate errors
obtained in the result.[28]

Error = actual output – desired output.

In fact, BP algorithm is a method to monitor learning. It utilizes the methods of mean


square error and gradient descent to realize the modification to the connection weight of
network. The modification to the connection weight of network is aimed at achieving the
minimum error sum of squares. In this algorithm, a little value is given to the connection

22
value of network first, and then, a training sample is selected to calculate gradient of error
relative to this sample.
The BP learning process can be described as follows: (1) Forward propagation of
operating signal: the input signal is propagated from the input layer, via the hide layer, to the
output layer. During the forward propagation of operating signal, the weight value and offset
value of the network are maintained constant and the status of each layer of neuron will only
exert an effect on that of next layer of neuron. In case that the expected output cannot be
achieved in the output layer, it can be switched into the back propagation of error signal.
(2) Back propagation of error signal: the difference between the real output and expect output
of the network is defined as the error signal; in the back propagation of error signal, the error
signal is propagated from the output end to the input layer in a layer-by-layer manner. During
the back propagation of error signal, the weight value of network is regulated by the error
feedback. The continuous modification of weight value and offset value is applied to make
the real output of network more closer to the expected one.

4.2. Activation Function


An activation function in a neural network defines how the weighted sum of the
input is transformed into an output from a node or nodes in a layer of the network.

Neural network has neurons that work in correspondence of weight, bias and their
respective activation function. In a neural network, we would update the weights and biases
of the neurons on the basis of the error at the output. This process is known as back-
propagation. Activation functions make the back-propagation possible since the gradients
are supplied along with the error to update the weights and biases. A neural network
without an activation function is essentially just a linear regression model. The activation
function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.[23] Sometimes the activation function is called a “transfer
function.” If the output range of the activation function is limited, then it may be called a
“squashing function.” Many activation functions are nonlinear and may be referred to as the
“nonlinearity” in the layer or the network design.[24] Technically, the activation function is
used within or after the internal processing of each node in the network, although networks
are designed to use the same activation function for all nodes in a layer.

23
A network may have three types of layers: input layers that take raw input from the
domain, hidden layers that take input from another layer and pass output to another layer,
and output layers that make a prediction. All hidden layers typically use the same activation
function. The output layer will typically use a different activation function from the hidden
layers and is dependent upon the type of prediction required by the model. Activation
functions are also typically differentiable, meaning the first-order derivative can be
calculated for a given input value. This is required given that neural networks are typically
trained using the backpropagation of error algorithm that requires the derivative of prediction
error in order to update the weights of the model.[24]

4.2.1. Sigmoid Function


In an ANN, the sigmoid function is a non-linear AF used primarily in feedforward
neural networks. It is a differentiable real function, defined for real input values, and
containing positive derivatives everywhere with a specific degree of smoothness. The
sigmoid function appears in the output layer of the deep learning models and is used for
predicting probability-based outputs.
The sigmoid function is represented as:

(4.1)

Generally, the derivatives of the sigmoid function are applied to learning algorithms.
The graph of the sigmoid function is ‘S’ shaped.
Some of the major drawbacks of the sigmoid function include gradient saturation, slow
convergence, sharp damp gradients during backpropagation from within deeper hidden layers
to the input layers, and non-zero centered output that causes the gradient updates to
propagate in varying directions.[25]

24
4.2.2. Hyperbolic Tangent Function (Tanh)
The hyperbolic tangent function, a.k.a., the tanh function, is another type of AF. It is a
smoother, zero-centered function having a range between -1 to 1. As a result, the output of
the tanh function is represented by:

(4.2)

The tanh function is much more extensively used than the sigmoid function since it delivers
better training performance for multilayer neural networks. The biggest advantage of the tanh
function is that it produces a zero-centered output, thereby supporting the backpropagation
process. The tanh function has been mostly used in recurrent neural networks for natural
language processing and speech recognition tasks.
However, the tanh function, too, has a limitation – just like the sigmoid function, it cannot
solve the vanishing gradient problem. Also, the tanh function can only attain a gradient of 1
when the input value is 0 (x is zero). As a result, the function can produce some dead
neurons during the computation process.

4.2.3. Softmax Function


The softmax function is another type of AF used in neural networks to compute
probability distribution from a vector of real numbers. This function generates an output that
ranges between values 0 and 1 and with the sum of the probabilities being equal to 1. The
softmax function is represented as follows:

(4.3)

25
This function is mainly used in multi-class models where it returns probabilities of
each class, with the target class having the highest probability. It appears in almost all the
output layers of the DL architecture where they are used. The primary difference between the
sigmoid and softmax AF is that while the former is used in binary classification, the latter is
used for multivariate classification.

4.2.4. Softsign Function:


The softsign function is another AF that is used in neural network computing.
Although it is primarily in regression computation problems, nowadays it is also used in DL
based text-to-speech applications. It is a quadratic polynomial, represented by:

(4.4)

Here “x” equals the absolute value of the input.


The main difference between the softsign function and the tanh function is that unlike the
tanh function that converges exponentially, the softsign function converges in a polynomial
form.

4.2.5. Rectified Linear Unit (ReLU) Function:

One of the most popular AFs in DL models, the rectified linear unit (ReLU) function,
is a fast-learning AF that promises to deliver state-of-the-art performance with stellar results.
Compared to other AFs like the sigmoid and tanh functions, the ReLU function offers much
better performance and generalization in deep learning. The function is a nearly linear
function that retains the properties of linear models, which makes them easy to optimize with
gradient-descent methods.
The ReLU function performs a threshold operation on each input element where all values
less than zero are set to zero. Thus, the ReLU is represented as:

26
(4.5)

By rectifying the values of the inputs less than zero and setting them to zero, this function
eliminates the vanishing gradient problem observed in the earlier types of activation
functions (sigmoid and tanh).[25]
The most significant advantage of using the ReLU function in computation is that it
guarantees faster computation – it does not compute exponentials and divisions, thereby
boosting the overall computation speed. Another critical aspect of the ReLU function is that
it introduces sparsity in the hidden units by squishing the values between zero to maximum.
ReLu (Rectified Linear units) has turned out to be more efficient and have the capacity to
accelerate the entire training process altogether. Usually, Neural networks use a sigmoidal
activation function or tanh (hyperbolic tangent) activation functions. But these functions are
prone to vanishing gradient problem. Vanishing gradient occurs when lower layers of a DNN
have gradients of nearly null because units of higher layers are nearly saturated at the
asymptotes of the tanh function. ReLU offers an alternative to sigmoidal non-linearity which
addresses the issues mentioned so far.

4.3. Dataset description:

The DARPA’s program for ID evaluation of 1998 was managed and prepared by
Lincoln Labs of MIT. The main objective of this is to analyze and conduct research in ID. A
standardized dataset was prepared, which included various types of intrusions which imitated
a military environment and was made publicly available.
ReLu has turned out to be more efficient and have the A detailed report and major
shortcomings of the provided synthetic data set such as KDDCup-’98’ and KDDCup-’99”
were discussed by. The main condemnation was that they failed to validate their data set a
simulation of real-world network traffic profile. Irrespective of all these criticisms, the
dataset of KDDCup-’99’ has been used as an effective dataset by many researchers for

27
bench-marking the IDS algorithms over the years. In contrast to the critiques about the
creation of the dataset, has revealed a detailed analysis of the contents, identified the non-
uniformity and simulated the artifacts in the simulated network traffic data. The reasons
behind why the machine learning classifiers have limited capability in identifying the attacks
that belong to the content categories R2L, U2R in KDDCup-’99’ datasets have been
discussed by. They have concluded that it is not possible to get acceptable detection rate
using classical ML algorithms. It is also stated the possibility of getting high detection rate in
most of the cases by producing a refined and augmented data set by combining the train and
test sets. However, a significant approach has not been discussed. The DARPA / KDDCup-
’88 failed to evaluate the traditional IDS resulting in many major criticisms. To eradicate this
used Snort ID system on DARPA / KDDCup-’98’ tcpdump traces. The system performed
poorly resulting in low accuracy and the impermissible false positive rates. It failed in
detecting dos and probing category but contrasting performing better than the detection of
R2L and U2R. Despite the harsh criticisms [30], still KDDCup-’99’ set is one of the most
widely used publicly available bench-marking datasets reliable for studies related to IDS
evaluation and other security-related tasks. In the effort of mitigating the underlying
problems existing with KDDCup-’99’ set, a refined version of dataset named NSL-KDD was
proposed by. It removed the connection redundancy records in the entire train and test data.
In addition to that, the invalid records were also removed from the test data. These measures
prevent the classifier from being biased in the direction of the more frequent records. Even
after the refinement, this failed to solve the issues reported by, and a new dataset named
UNSW-NB15 was proposed.

The DARPA’s ID evaluation group, accumulated network-based data of IDS by


simulation of an air force base LAN by over 1000s of UNIX nodes and for continuously 9
weeks, 100s of users at a given time in Lincoln Labs which was then divided into 7 and 2
weeks of training and testing respectively to extract the raw dump data TCP. MIT’s lab with
extensive financial support from DARPA and AFRL, used Windows and UNIX nodes for
almost all of the inbound intrusions from an alienated LAN unlike other OS nodes. For the
purpose of dataset, 7 distinct scenarios and 32 distinct attacks which totals up to 300 attacks
were simulated. Since the year of release of KDD-’99’ dataset [34], it is the most vastly

28
utilized data for evaluating several IDSs. This dataset is grouped together by almost
4,900,000 individual connections which includes a feature count of. The simulated attacks
were categorized broadly as given below:
•Denial-of-Service-Attack (DoS): Intrusion where a person aims to make a host inaccessible
to its actual purpose by briefly or sometimes permanently disrupting services by flooding the
target machine with enormous amounts of requests and hence overloading the host.
• User-to-Root-Attack (U2R): A category of commonly used maneuver by the perpetrator
start by trying to gain access to a user’s pre-existing access and exploiting the holes to obtain
root control.
• Remote-to-Local-Attack (R2L): The intrusion in which the attacker can send data packets
to the target but has no user account on that machine itself, tries to exploit one vulnerability
to obtain local access cloaking themselves as the existing user of the target machine.
• Probing-Attack: The type in which the perpetrator tries to gather information about the
computers of the network and the ultimate aim for doing so is to get past the firewall and
gaining root access.
KDDCup-’99’ set is classified into the following three groups: Basic features:
Attributes obtained from a connection of TCP/IP comes from this group. Majority of these
features results in implicitly delaying the detection. Traffic features: Features computed w.r.t.
a window of time is categorized under this group. This can be further subdivided into 2
groups:” Same host” features: The connections that has identical end host as the connection
under consideration for the continuously 2 seconds fall into this category and serves the
purpose of calculating the statistics of protocol behavior, etc. “Same service” features: The
connections that are only having identical services to the present connection for the last two
seconds fall under this category. Content features: Generally probing attacks and DoS attacks
have at least some kind of frequent sequential intrusion patterns unlike R2L and U2R attacks.
This is due to the reason that they involve multiple connections to a single set of a host(s)
under short span of time while the other 2 intrusions are integrated into the packets of data
partitions in which generally only one connection is involved. For the detection of these
types of attacks, we need some unique features by which we will be able to search for some
irregular behavior. These are called content features

29
4.4. Identifying network parameters:

Parameters are the coefficients of the model, and they are chosen by the model
itself. It means that the algorithm, while learning, optimizes these coefficients (according to a
given optimization strategy) and returns an array of parameters which minimize the error. To
give an example, in a linear regression task, you have your model that will look like y=b + ax,
where b and a will be your parameter. The only thing you have to do with those parameters is
initializing them.[26] The parameters of a neural network are typically the weights of the
connections. In this case, these parameters are learned during the training stage. So, the
algorithm itself (and the input data) tunes these parameters. The hyper parameters are
typically the learning rate, the batch size or the number of epochs.
Minibatch size: when you are facing billions of data, it might result inefficient (as well as
counterproductive) feeding your NN with all of them. A good practice is feeding it with
smaller samples of your data, called batches: by doing so, every time the algorithm train itself,
it will train on a sample of the same size of the batch. The typical size is 32 or higher,
however you need to keep in mind that, if the size is too big, the risk is an over generalized
model which won’t fit new data well.
Epochs: it represents how many times you want your algorithm to train on your whole dataset
(note that epochs are different from iterations: those latter are the number of batches needed to
complete one epoch). Again, the number of epochs depend on the kind of data and task you
are facing. An idea could be imposing a condition such that epochs stop when the error is
close to zero. Or, mor e easily, you can start with a relatively low number of epochs and then
increase it progressively, tracking some evaluation metrics (like accuracy).
Dropout: this technique consists of removing some nodes so that the NN is not too heavy.
This can be implemented during the training phase. The idea is that we do not want our NN to
be overwhelmed by information, especially if we consider that some nodes might be
redundant and useless. So, while building our algorithm, we can decide to keep, for each
training stage, each node with probability p (called ‘keep probability’) or drop it with
probability 1-p (called ‘drop probability’). Strategies are approaches and best practices we
might want to have towards our algorithm to make it more performing. Among these there are

30
the following: Data normalization: while inspecting your data, you might notice that some
features are represented on different scales. This might affect the performance of your NN,
since the convergence is slower. Normalizing data means converting all of them to the same
scale, within the range [0–1]. You can also decide to Standardize your data, that means
making them normally distributed with mean equal to 0 and standard deviation equal to 1.
While data normalization happens before training your NN, another way you can normalize
your data is through the so-called Batch Normalization: it happens directly during your NN
training, specifically after the weighted sum and before the activation function.
Optimization algorithm: in the previous paragraph, I mentioned the gradient descent as the
optimization algorithm. However, we have many variants of this latter: Stochastic Gradient
Descent (it minimizes the loss according to the gradient descent optimization, and for each
iteration it randomly selects a training sample — that’s why it’s called stochastic), the
RMSProp (that differs from the previous since each parameter has an adapted learning rate)
and the Adam Optimizer (it is a RMSProp + momentum). Of course, this is not the full list,
yet it is sufficient to understand that Adam optimizer is often the best choice, since it allows
you to set different hyperparameters and customize your NN.
Regularization: this strategy is pivotal if you want to keep your model simple and avoid
overfitting. The idea is that regularization adds a penalty to the model if weights are great/too
many. Indeed, it adds to our loss function a new term which tends to increase (hence, the loss
increases too) if the re-calibration procedure increases weights. There are two kinds of
regularization: the Lasso regularization (L1) and Bridge regularization (L2):

31
The L1 regularization tends to shrink weights to zero, with the risk of getting rid of some

inputs (since they will be multiplied with a null value), whereas the L2 might shrink weights

to very low values, but not to zero (hence inputs are preserved).[27]

Hyper-tuning of parameters to figure out the optimum set of parameters to achieve


the desired result is all by itself a separate field with plenty of future scope for research. In
this work, the learning is kept constant at 0.01 while the other parameters where optimized.
The count of the neurons in a layer was experimented by changing it over the range of 2 to
1024. After that, the count was further increased to 1280 but didn’t yield any appreciable
increase in accuracy. Therefore, the neuron count was tuned to 1024.

4.5. Identifying network structures

Table- 4.1-Statistical results of network Structures


Layer (type) Output Shape Param
Dense-1 (Dense) (NIL, 1024) 43008
Dropout-1 (Dropout) (NIL, 1024) 0
Dense-2 (Dense) (NIL, 768) 787200
Dropout-2 (Dropout) (NIL, 768) 0
Dense-3 (Dense) (NIL, 512) 393728
Dropout-3 (Dropout) (NIL, 512) 0
Dense-4 (Dense) (NIL, 256) 131328
Dropout-4 (Dropout) (NIL, 256) 0
Dense-5 (Dense) (NIL, 128) 32896
Dropout-5 (Dropout) (NIL, 128) 0
Dense-6 (Dense) (NIL, 1) 129
Activation-1 (Activation) (NIL, 1) 0

32
Conventionally, increasing the count of the layers results in better results compared
to increasing the neuron count in a layer. Therefore, the following network topologies were
used in order to scrutinize and conclude the optimum network structure for our input data. •
DNN with 1,2,3,4,5 layers. For all the above network topologies, 100 epochs were run and
the results were observed. Finally, the best performance was showed by DNN 3 layer
compared to all the others. To broaden the search for better results, all the common classical
machine learning algorithms were used and the results were compared to the DNN 3 layer,
which still outperformed every single classical algorithm. The detailed statistical results for
different network structures are reported in the table I.

4.6. Proposed Architecture:

Fig.4.1. Proposed architecture

33
An overview of proposed DNNs architecture for all use cases is shown in Fig.
4.1. This comprises of a hidden-layer count of 5 and an output-layer. The input-layer consists
of 41 neurons. The neurons in input-layer to hidden-layer and hidden to output-layer are
connected completely. Back-propagation mechanism is used to train the DNN networks. The
proposed network is composed of fully connected layers, bias layers and dropout layers to
make the network more robust. Input and hidden layers: This layer consists of 41 neurons.
These are then fed into the hidden layers. Hidden layers use ReLU as the non-linear
activation function. Then weights are added to feed them forward to the next hidden layer.
Weight is the parameter within a neural network that transforms input data within the
network's hidden layers.

A neural network is a series of nodes, or neurons. Within each node is a set of


inputs, weight, and a bias value. Often the weights of a neural network are contained within
the hidden layers of the network. Bias is just like an intercept added in a linear equation. It
is an additional parameter in the Neural Network which is used to adjust the output along
with the weighted sum of the inputs to the neuron. Moreover, bias value allows you to shift
the activation function to either right or left. The neuron count in each hidden layer is
decreased steadily from the first to the output to make the outputs more accurate and at the
same time reducing the computational cost. Regularization: To make the whole process
efficient and time-saving, Dropout (0.01). The function of the dropout is to unplug the
neurons randomly, making the model more robust and hence preventing it from over-fitting
the training set. Output layer and classification: The out layer consists only of two neurons
Attack and Benign. Since the 1024 neurons from the previous layer must be converted into
just 2 neurons, a sigmoid activation function is used. Due to the nature of the sigmoid
function, it returns only two outputs, hence favoring the binary classification that was
intended in this thesis.

34
Chapter-5

RESULT AND DISCUSSION

The model proposed has been evaluated based on the following standard performance
measures:

True Positive Rate (TPR): It is calculated as the ratio between the number of
correctly predicted attacks and the total number of attacks. If all intrusions are detected then
the TPR is 1 which is extremely rare for an IDS. TPR is also called a Detection Rate (DR) or
the Sensitivity. The TPR can be expressed mathematically as
𝑇𝑃
TPR=𝑇𝑃+𝐹𝑁 (5.1)

False Positive Rate (FPR): It is calculated as the ratio between the number of normal
instances incorrectly classified as an attack and the total number of normal instances.

𝐹𝑃
FPR=𝐹𝑃+𝑇𝑁 (5.2)

False Negative Rate (FNR): False negative means when a detector fails to identify an
anomaly and classifies it as normal. The FNR can be expressed mathematically as:

𝐹𝑁
FNR=𝐹𝑁+𝑇𝑃 (5.3)

Classification rate (CR) or Accuracy: The CR measures how accurate the IDS is in detecting
normal or anomalous traffic behavior. It is described as the percentage of all those correctly
predicted instances to all instances:

TP+TN
Accuracy= (5.4)
TP+TN+FP+FN

whereby FP, sometimes referred to as Type I error, is the rate of legitimate traffic classified
as attacks. FN, sometimes referred to as Type II error, is the rate of legitimate traffic
classified as intrusions. Additional metrics we consider in this paper are the Recall,
the Precision and F1score.

35
Recall: The ability of a model to find all the relevant cases within a data set. Mathematically,
we define recall as the number of true positives divided by the number of true positives plus
the number of false negatives [29].
TP
Recall = (5.5)
TP+FN

Precision: The ability of a classification model to identify only the relevant data points.
Mathematically, precision the number of true positives divided by the number of true
positives plus the number of false positives.

TP
Precision= (5.6)
TP+FP

F1 Score is the weighted average of Precision and Recall. Therefore, this score takes
both false positives and false negatives into account. Intuitively it is not as easy to understand
as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven
class distribution.

Precision.Recall
F1score = 2 (5.7)
Precision+Recall

The KDDCup-’99’ dataset was fed into classical ML algorithms. After the training is
completed, all models were compared for f1-score, accuracy, recall and precision with the
test dataset. The scores for the accuracy of classical ML algorithms have been compared in
detail in Fig.5.1, and the Precision, Recall, F1score of the classical machine learning
algorithms are compared in Fig.5.2.It is clear that the Naïve Bayes Algorithm has given the
Highest Accuracy rate

36
Accuracy
ML Algorithms

Decision Tree
0.928

Naive Bayes
0.929

Logistic Regression
0.846

0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94

Accuracy

Fig.5.1. Accuracy of Classical Machine learning Algorithms

0.954
Decision Tree 0.912
0.999

0.955
Naive Bayes 0.923
0.988

0.896
Logistic Regression 0.819
0.988

0 0.2 0.4 0.6 0.8 1 1.2

F1 score Recall Precision

Fig.5.2 F1score, Recall and Precision of Classical ML Algorithms

37
0.9305

0.93 0.93

0.9295

0.929 0.929 0.929 0.929

0.9285
ACCURACY

0.928

0.9275

0.927 0.927

0.9265

0.926

0.9255
DNN1 DNN2 DNN3 DNN 4 DNN5
DNN LAYERS

Fig.5.3. Accuracy of DNN layers

The Accuracy of all the five Deep Neural network layers were compared in Fig.5.It is clear
that the layer 3 has given the highest accuracy which is higher than the classical machine
learning Algorithms.

RECALL
0.916

0.915 0.915 0.915


0.914 0.914
RECALL

0.913 0.913
0.912

0.911 0.911
0.91

0.909
dnn 1 dnn 2 dnn 3 dnn 4 dnn 5
DNN LAYERS

Fig.5.4. Recall values of DNN Layers

38
1.01

0.998 0.998 0.997 0.998 0.998


1

0.99

0.98

0.97

0.96 0.955
0.954 0.954 0.954 0.953
0.95

0.94

0.93
DNN 1 DNN 2 DNN 3 DNN 4 DNN 5

Precision F1 Score

Fig.5.5. Precision and F1 Score of DNN layers

DNN 5
0.927
DNN 4
0.928
DNN 3
0.93
Algorithms

DNN 2
0.929
DNN 1
0.929
DECISION TREE
0.928
NAÏVE BAYES
0.929
LOGISTIC REGRESSION
0.846
0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94
Accuracy

Fig.5.6. Overall Accuracy

39
Table-5.1 – Overall Performance of the Algorithms

Algorithms Accuracy Precision Recall F1Score


Logistic Regression 0.846 0.988 0.819 0.896

Naïve Bayes 0.929 0.988 0.923 0.955

Decision Tree 0.928 0.999 0.912 0.954

DNN 1 0.929 0.998 0.915 0.954

DNN 2 0.929 0.998 0.914 0954

DNN 3 0.930 0.997 0.915 0.955

DNN 4 0.929 0.999 0.913 0954

DNN 5 0.927 0.998 0.911 0.953

For the scope of this thesis, the KDDCup-’99’ dataset was fed into classical
Machine Learning Algorithms as well as DNNs of varying hidden layers. After the training is
completed, all models were compared for f1-score, accuracy, recall and precision with the
test dataset. The scores for the DNN layers have been depicted in detail in Fig.5.3, Fig.5.4,
Fig.5.5.
The scores of the classical Machine Learning Algorithms are compared in Fig.5.1, Fig.5.2. In
Table 5.1 all the values have been compared in detail. In fig.5.6 Overall accuracy is depicted.
It’s certain that, DNN layer 3 network have outperformed all the other classical machine
learning algorithms. It is so because of the ability of DNNs to extract data and features with
higher abstraction and the non-linearity of the networks adds up to the advantage when
compared with the other algorithms.

40
Chapter-6

CONCLUSION
This thesis has elaborately recapitulated the usefulness of Deep Neural Networks
in Intrusion Detection System comprehensively. For the purpose of reference, other classical
Machine Learning algorithms such as Logistic Regression, Naïve Bayes, Decision Tree have
been accounted and compared against the results of Deep Neural Network. The publicly
available KDDCup-’99’ dataset has been primarily used as the benchmarking tool for the
study, through which the superiority of the Deep Neural Networks (DNN) over the other
compared algorithms have been documented clearly. For further refinement of the algorithm,
this work takes into account of DNNs with five counts of hidden layers and it was concluded
that a DNN layer 3 has been proven to be effective and accurate of all, which has the
accuracy of 0.930. From the empirical results of this thesis, we may claim that deep learning
methods are a promising direction towards Intrusion Detection tasks, but even though the
performance on artificial dataset is exceptional, application of the same on network traffic in
the real-time which contains more complex and recent attack types is necessary.
Additionally, studies regarding the flexibility of these DNNs in adversarial environments are
required. The increase in vast variants of deep learning algorithms calls for an overall
evaluation of these algorithms in regard to its effectiveness towards IDSs.

FUTURE WORK

KDDCup ‘99’ and NSL-KDD Datasets are most well-known and outdated.
Moreover, these are not representative for today’s network traffic. Applying the proposed
methodologies on the recent network traffic dataset such as CICIDS2017 which is labelled
based on the timestamp, source and destination IPs, source and destination ports, protocols
and attacks, A complete network topology was configured to collect this dataset which
contains Modem, Firewall, Switches, Routers, and nodes with different operating systems is
essential. This will be remained as one of significant future work direction.

41
REFERENCES:

[1] R. Vinayakumar, K. P. Soman and P. Poorna Chandran, "Evaluating effectiveness of


shallow and deep networks to intrusion detection system," 2017 International Conference on
Advances in Computing, Communications and Informatics (ICACCI), 2017, pp. 1282-1289,
doi: 10.1109/ICACCI.2017.8126018.

[2] Femi Emmanuel Ayo, Sakinat Oluwabukonla Folorunso, Adebayo A. Abayomi-Alli,


Adebola Olayinka Adekunle & Joseph Bamidele Awotunde (2020) Network intrusion
detection based on deep learning model optimized with rule-based hybrid feature
selection, Information Security Journal: A Global Perspective, 29:6, 267-
283, DOI: 10.1080/19393555.2020.1767240

[3] Deep Confidence: A Computationally Efficient Framework for Calculating Reliable


Prediction Errors for Deep Neural Networks,Isidro Cortés-Ciriano and Andreas
Bender,Journal of Chemical Information and Modeling 2019 59 (3), 1269-1281 DOI:
10.1021/acs.jcim.8b00542

[4] Srinivasan, Sriram; A, Shashank; R, vinayakumar; KP, Soman (2020): DCNN-IDS :


Deep Convolutional Neural Network based Intrusion Detection System. TechRxiv. Preprint.
https://doi.org/10.36227/techrxiv.12

[5] Vinayakumar, R., Soman, K. P., & Poornachandran, P. (2019). A Comparative Analysis
of Deep Learning Approaches for Network Intrusion Detection Systems (N-IDSs): Deep
Learning for N-IDSs. International Journal of Digital Crime and Forensics (IJDCF)

[6] Wang Peng, Xiangwei Kong, Guojin Peng, Xiaoya Li, Zhongjie Wang, (2019)
Network Intrusion Detection Based on Deep Learning . International Conference on
Communications, Information System and Computer Engineering (CISCE)

[7] Isidro Cortés-Ciriano and Andreas Bender Deep Confidence: A Computationally


Efficient Framework for Calculating Reliable Prediction Errors for Deep Neural Networks
(October 17, 2018)

[8] Ruijie Zhao, Zhaojie Li, Zhi Xue, Tomoaki Ohtsuki, and Guan Gui, A Novel Approach
based on Lightweight Deep Neural Network for Network Intrusion Detection (2021)

42
[9] Yanyan Zhang and Xiangjin Ran, A Step-Based Deep Learning Approach for Network
Intrusion Detection (2021)

[10] “An Overview of Neural Networks Use in Anomaly Intrusion Detection Systems”
(2009) by Yusuf Sani, Ahmed Mohamedou, Khalid Ali, Anahita Farjamfar, Mohamed
Azman, Solahuddin Shamsuddin, IEEE Student Conference on Research and Development
(SCOReD 2009)

[11] T. Kohonen, “The self-organizing map”, Proc. IEEE, vol. 78, no. 9, pp. 1464-1480,

[12] L. Ertoz, M. Steinbach, and V. Kaumar, “Finding clusters of different ¨ sizes, shapes,
and densities in noisy, high dimensional data,” in Proceedings of the 2003 SIAM
International Conference on Data Mining. SIAM, 2003,

[13] “Artificial Neural Network” Accessed: 14-12-2021 [online] Available:


https://www.techopedia.com/definition/5967/artificial-neural-network-ann

[14] “Artificial neural network” Accessed: 14-12-2021 [online] Available:


https://www.geeksforgeeks.org/introduction-to-ann-set-4-network-architectures/

[15] Al-Turaiki I, Altwaijry N. A Convolutional Neural Network for Improved Anomaly-


Based Network Intrusion Detection. Big Data. 2021;9(3):233-252.
doi:10.1089/big.2020.0263

[16] Vinayakumar R, Alazab M, Soman K, et al. Deep learning approach for intelligent
intrusion detection system. IEEE Access. 2019;7:41525–41550

[17] Pasupa K, Sunhem W. A comparison between shallow and deep architecture classifiers
on small dataset. In: 2016 8th International Conference on Information Technology and
Electrical Engineering (ICITEE), Yogyakarta, Indonesia: IEEE, 2016. pp. 1–6

43
[18] Yin C, Zhu Y, Fei J, et al. A deep learning approach for intrusion detection using
recurrent neural networks. IEEE Access. 2017; 5:21954–21961

[19] Altwaijry N, Alqahtani A, Al-Turaiki I. A deep learning approach for anomaly-based


network intrusion detection. In: Tian Y, Ma T, Khan MK (Eds.), First International
Conference on Big Data and Security, Nanjing, China: Springer, 2019

[20] Wu K, Chen Z, Li W. A novel intrusion detection model for a massive network using
convolutional neural networks. IEEE Access. 2018; 6:50850–50859

[21] Khraisat, A., Gondal, I., Vamplew, P. et al. Survey of intrusion detection systems:
techniques, datasets and challenges. (2019). https://doi.org/10.1186/s42400-019-0038-7

[22] Abdulhamit Subasi, in Practical Machine Learning for Data Analysis Using Python
(2020)

[23] “Activation functions-neural network” Accessed: 15-12-2021 [online] Available:


https://www.geeksforgeeks.org/activation-functions-neural-networks/]

[24] “Activation functions-neural network” Accessed: 15-12-2021 [online] Available:


https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/

[25] “Activation functions-neural network” Accessed: 15-12-2021 [online] Available:


https://www.upgrad.com/blog/types-of-activation-function-in-neural-networks/

[26] “Parameters and hyperparameters” Accessed: 15-12-2021 [online] Available:


https://towardsdatascience.com/neural-networks-parameters-hyperparameters-and-
optimization-strategies-3f0842fac0a5

[27] “Backpropagation” Accessed: 15-12-2021 [online] Available:


https://brilliant.org/wiki/backpropagation/

44
[28] “Back-propagation” Accessed: 15-12-2021 [online] Available:
https://www.watelectronics.com/back-propagation-neural-network/

[29] “Precision, recall” Accessed: 16-12-2021 [online] Available: https://builtin.com/data-


science/precision-and-recall

45

You might also like