1 s2.0 S0925231220319032 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Neurocomputing 452 (2021) 705–715

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Intrusion detection approach based on optimised artificial neural


network
Michał Choraś a,c,⇑, Marek Pawlicki b,c
a
FernUniversitat in Hagen (FUH), Germany
b
ITTI Sp. z o.o., Poland
c
UTP University of Science and Technology, Poland

a r t i c l e i n f o a b s t r a c t

Article history: Context and rationale: Intrusion Detection, the ability to detect malware and other attacks, is a crucial
Received 9 April 2020 aspect to ensure cybersecurity. So is the ability to identify this myriad of attacks.
Revised 8 June 2020 Objective: Artificial Neural Networks (as well as other machine learning bio-inspired approaches) are an
Accepted 9 July 2020
established and proven method of accurate classification. ANNs are extremely versatile – a wide range of
Available online 19 December 2020
setups can achieve significantly different classification results. The main objective and contribution of
this paper is the evaluation of the way the hyperparameters can influence the final classification result.
Keywords:
Method and results: In this paper, a wide range of ANN setups is put to comparison. We have performed
Cybersecurity
Artificial Neural Network
our experiments on two benchmark datasets, namely NSL-KDD and CICIDS2017.
Machine learning Conclusions: The most effective arrangement achieves the multi-class classification accuracy of 99.909%
on an established benchmark dataset.
Ó 2020 Elsevier B.V. All rights reserved.

1. Introduction stolen. A recent report [4] suggests that the vulnerability has not
been addressed after all those years.
Every single day societies, businesses and citizens in person are Early 2018 was marked by a security violation that affected over
threatened by a wide array of cyberattacks including malware, 150 million users of a popular fitness app, MyFitnessPal. The media
worms, trojan horses, spyware, SQLI, XSS, ransomware [1] and a coverage of the breach tried to pass the event as ‘‘just another day
significant variety of other hazards. The proliferation of cyber- on the Internet” [5]. With the prevailing risk of both new and
threats can suggest that these malicious instances have become known cybersecurity threats, it is very tempting to just nod in
part of the daily routine of the contemporary citizen. At the begin- agreement with that assertion.
ning of 2018, a banking malware geared towards android targeted The wide range of cybersecurity violations resulted in pro-
unsuspecting bank app users [2]. The new BankBot strain was pelling research on an array of different detection methods. Two
altered to such a degree that it deemed safe by the Google Play main trends of research and development emerged, namely
Store antivirus protection, even though BankBot is a well-known signature-based and anomaly-based. Signature-based IDS (Intru-
malware. The trojan operated under the guise of what was a sion Detection Systems) operate by utilising a storehouse of recog-
benevolent application at a first glance, but once set up on an nised attacks, while the anomaly-based methods form a model of
Android device, it proceeded to appropriate the bank’s access ‘normal’ traffic and go into alert with every divergence from the
credentials. model [6]. The black-hat society utilises numerous obfuscation
A well recognised case of cross-site scripting (XSS) took place techniques to deceive the signature-based detectors. According to
when eBay turned out to be vulnerable to attack [3]. In 2014, Java- a recent analysis, known malevolent software can be made abso-
Script code was being included in costly items’ listings. The user lutely undetectable for contemporary anti-malware applications
only had to click a malicious, but benign-looking listing to have [7]. Frequently, research concerning the use of neural networks
the script seize control of their browser, become redirected to a site in intrusion detection offers arbitrarily chosen topologies and
that looked exactly like eBay, and have their credit card credentials hyperparameter setups, disregarding the immense influence the
hyperparameters can have over the achieved accuracy. The first
aim of this research is to highlight the untapped power hidden in
⇑ Corresponding author. setups that often remain untested. Secondly, the best possible
E-mail address: chorasm@utp.edu.pl (M. Choraś).

https://doi.org/10.1016/j.neucom.2020.07.138
0925-2312/Ó 2020 Elsevier B.V. All rights reserved.
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

setup for the used intrusion detection benchmarks is sought after. In [15] Principal Component Analysis (PCA) is employed as a
Thirdly, the topology and hyperparameter mixes that come closely feature extractor, before feeding the data to the ANN, as opposed
to the best setup are noted to find the smallest topologies which to providing the inputs directly from the dataset. As the article
require the least computational power. The last two points should illustrates, this drops the memory requirements of the method sig-
be considered as the main motivation for this work, as the findings nificantly, along with the time of training necessary. The two eval-
will be used to achieve better intrusion detection in a deployable uated methods displayed comparable results as far as the accuracy
system. The stated motivation suggests the following research is concerned. This makes applying PCA clearly the better option.
question: what mix of hyperparameters and topology settings Using a Kernel PCA betters the training time of ANN, but uses sig-
can provide the highest accuracy for an ANN classifier used on nificantly more memory than the traditional PCA. Both methods
benchmark IDS datasets?. have similar accuracy measures, so the authors of [16] conclude
To achieve those abovementioned objectives, a hyperparameter that using a mix of different algorithms is preferable. There has
tuning procedure is proposed and employed. The procedure, called been research on utilising Graphical Processing Units to accelerate
gridsearch, searches through hyperparameter space, exhaustively ANN based IDS, since GPUs are a good fit for ANN computations. An
generating candidates of the settings of the neural network and increase in performance has been proven [17]. The authors of [18]
ranks them according to a chosen scoring function. The results of evaluate an ANN with one hidden layer in comparison with a Sup-
that procedure are gathered, listing the results the selected hyper- port Vector Machine, a Naive Bayes and a C4.5 algorithm. The ANN
parameter mixes have achieved for a given topology. The results achieves comparable, or better results in malware detection, but
highlight the importance of optimising the topology and hyperpa- thanks to the simpler nature of a 3-layer ANN framework requires
rameters for data in the cybersecurity domain. As noted in [8], fewer computations than other tested algorithms. The experiments
hyperparameter optimisation is necessary to find the best possible were performed on the NSL-KDD dataset, which is the current
fit for a given dataset. A range of hyperparameter optimisation benchmark, and the successor of KDD’99. In [19] the authors use
methods exist, including methods employing evolutionary a convolutional neural network to extract spatial features and Bi-
approaches, particle swarm optimisation and even random directional long short-term memory to extract temporal features,
searches. This research used gridsearch for its transparency, and they name their approach the deep hierarchical network model.
the possibility to guide the experiments into most promising areas The solution is used on the NSL-KDD and UNSW-NB15 datasets,
manually [8]. the well known cybersecurity benchmarks. The training dataset
This paper is structured as follows: in Section 2 an overview of is prepared with the use of one-sided selection and synthethic
related work in the domain with the emphasis on ANN for intru- minority oversampling technique to first reduce noise in the
sion and malware detection is provided. There are parts of the benign class and oversample the minority classes, forming bal-
ANN setup that cannot be inferred from data in the way parame- anced datasets. The authors report 83.58% accuracy on NSL-KDD
ters like weights and biases are. Those features of the ANN, among and 77.16% on UNSW-NB15. In [20] the authors aim to resolve
them the activation function, or the optimizer, have to be put in the problem of high dimensionality and the amount of noise in
manually. They are collectively referred to as ‘‘hyperparameters” cybersecurity data. To this end, they employ a combination of deep
[9]. In Section 3, the proposed method based on ANN is described belief network (DBN) with feature-weighted support vector
in detail, with the detailed description of the pipeline and chosen machine (WSVM). The DBN is trained with the use of an adaptive
hyperparameters. In Section 4, the experimental setup and results learning rate, which is used as a feature extractor. Then the fea-
are given, while the threats to validity and conclusions are placed tures are inputted to a particle-swarm-optimized WSVN. The solu-
thereafter in Sections 5 and 6, respectively. tion is tested on NSL-KDD, achieving the accuracy of 85.73% for
binary classification. The authors of [21] propose a model they call
BAT, which mixes a Bidirectional Long Short-term memory net-
2. Related work work with an attention mechanism. The attention mechanism is
utilised to scan the features obtained by the BLSTM. This is
2.1. Artificial neural networks in intrusion detection inputted to a CNN with multiple convolutional layers, achieving
a model that does not require feature engineering. The approach
Cybersecurity is an immensely broad topic, with different mea- achieves 85.25% accuracy.
sures designed to counter different attack vectors [10]. The appli-
cation of Artificial Neural Networks (ANN) for intrusion detection
systems (IDS) and malware detection is hardly a new concept. 3. The proposed method based on artificial neural network
There have been evaluations of the notion of using ANN to aid
anomaly detection and malware detection as far as in 2009 [11]. Artificial Neural Networks (ANN) are an all-purpose utility for
In [12], the authors attempt to address the problems of overfitting, modelling. The initial concept [22] found a variety of modifications,
high memory consumption and high overhead of standard IDS/ like Convolutional Neural Networks (CNN) [23], Radial Basis Func-
malware detection with a feed-forward ANN. Specifically, a 2- tion Networks [24], Radial Basis Probabilistic Function Neural Net-
layered feed-forward ANN was recommended. The aforementioned works [25,26], Recurrent Neural Networks (RNN) [27] and many
problems were handled through a conjugated training function more. With a myriad of applications, including Natural Language
and validation dataset. The authors claim that their method Processing [28], Biometrics [29], finding polynomial roots [30–
achieves similar results to classical procedures, but with less com- 32], intrusion detection [33] and many more, they are an accepted
putational overhead. The procedure was tested on the benchmark and renowned tool for data mining, with classification, regression,
KDD’99 dataset. The conclusion of the paper states that less data is clustering and time series analysis abilities. The basic assumption
better because of the time the machine needs to crunch it. In [13], of an ANN is that it imitates, to a certain extent, the learning com-
pruning of the ANN is evaluated as part of the optimisation of the petencies of a biological neural network, stressing by principle the
network. It is basically the deletion of neural nodes of either the properties of neural networks found in human brains, although
input or the hidden layers. This makes the ANN faster, as fewer strongly streamlined [34].
computations have to be processed. In [14], an Artificial Neural The surprising modelling capacity of ANN in pattern recognition
Network also showed promise as an IDS when evaluated. In fact, derives from its strong malleability as it fits to data. This extensive
the results were very encouraging. approximation capacity is markedly important when handling
706
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

real-world data, when the information is plentiful, but the patterns that it experiences significant difficulty in performing on unfore-
buried in it remain uncovered. The optimisation of the setup can seen data, as the approximation is not sufficiently generalised [38].
play an important part in the results the setup achieves [35]. In this work the influence of the number of hidden layers as
In an ANN, knowledge is gained through updating weights with well as the number of neurons in those hidden layers on the
consecutive batches of data instances. The algorithm can recognise ANN performance has been subjected to scrutiny (in addition to
the associations among the variables, as well as generalise in a way other hyperparameters). While it can be considered as topology
that allows for high performance on new, unforeseen data [36]. It is or network structure, those aspects could also be considered
basically like fitting a line, or a plane, or a hyper-plane through a hyperparameters.
set [37].
An artificial neural network with a sole computational layer is 3.1. The usage of backpropagation
dubbed a perceptron. It consists of an input and an output compu-
tational layer. After the data points are fed to the input layer, they Having to train a single-layer perceptron is straightforward -
are issued to the computational layer. The input layer contains d the loss function is a function of the weights. With multiple layers,
nodes that speak for d features X ¼ ½x1 . . . xd  and edges of weight the procedure becomes more demanding as it makes many layers
P of weights influence one another. Backpropagation calculates the
W ¼ ½w1 . . . wd . The output neuron computes W  X ¼ di¼1 ðwi xi ).
In case of the perceptron, the forecast is binary, and is delineated Error Gradient as the sum of local-gradients over multiple paths
by the sign of the value that is the result of the output layer com- to the output node [38]. The algorithm consists of two phases –
putation. To help deal with distribution imbalance, bias can be the forward and the backward phase. In the forward phase, the
added. data points are served to the input nodes, and one after one the
The prediction of y ^ is the result of the following equation: results at consecutive layers are computed with the current
weights. The result of this prediction is compared to the training
X
d instance. The backward phase uncovers the gradient of the loss
^ ¼ signfW  X þ bg ¼ signf
y wi xi þ bg function for all the weights. The gradients update the weights,
i¼1
starting from the output layer, stepping back all the way to the first
As seen in the equation, the sign is the activation function Uðv Þ. layer. This weight updating process iterates over the training data
Numerous activation functions can be utilised in artificial neural – each iteration is called an epoch – ANNs can often necessitate
networks with multiple hidden layers. For easier training, it is com- thousands of those iterations to attain convergence.
monly either the Rectified Linear Unit (ReLU) or Hard Tanh in mul-
tilayered networks. The error of the regression can be indicated as 3.2. The used technologies: TensorFlow and Keras
the difference between the real-life test value and the predicted
value, so EðXÞ ¼ y  y^. If the error is not equal to 0, the weights In this article TensorFlow has been used, which is a high perfor-
should be amended. Thus, the purpose of the perceptron is to min- mance, open source library, provided by the developers and engi-
imise the least-squares between y and y ^, for all data points in data- neers of the Google Brain team. It serves as a capable support for
set D. This objective is named the loss function. machine and deep learning, and it is currently implemented in
X an array of scientific and industry applications [39]. Keras, which
ðy  signfW  XgÞ operates on top of TensorFlow and a myriad of other machine
ðX;yÞ2D learning libraries, brings an astounding speed of experimentation
along with incomparable user experience. This is attained via mod-
The loss function is defined over the whole dataset X, the weights W ular, expandable design. Keras was brought into existence more as
are updated with the learning rate a, and the algorithm iterates over an interface than an autonomous library. Keras received full sup-
the entire dataset until it converges. This algorithm was named port in the TensorFlow library, and enables intuitive coding of both
stochastic gradient-descent, also expressed by: machine and deep learning procedures [40].
W ( W þ aEðXÞX
3.3. Improving the selected algorithms with hyperparameter
[38] optimization
A multi-layer neural network is created via multiple computa-
tional layers, also named the hidden layers. The title itself hints One of the most important parts of the Artificial Neural Net-
to the black-box character of those layers, as the computations work design comes in the role of the activation function, as the
are shrouded from the user’s perspective. The data points are car- effect it carries over the achievable results is straightforward. The
ried from the input layer to subsequent layers with computations network can accommodate diverse types of activation functions.
at every stage, down to the output layer. The decision on the type of an activation function plays a crucial
The aforementioned procedure is referred to as the feed- role especially in the multi-layer networks, as each layer can have
forward neural network [38]. The exact count of nodes in the fore- its own non-linear activation function [38]. Each distinctive func-
most computational layer usually does not reach the count of tion can have a special influence on the results of the ANN, as well
nodes of the input layer. The particular number of neurons and as how the ANN converges, and the comprehensive nature of the
the number of hidden layers is in proportion to the intricacy of network. Out of a wide range of activation functions Uðv Þ, the four
the necessary model and on the data [36]. While in some special most often appearing in the current literature were selected:
cases utilising a fully-connected layer is the norm, the use of hid-
den layers with the count of neurons below that of the input’s  Sigmoid
grants a loss in representation, which oftentimes increases the net-  Hard Sigmoid
work’s performance. This is very likely the result of getting rid of  Rectified Linear Unit (ReLU)
the noise in data [38].  Hyperbolic Tangent (tanh)
A network built with too many neurons can display unwanted
behaviour known as overfitting, also named overtraining. This par- The optimal network setup is found by using a grid search pro-
ticular phenomenon happens when the artificial neural network cedure, which completes an all-encompassing search over the
fitted the exact patterns found in the training dataset so tightly, hyperparameter’s space. The grid search parameters included:
707
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

 the epoch count Table 2


 the batch size 1 layer of 25 neurons, 300 epochs NSL-KDD.

 the activation function Accuracy Activation Optimizer Batch Size


 the optimiser 0.999128 tanh adam 100
 in some tests the number of hidden layers 0.999074 hard_sigmoid adam 100
 in some tests the number of neuron nodes 0.99906 sigmoid adam 100
0.999044 sigmoid adam 10
0.999022 tanh adam 10
The full cycle of learning and adapting the weights of the net- 0.999014 hard_sigmoid adam 10
work is called an epoch. The particular count of samples utilised 0.998996 tanh rmsprop 100
in one iteration is called batch size [38]. The grid search can test 0.998935 sigmoid rmsprop 100
different activation functions and optimisers. The optimisers eval- 0.998925 relu adam 10
0.998925 relu adam 100
uated in this paper are:
0.998905 hard_sigmoid rmsprop 100
0.998897 relu rmsprop 100
 Adaptive Moment Estimation (adam) 0.998727 hard_sigmoid rmsprop 10
 Root Mean Square Propagation (rmsprop) 0.998713 sigmoid rmsprop 10
 Stochastic gradient descent (SGD) 0.998641 tanh rmsprop 10
0.998607 relu rmsprop 10
0.998324 relu SGD 10
The examples of the results the grid search method offers are 0.997957 tanh SGD 10
found in Tables 1 and 2 in the results section. 0.996134 hard_sigmoid SGD 10
0.996074 sigmoid SGD 10
0.995683 relu SGD 100
3.4. Dimensionality reduction 0.995365 tanh SGD 100
0.992068 sigmoid SGD 100
0.9919 hard_sigmoid SGD 100
The search for a feature vector which communicates the nature
of the dataset, but without having to represent every single feature
is named dimensionality reduction.
of variation within those classes. In other words, LDA helps deter-
The domain concerns the construction of an n-dimensional pro-
mine boundaries between clusters of classes.
jection that explains the data of a k-dimensional space. Computa-
Another method of dimensionality reduction, frequently used in
tional benefits of this procedure are apparent. Other benefits
subject literature, is Principal Component Analysis (PCA). PCA
include preventing the ‘curse of dimensionality’. This predicament
looks for the view of the data where the variance is maximised.
causes the ML classifiers to fall short of the expected results with
An example introduced in [34] shows that if the data creates a line,
the increase in dataset dimensions [34]. Exponential increase of
performing PCA would immediately indicate that the variance over
samples is necessary for the algorithms balance this shortcoming.
all but one dimension equals 0. Thus, since the features of those
A wide range of approaches to dimensionality reduction exist.
dimensions are useless, they can be omitted. Even though data-
Since negligible features often constitute noise, frequently, feature
gathering could suggest a high signal strength in one direction,
selection is implemented to acquire the most applicable feature
the data will most likely contain noise in a substantial number of
set. However, there are other methods to arrive at a smaller feature
features. Provided the signal overpowers the noise to a high
set which explains the data thoroughly.
enough extent, the elicited projection containing maximum vari-
A different approach comes in the form of Linear Discriminant
ance is highly likely to convey the essence of the data.
Analysis (LDA). This approach has a twofold focus – the maximisa-
tion of distance of particular class centroids and the minimisation
4. Experimental setup
Table 1
1 layer, 10 neurons, 300 epochs, NSL-KDD dataset. NSL-KDD is a data set created to address the problems of the
KDD’99 malware and intrusion data, which were repeatedly raised
Accuracy Activation Optimizer Batch Size
in the literature. It is now an established benchmark dataset, even
0.998915 sigmoid adam 10 though it still displays some of the unwanted characteristics. Still,
0.998881 relu adam 100
the lack of open IDS datasets and the difficulty in collecting the
0.998845 hard_sigmoid adam 100
0.998833 hard_sigmoid adam 10 data makes NSL-KDD a reliable solution for intrusion detection/
0.998829 tanh adam 10 malware detection research.
0.998775 relu rmsprop 100 The set contains almost 5,000,000 records, which makes it both
0.998771 tanh adam 100 suitable for machine learning, and not overbearingly large so as to
0.998765 tanh rmsprop 100
0.998761 relu adam 10
force researches to pick parts of the set randomly. This makes the
0.998717 sigmoid adam 100 results more easily comparable.
0.998709 sigmoid rmsprop 100 The NSL-KDD is cleared of redundant data, to prevent ML algo-
0.998683 hard_sigmoid rmsprop 100 rithms bias, which is an improvement over the original KDD’99
0.998562 tanh rmsprop 10
dataset.
0.998522 hard_sigmoid rmsprop 10
0.998516 relu rmsprop 10 The Intrusion Detection Evaluation Dataset – CICIDS2017 [41]
0.99845 sigmoid rmsprop 10 was created as a response to the lack of dependable and adequately
0.997432 relu SGD 10 recent cybersecurity datasets. The IDS datasets available to
0.996776 tanh SGD 10 researchers usually come with a set of their own problems – be
0.995691 sigmoid SGD 10
0.995639 hard_sigmoid SGD 10
it lack of traffic diversity, lack of attack variety, insufficient features
0.995214 relu SGD 100 and other issues. The authors of CICIDS2017 offer a dataset with
0.994085 tanh SGD 100 realistic background traffic, created by abstracting the behaviour
0.991385 hard_sigmoid SGD 100 of 25 users across a range of protocols. The data was captured over
0.990996 sigmoid SGD 100
a range of 5 days, with 4 days of the setup being assaulted by a
708
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

Table 3
2 hidden layers, 25 neurons each, 300 epochs, NSL-KDD dataset.

Accuracy Activation Optimizer Batch Size


0.99909 relu adam 10
0.999084 sigmoid adam 100
0.99907 sigmoid rmsprop 100
0.999036 hard_sigmoid adam 10
0.999028 tanh rmsprop 100
0.999022 sigmoid adam 10
0.999022 tanh adam 100
0.99902 relu adam 100
0.998979 hard_sigmoid rmsprop 100
0.998947 tanh adam 10
0.998857 relu rmsprop 100
0.998827 tanh rmsprop 10
0.998819 hard_sigmoid rmsprop 10
0.998773 relu rmsprop 10
0.998719 sigmoid rmsprop 10
0.998643 relu SGD 10
0.998603 tanh SGD 10
0.995717 relu SGD 100
0.995403 tanh SGD 100
Fig. 1. An overall schema of an artificial neural network, with multiple computa-
0.995391 sigmoid SGD 10
tional layers, multiple neuron nodes and the output layer.
0.99531 hard_sigmoid SGD 10
0.990065 sigmoid SGD 100
0.989853 hard_sigmoid SGD 100

Table 4
4 hidden layers, 25 neurons each, 300 epochs, NSL-KDD dataset.

Accuracy Activation Optimizer Batch Size


Fig. 2. The process pipeline starting with the NSL-KDD dataset. Similar process is 0.998983 sigmoid adam 10
used for CICIDS2017. 0.998957 sigmoid adam 100
0.998779 sigmoid rmsprop 100
0.998591 sigmoid rmsprop 10
range of attacks, including various malware, DoS attacks, web 0.998516 hard_sigmoid rmsprop 100
attacks and others. Not only is CICIDS2017 one of the newest data- 0.998512 hard_sigmoid adam 100
0.997305 hard_sigmoid adam 10
sets available to researchers, but it also features over 80 network 0.996618 hard_sigmoid rmsprop 10
flow features.Fig. 1. 0.992189 sigmoid SGD 10
PCA is performed to reduce the number of features to 50. The 0.962501 hard_sigmoid SGD 10
number of extracted features was an arbitrary decision based on 0.957635 sigmoid SGD 100
0.957635 hard_sigmoid SGD 100
initial tests of the setup. This reduced feature vector constitutes
0.955097 relu adam 10
the feature-set provided to the input layer of the Artificial Neural 0.954515 relu rmsprop 100
Network. The pipeline of the process is illustrated in Fig. 2. 0.954267 relu adam 10
0.953717 relu rmsprop 10
0.949774 relu SGD 10
4.1. Cross validation 0.859187 relu SGD 100
0.620011 tanh SGD 100
Machine Learning and numerous facets of Artificial Intelligence 0.42001 tanh SGD 10
implementations are accompanied by a couple of particular issues. 0.204583 tanh rmsprop 10
0.122656 tanh adam 100
Some of those problems come in the form of overfitting. If the algo-
0.120043 tanh rmsprop 100
rithms experience those challenges, they might perform exception- 0.056442 tanh adam 10
ally in the lab, but when subjected to real data their performance
will not be satisfactory. In order to mitigate this phenomenon,
the models undergo a procedure called k-fold cross validation, also The results of the undertaken experiments are presented in
known as rotation estimation. The training dataset is split into k tables of the accuracies achieved by particular setups of hyperpa-
parts, and the k-1 parts are used for training, with the remaining rameters in topologies of increasing complexity. This way of pre-
one being utilised for testing. Then, the procedure is repeated k sentation has been chosen to provide better clarity, however this
times, changing the testing part. The folds are sampled randomly is not the order the experiments were performed. Of the multitude
[42] [43]. In this work, 10-fold cross-validation was used. of possible confusion-matrix based metrics, which have the ability
to accurately express the ability of a classifier [44], accuracy was
5. Results chosen for the clarity and intuitiveness it offers. Additionally, accu-
racy is one of the most often resorted to metrics in IDS research.
Multiple topology setups were tested, ranging from one hidden While it has its downside for datasets with high class imbalance,
layer of ten neurons (Table 1), through one hidden layer of 25 neu- in this work utilises a data balancing procedure to counteract this
rons (Table 2), two layers of 25 neurons (Table 3), four layers of 25 effect. To additionally augment the insight provided by accuracy, in
neurons each (Table 4), seven hidden layers of 25 neurons (Table 5) Tables 8, 9, 10 and 12 precision and recall are offered.
and ten hidden layers of 25 neurons – the results of this test can be The tables are ordered by accuracy. Early stopping was not
seen in Table 6. A summary of the best setups for each of those employed so as to not impose a smoothness constraint on the
topologies can be found in Table 7. models.
709
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

Table 5 The third table (Table 3) presents the hyperparameter mixes for
7 layers, 25 neurons, 300 epochs, NSL-KDD dataset. a topology of two hidden layers of 25 neurons each, trained for 300
Accuracy Activation Optimizer Batch_size epochs. The best result for this setup was almost exactly 99.9%
0.968288 tanh rmsprop 10 with the ReLU activation function, ADAM optimizer and batch size
0.966866 tanh adam 10 of 10. It is worth noticing that the eight best results in this table
0.964502 relu rmsprop 10 exceed 99.9% accuracy, and all the results in that table exceed
0.962138 relu adam 10 and all the preceding tables (Tables 1 and 2) exceed 98.9%.
0.958355 tanh adam 100
0.956455 relu adam 100
In Table 4 the finest accuracy exceeded 99.89%. In this table for
0.955988 relu rmsprop 100 the first time some of the accuracies fall below 90%, with the worst
0.955514 tanh rmsprop 100 performer falling below 6%. This trend will continue for more com-
0.95125 tanh SGD 10 plex topologies.
0.946037 relu SGD 10
In Table 5 the best accuracy was 96.83%, with a mix of tanh,
0.902984 tanh SGD 100
0.884989 hard_sigmoid adam 10 rmsprop and batch size of 10. Half of the setups did not reach
0.879798 sigmoid rmsprop 10 90%. One more time the hyperbolic tangent takes the first spot as
0.876011 hard_sigmoid rmsprop 10 one of the contributors of the best accuracy. In the summary table
0.870312 sigmoid adam 10 (Table 7) the tanh function appears in six out of ten different set-
0.8495 relu SGD 100
ups for ten best evaluated topologies.
0.849028 sigmoid rmsprop 100
0.842881 hard_sigmoid rmsprop 100 The accuracies collected in Table 6 illustrate that increasing the
0.782832 hard_sigmoid adam 100 depth of the neural network does not bring any additional benefit
0.640515 sigmoid adam 100 for NSL-KDD. The four worst results did not even reach 50% accu-
0.497387 sigmoid SGD 10
racy. The highest-scoring setup used tanh and ADAM. The ADAM
0.497387 hard_sigmoid SGD 10
0.497387 sigmoid SGD 100 optimizer appears in seven out of ten best setups in Table 7.
0.497387 hard_sigmoid SGD 100 Out of the best performing topologies the highest scoring
hyperparameter setups have been collected in Table 7. As noted
The first table (Table 1) presents how the achieved accuracy in preceding paragraphs, the tanh activation function and the
fluctuated in the smallest of the evaluated topologies. The best ADAM optimizer contribute to the best results more frequently
result – which came from combining the sigmoid activation func- than their counterparts. Rather surprisingly, the smaller setups
tion, the ADAM optimizer and a batch size of 10, with 300 epochs – lead to better results than the deeper architectures. The findings
reached almost 99.9%. That constitutes the fourth best result in the of this part of the research will be elaborated upon more fully in
summary table (Table 7). the statistical significance section.
The second table (Table 2) presents the accuracies and hyperpa- In a batch of experiments, the algorithm has been applied to the
rameter mixes for one hidden layer with 25 neural nodes. The CICIDS2017 dataset. The initial results found in Table 8, 9 were
number of epochs is 300. The best accuracy in this table exceeds very encouraging, with the accuracy exceeding 99% (0.9936). The
99.9%, and is at the same time the best of the accuracies attained recall of 0.54 and f1-score of 0.70 in one of the classes signified that
in all of the experiments (Table 7). This particular mix of hyperpa- there might be a balancing problem in the dataset. A closer inspec-
rameters attracted the authors’ attention. The hyperbolic tangent tion revealed that there are over 43000 records of benign netflows
activation function is a sigmoidal function and is often omitted in the set, but only slightly over 1300 attack records. This is shown
in contemporary neural networks (with ReLU being the go-to func- in Table 8 in the ‘Support’ column. To counteract that the majority
tion). The newest deep learning models usually use micro-batches class was randomly subsampled with the number of samples
of 1 or 2. Thus, a mix of tanh, ADAM and batch size of 100 is a very matching the sum of the attack records. The initial results for the
interesting option, especially for such a small topology. balanced dataset are represented in Table 9. The accuracy of the

Table 6
10 layers, 25 neurons, NSL-KDD dataset.

Accuracy Activation Optimizer Batch_size Epochs


0.970651 tanh adam 10 100
0.967342 tanh rmsprop 10 100
0.966869 relu rmsprop 10 100
0.966867 relu adam 10 100
0.965918 tanh SGD 10 100
0.963082 tanh adam 10 30
0.962605 relu rmsprop 10 30
0.96213 tanh rmsprop 10 30
0.96072 relu SGD 10 100
0.958352 relu adam 10 30
0.950777 tanh SGD 10 30
0.930435 relu SGD 10 30
0.924753 hard_sigmoid adam 10 100
0.918136 sigmoid rmsprop 10 100
0.917652 hard_sigmoid rmsprop 10 100
0.894478 sigmoid adam 10 100
0.873172 hard_sigmoid rmsprop 10 30
0.860396 sigmoid rmsprop 10 30
0.854231 hard_sigmoid adam 10 30
0.85092 sigmoid adam 10 30
0.497387 sigmoid SGD 10 30
0.497387 sigmoid SGD 10 100
0.497387 hard_sigmoid SGD 10 30
0.497387 hard_sigmoid SGD 10 100

710
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

Table 7
Summary of best hyperparameter setups for ten best topologies on NSL-KDD.

Accuracy Activation Optimizer Batch Size Epochs Layers Neurons


0.999128 tanh adam 100 300 1 25
0.99909 relu adam 10 300 2 25
0.998983 sigmoid adam 10 300 4 25
0.998915 sigmoid adam 10 300 1 10
0.970658 relu SGD 1 30 2 50
0.970651 tanh adam 10 100 10 25
0.968765 tanh SGD 1 30 2 25
0.968288 tanh rmsprop 10 30 7 25
0.964979 tanh adam 1 30 3 25
0.964032 tanh adam 1 30 2 10

Table 8
CICIDS2017 initial results, Tuesday subset.

Precision Recall f1-score Support


0 0.99 1.00 1.00 43171
1 1.00 0.98 0.99 820
2 1.00 0.54 0.70 574
Micro avg 0.99 0.99 0.99 44565
Macro avg 1.00 0.84 0.89 44565
Weighted avg 0.99 0.99 0.99 44565
Samples avg 0.99 0.99 0.99 44565

Table 9
Balanced CICIDS2017 initial results, Tuesday subset.

Precision Recall f1-score Support


0 0.99 0.96 0.98 1387
1 1.00 1.00 1.00 804
2 0.92 0.99 0.95 576
Micro avg 0.98 0.98 0.98 2767
Macro avg 0.97 0.98 0.98 2767
Weighted avg 0.98 0.98 0.98 2767

procedure on the balanced dataset exceeded 97% over the test set, with gaussian kernel (SVM), the Naive Bayes classifier and ADA-
and 95% mean accuracy in the 10-fold cross-validation. The results Boost. Table 10 illustrates the differences among the results the
of the setup over the full set are displayed in Table 13. classifiers have achieved with the use of CICIDS2017 Tuesday data.
In Table 12 the results the Naive Bayes classifier has achieved
on the full CICIDS2017 dataset have been showcased, with the pre-
5.1. Comparison to other state of the art machine learning algorithms
cision, recall and f1-score of each of the classes found in the data-
set. The support column signifies the number of instances of the
In this section the performance of other ML approaches is pre-
particular classes in the dataset. The table is further extended to
sented. To place the performance of the illustrated approach in
show the results of other methods over the whole CICIDS2017
context, tests were performed using a Support Vector Machine

Table 10
The results of other ML methods over the CICIDS2017 dataset, Tuesday subset.

Precision Recall f1-score Support


SVM, Accuracy: 0.9584
0 1.00 0.92 0.96 1387
1 0.97 1.00 0.98 804
2 0.87 0.99 0.93 576
Micro avg 0.96 0.96 0.96 2767
Macro avg 0.95 0.97 0.96 2767
Weighted avg 0.96 0.96 0.96 2767
Naive Bayes, Accuracy: 0.9078
0 1.00 0.82 0.90 1387
1 0.86 1.00 0.93 804
2 0.82 1.00 0.90 576
Micro avg 0.91 0.91 0.91 2767
Macro avg 0.89 0.94 0.91 2767
Weighted avg 0.92 0.91 0.91 2767
ADABoost, Accuracy: 0.9382
0 1.00 0.88 0.93 1387
1 1.00 1.00 1.00 804
2 0.78 0.99 0.87 576
Micro avg 0.94 0.94 0.94 2767
Macro avg 0.92 0.96 0.93 2767
Weighted avg 0.95 0.94 0.94 2767

711
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

set. After examining the results of the classifiers over all the classes Table 13
it is apparent that the samples contained in the Tuesday dataset 4 hidden layers, 25 neurons each, CICIDS2017 dataset top setup.

constitute an easier task for ML classifiers than the dataset as a Precision Recall f1-score Support
whole. 0 1.00 0.98 0.99 162154
1 0.50 0.63 0.56 196
2 1.00 1.00 1.00 12803
5.2. The influence of hyperparameter ooptimisation 3 0.98 0.98 0.98 1029
4 0.90 0.99 0.95 23012
5 0.90 0.99 0.94 550
Hyperparameter optimisation is performed on each of the set-
6 0.97 0.98 0.97 580
ups. The gridsearch method evaluates each of the possible permu- 7 0.99 0.98 0.98 794
tations of the selected hyperparameters. Namely, the used epochs 8 1.00 1.00 1.00 1
count, the batch size, the optimiser and the activation function are 9 1.00 1.00 1.00 15880
consecutively permutated in order to achieve the highest accuracy. 10 0.97 0.49 0.65 590
11 0.59 0.23 0.33 301
The Tables 1–6 illustrate the way the accuracy fluctuates on vari-
12 0.00 0.00 0.00 4
ous ANN setups. The Table 4 displays the results of the gridsearch 13 0.80 0.03 0.06 130
procedure of an ANN with 4 hidden layers, 25 neurons on each Accuracy 0.977
layer. Table 2 shows how gridsearch performed for a 1-hidden Macro avg 0.83 0.73 0.74 218024
Weighted avg 0.99 0.98 0.98 218024
layer, 25-neural nodes ANN. The remaining Tables 3 and 1 illus-
trate gridsearch of ANN with 2 hidden layers and 25 neurons, 1
hidden layer and 10 neurons. The optimal setup for the algorithm
in the CICIDS2017 case has been established with the gridsearch 5.3. Statistical analysis
procedure as well (Table 13. A sample of the results over the whole
dataset can be seen in Table 11–13. It is apparent that the results The standard and widely used Wilcoxon Signed-Rank Test was
acquired with different parameter setups vary greatly, just as it utilised to evaluate the statistical significance of this research.
was in the case of NSL-KDD. The best performing setups for NSL-KDD are gathered in Table 7.
All those setups were tested against the best if the best top ranking
setup and the test did not find statistical significance (with-p val-
Table 11
ues ranging from 0.4764 to 0.1829), leading to the conclusion that
4 hidden layers, 25 neurons each, 30 Epochs CICIDS2017 dataset.
the topology of the neural network does not bear a significant
Accuracy Activation Batch Size Optimizer impact on the accuracy if the hyperparameters are chosen cor-
0.976983 relu 50 rmsprop rectly for this dataset. However, different hyperparameter setups
0.976742 relu 100 rmsprop impacted the results of the chosen topologies immensely. For
0.975176 relu 50 adam example, the best and worst setups for the 4 layers and 25 neurons
0.974252 relu 100 adam
0.973408 hard_sigmoid 50 rmsprop
topology when subjected to the Wilcoxon test reject the null
0.972645 hard_sigmoid 100 rmsprop hypothesis with p = 0.0093, thus the top setup is clearly the better
0.972444 hard_sigmoid 50 adam option.
0.971922 hard_sigmoid 100 adam
0.970597 relu 50 SGD
0.970516 sigmoid 50 adam
0.969793 sigmoid 50 rmsprop
5.4. Lessons learned
0.96911 sigmoid 100 adam
0.965575 sigmoid 100 rmsprop
0.958184 relu 100 SGD The gathered results and investigating the statistical signifi-
0.930227 sigmoid 50 SGD cance of the results achieved by particular setups illustrated that
0.708456 hard_sigmoid 50 SGD neither increasing the depth of neural networks, nor inflating the
0.585017 sigmoid 100 SGD
0.499819 hard_sigmoid 100 SGD
number of neurons in layers contribute to increasing the ANN
accuracy for the evaluated benchmark dataset, beyond a certain

Table 12
CICIDS2017 – reference classifiers.

NaiveBayes ADABoost SVM


Precision Recall f1-score Precision Recall f1-score Precision Recall f1-score Support
0 0.99 0.10 0.18 0.64 1.00 0.78 0.99 0.96 0.97 55717
1 0.01 0.66 0.03 0.00 0.00 0.00 0.00 0.00 0.00 196
2 0.98 0.95 0.96 0.51 0.63 0.57 0.99 0.95 0.96 12803
3 0.21 0.94 0.34 0.00 0.00 0.00 0.96 0.95 0.96 1029
4 0.89 0.70 0.79 0.92 0.34 0.50 0.80 1.00 0.89 23012
5 0.01 0.65 0.02 0.00 0.00 0.00 0.88 0.97 0.93 550
6 0.15 0.63 0.24 0.00 0.00 0.00 0.97 0.88 0.92 580
7 0.15 1.00 0.26 0.00 0.00 0.00 0.91 0.47 0.62 793
8 1.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 1
9 1.00 0.99 0.99 0.00 0.00 0.00 0.97 0.99 0.98 15880
10 0.20 0.99 0.33 0.00 0.00 0.00 0.98 0.44 0.61 590
11 0.00 0.08 0.00 0.00 0.00 0.00 1.00 0.10 0.18 151
12 0.01 1.00 0.01 0.00 0.00 0.00 0.00 0.00 0.00 2
13 0.12 0.94 0.22 0.00 0.00 0.00 0.00 0.00 0.00 65
Accuracy 0.47 0.64 0.96 111369
Macro avg 0.41 0.76 0.38 0.16 0.15 0.14 0.79 0.64 0.67 111369
Weighted avg 0.94 0.47 0.51 0.57 0.64 0.56 0.96 0.96 0.96 111369

712
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

point. In fact, as illustrated by Tables 5 and 6, adding additional 7. Conclusions


layers decreased the accuracy of the neural network.
The major contribution of this paper is the innovative proposal
6. Threats to validity of hyperparameter optimisation in Artificial Neural Networks
(ANN) used in the task of network intrusion detection, a frequent
While creating commercial or research-based products, several component of cybersecurity applications. In this work, the pro-
risks, threats to validity and human factors should be discussed posed method has been presented in detail, and its effectiveness
and be taken into account. In this section, we address such aspects on two benchmark datasets was showcased, proving its efficiency
as the quality of data, the security of the system itself and the cog- by showing comparable results. A wide range of ANN topologies
nitive understanding of the system results/decisions. were tested, changing both the number of hidden layers and the
number of neurons on those layers. The influence the choice of
hyperparameters like the activation function, the optimizer, the
6.1. Data quality
batch size and the number of epochs has over the achieved accu-
racy was evaluated. The problem of data imbalance was also tack-
The proposed machine learning based IDS needs to be trained
led and the threats to validity of the proposed approach (and to
on real or realistic and reliable data. Such data has to come from
other machine learning based intrusion detection systems as well)
certified devices (probes, software-based parsers etc.) and needs
were discussed. The tests were performed with the use of two IDS
to represent real and non-biased network traffic. Yet another prob-
benchmark datasets, namely NSL-KDD and CICIDS2017. The tests
lem lies in dataset characteristics: it should be appropriately bal-
revealed that using the same topology over the same dataset can
anced in terms of the percentage of malicious data vs. normal
produce drastically different results if the inappropriate mix of
not-attacked (benign) traffic.
hyperparameters is employed. As seen, for example, in Table 4,
In this work (and our previous works) we suggest using some
switching any of the hyperparameters can result in a perceivable
techniques to counter the problem of data imbalance, as frequently
drop in accuracy. The change in the batch size from 10 to 100
in datasets there are much fewer malicious samples than normal
results, in one case, in a drop of accuracy from 0.9498 to 0.8592.
samples of the traffic. Another challenge that is consistent with
Switching from the rmsprop optimizer to SGD results in a drop
the quality of data category comes with the fact that even the most
of accuracy from 0.9545 to 0.8592. Switching from the ReLU activa-
sophisticated and well optimised algorithms are optimised at the
tion function to tanh results in a staggering drop of accuracy from
time of deployment, and their performance deteriorates in time,
0.9545 to 0.0564. However, the same setup, but using the sigmoid
a predicament that is significantly more pronounced with the
activation function boosted the accuracy to 0.9990. A similar effect
fast-paced advancement of network technologies. This adverse cir-
is evident in the tests where CICIDS2017 was used. This time, how-
cumstance could be countered with a life-long learning approach
ever, changing the activation function from ReLU to sigmoid results
[45].
in accuracy drop from 0.9582 to 0.5850.
This work provides the experimental proof that a slight change
6.2. Security of the system and detecting adversarial attacks in hyperparameter setup can have a significant effect on the accu-
racy of a given neural network topology for both of the imple-
One of the crucial (albeit frequently forgotten) aspects of each mented intrusion detection datasets, as illustrated in the
security system is that such a system should be secure itself. preceding paragraph. The best combination of the investigated
Therefore, machine learning based intrusion detection systems specifications was found. A number of other setups, which are on
have to be well protected against attacks (both physical and cyber). par with the best setup (for p = 0.05) were also extracted, these
Another emerging aspect is that the internal machine learning can be found in Table 7 and on the top of Tables 1–6 for NSL-
algorithms have to be secure as well, which mostly means being KDD and in Table 11.
resilient to adversarial attacks such as evasion, poisoning, model
extraction etc. as it was demonstrated in [46]. Therefore, the future
line of our research is to test the solution vs. known and novel
adversarial attacks and provide new countermeasures and adver- CRediT authorship contribution statement
sarial attacks detectors [47]48.
Michał Choraś: Conceptualization, Data curation, Formal analy-
sis, Funding acquisition, Investigation, Methodology, Software, Val-
6.3. Explainability and cognitive understanding
idation, Writing - original draft, Writing - review & editing. Marek
Pawlicki: Conceptualization, Data curation, Formal analysis, Inves-
A multitude of software products (especially coming from R&D
tigation, Methodology, Software, Validation, Writing - original
advances) did not reach the expected success due to human fac-
draft, Writing - review & editing.
tors, lack of usability or badly designed GUI. The results, indication
of anomalies and/or cyberattacks and the suggested decisions have
to be clearly and understandably presented to end-users (e.g. a
network security officer or IDS operator). The user cannot be over- Declaration of Competing Interest
whelmed by the information from the system; on the other hand,
however, should be provided with all the information necessary to The authors declare that they have no known competing finan-
clearly see the context and the situation in order to increase the cial interests or personal relationships that could have appeared
situational awareness [47]. to influence the work reported in this paper.
Another threat lies in the fact that frequently machine learning
solutions are used as black-box providers of the Single Truth, while
in fact, operators should understand the algorithms and services to Acknowledgement
make informed decisions [49]. Therefore, the aspect of so-called
explainability (sometimes called interpretability) of ML solutions This work is funded under SIMARGL project, which has received
should be addressed and taken into account to ensure wide adop- funding from the European Union’s Horizon 2020 research and
tion of this class of intrusion detection systems [50]. innovation programme under grant agreement No. 833042.
713
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

References [30] D.-S. Huang, H.H.-S. Ip, K.C.K. Law, Z. Chi, Zeroing polynomials using modified
constrained neural network approach, IEEE Trans. Neural Networks 16 (3)
(2005) 721–732.
[1] G. McGraw, G. Morrisett, Attacking malicious code: a report to the infosec
[31] D.-S. Huang, H.H. Ip, Z. Chi, A neural root finder of polynomials based on root
research council, IEEE Softw. 17 (5) (2000) 33–41, https://doi.org/10.1109/
moments, Neural Comput. 16 (8) (2004) 1721–1762.
52.877857.
[32] D.-S. Huang, A constructive approach for finding arbitrary roots of polynomials
[2] A. Bielec, analysis of a polish bankbot. https://www.cert.pl/en/news/
by neural networks, IEEE Trans. Neural Networks 15 (2) (2004) 477–491.
single/analysis-of-a-polish-bankbot/..
[33] J. Ryan, M.-J. Lin, R. Miikkulainen, Intrusion detection with neural networks,
[3] L. Kelion, ebay redirect attack puts buyers’ credentials at risk. http://www.
in: Advances in Neural Information Processing Systems, 1998, pp. 943–949..
bbc.com/news/technology-29241563..
[34] O. Maimon, L. Rokach, Data Mining and Knowledge Discovery Handbook,
[4] P. Mutton, hackers still exploiting ebay’s stored xss vulnerabilities in 2017.
second ed., 2010.
https://news.netcraft.com/archives/2017/02/17/hackers-still-exploiting-
[35] D.-S. Huang, J.-X. Du, A constructive hybrid structure optimization
ebays-stored-xss-vulnerabilities-in-2017.html..
methodology for radial basis probabilistic neural networks, IEEE Trans.
[5] D. Lee, myfitnesspal breach affects millions of under armour users. http://
Neural Networks 19 (12) (2008) 2099–2115.
www.bbc.com/news/technology-43592470..
[36] I.N. da Silva, D.H. Spatti, R.A. Flauzino, L.H.B. Liboni, S.F. dos Reis Alves,
[6] N. Idika, A. Mathur, A Survey of Malware Detection Techniques, Purdue
Artificial Neural Networks A Practical Course, 2017. doi:10.1007/978-3-319-
University..
43162-8..
[7] G. Canfora, A. Di Sorbo, F. Mercaldo, C.A. Visaggio, Obfuscation techniques
[37] S. Bassis, A. Esposito, F.C. Morabito, E. Pasero, Adv. Neural Networks (2016),
against signature-based detection: aa case study, Mobile Syst. Technol.
https://doi.org/10.1007/978-3-319-33747-0.
Workshop (MST) 2015 (2015) 21–26, https://doi.org/10.1109/MST.2015.8.
[38] C.C. Aggarwal, Neural Networks and Deep Learning a Textbook, 2018.
[8] M. Feurer, F. Hutter, Hyperparameter Optimization, Springer International
doi:10.1007/978-3-319-94463-0.
Publishing, Cham, 2019, pp. 3–33. doi:10.1007/978-3-030-05318-5_1..
[39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A.
[9] S. Skansi, Introduction to Deep Learning: From Logical Calculus to Artificial
Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M.
Intelligence, Springer, 2018.
Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
[10] M. Choraś, R. Kozik, Machine learning techniques applied to detect cyber
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K.
attacks on web applications, Logic J. IGPL 23 (1) (2015) 45–56, https://doi.org/
Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden,
10.1093/jigpal/jzu038.
M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine
[11] Y. Sani, A. Mohamedou, K. Ali, A. Farjamfar, M. Azman, S. Shamsuddin, An
learning on heterogeneous systems, software available from tensorflow.org
overview of neural networks use in anomaly intrusion detection systems, IEEE
(2015). https://www.tensorflow.org/..
Student Conference on Research and Development (SCOReD) 2009 (2009) 89–
[40] F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015)..
92, https://doi.org/10.1109/SCORED.2009.5443289.
[41] I. Sharafaldin, A.H. Lashkari, A.A. Ghorbani, Toward generating a new intrusion
[12] F. Haddadi, S. Khanchi, M. Shetabi, V. Derhami, Intrusion detection and attack
detection dataset and intrusion traffic characterization, in: Proceedings of the
classification using feed-forward neural network, Second International
4th International Conference on Information Systems Security and Privacy –
Conference on Computer and Network Technology 2010 (2010) 262–266,
Volume 1: ICISSP, INSTICC, SciTePress, 2018, pp. 108–116. doi:10.5220/
https://doi.org/10.1109/ICCNT.2010.28.
0006639801080116.
[13] W. Gong, W. Fu, L. Cai, A neural network based intrusion detection data fusion
[42] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation
model, in: 2010 Third International Joint Conference on Computational
and model selection, Ijcai (1995) 1137–1145.
Science and Optimization, vol. 2, 2010, pp. 410–414. doi:10.1109/
[43] G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to statistical
CSO.2010.62..
learning, in: Cluster Comput, 2018, 2013.
[14] I. Mukhopadhyay, M. Chakraborty, S. Chakrabarti, T. Chatterjee, Back
[44] P. Branco, L. Torgo, R. Ribeiro, Relevance-based evaluation metrics for multi-
propagation neural network approach to intrusion detection system,
class imbalanced domains, in: Pacific-Asia Conference on Knowledge
International Conference on Recent Trends in Information Systems 2011
Discovery and Data Mining, Springer, 2017, pp. 698–710..
(2011) 303–308, https://doi.org/10.1109/ReTIS.2011.6146886.
[45] R. Kozik, M. Choraś, J. Keller, Balanced efficient lifelong learning (B-ELLA) for
[15] H.A. Sonawane, T.M. Pattewar, A comparative performance evaluation of
cyber attack detection, J. UCS 25 (1) (2019) 2–15, http://www.jucs.org/
intrusion detection based on neural network and pca, International Conference
jucs_25_1/balanced_efficient_lifelong_learning.
on Communications and Signal Processing (ICCSP) 2015 (2015) 0841–0845,
[46] M. Choraś, M. Pawlicki, R. Kozik, The feasibility of deep learning use for
https://doi.org/10.1109/ICCSP.2015.7322612.
adversarial model extraction in the cybersecurity domain, in: H. Yin, D.
[16] T.M. Pattewar, H.A. Sonawane, Neural network based intrusion detection using
Camacho, P. Tino, A.J. Tallón-Ballesteros, R. Menezes, R. Allmendinger (Eds.),
bayesian with pca and kpca feature extraction, in: 2015 IEEE International
Intelligent Data Engineering and Automated Learning – IDEAL 2019, Springer
Conference on Computer Graphics, Vision and Information Security (CGVIS),
International Publishing, Cham, 2019, pp. 353–360.
2015, pp. 83–88. doi: 10.1109/CGVIS.2015.7449898.
[47] M. Choraś, M. Pawlicki, D. Puchalski, R. Kozik, Machine learning – the results
[17] N.T.T. Van, T.N. Thinh, Accelerating anomaly-based ids using neural network
are not the only thing that matters! what about security, explainability and
on gpu, International Conference on Advanced Computing and Applications
fairness?, in: International Conference on Computer Recognition Systems,
(ACOMP) 2015 (2015) 67–74, https://doi.org/10.1109/ACOMP.2015.30.
Springer, 2020
[18] B. Subba, S. Biswas, S. Karmakar, A neural network based system for intrusion
[48] M. Pawlicki, M. Choraś, R. Kozik, Defending network intrusion detection
detection and attack classification, Twenty Second National Conference on
systems against adversarial evasion attacks, Fut. Gen. Comput. Syst. 110
Communication (NCC) 2016 (2016) 1–6, https://doi.org/10.1109/
(2020) 148–154, https://doi.org/10.1016/j.future.2020.04.013, http://
NCC.2016.7561088.
www.sciencedirect.com/science/article/pii/S0167739X20303368.
[19] K. Jiang, W. Wang, A. Wang, H. Wu, Network intrusion detection combined
[49] R. Kozik, M. Choraś, A. Flizikowski, M. Theocharidou, V. Rosato, E. Rome,
hybrid sampling with deep hierarchical network, IEEE Access 8 (2020) 32464–
Advanced services for critical infrastructures protection, J. Ambient Intell.
32476.
Human. Comput. 6 (6) (2015) 783–795.
[20] Y. Wu, W.W. Lee, Z. Xu, M. Ni, Large-scale and robust intrusion detection
[50] M. Szczepański, M. Choraś, M. Pawlicki, R. Kozik, Achieving explainability of
model combining improved deep belief network with feature-weighted svm,
intrusion detectionsystem by hybrid oracle-explainer approach, in:
IEEE Access 8 (2020) 98600–98611.
International Joint Conference on Neural Networks (IJCNN) 2020, IEEE, 2020..
[21] T. Su, H. Sun, J. Zhu, S. Wang, Y. Li, Bat: Deep learning methods on network
intrusion detection using nsl-kdd dataset, IEEE Access 8 (2020) 29575–29585.
[22] W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous
activity, Bull. Math. Biophys. 5 (4) (1943) 115–133. Michał Choraś holds the professor position at Univer-
[23] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to sity of Science and Technology in Bydgoszcz (UTP)
document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. where he is the Head of Teleinformatics Systems Divi-
[24] J. Moody, C.J. Darken, Fast learning in networks of locally-tuned processing sion and PATRAS Research Group. He has been involved
units, Neural Comput. 1 (2) (1989) 281–294.
in many EU projects (e.g. SocialTruth, CIPRNet, QRapids,
[25] D.-S. Huang, Radial basis probabilistic neural networks: Model and
INSPIRE). His interests include data science and pattern
application, Int. J. Pattern Recogn. Artif. Intell. 13 (07) (1999) 1083–1101.
recognition in several domains e.g. cyber security,
[26] D.-S. Huang, W.-B. Zhao, Determining the centers of radial basis probabilistic
neural networks by recursive orthogonal least square algorithms, Appl. Math. image processing, software engineering, prediction,
Comput. 162 (1) (2005) 461–473. correlation, biometrics and critical infrastructures pro-
[27] C. Goller, A. Kuchler, Learning task-dependent distributed representations by tection. Currently he coordinates H2020 SIMARGL and is
backpropagation through structure, in: Proceedings of International the Programme Leader (SAFAIR) in H2020 SPARTA. He is
Conference on Neural Networks (ICNN’96), vol. 1, 1996, pp. 347–352.. an author of over 250 reviewed scientific publications.
[28] Y. Goldberg, A primer on neural network models for natural language
processing, J. Artif. Intell. Res. 57 (2016) 345–420.
[29] L. Shang, D.-S. Huang, J.-X. Du, C.-H. Zheng, Palmprint recognition using fastica
algorithm and radial basis probabilistic neural network, Neurocomputing 69
(13–15) (2006) 1782–1786.

714
Michał Choraś and M. Pawlicki Neurocomputing 452 (2021) 705–715

Marek Pawlicki Ph.D. Eng. holds an adjunct position at


the UTP University of Science and Technology in Byd-
goszcz. He has been involved in a number of interna-
tional projects related to cybersecurity, critical
infrastructures protection, software quality etc. (e.g.
H2020 SPARTA, H2020 SIMARGL, H2020 PREVISION,
H2020 MAGNETO, H2020 Q-Rapids, H2020 SocialTruth).
He is an author of over 20 peer – reviewed scientific
publications. His interests pertain to the application of
machine learning in several domains, including cyber-
security.

715

You might also like