An Intrusion Detection Model With Hierarchical Attention Mechanism-23

IEEE RELIABILITY SOCIETY SECTION
Received February 25, 2020, accepted March 16, 2020, date of publication March 30, 2020, date of current version April 22, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2983568
An Intrusion Detection Model With Hierarchical

Attention Mechanism
CHANG LIU 1, (Member, IEEE), YANG LIU2 , YU YAN3 , (Student Member, IEEE), AND JI WANG4
1 Instituteof Electronics and Information Engineering, Guangdong Ocean University, Guangdong 524088, China
2 Beijing Institute of Astronautical Systems Engineering, Beijing 100076, China
3 College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China
4 College of Information and Communication Engineering, Guangdong Ocean University, Guangdong 524088, China
Corresponding author: Ji Wang (zjouwangji@163.com)

This work was supported by the Program for Scientific Research Start-Up Funds of Guangdong Ocean University.
ABSTRACT Network security has always been a hot topic as security and reliability are vital to software
and hardware. Network intrusion detection system (NIDS) is an effective solution to the identification of
attacks in computer and communication systems. A necessary condition for high-quality intrusion detection
is the gathering of useful and precise intrusion information. Machine learning, particularly deep learning,
has achieved a lot of success in various fields of industry and academic due to its good ability of feature
representation and extraction. In this paper, deep learning methods are integrated into the NIDS. The intrusion
activity is regarded as a time-series event and a bidirectional gated recurrent unit (GRU) based network
intrusion detection model with hierarchical attention mechanism is presented. The influence of different
lengths of previous traffic on the performance is then studied. Some experiments are performed on the dataset
UNSW-NB15, in which the proposed hierarchical attention model achieves satisfactory detection accuracy
of more than 98.76% and a false alarm rate (FAR) of lower than 1.2%. An attention probability map to reflect
the importance of features is then visualized using the attention mechanism. The visualization ability assists
in providing an understanding of the varied importance of the same features for different traffic classes and
to determine feature selection in the future.
INDEX TERMS Intrusion detection system, recurrent neural network, attention mechanism, visualization.
I. INTRODUCTION network intrusion detection system (NIDS) and host-based

Vast amounts of data are generated, processed, and intrusion detection system (HIDS) [6]. The NIDS works at
exchanged in the use and interaction process of numer- the network layer to detect network threats by taking all traffic
ous devices. Such data has become the target of illegal from the target network as its data source to protect the entire
activity, which has caused significant damage to network network segment [7]–[9]. The HIDS serves as a monitor and
systems [1], [2]. Research into advanced security methods analyzer of a computer system that does not act on the exter-
has become increasingly important in both industry and nal interface, but focuses on the internal system [10], [11].
academia in order to consistently improve and update security This framework commonly analyzes system logs, processes,
threat detection [3]. The basic general components of network or files to monitor the dynamic behavior of all or part of the
security mechanisms include firewall, user authentication system and the state of the entire computer system. In the
technology, anti-virus software, and an intrusion detection network system, many devices or components require IDS
system (IDS) [4], [5]. As a proactive security technology, support such as web server, file server, and workstations [12].
IDS monitors a host or network and alerts when an attack A scenario illustrating how IDS works at different sites in the
is detected. Cybersecurity can be further guaranteed through network system is provided in Fig. 1.
intrusion detection methods in which network attack behavior The development of network technology and hardware
can be obtained and learned by data analysis and modeling. devices creates issues for the application and upgrade of
According to the location of the deployment and the scope IDS [13]–[15]. Current challenges include the following:
of monitoring, IDS products can be loosely divided into 1) Diversity: An increase in the type of network protocols
makes it increasingly hard to distinguish between normal
The associate editor coordinating the review of this manuscript and and abnormal data. 2) Low-frequency attacks: The imbal-
approving it for publication was Zhaojun Li . ance distribution of different attack types results in weak
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
67542 VOLUME 8, 2020
C. Liu et al.: Intrusion Detection Model With Hierarchical Attention Mechanism
FIGURE 1. Intrusion detection system works at different place of the network system.
detection precision of the IDSs, particularly for data-driven Learning to identify whether the network traffic is normal or
methods. 3) Adaptability: The diverse and flexible charac- anomaly can be understood as learning to perform sentiment
teristic of the network causes a significant reduction in the analysis or document classification given several sentences.
lifespan of detection models because IDS requires updating to From this perspective, network intrusion detection is partly
adapt to the evolving environment. 4) Placement: Distributed, similar to sentiment analysis tasks, for which RNN-based
centralized, and hybrid methods must be adopted accord- methods have been suitable.
ing to specific considerations of financial, computational, In this study, network traffic activity is treated as a
and time costs. 5) Accuracy: Existing traditional techniques time-series event, meaning that the assessment of traffic type
cannot achieve the required high-level accuracy due to the at the current time depends not only on the current data
aforementioned challenges. To ensure the performance of but also on data at the previous moments. To provide the
IDS, a deeper, more granular and increasingly comprehensive ability to process such data, an RNN-based method is used
understanding of the nature of the intrusion events is required. as a benchmark approach for intrusion detection. In reality,
Around the issue of intrusion detection, some scholars the traffic information at different moments or features in a
have done a lot of work, using methods including expert sample of traffic contributes differently to the judgment of the
knowledge, data mining, and machine learning [16]–[18]. current traffic type. To take full advantage of this character-
Among them, the deep learning method is unique, providing istic, attention mechanism is adopted to enhance the model,
a high level of detection performance. Deep neural network with two kinds of attention mechanism respectively applied
mimics human nerves and uses a large number of non-linear to the feature and traffic slice. The attention mechanism
processing units to deal with complex problems [19]–[21]. provides the ability of visualization to discern which feature
It can automatically learn features and extract core data infor- or traffic slice is important. The proposed attention-based
mation. Due to the improvement of hardware and optimiza- models are then evaluated on the benchmark dataset UNSW-
tion of the algorithm, recurrent neural network (RNN) has NB15, which has been used frequently in various recent stud-
received widespread acclaim. RNN has become a star model ies. The experiments prove that the proposed attention-based
in applications including natural language processing (NLP), model demonstrates superior performance compared with
semantic understanding, and speech recognition [13], [22]. other models. The main contributions of this paper include:
VOLUME 8, 2020 67543

1) Different feature or traffic slices contribute uniquely feature selection considering irrelevant and redundant fea-
to the classification of the current traffic. To account tures. In their model, five different kinds of feature selection
for sensitivity, the proposed model includes two lev- strategies are used and the J48 decision tree classifier with
els of attention mechanisms, feature-based and slice- gain ratio filter is determined to have the best performance.
based. The attention mechanism guides the model to In [28], Tian et al. proposed a robust and sparse method using
provide increased attention to some individual features one-class support vector machine (OSVM), which aimed to
or traffic slices when constructing the representation of locate samples that are different from the majority of data.
traffic information. Based on the above, an attention map However, the anomaly method is limited by the outliers and
is visualized, contributing to an understanding of the noise during the training phase. To improve the performance
importance of features or slices of traffic. of this model, the Ramp loss function is adopted resulting in
2) Three RNN-based detection models are individually the algorithm more robust and sparse.
compared with the different attention mechanisms of no- Deep learning has become an important branch of machine
attention, one-layer attention, and hierarchical attention. learning and has become the preferred solution to many
It is observed that the attention mechanism contributes to problems. This method has been applied in intrusion detec-
improved model performance. The influence of timestep tion filed, achieving remarkable results. In [29], Khan et al.
on the performance of IDS is also studied and the con- presented a two-stage intrusion detection model based on the
cept of cost-performance is applied to determine if the stacked autoencoder network. In the initial phase, the traffic is
value of timestep must be increased. judged as normal or abnormal by the value of classification
3) The entire UNSW-NB15 dataset is utilized in this study, probability. In the second stage, the result of the first stage
rather than partial data. The results show that when is regarded as an extra feature for the following multi-class
the timestep equals 10, the hierarchical attention model classification process. However, the detection accuracy could
achieves the highest detection accuracy of over 98.76% only reach 89.134% on the UNSW-NB15 dataset. In [30],
and the false alarm rate (FAR) is as low as 1.49%. Tian et al. presented a hybrid method of shallow and deep
The rest of this paper is organized as follows. Section II learning using a stacked autoencoder to reduce the dimension
details existing NIDS works, mainly using RNN as the base of features. The SVM is then combined with the artificial bee
model. In Section III, a number of basic methods of RNN colony algorithm for classification. The experiments were
and attention mechanism are introduced. Section IV details also conducted with accuracy reaching beyond 89.62%.
the proposed work, and Section V presents the results and Many scholars have explored works using RNN to
analysis of experiments. Finally, Section VI describes the solve network intrusion detection problems. In 2012,
conclusion of this paper and the direction of future work. Sheikhan et al. presented a three-layer RNN model to solve
the misuse-based intrusion detection problems [31]. The
II. RELATED WORKS input features in their experiment are divided into four cat-
The three predominant types of NIDS are misuse-based, egories according to the feature attribute. However, the RNN
anomaly-based, and hybrid. Among them, the misuse-based model is reduced in this method, meaning that the connec-
method works by constructing a pattern matching template to tion between the neural layers is partial and diminishes the
detect intrusion. The constructed template is built on artificial performance. In 2016, Kim et al. explored the possibility of
knowledge and the analysis of existing data. The template applying RNN to intrusion detection using a variant of RNN
is fixed, so the benefit of this method exists in detecting to build an intrusion detection model [32]. Instances from
known attack types with high accuracy [23]. However, this the KDD Cup99 dataset were extracted in their experiment
feature also leads to an inherent disadvantage of this method which focused on locating the super parameters and evalu-
as in a dynamic network environment, new attack types or ating model performance. In 2017, Yin et al. used standard
variations may appear at any time. It is thus difficult for the RNN to build an IDS, and evaluated their approach with
misuse-based approach to perform adequately in the static benchmark dataset NSL-KDD [33]. In their work, the number
background [24], [25]. Another kind of intrusion detection of hidden nodes, the number of layers, and the learning rate
method is the anomaly-based approach, which operates by have become the main variables. Unfortunately, the accuracy
only utilizing normal data so that samples with different of their proposed model is not adequate. In 2018, Xu et al.
behaviors may all be judged as anomaly [26]. When an attack constructed a new DNN model that applied gated recurrent
occurs in a real device where the misuse-based method is unit (GRU) and multilayer perceptron (MLP) to extract data
deployed, the NIDS will alert the alarm, but provide no information [34]. Their simulation results show that the GRU
information about the exact attack type. However, the disad- cell can be more effective than the long short-term mem-
vantage of this method is poor accuracy performance as some ory (LSTM) cell for intrusion detection problem. In [35],
attacks behave like normal data or it is difficult to separate the Anani et al. used the full KDD Cup99 dataset to compare
attack data and the normal data in the extracted features. the model detection performance based on LSTM, bidirec-
Several machine learning based approaches proposed in tional long short-term memory (BiLSTM), skip-LSTM, and
previous studies have achieved success in intrusion detection GRU. The results illustrate that the GRU achieves superior
systems.In [27], Hebatallah et al. presented a framework for performance compared to other models. In [36], Agarap et al.
67544 VOLUME 8, 2020

As traditional RNN is limited by gradient vanishing or

exploding, variants of RNN are proposed. Gated recurrent
unit(GRU) was proposed to address such issues by introduc-
ing the gating mechanism [42]. There are two kinds of gates:
the reset gate rt and the update gate zt . They work together to
decide the information update process.
Suppose the current input is xt and the new state ht in
time t contains two part: the candidate state h̃t and the past
state ht−1 .
ht = (1 − zt )ht−1 + zt h̃t (2)
The reset gate rt works in the process to derive the candidate
FIGURE 2. Structure of gated recurrent unit. state. The way to obtain the candidate state is similar to that
in traditional RNN except for the gate mechanism.
sought to enhance the ability of classification by building a h̃t = tanh(xt Wxh + Whh (rt ht−1 ) + bh ) (3)
GRU model and introducing linear SVM to replace the soft- where stands for the Hadamard Product, W is weight
max classifier. Similarly, L2-SVM loss function was adopted matrix, and b is the bias.
to replace the cross-entropy function. In [37], Roy et al. Here, rt helps to control how much information from the
selected samples from the UNSW-NB15 dataset and build a past state can be added into the candidate state. rt is updated
BiLSTM network. Five features were selected, reaching an as follow:
accuracy of over 95%. However, only part of the dataset is
utilized in this approach, which may cause some bias in the rt = σ (xt Wxr + ht−1 Whr + br ) (4)
results. In [38], the authors used the unsupervised version of
According to the equation to obtain new state ht , update
different variants of RNN cells to construct an autoencoder
gate zt plays a role to balance the previous state ht−1 and the
for intrusion detection. In [39], an end-to-end intrusion detec-
current candidate state h̃t . zt can then be regarded as a valve
tion approach was proposed. Network packets were adopted
for distributing the past information and the new information.
as the input and processed sequentially. There exists noth-
The update of zt is similar to that of rt :
ing about feature engineering or domain knowledge in this
method and instead, the payloads are divided into characters zt = σ (xt Wxz + ht−1 Whz + bz ) (5)
and train the RNN model to identify specific sequences.
However, the drawback of this end-to-end approach is that Former experiments have shown that the BiGRU cell per-
there are too many parameters which make the model overly forms better than other three cells including LSTM, GRU, and
complex. BiLSTM. A Bidirectional GRU (BiGRU) is an enhanced ver-
sion of the GRU that works in two directions. It summarizes
−
→ ←−
III. BASIC THEORY the forward information h and the backward information h
A. GRU-BASED METHOD to enhance feature extraction abilities.
The RNN is unique as the neural unit is self connection, −
→ −−→
h t = GRU (xt ), t ∈ [1, T ]
meaning that when the cycle unfolds, the data flow over ←− ←−−
h t = GRU (xt ), t ∈ [1, T ] (6)
time is preserved in the neurons [40]. The cyclic structure of
neurons enables them to preserve historical information and
B. ATTENTION MECHANISM
provide sequence modeling capabilities. The RNN calculates
The generation of attention mechanism is inspired by the
a mapping from the input x = (x1 , x2 , . . . , xT ) to the hidden
behavior of humans. Human attention happens, to some
state h = (h1 , h2 , . . . , hT ) as follows:
extent, when humans predominantly focus on particular local
ht = σ (Wxh xt + Whh ht−1 + bh ) (1) regions of an image or special words in one sentence. The
attention mechanism assists to fully utilize limited resources.
where σ is a non-linear function and t ∈ [1, T ]. Wxh and Whh The regular process of attention mechanism is illustrated
are corresponding weight matrices, b is a bias term. in Fig. 3. The attention value can be obtained by the pair
As we all know, the gradient descent method is often used of key and query. The attention mechanism is not a specific
to train the deep learning model. And Back Propagation(BP) method, but a mode of thinking which contains the two
is a way to obtain the gradient. In particular, back propagation important components of addressing and calculating.
training time (BPTT) is an algorithm that specifically solves Using an attention model, an input can be written in X =
the computation of parameters in RNN models. However, as it [x1 , x2 , . . . , xn ], where n can be treated as different timestep
is limited by structure, the gradient in RNN can easily explode for a 3-D data or the number of features for a 1-D vector.
or vanish due to the product of W [41]. Addressing is also called alignment score function, and is
VOLUME 8, 2020 67545

where W a is a randomly initialized weight matrix. After

determining the current key matrix, the similarity between
each query value and the current key value is calculated to
obtain a normalized probability vector d, which is the weight
vector.
d = softmax(qK T ) (9)
FIGURE 3. The regular pipeline of attention mechanism.
Finally, the attention vector can be obtained by:
a = dV (10)
After deriving the probability vector, the final attention
representation, that is context vector, can be calculated.
Depending on the range of hidden states used, the attention
mechanism can be divided into global attention and local
attention [44]. In this paper, we used the global attention
shown in Fig.6. The global attention model absorbs all the
hidden states when deriving the context vector ct . Due to this
calculating way, ct can capture relevant source-side features.
FIGURE 4. The illustration of location-based attention.
used to obtain the attention probability. The attention prob-

ability represents how much weight should be given to the
hidden state of each input. There are numerous addressing
methods available. In this paper, a location-based attention
and dot-product attention method is utilized.
Location-based attention was initially proposed in [43].
It computes the alignment from the generator state and the
previous alignment only in such a simple way:
αt = softmax(Wa ht ) (7)
where Wa is the weight matrix and ht is the current hidden FIGURE 6. Global attention model.
state.
Dot-product attention consists of three parts: a learned key
matrix K , a value matrix V , and a query vector q. The process IV. PROPOSED MODEL
to obtain the attention vector is illustrated in Fig. 5. The proposed model is introduced in this section. The hier-
archical attention mechanism of feature-based attention and
slice-based attention is applied respectively into the IDS.
The overall architecture of the hierarchical attention intru-
sion detection model is shown in Fig. 7. This model con-
sists of three main steps. To begin with, data preprocessing
is required. The main operations at this stage include the
missing value process, feature transformation, and feature
normalization. Feature-based attention is then utilized for
enhancing the expression ability of the traffic features, then
the slice-based attention is applied to several pieces of traffic
data.
FIGURE 5. The illustration of dot-product attention.
A. FEATURE-BASED ATTENTION
First, the key matrix is obtained: Not all features have the same importance in the represen-
tation of single traffic information. Thus, to fully release the
K = tanh(VW a ) (8) energy of some features and capture the features that are truly
67546 VOLUME 8, 2020

So in this part, a fully-connected layer with softmax acti-

vation is then adopted to determine the weight vector α. Input
Xi is then multiplied by α to derive the output hi .
B. SLICE-BASED ATTENTION
We believe the traffic data is time-related. Traffic information
for multiple adjacent moments helps significantly to judge the
type of current traffic. Thus, several pieces of traffic infor-
mation are grouped together, which is called slice traffic. The
dot-product attention is adopted due to the optimized matrix
multiplication operation in the program that can reduce the
resource consumption during calculation.
For each timestep, the corresponding hidden state hi is fed
through a single-layer perception to obtain ui as a hidden
representation of hi .
ui = tanh(Ww hi + bw ) (13)
The importance of each piece of traffic at different moment
i is then evaluated using the similarity of ui with uw . A nor-
malized importance vector α, also called attention weight, can
be computed through a softmax function.
exp(uTi us )
αi = P T
(14)
i exp(ui us )
The output of slice-based attention is then computed as a
weighted sum. The context vector v can be regarded as a high
level representation of the slice traffic.
X
v= αi hi (15)
i
A summary of the algorithmic phases of the proposed hier-

archical attention intrusion detection model in Algorithm 1 is
provided below.
FIGURE 7. Proposed model for intrusion detection.
V. EXPERIMENT
A. DATASET
significant to the representation of traffic, the feature-based A modern dataset that can represent actual situations in the
attention mechanism is adopted to determine which feature real network is required to build and evaluate the perfor-
should be the focus. Besides, the location-based attention mance of NIDS. The KDDCUP’99, NSL-KDD, and UNSW-
mechanism has no additional objects of interest, and is only NB15 datasets are compared in this paper, considering mul-
relevant to each input in the data source itself. So it is very tiple factors such as dataset size, number of types, and data
suitable to deal with the input features. distribution.
Given a sample with N dimensions, Xi = [xi0 , xi1 , . . . , Referring to Table 1 and Table 2, it can be determined
N −1 that UNSW-NB15 is an ideal candidate dataset for intrusion
xi ], the softmax function is adopted to get the probability
vector, that is the weight for each feature. The normalized detection. UNSW-NB15 was created by Moustafa et al. to
weight of j-th feature in time i can be computed by:
j
TABLE 1. Comparison of several training dataset for intrusion detection.
j j exi
αi = softmax(xi ) = P (11)
N −1 xik
k=0 e
j
The value of αi shows the importance of feature j.
j
Based on the above definition, the final output hi with
location-attention can be derived as:
j j j
hi = xi × αi (12)
VOLUME 8, 2020 67547

Algorithm 1 Algorithm for the Hierarchical Attention Intru- Each sample in UNSW-NB15 contains 49 features, which
sion Detection Model can be divided into the five sections of flow features. The
Input: The training dataset X is input with n pieces of detailed descriptions of every feature are listed in Table 3.
samples. Each sample is x (i) , where i ∈ (1, . . . , n). The The official website provides a pair of training and testing
weight matrix of GRU cells W are initialized along with datasets and there exist 82,332 records in the testing set,
attention layer matrix Wa , learning rate l, number of where the normal accounts for 45% and anomaly is 55%.
timesteps Nt , epochs K ; The training set is comprised of 175,341 samples, where
Output: The classification category y is output with the the ratio of normal records to the abnormal is 32% to 68%.
feature-based attention probability α1 and the slice-based In this research, the entire UNSW-NB15 dataset is adopted
attention probability α2 ; for model evaluation and analysis.
1: Data Preprocessing: Missing value filling is conducted To meet the input requirements of deep learning, data
by transforming the nominal features into numerical data preprocessing is needed which mainly includes feature trans-
and then normalizing the numerical data into the range formation, and feature normalization.
of 0 and 1;; To meet the requirements of the input format in neu-
2: The current data xt is merged with the history data xt , ral network, data preprocessing is required and mainly
where the length of history data is determined by Nt ; includes feature transformation and normalization. Feature
3: for k = 1 : K do transformation is used to transform the symbolic features into
4: The feature-based attention probability is obtained: numerical data such as service, state, and proto. This step
α1t = softmax(xt ); is necessary because neural network calculations only allow
5: st1 = α1t xt ; numerical operations. Several feature transformation tech-
6: The BiGRU cells are fed with st1 and the output [h0t , ht ] niques exist, among which one-hot encoding is frequently
is obtained; adopted, especially in the case of attributes that are not seri-
7: ut = tanh(Ww ht + bw ); alizable and cannot be compared in value. After encoding,
8: α2 =Psoftmax(ut ); the dimension of the samples is changed from 42 to 196.
9: v = i αi hi ; Feature normalization is highly useful in deep learning
10: The BPTT algorithm with learning rate l is used to methods and is utilized in most neural network calculation
train the model ; works. This is related to the activation feature of neurons and
11: The output ot of the model is obtained; updating of the weight [46]. In several partitions, the response
12: if ot > 0.5 then of neurons is stronger than other parts which will accelerate
13: yt = 1 the speed of training. In this paper, the min-max technique is
14: else adopted as follows:
15: yt = 0
end if x − min
16: x∗ = (16)
17: end for max − min
18: return yt , α1 , α2
B. EVALUATION
To evaluate the performance of a classifier, the confusion
TABLE 2. Comparison of several testing dataset for intrusion detection. matrix is defined in Table 4. True Negative (TN) means the
total number of normal examples correctly classified. False
Negative (FN), contrary to TN, represents the amount of
normal data wrongly judged. True Positive (TP) stands for
the number of attack correctly classified. False Positive (FP)
is the amount of attack samples that are wrongly divided into
normal parts.
Based on the above definition, other advanced matrices can
be obtained. Accuracy is a good measure when the classes are
balanced.
overcome the shortcomings of KDDCUP’99, and has grad- TP + TN
Accuracy = (17)
ually become one of the benchmark datasets in the filed of TP + FP + FN + TN
IDS [45]. UNSW-NB15 includes rich traffic types so that The FAR is a traditional metric and reflects the situation in
it can more accurately reflect the characteristics of modern which records are misclassified. The definition is as follow:
network traffic data. Ten types of traffic data exist which are
1 FP FN
Normal, Dos, Fuzzers, Analysis, Exploits, Reconnaissance, FAR = ( + ) (18)
Worm, Backdoors, Generic, and Shellcode. Besides, the dis- 2 FP + TP FN + TN
tribution of normal and anomaly data is balanced both in The Precision is the ratio of records correctly classified as
training and testing datasets. attacks to the number of attacks and Recall is the fraction of
67548 VOLUME 8, 2020

TABLE 3. UNSW-NB15 dataset.
TABLE 4. Confusion matrix for binary classification. dropped in order to make the total number an integer multiple
of the batch size. Thus, the final training dataset has a shape of
(175340, timestep, 196) and the shape of the testing dataset is
(825340, timestep, 196), where timestep is a hyper-parameter
representing the length of historical events. To create such
correctly classified attacks to all records that are detected to data, the TimeseriesGenerator in Keras is adopted.
be anomaly. In the proposed model, to build the feature-based attention
mechanism, a Dense layer with softmax activation is con-
TP
Precision = (19) nected to the input layer. The number of hidden units in this
TP + FN dense layer is equal to that of the input layer. Two BiGRU
TP layers with 32 and 12 units respectively are then stacked
Recall = (20)
TP + FP together for processing time-series data. Each timestep cre-
ates an output, and the dot-product attention is applied to all
C. MODEL CONFIGURATION AND TRAINING the steps. Finally, Dense layers are connected to the output
In this paper, Keras with Tensorflow is used as the backend of the attention layer and the output layer has only one
to build the model. To meet the requirement of the input unit. In the training phase, a batch of 1024 is used and the
dimension for BiGRU, the dataset is reorganized into a 3D Adam optimizer is adopted. The parameters of Adam are
shape. All 196 features are then arranged in a single piece of set to be lr = 0.1, beta_1 = 0.9, beta_2 = 0.999. The
data into a vector. Some samples at the end of the dataset are binary_crossentropy is adopted as the loss function.
VOLUME 8, 2020 67549

D. RESULT AND ANALYSIS

The three different structures of no attention, single attention,
and hierarchical attention were individually explored in this
research (other components were kept the same except the
attention module). The influence of timestep on the conver-
gence performance was first explored. Corresponding exper-
iments were then conducted on the hierarchical attention
model and several timesteps were randomly selected.
The convergence curves plotted in Fig. 8 illustrate how
the loss function changes with iterations during the training
phase.It can be observed that even though timestep is differ-
ent, the model will finally converge. Additionally, the larger
the timestep, the lower the loss value, meaning that model
performance is improved when the timestep is larger. Con-
sidering the speed of convergence, it appears that the value of
timestep has no effect on this condition.
FIGURE 8. Convergence curve of the hierarchical attention model during FIGURE 9. The experiment results on training and testing dataset.
the training phase with different timestep.
The influence of timestep on accuracy and false alarm

rate was also studied, and the results are provided in Fig.9.
As illustrated in Fig.9(a), as the value of timestep increases,
the accuracy also increases gradually on the testing dataset.
Impressively, the promotion of performance due to timestep
growth is clearly evident (detection accuracy can reach
91.69% and 98.88% when timestep = 1 and timestep = 11
respectively).
To better characterize the impact of timestep, a gain ratio is
introduced that is the value of accuracy improvement for each
timestep. Starting from timestep = 2, a vertical line serves as
the indicator of the increase of accuracy: the longer the line, FIGURE 10. The detection process on testing dataset using BiGRU with
10 timestep.
the greater the increase of accuracy. Generally, the develop-
ment of the model experiences a fast-ascension period and
a slow-convergence period which is evident in the proposed retained at a relatively low level when timestep = 10, again
model. Although the length of the black line reaches its max indicating the optimal value of timestep being 10.
value when timestep equals 2, the actual detection rate can It can also be concluded that the hierarchical attention-
improve further and is therefore still considered to be in the based model performs the best, and the single-level attention-
fast-ascension period. When timestep = 11, the accuracy based model has a better performance than the model with no
improves to a relatively high level due to its smallest gain ratio attention mechanism. When the value of timestep is small,
and there is little room for further improvement. At the same for example, timestep equals 2 or 3, attention mechanism has
time, the black line looks as if it will disappear. In summary, a significant impact on the performance improvement of the
10 is determined as an optimized candidate value for the model. As the timestep increases, the effect of the attention
parameter of timestep. Fig.9(b) shows a comparison of false mechanism is gradually reduced. This is likely because the
alarm rate under different timestep values, where FAR is features extracted by the BiGRU with a high timestep value
67550 VOLUME 8, 2020

FIGURE 11. Attention map for a case of normal traffic.
are sufficient to characterize the data and is illustrated by the TABLE 5. Running time comparison of different models on the testing
dataset (/second).
performance of the model without attention mechanism.
We also compare the running time of different models
on the testing dataset. And the result is shown in Table.5.
It can be seen that the time cost gradually increases along
with the increase of timesteps for each model. And the model
with multi-level attention costs larger than other two models.
And the larger the timestep is, the larger the value of time
difference between every two models.
The detecting phase on the testing dataset using the hier-
archical attention model with timestep = 10 value of 10 is
shown in Fig. 10. The blue bold dots represent the real label
of testing samples, orange middle dots denote the correct the performance of the model is reduced, roughly beginning
predictions, and the red small dots are incorrectly classi- at sample 40,000th. This may be attributed to the emergence
fied samples. In the beginning, despite the fluctuation in the of new types of attacks as time goes on. For some features,
classified data (orange dots), a satisfying performance can certain values exist in the testing dataset that are not available
still be achieved. However, with increased time and samples, in the training dataset.Another reason may be the fluctuation
VOLUME 8, 2020 67551

FIGURE 12. Attention map for a case of anomaly traffic.
TABLE 6. Comparison between our proposed model with other machine E. VISUALIZATION OF ATTENTION
learning algorithms.
To validate that attention effectively helps to select infor-
mative features or pieces of traffic, two pieces of traffic
were randomly selected, representing normal and anomaly.
The attention probability was then visualized separately,
using both slice-based attention and feature-based attention.
To classify the current traffic, several previous traffic data
points were also considered.
Attention maps for a case of normal traffic and anomaly
events are respectively illustrated in Fig. 11 and 12.
of Normal data, especially in features like proto and state. The x-axis is the value of timestep and the y-axis is the feature
Online learning could provide a solution to these problems. number or feature name. There are two different ways to
In future research, the inclusion of online operations to this illustrate attention probability. A bar chart is used to represent
model will be considered. the slice-based attention, and the color block is adopted for
The proposed method was also compared with other works the illustration of feature-based attention. The darker the
using the UNSW-NB15 dataset, as shown in Table.6. The color, the greater the probability.
comparison results further illustrate the effectiveness and In Fig. 11(a) and Fig. 11(b), the lower section in the
improvement of the proposed hierarchical attention model. subgraph displays the slice-based attention probability for
67552 VOLUME 8, 2020

10 timesteps. The value of slice-based attention probability in the future, the model can be lighten and support the further
at varying timestep is close which is reasonable because the detection.
10 traffic data points all belong to normal classes which are
similar to each other. As can be seen from the dark areas ACKNOWLEDGMENT
of the attention map, the features extracted are similar at The authors declare that there is no conflict of interests
varying timestep. This occurrence is also reasonable as sim- regarding the publication of this article. They gratefully thank
ilar features have a similar effect on the final classification. of very useful discussions of reviewers.
Figure 11(a) also illustrates that the dload may be the most
important feature for this kind of normal traffic. REFERENCES
An example of the attention map for anomaly traffic is [1] C. Alcaraz, R. Roman, P. Najera, and J. Lopez, ‘‘Security of industrial
provided in Fig. 12. The probability distribution is completely sensor network-based remote substations in the context of the Internet of
different from that in the normal case in Fig. 11. First, the fea- Things,’’ Ad Hoc Netw., vol. 11, no. 3, pp. 1091–1104, May 2013.
tures with strong responses at each timestep are different from [2] H. Huang, J. Yang, H. Huang, Y. Song, and G. Gui, ‘‘Deep learning for
super-resolution channel estimation and DOA estimation based massive
each other, that is, they are all different data types. Further MIMO system,’’ IEEE Trans. Veh. Technol., vol. 67, no. 9, pp. 8549–8560,
evidence that the data is of different types is provided by the Sep. 2018.
assigning of attention probability. Therefore, when classify- [3] F. Salo, A. B. Nassif, and A. Essex, ‘‘Dimensionality reduction with IG-
PCA and ensemble classifier for network intrusion detection,’’ Comput.
ing the current data, data at other timestep contributes noth- Netw., vol. 148, pp. 164–175, Jan. 2019.
ing, which is reasonable. Additionally, features including sttl, [4] A.-R. Sadeghi, C. Wachsmann, and M. Waidner, ‘‘Security and pri-
dttl, dload, and cts rvd st, play an important role in judging vacy challenges in industrial Internet of Things,’’ in Proc. 52nd
ACM/EDAC/IEEE Design Autom. Conf. (DAC), Jun. 2015, pp. 1–6.
this kind of attack data. The attention mechanism strengthens
[5] Y. Lin, M. Wang, X. Zhou, G. Ding, and S. Mao, ‘‘Dynamic spectrum
the existence of this feature resulting in the improvement of interaction of UAV flight formation communication with priority: A deep
model performance. reinforcement learning approach,’’ IEEE Trans. Cognit. Commun. Netw.,
early access, Feb. 12, 2020, doi: 10.1109/TCCN.2020.2973376.
The hierarchical attention mechanism proposed in this
[6] S. Agrawal and J. Agrawal, ‘‘Survey on anomaly detection using data min-
work not only enhances the detection ability, but also helps ing techniques,’’ Procedia Comput. Sci., vol. 60, pp. 708–713, Jan. 2015.
to determine which feature plays a substantial role in the [7] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, ‘‘Intrusion detec-
detection process. Feature selection can thus be conducted tion techniques in cloud environment: A survey,’’ J. Netw. Comput. Appl.,
vol. 77, pp. 18–47, Jan. 2017.
based on attention probability, which will be the focus of [8] T. Liu, Y. Guan, and Y. Lin, ‘‘Research on modulation recognition with
future work. ensemble learning,’’ EURASIP J. Wireless Commun. Netw., vol. 2017,
no. 1, p. 179, Dec. 2017.
[9] Y. Lin, C. Wang, J. Wang, and Z. Dou, ‘‘A novel dynamic spectrum access
framework based on reinforcement learning for cognitive radio sensor
VI. CONCLUSION networks,’’ Sensors, vol. 16, no. 10, p. 1675, 2016.
This paper presented an intrusion detection model with hier- [10] Z. Zhang, X. Guo, and Y. Lin, ‘‘Trust management method of D2D com-
archical attention mechanism. Several traffic data are merged munication based on RF fingerprint identification,’’ IEEE Access, vol. 6,
pp. 66082–66087, 2018.
in order and the influence of the different number of previous [11] H. Wang, L. Guo, Z. Dou, and Y. Lin, ‘‘A new method of cognitive
traffic on performance was also investigated. The proposed signal recognition based on hybrid information entropy and D-S evidence
model was demonstrated to achieve satisfactory performance theory,’’ Mobile Netw. Appl., vol. 23, no. 4, pp. 677–685, Aug. 2018.
[12] R. Zuech, T. M. Khoshgoftaar, and R. Wald, ‘‘Intrusion detection and big
on the UNSW-NB15 dataset, with accuracy of more than
heterogeneous data: A survey,’’ J. Big Data, vol. 2, no. 1, p. 3, Dec. 2015.
98.76% and FAR lower than 1.2%. As for the detection [13] Y. Lin, X. Zhu, Z. Zheng, Z. Dou, and R. Zhou, ‘‘The individual iden-
accuracy, our attention-based intrusion detection model is tification method of wireless device based on dimensionality reduction
better than some state-of-art approaches such as autoencoder, and machine learning,’’ J. Supercomput., vol. 75, no. 6, pp. 3010–3027,
Jun. 2019.
deep feedforward neural networks, and single-class support [14] C. Shi, Z. Dou, Y. Lin, and W. Li, ‘‘Dynamic threshold-setting for RF-
vector machines. And compared with the bidirectional long powered cognitive radio networks in non-Gaussian noise,’’ Phys. Com-
short-term memory model, our method has an improve- mun., vol. 27, pp. 99–105, Apr. 2018.
[15] Y. Xiao, C. Xing, T. Zhang, and Z. Zhao, ‘‘An intrusion detection model
ment of 3.05%. With the assistance of attention mecha- based on feature reduction and convolutional neural networks,’’ IEEE
nism, an attention map was presented. The visualization may Access, vol. 7, pp. 42210–42219, 2019.
provide assistance for feature selection and contributes to [16] M. Ahmed, A. Naser Mahmood, and J. Hu, ‘‘A survey of network
the understanding of the differences between varied traffic anomaly detection techniques,’’ J. Netw. Comput. Appl., vol. 60, pp. 19–31,
Jan. 2016.
classes in the future. However, our model is lack of fast [17] Y. Tu, Y. Lin, J. Wang, and J.-U. Kim, ‘‘Semi-supervised learning with gen-
calculation. Future developments will be focused on the erative adversarial networks on digital signal modulation classification,’’
evolution of attention mechanism and attempts of parallel Comput. Mater. Continua, vol. 55, no. 2, pp. 243–254, 2018.
[18] Q. Shi, J. Kang, R. Wang, H. Yi, Y. Lin, and J. Wang, ‘‘A framework
computing. Besides, in our method, only anomaly traffic has of intrusion detection system based on Bayesian network in IoT,’’ Int. J.
been detected. But information about specific attack type can Performability Eng., vol. 14, no. 10, pp. 2280–2288, 2018.
not be obtained. So future work will also be conducted on [19] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
no. 7553, p. 436, 2015.
classifying specific types of attacks using the attention mech-
[20] D. Kwon, H. Kim, J. Kim, S. C. Suh, I. Kim, and K. J. Kim, ‘‘A survey
anism. And the current version of proposed can not provide of deep learning-based network anomaly detection,’’ Cluster Comput.,
second-phase detection function. As for the misclassified, vol. 22, no. S1, pp. 949–961, Jan. 2019.
VOLUME 8, 2020 67553

[21] R. Wu, X. Chen, H. Han, H. Zhao, and Y. Lin, ‘‘Abnormal information [44] Y. Guo, J. Ji, X. Lu, H. Huo, T. Fang, and D. Li, ‘‘Global-local
identification and elimination in cognitive networks,’’ Int. J. Performability attention network for aerial scene classification,’’ IEEE Access, vol. 7,
Eng., vol. 14, no. 10, pp. 2271–2279, 2018. pp. 67200–67212, 2019.
[22] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, [45] N. Moustafa and J. Slay, ‘‘The evaluation of network anomaly detection
‘‘Attention-based models for speech recognition,’’ in Proc. Adv. Neural Inf. systems: Statistical analysis of the UNSW-NB15 data set and the compar-
Process. Syst., 2015, pp. 577–585. ison with the KDD99 data set,’’ Inf. Secur. J., Global Perspective, vol. 25,
[23] S. T. Ikram, ‘‘Improving accuracy of intrusion detection model using PCA nos. 1–3, pp. 18–31, Apr. 2016.
and optimized SVM,’’ J. Comput. Inf. Technol., vol. 24, no. 2, pp. 133–148, [46] B. Recht, C. Re, S. Wright, and F. Niu, ‘‘Hogwild: A lock-free approach
Jun. 2016. to parallelizing stochastic gradient descent,’’ in Proc. Adv. Neural Inf.
[24] N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, ‘‘A deep learning approach to Process. Syst., 2011, pp. 693–701.
network intrusion detection,’’ IEEE Trans. Emerg. Topics Comput. Intell., [47] M. Al-Hawawreh, N. Moustafa, and E. Sitnikova, ‘‘Identification of mali-
vol. 2, no. 1, pp. 41–50, Feb. 2018. cious activities in industrial Internet of Things based on deep learning
models,’’ J. Inf. Secur. Appl., vol. 41, pp. 1–11, Aug. 2018.
[25] F. Farahnakian and J. Heikkonen, ‘‘A deep auto-encoder based approach
[48] Y. Zhou, M. Han, L. Liu, J. S. He, and Y. Wang, ‘‘Deep learning approach
for intrusion detection system,’’ in Proc. 20th Int. Conf. Adv. Commun.
for cyberattack detection,’’ in Proc. IEEE Conf. Comput. Commun. Work-
Technol. (ICACT), Feb. 2018, pp. 178–183.
shops (INFOCOM WKSHPS), Apr. 2018, pp. 262–267.
[26] M. S. Islam, W. Khreich, and A. Hamou-Lhadj, ‘‘Anomaly detection
techniques based on kappa-pruned ensembles,’’ IEEE Trans. Rel., vol. 67,
no. 1, pp. 212–229, Mar. 2018.
CHANG LIU (Member, IEEE) received the B.S.
[27] H. M. Anwer, M. Farouk, and A. Abdel-Hamid, ‘‘A framework for efficient
and M.S. degrees in computer science and tech-
network anomaly intrusion detection with features selection,’’ in Proc. 9th
nology from the Kharkiv National University of
Int. Conf. Inf. Commun. Syst. (ICICS), Apr. 2018, pp. 157–162.
Ukraine, in 2008 and 2009, respectively, and the
[28] Y. Tian, M. Mirzabagheri, S. M. H. Bamakan, H. Wang, and Q. Qu, ‘‘Ramp
Ph.D. degree in radio technology and television
loss one-class support vector machine; a robust and effective approach
to anomaly detection problems,’’ Neurocomputing, vol. 310, pp. 223–235, systems from the Kharkiv National University
Oct. 2018. of Radio Electronics, Kharkiv, Ukraine, in 2013.
[29] F. A. Khan, A. Gumaei, A. Derhab, and A. Hussain, ‘‘A novel two- He has been a Teacher with the Heilongjiang
stage deep learning model for efficient network intrusion detection,’’ IEEE Agricultural University of China, since 2011, and
Access, vol. 7, pp. 30373–30385, 2019. became an Associated Professor, in 2018. He is
[30] Q. Tian, J. Li, and H. Liu, ‘‘A method for guaranteeing wireless communi- currently an Associated Professor with the Institute of Electronics and Infor-
cation based on a combination of deep and shallow learning,’’ IEEE Access, mation Engineering, Guangdong Ocean University. His main studying areas
vol. 7, pp. 38688–38695, 2019. are signal processing and artificial intelligence.
[31] M. Sheikhan, Z. Jadidi, and A. Farrokhi, ‘‘Intrusion detection using
reduced-size RNN based on feature grouping,’’ Neural Comput. Appl.,
YANG LIU received the B.S. degree in elec-
vol. 21, no. 6, pp. 1185–1190, Sep. 2012.
tronic information engineering from the College
[32] J. Kim, J. Kim, H. L. T. Thu, and H. Kim, ‘‘Long short term memory
recurrent neural network classifier for intrusion detection,’’ in Proc. Int. of Electronic Information, Northwestern Polytech-
Conf. Platform Technol. Service (PlatCon), Feb. 2016, pp. 1–5. nical University, Xian, China, in 2015, and the
[33] C. Yin, Y. Zhu, J. Fei, and X. He, ‘‘A deep learning approach for intru- M.S. degree with China Aerospace Science and
sion detection using recurrent neural networks,’’ IEEE Access, vol. 5, Technology Corporation, in 2018. He is currently
pp. 21954–21961, 2017. working with the Beijing Institute of Astronau-
[34] C. Xu, J. Shen, X. Du, and F. Zhang, ‘‘An intrusion detection system using tical Systems Engineering. His research interest
a deep neural network with gated recurrent units,’’ IEEE Access, vol. 6, includes command and control, and information
pp. 48697–48707, 2018. security.
[35] W. Anani and J. Samarabandu, ‘‘Comparison of recurrent neural network
algorithms for intrusion detection based on predicting packet sequences,’’
YU YAN (Student Member, IEEE) received the
in Proc. IEEE Can. Conf. Electr. Comput. Eng. (CCECE), May 2018,
pp. 1–4. B.S. degree from the College of Information and
Communication Engineering, Harbin Engineering
[36] A. F. M. Agarap, ‘‘A neural network architecture combining gated recurrent
unit (GRU) and support vector machine (SVM) for intrusion detection University, Harbin, China, in 2019, where she is
in network traffic data,’’ in Proc. 10th Int. Conf. Mach. Learn. Comput. currently pursuing the master’s degree with the
(ICMLC), 2018, pp. 26–30. College of Information and Communication Engi-
[37] B. Roy and H. Cheung, ‘‘A deep learning approach for intrusion detection neering. Her current research interests include net-
in Internet of Things using bi-directional long short-term memory recur- work intrusion detection, machine learning, and
rent neural network,’’ in Proc. 28th Int. Telecommun. Netw. Appl. Conf. data analysis.
(ITNAC), Nov. 2018, pp. 1–6.
[38] A. H. Mirza and S. Cosan, ‘‘Computer network intrusion detection using
sequential LSTM neural networks autoencoders,’’ in Proc. IEEE 26th
Signal Process. Commun. Appl. Conf. (SIU), May 2018, pp. 1–4. JI WANG received the B.S. degree in electron-
[39] H. Liu, B. Lang, M. Liu, and H. Yan, ‘‘CNN and RNN based payload
ics and communication technology from Liaoning
classification methods for attack detection,’’ Knowl.-Based Syst., vol. 163, University, China, in 1994, and the M.S. degree
pp. 332–341, Jan. 2019. in engineering from the Guangdong University of
[40] M. Schuster and K. K. Paliwal, ‘‘Bidirectional recurrent neural networks,’’ Technology, in 2010. He is currently a Professor
IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, Nov. 1997. with the Institute of Electronics and Information
[41] J. Schmidhuber, ‘‘Deep learning in neural networks: An overview,’’ Neural Engineering, Guangdong Ocean University. He is
Netw., vol. 61, pp. 85–117, Jan. 2015. also the Director of the Guangdong Intelligent
[42] T. A. Tang, L. Mhamdi, D. McLernon, S. A. R. Zaidi, and M. Ghogho, Ocean Sensor Network and its Equipment Engi-
‘‘Deep recurrent neural network for intrusion detection in SDN-based neering Technology Research Center, a Senior
networks,’’ in Proc. 4th IEEE Conf. Netw. Softwarization Workshops (Net- Member of the China Electronics Society, and a member of the Guang-
Soft), Jun. 2018, pp. 202–206. dong Electronic Information Education and Reference Committee. His main
[43] M.-T. Luong, H. Pham, and C. D. Manning, ‘‘Effective approaches to studying areas are wireless sensor network and ocean Internet of Things,
attention-based neural machine translation,’’ 2015, arXiv:1508.04025. information processing, and communication systems.
[Online]. Available: http://arxiv.org/abs/1508.04025
67554 VOLUME 8, 2020

An Intrusion Detection Model With Hierarchical Attention Mechanism-23

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Intrusion Detection Model With Hierarchical Attention Mechanism-23

Uploaded by

Copyright:

Available Formats

IEEE RELIABILITY SOCIETY SECTION

An Intrusion Detection Model With Hierarchical

Corresponding author: Ji Wang (zjouwangji@163.com)

I. INTRODUCTION network intrusion detection system (NIDS) and host-based

VOLUME 8, 2020 67543

67544 VOLUME 8, 2020

As traditional RNN is limited by gradient vanishing or

VOLUME 8, 2020 67545

where W a is a randomly initialized weight matrix. After

FIGURE 4. The illustration of location-based attention.

used to obtain the attention probability. The attention prob-

67546 VOLUME 8, 2020

So in this part, a fully-connected layer with softmax acti-

A summary of the algorithmic phases of the proposed hier-

VOLUME 8, 2020 67547

67548 VOLUME 8, 2020

TABLE 3. UNSW-NB15 dataset.

VOLUME 8, 2020 67549

D. RESULT AND ANALYSIS

The influence of timestep on accuracy and false alarm

67550 VOLUME 8, 2020

FIGURE 11. Attention map for a case of normal traffic.

VOLUME 8, 2020 67551

FIGURE 12. Attention map for a case of anomaly traffic.

67552 VOLUME 8, 2020

VOLUME 8, 2020 67553

67554 VOLUME 8, 2020

You might also like