yousefi-azar2017

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Autoencoder-based Feature Learning for Cyber

Security Applications
Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey and Uday Tupakula

Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia.
Email: mahmood.yousefiazar@hdr.mq.edu.au, vijay.varadharajan@mq.edu.au
len.hamey@mq.edu.au, udaya.tupakula@mq.edu.au

Abstract—This paper presents a novel feature learning model unsupervised or semi-supervised training schemes. Normally,
for cyber security tasks. We propose to use Auto-encoders using labeled data in supervised schemes could result in better
(AEs), as a generative model, to learn latent representation of performance compared to unsupervised learning models. The
different feature sets. We show how well the AE is capable of
automatically learning a reasonable notion of semantic similarity utilization of labled data alongside the availability of the
among input features. Specifically, the AE accepts a feature large number of cyber criminals attempting to gain access to
vector, obtained from cyber security phenomena, and extracts the data, have motivated researchers to use semi-supervised
a code vector that captures the semantic similarity between the algorithms that have advantages of both supervised and unsu-
feature vectors. This similarity is embedded in an abstract latent pervised schemes. The application of the unsupervised scheme
representation. Because the AE is trained in an unsupervised
fashion, the main part of this success comes from appropriate proposed in this paper is relevant for both supervised classifiers
original feature set that is used in this paper. It can also and semi-supervised algorithms.
provide more discriminative features in contrast to other feature Malicious software is intentionally developed to target
engineering approaches. Furthermore, the scheme can reduce the computer systems or networks for different purposes such
dimensionality of the features thereby signicantly minimising the as stealing information, or distribution of spam messages or
memory requirements. We selected two different cyber security
tasks: networkbased anomaly intrusion detection and Malware even destruction of messages. Malware classically refers to
classication. We have analysed the proposed scheme with various malicious software such as computer viruses, Internet worms,
classifiers using publicly available datasets for network anomaly trojans, adware and ransomware. Identification and removal
intrusion detection and malware classifications. Several appro- of malware are a significant part of both network and host
priate evaluation metrics show improvement compared to prior defence systems. Detection, clustering and classification of
results.
malware are major threads in cyber security and form im-
I. I NTRODUCTION portant applications of malware analysis.
Malware analysis using machine learning has been receiving
In the cyber space, security requires a wide range of much attention in the recent years, both in the academia
technologies and processes to protect the range of devices and in the industry [1, 3, 4]. The main reason behind this
from computers, to smart phones, to networks to Internet of is the capability to automatically identify malicious software
Things to users and importantly data from intrusion, unau- compared to more tedious manual techniques. Another major
thorized access and destruction. To meet these requirements, reason behind the application of machine learning techniques
cyber security defensive technologies include two conventional for malware analysis is the emergence of zero-day malware,
systems, namely network defence systems and host defence whose fingerprints or signatures are unknown to the software
systems. Each of these systems are composed of different developers.
technologies and various layers such as intrusion detection, In this study, our aim is to consider both malware classi-
firewall and antivirus [1]. fication as well as malware detection. To do both detection
Intrusion Detection Systems (IDSs), in particular, network- and classification, we introduce a technique that achieves a
based IDSs are special-purpose algorithms and tools to detect richer feature space using deep auto-encoders (AEs). The AEs
anomaly attacks to a networked system, and help determine as the automatic feature learning models can provide more
and identify unauthorized usage, duplication, alteration as discriminative features in contrast to other feature engineering
well as any destruction of data. Depending on the detection approaches. In the literature, a wide range of feature sets have
techniques, IDSs can be categorized into different approaches been used to identify anomaly intrusions and malware [5, 6, 7].
such as signature-based detection, anomaly-based detection The examples of feature sets in network-based anomaly in-
and behaviour based detection. The focus of our study is on trusion detection application domain include network flow,
network-based anomaly intrusion detection systems. source IP and port, destination IP and protocols. For malware
Machine learning approaches are being widely used for analysis, the number of bytes, the entropy of the binary file,
anomaly intrusion detection [2, 3]. The schemes are able to system calls and operation code of assembly files have been
detect patterns of known and unknown attacks in supervised, commonly used. The AEs can learn the concept space from

978-1-5090-6182-2/17/$31.00 ©2017 IEEE 3854


the original feature sets to achieve both these tasks. learning models and other non-deep models. Our literature
Another advantage of our proposed scheme is the dimen- review is mainly related to deep learning models that are
sionality reduction. In terms of tractability of a model, some related to malware analysis and intrusion detection.
classifiers require the observation of uncorrelated features. The Deep learning models have been recently used to detect and
two most commonly used statistical techniques to provide such classify malware in Microsoft Windows and Android [12, 13,
features are Principal Component Analysis (PCA) and Zero 14, 15]. The models use a wide range of structures such as
Component Analysis (ZCA) [8, 9]. In practice, greater the Convolutional Neural Network (CNN), Auto-encoders (AEs),
dimension of the feature space, greater the memory required Recurrent Neural Networks (RNNs) and Deep Belief Networks
to compute the covariance matrix needed in either the PCA (DBN) [13, 14, 15, 16, 17, 18, 19].
or the ZCA. In addition to a more discriminative feature The paper [15] showed that a stacked de-noising AE can
space, the AEs can reduce the dimensionality of the features, be a good model to distinguish malicious from non-malicious
thereby helping to reduce the memory needed to compute the software (i.e. detect malware). The model is designed to
covariance matrix. handle malicious scripts (e.g. JavaScript code). The stacked
More specifically, we have used the AEs to map the original de-noising AEs have also been used in portable executable
feature space to a latent representation, with two unsupervised files classification [16]. This model uses Application Program
training stages. The motivation is that AE as a generative Interface (API) calls as the dynamic feature set and provides
model is capable of learning a reasonable notion of semantic a signature for each malware, consisting of 30 codes. The
similarity and the relation among input features [10, 11] . To paper [14] developed a AE-based model that uses RBM. This
evaluate the proposed scheme, we have carried out security model could successfully detect malware using a wide range
analysis of the proposed scheme using two publicly available of dynamic and static feature sets and convert them to 16
datasets. feature vector sets and 4 graph feature sets. The paper [17]
In summary, the major contributions of this paper are the proposed RNN-based AEs that can automatically learn the
following: representation of a malware from its raw API calls. They
• We introduce an unsupervised feature learning approach manage to handle the difficulties of training a recurrent neural
for the two different cyber security problems using AEs. network.
Although AEs have been previously applied to cyber In the literature, another area of interest has been outlier
security, the proposed model has unique training phase detection where recently deep learning models have shown
and topology compared to the previous works. some promising results. The proposed models vary on the
• We show how a single model with the same training structure of the model used, the application purpose and
model and topology can be quite effective for both mal- motivation behind the chosen strategy [20, 21, 22, 23]. The
ware classification and network-based anomaly intrusion paper [24] introduced the application of a Variational AE for
detection. This is helpful when it comes to designing only intrusion detection and showed that the Variational AE can
one embedded security analysis tool for different systems. perform well for both network-based intrusion detection as
• Our scheme uses almost the minimum number of features well as in detecting outliers. The paper [25] proposed a hybrid
compared to other state-of-the-art algorithms. This makes AE and Density Estimation Model for anomaly detection. The
the model to be more effective for real time protection. model is based on estimating the density in the compressed
• In addition to the limited number of original features, the hidden-layer representation of the applied AE. Another paper
proposed scheme generates a small set of latent features. [26] uses hybrid stacked de-noising AEs.
The resulting rich and small latent representation makes Our model uses hex-based features of portable executable
it practical for it to be implemented in small devices such files, without disassembling. The model does not require any
as the Internet of Things. feature selection phase, as the pre-processing phase, which
The rest of this paper is presented as follows. Section II helps to improve the performance. The proposed scheme also
briefly describes some previous relevant works on malware provides a discriminative concept space to distinguish between
and intrusion detection using machine learning models and a normal flow of network data and an anomalous flow. The
deep learning approaches. In Section III, we describe the model provides a fixed 10 size vector for both malware
AE model and its pre-train and training stages. Section IV detection and classification tasks.
presents the performance of the proposed scheme for both
III. F EATURE LEARNING
malware detection and clasification. Subsection IV-A describes
the experimental setup, and Subsection IV-B presents the The main purpose of an unsupervised feature (or represen-
results of the model. Section IV-B1 presents and discusses tation) learning is to provide a setting to map the original
malware classification, and Section IV-B2 presents the results feature set into a different (or latent) representation that is
for intrusion detection. Section V concludes. suitable for a specific machine learning task. In representation
learning, a model automatically learns both a specific task and
II. LITERATURE REVIEWS the features itself.
We categorize the previous work on the application of ma- To achieve the goals of our study (i.e. learning a representa-
chine learning to the cyber security tasks into two types: deep tion that is appropriate for detection and classification tasks),

3855
we have used AE. Our motivation is that an AE offers a rich
representation from which it should be possible to re-generate
the original representation. This assumption is not task-specific
and has motivated us to apply AE for different cyber security
tasks.

A. Autoencoders
An AE (Figure 1) is a feed-forward network that learns
to reconstruct the input x [27]. AE is trained to encode the
input x using a set of recognition weights into a feature space
C(x). Then, the features (codes) C(x) are converted into an
approximate reconstruction of x̂ using a set of generative
weights. The generative weights are mostly obtained firstly
from unrolled weights of encoder and then from a fine-
tuned phase. Through the encoder (i.e. mapping to a hidden
representation) the dimensionality of x is reduced to the give
number of codes. The codes are then mapped to x̂ through the
decoder. Because no labeled data is required in the training
process, the algorithm is unsupervised.
Indeed, a deep AE transforms the original input to a better
representation over hierarchical features or representations,
with each level of hierarchy corresponding to a different level
of abstraction. Fig. 1. The Topology of our AE. The decoder parameters are obtained from
AEs can have different topological structures. Figure 1 unrolled parameters of encoder, which are then fine-tuned. x and x̂ denote
the input and reconstructed inputs respectively. hi are the hidden layers and
demonstrates our AE topology. The coding layer (a.k.a bottle- wi are the parameters. The ε denotes the modification of parameters over the
neck layer or discriminative layer) provides the latent repre- fine-tuning phase. Concept space in which features/codes are extracted C(x),
sentation. In general, an AE with a linear activation function in in this scheme, is the output of the bottleneck layer.
all units projects a subspace of the principal components of the
input and not a richer representation than the PCA. However, it
There are two reasons behind our choice of the use of the
is expected the AE with non-linear activation functions learns
RBM. First, because the training phase is based on a back-
more useful feature-detectors [28].
propagation algorithm, we believe that having two different
Mainly because each layer of our AE provides a different
search methods in the parameter space of the AE can be more
level of abstraction, we have used a fairly deep network. Ad-
efficient. Also, assuming that the observations are produced
ditionally, [28] in the supplementary material paper1 showed
from a stochastic generative model, the RBM as a generative
that a deep AE can have better performance than a shallow AE
model can be an appropriate prototype to discover the feature
with the same number of parameters. Because our AE consists
detectors [27, 31].
of many hidden layers, the back-propagated gradients of the
error is very small for the lower layers. The RBM (figure 2) is the undirected graphical model that
can be considered as a two-layer neural network with one layer
To tackle the vanishing problem, the parameters of a deep
of observable units and one layer of hidden units (i.e. feature
AE requires a good initialization point. This initialization
detectors). The weighted connections are restricted between
could significantly improve the performance of an AE in a
the hidden units and the visible units, symmetrically, and
variety of applications compared with random initialization
there are no connections between the units in the same layer.
[29, 30].
Depending on the function of the network, the visible and
Our model has two training stages: pre-training and
hidden units could be considered as a different distribution
fine-tuning. Pre-training is an approach to find an appropriate
such as Bernoulli, Gaussian or Multinomial, binomial. We
starting point for the fine-tuning phase. The parameters
have used Bernoulli-Bernoulli units, Gaussian-Bernoulli units
obtained in the pre-training phase will be the initial weights
and Bernoulli-Gaussian units (only for the bottleneck layer
of the fine-tuning phase.
RBM).
The energy function is bilinear for both binary states of
Pre-training Stage visible and hidden units (i.e. Bernoulli-Bernoulli units) (see
[32]):
The two most common techniques to pre-train an AE and
obtain the initialization weight are stacked Restricted Boltz-   
E(x, h; θ) = − bi x i − a j hj − xi hj wij (1)
mann Machine (RBM) and stacked de-noising AE [28, 30]. i∈visible j∈hidden i,j

1 http://www.cs.toronto.edu/∼hinton/absps/science som.pdf Where wij is the weights between visible units xi and

3856
To estimate the parameters of the network, maximum like-
lihood estimation (equivalent to minimizing the negative log-
likelihood) can be used. Taking the derivative of the negative
log-probability of the inputs with respect to the weights gives:

∂ − log p(x) ∂ 
= (− log p(x, h))
Fig. 2. The schematic representation of the RBM. ∂θij ∂θ (9)
h
= xi , hj data − xi , hj recon
hidden units hj , bi and aj are their biases. θ = {W, a, b}
where the angle brackets are used to denote the expectation
denotes the all sets of network parameters. The joint distribu-
of the distribution of the subscript that follows. This leads to
tion p(x, h; θ) has the following the energy function:
a simple learning algorithm by which the parameters update
exp (−E(x, h; θ)) rule is given:
p(x, h; θ) = (2)
Z

Where Z = x,h exp (−E(x, h; θ)) is a partition function Δwij = (xi , hj data − xi , hj recon ) (10)
as a normalization constant. The marginal probability assigned
to a visible vector is: where  is a learning rate, xi , hj data is the so-called
 positive phase contribution and xi , hj recon is the so-called
exp (−E(x, h; θ))
p(x; θ) = h (3) negative phase contribution. The positive phase decreases the
Z energy of observation and negative phase increases the model
Because this network is symmetric, the conditional proba- energy. However, the computation of the expectation defined
bilities for Bernoulli-Bernoulli RBM is: by the model is not easily tractable. [33] presented maximizing
 the likelihood or log-likelihood of the data is equivalent
exp ( i wij xi + aj ) to minimize KullbackLeibler (KL) divergence between data
p(hj = 1|x; θ) = 
1 + exp ( i wij xi + aj ) distribution and the equilibrium distribution over the visible
 (4) variables. To compute this expectation, The k-step contrastive
= f( wij xi + aj )
divergence (CD-k) approximation provides surprising results
i
[33]. We used CD-1, with running one step Gibbs sampler that
 is effective enough:
exp ( j wij hj + bi )
p(xi = 1|h; θ) = 
1 + exp ( j wij hj + bi )
 (5) p(h|x0 ) p(x|h0 ) p(h|x1 )
x = x0 −−−−−→ h0 −−−−−→ x1 −−−−−→ h1 (11)
= f( wij hj + bi )
j
The RBM blocks can be stacked to form the topology of
Assuming the visible units has Gaussian distribution, the the desired AE. More clearly, in pre-training phase, the AE is
energy function is represented as following: trained in a greedy layer-wise fashion using individual RBMs,
  where the output of one trained RBM is used as input to the
(xi − bi )2
E(x, h; θ) = − a j hj next upper RBM block. Then, individual RBM blocks would
2σi2 be stacked on top of each other and having the RBM blocks
i∈visible j∈hidden
 xi (6)
stacked, the topology of the AE can be generated.
− hj wij ,
i,j
σi Global adjusting parameters
Figure 1 shows how the obtained weights from the pre-
where σi is the standard deviation of the ith visible unit.
training are ties (i.e. unrolls) and used to initialise the deep
For Gaussian-Bernoulli conditional probabilities becomes:
AE. Globally adjusting parameters is indeed fine-tuning the

exp ( i wij xi + aj ) parameters in an iterative way. We use back-propagation
p(hj = 1|x; θ) =  algorithm to do this iterative tuning of parameters.
1 + exp ( i wij xi + aj )
 (7) The whole network is trained in this phase. We use the
= f( wij xi + aj ) cross-entropy error as the loss function for this unsupervised
i
fine-tuning phase and to have optimal reconstruction, given
   the encoding C(x):
1 (x − bi − j wij hj )2
p(xi |h; θ) = √ exp −
2π 2 1 
N 
N
⎛ ⎞ (8) − log p(x|C(x)) = − N xi log x̂i + (1 − xi ) log(1 − x̂i ) (12)
 i=1 i=1
=N⎝ wij hj + bi , 1⎠ where n is the total number of items of training data, x is
j
the input of the AE, x̂ = fθ (C(x)) is the reconstructed values.

3857
IV. P ERFORMANCE E VALUATION In the above sample, 00401370 indicates the offset and is
We have selected malware classification and network-based not used in this paper.
anomaly intrusion detection to evaluate the capability of the The most successful feature set for this dataset is n-gram
AE for the security application domain. To evaluate the modelling [34]. Inspired from human language modelling, a n-
performance of the learned representation, the metrics are the gram is a contiguous sequence of n items (here, a byte) from
accuracy, the multi-class logarithmic loss (a.k.a logistic loss, a given sequence of the binary file. We use unigram (a.k.a
cross-entropy loss or only Log Loss) and the confusion matrix. 1-gram) probabilistic model.
The metrics measure the performance of the applied classifiers: Having unigram feature representation, each item in a byte
Gaussian Nave Bayes (GaussianNB), K-nearest-neighborhood sequence can have one out of 256 different values; thus, the
(K-NN), Support Vector Machines (SVMs) and Gradient dictionary size is 256 (accordingly 00 to FF hex). Instead of
Boosting (Xgboost). The classification accuracy shows the using the raw frequency of an item in a hex file (i.e. the
correct predictions of labels. The Log Loss function measures number of times that item t occurs in the hex file), we used
the uncertainty of the probabilities of a classifier compared the logarithmically scaled frequency as following:
to the true labels. That is, the function is the cross entropy Unigram = 1 + log(ft ) or zero where ft is zero.
between the distribution of the true labels and the predictions. In short, each hex file turns into a vector with size 256 and
For multi-class classification the Log Loss is defined as: fed into the AE.
To keep the same setting with our baselines, we used 5-
1  fold cross validation. Although the dataset is imbalanced, we
N M
logloss = − yij log(pij ) (13) randomly chose equal proportion of each class for each fold.
N i=1 j=1
For anomaly intrusion detection task, we used NSL-KDD
where N is the total number of the samples of training data, dataset 12 , which consists of selected records of the KDD-
M is the number of labels, log is the natural logarithm, y is CUP99 dataset. NSL-KDD collected by analysing incoming
the true label (i.e. 0 or 1) of an item and p is the estimated network traffic and it has been widely used to develop the
probability that the item i belongs to class j. network-based intrusion detection systems (NIDs).
The presented confusion metrics is a table to visualize the NSL-KDD includes 125,973 train and 22,544 test records
classifier performance in confusing classes. labeled either normal or anomaly (i.e. a network attack). The
anomaly class has four categories; however, for distinguishing
A. Experiment Setup normal from an anomaly and having a detection system, we
consider all the four categories into one class label. Briefly, the
We used Microsoft Malware Classification Challenge (BIG
two-class classifiers are trained to distinguish a normal class
2015) dataset hosted at Kaggle 2 to analyse the AE-based
from an anomaly class.
representation. The dataset includes hexadecimal and assembly
Each sample in the NSL-KDD represents with 41 fea-
representation of the 10868 labled malware binary files from 9
tures. The features are categorized into four feature sets:
different malware families: Ramnit 3 (Ram), Lollipop 4 (Lol),
Basic features (from TCP/IP connection without inspecting
Kelihos ver3 5 (Kel), Vundo 6 (Vun), Simda 7 (Sim), Tracur 8
the payload), content features (accessing the payload of TCP
(Tra), Kelihos ver1 9 (Kel), Obfuscator.ACY 10 (Obf), Gatak
11 packet), time based traffic features (statistics in a 2 second time
(Gat). This dataset is imbalanced. That is, the number of
windows) and host based traffic features (within a historical
malware samples for each class (family) is not equal.
window). Each of the 41 features is presented as either as
We only use hex dump files. The average size of the hex
a continuous value or as a symbolic sign. We used the
dump files is 4.8 MByte while the biggest file is 56 MByte
continuous value features intact while we replaced symbolic
and the smallest is 110 KByte. A sample line of a hex file is:
features with one-hot encoding. In the one-hot encoding each
00401370 8B 4C 24 04 8B D1 8D 81 D6 8D 82 F7 81 F2 60 4F
symbol is replaced with the state of the symbol.
2 https://www.kaggle.com/c/malware-classification/data To have an efficient topology for both experiments, we set
3 https://www.symantec.com/security response/writeup.jsp?docid=
our deep AE with a 150-90-50-10-50-90-150 structure. Here,
2010-011922-2056-99 150 is the size of the first and last hidden layer and 10 is the
4 https://www.microsoft.com/security/portal/threat/encyclopedia/Entry.aspx?
Name=Adware:Win32/Lollipop size of AE’s bottleneck layer. The ten code units provide a
5 https://en.wikipedia.org/wiki/Kelihos botnet ten-dimensional concept space to represent the feature space
6 https://www.symantec.com/security response/writeup.jsp?docid=
for both network-based intrusion attack and malware families.
2004-112111-3912-99 The 10 units in the bottleneck of AE is a constraint by which
7 https://www.microsoft.com/security/portal/threat/Encyclopedia/entry.aspx?
Name=Trojan%3AWin32%2FSimda the AE is complied to learn useful features. The activation
8 https://www.symantec.com/security response/writeup.jsp?docid= function of all units is sigmoid except for the bottleneck layer
2011-071504-5259-99 which is linear, in both pre-training and fine-tuning phases.
9 https://en.wikipedia.org/wiki/Kelihos botnet
10 https://www.microsoft.com/en-us/security/portal/threat/encyclopedia/ For the layer-wise pre-train phase, that is, training four
Entry.aspx?Name=VirTool:Win32/Obfuscator.ACY RBMs with 150, 90, 50 and 10 units, we run one step of the
11 https://www.symantec.com/security response/writeup.jsp?docid=
2012-012813-0854-99 12 http://nsl.cs.unb.ca/NSL-KDD/

3858
Gibbs sampler with 200 epochs [35]. In the fine-tuning phase, Classifiers Accuracy with Accuracy with
the whole network is trained using mini-batch Conjugate Unigram representation AE representation
Gradient with line search, having 1500 epochs [36]. The loss Naive Bayes 66.2% (±4.45e − 05) 80.4% (±5.73e − 04)
function in the training phase is the cross-entropy error. We K-NN 94.0% (±3.90e − 05) 96.0% (±3.26e − 05)
SVM 95.6% (±1.22e − 05) 96.3% (±1.13e − 04)
have used the same hyper-parameters, obtained on the basis of Xgboost 98.2% (±6.65e − 06) 95.7% (±2.53e − 05)
presented practical guides in [31, 37], in both the experiments. TABLE I
Extracted codes from the AE are fed into the classifiers. T HE ACCURACY OF DIFFERENT CLASSIFIERS FOR THE ORIGINAL
REPRESENTATION (U NIGRAM ) AND LATENT REPRESENTATION
Although there is not a general agreement in literature for GENERATED USING THE AE.
the input of classifiers, it is beneficial to have a linear
transformation such as whitening transformation. Statistical
Whitening data helps to uncorrelate features with an identity
Log Loss reduces and is less than the other models in which
covariance matrix. To do so, we used Principal Component
Unigram has been used as the feature.
Analysis (PCA) whitening algorithm [8].
One of the key evaluations is shown in Table II. The Ram
We used sklearn library (for the Python 2.7.12) to imple-
and Tra families are confused with Gat family more than
ment the classifiers. Assuming the features have Gaussian
other families. This issue is not due to different number of
distribution, we used Gaussian Naive Bayes. The hyper-
sampling for each class. Interestingly, Gat is a trojan that
parameters of K-NN set used a grid search to achieve max-
opens a backdoor on the compromised computer while Ram is
imum accuracy. This grid search range from 1 to 20 for
a worm which also functions as a backdoor and Tra is a trojan.
n neighbors and are either ’uniform’ or ’distance’ for weights.
This shows that the AE provides a clustered representation
We also used a grid search to tune the C and gamma
by which similar families are more likely being alternatively
parameters of SVM (with a radial basis function kernel) to
predicted than dissimilar families. Figure 3 illustrates the
have maximum accuracy. This grid search range from 10−4 to
whitened features in two dimensions, visualized by t-SNE
104 for both C and gamma parameters of SVM. Xgboost has
[39].
the same hyper-parameters of the first winner of the Kaggle
Another important and mutual confusion of classes is be-
Competition [34].
tween Vun, Sim and Tra classes. All the three classes are tro-
We used the same topology for AE and other conditions
jans. This confusion again provides grounds for understanding
of the experiments throughout our study to make the results
that the three classes of malware are indeed from one malware
more comparable.
broader family, and possibly with the same pattern.
The important thing to notice is that regardless of observing
B. Results and Discussion
imbalanced classes, Naive Bayes can provide a fairly good
Having the pre-processed data, we conducted experiments prediction across all the malware families, even for Sim family
on two different cyber security threats: malware classification with only 42 samples.
and network-based anomaly intrusion detection.
1) Malware classification:
Table I shows how well AE can provide the discriminative
representation. As expected, among the classifiers, Gaussian
Naive Bayes gains the highest benefits of the latent repre-
sentation. Indeed, AE can enrich the representation and insert
the relation between original features into the concept space,
assuming independent features will not drastically reduce
the accuracy of the Naive Bayes. Additionally, because of
applying Bernoulli-Gaussian RBM for the initialization of
the bottleneck layer, the Gaussian Naive Bayes classifier can
perform well.
Both K-NN and SVM can also generate more accurate
predictions. For K-NN and SVM, the error rate improves by
33% and 16% respectively. However, Xgboost can handle the
classification task better with the original unigram and not with
the latent representation generated by the AE.
Table I also shows the possible variance of the accuracy for Fig. 3. The output of coding layer that is applied to the classifiers.
all the classifiers. The variance in all situations is small and,
more importantly, without any overlap with each other. 2) Network-based intrusion detection:
In addition to the accuracy, Log Loss metric has been used The accuracy of classifiers for the intrusion detection task
to analyse the classification performance for Xgboost. The also improves using the AE-based generated features com-
result is presented in Table III. In contrast with the accuracy, pared to the original features (Figure 4). Similar to the

3859
Family Name Ram Lol Kel ver3 Vun Sim Tra Kel ver1 Obf Gat
Ram (n = 1541) 45.13% 06.82% 00.19% 01.04% 00.39% 05.33% 00.32% 10.39% 30.39%
Lol (n = 2478) 04.38% 91.12% 00.00% 00.24% 00.04% 00.16% 01.22% 01.87% 00.97%
Kel ver3 (n = 2942) 00.00% 10.82% 89.05% 00.00% 00.00% 00.07% 00.00% 00.00% 00.07%
Vun (n = 0475) 00.00% 00.21% 00.00% 78.11% 06.11% 06.95% 01.47% 06.95% 00.21%
Sim (n = 0042) 00.00% 00.00% 00.00% 00.00% 87.50% 05.00% 00.00% 07.50% 00.00%
Tra (n = 0751) 04.00% 01.60% 01.47% 06.27% 05.73% 62.27% 01.47% 06.67% 10.53%
Kel ver1 (n = 0398) 00.25% 01.52% 00.51% 00.00% 00.00% 03.80% 93.92% 00.00% 00.00%
Obf (n = 1228) 03.10% 03.43% 00.65% 01.31% 00.90% 03.10% 01.39% 84.65% 01.47%
Gat (n = 1013) 00.89% 00.10% 01.68% 00.00% 00.20% 03.47% 00.20% 07.03% 86.44%

TABLE II
T HE C ONFUSION M ATRIX OF NAIVE BAYES CLASSIFIER WITH AE- BASED FEATURES AS THE INPUT OF THE CLASSIFIER . n IS THE NUMBER OF SAMPLES
IN THE DATASET

Models Log Loss


Xgboot (AE) 0.0748
Deep Learning H20 models [38] 0.1810
1G [5] 0.0764
TABLE III
T HE L OG L OSS OF X GBOOT CLASSIFIER AND 2 BASELINE MODELS

classification task, the performance of the Gaussian Naive


Bayes classifier improves significantly. Although K-NN can
not predict labels with the latent features as accurate as the
original features, SVM and Xgboost perform much better.
This is to be expected that not all classifiers gain perfor-
mance from a representation driven from a particular data.
We think it might be because of the distribution of the original
data. However, the AE can provide much more discriminative
representation by which most classifiers can perform better
Fig. 4. The accuracy of different classifiers for the original feature set and
compared to the classifier’s accuracy with the original features latent features generated using AE.
as the input of classifier.
Table IV shows the comparison between our model and
the other models. To be able to have a fair comparison, V. C ONCLUSION
single classifier models have been selected. AE-based features In this paper we proposed an unsupervised feature learn-
with not a highly complicated classifier (i.e. Gaussian Naive ing approach for malware classification and network-based
Bayes) can outperform other single classifier models and even anomaly detection using auto-encoder (AE). Compared to
a complex algorithm proposed in [40]. Although the second previous work, the proposed scheme uses a unique training
proposed algorithm by [40] performs better than our scheme, phase and topology. We showed how a single model with
the baseline model is computationally more expensive, built the same training model and topology can be quite effective
up from different modules of classifiers. Also, a classifier is for both malware classification and network-based anomaly
required to be re-trained over the algorithm. In contrast, the intrusion detection. This is helpful when it comes to designing
proposed scheme of this paper uses only one classifier with a shared embedded security analysis tool for different systems.
one training stage. For malware classification, our scheme uses raw byte features
of portable executable files, without disassembling and does
Models KDDTest+
not require any preprocessing to select features. For network
intrusion detection, the proposed scheme also provides a
Gaussian Naive Bayes Tree [7] 82.02%
Fuzzy classifier [41] 82.74% discriminative concept space to distinguish between a normal
Our Approach 83.34% flow of network data and an anomalous flow. The model
Decision Tree [42] 80.14% produces a fixed 10 size vector for both classification and
Proposed algorithm-(Experiment-1) [40] 82.41%
Proposed algorithm-(Experiment-2) [40] 84.12% detection of attacks. Hence our scheme uses the minimum
TABLE IV number of features compared to other state-of-the-art algo-
T HE ACCURACY OF OUR PROPOSED MODEL AND THE OTHER MODELS rithms. This makes the model more computationally efficient
for real time protection. In addition to the limited number
of original features, the proposed scheme generates a small

3860
set of latent features. The resulting rich and small latent in International Conference on Future Data and Security Engineering.
representation makes it practical for implementation in small Springer, 2016, pp. 141–152.
[22] T. Nolle, A. Seeliger, and M. Mühlhäuser, “Unsupervised anomaly detec-
devices such as the Internet of Things. tion in noisy business process event logs using denoising autoencoders,”
in International Conference on Discovery Science. Springer, 2016, pp.
R EFERENCES 442–456.
[1] A. Patel, Q. Qassim, and C. Wills, “A survey of intrusion detection and [23] S. Potluri and C. Diedrich, “Accelerated deep neural networks for
prevention systems,” Information Management & Computer Security, enhanced intrusion detection system,” in Emerging Technologies and
vol. 18, no. 4, pp. 277–290, 2010. Factory Automation (ETFA), 2016 IEEE 21st International Conference
[2] P. Garcia-Teodoro, J. Diaz-Verdejo, G. Maciá-Fernández, and on. IEEE, 2016, pp. 1–8.
E. Vázquez, “Anomaly-based network intrusion detection: Techniques, [24] J. An and S. Cho, “Variational autoencoder based anomaly detection
systems and challenges,” computers & security, vol. 28, no. 1, pp. using reconstruction probability,” 2015.
18–28, 2009. [25] M. Nicolau, J. McDermott et al., “A hybrid autoencoder and density
[3] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “Intrusion de- estimation model for anomaly detection,” in International Conference
tection techniques in cloud environment: A survey,” Journal of Network on Parallel Problem Solving from Nature. Springer, 2016, pp. 717–
and Computer Applications, 2016. 726.
[4] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis [26] T. Ma, F. Wang, J. Cheng, Y. Yu, and X. Chen, “A hybrid spectral
of malware behavior using machine learning,” Journal of Computer clustering and deep neural network ensemble algorithm for intrusion
Security, vol. 19, no. 4, pp. 639–668, 2011. detection in sensor networks,” Sensors, vol. 16, no. 10, p. 1701, 2016.
[5] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, [27] G. E. Hinton and R. S. Zemel, “Minimizing description length in an
“Novel feature extraction, selection and fusion for effective malware unsupervised neural network,” Preprint, 1997.
family classification,” in Proceedings of the Sixth ACM Conference on [28] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
Data and Application Security and Privacy. ACM, 2016, pp. 183–194. data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
[6] K. Wang and S. J. Stolfo, “Anomalous payload-based network intrusion 2006a.
detection,” in International Workshop on Recent Advances in Intrusion [29] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
Detection. Springer, 2004, pp. 203–222. deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
[7] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed analysis 2006b.
of the kdd cup 99 data set (2009),” in Proceedings of the 2009 IEEE [30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
Symposium on Computational Intelligence in Security and Defense “Stacked denoising autoencoders: Learning useful representations in a
Applications (CISDA 2009), 2009. deep network with a local denoising criterion,” Journal of Machine
[8] A. Kessy, A. Lewin, and K. Strimmer, “Optimal whitening and decor- Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
relation,” arXiv preprint arXiv:1512.00809, 2015. [31] G. Hinton, “A practical guide to training restricted boltzmann machines,”
[9] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from Momentum, vol. 9, no. 1, p. 926, 2010.
tiny images,” 2009. [32] J. J. Hopfield, “Neural networks and physical systems with emergent
[10] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International collective computational abilities,” Proceedings of the national academy
Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009. of sciences, vol. 79, no. 8, pp. 2554–2558, 1982.
[11] R. Salakhutdinov, “Learning deep generative models,” Annual Review [33] G. E. Hinton, “Training products of experts by minimizing contrastive
of Statistics and Its Application, vol. 2, pp. 361–385, 2015. divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
[12] A. Narayanan, M. Chandramohan, L. Chen, Y. Liu, and S. Sami- [34] X. Wang, J. Liu, and X. Chen, “First place team: Say no to overfitting,”
nathan, “subgraph2vec: Learning distributed representations of rooted 2015.
sub-graphs from large graphs,” arXiv preprint arXiv:1606.08928, 2016. [35] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence
[13] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, learning,” in Proceedings of the tenth international workshop on artifi-
“Malware detection with deep neural network using process behavior,” cial intelligence and statistics. Citeseer, 2005, pp. 33–40.
in Computer Software and Applications Conference (COMPSAC), 2016 [36] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng,
IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 577–582. “On optimization methods for deep learning,” in Proceedings of the
[14] L. Xu, D. Zhang, N. Jayasena, and J. Cavazos, “Hadm: Hybrid analysis 28th International Conference on Machine Learning (ICML-11), 2011,
for detection of malware.” pp. 265–272.
[15] Y. Wang, W.-d. Cai, and P.-c. Wei, “A deep learning approach for detect- [37] Y. Bengio, “Practical recommendations for gradient-based training of
ing malicious javascript code,” Security and Communication Networks, deep architectures,” in Neural Networks: Tricks of the Trade. Springer,
2016. 2012, pp. 437–478.
[16] O. E. David and N. S. Netanyahu, “Deepsign: Deep learning for [38] Malware classification: Distributed data mining with spark. http://
automatic malware signature generation and classification,” in 2015 msan-vs-malware.com/. Accessed: 2016-11-11.
International Joint Conference on Neural Networks (IJCNN). IEEE, [39] L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms.”
2015, pp. 1–8. Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245,
[17] X. Wang and S. M. Yiu, “A multi-task learning model for malware 2014.
classification with useful file access pattern from api call sequence,” [40] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L.
arXiv preprint arXiv:1610.05945, 2016. He, “Fuzziness based semi-supervised learning approach for intrusion
[18] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas, detection system,” Information Sciences, 2016.
“Malware classification with recurrent networks,” in 2015 IEEE In- [41] P. Krömer, J. Platoš, V. Snášel, and A. Abraham, “Fuzzy classification
ternational Conference on Acoustics, Speech and Signal Processing by evolutionary algorithms,” in Systems, Man, and Cybernetics (SMC),
(ICASSP). IEEE, 2015, pp. 1916–1920. 2011 IEEE International Conference on. IEEE, 2011, pp. 313–318.
[19] W. Huang and J. W. Stokes, “Mtnet: a multi-task neural network for dy- [42] M. Mohammadi, B. Raahemi, A. Akbari, and B. Nassersharif, “Class
namic malware classification,” in Detection of Intrusions and Malware, dependent feature transformation for intrusion detection systems,” in
and Vulnerability Assessment: 13th International Conference, DIMVA 2011 19th Iranian Conference on Electrical Engineering. IEEE, 2011,
2016, San Sebastián, Spain, July 7-8, 2016, Proceedings. Springer, pp. 1–6.
2016, pp. 399–418.
[20] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning ap-
proach for network intrusion detection system,” in In Proceedings of
the 9th EAI International Conference on Bio-inspired Information and
Communications Technologies (formerly BIONETICS). ICST (Institute
for Computer Sciences, Social-Informatics and Telecommunications
Engineering), 2016, pp. 21–26.
[21] L. Bontemps, J. McDermott, N.-A. Le-Khac et al., “Collective anomaly
detection based on long short-term memory recurrent neural networks,”

3861

You might also like