Professional Documents
Culture Documents
yousefi-azar2017
yousefi-azar2017
yousefi-azar2017
Security Applications
Mahmood Yousefi-Azar, Vijay Varadharajan, Len Hamey and Uday Tupakula
Department of Computing, Faculty of Science and Engineering, Macquarie University, Sydney, NSW, Australia.
Email: mahmood.yousefiazar@hdr.mq.edu.au, vijay.varadharajan@mq.edu.au
len.hamey@mq.edu.au, udaya.tupakula@mq.edu.au
Abstract—This paper presents a novel feature learning model unsupervised or semi-supervised training schemes. Normally,
for cyber security tasks. We propose to use Auto-encoders using labeled data in supervised schemes could result in better
(AEs), as a generative model, to learn latent representation of performance compared to unsupervised learning models. The
different feature sets. We show how well the AE is capable of
automatically learning a reasonable notion of semantic similarity utilization of labled data alongside the availability of the
among input features. Specifically, the AE accepts a feature large number of cyber criminals attempting to gain access to
vector, obtained from cyber security phenomena, and extracts the data, have motivated researchers to use semi-supervised
a code vector that captures the semantic similarity between the algorithms that have advantages of both supervised and unsu-
feature vectors. This similarity is embedded in an abstract latent pervised schemes. The application of the unsupervised scheme
representation. Because the AE is trained in an unsupervised
fashion, the main part of this success comes from appropriate proposed in this paper is relevant for both supervised classifiers
original feature set that is used in this paper. It can also and semi-supervised algorithms.
provide more discriminative features in contrast to other feature Malicious software is intentionally developed to target
engineering approaches. Furthermore, the scheme can reduce the computer systems or networks for different purposes such
dimensionality of the features thereby signicantly minimising the as stealing information, or distribution of spam messages or
memory requirements. We selected two different cyber security
tasks: networkbased anomaly intrusion detection and Malware even destruction of messages. Malware classically refers to
classication. We have analysed the proposed scheme with various malicious software such as computer viruses, Internet worms,
classifiers using publicly available datasets for network anomaly trojans, adware and ransomware. Identification and removal
intrusion detection and malware classifications. Several appro- of malware are a significant part of both network and host
priate evaluation metrics show improvement compared to prior defence systems. Detection, clustering and classification of
results.
malware are major threads in cyber security and form im-
I. I NTRODUCTION portant applications of malware analysis.
Malware analysis using machine learning has been receiving
In the cyber space, security requires a wide range of much attention in the recent years, both in the academia
technologies and processes to protect the range of devices and in the industry [1, 3, 4]. The main reason behind this
from computers, to smart phones, to networks to Internet of is the capability to automatically identify malicious software
Things to users and importantly data from intrusion, unau- compared to more tedious manual techniques. Another major
thorized access and destruction. To meet these requirements, reason behind the application of machine learning techniques
cyber security defensive technologies include two conventional for malware analysis is the emergence of zero-day malware,
systems, namely network defence systems and host defence whose fingerprints or signatures are unknown to the software
systems. Each of these systems are composed of different developers.
technologies and various layers such as intrusion detection, In this study, our aim is to consider both malware classi-
firewall and antivirus [1]. fication as well as malware detection. To do both detection
Intrusion Detection Systems (IDSs), in particular, network- and classification, we introduce a technique that achieves a
based IDSs are special-purpose algorithms and tools to detect richer feature space using deep auto-encoders (AEs). The AEs
anomaly attacks to a networked system, and help determine as the automatic feature learning models can provide more
and identify unauthorized usage, duplication, alteration as discriminative features in contrast to other feature engineering
well as any destruction of data. Depending on the detection approaches. In the literature, a wide range of feature sets have
techniques, IDSs can be categorized into different approaches been used to identify anomaly intrusions and malware [5, 6, 7].
such as signature-based detection, anomaly-based detection The examples of feature sets in network-based anomaly in-
and behaviour based detection. The focus of our study is on trusion detection application domain include network flow,
network-based anomaly intrusion detection systems. source IP and port, destination IP and protocols. For malware
Machine learning approaches are being widely used for analysis, the number of bytes, the entropy of the binary file,
anomaly intrusion detection [2, 3]. The schemes are able to system calls and operation code of assembly files have been
detect patterns of known and unknown attacks in supervised, commonly used. The AEs can learn the concept space from
3855
we have used AE. Our motivation is that an AE offers a rich
representation from which it should be possible to re-generate
the original representation. This assumption is not task-specific
and has motivated us to apply AE for different cyber security
tasks.
A. Autoencoders
An AE (Figure 1) is a feed-forward network that learns
to reconstruct the input x [27]. AE is trained to encode the
input x using a set of recognition weights into a feature space
C(x). Then, the features (codes) C(x) are converted into an
approximate reconstruction of x̂ using a set of generative
weights. The generative weights are mostly obtained firstly
from unrolled weights of encoder and then from a fine-
tuned phase. Through the encoder (i.e. mapping to a hidden
representation) the dimensionality of x is reduced to the give
number of codes. The codes are then mapped to x̂ through the
decoder. Because no labeled data is required in the training
process, the algorithm is unsupervised.
Indeed, a deep AE transforms the original input to a better
representation over hierarchical features or representations,
with each level of hierarchy corresponding to a different level
of abstraction. Fig. 1. The Topology of our AE. The decoder parameters are obtained from
AEs can have different topological structures. Figure 1 unrolled parameters of encoder, which are then fine-tuned. x and x̂ denote
the input and reconstructed inputs respectively. hi are the hidden layers and
demonstrates our AE topology. The coding layer (a.k.a bottle- wi are the parameters. The ε denotes the modification of parameters over the
neck layer or discriminative layer) provides the latent repre- fine-tuning phase. Concept space in which features/codes are extracted C(x),
sentation. In general, an AE with a linear activation function in in this scheme, is the output of the bottleneck layer.
all units projects a subspace of the principal components of the
input and not a richer representation than the PCA. However, it
There are two reasons behind our choice of the use of the
is expected the AE with non-linear activation functions learns
RBM. First, because the training phase is based on a back-
more useful feature-detectors [28].
propagation algorithm, we believe that having two different
Mainly because each layer of our AE provides a different
search methods in the parameter space of the AE can be more
level of abstraction, we have used a fairly deep network. Ad-
efficient. Also, assuming that the observations are produced
ditionally, [28] in the supplementary material paper1 showed
from a stochastic generative model, the RBM as a generative
that a deep AE can have better performance than a shallow AE
model can be an appropriate prototype to discover the feature
with the same number of parameters. Because our AE consists
detectors [27, 31].
of many hidden layers, the back-propagated gradients of the
error is very small for the lower layers. The RBM (figure 2) is the undirected graphical model that
can be considered as a two-layer neural network with one layer
To tackle the vanishing problem, the parameters of a deep
of observable units and one layer of hidden units (i.e. feature
AE requires a good initialization point. This initialization
detectors). The weighted connections are restricted between
could significantly improve the performance of an AE in a
the hidden units and the visible units, symmetrically, and
variety of applications compared with random initialization
there are no connections between the units in the same layer.
[29, 30].
Depending on the function of the network, the visible and
Our model has two training stages: pre-training and
hidden units could be considered as a different distribution
fine-tuning. Pre-training is an approach to find an appropriate
such as Bernoulli, Gaussian or Multinomial, binomial. We
starting point for the fine-tuning phase. The parameters
have used Bernoulli-Bernoulli units, Gaussian-Bernoulli units
obtained in the pre-training phase will be the initial weights
and Bernoulli-Gaussian units (only for the bottleneck layer
of the fine-tuning phase.
RBM).
The energy function is bilinear for both binary states of
Pre-training Stage visible and hidden units (i.e. Bernoulli-Bernoulli units) (see
[32]):
The two most common techniques to pre-train an AE and
obtain the initialization weight are stacked Restricted Boltz-
E(x, h; θ) = − bi x i − a j hj − xi hj wij (1)
mann Machine (RBM) and stacked de-noising AE [28, 30]. i∈visible j∈hidden i,j
1 http://www.cs.toronto.edu/∼hinton/absps/science som.pdf Where wij is the weights between visible units xi and
3856
To estimate the parameters of the network, maximum like-
lihood estimation (equivalent to minimizing the negative log-
likelihood) can be used. Taking the derivative of the negative
log-probability of the inputs with respect to the weights gives:
∂ − log p(x) ∂
= (− log p(x, h))
Fig. 2. The schematic representation of the RBM. ∂θij ∂θ (9)
h
= xi , hj data − xi , hj recon
hidden units hj , bi and aj are their biases. θ = {W, a, b}
where the angle brackets are used to denote the expectation
denotes the all sets of network parameters. The joint distribu-
of the distribution of the subscript that follows. This leads to
tion p(x, h; θ) has the following the energy function:
a simple learning algorithm by which the parameters update
exp (−E(x, h; θ)) rule is given:
p(x, h; θ) = (2)
Z
Where Z = x,h exp (−E(x, h; θ)) is a partition function Δwij = (xi , hj data − xi , hj recon ) (10)
as a normalization constant. The marginal probability assigned
to a visible vector is: where is a learning rate, xi , hj data is the so-called
positive phase contribution and xi , hj recon is the so-called
exp (−E(x, h; θ))
p(x; θ) = h (3) negative phase contribution. The positive phase decreases the
Z energy of observation and negative phase increases the model
Because this network is symmetric, the conditional proba- energy. However, the computation of the expectation defined
bilities for Bernoulli-Bernoulli RBM is: by the model is not easily tractable. [33] presented maximizing
the likelihood or log-likelihood of the data is equivalent
exp ( i wij xi + aj ) to minimize KullbackLeibler (KL) divergence between data
p(hj = 1|x; θ) =
1 + exp ( i wij xi + aj ) distribution and the equilibrium distribution over the visible
(4) variables. To compute this expectation, The k-step contrastive
= f( wij xi + aj )
divergence (CD-k) approximation provides surprising results
i
[33]. We used CD-1, with running one step Gibbs sampler that
is effective enough:
exp ( j wij hj + bi )
p(xi = 1|h; θ) =
1 + exp ( j wij hj + bi )
(5) p(h|x0 ) p(x|h0 ) p(h|x1 )
x = x0 −−−−−→ h0 −−−−−→ x1 −−−−−→ h1 (11)
= f( wij hj + bi )
j
The RBM blocks can be stacked to form the topology of
Assuming the visible units has Gaussian distribution, the the desired AE. More clearly, in pre-training phase, the AE is
energy function is represented as following: trained in a greedy layer-wise fashion using individual RBMs,
where the output of one trained RBM is used as input to the
(xi − bi )2
E(x, h; θ) = − a j hj next upper RBM block. Then, individual RBM blocks would
2σi2 be stacked on top of each other and having the RBM blocks
i∈visible j∈hidden
xi (6)
stacked, the topology of the AE can be generated.
− hj wij ,
i,j
σi Global adjusting parameters
Figure 1 shows how the obtained weights from the pre-
where σi is the standard deviation of the ith visible unit.
training are ties (i.e. unrolls) and used to initialise the deep
For Gaussian-Bernoulli conditional probabilities becomes:
AE. Globally adjusting parameters is indeed fine-tuning the
exp ( i wij xi + aj ) parameters in an iterative way. We use back-propagation
p(hj = 1|x; θ) = algorithm to do this iterative tuning of parameters.
1 + exp ( i wij xi + aj )
(7) The whole network is trained in this phase. We use the
= f( wij xi + aj ) cross-entropy error as the loss function for this unsupervised
i
fine-tuning phase and to have optimal reconstruction, given
the encoding C(x):
1 (x − bi − j wij hj )2
p(xi |h; θ) = √ exp −
2π 2 1
N
N
⎛ ⎞ (8) − log p(x|C(x)) = − N xi log x̂i + (1 − xi ) log(1 − x̂i ) (12)
i=1 i=1
=N⎝ wij hj + bi , 1⎠ where n is the total number of items of training data, x is
j
the input of the AE, x̂ = fθ (C(x)) is the reconstructed values.
3857
IV. P ERFORMANCE E VALUATION In the above sample, 00401370 indicates the offset and is
We have selected malware classification and network-based not used in this paper.
anomaly intrusion detection to evaluate the capability of the The most successful feature set for this dataset is n-gram
AE for the security application domain. To evaluate the modelling [34]. Inspired from human language modelling, a n-
performance of the learned representation, the metrics are the gram is a contiguous sequence of n items (here, a byte) from
accuracy, the multi-class logarithmic loss (a.k.a logistic loss, a given sequence of the binary file. We use unigram (a.k.a
cross-entropy loss or only Log Loss) and the confusion matrix. 1-gram) probabilistic model.
The metrics measure the performance of the applied classifiers: Having unigram feature representation, each item in a byte
Gaussian Nave Bayes (GaussianNB), K-nearest-neighborhood sequence can have one out of 256 different values; thus, the
(K-NN), Support Vector Machines (SVMs) and Gradient dictionary size is 256 (accordingly 00 to FF hex). Instead of
Boosting (Xgboost). The classification accuracy shows the using the raw frequency of an item in a hex file (i.e. the
correct predictions of labels. The Log Loss function measures number of times that item t occurs in the hex file), we used
the uncertainty of the probabilities of a classifier compared the logarithmically scaled frequency as following:
to the true labels. That is, the function is the cross entropy Unigram = 1 + log(ft ) or zero where ft is zero.
between the distribution of the true labels and the predictions. In short, each hex file turns into a vector with size 256 and
For multi-class classification the Log Loss is defined as: fed into the AE.
To keep the same setting with our baselines, we used 5-
1 fold cross validation. Although the dataset is imbalanced, we
N M
logloss = − yij log(pij ) (13) randomly chose equal proportion of each class for each fold.
N i=1 j=1
For anomaly intrusion detection task, we used NSL-KDD
where N is the total number of the samples of training data, dataset 12 , which consists of selected records of the KDD-
M is the number of labels, log is the natural logarithm, y is CUP99 dataset. NSL-KDD collected by analysing incoming
the true label (i.e. 0 or 1) of an item and p is the estimated network traffic and it has been widely used to develop the
probability that the item i belongs to class j. network-based intrusion detection systems (NIDs).
The presented confusion metrics is a table to visualize the NSL-KDD includes 125,973 train and 22,544 test records
classifier performance in confusing classes. labeled either normal or anomaly (i.e. a network attack). The
anomaly class has four categories; however, for distinguishing
A. Experiment Setup normal from an anomaly and having a detection system, we
consider all the four categories into one class label. Briefly, the
We used Microsoft Malware Classification Challenge (BIG
two-class classifiers are trained to distinguish a normal class
2015) dataset hosted at Kaggle 2 to analyse the AE-based
from an anomaly class.
representation. The dataset includes hexadecimal and assembly
Each sample in the NSL-KDD represents with 41 fea-
representation of the 10868 labled malware binary files from 9
tures. The features are categorized into four feature sets:
different malware families: Ramnit 3 (Ram), Lollipop 4 (Lol),
Basic features (from TCP/IP connection without inspecting
Kelihos ver3 5 (Kel), Vundo 6 (Vun), Simda 7 (Sim), Tracur 8
the payload), content features (accessing the payload of TCP
(Tra), Kelihos ver1 9 (Kel), Obfuscator.ACY 10 (Obf), Gatak
11 packet), time based traffic features (statistics in a 2 second time
(Gat). This dataset is imbalanced. That is, the number of
windows) and host based traffic features (within a historical
malware samples for each class (family) is not equal.
window). Each of the 41 features is presented as either as
We only use hex dump files. The average size of the hex
a continuous value or as a symbolic sign. We used the
dump files is 4.8 MByte while the biggest file is 56 MByte
continuous value features intact while we replaced symbolic
and the smallest is 110 KByte. A sample line of a hex file is:
features with one-hot encoding. In the one-hot encoding each
00401370 8B 4C 24 04 8B D1 8D 81 D6 8D 82 F7 81 F2 60 4F
symbol is replaced with the state of the symbol.
2 https://www.kaggle.com/c/malware-classification/data To have an efficient topology for both experiments, we set
3 https://www.symantec.com/security response/writeup.jsp?docid=
our deep AE with a 150-90-50-10-50-90-150 structure. Here,
2010-011922-2056-99 150 is the size of the first and last hidden layer and 10 is the
4 https://www.microsoft.com/security/portal/threat/encyclopedia/Entry.aspx?
Name=Adware:Win32/Lollipop size of AE’s bottleneck layer. The ten code units provide a
5 https://en.wikipedia.org/wiki/Kelihos botnet ten-dimensional concept space to represent the feature space
6 https://www.symantec.com/security response/writeup.jsp?docid=
for both network-based intrusion attack and malware families.
2004-112111-3912-99 The 10 units in the bottleneck of AE is a constraint by which
7 https://www.microsoft.com/security/portal/threat/Encyclopedia/entry.aspx?
Name=Trojan%3AWin32%2FSimda the AE is complied to learn useful features. The activation
8 https://www.symantec.com/security response/writeup.jsp?docid= function of all units is sigmoid except for the bottleneck layer
2011-071504-5259-99 which is linear, in both pre-training and fine-tuning phases.
9 https://en.wikipedia.org/wiki/Kelihos botnet
10 https://www.microsoft.com/en-us/security/portal/threat/encyclopedia/ For the layer-wise pre-train phase, that is, training four
Entry.aspx?Name=VirTool:Win32/Obfuscator.ACY RBMs with 150, 90, 50 and 10 units, we run one step of the
11 https://www.symantec.com/security response/writeup.jsp?docid=
2012-012813-0854-99 12 http://nsl.cs.unb.ca/NSL-KDD/
3858
Gibbs sampler with 200 epochs [35]. In the fine-tuning phase, Classifiers Accuracy with Accuracy with
the whole network is trained using mini-batch Conjugate Unigram representation AE representation
Gradient with line search, having 1500 epochs [36]. The loss Naive Bayes 66.2% (±4.45e − 05) 80.4% (±5.73e − 04)
function in the training phase is the cross-entropy error. We K-NN 94.0% (±3.90e − 05) 96.0% (±3.26e − 05)
SVM 95.6% (±1.22e − 05) 96.3% (±1.13e − 04)
have used the same hyper-parameters, obtained on the basis of Xgboost 98.2% (±6.65e − 06) 95.7% (±2.53e − 05)
presented practical guides in [31, 37], in both the experiments. TABLE I
Extracted codes from the AE are fed into the classifiers. T HE ACCURACY OF DIFFERENT CLASSIFIERS FOR THE ORIGINAL
REPRESENTATION (U NIGRAM ) AND LATENT REPRESENTATION
Although there is not a general agreement in literature for GENERATED USING THE AE.
the input of classifiers, it is beneficial to have a linear
transformation such as whitening transformation. Statistical
Whitening data helps to uncorrelate features with an identity
Log Loss reduces and is less than the other models in which
covariance matrix. To do so, we used Principal Component
Unigram has been used as the feature.
Analysis (PCA) whitening algorithm [8].
One of the key evaluations is shown in Table II. The Ram
We used sklearn library (for the Python 2.7.12) to imple-
and Tra families are confused with Gat family more than
ment the classifiers. Assuming the features have Gaussian
other families. This issue is not due to different number of
distribution, we used Gaussian Naive Bayes. The hyper-
sampling for each class. Interestingly, Gat is a trojan that
parameters of K-NN set used a grid search to achieve max-
opens a backdoor on the compromised computer while Ram is
imum accuracy. This grid search range from 1 to 20 for
a worm which also functions as a backdoor and Tra is a trojan.
n neighbors and are either ’uniform’ or ’distance’ for weights.
This shows that the AE provides a clustered representation
We also used a grid search to tune the C and gamma
by which similar families are more likely being alternatively
parameters of SVM (with a radial basis function kernel) to
predicted than dissimilar families. Figure 3 illustrates the
have maximum accuracy. This grid search range from 10−4 to
whitened features in two dimensions, visualized by t-SNE
104 for both C and gamma parameters of SVM. Xgboost has
[39].
the same hyper-parameters of the first winner of the Kaggle
Another important and mutual confusion of classes is be-
Competition [34].
tween Vun, Sim and Tra classes. All the three classes are tro-
We used the same topology for AE and other conditions
jans. This confusion again provides grounds for understanding
of the experiments throughout our study to make the results
that the three classes of malware are indeed from one malware
more comparable.
broader family, and possibly with the same pattern.
The important thing to notice is that regardless of observing
B. Results and Discussion
imbalanced classes, Naive Bayes can provide a fairly good
Having the pre-processed data, we conducted experiments prediction across all the malware families, even for Sim family
on two different cyber security threats: malware classification with only 42 samples.
and network-based anomaly intrusion detection.
1) Malware classification:
Table I shows how well AE can provide the discriminative
representation. As expected, among the classifiers, Gaussian
Naive Bayes gains the highest benefits of the latent repre-
sentation. Indeed, AE can enrich the representation and insert
the relation between original features into the concept space,
assuming independent features will not drastically reduce
the accuracy of the Naive Bayes. Additionally, because of
applying Bernoulli-Gaussian RBM for the initialization of
the bottleneck layer, the Gaussian Naive Bayes classifier can
perform well.
Both K-NN and SVM can also generate more accurate
predictions. For K-NN and SVM, the error rate improves by
33% and 16% respectively. However, Xgboost can handle the
classification task better with the original unigram and not with
the latent representation generated by the AE.
Table I also shows the possible variance of the accuracy for Fig. 3. The output of coding layer that is applied to the classifiers.
all the classifiers. The variance in all situations is small and,
more importantly, without any overlap with each other. 2) Network-based intrusion detection:
In addition to the accuracy, Log Loss metric has been used The accuracy of classifiers for the intrusion detection task
to analyse the classification performance for Xgboost. The also improves using the AE-based generated features com-
result is presented in Table III. In contrast with the accuracy, pared to the original features (Figure 4). Similar to the
3859
Family Name Ram Lol Kel ver3 Vun Sim Tra Kel ver1 Obf Gat
Ram (n = 1541) 45.13% 06.82% 00.19% 01.04% 00.39% 05.33% 00.32% 10.39% 30.39%
Lol (n = 2478) 04.38% 91.12% 00.00% 00.24% 00.04% 00.16% 01.22% 01.87% 00.97%
Kel ver3 (n = 2942) 00.00% 10.82% 89.05% 00.00% 00.00% 00.07% 00.00% 00.00% 00.07%
Vun (n = 0475) 00.00% 00.21% 00.00% 78.11% 06.11% 06.95% 01.47% 06.95% 00.21%
Sim (n = 0042) 00.00% 00.00% 00.00% 00.00% 87.50% 05.00% 00.00% 07.50% 00.00%
Tra (n = 0751) 04.00% 01.60% 01.47% 06.27% 05.73% 62.27% 01.47% 06.67% 10.53%
Kel ver1 (n = 0398) 00.25% 01.52% 00.51% 00.00% 00.00% 03.80% 93.92% 00.00% 00.00%
Obf (n = 1228) 03.10% 03.43% 00.65% 01.31% 00.90% 03.10% 01.39% 84.65% 01.47%
Gat (n = 1013) 00.89% 00.10% 01.68% 00.00% 00.20% 03.47% 00.20% 07.03% 86.44%
TABLE II
T HE C ONFUSION M ATRIX OF NAIVE BAYES CLASSIFIER WITH AE- BASED FEATURES AS THE INPUT OF THE CLASSIFIER . n IS THE NUMBER OF SAMPLES
IN THE DATASET
3860
set of latent features. The resulting rich and small latent in International Conference on Future Data and Security Engineering.
representation makes it practical for implementation in small Springer, 2016, pp. 141–152.
[22] T. Nolle, A. Seeliger, and M. Mühlhäuser, “Unsupervised anomaly detec-
devices such as the Internet of Things. tion in noisy business process event logs using denoising autoencoders,”
in International Conference on Discovery Science. Springer, 2016, pp.
R EFERENCES 442–456.
[1] A. Patel, Q. Qassim, and C. Wills, “A survey of intrusion detection and [23] S. Potluri and C. Diedrich, “Accelerated deep neural networks for
prevention systems,” Information Management & Computer Security, enhanced intrusion detection system,” in Emerging Technologies and
vol. 18, no. 4, pp. 277–290, 2010. Factory Automation (ETFA), 2016 IEEE 21st International Conference
[2] P. Garcia-Teodoro, J. Diaz-Verdejo, G. Maciá-Fernández, and on. IEEE, 2016, pp. 1–8.
E. Vázquez, “Anomaly-based network intrusion detection: Techniques, [24] J. An and S. Cho, “Variational autoencoder based anomaly detection
systems and challenges,” computers & security, vol. 28, no. 1, pp. using reconstruction probability,” 2015.
18–28, 2009. [25] M. Nicolau, J. McDermott et al., “A hybrid autoencoder and density
[3] P. Mishra, E. S. Pilli, V. Varadharajan, and U. Tupakula, “Intrusion de- estimation model for anomaly detection,” in International Conference
tection techniques in cloud environment: A survey,” Journal of Network on Parallel Problem Solving from Nature. Springer, 2016, pp. 717–
and Computer Applications, 2016. 726.
[4] K. Rieck, P. Trinius, C. Willems, and T. Holz, “Automatic analysis [26] T. Ma, F. Wang, J. Cheng, Y. Yu, and X. Chen, “A hybrid spectral
of malware behavior using machine learning,” Journal of Computer clustering and deep neural network ensemble algorithm for intrusion
Security, vol. 19, no. 4, pp. 639–668, 2011. detection in sensor networks,” Sensors, vol. 16, no. 10, p. 1701, 2016.
[5] M. Ahmadi, D. Ulyanov, S. Semenov, M. Trofimov, and G. Giacinto, [27] G. E. Hinton and R. S. Zemel, “Minimizing description length in an
“Novel feature extraction, selection and fusion for effective malware unsupervised neural network,” Preprint, 1997.
family classification,” in Proceedings of the Sixth ACM Conference on [28] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
Data and Application Security and Privacy. ACM, 2016, pp. 183–194. data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
[6] K. Wang and S. J. Stolfo, “Anomalous payload-based network intrusion 2006a.
detection,” in International Workshop on Recent Advances in Intrusion [29] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for
Detection. Springer, 2004, pp. 203–222. deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554,
[7] M. Tavallaee, E. Bagheri, W. Lu, and A. Ghorbani, “A detailed analysis 2006b.
of the kdd cup 99 data set (2009),” in Proceedings of the 2009 IEEE [30] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
Symposium on Computational Intelligence in Security and Defense “Stacked denoising autoencoders: Learning useful representations in a
Applications (CISDA 2009), 2009. deep network with a local denoising criterion,” Journal of Machine
[8] A. Kessy, A. Lewin, and K. Strimmer, “Optimal whitening and decor- Learning Research, vol. 11, no. Dec, pp. 3371–3408, 2010.
relation,” arXiv preprint arXiv:1512.00809, 2015. [31] G. Hinton, “A practical guide to training restricted boltzmann machines,”
[9] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from Momentum, vol. 9, no. 1, p. 926, 2010.
tiny images,” 2009. [32] J. J. Hopfield, “Neural networks and physical systems with emergent
[10] R. Salakhutdinov and G. Hinton, “Semantic hashing,” International collective computational abilities,” Proceedings of the national academy
Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969–978, 2009. of sciences, vol. 79, no. 8, pp. 2554–2558, 1982.
[11] R. Salakhutdinov, “Learning deep generative models,” Annual Review [33] G. E. Hinton, “Training products of experts by minimizing contrastive
of Statistics and Its Application, vol. 2, pp. 361–385, 2015. divergence,” Neural computation, vol. 14, no. 8, pp. 1771–1800, 2002.
[12] A. Narayanan, M. Chandramohan, L. Chen, Y. Liu, and S. Sami- [34] X. Wang, J. Liu, and X. Chen, “First place team: Say no to overfitting,”
nathan, “subgraph2vec: Learning distributed representations of rooted 2015.
sub-graphs from large graphs,” arXiv preprint arXiv:1606.08928, 2016. [35] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence
[13] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, learning,” in Proceedings of the tenth international workshop on artifi-
“Malware detection with deep neural network using process behavior,” cial intelligence and statistics. Citeseer, 2005, pp. 33–40.
in Computer Software and Applications Conference (COMPSAC), 2016 [36] J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, Q. V. Le, and A. Y. Ng,
IEEE 40th Annual, vol. 2. IEEE, 2016, pp. 577–582. “On optimization methods for deep learning,” in Proceedings of the
[14] L. Xu, D. Zhang, N. Jayasena, and J. Cavazos, “Hadm: Hybrid analysis 28th International Conference on Machine Learning (ICML-11), 2011,
for detection of malware.” pp. 265–272.
[15] Y. Wang, W.-d. Cai, and P.-c. Wei, “A deep learning approach for detect- [37] Y. Bengio, “Practical recommendations for gradient-based training of
ing malicious javascript code,” Security and Communication Networks, deep architectures,” in Neural Networks: Tricks of the Trade. Springer,
2016. 2012, pp. 437–478.
[16] O. E. David and N. S. Netanyahu, “Deepsign: Deep learning for [38] Malware classification: Distributed data mining with spark. http://
automatic malware signature generation and classification,” in 2015 msan-vs-malware.com/. Accessed: 2016-11-11.
International Joint Conference on Neural Networks (IJCNN). IEEE, [39] L. Van Der Maaten, “Accelerating t-sne using tree-based algorithms.”
2015, pp. 1–8. Journal of machine learning research, vol. 15, no. 1, pp. 3221–3245,
[17] X. Wang and S. M. Yiu, “A multi-task learning model for malware 2014.
classification with useful file access pattern from api call sequence,” [40] R. A. R. Ashfaq, X.-Z. Wang, J. Z. Huang, H. Abbas, and Y.-L.
arXiv preprint arXiv:1610.05945, 2016. He, “Fuzziness based semi-supervised learning approach for intrusion
[18] R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas, detection system,” Information Sciences, 2016.
“Malware classification with recurrent networks,” in 2015 IEEE In- [41] P. Krömer, J. Platoš, V. Snášel, and A. Abraham, “Fuzzy classification
ternational Conference on Acoustics, Speech and Signal Processing by evolutionary algorithms,” in Systems, Man, and Cybernetics (SMC),
(ICASSP). IEEE, 2015, pp. 1916–1920. 2011 IEEE International Conference on. IEEE, 2011, pp. 313–318.
[19] W. Huang and J. W. Stokes, “Mtnet: a multi-task neural network for dy- [42] M. Mohammadi, B. Raahemi, A. Akbari, and B. Nassersharif, “Class
namic malware classification,” in Detection of Intrusions and Malware, dependent feature transformation for intrusion detection systems,” in
and Vulnerability Assessment: 13th International Conference, DIMVA 2011 19th Iranian Conference on Electrical Engineering. IEEE, 2011,
2016, San Sebastián, Spain, July 7-8, 2016, Proceedings. Springer, pp. 1–6.
2016, pp. 399–418.
[20] A. Javaid, Q. Niyaz, W. Sun, and M. Alam, “A deep learning ap-
proach for network intrusion detection system,” in In Proceedings of
the 9th EAI International Conference on Bio-inspired Information and
Communications Technologies (formerly BIONETICS). ICST (Institute
for Computer Sciences, Social-Informatics and Telecommunications
Engineering), 2016, pp. 21–26.
[21] L. Bontemps, J. McDermott, N.-A. Le-Khac et al., “Collective anomaly
detection based on long short-term memory recurrent neural networks,”
3861