Obfuscated Malware Detection Using Deep Generative Models

computers & security 112 (2022) 102501
Available online at www.sciencedirect.com
journal homepage: www.elsevier.com/locate/cose
Obfuscated Malware Detection Using Deep

Generative Model based on Global/Local Features
Jin-Young Kim, MS, Sung-Bae Cho, Ph.D. ∗

Department of Computer Science, Yonsei University, Seoul 03722, South Korea
a r t i c l e i n f o a b s t r a c t
Article history: As a large amount of malicious software (malware), including DDoS or Trojan horse pervade
Received 25 July 2021 in communication networks, several approaches based on global and local features have
Revised 27 September 2021 been attempted to cope with some modifications added in malware variants such as null
Accepted 4 October 2021 value insertion, code interchange, and reordering of subroutines. Detectors that use only one
Available online 9 October 2021 type of feature have been studied a lot, but what uses both features is rarely investigated,
although good performance might be expected due to their complementary characteristics.
Keywords: In this paper, we propose a hybrid deep generative model that exploits global and local fea-
malicious software tures together to detect the malware variants effectively. While transforming malware into
sensor networks an image to efficiently represent global features with pre-defined latent space, it extracts lo-
global features cal features using the binary code sequences. The two features extracted from the data with
local features their respective characteristics are concatenated and entered into the malware detector. By
deep learning using both features, the proposed model achieves an accuracy of 97.47%, resulting in the
generative model state-of-the-art performance. We analyze what parts of the malware code affect the results
temporal model of detection through a class activation map (CAM) and confirm the usefulness by analyz-
ing the CAM results of the generated malware that virtual malware generation improves
detection performance.
© 2021 Elsevier Ltd. All rights reserved.
Malware usually infects a computer and destroys or steals

1. Introduction important information. Once it infects a computer, it intrudes
into additional connected devices in the vicinity, increasing
Malicious software, which adversely affects computers by
the damage. Not only is there a large number of malware, but
deleting or modifying the important data, is a significant tool
also there are many types of malware. They can be classified
used in cyberwar. Hackers often use it to steal many types of
into four categories according to the behavior: Virus, worm,
data such as personal, financial, military, or business informa-
potentially unwanted program (PUP), and Trojan horse. When
tion, resulting in tremendous damage to various fields. In par-
it is executed, a virus exhibits harmful behavior that impacts
ticular, the wireless sensor network is an easy target of mal-
other files, such as inserting malicious code in another file, for
ware due to weak security mechanism (Ohja et al., 2020). It
example, “text segment infection” or “data segment infection.”
has been steadily growing in speed (rapidity of threats), num-
Unlike a virus, a worm is copied by itself without a medium,
ber (growing threat landscape), and discrepancy (introduction
and the infection spreads to other networks through a vulner-
of new methods) resulting in approximately 1.04 billion mal-
ability in the operating system. When the worm is executed,
wares in 2020 (Dgammi and Singh, 2015).
it is continuously replicated on the infected computer, which
∗
Corresponding authors.
E-mail addresses: seago0828@yonsei.ac.kr (J.-Y. Kim), sbcho@yonsei.ac.kr (S.-B. Cho).
https://doi.org/10.1016/j.cose.2021.102501
0167-4048/© 2021 Elsevier Ltd. All rights reserved.
2 computers & security 112 (2022) 102501
adversely affects the memory and CPU of the computer. PUP is encoder can extract the features of malware effectively. The
installed with the user’s unintended consent. It may consist of trained encoder allows to construct a detector which is robust
code that can weaken computer’s security or invade privacy. A to the modification. In this method, the code-level informa-
Trojan horse misleads the original intention of the user by per- tion of the malware is lost, so we use a model that extracts the
forming a hidden function after the attacker has confused the features of the code sequence and delivers it to the detector.
target for the intended purpose. It is well known that wireless Then, the two characteristics of malware are combined and
sensor network (WSN) is vulnerable to the malware because it used to classify the malware. We can represent malware data
can spread in entire sensor network through neighboring sen- as assembly code, binary code, or a malware image. Malware
sor nodes when a single node gets compromised by malware images are described in detail later. The main contributions of
(Acrali et al., 2019; Biswal and Swain, 2019; Ohja et al., 2020). this paper are as follows.
The data used in this paper were collected in the Kaggle
Microsoft Malware Classification Challenge. Rinnen et al. sum- • A novel method is proposed for detecting malware by us-
marized existing studies that used the same data (Ronen et al., ing global and local features to overcome the limitations of
2018). Ramnit (R), Kelihos version 3 (K3), Simda (S), and Kelihos the existing approaches and achieves state-of-the-art per-
version1 (K1) belong to a botnet that can be used to steal data, formance compared with the conventional methods.
allow the attacker to access the computer and its connec- • We focus on malware deformability and construct a deep
tion, and perform distributed denial-of-service (DDos). Vundo generative model that can solve the obfuscation problem.
(V), Simda (S), Tracur (T), and Gatak (G) can belong to Trojan It generates the virtual malware directly and improves the
horse malware that misleads the true intentions of user. De- performance of the detector by making the detector being
spite they seem to be normal software, they will cause damage robust to modification.
to the computer by executing malicious behavior when run- • The effect of generated malwares is verified by analyzing
ning. Lollipop (L) is an adware that produces profits for the de- the what parts of the malware code affect the results of de-
veloper by automatically showing advertisements on the user tection through a class activation map (CAM) and we con-
computer. firm the usefulness that virtual malware generation im-
Malware developers typically produce it with some modi- proves detection performance.
fications such as null value insertion, code exchange, or reor-
ganization of subroutines to prevent its detection by vaccine
The rest of this paper is organized as follows. Section 2 in-
systems (You and Yim, 2010). The top image in Fig. 1 shows
troduces several related works for malware detection and dis-
examples of null value insertion and the bottom image shows
cusses the limitations. In Section 3, we propose and explain
reorganization of subroutines. For the detection of such mal-
our method to detect malware variants as well as solve the
ware, detectors have been studied focusing on two main fea-
limitations. The performance of our method is verified in
tures: global and local features. Malware has a global feature
Section 4 with several experiments. Lastly, we present the con-
that represents the overall characteristics of malware, such
clusions and future works in Section 5.
as the developer’s signature. However, the malware detection
method based on this is vulnerable to local changes such as
null value insertion shown in Fig. 1.The approach of detecting
malware based on local features at the contrary is vulnerable 2. Related Work
to global transformations such as the reorganization of sub-
routines depicted in Fig. 1. Since malware developers typically Several researchers have studied on the malware since the
produce it with both of them to prevent its detection by vac- harmful impact caused has increased, resulting in the neces-
cine systems, it is necessary to detect malware while consid- sity of detecting various malware effectively. The approaches
ering the global and local characteristics simultaneously. Re- to detect malware are divided into three types: global feature
cently, many studies have been conducted to detect malware modeling, local feature modeling and hybrid feature model-
using both features. However, they have the disadvantage that ing. Several novel features which are based on hex dump or
they do not consider characteristics for each feature because collected from disassembled files are proposed by Ahamdi
they model both features with one model. et al. to show distinctive characteristics of different mal-
To overcome these limitations, we propose a novel method ware groups (Agmadi et al., 2016). Drew et al. applied a high-
to generate modified malware and train the detector using throughput method for gene sequence classification to mal-
generated image-level data and code-level information simul- ware detection (Drew et al., 2016). Narayanan et al. analyzed
taneously. To train the detector, we pre-define the latent space the performance of machine learning and pattern recognition
where the characteristics of the data are embedded to gener- algorithms for malware classification by visualizing the mal-
ate more sophisticated malware with representation learning ware as Nataraj did (Nataraj et al., 2011; Narayanan, 2016).
based on variational autoencoder (VAE). The encoder learns to Function call graph (FCG) representation was proposed by
project data to a specific latent space according to the features. Hassen and Chan by using clustering method (Hassen and
The decoder reconstructs the latent variable present in the la- Chan, 2017). Hassen et al. presented a new feature based on
tent space by the encoder. They interact with each other and control statement shingling (Hassen et al., 2017). Norton and
learn to define data representation. We exploit the decoder Qi proposed a visualization tool for the performance of adver-
and encoder for the generative model to produce the com- sarial samples (Norton and Qi, 2017). A theoretical framework
petitive new data and to confirm that the generator creates was proposed by Wang et al. to decrease a gap of many recent
appropriate data, respectively. After the generation phase, the studies (Wang et al., 2016).
computers & security 112 (2022) 102501 3
In a study in the global feature modeling, Nataraj et al. The related studies on malware detection are summarized
proposed a malware classifier using a novel preprocessing in Table 1. They include various methods to extract features
method that converts the binary code to malware image and detect malware with extracted characteristics or analyzed
(Drew et al. 2016). The malware images shown in Fig. 1 are logs. In this paper, we propose a novel method that generates
the results of this preprocessing method. Dahl et al. classi- virtual malware from the pre-defined latent space for being
fied malware with neural networks using random projections robust to obfuscations. Besides, since the code-level informa-
(Dahl et al., 2013). This preprocessing method was also used by tion of malware is lost in this method, we extend the model
Garcia et al. who detected malware with random forest (RF) that processes time series modeling in parallel.
(Garcia et al., 2016). Wang et al. proposed a detector which
is robust to some modifications by randomly nullifying fea-
tures of malware (Wang et al., 2017). However, there was a risk 3. The Proposed Method
of cancelation of significant characteristics, since they elim-
inated features randomly. Kim et al. proposed a generative The entire process, as shown in Fig. 2, is divided into three
model which produces the virtual malware for a detector to phases: 1) global feature modeling (image-level feature extrac-
be robust to some obfuscations (Kim et al., 2018). Kim and tion); 2) local feature modeling (code-level feature extraction);
Cho pre-defined the latent space and constructed a hybrid and 3) malware detection.
generative deep learning model with the global features only The first and second parts are described in Fig. 2(a) and 2(b),
(Kim and Cho, 2018). respectively. The encoder E is used in the third phase. When
Global feature modeling makes it difficult to detect mal- going from 1 to 3, we use transfer learning (Arnold et al., 2007).
ware with obfuscation on local, which can be overcome by This method allows more effective training tasks by reusing a
local feature modeling. Christodorescu et al. made templates similar model that solves the similar tasks on a different do-
for the behavior of malware and compared them with a given main. It can be regarded as passing down the capacity to the
malware (Christodorescum et al., 2005). It can capture the be- other. For the global feature modeling, the malware codes in
havior of malware directly, but the performance strongly de- binary format are pre-processed as malware images to be used
pends on the number of templates. Akritidis et al. defined in the proposed model. To robustly cope with malware obfus-
the general characteristics of malware (Akritidis et al., 2005). cation, process of defining the latent space where the char-
Berlin et al. and Ye et al. used the Windows API and Windows acteristics of malware are embedded and generating a mal-
Audit Log for detecting malware using log data (Berlin et al., ware image is performed. As shown in Fig. 2(a), the charac-
2015; Ye et al., 2017). To solve the problems that previous works teristics of malware images are extracted into the pre-defined
had (such as vulnerable to modifications), Jordaney et al. pro- latent space by autoencoder consisting of the encoder E and
posed a Transcend, which is a sustainable malware classifier the decoder D. The virtual malware is generated from the pre-
(Jordaney et al., 2017). Lin et al. proposed a fast detector to defined latent space and strengthens the feature extractor f i .
overcome the high time complexity (Lin et al., 2018). Those On the other hand, as illustrated in Fig. 2(b), for the local fea-
studies tried detecting malware by defining and analyzing the ture modeling, the malware codes are pre-processed as the
local characteristics of them. sequence of binary codes. Finally, extracted features are con-
Methods to construct advanced detector which considers catenated and used as input of detector.
both characteristics of malware have been studied. Yuan et al.
proposed a method to utilize more than 200 features for An- 3. 1. Global Feature Modeling
droid malware detection (Yuan et al., 2014). Su et al. extracted
five types of global features and built the deep learning model 3.1.1. Defining Latent Space
to learn features from Android apps. The learned features are Autoencoders composed of an encoder and decoder have
used to detect unknown Android malwares (Su et al., 2016). Ye usually been used to learn the data representation and are
et al. constructed AiDroid which extracted heterogeneous in- used in various fields (Hosseini-Asl et al., 2016; Han et al.,
formation of malware (Ye et al., 2018). Wang et al. trained the 2016; Zeng et al., 2017; Guo et al., 2018; Nicolau and McDer-
deep autoencoder based on convolutional neural networks to mott, 2018; Zhang et al., 2018c). The VAE is one of the methods
represent the characteristics of malware in unsupervised way to learning complex distributions (Kingma and Welling, 2013).
(Wang et al., 2019). Kim et al. modelled many kinds of features The basic process of an autoencoder is
such as permissions, environmental information, Dalvik Op-
code, etc. using multimodal neural network. However, in this z ∼ Enc(x ) = Q (zx ), (1)
approach, each kind of features were modeled just with one-
hot encoding (Kim et al., 2019).
The related studies on malware detection are summarized x ∼ Dec(z ) = P(xz ), (2)
in Table 1. They include various methods to ex-tract features
and detect malware with extracted characteristics or analyzed where z is a latent variable and x is a data value. We project
logs. In this paper, we propose a novel method that generates data x to latent variable z with an encoder Q, and reconstruct
virtual malware from the pre-defined latent space for being z to x with a decoder P. This technique can learn a data repre-
robust to obfuscations. Besides, since the code-level informa- sentation, but compressed data is hidden to us, i.e., the mean-
tion of malware is lost in this method, we extend the model ing of the latent variable value is unknown to us. To con-
that processes time series modeling in parallel. firm which characteristics of data are compressed and recon-
structed, we compress and reconstruct the data value xi with
Table 1 – The summary of the related works.
Category Input Pros Cons

Feature Engineering Binary malware code Make data easy to Loss some information
detect malicious feature of malware
Global Feature Modeling Malware images, Analyze the entire Be vulnerable to local
developers’ signature characteristics of transformation
malware
Local Feature Modeling Dalvik code, API call Cope with the Be difficult to detect the
sequences obfuscation of malware comprehensive features
of the malware
Hybrid Feature Hybrid of above features Consider the both of Construct relatively
Modeling characteristics of complex model
malware
Fig. 1 – Example of obfuscation of malware. (Top) Modification by null value insertion and (Bottom) reordering of subroutines.
i-feature into a corresponding space Q(zi xi ). The changed au- the data of each feature are projected to the corresponding
toencoder process is latent space. Trained VAE is also reused in the process of gen-
erating malware, which is treated in more detail in the next
zi ∼ Enc(xi ) = Q (zi xi ), (3) section. Deconvolution layers are used to increase the dimen-
sion of latent variables to that of the data (Zeiler and Fer-
gus, 2013). We place a dropout layer, leaky ReLU activation
xi ∼ Dec(zi ) = P(xi zi ), (4) layer, and batch normalization layer after every layer except
the last (Ioffe and Szegedy, 2015). The details of proposed VAE
where xi and zi represent x and z including i-feature. When used in this study for experiments of generation is following:
the encoder learns, it is regularized by putting a prior over the
distribution P(z ). Typically, z ∼ N (0, I ) is chosen, but we choose • Encoder: Input(128,256) - Conv2D 5∗ 5@16 (128,256,16) -
zi ∼ N (μi , I ) for dealing with a specific characteristic. The final Conv2D 5∗ 5@32 (64,128,32) - Conv2D 5∗ 5@32 (64,128,32)
objective function is sum of the VAE loss and the prior regu- - Conv2D 5∗ 5@64 (32,64,64) - Conv2D 5∗ 5@64 (32,64,64) -
larity, as in Conv2D 5∗ 5@128 (16,32,128) - Conv2D 5∗ 5@128 (16,32,128)
– Flatten (65536) – Dense (540) – Dense (270), Dense (270)
L = Ezi ∼Q (zi xi ) [log P(xi zi )] + DKL [Q(zi | xi P(zi )] • Decoder: Sampling (270) – Dense (65536) – Reshape
(5)
= LVAE + L prior (16,32,128) – deConv2D 2∗ 2@128 (32,64,128) – Conv2D
5∗ 5@64 (32,64,64) – Conv2D 5∗ 5@64 (32,64,64) – deConv2D
where DKL is the Kullback-Leibler divergence to compute the 2∗ 2@64 (64,128,64) – Conv2D 5∗ 5@32 (64,128,32) – Conv2D
distance between the distribution of encoded data Q(zi xi ) and 5∗ 5@32 (64,128,32) – deConv2D 2∗ 2@32 (128,256,32) –
the prior distribution P(zi ). As a result of this training process, Conv2D 5∗ 5@16 (128,256,16) – Conv2D 5∗ 5@1 (128,256,1)
Fig. 3 – Results of malware detection as a box plot. We repeat the experiments in ten times to validate the performance in
fair.
Fig. 4 – Results of malware detection as a box plot. We repeat the experiments in ten times to validate the performance in
fair. CNN (Image) and LSTM (Code) models use only single-feature and CNN+LSTM (Image+Code) model uses both global
and local features, but not pre-trained latent space.
3.1.2. Generating Data Because there is a disadvantage of the original GAN in that
To cope with the obfuscations of malware, we propose a gen- the produced data are incomplete because of the unstable
erative model that produces virtual malware for the detector training process of the G, we pre-train G with the proposed
to be robust to modification. GAN has been attractive model in VAE decoder. The goal of the learning process of G is in (7)
data generation and is used in various fields (Goodfellow et al., from (6); it is equivalent to (8), but is related to D. Therefore,
2014; Zhang et al., 2018a; Zhang et al., 2018b; Creswell and we change (8) to (9), using only G. In this study, to pre-train G
Bharath, 2018; Arjovsky et al., 2017). The learning process of with (9), we reuse the proposed VAE, which includes the en-
GAN is to train G, which produces data, and D, which distin- coder and decoder, and then transfer the decoder into G. The
guishes the generated data with the real data, at the same result is | pdata − pVAE
G | ≤ | pdata − pG |, so that it can reach the
time. Eq. (6) shows the loss function of a GAN. pdata is the dis- goal of GAN ( pdata ≈ pG ) stably.
tribution of the actual data. G(z ) is created from a distribution
pz by the G, and it is classified with the actual data by the D. min log (1 − D(G(z ) ) ) (7)
G
The D is trained to maximize V(D, G ) as D(x) is to 1 and D(G(z ) )
is to 0, and G is trained to minimize V(D, G ) as D(G(z ) ) of the
second term is to 1.1 ⇔ D (G (z ) ) ≈ 1 (8)
min max V (D, G ) = min max Ex∼pdata (x ) [logD(x )]

G D G D
⇔ G(z ) ≈ x ∈ χ , where χ is real dataset (9)
+Ez∼pz (z ) [log (1 − D(G(z ) ) )] (6)
From a game theory point of view, when the discriminator

and the generator reach the Nash equilibrium, the GAN con-
1
We detach index i for general representation. verges to the optimal saddle point. In this discussion, let pG be
Fig. 5 – Confusion matrices of (a) CNN (image), (b) LSTM (code), (c) CNN+LSTM (image+code), and (d) ours for malware
detection. Label indices are matched with ‘Ramnit’, ‘Lollipop’, ‘Kelihos ver. 3’, ‘Vundo’, ‘Simda’, ‘Tracur’, ‘Kelihos ver. 1’,
‘Ofuscation. ACY’, and ‘Gatak’ in order.
Fig. 6 – Examples of generated malware images
the distribution of data produced from the G. We show that if and D∗ , respectively.
G(z ) ≈ x, i.e., pdata ≈ pG , the GAN reaches the Nash equilibrium.
We define J(D, G ) = ∫x pdata (x )D(x )dx + ∫z pz (z )(1 − D(G(z ) ))dz J ( D ∗ , G ∗ ) ≤ J ( D ∗ , G )∀ G (10)
and K(D, G ) = ∫x pdata (x )(1 − D(x ) )dx+ ∫z D(G(z ) )dz. We train the
G and D to minimize J(D, G ) and K(D, G ), respectively, and we
define the Nash equilibrium of the GAN as a state that satis- K(D∗ , G∗ ) ≤ K(D, G∗ )∀D (11)
fies (10) and (11) where the two inequalities hold for any G and
Theorem 1. If pdata ≈ pG almost everywhere, then the Nash
D at the same time. We denote the fully trained G and D as G∗
equilibrium of the GAN is reached.
We need to prove the following two lemmas before proving - Conv2D 5∗ 5@64 (32,64,64) - Conv2D 5∗ 5@64 (32,64,64) -
Theorem 1. Conv2D 5∗ 5@128 (16,32,128) – Flatten (65536) – Dense (270)
– Dense (18) – Dense (1)
Lemma 1. J(D∗ , G ) can reach a minimum when pdata (x ) ≤ pG (x)
for almost every x.
3.2. Local Feature Modeling
Lemma 2. K(D, G∗ ) can reach a minimum when pdata (x ) ≥
pG∗ (x) for almost every x. To solve the problem of previous works that miss the local fea-
tures, we propose a way to build a model that extracts the se-
The proofs of Lemmas 1 and 2 were discussed by Kim
quence features of malware code and passes them to the de-
et al. (Kim et al., 2018). We assume that pdata ≈ pG . From
tector. The malware codes are pre-processed as the sequence
Lemma 1 and 2, if pdata ≈ pG , then J(D, G ) and K(D, G ) reach
of binary code to represent the local features. We construct
minima. It means that the GAN reaches the Nash equilib-
the model with 1D convolutional neural network (1D CNN)
rium and converges to optimal saddle points. By Theorem 1,
to extract features of adjacent regions in the sequence using
GAN converges when pd ≈ pg. Therefore, the proposed GAN
Eq. (14) and long-short term memory (LSTM) to process tem-
can reach the Nash equilibrium and converge to optimal sad-
poral features using Eq. (15) thorough (19) (Kim et al., 2016).
dle points. By Theorem 1, the GAN converges when pd ≈
pg, and it is done to some extent by the modified VAE, i.e.,
yli, j = σ Ff =1 wl−1 · xl−1 + bl−1 (14)
| pd − pVAE
g | ≤ | pd − pg|, because the one goal of the modified f, j i+ f −1 j
VAE is P(x|Q(zx )) ≈ x. Therefore, the proposed method is use-

ful to initialize the weight of the generative model. where σ represents activation function, yli, j is l th layer’s output
In this study, the GAN framework is based on the Wasser- in ith timestep and jth feature-map, F means the number of
stein GAN, whose objective function is (12). It defines the dif- feature-map, w is weight of l th layer, and x means the input.
ference of distributions as Wasserstein distance, not Jensen-
p
Shannon as used in the vanilla GAN (Arjovsky et al., 2017; it = σi wi pt + whi ht−1 + wci ◦ ct−1 + bi (15)
Gulrajani et al., 2017). To verify that the data, which is pro-
duced from the latent space representing the i-feature, has
p
the i-feature, the encoder projects it back to the latent space ft = σ f w f pt + whf ht−1 + wcf ◦ ct−1 + b f (16)
and the distance between the distribution of encoded data
Q(zi G(zi ) ) and the prior distribution N (μi , I ) is computed. The

objective function is (13). p
ot = σo wo pt + who ht−1 + wco ◦ ct−1 + bo (17)

max V (D, G ) = max Ex∼pdata (x ) [D(x )] − Ez∼pz (z ) [D(G(z ) )] (12)

p
ct = ft ◦ ct−1 + it ◦ wc pt + whc ht−1 + bc (18)

max V (D, G ) = max Ex∼pdata (x ) [D(x )] − Ez∼pz (z ) [D(G(z ) )]
ht = ot ◦ σ (ct ) (19)
−DKL [Q (zi G(zi ) )||N (μi , I )]] (13)
The reason to use a GAN is to generate modified data where it , ft , ot represent output of the input, forget, and output
so as to be robust deformations of malware. Because a GAN gates, respectively, b is bias term, and ◦ operation means the
produces data with a randomly sampled from a distribution, elementwise product.
e.g. Gaussian distribution, new data have some modifications
compared to the real data, which expands the range of de- 3.3. Malware Detection with Transfer Learning
tectable malware. Therefore, the encoder and D are robust to
modification (Radford et al., 2015), and we can train D with In the previous sub-process, we train the VAE with mixture
more data. of multivariate Gaussian distribution for defining the latent
The details of the constructed model used in this study for space and generating data for expanding the range of knowl-
the generation experiments are as follows. As in VAE, we place edge space. Finally, we construct a detector with all the pre-
a dropout layer, leaky ReLU activation layer, and batch normal- trained models by using transfer learning [38]. The encoder is
ization layer after every layer except the last layer. The details reused for detector so that the capacity of it is inherited. We
of constructed model used in this paper for experiments of add a new fully-connected-layer to the detector and train it.
generation are following: In the previous sub-process, we train the VAE with mixture
of multivariate Gaussian distribution for defining the latent
• Generator: Sampling (270) – Dense (65536) – Reshape space and generating data for expanding the range of knowl-
(16,32,128) – deConv2D 2∗ 2@128 (32,64,128) – Conv2D edge space. Finally, we construct a detector with all the pre-
5∗ 5@64 (32,64,64) – Conv2D 5∗ 5@64 (32,64,64) – deConv2D trained models by using transfer learning (Arnold et al., 2007).
2∗ 2@64 (64,128,64) – Conv2D 5∗ 5@32 (64,128,32) – Conv2D The encoder is reused for detector so that the capacity of it is
5∗ 5@32 (64,128,32) – deConv2D 2∗ 2@32 (128,256,32) – inherited. We add a new fully-connected-layer to the detector
Conv2D 5∗ 5@16 (128,256,16) – Conv2D 5∗ 5@1 (128,256,1) and train it.
• Discriminator: Input(128,256) - Conv2D 5∗ 5@16 (128,256,16) The proposed detector can effectively extract the features
- Conv2D 5∗ 5@32 (6∗ 4,128,32) - Conv2D 5∗ 5@32 (64,128,32) of malware by pre-defining the latent space while projecting
wares for each type is described in Table 2. The number of train

Algorithm 1 – Obfuscated Malware Detector
and test datasets were 9,780 and 1,088, respectively.
Learning algorithm for the detector To convert the code into an image, we changed the num-
Input: Dataset T = {(xI1 , xT1 y1 ), . . . , (xIn , xTn , yn )} ber of hexadecimal values in one row. If k is the length of the
Output: Detector C
binary code, R is the size of the transformed row, and C is the
Initialize decoder P, encoder Q, and discriminator D
size of the transformed column, then the method for calculat-
for epochsVAE × batch do
Sample {(xIi , yi )}m a batch from T ing the transformed row and column size is given by
i=1
Sample {zi }mi=1
from the corresponding yi √
16k/ log 2+1
LVAE = −Ez ∼Q(z xI ) [logP(xIi |zi )]+DKL [Q(zi |xIi )||P(zi )] C = 2log ; R = 16k/C (20)
i i i
1 ∂L
θQ ← θ Q − m im=1 ∂VAEθQ ( x i , y i )
I
1 ∂ L All the models are trained with Adam optimizer, and

θP ← θ P − m im=1 ∂VAEθP ( x i , y i )
I
end for
the weights are initialized with a method proposed by Glo-
Generate G ← Q rot and Bengio (Kingma and Ba, 2014; Glorot and Bengio,
for epochsGAN ∗ batch do 2009).
Sample {(xII , yi )}mi=1
a batch from T
LGAN = 4.2. Detection Results
−ExI ∼p ( xI )
[logD(xIi )]+Ez ∼pz (z ) [log D(G(zi ) )]+DKL [Q(zi G(zi ) )||N (μi , I )]
i dat ai i i i i
1 m ∂ LGAN The performance of the proposed model is compared with
θD ← θ D − m i=1 ∂ θD (xi , yi )
I
1 m ∂ LGAN other conventional machine learning algorithms, such as a

θG ← θ G −
m i=1 ∂ θG ( x I
i
, yi )
θQ ← θ Q − 1 m ∂ LGAN quadratic discriminant analysis (QDA), support vector ma-
m i=1 ∂ θQ (xi , yi )
I
end for
chine (SVM) with a polynomial kernel (Poly), linear discrim-
Transfer Q to detector C inant analysis (LDA), AdaBoost (AB), naïve Bayes (NB), k-
for epochsDet ect or ∗ batch do nearest-neighbors (k-NN), decision trees (DT), RF, and SVM
Sample {(xIi , xTi , yi )}m
i=1
a batch from T with radial basis function (RBF) kernel which are provided in
Lc = crossentropy(C(xIi , xTi ), yi ) the scikit-learn library (sklearn). We also compare the per-
1 m ∂ LC
θC ← θ C − m i=1 ∂ θC (xIi , xTi , , yi ) formance of the proposed model with deep neural networks
end for
consisting of a convolutional neural network (CNN) for mal-
Return C;
ware images and long short-term memory (LSTM) for the
sequence of raw code. The 10-fold cross validation is con-
ducted for the statistically meaningful results and they are
described in Fig. 3. The averaged accuracy is 97.47%, which
the characteristics to the corresponding area of latent space. is better than other conventional methods. This is higher
The extracted features of malware image and binary code se- than the performance of the previous research (Kim et al.,
quence are concatenated and used to classify malware. The 2017, 2018; Kim and Cho, 2018, ), 95.74%, 96.97%, and 96.39%,
algorithm for the whole process is shown in Algorithm 1. respectively.
We also verify our hybrid method with others on a sin-
gle feature. We compare the detection results as shown in
4. Experiment Fig. 4. CNN model uses only features of malware image (global
feature). Otherwise, LSTM model has only characteristics of
4.1. Dataset and Experimental Settings a sequence of raw malicious code (local feature). The hybrid
model (CNN+LSTM) has better performance than CNN and
To verify the performance of producing and detecting mal- LSTM models, but it is lower than that of ours, which means
ware, we used the dataset from the Kaggle Microsoft Malware the pre-trained method is useful in detecting malware. We an-
Classification Challenge. Ramnit, Kelihos ver. 3, Simda, and alyze the statistical significance of differences in performance
Kelihos ver. 1 can be classified as a botnet that can be used by each method through t-test. As shown in Table 1, all the
to perform DDos attacks, steal data, and allow the attacker to p-values are less than 0.05, which means there is a differ-
access the computer and its connection. V, S, T, and G are a ence statistically. Besides, we calculate the F1-scores for each
Trojan horse that confuses the true intentions of user. Despite malware type as shown in Table 3. The proposed method has
they seem to be normal software, they will cause damage to the best performance on all the malware types except ‘Gatak’,
the computer by executing malicious code if run. Lollipop (L) but there is little difference. The detailed results are shown in
is an adware that produces profits for the developer by show- Fig. 5 as the confusion matrix. The averages shown in Table 4
ing advertisements in the user’s computer. Obfuscator. ACY differ from those in Table 3 because we do not consider the
(O.ACY) is a combination of some methods. number data in each class when we calculate the averages in
The program data written in binary and assembly code are Table 4.
given, and we used only the former. We converted the values
in hexadecimal to decimal numbers, which were normalized 4.3. Effects of Generating Malware
into values of 0 to 1. The malware code is converted to an im-
age as Nataraj did, called a malware image. Then, since the The generated malware images with the actual images are il-
malware images were too large to use it in our model, the im- lustrated in Fig. 6. They look highly similar to the real images.
ages were reduced to 128 × 128 pixels. The number of mal- However, the image cannot be returned to the code because
Table 2 – Summary of malware data used in this study. The data are split for training and test at ratio of 9:1.
Malware type # training # test Description

Ramnit 1387 153 Worm
Lollipop 2249 229 Adware
Kelihos ver. 3 2620 322 Backdoor
Vundo 435 40 Trojan
Simda 35 7 Backdoor
Tracur 688 63 TrojanDownloader
Kelihos ver. 1 358 40 Backdoor
Obfuscator. ACY 1110 118 Obfuscated malware
Gatak 898 115 Backdoor
Table 3 – Numeric results of experiments. Since p-values are less than 0.05, there is a statistical significance of differences
in the performance by each model.
Image Code Image+Code The Proposed

Accuracy 94.49% 94.20% 95.86% 97.47%
Std. dev. 1.01 × 10−2 1.06 × 10−2 0.69 × 10−2 1.10 × 10−2
p-value 0.43 × 10−2 0.54 × 10−2 0.96 × 10−2 -
Table 4 – F1-scores of CNN (Image), LSTM (Code), CNN+LSTM (Image+Code), and the proposed method for different malware
types.
Type Image Code Image+Code The Proposed

Ramnit 0.9279 0.8693 0.6705 0.9742
Lollipop 0.9494 0.9742 0.9578 0.9806
Kelihos ver. 3 0.9984 0.9969 0.9969 1.0000
Vundo 0.9630 0.9351 0.9000 0.9877
Simda 0.4444 0.4444 0.2500 0.5455
Tracur 0.8870 0.8955 0.9043 0.9344
Kelihos ver. 1 0.9620 0.9367 0.9620 0.9750
Obfuscator. ACY 0.9250 0.9333 0.9191 0.9623
Gatak 0.8485 0.8560 0.9487 0.9464
Average 0.8817 0.8713 0.8677 0.9229
the size of image was reduced. We leave this issue of reducing ated malware, which shows that the detector is strengthened
of the image size to future work. We find data which has sim- by generating key features in the text and the data subsec-
ilar form to the produced data in the dataset with the struc- tions.
tural similarity index measure (SSIM). The values of std. dev.
and SSIM of the malware including real and produced data
are 0.3308 and 0.1019, but those of only actual data are 0.3224 5. Conclusions
and 0.1068, respectively. This shows that the produced data
increase the variety of the malware. We performed an F-test In this paper, we address the important problems caused by
to check the statistical meaning of the difference in standard malware. There are some modifications in newly generated
deviation, resulting in a p-value of 0.0342 malware such as null value insertion. Other malware hides
The malware image is roughly divided into subsections of their malicious behavior before execution. However, previous
text and data (Nataraj et al., 2011). Among the methods of cre- research lacked to detect malware using all of these local and
ating malware are text-segment infections and data-segment global characteristics of malware. We propose a novel method
infections, which add malicious code to text or data subsec- to generate the virtual malware to expand the range of knowl-
tions. We showed our model to detect malware created in edge space as well as to detect malware with local and global
these ways using the class activation map (CAM), which is a features; the malware images and the sequence of raw mali-
method to find features that affect the results of the model cious code. The proposed method achieves 97.47% of accuracy,
(Zhou et al., 2016). Fig. 7 illustrates the results of applying CAM which is the highest performance compared with the conven-
to the malware images. The brighter feature is important. The tional models.
malware images in the first and second rows are the results of In the future, we will generate the sequence of malware
detection by features in the text and the data subsection, re- code as well as the malware images. After this process, we
spectively. The last row is the results of the CAM of the gener- will analyze the generated code and compare it with the real
Fig. 2 – Architecture of the proposed method. Our method is composed of three modules: (a) extracting feature of malware
images through the encoder E, and generating malware from the pre-defined latent space, (b) handling the temporal feature,
and (c) detecting malware by exploiting previous models. Feature extractor of malware image and binary code sequence f i
and ft process each kind of data and pass them to the detector.
data. A complete system for detecting malware will be con-

structed and used for the real world. In particular, as a lot of
Declaration of Competing Interest
Android apps are installed on smartphones, we want to ana-
The authors declare that they have no known competing fi-
lyze various characteristics of Android malware and research
nancial interests or personal relationships that could have ap-
detectors based on this.
peared to influence the work reported in this paper.
Fig. 7 – Visualization of class activation map. The brighter feature (pixel) is important. The images in the first and the second
row are for real malware.
Creswell A, Bharath AA. Inverting the Generator of a Generative

Acknowledgement Adversarial Network. IEEE Trans. on Neural Networks and
Learning Systems 2018 to be published.
This work was supported by an IITP grant funded by the Ko- Dahl GE, Stokes JW, Deng L, Yu D. Large-scale Malware
rean government (MSIT) (No. 2020-0- 01361, Artificial Intelli- Classification using Random Projections and Neural
gence Graduate School Program (Yonsei University)) and Air Networks. In: IEEE Int. Conf. on In Acoustics, Speech and
Signal Processing; 2013. p. 3422–6.
Force Defense Research Sciences Program funded by Air Force
Dhammi A, Singh M. Behavior Analysis of Malware Using
Office of Scientific Research. Machine Learning. In: IEEE Int. Conf. on Contemporary
Computing; 2015. p. 481–6.
Drew J, Moore T, Hahsler M. “Polymorphic Malware Detection
using Sequence Classification Methods,” Security and Privacy
R E F E R E N C E
Workshops, pp. 81-87, 2016.
Garcia FCC, Muga II, Felix P. “Random Forest for Malware
Classification,” arXiv preprint arXiv: 1609.07770, 2016.
Acrali D, Rajarajan M, Komninos N, Zarpelao BB. Modelling the Glorot X, Bengio Y. Understanding the Difficulty of Training Deep
Spread of Botnet Malware in IoT-Based Wireless Sensor Feedforward Neural Networks. In: Int. Conf. on Artificial
Networks. Security and Communication Networks Intelligence and Statistics; 2009. p. 249–56.
2019;2019:1–13. Goodfellow I, Pouget-Abadie J, Mirze M, Xu B, Warde-Farley D,
Ahmadi M, Ulyanov D, Semenov S, Trofimov M, Giacinto G. Novel Ozair S, Courville A, Bengio Y. Generative Adversarial Nets.
Feature Extraction, Selection and Fusion for Effective Malware Advances in Neural Information Processing Systems
Family Classification. In: Proc. of the ACM Conf. on Data and 2014:2672–80.
Application Security and Privacy; 2016. p. 183–94. Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC.
Akritidis P, Anagnostakis K, Markatos EP. Efficient Content-based Improved Training of Wasserstein Gans. Advances in Neural
Detection of Zero-day Worms. IEEE. Int. Conf. Information Processing Systems 2017:5767–77.
Communications 2005;2:837–43. Guo Y, Jiao L, Wang S, Wang S, Liu F. Fuzzy Sparse Autoencoder
Arjovsky M, Chintala S, Bottou L. Wasserstein generative Framework for Single Image Per Person Face Recognition. IEEE
adversarial networks. In: Int. Conf. on Machine Learning; 2017. Trans. on Cybernetics 2018;48(8):2402–15.
p. 214–23. Han J, Zhang D, Wen S, Guo L, Liu T, Li X. Two-stage Learning to
Arnold A, Nallapati R, Cohen W. A Comparative Study of Methods Predict Human Eye Fixations via SDAEs. IEEE Trans. on
for Transductive Transfer Learning. In: IEEE Int. Conf. on Data Cybernetics 2016;46(2):487–98.
Mining; 2007. p. 77–82. Hassen M, Carvalho MM, Chan PK. Malware Classification Using
Berlin K, David S, Joshua S. Malicious Behavior Detection using Static Analysis based Features. In: IEEE Symposium Series on
Windows Audit Logs. Artificial Intelligence and Security Computational Intelligence; 2017. p. 1–7.
2015:35–44. Hassen M, Chan PK. Scalable Function Call Graph-based Malware
Biswal SR, Swain SK. Model for Study of Malware Propagation Classification. In: Proc. of the ACM Conf. on Data and
Dynamics in Wireless Sensor Network. In: Int. Conf. on Trends Application Security and Privacy; 2017. p. 239–48.
in Electronics and Informatics; 2019. p. 647–53. Hosseini-Asl E, Zurada JM, Nasraoui O. Deep Learning of
Christodorescum S M, Jha SASeshia, Song D, Bryant RE. Part-Based on Representation of Data Using Sparse
Semantics-Aware Malware Detection. Security and Privacy
2005:32–46.
Autoencoders with Nonnegativity Constraints. IEEE Trans. on Convolutional Neural Network. Journal of Ambient
Neural Networks and Learning Systems 2016;27(12):2486–98. Intelligence and Humanized Computing 2019;10(8):3035–43.
loffe S, Szegedy C. “Batch Normalization: Accelerating Deep Ye Y, Chen L, Hou S, Hardy W, Li X. DeepAM: A Heterogeneous
Network Training by Reducing Internal Covariate Shift,” arXiv Deep Learning Framework for Intelligent Malware Detection.
preprint arXiv: 1502.03167, 2015. Knowledge and Information Systems 2017:1–21.
Jordaney R, Sharad K, Dash SK, Wang Z, Papini D, Nouretdinov I, Ye Y, Hou S, Chen L, Lei J, Wan W, Wang J, Xiong Q, Shao F.
Cavallaro L. Transcend: Detecting Concept Drift in Malware “AiDroid: When Heterogeneous Information Network Marries
Classification Models. In: Security Symposium; 2017. p. 625–42. Deep Neural Network for Real-time Android Malware
Kim Y, Jernite Y, Sontag D, Rush AM. Character-Aware Neural Detection,” arXiv preprint arXiv: 1811.01027, 2018.
Language Models. In: AAAI Conf. on Artificial Intelligence; You I, Yim K. Malware Obfuscation Techniques: A Brief survey. In:
2016. p. 2741–9. IEEE Int. Conf. on Broadband, Wireless Computing,
J Y, Kim JY, Bu SJ, Cho S-B. Malware Detection Using Deep Communication and Applications; 2010. p. 297–300.
Transferred Generative Adversarial Networks. In: Int. Conf. on Yuan Z, Lu Y, Wang Z, Xue Y, Droidsec ”. Deep Learning in
Neural Information Processing; 2017. p. 556–64. Android Malware Detection. ACM Conf. on Special Interest
Kim JY, Bu SJ, Cho SB. Zero-day Malware Detection using Group on Data Communication 2014:371–2.
Transferred Generative Adversarial Networks based on Deep Zeiler MD, Fergus R. Visualizing and Understanding
Autoencoders. Information Sciences 2018;460-461:83–102. Convolutional Networks. In: European Conference on
Kim JY, Cho SB. Detecting Intrusive Malware with a Hybrid Computer Vision; 2013. p. 818–33.
Generative Deep Learning Model. In: Int. Conf. on Intelligent Zeng K, Yu J, Wang R, Li C, Tao D. Coupled Deep Autoencoder for
Data Engineering and Automated Learning; 2018. p. 499–507. Single Image Super-resolution. IEEE Trans. on Cybernetics
Kim T, Kang B, Rho M, Sezer S, Im EG. A Multimodal Deep 2017;47(1):27–37.
Learning Method for Android Malware Detection Using Zhang S, Ji R, Hu J, Lu X, Li X. Face Sketch Synthesis by
Various Features. IEEE Trans. on Information Forensics and Multidomain Adversarial Learning. IEEE Trans. on Neural
Security 2019;14(3):773–88. Networks and Learning Systems 2018a to be published.
Kingma DP, Ba J. “Adam: A Method for Stochastic Optimization,” Zhang J, Peng Y, Yuan M. SCH-GAN: Semi-Supercised
arXiv preprint arXiv: 1412.6980, 2014. Cross-Modal Hashing by Generative Adversarial Network. IEEE
Kingma DP, Welling M. “Auto-encoding Variational Bayes,” arXiv Trans. on Cybernetics 2018b to be published.
preprint arXiv:1312.6114, 2013. Zhang B, Xiong D, Su J, Qin Y. Alignment-Supervised
Lin CH, Pao HK, Liao JW. Efficient Dynamic Malware Analysis Bidimensional Attention-Based Recursive Autoencoders for
using Virtual Time Control Mechanics. Computer & Security Bilingual Phrase Representation. IEEE Trans. on Cybernetics
2018;73:359–73. 2018c to be published.
Narayanan BN, Djaneye-Boundjou O, Kebede TM. Performance Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A. Learning Deep
Analysis of Machine Learning and Pattern Recognition Features for Discriminative Localization. In: Proc. of the IEEE
Algorithms for Malware Classification. In: Aerospace and Conf. on Computer Vision and Pattern Recognition; 2016.
Electronics Conf. and Ohio Innovation Summit; 2016. p. 2921–9.
p. 338–42.
Nataraj L, Karthikeyanm S, Jacob G, Manjunath BS. Malware J.-Y. Kim received a BS degree in mathemat-
Images: Visualization and Automatic Classification. Proc. of ics from Yonsei University, Seoul, South Ko-
the Int. Symposium for Cyber Security, 2011. rea, in 2018. He is currently pursuing an MS
Nicolau M, McDermott J. Learning Neural Representations for degree at the Department of Computer Sci-
Network Anomaly Detection. IEEE Trans. On Cybernetics ence, Yonsei University. His current research
2018;99:1–14. interests include neural networks, artificial
Norton AP, Qi Y. Adversarial-Playground: A Visualization Suite intelligence, and controllable representation
Showing How Adversarial Examples Fool Deep Learning. learning.
Visualization for Cyber Security 2017:1–4.
Ojha RP, Srivastava PK, Sanyal G, Gupta N, “Improved Model for
the Stability Analysis of Wireless Sensor Network Against
S.-B. Cho received a BS degree in computer
Malware Attacks,” Wireless Personal Communications, pp.
science from Yonsei University, Seoul, Korea
1-24, 2020.
and an MS degree and PhD in computer sci-
Radford A, Metz L, Chintala S. “Unsupervised representation
ence from Korea Advanced Institute of Sci-
learning with deep convolutional generative adversarial
ence and Technology (KAIST), Daejeon, Ko-
networks,” arXiv preprint arXiv:1511.06434, 2015.
rea. He was an invited researcher in the Hu-
Ronen R, Radu M, Feuerstein C, Yon-Tov E, Ahmadi M. “Microsoft
man Information Processing Research Lab-
Malware Classification Challenge,” arXiv preprint arXiv:
oratories at the Advanced Telecommunica-
1802.10135, 2018.
tions Research (ATR) Institute, Kyoto, Japan
Su X, Zhang D, Li W, Zhao K. A Deep Learning Approach to
from 1993 to 1995, and a visiting scholar
Android Malware Feature Learning and Detection. IEEE
at the University of New South Wales, Can-
TrustCom, BigDataSEISPA 2016:244–51.
berra, Australia in 1998. He was also a vis-
Wang B, Gao J, Qi Y. “A Theoretical Framework for Robustness of
iting professor at the University of British
(Deep) Classifiers under Adversarial Noise,” arXiv preprint
Columbia, Vancouver, Canada from 2005 to 2006. Since 1995, he
arXiv: 1612.00334, 2016.
has been a professor in the Department of Computer Science, Yon-
Wang Q, Guo W, Zhang K, Alexander G, Ororbia II, Xinyu X,
sei University. His research interests include neural networks, pat-
Giles CLee, Xue L. Adversary Resistant Deep Neural Networks
tern recognition, intelligent man-machine interfaces, evolution-
with an Application to Malware Detection. In: Int. Conf.
ary computation, and artificial life. Dr. Cho was awarded outstand-
Knowledge Discovery and Data Mining; 2017. p. 1145–53.
ing paper prizes from the IEEE Korea Section in 1989 and 1992, and
Wang W, Zhao M, Wang J. Effiective Android Malware Detection
another from the Korea Information Science Society in 1990. He
with a Hybrid Model based on Deep Autoencoder and
was also the recipient of the Richard E. Merwin prize from the IEEE ceived the best paper award at the World Automation Congress in
Computer Society in 1993. He was listed in Who’s Who in Pattern 1998, and was listed in Marquis Who’s Who in Science and Engi-
Recognition from the International Association for Pattern Recog- neering in 2000 and Marquis Who’s Who in the World in 2001. He
nition in 1994, and received best paper awards at the International is a Senior Member of IEEE and a Member of the Korea Information
Conference on Soft Computing in 1996 and 1998. In addition, he re- Science Society, INNS, the IEEE Computa.

Obfuscated Malware Detection Using Deep Generative Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Obfuscated Malware Detection Using Deep Generative Models

Uploaded by

Copyright:

Available Formats

computers & security 112 (2022) 102501

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cose

Obfuscated Malware Detection Using Deep

Jin-Young Kim, MS, Sung-Bae Cho, Ph.D. ∗

Malware usually infects a computer and destroys or steals

Table 1 – The summary of the related works.

Category Input Pros Cons

min max V (D, G ) = min max Ex∼pdata (x ) [logD(x )]

From a game theory point of view, when the discriminator

Fig. 6 – Examples of generated malware images

VAE is P(x|Q(zx )) ≈ x. Therefore, the proposed method is use-

wares for each type is described in Table 2. The number of train

1 ∂ L All the models are trained with Adam optimizer, and

1 m ∂ LGAN other conventional machine learning algorithms, such as a

Malware type # training # test Description

Image Code Image+Code The Proposed

Type Image Code Image+Code The Proposed

data. A complete system for detecting malware will be con-

Creswell A, Bharath AA. Inverting the Generator of a Generative

You might also like