Professional Documents
Culture Documents
2019 KY Toward A Deep Learning
2019 KY Toward A Deep Learning
514
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.
515
Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam
Opcode Description
stosq Store String
syscall Fast System Call
setno Set Byte on Condition - not overflow (OF=0)
cvtsd2si Convert Scalar Double-FP Value to DW Integer
movmskpd Extract Packed Double-FP Sign Mask
prefetcht1 Prefetch Data Into Caches
fprem Partial Remainder (for,compatibility with i8087 and i287)
cmpsq Compare String Operands
lodsq Load String
scasq Scan String
cvtss2si Convert Scalar Single-FP Value to DW Integer
fnsave Store x87 FPU State
Figure 2: Webshell detection process using Yara. orpd Bitwise Logical OR of Double-FP Values
fxsave Save x87 FPU, MMX, XMM, and,MXCSR State
movmskps Extract Packed Single-FP Sign Mask
It can be said that in the problem of detecting malicious code
by pattern matching method using Yara rule set, pattern match-
ing technique only determines the resource usage and calculation feature engineering. One preeminent advantage of deep learning is
speed. As for accuracy, it will be determined completely by the that a bigger training data make it learn more robust feature. One
Yara ruleset. In this study, we use the latest Yara ruleset for detect- of the most famous example of deep learning technique is Con-
ing PHP webshell from GitHub 5 in conjunction with the one we volution Neural Network (CNN), in which the local receive field
collected during our research. from the previous layer is handled in a sliding window. Because of
Opcodes stand for Operation Codes, is the portion of a machine these advantages, more and more research is being applied to deep
language instruction that specifies the operation to be performed 6 . learning technique in the field of malware detection [8].
In programming in PHP or any other language, we can extract
the list of opcode used[4]. When making statistics of lists of opcodes 2.3 Related Work
created from benign files and malicious files, we can easily see the In this section, we briefly introduce some related research and
huge difference between them. This can be explained by the fact solutions regarding malware, including some popular Web Shell
that the opcodes used by malicious files will tend to perform data detector, malware detection based on deep learning:
theft, impact on the system to gain control or perform check the Web Shell Detector7 is a python tool that helps on detecting
system environment to hide it behavior, etc, while the benign files Web Shells. This product is a quite good solution as it is easy in
rarely do these things. Taking an example of the use of functions using, developing and customizing. However, the Web Shell pattern
related to virtualized operation functions, malicious files often use set in Web Shell Detector database is not up-to-date and also very
these functions to check if they are being executed in a virtualized limited.
environment, if it is true, they will not execute malicious behavious PHP Malware Finder8 is also an effective tool to scan Web
to avoid detection. Because of this, machine learning approaches Shells with its YARA-based rules. Because the detecting mechanism
often use this sequence of opcode to predict whether a file is mali- of this product is quite simple, the False/Positive rate in final results
cious or benign. According to the statistical results of Bragen and is somewhat high. Also, PHP Malware Finder can depict suspicious
Simen Rune [5], it has been shown that the list of the 15 most used files, not show whether a file is precisely a Web Shell or a dangerous
opcode by malicious files is shown in Table 1. file.
VirusTotal 9 is an online service that supports analyze sus-
2.2 Webshell Detection by Deep Learning picious files, included viruses, worms and Web application ones
Approaches through the detection of tens of other anti-virus products. However,
it is limited to at most one file of any nature in any given in at
Deep learning is the application of deep neural networks to ma-
once. This restriction may lead to the time-consuming problem. It
chine learning. Deep learning is capable of simulating complex
is almost not proper to validate whole of a Web project.
functions by learning deep nonlinear network structures to solve
In a research of Yingying and Wang [9], they proposed a malware
complex problems. A neural networks contain of an input layer,
detection system using deep learning on API calls. Based on the
followed by a list of hidden layers, then ending with an output layer.
feature of an solution to automated analyze malicious code Cuckoo
Value of output of a layer turn to input of the next layer. Unlike
Sandbox 10 , they extracted the API calls sequence of malicious
machine learning techniques, deep learning is trained by learning
programs, then using some Deep Learning technique such as: GRU,
features rather than task-specific algorithms. Different layers of
BGRU, LSTM, SimpleRNN, and BLSTM to train and test on an
neural networks automatically learn features at different levels.
7 http://www.shelldetector.com/
Therefore it can work on raw data without any need of manual
8 https://github.com/nbs-system/php-malware-finder
5 https://github.com/Yara-Rules/rules 9 https://virustotal.com
6 https://en.wikipedia.org/wiki/Opcode 10 https://cuckoosandbox.org/
516
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.
dataset including 21,378 samples. The result show that BLSTM has • Convert the PHP source files to a numerical sequence of PHP
the best performance for malware detection, reaching the accuracy opcodes. These opcodes are used to remove the duplicate
of 97.85%. PHP files for both benign and webshell datasets.
Kemal Ozkan [18] wants to use image processing techniques to • Build the clean datasets of both benign and webshell samples
detect malicious code. Realized that some image based techniques for both training and testing sets. For that, pattern matching
have been developed together with feature extraction and classi- techniques by applying Yara rules is chosen to generate the
fiers in order to discover the relation between malware binaries clean datasets.
in grayscale color representation, they applied the CNN features • Build the Convolutional Neural Network model by the deep
to overcome the malware detection problem. With the datasets learning approach with the clean datasets. This model will be
consisting of 12,279 malware samples, the classifier has an 85% used to to predict a PHP file whether embedding a malicious
accuracy rate, increased to 99% with a dataset containing 9, 339 code such as a webshell.
samples. We will detail the two last stages in the next subsections.
Another research using CNN to detect Webshell by YifanTian
[14], focus on the HTTP request of web service, they use ’word2vec’ 3.2 Building Clean Datasets
technique to segmented the HTTP requests to the form of HTTP
Our idea to build clean datasets is shown in Figure 3:
symbol words, then HTTP request can be represented as a matrix.
Once having the matrix representation, they applied CNN to extract
feature and train the model for detecting malicious webshell.
Using 35 different features extracted from packet flow, M. Yeo
[17] proposed an automated malware detection method based on
convolutional neural network (CNN), multi-layer perceptron (MLP),
support vector machine (SVM), and random forest (RF). With a
netflow capture from Stratosphere IPS which has nine different
public malware packets and normal state packets were converted
to flow data, they can show >85% accuracy, precision and recall for
all classes using CNN and RF.
3.1 Approach
Each technique has its own advantages and disadvantages, for the
pattern matching method, the rate of True Positive detecting the
type of known malicious codes is extremely high, but this method
will have difficulty in predicting the types of unknown malicious
code. As for the CNN deep learning method, the prediction model
only approach high accurate if we build the correct training data
set. In the process of researching and developing the training data
set, we had difficulty finding malicious code samples. For a dataset
of benign PHP files, it is not difficult to search within the source
code of popular content management systems (CMS) using PHP Figure 3: Building Clean Datasets using Yara rulesets.
languages such as Wordpress, Joomla or Drupal, etc. As for the
malicious code dataset, although we have tried to use the most As we can see in Figure 3, at the beginning, to eliminate the
reliable data sources, however, most of the datasets we found both fake malicious files in the webshell datasets, we use the Yara-based
contained clean files, which led to inaccurate training results. With webshell detection by applying Yara rulesets for the raw datasets.
the number of thousands of files in each dataset, it is difficult to After that, a training data set consisting of benign PHP files and ma-
manually remove clean files. Therefore, our idea is to use a malware licious PHP files was translated to opcode sequences via an Opcode
detection method using the Yara rule set to standardize the dataset Converter. This converter also has the function of eliminating du-
of malicious code files, as the training input data for the CNN plicated opcodes during conversion to avoid affecting the accuracy
learning model. when training the model. The duplication of opcodes of completely
From that, our method to detect PHP webshells is based on three different PHP files can be explained because opcode is a sequence
stages: of numbers representing a list of called Operation Codes functions,
517
Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam
if the files are accidentally the same in the list of called opcode
functions, their opcode sequence will be the same. At the end of
the process, we have the clean benign/webshell datasets for both
the training and testing phases.
518
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.
train and validate our proposed method of detecting PHP webshells, training datasets to train the CNN model by using the tensorflow
we divided the benign and webshells datasets in two parts with the engine. The maximum sequence length of opcodes in our datasets
ratio of 7:3 as the rule of thumb [16]. Based on the distribution of is 44,335. Therefore, we should pad all training opcodes by value 0
files in the dataset sources, the split of training/testing sets is chosen (mean no-operation) to have the same maximum length.
by whole sources. Thus, the following table shows our final datasets Therefore, the configuration of CNN network is based on maxi-
for training and testing. To convert the PHP files into opcodes, we mum of 100,000 inputs, 128 outputs, 03 1D-convolution layers. By
our different training, we chosen finally the filter sizes for 3 layers
Table 2: Raw Benign and Webshell Datasets are 3, 4 and 5 respectively; dropout is 0.5; activation function is
softmax; optimizer is adam; learning_rate is 0.08; loss function is
Training Set Testing Set categorical_crossentropy, validation set is 10%; batch_size is 96; and
Benign Dataset 5,802 1,598 epochs are 32.
Webshell Dataset 3,684 487 By using this CNN model, we performed the test datasets and
obtained the results illustrated by the matrix confusion in table 6
use the vld extension of PHP engine 13 to implement the opcode and the scores in the table 7.
converter. Based on this tool, the raw datasets are firstly cleaned by Table 6: Confusion matrix of PHP webshell detection by us-
removing duplicate opcodes. Therefore, the non-duplicate datasets ing CNN model
are shown in the table 3:
Table 3: Non-duplicate Benign and Webshell Datasets Real Benign Real Webshell
Predicted Benign 1,157 10
Training Set Testing Set Predicted Webshell 25 265
Benign Dataset 4,875 1,182
Webshell Dataset 1,049 275
Table 7: Accuracy, Precision, F1-score and FPR of CNN based
testing (%)
4.3 Experiment Results
Accuracy Precision Recall F1-Score FPR
a. Pattern Matching based Detection Benign 97.60 99.14 97.88 98.51 3.64
From the non-duplicate webshell training dataset, we generated Webshell 97.60 91.38 96.36 93.81 2.12
3,242 Yara rules based on our previous research [15]. We used
these rules to detect the PHP webshell in the non-duplicate testing
datasets (both benigns and webshells). Table 4 shows the results
we got in the matrix confusion. From that, the performance of our c. Yara and CNN based Detection
In the above experiment, it is clear that the CNN-based detecting
Table 4: Confusion matrix of PHP webshell detection by us- model have lower F1-Score, accuracy and FFR in comparison with
ing Yara rules the Yara-based detecting model. However, after reviewing the mis-
detected samples, we found that these samples merely contain very
Real Benign Real Webshell common functions, such as the fread or file_put_contents functions
Predicted Benign 1,180 2 that manipulate the contents to a file. We also lookup in detail in
Predicted Webshell 25 250 the raw datasets and found that the webshell datasets contain some
wrong samples: file in webshell datasets but is benign and similarly
for benign datasets.
Yara-based PHP webshell detector is illustrated by the following From that, we decided to combine the Yara-based detector with
table: This experiment results are clearly better than the results the CNN based model. Fistly, we clean all non-duplicate datasets by
using the Yara-based detector in order to remove the fake webshells.
Table 5: Accuracy, Precision, F1-score and FPR of Yara based
After that, we got the cleaned datasets and then these datasets are
testing (%)
used to train and test the CNN-based model of webshell detection.
The cleaned datasets we obtained is summary in table 8:
Accuracy Precision Recall F1-Score FPR
Benign 98.15 97.93 99.83 98.87 9.09 Table 8: Cleaned Benign and Webshell Datasets
Webshell 98.15 99.21 90.91 94.88 0.17
Training Set Testing Set
published in [15] having the detecting F1-Score of 92%. Benign Dataset 4,871 1,180
Webshell Dataset 618 250
b. CNN based Detection
Same as the previous experiment, we used also the non-duplicate By using this datasets, we performed to train the CNN model
13 See more about VLD at: https://github.com/derickr/vld by using the same the settings as previous works. After that, the
519
Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam
cleaned test datasets were used to evaluate this model. Results we Table 12: Comparison of different webshell detection ap-
obtained are shown in the matrix confusion in table 9 and the scores proaches (%)
in the table 10.
Accuracy F1-Score FPR
Table 9: Confusion matrix of PHP webshell detection by us- php-malware-finder[1] 94.23 96.46 4.49
ing Yara+CNN model Word2Vec+CNN[14] 98.6 98.6 -
RF-GBDT[6] 99.16 99.09 0.68
Real Benign Real Webshell GuruWS[15] 85.56 92.00 0.00
Predicted Benign 1,170 4 Yara 98.15 98.87 0.17
Predicted Webshell 10 246 CNN 97.60 97.88 2.11
Our Yara+CNN 99.02 99.41 0.85
520
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.
521