Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Toward a Deep Learning Approach for Detecting PHP Webshell

Ngoc-Hoa NGUYEN Viet-Ha LE


VNU University of Engineering and Technology Office of the Government
Hanoi, Vietnam Hanoi, Vietnam
hoa.nguyen@vnu.edu.vn levietha@chinhphu.vn

Van-On PHUNG Phuong-Hanh DU


Office of the Government VNU University of Engineering and Technology
Hanoi, Vietnam Hanoi, Vietnam
phungvanon@gmail.com hanhdp@vnu.edu.vn

ABSTRACT According to Internet Live Stats up to 2019 September[13], there is


The most efficient way of securing Web applications is searching an enormous amount of websites being attacked everyday (from
and eliminating threats therein (from both malwares and vulnerabil- 25.000 hacked websites per day on April 2015 to 61.750 hacked
ities). In case of having Web application source codes, Web security websites per day on September 2019), causing direct significant
can be improved by performing the task to detecting malicious impact on nearly 4.43 billion Internet users. In case of having Web
codes, such as Web shells. In this paper, we proposed a model using application source codes, Web security can be improved by per-
a deep learning approach to detect and identify the malicious codes forming the task to detecting malicious codes, such as a Webshell.
inside PHP source files. Our method relies on (i) pattern matching which is defined as a script that is installed on source code of web
techniques by applying Yara rules to build a malicious and benign application to enable remote administration on the infected server.
datasets, (ii) converting the PHP source codes to a numerical se- Webshell could be injected into the system directly by attackers
quence of PHP opcodes and (iii) applying the Convolutional Neural or through malicious plugin installed by the webmaster [7]. An
Network model to predict a PHP file whether embedding a ma- essential feature of a Webshell is command execution. With this
licious code such as a webshell. Thus, we validate our approach unsophisticated weapon, an attacker can do many stuff such as
with different webshell collections from reliable source published communicating with files/folders, listing active processes... or let
in Github. The experiment results show that the proposed method it act as a backdoor. These webshells seem to be extremely tiny,
achieved the accuracy of 99.02% with 0.85% false positive rate. but their capabilities are so diversity and high-plasticity. Besides
that, they sometimes use encoding method like base64 or gzinflate
CCS CONCEPTS to encode themselves for self-defense. All of them are wrapped in
only one file, so this type of WebsShell can be injected quickly.
• Security and privacy → Malware and its mitigation; Web
Webshell can be installed as other kinds of backdoor. For exam-
application security;
ple, CryptoPHP is a hidden backdoor found by FoxIT1 . CryptoPHP
is a threat that compromises Web servers on a large scale through
KEYWORDS
installing unoriginal WordPress, Joomla, and Drupal themes and
pattern matching, yara rules, deep learning, CNN, opcode sequence, plug-ins. CryptoPHP has some activities and properties, included
webshell detection (i) integrates with popular content management systems like Dru-
ACM Reference Format: pal, WordPress and Joomla: injecting hyperlink into post content
Ngoc-Hoa NGUYEN, Viet-Ha LE, Van-On PHUNG, and Phuong-Hanh DU. (for Black Hat SEO’s purpose2 ), and so on; (ii) Uses asymmetric
2019. Toward a Deep Learning Approach for Detecting PHP Webshell. In cryptography3 (RSA public-key)4 for communication between the
The Tenth International Symposium on Information and Communication victim’s server and the C&C server; (iii) in case C&C server or do-
Technology (SoICT 2019), December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam. main takedowns in multiple times, CryptoPHP can encrypt its data
ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3368926.3369733
and send via email to some specific mail addresses; (iv) supports
manually control via HTTP requests; (v) updates automatically the
1 INTRODUCTION list of C&C servers; and (vi) haves ability to receive new version
Nowadays, web applications are everywhere and Web security has from C&C server and update itself.
also received a lot of attention from both researchers and managers. Several popular approaches for securing web applications [3]
have been investigated, for example safe web development [11],
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed implementing intrusion detection and protection systems, code
for profit or commercial advantage and that copies bear this notice and the full citation reviewing, and web application firewalls. Masood et al. [10] pre-
on the first page. Copyrights for components of this work owned by others than ACM sented an efficient way for securing web applications by searching
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a and eliminating vulnerabilities therein. In fact, an attack campaign
fee. Request permissions from permissions@acm.org.
1 https://fox-it.com/
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam
2 https://en.wikipedia.org/wiki/Search_engine_optimization
© 2019 Association for Computing Machinery.
3 https://en.wikipedia.org/wiki/Public-key_cryptography
ACM ISBN 978-1-4503-7245-9/19/12. . . $15.00
https://doi.org/10.1145/3368926.3369733 4 https://en.wikipedia.org/wiki/RSA_(cryptosystem)

514
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.

is temporary. However, attackers might upload their backdoors strings:


to that system for persistence, as they can come back to interact $a = 'hello'
and steal information anytime without exploiting any vulnerabil- $b = {01 23 45 67 89 ab cd ef}
ity. This situation leads to serious consequences [12] since these $c = /md5: [0-9a-zA-Z]32/
backdoors are Web shells, and they allow to remotely control files, condition:
databases and execute commands. They are not only flexible but $a or $b and $c
also countless. }

Each Yara rule consists of 3 components:

• Meta: store the metadata information such as description,


created date, references, etc.
• Strings: define the patterns to be matched of the rule. There
are 3 type of strings that can be defined: text , hexadecimal
and regular expression.
Figure 1: Command execution in webshell b374k.
• Condition: define as a Boolean expression, that is determines
the logic to combine the results of pattern matching of each
Indeed, lacking of secure programming awareness and of ability strings.
to discover both malicious web shells and web vulnerabilities from
web developers are main root causes. These current issues in web
application security raise a demand for one solution which allows
web developers and security penetration testers to detect security- Based on algorithm design, there are 3 types of pattern matching
related problems in the easiest way. technique: prefix-based matching, suffix-based matching and factor
In this research, we proposed a model using a deep learning matching [15].
approach to detect and identify the malicious codes inside PHP
source files. The reason why we focus on web applications written
in the PHP language is because the popular usage of PHP in server-
side programming languages - about 79.0% of all the websites (up • Prefix-based matching: the matching process start searching
to September 2019) [2]. Our method relies on 3 techniques. First from the top of the sliding window, all characters in the text
of all, we use pattern matching techniques by applying Yara rules are read and checked if it doesn’t match then move to the
to build a malicious and benign datasets. Secondly, we convert the next character. This is the simplest strategy but the number
PHP source codes to a numerical sequence of PHP opcodes. Finally, of comparisons is large so the execution speed is slow.
we apply the Convolutional Neural Network model to predict a • Suffix-based matching: the matching process start searching
PHP file whether embedding a malicious code such as a webshell. from the bottom of the sliding window. It does not read all
The organization of this paper composes 4 sections: in Section 2, the consecutive characters in the text, ignoring the charac-
we revise some basic principles, literature research and related work ters base on the comparison result of the characters at the
in malware detection and deep learning techniques. In Section 3, we bottom of the sliding window.This is the basis for reducing
describe our proposed solution that is a combination of 3 different the number of comparisons and reducing the complexity of
techniques as mentioned above to solve the problem of detecting the algorithm.
malicious code in the web application source code. In Section 4, • Factor-based matching: the matching process start searching
we present our experiment results, evaluate our work and provide from the bottom of the sliding window, It does not read all
benchmarks. The last section is dedicated to some conclusions and the consecutive characters in the text, but compare each
future work. special character to predict the set of factors (subsamples) of
the original sample.
2 PRELIMINARIES AND RELATED WORK
2.1 Yara and Pattern Matching
Yara ruleset is a list of rules that define the strings that is called pat- All algorithms have 2 stages: pre-processing and searching. The
terns and the logical condition between matches and non-matches pre-processing stage has to build the Yara ruleset, meanwhile the
of those pattern to determine the final result. second stage will use the pattern matching techniques (using the
yara rule example regular expression) based on the Yara ruleset. Figure 2is illustrated
{ the flowchart of PHP webshell detection using Yara and pattern
meta: matching approach.
description = An example of YARA rule

515
Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam

Table 1: Top 15 opcodes used exclusively used by malware

Opcode Description
stosq Store String
syscall Fast System Call
setno Set Byte on Condition - not overflow (OF=0)
cvtsd2si Convert Scalar Double-FP Value to DW Integer
movmskpd Extract Packed Double-FP Sign Mask
prefetcht1 Prefetch Data Into Caches
fprem Partial Remainder (for,compatibility with i8087 and i287)
cmpsq Compare String Operands
lodsq Load String
scasq Scan String
cvtss2si Convert Scalar Single-FP Value to DW Integer
fnsave Store x87 FPU State
Figure 2: Webshell detection process using Yara. orpd Bitwise Logical OR of Double-FP Values
fxsave Save x87 FPU, MMX, XMM, and,MXCSR State
movmskps Extract Packed Single-FP Sign Mask
It can be said that in the problem of detecting malicious code
by pattern matching method using Yara rule set, pattern match-
ing technique only determines the resource usage and calculation feature engineering. One preeminent advantage of deep learning is
speed. As for accuracy, it will be determined completely by the that a bigger training data make it learn more robust feature. One
Yara ruleset. In this study, we use the latest Yara ruleset for detect- of the most famous example of deep learning technique is Con-
ing PHP webshell from GitHub 5 in conjunction with the one we volution Neural Network (CNN), in which the local receive field
collected during our research. from the previous layer is handled in a sliding window. Because of
Opcodes stand for Operation Codes, is the portion of a machine these advantages, more and more research is being applied to deep
language instruction that specifies the operation to be performed 6 . learning technique in the field of malware detection [8].
In programming in PHP or any other language, we can extract
the list of opcode used[4]. When making statistics of lists of opcodes 2.3 Related Work
created from benign files and malicious files, we can easily see the In this section, we briefly introduce some related research and
huge difference between them. This can be explained by the fact solutions regarding malware, including some popular Web Shell
that the opcodes used by malicious files will tend to perform data detector, malware detection based on deep learning:
theft, impact on the system to gain control or perform check the Web Shell Detector7 is a python tool that helps on detecting
system environment to hide it behavior, etc, while the benign files Web Shells. This product is a quite good solution as it is easy in
rarely do these things. Taking an example of the use of functions using, developing and customizing. However, the Web Shell pattern
related to virtualized operation functions, malicious files often use set in Web Shell Detector database is not up-to-date and also very
these functions to check if they are being executed in a virtualized limited.
environment, if it is true, they will not execute malicious behavious PHP Malware Finder8 is also an effective tool to scan Web
to avoid detection. Because of this, machine learning approaches Shells with its YARA-based rules. Because the detecting mechanism
often use this sequence of opcode to predict whether a file is mali- of this product is quite simple, the False/Positive rate in final results
cious or benign. According to the statistical results of Bragen and is somewhat high. Also, PHP Malware Finder can depict suspicious
Simen Rune [5], it has been shown that the list of the 15 most used files, not show whether a file is precisely a Web Shell or a dangerous
opcode by malicious files is shown in Table 1. file.
VirusTotal 9 is an online service that supports analyze sus-
2.2 Webshell Detection by Deep Learning picious files, included viruses, worms and Web application ones
Approaches through the detection of tens of other anti-virus products. However,
it is limited to at most one file of any nature in any given in at
Deep learning is the application of deep neural networks to ma-
once. This restriction may lead to the time-consuming problem. It
chine learning. Deep learning is capable of simulating complex
is almost not proper to validate whole of a Web project.
functions by learning deep nonlinear network structures to solve
In a research of Yingying and Wang [9], they proposed a malware
complex problems. A neural networks contain of an input layer,
detection system using deep learning on API calls. Based on the
followed by a list of hidden layers, then ending with an output layer.
feature of an solution to automated analyze malicious code Cuckoo
Value of output of a layer turn to input of the next layer. Unlike
Sandbox 10 , they extracted the API calls sequence of malicious
machine learning techniques, deep learning is trained by learning
programs, then using some Deep Learning technique such as: GRU,
features rather than task-specific algorithms. Different layers of
BGRU, LSTM, SimpleRNN, and BLSTM to train and test on an
neural networks automatically learn features at different levels.
7 http://www.shelldetector.com/
Therefore it can work on raw data without any need of manual
8 https://github.com/nbs-system/php-malware-finder
5 https://github.com/Yara-Rules/rules 9 https://virustotal.com
6 https://en.wikipedia.org/wiki/Opcode 10 https://cuckoosandbox.org/

516
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.

dataset including 21,378 samples. The result show that BLSTM has • Convert the PHP source files to a numerical sequence of PHP
the best performance for malware detection, reaching the accuracy opcodes. These opcodes are used to remove the duplicate
of 97.85%. PHP files for both benign and webshell datasets.
Kemal Ozkan [18] wants to use image processing techniques to • Build the clean datasets of both benign and webshell samples
detect malicious code. Realized that some image based techniques for both training and testing sets. For that, pattern matching
have been developed together with feature extraction and classi- techniques by applying Yara rules is chosen to generate the
fiers in order to discover the relation between malware binaries clean datasets.
in grayscale color representation, they applied the CNN features • Build the Convolutional Neural Network model by the deep
to overcome the malware detection problem. With the datasets learning approach with the clean datasets. This model will be
consisting of 12,279 malware samples, the classifier has an 85% used to to predict a PHP file whether embedding a malicious
accuracy rate, increased to 99% with a dataset containing 9, 339 code such as a webshell.
samples. We will detail the two last stages in the next subsections.
Another research using CNN to detect Webshell by YifanTian
[14], focus on the HTTP request of web service, they use ’word2vec’ 3.2 Building Clean Datasets
technique to segmented the HTTP requests to the form of HTTP
Our idea to build clean datasets is shown in Figure 3:
symbol words, then HTTP request can be represented as a matrix.
Once having the matrix representation, they applied CNN to extract
feature and train the model for detecting malicious webshell.
Using 35 different features extracted from packet flow, M. Yeo
[17] proposed an automated malware detection method based on
convolutional neural network (CNN), multi-layer perceptron (MLP),
support vector machine (SVM), and random forest (RF). With a
netflow capture from Stratosphere IPS which has nine different
public malware packets and normal state packets were converted
to flow data, they can show >85% accuracy, precision and recall for
all classes using CNN and RF.

3 PHP WEBSHELL DETECTION BY DEEP


LEARNING METHOD
In this section, we will propose a solution that combines pattern
matching for malicious code detection technique using Yara rule
set and CNN based approach.

3.1 Approach
Each technique has its own advantages and disadvantages, for the
pattern matching method, the rate of True Positive detecting the
type of known malicious codes is extremely high, but this method
will have difficulty in predicting the types of unknown malicious
code. As for the CNN deep learning method, the prediction model
only approach high accurate if we build the correct training data
set. In the process of researching and developing the training data
set, we had difficulty finding malicious code samples. For a dataset
of benign PHP files, it is not difficult to search within the source
code of popular content management systems (CMS) using PHP Figure 3: Building Clean Datasets using Yara rulesets.
languages such as Wordpress, Joomla or Drupal, etc. As for the
malicious code dataset, although we have tried to use the most As we can see in Figure 3, at the beginning, to eliminate the
reliable data sources, however, most of the datasets we found both fake malicious files in the webshell datasets, we use the Yara-based
contained clean files, which led to inaccurate training results. With webshell detection by applying Yara rulesets for the raw datasets.
the number of thousands of files in each dataset, it is difficult to After that, a training data set consisting of benign PHP files and ma-
manually remove clean files. Therefore, our idea is to use a malware licious PHP files was translated to opcode sequences via an Opcode
detection method using the Yara rule set to standardize the dataset Converter. This converter also has the function of eliminating du-
of malicious code files, as the training input data for the CNN plicated opcodes during conversion to avoid affecting the accuracy
learning model. when training the model. The duplication of opcodes of completely
From that, our method to detect PHP webshells is based on three different PHP files can be explained because opcode is a sequence
stages: of numbers representing a list of called Operation Codes functions,

517
Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam

if the files are accidentally the same in the list of called opcode
functions, their opcode sequence will be the same. At the end of
the process, we have the clean benign/webshell datasets for both
the training and testing phases.

3.3 Detecting Webshell by CNN Model


We will use the CNN model to implement our deep learning ap-
proach for detecing the webshells in PHP source files. The following
figure illustrates our training and testing model:

Figure 5: CNN Architecture for Detecting Webshell.

4 EXPERIMENT AND EVALUATION


Based on the proposed method, we built and implemented our so-
lution, namely WSDetector, in python language. The experiments
were performed in a computer having 2 x Intel(R) Xeon(R) CPU
E5-2697 v4 @ 2.30GHz (45MB Cache, 18-cores per CPU), 128GB
for the main memory, CentOS Linux release 7.4.1708, python re-
lease 2.7. For the deep learning platform, we use tensorflow v.1.14.0,
scikit-learn v.0.20.4, scipy v.1.2.2, numpy v.1.16.5 and yara-python
v.3.10.0.

4.1 Evaluation Metrics


To evaluate the ability of PHP webshell detection tools, we will
use two different test sets: one contains malicious PHP web shells
Figure 4: Webshell Detection Using CNN Model. and one is a collection of clean, benign PHP codes. We will observe
the true positive (TP), False Positive (FP), False Negative (FN) and
True Negative (TN) samples, then compute the Accurary, Precision,
Recall (sensitivity, or true positive rate -TPR), F1-score and Fall
Positive Rate (FPR) with the following formulas [15]:
TP +TN TP
Accuracy = Precision =
TP + FP + FN + T N TP + FP
TP FP
The training input data consists of benign and webshell dataset. Recall = F PR =
As mentioned in the previous section, because the webshell dataset TP + FN FP + T N
2T P
is collected from many different sources, we will get benign files, F 1 − score =
so we use the pattern matching method with the Yara rule set to 2T P + F P + F N
ensure webshell data is most accurate. The standardized dataset 4.2 Datasets
consists of PHP files will continue to be converted into opcode To build the webshell dataset, we collected a wide range of web-
sequences, then these opcode sequences became training data for shells from reliable and most stars sources on Github 11 . There are
the CNN. The trained model will be used to predict test data sets, totally 4,171 PHP webshell files. For the benign dataset, different
resulting in the data set classified as benigns and webshell. PHP frameworks, forums and content management systems were
In our research, Convolution Neural Network applied for mal- collected from their official sites. They includes Laravel, Wordpress,
ware detection using opcodes as its input raw data as show in Figure Joomla, phpMyAdmin, phpPgAdmin, phpbb 12 . After removing
5. The opcodes goes through a sequence of convolution layers at non-PHP files, the benign set contains totally 7,400 files. In order to
different levels. In the end, we have output layer which outputs
11 /tennc/webshell, /bartblaze/PHP-backdoors, /b374k/b374k, /JohnTroony/
probabilities of the files being malware or benign. By providing a
php-webshells, /xl7dev/WebShell, /BlackArch/webshells, /fuzzdb-project/fuzzdb,
huge amount of training data, we can expect the neural network to /LuciferoO/webshell-collector, /ysrc/webshell-sample, /webshellpub/
learn specific patterns of the malware family as well as powerful awsome-webshell, /PHP-WebShell-Bypass-WAF, /linuxsec/indoxploit-shell
12 Github: https://github.com/laravel/laravel; https://github.com/WordPress/
invariant features over time to distinguish the malware with benign
WordPress; https://github.com/joomla/joomla-cms; https://github.com/
files. phpmyadmin/phpmyadmin; https://github.com/phppgadmin/phppgadmin;
https://github.com/phpbb/

518
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.

train and validate our proposed method of detecting PHP webshells, training datasets to train the CNN model by using the tensorflow
we divided the benign and webshells datasets in two parts with the engine. The maximum sequence length of opcodes in our datasets
ratio of 7:3 as the rule of thumb [16]. Based on the distribution of is 44,335. Therefore, we should pad all training opcodes by value 0
files in the dataset sources, the split of training/testing sets is chosen (mean no-operation) to have the same maximum length.
by whole sources. Thus, the following table shows our final datasets Therefore, the configuration of CNN network is based on maxi-
for training and testing. To convert the PHP files into opcodes, we mum of 100,000 inputs, 128 outputs, 03 1D-convolution layers. By
our different training, we chosen finally the filter sizes for 3 layers
Table 2: Raw Benign and Webshell Datasets are 3, 4 and 5 respectively; dropout is 0.5; activation function is
softmax; optimizer is adam; learning_rate is 0.08; loss function is
Training Set Testing Set categorical_crossentropy, validation set is 10%; batch_size is 96; and
Benign Dataset 5,802 1,598 epochs are 32.
Webshell Dataset 3,684 487 By using this CNN model, we performed the test datasets and
obtained the results illustrated by the matrix confusion in table 6
use the vld extension of PHP engine 13 to implement the opcode and the scores in the table 7.
converter. Based on this tool, the raw datasets are firstly cleaned by Table 6: Confusion matrix of PHP webshell detection by us-
removing duplicate opcodes. Therefore, the non-duplicate datasets ing CNN model
are shown in the table 3:

Table 3: Non-duplicate Benign and Webshell Datasets Real Benign Real Webshell
Predicted Benign 1,157 10
Training Set Testing Set Predicted Webshell 25 265
Benign Dataset 4,875 1,182
Webshell Dataset 1,049 275
Table 7: Accuracy, Precision, F1-score and FPR of CNN based
testing (%)
4.3 Experiment Results
Accuracy Precision Recall F1-Score FPR
a. Pattern Matching based Detection Benign 97.60 99.14 97.88 98.51 3.64
From the non-duplicate webshell training dataset, we generated Webshell 97.60 91.38 96.36 93.81 2.12
3,242 Yara rules based on our previous research [15]. We used
these rules to detect the PHP webshell in the non-duplicate testing
datasets (both benigns and webshells). Table 4 shows the results
we got in the matrix confusion. From that, the performance of our c. Yara and CNN based Detection
In the above experiment, it is clear that the CNN-based detecting
Table 4: Confusion matrix of PHP webshell detection by us- model have lower F1-Score, accuracy and FFR in comparison with
ing Yara rules the Yara-based detecting model. However, after reviewing the mis-
detected samples, we found that these samples merely contain very
Real Benign Real Webshell common functions, such as the fread or file_put_contents functions
Predicted Benign 1,180 2 that manipulate the contents to a file. We also lookup in detail in
Predicted Webshell 25 250 the raw datasets and found that the webshell datasets contain some
wrong samples: file in webshell datasets but is benign and similarly
for benign datasets.
Yara-based PHP webshell detector is illustrated by the following From that, we decided to combine the Yara-based detector with
table: This experiment results are clearly better than the results the CNN based model. Fistly, we clean all non-duplicate datasets by
using the Yara-based detector in order to remove the fake webshells.
Table 5: Accuracy, Precision, F1-score and FPR of Yara based
After that, we got the cleaned datasets and then these datasets are
testing (%)
used to train and test the CNN-based model of webshell detection.
The cleaned datasets we obtained is summary in table 8:
Accuracy Precision Recall F1-Score FPR
Benign 98.15 97.93 99.83 98.87 9.09 Table 8: Cleaned Benign and Webshell Datasets
Webshell 98.15 99.21 90.91 94.88 0.17
Training Set Testing Set
published in [15] having the detecting F1-Score of 92%. Benign Dataset 4,871 1,180
Webshell Dataset 618 250
b. CNN based Detection
Same as the previous experiment, we used also the non-duplicate By using this datasets, we performed to train the CNN model
13 See more about VLD at: https://github.com/derickr/vld by using the same the settings as previous works. After that, the

519
Toward a Deep Learning Approach for Detecting PHP Webshell SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam

cleaned test datasets were used to evaluate this model. Results we Table 12: Comparison of different webshell detection ap-
obtained are shown in the matrix confusion in table 9 and the scores proaches (%)
in the table 10.
Accuracy F1-Score FPR
Table 9: Confusion matrix of PHP webshell detection by us- php-malware-finder[1] 94.23 96.46 4.49
ing Yara+CNN model Word2Vec+CNN[14] 98.6 98.6 -
RF-GBDT[6] 99.16 99.09 0.68
Real Benign Real Webshell GuruWS[15] 85.56 92.00 0.00
Predicted Benign 1,170 4 Yara 98.15 98.87 0.17
Predicted Webshell 10 246 CNN 97.60 97.88 2.11
Our Yara+CNN 99.02 99.41 0.85

Table 10: Accuracy, Precision, F1-score and FPR of


Yara+CNN based Detection (%) an efficient model using a deep learning approach combine with
pattern matching applying Yara rules technique and converting the
Accuracy Precision Recall F1-Score FPR PHP source codes to a numerical sequence of opcodes to predict a
Benign 99.02 99.66 99.15 99.41 1.60 PHP file whether embedded a malicious code or not. Our experiment
Webshell 99.02 96.09 98.40 97.23 0.85 results show that the proposed method (Yara+CNN) achieved the
Micro Avg 99.02 99.66 99.15 99.41 0.85 accuracy of 99.02% with 0.85% false positive rate.
Macro Avg 99.02 97.88 98.78 98.32 1.22 For future works, we aim to extend our method for others pro-
Weighted Avg 99.02 99.04 99.02 99.03 1.47 gramming languages such as ASP, ASP.NET, Java, Python, etc. Be-
sides that, we will study and test other deep learning methods
such as LSTM to compare with current methods then select a most
We also perform the k-fold cross validation for this model. The
accurate predictive model.
following table shows the results we obtained with k=5 folds.

Table 11: 5-fold Cross Validation Results (%)


ACKNOWLEDGMENTS
This work is partially supported by the national research project No.
Accuracy F1-Score FPR KC.01.19/16-20, granted by the Ministry of Science and Technology
Fold 1 99.40 98.20 0.00 of Vietnam (MOST).
Fold 2 99.23 97.67 0.92
Fold 3 98.63 95.98 0.93 REFERENCES
[1] 2019. PHP Malware Finder. https://github.com/nbs-system/php-malware-finder.
Fold 4 98.97 96.88 0.00 [2] 2019. Web Technology Surveys. http://w3techs.com/technologies/overview/
Fold 5 98.63 96.02 1.14 programming_language/all/.
Average 98.97 96.95 0.60 [3] G. P. Bherde and M. A. Pund. 2016. Recent attack prevention techniques in web
service applications. In 2016 International Conference on Automatic Control and
Dynamic Optimization Techniques (ICACDOT). 1174–1180. https://doi.org/10.
1109/ICACDOT.2016.7877771
These results allow to confirm that the CNN model built from [4] Daniel Bilar. 2007. Malware detection through opcode sequence analysis using
the cleaned datasets by Yara detector is overall better than only machine learning. Int. J. Electronic Security and Digital Forensics 1 (2007). https:
Yara and CNN based approach. //doi.org/10.1504/IJESDF.2007.016865
[5] Simen Rune Bragen. 2015. Opcodes as predictor for malware. VDP::Mathematics
and natural science: 400::Information and communication science: 420::Security and
4.4 Evaluation vulnerability: 424 1 (01 2015).
[6] H. Cui, D. Huang, Y. Fang, L. Liu, and C. Huang. 2018. Webshell Detection Based
To justify the performance of our PHP webshel detection method on Random ForestâĂŞGradient Boosting Decision Tree Algorithm. In 2018 IEEE
based on Yara and CNN, we compare our results with other ap- Third International Conference on Data Science in Cyberspace (DSC). 153–160.
proaches. By the time, we do not perform the evaluating test on the https://doi.org/10.1109/DSC.2018.00030
[7] Z. Cui, F. Xue, X. Cai, Y. Cao, G. Wang, and J. Chen. 2018. Detection of Malicious
same machine, same datasets (moreover, the source codes and clean Code Variants Based on Deep Learning. IEEE Transactions on Industrial Informatics
datasets of other approaches are not published). Thus, we show 14, 7 (July 2018), 3187–3196. https://doi.org/10.1109/TII.2018.2822680
[8] Z. Cui, F. Xue, X. Cai, Y. Cao, G. Wang, and J. Chen. 2018. Detection of Malicious
only the results of each approach published by their authors. Note Code Variants Based on Deep Learning. IEEE Transactions on Industrial Informatics
that we use only the accuracy, F1-score, FPR metrics to compare 14, 7 (July 2018), 3187–3196. https://doi.org/10.1109/TII.2018.2822680
them in this evaluation. The following table shows the comparison [9] Y. Liu and Y. Wang. 2019. A Robust Malware Detection System Using Deep
Learning on API Calls. In 2019 IEEE 3rd Information Technology, Networking,
of our Yara+CNN model with other approaches: Electronic and Automation Control Conference (ITNEC). 1456–1460. https://doi.
org/10.1109/ITNEC.2019.8728992
5 CONCLUSION [10] A. Masood and J. Java. 2015. Static analysis for web service security - Tools
amp; techniques for a secure development life cycle. In 2015 IEEE International
Facing the fact that more and more unknown malicious code is now Symposium on Technologies for Homeland Security (HST). 1–6. https://doi.org/10.
being developed to install into the source code of web applications 1109/THS.2015.7225337
[11] M. Mazumder and T. Braje. 2016. Safe Client/Server Web Development with
that are dominating the cyberspace have been a huge challenge Haskell. In 2016 IEEE Cybersecurity Development (SecDev). 150–150. https://doi.
today for cybersecurity researchers. We proposed in this paper org/10.1109/SecDev.2016.040

520
SoICT 2019, December 4–6, 2019, Hanoi - Ha Long Bay, Viet Nam N-H. NGUYEN et al.

[12] M. A. E. Mohd Efendi, Z. Ibrahim, M. N. Ahmad Zawawi, F. Abdul Rahim, N. A.


Muhamad Pahri, and A. Ismail. 2019. A Survey on Deception Techniques for
Securing Web Application. In 2019 IEEE 5th Intl Conference on Big Data Security
on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart
Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS).
328–331. https://doi.org/10.1109/BigDataSecurity-HPSC-IDS.2019.00066
[13] Internet Live Stats. 2019. Internet Usage and Social Media Statistics. https:
//www.internetlivestats.com/.
[14] Yifan Tian, Jiabao Wang, Zhenji Zhou, and Shengli Zhou. 2017. CNN-Webshell:
Malicious Web Shell Detection with Convolutional Neural Network. In Proceed-
ings of the 2017 VI International Conference on Network, Communication and
Computing (ICNCC 2017). ACM, New York, NY, USA, 75–79. https://doi.org/10.
1145/3171592.3171593
[15] Le V-G, Nguyen H-T, Pham D-P, Phung V-O, and N-H Nguyen. 2019. GuruWS:
A Hybrid Platform for Detecting Malicious Web Shells and Web Application
Vulnerabilities. Transactions on Computational Collective Intelligence, Springer,
Berlin, Heidelberg 11370, XXXII (01 2019), 184–208.
[16] Le V-G, Nguyen H-T, Lu L-D, and N-H Nguyen. 2016. A solution for automatically
malicious Web shell and Web application vulnerability detection. In Computa-
tional Collective Intelligence, Volume 9875 of the series Lecture Notes in Computer
Science. Springer-Verlag, Berlin, Heidelberg, 367–378.
[17] M. Yeo, Y. Koo, Y. Yoon, T. Hwang, J. Ryu, J. Song, and C. Park. 2018. Flow-based
malware detection using convolutional neural network. In 2018 International
Conference on Information Networking (ICOIN). 910–913. https://doi.org/10.1109/
ICOIN.2018.8343255
[18] K. ÃŰzkan, Åđ. IÅ§Äśk, and Y. Kartal. 2018. Evaluation of convolutional neural
network features for malware detection. In 2018 6th International Symposium on
Digital Forensic and Security (ISDFS). 1–5. https://doi.org/10.1109/ISDFS.2018.
8355390

521

You might also like