Professional Documents
Culture Documents
PSI-rooted Subgraph A Novel Feature For IoT Botnet Detection Using Classifer
PSI-rooted Subgraph A Novel Feature For IoT Botnet Detection Using Classifer
com
ScienceDirect
ICT Express xxx (xxxx) xxx
www.elsevier.com/locate/icte
PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier
algorithms
Huy-Trung Nguyena,b ,∗, Quoc-Dung Ngod ,∗, Doan-Hieu Nguyenc , Van-Hoang Lec
a Graduate University of Science and Technology, Vietnam Academy of Science and Technology, Viet Nam
b Institute of Information Technology, Vietnam Academy of Science and Technology, Viet Nam
c People’s Security Academy, Viet Nam
d Posts and Telecommunications Institute of Technology, Viet Nam
Abstract
It is obvious that IoT devices are widely used more and more in many areas. However, due to limited resources (e.g., memory, CPU),
the security mechanisms on many IoT devices such as IP-Camera, router are low. Therefore, botnets are an emerging threat to compromise
IoT devices recently. To tackle this, a novel method for IoT botnets detection plays a crucial role. In this paper, we have some contributions
for IoT botnet detection: first, we present a novel high-level PSI-rooted subgraph-based feature for the detection of IoT botnets; second, we
generate a limited number of features that have precise behavioral descriptions, which require smaller space and reduce processing time;
third, The evaluation results show the effectiveness and robustness of PSI-rooted subgraph-based features, as with five machine classifiers
consisting of Random Forest, Decision Tree, Bagging, k-Nearest Neighbor, and Support Vector Machine, each classifier achieves more than
97% detection rate and low time-consuming. Moreover, compared to other work, our proposed method obtains better performance. Finally,
we publicize all our materials on Github, which will benefit future research (e.g., IoT botnet detection approach).
⃝c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Keywords: IoT botnet; IoT security; Static malware detection; PSI-rooted subgraph; Machine learning; Deep learning
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
2 H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx
is also discovered that nearly 20% of the tested routers expose - The results are not interpretable because of the complexity
their Telnet port to the Internet, which is a serious security of the hidden layers of deep learning, therefore, it is diffi-
implication. cult to comprehend the results and understand the algorithm
+ The majority of IoT devices use Linux-based operating mechanism.
system so the focus of this paper is on ELF executable files - It is difficult to improve “what is learned” since there is
IoT malware (abbreviation for ‘malicious software’) is any no theory to guide.
software that is designed to have undesirable or harmful effects Therefore, researchers often encounter trade-off: machine
on a computer system or to the IoT devices. IoT malware, learning algorithms are fast and able to understand ‘what is
though, is not a completely new concept at this time, malware learned’ but not so much accurate. Nonetheless, deep learning
targeting IoT devices were spotted since 2008 (Linux.Hydra methods are time consuming but more accurate in detecting
malware) [8]. There are many types of IoT malware [9] such as problems such malware detection. Also, the choice of appro-
Trojans, Worms, Rootkits, Botnets, etc., among which botnets priate input features is a primary task for the improvement
are now considered to be a major security threat in IoT of prediction rates in every learning model [19]. According
devices network environments [10]. One of the most promi- to [20], graph-based features are better flow-based features in
nent examples of the IoT botnet is Mirai that considered the detecting botnet malware since it avoids the need to cross-
largest botnet in history, consisting of approximately 500,000 compare flows across the dataset. Besides, a graph is used for
compromised IoT devices with a traffic peak of 1.2 terabits per representing the behavior of application efficiently, because the
second (Tbps), the biggest DDoS attack ever recorded [8,11]. graph, especially labeled graphs, can be used to model lots of
Therefore, it is worth noting that in this work, we focus on complex relations between data [21].
IoT botnet. Furthermore, the source code for “Mirai” has been In this paper, we propose a novel PSI-rooted subgraph-
released, which allows malware creators of IoT malware to based feature in a completely static way (no dynamic analysis)
make many malware variants [12] such as Persirai, BrickerBot which represents the behaviors in the life-cycle IoT botnet.
(in 2017), HideNSeek (in 2018) and so on. According to These features are defined according to both the behavior of
Kaspersky Lab’s collection, the number of IoT malware has IoT botnet information and the topology of PSI-graph [22],
increased 37 times from 2016 to 2018. Therefore, botnet and and they are generated and processed through a combination of
malware are used interchangeably in this paper. processing deep learning and machine learning models. In our
To combat with such a huge number of IoT malware previous study, we preliminarily confirmed the effectiveness
samples, many researchers and cybersecurity firms are using of PSI-rooted subgraph for two-class classification [23] (also
automatic malware detection methods. One of the possible called as detection). Besides, they suffer from disadvantage
solutions is by applying machine learning. It involves building in the number of subgraphs may increase exponentially with
classifier algorithms that are capable of “learning” from input the size of training PSI-graphs. Therefore, this paper improves
data and later using them for processing previously unseen previous method of classification, is fast, IoT botnet family
data. Input data come in the form of features characterizing detection (e.g. Mirai, Bashlite), and have a low false positive
analyzed objects. Features can be integer or floating point rate when presented with benign files.
number, Boolean (‘true’ or ‘false’) and categorical values. The The major contributions of this paper can be summarized
studies for the detection of IoT malware using machine learn- as follows: First, we present a novel high-level PSI-rooted
ing such as [13–15]. However, those studies suggest that the subgraph-based feature through a combination of processing
performance of classifier algorithms such as Random Forest, deep learning and machine learning models for the detec-
k-Nearest Neighbors, Support Vector Machine and Decision tion of IoT botnets. Second, we generate a limited number
Tree, etc. depends not only on the task at hand, but also of features that have precise behavioral descriptions, which
on the number and characteristic of features. Recently, the require smaller space and reduce processing time. Final, we
application of deep learning to deal with the limitation of run extensive experiments for different dataset. The evaluation
machine learning models has become an imperative research results show the effectiveness and robustness of PSI-rooted
topic [16]. Deep learning is a branch of machine learning subgraph-based features for multiple machine classifier, which
which consists of linear and nonlinear multilayer processing achieves larger than 97% detection rate.
neurons. Deep learning models extract features automatically The remainder of this paper is organized as follows: Sec-
and autonomously by learning from the data and training tion 2 discusses related work in this area. Section 3 describes
the neural network rather than design features manually. The IoT botnet malware. Next, in Section 4, we describe our
features extracted from the deep learning are very effective proposed method in detail. Section 5 presents the results of
and outperform other machine learning classifiers that have the evaluation, discusses the results of the evaluation and
been trained using feature extracted using hand-engineering the drawbacks of our proposed method. Finally, in Section 6
method. However, some of the limitations of common deep conclusion is presented.
learning algorithms are as follows [17,18]:
2. Related works
- It requires extremely large amount of computational
power (with expensive GPUs). As previously mentioned, finding good features applied for
- It is difficult to configure and easy to fall into over-fitting representing malware in machine learning-based malware de-
since deep learning is based on the experiment many times to tection is essential, such as Operation Codes (Opcodes), Print-
select the appropriate hidden layers. able Strings Information (PSIs), System calls, Control Flow
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx 3
Graphs (CFGs), Processor Information, etc. We group the re- hardware implementation, therefore, it is not robust for IoT
search methods into two groups, namely, non-structured graph- botnet detection.
based and structured graph-based features. In this section, we The above-mentioned researches represent executable files
will show an overview of how these features are used and by non-structured graph-based feature, where the dependence
will concentrate on structured graph-based (e.g., PSI-rooted on the values of features (e.g., system call inet toa) and these
subgraph) in the proposed method for feature extraction. features cannot describe the complex semantic information
(such as dependency data in the life-cycle of IoT botnet)
2.1. Non-structured graph-based among them. Besides, the result points out that non-structured
graph-based features are weak to obfuscation malware such as
Malware detection using non-structure graph-based features encrypted, garbage insertion. To deal with it, other researchers
means using a single or a combination of features to detect ma- use control flow analysis (e.g., control flow graph) to represent
licious characteristics such as Processor Information, opcodes. the behavior of malware as a graph.
strings, system call (present in file header) and so on. The use
of these features has been of the following studies. 2.2. Structured graph-based
Jinrong Bai et al. [24] proposed a malware detection static
method by mining system calls from symbol table of Linux ex- Some other researchers focus on graph-based features
ecutables, because these features reflect behaviors of program (e.g., control flow graph, call graph, code graph) to represent
code pieces and carry semantic interpretations, which can be the behavior of executable files since the graph is a strong
used to detect intent and goal of attackers. The experimental method to model lots of complex relationships between data.
results achieve larger than 98% accuracy rate for distinguish- Eskandari et al. [28] presented a robust semantic based
ing between benign and malware depending on the selected method to detect malwares based on combination of a visualize
machine learning classifier. However, this work has trouble model (CFG) and called APIs. In their study, authors enriched
with malware using obfuscation and packing techniques. the current control flow graph by adding the behavioral prop-
Saxe et al. [25] proposed a deployable deep feed-forward erty of malware to it. Behavioral information is gained by
neural network with two hidden layers based malware de- calling APIs. This simple CFG will turn out to API on CFG,
tector using static features include byte entropy histogram, referred to as API-CFG, in which the nodes and edges rep-
the 2D histogram of ASCII printable strings, metadata of resented the necessary instructions and API call, respectively.
executable files and imported DLLs. These four types of Their approach is capable of classifying unseen benign and
features were transformed to 256-dimensions vectors one by malware with high accuracy.
one, and which were aggregated to 1024-dimension feature Zhao et al. [29] proposed a graph-based detection method
vector. Their method achieved a 95% detection rate while based on features of the CFG. He used specific graph traversal
false positive rate is 0.1% when running an experiment on algorithms to obtain the values of previously defined fea-
a 400,000 samples dataset. Furthermore, it should be noted tures. The features are information about nodes, edges of the
that they achieved their highest true positive rate when using graph. Then, they performed classification with some popular
all the static features as opposed to single features. However, machine learning algorithms including J48 Decision Tree,
the limitation is that the label assigned to training set may Bagging and Random Forest. Their experiments show a good
be inaccurate and the accuracy of the proposed approach result which achieved as high as 96.8% accuracy with Random
decreases substantially when samples are obfuscated. Forest classifier.
Igor Popov [26] proposed to use word2vec, a recently Zhiwu Xu et al. [30] proposed a method to detect Android
developed popular tool to analyze natural language texts, malware based on Convolutional Neural Network from the
for Windows malware (PE file) classification. Popov used semantic representation of the graph, combining the Control
word2vec to generate word embeddings from malware op- Flow Graph (CFG) and Data Flow Graph (DFG) together into
codes, afterward that machine instructions were fed into Cap- a graph. From the experimental results, the method achieved
stone disassembler and opcodes were generated. Then these outperforms accuracy and F1-score at 99.82% and 99.191%,
word vectors were used to classify executable codes using a respectively.
convolutional neural network-based classifier. This system was However, graph-based features are not always a single
tested with up to 97.0% success. connected graph, so the researchers used subgraphs to exploit
Hayate Takase et al. [27] proposed a method classifies more in-depth about the malicious behavior of malware, which
malware by machine learning using hardware resources such may be a special part in the graph. Karbalaie et al. [31]
as CPU information and memory consumption as features. proposed a graph-based method that they constructed a graph
These features include the acquired program counter, opcode, representation of malware behavior and performed frequent
and register number, which trace by emulating a CPU using subgraph mining before using a classification method to get
QEMU. Their experiment shows that the proposed method the result. The proposed method got a result of 96.6% accu-
can detect malware related to the same malware family by racy. Silvio Cesare and Yang Xiang [32] used each possible
training one malware, 100% accuracy rate. However, this work subgraph of fixed size in the CFG as features in their approach.
evaluated the experimental results with very small dataset and They first eliminated cycles in graph and perform a traver-
the size of each classifier being suppressed to about 2 MB for sal of all paths to generate the subgraphs. Then, they used
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
4 H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx
Table 1
A sample to generate PSI-rooted subgraph step by step.
Degree Vertexes
d = 0 11
d = 1 0, 8, 10, 7, 9
d = 2 18, 0, 0, 7, 0, 5, 6, 15, 16
Table 2
Machine classifiers description.
Machine classifiers Description
SVM Support vector machines (SVM) is a machine learning algorithm that is generally used for classification problems. The main
idea relies on finding such a hyperplane, that would separate the classes in the best way. SVM is good with working with
high-dimensional datasets.
RF Random forest is one of the most popular ensemble machine learning algorithms. The idea of them is to grow multiple
decision trees based on the independent subsets of the dataset to produce a better prediction accuracy. But the fundamental
difference is that in Random forest, only a subset of features are selected at random out of the total.
DT Decision Tree (DT) trains data by creating the tree that is used for making predictions on the test data. Decision tree is more
popular now due to its simplicity. It can work well with large datasets. Another advantage is that decision trees operate in a
“white box”, meaning that we can clearly see how the outcome is obtained and which decisions led to it.
kNN The label of a new data point is classified by a majority vote of its neighbors, based on the class most common among its k
nearest neighbors. The kNN algorithm calculates the distance between training data and new data to classify the class. The
most common distance metric is Euclidean Distance.
Bagging Bootstrap Aggregation (or Bagging for short) with its base estimator is decision tree. But the fundamental difference is that all
features are considered for splitting a node.
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx 11
[7] Aohui Wang, Ruigang Liang, Xiaokang Liu, Yingjun Zhang, Kai Chen, [26] Igor Popov, Malware detection using machine learning based on
Jin Li, An inside look at IoT malware, in: International Conference word2vec embedding of machine code instruction, in:Siberian Sym-
on Industrial IoT Technologies and Applications, Wuhu, China, 2017, posium on Data Science and Engineering, SSDSE, 2017, pp.
pp. 176–186. 1–4.
[8] Michele De Donno, Nicola Dragon, Alberto Giaretta, DDoS-Capable [27] Hayate Takase, et al., A prototype implementation and evaluation of
IoT Malwares: Comparative Analysis and Mirai Investigation, Security the malware detection mechanism for IoT devices using the processor
and Communication Networks, Wiley, 2018. information, Int. J. Inf. Secur. (2019) 1–11.
[9] Evanson Mwangi karanja, Shedden Masupe, Jeffrey Mandu, Internet of [28] Mojtaba Eskandari, Sattar Hashemi, A graph mining approach for
Things malware: Survey, Int. J. Comput. Sci. Eng. Surv. 8 (3) (2017). detecting unknown malwares, J. Vis. Lang. Comput. 23 (2012)
[10] Y.M.P. Pa, S. Suzuki, K. Yoshioka, T. Matsumoto, T. Kasama, C. 154–162.
Rossow, IoTPOT: A novel honenypot for revealing current IoT threats,
[29] Zhao Zongqu, Junfeng Wang, Chonggang Wang, An unknown malware
J. Inf. Process. 24 (2016) 522–533.
detection scheme based on the features of graph, Secur. Commun.
[11] Anton O. Prokofiev, Yulia S. Smirnova, Vasiliy A. Surov, A method to
Netw. 6 239–246.
detect Internet of Things botnets, in: Young Researchers in Electrical
[30] Z. Xu, K. Ren, S. Qin, F. Craciun, CDGDroid: Android malware
and Electronic Engineering, EIConRus, Moscow, Russia, 2018..
detection based on deep learning using CFG and DFG, in: Proceedings
[12] Constantinos Kolias, Georgios Kambourakis, Angelos Stavrou, Jeffrey
Voas, DDoS in the IoT: Mirai and Other Botnets, vol. 50, IEEE of the 20th International Conference on Formal Engineering Methods,
Computer Society, 2017. ICFEM, 2018, pp. 177–193.
[13] Amin Azmoodeh, Ali Dehghantanha, Mauro Conti, Kim- [31] Fatemeh Karbalaie, et al., Semantic malware detection by deploying
Kwang.Raymond Choo, Detecting crypto-ransomware in IoT graph mining, Int. J. Comput. Sci. 9 (1) (2012).
networks based on energy consumption footprint, J. Ambient Intell. [32] Cesare Silvio, Yang Xiang, Malware variant detection using similarity
Humaniz. Comput. (2017) 1–12. search over sets of control flow graphs, in: IEEE 10th Interna-
[14] Amin Azmoodeh, et al., Robust malware detection for Internet of tional Conference on Trust, Security and Privacy in Computing and
(battlefield) Things devices using deep eigenspace learning, IEEE Communications, 2011.
Trans. Sustain. Comput. (2018) 88–95. [33] Aya Hellal, Lotti Ben Romdhane, Maximal frequent sub-graph mining
[15] Fairuz Amalina Narudin, et al., Evaluation of machine learn- for malware detection, in: 15th International Conference on Intelligent
ing classifiers for mobile malware detection, Soft Comput. (2014) Systems Design and Applications, ISDA, 2015.
343–357. [34] Ming Fan, et al., Frequent subgraph based familial classification of
[16] Mohammed Ali Al-Garadi, Amr Mohamed, Abdulla Al-Ali, Xiaojiang Android malware, in: IEEE 27th International Symposium on Software
Du, Mohsen Guizani, A Survey of Machine and Deep Learning Reliability Engineering, 2016.
Methods for Internet of Things (IoT) Security. [35] F-scure, Articles: botnets;, 2016, https://www.f-secure.com/en/web/lab
[17] Amit Ray, Compassionate Superintelligence AI 50: AI with s_global/botnets. (Accessed 21 Feb 2019).
Blockchain, Bmi, Drone, IoT, and Biometric Technologies, Inner Light [36] Michele De Donno, Nicola Dragoni, Alberto Giaretta, Angelo Spog-
Publishers, 2018. nardi, Analysis of DDoS-capable IoT malwares, in: The Federated
[18] L. Miralles-Pechuán, D. Rosso, F. Jiménez, J.M. García, A methodol- Conference on Computer Science and Information Systems, 11, 2017,
ogy based on deep learning for advert value calculation in CPM, CPC pp. 807–816.
and CPA networks, Soft Comput. 21 (2017) 651–665. [37] Manos Antonakakis, et al., Understanding the Mirai Botnet, in:
[19] Hoda El Merabet, Abderrahmane Hajraoui, A survey of malware
USENIX Security Symposium, Canada, 2017, pp. 1092–1110.
detection techniques based on machine learning, Int. J. Adv. Comput.
[38] Jiawei Su, Danilo Vasconcellos Vargas, Sanjiva Prasad, Daniele Sgan-
Sci. Appl. 10 (2019).
durra, Yaokai Feng, Kouichi Sakurai, Lightweight classification of IoT
[20] Sudipta Chowdhury, Mojtaba Khanzadeh, Botnet detection using
malware based on image recognition, in: 2018 IEEE 42nd Annual
graph-based feature clustering, J. Big Data (2017).
[21] T. Sherwood, E. Perelman, G. Hamerly, B. Calder, Automatically char- Computer Software and Applications Conference, COMPSAC, vol. 2,
acterizing large scale program behavior, SIGARCH Comput. Archit. 2018, pp. 664–669.
News 30 (5) (2002) 45–57. [39] Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang
[22] Huy Trung Nguyen, Quoc Dung Ngo, Van Hoang Le, IoT botnet Liu, Santhoshkumar Saminathan, Subgraph2vec: Learning distributed
detection approach based on PSI graph and DGCNN classifier, in: representations of rooted sub-graphs from large graphs, 2016, arXiv
International Conference on Information Communication and Signal preprint arXiv:1606.08928.
Processing, 2018, pp. 118–122. [40] Darko Andročec, Neven Vrček, Machine learning for the Internet of
[23] Huy-Trung Nguyen, et al., Towards a rooted subgraph classifier for IoT Things security: A systematic, in: Proceedings of the 13th International
botnet detection, in: The 7th International Conference on Computer Conference on Software Technologies, ICSOFT, 2018, pp. 563–570.
and Communications Management, 2019. [41] VirusShare, Because sharing is caring, 2019, https://virusshare.com/.
[24] J. Bai, Y. Yang, S. Mu, Y. Ma, Malware detection through mining (Accessed at 10/01/2019).
symbol table of Linux executables, Inform. Technol. J. 12 (2) (2013) [42] https://www.virustotal.com. (Accessed at 5/02/2019).
380–383. [43] Hamed HaddadPajouh, Ali Dehghantanha, Raouf Khayami, Kim-
[25] J. Saxe, K. Berlin, Deep neural network based malware detection Kwang.Raymond Choo, A deep recurrent neural network based
using two dimensional binary program features, in: 10th International approach for Internet of Things malware threat hunting, Future Gener.
Conference on Malicious and Unwanted Software, 2015, pp. 11–20. Comput. Syst. 85 (2018) 88–96.
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.