Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Available online at www.sciencedirect.

com

ScienceDirect
ICT Express xxx (xxxx) xxx
www.elsevier.com/locate/icte

PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier
algorithms
Huy-Trung Nguyena,b ,∗, Quoc-Dung Ngod ,∗, Doan-Hieu Nguyenc , Van-Hoang Lec
a Graduate University of Science and Technology, Vietnam Academy of Science and Technology, Viet Nam
b Institute of Information Technology, Vietnam Academy of Science and Technology, Viet Nam
c People’s Security Academy, Viet Nam
d Posts and Telecommunications Institute of Technology, Viet Nam

Available online xxxx

Abstract
It is obvious that IoT devices are widely used more and more in many areas. However, due to limited resources (e.g., memory, CPU),
the security mechanisms on many IoT devices such as IP-Camera, router are low. Therefore, botnets are an emerging threat to compromise
IoT devices recently. To tackle this, a novel method for IoT botnets detection plays a crucial role. In this paper, we have some contributions
for IoT botnet detection: first, we present a novel high-level PSI-rooted subgraph-based feature for the detection of IoT botnets; second, we
generate a limited number of features that have precise behavioral descriptions, which require smaller space and reduce processing time;
third, The evaluation results show the effectiveness and robustness of PSI-rooted subgraph-based features, as with five machine classifiers
consisting of Random Forest, Decision Tree, Bagging, k-Nearest Neighbor, and Support Vector Machine, each classifier achieves more than
97% detection rate and low time-consuming. Moreover, compared to other work, our proposed method obtains better performance. Finally,
we publicize all our materials on Github, which will benefit future research (e.g., IoT botnet detection approach).
⃝c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Keywords: IoT botnet; IoT security; Static malware detection; PSI-rooted subgraph; Machine learning; Deep learning

1. Introduction because virtually, any physical object that can be managed


over the internet can be turned into an IoT device [3].
The digital transformation following the Industrial Revolu- Along with this, IoT devices become an attractive target for
tion 4.0, especially the development of Internet of Things (IoT) cybercriminals using malware (stands for malicious software)
is taking place strongly all over the world, affecting various which as they are easier to be infected than conventional com-
areas of life such as transportation, medical and healthcare, puters (e.g., desktops, laptops) for the following reasons [4,
energy management, automated vehicles and so on. A survey 5]:
conducted by Cisco [1] shows that the number of IoT devices + Many IoT devices are running 24/7 without security
is on the rise each year and will possibly exceed 50 billion updates.
by 2020. In the year 2020, it is further estimated that 44 + Security often is overlooked to reduce cost within the
zettabytes of data will be exchanged between things connected development cycle of IoT devices. For example, about 70% of
on the Internet (1 Zettabyte is equivalent to 1,000,000,000,000 existing IoT devices have a vulnerability [6]
Gigabytes) [2]. The prediction should not be a thunderbolt + Due to restricted resources, it is expensive to conven-
tional cryptography in IoT devices.
∗ Corresponding authors.
+ Many IoT devices use the manufacturer’s default login
E-mail addresses: huytrung.nguyen.hvan@gmail.com (H.-T. Nguyen), credentials or weak login credentials configured by users. For
dungnq@ptit.edu.vn (Q.-D. Ngo), doanhieunguyen.psa@gmail.com
(D.-H. Nguyen), levanhoang.hvan@gmail.com (V.-H. Le).
example, according to a report by ESET [7], about 15% of the
Peer review under responsibility of The Korean Institute of Communica- tested routers use weak or default usernames and passwords.
tions and Information Sciences (KICS). It was reported that “admin” is the most common username. It
https://doi.org/10.1016/j.icte.2019.12.001
2405-9595/⃝ c 2020 The Korean Institute of Communications and Information Sciences (KICS). Publishing services by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
2 H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx

is also discovered that nearly 20% of the tested routers expose - The results are not interpretable because of the complexity
their Telnet port to the Internet, which is a serious security of the hidden layers of deep learning, therefore, it is diffi-
implication. cult to comprehend the results and understand the algorithm
+ The majority of IoT devices use Linux-based operating mechanism.
system so the focus of this paper is on ELF executable files - It is difficult to improve “what is learned” since there is
IoT malware (abbreviation for ‘malicious software’) is any no theory to guide.
software that is designed to have undesirable or harmful effects Therefore, researchers often encounter trade-off: machine
on a computer system or to the IoT devices. IoT malware, learning algorithms are fast and able to understand ‘what is
though, is not a completely new concept at this time, malware learned’ but not so much accurate. Nonetheless, deep learning
targeting IoT devices were spotted since 2008 (Linux.Hydra methods are time consuming but more accurate in detecting
malware) [8]. There are many types of IoT malware [9] such as problems such malware detection. Also, the choice of appro-
Trojans, Worms, Rootkits, Botnets, etc., among which botnets priate input features is a primary task for the improvement
are now considered to be a major security threat in IoT of prediction rates in every learning model [19]. According
devices network environments [10]. One of the most promi- to [20], graph-based features are better flow-based features in
nent examples of the IoT botnet is Mirai that considered the detecting botnet malware since it avoids the need to cross-
largest botnet in history, consisting of approximately 500,000 compare flows across the dataset. Besides, a graph is used for
compromised IoT devices with a traffic peak of 1.2 terabits per representing the behavior of application efficiently, because the
second (Tbps), the biggest DDoS attack ever recorded [8,11]. graph, especially labeled graphs, can be used to model lots of
Therefore, it is worth noting that in this work, we focus on complex relations between data [21].
IoT botnet. Furthermore, the source code for “Mirai” has been In this paper, we propose a novel PSI-rooted subgraph-
released, which allows malware creators of IoT malware to based feature in a completely static way (no dynamic analysis)
make many malware variants [12] such as Persirai, BrickerBot which represents the behaviors in the life-cycle IoT botnet.
(in 2017), HideNSeek (in 2018) and so on. According to These features are defined according to both the behavior of
Kaspersky Lab’s collection, the number of IoT malware has IoT botnet information and the topology of PSI-graph [22],
increased 37 times from 2016 to 2018. Therefore, botnet and and they are generated and processed through a combination of
malware are used interchangeably in this paper. processing deep learning and machine learning models. In our
To combat with such a huge number of IoT malware previous study, we preliminarily confirmed the effectiveness
samples, many researchers and cybersecurity firms are using of PSI-rooted subgraph for two-class classification [23] (also
automatic malware detection methods. One of the possible called as detection). Besides, they suffer from disadvantage
solutions is by applying machine learning. It involves building in the number of subgraphs may increase exponentially with
classifier algorithms that are capable of “learning” from input the size of training PSI-graphs. Therefore, this paper improves
data and later using them for processing previously unseen previous method of classification, is fast, IoT botnet family
data. Input data come in the form of features characterizing detection (e.g. Mirai, Bashlite), and have a low false positive
analyzed objects. Features can be integer or floating point rate when presented with benign files.
number, Boolean (‘true’ or ‘false’) and categorical values. The The major contributions of this paper can be summarized
studies for the detection of IoT malware using machine learn- as follows: First, we present a novel high-level PSI-rooted
ing such as [13–15]. However, those studies suggest that the subgraph-based feature through a combination of processing
performance of classifier algorithms such as Random Forest, deep learning and machine learning models for the detec-
k-Nearest Neighbors, Support Vector Machine and Decision tion of IoT botnets. Second, we generate a limited number
Tree, etc. depends not only on the task at hand, but also of features that have precise behavioral descriptions, which
on the number and characteristic of features. Recently, the require smaller space and reduce processing time. Final, we
application of deep learning to deal with the limitation of run extensive experiments for different dataset. The evaluation
machine learning models has become an imperative research results show the effectiveness and robustness of PSI-rooted
topic [16]. Deep learning is a branch of machine learning subgraph-based features for multiple machine classifier, which
which consists of linear and nonlinear multilayer processing achieves larger than 97% detection rate.
neurons. Deep learning models extract features automatically The remainder of this paper is organized as follows: Sec-
and autonomously by learning from the data and training tion 2 discusses related work in this area. Section 3 describes
the neural network rather than design features manually. The IoT botnet malware. Next, in Section 4, we describe our
features extracted from the deep learning are very effective proposed method in detail. Section 5 presents the results of
and outperform other machine learning classifiers that have the evaluation, discusses the results of the evaluation and
been trained using feature extracted using hand-engineering the drawbacks of our proposed method. Finally, in Section 6
method. However, some of the limitations of common deep conclusion is presented.
learning algorithms are as follows [17,18]:
2. Related works
- It requires extremely large amount of computational
power (with expensive GPUs). As previously mentioned, finding good features applied for
- It is difficult to configure and easy to fall into over-fitting representing malware in machine learning-based malware de-
since deep learning is based on the experiment many times to tection is essential, such as Operation Codes (Opcodes), Print-
select the appropriate hidden layers. able Strings Information (PSIs), System calls, Control Flow
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx 3

Graphs (CFGs), Processor Information, etc. We group the re- hardware implementation, therefore, it is not robust for IoT
search methods into two groups, namely, non-structured graph- botnet detection.
based and structured graph-based features. In this section, we The above-mentioned researches represent executable files
will show an overview of how these features are used and by non-structured graph-based feature, where the dependence
will concentrate on structured graph-based (e.g., PSI-rooted on the values of features (e.g., system call inet toa) and these
subgraph) in the proposed method for feature extraction. features cannot describe the complex semantic information
(such as dependency data in the life-cycle of IoT botnet)
2.1. Non-structured graph-based among them. Besides, the result points out that non-structured
graph-based features are weak to obfuscation malware such as
Malware detection using non-structure graph-based features encrypted, garbage insertion. To deal with it, other researchers
means using a single or a combination of features to detect ma- use control flow analysis (e.g., control flow graph) to represent
licious characteristics such as Processor Information, opcodes. the behavior of malware as a graph.
strings, system call (present in file header) and so on. The use
of these features has been of the following studies. 2.2. Structured graph-based
Jinrong Bai et al. [24] proposed a malware detection static
method by mining system calls from symbol table of Linux ex- Some other researchers focus on graph-based features
ecutables, because these features reflect behaviors of program (e.g., control flow graph, call graph, code graph) to represent
code pieces and carry semantic interpretations, which can be the behavior of executable files since the graph is a strong
used to detect intent and goal of attackers. The experimental method to model lots of complex relationships between data.
results achieve larger than 98% accuracy rate for distinguish- Eskandari et al. [28] presented a robust semantic based
ing between benign and malware depending on the selected method to detect malwares based on combination of a visualize
machine learning classifier. However, this work has trouble model (CFG) and called APIs. In their study, authors enriched
with malware using obfuscation and packing techniques. the current control flow graph by adding the behavioral prop-
Saxe et al. [25] proposed a deployable deep feed-forward erty of malware to it. Behavioral information is gained by
neural network with two hidden layers based malware de- calling APIs. This simple CFG will turn out to API on CFG,
tector using static features include byte entropy histogram, referred to as API-CFG, in which the nodes and edges rep-
the 2D histogram of ASCII printable strings, metadata of resented the necessary instructions and API call, respectively.
executable files and imported DLLs. These four types of Their approach is capable of classifying unseen benign and
features were transformed to 256-dimensions vectors one by malware with high accuracy.
one, and which were aggregated to 1024-dimension feature Zhao et al. [29] proposed a graph-based detection method
vector. Their method achieved a 95% detection rate while based on features of the CFG. He used specific graph traversal
false positive rate is 0.1% when running an experiment on algorithms to obtain the values of previously defined fea-
a 400,000 samples dataset. Furthermore, it should be noted tures. The features are information about nodes, edges of the
that they achieved their highest true positive rate when using graph. Then, they performed classification with some popular
all the static features as opposed to single features. However, machine learning algorithms including J48 Decision Tree,
the limitation is that the label assigned to training set may Bagging and Random Forest. Their experiments show a good
be inaccurate and the accuracy of the proposed approach result which achieved as high as 96.8% accuracy with Random
decreases substantially when samples are obfuscated. Forest classifier.
Igor Popov [26] proposed to use word2vec, a recently Zhiwu Xu et al. [30] proposed a method to detect Android
developed popular tool to analyze natural language texts, malware based on Convolutional Neural Network from the
for Windows malware (PE file) classification. Popov used semantic representation of the graph, combining the Control
word2vec to generate word embeddings from malware op- Flow Graph (CFG) and Data Flow Graph (DFG) together into
codes, afterward that machine instructions were fed into Cap- a graph. From the experimental results, the method achieved
stone disassembler and opcodes were generated. Then these outperforms accuracy and F1-score at 99.82% and 99.191%,
word vectors were used to classify executable codes using a respectively.
convolutional neural network-based classifier. This system was However, graph-based features are not always a single
tested with up to 97.0% success. connected graph, so the researchers used subgraphs to exploit
Hayate Takase et al. [27] proposed a method classifies more in-depth about the malicious behavior of malware, which
malware by machine learning using hardware resources such may be a special part in the graph. Karbalaie et al. [31]
as CPU information and memory consumption as features. proposed a graph-based method that they constructed a graph
These features include the acquired program counter, opcode, representation of malware behavior and performed frequent
and register number, which trace by emulating a CPU using subgraph mining before using a classification method to get
QEMU. Their experiment shows that the proposed method the result. The proposed method got a result of 96.6% accu-
can detect malware related to the same malware family by racy. Silvio Cesare and Yang Xiang [32] used each possible
training one malware, 100% accuracy rate. However, this work subgraph of fixed size in the CFG as features in their approach.
evaluated the experimental results with very small dataset and They first eliminated cycles in graph and perform a traver-
the size of each classifier being suppressed to about 2 MB for sal of all paths to generate the subgraphs. Then, they used
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
4 H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx

these subgraphs to construct a feature vector. However, these


approaches do not apply machine learning techniques.
Aya Hellal et al. [33] proposed a behavior-based mech-
anism through graph mining techniques to detect malware.
They extracted maximal frequent subgraphs (as features) from
code graphs, which have a precise behavioral specification
in executable files. They perform experiments with a dataset
consisting of 1520 samples to evaluate the proposed method.
Experimental results show that the mechanism generates a
limited number of features, and these features are effective in
detecting malware using machine learning, best accuracy at
91% with SVM classifier. However, this work suffers costly
subgraph isomorphism issue, which is NP-complete. Fig. 1. The IoT botnet life-cycle.
Ming Fan et al. [34] proposed a novel approach to clas-
sify Android malware based on frequent subgraph features,
were extracted from the API-based graph. First, it distills an + Try to infect a wide range of devices such as IP Camera,
Android malware into a function call graph representation. Routers and so on.
Then, a TF–IDF-like approach method is utilized to calculate + Try to be also compatible with different processor archi-
weights of each sensitive API in different families. Next, the tectures, ensuring the maximum possible successful infections.
graph is simplified into a sensitive API graph (sensitive API Besides, IoT botnet usually has to go through a cycle
usually is invoked to perform malicious activities in malware). of phases to infect IoT devices, make them useful like bot-
Finally, they divided the sensitive API graph into a set of master desire. Depending on the study material, the number
the subgraph, where the subgraph with sensitive API that is and names of phases may have difference, but the general
used by most malware files in one family is determined as the description can be as shown in Fig. 1.
frequent subgraph of the particular family. The experimental Behaviors of IoT botnet take place through IoT botnet
result shows that the proposed method can correctly classify life-cycle, which consists of four main phases: Scanning,
94.5% malware into their families based on 6650 malware Attacking, Infection and Violation. In the scanning phase, IoT
samples in 30 families. However, this paper has limitation in botnet starts scanning the network by sending SYN packets to
against the advanced obfuscated malware. random IoT address (except for IP addresses of some large or-
Most of the above-mentioned studies are on PCs platform
ganizations such as Hewlett-Packard, US Postal Service, etc.)
and mobile platform, which have specifically not focused on
and waiting for them to respond. Once it finds a compromised
IoT devices. In our previous study, we preliminarily confirm
device with a service port open (such as 23, 2323, 23231),
the effectiveness of subgraph for detection of an IoT exe-
it tries to open a socket connection to that device. Then, in
cutable file is a malware or not [23]. However, in this paper we
Attacking phase, it attempts to log in the IoT device with
will improve previous method of classification, which is fast,
a list of default credentials, and if successful, it sents this
and has a low false positive rate when presented with bening
information (include IP address, login credentials) to attacker’s
files.
server.
Once logged in successfully, in Infection phase, the IoT
3. IoT botnet malware
botnet downloads the executable file compatible with the de-
The outstanding example of a DDoS attack emitted by vice’s architecture. After executing this file, IoT botnet deletes
botnet Mirai has shown that the attacker’s goal of exploiting this file and runs only in the device’s temporary memory
IoT devices is to infect IoT botnet. IoT botnet is a collection (e.g., Random Access Memory). From here on, the attacker
of infected IoT devices (called as bots) which are remotely can completely command and control the compromised IoT
controlled by botmasters (sometimes referred to as the ‘both- devices in Violation phase to attack network such as DDoS.
erder’, or ‘controller’ [35]). The primary difference from other
malwares is that botnet includes a channel of communication 4. Proposed mechanism
with botmasters to perform various network attacks such as
DDoS, leaking user’s confidential data and phishing. Accord- We propose an IoT botnet detection mechanism applied to
ing to recent observations and preliminary analysis [36,8,7,37], combine deep learning and machine learning using PSI-rooted
IoT botnet is functionally similar to existing botnet on Personal subgraph as features. In this section, we describe the details
Computer (PCs) platform, except for some specific features of the PSI-rooted subgraph used as a feature and the summary
that are seldom observed on PCs [38]. of our proposed mechanism.
+ Try to kill other malware of competitive families to
get more system resources for themselves, due to the limited 4.1. Overview
computational capability of IoT devices. For example, Mirai
will also kill other competitor malwares such as Qbot, Zollard, In this section, we discuss the main components of our
Remaiten [7]. model, how it processes our data and enables detecting IoT
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx 5

IoT botnet malware that appears in the executable files, and


each malicious behavior corresponds to a rooted subgraph in
the PSI-graph.

4.3. PSI-rooted subgraph generation phase

As previously noted, our approach is based on PSI-graph,


which shows us how ELF files are actually executed. But
not all activities of a malware file are considered harmful,
it definitely has very normal activities like any other benign
program. Because of that, analyzing paths and circles in a
PSI-graph is considered a significant work. That can show
us the chain or circle of events which is really harmful. We
perform those analyses through rooted subgraph — a part of
PSI-graph. We define it as below:

Definition 2. Let G sg = V ′ , E ′ , v, d represent an acyclic


( )

directed PSI-rooted subgraph that is generated from PSI-graph


G P S I = (V, E, λ) rooted at vertex v; where V ′ ⊂ V is the
set of subgraph
( vertexes whereas the length between (θ, Vi′ )
satisfies 0 ≤ v, Vi ≤ d, and E ′ ⊂ E is the set of directed

)

edges between vertexes in V ′ . d is called degree of PSI-rooted


subgraph.

Fig. 2. The overview of proposed method.

botnet samples. Our proposed method consists of four main


steps as presented in Fig. 2. First of all, we extract PSI-graph
from ELF files. Secondly, a traversing algorithm is applied
to the PSI-graphs to generate PSI-rooted subgraphs. Before
classification task, our raw data of PSI-rooted subgraphs are
preprocessed so that the data can be learnt efficiently by
machine learning algorithms. This step consists of three stages:
feature extraction for vectorizing our data, normalization and
feature selection for increasing the performance of learning
algorithms. Finally, by applying several classifiers to the pre-
pared dataset, we can select the best classifiers in detecting
IoT botnet samples.

4.2. PSI-graph generation phase


Where, GETWLSUBGRAPH (called Algorithm 2) is inher-
We have inherited the way to represent IoT executable files ited from the algorithm proposed by Annamalai
Narayanan [39]. Algorithm 2 takes the root vertex v, graph
with PSI-graphs [22].
G i from which rooted subgraphs has to be extracted and
degree of the intended subgraph d as inputs and returns the
Definition 1 ([22]). PSI-graph is a directed graph defined as
rooted subgraph SG i . The algorithm gets all the (breadth first)
G P S I (V, E) where:
neighbors of a vertex to extract subgraphs. Thus, the process
−N is a set whose vertexes are called PSI elements.
of extracting PSI-rooted subgraphs is based on the breadth first
−E is a set of edges where all the edges are directed from one
search (BFS) algorithm which is more efficient than depth first
node to another.
search (DFS). The main reason is that BFS starts at the root
However, the authors in [22] focused on the structure of the node and explores all of the neighbor nodes at present depth
PSI-graph and not mining the malicious behaviors in the life- prior to moving to the nodes at the next depth level while
cycle of IoT botnet malware. Therefore, through this graph, DFS explores the highest-depth nodes first before being forced
we expect to find the malicious behaviors in the life-cycle of to backtrack. With fixed depth (degree) of a rooted subgraph,
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
6 H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx

Table 1
A sample to generate PSI-rooted subgraph step by step.
Degree Vertexes
d = 0 11
d = 1 0, 8, 10, 7, 9
d = 2 18, 0, 0, 7, 0, 5, 6, 15, 16

Fig. 3. An example of PSI-graph.

BFS is clearly more suitable for extracting rooted subgraphs.


After applying Algorithm 1 to our PSI-graph dataset, we get Fig. 4. An illustration of PSI-Rooted sub-graph rooted at the vertex 11.
PSI-rooted subgraph dataset. With each PSI-graph file, we
get a PSI-rooted subgraph file in which each sentence is a
PSI-rooted subgraph of a vertex in the original PSI-graph. faster (reduces the learning time to approximately 30 min from
Considering the trade-off between classification accuracy and 12–15 h) than Annamalai Narayanan’s algorithm [39] (using
computational complexity, we set d (see in Definition 2) equal skipgram) but also does not require much computation power
to 3. and is without any considerable loss on detection accuracy.
For ease of illustration, we traverse the PSI-graph in Fig. 3 The matrix we get has features which have different ranges,
to find a sample of the PSI-rooted subgraph rooted at the vertex so that we need to normalize it. Normalization is the process
11 with d equals 2 (showed in Table 1 and Fig. 4). of scaling individual samples to have unit norm. Scaling inputs
Finally, the PSI-rooted subgraph rooted at the vertex 11 is to unit norms is a common operation for text classification
a list {11, 0, 8, 10, 7, 9, 18, 0, 0, 7, 0, 5, 6, 15, 16} or clustering. As we consider PSI-rooted subgraph as a word
Continually, we traverse full PSI-graph with root as other in a document, normalization is essential to avoid undesirable
vertexes, then obtain the list of the PSI-Rooted subgraphs. results when data is fed to our classifiers. The formula for the
Afterward, we identify the set of PSI-rooted subgraph, which process is described as below:
one contains the behavior in the life-cycle of the IoT botnet x
xnor mali zed = (1)
cycle process of malware by classifying tasks. ∥x∥2
Where x is a sample of the PSI-rooted subgraph dataset.
4.4. Data preprocessing phase However, the number of different rooted subgraphs in the
whole graph dataset is very big (530155 different PSI-rooted
Before it comes to classification task, we need to convert subgraphs extracted in our dataset), but not every subgraph
our complex data of PSI-rooted subgraph into vector. What is essential for classification task. For example, there are
we need is the vector representation of the whole PSI-graph many normal subgraphs which both malwares and benigns
through those PSI-rooted subgraphs. For this purpose, we execute. Therefore, we transform the dataset into a reduced
have considered several such methods as Word2vec, Doc2vec, set of features, which is called feature selection. We reduce
Graph2vec and Subgraph2vec. Those methods all involve a the dimension of our data not only to reduce the amount of
language model called skipgram to learn the embeddings. memory and computation power needed for classification but
However, the skipgram takes much time and requires strong also to make our training process significantly faster. We use
hardware to learn. Therefore, we decided to use a simple a meta-transformer which trains the dataset by a classifier
solution to this problem. The process is called vectorization, and selects the most relevant features by evaluating feature
in which we turn a collection of graphs into numerical feature importance. After this step, we get our prepared dataset ready
vectors. We consider each graph as a document and its rooted for classification tasks.
subgraph as a word of this document; hence, we can perform After preprocessing data, we get a dataset with a reduced
counting the occurrences of words in each document. In this number of features. Our dataset is ready to be fed in classifiers.
scheme, each individual rooted subgraph occurrence frequency To obtain the most efficient classifier for IoT security [40],
is treated as a feature. The vector of all the rooted subgraph especially IoT botnet detection, five common machine learning
frequencies for a given PSI-graph is considered a multivariate classifiers have been applied to compare their efficiencies,
sample. Therefore, our data can be represented by a matrix consist of Decision Tree (DT), Random Forest (RF), k-Nearest
with one row per graph and one column per rooted-subgraph Neighbors (kNN), Support Vector Machine (SVM), and Bag-
occurring in the graph dataset. The results of experiments ging for IoT botnet detection. Classifier algorithms used in this
prove that our decision is right. This approach not only is much research are summarized in Table 2.
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx 7

Table 2
Machine classifiers description.
Machine classifiers Description
SVM Support vector machines (SVM) is a machine learning algorithm that is generally used for classification problems. The main
idea relies on finding such a hyperplane, that would separate the classes in the best way. SVM is good with working with
high-dimensional datasets.
RF Random forest is one of the most popular ensemble machine learning algorithms. The idea of them is to grow multiple
decision trees based on the independent subsets of the dataset to produce a better prediction accuracy. But the fundamental
difference is that in Random forest, only a subset of features are selected at random out of the total.
DT Decision Tree (DT) trains data by creating the tree that is used for making predictions on the test data. Decision tree is more
popular now due to its simplicity. It can work well with large datasets. Another advantage is that decision trees operate in a
“white box”, meaning that we can clearly see how the outcome is obtained and which decisions led to it.
kNN The label of a new data point is classified by a majority vote of its neighbors, based on the class most common among its k
nearest neighbors. The kNN algorithm calculates the distance between training data and new data to classify the class. The
most common distance metric is Euclidean Distance.
Bagging Bootstrap Aggregation (or Bagging for short) with its base estimator is decision tree. But the fundamental difference is that all
features are considered for splitting a node.

5. Experimental results Table 3


The description of experimental dataset.
In this section, we perform a series of experiments to evalu- Family name Sample number ARM MIPS
ate our approach. Firstly, we conduct a set of cross-validation Mirai 1,765 331 301
experiments to see the effectiveness of PSI-rooted subgraph Bashlite 3,720 762 646
in ARM and MIPS-based IoT botnet detection. Secondly, to Other botnet 680 152 103
test the ability of our approach in detect multi-architecture IoT Benign 3,845 561 533
Total 10,010 1806 1583
botnet and robustness, we run our method on a dataset consist-
ing of 10,010 ELF samples include 6165 IoT botnet samples
and 3845 benign samples, with five machine classifiers include Table 4
PSI-rooted subgraph analysis of the executable collection.
SVM, RF, Bagging, DT, kNN. Finally, we also perform some
Statistic Bashlite Mirai Other botnet Benign
experiments to compare our approach with some existing IoT
Maximum 2549 7253 9253 11121
botnet detecting method.
Minimum 5 5 9 5
Total 2570896 277973 786340 1904370
5.1. Dataset description

We collected the samples mainly from 2 datasets, namely


IoTPOT [10] and VirusShare [41]. The IoTPOT dataset con-
tains 4000 IoT botnet samples, and VirusShare dataset contains
3779 IoT botnet samples. We also collect 4001 benign samples
from an online repository and IoT SOHO website. Also, files
which disassembler (e.g., IDA pro) cannot disassemble or were
non-binary or were duplicated were removed from the dataset.
To ensure there is no malware sample in our benign dataset,
we checked the benign dataset with VirusTotal [42], which is
a system that contains more than 58 anti-malware scanners.
Only if a sample passes all anti-malware engines in VirusTotal, Fig. 5. Scatter plot describing the distribution of data points between
different families after 3-dimensional LSA.
we consider it as a benign sample. Then, we combine the
IoT malware and benign executable samples and use it as our
experiment dataset. Thus the dataset contains 6165 malware In addition, in order to visualize our data in 3-dimension
and 3845 benign ELF executable files. Afterward, the dataset space, we perform dimensionality reduction by applying 3-
is divided into four different families, which consist of Mirai, dimensional Truncated Singular Value Decomposition (Trun-
Bashlite, Other botnet and Benign. However, we have a prob- cated SVD) to the whole data. This method is also known as
lem with results provided by the anti-malware scanners is that Latent Semantic Analysis (LSA). Although dimension reduc-
different anti-malware scanner has minor differences in label tion loses some information and visualization results of low
malware such as Bashlite/Lizkebab/Gafgyt. To deal with it, we dimensional space cannot thoroughly reveal what are in the
label the malware with the name which is the result returned high-dimensional space, it is still essential to some extent.
by more than half of anti-malware scanners. Table 3 shows the Fig. 5 shows that the distribution of IoT malware and
details of the dataset and Table 4 shows statistical information benign data is quite close, there are some overlapping regions,
PSI-rooted subgraph of each class. which makes difficulty for any existing machine learning
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
8 H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx

classifier to detect precisely. Therefore, we will need a feature Table 5


and model that is strong enough to classify an instance into Classifiers results for the proposed feature model.
one of two classes: malware or benign or even into one of Classifier TPR (%) FPR Accuracy (%) AUC (%) F1-score (%)
several classes of several types of malwares and benign. DT 97 0.043 96.3 96.4 97
All our implementation is conducted on a desktop computer RF 98 0.03 97.2 97.1 98
SVM 98 0.041 97 96.8 98
with the following specifications: Ubuntu Operating System
Bagging 98 0.04 97.3 97.1 98
16.04 64 bit, 32 GB RAM, GPU NVIDIA GeForce GTX kNN 97 0.044 96.8 96.7 98
1080i, Python language.
Table 6
5.2. Cross validation Malware detection evaluation result with ARM-based dataset.
Classifier TPR (%) FPR Accuracy (%) AUC (%) F1-score (%)
Cross-validation technique (e.g., K-fold, Repeated Random
DT 99 0.019 98.3 98.3 98
Sub-sampling) is commonly used to avoid the overfitting issue RF 99 0.01 98.8 98.8 99
and trust the machine learning accuracy results. It validates SVM 100 0.01 99.3 99.3 99
that the applied hyperparameter tuning of an algorithm classi- Bagging 99 0.01 98.8 98.8 99
fier does not lead to overfitting if the change accuracies remain kNN 98 0.019 97.8 97.8 98
within an accepted range of values. In this work, we used 5-
fold cross-validation in all our experiments, as the size of a Table 7
dataset is limited. In rotation estimation data is divided into Malware detection evaluation result with MIPS-based dataset.
five equal parts where four parts are used to training model Classifier TPR (%) FPR Accuracy (%) AUC (%) F1-score (%)
and remain on one part is used for testing. DT 98 0.007 99 98.7 98
RF 99 0.005 99.3 99.1 98
SVM 100 0.007 99.4 99.6 99
5.3. Evaluative criteria
Bagging 96 0.011 98.3 97.6 96
kNN 99 0.004 99.4 99.2 99
To evaluate the performance of our proposed method, we
define the following terms:
+ True negative (TN) is the number of predicted benign Where, Recall is the TPR and Precision is the ratio of
samples correctly classified as benign. predicted malware that are correctly malware, and is defined
+ False negative (FN) is the number of predicted malware as follows:
incorrectly tagged as benign. TP
+ True positive (TP) is the number of predicted malware Pr ecision =
T P + FP
samples correctly classified as malicious.
+ False positive (FP) is the number of predicted benign The fifth criterion is Area under the ROC curve (referred
samples falsely marked as malicious. as AUC). Receiver Operating Characteristic (ROC) curve is
These terms are used to define four performance compar- obtained by plotting the false positive rate against the true
ison criteria between DT, RF, kNN, Bagging, and SVM. We positive rate as the threshold varies through the range of data
use metrics as following: values. A data point in the upper left corner correlates to
+ True Positive Rate (TPR): the number of predicted optimal, high efficiency. The closer the curve is to the top and
malware samples correctly classified as malicious divided by left borders of the ROC area, the more accurate the detection
total malware. rate is. A higher AUC value means better detection rate with
a less FPR.
TP
T PR = It is worth noting that the closer between our each measure
T P + FN and 1.00, the better performance the employed method has.
+ False Positive Rate (FPR): the number of predicted
benign samples falsely marked as malicious divided by total 5.4. Result and discussion
benign files.
FP The results obtained from a 5-fold validation test using five
FPR = selected classifiers to perform the experiments are presented
FP + T N
+ Accuracy: refers to the ratio of the number of corrected in Tables 5–7. Classifier performance is measured with five
samples to the number of both malware and benign files. evaluation metrics, namely TPR, FPR, Accuracy, F-score, and
AUC.
TP +TN
Accuracy = It is possible to see that the proposed method has high de-
T P + FP + T N + FN tection rate for each classifier using the dataset comprise many
+ F-measure (F-score): combining precision and recall into different architecture files, as shown in Table 5. Random forest
a single number to estimate the entire system performance demonstrated the better results compared to all other classifiers
2 × Recall × Pr ecision with 98% TPR and other metrics present very satisfactory
F − scor e = results. Moreover, the areas under the curve for the classifiers
Recall + Pr ecision
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx 9

used in this above experiment show the all classifiers provide


best AUC value larger than 96%, which signifies excellent
prediction, where Random Forest classifier is the best with
97% as shown in Figs. 6, 7, 8. AUC values verify that the
malware experiment provides forcible results concerning IoT
botnet detection.
We present the predictive performance metrics for each
ARM-based sample and MIPS-based sample in Tables 6 and
7, respectively. Because each database is only composed of
benign and malicious files based on ARM or MIPS architec-
tures so SVM has a slightly higher result than other classifiers.
SVM achieves true positive rate at 100% in both 2 experiment
datasets. As mentioned previously, precision is the percentage
of instances identified correctly from all data. In other words,
precision is how strong a classifier is in terms of predicting
positive instances. Meanwhile, F-score is calculated from Pre-
Fig. 6. The ROC curves of Bagging, RF, DT, kNN and SVM test results cision. So, the Random Forest and SVM classifier achieved a
on dataset. F-score larger than 98%, meaning that it can predict positive
instance very well.
Another advantage of the proposed method is its time
complexity. The improved algorithm (Algorithm 1) has a con-
siderably low time cost. We demonstrate that our proposed
method can achieve much more efficient time complexity
than using the whole subgraph2vec model [39]. As we know,
the subgraph2vec model consists of two main components:
a procedure to generate rooted subgraphs around every node
and the procedure to learn embeddings of those subgraphs.
Considering the rooted-subgraph extraction function, we have
the maximum degree of the graph k (the maximum number
of neighbors of a vertex in graph), the degree of extracted
rooted-subgraph D. The function is a recursive function which
takes the root node and gets all the degree D-1 subgraphs
of its neighbors to construct the rooted subgraph; therefore,
in the worst case, the time complexity of this function is
obviously O(k D ). Besides, we need to get rooted subgraphs
Fig. 7. The ROC curves of Bagging, RF, DT, kNN and SVM test results
of all vertexes, so the time complexity will be O(|V | k D ). It
on ARM-based dataset. is a time-consuming function, because the number of vertexes
and k maybe big. In addition, although we can choose small
D (≥ 3), the space required to store the entire rooted subgraph
may be a problem. However, the procedure to learn their
embeddings by radial skipgram with negative sampling is
much more time-consuming. That is mainly because it is a
deep learning-based model.
Hence, rejecting skipgram in our method is much more
useful while we still guarantee high results, with the over-
all time complexity of our PSI-rooted subgraph generation
is O(|G| |V | D.k D ). The time complexity of the skipgram
is rather big. Given the negative sample size n, embedding
dimension h, set of positive pairs (target word, context) S,
the overall time complexity is O((n + 1) |S| h), where n + 1
denotes n negative samples and 1 positive sample. We can
see that S is a big set with large vocabulary. In our situation,
the vocabulary of rooted-subgraph is large (approximately
500000), so that |S| can easily reach million level or even
higher. The process has to come through multiple epochs to
Fig. 8. The ROC curves of Bagging, RF, DT, kNN and SVM test results update the weight. Each epoch will take the time complexity
on MIPS-based dataset.
as above. Experiments show that skipgram model can take
Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
10 H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx

Table 8 6. Conclusion and future work


Comparison of processing time.
Classifier Processing time (second) In this paper, we present a rooted subgraph-based method
to detect IoT botnet. It extract novel features from the PSI-
Processing time with feature selection
graph of ELF files. The features are employed in process
DT 1.84
of machine learning to achieve some classifiers, which can
RF 69.18
Bagging 144.64 be used as malware detectors. In summary, our main contri-
kNN 12.83 butions are the following: (1) we present a novel high-level
SVM 237.78 PSI-rooted subgraph-based feature for the detection of IoT
Processing time without feature selection botnets; (2) we generate a limited number of features that have
DT 18.49 precise behavioral descriptions, which requires smaller space
RF 9305.21 and reduces processing time; (3) The evaluation results show
Bagging 5225.02 the effectiveness and robustness of PSI-rooted subgraph-based
kNN 19.60 features for multiple machine classifier, which achieves larger
SVM 1705.33
than 97% true positive rate, and Random forest demonstrated
the better results compared to all other classifiers. Additionally,
Table 9
Accuracy of conventional machine learning classifiers on IOT malware: a our proposed method is compared to existing work , the re-
comparative summary. sults demonstrate that our approach performs better. However,
Accuracy (%) the technique used to extract the PSI-rooted subgraph has
Classifier
Proposed method Hamed et al. [43] some limitations inherent from static analysis methods, which
cannot disassembly complex and sophisticated obfuscation
Random forest 98.8 92.37
SVM 99.3 82.21 samples. Moreover, the behaviors of IoT botnet usually start
kNN 97.8 94 after a trigger-point, such as receiving a remote command or
Decision tree 97.8 92.36 depending on a specific time. Therefore, we combine with
dynamic analysis method to determine trigger-point or vertex
in PSI-graph for starting better, avoid brute-force the entire PSI
up to 15 h to learn embeddings, while our methods which graph in future work.
use frequency-based vectorization only take several minutes
to complete. Declaration of competing interest
The current methods for detecting malware on IoT de-
vices commonly consume high resources (e.g., memory, CPU, The authors declare that there is no conflict of interest in
and storage) during operation. Therefore, the time spent on this paper.
classifiers plays a crucial role in order to have less resource-
consuming on IoT devices. Table 8 shows a processing time CRediT authorship contribution statement
using feature selection compared to without feature selection Huy-Trung Nguyen: Conceptualization, Methodology, Soft-
during the experiments. Obviously, as a decreased number ware, Investigation, Data curation, Writing - original draft,
of features, the time to build the model improved. When all Writing - review & editing, Visualization, Supervision. Quoc-
530,155 features took part in the experiment, the processing Dung Ngo: Methodology, Investigation, Writing - original
time in 9305.2 s, meanwhile with feature selection, the time draft, Writing - review & editing, Supervision, Project ad-
is reduced to 69.18 s for the RF classifier. Moreover, the other ministration. Doan-Hieu Nguyen: Software, Writing - original
classifiers also showed that the processing time decreases as draft, Writing - review & editing, Visualization. Van-Hoang
applied feature selection. Therefore, the processing time of the Le: Investigation, Data curation, Writing - review & editing.
classifiers is directly proportional to dataset size.
Besides, we compare our proposed method with recently References
proposed promising techniques by Hamed et al. [43] that
[1] Cisco Internet of Things, 2015 (Accessed: 10-June 2019).
extracted Opcode sequences as a feature. There are three main [2] M. Future, Internet of Things and the mobile future, 2018, [Online],
reasons why we choose it as experimental comparison. First, Available: http://mobilefuture.org/resources/internet-of-thingsand-the-m
it uses a static feature approach for IoT executable as we do, obile-future/.
and second, it evaluates experiment with deep learning and [3] Overview of the Internet of Things, Recommendation ITU-T Y.2060
International Telecommunication Union, 2013.
machine learning.
[4] Y. Yang, L. Wu, G. Yin, L. Li, H. Zhao, A survey on security and
This method experiments on an IoT executable dataset privacy issues in Internet-of-Things, IEEE Internet Things J. 4 (5)
comprising only ARM-based samples. Therefore, we use the (2017).
experimental results on the ARM dataset for comparison, [5] M. Frustaci, P. Pace, G. Aloi, G. Fortino, Evaluating critical security
as shown in Table 9. The results demonstrate that the our issues of the IoT world: Present and future challenges, IEEE Internet
Things J. (2017).
approach delivers better possible outcome. Thus, PSI-rooted [6] Gartner, Internet of Things research study: 2014 report, 2015, http://d
subgraph features are effective in detecting IoT malware using -russia.ru/wp-content/uploads/2015/10/4AA5-4759ENW.pdf. (Accessed
machine learning. 2 Feb 2019).

Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.
H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al. / ICT Express xxx (xxxx) xxx 11

[7] Aohui Wang, Ruigang Liang, Xiaokang Liu, Yingjun Zhang, Kai Chen, [26] Igor Popov, Malware detection using machine learning based on
Jin Li, An inside look at IoT malware, in: International Conference word2vec embedding of machine code instruction, in:Siberian Sym-
on Industrial IoT Technologies and Applications, Wuhu, China, 2017, posium on Data Science and Engineering, SSDSE, 2017, pp.
pp. 176–186. 1–4.
[8] Michele De Donno, Nicola Dragon, Alberto Giaretta, DDoS-Capable [27] Hayate Takase, et al., A prototype implementation and evaluation of
IoT Malwares: Comparative Analysis and Mirai Investigation, Security the malware detection mechanism for IoT devices using the processor
and Communication Networks, Wiley, 2018. information, Int. J. Inf. Secur. (2019) 1–11.
[9] Evanson Mwangi karanja, Shedden Masupe, Jeffrey Mandu, Internet of [28] Mojtaba Eskandari, Sattar Hashemi, A graph mining approach for
Things malware: Survey, Int. J. Comput. Sci. Eng. Surv. 8 (3) (2017). detecting unknown malwares, J. Vis. Lang. Comput. 23 (2012)
[10] Y.M.P. Pa, S. Suzuki, K. Yoshioka, T. Matsumoto, T. Kasama, C. 154–162.
Rossow, IoTPOT: A novel honenypot for revealing current IoT threats,
[29] Zhao Zongqu, Junfeng Wang, Chonggang Wang, An unknown malware
J. Inf. Process. 24 (2016) 522–533.
detection scheme based on the features of graph, Secur. Commun.
[11] Anton O. Prokofiev, Yulia S. Smirnova, Vasiliy A. Surov, A method to
Netw. 6 239–246.
detect Internet of Things botnets, in: Young Researchers in Electrical
[30] Z. Xu, K. Ren, S. Qin, F. Craciun, CDGDroid: Android malware
and Electronic Engineering, EIConRus, Moscow, Russia, 2018..
detection based on deep learning using CFG and DFG, in: Proceedings
[12] Constantinos Kolias, Georgios Kambourakis, Angelos Stavrou, Jeffrey
Voas, DDoS in the IoT: Mirai and Other Botnets, vol. 50, IEEE of the 20th International Conference on Formal Engineering Methods,
Computer Society, 2017. ICFEM, 2018, pp. 177–193.
[13] Amin Azmoodeh, Ali Dehghantanha, Mauro Conti, Kim- [31] Fatemeh Karbalaie, et al., Semantic malware detection by deploying
Kwang.Raymond Choo, Detecting crypto-ransomware in IoT graph mining, Int. J. Comput. Sci. 9 (1) (2012).
networks based on energy consumption footprint, J. Ambient Intell. [32] Cesare Silvio, Yang Xiang, Malware variant detection using similarity
Humaniz. Comput. (2017) 1–12. search over sets of control flow graphs, in: IEEE 10th Interna-
[14] Amin Azmoodeh, et al., Robust malware detection for Internet of tional Conference on Trust, Security and Privacy in Computing and
(battlefield) Things devices using deep eigenspace learning, IEEE Communications, 2011.
Trans. Sustain. Comput. (2018) 88–95. [33] Aya Hellal, Lotti Ben Romdhane, Maximal frequent sub-graph mining
[15] Fairuz Amalina Narudin, et al., Evaluation of machine learn- for malware detection, in: 15th International Conference on Intelligent
ing classifiers for mobile malware detection, Soft Comput. (2014) Systems Design and Applications, ISDA, 2015.
343–357. [34] Ming Fan, et al., Frequent subgraph based familial classification of
[16] Mohammed Ali Al-Garadi, Amr Mohamed, Abdulla Al-Ali, Xiaojiang Android malware, in: IEEE 27th International Symposium on Software
Du, Mohsen Guizani, A Survey of Machine and Deep Learning Reliability Engineering, 2016.
Methods for Internet of Things (IoT) Security. [35] F-scure, Articles: botnets;, 2016, https://www.f-secure.com/en/web/lab
[17] Amit Ray, Compassionate Superintelligence AI 50: AI with s_global/botnets. (Accessed 21 Feb 2019).
Blockchain, Bmi, Drone, IoT, and Biometric Technologies, Inner Light [36] Michele De Donno, Nicola Dragoni, Alberto Giaretta, Angelo Spog-
Publishers, 2018. nardi, Analysis of DDoS-capable IoT malwares, in: The Federated
[18] L. Miralles-Pechuán, D. Rosso, F. Jiménez, J.M. García, A methodol- Conference on Computer Science and Information Systems, 11, 2017,
ogy based on deep learning for advert value calculation in CPM, CPC pp. 807–816.
and CPA networks, Soft Comput. 21 (2017) 651–665. [37] Manos Antonakakis, et al., Understanding the Mirai Botnet, in:
[19] Hoda El Merabet, Abderrahmane Hajraoui, A survey of malware
USENIX Security Symposium, Canada, 2017, pp. 1092–1110.
detection techniques based on machine learning, Int. J. Adv. Comput.
[38] Jiawei Su, Danilo Vasconcellos Vargas, Sanjiva Prasad, Daniele Sgan-
Sci. Appl. 10 (2019).
durra, Yaokai Feng, Kouichi Sakurai, Lightweight classification of IoT
[20] Sudipta Chowdhury, Mojtaba Khanzadeh, Botnet detection using
malware based on image recognition, in: 2018 IEEE 42nd Annual
graph-based feature clustering, J. Big Data (2017).
[21] T. Sherwood, E. Perelman, G. Hamerly, B. Calder, Automatically char- Computer Software and Applications Conference, COMPSAC, vol. 2,
acterizing large scale program behavior, SIGARCH Comput. Archit. 2018, pp. 664–669.
News 30 (5) (2002) 45–57. [39] Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang
[22] Huy Trung Nguyen, Quoc Dung Ngo, Van Hoang Le, IoT botnet Liu, Santhoshkumar Saminathan, Subgraph2vec: Learning distributed
detection approach based on PSI graph and DGCNN classifier, in: representations of rooted sub-graphs from large graphs, 2016, arXiv
International Conference on Information Communication and Signal preprint arXiv:1606.08928.
Processing, 2018, pp. 118–122. [40] Darko Andročec, Neven Vrček, Machine learning for the Internet of
[23] Huy-Trung Nguyen, et al., Towards a rooted subgraph classifier for IoT Things security: A systematic, in: Proceedings of the 13th International
botnet detection, in: The 7th International Conference on Computer Conference on Software Technologies, ICSOFT, 2018, pp. 563–570.
and Communications Management, 2019. [41] VirusShare, Because sharing is caring, 2019, https://virusshare.com/.
[24] J. Bai, Y. Yang, S. Mu, Y. Ma, Malware detection through mining (Accessed at 10/01/2019).
symbol table of Linux executables, Inform. Technol. J. 12 (2) (2013) [42] https://www.virustotal.com. (Accessed at 5/02/2019).
380–383. [43] Hamed HaddadPajouh, Ali Dehghantanha, Raouf Khayami, Kim-
[25] J. Saxe, K. Berlin, Deep neural network based malware detection Kwang.Raymond Choo, A deep recurrent neural network based
using two dimensional binary program features, in: 10th International approach for Internet of Things malware threat hunting, Future Gener.
Conference on Malicious and Unwanted Software, 2015, pp. 11–20. Comput. Syst. 85 (2018) 88–96.

Please cite this article as: H.-T. Nguyen, Q.-D. Ngo, D.-H. Nguyen et al., PSI-rooted subgraph: A novel feature for IoT botnet detection using classifier algorithms, ICT Express (2020),
https://doi.org/10.1016/j.icte.2019.12.001.

You might also like