Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.Doi Number

IoT Network Anomaly Detection in Smart


Homes Using Machine Learning
Nadeem Sarwar1 *[0000-0001-8681-6382], Imran Sarwar Bajwa1[0000-0002-5161-6441], Muhammad
Zunnurain Hussain2 [0000-0002-6071-1029], Muhammad Ibrahim1 and Khizra Saleem1
1
Department of Computer Science, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
2
Department of Computer Science, Bahria University Lahore Campus, Lahore, Pakistan
Corresponding author: Nadeem Sarwar (e-mail: Nadeem_srwr@yahoo.com).

ABSTRACT In this modern age of technology, the Internet of things has covered all aspects of life including smart
situations, smart homes, and smart spaces. Smart homes have a large number of IoT objects that are working continuously
without any interruption. Better security and authentication of these smart devices can provide peaceful environments to live
in such spaces. It is important to monitor the activities of smart IoT devices to make them work fault free. Such devices are
small, consume relatively less power and resources, and are easily attackable by attackers. It is crucial to protect the integrity
and characteristics of the smart home environment from external attacks. Machine Learning played a vital role in
recognizing such malicious activities and attempts. Several Machine Learning approaches are available to detect the normal
and abnormal behavior of IoT device traffic. This study proposed a machine learning-based anomaly detection approach for
smart homes using different classifiers. Testing and evaluation are performed using the University of New South Wales
(UNSW) BoT IoT dataset. Machine learning models based on four classifiers are built using an IoT devices dataset. For the
Test dataset, the Weighted Precision, Recall and F1 score of Random forest, decision tree, and AdaBoost is 1 as compared to
ANN which has 0.98, 0.96, and 0.96 respectively Results show that high performance, precision, and robustness can be
achieved using the proposed methodology. In this way, smart homes' security and identity of devices can be monitored and
anomalies can be detected with high accuracy. Attack categories include binary class, multiclass class, and subclasses.
Results show Random Forest algorithm outperforms enough to use this methodology in smart environments.
INDEX TERMS Smart Homes, IoT Environment, Cyber Security, Network Anomaly Detection, Smart
Environments, Machine Learning.

I. INTRODUCTION networks in smart homes. Researchers are focusing on the


The Internet has revolutionized the modern age of security of IoT applications but it is also the demand of the
technology by providing everyday life facilities at the current age to work on the protection of IoT networks from
fingertip. The Internet of Things (IoT) is one of the unauthorized users (intruders) [4].
technologies, which has changed the ideology of modern Smart Homes' technology usage is rapidly growing. This
developments. IoT has been used in hospitals, agri-culture, aids people and individuals to control house appliances and
restaurants, roads, and even in our houses. IoT-based devices through a single platform. In this way, people will
applications are known as smart applications [1. Smart be able to easily control and monitor house devices. The
homes are based on IoT devices that are capable of basic objectives of a smart home are illustrated in Fig 2.
capturing and facilitating every area of the house using These objectives show that comfort, security, and reliability
smart sensors and controllers. These sensors are are the key points. However, this comfort also increases the
communicating through an internet connection. These risk of external threats that should address properly. One of
devices share data for particu-lar tasks and purposes [2]. the main concerns in smart home environment technology
Fig 1 shows the basic elements of an IoT network, which is to permit users to monitor, control device security and
in-clude identification, sensing, communication, services, take precautions accordingly [6] Smart home devices
and semantics. IoT applications are increasing day by day, consist of sensors and actuators. These devices are of
different sizes. Voice controllers may enable intruders to
which enables these devices to become low-cost, ener-gy-
risk the security of the network. A household person will
efficient, and compact. However, increased usage of IoT
not be able to detect and protect against such intrusions and
devices also increases the risk factors and threats for such
threats. In recent years, researchers are focusing on
networks [3]. It is important to make these devices secure
application intrusion detection to address such issues [7].
and threat-free so that people can securely use such

VOLUME XX, 2017 1


This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

FIGURE 1. The IoT Components

FIGURE 2. Smart Homes Objectives


con-trolled system can use an Ethernet, wireless
Generally, there are two types of home automation systems connection, or Bluetooth for such purposes [9].
available. These include a locally controlled system and a The rapid increase in IoT technology for smart homes also
remotely controlled system. First, a local controller is used increases the potential risks of network attacks and security
to control in-house devices. While in second, devices can challenges. It is necessary to bring high-level security to the
be controlled from distant areas. Remotely controlled smart home environment to enable privacy and data
systems are operated using an internet connection. A locally protection. Such advancements increase the benefits and
challenges of smart environments. If these smart

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

environments are attacked by intruders of attackers, serious


privacy issues to users’ data can create damaged by stealing Now, we suggested a Machine Learning based solution for
information or monitoring the activities [10]. smart home environments’ abnormal activity monitoring
Thereafter, a secured IoT framework for anomaly detection and detecting malicious activities. The proposed framework
is needed for the protected communication of sensing is tested with six different ML algorithms. An additional
devices. Smart devices make people more vulnerable to crucial characteristic of this anomaly detection study is to
safe-ty precautions. For some investors and business compare simple machine learning classifiers like Decision
owners, data is everything to the organization [11]. Certain Tree (DT), AdaBoost (ADA), and Random Forest (RF)
information must be kept secret by the government and with complex classifiers like Artificial Neural Network
certain private organizations. Internet of Things (IoT) nodes (ANN).
allow hackers to collect vital information from any The following are primary contributions of the suggested
significant corporation [12]. The abovementioned issues are study:
frequently resolved by employing very simple strategies. 1. The dataset for the proposed methodology is evaluated
The signature-based [13] technique is used to capture for six different categories of anomalies.
anomalies in advance. At times, the system is crosschecked 2. An anomaly detection framework is implemented for
with the database. This strategy, however, creates analyzing anomalous activity by machine learning
processing costs and leaves the system vulnerable to algorithms.
unidentified assaults. Feature selection is critical for 3. The proposed study has high performance and reliable
identifying the most important traits. The da-ta-analysis- predictions for anomaly detection and alert generation.
based strategy can overcome the problem of unknown 4. This study's key contribution is the development of a
threats and can be adopted faster than other alternatives. As high-performing anomaly detection model utilizing a
a result, this study adopts data-driven method-ologies [11]. machine learning classification approach.
Several Machine Learning (ML) approaches are discussed Section 2 of the paper describes the literature work. The
to address such issues. Machine Learning models analyze, proposed approach is en-lightened in section 3 along with
recognize, and monitor traffic patterns to learn the traffic dataset description, and anomalies details. Section 4
flow in a normal routine. For this determination, discusses the implementation specifics and outcomes of the
supervised, and unsupervised machine learning models are proposed methodology. Section 5 lastly provides the
available [14]. To overcome the security problem in smart conclusion and future work.
homes, a supervised learning-based anomaly detection
model is proposed which is effi-cient enough to accurately II. BACKGROUND & LITERATURE REVIEW
predict the anomalies on time and generate alerts and halt Internet of Things (IoT) is becoming popular in every field
the working of the devices to enable persons to take of life including industry and research. IoT is simply
precautionary measures. defined as the objects or things that share recorded
The main goal of this research is to develop a smart, secure, information with other objects or platforms through a
and reliable methodology for smart home environments to communication channel like the Internet. IoT devices
detect anomalies and vulnerabilities. The primary focus of consist of sensors, actuators, and communication links.
our study is on identifying network anomalies, specifically These sensors and devices are used nowadays to connect
in the context of IoT (Internet of Things) environments. homes, offices, schools, universities, hospitals, and even
These anomalies encompass a range of malicious activities human beings. [15]. IoT refers to a smart environment.
that may include but are not limited to: Such smart systems are be-coming helpful in industry,
1. Data Exfiltration: Anomalies related to unauthorized banks, education, healthcare, and home environments.
data transfers or leaks from the IoT devices, indicating Sever-al biomedical IoT devices are available for patients
potential security breaches. to assist them in monitoring and diagnosis of diseases [16].
2. Keylogging: Detection of anomalous keystroke In this way, patients will be able to understand their health
logging activities, which could indicate attempts to
parameters. Such devices are also connected to a central
capture sensitive information.
platform where recorded information is stored and
3. OS Fingerprinting: Anomalies related to attempts to
processed further.
identify the operating system of devices in the
network, often a precursor to targeted attacks. IoT is also assisting in managing and operating
4. Service Scanning: Unusual activities associated with powerhouses that transform these houses into smart
probing or scanning services and ports on IoT devices, powerhouses. The basic objective to offer smart
a common behavior of potential malicious entities environments is to in-crease the availability, usability, and
seeking vulnerabilities. reliability of the services and applications. The Internet of
5. UDP Anomalies: Abnormalities in UDP (User Things (IoT) is designed to facilitate productive
Datagram Protocol) communication, which can be communication between the real world and its digital
indicative of network attacks or unusual device equivalent (also called the digital transformation or cyber-
behavior. physical systems) (CPS)) [17].

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

In smart homes, such devices are playing an significant role Open Challenges, and Prospects [44], discusses machine
in day-to-day activities. One must consider the security and learning frameworks for DDoS attack detection in SD-IoT
privacy of data shared by IoT objects. Gartner predicts that networks.
by 2020, more than 25 percent of assaults on businesses For IoT-based smart home environments, Alippi et al. [26]
would include Internet of Things (IoT) devices [18]. presented a framework for detecting changes at the level of
Using these obstacles and numbers as a jumping-off point, sensors. The approach constructs a dynamical model of the
we discovered that almost all studies with the intent of anticipated sensor-generated signal over time [22]. The
analyzing the difficulties encountered in the IoT area system then continuously compares the predicted signals
include a portion dedicated to security and privacy concerns from the model with the actual signals (data streams)
[19]. The previous few years have seen a number of cyber- arriving from various sensors. Specifically, it uses a method
security problems involving networked, pervasive, and for detecting transitions known as "Intersection-of-
Internet of Things technology. Numerous attacks have been Confidence" to look for differences [27]. Author investigate
made against the Internet of Things (IoT), including notable the deep learning based solutions for IDS and achieved
Distributed Denial of Service (DDoS) attacks on Dyn's 95% precision and 97% Recall after various attack
DNS (domain name system; see solutions [38]. Also was discussed about Imbalanced data
reference).[20]Attacks/vulnerabilities on/off self-driving issues during intrusion detection and applied SMOTE
cars [21], ransomware attacks [21], attacks on smart home technique to balance dataset [39]. The author used IG and
and medical equipment [22,23], and more. GR filter-based approaches for feature ranking and applied
The IoT and critical infrastructures pose new security five ML algorithms, including kNN, ANN, bagging, J48,
threats. The limited resources aboard the devices used to and Ensemble methods, to classification of traffic features.
construct IoT and CPS apps make it difficult to mitigate Both classification approaches are implemented in this
these hazards. In addition, the gadgets are inherently study, Binary and multi [40]. NSL-KDD dataset was used
insecure since they need a constant Inter-net connection. to test the IoT based cyber-attack classification (Binary and
New instances and studies show that greater investigation is Multi) [41].
required [22]. They also reveal, sadly, that present solutions The available studies for these smart homes' security and
are far from being acceptable to halt the exponential threat protection are yet not mature enough to fulfill the
development in the number and complexity of assaults [24]. domain requirements. Such environments need security
Thus, the necessity for further layers of defense to make mechanisms, privacy mechanisms, and data transmission
smart home IoT devices more durable and proof of such mechanisms with machine intelligence for a secure smart
assaults is vital [22]. home. The proposed study consists of Machine Learning
[25] introduced a host-based anomaly discovery method in based layered architecture as shown in Fig 3, which can
which device features and activities are considered using a provide improved and reliable detection of anomalies in
mobile phone application. Their system uses a supervised smart homes.
machine learning approach to evaluate the gathered data At level 1, which is L1, the dataset generation and
and determine whether an app is benign or dangerous. collection are performed. In the proposed study, the
Device metrics including CPU load, active processes, and UNSW−2018−IoT−Botnet dataset is used. This dataset
application programming interface (API) invocations are all contains the data of IoT devices of a smart home. This
included in their feature vector [22]. Based on their claimed dataset contains 0 to 668521 records that are sufficient to be
results, their method successfully distinguishes between used for the study. This information is forwarded to L2,
apps that are games and those that are tools 99.7 percent of which is the data pre-processing level. At this level, data is
the time. However, they fail to reveal or comment on preprocessed for missing values, feature engineering, and
whether the shown or mentioned programs are dangerous. data balancing. After preprocessing, another vital level is
Relatedly, Burguera et al. (2011) devised a method for feature selection. The most relevant features are selected
detecting malware on smart devices at the device level [22]. using correlation among several features. After selecting
This study introduces a system called Crowdroid, which relevant and ideal features for classification, training data is
records and analyses system calls as they occur in real-time. used for classification. This classification is performed
A mobile application receives this data. The collected data, using different machine learning algorithms. These
represented by Linux system calls, is sent to a central server algorithms include AdaBoost, Decision Trees, Random
(off-device research) where it is analyzed for harmful Forest, LSTM, Auto Encoder, and ANN.
software. Towards a Machine Learning-based Framework
for DDoS Attack Detection in Software-Defined IoT (SD-
IoT) Networks[42]. This study discusses machine learning
frameworks for DDoS attack detection in SD-IoT networks
[43]Toward Software-Defined Networking-Based IoT
Frameworks: A Systematic Literature Review, Taxonomy,

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

working of the layered super-vised learning approach. In


this layered process, training data with labels are provided
to the classifier for learning and classification purposes.
After this, testing is accomplished to assess the accuracy of
the model with unlabeled data.
A subfield of Computational systems in which human
intelligence-based algorithms are used to solve real-life
problems is called machine learning. The Machine learning
is reflected as a functional mare in the Big Data field.
Machine Learning approaches can be applied in health,
business, commerce, aerospace, biomedicine, music, media,
and many more real-world applications. However such
advancements also increase security issues and threats to
smart home environments.
Machine Learning has revolutionized the field of smart
home environments by providing IoT devices in hand. In
the proposed anomaly detection system, data received from
the sensing devices is processed using a supervised
machine-learning algorithm. In a supervised learning
approach, the algorithm is previously trained on test data
FIGURE 3. Proposed systematic architecture for Smart home IoT
Network anomaly detection and then performs prediction using this knowledge. In the
III. PROPOSED METHODOLOGY unsupervised approach, previous knowledge is not known
Smart homes are a new technology for people to live [30]. Pattern recognition and feature extraction are per-
comfortably. However, such networks and devices are formed on real-world data. These two approaches can be
prone to intrusion attacks due to a lack of security applied according to the requirement of the problem. Figure
mechanisms for implementation. The speedy evolution of 4 shows the working of layered supervised learning
the Internet of Things opens new dimensions for IoT approach. In this layered process, training data with labels
devices but poses new threats as well. The tiny sensors- are provided to the classifier for learning and classification
based intelligent environments open a new challenge for the purposes. After this, testing is performed to evaluate the
security and privacy of home devices. A subfield of performance of the model with unlabeled data.
Computational systems in which human intelligence-based Fig 5 explains the system structure and flow of different
algorithms are used to solve real-life problems is called parts of the suggested anomaly detection model for IoT-
Machine Learning (ML). it is reflected as a functional mare based smart environments. The primary stages of the
in the Big Data field. These approaches can be applied in suggested approach are Data Sources, Data Gathering,
health, business, commerce, aerospace, biomedicine, music, Processing, and Anomaly detection. The proposed
media, and many more real-world applications. However framework examines the traffic and arranges this activity in
such advancements also increase security issues and threats accordance with the typical motions recorded for various
to smart home environments. Machine Learning has IoT devices in order to detect any unusual movement. The
revolutionized the field of smart home environments by gathering of data is the first stage in Fig 6. The proposed
providing IoT devices in hand. In the suggested anomaly study is based on one of the benchmark datasets from
detection system, data received from the sensing devices is UNSW BoT IoT that is openly available. There are 46
processed using a supervised machine-learning algorithm. attributes in this data set. Eight different types make up this
In a supervised learning approach, the algorithm is dataset. Section 3.1 of the dataset contains a detailed
previously trained on test data and then performs prediction description. Due to its extraction from several Internet of
using this knowledge. In the unsupervised approach, Things devices, data used in this study is regarded as big
previous knowledge is not known [28]. Pattern recognition data [29,31]. Data Pre-processing is done using data
and feature extraction are per-formed on real-world data. cleaning, data visualization, data balancing, and feature
These two approaches can be applied according to the selection.
requirement of the problem. The following Fig 4 shows the

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

FIGURE 4. Supervised Learning Approach

FIGURE 5. System Communication Channel

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

Figure 6. Detail Description of Anomaly Detection Framework

categorical column. 80% of training and 20% of test data is


A. DATASET DESCRIPTION used for training and testing.
Several datasets are available for this work such as the Bot-
IoT and the UNSW BoT IoT datasets. It is found that the
UNSW BoT IoT dataset is more refined and has been
practiced by researchers. The entire dataset is 16 GB in size,
it has some similar attributes to the UNSW NB 15 dataset but
contains supplementary diversity in the type of mischievous
measures as it is shown in Fig 7. The designed methodology
consists of the UNSW BoT IoT dataset because of the
diverseness and authenticity of training models [30,37]. This
contains 46 different features.

In this work, 668522 records were arbitrarily nominated for


selection. These selected records are used for training and
testing which is 5% of the entire dataset. Data cleaning and
preparation are performed for training and testing models
[29]. A dataset contains nominal, numerical, and categorical
features. It is important to encode the dataset for numerical Figure 7. Description of UNSW BoT IoT dataset
values. One hot encoding and label encoding technique are
used for transformation into numerical attributes. In label
encoding, unique numbers are assigned to each class in the

1) DETAILS OF DATASET ATTACK CATEGORIES attack, which is studied in the proposed work. A comparison
UNSW BoT IoT dataset contains six different classes with a of the BoT IoT dataset with other standard network datasets
different number of records as shown in Table 1. There are is given in Table 3. This shows that the BoT IoT dataset is far
six different types of categories considered in the pro-posed more realistic and reliable to test smart network
study. At first, binary classification is done in which there are environments as compared to other datasets, which are used
two classes i.e. attack and normal. In multiclass in previous research. T represents true and F represents False.
classification, the attack category is subcategorized into True shows the presence of that characteristic and False
Keylogging, Data Exfiltration, Service Scan, OS Fingerprint, shows the absence of the characteristic.
and UDP. Table 2 presents the description of each type of

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

B. FEATURE ENGINEERING variations in network traffic, the model can identify abnormal
behaviors associated with potential attacks or malicious
The Machine Learning models are based on exploratory data activities. For instance, sudden spikes in packet sizes or
analysis and observations. A dataset must be in the unexpected protocol usage can be indicative of an anomaly.
appropriate format to be given to the classification model.
For this purpose, categorical attributes are handled by using 2. Communication Patterns: Features related to
different transformation techniques. The dataset used in this communication patterns involve attributes like the number of
study contains both categorical and numerical features. In connections, data transfer rates, and the frequency of
order to convert the input into numeric characteristics, one interactions between devices. Anomalies in these patterns,
hot encoding and label encoding technique is used. In this such as a device initiating an unusually high number of
research, the data was transformed into a feature vector using connections, may suggest a potential security threat.
label encoding techniques [11]. 3. Categorical Features: These features pertain to
After this process, we ensure that data must be balanced so qualitative aspects of network traffic, including service types
that the efficiency of the model can be achieved. 46 features and flags. By one-hot encoding categorical features, the
selected after encoding are presented in Fig 8. In this BoT model can effectively handle categorical data and distinguish
IoT dataset, there are different attack categories. Each patterns of behavior for different services or flags. Anomalies
category contains a different number of records. For efficient may manifest as unexpected or malicious usage of particular
performance, it is important to balance the dataset. For data services.
balancing, the oversampling technique is used. The balanced
dataset after over-sampling is shown in Fig 9.
TABLE 1: ATTACK COUNT DESCRIPTION IN THE DATASET
To aid in the computational efficiency and simplicity of the
machine learning models, only relevant characteristics are
Sr. Class Sub Class No of Samples
chosen in the feature selection process using correlation as 1 576884
shown in Fig 10. In addition, this method can mitigate the Data Exfiltration
effects of overfitting when all characteristics are used [11],
below features are extracted. Network Traffic Features: 2 Attack Keylogging 73168
4 OS Fingerprint 477
5 Service Scan 73
1. These features encompass data related to network 6 UDP 6
protocols, packet sizes, flow durations, and 7 Normal …. 17914
source/destination IP addresses. By analyzing patterns and

TABLE 2: DESCRIPTION OF ATTACKS MODELS

Sr. Type Subcategory Description


1 Data Exfiltration Unauthorized removal or movement of any data from a device.

2 Act of tracking and recording every keystroke entry made on the network without
Keylogging
permission.
4 Attack OS Fingerprint Obtain OS information of target hosts to prepare for future attacks.
5 Service Scan The system is scanned for some information that aims to corrupt data.
6 Forward a huge amount of UDP packets to exhaust the bandwidth of target servers/
UDP
devices.

TABLE 3: COMPARISON OF DATASETS MODELS IN TERMS OF DIFFERENT CHARACTERISTICS

Dataset Darpa98 KDD99 ISCX LBNL DEFCON8 CAIDA UNIBS CICIDS TUIDS UNSWNB15 BoT
2017 IoT
Realistic Traffic No No Yes Yes No Yes Yes Yes Yes Yes Yes
Realistic Yes Yes Yes No No Yes Yes Yes Yes Yes Yes
Testbed
Configuration
Labeled data Yes Yes Yes No No No Yes Yes Yes Yes Yes
New generate No Yes Yes No No No No Yes Yes Yes Yes
features
Full packet Yes Yes Yes No Yes No Yes Yes Yes Yes Yes
capture
Diverse attack Yes Yes Yes Yes Yes No No Yes Yes Yes Yes
scenarios

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

IoT Traces No No No No No No No No No No Yes

Figure 8. 63 Features Selected after Encoding

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

Figure 9. Balanced Dataset after oversampling

Figure 10. Feature engineering process in the proposed methodology

In feature selection, No machine learning method similar to Figure 11. Oversampling process in the proposed methodology

that used by Pahl et al. has been used here for feature
selection. [32] for the proposed study [11]. Because of the Algorithm 1
large dataset, it is important to perform feature selection for Outlier detection and handling algorithm:
improved performance and minimum biasedness. This 1. Sort the dataset D in ascending order
process of oversampling is shown in Fig 11. IQR and 2. estimate the 1st and 3rd quartiles(Q1, Q3) in D features
SMOTE techniques are used for balancing and 3. calculate IQR=Q3-Q1
oversampling. SMOTE is a technique of oversampling in 4. calculate lower bound = (Q1–1.5*IQR), upper bound =
which artificial samples are created for the minority class. (Q3+1.5*IQR)
This technique helps in overcoming the overfitting problem 5. loop through the values of the dataset and check for those who
brought on by random oversampling. IQR is used to fall below the lower bound and above the upper bound and mark
identify outliers in the dataset. IQR (inter Quartile Range) them as outliers
is used to identify outliers. 1.5 IQR outlier is used to 6. Replace outlier values with Median value
identify abnormalities.

Algorithm 2

SMOTE algorithm:
1. Take difference between a sample i and its nearest neighbor ki
2. Multiply the difference by a random number between 0 and 1
Dif =Random(0,1) * (i*ki)
3. Add this difference to the sample to generate a new synthetic
example in feature space
Add Dif to Feature space
4. Continue on with next nearest neighbour up to user-defined
number.

Feature selection is performed using the following equation


of covariance. The algorithm estimates the merit of

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

covariance between x and y. where Sd is the standard emerged. Then it evolved into a tree, or more accurately, a
deviation. flowchart.

P(x,y)= Cov(x,y) / (Sd (x).Sd(y)) (1) 3) RANDOM FOREST


The Random Forest works using forest decision trees with a
Algorithm 3 supervised learning technique. This algorithm is fast
Feature Selection Algorithm:
compared to other supervised learning techniques [34].
Array of features _ Feature Extraction Several individual tree results are evaluated using majority
For i:Array of features.length voting or average calculation. This algorithm provides a good
For k:Array of data[i].length performance measure [28].
X=feature of i
Y=feature of i+1
4) ARTIFICIAL NEURAL NETWORK
Cov(x,y) calculate covariance between x and y
The ANN is a deep learning method of predicting classes or
Calculate standard deviation of x
categories. We can modify many parameters in ANN as
Calculate standard deviation of y
Correlation value calculated from above
compared to other classifiers. However, error optimization
equation
time is increased in this technique [11,35]. The amount of the
End for loop
chosen features of the network traffic for a host is
represented by the number of input nodes of the ANN (22 or
41). The employed ANN also has two hidden layers with 25
and 30 nodes each, as well as an output layer with 29 nodes
C. MACHINE LEARNING MODELS
(representing 29 categories of attacks). The backpropagation
Several machine learning approaches are considered while
(BP) computation method has been used to identify the
experimenting with the proposed approach. These algorithms
numbers of hidden layers and nodes in them.
include AdaBoost, Decision Tree, Random Forest, Ada
Boost, Auto Encoder and Artificial Neural Network. A
5) AUTOENCODER
description of these algorithms is given below.
An example of a self-supervised learning model that can
1) ADA BOOST learn a compressed representation of incoming data is the
Ada Boost (Adaptive Boosting) is a boosting technique that autoencoder. In this, the reconstruction component of the
is used in supervised machine learning algorithms. In this model's design may be dropped after fitting and the model
technique, instances with minimum weights are assigned may be used all the way until the bottleneck. A fixed-length
higher weights. In short, instances with minimum weights are vector that offers a compressed representation of the input
converted into strong instances by assigning more weights data is the output of the model at the bottleneck. Neural
[33]. Equation 1 describes the AdaBoost final classifier networks known as autoencoders are capable of finding
equation. representations with low dimensions of highly dimensional
input. It ought to be able to rebuild the input from the output
based on this. The model can then receive input data from the
domain, and the model's output at the bottleneck can be used
( 2) as a feature vector in a supervised learning model.

In Equation 2, T represents the weak classifiers. ht (x) is the 6) LSTM


output of weak classifier “t”. α_t is the weight applied to The use of LSTM Autoencoders on time series, audio, video,
classifier ‘t’ as obtained by the Ada Boost algorithm? The and text sequence data has shown that they are capable of
final result is thus a linear combination of all the weak learning a compressed representation of the sequence data.
classifiers, and the choice is made by inspecting the sign of The input sequence is sequentially read by an encoder LSTM
this sum of values. model. The hidden state or output of this model represents an
internally learnt representation of the full input sequence as a
fixed-length vector after reading the entire input sequence.
2) DECISION TREE
After that, this vector is sent as input to the decoder model,
Decision Trees work by using a tree-like structure that is
which interprets it as each step of the output sequence is
based on profits, outlays, and likelihoods. A schematic
produced. For a specific dataset of sequences, an encoder-
depicting the many consequences of a set of interconnected
decoder LSTM is set up to read the input sequence, encode it,
decisions. In most cases, a DT will have a single starting
decode it, and reproduce it. Based on the model's capacity to
node from which several branches will emerge. The results
replicate the input sequence, the performance of the model is
of these actions generated new nodes, from which new cases
assessed.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

D. PERFORMANCE EVALUATION METRICS IV. IMPLEMENTATION AND RESULTS ANALYSIS


There are a variety of methods for gauging performance. The architecture and full design specifications of the
Methods like statistics and mathematics fall under this suggested Network anomaly detection framework are
category. The designed system's efficacy is measured using covered in the previous section. We will go into the specifics
the following indicators. Indicators like this help determine of the suggested system's implementation in this part. Figure
the method that is most appropriate for the job at hand. 3 displays layered construction. Figure 6 displays the system
in operation. Following that, section 4.2 presents the findings
1) ACCURACY analysis.
Accuracy is the measure of correctly classified instances A. EXPERIMENT DESCRIPTION
among all of the instances. The formula is given below in
Equation 3. The proposed methodology consists of the UNSW BoT IoT
dataset because of the diverseness and authenticity of training
models [30,37]. Data Pre-processing is done using data
cleaning, data balancing, and feature selection. The dataset is
(3) preprocessed by removing noisy data and missing
information. After removing these anomalies, one hot
2) PRECISION encoding is performed for categorical features. In this way,
Precision is a mathematical method to estimate the each attack category has a unique column which increases
performance of the classifier. Precision is the percentage of the number of features as well. In our dataset, OS
true predictions of positive classes divided by the total Fingerprint, Data Exfiltration, Service Scan, Keylogging, and
prediction of the positive classes in the model. The formula UDP are categories of attack to which the dataset is
for precision is given below. classified. Figure 8 shows the details of features selected
after one hot encoding. IQR outlier is used to identify
abnormalities. 1.5 IQR technique is used to identify outliers
( 4) in the proposed study. SMOTE (Synthetic Minority
Oversampling Technique) Oversampling is performed to
3) RECALL balance the dataset for each category. Figure 9 shows the
Recall denotes the amount of correctly classified instances details of oversampling for each attack category. After this
of the positive class. In Equation 5, R means Recall, TP process, we ensure that data must be balanced so that the
symbolizes true positive predictions and FP symbolizes efficiency of the model can be achieved. 46 features selected
False positive predictions [11]. after encoding are presented in Fig 8. After selecting relevant
features, machine learning models are applied to measure
precision, recall, accuracy, and F1. Weighted mean,
( 5) precision, and recall are also calculated for detailed analysis.

B. RESULT ANALYSIS
4) F1 SCORE
The data description section explains machine-learning
F1-score (F1) represents the measure of the harmonic mean
approaches used on the dataset UNSW BoT IoT. Validation
of recall (or TPR) and precision [36]. The formula for the
of data is performed using 5-fold cross-validation.
F1 measure is given in equation 6.
Performance parameters used for result evaluation are
confusion matrix; accuracy, precision, recall, and F1 score as
discussed in section 3.4. The accuracy comparison of
Random Forest, Decision Tree, AdaBoost, LSTM, Artificial
( 6) Neural Network (ANN), and Auto Encoder is shown in Table
4 and Fig 12.
5) CONFUSION MATRIX C. Comparison of Accuracy
When evaluating machine learning classification models, the
confusion matrix is a useful tool for getting a feel for how Accuracy is the measurement of correctly classified items
things will shake out. A comparison between the actual among all items. This comparison of accuracy is shown in
classes and the expected ones is shown in the table below. It Table 6. In this table accuracy of ML, models including
facilitates a speedy diagnosis of misidentification between Random Forest, Decision Tree, ANN, and AdaBoost are
actual and anticipated classes. Prediction outcomes from a compared. For training data, AdaBoost Decision Tree and
Random forest achieve 100% accuracy whereas ANN
classification task may be summarised using a confusion
achieves 93.17% accuracy. For the test dataset, these
matrix [37].
algorithms achieve 99.9% and ANN achieves 95.74%
accuracy. Fig 12 shows the graphical representation of

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

accuracy, in which classifiers are shown on the x-axis and the shown in blue color. Accuracy shows that ANN performance
accuracy percentage is shown on the y-axis. Training dataset is low compared to AdaBoost, Decision tree, and RF. ANN
accuracy is shown in yellow color and test dataset accuracy is performed low with training and testing datasets.
TABLE 4. ACCURACY COMPARISON OF ML APPROACHES.
Accuracy of ML Approaches
Classifiers
Performance
AdaBoost Decision ANN Random LSTM Auto
Metrics
Tree Forest Encoder
Training Data 100% 100% 93.17% 100%
91.40% 88.30%
Accuracy
Testing Data 99.99% 99.99% 95.74% 100%
90.20% 84.30%
Accuracy

Figure 12. Accuracy Comparison of Machine Learning Approaches

D. Comparison of Confusion Matrices From the 5-fold cross-validation results, it can be concluded
The confusion Matrix shows the actual values of true that RF performed best as shown in Fig 14. In the first fold,
positive, false positive, true negative, and false negative ANN performed with approximately 94% accuracy, and DT,
instances predicted by the specific ML model. Confusion RF, and ADA performs between 96% and 98%. ADA and
matrices of ML techniques, which are shown in Fig 13, RF, DT are similar in the first fold whereas RF is slightly lower
DT, and ADA performed well in terms of True Positive and than these. In the second fold, ANN accuracy is below 92%,
True Negative predictions and resulted in high precision, whereas ADA, DT, and RF are better than ANN. In the third
recall, and accuracy. In ADA, 10 instances of fold, DT and ADA accuracy is dropped and ANN accuracy is
OS_Fingerprint are misclassified out of 5450 instances, and 2 comparatively higher than these two, Whereas RF performs
instances of keylogging are misclassified. In RF, only 2 highest in the third fold. In the last two folds, ADA and DT
instances of keylogging and 1 instance of Service_scan are accuracy are similar and RF performed better in all of the
misclassified. In DT, 2 instances of keylogging, 7 instances five folds. ANN, LSTM, and Auto Encoder have less
of OS_Fingerprint, and 3 instances of Service_scan are performance.
misclassified. Whereas in ANN, 3 instances of keylogging, 8
instances of Normal, 21 instances of OS_Fingerprint, and
8519 instances of Service_scan are misclassified.

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

Figure 13. Confusion Matrix comparison of different ML approaches

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

Figure 14. Cross-Validation Accuracy of Decision Tree, Random Forest, ANN, and ADA

TABLE 5. MACRO PRECISION, RECALL, AND F1 SCORE OF ML ALGORITHMS

Macro precision, Recall, and F1 score of ML Algorithms


Classifiers
Performance
Data AdaBoost Decision ANN Random LSTM Auto
Metrics
Tree Forest Encoder
Precision 1.00 1.00 0.95 1.00 0.91 0.89
Training
Macro Recall 1.00 1.00 0.93 1.00 0.84 0.88
Data
F1 Score 1.00 1.00 0.93 1.00 0.86 0.94

Precision 0.99 0.99 0.91 0.99 0.89 0.77


Test Data Macro Recall 0.94 0.94 0.84 0.94 0.89 0.88
F1 Score 1.00 1.00 0.95 1.00 0.91 0.89

E. COMPARISON OF MACRO PRECISION, RECALL,


AND F1 SCORE
Average precision, recall, and F1 score per class are ( 8)
calculated using Macro measures. The macro measure is
used in multiclass classification in which precision is Similarly, for recall, individual class recall is calculated and
calculated for each class separately and then the average of then macro recall is calculated using Equations 9 and 10
these is calculated. Similarly, recall and F1 scores are respectively.
calculated as shown in Table 5. ( 9)

If there are six classes (a, b, c, d, e, f) then precision is


calculated using Equation 7, and Macro precision is
calculated using Equation 8. (10)
A graphical comparison of Macro performance measures is
(7) presented in Figure 15. In this, it is seen that Decision Tree,
Random Forest, and AdaBoost have good performance
Whereas; compared to ANN

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

Figure 15. Comparison of Macro Precision, Recall, and F1 score of ML algorithms

F. Comparison of Weighted Precision, Recall, and F1 B, C, D, E, F) then weighted precision can be calculated
score using the following formula.
Weighted performance measures are presented in Table 6.
In weighted measures, weights (number of instances) are
( 11)
attached to the measures. If the number of instances is (A,

( 22)

Similarly, F1 measure and Recall can be produced in this score for RF, ADA, and DT are 1 compared to ANN which
way. Fig 16 shows that weighted precision, recall, and F1 has low measures.
TABLE 1: WEIGHTED PRECISION, RECALL, AND F1 SCORE OF ML ALGORITHMS

Weighted precision, Recall, and F1 score of ML Algorithms


Classifiers
Data Performance Metrics AdaBoost Decision ANN Random LSTM Auto
Tree Forest Encoder
Precision 1.00 1.00 0.95 1.00 0.90 0.83
Training
Weighted Recall 1.00 1.00 0.93 1.00 0.87 0.87
Data
F1 Score 1.00 1.00 0.93 1.00 0.76 0.82

Precision 1.00 1.00 0.98 1.00 0.93 0.89


Test
Weighted Recall 1.00 1.00 0.96 1.00 0.86 0.87
Data
F1 Score 1.00 1.00 0.96 1.00 0.88 0.80

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

Figure 16. Comparison of Weighted Precision, Recall, and F1 score of ML algorithms

G. Summary of Right Classified and Misclassified respectively. With RF, 5450 and 0 are RC and MC
Instances respectively. LSTM has an 8765 MC value and Autoencoder
Table 7 presents the right classified (RC) and misclassified has 9504 misclassified instances. Their identification rate of
(MC) instances with different algorithms. The total count the malicious node is above 80%. Results show that the
shows the total number of instances in that particular class. proposed technique provides a detailed description of
Data Exfiltration has 2 instances with each classifier and the anomaly detection with large datasets. Feature engineering is
right predictions are made by each algorithm. Keylogging performed using one-hot encoding, label encoding,
has 28 counts of which 2 are misclassified by RF, DT, and oversampling, and correlation-based feature selection.
ADA whereas 3 are misclassified by ANN. For the normal Multiclass classification is performed accurately, which is
category, ANN correctly classifies 149, and 8 instances are more difficult to predict than binary classification. The
misclassified. proposed method outperforms Random forest and is best
with DT, and ADA.
For OS Fingerprint, 5450 is the total count out of which
accurately classified are 5440, and 10 instances are
misclassified by ADA. For the Decision Tree, 5453 is RC
and 7 is MC. For ANN, 5429 and 21 are RC and MC
TABLE 2: PROPOSED METHODOLOGY COMPARISON USING DIFFERENT ALGORITHMS

Attacks Total AdaBoost Decision Tree ANN Random Auto Encoder LSTM
Counts Forest
RC MC RC MC RC MC RC MC RC MC RC MC

Data 2 2 0 2 0 2 0 2 0 1 1 2 0
Exfiltration
Keylogging 28 26 2 26 2 25 3 26 2 10 18 20 08

Normal 157 157 0 157 0 149 8 157 0 100 56 130 27

OS 5450 5440 10 5453 7 5429 21 5450 0 5000 429 5300 129


Fingerprint
Service Scan 21857 21856 1 21854 3 13338 8519 21856 1 13240 8596 13248 8537

UDP 173063 173063 0 173063 0 173063 0 173063 0 172659 404 172999 64

Total MC 13 12 8551 3 9504 8765

Figure 17. Comparison of the proposed approach with the existing approach of the IoT anomaly detection system

2 VOLUME XX, 2017

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

Figure 17. Comparison of the proposed approach with the proposed Machine learning algorithms

ANN. The proposed study provides better re-sults compared


The suggested method outperforms the technique in which an
IoT sensors data set is utilized for experimentation and to previous approaches because of the feature selection
assessment, as shown by a comparison with the dataset approach which has improved the results. Evaluation and
supplied by [11,32]. There are a total of 357,952 samples and validation results proved that the proposed methodology
13 characteristics available in the aforementioned dataset gives 100% Training dataset accuracy with Decision Tree,
from Hasan et al. There are a total of eight classes in the Random Forest, and AdaBoost. Whereas 99.9% Testing
dataset, which consists of 347,935 Normal data and 10,017 Dataset accuracy with a decision tree, and Ada-Boost. And
aberrant data. There are gaps in reporting for 148 nodes in 100% test dataset accuracy with Random forest. Weighted
the "Accessed Node Type" feature and 2050 in the "Value" Precision, Recall, and F1 score of Random forest, decision
feature. In terms of precision, recall, and accuracy, our tree, and AdaBoost is 1 as compared to ANN which has 0.95,
suggested methodology outperforms the status quo method. 0.93, and 0.93 respectively for the Training dataset. For the
For the sake of safety and privacy, the suggested technique Test dataset, the Weighted Precision, Recall, and F1 score of
can be implemented in the context of smart homes. Random forest, decision tree, and Ada-Boost is 1 as
compared to ANN which has 0.98, 0.96, and 0.96
VII. CONCLUSION AND FUTURE WORK respectively. It is observed that the Random forest, decision
Smart home environments are becoming popular to provide tree, and ADA works best for larger datasets and can easily
ease and comfort to users. However, increased IoT devices identify malicious activities in the network with high
are more prone to attack due to a lack of security measures. It accuracy. The proposed frame-work is best for the detection
is important to monitor such devices to prevent and identify of Botnet attacks and malicious activities using RF.
malicious activities in the network. In the proposed study, the The proposed work is a significant contribution to securing
UNSW BoT IoT dataset is used for the evaluation of the IoT devices from external attacks by using a Feature
proposed methodology using a feature selection approach. selection-based machine learning approach to identify mali-
This dataset is based on real-time IoT botnet traffic of smart cious patterns in traffic. In this way, user privacy, security,
devices. Machine Learning-based anomaly detection and safety can be obtained. Our solution shows a path
methodology is evaluated using six algorithms i.e. Random
forest, decision trees, AdaBoost, LSTM Autoencoder, and

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

towards making security by design retrofittable into existing [13] Diro, A. A., & Chilamkurti, N. Distributed attack detection
scheme using deep learning approach for Internet of Things.
IoT installations. Future generation computer systems, 82, (2018). 761-768.
In the future, the proposed methodology of anomaly [14] Alsharif, M. H., Kelechi, A. H., Yahya, K., & Chaudhry, S. A.
detection can be tested and validated with new ML models Machine learning algorithms for smart data analysis in Internet of
things environment: taxonomies and research trends. Symmetry,
and other datasets like CIDIDS, KDD, Bot−IoT, and CTU-13 12(1), (2020). 88.
and results can be compared. More Machine Learning [15] Porkodi, R., & Bhuvaneswari, V. The Internet of Things (IoT)
classifiers can be tested. Unsupervised techniques can also be applications and communication enabling technology standards:
tested and compared with the algorithms used in this study. An overview. Paper presented at the 2014 International
Conference on intelligent computing applications (2014).
Other feature selection techniques can be tested with this [16] Saleem, K., Bajwa, I. S., Sarwar, N., Anwar, W., & Ashraf, A.
methodology. IoT Healthcare: Design of Smart and Cost-Effective Sleep
The limitation of the proposed study is that it needs more in- Quality Monitoring System. Journal of Sensors, 2020. (2020).
[17] Klötzer, C., Weißenborn, J., & Pflaum, A. The evolution of cyber-
depth focus to interpret the behavior of the device in different physical systems as a driving force behind digital transformation.
environments. Paper presented at the 2017 IEEE 19th Conference on Business
Informatics (CBI). (2017).
[18] Garg, S., Kaur, K., Kaddoum, G., & Choo, K.-K. R. Toward
CONFLICTS OF INTEREST secure and provable authentication for Internet of Things:
The authors declare no conflicts of interest. Realizing industry 4.0. IEEE Internet of Things Journal, 7(5),
(2019). 4598-4606.
[19] Khalfi, B., Hamdaoui, B., & Guizani, M. Extracting and
Data Availability Statement:
exploiting inherent sparsity for efficient IoT support in 5G:
The dataset supporting the conclusions of this article is Challenges and potential solutions. IEEE Wireless
available at the repository: Communications, 24(5), (2017). 68-73.
https://research.unsw.edu.au/projects/bot-iot-dataset. [20] Donno, M. D., Dragoni, N., Giaretta, A., & Mazzara, M.
AntibIoTic: protecting IoT devices against DDoS attacks. Paper
presented at the International Conference in Software Engineering
for Defence Applications. (2016).
REFERENCES [21] Dargahi, T., Dehghantanha, A., Bahrami, P. N., Conti, M.,
[1] Chen, S., Xu, H., Liu, D., Hu, B., & Wang, H. A vision of IoT: Bianchi, G., & Benedetto, L. A cyber-kill-chain-based taxonomy
Applications, challenges, and opportunities with China of crypto-ransomware features. Journal of Computer Virology
perspective. IEEE Internet of Things Journal, 1(4), (2014) 349- and Hacking Techniques, 15(4), (2019). 277-305.
359. [22] Albasir, A. Detection of Anomalous Behavior of IoT/CPS
[2] Lee, I., & Lee, K. The Internet of Things (IoT): Applications, Devices Using Their Power Signals. (2020).
investments, and challenges for enterprises. Business Horizons, [23] Ullah, F. Javeed, A. Salam, M. Ahmad, N. Sarwar, D. Shah, and
58(4), (2015) 431-440. M. Abrar, “Modified decision tree technique for ransomware
[3] Xiao, L., Wan, X., Lu, X., Zhang, Y., & Wu, D. IoT security detection at runtime through API Calls,” Scientific Programming,
techniques based on machine learning: How do IoT devices use 4(2) 2020 1-21.
AI to enhance security? IEEE Signal Processing Magazine, 35(5), [24] Gou, Q., Yan, L., Liu, Y., & Li, Y.. Construction and strategies
(2018). 41-49. in IoT security system. Paper presented at the 2013 IEEE
[4] Chaabouni, N., Mosbah, M., Zemmari, A., Sauvignac, C., & international conference on green computing and
Faruki, P. Network intrusion detection for IoT security based on Communications and IEEE Internet of Things and IEEE Cyber,
learning techniques. IEEE Communications Surveys & Tutorials, physical and social computing. (2013)
21(3), (2019). 2671-2701. [25] Shabtai, A., Kanonov, U., Elovici, Y., Glezer, C., & Weiss, Y.
[5] Al-Fuqaha, A., Guizani, M., Mohammadi, M., Aledhari, M., & “Andromaly”: a behavioral malware detection framework for
Ayyash, M. Internet of things: A survey on enabling technologies, Android devices. Journal of Intelligent Information Systems,
protocols, and applications. IEEE Communications Surveys & 38(1), (2012). 161-190.
Tutorials, 17(4), (2015). 2347-2376. [26] Alippi, C., D'Alto, V., Falchetto, M., Pau, D., & Roveri, M.
[6] Balta-Ozkan, N., Davidson, R., Bicket, M., & Whitmarsh, L.. Detecting changes at the sensor level in cyber-physical systems:
Social barriers to the adoption of smart homes. Energy Policy, 63, Methodology and technological implementation. Paper presented
(2013) 363-374. at the 2017 International Joint Conference on Neural Networks
[7] Sperotto, A., & Pras, A.. Flow-based intrusion detection. Paper (IJCNN). (2017).
presented at the 12th IFIP/IEEE International Symposium on [27] Alippi, C., Boracchi, G., & Roveri, M. A just-in-time adaptive
Integrated Network Management (IM 2011) and Workshops. classification system based on the intersection of confidence
(2011) intervals rule. Neural Networks, 24(8), (2011). 791-800.
[8] Ali, B. Internet of Things Based Smart Homes: Security Risk [28] Shanthamallu, U. S., Spanias, A., Tepedelenlioglu, C., & Stanley,
Assessment and Recommendations. (2016) M. A brief survey of machine learning methods and their sensor
[9] Bin, S., Yuan, L., & Xiaoyi, W. Research on data mining models and IoT applications. Paper presented at the 2017 8th
for the Internet of things. Paper presented at the 2010 International Conference on Information, Intelligence, Systems &
International Conference on Image Analysis and Signal Applications (IISA). (2017).
Processing. (2010) [29] Aysa, M. H., Ibrahim, A. A., & Mohammed, A. H. IoT DDoS
[10] Steinberg, J. These devices may be spying on you (even in your attack detection using machine learning. Paper presented at the
own home). Forbes. Retrieved, 27. (2014). 2020 4th International Symposium on Multidisciplinary Studies
[11] Hasan, M., Islam, M. M., Zarif, M. I. I., & Hashem, M. Attack and Innovative Technologies (ISMSIT). (2020).
and anomaly detection in IoT sensors in IoT sites using machine [30] Koroniotis, N., Moustafa, N., Sitnikova, E., & Turnbull, B.
learning approaches. Internet of Things, 7, 100059. (2019). Towards the development of realistic botnet dataset in the internet
[12] Poyner, I., & Sherratt, R. Privacy and security of consumer IoT of things for network forensic analytics: Bot-iot dataset. Future
devices for the pervasive monitoring of vulnerable people. Paper generation computer systems, 100, (2019). 779-796.
presented at the Living in the Internet of Things: Cybersecurity of [31] Koroniotis, N., Moustafa, N., Sitnikova, E., & Slay, J. Towards
the IoT-2018. (2018). developing network forensic mechanism for botnet activities in

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3325929

Author Name: Preparation of Papers for IEEE Access (February 2017)

the IoT based on machine learning techniques. Paper presented at and Soft Computing, PLOS ONE, IET Software, Security and
the International Conference on Mobile Networks and Communication Networks, International Journal of Telemedicine and
Management. (2017). Applications, Journal of Healthcare Engineering, Milestone transactions
[32] Pahl, M.-O., & Aubet, F.-X. All eyes on you: Distributed Multi- on medical Technometrics, Mathematics and Computer Science, SCIREA
Dimensional IoT microservice anomaly detection. Paper journal of Computer. Worked for more than 25 journals as and reviewer
presented at the 2018 14th International Conference on Network and PC member.
and Service Management (CNSM). (2018).
[33] Hastie, T., Tibshirani, R., & Friedman, J. H.. The elements of
statistical learning: data mining, inference, and prediction (Vol.
2): Springer. (2009) IMRAN SARWAR BAJWA received the
[34] Krogh, A. What are artificial neural networks? Nature Ph.D. degree in computer science from the
Biotechnology, 26(2), (2008). 195-197. University of Birmingham, U.K. He has more
[35] Xu, W., Jang-Jaccard, J., Singh, A., Wei, Y., & Sabrina, F. than 18 years of teaching research experience
Improving performance of autoencoder-based network anomaly in various universities of Pakistan, Portugal,
detection on nsl-kdd dataset. IEEE Access, 9, (2021). 140136- and U.K. He is currently an Associate
140146. Professor with the Department of Computer
[36] Hossin, M., & Sulaiman, M. N. A review on evaluation metrics Science and Information Technology, The
for data classification evaluations. International journal of data Islamia University of Bahawalpur. He has
mining & knowledge management process, 5(2), 1. (2015). published more than 190 articles in well
[37] Peterson, J. M., Leevy, J. L., & Khoshgoftaar, T. M. A Review reputed peer-refereed conferences and
and Analysis of the Bot-IoT Dataset. Paper presented at the 2021 journals. He has more than 80 publications and 1000 citations in Google
IEEE International Conference on Service-Oriented System Scholar. Additionally, he has more than 90 articles in Scopus and 370
Engineering (SOSE). (23-26 Aug. 2021). citations of his work in Scopus. His personal impact factor is more than
[38] Thamilarasu, G.; Chawla, S. Towards Deep-Learning-Driven 40. His current research interests include artificial intelligence, the IoT, big
data, and machine learning. He is an Associate Editor of IEEE ACCESS
Intrusion Detection for the Internet of Things. Sensors 2019, 19,
journal. He has very good programming skills in Java and C#.
1977. https://doi.org/10.3390/s19091977
[39] Alosaimi, S.; Almutairi, S.M. An Intrusion Detection System
Using BoT-IoT. Appl. Sci. 2023, 13, 5427.
https://doi.org/10.3390/app13095427 Muhammad Zunnurain Hussain
[40] Albulayhi, K.; Abu Al-Haija, Q.; Alsuhibany, S.A.; Jillepalli, working as Assistant Professor at
A.A.; Ashrafuzzaman, M.; Sheldon, F.T. IoT Intrusion Detection Department of Computer Science,
Using Machine Learning with a Novel High Performing Feature Bahria University, Lahore, Pakistan. He
Selection Method. Appl. Sci. 2022, 12, 5015. has 10 years proven Network & Security
https://doi.org/10.3390/app12105015 specialist with comprehensive lifecycle
[41] Abu Al-Haija, Q.; Zein-Sabatto, S. An Efficient Deep-Learning- advisory, client development,
Based Detection and Classification System for Cyber-Attacks in management, & technology consulting
IoT Communication Networks. Electronics 2020, 9, 2152. for fortune 100 clients. His research
https://doi.org/10.3390/electronics9122152 interests include the Computer
[42] Bhayo, J., Shah, S. A., Hameed, S., Ahmed, A., Nasir, J., & Networks, Information Security, IoT
Draheim, D. Towards a machine learning-based framework for Security, Cloud Computing, Machine
DDOS attack detection in software-defined IoT (SD-IoT) Learning.
networks. Engineering Applications of Artificial
Intelligence, 123, 2023, 106432.
[43] Siddiqui, S., Hameed, S., Shah, S. A., Ahmad, I., Aneiba, A.,
Draheim, D., & Dustdar, S. (2022). Towards Software-Defined
Muhammad Ibrahim received the Ph.D.
Networking-based IoT Frameworks: A Systematic Literature
degree in computer science from the
Review, Taxonomy, Open Challenges and Prospects. IEEE
Islamia University of Bahawalpur,
Access.
Pakistan. He has more than 10 years of
[44] Khalid, M., Hameed, S., Qadir, A., Shah, S. A., & Draheim, D. teaching research experience in various
(2023). Towards SDN-based smart contract solution for IoT institutes of Pakistan. He is currently an
access control. Computer Communications, 198, 1-31. Lecturer with the Department of Computer
Science, The Islamia University of
Bahawalpur. His current research interests
include artificial intelligence, the IoT, big
data, and machine learning.
NADEEM SARWAR is
perusing his PhD (CS) from
Islamia University of
Bahawalpur, Pakistan and
working as Assistant Professor
Khizra Saleem is working as a
at Department of Computer
Lecturer in Department of Computer
Science, Bahria University,
Science, The Islamia University of
Lahore, Pakistan. He has 10
Bahawalpur. Her research Interest
years of teaching and research
Includes Machine Learning, IoT, and
experience. He has published
Cloud Computing.
more than 60 international and
National Journal/conference
publications in the last five
years. He is also working as
Editorial Board Member for various research journals named are CMC-
Computers, Materials & Continua, Applied Computational Intelligence

VOLUME XX, 2017 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like