Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

sensors

Article
Design and Implementation of High-Availability
Architecture for IoT-Cloud Services
Hyunsik Yang and Younghan Kim *
School of Electronic Engineering, Soongsil University, Seoul 06978, Korea
* Correspondence: younghak@ssu.ac.kr; Tel.: +82-02-820-0841

Received: 12 April 2019; Accepted: 23 July 2019; Published: 25 July 2019 

Abstract: For many vertical Internet of Things (IoT) applications, the high availability is very
important. In traditional cloud systems, services are usually implemented with the same level of
availability in which the fault detection and fault recovery mechanisms are not aware of service
characteristics. In IoT-cloud, various services are provided with different service characteristics and
availability requirements. Therefore, the existing cloud system is inefficient to optimize the availability
method and resources to meet service requirements. To address this issue, this paper proposes
a high availability architecture that is capable of dynamically optimizing the availability method
based on service characteristics. The proposed architecture was verified through an implementation
system based on OpenStack, and it was demonstrated that the system was able to achieve the target
availability while optimizing resources, in contrast with existing architectures that use predefined
availability methods.

Keywords: IoT service; high availability; IoT-cloud

1. Introduction
The Internet of Things (IoT) is promising for many new applications by enabling many objects
around us to connect, communicate, and interact over the internet without human intervention. Huge
data processing and data collection from sensors and other IoT devices are performed for those
applications. Because IoT devices (sensors, actuators, etc.) are normally resource-constraint (i.e.,
storage and processing capabilities), sensor-cloud, or in other words, IoT-cloud [1–5] was proposed as
a promising approach to address the limitations. IoT-cloud infrastructure constituting sensor networks
and cloud is an expanded form of cloud computing for sensing services and IoT applications. In
previous research [3–5], we proposed different models to integrate wireless sensor networks with
the cloud for efficient IoT-cloud services. Such models consider the resource constraints of sensor
devices [6–8] for sensor data collection as well as quality of service (QoS) for sensing services [3–5,9–11].
However, the previous works focus only on the communication between sensors and the IoT-cloud. In
this paper, we expand our consideration into inside the IoT-cloud management architecture for high
availability of IoT-cloud services.
For many vertical Internet of Things (IoT) applications (i.e., mission-critical IoT or industrial
monitoring), the high availability is critical. In existing cloud systems, the system availability is usually
improved by using various fault detection and recovery methods [12,13]. We observe that detailed
fault detection and recovery operations depend on the requirements of the services provided by the
cloud system, as well as available resources. However, how to implement an IoT-cloud system, which
considers service characteristics and requirements to deploy corresponding appropriate fault detection
and fault recovery schemes automatically, is a practical issue.
Fault detection can be divided into physical-, virtualized-, and application-level and may require
some or all of them. The recovery method may also be different depending on the requirements of a

Sensors 2019, 19, 3276; doi:10.3390/s19153276 www.mdpi.com/journal/sensors


Sensors 2019, 19, 3276 2 of 20

service. The recovery methods of a cloud system can be classified into fault tolerance and redundancy
methods [14,15]. The fault tolerance methods recover the faults by restarting the system or transferring
it to another normal system. Redundancy methods, such as active–active/active–standby, improve
a system by preparing a system that is the same as that for the provided service and using it as a
replacement in cases of failure. Redundancy methods are faster than fault tolerance methods because
they always prepare the same system as a backup or redundancy, but they have the disadvantage of
consuming more resources.
In existing cloud systems for IoT, various studies on efficient service provision architecture and
application placement methods have been conducted. However, existing availability methods are
primarily designed for general cloud environments [9–11]. Alcaraz Calero et al. and Kim et al. [12,16]
proposed a monitoring framework for physical and virtualization resources, ensuring availability for
all services. Kourtis et al. [17] proposed an architecture that can detect faults in the cloud infrastructure
and services.
In traditional cloud systems, services are usually implemented with the same level of availability.
In such an existing cloud system, fault detection and fault recovery mechanisms are not aware of
service characteristics. However, in the IoT-cloud, various services are provided with different service
characteristics and availability requirements. A fault detection architecture that is dynamically based
on various service characteristics has not been studied. For recovery, the backup system arrangement
algorithm and the recovery architecture using the shared backup system are normally implemented.
Jung et al. [15] proposed a method for improving service availability by using the active–standby
method, whereas Zhou et al. [18] introduced an algorithm for effective placement of the backup system.
However, a fault recovery architecture that is dynamically based on various service characteristics has
not been implemented. Therefore, the existing cloud system is inefficient to optimize the availability
method and resources to meet service requirements. To address that issue, this paper proposed a high
availability architecture that is capable of dynamically optimizing the availability method based on
service characteristics.
It should be noted that in an IoT-cloud environment where various services are provided, an
architecture that is capable of dynamically optimizing the availability method based on service
characteristics is required. An IoT-cloud environment that provides several types of services, such
as V2V (vehicle-to-vehicle), Smart Grid, video surveillance, and health care should consider the
environment and the characteristics of network services [1]. For example, a V2V service that transmits
and receives data in real time may require a more precise fault detection method than other services.
Such a service also requires the recovery time to be minimal. Therefore, to provide high availability for
V2V services, multiple monitoring functions are required for the infrastructure and the applications. In
such a case, a redundancy model that can reduce the recovery time may be more appropriate [19].
On the other hand, Botta et al. mentioned [1] Smart Grid services focus on data size and collection
in a large-scale environment. In other words, the V2V service and the Smart Grid service have different
characteristics. Each type of service may have a different traffic type, size, and data communication
period. Therefore, the Smart Grid service may require a different fault detection or recovery method.
Furthermore, in environments where resources are limited, such as an edge cloud or a fog, an
architecture that is capable of improving availability while optimizing resources by providing an
appropriate failure detection and recovery function according to service characteristics is necessary.
In this study, an architecture that automatically configures fault detection and fault recovery
methods according to the characteristics of IoT services in an IoT-cloud infrastructure is proposed.
Based on the characteristics of an IoT service, the proposed architecture can provide the required
availability in a resource-efficient manner by dynamically providing failure detection and a failover
model through a template [19].
To verify the proposed architecture, it was implemented using OpenStack, and it was confirmed that
it was able to provide the target availability. Furthermore, resources were optimized, as demonstrated
by comparison with other architectures that use predefined fault detection and recovery functions.
Sensors 2019, 19, 3276 3 of 20

The paper is organized as follows. In Section 2, related research on the features and availability of
IoT-cloud is presented. In Section 3, the proposed architecture designed to dynamically detect recovery
faults according to IoT services is described. Section 4 presents the implementation of the proposed
architecture and experimental results. Finally, Section 5 concludes the paper.

2. Related Research

2.1. Characteristics of IoT-Cloud


IoT-cloud infrastructure [3–5,9–12] constituting sensor networks [6–8] and cloud is an expanded
form of cloud computing for sensing services and IoT applications. Huge data processing and data
collection from sensors and other IoT devices are performed for those applications. Because IoT devices
(sensors, actuators, etc.) are normally resource-constraint (i.e., storage and processing capabilities),
sensor-cloud, or in other words, IoT-cloud [20–22] was proposed as a promising approach to address the
limitations. In our previous research [3–5], we proposed different models to integrate wireless sensor
networks with the cloud for efficient IoT-cloud services. Such models consider the resource constraints
of sensor devices [6–8] for sensor data collections as well as quality of services (QoS) for sensing
services [3–5,9–11]. However, the previous research focuses on the communication between sensors
and the IoT-cloud. In this paper, we expand our consideration into inside IoT-cloud management
architectures for high availability of IoT-cloud services.
An IoT-cloud provides the management functions of IoT devices and applications for services.
The IoT-cloud includes various services such as smart home, smart grid, and health care, and each
application service has different requirements and environments depending on service characteristics.
Moreover, it also has different requirements for availability depending on service characteristics [1].
In an IoT-cloud, certain availability mechanisms have been proposed, but they either provide
mechanisms for specific types of data (such as collected data) or do not consider the characteristics
of the cloud [23,24]. Recently, MEC (Mobile Edge Computing) architecture was introduced for IoT
services. It provides IoT services at edge, so it improves performance, latency between IoT services
and devices [20–22,25]. Compared to other types of clouds, an IoT-cloud should consider several
environments owing to the variety of IoT devices (e.g., sensors or mobile devices), and each application
should provide service for each IoT device.
Table 1 shows representative IoT services and their characteristics [1,26]. Service characteristics
are classified according to the criteria defined below, and it can be confirmed that each service has
different characteristics. The ‘o’ of an IoT service indicates that the IoT service has the corresponding
property described in that column. Empty cells indicate that the IoT service does not have the property
described in that column.

Table 1. Internet of Things (IoT) service characteristics.


Data
Service Type Latency Service State Data Storage Network Size Location
Processing
Non
Critical High Low Stateful Stateless Use Non Use Large MID Small Edge Core
Service Field Critical
Health care (heart beat) O O O O O O
Smart home and smart metering O O O O O O
Video surveillance O O O O O O
Automotive and smart mobility O O O O O O
Smart energy and smart grid O O O O O O
Smart logistics O O O O O O
Environmental monitoring O O O O O O O O

In the Table 1, latency means that this service should be ensured real-time data processing, and
data processing means that this service requires high performance computing owing to high data
throughput. Service continuity means that this service is stateful service or not (stateless), and data
storage means that this service requires external data storage or not. Network size means the size of
the network environment in which the service is provided and location refers to the cloud environment
Sensors 2019, 19, 3276 4 of 20

in which the service is located—in this paper, it is classified as core cloud and edge cloud. According
to general characteristic of IoT service, we classified IoT service like Table 1 [1].
As the characteristics of an IoT service may vary, fault detection and fault recovery functions
must be applied differently. For example, fault detection and recovery for services provided in an
IoT-cloud environment may be adopted according to the service characteristics and the priorities of the
targets to be monitored. One of the key features of smart metering services is to collect and visualize
data received from various IoT devices [2]. Smart metering services normally do not require real-time
response and processing; thus, they may allow more time than services requiring real-time response
for detection and recovery. In addition, data processing and database failures are not as critical as
application failures. Therefore, database monitoring is more important for availability, and fault
recovery methods such as re-instantiation, respawn models may be more suitable than redundancy
models. By contrast, a V2V service, which is used for communication between vehicles, requires
real-time response and fast data processing. Therefore, such a service requires a redundancy recovery
mechanism and multi-monitoring for fast detection and recovery.
In the case of a V2V application, a redundancy model is more suitable than other recovery
mechanisms because it is faster. In addition, V2V applications are highly dependent on applications
because they should quickly respond to and process data. Therefore, detection at the application level
can facilitate expeditious fault detection [3].
A recovery mechanism can be adopted according to the cloud environment. For example, in the
case of healthcare or CCTV services, which require real-time data processing, a redundancy model
is suitable for fast recovery. However, in resource-constrained environments such as an edge cloud,
resources may not be sufficient for configuring redundancy models. In this case, fault tolerance
methods are not faster than a redundancy model, but the former offers better availability, and thus
it should be preferred. Accordingly, the uniformized availability method may not be effective in an
IoT-cloud providing various services. Therefore, a cloud architecture that can configure an optimized
availability environment automatically according to various service characteristics is necessary.

2.2. Open Source Solutions for IoT-Cloud Availability


In Network Functions Virtualization (NFV) environments, ensuring availability is a challenging
task. In the ETSI (European Telecommunications Standards Institute) standards, the Virtualized
Infrastructure Manager (VIM) is responsible for controlling and managing the NFV Infrastructure
(NFVI), e.g., computing, storage, and network resources, whereas the VNF manager (VNFM) is
responsible for the lifecycle management of Virtual Network Function (VNF) instances. In this context,
NFVI and VNFM should support a fault detection and recovery mechanism for life cycle management.
Accordingly, in this study, an IoT-cloud availability architecture and procedure was designed on the
NFV reference architecture. Thus, one of the representative cloud VIMs is OpenStack. OpenStack
can control and manage resource pools. Several monitoring tools are proposed for fault detection,
such as Ceilometer, Ngios, and Monasca [27,28]. These are also integrated with OpenStack to provide
fault monitoring for the infrastructure. However, these solutions only consider errors caused by the
NFVI and do not check the status of a VNF. Moreover, monitoring tools cannot support the recovery
mechanism itself and should interwork with the cloud infrastructure to manage the status of the VNF
or the NFVI. Furthermore, OpenStack has a Tacker project that develops the VNFM and provides a
general VNFM function such as VNF CRUD (Create, Read, Update, Delete) [29]. Thereby, the basic
IoT-cloud architecture can be implemented. However, a monitoring/recovery mechanism is required
to detect failures. In an enterprise environment, several solutions are already used to check the status
of the infrastructure. One of the more widely used monitoring tools is Zabbix [30]. Zabbix is an
enterprise open source monitoring software for networks and applications that has been mainly used
to manage faults in the physical server. Moreover, it is designed to monitor and track the status of
various network services, servers, and other network hardware. It also provides a GUI so that users
Read, Update, Delete) [29]. Thereby, the basic IoT-cloud architecture can be implemented. However,
a monitoring/recovery mechanism is required to detect failures. In an enterprise environment,
several solutions are already used to check the status of the infrastructure. One of the more widely
used monitoring tools is Zabbix [30]. Zabbix is an enterprise open source monitoring software for
networks and applications that has been mainly used to manage faults in the physical server.
Sensors 2019, 19, 3276 5 of 20
Moreover, it is designed to monitor and track the status of various network services, servers, and
other network hardware. It also provides a GUI so that users may easily check the status of the target
may
on theeasily checkHowever,
website. the statusitsoflimitation
the targetison theitwebsite.
that relies onHowever, its limitation
an agent for obtaining is that it relies
monitoring on an
data.
agent for obtaining monitoring data.
3. Proposed Architecture
3. Proposed Architecture
In this section, the design for the proposed architecture is presented. An IoT-cloud service is
In
composed thisofsection, the design
multi-layers, and itfor is the proposed
necessary architecture
to configure is presented.
the availability An IoT-cloud
provisioning service is
environment
composed of multi-layers, and it is necessary to configure the availability
according to the service type. For that purpose, the proposed architecture was designed and provisioning environment
according
implemented to the service the
including type.following
For thatelements.
purpose, At the first,
proposed architecture
the proposed was designed
architecture and
provided
implemented including the following elements. At first, the proposed architecture
multilayer monitoring architecture for IoT-cloud and a common driver for extension of IoT-cloud provided multilayer
monitoring
monitoring architecture
function wasfor IoT-cloud
also designed.andIna addition,
common driver for extension
the proposed of IoT-cloud
architecture monitoring
was implemented
function was also designed. In addition, the proposed architecture
with OpenStack as a cloud infra and OpenStack Tacker as a management and orchestration was implemented with OpenStack
as a cloud
(MANO). infra and OpenStack Tacker as a management and orchestration (MANO).
As
As shown
showninin Figure 1, the
Figure 1, proposed architecture
the proposed consistsconsists
architecture of two components. The left component
of two components. The left
is a cloud-based IoT service architecture. The right component is a
component is a cloud-based IoT service architecture. The right component is a management and management and monitoring
architecture
monitoring for IoT service
architecture forand IoTinfrastructure.
service and The former is the
infrastructure. Theprimary
formercloud environment
is the primary cloud that
provides
environment IoT services, which
that provides IoTrun as virtual
services, network
which run asfunctions (VNFs),functions
virtual network and has the role of
(VNFs), receiving
and has the
and processing data from IoT devices.
role of receiving and processing data from IoT devices.

Create VNF
(Web UI or Command Line)
Operator Support System / Business Support System

GUI CLI
IoT Service
MANO (Tacker)
IoT
Data Analyzer
IoT IoT
vIoT3 vIoT4 Monitoring
Authentication Visualizatiom Service Driver
& driver
APP Monitoring IoT
Authorization vIoT1 vIoT2
Data Collector

VM Monitoring VM VM VM VM 3rd party monitoring Server

WEB
Collector Notifier
NFVI Interface
Infra Monitoring
MySQL Monitoring
Virtual Virtual Virtual Sender
MySQL
Compute
Network Storage Computing
VDU Creation
Network Storage Computing VNF VM
HEAT
Hardware Hardware Hardware Creation

Glance
Nova
IoT Devices IoT Devices IoT Devices IoT Devices Neutron

Sensors Sensors

Figure
Figure 1. High-availability architecture
1. High-availability architecture for
for IoT-cloud.
IoT-cloud.

MANO (management and orchestration) is a top-level cloud management system that manages
the entire infrastructure. In this study, an architecture providing availability in conjunction with
MANO was developed. A monitoring driver interacting between monitoring tools and MANO was
also designed. Furthermore, templates for automated monitoring and recovery were defined, and
translators for availability templates were designed. Finally, a service driver supporting interworking
with other monitoring tools was designed.

3.1. Architecture for IoT-Cloud Availability


To provide availability for IoT-cloud services, an integrated architecture, which could interwork
with other monitoring tools, was developed. The proposed architecture was designed using third-party
monitoring tools and OpenStack Tacker (MANO). Figure 2 shows the availability architecture for
IoT-cloud services.
3.1. Architecture for IoT-Cloud Availability
To provide availability for IoT-cloud services, an integrated architecture, which could
interwork with other monitoring tools, was developed. The proposed architecture was designed
using 2019,
Sensors third-party
19, 3276 monitoring tools and OpenStack Tacker (MANO). Figure 2 shows the availability
6 of 20
architecture for IoT-cloud services.

Monitoring_template.yaml

Horizon(GUI) CLI

API
Heat (WSGI, extension / Plugin framework)
Heat
Translator
NFV Catalog TOSCA
NSD FFGD VNFD Template
Validation

NFVO VNFM
Management
TOSCA VNF TOSCA Service
Multi-Site Driver
Multi-Site
Workflow Fwd Graph Workflow Driver
Framework
OpenStack
Component Network Service
VIM Event Audit VNF Monitoring
Service Chains
Instances Log Instances driver
Instance (SFC)

3rd party monitoring Server VIM

Collector Notifier WEB Interface

Network Agent Network Agent


Script Script
Application Client Application Client
MySQL Monitoring
MySQL
Sender Compute VNF 2
OS OS

VNF 1 VNF 2

2. Design of components for IoT-cloud


Figure 2. IoT-cloud availability.
availability.

In
In the
theproposed
proposed architecture,
architecture,we designed
we designed the service driver, the
the service monitoring
driver, driver, the monitoring
the monitoring driver, the
server, and the
monitoring translator
server, and the which are placed
translator whichin are
the placed
controller as acontroller
in the software component.
as a softwareTocomponent.
communicate To
with each component, proposed component should be deployed
communicate with each component, proposed component should be deployed Control Node. Control Node. Monitoring server
can be deployed
Monitoring server oncan
independent
be deployed nodes, but communication
on independent nodes, butwith control nodes
communication is essential.
with In this
control nodes is
architecture, it is assumed that monitoring tools use agents to collect data from
essential. In this architecture, it is assumed that monitoring tools use agents to collect data from the the target.
First, a monitoring driver that communicates with the monitoring server in MANO was designed.
target.
WhenFirst,
an IoTa service
monitoring is created
driveras that
a VNF, the driver sends
communicates withVNFtheinformation
monitoringtoserverthe monitoring
in MANO server
was
for the registration process. Another role
designed. When an IoT service is created as a VNF, the of the monitoring driver is to handle the recovery
sends VNF information to the action
when an IoTserver
monitoring servicefor has theanregistration
error. process. Another role of the monitoring driver is to handle the
Subsequently,
recovery action when a translator was designed
an IoT service has an error.to analyze the monitoring policy in the template. Its
function is to translate the monitoring
Subsequently, a translator was designed to policy from the VNF
analyze descriptor (VNFD)
the monitoring policy inand the forward
template.this
Its
information
function is to the monitoring
translate server. Tacker
the monitoring usually
policy fromusesthethe
VNFVNFD, which (VNFD)
descriptor includes VNF information,
and forward this
such as network
information to configuration,
the monitoring images, size,Tacker
server. or resources [31,32];
usually usesit also
the includes
VNFD, awhich monitoring policy.
includes VNFIn
the proposed system,
information, such asthe function
network of the translator
configuration, was extended
images, to recognize
size, or resources the monitoring
[31,32]; function.
it also includes a
Finally, policy.
monitoring a service In driver was designed.
the proposed system, Itthecalls a specific
function monitoring
of the translatordriver based ontothe
was extended VNFD.
recognize
When the user requests
the monitoring function. fault management using specific monitoring, the service driver calls the specific
monitoring
Finally,function
a service todriver
check was
the status.
designed. It calls a specific monitoring driver based on the VNFD.
When Figure 3 shows
the user the procedures
requests of the proposed
fault management architecture
using specific for IoT
monitoring, theservice
serviceavailability.
driver callsThe
the
administrator makes a request to the
specific monitoring function to check the status. VNFM to create a VNF with a VNFD that includes the monitoring
policy. When3 the
Figure showsVNFM the receives
procedures data,ofitthe
sends a request
proposed message tofor
architecture theIoTOpenStack entity to create
service availability. The
aadministrator
VNF. makes a request to the VNFM to create a VNF with a VNFD that includes the
Sensors 2019, 19, x FOR PEER REVIEW 7 of 19

monitoring
Sensors 2019, 19,policy.
When the VNFM receives data, it sends a request message to the OpenStack
3276 7 of 20
entity to create a VNF.

Administrator
VNFM Service Monitoring OpenStack Entity Monitoring
(Tacker) driver driver (Nova, Neutron) Server
Create VNF

VNF descriptor
Configuration

VNF Created

VNF Info (ID, Name)


VNF Created
Monitoring configuration
Monitoring Agent
Installation

Monitoring Request
{VNF Info (ID, Name)}
Monitoring Start
Monitoring Result

Fault Event

Monitoring Result

Recovery Action Request

VNF Recovery

Monitoring re-start
Monitoring Result

Figure 3. Procedure for service (Virtual Network Function (VNF)) fault


fault management.
management.

When the VNF creation is completed, completed, the service driver reads the descriptor and finds the the
monitoring driver based on the information in the descriptor. descriptor. Subsequently, the the monitoring
monitoring driver
sends the monitoring request message with VNF information to the monitoring server (e.g., VNF id,
monitoring policy).
In addition,
addition, thetheVNF
VNFbegins
beginstotoinstall
installthe monitoring
the monitoring agent
agentaccording
according to the script
to the defined
script by the
defined by
the VNFD.
VNFD. This agent’s
This agent’s role isrole is to the
to collect collect
VNFthe VNF information
information and forwardanditforward it to After
to the server. the server. After
installation,
installation,
the the agentsends
agent periodically periodically
monitoring sends
data monitoring dataWhen
to the server. to thedataserver.
meetsWhen data meets
the monitoring the
policy
monitoring policy
requirements, whichrequirements,
are defined bywhich are defined the
the administrator, by the administrator,
monitoring the monitoring
server sends server
a recovery action
sends a (e.g.,
request recovery action
rebuild, request
restart, or (e.g., rebuild,to
reconfigure) restart, or reconfigure)
the VNFM. For example, to the VNFM.
when For service
an IoT example, is
when an IoT
launched service
in the cloud,is the
launched in theagent
monitoring cloud, the monitoring
gathers monitoring agent
datagathers
to check monitoring
the status of data
thetoservice
check
the status
(e.g., of the service
infrastructure (e.g., infrastructure
or application status). When or application
the monitoringstatus). When
server the monitoring
receives data from an server
IoT
receivesVNF,
service data and
fromsome
an IoT serviceare
resources VNF, and some
insufficient orresources
the service are insufficient
fails, it requests or athe service
suitable fails, it
recovery
requests
action to athe
suitable
VNFMrecovery
based onaction to the VNFM kind
the corresponding basedofon the corresponding kind of failure.
failure.

3.2. Design of
3.2. Design of Monitoring
Monitoring Driver
Driver
Figure
Figure 44shows
showsthetheproposed
proposeddesign
designforfor
thethe
monitoring
monitoringdriver andand
driver service driver
service in MANO.
driver in MANO.The
monitoring driver is composed of three small functions. The first function is the sender
The monitoring driver is composed of three small functions. The first function is the sender function, which
adjusts thewhich
function, formatadjusts
when monitoring-related
the format when information is transferred
monitoring-related to the monitoring
information server.
is transferred to All
the
monitoring-related data is transmitted through the sender function. The create service
monitoring server. All monitoring-related data is transmitted through the sender function. The function is the
function which, after receiving the VNF information from MANO, extracts information related to the
monitoring policy and forwards it to the sender function.
Sensors 2019, 19, x FOR PEER REVIEW 8 of 19

create2019,
Sensors service function
19, 3276 is the function which, after receiving the VNF information from MANO,8 of 20
extracts information related to the monitoring policy and forwards it to the sender function.

Monitoring_template.
yaml

Monitoring
Driver Service Driver
Monitoring
Sender Function Monitoring
Server
Tool

Create Service Monitoring


Function Tool
Monitoring
Registration Tool
Function

Figure 4. Components
Components of monitoring driver.

The registration function is used to register the specific monitoring driver as a monitoring
function. This
Thisis is
a common
a common function that communicates
function with thewith
that communicates service
thedriver. When
service the registration
driver. When the
message comes
registration from comes
message the service
from driver, this function
the service registers
driver, this functiontheregisters
monitoring driver to thedriver
the monitoring service
to
driver. The service driver is a monitor call function that calls a specific monitoring driver
the service driver. The service driver is a monitor call function that calls a specific monitoring driver based on
the template.
based This function
on the template. is designed
This function so that other
is designed drivers
so that other can alsocan
drivers be supported. Based on
also be supported. the
Based
monitoring parameter in the descriptor, the service driver calls a specific monitoring
on the monitoring parameter in the descriptor, the service driver calls a specific monitoring driver. driver. For
instance, if a user
For instance, if a wants to usetoanother
user wants monitoring
use another driver,driver,
monitoring the service driver can
the service register
driver it to MANO
can register it to
as a monitoring
MANO driver. driver.
as a monitoring

3.3. Design Template


3.3. Design Template for
for Availability
Availability
To
To provide
providethe theappropriate
appropriate availability mechanism
availability mechanismfor IoT forservices, the monitoring
IoT services, and recovery
the monitoring and
functions should be provided automatically when a service
recovery functions should be provided automatically when a service is created. To is created. To support availability
support
automatically, the fault detection
availability automatically, the faultanddetection
fault recovery policies
and fault should
recovery be included
policies shouldinbetheincluded
VNFD. In in this
the
respect, Tacker not only creates VNFs using TOSCA-based templates, but can
VNFD. In this respect, Tacker not only creates VNFs using TOSCA-based templates, but can also also apply the availability
policy
apply tothethe VNFs. To policy
availability enable tothis, theVNFs.
the template To was modified
enable this, theandtemplate
the translator was extended.
was modified and theIn
the proposed
translator wasarchitecture,
extended. In a driver that can architecture,
the proposed support othera third-party
driver thatmonitoring
can supporttools was
other designed.
third-party
Templates and translators for Zabbix were designed as a monitoring tool. The
monitoring tools was designed. Templates and translators for Zabbix were designed as a monitoring translator analyzes the
content defined in the VNFD and converts it into a HOT template used by
tool. The translator analyzes the content defined in the VNFD and converts it into a HOT template the cloud. To apply the
monitoring
used by thepolicy
cloud. defined in thethe
To apply proposed architecture,
monitoring it is necessary
policy defined in thetoproposed
extend thearchitecture,
translator. it is
Figure 5 shows the VNFD
necessary to extend the translator. for monitoring policy. The “app_monitoring_policy” part is composed
of information about the target monitoring tool and account information
Figure 5 shows the VNFD for monitoring policy. The “app_monitoring_policy” for linking with the monitoring
part is
server, whence the monitoring driver requests registration from the monitoring
composed of information about the target monitoring tool and account information for linking with server. The parameter
part can be further
the monitoring divided
server, into two
whence theparts: the application
monitoring and the OS.
driver requests The application
registration from the partmonitoring
is used for
defining a value
server. The for monitoring
parameter part can be an further
IoT service, whereas
divided the OS
into two partthe
parts: is used for defining
application and the a value for
OS. The
monitoring infrastructure resources. The following procedures are added to provide
application part is used for defining a value for monitoring an IoT service, whereas the OS part is availability:
used for defining a value for monitoring infrastructure resources. The following procedures are
added to provide availability:
Sensors 2019, 19, 3276 9 of 20
Sensors 2019, 19, x FOR PEER REVIEW 9 of 19

Figure 5. Virtual Network Function descriptor (VNFD) template for monitoring.


Figure 5. Virtual Network Function descriptor (VNFD) template for monitoring.

The application part includes the monitoring policy for specific applications running on the
The application part includes the monitoring policy for specific applications running on the
VNF. It also includes account information of VNF connections and application information (e.g., port
VNF. It also includes account information of VNF connections and application information (e.g.,
information) for checking the status. The “app-status” part includes information about the recovery
port information) for checking the status. The “app-status” part includes information about the
policy for the VNF monitored, whereas “app_memory” includes a recovery policy according to memory
recovery policy for the VNF monitored, whereas “app_memory” includes a recovery policy
usage, which is used by the application currently being monitored.
according to memory usage, which is used by the application currently being monitored.
As shown in Figure 5, the OS part includes the monitoring policy for specific resources of a specific
As shown in Figure 5, the OS part includes the monitoring policy for specific resources of a
VNF or machine, such as CPU or memory. Specifically, “os_proc_value” indicates the number of
specific VNF or machine, such as CPU or memory. Specifically, “os_proc_value” indicates the
process, “os_cpu_usage” indicates the CPU load, and “os_cpu_usage” indicates the I/O throughput.
number of process, “os_cpu_usage” indicates the CPU load, and “os_cpu_usage” indicates the I/O
The Zabbix monitoring driver uses an agent installed in the VNF. The agent collects data according
throughput.
to the monitoring policy and forwards it to the monitoring server. Currently, when a user uses Zabbix
The Zabbix monitoring driver uses an agent installed in the VNF. The agent collects data
as a monitoring driver, the agent should be manually installed on each machine. However, as the
according to the monitoring policy and forwards it to the monitoring server. Currently, when a user
number of monitoring nodes increases, installing agents on all nodes becomes difficult.
uses Zabbix as a monitoring driver, the agent should be manually installed on each machine.
To resolve this, the agent installation was also defined in the VNFD. As shown in Figure 6, a user
However, as the number of monitoring nodes increases, installing agents on all nodes becomes
script was created in the user data part of the VNFD. Using the user data, VNF installs the agent, setups
difficult.
the network automatically for connection to the monitoring server, and thereafter the agent begins to
To resolve this, the agent installation was also defined in the VNFD. As shown in Figure 6, a
collect the VNF data. Optionally, the user could also install the agent manually without the descriptor.
user script was created in the user data part of the VNFD. Using the user data, VNF installs the
agent, setups the network automatically for connection to the monitoring server, and thereafter the
agent begins to collect the VNF data. Optionally, the user could also install the agent manually
without the descriptor.
Sensors 2019, 19, 3276 10 of 20
Sensors 2019, 19, x FOR PEER REVIEW 10 of 19

Figure 6. VNFD template for monitoring.


Figure 6. VNFD template for monitoring.
3.4. Discussion the Implementation Complexity
3.4. Discussion the Implementation Complexity
The new components proposed in this paper are a software component which installed into
TheNFV
existing newarchitecture.
componentsThey proposed
cannotinbethis paper are
deployed a software
to an independentcomponent
physicalwhich
node installed into
because that
existing NFV architecture. They cannot be deployed to an independent physical
requires internal communication with the NFV management function (MANO) which is in the Control node because that
requires
Node. internal
In other communication
words, any developerwith the NFV
can install management
this architecture in function (MANO)
their existing which is inand
NFV Architecture the
Control Node.
configure In otherproposed
it. In addition, words, any developer
architecture can install
followed thisprocedure
internal architecture in theirarchitecture,
of existing existing NFVso
Architecture and configure it. In addition, proposed architecture followed
the communication overhead and the control overhead of those components are as same as currentinternal procedure of
existingarchitecture.
existing architecture, so the communication overhead and the control overhead of those components
are as same as current existing architecture.
4. Implementation and Analysis
4. Implementation and Analysis
In this section, we introduce the implementation and verify the designed architecture. The
proposed architecture
In this section, weis introduce
implemented using open source,
the implementation and and
verifymathematical
the designedanalysis and real
architecture. The
performance verification are performed. We perform mathematical analysis with
proposed architecture is implemented using open source, and mathematical analysis and real proposed architecture
and verify eachverification
performance function suchareas the template operation,
performed. We perform fault mathematical
detection, and fault recovery
analysis withfunction
proposedin
the implementation
architecture environment.
and verify each function such as the template operation, fault detection, and fault
recovery function in the implementation environment.
4.1. Implementation
4.1. Implementation
For evaluation, the proposed architecture was implemented in the environment described in
Table For
2. We used fourthe
evaluation, servers that architecture
proposed consisted of was
one controller
implementedandinthree
the computer nodes.
environment Service
described in
drivers
Table 2.and
We monitoring driversthat
used four servers were deployed
consisted in the
of one controller
controller andfor interoperability
three with Service
computer nodes. cloud
management
drivers and functions.
monitoringIndrivers
this paper,
werethe monitoring
deployed server
in the was placed
controller in the controller with
for interoperability but can be
cloud
configured independently.
management functions. In this paper, the monitoring server was placed in the controller but can be
configured independently.
Table 2. Implementation specifications.

Entity Table 2. Implementation


Condition specifications. Version
Entity Condition Version
- Intel® Xeon® processor D-1520, Single-socket FCBGA 1667;
- Intel ® Xeon® processor D-1520, Single-socket FCBGA 1667; 4-core,
4-core, 8 threads, 45 W
Physical
Physical
Server(4) - 8RAM:
threads, 45 W space: 256 GB
16 GB/Disk
Server(4) -- RAM: 16 GB/Disk
Controller space: 256Node
Node (1)/Compute GB (3)
- Controller Node (1)/Compute Node (3)
Cloud OS OpenStack stable Queens
Cloud OS OpenStack stable Queens
VNF Manager OpenStack Tacker Master
VNF Manager OpenStack Tacker Master

In
Inthis
thissetting,
setting,aacloud
cloudenvironment
environmentand andMANO
MANOfunctions
functionswere
werecreated.
created. Figure
Figure 77shows
showsthe
the
scenario
scenarioand
andthe
theprocedure
proceduretotoverify
verifythe
theproposed
proposedarchitecture.
architecture.
Sensors 2019, 19, x FOR PEER REVIEW 11 of 19
Sensors 2019, 19, 3276 11 of 20
Sensors 2019, 19, x FOR PEER REVIEW 11 of 19
2. Send VNF info
Send_post
Monitoring Server (ip address, Name, Service policy)
2. Send VNF info Create_API
(3rd party : Zabbix) Send_post
Monitoring Server (ip address, Name, Service policy) Create_API
6 . Recovery Create_API
(3rd party : Zabbix)
Create_API
6 . Recovery

5. Monitoring data Report 1. VNF Creation


(Monitoring tool, Recovery policy)
7 . Recovery 5. Monitoring data Report 1. VNF Creation
(Monitoring tool, Recovery policy)
7 . Recovery Monitoring
Agent
3. Agent Registration Monitoring 4. Monitoring
(agent info) Agent
3. Agent Registration APP APP 4.APP
Monitoring
(agent info) 6 . Error
APP APP APP
6 . Error

Procedure of service recovery for availability.


Figure 7. Procedure availability.
Figure 7. Procedure of service recovery for availability.
The
The procedure
procedure for for providing
providing availability
availability in in this
this implementation
implementation is is shown
shown in in Figure
Figure 7. 7. A
A VNF
VNF
was
wasThe created
created using
using afor a Zabbix
Zabbix monitoring
monitoring template,
template, and
and an apache
an apache server server was installed
was installed in the VNF.
procedure providing availability in this implementation is shown in Figurein7.the VNF.
A VNF
Subsequently,
Subsequently, aa simple
simple IoT
IoT application
application was
was deployed
deployed on
on the
the VNF. The
VNF. The VNF
VNF creation,
creation, the
the monitoring
monitoring
was created using a Zabbix monitoring template, and an apache server was installed in the VNF.
driver
driver registers
registers the
theinformation
information ofofthe
thecreated
created VNF
VNFwithwith thethemonitoring
monitoring server,
server, and andthisthis
is followed
is followed by
Subsequently, a simple IoT application was deployed on the VNF. The VNF creation, the monitoring
agent
by agent installation,
installation, whereby the
whereby the agent collects
agent the
collects data and
thewith
data theforwards
andmonitoring it to the
forwards itserver, monitoring
to the monitoring server. In this
server. In
driver registers the information of the created VNF and this is followed
scenario,
this the status
scenario, the of theofIoT
status the application
IoT application running on the
running on apache
the serverserver
apache is checked,
is and when
checked, and there
when
by agent installation, whereby the agent collects the data and forwards it to the monitoring server. In
is an
there error, the
is an error, agent notifies the monitoring server. After receiving notification from the agent, the
this scenario, the the agent
status notifies
of the the monitoring
IoT application server.
running onAfter receiving
the apache notification
server is checked, fromand the when
agent,
monitoring
the monitoring server requests the VNFM to perform recovery of theofVNF
the that has a fault. Note that inthat
the
there is an error,serverthe agentrequests
notifies thetheVNFM to perform
monitoring recovery
server. After receiving VNF that
notification has afromfault.theNote
agent,
proposed
in the architecture,
proposed we designed
architecture, the serviceservicedriver,driver,
the monitoring driver, the monitoring server,
the monitoring server requestswe thedesigned
VNFM tothe perform recovery the monitoring
of the VNF thatdriver, the monitoring
has a fault. Note that
and the
server, translator as
and the architecture,software
translator as we components,
software which
components, are placed in the controller node of the existing
node ofNFV
in the proposed designed the servicewhich
driver,are theplaced in thedriver,
monitoring controllerthe monitoring the
architecture.
existing NFV Therefore,
architecture. theTherefore,
architecture the does not require
architecture doesany new
not physical
require any node.
new physical node.
server, and the translator as software components, which are placed in the controller node of the
Figure
existing NFV88architecture.
Figure shows
shows an an example
example
Therefore,of
of thethefault
the fault detection
detectiondoes
architecture and
and notrecovery
recovery
require method
method
any new registered
physicalin
registered the
the service
innode. service
according
according to
to the
the information
information defined
defined in
in the
the template.
template. Monitoring
Monitoring server
server information
information is
is registered
registered
Figure 8 shows an example of the fault detection and recovery method registered in the service
according
accordingto
according to
tothe
theinformation
the informationdefined
information definedinin
defined the
in the
thetemplate,
template,
template. which
which alsoalso
Monitoring includes
includes
server a fault detection
a fault
information ismethod
detection method
registeredand
a fault recovery
and a faulttorecovery
according method; this information
method;defined
the information this information can be changed
can be changed
in the template, depending
which also dependingon
includesonthe template.
the template.
a fault detection method
and a fault recovery method; this information can be changed depending on the template.

Figure 8. Example of VNF registration information.


Figure 8. Example of VNF registration information.
In Figure 9, the monitoring
monitoring driver registers a VNF to to the
the monitoring
monitoring server.
server. After creating the
VNF,
VNF, it runs
InitFigure the
runs the user data section
usermonitoring
9, the defined
data sectiondriver
defined in
in the
registers VNFD. Upon installation, the
a VNF to the monitoring server. agent collects
After the data
creating the
and forwards
forwards it
it to
tothe
themonitoring
monitoring server. In
server. this
In scenario,
this the
scenario, status
the of
status the
of IoT
the application
IoT running
application
VNF, it runs the user data section defined in the VNFD. Upon installation, the agent collects the data on
running
apache
on apache
and was
forwards checked.
was it checked.
to the monitoring server. In this scenario, the status of the IoT application running
on apache was checked.
Sensors 2019, 19, 3276 12 of 20
Sensors 2019, 19, x FOR PEER REVIEW 12 of 19

Figure 9. VNF
VNF registration
registration on the monitoring server.

Figure
Figure 10
10shows
showsa snapshot of the
a snapshot ofmonitoring processprocess
the monitoring for the IoT
for application. If the IoT application
the IoT application. If the IoT
has an error,has
application thean
agent notifies
error, the monitoring
the agent notifies the server.
monitoring server.

Monitoring result in the monitoring server.


Figure 10. Monitoring

When
When the
the monitoring server receives
monitoring server receives the
the notification
notification from
from the
the agent
agent (Figure
(Figure 11)
11) it
it requests
requests the
the
VNFM to perform recovery of the VNF that has a fault.
VNFM to perform recovery of the VNF that has a fault.

Figure 11.
Figure 11. Monitoring
Monitoring result
result on
on the
the monitoring
monitoring server.
server.

4.2.
4.2. Resource Utilization Cost
This
This section
section presents
presents aa comparison
comparison of
of the
the proposed
proposed architecture
architecture with
with existing
existing architectures.
architectures. The
The
total
total resource
resource cost
cost can
can be
be expressed
expressed as
as follows:
follows:
CC 𝐶 𝐶 (1)
total = Cmonitoring + Crecovery (1)
where 𝐶 is the resource cost for monitoring and 𝐶 is the cost of recovery. C is
where Cmonitoring
estimated as theissum
the resource
of 𝐶 cost for monitoring
and 𝐶 . 𝐶 Crecovery isisthe
and thecost of recovery.
total Ctotal(CPU,
resource cost is estimated
RAM,
as the sum of Cmonitoring and Crecovery . Cmonitoring is the total resource cost (CPU, RAM, and Network)
and Network) required to monitor a single target and refers to single-server deployment [40,41]. It is
derived as follows:
Sensors 2019, 19, 3276 13 of 20

required to monitor a single target and refers to single-server deployment [33,34]. It is derived as
follows:
1 1 1
Cmonitoring = cpu × × (2)
1−u 1−u net 1 − umen
where ucpu , unet , and umen are the CPU, network, and memory utilization, respectively, of a VM on the
same machine. Crecovery represents the deployment cost of a VM that occurs depending on the recovery
Sensors 2019, 19, x FOR PEER REVIEW 13 of 19
method and is given by [33,34].
p X
m 
X 1  1 1 
𝐶 eV
Crecovery (X) = X = xij , xij > 0, i ∈ [1, p] and j ∈ [1, m] (3)
(2)
i=1 j=1
1 𝑢 1 𝑢 1 𝑢

where 𝑢 , 𝑢 , and 𝑢 are the CPU, network, and memory utilization, respectively, of a VM on
where eV is the fixed cost per deployed VM and xij denotes the number of vCPUs assigned to VM
the same machine. 𝐶 represents the deployment cost of a VM that occurs depending on the
i hosted on the physical host. For performance analysis, performance analysis environment was
recovery method and is given by [33,34].
assumed that several different services ran on a single host, and 10% of the resources used by one
service was allocated for monitoring. The assumption is based on the average percentage observed in
𝐶 In addition,
our experiments. 𝑋 𝑒it was assumed
X that
𝑥 half
, 𝑥 of the
0, 𝑖VM
∈ required
1, 𝑝 𝑎𝑛𝑑 two
𝑗 ∈ monitoring
1, 𝑚 (3)
methods
and the remaining required one.
where 𝑒 is12
Figure theshows
fixed that the monitoring
cost per deployed VM and 𝑥 cost
resource of thethe
denotes proposed
number architecture is lowertothan
of vCPUs assigned VM
𝑖 hosted on the physical host. For performance analysis, performance analysis environment was
that of the existing architecture. The gap increases proportionally to the number of VMs because the
monitoring
assumed resource
that severalconsumption increases.
different services However,
ran on a singlean overall
host, and the
10%improvement is achieved
of the resources used byup to
one
15%. In the existing architecture, two monitoring functions are allocated to all VMs
service was allocated for monitoring. The assumption is based on the average percentage observed regardless of the
service,
in whereas in theInproposed
our experiments. addition,architecture, it is confirmed
it was assumed that halfthat
of the
theresource cost is reduced
VM required because
two monitoring
a monitoring
methods and method specificrequired
the remaining to the characteristics
one. of each service is provided.

Figure 12.
Figure Comparison of
12. Comparison of monitoring
monitoring resource
resource cost.
cost.

Subsequently, the resource cost for providing the recovery function was analyzed. Performance
Figure 12 shows that the monitoring resource cost of the proposed architecture is lower than
analysis was conducted considering typical environments where various recovery methods are
that of the existing architecture. The gap increases proportionally to the number of VMs because the
necessary, such as an IoT-cloud. To investigate the change in cost owing to the environment, two
monitoring resource consumption increases. However, an overall the improvement is achieved up to
recovery methods were used. Five cases were analyzed, where 10% (Case 1), 30% (Case 2), 50% (Case
15%. In the existing architecture, two monitoring functions are allocated to all VMs regardless of the
3), 70% (Case 4), and 100% (Case 5) of the total services requested a redundancy method, and the
service, whereas in the proposed architecture, it is confirmed that the resource cost is reduced
remaining requested a respawning method. It was assumed that in all services, the same availability
because a monitoring method specific to the characteristics of each service is provided.
level was ensured regardless of the recovery method. As shown in Figure 13, it was confirmed that
Subsequently, the resource cost for providing the recovery function was analyzed. Performance
the proposed architecture reduced resource cost by approximately 2.5 times compared to the existing
analysis was conducted considering typical environments where various recovery methods are
method. In the fifth case with 100%, the deployment cost is equal to that of the existing architecture.
necessary, such as an IoT-cloud. To investigate the change in cost owing to the environment, two
By providing the required recovery method for each service, the proposed architecture achieves better
recovery methods were used. Five cases were analyzed, where 10% (Case 1), 30% (Case 2), 50%
resource efficiency than the existing architecture. It was also observed that the proposed architecture
(Case 3), 70% (Case 4), and 100% (Case 5) of the total services requested a redundancy method, and
the remaining requested a respawning method. It was assumed that in all services, the same
availability level was ensured regardless of the recovery method. As shown in Figure 13, it was
confirmed that the proposed architecture reduced resource cost by approximately 2.5 times
compared to the existing method. In the fifth case with 100%, the deployment cost is equal to that of
the existing architecture. By providing the required recovery method for each service, the proposed
Sensors 2019, 19, 3276 14 of 20

was more cost-effective for more varied services compared to the existing method. The reason is that
the proposed architecture can efficiently use resources and provide suitable availability methods for
Sensors
various 2019, 19, x FOR
service PEER REVIEW
characteristics. 14 of 19

Figure13.
Figure Comparisonof
13.Comparison ofrecovery
recoveryresource
resourcecost.
cost.

4.3. Performance Analysis


4.3. Performance Analysis
Herein, the performance of the monitoring and recovery functions in the proposed model are
Herein, the performance of the monitoring and recovery functions in the proposed model are
analyzed. To evaluate the performance of the proposed model, a VNF was created and the error
analyzed. To evaluate the performance of the proposed model, a VNF was created and the error
detection time was checked in two cases: One is for detection of errors due to an increase in resource
detection time was checked in two cases: One is for detection of errors due to an increase in resource
usage, and the other is for application errors. Furthermore, the number of applications was gradually
usage, and the other is for application errors. Furthermore, the number of applications was
increased to verify whether the proposed architecture is scalable. The basic testing scenario is as follows.
gradually increased to verify whether the proposed architecture is scalable. The basic testing
We run simple IoT applications with VNFs. For monitoring, the system monitors infrastructure and
scenario is as follows. We run simple IoT applications with VNFs. For monitoring, the system
application levels. For recovery actions, we perform application restarting and VNF respawning.
monitors infrastructure and application levels. For recovery actions, we perform application
We conduct experiments with fifty IoT services used for testing, and the number of the applications
restarting and VNF respawning.
for each IoT service was gradually increased from one to twelve. We used the application monitoring
tool to detect application faults and ‘service restart’ as a recovery action
Figure 14 shows the monitoring detection time and the recovery time according to the number of
target applications (IoT services) in the configuration environment. The results show that increasing the
number of applications does not impact the fault detection time significantly. The fault detection time
difference between the case with one application and the case with seven applications is only 180 ms.
The reason is that the proposed system monitors and detects faults quickly on the infrastructure and
VNF level regardless of the number of applications installed in each VNF. However, the fault recovery
time increases gradually proportional to the number of applications. The reason is that the higher the
number of applications installed within a VNF the longer the time period is required to restart the
VNF with all applications.

Figure 14. Multiple service fault monitoring performance.

We conduct experiments with fifty IoT services used for testing, and the number of the
applications for each IoT service was gradually increased from one to twelve. We used the
application monitoring tool to detect application faults and ‘service restart’ as a recovery action
detection time was checked in two cases: One is for detection of errors due to an increase in resource
usage, and the other is for application errors. Furthermore, the number of applications was
gradually increased to verify whether the proposed architecture is scalable. The basic testing
scenario is as follows. We run simple IoT applications with VNFs. For monitoring, the system
monitors infrastructure and application levels. For recovery actions, we perform application
Sensors 2019, 19, 3276 15 of 20
restarting and VNF respawning.

Figure 14. Multiple


Figure 14. Multiple service
service fault
fault monitoring
monitoring performance.
performance.

Subsequently, experiments were


We conduct experiments withconducted
fifty IoT to check whether
services used forthe monitoring
testing, and server and theofVNF
the number the
manager performance are affected when the number of monitoring targets
applications for each IoT service was gradually increased from one to twelve. We increases. Theused
number
the
of target IoTmonitoring
application service nodes was
tool to gradually
detect increased
application faultsfrom one to 50,
and ‘service while
restart’ as the environment
a recovery action was
maintained. In this experiment, CPU utilization was used as a resource parameter. In addition, the
system was configured so that when the CPU usage rate for each IoT service node exceeds 70%, i.e.,
through increasing the number of threads, the system determines that a problem has occurred and
should send a recovery action request to the VNFM according to the pre-recovery policy. Figure 15
shows the average detection time under various number of threads with different cases for the number
of target IoT service nodes. The number of target nodes was increased from 10 to 50, and the number
of threads was increased from 10 to 80 to vary the CPU load [35]. The experiment was run five times,
and the average value was computed. The detection time of all cases is decreased gradually when the
number of threads is increased. While the case with 10 VNFs witnesses the lowest fault detection time,
the case with 50 VNFs shows the highest detection time due to the system workload. However, their
detection time difference is small. The result indicates that the fault detection time does not increase
significantly when we increase the number of VNFs. In our experiments, the results of all cases (10, 20,
30, 40, and 50 VNFs) are quite similar when the number of threads is 80 and over 80. We observed
that with over 80 threads, the system processing is quick enough to process all cases similarly, so the
detection time of all cases is similar.
fault detection time difference between the case with one application and the case with seven
applications is only 180 ms. The reason is that the proposed system monitors and detects faults
quickly on the infrastructure and VNF level regardless of the number of applications installed in
each VNF. However, the fault recovery time increases gradually proportional to the number of
applications. The reason is that the higher the number of applications installed within a VNF the
Sensors 2019, 19, 3276 16 of 20
longer the time period is required to restart the VNF with all applications.

Figure 15. Resource


Figure15. Resourcefault
faultmonitoring
monitoringperformance.
performance

In the same environment


Subsequently, experiments using over
were 80 threads,
conducted towe conduct
check experiments
whether with application
the monitoring server and faults.
the
We increase the number of VNFs from 20 to 160 to test the scalability
VNF manager performance are affected when the number of monitoring targets increases. Theof the system. Similar to Figure 15,
the fault detection time in all cases in Figure 16 is quite similar. As explained
number of target IoT service nodes was gradually increased from one to 50, while the environment in the previous experiment,
when the system processing
was maintained. is quick CPU
In this experiment, enough, increasing
utilization wasthe number
used of VNFsparameter.
as a resource results in only a small
In addition,
increase
the system was configured so that when the CPU usage rate for each IoT service node exceeds 20
in the fault detection time. In particular, when the number of VNFs is increased from to
70%,
160, the fault detection time increases only approximately 400 ms. The result
i.e., through increasing the number of threads, the system determines that a problem has occurred indicates that the system
Sensors 2019,
achieves a 19, x FOR PEER REVIEW
good 16 ofthe
19
and should sendscalability
a recoveryfor faultrequest
action detection. ByVNFM
to the comparing Figures
according to 15
theand 16, we alsopolicy.
pre-recovery find thatFigure
application
15 shows the fault can bedetection
average detectedtime faster, in comparison
under various number with the of physical
threads withand virtualized
different cases level.
forThe
the
the
reasonresource usage until it reaches a certain level to conclude that a fault is detected. The fault
numberisofthat for IoT
target the service
physical and virtualized
nodes. The numberlevel, our system
of target nodes was hasincreased
to monitor the10
from resource
to 50, andusage
the
recovery
until it time increases
reaches a certain proportionally
level to conclude when
that we
a increase
fault is the number
detected. The of VNFs.
fault The time
recovery fault increases
recovery
number of threads was increased from 10 to 80 to vary the CPU load [35]. The experiment was run
action requireswhen
proportionally higherweresource
increase consumption
thewas
number and
of depends
VNFs. on the resource allocation to higher
restart
five times, and the average value computed. The The fault
detection recovery
time of action requires
all cases is decreased
services, consumption
resource so the recovery andtime is impacted
depends significantly
on theisresource by the restart
allocation number of VNFs so sharing the same
gradually when the number of threads increased. While to the case services,
with 10 VNFs the recovery
witnessestime the
infrastructure.
is impacted
lowest faultsignificantly by the
detection time, thenumber of VNFs
case with 50 VNFssharing the same
shows infrastructure.
the highest detection time due to the
system workload. However, their detection time difference is small. The result indicates that the
fault detection time does not increase significantly when we increase the number of VNFs. In our
experiments, the results of all cases (10, 20, 30, 40, and 50 VNFs) are quite similar when the number
of threads is 80 and over 80. We observed that with over 80 threads, the system processing is quick
enough to process all cases similarly, so the detection time of all cases is similar.
In the same environment using over 80 threads, we conduct experiments with application
faults. We increase the number of VNFs from 20 to 160 to test the scalability of the system. Similar to
Figure 15, the fault detection time in all cases in Figure 16 is quite similar. As explained in the
previous experiment, when the system processing is quick enough, increasing the number of VNFs
results in only a small increase in the fault detection time. In particular, when the number of VNFs is
increased from 20 to 160, the fault detection time increases only approximately 400 ms. The result
indicates that the system achieves a good scalability for fault detection. By comparing Figures 15 and
16, we also find that the application fault can be detected faster, in comparison with the physical and
virtualized level. The reason is that for the physical and virtualized level, our system has to monitor

Figure 16. Application fault monitoring performance.


Figure 16.

In
In ETSI,
ETSI, availability
availability is
is defined
defined as
as mean
mean time
time between
between failures
failures (MTBF)
(MTBF) and
and mean
mean time
time to
to repair
repair
(MTTR)
(MTTR) [14]. MTTR is the average length of the interval between repair points after a failure has
[14]. MTTR is the average length of the interval between repair points after a failure has
occurred, and MTBF is the average length of working intervals without occurrence of any failures.
Considering these definitions, availability can be expressed as follows:
𝑀𝑇𝐵𝐹
Availabilty (4)
𝑀𝑇𝐵𝐹 𝑀𝑇𝑇𝑅
That is, if failure detection and recovery are executed quickly, the MTTR value is decreased,
which can improve availability. Based on the availability expression, the proposed system was
Figure 16. Application fault monitoring performance.
Sensors 2019, 19, 3276 17 of 20
In ETSI, availability is defined as mean time between failures (MTBF) and mean time to repair
(MTTR) [14]. MTTR is the average length of the interval between repair points after a failure has
occurred, and MTBF is the average length of working intervals without occurrence of any failures.
occurred, and MTBF is the average length of working intervals without occurrence of any failures.
Considering these definitions, availability can be expressed as follows:
Considering these definitions, availability can be expressed as follows:
𝑀𝑇𝐵𝐹
MTBF
Availabilty = MTBF + MTTR
Availabilty (4)
(4)
𝑀𝑇𝐵𝐹 𝑀𝑇𝑇𝑅
That is,
is, ififfailure
failuredetection
detectionand
and recovery
recovery areare executed
executed quickly,
quickly, the MTTR
the MTTR valuevalue is decreased,
is decreased, which
which can improve availability. Based on the availability expression, the proposed
can improve availability. Based on the availability expression, the proposed system was evaluated system was
evaluated from
from 1 to 48 h. 1Errors
to 48 were
h. Errors were periodically
periodically generated generated
every 30every
min. 30 min. Therefore,
Therefore, the valuetheofvalue of the
the MTBF
MTBF
from 1from
to 481htoand 48 the
h and the experimentally
experimentally obtained
obtained valuevalue
of theofMTTR
the MTTR
were were
used.used.
This This experiment
experiment was
was conducted for resource and application monitoring. The results are shown
conducted for resource and application monitoring. The results are shown in Figure 17. in Figure 17.

Figure
Figure 17.
17. Availability
Availability of
of proposed
proposed architecture.

The results demonstrate that the average availability in the proposed system is almost 99.80%. It
was confirmed that the proposed system provides satisfactory availability in an NFV environment. It
was also inferred that reducing the recovery and failure detection time affected the MTTR and MTBF.

5. Discussion
Through the analysis and experiments, results show the significant improvements of the proposed
architecture compared to the existing architecture, in terms of the deployment cost and the monitoring
resource cost, as well as the monitoring performance. The improvements are achieved because the
proposed architecture is designed to automatically configure fault detection and fault recovery methods
according to the characteristics of IoT services. In other words, depending on characteristics and
requirements of IoT services once they are launched, the proposed architecture selects an appropriate
model for fault detection and fault recovery, which not only satisfies the availability requirement, but
also is efficient in term of cost. On the one hand, the existing architecture supports only predefined
monitoring method and recovery method. It means that existing architecture provides the same
methods for all kinds of services. This is inefficient as each service normally has a different requirement,
as shown in Table 1. Through experiments, we also observed that the fault detection time is impacted
significantly by the processing capability of the system but it only slightly depends on the number of
applications. When the system processing is sufficient enough, the fault detection time of experiments
is quite similar even when we increase the number of VNFs. On the other hand, the fault recovery
time depends significantly on the number of applications and the number of VNFs.
In this paper, experiments are conducted with VM-based VNFs. For future research, we
plan to extend the architecture to be implemented with the container environment. We are also
Sensors 2019, 19, 3276 18 of 20

interested in investigating the availability issue in a hybrid scenario where both VM-based VNFs and
container-based VNFs are used. The reason is that VM-based VNFs and container-based VNFs have
different characteristics. If a service chain is composed with both VM-based VNFs and container-based
VNFs, the availability guarantee for the service chain is more complicated. This raises an interesting
research question on how to optimally allocate resources and compose VNFs for a number of service
chains that have different availability requirements.

6. Conclusions
A new architecture for providing availability for an IoT-cloud environment was proposed. The
current architecture does not support appropriate fault detection and recovery. An architecture capable
of automatically configuring the availability method according to service characteristics was designed
and implemented. A template-based cloud architecture that can automatically configure fault detection
and fault recovery methods subject to various service characteristics was proposed to ensure availability.
Templates allow the application of the proposed method according to the characteristics of the service,
and the feasibility of the method was demonstrated by comparison with the existing architecture.
It was verified that the architecture provides optimal availability for IoT services. The analysis of
the results demonstrated that the proposed system provides availability more dynamically and with
higher efficiency compared to the existing architecture. Moreover, the proposed architecture was
implemented, and its performance was verified. In future studies, we intend to research further on
provisioning of availability in IoT-cloud continuously.

Author Contributions: All the authors contributed to the research and wrote the article. H.Y. proposed the idea,
designed, and performed the evaluation. Y.K. suggested directions for the detailed designs and evaluation, as
well as coordinating the research.
Funding: This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC
(Information Technology Research Center) support program (IITP-2019-2017-0-01633) supervised by the IITP
(Institute of Information & Communications Technology Planning & Evaluation).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Botta, A.; de Donato, W.; Persico, V.; Pescapé, A. Integration of Cloud computing and Internet of Things:
A survey. Future Gener. Comput. Syst. 2016, 56, 684–700. [CrossRef]
2. Madria, S.; Kumar, V.; Dalvi, R. Sensor cloud: A cloud of virtual sensors. IEEE Softw. 2014, 31, 70–77.
[CrossRef]
3. Dinh, N.T.; Kim, Y. An Efficient On-Demand Latency Guaranteed Interactive Model for Sensor-Cloud.
IEEE Access 2018, 6, 68596–68611. [CrossRef]
4. Dinh, N.T.; Kim, Y. An Energy Efficient Integration Model for Sensor Cloud Systems. IEEE Access 2019, 7,
3018–3030. [CrossRef]
5. Dinh, N.T.; Kim, Y.; Lee, H. A Location-Based Interactive Model of Internet of Things and Cloud (IoT-Cloud)
for Mobile Cloud Computing Applications. Sensors 2017, 17, 489. [CrossRef]
6. Dinh, T.; Kim, Y.; Gu, T.; Vasilakos, A.V. L-MAC: A Wake-up Time Self-Learning MAC Protocol for Wireless
Sensor Networks. Comput. Netw. 2016, 105, 33–46. [CrossRef]
7. Dinh, N.T.; Kim, Y.; Gu, T.; Vasilakos, A.V. An Adaptive Low Power Listening Protocol for Wireless Sensor
Networks in Noisy Environments. IEEE Syst. J. 2018, 11, 1–12. [CrossRef]
8. Dinh, T.; Gu, T. A Novel Metric for Opportunistic Routing in Heterogenous Duty-Cycled Wireless Sensor
Networks. In Proceedings of the IEEE International Conference on Network Protocols, San Francisco, CA,
USA, 10–13 November 2015.
9. Mineraud, J.; Mazhelis, O.; Su, X.; Tarkoma, S. A gap analysis of Internet-of-Things platforms.
Comput. Commun. 2016, 89–90, 5–16. [CrossRef]
10. Vögler, M.; Schleicher, J.M.; Inzinger, C.; Dustdar, S. Optimizing Elastic IoT Application Deployments.
IEEE Trans. Serv. Comput. 2018, 11, 879–892. [CrossRef]
Sensors 2019, 19, 3276 19 of 20

11. Xu, Y.; Helal, A. Scalable Cloud–Sensor Architecture for the Internet of Things. IEEE Internet Things J. 2016, 3,
285–298. [CrossRef]
12. Alcaraz, C.J.M.; Aguado, J.G. MonPaaS: An Adaptive Monitoring Platform as a Service for Cloud Computing
Infrastructures and Services. IEEE Trans. Serv. Comput. 2015, 8, 65–78. [CrossRef]
13. Dinh, N.T.; Kim, Y. An Efficient Reliability Guaranteed Deployment Scheme for Service Function Chains.
IEEE Access 2019, 7, 46491–46505. [CrossRef]
14. Jhawar, R.; Piuri, V. Fault tolerance and resilience in Cloud computing environments. In Computer and
Information Security Handbook; Elsevier: Amsterdam, The Netherland, 2013; pp. 125–141.
15. Jung, G.; Rahimzadeh, P.; Liu, Z.; Ha, S.; Joshi, K.; Hiltunen, M. Virtual Redundancy for Active-Standby Cloud
Applications. In Proceedings of the IEEE INFOCOM 2018—IEEE Conference on Computer Communications,
Honolulu, HI, USA, 16–19 April 2018; pp. 1916–1924.
16. Kim, H.; Yoon, S.; Jeon, H.; Lee, W.; Kang, S. Service platform and monitoring architecture for network
function virtualization (NFV). Clust. Comput. 2016, 19, 1835–1841. [CrossRef]
17. Kourtis, M.; Mcgrath, M.J.; Gardikis, G.; Xilouris, G.; Riccobene, V.; Rapadimitriou, P.; Trouva, E.; Liberati, F.;
Trubian, M.; Batalle, J.; et al. T-NOVA: An Open-Source MANO Stack for NFV Infrastructures. IEEE Trans.
Netw. Serv. Manag. 2017, 14, 586–602. [CrossRef]
18. Zhou, A.; Wang, S.G.; Cheng, B.; Zheng, Z.B.; Yang, F.C.; Chang, R.N.; Lyu, M.R.; Buyya, R. Cloud Service
Reliability Enhancement via Virtual Machine Placement Optimization. IEEE Trans. Serv. Comput. 2017, 10,
902–913. [CrossRef]
19. Nabi, M.; Toeroe, M.; Khendek, F. Availability in the cloud. State Art J. Netw. Comp. Appl. 2016, 60, 54–67.
[CrossRef]
20. Li, J.; Jin, J.; Yuan, D.; Zhang, H. Virtual Fog: A Virtualization Enabled Fog Computing Framework for
Internet of Things. IEEE Internet Things J. 2018, 5, 121–131. [CrossRef]
21. Pan, J.; McElhannon, J. Future Edge Cloud and Edge Computing for Internet of Things Applications.
IEEE Internet Things J. 2018, 5, 439–449. [CrossRef]
22. Hesham, E.; Sharmi, S.; Mukesh, P.; Deepak, P.; Akshansh, G.; Manoranjan, M.; Chin-teng, L. Edge of Things:
The Big Picture on the Integration of Edge, IoT and the Cloud in a Distributed Computing Environment.
IEEE Access 2018, 6, 1706–1717.
23. Zhu, H.; Huang, C. Cost-Efficient VNF Placement Strategy for IoT Networks with Availability Assurance.
In Proceedings of the IEEE 86th Vehicular Technology Conference (VTC-Fall), Toronto, ON, Canada, 24–27
September 2017; pp. 1–5.
24. Kumar, A.; Narendra, N.C.; Bellur, U. Uploading and Replicating Internet of Things (IoT) Data on Distributed
Cloud Storage. In Proceedings of the IEEE 9th International Conference on Cloud Computing (CLOUD), San
Francisco, USA, 27 June–2 July 2016; pp. 670–677.
25. Pavlos, A.; Eirini, E.; Papavassiliou, S. Game-theoretic Learning-based QoS Satisfaction in Autonomous
Mobile Edge Computing. In Proceedings of the 2018 Global Information Infrastructure and Networking
Symposium (GIIS), Thessaloniki, Greece, 23–25 October 2018; pp. 1–5.
26. Gubbi, J.; Buyya, R.; Marusic, S.; Palaniswami, M. Internet of Things (IoT): A vision, architectural elements,
and future directions. Future Gener. Comput. Syst. 2013, 29, 1645–1660. [CrossRef]
27. Nagios. Available online: https://www.nagios.org/ (accessed on 5 January 2019).
28. Monasca. Available online: http://monasca.io/ (accessed on 5 January 2019).
29. Openstack Tacker. Available online: https://wiki.openstack.org/wiki/Tacker (accessed on 5 January 2019).
30. Zabbix. Available online: https://www.zabbix.com/ (accessed on 5 January 2019).
31. VNF Descriptor. Available online: https://docs.openstack.org/tacker/latest/contributor/vnfd_template_
description.html (accessed on 5 January 2019).
32. Caroline Chappell. Deploying Virtual Network Functions: The complementary Roles of TOSCA& NETCONF/YANG,
Heavy Reading, White Paper; 2015.
33. Wood, T.; Shenoy, P.; Venkataramani, A.; Yousif, M. Black-box and Gray-box Strategies for Virtual Machine
Migration. In Proceedings of the NSDI ’07: 4th USENIX Symposium on Networked Systems Design &
Implementation, Cambridge, MA, USA, 11–13 April 2007; p. 17.
34. Yala, L.; Frangoudis, P.A.; Lucarelli, G.; Ksentini, A. Cost and Availability Aware Resource Allocation and
Virtual Function Placement for CDNaaS Provision. IEEE Trans. Netw. Serv. Manag. 2018, 15, 1334–1348.
[CrossRef]
Sensors 2019, 19, 3276 20 of 20

35. Stress-ng. Available online: http://kernel.ubuntu.com/~{}cking/stress-ng/ (accessed on 5 January 2019).

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like