A Reduced Network Traffic Method For IoT Data Clustering

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

1

A Reduced Network Traffic Method for IoT Data Clustering

RICARDO DE AZEVEDO BRANDÃO, GABRIEL RESENDE MACHADO, RONALDO RIBEIRO


GOLDSCHMIDT, and RICARDO CHOREN, Military Institute of Engineering, Brazil
Internet of Things (IoT) systems usually involve interconnected, low processing capacity and low memory
sensor nodes (devices) that collect data in several sorts of applications that interconnect people and things. In
this scenario, mining tasks, such as clustering, have been commonly deployed to detect behavioral patterns
from the collected data. The centralized clustering of IoT data demands high network traffic to transmit the
data from the devices to a central node, where a clustering algorithm must be applied. This approach does not
scale as the number of devices increases, and the amount of data grows. However, distributing the clustering
process through the devices may not be a feasible approach as well, since the devices are usually simple and
may not have the ability to execute complex procedures. This work proposes a centralized IoT data clustering
method that demands reduced network traffic and low processing power in the devices. The proposed method
uses a data grid to summarize the information at the devices before transmitting it to the central node, reducing
network traffic. After the data transfer, the proposed method applies a clustering algorithm that was developed
to process data in the summarized representation. Tests with seven datasets provided experimental evidence
that the proposed method reduces network traffic and produces results comparable to the ones generated by
DBSCAN and HDBSCAN, two robust centralized clustering algorithms.
CCS Concepts: • Information systems → Data mining; • Computer systems organization → Dis-
tributed architectures; Embedded software.
Additional Key Words and Phrases: Data Traffic Reduction, Data Summarization, Internet of Things, Distributed
Data Mining
ACM Reference Format:
Ricardo de Azevedo Brandão, Gabriel Resende Machado, Ronaldo Ribeiro Goldschmidt, and Ricardo Choren.
2020. A Reduced Network Traffic Method for IoT Data Clustering. ACM Trans. Knowl. Discov. Data. 1, 1,
Article 1 (January 2020), 23 pages. https://doi.org/10.1145/3423139

1 INTRODUCTION
The Internet of Things (IoT) is an evolution of the current Internet into a network of interconnected
objects that not only harvests information from the environment (sensing) and interacts with the
physical world (actuation/command/control), but also uses existing Internet standards to provide
services for information transfer, analytics and applications [18]. IoT allows the interconnection of
people and things anytime, anywhere, with anything or anyone using any path and any service
[33]. In fact, IoT applications use a large number of tiny low-power, low-cost, having low processing
capability, low memory, possibly heterogeneous, and multi-functional sensor nodes (devices) which
are randomly and highly distributed in the physical environment [3]. IoT devices are used in all
sorts of applications such as [28]: weather monitoring and forecasting, traffic monitoring and
management, agriculture and food production, travel planning, smart cities and social life.
Authors’ address: Ricardo de Azevedo Brandão, rbrandao@protonmail.com; Gabriel Resende Machado, gabriel.rmachado10@
gmail.com; Ronaldo Ribeiro Goldschmidt, ronaldo.rgold@ime.eb.br; Ricardo Choren, choren@ime.eb.br, Military Institute
of Engineering, Praça General Tibúrcio, 80, Rio de Janeiro, RJ, Brazil.

ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national
government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to
allow others to do so, for Government purposes only.
© 2020 Association for Computing Machinery.
1556-4681/2020/1-ART1 $15.00
https://doi.org/10.1145/3423139

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:2 Brandao, et al.

This results in the production of large amounts of data which are generated frequently (possibly
periodically) and which have to be stored, processed and presented in a seamless and conveniently
interpretable form. For instance, the Boeing 787 set of sensors is able to generate up to 5 gigabytes
of data in one second [38] and connected cars are expected to send 25 gigabytes of data (each
car) every hour, collecting telematics and driver behavior data to keep the vehicle’s performance,
efficiency, and safety in check [32].
IoT provides fertile data to mining techniques that extract valuable information regarding the
behavior of people and processes [2, 8, 29, 34, 44]. One of the most useful mining tasks for behavior
pattern detection is data clustering, which organizes data in groups where intragroup and intergroup
data similarities are maximized and minimized, respectively. However, clustering data from multiple
sensors is a growing challenge, particularly as the number of IoT devices and the volume of data is
predicted to increase dramatically [11].
Basically, clustering IoT data follows one of two main approaches [34, 40]: centralized or dis-
tributed. The centralized approach usually demands a significant network traffic to transmit the
collected data from the devices to a central node where a clustering algorithm must be applied.
This approach does not scale well as the amount of data grows.
In the distributed approach, the mining techniques perform partial analysis of data at individual
sites and send the outcome as partial results to a central node that aggregates the global result [25].
This approach may reduce data traffic, but it demands for systems whose devices must have enough
computational capacity to process the initial mining step. This processing demand can be achieved
with edge computing solutions: a paradigm that shifts computing applications, data, and services
from centralized nodes to the edge of the network. They enable data processing closer to the source
of data with higher bandwidth and lower latency [1]. Nevertheless, those solutions may bring
two additional problems to the distributed mining approach: (1) a competition by computational
resources, if processing is pushed to the data collector nodes, and/or; (2) architectural heterogeneity
and, hence, system complexity, if intermediate layers are created to accommodate edge servers.
Therefore, in this work, our question is: given an IoT system, is it possible to reduce network traffic
to transfer IoT data for centralized clustering, without increasing the computational demand of the
devices and without compromising the clustering results?
In order to answer the above question positively, we propose a data clustering method that
uses a data grid to summarize the original information at the devices before transmitting it to the
central node. Data summarizing reduces network traffic significantly as transmission is restricted
to summarized information. The proposed summarization process has, in its worst case, linear
complexity, avoiding computational overhead in the devices. Then, the proposed method applies
gCluster, a clustering algorithm that was developed to process data in the summarized representation.
Tests with seven datasets provided experimental evidence that the proposed method to reduce
network traffic produces results comparable to ones generated by DBSCAN [13] and HDBSCAN
[10], two robust centralized algorithms.
The paper is organized as follows. Section 2 presents a conceptual comparison between the
proposed method and the state-of-the-art related research initiatives for IoT data clustering. Section
3 describes the proposed method. Details about the experimental setup and the obtained results are
presented in Section 4. Section 5 concludes the paper, highlighting the work’s main contributions
and indicating some future work.

2 RELATED WORKS
With the progress of IoT systems, huge amounts of data are increasingly originating from multiple,
dispersed and often heterogeneous sources [34]. Mining such massive data volumes is a challenging
task. By and large, two main data mining approaches can be used to analyze IoT data [34, 40]:

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:3

distributed or centralized. In the distributed approach, the mining techniques perform partial
analysis of data at individual sites and send the outcome as partial results to a central node that
aggregates the global result [25]. In the centralized approach, the data collected by the IoT devices
are transmitted to a central node where a data mining algorithm must be applied [24].
According to [7], the works that follow the distributed data mining approach for IoT can be divided
in four models: (1) multi-layered, (2) grid-based, (3) multi-technology integration perspective, and
(4) distributed.
The multi-layer data mining model distributes the process across the layers using data filters
to reduce data traffic, transferring to the last layer nothing but relevant data. It encompasses
four layers. The data collection layer is responsible for sensing the environment and capturing
information. The data management layer stores data in the database, after data cleaning. The event
processing layer organizes and aggregates data through event-based queries, according to the type
of event to be processed. The data mining service layer executes mining tasks in order to extract
knowledge from the processed data. Studies like [24, 30] follow this data mining model.
The grid-based data mining model for IoT uses the grid computing concept. According to this
concept, the available resources of the devices in a network can be shared, making global processing
more efficient. This approach consists of five layers [39]. The IoT resource layer includes hardware
and software modules. The IoT service layer encompasses storage and scheduling services. The grid
middleware layer deals with the problems caused by a heterogeneous data network. The grid mining
layer is responsible for data fusion. The grid application layer is responsible for data manipulation,
workflow controlling, execution management, and IoT application. [22, 39] are examples of works
that fit into the grid-based data mining model.
In the multi-technology integration model, there is an intermediate layer called context-aware
layer. It is responsible for integrating the technologies of different types of device with the technolo-
gies available in the data mining layer. This model encompasses studies such as the ones described
in [16, 19].
Finally, in the distributed data mining model for IoT, nodes summarize data in local knowledge
components (KC) and send those KC to an aggregation module that transforms them into a global
KC to be processed by the applications in the central node. Most IoT data mining studies follow
this model. In the following paragraphs, we discuss some of those studies, highlighting the ones
concerning data clustering, the data mining task analyzed in this paper.
Distributed clustering may reduce data traffic, but it demands for systems whose devices must
have enough computational capacity to process the initial mining step. For example, in [37], the
authors investigated the use of edge computing as a way to achieve such processing demand.
They implemented a distributed version of K-means at the IoT Edge and tested it in a weather
dataset. Another edge computing-based study with IoT data is presented in [9]. Such study assessed
two ways of mining patterns using edge computing. In the first, local data were transmitted
from IoT devices to local networking services to compute partial models. In the second, the
mining process was completely pushed to local IoT devices so that they could discover locally
frequent patterns by themselves. Both ways appeared to be more efficient and practical than a
cloud computing-based solution in a case study with urban data. Despite the above mentioned
works’ successful experimental results, using edge computing-based distributed mining may bring
two relevant drawbacks: (1) a competition by computational resources, if processing is pushed to
the data collector nodes, and/or; (2) architectural heterogeneity and, hence, system complexity, if
intermediate layers are created to accommodate edge servers.
In [20], the authors propose CLUBS-P, a parallel algorithm that clusters distributed data without
moving them. Actually, CLUBS-P follows the distributed mining model where information exchange
between its processing nodes is restricted to summarized data only. The processing nodes must

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:4 Brandao, et al.

compute marginal distributions, local density of sets of possible outliers and assign points to
their closest centroids. The work compared two versions of CLUBS-P: a message passing-based
implementation and a Spark-based one. Both demand nodes with a processing capacity compatible
with the computations indicated above. Clearly such computational requisite may not be assured
in most real IoT applications.
The work described in [21] addresses distributed data clustering by generating local models
and aggregating these models into a global model in the central node. The algorithm proposed in
that paper combines the characteristics of two well-known clustering algorithms: K-means and
DBSCAN. It starts the process using DBSCAN to find the clusters. From the clustering result, the
core points are used as the initial centroids provided to an adapted version of K-means. From the
results of the two algorithms, the local models are generated and sent to the central node, which,
in turn, joins all received models. To use the proposed algorithm in IoT systems, the devices must
have enough computational power and storage to support possible processing concurrency among
IoT tasks and the clustering process.
Distributed Dynamic Clustering (DDC) [6] is a clustering algorithm that deals with spatial
datasets and follows the known three-step distributed mining process: (1) generation of local
models in distributed nodes; (2) local model transference to the central node, and; (3) aggregation of
the local models in the central node, generating global models. In that work, local models are found
by running a local clustering algorithm in each node, and then the local clusters are transferred and
merged into global clusters. The idea is to reduce the amount of data to be transferred across the
network. To this end, the local clusters are represented exclusively by their boundaries. Despite the
data reduction proposal, DDC demands high computational power on the local nodes to execute
the data clustering algorithm and the data reduction technique.
GDCluster [26] is a clustering method that can cluster datasets dispersed among the nodes of
distributed environments. To this end, those nodes gradually build a summarized view of the dataset
and execute weighted versions of clustering algorithms to produce approximations of the final
clustering results. The nodes combine their data with the summarized data received from other
nodes and then propagate the results to randomly chosen neighboring nodes using a gossip-based
communication process (i.e. a data dissemination technique which assumes no predefined structure
in the network). This process is continuously repeated until the central node is reached. According
to the authors, GDCluster showed good results when compared with K-Means and DBSCAN,
important representatives of the partition-based and density-based clustering paradigms. Despite
its performance, GDCluster may not be feasible in many IoT environments, once the nodes of the
environments may not have enough computational capacity to run the above mentioned processes.
In [42], the authors propose two distributed clustering algorithms for observations collected by
spatially distributed IoT devices: Distributed Fuzzy C-Means and Distributed K-Means. Instead of
transmitting raw data to a center node, the proposed algorithms acquire the global clustering result
through neighboring information exchange applied to all observations collected by all devices. The
authors also propose a distributed initialization method for IoT environments. As reported, this
method experimentally outperformed the random initialization process, leading to algorithmic
stability and clustering quality. Despite those promising experimental results, that work focused
on IoT networks with peer-to-peer topology and assumed that the network topology is fixed and
stable, and all nodes have the same capacities of storage, computation and telecommunications.
These assumptions clearly restrict the universe of real IoT scenarios to which such work can be
applied.
The above mentioned models and algorithms can reduce data traffic and divide the processing
load through distributed mining. However, there is no guarantee that the possibly heterogeneous
and simple IoT devices have enough computational capacity to perform the extra processing

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:5

imposed by the distributed mining approach. For this reason, in this paper, we focused on the
centralized approach to IoT data clustering. The following paragraphs describe related works based
on such approach.
The work reported in [12] proposes GDPC, a dynamic Gaussian mixture models-based clustering
algorithm for data streams. According to the authors, different from similar algorithms, GDPC is
able to induce clustering models that provide membership probability of each instance to each
cluster. Moreover, the proposed algorithm can identify concept drifts and decide whether and when
the clustering models must be updated. The paper reported that GDPC achieved good results when
applied to IoT data streams from an industrial test bench. Although the authors mentioned that
GDPC can deal with data streams with a limited amount of historical data and time needs, no
comments concerning data centralization process for the clustering algorithm could be found in
that paper.
Similarly to [12], in [43], the authors also tackle the industrial IoT data streams clustering problem.
They propose an incremental variant of CFS, a clustering algorithm that considers objects’ densities
and minimum distance from other objects to identify the clustering centers and then assigns the
remaining objects to their nearest centers. The proposed incremental version of CFS clustering
can modify the current clustering results in the face of the new arriving of objects rather than
re-implementing CFS clustering on the whole dataset. To integrate the clustering of new objects
into the existing one, incremental CFS uses two adjustment operations (cluster creating and cluster
combining) and k-medoids to modify the clustering centers. Despite the promising experimental
results presented by the incremental CFS, the paper did not mention any efforts to reduce data
transferring to centralize the algorithm’s input.
Determining how many clusters can be found in a dataset is a recurrent problem in data clustering.
To tackle it, [31] proposed a method that explores data distribution in order to infer the number of
clusters in IoT data streams. According to this method, the shape of the probability distribution
curves of the data features can give good approximations of how many clusters are needed to group
the data properly. After selecting the number of clusters, the method uses an online clustering
mechanism to cluster the incoming data from the streams. This mechanism ensures that the
clustering remains adaptive to drifts by adjusting it as the data changes. The paper reports on a
case study from an intelligent traffic analysis scenario where features like average speed of vehicles
and number of cars were extracted from traffic sensors and used for clustering. As observed in the
above mentioned works, no comments about actions concerning the reduction of data transference
for the centralized clustering could be found in this paper.
In [37], the authors executed a parallel centralized version of K-means in a Cloud-based scenario.
Indeed, parallel solutions have been useful for mining massive volumes of sequential data in several
applications [15]. Although the experiments described in [37] also demanded a centralized dataset,
no comments concerning the IoT data centralization were made. Seemingly, they did not take any
providence to reduce data transference to put the IoT data together.
In summary, none of the above works mentioned how to deal with one of the main problems
with the centralized approach to IoT data clustering: the high amount of data to be transferred
from the data collector devices to the central node to process the data clustering algorithm. This
approach does not scale well as the amount of data grows. To mitigate this problem, this paper
proposes a method that reduces data transfer and does not compromise the results of the clustering
process.

3 THE PROPOSED METHOD


In this section, we describe a reduced network traffic method for IoT data clustering. Based on the
centralized mining approach, our method summarizes all data collected by the IoT devices so that

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:6 Brandao, et al.

it reduces the amount of data to be transferred to a central node where the clustering process is
executed, without hampering the clustering results.
Figure 1 illustrates the seven steps of our method distributed in an arbitrary IoT system with
a central node and 𝑞 possibly heterogeneous and low-capacity IoT devices. Those steps can be
divided into three main stages to be sequentially executed. In the first stage, the data analyst must
set the overall system configuration, defining the parameters to be used in the other stages. After
configuration, the parameters are transferred to all IoT devices which begin the second stage. In this
stage, each IoT device gathers and summarizes data for a user specified time range. When the time
interval ends, all IoT devices send the summarized data to the central node over the network and
the last stage begins. It receives the summarized data from the IoT devices, integrates those data
and runs a centralized data clustering algorithm. The following subsections describe our method’s
stages and their steps.

Fig. 1. Workflow of the proposed method distributed in an arbitrary IoT system

3.1 Stage 1 - System Configuration


This stage allows the data analyst configure all the parameters to be used in the other stages.
It comprises the steps Variable Configuration, Clustering Parameters Configuration, and Space
Partitioning. They all run in the IoT system’s central node. Table 1 briefly presents those parameters.
They will be opportunely detailed in the subsections ahead.
3.1.1 Variable Configuration. In cluster analysis, data can be classified in active and passive
variables [5]. Active variables are the input variables for the clustering algorithm. Passive variables

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:7

Table 1. Proposed method’s main parameters

Parameter Notation Description


Set of active variables 𝑉𝑀 Variables to be considered in the clustering process
Set of passive variables 𝑉𝑆 Variables to be considered in the segmentation process
Epsilon 𝜀 Create a grid of cells with edge size = 1/𝜀
Normalization criterion 𝜂 E.g.: min-max, maximum, sum, score, etc.
NumDevices 𝑞 Number of IoT devices in the IoT network
MinCells - Minimum number of cells that may define a cluster
MinForce - Minimum “attraction force” to connect two cells

are also connected to the objects to be clustered but they are not actively used during modeling.
These variables are used to make data segments identifiable and distinctive.
There is no general rule on how to decide whether a variable should be active or passive, being
a decision made from a business point of view. Thus, given a set of available variables 𝑉 , this
step allows the data analyst partition 𝑉 , defining 𝑉𝑆 and 𝑉𝑀 , the sets of variables to be used in
segmentation and in clustering, respectively. Equation 1 formalizes these sets and their relation.

𝑉𝑆 ⊊ 𝑉 ∧ 𝑉𝑀 ⊊ 𝑉 ∧ 𝑉𝑆 ∩ 𝑉𝑀 = ∅ ∧ 𝑉𝑆 ∪ 𝑉𝑀 = 𝑉 (1)
The passive variables are used to filter and segment data, creating grouped datasets with similar
characteristics. So, this step also allows the analyst define which segments should be considered.
Each segment must contain a description and a filter function, formalized as follows:
seg.desc ↦→ text that describes the contents of seg.
seg.filter ↦→ predicates, defined on 𝑉𝑆 , that characterize the seg dataset.
For illustrative purposes, consider an IoT system that collects car trip data and 𝑉 = {𝑤𝑒𝑒𝑘𝑑𝑎𝑦, 𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛, 𝑑𝑖𝑠𝑡𝑎𝑛
Also consider 𝑉𝑀 = {𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛, 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒} and 𝑉𝑆 = {𝑤𝑒𝑒𝑘𝑑𝑎𝑦}. In this example, the following seg-
ments could be defined to separate workday from weekend data.
𝑠𝑒𝑔0 .𝑑𝑒𝑠𝑐 ↦→ workday trips
𝑠𝑒𝑔0 .𝑓 𝑖𝑙𝑡𝑒𝑟 ↦→ weekday ∈ {monday, tuesday, wednesday, thursday, friday}
𝑠𝑒𝑔1 .𝑑𝑒𝑠𝑐 ↦→ weekend trips
𝑠𝑒𝑔1 .𝑓 𝑖𝑙𝑡𝑒𝑟 ↦→ weekday ∈ {saturday, sunday}
The active variables can be expressed in different units, ranges or scales. To avoid comparing
numbers with different orders of magnitude, our method adjusts them, using data normalization.
Thus, for each variable in 𝑉𝑀 , the analyst must choose the normalization criterion to be used in all
IoT devices. Our method provides some of the main normalization criteria available in specialized
literature, including, for example, min-max normalization [4]
At last, the data analyst must also set the time range of data collection and summarization to be
executed by the IoT devices. This choice is a business decision that is directly related to the IoT
system’s domain of application.

3.1.2 Clustering Parameters Configuration. In this step, the analyst should configure the parameters
for the clustering method. This includes the definition of the minimum number of cells that may
define a cluster (minCells) and a threshold value that determines if adjacent cells will be connected
to build a cluster (minForce). These parameters are used in the clustering task, further detailed in
Sub-section 3.3.2.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:8 Brandao, et al.

3.1.3 Space Partitioning. Space partitioning is the process that divides a monitoring domain (hy-
perspace) into a grid of cells (hypercubes) [35]. It is the last step in the overall system configuration
stage, comprising the device configuration in the proposed method.
In this step, the analyst should create a grid with 𝜀 𝑧 cells, with cell edge size = 1/𝜀, where
𝑧 = |𝑉𝑀 | (i.e. number of active variables) and 𝜀 divides each dimension (i.e. active variable) into
intervals of equal size. Each cell 𝑐 in the grid is referenced in the hyperspace by a z-dimensional
coordinate system as indicated in Equation 2. Figure 2 illustrates an example with four cells in a
two-dimensional system.

𝑐 (𝜅 1, 𝜅 2, ..., 𝜅𝑧 ),where 𝜅𝑖 ∈ Z (2)

Fig. 2. Cell representation in a bi-dimensional space

The main idea behind this step is to create a hyperspace where the dimensions have higher
granularity so that each cell can accommodate multiple data to be collected by the IoT devices.
When the data analyst finishes the system configuration stage, the parameter values are sent to
each IoT device in order to start the Data Collection and Representation stage.

3.2 Stage 2 - Data Collection and Representation


This stage encompasses the steps that run in the IoT devices (i.e. Data Gathering and Data Sum-
marization). It begins when the IoT devices receive the parameters configured by the analyst in
the previous stage. It runs as long as the user-defined time span lasts. When this interval ends, the
devices send the summarized version of the collected data to the central node.
3.2.1 Data Gathering. Data collection is the most common operation executed by IoT devices.
Those devices represent the sensor layer in an IoT infrastructure, which is responsible for obtaining
the information of the real world through various types of sensors.
When the IoT application begins, each device 𝑑𝑒𝑣 creates an internal dataset 𝐷𝑆𝑑𝑒𝑣 to store the
data as they are collected. At each read cycle, each 𝑑𝑒𝑣 gathers the values of the variables read by
its sensors, creates a data record represented by a vector 𝜈 and stores it in 𝐷𝑆𝑑𝑒𝑣 . This process is
repeated until the user-defined time span finishes. As can be seen, this step is very simple and has
𝑂 (1) time complexity. It treats a single data record at a time and does not traverse the generated
dataset in any moment.
3.2.2 Data Summarization. In this step, the data compression task is executed. First, each device
𝑑𝑒𝑣 processes its 𝐷𝑆𝑑𝑒𝑣 so that each data record 𝜈 in 𝐷𝑆𝑑𝑒𝑣 belongs to a cell in the grid previously
defined by the analyst. An example of a grid with four two-dimensional cells is illustrated in Figure
3. Each filled dot represents a data record. Additionally, 𝜈 must be assigned to the segment 𝑠𝑒𝑔 (also
defined at the System Configuration Stage) whose filter 𝑠𝑒𝑔.𝑓 𝑖𝑙𝑡𝑒𝑟 is satisfied by 𝜈.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:9

Fig. 3. Example of a grid with four cells. Filled dots represent data records.

After data record allocation, all points contained in each cell are preprocessed, normalized
(according to the criterion previously defined), and represented by a single point, located at the
midpoint (center of mass) of that cell. Figure 4 extends the example shown in Figure 3 in order to
illustrate the center of mass (empty dot) of each cell.

Fig. 4. Example of a grid with four cells. Filled dots represent data records. Empty dots are the center of mass.

To describe the data summarization step more clearly, we formalize it in Algorithm 1. For each
data record 𝜈 in 𝐷𝑆𝑑𝑒𝑣 , line 4 checks to which segment 𝜈 belongs. In line 5, active variables of 𝜈 are
projected in 𝜈 ′. If preprocessing tasks have been defined by the analyst, they must be applied to 𝜈 ′
by the preProcessing function (line 6), producing 𝜈 ′′. This function must have been pre-built from
the system configuration details and it comprises any data preparation task such as cleansing (e.g.
value corrections), or codification (e.g., format changes) [41]. For the Normalization function, the
parameter 𝜂 contains the normalization criterion to be applied to 𝜈 ′′ (line 7), generating 𝜈 ′′′, the
normalized version of 𝜈 ′′. In line 8, function calcCell identifies the coordinates of the cell 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔)
to which 𝜈 ′′′ must be assigned. The insertVector function in line 11 includes 𝜈 ′′′ in cell 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) ,
increasing the number of vectors/points (𝑡) in that cell and recalculating its center of mass (𝜇).
When the summarization step ends, each devide 𝑑𝑒𝑣 sends its 𝐻 (𝑑𝑒𝑣,𝑠𝑒𝑔) , for all 𝑠𝑒𝑔 defined by the
user.
There are two important considerations that must be emphasized at this point. First, as the number
of cells is supposed to be much lower than the number of collected data records, representing all
those data records by their cells1 certainly reduces the amount of data to be transferred to the
central node.
The second consideration is that the summarization step has 𝑂 (𝑛) time complexity since it
traverses the entire dataset without any nested loops. This seems to be a computational demand
1 Remember that each cell 𝑐 is represented by a single pair of values (𝑐’s center of mass and the number of points in 𝑐) and
its coordinates.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:10 Brandao, et al.

Algorithm 1: Data Summarization Step


Input:
data collected by device dev: 𝐷𝑆𝑑𝑒𝑣
cell size: 𝜀
normalization criterion: 𝜂
1 Let 𝐻 represent the space grid (partitioned in cells), each element in it is a cell (𝑐) that is
represented by two features: number of vectors belonging to it (𝑡) and its center of mass
vector (𝜇).
2 𝐻 (𝑑𝑒𝑣,𝑠𝑒𝑔) ← ∅
3 foreach data record 𝜈 ∈ 𝐷𝑆𝑑𝑒𝑣 do
4 𝑠𝑒𝑔 ← 𝜈, if 𝑠𝑒𝑔.𝑓 𝑖𝑙𝑡𝑒𝑟 (𝜈) is true
5 𝜈 ′ ← projection of the variables in 𝑉𝑆 applied to 𝜈
6 𝜈 ′′ ← preProcessing(𝜈 ′)
7 𝜈 ′′′ ← Normalization(𝜈 ′′, 𝜂)
8 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) ← calcCell(𝜈 ′′′, 𝜀)
9 if 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) ∉ 𝐻 (𝑑𝑒𝑣,𝑠𝑒𝑔) then
10 insert 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) in 𝐻 (𝑑𝑒𝑣,𝑠𝑒𝑔)
11 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) .insertVector(𝜈 ′′′)
12 return 𝐻

compatible with most real IoT systems which contain possibly heterogeneous low processing-
capacity devices. It is also important to mention that this complexity may be reduced to 𝑂 (1), if
necessary. To this end, the data gathering and the summarization steps can easily be merged, if
the data records are summarized as they are collected. That would avoid traversing the dataset as
described in Algorithm 1.

3.3 Stage 3 - Data Consolidation and Mining


This stage runs in the central node and comprises the last two steps of our method: Data Reception
and Integration and Data Clustering. It receives the summarized data sent from the IoT devices,
aggregates and clusters them according to the parameters previously specified by the analyst in
the first stage. For the data clustering step, this stage uses gCluster, a clustering algorithm designed
to deal with the data summarized by our method.
3.3.1 Data Reception and Integration. When the central node receives the set 𝐻 (𝑑𝑒𝑣,𝑠𝑒𝑔) from each
device dev, it has to create a consolidated set 𝐻 Θ𝑠𝑒𝑔 composed of the information from all the devices
w.r.t. segment 𝑠𝑒𝑔. Note that, for the sake of simplicity, we will refer to 𝐻 Θ𝑠𝑒𝑔 as 𝐻 Θ , omitting the
reference to segment index in the remainder of this subsection.
As all the grids have the same configuration, a cell 𝑐 Θ in the central node and a cell 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) in
device dev, which is in the same segment seg, will have the same coordinates. Then, as the central
node receives representations of 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) , 𝑐 Θ is updated as indicated in Equations 3 (center of mass)
and 4 (number of points), where 𝑐 Θ .𝜇 and 𝑐 Θ .𝑡 are, respectively, the center of mass and the total
number of points in cell 𝑐 Θ , which is inside the integrated space grid 𝐻 Θ .

(𝑐 Θ .𝜇 × 𝑐 Θ .𝑡) + (𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) .𝜇 × 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) .𝑡)


𝑐 Θ .𝜇 = (3)
𝑐 Θ .𝑡 + 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) .𝑡

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:11

𝑐 Θ .𝑡 = 𝑐 Θ .𝑡 + 𝑐 (𝑑𝑒𝑣,𝑠𝑒𝑔) .𝑡 (4)

At this point, two aspects of the proposed method must be highlighted. First, the data sent over
the network is summarized, having a lower impact on the network resources. Second, as the central
node receives and integrates the data, the devices do not need to keep the original data. Thus, the
devices can discard the data collected so far, releasing memory space for future data collection.
This is an important feature, since the devices may have low memory capacity.
3.3.2 Data Clustering. After receiving, processing and integrating the data sent from all the
devices, the central node is ready to execute the clustering algorithm. Since the data to be clustered
are summarized, the proposed method introduces gCluster, an attraction force-based clustering
algorithm that is applied to adjacent cells in the space grid 𝐻 Θ . Two cells 𝑐𝑖 (𝜅𝑖 1 , 𝜅𝑖 2 , ..., 𝜅𝑖𝑧 ) and
𝑐 𝑗 (𝜅𝑖 1 , 𝜅𝑖 2 , ..., 𝜅𝑖𝑧 ) are adjacent in 𝐻 Θ if and only if the following three conditions are satisfied: (1)
𝑐𝑖 , 𝑐 𝑗 ∈ 𝐻 Θ ; (2) 𝑐𝑖 ≠ 𝑐 𝑗 ; and (3) |𝜅𝑖𝑛 − 𝜅 𝑗𝑛 | ≤ 1, ∀𝑛 (𝑛 = 1 ... 𝑧).
According to gCluster, adjacent cells have an attraction force. This force is inspired by the Uni-
versal Gravitation Law [27], and indicates that the force of attraction between cells is proportional
to the number of points contained in each cell and inversely proportional to the distance between
the center of mass of each cell. Then, the attraction force between two adjacent cells 𝑐𝑖 and 𝑐 𝑗 can
be calculated as stated in Equation 5.
𝑐𝑖 .𝑡 × 𝑐 𝑗 .𝑡
𝐹 (𝑐𝑖 ,𝑐 𝑗 ) = (5)
(𝑑𝑖𝑠𝑡 (𝑐𝑖 .𝜇, 𝑐 𝑗 .𝜇)) 2

However, calculating the force using the raw information about the cells may lead to outlier
values since it will ignore the density of the points in the cells. To solve this problem, the quantity
of points should be normalized by the maximum value according to Equation 6.
𝑐𝑖 .𝑡
𝑐𝑖 .𝑡𝑛𝑜𝑟𝑚 = (6)
𝑚𝑎𝑥 (𝑐.𝑡)

The distance calculation should also be normalized. To do so, we use the edge size of the space
grid (i.e. 1/𝜀). Therefore, the attraction force between two adjacent cells 𝑐𝑖 and 𝑐 𝑗 can be rewritten
as indicated in Equation 7.
𝑐𝑖 .𝑡𝑛𝑜𝑟𝑚 × 𝑐 𝑗 .𝑡𝑛𝑜𝑟𝑚
𝐹 (𝑐𝑖 ,𝑐 𝑗 ) = 𝑐𝑖 .𝜇,𝑐 𝑗 .𝜇 2 (7)
(𝑑𝑖𝑠𝑡 ( 1/𝜀 ))

This way, two adjacent cells 𝑐𝑖 and 𝑐 𝑗 are considered connected if the attraction force between
them is greater or equal to the minForce value, previously defined by the analyst. The connected
cells will create a graph in which the nodes are the center of mass of the cells and the clusters are
characterized by connected graphs. Then the minCells parameter is used to define the minimum
number of cells that may define a cluster. This value was also determined by the analyst in the first
stage and indicates that connected graphs with less than minCells edges are not considered to be
clusters. We formalize the detail of gCluster in Algorithm 2.
Figure 5 illustrates the rationale of gCluster with a simple example. In this example, minCells is
two. Thus, gCluster identified four clusters: 𝑐 01 , 𝑐 02 , 𝑐 03 and 𝑐 04 . If minCells was three, clusters 𝑐 01
and 𝑐 04 would have been discarded.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:12 Brandao, et al.

Algorithm 2: gCluster
Input:
consolidated grid per segment: 𝐻 Θ
cell size: 𝜀
minimum force value: minForce
minimum number of cells in a cluster: minCells
1 Let 𝐺 represent the connected graphs (set of identified clusters)
2 𝐺←∅
3 foreach cell 𝑐 ∈ 𝐻 Θ do
4 G.insertNode(c)
5 foreach cell 𝜌 ∈ 𝐻 Θ not yet iterated by the outer loop do
6 if 𝜌 and 𝑐 are adjacent then
7 𝐹 (𝑐,𝜌) ← calcForce(𝑐, 𝜌, 𝜀)
8 if 𝐹 (𝑐,𝜌) ≥ 𝑚𝑖𝑛𝐹𝑜𝑟𝑐𝑒 then
9 G.insertEdge(𝑐, 𝜌)

10 G.discardClusters(𝐺, 𝑚𝑖𝑛𝐶𝑒𝑙𝑙𝑠)
11 return 𝐺

Fig. 5. Example of clusters produced by gCluster

Figure 6 presents another example, this time showing the original data. It is possible to notice
that some points were discarded since they were not in adjacent cells or they were in adjacent cells
but those cells did not have enough attraction force to get connected.

4 EXPERIMENTS AND RESULTS


We conducted two groups of experiments to assess the effectiveness of the proposed method
(summarization to reduce network load for centralized data clustering). It is important to mention
that none of the experiments were executed in real IoT networks. Moreover, for simplicity purposes,
we considered a simulation environment with a single IoT device. We chose this evaluation approach
because our primary goal was to show that the proposed method can reduce the amount of data to
be transferred to a central node without compromising the centralized clustering results.
The first group of experiments enclosed labeled (ground truth) datasets that do not contain IoT
data. They are described in Section 4.1 and were used to evaluate, both qualitative and quantitatively,

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:13

Fig. 6. Examples of clusters detected by gCluster. Points indicate the original data.

the clustering results. The second group covered an unlabeled dataset with real IoT data. This
experiment is presented in Section 4.2 which provides a qualitative analysis of the proposed method.

4.1 Group I
The experiments in this group included six datasets: Aggregation [17], Diamond9 [36] and four
versions of Cluto [23] (t4.8k, t5.8k, t7.10k and t8.8k). These datasets (ground truths) are depicted in
Figure 7. Although these datasets do not contain IoT data, four reasons led us choose them: (1) they
are labeled benchmark datasets, widely used to evaluate and compare clustering algorithms, (2) they
represent different scenarios with clusters of arbitrary shape, proximity, orientation, noises, and
varying densities, (3) they are two-dimensional datasets and, hence, facilitate the visual (qualitative)
evaluation of the results, and, (4) we could not find any dataset with IoT data that contained
clustering ground truth information. Their statistical overview is shown in Table 2.

Table 2. Group I - Statistical overview of the datasets

Dataset Points Clusters Noise


Aggregation 788 7 No
Diamond9 3,000 9 No
Cluto t4.8k 8,000 6 Yes
Cluto t5.8k 8,000 6 Yes
Cluto t7.10k 10,000 9 Yes
Cluto t8.8k 8,000 8 Yes

Since the datasets were not provided by a real IoT system, we assumed that, for the sake of
simplicity, each one of them was entirely collected by a single device and allocated in a single analyst-
defined segment. Moreover, we also assumed that for each dataset, 𝑉𝑀 = 𝑉 (i.e., all attributes were
considered in clustering). No data preprocessing was necessary. Parameters 𝜀, minForce and minCells
were manually chosen through a sensibility analysis run on a subset randomly selected from each
dataset. The amount of samples in each subset corresponded to 50% of the original dataset’s
cardinality. Table 3 presents the gCluster’s parameters configuration used in the experiments.
Table 4 reports the data compression obtained for each dataset via summarization. Analyzing
the results, two points must be highlighted. First, the average reduction exceeded 74%, which may
represent a substantial cut in the network traffic when transferring data to the central node. Second,
the reduction varies according to the density of the dataset. The higher this density, the more
significant is the achieved reduction. This can be seen in Table 4, which shows that the smallest and
the biggest reductions were materialized in the Aggregation (most sparse) and Cluto t5.8k (most
dense) datasets, respectively.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:14 Brandao, et al.

Fig. 7. Group I datasets - ground truths.

Table 3. gCluster’s parameters set.

Dataset 𝜀 minForce minCells


Aggregation 20 0.0950 3
Diamond9 30 0.0750 3
Cluto t4.8k 50 0.0845 3
Cluto t5.8k 25 0.0940 3
Cluto t7.10k 50 0.0500 3
Cluto t8.8k 47 0.0600 6

After the data summarization, we applied gCluster to the summarized data and then, we compared
it with four baseline methods based on two renowned clustering algorithms: DBSCAN and HDB-
SCAN. DBSCAN is a traditional clustering algorithm that has been designed to discover clusters
and noise (i.e. data that do not meet a given density criterion) in an arbitrary spatial database [13].
On the other hand, HDBSCAN is a an extension of DBSCAN that introduces a hierarchical-based
clustering approach that generates a density-based hierarchy which, in contrast to the global
density threshold in traditional DBSCAN, can be used to extract only the most significant clusters
from a given dataset [10].

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:15

Table 4. Summarization results

Raw Data Cells


Dataset Reduction
(bytes) (bytes)
Aggregation 10,435 6,304 39.59%
Diamond9 60,070 15,424 74.32%
Cluto t4.8k 192,946 26,893 86.06%
Cluto t5.8k 188,256 16,538 91.22%
Cluto t7.10k 242,756 53,493 77.96%
Cluto t8.8k 192,295 46,921 75.56%

In the Baselines 1 and 2, we respectively compared the gCluster results with DBSCAN and
HDBSCAN applied to the raw data. The idea was to evaluate the clustering results that would be
produced when a centralized clustering algorithm is performed on the entire dataset.
In turn, Baselines 3 and 4 consist of two steps. Initially, we randomly sampled, without replace-
ment, 𝑃𝑅 = 𝑟𝑜𝑢𝑛𝑑 (𝑃 · 𝑥) points from the dataset, where 𝑃 corresponds to the total number of points
in this dataset and 𝑥 = 1 − 𝑟 corresponds to the complement of the reduction rate 𝑟 , according to
the data summarization step of the proposed method. This step has linear complexity, thus, it does
not demand significant processing capacity, unlike other distributed mining algorithms. Then, we
respectively applied DBSCAN and HDBSCAN to the selected samples. We repeated this process
100 times. Table 5 summarizes the amount of randomly chosen points per iteration, according to
the reduction rate of the proposed method.

Table 5. Amount of randomly sampled points for each dataset in Baselines 3 and 4.

Sampled Points 𝑃𝑅
Dataset Points 𝑃 Reduction 𝑟
(per iteration)
Aggregation 788 39.59% 476
Diamond9 3,000 74.32% 770
Cluto t4.8k 8,000 86.06% 1,115
Cluto t5.8k 8,000 91.22% 702
Cluto t7.10k 10,000 77.96% 2,204
Cluto t8.8k 8,000 75.56% 1,955

On the other hand, Table 6 reports the parameters used in DBSCAN and HDBSCAN algorithms
along the four baselines. It is important to mention that those parameters were manually configured
through a sensibility analysis run on the same subsets used for gCluster’s configuration. It is also
worth mentioning that the parameters’ names presented in Table 6 follows the documentation of
both implementations of DBSCAN2 and HDBSCAN3 available online, which have been used to
perform the experiments.
Qualitatively speaking, the results produced by gCluster were very similar to the ones produced
by DBSCAN and HDBSCAN in Baselines 1 e 2 (see Figures 8, 9 and 10), even in datasets with
noise and clusters with arbitrary and complex shapes. These results are evidences that back up the
hypothesis that the proposed data summarization does not impair clustering. However, it is also
2 https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html. Accessed in June 29, 2020.
3 https://hdbscan.readthedocs.io/en/latest/parameter_selection.html. Accessed in June 29, 2020.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:16 Brandao, et al.

Table 6. DBSCAN and HDBSCAN’s parameters set.

DBSCAN HDBSCAN
Dataset
𝜖 minPts 𝜖 minPts minSamples algorithm
Aggregation 0.0420 7 0.0420 7 9 prims_kdtree
Diamond9 0.0300 12 0.0150 12 9 best
Cluto t4.8k 0.0200 25 0.005 23 50 best
Cluto t5.8k 0.0200 25 0.0250 32 35 boruvka_kdtree
Cluto t7.10k 0.0250 28 0.0150 28 33 best
Cluto t8.8k 0.0218 14 0.0200 20 9 best

worth pointing out gCluster could not differentiate the noise that forms the horizontal line through
the clusters of Cluto t5.8k dataset (see Figure 9).

Fig. 8. gCluster and Baselines 1 and 2 – Aggregation and Diamond9 datasets

To evaluate the results quantitatively and compare the performance of the proposed method with
the four baseline methods, we used two metrics: (1) the Fowlkes and Mellows Index (FM-Index) [14]
and (2) Accuracy. The FM-Index is an external evaluation metric according to which, the higher
the index value, the greater the similarity between clusters. Its values vary from 0 to 1. In our
work, FM-Index allowed a comparison between each labeled dataset (with ground-truth) and the
corresponding results found by the clustering algorithm to be evaluated. The Accuracy metric,
in turn, checks how accurate the clustering algorithm is regarding the ground-truth settings. Its
values also vary from 0 to 1. Table 7 presents the FM-Index and Accuracy for gCluster and for the
four baseline methods.
The first point to highlight is that the high FM-Index values obtained by gCluster corroborate
the qualitative impression that data reduction preserved the main characteristics of all datasets and
did not interfere with the clustering results. Even in the face of high reduction rates, gCluster FM
indexes always exceeded 0.88.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:17

Fig. 9. gCluster and Baselines 1 and 2 – Cluto t4.8k and Cluto t5.8k datasets

Fig. 10. gCluster and Baselines 1 and 2 – Cluto t7.10k and Cluto t8.8k datasets

Comparing gCluster with Baselines 1 and 2, it is possible to observe three aspects. First, gCluster,
DBSCAN and HDBSCAN are all density-based methods. Thus, as expected, they presented good
performance (i.e., high FM-Indexes and accuracies) on datasets such as Aggregation and Diamond9,
where the clusters have similar density and the noisy elements are sparse. In contrast, these methods
did perform slightly worse on datasets Cluto, especially Cluto t5.8k, in which noise is so dense that
it may have hindered cluster boundaries detection. Nevertheless, gCluster’s FM-Index and accuracy
were better when compared to the respective results of Baselines 1 and 2 in this dataset.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:18 Brandao, et al.

Table 7. Comparative summary - labeled datasets

gCluster Baseline 1 (DBSCAN) Baseline 2 (HDBSCAN) Baseline 3 (DBSCAN) Baseline 4 (HDBSCAN)


Dataset ∗ ∗∗ ∗ ∗∗
FM Index Accuracy FM Index Accuracy FM Index Accuracy FM Index (𝜇 - 𝜎) Accuracy (𝜇 - 𝜎) FM Index (𝜇 - 𝜎) Accuracy (𝜇 - 𝜎)
Aggregation 0.963 0.962 0.992 0.995 0.906 0.945 0.694 - 0.043 0.811 - 0.038 0.675 - 0.111 0.784 - 0.087
Diamond9 0.992 0.993 0.989 0.995 0.852 0.930 0.329 - 0.003 0.138 - 0.014 0.651 - 0.074 0.772 - 0.088
Cluto t4.8k 0.957 0.973 0.958 0.972 0.943 0.958 0.407 - 0.003 0.221 - 0.009 0.502 - 0.046 0.487 - 0.045
Cluto t5.8k 0.881 0.931 0.857 0.919 0.817 0.896 0.379 - 0.001 0.165 - 0.008 0.322 - 0.025 0.277 - 0.045
Cluto t7.10k 0.974 0.982 0.983 0.986 0.948 0.960 0.404 - 0.004 0.278 - 0.009 0.508 - 0.020 0.428 - 0.031
Cluto t8.8k 0.946 0.917 0.956 0.944 0.956 0.953 0.382 - 0.008 0.231 - 0.014 0.729 - 0.113 0.699 - 0.170

Arithmetic mean 𝜇 and standard deviation 𝜎 of 100 FM indexes.
∗∗
Arithmetic mean 𝜇 and standard deviation 𝜎 of 100 accuracies.

Second, the differences between the corresponding FM-Indexes and accuracies of gCluster and
DBSCAN occur at the second decimal place in all of the six datasets. These close values indicate
that despite reducing data traffic, the proposed method did not compromise the clustering results
that would be produced if all original data were completely transferred to a central node before
mining.
Third, despite its hierarchical-based clustering approach, HDBSCAN’s performances, represented
by Baseline 2, were slightly worse than the DBSCAN’s Baseline 1 results for all datasets. We believe
that, in contrast to the implementation of DBSCAN (which relies on only two parameters, 𝜖 and
minPts), the larger amount of HDBSCAN parameters might have contributed to a non-optimal
manual parameter tuning.
Although gCluster, Baseline 3, and Baseline 4 have linear complexity summarization procedures
and were able to reduce the same amount of data, Table 7 shows that our proposed method presented
expressively better results than the other two. This is certainly due to the random sampling process.
Different from the gCluster summarizing step, the random sampling process cannot ensure data
distribution preservation. The unstable results reflect this randomness.

4.2 Group II
In this group, we wanted to conduct an experiment with real IoT data. To this end, we selected the
Chicago Taxi Trips dataset (a public dataset created using IoT devices) and executed all the stages
from the proposed method. First, we created two segments: one with data gathered on weekdays,
and other with data from weekends. The rationale is that in a metropolis, like Chicago, weekend
traffic is expected to be more fluid than on working days. Then, the overall system configuration
stage defined the following parameters:

𝑉𝑆 = {𝑤𝑒𝑒𝑘𝑑𝑎𝑦}
𝑉𝑀 = {𝑡𝑟𝑖𝑝_𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛, 𝑡𝑟𝑖𝑝_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒}
𝜀 = 50
minForce = 0.0
minCells = 3

Segment 0 as:
𝑠𝑒𝑔0 .𝑑𝑒𝑠𝑐 ↦→ workday trips
𝑠𝑒𝑔0 .𝑓 𝑖𝑙𝑡𝑒𝑟 ↦→ weekday ∈ {monday, tuesday, wednesday, thursday, friday}

Segment 1 as:
𝑠𝑒𝑔1 .𝑑𝑒𝑠𝑐 ↦→ weekend trips

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:19

𝑠𝑒𝑔1 .𝑓 𝑖𝑙𝑡𝑒𝑟 ↦→ weekday ∈ {saturday, sunday}

Six taxis were randomly selected and each one had a device to gather data. The data aspects can
be seen in Table 8. Figure 11 shows the plotting of the original data gathered by the devices, split in
the two segments.

Table 8. Sample characteristics of the Chicago Taxi Trips dataset

Car ID Number of Records Size (Bytes)


455b6b 14,209 978,090
4c8b67 11,246 764,151
5f1b23 13,688 940,907
7c51c6 10,006 690,567
b50eb9 14,097 971,251
d1b852 12,688 874,718
Total 75,934 5,219,684

Fig. 11. Chicago Taxi Trips dataset

To execute the experiment, we stored the raw data generated by each taxi in a different base. For
each base, we applied min-max normalization to each attribute in 𝑉𝑀 . Thereafter, the summarizing
method was executed in each base and the results were sent to another processor simulating the
central node. To get to the final result, a program gathered all the data sent by each device and
integrated them according to the procedure depicted in Section 3.3.1.
In the proposed method, only summarized information (center of mass and number of points)
was sent to the central node. Figure 12 presents the cells resulting from the summarization step of
the proposed method in the two segments. It led to a expressive reduction of 99.14% of the data
traffic that it would be necessary if all the raw data (data generated by all the devices) were to be
sent over the network to the central node. It is important to notice that the reduction preserved the
shape of the data distribution for both segments.
As the Chicago Taxi Trips dataset is unlabeled, we had no ground-truth to validate our results
quantitatively. Thus, we applied DBSCAN to the normalized data of each segment (Figure 13). Our
purpose was to evaluate the clustering results that would be produced if a traditional algorithm
was applied to each complete segment, simulating raw data centralization. Under a qualitative

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:20 Brandao, et al.

Fig. 12. gCluster results from summarized data

point of view, the results produced by gCluster were similar to the corresponding ones produced
by DBSCAN in both segments, providing additional evidence that data summarization as proposed
in this work does not hamper clustering.

Fig. 13. DBSCAN results from raw data

5 CONCLUSION AND FUTURE WORKS


IoT is a network of interconnected devices (sensor nodes) that interact with the physical world
collecting data and providing services in several sorts of applications. In a typical IoT application,
those sensor nodes are highly distributed and have low processing power and low memory. The
growing number of IoT applications and their overwhelming production of data have stimulated
academy and industry to deploy mining tasks, such as data clustering, to identify relevant knowledge
(e.g., people and processes behavioral patterns) from those data.
One of the main approaches to IoT data clustering usually demand high network traffic to
transmit the collected data from the sensor nodes to a central node, where a clustering algorithm
must be applied. This centralized approach does not scale as the number of nodes increase and the
amount of collected data grows. Distributing the clustering process through the sensor nodes is
not a feasible alternative, since these nodes usually are simple devices and may not have enough
processing capacity and memory to run complex procedures.
In this paper, we investigated how to reduce network traffic to transfer IoT data for central-
ized clustering, without increasing the computational demand in the sensor nodes and without
compromising the clustering results. The main contribution is a centralized IoT data clustering
method that uses a data grid based process to summarize the original information at the sensor
nodes before transmitting it to the central node. This summarizing process has linear complexity,

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:21

avoiding computational overhead in the sensor nodes. Data summarizing reduces network traffic
significantly, as transmission is restricted to summarized information. After data transference, the
proposed method applies gCluster, a clustering algorithm that was developed to process data in the
summarized representation. Experiments with seven datasets provided qualitative and quantitative
evidence that the proposed method reduces network traffic and produces results comparable to the
ones generated by four baseline methods that made use of DBSCAN and HDBSCAN, both robust
centralized clustering algorithms.
Our alternatives for future work include: (i) evaluate gCluster in a real IoT system; (ii) investigate
the relation between cluster density and gCluster input parameters; (iii) develop intelligent tools
that support the analyst in gCluster parameter optimization; (iv) search for internal evaluation
metrics that can be used to assess the clustering results using the own dataset; (v) apply the proposed
method to multidimensional and real IoT datasets; (vi) search for detection models that distinguish
noise from outliers; (vii) develop a monitoring system that uses the clustering results for behavior
deviation detection; and (viii) create an easy to process, reduced size knowledge model to be pushed
back to the sensors so that the behavior deviation detection does not depend on the central node.

ACKNOWLEDGMENTS
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível
Superior - Brasil (CAPES) - Finance Code 001.

REFERENCES
[1] M. Abdelshkour. 2015. IoT, from Cloud to Fog Computing. https://tinyurl.com/ydalpr5s. [online: accessed April 12,
2019].
[2] S. Agrawal and J. Agrawal. 2015. Survey on Anomaly Detection using Data Mining Techniques. Procedia Computer
Science 60 (2015), 708–713.
[3] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. 2002. Wireless sensor networks: a survey. Computer
Networks 38, 4 (2002), 393–422.
[4] Luai Al Shalabi, Zyad Shaaban, and Basel Kasasbeh. 2006. Data mining: A preprocessing engine. Journal of Computer
Science 2, 9 (2006), 735–739.
[5] D. Arndt and N. Langbein. 2002. Data Quality in the Context of Customer Segmentation. In 2002 International Conference
on Information Quality. MIT, Massachusetts, 47–60.
[6] M. Bendechache and M. Kechadi. 2015. Distributed clustering algorithm for spatial data mining. In 2015 IEEE
International Conference on Spatial Data Mining and Geographical Knowledge Services. IEEE Computer Society, Los
Alamitos, CA, USA, 60–65.
[7] S. Bin, L. Yuan, and W. Xiaoyiu. 2010. Research on data mining models for the internet of things. In 2010 International
Conference on Image Analysis and Signal Processing. IEEE Computer Society, Los Alamitos, CA, USA, 127–132.
[8] R. Brandao, R. Goldschmidt, and R. Choren. 2019. A Data Traffic Reduction Approach Towards Centralized Mining in
the IoT Context. In Proceedings of the 21st International Conference on Enterprise Information Systems - Volume 1: ICEIS,.
INSTICC, SciTePress, 563–570. https://doi.org/10.5220/0007674505630570
[9] P. Braun, A. Cuzzocrea, C. K. Leung, A. M. Pazdor, J. Souza, and S. K. Tanbeer. 2019. Pattern Mining from big
IoT Data with fog Computing: Models, Issues, and Research Perspectives. In 2019 19th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE Computer Society, Los Alamitos, CA, USA, 584–591.
https://doi.org/10.1109/CCGRID.2019.00075
[10] Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-Based Clustering Based on Hierarchical
Density Estimates. In Advances in Knowledge Discovery and Data Mining, Jian Pei, Vincent S. Tseng, Longbing Cao,
Hiroshi Motoda, and Guandong Xu (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 160–172.
[11] Cisco. 2017. Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 2016–2021 White Paper.
https://tinyurl.com/y8kuucvk. [online: accessed January 31, 2019].
[12] J. Diaz-Rozo, C. Bielza, and P. Larrañaga. 2018. Clustering of Data Streams With Dynamic Gaussian Mixture Models:
An IoT Application in Industrial Processes. IEEE Internet of Things Journal 5, 5 (2018), 3533–3547.
[13] M. Ester, H-P Kriegel, J. Sander, and X. Xu. 1996. A Density-based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise. In International Conference on Knowledge Discovery and Data Mining. Association for Computing
Machinery, New York, NY, USA, 226–231.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
1:22 Brandao, et al.

[14] E.B. Fowlkes and C.L. Mallows. 1983. A Method for Comparing Two Hierarchical Clusterings. J. Amer. Statist. Assoc.
78, 383 (1983), 553–569.
[15] Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao, and Philip S. Yu. 2019. A Survey of
Parallel Sequential Pattern Mining. ACM Trans. Knowl. Discov. Data 13, 3, Article Article 25 (June 2019), 34 pages.
https://doi.org/10.1145/3314107
[16] Y. Gao and L. Ran. 2019. Collaborative Filtering Recommendation Algorithm for Heterogeneous Data Mining in the
Internet of Things. IEEE Access 7 (2019), 123583–123591.
[17] A. Gionis, H. Mannila, and P. Tsaparas. 2007. Clustering Aggregation. ACM Transactions on Knowledge Discovery Data
1, 1 (2007), 1556–4681.
[18] Jayavardhana Gubbi, Rajkumar Buyya, Slaven Marusic, and Marimuthu Palaniswami. 2013. Internet of Things (IoT): A
vision, architectural elements, and future directions. Future Generation Computer Systems 29, 7 (2013), 1645–1660.
[19] Yuan Guo, Nan Wang, Ze-Yin Xu, and Kai Wu. 2020. The internet of things-based decision support system for
information processing in intelligent manufacturing using data mining technology. Mechanical Systems and Signal
Processing 142 (2020), 106630. https://doi.org/10.1016/j.ymssp.2020.106630
[20] Michele Ianni, Elio Masciari, Giuseppe M. Mazzeo, Mario Mezzanzanica, and Carlo Zaniolo. 2020. Fast and effective
Big Data exploration by clustering. Future Generation Computer Systems 102 (2020), 84 – 94. https://doi.org/10.1016/j.
future.2019.07.077
[21] E. Januzaj, H-P. Kriegel, and M. Pfeifle. 2004. DBDC: Density Based Distributed Clustering. In Advances in Database
Technology - EDBT 2004. Springer Berlin Heidelberg, Berlin, Heidelberg, 88–105.
[22] Divya Joshi, Chanchal Kumari, and Abhishek Srivastava. 2016. Challenges and data mining model for IoT. International
Journal of Engineering Applied Sciences and Technology 1, 3 (2016), 2455–2143.
[23] G. Karypis. 2015. CLUTO - Software for Clustering High-Dimensional Datasets. https://tinyurl.com/pxkr8yl. [online:
accessed April 12, 2019].
[24] I. Kholod, M. Kuprianov, and I. Petukhov. 2016. Distributed data mining based on actors for Internet of Things. In
2016 5th Mediterranean Conference on Embedded Computing (MECO). IEEE Computer Society, Los Alamitos, CA, USA,
480–484.
[25] M. Klusch, S. Lodi, and G. Moro. 2003. Issues of Agent-based Distributed Data Mining. In 2003 International Joint
Conference on Autonomous Agents and Multiagent Systems (AAMAS). Association for Computing Machinery, New York,
NY, USA, 1034–1035.
[26] H. Mashayekhi, J. Habibi, T. Khalafbeigi, S. Voulgaris, and M. van Steen. 2015. GDCluster: A General Decentralized
Clustering Algorithm. IEEE Transactions on Knowledge and Data Engineering 27, 7 (2015), 1892–1905.
[27] Isaac Newton. 1687. Philosophiae naturalis principia mathematica. Vol. 1. S. Pepys Reg.Soc. PrÆsis, London.
[28] A. C. Onal, O. Berat Sezer, M. Ozbayoglu, and E. Dogdu. 2017. Weather data analysis and sensor fault detection using
an extended IoT framework with semantics, big data, and machine learning. In 2017 IEEE International Conference on
Big Data (Big Data). IEEE Computer Society, Los Alamitos, CA, USA, 2037–2046.
[29] S. Pattar, R. Buyya, K. R. Venugopal, S. S. Iyengar, and L. M. Patnaik. 2018. Searching for the IoT Resources: Fundamentals,
Requirements, Comprehensive Review, and Future Directions. IEEE Communications Surveys Tutorials 20, 3 (2018),
2101–2132.
[30] Haibo Peng, Qiaoshun Wu, Jie Li, and Rong Zhou. 2020. Design of Multi-layer Industrial Internet of Data Mine Network
Model Based on Edge Computation. In Cyber Security Intelligence and Analytics, Zheng Xu, Kim-Kwang Raymond
Choo, Ali Dehghantanha, Reza Parizi, and Mohammad Hammoudeh (Eds.). Springer International Publishing, Cham,
1034–1040.
[31] D. Puschmann, P. Barnaghi, and R. Tafazolli. 2017. Adaptive Clustering for Dynamic IoT Data Streams. IEEE Internet of
Things Journal 4, 1 (2017), 64–74.
[32] Quartz. 2015. Connected cars will send 25 gigabytes of data to the cloud every hour. https://qz.com/344466/. [online:
accessed April 12, 2019].
[33] H. Rahman, N. Ahmed, and M. I. Hussain. 2016. A hybrid data aggregation scheme for provisioning Quality of Service
(QoS) in Internet of Things (IoT). In 2016 Cloudification of the Internet of Things (CIoT). IEEE Computer Society, Los
Alamitos, CA, USA, 1–5.
[34] M. M. Rashid, J. Kamruzzaman, M. M. Hassan, S. Shahriar Shafin, and M. Z. A. Bhuiyan. 2020. A Survey on Behavioral
Pattern Mining From Sensor Data in Internet of Things. IEEE Access 8 (2020), 33318–33341.
[35] M. Roriz, M. Endler, M.A. Casanova, H. Lopes, F.S. Silva, and T. Hara. 2016. A Heuristic Approach for On-line Discovery
of Unidentified Spatial Clusters from Grid-Based Streaming Algorithms. In International Conference on Big Data
Analytics and Knowledge Discovery (DaWaK). Springer Berlin Heidelberg, Berlin, Heidelberg, 128–142.
[36] S. Salvador and P. Chan. 2004. Determining the number of clusters/segments in hierarchical clustering/segmentation
algorithms. In IEEE International Conference on Tools with Artificial Intelligence. IEEE Computer Society, Los Alamitos,
CA, USA, 576–584.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.
A Reduced Network Traffic Method for IoT Data Clustering 1:23

[37] C. Savaglio, P. Gerace, G. Di Fatta, and G. Fortino. 2019. Data Mining at the IoT Edge. In 2019 28th International
Conference on Computer Communication and Networks (ICCCN). IEEE Computer Society, Los Alamitos, CA, USA, 1–6.
[38] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu. 2016. Edge Computing: Vision and Challenges. IEEE Internet of Things Journal
3, 5 (2016), 637–646.
[39] A. Singh and S. Sharma. 2017. Analysis on data mining models for Internet Of Things. In 2017 International Conference
on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC). IEEE Computer Society, Los Alamitos, CA, USA,
94–100.
[40] C. Tsai, C. Lai, M. Chiang, and L. T. Yang. 2014. Data Mining for Internet of Things: a survey. IEEE Communications
Surveys Tutorials 16, 1 (2014), 77–97.
[41] I. Witten, E. Frank, M. Hall, and C. Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques. Morgan
Kaufmann, San Francisco - CA.
[42] H. Yu, H. Chen, S. Zhao, and Q. Shi. 2020. Distributed Soft Clustering Algorithm For IoT Based on Finite Time Average
Consensus. IEEE Internet of Things Journal . (2020), 1–1.
[43] Q. Zhang, C. Zhu, L. T. Yang, Z. Chen, L. Zhao, and P. Li. 2017. An Incremental CFS Algorithm for Clustering Large
Data in Industrial Internet of Things. IEEE Transactions on Industrial Informatics 13, 3 (2017), 1193–1201.
[44] Y. Zhang, M. Chen, S. Mao, L. Hu, and V. C. M. Leung. 2014. CAP: community activity prediction based on big data
analysis. IEEE Network 28, 4 (2014), 52–57.

ACM Trans. Knowl. Discov. Data., Vol. 1, No. 1, Article 1. Publication date: January 2020.

You might also like