1 s2.0 S0957417421004899 Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Expert Systems With Applications 178 (2021) 115048

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Spatiotemporal trajectory clustering: A clustering algorithm for


spatiotemporal data
Mohd Yousuf Ansari a, Mainuddin b, *, 1, Amir Ahmad c, Gopal Bhushan a
a
Defence Scientific Information & Documentation Centre (DESIDOC) , Defence R&D Organisation (DRDO), Metcalfe House, Delhi 110054, India
b
Department of Electronics & Communication, Faculty of Engineering and Technology, Jamia Millia Islamia, New Delhi 110025, India
c
College of Information Technology, United Arab Emirates University, Al Ain, United Arab Emirates

A R T I C L E I N F O A B S T R A C T

Keywords: Spatial technologies generate large datasets quickly and continuously. The purpose of this study is to develop a
Density-based clustering clustering algorithm to mine spatiotemporal co-location events in trajectory datasets. We present a spatiotem­
Trajectory clustering poral algorithm for sub-trajectory clustering that divides a trajectory into line segments and groups theses sub-
Spatiotemporal data
trajectories on the basis of both spatial and temporal aspects by extending DBSCAN (Density Based Spatial
Co-location events
Clustering of Applications with Noise) algorithm. We adopt the concepts of entropy and silhouette index to
validate the clusters. Experiments conducted on two different real datasets demonstrate that the proposed
clustering algorithm effectively discovers optimal clusters. Furthermore, experimental results reveal hidden and
useful clusters and demonstrate that the proposed algorithm outperforms the CorClustST (Correlation-based
Clustering of Big Spatiotemporal Datasets), and the ST-OPTICS (Spatiotemporal-Ordering Points to Identify
Clustering Structure) algorithms.

1. Introduction means (Macqueen, 1967), DBSCAN (Ester, Kriegel, Sander, & Xu, 1996),
Ordering points to identify the clustering structure (OPTICS) (Ankerst,
Currently, technological infrastructures continuously generate Breunig, Kriegel, & Sander, 1999), Clustering using references and
massive amounts of various types of raw data, such as geographic data, density (CURD) (Ma, Wang, Tang, Yang, & Gao, 2003), ST-DBSCAN
movement, date and time data, and in some cases related information, (Birant & Kut, 2007), and ST-OPTICS (Agrawal, Garg, Sharma, &
captured by tracking devices. These gigantic volumes of data are diffi­ Patel, 2016). These existing algorithms can cluster point data; however
cult to manage and analyze to obtain useful information for decision they cannot cluster trajectories. Several researchers have proposed tra­
making (Li, Han, & Yang, 2004). Clustering techniques are helpful to jectory clustering algorithms; notable among these are the algorithms
extract meaningful information (Ahmad & Dey, 2007). Spatial clus­ proposed by Pelekis et al. (2007), Gaffney and Smyth (1999), Chudova,
tering takes a static view of geospatial phenomena; however the evo­ Gaffney, Mjolsness, and Smyth (2003), Alon, Sclaroff, Kollios, and
lution of real-world phenomena is usually related to time. Therefore for Pavlovic (2003), Chang and Zhou (2009), Li and Wang (2010), Nanni
clustering to be applicable in real-world contexts, both space and time and Pedreschi (2006), Tsumoto and Hirano (2009), and Yanagisawa and
must be considered simultaneously (Yao, 2003). Satoh (2006). These algorithms cluster a trajectory as a whole. Clus­
Spatiotemporal clustering groups objects based on both spatial and tering complete trajectories may miss local trajectory features and
temporal aspects (Ansari, Ahmad, Kahn, Bhushan, & Mainuddin, 2020) similar parts of the trajectories may not be detected. In trajectory
and spatiotemporal trajectory clustering is instrumental in the analysis analysis, parts of the trajectories may have crucial roles e.g., to analyze
of trajectory data (Zaghlool, ElKaffas, & Saad, 2015). Several applica­ interesting regions. Moreover matching complex and relatively long
tions, such as predicting moving objects relative to space and time and trajectories is a tedious task. Some researchers have focused on sub-
mining spatiotemporal co-location events require a spatiotemporal tra­ trajectory clustering, such as Chang and Zhou (2009), Lee, Han, and
jectory clustering technique. Whang (2007), Kreveld and Luo (2007), Yuan, Xia, Zhang, Zhou, and Ji
Several clustering algorithms have been proposed till date, such as K- (2011) and Zhang, Lee, and Lee (2018). The approaches proposed by Lee

* Corresponding author.
E-mail address: mainuddin@jmi.ac.in (Mainuddin).
1
ORCID: 0000-0003-1811-2062.

https://doi.org/10.1016/j.eswa.2021.115048
Received 8 August 2019; Received in revised form 17 March 2021; Accepted 14 April 2021
Available online 20 April 2021
0957-4174/© 2021 Elsevier Ltd. All rights reserved.
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

et al. (2007) and Kreveld and Luo (2007) treat the trajectory geomet­ timethreshold and describes a concept regarding the measurement of the
rically; however, these approaches do not consider temporal compo­ quality of clusters. Experimental evaluation results are reported in
nents such as time and speed. Considering only the geometrical aspects Section 5. These results are then discussed in Section 6 and the con­
of the trajectory may not reveal several interesting patterns. In real- clusions and suggestions for future work are presented in Section 7.
world applications, time and speed are crucial. For example, in the
case of a hurricane, two trajectories may have same location and shape; 2. Related work
however, their speeds may differ. The first trajectory may move fast
initially and then move slowly, where as the second may move slowly In clustering, a set of objects is partitioned into sub groups based on
and then accelerate. Therefore speed needs to be considered in the certain similarity measure without a priori knowledge about the dataset
clustering process. Location, direction, time, and speed influence inter­ (Birant & Kut, 2007). Clustering algorithms are categorized into five
nal and external features of a trajectory (Kreveld & Luo, 2007) and can main types based the technique used to define the clusters. The primary
reveal meaningful patterns of the trajectory. Lee et al. (2007) have function of a partitional algorithm is to determine a division of k clusters
pioneered the study of trajectory partitions. In their work, the trajectory that optimizes the selected dividing criteria, wherein number of clusters
is partitioned into basic units based on characteristic points that express (k) is the input parameter. The K-means algorithm (Macqueen, 1967) is
the local features of the trajectory. A location, at which trajectory an important partitional algorithm. Hierarchical clustering algorithms
behavior changes rapidly, e.g., changes in the speed and direction, is produce a collection of nested clusters that form a hierarchical tree, e.g.,
considered as a characteristic point (Lee et al., 2007). The stops and BIRCH (Zhang, Ramakrishnan, & Livny, 1996). Grid-based methods
moves describe the semantic features of a trajectory; wherein a stop can quantize the object space into a grid structure. Examples of multiple
be considered as a characteristic point. Therefore, characteristic points level grid based clustering algorithms include STING (Wang, Yang, &
are instrumental in the exploration of spatiotemporal and semantic Muntz, 1997) and WaveCluster (Sheikholeslami, Chatterjee, &Zhang,
features of a trajectory, which inspired us to select the TRACLUS algo­ 2000). Model-based approaches use a model to determine the best fit of
rithm proposed by Lee et al. (2007). In this algorithm, each sub- data. For example in COBWEB (Fisher, 1987), a model is hypothesized
trajectory represents a direct line between two characteristic points for each cluster to determine the best fit.
thereby addressing the problem of matching complex trajectories. The DBSCAN algorithm (Ester et al., 1996) groups objects into
The TRACLUS algorithm clusters trajectories by dividing each tra­ meaningful subclasses. In this approach, the density threshold is
jectory into sub-trajectories based on the minimum description length expressed through the maximum radius of the neighborhood (Eps) and
principle to eliminate noise and to decrease the clustering time. the minimum number of objects in an Eps neighborhood of a given
Furthermore, the grouping phase clusters sub-trajectories using a object (MinPts). For each of the core objects, the DBSCAN algorithm
modified DBSCAN algorithm (Ester et al., 1996) which groups the ob­ checks for clusters around each core object and if the core object has no
jects into meaningful subclasses. It visits all objects and identifies the cluster label, then it creates a cluster with this core object and assigns a
core objects, border objects, and noise objects. Then core objects that are label to this cluster. Whenever a cluster is created with a core object, all
close to each other become a part of the same cluster. The TRACLUS objects within the Eps neighborhood of this core object are assigned to
algorithm (Lee et al., 2007) can predict future movements of moving the newly created cluster. This process continues until there are no core
objects; however it cannot identify temporal information because it objects without a cluster label. DBSCAN employs spatial access methods;
clusters line segments solely on the basis of spatial information. To thus, it can process large spatial datasets (Ester et al., 1996) and can
predict the movement of objects relative to space and time, we extend identify arbitrarily shaped clusters.
the TRACLUS algorithm and cluster the line segments based on both OPTICS (Ankerst et al., 1999) is based on the DBSCAN algorithm.
spatial and temporal information. The OPTICS method stores the processing order of the objects, and an
Density-based clustering algorithms can discover arbitrarily shaped extended DBSCAN algorithm uses this information to assign cluster
clusters and can filter noise (Birant & Kut, 2007). We observe that tra­ membership (Ankerst et al., 1999). The OPTICS method can identify
jectory datasets have large numbers of outliers (i.e., noise) and line nested clusters and the structure of clusters. The difference between
segment clusters generally exhibit an arbitrary shape. OPTICS and DBSCAN is related to the order in which objects are visited
This study aims to develop an innovative approach to mining in dataset.
spatiotemporal co-location events in trajectory datasets. We expect that CURD (Ma et al., 2003) captures the shape and extent of a cluster
the proposed approach will help in predicting future patterns of moving with references; it then analyzes the data based on these references.
objects, identify and analyze the concentration of vehicles in terms of CURD can discover arbitrarily shaped clusters and is insensitive to noise
space and time, and improve traffic management. Our notion of the co- data. The high efficiency of CURD makes it suitable for mining consid­
location event is a temporal extension of the spatial co-location defined erably large datasets. Moreover, CURD can process high-dimensional
by Shekhar and Huang (2001), which models the coexistence set based data.
on the attribute values in a spatial neighborhood. Agrawal et al. (2016) developed and validated an enhanced spatio­
Our primary contribution is the development of a density-based temporal clustering algorithm, ST-OPTICS, by modifying OPTICS algo­
spatiotemporal clustering algorithm, ST-TRACLUS, which is an exten­ rithm. The scalable technique can identify nested and adjacent clusters,
sion of the TRACLUS algorithm. This algorithm considers both temporal and can handle multi-dimensional data. The approach first sorts the
and spatial information during clustering. To generate optimal clusters, observation, and handles spatiotemporal data in which spatial and non-
we adopt the concepts of entropy and a silhouette index. We demon­ spatial attributes are handled using distance parameters ε1 and ε2
strate that the proposed clustering algorithm effectively discovers respectively; whereas the temporal dimension is handled by concate­
optimal clusters through experiments on two real-world datasets nating related spatial and non-spatial values while retaining the tem­
comprising truck-position data and hurricane-track data, respectively. poral neighbors. To improve visualization and analysis of generated
Experimental results indicate that the proposed algorithm can identify micro clusters, the result of spatiotemporal clustering algorithm act as
hidden and useful clusters and demonstrate that our algorithm out­ input to agglomerative method.
performs the CorClustST, and the ST-OPTICS algorithms. Husch, Schyska, and Bremen (2020) developed CorClustST, a clus­
The remainder of this paper is organized as follows. Section 2 briefly tering algorithm that uses the concept of correlation for big spatiotem­
reviews existing clustering algorithms, specifically density-based clus­ poral data. The algorithm employs the concept of empirical correlations
tering approaches. Section 3 describes the proposed spatiotemporal of spatial neighbors over time. The algorithm can be extended for large-
trajectory clustering algorithm and related basic concepts. Section 4 scale parallelization.
provides heuristics to select the parameter values of Eps and Pelekis et al. (2007) used trajectory characteristics such as space,

2
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

time, velocity, and direction. Gaffney and Smyth (1999) presented a


clustering technique that combines models of continuous trajectories.
Chudova et al. (2003) presented a group of algorithms to simultaneously
align the spatial and temporal shifts of trajectories within each cluster.
The Expectation-maximization (EM) algorithm is used to recover the
characteristics of each curve. Alon et al. (2003) modeled trajectories in
the form of successive positions using hidden Markov models. Chang
and Zhou (2009) proposed a technique where each trajectory is
segmented into sub-trajectories based on corner point detection. A
density-based clustering technique is then used to cluster sub-
trajectories by employing the Fréchet distance. Nanni and Pedreschi
(2006) proposed a density-based clustering method for trajectories by Fig. 1. Components of the spatial distance.
adopting the OPTICS algorithm. The authors proposed two versions, i.e.,
Trajectory-OPTICS and its time-focused version, which they refer to as Definition2.. A line segment Li ∈ DS is a core line segment with
TF-OPTICS. Kreveld and Luo (2007) introduced a time-dependent rela­ respect to∈, timethreshold and MinLns if |N∈,timethreshold (Li )| ≥ MinLns .
tion to the shape-based trajectory analysis and proposed algorithms to
Definition3.. A line segment Li ∈ DS is directly density reachable from
compute the most similar parts of trajectories. Yuan et al. (2011) pro­
a line segment Lj ∈ DS with respect to ∈, timethreshold and MinLns if the
posed a trajectory clustering algorithm that employs an index tree. This
following conditions are satisfied:
approach works in the preprocessing and clustering phase. In the pre­
processing phase, the trajectory is partitioned into segments with ( )
1) Li ∈ N∈,timethreshold Lj and
respect to the corner threshold value. In the clustering phase, trajectory
data with distances matrix are stored in the index tree. Then, segments 2) Lj is a core line segment
are clustered based on the index tree.
Zhang et al. (2018) proposed a TRACLUS-based algorithm for
spatiotemporal periodic pattern mining. The TRACLUS algorithm (Lee Definition4.. A line segment Li ∈ DS is density reachable from a line
et al., 2007) extends DBSCAN, whereas the proposed algorithm is based segment Lj ∈ DS with respect to ∈, timethreshold and MinLns if there is a
on HDBSCAN (Campello, Moulavi, & Sander, 2013). Ying, Lee, and sequence of line segments Lj , Lj− 1 , ⋯, Li+1 , Li ∈ DSsuch that Lk is directly
Tseng (2013) proposed a Geographic-temporal-semantic-based location density reachable from Lk+1 with respect to ∈, timethreshold and MinLns.
prediction approach that considers geographic locations, temporal in­
formation and the geographic consequence of locations visited by users Definition5.. A line segment Li ∈ DS is density connected to a line
to estimate the likelihood of a user visiting a given location. Zaghlool segment Lj ∈ DS with respect to ∈, timethreshold and MinLns if there is a
et al. (2015) suggested a TRACLUS-based approach to cluster sub- line segment Lk ∈ DSsuch that both Li and Lj are density reachable from
trajectories considering the time dimension and to analyze spatiotem­ Lk with respect to ∈, timethreshold and MinLns.
poral data. Bermingham and Lee (2015) proposed a methodology for n- Definition6.. A cluster C with respect to ∈, timethreshold and MinLns
dimensional trajectory clustering. The authors applied the proposed is a non-empty subset of DS satisfying the following “maximality” and
approach to the TRACLUS algorithm. “connectivity” conditions:
The TRACLUS algorithm (Lee et al., 2007) can predict future
movements of moving objects; however, it cannot determine temporal 1) ∀Li , Lj : ifLi ∈ C and Lj is density reachable from Li with respect to ∈,
information because it clusters line segments solely on the basis of timethreshold and MinLns, then Lj ∈ C. (Maximality)
spatial information. To predict object movements relative to space and 2) ∀Li , Lj ∈ C : Li is density connected to Lj with respect to ∈, time­
time, to identify and analyze the concentration of vehicles in space and threshold and MinLns (Connectivity)
time, and to help improve traffic management, we extend the TRACLUS
algorithm and cluster the line segment based on both spatial and tem­
poral information. 3.2. Spatial distance

3. ST-TRACLUS algorithm and basic concepts Here, spatial distance is the geographic distance between two line
segments. In this context, spatial distance comprises perpendicular
We developed a spatiotemporal trajectory clustering (ST-TRACLUS) (lprp), parallel (lprl), and angle distances (dθ), as shown in the Fig. 1. The
algorithm using the concept of the DBSCAN algorithm (Ester et al., line segments Li = (si,ei) and Lj = (sj,ej) are also shown in Fig. 1. The
1996), the TRACLUS algorithm (Lee et al., 2007), and the temporal perpendicular distance between line segments Li and Lj can be computed
distance. The basic concepts pertaining to the ST-TRACLUS algorithm as follows:
are described in Section 3.1. We used the spatial distance from the
l2prp1 + l2prp2
TRACLUS algorithm which is described in the Section 3.2. The proposed lprp (Li , Lj ) =
temporal distance is described in Section 3.3, and the proposed ST- lprp1 + lprp2
TRACLUS algorithm is described in Section 3.4.
where lprp1 and lprp2 are the Euclidean distances between sj and ps , and
ej and pe , respectively.
3.1. Basic concepts The parallel distance between line segments Li and Lj can be
computed as follows:
In this section, we provide definitions and notations for spatiotem­
poral density-based clustering. We extend the definitions of the line lprl (Li , Lj ) = Min(lprl1 , lprl2 )
segments proposed in TRACLUS algorithm (Lee et al., 2007). Here, as­
sume that DS represents the set of line segments. where lprl1 and lprl2 are the Euclidean distances between si and ps and ei
and pe , respectively.
Definition1.. The spatiotemporal-neighborhood N∈,timethreshold (Li ) of a The angle distance is the difference in directional movements be­
{ ⃒
line segment Li ∈ DS is defined using N∈,timethreshold (Li ) = Lj ∈ DS⃒ spa­ tween two sub-trajectories. The angle distance between line segments Li
tialdist(Li,Lj)≤∈ and temporaldist(Li,Lj) ≤ timethreshold}. and Lj can be computed as follows:

3
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 2. Different cases for maximum duration overlap.

Fig. 3. Algorithm to compute the temporal distance.

{ ( ⃒⃒ ⃒ ⃒ )
min ||Li⃒|⃒|, ⃒⃒⃒⃒L)j ⃒ ⃒ × sin(θ), if 00 ≤ θ ≤ 900 The weights wprp, wprl, and wθ are determined by the application.
dθ (Li , Lj ) = (
min ||Li | |, Lj ⃒ ⃒ × sin(π − θ), if 900 < θ ≤ 1800
⃒ ⃒ Here, the default weights of wprp, wprl, and wθ are1.

The spatial distance between two line segments can be defined as


follows: 3.3. Temporal distance
( ) ( ) ( )
spatialdist Li , Lj = wprp *lprp Li , Lj + wprl *lprl Li , Lj + wθ *dθ (Li , Lj ) Two moving objects are said to be spatially similar when they move
close to each other at a given place irrespective of time. They are

4
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 4. ST-TRACLUS algorithm.

considered similar in both space and time when they move close to each 3.4. ST-TRACLUS algorithm
other concurrently.
The temporal distance TDi is a measure of time similarity between The proposed ST-TRACLUS algorithm is based on the concept of
two sub-trajectories. To support an almost concurrent movement, a time partition and group framework proposed by Lee et al. (2007). The al­
window δ for the tolerance of past and future, was introduced by Pelekis gorithm is shown in Fig. 4. Two algorithms are executed to perform
et al. (2012). We define the temporal distance based on the idea of the required subtasks (lines 1–4 and line 5).
Jaccard distance (Tan, Steinbach, and Kumar, 2006). To define the We adopt the partitioning phase of the original TRACLUS algorithm
temporal distance, we require time points (i.e., date and time) of both without any modifications. To include the spatiotemporal element, we
the start and end of sub-trajectories. The pair of time points defines the employ the MDO-based temporal distance concept (Pelekis et al., 2012)
maximum duration overlap and life spans of both sub-trajectories. along with the parallel, perpendicular, and angular distances of TRA­
Different cases that compute the maximum duration overlap are CLUS algorithm (Lee et al., 2007).
depicted in Fig. 2. In the grouping phase the proposed algorithm uses a modified
The temporal distance TDi(δ) ∈[0, 1] is defined as follows: DBSCAN algorithm (Fig. 5) to cluster sub-trajectories based on spatial
and temporal dimensions. Here, the input parameters are the maximum
TDi(δ) = ||1 − MDO(δ)/(lifeSpan1 + LifeSpan2 − MDO(δ)||
radius of neighborhood (Eps), minimum number of lines (MinLns),
where MDOi(δ) is the maximum duration overlap of the temporal period timethreshold, and time window (Delta).
between two sub-trajectories, and lifeSpan1 and lifeSpan2 are the time The proposed line-segment clustering algorithm is listed in Fig. 5.
spans of both sub-trajectories. Here, we incorporate the temporal aspect into the grouping phase of the
The maximum duration overlap is defined as follows: TRACLUS algorithm, which is an extension of the DBSCAN algorithm.
We initialize a cluster list and mark all line segments as unvisited (lines 1
MDOi(δ) = maximum{dur([trs1, tre1] ∩ [trs2, tre2] ), dur([trs1, tre1] and 2). As the algorithm progresses, the line segments are marked as
∩ [trs2, tre2 + δ] ), visited and become members of a cluster. This is a two-step algorithm. In
the first step (lines 3–13), the algorithm computes the spatiotemporal
dur([trs1, tre1] ∩ [trs2 − δ, tre2] )} neighborhood of unvisited line segment L. If L is designated a core line
segment (lines 7–11), then the algorithm executes the second step to
where trs1, tre1, trs2, tre2, and δ are the start time point of sub- build the cluster (line 9). Presently, the cluster has onlyN∈,timethreshold (L).
trajectory 1, end time point of sub-trajectory 1, start time point of sub- The buildCluster() function (lines 24–37) directly computes density-
trajectory 2, end time point of sub-trajectory 2, and time window, reachable line segments (line 29). If these are core line segments, then
respectively. they are added to the current cluster (line 31). The spatial distance
We develop an algorithm, shown in Fig. 3, to compute the temporal function (line 18) has been implemented by adopting the concept of the
distance between two line segments. Initially we calculate the life spans parallel, perpendicular, and angular distances proposed in the TRACLUS
of both sub-trajectories and verify whether sub-trajectory 2 partially or algorithm (Lee et al., 2007).
fully overlaps sub-trajectory1 in the temporal dimension. If the overlap The time complexity of the DBSCAN algorithm is O(n2), where n is
is either partial or full, we compute the maximum duration overlap the number of objects in the database. If we use an indexing mechanism,
(MDO) (lines 1–10). Then, we verify whether sub-trajectory 1 partially then the time complexity is expressed as O(n log n). Our algorithm is an
or fully overlaps with sub-trajectory 2 in the temporal dimension. If the extension of the DBSCAN algorithm, and our modifications do not
overlap is either partial or full, we compute the MDO (lines 11–19). change the time complexity of the algorithm. Therefore, if we scan all
Finally, using the MDO and life spans of both sub-trajectories, we line segments in the database and do not use an indexing mechanism,
compute the temporal distance. The temporal distance value can lie then the complexity of our algorithm is O(n2). The indexing mechanism,
between 0 and 1. such as R-tree (Guttman, 1984), enables the line segments in the Eps
If both sub-trajectories are identical in terms of temporal dimension, neighborhood to be located quickly. The use of an appropriate indexing
the value of the temporal distance is assumed to be 0. A temporal dis­ mechanism will reduce the time complexity of our algorithm to O(n log
tance of 1 indicates that there is no temporal similarity between the sub- n).
trajectories. Our computation of the spatial distance is based on the
TRACLUS algorithm.

5
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 5. Line Segment Spatiotemporal Clustering Algorithm.

4. Selection of parameter values and quality measure When Eps and timethreshold are extremely small then
|N∈,timethreshold (L)| approaches 1 for nearly all line segments. If Eps and
To select the values of parameters Eps and timethreshold, we adopt the timethreshold are extremely large, then |N∈,timethreshold (L)| becomes the
entropy theory and a silhouette index (Rousseeuw, 1987). Entropy is total number of line segments for almost all line segments. As per the
crucial in information theory as a measure of information, choice, and entropy theory, the entropy becomes maximum for equally likely out­
uncertainty. comes.

6
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 6. Entropy for truck-position data for ST-TRACLUS.

As a result, in the above cases, entropy will be maximum. The ten­ 5.1. Experimental setting
dency of |N∈,timethreshold (L)| is skewed in nature for good clustering, which
makes the entropy minimum. Equation (1) is used to compute the values Herein, we use a two real-world datasets pertaining to truck positions
of Eps and timethreshold that minimize H(X). and hurricane tracks.
The truck-position dataset pertains to the positions of 50 trucks

n
H(X) = − p(xi )log2 p(xi ) (1) transporting concrete in Athens, Greece during the period from August
i=1 to September 2002. The dataset contains 1,12,300 position records.
Each record comprises the object identifier, trajectory identifier, date
|N∈,timethreshold (xi )|
wherep(xi ) = ∑n ( ) (dd/mm/yyyy), time (hh:mm:ss), and geographical coordinates in
j=1 |N∈,timethreshold xj | WGS84 and GGRS87 datum. The temporal interval is 30, which is the
sampling rate of the GPS devices on the trucks.
To select the value of parameter MinLns, the average of |
When the temporal gap between two records is greater than or equal
N∈,timethreshold (L)| at Eps and timethreshold is computed. The optimal value
to 900, we split the trajectory. By adopting this approach, we have
of MinLns is determined between avg(|N∈,timethreshold (L)) +1
identified 2314 un-partitioned trajectories from the raw truck-position
andavg(|N∈,timethreshold (L)) + 3.
dataset. The trajectories were partitioned by adopting the first phase
A cluster quality measure, i.e., the Sum of Squared Error and a noise
of the TRACULUS algorithm; 98,285 sub-trajectories were found.
penalty to penalize incorrectly classified noise (Lee et al., 2007), is
The hurricane-track dataset contains hurricane data for the Tropical
adopted.
Atlantic region of North America. The dataset is called Best Track. We
Quality Measure = Total SSE + Noise Penalty (2) use a dataset that contains data for hurricanes that occurred from 1950
( ) to 2013. Each record comprises the date, name, latitude, longitude,
∑numbersofclusters 1 ∑ ∑ 2 maximum sustained surface wind, minimum sea level pressure, and
= i=1 2|Ci | X∈Ci Y∈Ci Spatiotemporaldist(X, Y) +
STAT at an interval of 6. We preprocess the raw dataset and extract
latitude, longitude, and time to form trajectories. The dataset contains
1 ∑∑ 913 un-partitioned trajectories and has 24,488 points. We partitioned
Spatiotemporaldist(P, Q)2
2|N| P∈N Q∈N the trajectories by adopting the first phase of the TRACLUS algorithm
and found 23,238 sub-trajectories.
Here, N represents the number of all noise line segments.
To apply the parallel, perpendicular, and angular distances of the
The silhouette index (Rousseeuw, 1987) is based on a measure of
TRACLUS algorithm (Lee et al., 2007), we converted geodetic WGS84
cohesion and a measure of separation. Cohesion is how similar a line
coordinates (latitude, longitude) to earth-centered-earth-fixed Cartesian
segment is to its own cluster, whereas separation is how dissimilar a line
coordinates (x,y) (Snyder, 1987).
segment is to all other line segments belonging to other clusters.
We attempted to measure the quality of clusters by varying the
Silhouette index values range from –1 to +1. A larger value indicates
spatial and temporal thresholds i.e., Eps, MinLns, and timethreshold.
quality clusters. If most of line segments have higher values, then the
The adopted quality measures are the sum of the total sum of squared
quality of the cluster is high. If many line segments have lower or
errors and the noise penalty that is represented by Eq. (2) and the
negative values, then the cluster quality is low.
silhouette index.
The experiments were conducted on a PC with an i3 processor with a
5. Experimental evaluation
3.2 GHz CPU and 4 GB RAM running Windows 7 OS. The algorithm is
implemented in Java using NetBeans IDE 8.0.2. To visualize clusters, the
The proposed spatiotemporal clustering algorithm is evaluated on a
GeoTools 14.2 libraries (“GeoTools: GeoTools 14.2 Released.”) along
truck-position dataset1 and a hurricane-track dataset2. The experimental
with Styled Layer Descriptor (“Styled Layer Descriptor | OGC”) are used.
setting and environment are described in Section 5.1. The results of real-
We use PostgreSQL 9.3.4 with PostGIS spatial Extension 2.2.3 (”Post­
world datasets of truck positions and hurricane tracks are presented in
greSQL 9.6.2, 9.5.6 Released”) to store the geographical coordinates.
Sections 5.2 and 5.3, respectively.

7
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 7. Quality measure for truck-position data for ST-TRACLUS.

Fig. 8. Truck-position data clustering results for ST-TRACLUS.

Fig. 9. Entropy for truck-position data for TRACLUS.

8
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 10. Quality measure for truck-position data for TRACLUS.

5.2. Truck-position dataset results 5.2.1. Selection of parameter values using entropy theory
To obtain appropriate parameter values for Eps and timethreshold,
In this section, the proposed algorithm is evaluated on truck-position we employ various values and record the corresponding entropy which
dataset. To obtain optimal values of the parameters we use the concept is depicted in Fig. 6. The minimum value of entropy (16.3295) is ob­
of entropy and the silhouette index. The application of TRACLUS, Cor­ tained at Eps = 95 and timethreshold = 0.9, as shown in Fig. 6. At this
ClustST, and ST-OPTICS algorithms on truck-position dataset is covered point, the Avg(|N ∈, timethreshold(L) |) is 1.91. The smaller entropy
in the following sections. produces high-quality clusters; therefore, we try to keep our parameter
values close to Eps = 95, timethreshold = 0.9, and MinLns 2–4.
We compute the quality by varying Eps and MinLns and maintain

Fig. 11. Clustering results of truck-position data for TRACLUS.

9
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 12. Silhouette index for truck-position data for ST-TRACLUS.

timethreshold at 0.9, which is depicted in Fig. 7. Smaller QMeasure 33–35. We compute the quality metric by varying Eps and MinLns. The
values indicate a good clustering quality. The minimum QMeasure is minimum QMeasure is obtained at Eps = 70 and MinLns = 33, shown in
obtained at Eps = 320 and MinLns = 2, as shown in Fig. 7. The clusters Fig. 10. As shown in Fig. 11, the resultant clusters are visualized using
formed using parameter values Eps = 320, timethreshold = 0.9, and parameter values Eps = 70 and MinLns = 33. Here, the black lines
MinLns = 2 are shown in Fig. 8. represent the sub-trajectories that are not part of any cluster. We ob­
In Fig. 8, the black lines represent the sub-trajectories that are not tained 70 clusters of sub-trajectories that are represented by different
part of any cluster. We obtained 7944 clusters of sub-trajectories that are colored lines. Of the 70 clusters, 32 clusters had more than 100 sub-
represented by different colored lines. Of the 7944 clusters 71 had more trajectories. The highest density cluster (cluster1) had 47,327 line seg­
than 100 sub-trajectories. The highest density cluster (cluster1) had ments (red lines), and the next highest density cluster (cluster2) had
1383 line segments, and the average segment time was 61.83 s. The next 19,828 line segments (green lines). Clusters 3, 4, 5, and 6 contains 2861,
highest density cluster (cluster2) had 1193 line segments, and the 1753, 547, and 542 line segments, respectively. The remaining clusters
average segment time was 45.86. Clusters 3 and 4 had 612 and 527 line had fewer than 500 line segments.
segments, respectively, and average segment times of 48.04 and 47.53,
respectively. The remaining clusters had fewer than 500 line segments. 5.2.2. Selection of parameter values using silhouette index
The different clusters indicate the concentration of trucks in various A large silhouette value indicates a high quality cluster. If most line
parts of the city and also reveals the average time required to cross these segments have larger values, then the cluster quality is high. If many line
cluster areas. segments have smaller or negative values, then the cluster quality is low.
The TRACLUS algorithm was also applied to the truck-position If we count the negative values in the Y-axis and Eps in the X-axis, the
dataset. We varied Eps and recorded the corresponding entropy as lower value in the Y-axis indicates a high silhouette value. To obtain
shown in Fig. 9. The minimum entropy (14.0569) was obtained at Eps = optimal parameter values (Eps and MinLns) keeping timethreshold =
10. Lower entropy values produce high quality of clusters. Conse­ 0.9, we varied Eps and MinLns and recorded the corresponding number
quently, we kept our parameter values close to Eps = 10 and MinLns = of negative values as shown in Fig. 12.

Fig. 13. Clustering results of truck-position data for ST-TRACLUS using silhouette index.

10
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 14. Silhouette index for truck-position data for TRACLUS.

Fig. 15. Clustering results of truck-position data for TRACLUS using silhouette index.

The minimum number of negative values (2 2 7) is obtained at Eps = We obtained 8451 clusters of sub-trajectories that are represented by
200 and MinLns = 3 (Fig. 12). The resultant clusters formed using different colored lines in Fig. 13. Of the 8451 clusters, 19 clusters had
parameter values Eps = 200, timethreshold = 0.9, and MinLns = 3 are more than 100 sub-trajectories. The highest density cluster (cluster1)
shown in Fig. 13. had 324 line segments, and the average segment time was 102.7. The

Fig. 16. Number of clusters of truck-position data for different Eps.

11
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 17. Clustering results of truck-position data for CorClustST.

next highest density cluster (cluster2) had 229 line segments, and the had fewer than 400 line segments.
average time was 92.75. Clusters 3, 4, and 5 contain 210, 175, and 157
line segments, respectively, and the average segment times were 89.42, 5.2.3. Application of CorClustST algorithm
92.57, and 127.83, respectively. The remaining clusters had fewer than The CorClustST algorithm (Husch et al., 2020) was developed for
150 line segments. points. To apply the algorithm to the truck-position dataset, we modified
The TRACLUS algorithm was also applied to the truck-position the algorithm for the line segments. Pearson’s sample correlation, rho,
dataset. To obtain optimal value of parameters, we varied Eps and (Pearson, 1895) between the time series of two spatial objects has been
MinLns and recorded the corresponding number of negative values. The used to compute spatiotemporal neighbors. Initially, we selected
minimum number of negative values (3 1 0) was obtained at Eps = 80 appropriate Eps to compute spatial neighbors such that a sufficient
and MinLns = 34, as shown in Fig. 14. The resultant clusters formed numbers neighbors were considered and the computation time was
using parameter values Eps = 80 and MinLns = 34 are shown in Fig. 15. reasonable. Spatial neighbors having correlations greater than or equal
We obtained 73 clusters of sub-trajectories that are represented by to a predefined value of rho become spatiotemporal neighbors. A 0.7-rho
different colored lines. Of the 73 clusters, 25 clusters had more than 100 value can be used for moderate correlation, and a value of 0.9 can be
sub-trajectories. The highest density cluster (cluster1) had 72,739 line used for high correlation. To select appropriate parameter values, we
segments (red lines), and next highest density cluster (cluster2) had varied Eps value and recorded the number of clusters for rho = 0.7 and
1754 line segments (green lines). Clusters 3, 4, 5, and 6 contain 570, rho = 0.9, as shown in Fig. 16. The number of clusters becomes stable for
448, 432, and 410 line segments, respectively. The remaining clusters larger values of Eps; therefore, we selected Eps as 35000. The number of

Fig. 18. Clustering results of truck-position data for ST-OPTICS.

12
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 19. Entropy for hurricane-track data for ST-TRACLUS.

Fig. 20. Quality measure for hurricane-track data for ST-TRACLUS.

Fig. 21. Clustering results of hurricane-track data for ST-TRACLUS.

13
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 22. Entropy for hurricane-track data for TRACLUS.

clusters for rho = 0.9 were large and sizes of clusters were very small. density clusters had 20 line segments, one cluster has 19 line segments,
Very small cluster may be noisy region. The number of clusters for rho = and eighteen clusters had 18 line segments. The ST-OPTICS algorithm
0.7 was not as large as rho = 0.9 and size of clusters were reasonable; produces large number of small clusters.
therefore, we selected 0.7 as rho value. We executed the algorithm with
the aforementioned parameter values and obtained 8221 clusters of
5.3. Hurricane-track dataset results
various sizes. The clusters formed using parameter values Eps = 35000
and rho = 0.7 are shown in Fig. 17. The first three highest density
In this section, we evaluated the proposed algorithm on hurricane-
clusters had 26 line segments and next four highest density clusters had
25 line segments. Fifty five clusters had more than 20 line segments. track dataset. To obtain optimal values of parameters we used the
concept of entropy and silhouette index. The application of the TRA­
5.2.4. Application of ST-OPTICS algorithm CLUS, CorClustST, and ST-OPTICS algorithms on hurricane-track data­
set is covered in the following sections.
The ST-OPTICS algorithm (Agrawal et al., 2016) was developed for
points. To apply the algorithm to the truck-position dataset, we modified
5.3.1. Selection of parameter values using entropy theory
the algorithm to work on the line segments. We computed MinLns by
taking the natural log of the number of line segments. The spatial radius To obtain the appropriate parameter values (Eps and timethreshold),
we varied Eps and timethreshold values and recorded the corresponding
Eps1 and temporal radius Eps2 are computed depending on the MinLns
value by using a sorted k-dist graph (Ester et al., 1996). To obtain entropy as shown in Fig. 19. The minimum entropy (13.9026) was ob­
tained at Eps = 50,000 and timethreshold = 0.1. At this point, avg(|N ∈
optimal clusters, we selected MinLns = 11, Eps1 = 822, and Eps2 = 3271
based on aforementioned heuristics. In addition, the selected values for , timethreshold(L) |) was 7.53. Smaller entropy values produce high-
quality clusters. Consequently, we attempted to keep our parameter
the minimum reachability distances (Min_RD1 and Min_RD2) and
values close to Eps = 50000, timethreshold = 0.1 and MinLns = 8–10.
maximum core distances (Max_CD1 and Max_CD2) were 6.63466, 54,
We computed quality measure by varying the Eps and MinLns values
3708.96, and 3271, respectively. Based on the parameter values we
while maintaining timethreshold at 0.1, as shown in Fig. 20. The mini­
obtained 6554 clusters of various sizes. The clusters formed using
mum QMeasure was obtained at Eps = 45,000 and MinLns = 8. The
aforementioned parameter values are shown in Fig. 18. The two highest
resultant clusters formed using parameter values Eps = 45000,

Fig. 23. Quality measure for hurricane-track data for TRACLUS.

14
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 24. Clustering results of hurricane-track data for TRACLUS.

Fig. 25. Silhouette index for hurricane-track data for ST-TRACLUS.

timethreshold = 0.1, and MinLns = 8 are shown in Fig. 21. recorded the corresponding entropy as shown in Fig. 22. The minimum
In Fig. 21, the black lines represent the sub-trajectories that are not entropy (13.951) was obtained at Eps = 45,000. Smaller entropy values
part of any cluster. We obtained 87 clusters of sub-trajectories that are produce high-quality of clusters. Thus, we tried to keep our parameter
represented by different colored lines. The highest density cluster values close to Eps = 45000 and MinLns = 7–9.
(cluster1) had 7879 line segment with an average segment time of 3.05 We computed the quality measure by varying the Eps and MinLns in
days (green lines). The next highest density cluster (cluster2) had 528 Fig. 23. As shown in Fig. 23, the minimum QMeasure was obtained at
line segments with an average segment time of 1.1 days (red lines). Eps = 45,000 and MinLns = 7. The resultant clusters formed using
Cluster 3 had 517 line segments with an average segment time of 1.14 parameter values Eps = 45000 and MinLns = 7 are shown in Fig. 24.
days (blue lines). Clusters 4, 5, 6, and 7 contain 229, 133,123, and 84 Here, the black lines represent the sub-trajectories that are not part of
line segments, respectively. For these clusters, the average segment any cluster. We obtained 56 clusters of sub-trajectories that are repre­
times were 1.08, 0.76, 0.93, and 0.7 days, respectively. The remaining sented by different colored lines. The highest density cluster (cluster1)
clusters had fewer than 73 line segments. had 11,102 line segments (green lines) and the next highest density
As shown in Fig. 21, cluster 1 had the maximum line segment density cluster (cluster2) had 328 line segments. Clusters 3, 4, and 5 had 99, 60,
with an average segment time of 3.05 days. This indicates that the most and 49 line segments, respectively. The remaining clusters had fewer
likely passage of a hurricane is through cluster 1 (green lines). It can be than 45 line segments.
also predicted that most probably the hurricane will arrive in approxi­
mately 3.05 days from its start time. Furthermore, it also appears that 5.3.2. Selection of parameter values using silhouette index
cluster 1 is spatially distant from all other clusters. We also applied a silhouette index to the hurricane-track data. To
We also apply the TRACLUS algorithm to the hurricane-track data­ obtain optimal value of parameters (Eps and MinLns) keeping time­
set. To obtain the optimal Eps parameter value, we varied it and threshold = 0.1, we varied the Eps and MinLns and recorded the

15
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 26. Clustering results of hurricane-track data for ST-TRACLUS using silhouette index.

Fig. 27. Silhouette index for hurricane-track data for TRACLUS.

corresponding number of negative values. The minimum number of had fewer than 100 line segments.
negative values (25) was obtained at Eps = 35000 and MinLns = 10, as
shown in Fig. 25. The resultant clusters formed using parameter values 5.3.3. Application of CorClust algorithm
Eps = 35000, timethreshold = 0.1, and MinLns = 10 are shown in The CorClustST algorithm (Husch et al., 2020) was developed for
Fig. 26. We obtained 46 clusters of sub-trajectories (represented by points. To apply the algorithm on hurricane-track dataset, we modified
different colored lines). The highest density cluster (cluster1) had 3946 the algorithm for the line segments. To select appropriate parameter
line segment with an average segment time of 2.99 days (red lines), and values, we varied the Eps value and recorded the number of clusters for
the next highest density cluster (cluster2) had 557 line segments with an rho = 0.7 and rho = 0.9 as shown in Fig. 29. The number of clusters
average segment time of 1.39 days (orange lines). Cluster 3 had 347 line becomes stable for larger values of Eps; therefore, we selected Eps as
segments with an average segment time of 1.33 days (green lines). The 1155000. The number of clusters for both values of rho (0.7 and 0.9) was
remaining clusters had fewer than 40 line segments. almost the same; therefore, we selected 0.7 as the rho value. We
We also applied the TRACLUS algorithm to the hurricane-track data. executed the algorithm with the aforementioned parameter values and
To obtain optimal parameter values, we varied the Eps and MinLns and obtained 19 clusters of various sizes. The clusters formed using param­
recorded the corresponding number of negative values. The minimum eter values Eps = 1155000 and rho = 0.7 are shown in Fig. 30. The
number of negative values (3 3 5) was obtained at Eps = 35000 and highest density cluster (cluster1) had 7419 line segments, and the next
MinLns = 8, as shown in Fig. 27. The resultant clusters formed using highest density cluster (cluster2) had 3878 line segments. Ten clusters
parameter values Eps = 35000 and MinLns = 8 are shown in Fig. 28. We had more than 100 line segments.
obtained 81 clusters of sub-trajectories, which are represented by
different colored lines in Fig. 28. The highest density cluster (cluster1) 5.3.4. Application of ST-OPTICS algorithm
had 4417 line segments (green lines) and the next highest density cluster The ST-OPTICS algorithm (Agrawal et al., 2016) was developed for
(cluster2) had 1460 line segments (red lines). The remaining clusters points. To apply the algorithm on hurricane-track dataset, we modified

16
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 28. Clustering results of hurricane-track data for TRACLUS using the silhouette index.

Fig. 29. Number of clusters of hurricane-track data for different Eps.

the algorithm to work on the line segments. To obtain quality clusters, trucks, which represents the highest level of congestion, and indicates
the parameters MinLns = 10, Eps1 = 378000, and Eps2 = 27 were the likelihood of the arrival of the next truck. When we use QMeasure,
selected based on the natural log and a sorted k-dist graph (Ester et al., the number of sub-trajectories in clusters 3 and 4 is almost half that of
1996). The minimum reachability distances (Min_RD1 and Min_RD2) cluster2; however, the average segment time of clusters 3 and 4 is
and maximum core distances (Max_CD1 and Max_CD2) were 166557, 0, greater than that of cluster 2, which indicates that the road conditions of
1174650, and 24, respectively. Depending on the values of the appli­ cluster 2 areas, are probably better than the cluster areas of clusters 3
cable parameters, we obtained 65 clusters of various sizes. The clusters and 4. When using the silhouette index cluster quality measure, the
formed using aforementioned parameter values are shown in Fig. 31. number of sub-trajectories in cluster 5 is less than clusters 2, 3, and 4;
The highest density cluster (cluster1) had 3208 line segments and the however, the average segment time of cluster 5 is greater than clusters 2,
next highest density cluster (cluster2) had 3122 line segments. Nineteen 3, and 4, which indicates that road conditions in the cluster 5 area are
clusters had more than 100 line segments. not good. Similar conclusions can be drawn by analyzing the rest of the
clusters.
6. Discussion Trucks produce air and noise pollution and potentially hinder the
movement of public transportation. The above analysis could also be
Based on an analysis of the quality results obtained from the exper­ helpful for the evolution of policy for transportation companies and for
iments conducted using the truck-position dataset, we make the traffic management. This approach can also be applied to logistic sup­
following observations. For both quality measures, cluster 1 had the port services.
highest number of occurrences of spatiotemporal co-location events of If we consider the experimental results for the hurricane-track

17
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Fig. 30. Clustering results of hurricane-track data for CorClustST.

Fig. 31. Clustering results of hurricane-track data for ST-OPTICS.

dataset by using both quality measures, the following observations can 6.1. Comparison of clustering methods
be made. When QMeasure is used, the density of cluster 1 is 7879 line
segments and the average segment time is 3.05 days. Cluster 1 spreads To demonstrate merits of the proposed algorithm, we performed
from latitude and longitude 23.25 and –99.5 to 10.8 and –17.3, comparative study with existing three popular clustering algorithms i.e.,
respectively, which covers a lengthy distance (approximately 8750 km). TRACLUS, CorClustST, and ST-OPTICS algorithms using truck-position
This indicates that the hurricane passes through this cluster at extremely dataset and hurricane-track dataset. To evaluate the results of the pro­
high speed because it covers a lengthy distance in 3.05 days. The density posed algorithm and existing three algorithms in quantitative fashion,
of cluster 2 is 528 line segments with an average segment time of 1.1 we adopted Dunn index (Dunn†, 1974). Dunn index is a method to
days. Cluster 2 spreads from latitude and longitude 27.93 and –84.47 to evaluate compactness and separation of clusters. The large value of
34.80 and –72.45, respectively, which cover approximately 1370 km. Dunn index is indicative of better cluster quality. Dunn index is based on
When we use the silhouette index cluster quality measure, cluster 1 had the concept of distance between objects. In the aforementioned algo­
3946 line segments, average segment time of 2.99 days, and covered a rithms, spatial and temporal distances had been used, therefore we
lengthy distance. Cluster2 had 557 line segments, average segment time computed Dunn index using spatial distance and temporal distance
of 1.39 days, and covered a short distance. After analyzing the results separately.
using both quality measures, we conclude that the hurricane passes The proposed ST-TRACLUS algorithm was compared with the TRA­
through cluster 2 with relatively less speed than it passes through cluster CLUS, CorClustST, and ST-OPTICS algorithms using truck-position
1. A possible cause could be that the cluster 1 area may not have ob­ dataset. The proposed algorithm was compared with the TRACLUS al­
stacles that hinder the motion of hurricane, whereas cluster 2 might gorithm on the truck-position dataset using QMeasure and the silhouette
have obstacles. index cluster quality measures. Using QMeasure, the TRACLUS algo­
rithm produced 70 clusters whereas the proposed ST-TRACLUS

18
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Table 1
Overview of quality measure, parameters and the number of clusters on truck-position dataset.
Algorithm Quality Measure Parameters Number of clusters

TRACLUS QMeasure Eps = 70, MinLns = 33 70


ST- QMeasure Eps = 320, timethreshold = 0.9, MinLns = 2 7944
TRACLUS
TRACLUS Silhouette index Eps = 80, MinLns = 34 73
ST- Silhouette index Eps = 200, timethreshold = 0.9, MinLns = 3 8451
TRACLUS
CorClustST – Eps = 35000, Rho = 0.7 8221
ST-OPTICS Sorted k-dist graph MinLns = 11, Eps1 = 822, Eps2 = 3271, Min_RD1 = 6.63466, Min_RD2 = 54, Max_CD1 = 3708.96, Max_CD2 = 3271 6554

Table 2
Dunn indices on results of algorithms on trucks position dataset.
Algorithm Quality Measure Parameters Dunn index value- bold letters represent
better values

Using spatial Using temporal


distance distance

TRACLUS QMeasure Eps = 70, MinLns = 33 0.000270209 –


ST- QMeasure Eps = 320, timethreshold = 0.9, MinLns = 2 0.000000927 0.94462
TRACLUS
TRACLUS Silhouette index Eps = 80, MinLns = 34 0.000180958 –
ST- Silhouette index Eps = 200, timethreshold = 0.9, MinLns = 3 0.00000246 0.94561
TRACLUS
CorClustST – Eps = 35000, Rho = 0.7 0.000000244 0.93978
ST-OPTICS Sorted k-dist MinLns = 11, Eps1 = 822, Eps2 = 3271, Min_RD1 = 6.63466, Min_RD2 = 54, Max_CD1 = 0.000000301 0.84579
graph 3708.96, Max_CD2 = 3271

algorithm produced 7944 clusters. With the silhouette index quality than the Dunn index values (temporal) of both CorClustST (Dunn index
measure, the TRACLUS algorithm produced 73 clusters and the ST- value temporal: 0.93978) and ST-OPTICS (Dunn index value temporal:
TRACLUS algorithm produced 8451 clusters. The CorClustST algo­ 0.84579) algorithms. The larger value of Dunn index is indicative of
rithm produced 8221 clusters, whereas ST-OPTICS produced 11,557 better cluster quality therefore we can say that the proposed algorithm
clusters. The quality measure, different parameters, and number of performs better than the CorClustST, and the ST-OPTICS algorithms on
clusters for four algorithms are listed in Table 1. truck-position dataset.
The ST-TRACLUS, CorClustST, and ST-OPTICS algorithms produced The proposed ST-TRACLUS algorithm was compared with the TRA­
large number of clusters on truck-position dataset. The computation of CLUS, CorClustST, and ST-OPTICS algorithms using the hurricane-track
Dunn index on the results produced by these algorithms was taking very dataset. The TRACLUS algorithm and proposed algorithm was compared
long time; therefore we removed very small clusters from the results of by adopting QMeasure and the silhouette index cluster quality mea­
aforementioned three algorithms and computed Dunn index values. sures. With QMeasure, the TRACLUS algorithm produced 56 clusters,
Table 2 provides Dunn index values using both spatial and temporal whereas the ST-TRACLUS algorithm produced 87 clusters. With the
distance measures on the results of algorithms on truck-position dataset. silhouette index quality measure, the TRACLUS algorithm produced 81
Using quality measures i.e, QMeasure and the silhouette index, the Dunn clusters, whereas the ST-TRACLUS algorithm produced 46 clusters. The
index values (spatial: 0.000270209, and 0.000180958 using QMeasure, CorClustST algorithm produced 19 clusters, whereas ST-OPTICS algo­
and the silhouette index, respectively) of TRACLUS algorithm were rithm produced 65 clusters. The quality measure, different parameters,
larger than the proposed algorithm (spatial: 0.000000927, and and the number of clusters produced by four algorithms are listed in
0.00000246 using QMeasure, and the silhouette index, respectively); Table 3.
however, the TRACLUS algorithm produced only spatial clusters Table 4 provides Dunn index values using both spatial and temporal
whereas the proposed algorithm produced spatiotemporal clusters. If we distance measures on the results of algorithms on hurricane-track
compare the proposed algorithm with CorClustST (Dunn index value dataset. Using QMeasure quality measure, the Dunn index value
spatial: 0.000000244) and ST-OPTICS (Dunn index value spatial: (spatial: 0.001406) of the proposed algorithm was larger than TRACLUS
0.000000301) using truck-position dataset, the Dunn index values algorithm (Dunn index value spatial: 0.000985), but in case of the
(spatial: 0.000000927, and 0.00000246 using QMeasure, and the silhouette index quality measure the Dunn index value (spatial:
silhouette index, respectively) of proposed algorithm were larger than 0.001778) of TRACLUS was larger than the proposed algorithm (Dunn
the Dunn index values of both algorithms. The computed Dunn index index value spatial: 0.001166). If we compare the proposed algorithm
values (temporal: 0.94462, and 0.94561 using QMeasure, and the with CorClustST (Dunn index value spatial: 0.000230) and ST-OPTICS
silhouette index, respectively) of the proposed algorithm were larger (Dunn index value spatial: 0.000010) using hurricane-track dataset,

Table 3
Overview of quality measure, parameters and the number of clusters on hurricane track dataset.
Algorithm Quality Measure Parameters Number of clusters

TRACLUS QMeasure Eps = 45000, MinLns = 7 56


ST-TRACLUS QMeasure Eps = 45000, timethreshold = 0.1, MinLns = 8 87
TRACLUS Silhouette index Eps = 35000, MinLns = 8 81
ST-TRACLUS Silhouette index Eps = 35000, timethreshold = 0.1, MinLns = 10 46
CorClustST – Eps = 1155000, Rho = 0.7 19
ST-OPTICS Sorted k-dist graph MinLns = 10, Eps1 = 378000, Eps2 = 27, Min_RD1 = 166557, Min_RD2 = 0, Max_CD1 = 1174650, Max_CD2 = 24 65

19
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Table 4
Dunn indices on results of algorithms on hurricane track dataset.
Algorithm Quality Measure Parameters Dunn index value- bold letters represent
better values

Using spatial Using temporal


distance distance

TRACLUS QMeasure Eps = 45000, MinLns = 7 0.000985 –


ST- QMeasure Eps = 45000, timethreshold = 0.1, MinLns = 8 0.001406 0.155555
TRACLUS
TRACLUS Silhouette index Eps = 35000, MinLns = 8 0.001778 –
ST- Silhouette index Eps = 35000, timethreshold = 0.1, MinLns = 10 0.001166 0.428571
TRACLUS
CorClustST – Eps = 1155000, Rho = 0.7 0.000230 0.000228
ST-OPTICS Sorted k-dist MinLns = 10, Eps1 = 378000, Eps2 = 27, Min_RD1 = 166557, Min_RD2 = 0, Max_CD1 = 0.000010 0.333333
graph 1174650, Max_CD2 = 24

the Dunn index values (spatial: 0.001406, and 0.001166 using QMeas­ interests or personal relationships that could have appeared to influence
ure, and the silhouette index, respectively) of proposed algorithm were the work reported in this paper.
larger than both algorithms. When we employed the silhouette index
quality measure, the computed Dunn index value (temporal: 0.428571) Acknowledgment
of the proposed algorithm was larger than the Dunn index value (tem­
poral) of both the CorClustST (Dunn index value temporal: 0.000228) This work is supported by Ministry of Electronics and Information
and the ST-OPTICS (Dunn index value temporal: 0.333333) algorithms. Technology, Government of India under the Visvesvaraya Ph.D. scheme.
The larger value of Dunn index is indicative of better cluster quality
therefore we can say that the proposed algorithm performs better than References
the CorClustST, and the ST-OPTICS algorithms on hurricane track
dataset. Agrawal, K. P., Garg, S., Sharma, S., & Patel, P. (2016). Development and validation of
OPTICS based spatio-temporal clustering technique. Information Sciences, 369,
388–401. https://doi.org/10.1016/j.ins.2016.06.048
7. Conclusion and future work Ahmad, A., & Dey, L. (2007). A k-mean clustering algorithm for mixed numeric and
categorical data. Data and Knowledge Engineering, 63(2), 503–527. https://doi.org/
10.1016/j.datak.2007.03.016
The technological infrastructure of the current era continuously Alon, J., Sclaroff, S., Kollios, G., & Pavlovic, V. (2003). Discovering clusters in motion
generates spatiotemporal data. The data need to be analyzed to obtain time-series data. 2003 IEEE Computer Society Conference on Computer Vision and
useful information. We proposed ST-TRACLUS, an innovative spatio­ Pattern Recognition, 2003. Proceedings., 1, 375-381. https://doi.org/10.1109/
CVPR.2003.1211378.
temporal clustering algorithm, to mine spatiotemporal co-location Ankerst, M., Breunig, M. M., Kriegel, H., & Sander, J. (1999). OPTICS : Ordering Points
events. The algorithm partitions a trajectory into sub-trajectories and To Identify the Clustering Structure. SIGMOD ’99 Proceedings of the 1999 ACM
groups them based on spatiotemporal dimensions. We implemented SIGMOD International Conference on Management of Data, 28(2), 49–60. https://
doi.org/10.1145/304182.304187.
visualization tools to validate clusters. We performed experiments using Ansari, M. Y., Ahmad, A., Khan, S. S., Bhushan, G., & Mainuddin,. (2020).
truck-position data and hurricane-track data. To select appropriate Spatiotemporal clustering: A review. Artificial Intelligence Review, 53, 2381–2423.
parameter values, we adopted the concept of entropy and the silhouette https://doi.org/10.1007/s10462-019-09736-1
Bermingham, Luke, & Lee, Ickjai (2015). A general methodology for n-dimensional
index, which gave accurate values and helped in the generation of trajectory clustering. Expert System with Applications, 42(21), 7573–7581.
quality clusters. To demonstrate merits of the proposed algorithm, we Birant, D., & Kut, A. (2007). ST-DBSCAN: An algorithm for clustering spatial-temporal
performed comparative study using Dunn index, with existing three data. Data and Knowledge Engineering, 60(1), 208–221. https://doi.org/10.1016/j.
datak.2006.01.013
popular clustering algorithms i.e., TRACLUS, CorClustST, and ST-
Ricardo J. G. B. Campello Davoud Moulavi Joerg Sander 160 172 10.1007/978-3-642-
OPTICS algorithms on the truck-position data and hurricane-track 37456-2_14.
data. Experimental results demonstrate that the proposed ST- C. Chang B. Zhou Multi-granularity Visualization of Trajectory Clusters Using Sub-
TRACLUS algorithm performs better than the CorClustST, and the ST- trajectory Clustering 2009 Miami, Florida, USA 577 582 10.1109/ICDMW.2009.24.
Chudova, D., Gaffney, S., Mjolsness, E., & Smyth, P. (2003). Translation-invariant
OPTICS algorithms. mixture models for curve clustering. Proceedings of the Ninth ACM SIGKDD
The clustering algorithms usually require a number of parameter International Conference on Knowledge Discovery and Data Mining - KDD ’03, (3),
settings by the user, which affects the clustering results; therefore in 79-88. https://doi.org/10.1145/956750.956763.
Dunn†, J. C. (1974). Well separated clusters and optimal fuzzy partitions. Journal of
future, we will attempt develop algorithms that are more efficient and Cybernetics, 4(1), 95–104.
less sensitive to parameter values. The integration of domain knowledge Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for
with clustering helps to improve clustering accuracy and cluster inter­ discovering clusters in large spatial databases with noise. Proceedings of Second
International Conference on Knowledge Discovery and Data Mining, 226–231.
pretation. We will attempt develop clustering algorithms using domain Fisher, D. (1987). Knowledge acquisition via incremental clustering. Machine Learning, 2,
knowledge. In addition, we will perform additional experiments with 139–182. https://doi.org/10.1023/a:1022852608280
other datasets. We will attempt, ways and means to combine the clus­ Gaffney, S., & Smyth, P. (1999). Trajectory clustering with mixtures of regression
models. Proceedings of the Fifth ACM SIGKDD International Conference on
tering results of present algorithms with the result of other algorithms to Knowledge Discovery and Data Mining KDD 99, 10(99), 63–72. https://doi.org/
improve the performance. 10.1145/312129.312198.
GeoTools: GeoTools 14.2 Released. http://geotoolsnews.blogspot.in/ 2016/01/
geotools-142-released.html/ Accessed 15.09.2016.
Funding Guttman, A. (1984). R-trees. Proceedings of the 1984 ACM SIGMOD International
Conference on Management of Data - SIGMOD ’84, 47-57. https://doi.org/10.1145/
This research did not receive any specific grant from funding 602259.602266.
Hüsch, M., Schyska, B. U., & Bremen, L. V. (2020). CorClustST—Correlation-based
agencies in the public, commercial, or not-for-profit sectors. clustering of big spatio-temporal datasets. Future Generation Computer Systems, 110,
610–619. https://doi.org/10.1016/j.future.2018.04.002
Kreveld, M. V., & Luo, J. (2007). The definition and computation of trajectory and sub-
Declaration of Competing Interest trajectory similarity. In Proceedings of the 15th Annual ACM International Symposium
on Advances in Geographic Information Systems (pp. 324–327).
The authors declare that they have no known competing financial

20
M.Y. Ansari et al. Expert Systems With Applications 178 (2021) 115048

Lee, J., Han, J., & Whang, K.-Y. (2007). Trajectory clustering: a partition-and-group Sheikholeslami, G., Chatterjee, S., & Zhang, A. (2000). WaveCluster: A wavelet-based
framework. Proceedings of the 2007 ACM SIGMOD International Conference on clustering approach for spatial data in very large databases. The VLDB Journal, 8(3),
Management of Data - SIGMOD ’07, 593-604. https://doi.org/10.1145/ 289–304. https://doi.org/10.1007/s007780050009
1247480.1247546. Shekhar, S., & Huang, Y. (2001). Discovering spatial co-location patterns: A summary of
Li, Y., Han, J., & Yang, J. (2004). Clustering moving objects. Proceedings of the 2004 results. SSTD, 236–237.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - Snyder, J. P. (1987). Map projections–A working manual, 1395. US Government Printing
KDD ’04, 617–622. https://doi.org/10.1145/1014052.1014129. Office.
-, Zhongzhi Li, & -, Xuegang Wang (2010). Spatial clustering algorithm based on Tan, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to Data mining. Boston, MA,
hierarchical-partition tree. International Journal of Digital Content Technology and its USA: Addison Wesley.
Applications, 4(6), 26–35. Tsumoto, S., & Hirano, S. (2009). Behavior grouping based on trajectory mining. In
Ma, S., Wang, T., Tang, S., Yang, D., & Gao, J. (2003). A New Fast Clustering Algorithm U. S. A. New York (Ed.), Social Computing and Behavioral Modeling (pp. 219–226).
Based on Reference and Density *. Advances in Web-Age Information Management, Springer.
2002, 214–225. https://doi.org/10.1007/978-3-540-45160-0_21 Wang, W., Yang, J., & Muntz, R. (1997). STING: A Statistical Information Grid Approach
Macqueen, J. (1967). Some methods for classification and analysis of multivariate to Spatial Data Mining. Proc. 23rd Int. Conf. Very Large Data Bases, 186–195.
observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics Yanagisawa, Y., & Satoh, T. (2006). Clustering multidimensional trajectories based on
and Probability, 1(233), 281–297. https://projecteuclid.org/euclid.bsmsp/120051 shape and velocity. In Proceedings of the 22nd International Conference on Data
2992. Engineering Workshops (pp. 12–22).
Nanni, M., & Pedreschi, D. (2006). Time-focused density-based clustering of trajectories Yao, X. (2003). Research Issues in Spatio-temporal Data Mining, A white paper submitted to
of moving objects. Journal of Intelligent Information Systems, 27, 267–289. the University Consortium for Geographic Information Science (UCGIS) workshop on
Pearson, K. (1895). Note on regression and inheritance in the case of two parents. Proc R Geospatial Visualization and Knowledge Discovery (pp. 1–6). Virginia: Lansdowne.
Soc Lond Ser I, 58, 240–242. Ying, J. J.-C., Lee, W.-C., & Tseng, V. S. (2013). Mining geographic-temporal-semantic
Pelekis, N., Andrienko, G., Andrienko, N., Kopanakis, I., Marketos, G., & Theodoridis, Y. patterns in trajectories for location prediction. ACM Transactions on Intelligent
(2012). Visually exploring movement data via similarity-based analysis. Journal of Systems and Technology, 5(1), 1–33. https://doi.org/10.1145/2542182.2542184
Intelligent Information Systems, 38(2), 343–391. https://doi.org/10.1007/s10844- Yuan, G., Xia, S., Zhang, L., Zhou, Y., & Ji, C. (2011). An efficient trajectory-clustering
011-0159-2 algorithm based on an index tree. Transactions of the Institute of Measurement and
Pelekis, N., Kopanakis, I., Marketos, G., Ntoutsi, I., Andrienko, G., & Theodoridis, Y. Control, 34(7), 850–861.
(2007). Similarity search in Trajectory Databases. In Proceedings of the International Zaghlool, E., ElKaffas, S., & Saad, A. (2015). A Density-Based Clustering of Spatio-
Workshop on Temporal Representation and Reasoning (pp. 129–140). https://doi.org/ Temporal Data, New Contributions in Information Systems and Technologies.
10.1109/TIME.2007.59 Springer International Publishing, 41–50.
PostgreSQL 9.6.2, 9.5.6, 9.4.11, 9.3.16 and 9.2.20 Released. https://www.postgresql. Zhang, Dongzhi, Lee, Kyungmi, & Lee, Ickjai (2018). Hierarchical trajectory clustering
org/ Accessed 10.09.2017. for spatio-temporal periodic pattern mining. Expert Systems With Applications, 92,
Rousseeuw, P. J. (1987). Silhouettes : A graphical aid to the interpretation and validation 1–11.
of cluster analysis. Journal of Computational and Applied Mathematics, 20(1), 53–65. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An Efficient Data Clustering
Databases Method for Very Large Databases. ACM SIGMOD International Conference
on Management of Data, 1, 103–114. https://doi.org/10.1145/233269.233324

21

You might also like