Professional Documents
Culture Documents
DSFSF
DSFSF
net/publication/256859157
CITATIONS READS
5 205
3 authors:
Rosanna Verde
Università degli Studi della Campania "Luigi Vanvitelli
120 PUBLICATIONS 1,121 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Rosanna Verde on 08 February 2016.
Abstract In recent years, data streams analysis has gained a lot of attention due
to the growth of applicative fields generating huge amount of temporal data. In this
paper we will focus on the clustering of multiple streams. We propose a new strategy
which aims at grouping similar streams and, together, at computing summaries of
the incoming data. This is performed by means of a divide and conquer approach
where a continuously updated graph collects information on incoming data and an
off-line partitioning algorithm provides the final clustering structure. An application
on real data sets corroborates the effectiveness of the proposal.
1 Introduction
With the fast growing of capabilities in data acquisition and processing, a wide
number of domains is generating huge amount of temporal data.
Some examples are financial and retail transactions, web data, network traffic,
electricity consumptions, remote sensors data.
Traditional data mining methods fail at dealing with these data since they use
computational intensive algorithms which require multiple scans of the data. Thus,
whenever users need to get the answers to their mining and knowledge discovery
queries in short times, such algorithms become ineffective.
A further issue of traditional algorithms is that data can be only processed if they
are stored on some available media.
1
2 Antonio Balzanella, Yves Lechevallier and Rosanna Verde
To deal with this new challenging task, proper approaches, usually referred as
techniques for data streams analysis, are needed.
Among the knowledge extraction tools for data streams, clustering is widely used
in exploratory analyses.
Clustering in data stream framework is used to deal with two different challenges.
The first one is related to analyzing a single data stream to discover a partitioning of
the observations it is composed of. A second one is to process data streams generated
by a set of sources (let’s think about sensor networks) to discover a partitioning of
the sources selves.
In this paper we will focus on the second one, which is usually referred as streams
clustering.
Interesting proposals on this topic have been introduced in [2][3]. The first one is
an extension to the data streams framework of the k-means algorithm performed on
time series. Basically, the idea is to split parallel arriving streams into non overlap-
ping windows and to process the data of each window performing, at first, a Discrete
Fourier Transform to reduce the dimensionality of data, and then, the k-means al-
gorithm on the coefficients of the transformation. On each window, the k-means is
initialized using the centroids of the clusters of the partition obtained by the latest
processed window.
The main drawback of this strategy is the inability to deal with evolving data
streams. This is because the final data partition only depends from the data of the
most recent window.
The second proposal is performed in two steps:
- an On-line procedure stores the coefficients of a suitable transformation (wavelet
or linear regression) computed on chunks of the streams.
- an Off-line procedure is run on the on-line collected coefficients to get the final
clustering structure
Although this method is able to deal with evolving data streams, its main draw-
back is that the approach used for summarization is only based on storing com-
pressed streams.
In this paper, we introduce a new strategy for clustering highly evolving data
streams which provides the clustering structure over a specific temporal interval and
a set of time located summaries of the data. Our proposal consists in two steps. The
first one, which is run on-line, performs the clustering of incoming chunks of data to
get local representative profiles and to update the adjacency matrix of an undirected
graph which collects the similarities among the streams.
The second one, performs the final data partitioning by means of an off-line clus-
tering algorithm that is run on the adjacency matrix.
Let us note with S = {Y1 , . . . ,Yi , . . . ,Yn } the n streams Yi = [(y1 ,t1 ), . . ., (y j ,t j ), . . . , (y∞ ,t∞ )]
made by real valued ordered observations on a discrete time grid T = t1 , ...,t j , ...t∞ ∈
Clustering multiple data streams † 3
each cluster Ck with the aim to minimize the dissimilarity within each cluster and to
maximize the dissimilarity between clusters.
In order to get the partition P, the incoming parallel streams are split into non
overlapping windows of fixed size (Fig. 1).
For each local partitions Pw we update the adjacency matrix A of the graph G,
processing the outputs provided by the clustering algorithm at each window (Fig.
2). This is performed by collecting the similarity values among the streams without
computing all the pairwise proximities among the streams, as required in the data
stream framework.
The main idea underlying this approach is to store in each cell ai,l the number of
times each couple of streams is allocated to the same cluster of a local partition Pw .
This involves that the following procedure has to be run on each window:
For instance, let us assume to have five streams (Y1 ,Y2 , . . . ,Y5 ) and a local parti-
tion P1 = (Y1w ,Y2w )(Y3w ,Y4w ,Y5w ), the graph updating consists in:
1. Adding 1 to the cells a1,2 and a2,1
2. Adding 1 to the cells a3,4 and a4,3
3. Adding 1 to the cells a3,5 and a5,3
4. Adding 1 to the cells a4,5 and a5,4
We can look at A as a proximity matrix where the cells record the pairwise simi-
larity among streams.
For each couple of subsequences Yiw ,Ylw in the same cluster, the updating of the
cells ai,l and al,i of A with the value 1, corresponds to assign the maximum value of
similarity to such couple of subsequences.
When this procedure is performed on a wide number of windows, it is a grad-
uation of the consensus between couple of streams computed incrementally. The
higher is the consensus, the more similar will be the streams.
According to the proposed procedure, if two subsequences belong to two differ-
ent clusters of a local partition, their similarity is always considered minimal, with
the value 0 set in the corresponding cells of A.
Clustering multiple data streams † 5
In order to deal with this issue, we improve the previous graph updating ap-
proach, by graduating the similarities between the couples of streams belonging to
different clusters instead to consider these maximally dissimilar.
To reach this aim, we define the cluster radius and the membership function.
Definition 1. Let d ∈ ℜ be a distance function between two subsequences. The ra-
dius radκw of the cluster Cκw is max(d(Yiw , bwκ )) for each Yiw ∈ Cκw
Starting from the cluster radius radκw , the Membership function of a subsequence
to a cluster Cκw is given by:
radκw w
m fl,κ = l / Cκw
∈ (1)
d(Ylw , bwκ )
m fl,κ ranges from 0 to 1 since it is the ratio between the radius of the cluster Ck
and the distance of a subsequence Ylw not belonging to the cluster from the prototype
bw∪ of Ckw .
We update the cells of A using the value of the Membership function computed
on each couple subsequences belonging to different clusters.
According to such adjustments, the whole updating process becomes the follow-
ing:
The proposed procedure, when repeated on each window, allows to get the pair-
wise similarities among streams in an incremental way. This is obtained without
computing the similarity between each couple of streams because only the distances
computed in the allocation step of the last iteration of the DCA algorithm are used.
In order to get a global partition P into C clusters, of the set of streams S analyzed
in the sequence of windows w f ( f = 1, . . . , F), we have to run a proper clustering
algorithm on the proximity matrix A.
6 Antonio Balzanella, Yves Lechevallier and Rosanna Verde
4 Main results
To evaluate the performance of the proposed strategy we have compared the clus-
tering performance of the on-line clustering strategy with the k-means algorithm on
stocked data, using highly evolving datasets.
Two datasets have been used in the evaluation process:
The first one is made by 76 highly evolving time series, downloaded from Yahoo
finance, which represent the daily closing price of random chosen stocks. Each time
series is made by 4000 observation.
Clustering multiple data streams † 7
The second one is made by 179 highly evolving time series which collect daily
electricity supply at several locations in Australia. Each time series is made by 3288
recordings.
We have considered some of the most common indexes to assess the effectiveness
of the proposal (See [7]). The Rand index (RI) and the Adjusted Rand index (ARI)
are used as external validity indexes to evaluate the degree of consensus between
the partition obtained by our proposal and the partition obtained using the k-means.
Moreover, the Calinski-Harabasz Index(CH), the Davies-Bouldin(DB) Index and
the Silhouette Width Criterion(SW), are used as internal validity indexes to evaluate
the compactness of clusters and their separation.
In order to perform the testing, we need to set the following input parameters for
the proposed procedure:
• number of clusters K of each local partition Pw
• the final number of cluster C to get the partition of S
• the size w f of each temporal window
For the k-means we only need to set the number of clusters C.
The Euclidean distance is used as dissimilarity function in both the procedures.
According to this choice, DCA algorithm on the windows data, is a classical k-
means where the prototypes are the average of the data in a cluster.
The parameter C has been set, for the first and second datasets, running the k-
means algorithm using C = 2, . . . , 8. For each value of C we have computed the total
within deviance.
We have chosen C = 4 for the first dataset and C = 3 for the second dataset,
since these are the values which provide the highest improvement of the clusters
homogeneity.
By evaluating, through the mentioned indexes, the partitioning quality for several
values of ws we can state that the choice of the windows size does not impact on the
clusters homogeneity. As a consequence, the choice of the value of such parameter,
can be performed according to the kind of required summarization. For example, if
we need to detect a set of prototypes for each week of data, we choose a value of
the window size which frames the observations in a week.
In our tests, we have used windows made by 30 observations for the first two
datasets and 50 for the third one.
The third required input parameter K does not strongly impact on the clustering
quality. We have tested this by evaluating the behavior of the Calinski-Harabasz
Index and of the Davies-Bouldin Index according to k = 2, . . . , k = 10.
In the following table we show the main results of the evaluated indexes:
Looking at the values of the internal validity indexes, computed for our proposal
and for the k-means on stocked data, it emerges that the homogeneity of the clusters
and their separation, is quite similar.
Moreover, the value of the Rand Index and of the Adjusted Rand Index, high-
lights the strength of the consensus between the obtained partitions.
8 Antonio Balzanella, Yves Lechevallier and Rosanna Verde
In this paper we have introduced a new strategy for clustering highly evolving data
streams based on processing the incoming data in incremental way without requiring
their storage. This strategy favorably compares to the standard k-means performed
on stocked data as it is shown on the test datasets using several standard validity
indexes. Further developments will be to introduce a strategy for monitoring the
evolution of the clustering structure over time.
References
1 This paper has been supported by COST Action IC0702 and by ”Magda una piattaforma ad
agenti mobili per il Grid Computing” Chair: Prof. Beniamino Di Martino