Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/256859157

Clustering Multiple Data Streams

Chapter · March 2011


DOI: 10.1007/978-3-642-11363-5_28

CITATIONS READS
5 205

3 authors:

Antonio Balzanella Yves Lechevallier


Università degli Studi della Campania "Luigi Vanvitelli National Institute for Research in Computer Science and Control
38 PUBLICATIONS 194 CITATIONS 160 PUBLICATIONS 1,637 CITATIONS

SEE PROFILE SEE PROFILE

Rosanna Verde
Università degli Studi della Campania "Luigi Vanvitelli
120 PUBLICATIONS 1,121 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Recommender and Support System View project

Pacalabs HOTELREF View project

All content following this page was uploaded by Rosanna Verde on 08 February 2016.

The user has requested enhancement of the downloaded file.


Clustering multiple data streams ∗

Antonio Balzanella, Yves Lechevallier and Rosanna Verde

Abstract In recent years, data streams analysis has gained a lot of attention due
to the growth of applicative fields generating huge amount of temporal data. In this
paper we will focus on the clustering of multiple streams. We propose a new strategy
which aims at grouping similar streams and, together, at computing summaries of
the incoming data. This is performed by means of a divide and conquer approach
where a continuously updated graph collects information on incoming data and an
off-line partitioning algorithm provides the final clustering structure. An application
on real data sets corroborates the effectiveness of the proposal.

Key words: Data streams, Clustering, Temporal data mining

1 Introduction

With the fast growing of capabilities in data acquisition and processing, a wide
number of domains is generating huge amount of temporal data.
Some examples are financial and retail transactions, web data, network traffic,
electricity consumptions, remote sensors data.
Traditional data mining methods fail at dealing with these data since they use
computational intensive algorithms which require multiple scans of the data. Thus,
whenever users need to get the answers to their mining and knowledge discovery
queries in short times, such algorithms become ineffective.
A further issue of traditional algorithms is that data can be only processed if they
are stored on some available media.

Antonio Balzanella (antonio.balzanella2@gmail.com) and Rosanna Verde


(rosanna.verde@unina2.it)
Seconda Universitá degli Studi di Napoli, Via del Setificio 81100 Caserta
Yves Lechevallier(Yves.Lechevallier@inria.fr)
INRIA, 78153 Le Chesnay cedex - France

1
2 Antonio Balzanella, Yves Lechevallier and Rosanna Verde

To deal with this new challenging task, proper approaches, usually referred as
techniques for data streams analysis, are needed.
Among the knowledge extraction tools for data streams, clustering is widely used
in exploratory analyses.
Clustering in data stream framework is used to deal with two different challenges.
The first one is related to analyzing a single data stream to discover a partitioning of
the observations it is composed of. A second one is to process data streams generated
by a set of sources (let’s think about sensor networks) to discover a partitioning of
the sources selves.
In this paper we will focus on the second one, which is usually referred as streams
clustering.
Interesting proposals on this topic have been introduced in [2][3]. The first one is
an extension to the data streams framework of the k-means algorithm performed on
time series. Basically, the idea is to split parallel arriving streams into non overlap-
ping windows and to process the data of each window performing, at first, a Discrete
Fourier Transform to reduce the dimensionality of data, and then, the k-means al-
gorithm on the coefficients of the transformation. On each window, the k-means is
initialized using the centroids of the clusters of the partition obtained by the latest
processed window.
The main drawback of this strategy is the inability to deal with evolving data
streams. This is because the final data partition only depends from the data of the
most recent window.
The second proposal is performed in two steps:
- an On-line procedure stores the coefficients of a suitable transformation (wavelet
or linear regression) computed on chunks of the streams.
- an Off-line procedure is run on the on-line collected coefficients to get the final
clustering structure
Although this method is able to deal with evolving data streams, its main draw-
back is that the approach used for summarization is only based on storing com-
pressed streams.
In this paper, we introduce a new strategy for clustering highly evolving data
streams which provides the clustering structure over a specific temporal interval and
a set of time located summaries of the data. Our proposal consists in two steps. The
first one, which is run on-line, performs the clustering of incoming chunks of data to
get local representative profiles and to update the adjacency matrix of an undirected
graph which collects the similarities among the streams.
The second one, performs the final data partitioning by means of an off-line clus-
tering algorithm that is run on the adjacency matrix.

2 A graph based approach for clustering streaming time series

Let us note with S = {Y1 , . . . ,Yi , . . . ,Yn } the n streams Yi = [(y1 ,t1 ), . . ., (y j ,t j ), . . . , (y∞ ,t∞ )]
made by real valued ordered observations on a discrete time grid T = t1 , ...,t j , ...t∞ ∈
Clustering multiple data streams † 3

ℜ. A time window w f with f = 1, . . . , ∞ is an ordered subset of T having size


ws . Each w
w
 time window w f frames a subset Yi of Yi called subsequence where
Yi = y j , . . . , y j+ws .
The objective is to find a partition P of S into C clusters such that each stream Yi
belongs to a cluster Ck with k = 1, ...,C and Ck=1 Ck = φ . Streams are allocated to
T

each cluster Ck with the aim to minimize the dissimilarity within each cluster and to
maximize the dissimilarity between clusters.
In order to get the partition P, the incoming parallel streams are split into non
overlapping windows of fixed size (Fig. 1).

Fig. 1 Splitting of the incoming data into non overlapping windows

On the subsequences Yiw of Yi framed by each window w f we run a Dynamic


Clustering Algorithm (DCA) extended to complex data[5][4].
The DCA looks both for a representation of the clusters (by means of a set of
prototypes) and the best partition in K clusters, according to a criterion function
based on a suitable dissimilarity measure.
The algorithm performs a step of representation of the clusters and a step of al-
location of the subsequences to the clusters according to the minimum dissimilarity
to the prototypes.
In our case, DCA provides a local partitioning Pw = C1w ∪ . . . ∪Cκw ∪ . . . ∪CKw into
K clusters of the subsequences framed by w f and the associated set of prototypes
Bw = (bw1 , . . . , bwκ , . . . , bwK ) which summarize the behaviors of the streams in time
localized windows.
Let G = (V, E) be an undirected similarity graph where the vertex set V =
(v1 , . . . , vi , . . . , vn ) corresponds to the indices i of the streams of S and the edges
set E, carries non-negative values which stand for the weights ai,l of the linking
between the streams Yi and Yl .
The graph G can be represented by means of a symmetric adjacency matrix
A = (ai,l )i,l=1,...n where ai,l = al,i and ai,i = 0.
4 Antonio Balzanella, Yves Lechevallier and Rosanna Verde

Fig. 2 Data processing schema

For each local partitions Pw we update the adjacency matrix A of the graph G,
processing the outputs provided by the clustering algorithm at each window (Fig.
2). This is performed by collecting the similarity values among the streams without
computing all the pairwise proximities among the streams, as required in the data
stream framework.
The main idea underlying this approach is to store in each cell ai,l the number of
times each couple of streams is allocated to the same cluster of a local partition Pw .
This involves that the following procedure has to be run on each window:

for each local cluster Cκw ∈ Pw do


Detect all the possible couples of subsequences Yiw , Ylw which are allocated to the cluster Cκw
for each couple (i, l) do
add 1 to the cells ai,l and al,i of A
end for
end for

For instance, let us assume to have five streams (Y1 ,Y2 , . . . ,Y5 ) and a local parti-
tion P1 = (Y1w ,Y2w )(Y3w ,Y4w ,Y5w ), the graph updating consists in:
1. Adding 1 to the cells a1,2 and a2,1
2. Adding 1 to the cells a3,4 and a4,3
3. Adding 1 to the cells a3,5 and a5,3
4. Adding 1 to the cells a4,5 and a5,4
We can look at A as a proximity matrix where the cells record the pairwise simi-
larity among streams.
For each couple of subsequences Yiw ,Ylw in the same cluster, the updating of the
cells ai,l and al,i of A with the value 1, corresponds to assign the maximum value of
similarity to such couple of subsequences.
When this procedure is performed on a wide number of windows, it is a grad-
uation of the consensus between couple of streams computed incrementally. The
higher is the consensus, the more similar will be the streams.
According to the proposed procedure, if two subsequences belong to two differ-
ent clusters of a local partition, their similarity is always considered minimal, with
the value 0 set in the corresponding cells of A.
Clustering multiple data streams † 5

In order to deal with this issue, we improve the previous graph updating ap-
proach, by graduating the similarities between the couples of streams belonging to
different clusters instead to consider these maximally dissimilar.
To reach this aim, we define the cluster radius and the membership function.
Definition 1. Let d ∈ ℜ be a distance function between two subsequences. The ra-
dius radκw of the cluster Cκw is max(d(Yiw , bwκ )) for each Yiw ∈ Cκw
Starting from the cluster radius radκw , the Membership function of a subsequence
to a cluster Cκw is given by:
radκw w
m fl,κ = l / Cκw
∈ (1)
d(Ylw , bwκ )
m fl,κ ranges from 0 to 1 since it is the ratio between the radius of the cluster Ck
and the distance of a subsequence Ylw not belonging to the cluster from the prototype
bw∪ of Ckw .
We update the cells of A using the value of the Membership function computed
on each couple subsequences belonging to different clusters.
According to such adjustments, the whole updating process becomes the follow-
ing:

Algorithm 1 Graph updating strategy


for all Yiw i ∈ n do
Detect the index κ of the cluster which Yiw belongs to
for all Ylw ∈ Cκw do
Add the value 1 to the graph edges ai,l and al,i
end for
for all γ 6= κ γ = 1, . . . , K do
Compute m fi,γ
Detect all the subsequences Ylw ∈ Cγw
Add the value m fi,γ to the cells ai,l and al,i of A
end for
end for

The proposed procedure, when repeated on each window, allows to get the pair-
wise similarities among streams in an incremental way. This is obtained without
computing the similarity between each couple of streams because only the distances
computed in the allocation step of the last iteration of the DCA algorithm are used.

3 Off-line graph partitioning by Spectral clustering

In order to get a global partition P into C clusters, of the set of streams S analyzed
in the sequence of windows w f ( f = 1, . . . , F), we have to run a proper clustering
algorithm on the proximity matrix A.
6 Antonio Balzanella, Yves Lechevallier and Rosanna Verde

The choice of the F value of processed windows, is performed according to a


user clustering demand or to a time schedule which breaks the updating of A at
prefixed time points such to get the clustering results over several time horizons.
Usual techniques for clustering proximity graphs are based on spectral clustering
procedures[8].
The problem of finding a partition of S where the elements of each group are sim-
ilar among them while elements in different groups are dissimilar can be formulated
in terms of similarity graphs. We want to find a partition of the graph such that the
edges between different groups have a very low weight (similarity) and the edges
within a group have high weight (similarity).
Note that according to [1], similar results can be obtained using a non metric
multidimensional scaling on the adjacency matrix A.
Definition 2. Let vi ∈ V a vertex of the graph. The degree of vi is di = ∑nl=1 ai,l . The
degree matrix D is defined as the diagonal matrix with the degrees d1 , . . . , dn on the
diagonal.
Starting from the degree matrix D it is possible to define the Laplacian Matrix
and the Normalize Laplacian Matrix:
Definition 3. Let L = D − A be the unnormalized Laplacian Matrix. The normalized
Laplacian Matrix Lnorm is defined as:
1 1
Lnorm = D− 2 LD− 2 (2)

The algorithm schema is the following:

Algorithm 2 Spectral clustering algorithm


Compute L and Lnorm
Compute the first C eigenvalues of Lnorm
Let à ∈ ℜnxC the matrix containing the C eigenvectors associated to the first eigenvalues of Lnorm
Get the partition P of S into C clusters using the k-means algorithm on the multidimensional
points defined by the rows of Ã

4 Main results

To evaluate the performance of the proposed strategy we have compared the clus-
tering performance of the on-line clustering strategy with the k-means algorithm on
stocked data, using highly evolving datasets.
Two datasets have been used in the evaluation process:
The first one is made by 76 highly evolving time series, downloaded from Yahoo
finance, which represent the daily closing price of random chosen stocks. Each time
series is made by 4000 observation.
Clustering multiple data streams † 7

The second one is made by 179 highly evolving time series which collect daily
electricity supply at several locations in Australia. Each time series is made by 3288
recordings.
We have considered some of the most common indexes to assess the effectiveness
of the proposal (See [7]). The Rand index (RI) and the Adjusted Rand index (ARI)
are used as external validity indexes to evaluate the degree of consensus between
the partition obtained by our proposal and the partition obtained using the k-means.
Moreover, the Calinski-Harabasz Index(CH), the Davies-Bouldin(DB) Index and
the Silhouette Width Criterion(SW), are used as internal validity indexes to evaluate
the compactness of clusters and their separation.
In order to perform the testing, we need to set the following input parameters for
the proposed procedure:
• number of clusters K of each local partition Pw
• the final number of cluster C to get the partition of S
• the size w f of each temporal window
For the k-means we only need to set the number of clusters C.
The Euclidean distance is used as dissimilarity function in both the procedures.
According to this choice, DCA algorithm on the windows data, is a classical k-
means where the prototypes are the average of the data in a cluster.
The parameter C has been set, for the first and second datasets, running the k-
means algorithm using C = 2, . . . , 8. For each value of C we have computed the total
within deviance.
We have chosen C = 4 for the first dataset and C = 3 for the second dataset,
since these are the values which provide the highest improvement of the clusters
homogeneity.
By evaluating, through the mentioned indexes, the partitioning quality for several
values of ws we can state that the choice of the windows size does not impact on the
clusters homogeneity. As a consequence, the choice of the value of such parameter,
can be performed according to the kind of required summarization. For example, if
we need to detect a set of prototypes for each week of data, we choose a value of
the window size which frames the observations in a week.
In our tests, we have used windows made by 30 observations for the first two
datasets and 50 for the third one.
The third required input parameter K does not strongly impact on the clustering
quality. We have tested this by evaluating the behavior of the Calinski-Harabasz
Index and of the Davies-Bouldin Index according to k = 2, . . . , k = 10.
In the following table we show the main results of the evaluated indexes:
Looking at the values of the internal validity indexes, computed for our proposal
and for the k-means on stocked data, it emerges that the homogeneity of the clusters
and their separation, is quite similar.
Moreover, the value of the Rand Index and of the Adjusted Rand Index, high-
lights the strength of the consensus between the obtained partitions.
8 Antonio Balzanella, Yves Lechevallier and Rosanna Verde

On-line clustering K-means clustering


Dataset DB CH SW RI ARI DB CH SW
Power supply 2.049 26.013 0.223 0.92 0.83 2.172 26.504 0.229
Financial data 1.793 14.39 0.270 0.88 0.80 1.754 15.594 0.321

Table 1 External and Internal validity indices

5 Conclusions and perspectives

In this paper we have introduced a new strategy for clustering highly evolving data
streams based on processing the incoming data in incremental way without requiring
their storage. This strategy favorably compares to the standard k-means performed
on stocked data as it is shown on the test datasets using several standard validity
indexes. Further developments will be to introduce a strategy for monitoring the
evolution of the clustering structure over time.

References

1. Bavaud, F. Spectral Clustering and Multidimensional Scaling : A Unified View. In Batagelj,


V., Bock, H.-H., Ferligoj, A. et Ziberna, A., Data Science and Classification. New York. 131-
139, (2006)
2. Beringer J., Hullermeier E.: Online clustering of parallel data streams. In: Data and Knowl-
edge Engineering, 58(2):180-204, (2006).
3. Bi-Ru Dai, Jen-Wei Huang, Mi-Yen Yeh, and Ming-Syan Chen: Adaptive Clustering for Mul-
tiple Evolving Streams. In IEEE Transactions On Knowledge And Data Engineering, Vol. 18,
No. 9, (2006).
4. De Carvalho F, Lechevallier Y, Verde R. Clustering methods in symbolic data analysis. In:
Banks D, House L, McMorris FR, Arabie P, Gaul E (eds) Classification, clustering, and data
mining applications. Studies in Classification, Data Analysis, and Knowledge Organization.
Springer, Berlin, pp. 299317 (2004)
5. Diday E: La methode des Nuees dynamiques Revue de Statistique Appliquee, 19, 2, 19-34,
(1971).
6. Keogh E., Lin J., Truppel W. Clustering of Time Series Subsequences is Meaningless: Impli-
cations for Past and Future Research. In proceedings of the 3rd IEEE International Conference
on Data Mining. Melbourne, FL. Nov 19-22. pp 115-122 (2003).
7. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algorithms and
validity indices. Pattern Analysis and Machine Intelligence, IEEE Transactions on 24 (12),
1650-1654 (2002).
8. von Luxburg U.: A tutorial on spectral clustering. Statistics and Computing, Volume 17, Num-
ber 4 December. Springer Netherlands (2007)

1 This paper has been supported by COST Action IC0702 and by ”Magda una piattaforma ad
agenti mobili per il Grid Computing” Chair: Prof. Beniamino Di Martino

View publication stats

You might also like