Professional Documents
Culture Documents
Pattern Recognition Letters: Krista Rizman Z Alik
Pattern Recognition Letters: Krista Rizman Z Alik
a r t i c l e
i n f o
Article history:
Received 29 March 2007
Received in revised form 24 December 2007
Available online 4 March 2008
Communicated by L. Heutte
Keywords:
Clustering analysis
k-Means
Cluster number
Cost-function
Rival penalized
a b s t r a c t
This paper introduces k0 -means algorithm that performs correct clustering without pre-assigning the
exact number of clusters. This is achieved by minimizing a suggested cost-function. The cost-function
extends the mean-square-error cost-function of k-means. The algorithm consists of two separate steps.
The rst is a pre-processing procedure that performs initial clustering and assigns at least one seed point
to each cluster. During the second step, the seed-points are adjusted to minimize the cost-function. The
algorithm automatically penalizes any possible winning chances for all rival seed-points in subsequent
iterations. When the cost-function reaches a global minimum, the correct number of clusters is
determined and the remaining seed points are located near the centres of actual clusters. The simulated
experiments described in this paper conrm good performance of the proposed algorithm.
2008 Elsevier B.V. All rights reserved.
1. Introduction
Clustering is a search for hidden patterns that may exist in datasets. It is a process of grouping data objects into disjointed clusters
so that the data in each cluster are similar, yet different to the others. Clustering techniques are applied in many application areas
such as data analyses, pattern recognition, image processing, and
information retrieval.
k-Means is a typical clustering algorithm (MacQueen, 1967). It
is attractive in practice, because it is simple and it is generally very
fast. It partitions the input dataset into k clusters. Each cluster is
represented by an adaptively-changing centroid (also called cluster
centre), starting from some initial values named seed-points.
k-Means computes the squared distances between the inputs (also
called input data points) and centroids, and assigns inputs to the
nearest centroid. An algorithm for clustering N input data points
x1, x2, . . . , xN into k disjoint subsets Ci, i = 1, . . . , k, each containing
ni data points, 0 < ni < N, minimizes the following mean-square-error (MSE) cost-function:
J MSE
k
X
X
kxt ci k2
0167-8655/$ - see front matter 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.patrec.2008.02.014
(
Ixt ; i
if i arg minkxt cj k2
otherwise
j 1; . . . ; k
)
2
Here c1, c2, cj, . . . , ck are called cluster centres which are learned by
the following steps:
Step 1: Initialize k cluster centres c1, c2, . . . , ck by some initial values called seed-points, using random sampling.
For each input data point xt and all k clusters, repeat steps 2 and
3 until all centres converge.
Step 2: Calculate cluster membership function I(xt, i) by Eq. (2)
and decide the membership of each input data point in one of
the k clusters whose cluster centre is closest to that point.
Step 3: For all k cluster centres, set ci to be the centre of mass of
all points in cluster Ci.
i1 xt 2C i
Although k-means has been widely used in data analyses, pattern recognition and image processing, it has three major
limitations:
(1) The number of clusters must be previously known and xed.
(2) The results of k-means algorithm depend on initial cluster
centres (initial seed-points).
(3) The algorithm contains the dead-unit problem.
1386
The major limitation of the k-means algorithm is that the number of clusters must be pre-determined and xed. Selecting the
appropriate number of clusters is critical. It requires a priori
knowledge about the data or, in the worst case, guessing the number of clusters. When the input number of clusters (k) is equal to
the real number of clusters (k0 ), the k-means algorithm correctly
discovers all clusters, as shown in Fig. 1 where cluster centres
are marked by squares. Otherwise, it gives incorrect clustering results, as illustrated in Fig. 2ac. When clustering real data, the
number of clusters is unknown ahead and has to be estimated.
Finding the correct number of clusters is usually performed over
many clustering runs using different numbers of clusters.
The performances of the k-means algorithm depend on initial
cluster centres (initial seed-points). Furthermore, the nal partition depends on the initial conguration. Some research has solved
this problem by proposing an algorithm for computing initial cluster centres for k-means clustering (Khan and Ahmad, 2004; Redmond and Heneghan, 2007). Genetic algorithms have been
developed for selecting centres in order to seed the popular kmeans method for clustering (Laszlo and Mukherjee, 2007). Steinley and Brusco (2007) evaluated twelve procedures proposed in the
literature for initializing k-means clustering and to introduce recommendations for best practices. They recommended the method
of multiple random starting-points for general use. In general, initial cluster centres are selected randomly. An assumption from
their studies is that the number of clusters is known ahead. They
conclude that even the best initial strategy for clustering centres
and minimizing the mean-square-error cost-function, do not lead
to the best dataset partition.
Fig. 2. k-Means produces wrong clusters for k = 1 (a), k = 2 (b) and k = 4 (c) for the same dataset as in Fig. 1, which consists of three clusters; the black square denotes the
location of the converged cluster centre.
1387
k
X
log2 pC i
i1
k
X
pC i 1
0 6 pC i 6 1;
i 1; . . . ; k
i1
4
p(Ci) is the probability that the input data is in the Ci cluster (subset). E is a constant and is just a choice of measurement units. E
should be from the range of point coordinates. The coordinates
magnitude does not matter, because we only care about point distances. Setting parameter E is discussed and experimentally proved
in Sections 4 and 5.
In view of the above considerations we were motivated to construct a cost-function composed of the mean-square-error JMSE and
information uncertainty as
J J I J MSE
Fig. 3. Dataset with 800 data objects clustered into four clusters and values of functions JI, JMSE and JI + JMSE for cluster number k = 19.
1388
k
X
pC i 1
0 6 pC i 6 1;
i1
i 1; . . . ; k
We assign an input data point xt into cluster Cj if the cluster membership function I(xt, i) Eq. (7) is 1.
Ixt ; i
if i arg mindmxt ; j
otherwise
j 1; . . . ; N
7
The input data point xt effects the cluster centre of cluster Ci. The
winners centre is modied in order to also contain the input data
xt and the term E log2 p(Ci) in the data metric (Eq. 6) is automatically decreased for the rival centre, because p(Ci) is decreased and
the sum of all probabilities (p(Ci), i = 1, . . . , k) is 1. The rival cluster
centres are automatically penalized in the sense of a winning
chance. Such penalization of the rival cluster centres can reduce a
winning chance for rival cluster centres to zero. This rival penalized
mechanism is briey described in the next section.
The minimization of information uncertainty JI allocates the
proper number of clusters to data points, while minimization of
JMSE makes clustering of input data possible. The values for both
functions JMSE and JI over nine values of cluster numbers (k) for a
dataset with a cardinality of 800 regarding four Gaussian distributions are shown in Fig. 3. The nodes on the curves in Fig. 3 denote
the global minimum values for cost-functions JI and JMSE and their
sum J for various cluster numbers (k). The global minimum for the
sum of both functions corresponds to the number of actual clusters
(k = k0 ).
the second part of the proposed metric smaller (Eq. 7). We suppose that the rst cluster has less elements than the second
0
0
0
n0 < n1 . During data scanning, if the centre c1 of the second
cluster with more elements wins when adapting to the input data
0
point xt then it moves towards the rst cluster centre c0 and consequently, the separating line is moved towards the left as shown
in Fig. 4b. Region 1 of the rst cluster is becoming smaller while
region 2 of the second cluster is expanding towards the left. The
same repeats through out the next iterations to points that are
near or on a separating line, until c1 gradually converges to the actual cluster centre through minimizing data metric dm (Eq. 6) and
the centre c0 moves towards the clusters boundary. The rst (rival) cluster has less and less elements until the number of elements
decreases to 0 and its competition chance reaches zero. From Eq.
(7) we see that then the data metric dm becomes innity. Cluster
centre c0 becomes dead without chance to win again. When a
t
cluster centre ci is far away from the input data then it is on
one side of the input data and it cannot be winner for any new
sample. Change of cluster centre Dci directs to the outside of the
sample data. If every cluster centre goes away from the sample
dataset then the JMSE cost-function becomes greater and greater.
This contradicts the assumption and fact that algorithm decreases
the function JMSE and proves that some centres exists within the
sample data.
The analysis of multiple clusters is more complicated, because
of interactive effects among clusters. In Section 5 various datasets
have been tested to prove the convergence behaviour of data metric that automatically penalizes the winning chance of all rival
cluster centres in the subsequent iterations while winning cluster
centres are moved toward actual cluster centres.
It is clear from Section 3, that the proposed metric automatically penalizes all rival cluster centres in the competition to get a
new point into the cluster. We propose a k0 -algorithm that minimizes the proposed cost-function and data metric. It has two
phases. In the rst phase we allocate k0 cluster centres in such a
way that in each cluster there are one or more cluster centres.
We suppose the input number of cluster centres k is greater than
the real number of clusters k0 . In the second phase, all rival cluster
centres in the same cluster are pushed out of the cluster, thus representing a cluster with no elements. The detailed k0 -means algorithm consisting of two completely separated phases is suggested
as follows.
For the rst phase we use k-means algorithm as initial clustering to allocate k cluster centres so that each actual cluster has at
least one or more centres. We suppose that the input parameter,
the number of clusters, is greater than the actual number of clusters that the data performs: k > k0 .
Step 1: Randomly initialize the k cluster centres in the input
dataset.
Fig. 4. The clustering process of one Gaussian distribution with an input parameter-number of clusters k = 2 after: (a) 10 iterations (b) 15 iterations and (c) 20 iterations.
ci
1 X
xt
jC i j x 2C
t
Table 1
Parameters of dataset 1 where number of samples N = 470
Cluster number i
Ni
ci
ri
ai
1
2
3
4
100
50
160
160
(0.5, 0.5)
(1, 1)
(1.5, 1.5)
(1.4, 2.3)
(0.1, 0.1)
(0.1, 0.1)
(0.2, 0.1)
(0.4, 0.2)
0.213
0.106
0.25
0.34
1389
Steps 1 and 2 are repeatedly implemented until all cluster centres remain unchanged for all input data points, or they change less
than some threshold value. At the end k0 clusters are discovered,
where k0 is the number of actual clusters. The initial seed-points
cluster centres will converge towards the centroid of the input
data clusters. All extra seed-points, the difference between k and k0 ,
will be driven away from the dataset.
The number of recognized clusters k0 is implicitly dened by
parameter E (Eq. (6)). E is just a choice of measurement units. E
should be from the range of point coordinates. The coordinates
magnitude does not matter, because we only care about point distances. However, it has been shown by experiments that a wide
interval exists for E when a consistent number of actual clusters
are discovered in the sample dataset. The heuristic for parameter
E is given in Eq. (9).
E 2 a; 3a a averager averaged=2
where r is the average radius of clusters after the rst phase of the
algorithm and d is the smallest distance between two cluster centres greater than 3r. For stronger clustering one can double parameter E. If E is smaller than suggested, the algorithm cannot push the
redundant cluster centres away from the input regions. On the
other hand, if E is too large, the algorithm pushes almost all cluster
centres away from the input data.
5. Experimental results
Three simulated experiments were carried-out to demonstrate
the performance of the k0 -means algorithm. This algorithm has also
been applied to the clustering of a real dataset. The stopping
threshold value was selected to 106.
5.1. Experiment 1
Experiment 1 used 470 points from a mixture of four Gaussian
distributions. The detail parameters of input dataset are given in
Table 1, where Ni, ci, ri and ai denote the number of samples, the
mean vector, the standard variance, and the mixing proportion.
The input number of clusters k was set to 10. Fig. 5a shows all
10 clusters and centres after the rst phase of the algorithm. Each
cluster has at least one seed point. After the second phase only four
seed-points denoted four cluster centres. As shown in Fig. 5b, the
data forms four well-separated clusters. The parameters of the four
well-recognized clusters are given in Table 2.
Fig. 5. (a) Clusters discovered for k = 10 by k-means algorithm and by suggested k0 -means algorithm (b).
1390
Table 2
The four discovered clusters in experiment 1
Table 4
Predicted number of components for different standard deviations
Cluster number i
Ni
ci
rx = ry = 0.67
rx = ry = 1
rx = ry = 1.2
rx = ry = 1.33
1
2
3
4
100
50
167
155
(0.496, 0.501)
(0.993, 0.985)
(1.483, 1.51)
(1.356, 2.303)
1
2
3 True
4
5
0
0
99
1
0
3
0
97
0
0
14
0
86
0
0
45
0
55
0
0
5.2. Experiment 2
In Experiment 2, 800 data points were used, also from a mixture
of four Gaussians. The three sets of data S1, S2, and S3 were generated at different degrees of overlap among the clusters. The sets
had different variances of Gaussian distributions and different
numbers of input datasets is controlled by mixing proportions ai.
The detail parameters for these datasets are given in Table 3.
In sets S1 and S2, the data has a symmetric structure and each
cluster has the same number of elements. For such datasets, when
these clusters are separated at a certain degree, it is usual for the
algorithm converges correctly.
It can be observed from Fig. 6 that all three datasets resulted in
correct convergence. The input number of cluster centres was set
to 7. Four cluster centres were located around the centres of the
four actual clusters, while the three cluster centres were sent far
away from the data. Results show that this algorithm can also discover clusters that do not form well-separated clusters as dataset
S3.
5.3. Experiment 3
The k0 -means method was compared to previous model selection criteria and Gaussian mixture estimation methods MDL, AIC,
Table 3
Parameters of three datasets for experiment 2
Dataset number
Ni
ci
ri
ai
S1
1
2
3
4
1
2
3
4
1
2
3
4
200
200
200
200
200
200
200
200
400
400
150
150
(1, 2)
(2, 1)
(3, 2)
(2, 3)
(1, 2 )
(2, 1 )
(3, 2 )
(2, 3 )
(1, 2 )
(2, 1 )
(3, 2 )
(2, 3 )
(0.2, 0.2)
(0.2, 0.2)
(0.2, 0.2)
(0.2, 0.2)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
0.25
0.25
0.25
0.25
0.250
0.250
0.250
0.250
0.364
0.364
0.136
0.136
S2
S3
Fig. 6. Three sets of input data used in Experiment 2 and clusters discovered by the proposed k0 means algorithm.
1391