Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Pattern Recognition Letters 29 (2008) 13851391

Contents lists available at ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

An efcient k0 -means clustering algorithm


Krista Rizman Zalik *
University of Maribor, Faculty of Natural Sciences and Mathematics, Department of Mathematics and Computer Science, Koroka Cesta 160, 2000 Maribor, Slovenia

a r t i c l e

i n f o

Article history:
Received 29 March 2007
Received in revised form 24 December 2007
Available online 4 March 2008
Communicated by L. Heutte
Keywords:
Clustering analysis
k-Means
Cluster number
Cost-function
Rival penalized

a b s t r a c t
This paper introduces k0 -means algorithm that performs correct clustering without pre-assigning the
exact number of clusters. This is achieved by minimizing a suggested cost-function. The cost-function
extends the mean-square-error cost-function of k-means. The algorithm consists of two separate steps.
The rst is a pre-processing procedure that performs initial clustering and assigns at least one seed point
to each cluster. During the second step, the seed-points are adjusted to minimize the cost-function. The
algorithm automatically penalizes any possible winning chances for all rival seed-points in subsequent
iterations. When the cost-function reaches a global minimum, the correct number of clusters is
determined and the remaining seed points are located near the centres of actual clusters. The simulated
experiments described in this paper conrm good performance of the proposed algorithm.
2008 Elsevier B.V. All rights reserved.

1. Introduction
Clustering is a search for hidden patterns that may exist in datasets. It is a process of grouping data objects into disjointed clusters
so that the data in each cluster are similar, yet different to the others. Clustering techniques are applied in many application areas
such as data analyses, pattern recognition, image processing, and
information retrieval.
k-Means is a typical clustering algorithm (MacQueen, 1967). It
is attractive in practice, because it is simple and it is generally very
fast. It partitions the input dataset into k clusters. Each cluster is
represented by an adaptively-changing centroid (also called cluster
centre), starting from some initial values named seed-points.
k-Means computes the squared distances between the inputs (also
called input data points) and centroids, and assigns inputs to the
nearest centroid. An algorithm for clustering N input data points
x1, x2, . . . , xN into k disjoint subsets Ci, i = 1, . . . , k, each containing
ni data points, 0 < ni < N, minimizes the following mean-square-error (MSE) cost-function:
J MSE

k
X
X

kxt  ci k2

xt is a vector representing the t-th data point in the cluster Ci and ci


is the geometric centroid of the cluster Ci. Finally, this algorithm
aims at minimizing an objective function, in this case a squared-

0167-8655/$ - see front matter 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.patrec.2008.02.014

(
Ixt ; i

if i arg minkxt  cj k2

otherwise

j 1; . . . ; k

)
2

Here c1, c2, cj, . . . , ck are called cluster centres which are learned by
the following steps:
Step 1: Initialize k cluster centres c1, c2, . . . , ck by some initial values called seed-points, using random sampling.
For each input data point xt and all k clusters, repeat steps 2 and
3 until all centres converge.
Step 2: Calculate cluster membership function I(xt, i) by Eq. (2)
and decide the membership of each input data point in one of
the k clusters whose cluster centre is closest to that point.
Step 3: For all k cluster centres, set ci to be the centre of mass of
all points in cluster Ci.

i1 xt 2C i

* Tel.: +386 02 229 38 21; fax: +386 02 251 81 80.


E-mail address: krista.zalik@uni-mb.si

error-function, where kxt  cik2 is a chosen distance measurement


between data point xt and the cluster centre ci.
The k-means algorithm assigns an input data point xt into the
ith cluster if the cluster membership function I(xt, i) is 1.

Although k-means has been widely used in data analyses, pattern recognition and image processing, it has three major
limitations:
(1) The number of clusters must be previously known and xed.
(2) The results of k-means algorithm depend on initial cluster
centres (initial seed-points).
(3) The algorithm contains the dead-unit problem.

1386

K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391

The major limitation of the k-means algorithm is that the number of clusters must be pre-determined and xed. Selecting the
appropriate number of clusters is critical. It requires a priori
knowledge about the data or, in the worst case, guessing the number of clusters. When the input number of clusters (k) is equal to
the real number of clusters (k0 ), the k-means algorithm correctly
discovers all clusters, as shown in Fig. 1 where cluster centres
are marked by squares. Otherwise, it gives incorrect clustering results, as illustrated in Fig. 2ac. When clustering real data, the
number of clusters is unknown ahead and has to be estimated.
Finding the correct number of clusters is usually performed over
many clustering runs using different numbers of clusters.
The performances of the k-means algorithm depend on initial
cluster centres (initial seed-points). Furthermore, the nal partition depends on the initial conguration. Some research has solved
this problem by proposing an algorithm for computing initial cluster centres for k-means clustering (Khan and Ahmad, 2004; Redmond and Heneghan, 2007). Genetic algorithms have been
developed for selecting centres in order to seed the popular kmeans method for clustering (Laszlo and Mukherjee, 2007). Steinley and Brusco (2007) evaluated twelve procedures proposed in the
literature for initializing k-means clustering and to introduce recommendations for best practices. They recommended the method
of multiple random starting-points for general use. In general, initial cluster centres are selected randomly. An assumption from
their studies is that the number of clusters is known ahead. They
conclude that even the best initial strategy for clustering centres
and minimizing the mean-square-error cost-function, do not lead
to the best dataset partition.

Fig. 1. A dataset with three clusters recognized by k-means algorithm for k = 3.

In the late 1980s, it was pointed-out that the classical k-means


algorithm has the so-called dead-unit or underutilization problem
(Xu, 1993). Each centre, initialized far away from the input data
points, may never win in the process of assigning a data point to
the nearest centre, and so it then stays far away from the input
data objects, becoming a dead-unit.
Over the last fteen years, new advanced k-means algorithms
have been developed that eliminate the dead-unit problem as,
for example, the Frequency sensitive competitive algorithm (FSCL)
(Ahalt et al., 1990). A typical strategy is to reduce the learning rates
of frequent winners. Each cluster centre counts the number when
it wins the competition, and consequently reduces its learning rate.
If a centre wins too often, it does not cooperate in the competition.
FSCL solves the dead-unit problem and successfully identies clusters, but only when the number of clusters is known in advance
and appropriately preselected; otherwise, the algorithm performs
badly.
Solving the selection of a correct cluster number has been tried
in two ways. The rst one invokes some heuristic approaches. The
clustering algorithm is run many times with the number of clusters
gradually increasing from a certain initial value to some threshold
value that is difcult to set. The second is to formulate cluster
number selection by choosing a component number in a nite mixture model. The earliest method for solving the model selection
problem may be to choose the optimal number of clusters by
Akaikes information criterion or its extensions AIC (Akaike,
1973; Bozdogan, 1987). Other criteria include Schwarzs Bayesian
interface criterion (BIC) (Schwarz, 1978), Rissanens minimum
description length (MML) (Wallace and Dowe, 1999) and Bezdeks partition coefcients (PC) (Bezdek, 1981). As reported in
(Oliver et al., 1996), BIC and MML perform comparably and outperform the AIC and PC criteria. These existing criteria may overestimate or underestimate the cluster number, because of difculty
in choosing an appropriate penalty function. Better results are obtained by a number selection criterion developed from Ying-Yang
machine (Xu, 1997) which means, unfortunately, laborious
computing.
To tackle the problem of appropriate selection for number of
clusters, the rival penalized competitive learning (RPCL) algorithm
was proposed (Xu, 1993), which adds a new mechanism to FSCL.
For each input data point, the basic idea is that, not only the cluster
centre of a winner cluster is modied to adapt to the input data
point, but also the cluster centre of its rival cluster (second winner)
is de-learned by a smaller learning rate. Many experiments have
shown that RPCL can select the correct cluster number by driving
extra cluster centres far away from the input dataset. Although
the RPCL algorithm has had success in some applications, such as

Fig. 2. k-Means produces wrong clusters for k = 1 (a), k = 2 (b) and k = 4 (c) for the same dataset as in Fig. 1, which consists of three clusters; the black square denotes the
location of the converged cluster centre.

K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391

colour-image segmentation and image features extraction, it is


rather sensitive to the selection of de-learning rate (Law and
Cheung, 2003; Cheug, 2005; Ma and Cao, 2006). The RPCL algorithm was proposed heuristically. It has been shown that RPCL
can be regarded as a fast approximate implementation of a special
case Bayesian Ying-Yang (BYY) harmony learning on a Gaussian
mixture (Xu, 1997). The ability to select a number of clusters is
provided by the ability of Bayesian Ying-Yang learning model
selection. There is still a lack of mathematical theory for directly
describing the correct convergence behaviour of RPCL which
selects the correct number of clusters, while driving all other
unnecessary cluster centres far away from the sample data.
This paper presents a new k0 -means algorithm, which is an
extension of k-means, without the three major drawbacks stated
at the beginning of this section. The algorithm has a similar
mechanism to RPCL in that it performs clustering without predetermining the correct cluster number. The problem of the
suggested k0 -means algorithms correct convergence is investigated via a cost-function approach. A special cost-function is suggested since the k-means cost-function (Eq. (1)) cannot be used
for determining the number of clusters, because it decreases
monotonically with any increase in cluster number. It is shown
that, when the cost-function reduces into a global minimum, the
correct number of cluster centres converges into an actual cluster
centre, while driving all other initial centres far away from the input dataset, and corresponding clusters can be neglected, because
they are empty.
Section 2 constructs a new cost-function. Rival penalized mechanism analysis of the proposed cost-function is presented in Section 3. Section 4 describes the k0 -means algorithm for minimizing
the proposed cost-function. Section 5 presents the experimental
evaluation. The paper is summarized in Section 6.
2. The cost-function
The k-means algorithm minimizes the mean-square-error costfunction JMSE (Eq. 1), which decreases monotonically with any
increase of cluster number. Such a function cannot be used for
identifying the correct number of clusters and cannot be used for
the RPCL algorithm. This section introduces a new cost-function
using the following two characteristics:

1387

(1) Areas with dense samples strongly attract centres, and


(2) Each cluster centre pushes all other cluster centres away in
order to give maximal information about patterns formed
by input data points. This enables the possibility of moving
extra cluster centres away from the sample data. When a
cluster centre is driven away from the sample data, the corresponding cluster can be neglected, because it is empty.
We want to obtain maximal information about patterns formed
by input data points. The amount of information each cluster gives
us about the dataset can be quantied. Discovering an ith cluster Ci
having ni elements in a dataset with N elements gives us the
amount of information I(Ci)
IC i j logni =Nj:

This information is a measure of decreasing uncertainty about the


dataset. The logarithm is selected for measuring information since
it is additive when concatenating independent, unrelated amounts
of information for a whole system, e.g. if it discovers a cluster. For
a dataset with N elements forming k distinguishable clusters, the
amount of information is I(C1) + I(C2) +    + I(Ck).
We have to maximize the amount of information and minimize
uncertainty about the system JI (Eq. (4)).
J I ni E

k
X

log2 pC i

i1

k
X

pC i 1

0 6 pC i 6 1;

i 1; . . . ; k

i1

4
p(Ci) is the probability that the input data is in the Ci cluster (subset). E is a constant and is just a choice of measurement units. E
should be from the range of point coordinates. The coordinates
magnitude does not matter, because we only care about point distances. Setting parameter E is discussed and experimentally proved
in Sections 4 and 5.
In view of the above considerations we were motivated to construct a cost-function composed of the mean-square-error JMSE and
information uncertainty as
J J I J MSE

Data metric dm used for clustering, which minimizes the upper


cost-function (Eq. (5)), where cluster Ci having centre ci and xt is
an input data point, is

Fig. 3. Dataset with 800 data objects clustered into four clusters and values of functions JI, JMSE and JI + JMSE for cluster number k = 19.

K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391

1388

dmxt ; C i kxt  ci k2  Elog2 pC i

k
X

pC i 1

0 6 pC i 6 1;

i1

i 1; . . . ; k

We assign an input data point xt into cluster Cj if the cluster membership function I(xt, i) Eq. (7) is 1.
Ixt ; i

if i arg mindmxt ; j

otherwise

j 1; . . . ; N


7

The input data point xt effects the cluster centre of cluster Ci. The
winners centre is modied in order to also contain the input data
xt and the term E log2 p(Ci) in the data metric (Eq. 6) is automatically decreased for the rival centre, because p(Ci) is decreased and
the sum of all probabilities (p(Ci), i = 1, . . . , k) is 1. The rival cluster
centres are automatically penalized in the sense of a winning
chance. Such penalization of the rival cluster centres can reduce a
winning chance for rival cluster centres to zero. This rival penalized
mechanism is briey described in the next section.
The minimization of information uncertainty JI allocates the
proper number of clusters to data points, while minimization of
JMSE makes clustering of input data possible. The values for both
functions JMSE and JI over nine values of cluster numbers (k) for a
dataset with a cardinality of 800 regarding four Gaussian distributions are shown in Fig. 3. The nodes on the curves in Fig. 3 denote
the global minimum values for cost-functions JI and JMSE and their
sum J for various cluster numbers (k). The global minimum for the
sum of both functions corresponds to the number of actual clusters
(k = k0 ).

the second part of the proposed metric smaller (Eq. 7). We suppose that the rst cluster has less elements than the second
0
0
0
n0 < n1 . During data scanning, if the centre c1 of the second
cluster with more elements wins when adapting to the input data
0
point xt then it moves towards the rst cluster centre c0 and consequently, the separating line is moved towards the left as shown
in Fig. 4b. Region 1 of the rst cluster is becoming smaller while
region 2 of the second cluster is expanding towards the left. The
same repeats through out the next iterations to points that are
near or on a separating line, until c1 gradually converges to the actual cluster centre through minimizing data metric dm (Eq. 6) and
the centre c0 moves towards the clusters boundary. The rst (rival) cluster has less and less elements until the number of elements
decreases to 0 and its competition chance reaches zero. From Eq.
(7) we see that then the data metric dm becomes innity. Cluster
centre c0 becomes dead without chance to win again. When a
t
cluster centre ci is far away from the input data then it is on
one side of the input data and it cannot be winner for any new
sample. Change of cluster centre Dci directs to the outside of the
sample data. If every cluster centre goes away from the sample
dataset then the JMSE cost-function becomes greater and greater.
This contradicts the assumption and fact that algorithm decreases
the function JMSE and proves that some centres exists within the
sample data.
The analysis of multiple clusters is more complicated, because
of interactive effects among clusters. In Section 5 various datasets
have been tested to prove the convergence behaviour of data metric that automatically penalizes the winning chance of all rival
cluster centres in the subsequent iterations while winning cluster
centres are moved toward actual cluster centres.

3. The rival penalized mechanism


4. k0 -Means algorithm
This section analyzes the rival penalized mechanism of the proposed metric in Eq. (6). The data assignment based on the data
metric to the winners cluster centre reduces JMSE and drives a
group of cluster centres to converge onto the centres of actual clusters. The winners centre is modied to also contain the input data
xt. The second term in the data metric is automatically decreased
for the rival centres. We show that such a penalization of rival cluster centres can reduce a winning chance for rival cluster centres to
zero.
We consider a simple example of one Gaussian distribution
forming one cluster. We set the input number of clusters to be
2. The number of input data points is 200, the mean vector is
(190, 90) and the standard variance is (0.5, 0.2). At the beginning
(t = 0), the data is divided into two clusters with two cluster centres, as shown in Fig. 4a, where each cluster centre is indicated by
0
0
a rectangle. We denote them as c0 and c1 . t represents the number of iterations that the data has been repeatedly scanned. Data
metric (Eq. 2) divides the cluster into two regions by a virtual separating line, as shown in Fig. 4a. Data points on the line are the
same distance from both cluster centres. In the next iteration,
they are assigned to a cluster with more elements that make

It is clear from Section 3, that the proposed metric automatically penalizes all rival cluster centres in the competition to get a
new point into the cluster. We propose a k0 -algorithm that minimizes the proposed cost-function and data metric. It has two
phases. In the rst phase we allocate k0 cluster centres in such a
way that in each cluster there are one or more cluster centres.
We suppose the input number of cluster centres k is greater than
the real number of clusters k0 . In the second phase, all rival cluster
centres in the same cluster are pushed out of the cluster, thus representing a cluster with no elements. The detailed k0 -means algorithm consisting of two completely separated phases is suggested
as follows.
For the rst phase we use k-means algorithm as initial clustering to allocate k cluster centres so that each actual cluster has at
least one or more centres. We suppose that the input parameter,
the number of clusters, is greater than the actual number of clusters that the data performs: k > k0 .
Step 1: Randomly initialize the k cluster centres in the input
dataset.

Fig. 4. The clustering process of one Gaussian distribution with an input parameter-number of clusters k = 2 after: (a) 10 iterations (b) 15 iterations and (c) 20 iterations.

K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391

Step 2: Randomly pick up a data point xt from the input dataset


and for j = 1, 2, . . . , k calculate the class membership function
I(xt, j) by Eq. (2). Every point is assigned to the cluster whose
centroid is the closest to that point.
Step 3: For all k cluster centres, set ci to be the centre of mass of
all points in cluster Ci.

ci

1 X
xt
jC i j x 2C
t

Steps 2 and 3 are repeatedly implemented until all cluster centres


remain unchanged or until they change to some threshold value.
The stopping threshold value is usually selected as being very
small. The other way to stop the algorithm is to set an upper
number of iterations to a certain threshold value. At the end of
the rst phase of the algorithm each cluster has at least one
centre.
In the rst phase, we do not include the extended clustering
membership function described by Eq. (7), because the rst step
aims to allocate the initial seed-points into some desired regions,
rather than making a precise cluster number estimation. This is
achieved by the second phase that repeats the following two steps
until all cluster centres converge.
Step 1: For each input data point xt and all k clusters randomly
pick a data point xt from the input dataset and for j = 1, 2, . . . , k
calculate the cluster membership function I(xt, j) by Eq. (7).
Every point is assigned to the cluster whose centroid is closest
to that point, as dened by the cluster membership function
I(xt, j).
Step 2: For all k cluster centres set ci to be the centre of mass of
all points in cluster Ci (Eq. 8).

Table 1
Parameters of dataset 1 where number of samples N = 470
Cluster number i

Ni

ci

ri

ai

1
2
3
4

100
50
160
160

(0.5, 0.5)
(1, 1)
(1.5, 1.5)
(1.4, 2.3)

(0.1, 0.1)
(0.1, 0.1)
(0.2, 0.1)
(0.4, 0.2)

0.213
0.106
0.25
0.34

1389

Steps 1 and 2 are repeatedly implemented until all cluster centres remain unchanged for all input data points, or they change less
than some threshold value. At the end k0 clusters are discovered,
where k0 is the number of actual clusters. The initial seed-points
cluster centres will converge towards the centroid of the input
data clusters. All extra seed-points, the difference between k and k0 ,
will be driven away from the dataset.
The number of recognized clusters k0 is implicitly dened by
parameter E (Eq. (6)). E is just a choice of measurement units. E
should be from the range of point coordinates. The coordinates
magnitude does not matter, because we only care about point distances. However, it has been shown by experiments that a wide
interval exists for E when a consistent number of actual clusters
are discovered in the sample dataset. The heuristic for parameter
E is given in Eq. (9).
E 2 a; 3a a averager averaged=2

where r is the average radius of clusters after the rst phase of the
algorithm and d is the smallest distance between two cluster centres greater than 3r. For stronger clustering one can double parameter E. If E is smaller than suggested, the algorithm cannot push the
redundant cluster centres away from the input regions. On the
other hand, if E is too large, the algorithm pushes almost all cluster
centres away from the input data.
5. Experimental results
Three simulated experiments were carried-out to demonstrate
the performance of the k0 -means algorithm. This algorithm has also
been applied to the clustering of a real dataset. The stopping
threshold value was selected to 106.
5.1. Experiment 1
Experiment 1 used 470 points from a mixture of four Gaussian
distributions. The detail parameters of input dataset are given in
Table 1, where Ni, ci, ri and ai denote the number of samples, the
mean vector, the standard variance, and the mixing proportion.
The input number of clusters k was set to 10. Fig. 5a shows all
10 clusters and centres after the rst phase of the algorithm. Each
cluster has at least one seed point. After the second phase only four
seed-points denoted four cluster centres. As shown in Fig. 5b, the
data forms four well-separated clusters. The parameters of the four
well-recognized clusters are given in Table 2.

Fig. 5. (a) Clusters discovered for k = 10 by k-means algorithm and by suggested k0 -means algorithm (b).

K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391

1390
Table 2
The four discovered clusters in experiment 1

Table 4
Predicted number of components for different standard deviations

Cluster number i

Ni

ci

rx = ry = 0.67

rx = ry = 1

rx = ry = 1.2

rx = ry = 1.33

1
2
3
4

100
50
167
155

(0.496, 0.501)
(0.993, 0.985)
(1.483, 1.51)
(1.356, 2.303)

1
2
3 True
4
5

0
0
99
1
0

3
0
97
0
0

14
0
86
0
0

45
0
55
0
0

5.2. Experiment 2
In Experiment 2, 800 data points were used, also from a mixture
of four Gaussians. The three sets of data S1, S2, and S3 were generated at different degrees of overlap among the clusters. The sets
had different variances of Gaussian distributions and different
numbers of input datasets is controlled by mixing proportions ai.
The detail parameters for these datasets are given in Table 3.
In sets S1 and S2, the data has a symmetric structure and each
cluster has the same number of elements. For such datasets, when
these clusters are separated at a certain degree, it is usual for the
algorithm converges correctly.
It can be observed from Fig. 6 that all three datasets resulted in
correct convergence. The input number of cluster centres was set
to 7. Four cluster centres were located around the centres of the
four actual clusters, while the three cluster centres were sent far
away from the data. Results show that this algorithm can also discover clusters that do not form well-separated clusters as dataset
S3.
5.3. Experiment 3
The k0 -means method was compared to previous model selection criteria and Gaussian mixture estimation methods MDL, AIC,

Table 3
Parameters of three datasets for experiment 2
Dataset number

Cluster number (i)

Ni

ci

ri

ai

S1

1
2
3
4
1
2
3
4
1
2
3
4

200
200
200
200
200
200
200
200
400
400
150
150

(1, 2)
(2, 1)
(3, 2)
(2, 3)
(1, 2 )
(2, 1 )
(3, 2 )
(2, 3 )
(1, 2 )
(2, 1 )
(3, 2 )
(2, 3 )

(0.2, 0.2)
(0.2, 0.2)
(0.2, 0.2)
(0.2, 0.2)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)
(0.4, 0.4)

0.25
0.25
0.25
0.25
0.250
0.250
0.250
0.250
0.364
0.364
0.136
0.136

S2

S3

BIC, MML. The comparison presented by Oliver et al. (1996) was


used. The same mixture of three Gaussian components with
the
p
mean of the rst component being (0, 0), the second (2, 12),
and the third (4, 0) was used. As a dataset in this Experiment,
100 data points were generated from this distribution. The results
of our method are given in Table 4 for four values of standard
deviations.
The counts (e.g., 99 in the rst block) indicate the times that the
actual number of clusters (k = 3) were conrmed in 100 experiments repeated using different cluster centre initialization. The initial number of clusters had been set to 5.
If we compare the obtained results with MML, AIC, PC, MDL and
ICOMP criteria as presented by Oliver et al. (1996) for three component distribution then the k0 -means algorithm gives considerably better results. The k0 -means method conrms the actual
(true) number of clusters in 100 experiments repeated using different initializations more frequently that other criteria. When an
incorrect number of clusters was obtained, the k0 -means predicted
less clusters, but the AIC, PC, MDL and ICOMP criteria often predicted more clusters.
5.4. Experiment 4 with real dataset
The k0 -algorithm was applied also to a real dataset. Clustering of
the wine dataset (Blake and Merz, 1998) was performed, which is a
typical real dataset for testing clustering (http://mlearn.ics.uci.edu/
databases/wine/). The dataset consisted of 178 samples of three
types of wine. These data were the results from a chemical analysis
of wines grown in the same region but derived from three different
cultivars. The analysis determined quantities of 13 constituents.
The correct number of elements in each cluster is: 48, 71, 59.
These wine data were rst regularized into an interval of
[0, 300] and then the k0 -means algorithm was applied to solve
the unsupervised clustering problem of the wine data by setting
k = 6.
The k0 -means algorithm detected three classes in the wine dataset with a clustering accuracy of 97.75% (there were four errors)
which is a rather good result for unsupervised learning methods.
This is the same result as performed by the method of linear mix-

Fig. 6. Three sets of input data used in Experiment 2 and clusters discovered by the proposed k0 means algorithm.

K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391

ing kernels with information minimization criterion (Roberts et al.,


2000).
5.5. Discussion and experimental results
As shown by experiments the k0 -means can allocate the correct
number of clusters at or in the near of actual cluster centres. Experiment 3 showed that the k0 -means algorithm is insensitive to initial
values of cluster centres and leads to good result. We also found
from Experiment 4 on real dataset, that the algorithm also worked
well in high dimensional space when the clusters had been separated to the degree as in the Experiment 2. Simulation experiments
also proved that when the initial cluster centres are randomly-selected from the input dataset, then the dead-unit problem does not
occur. The experiments also showed that, if two or more clusters
are seriously overlapped, the algorithm regards them as one cluster
and this leads to an incorrect result. When clusters are elliptical, or
some other forms, the algorithm can still detect the number of
clusters, but clustering is not as good. For the classication of elliptical clusters, the Mahalanobis distance gives better clustering than
Euclidean distance in cost-function and data metric (Ma and Cao,
2006).
According to analysis of the data metric and simulation experiments we claim that when the input parameter for the number of
clusters k is not much larger than actual number of clusters k0 , the
algorithm converges correctly. However, when k is much larger
than k0 , the number of discovered clusters is usually greater than k0 .
From the following simulation results, we have demonstrated
that there exists a large valid range of k for each dataset. On each
of the three datasets from Experiment 2 we run the algorithm
100 times for values k > k0 . We increased k from k0 and computed
the percentage of the valid results. The upper boundary of the valid
range for k is the largest integer k at which the valid percentage is
larger or equal to a certain threshold value. We choose it 98%. The
valid range for the rst dataset S1 is 424, for the second is 416
and the third is 49. The parameter E in data metric has to be doubled for a greater number k.
6. Conclusions
A new clustering algorithm named k0 -means is presented which
performs correct clustering without predetermining the exact
number of clusters k. It minimizes cost-function dened as the

1391

sum of mean-square-error and information uncertainty. Its rival


penalized mechanism has been shown. As the cost-function reduces to a global minimum, the algorithm separates the input
number k0 (k0 is the actual number of clusters) of cluster centres
that converge towards actual cluster centres. The other (k  k0 )
centres are moved far away from the dataset and never win in
competition for any data sample. It has been demonstrated by
experiments, that this algorithm can efciently determine the actual number of clusters in articial and real datasets.
References
Ahalt, S.C., Krishnamurty, A.K., Chen, P., Melton, D.E., 1990. Competitive algorithms
for vector quantization. Neural Networks 3, 277291.
Akaike, H., 1973. Information theory and an extension of the maximum likelihood
principle. In: Proc. 2nd Internat. Symp. on Information Theory, pp. 267281.
Bezdek, J., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum Press, New York.
Blake, C.L., Merz, C.J., 1998. UCI Repository for machine learning databases, Irvine
Dept. Inf. Comput. Sci., Univ. California [Online]. <http://mlearn.ics.uci.edu/
MLRepository.html>.
Bozdogan, H., 1987. Model selection and Akaikes information criterion the general
theory and its analytical extensions. Psyhometrika 52, 345370.
Cheug, Y.M., 2005. On rival penalization controlled competitive learning for
clustering with automatic cluster number selection. IEEE Trans. Knowledge
Data Eng. 17, 15831588.
Khan, S., Ahmad, A., 2004. Cluster centre initialization algorithm for k-means
clustering. Pattern Recognition Lett. 25, 12931302.
Laszlo, M., Mukherjee, S., 2007. A genetic algorithm that exchanges neighbouring
centres for k-means clustering. Pattern Recognition Lett. 28, 23592366.
Law, L.T., Cheung, Y.M., 2003. Colour image segmentation using rival penalized
controlled competitive learning. In: Proc. 2003 Internat. Joint Conf. on Neural
Networks (IJCNN2003), Portland, Oregon, USA, pp. 2024.
Ma, J., Cao, B., 2006. The Mahalanobis distance based rival penalized competitive
learning algorithm. Lect. Note Comput. Sci. 3971, 442447.
MacQueen, J.B., 1967. Some methods for clustering and analysis of multivariate
observations. Proc. 5th Berkeley Symp. on Math. Statist. Prob., vol. 1. University
of California Press, Berkeley, pp. 281297.
Oliver, J., Baxter, R., Wallace, C., 1996. Unsupervised learning using MML. In: Proc.
13th Internat. Conf. on Mach. Learn., pp. 364372.
Redmond, S.J., Heneghan, C., 2007. A method for initializing the k-means clustering
algorithm using kd-trees. Pattern Recognition Lett. 28, 965973.
Roberts, S.J., Everson, R., Rezek, I., 2000. Maximum certainty data partitioning.
Pattern Recognition 33, 833839.
Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist. 6, 461464.
Steinley, D., Brusco, M.J., 2007. Initialization k-means batch clustering: a critical
evaluation of several techniques. J. Classif. 24, 99121.
Wallace, C., Dowe, D., 1999. Minimum message length and Kolmogorov complexity.
Comput. J. 42, 270283.
Xu, L., 1993. Rival penalized competitive learning for cluster analysis, RBF net and
curve detection. IEEE Trans. Neural Network 4, 636648.
Xu, L., 1997. Bayesian Ying-Yang machine, clustering and number of clusters.
Pattern Recognition Lett. 18, 11671178.

You might also like