Professional Documents
Culture Documents
Document PDF
Document PDF
Document PDF
World Scientic
1. Introduction
Cluster analysis has been a topic researched for some time now. However many
clustering algorithms suer from a signicant limitation in that they assume that the
underlying clusters are spherical. Unfortunately, this is not appropriate for all data
sets. This paper looks at clustering using tools from graph theory. It rst triangulates
the data, then partitions the edges of the resulting graph into inter- and intra-cluster
edges. The technique is unaected by the actual shape of the clusters, thus allowing
a far more general version of the clustering problem to be solved.
Section 2 of the paper is a general introduction to clustering, which includes a brief
description of the commonly used k-means technique. Following this is a discussion
of the problems which arise in the k-means (and related) methods and why there is a
need for graph-based methods. Sections 4 and 6 explain the proposed new method,
and give examples of its success. Section 5 discusses a few existing graph-based
methods and why they can be improved upon. The test programs, which provide the
results discussed in this paper, are currently written for two dimensional data sets,
but Section 7 explains how the same principles can be extended to higher dimensional
problems.
2. Spherical Clustering Techniques
Clustering techniques are one of the core Data Mining tools. At its most general,
clustering breaks a set of N data points in M dimensions into K sets or \clusters".
The aim is that in-class members of these subsets are \similar" and cross-class mem-
bers are \dissimilar". . . however these terms are dened. The data in question may
be continuous, discrete or a combination of both.
Of course many dierent partitionings of the data are possible. In creating the
clustering algorithm, it is essential to be able to compare two clusterings so as to
distinguish between good and bad ones. A good clustering is generally obtained when
the points within the same cluster are strongly related, and those in dierent clusters
are measurably less so. If we have some means of measuring the dissimilarity between
two points, a metric (x; y), then a standard clustering objective is to minimise:
0N N 1
X
K
@X X (xij ; xik )=Ni A
i i
2
(2.1)
i=1 j =1 k=1
where xij is the j th element of the Ni elements that are within the ith cluster.
By minimising this expression, the sum of the average dissimilarities within a cluster
is minimised. Unfortunately, with no restriction upon K , this gives the trivial answer
2 C. Eldershaw & M. Hegland
of K = N , where each point is its own cluster (a clear case of over-tting). The
solution to this is to restrict K , either by adding a penalty term involving K to the
expression, or else by simply xing K to a predetermined value.
Since all the information about each point is stored within its M dimensions of data,
then the dissimilarities (denoted by ) can be dealt with in a geometric manner. That
is, the distance between any two points gives a measure of the dissimilarity between
them. Such distances can be calculated in several ways (eg. 1-norm, 2-norm, etc.).
In the case of the L2-norm, expression 2.1 can (ignoring a multiplicative factor of
2) be re-written in a form better designed for ecient evaluation:
0N 1
X
K
@X (xij ; xi )=Ni A
i
(2.2)
i=1 j =1
where xi is the centroid or mean of the ith cluster. Expression 2.2 can be interpreted
as nding the sum of dierences of each point from a point representative of its
cluster.
This modied form also forms the basis for the popular k-means[9] algorithm. This
is a very ecient iterative algorithm designed for continuous data, however variations
exist for categorical data (k-modes)[6] or mixed data (k-prototypes) [5].
The k-means algorithm is iterative. It starts by randomly choosing K locations in
M -space to be initial centroids of the K clusters. All N points are allocated to the
centroid closest to them, then the centroids (the xi) are re-calculated from the points
now allocated to them. Using these new centroids, the whole procedure starts over,
with the N points again being allocated to their nearest centroids.
This iterative process continues, reducing the expression in 2.2, until the sequence
converges. The k-means algorithm will not necessarily nd the global minimum,
however if the initial choices of the K centroids are suciently well separated, then
the algorithm is usually eective.
3. Limitations of Spherical Clustering
The above mentioned k-means algorithm is very ecient. Indeed, even ecient
parallel versions are possible [work in progress by researchers, including the authors].
Unfortunately any algorithm based upon minimising expression 2.1 or 2.2 has a sig-
nicant inherent limitation: since all the distances are calculated relative to the
centroids, these centroids are representative of all points allocated to a particular
cluster. This representative point is reasonable if all the points in the cluster should
be as close as possible, however in many situations, a slightly broader description of
a cluster is useful.
This broader denition still maintains that neighbouring points in the M -space,
within the same cluster, should be similar. However it does not require that all
possible pairs of points within a cluster be so. This allows an aspect of transitivity
into the description: if one point is similar to two other points, neither of which are
particularly similar to each other, then all three are clustered together as the rst
and third are indirectly related.
Cluster Analysis using Triangulation 3
This would allow non-spherical clusters to be correctly identied. Consider the two
dimensional example in Figure 1. This clearly has three distinct groupings of points:
the outer parabolic shape; the lower circular arc; and the central \spot". Each of
these three are quite well separated, and certainly should not be merged, yet within
each of the three, all the points could reasonably be said to be related. 20% of the
3000 data points uniform randomly distributed for the sake of realism. While the
central spot would normally respond well to the k-means algorithm, the other two
most certainly would not, as any chosen centroid would be quite distant from at least
some of the points in that group.
Figure 1. This gure shows the results of applying the k -means al-
gorithm to a generated set of data. The crosses, triangles and circles
indicate into which cluster the algorithm placed each point. The bound-
ing lines have been added to make the output clusters more distinct.
As is apparent, the results are far from ideal.
The complete failure of the k-means algorithm for this example is demonstrated
in Figure 1. The triangles, circles and crosses indicate the nal results of running
the k-means algorithm with K set to 3. The bounding boxes have been added to
more clearly show what the algorithm has done: namely, that it has located one
centroid in each of the three thirds of the region, and allocated all the closest points
in each third to it. In each of the three clusters it identied, the k-means algorithm
selected signicant sections of two of the underlying clusters. Note that the particular
results of this \run" of the program are typical|ie. they are not simply the result of
unfortunate choices of the initial centroids.
4 C. Eldershaw & M. Hegland
spanning tree (or MST) method[4, 14, 10]. This involves extracting the minimum
spanning tree from the complete graph, and then removing some edges. Unfortu-
nately, forming a MST in the rst place is an O(N ) operation[10, 11]. Ahuja[1]
2
and Sibson[12] indicate that Voronoi tessellations (the dual problem of Delaunay
triangulation) could be used, but do not provide any details.
Zahn[14], in his discussion of the MST approach, suggests comparing the length of
each arc against the average length of nearby arcs and removing those with lengths
more than double the average. However with a Delaunay triangulation, experimen-
tation shows that this method cannot be directly employed, as there tends to be
several of the longer (inter-cluster) arcs coming from the same vertex. This raises the
neighbourhood average to the point where all arcs are preserved. An alternative, for
the case where all the clusters are of approximately the same density, is to use the
average length of all the arcs in the graph.
However xing the cut-o point, , to twice the average edge length proves not to
be eective when any signicant noise is present. For the example used in Figures 1
and 2, this algorithm only worked with a uniform noise level of 6% or less. Even for
a simpler problem, with three well separated spherical clusters, it only worked with
noise levels up to 27%. The problem is that additional points between the clusters (as
6 C. Eldershaw & M. Hegland
noise is), causes a shortening of the inter-cluster arc lengths, defeating the algorithm.
What is needed is some dynamic means of choosing this multiplicative value.
6. Choosing a Value for p
The new algorithm put forward by the authors diers somewhat from Zhan's.
Rather than using some constant times the average, it simply determines a value for
p. Any arc with length greater than p is removed from the graph. The value for
p will be determined automatically for each input data set, and will, in general, be
dierent for each.
Consider, for a moment, what the aim of choosing p is: we are trying to separate
the inter-cluster arcs from the intra-cluster ones. So this is really another clustering
exercise, only this time in one dimension, and with exactly two clusters.
This special form of the clustering problem can be taken advantage of. Let X be
the set of arcs from our graph which we believe span the gap between clusters. Let Y
be those arcs which we think have both endpoints contained within the same cluster.
We can now employ the standard clustering expression, expression 2.2. What we
wish to do, is split all the arcs into two groupings; each group containing arcs of
approximately the one size (either the smaller intra-cluster arcs or the larger inter-
cluster arcs). The function T (p) gives a measure of how \neatly split" the two sections
are for any given value of p. T (p) is dened as follows:
X
n X
ny
T (p) = 2 2
(6.1)
i=1 i=1
where xi 2 X , yi 2 Y ,
X
n x X
n y
x= xi =nx , y = yi =ny
i=1 i=1
and nx and ny are the numbers of arcs in X and Y respectively.
In practice, the process was found to be far more eective when the logarithm of
the length was used instead of the raw lengths. This compresses the distribution of
lengths at the long-arc end of the spectrum. The algorithm can then more easily
identify the cluster containing these long spanning arcs due to its compacted nature.
Figure 3 shows a plot of T (p) verses p. This plot is quite typical of T (p) for dierent
input data sets: there is a clear global minimum, in this case at around p = 0:6. At
this minimum, the optimal separation of the inter- and intra-cluster arcs is achieved.
So nding an appropriate value for p is nothing more than minimising the function
T (p).
A single evaluation of the function T (p) can be performed in O(N ) time. Experi-
ence shows that the minimum need not be calculated with any great level of accuracy.
In fact the implementation used here simply evaluated T (p) at twenty evenly spaced
values of p, and chose the least. So a value for p can reasonably be determined in
O(N ) time
This dynamic method of choosing p is quite eective in the presence of noise,
since it can compensate for the shortened inter-cluster arc lengths. In fact, for some
Cluster Analysis using Triangulation 7
−3
x 10
12
11
10
T(p)
7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cutting Point (p)
Figure 3. A plot of T (p), showing the optimal cut point in the one
dimensional clustering. This graph is highly typical of T (p) for many
dierent input data sets. A single global minimum is clearly present.
problems, it works better in the presence of small amounts of noise. This is because,
while it decreases the inter-cluster arc-lengths, the noise points between the clusters
increases the ratio of the number of inter-cluster to intra-cluster arcs. When this
ratio is large, the clustering algorithm generally yields better results.
Evidence of the algorithm's greatly increased
exibility can be seen by comparing
its results with that of the modied Zahn algorithm. For the simple problem of
three well separated spherical clusters, this new algorithm works correctly with any
amount of noise up to 50% (contrasting with a limit of 27% with the modied Zahn
approach). For the more complicated problem illustrated in Figures 1 and 2, the new
algorithm works with any noise in the range of 16{45% (as opposed to 0{6% for the
modied Zhan algorithm).
7. Conclusions
To this point, all the discussion and examples have related to two dimensional prob-
lems. In fact the same techniques can equally be applied to general M -dimensional
clustering problems. Delaunay triangulations can be performed in any dimensional
space[2, 13].
Eectively what this algorithm does, is map an K -cluster, M -dimension problem
into the relatively trivial two-cluster, one-dimension problem.
8 C. Eldershaw & M. Hegland
For the case of two dimensions, all of the above steps are computationally either
O(N ) or O(N log(N )), making the entire algorithm also O(N log(N )). Even in higher
dimensional problems, the time is bounded by the triangulation (with complexity
O(N M ? =M )[13]) and so the total time for the algorithm is always less than O(N )
(2 1) 2