Document PDF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Computational Techniques and Applications: CTAC97 1

World Scienti c

Cluster Analysis using Triangulation


C. Eldershaw, M. Hegland
Computer Sciences Laboratory, RSISE, Australian National University,
Canberra, ACT 0200, Australia.

1. Introduction
Cluster analysis has been a topic researched for some time now. However many
clustering algorithms su er from a signi cant limitation in that they assume that the
underlying clusters are spherical. Unfortunately, this is not appropriate for all data
sets. This paper looks at clustering using tools from graph theory. It rst triangulates
the data, then partitions the edges of the resulting graph into inter- and intra-cluster
edges. The technique is una ected by the actual shape of the clusters, thus allowing
a far more general version of the clustering problem to be solved.
Section 2 of the paper is a general introduction to clustering, which includes a brief
description of the commonly used k-means technique. Following this is a discussion
of the problems which arise in the k-means (and related) methods and why there is a
need for graph-based methods. Sections 4 and 6 explain the proposed new method,
and give examples of its success. Section 5 discusses a few existing graph-based
methods and why they can be improved upon. The test programs, which provide the
results discussed in this paper, are currently written for two dimensional data sets,
but Section 7 explains how the same principles can be extended to higher dimensional
problems.
2. Spherical Clustering Techniques
Clustering techniques are one of the core Data Mining tools. At its most general,
clustering breaks a set of N data points in M dimensions into K sets or \clusters".
The aim is that in-class members of these subsets are \similar" and cross-class mem-
bers are \dissimilar". . . however these terms are de ned. The data in question may
be continuous, discrete or a combination of both.
Of course many di erent partitionings of the data are possible. In creating the
clustering algorithm, it is essential to be able to compare two clusterings so as to
distinguish between good and bad ones. A good clustering is generally obtained when
the points within the same cluster are strongly related, and those in di erent clusters
are measurably less so. If we have some means of measuring the dissimilarity between
two points, a metric (x; y), then a standard clustering objective is to minimise:
0N N 1
X
K
@X X (xij ; xik )=Ni A
i i
2
(2.1)
i=1 j =1 k=1

where xij is the j th element of the Ni elements that are within the ith cluster.
By minimising this expression, the sum of the average dissimilarities within a cluster
is minimised. Unfortunately, with no restriction upon K , this gives the trivial answer
2 C. Eldershaw & M. Hegland

of K = N , where each point is its own cluster (a clear case of over- tting). The
solution to this is to restrict K , either by adding a penalty term involving K to the
expression, or else by simply xing K to a predetermined value.
Since all the information about each point is stored within its M dimensions of data,
then the dissimilarities (denoted by ) can be dealt with in a geometric manner. That
is, the distance between any two points gives a measure of the dissimilarity between
them. Such distances can be calculated in several ways (eg. 1-norm, 2-norm, etc.).
In the case of the L2-norm, expression 2.1 can (ignoring a multiplicative factor of
2) be re-written in a form better designed for ecient evaluation:
0N 1
X
K
@X (xij ; xi )=Ni A
i

(2.2)
i=1 j =1

where xi is the centroid or mean of the ith cluster. Expression 2.2 can be interpreted
as nding the sum of di erences of each point from a point representative of its
cluster.
This modi ed form also forms the basis for the popular k-means[9] algorithm. This
is a very ecient iterative algorithm designed for continuous data, however variations
exist for categorical data (k-modes)[6] or mixed data (k-prototypes) [5].
The k-means algorithm is iterative. It starts by randomly choosing K locations in
M -space to be initial centroids of the K clusters. All N points are allocated to the
centroid closest to them, then the centroids (the xi) are re-calculated from the points
now allocated to them. Using these new centroids, the whole procedure starts over,
with the N points again being allocated to their nearest centroids.
This iterative process continues, reducing the expression in 2.2, until the sequence
converges. The k-means algorithm will not necessarily nd the global minimum,
however if the initial choices of the K centroids are suciently well separated, then
the algorithm is usually e ective.
3. Limitations of Spherical Clustering
The above mentioned k-means algorithm is very ecient. Indeed, even ecient
parallel versions are possible [work in progress by researchers, including the authors].
Unfortunately any algorithm based upon minimising expression 2.1 or 2.2 has a sig-
ni cant inherent limitation: since all the distances are calculated relative to the
centroids, these centroids are representative of all points allocated to a particular
cluster. This representative point is reasonable if all the points in the cluster should
be as close as possible, however in many situations, a slightly broader description of
a cluster is useful.
This broader de nition still maintains that neighbouring points in the M -space,
within the same cluster, should be similar. However it does not require that all
possible pairs of points within a cluster be so. This allows an aspect of transitivity
into the description: if one point is similar to two other points, neither of which are
particularly similar to each other, then all three are clustered together as the rst
and third are indirectly related.
Cluster Analysis using Triangulation 3

This would allow non-spherical clusters to be correctly identi ed. Consider the two
dimensional example in Figure 1. This clearly has three distinct groupings of points:
the outer parabolic shape; the lower circular arc; and the central \spot". Each of
these three are quite well separated, and certainly should not be merged, yet within
each of the three, all the points could reasonably be said to be related. 20% of the
3000 data points uniform randomly distributed for the sake of realism. While the
central spot would normally respond well to the k-means algorithm, the other two
most certainly would not, as any chosen centroid would be quite distant from at least
some of the points in that group.

Figure 1. This gure shows the results of applying the k -means al-
gorithm to a generated set of data. The crosses, triangles and circles
indicate into which cluster the algorithm placed each point. The bound-
ing lines have been added to make the output clusters more distinct.
As is apparent, the results are far from ideal.

The complete failure of the k-means algorithm for this example is demonstrated
in Figure 1. The triangles, circles and crosses indicate the nal results of running
the k-means algorithm with K set to 3. The bounding boxes have been added to
more clearly show what the algorithm has done: namely, that it has located one
centroid in each of the three thirds of the region, and allocated all the closest points
in each third to it. In each of the three clusters it identi ed, the k-means algorithm
selected signi cant sections of two of the underlying clusters. Note that the particular
results of this \run" of the program are typical|ie. they are not simply the result of
unfortunate choices of the initial centroids.
4 C. Eldershaw & M. Hegland

4. Using Triangulation in Clustering


With this alternative de nition of what constitutes a good cluster, we introduce
the concept of neighbours. Each point has a set of at least two neighbours, but
more typically (in two dimensions) four or ve. Now if two neighbouring points are
suciently close together, then they belong to the same cluster.
In the technique proposed, the neighbours are not necessarily the closest n points,
but are rather selected using a triangulation technique. In the implementation by the
authors, a Delaunay triangulation technique was chosen. The resulting triangulation
is close to optimal; and ecient algorithms for it (O(N log(N ))) exist[8]. By triangu-
lating a set of data points, the neighbours of any given point, are simply those points
in its adjacency set.
Once the neighbours of a point have been found, the next step is to determine
which of them are close enough to be in the same cluster. This can be accomplished
by simply choosing a cut-o point p. All edges between neighbours which are longer
than p are deemed to be too long, and are removed from the graph. With the correct
choice of p, ideally all the edges between clusters will be removed, and all the arcs
within a single cluster will be preserved.
All that remains is to employ a graph partitioning algorithm to nd isolated con-
nected components of this graph, and label each section as being a cluster. A depth-
rst search is easily implemented with time complexity O(E + V )[3], where E is the
number of edges and V the number of vertices. Since in this case, E  O(V ) and
V = N , then the entire complexity of the partitioning is O(N ).
Figure 2 shows the result of applying this algorithm to the same data set as used in
Figure 1. As can be seen, the three underlying clusters have this time been identi ed,
despite being far from spherical.
This algorithm has the further signi cant advantage that it does not need to be
given K , the number of clusters, in advance. After removing the trivially small
clusters which are an artifact of the noise (for the example in Figure 2, which had a
total of 3000 points, clusters with less than 10 points were discarded), the number of
remaining clusters is the most appropriate number for that data set as determined by
the algorithm, be it 2, 20 or 200. With the k-means algorithm, it would be necessary
to re-run the program with many di erent K values, and then try to determine which
gure was the most appropriate one.
This algorithm has the nice bonus that it also removes almost all the noise from
the input set. The vast majority of the noise points are turned into trivially sized
clusters (eg. one, two or three points). When these clusters are removed, then what
remains is is much \cleaner". This is readily apparent when comparing Figure 2
against Figure 1. All the uniformly distributed noise from Figure 1 has been removed
in Figure 2.
5. Existing Graph-based Techniques
Now this is not the rst time that graph-based techniques have been suggested
for clustering. One which is referred to in the literature is the so-called minimum
Cluster Analysis using Triangulation 5

Figure 2. This gure shows the results of applying the Triangulation


algorithm to a generated set of data. The crosses, triangles and circles
indicate into which cluster the algorithm placed each point. The points
have clearly been placed into appropriate clusters. Furthermore, all of
the uniformly scattered points have been removed by the algorithm.

spanning tree (or MST) method[4, 14, 10]. This involves extracting the minimum
spanning tree from the complete graph, and then removing some edges. Unfortu-
nately, forming a MST in the rst place is an O(N ) operation[10, 11]. Ahuja[1]
2

and Sibson[12] indicate that Voronoi tessellations (the dual problem of Delaunay
triangulation) could be used, but do not provide any details.
Zahn[14], in his discussion of the MST approach, suggests comparing the length of
each arc against the average length of nearby arcs and removing those with lengths
more than double the average. However with a Delaunay triangulation, experimen-
tation shows that this method cannot be directly employed, as there tends to be
several of the longer (inter-cluster) arcs coming from the same vertex. This raises the
neighbourhood average to the point where all arcs are preserved. An alternative, for
the case where all the clusters are of approximately the same density, is to use the
average length of all the arcs in the graph.
However xing the cut-o point, , to twice the average edge length proves not to
be e ective when any signi cant noise is present. For the example used in Figures 1
and 2, this algorithm only worked with a uniform noise level of 6% or less. Even for
a simpler problem, with three well separated spherical clusters, it only worked with
noise levels up to 27%. The problem is that additional points between the clusters (as
6 C. Eldershaw & M. Hegland

noise is), causes a shortening of the inter-cluster arc lengths, defeating the algorithm.
What is needed is some dynamic means of choosing this multiplicative value.
6. Choosing a Value for p
The new algorithm put forward by the authors di ers somewhat from Zhan's.
Rather than using some constant times the average, it simply determines a value for
p. Any arc with length greater than p is removed from the graph. The value for
p will be determined automatically for each input data set, and will, in general, be
di erent for each.
Consider, for a moment, what the aim of choosing p is: we are trying to separate
the inter-cluster arcs from the intra-cluster ones. So this is really another clustering
exercise, only this time in one dimension, and with exactly two clusters.
This special form of the clustering problem can be taken advantage of. Let X be
the set of arcs from our graph which we believe span the gap between clusters. Let Y
be those arcs which we think have both endpoints contained within the same cluster.
We can now employ the standard clustering expression, expression 2.2. What we
wish to do, is split all the arcs into two groupings; each group containing arcs of
approximately the one size (either the smaller intra-cluster arcs or the larger inter-
cluster arcs). The function T (p) gives a measure of how \neatly split" the two sections
are for any given value of p. T (p) is de ned as follows:
X
n X
ny

(xi ? x) =nx + (yi ? y) =ny


x

T (p) = 2 2
(6.1)
i=1 i=1

where xi 2 X , yi 2 Y ,
X
n x X
n y

x= xi =nx , y = yi =ny
i=1 i=1
and nx and ny are the numbers of arcs in X and Y respectively.
In practice, the process was found to be far more e ective when the logarithm of
the length was used instead of the raw lengths. This compresses the distribution of
lengths at the long-arc end of the spectrum. The algorithm can then more easily
identify the cluster containing these long spanning arcs due to its compacted nature.
Figure 3 shows a plot of T (p) verses p. This plot is quite typical of T (p) for di erent
input data sets: there is a clear global minimum, in this case at around p = 0:6. At
this minimum, the optimal separation of the inter- and intra-cluster arcs is achieved.
So nding an appropriate value for p is nothing more than minimising the function
T (p).
A single evaluation of the function T (p) can be performed in O(N ) time. Experi-
ence shows that the minimum need not be calculated with any great level of accuracy.
In fact the implementation used here simply evaluated T (p) at twenty evenly spaced
values of p, and chose the least. So a value for p can reasonably be determined in
O(N ) time
This dynamic method of choosing p is quite e ective in the presence of noise,
since it can compensate for the shortened inter-cluster arc lengths. In fact, for some
Cluster Analysis using Triangulation 7

−3
x 10

12

11

10
T(p)

7
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Cutting Point (p)

Figure 3. A plot of T (p), showing the optimal cut point in the one
dimensional clustering. This graph is highly typical of T (p) for many
di erent input data sets. A single global minimum is clearly present.

problems, it works better in the presence of small amounts of noise. This is because,
while it decreases the inter-cluster arc-lengths, the noise points between the clusters
increases the ratio of the number of inter-cluster to intra-cluster arcs. When this
ratio is large, the clustering algorithm generally yields better results.
Evidence of the algorithm's greatly increased exibility can be seen by comparing
its results with that of the modi ed Zahn algorithm. For the simple problem of
three well separated spherical clusters, this new algorithm works correctly with any
amount of noise up to 50% (contrasting with a limit of 27% with the modi ed Zahn
approach). For the more complicated problem illustrated in Figures 1 and 2, the new
algorithm works with any noise in the range of 16{45% (as opposed to 0{6% for the
modi ed Zhan algorithm).
7. Conclusions
To this point, all the discussion and examples have related to two dimensional prob-
lems. In fact the same techniques can equally be applied to general M -dimensional
clustering problems. Delaunay triangulations can be performed in any dimensional
space[2, 13].
E ectively what this algorithm does, is map an K -cluster, M -dimension problem
into the relatively trivial two-cluster, one-dimension problem.
8 C. Eldershaw & M. Hegland

For the case of two dimensions, all of the above steps are computationally either
O(N ) or O(N log(N )), making the entire algorithm also O(N log(N )). Even in higher
dimensional problems, the time is bounded by the triangulation (with complexity
O(N M ? =M )[13]) and so the total time for the algorithm is always less than O(N )
(2 1) 2

(the complexity of the MST method).


The triangulation and subsequent partitioning of edges, in the manner suggested
in this paper, o ers an ecient and exible method for cluster analysis. Most impor-
tantly, it is one which can be readily applied to data sets with non-spherical clusters.
8. Acknowledgements
The Authors would like to thank the Advanced Computational Systems CRC (AC-
Sys) at ANU for funding to attend CTAC'97. Also Jonathan Richard Shewchuk of
Carnegie Mellon University for his Triangle program, part of which was used in the
implementations described in this paper.
References
[1] Ahuja, Narendra, Dot Pattern Processing Using Voronoi Neighbourhoods IEEE Transactions
of Pattern Analysis and Machine Intelligence, 4(3), 1982, 336{343.
[2] Bowyer, A., Computing Dirichlet Tessellations, The Computer Journal, 24(2), 1981, 162{166.
[3] Cormen, Thomas H., Leiserson, Charles E. and Rivest, Ronald L., Introduction to Algorithms,
MIT Press, USA, 1985.
[4] Gower, J.C. and Ross, G.J.S., Minimum Spanning Trees and Single Linkage Cluster Analysis,
Applied Statistics, 18(1), 1969, 54{64.
[5] Huang, Zhexue, Clustering Large Data Sets with Mixed Numeric and Categorical Values, First
Asia Paci c Conference on Knowledge Discovery and Data Mining, Singapore, World Scienti c,
February 1997.
[6] Huang, Zhexue, A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in
Data Mining, SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discov-
ery, Tucson, Arizona, May 1997.
[7] Jain, Anil K. and Dubes, Richard C., Algorithms for Clustering Data, Prentice-Hall, USA,
1988.
[8] Lee, D.T. and Schachter, B.J., Two Algorithms for Constructing a Delaunay Triangulation,
International Journal of Computer and Information Sciences, 9, 1980, 219{241.
[9] MacQueen, J.B., Some Methods for Classi cation and Analysis of Multivariate Observations,
in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 281{
297, 1967.
[10] Page, R.L., A Minimal Spanning Tree Clustering Method[Z], Communications of the ACM,
17(6), 1974.
[11] Papadimitriou, Christos H. and Steiglitz, Kenneth, Combinatorial Optimization : Algorithms
and Complexity, Prentice-hall, 1982, USA, 1982.
[12] Sibson, Robin, The Dirichlet Tessellation as an Aid in Data Analysis, Scandinavian Journal of
Statistics, 7, 1980.
[13] Watson, D.F., Computing the n-dimensional Delaunay Tessellation with Application to Voronoi
Polytypes, The Computer Journal, 24(2), 1981, 167{172.
[14] Zhan, Charles T., Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters,
IEEE Transactions on Computers, C-20(1), 1971, 68{86.

You might also like