Professional Documents
Culture Documents
A Fast Clustering Algorithm To Cluster Very Large Categorical Data Sets in Data Mining PDF
A Fast Clustering Algorithm To Cluster Very Large Categorical Data Sets in Data Mining PDF
Data Mining
Zhexue Huang*
Cooperative Research Centre for Advanced Computational Systems
CSIRO Mathematical and Information Sciences
GPO Box 664, Canberra 2601, AUSTRALIA
email:Zhexue.Huang@cmis.csiro.au
*
The author wishes to acknowledge that this work was carried out within the Cooperative Research Centre for Advanced Computational Systems (ACSys)
established under the Australian Government’s Cooperative Research Centres Program.
prototypes. A problem in using that algorithm is to choose The objects, called categorical objects, are a simplified
a proper weight. We have suggested the use of the average version of the symbolic objects defined in (Gowda and
standard deviation of numeric attributes as a guide in Diday 1991). We consider all numeric (quantitative)
choosing the weight. attributes are categorised and do not consider categorical
The k-modes algorithm presented in this paper is a attributes that have combinational values, e.g., Languages-
simplification of the k-prototypes algorithm by only taking spoken (Chinese, English). The following two subsections
categorical attributes into account. Therefore, weight γ is define the categorical attributes and objects accepted by the
no longer necessary in the algorithm because of the algorithm.
disappearance of sn. If numeric attributes are involved in a
data set, we categorise them using a method as described 2.1 Categorical Domains and Attributes
in (Anderberg 1973). The biggest advantage of this Let A1, A2, …, Am be m attributes describing a space Ω
algorithm is that it is scalable to very large data sets. and DOM(A1), DOM(A2), …, DOM(Am) the domains of
Tested with a health insurance data set consisting of half a
million records and 34 categorical attributes, this the attributes. A domain DOM(Aj) is defined as categorical
algorithm has shown a capability of clustering the data set if it is finite and unordered, e.g., for any a, b ∈ DOM(Aj),
into 100 clusters in about a hour using a single processor either a = b or a ≠ b. Aj is called a categorical attribute. Ω
of a Sun Enterprise 4000 computer. is a categorical space if all A1, A2, …, Am are categorical.
Ralambondrainy (1995) presented another approach to
A categorical domain defined here contains only
using the k-means algorithm to cluster categorical data.
Ralambondrainy’s approach needs to convert multiple singletons. Combinational values like in (Gowda and
category attributes into binary attributes (using 0 and 1 to Diday 1991) are not allowed. A special value, denoted by
represent either a category absent or present) and to treat ε, is defined on all categorical domains and used to
the binary attributes as numeric in the k-means algorithm. represent missing values. To simplify the dissimilarity
If it is used in data mining, this approach requires to measure we do not consider the conceptual inclusion
handle a large number of binary attributes because data relationships among values in a categorical domain like in
sets in data mining often have categorical attributes with (Kodratoff and Tecuci 1988) such that car and vehicle are
hundreds or thousands of categories. This will inevitably two categorical values in a domain and conceptually a car
increase both computational and space costs of the k- is also a vehicle. However, such relationships may exist in
means algorithm. The other drawback is that the cluster real world databases.
means, given by real values between 0 and 1, do not
indicate the characteristics of the clusters. Comparatively, 2.2 Categorical Objects
the k-modes algorithm directly works on categorical Like in (Gowda and Diday 1991) a categorical object X ∈
attributes and produces the cluster modes, which describe Ω is logically represented as a conjunction of attribute-
the clusters, thus very useful to the user in interpreting the value pairs [A1 = x1] ∧ [A2 = x2] ∧…∧ [Am = xm], where
clustering results.
xj ∈ DOM(Aj) for 1 ≤ j ≤ m. An attribute-value pair [Aj =
Using Gower’s similarity coefficient (Gower 1971)
and other dissimilarity measures (Gowda and Diday 1991) xj] is called a selector in (Michalski and Stepp 1983).
one can use a hierarchical clustering method to cluster Without ambiguity we represent X as a vector [x1, x2, …,
categorical or mixed data. However, the hierarchical xm]. We consider every object in Ω has exactly m attribute
clustering methods are not efficient in processing large values. If the value of attribute Aj is not available for an
data sets. Their use is limited to small data sets.
The rest of the paper is organised as follows. object X, then Aj = ε.
Categorical data and its representation are described in Let X = {X1, X2, …, Xn} be a set of n categorical
Section 2. In Section 3 we briefly review the k-means objects and X ⊆ Ω. Object Xi is represented as [xi,1, xi,2,
algorithm and its important properties. In Section 4 we …, xi,m]. We write Xi = Xk if xi,j = xk,j for 1 ≤ j ≤ m. The
discuss the k-modes algorithm. In Section 5 we present
some experimental results on two real data sets to show the relation Xi = Xk does not mean that Xi, Xk are the same
classification performance and computational efficiency of object in the real world database. It means the two objects
the k-modes algorithm. We summarise our discussions and have equal categorical values in attributes A1, A2, …, Am.
describe our future work plan in Section 6. For example, two patients in a data set may have equal
values in attributes Sex, Disease and Treatment. However,
2 Categorical Data they are distinguished in the hospital database by other
attributes such as ID and Address which were not selected
Categorical data as referred to in this paper is the data for clustering.
describing objects which have only categorical attributes.
Assume X consists of n objects in which p objects are and Chowdhury 1996) can be incorporated with the
distinct. Let N be the cardinality of the Cartesian product k-means algorithm.
DOM(A1) x DOM(A2) x … x DOM(Am). We have p ≤ N. 3. It works only on numeric values because it
However, n may be larger than N, which means there are minimises a cost function by calculating the means
duplicates in X. of clusters.
4. The clusters have convex shapes (Anderberg 1973).
Therefore, it is difficult to use the k-means
3 The K-means Algorithm algorithm to discover clusters with non-convex
The k-means algorithm (MacQueen 1967, Anderberg shapes.
1973) is built upon four basic operations: (1) selection of One difficulty in using the k-means algorithm is to specify
the initial k means for k clusters, (2) calculation of the the number of clusters. Some variants like ISODATA
dissimilarity between an object and the mean of a cluster, include a procedure to search for the best k at the cost of
(3) allocation of an object to the cluster whose mean is some performance.
nearest to the object, (4) Re-calculation of the mean of a The k-means algorithm is best suited for data mining
cluster from the objects allocated to it so that the intra because of its efficiency in processing large data sets.
cluster dissimilarity is minimised. Except for the first However, working only on numeric values limits its use in
operation, the other three operations are repeatedly data mining because data sets in data mining often have
performed in the algorithm until the algorithm converges. categorical values. Development of the k-modes algorithm
The essence of the algorithm is to minimise the cost to be discussed in the next section was motivated by the
function desire to remove this limitation and extend its use to
k n
categorical domains.
E = ∑ ∑ y i ,l d ( X i , Ql ) (1)
l =1i =1
where n is the number of objects in a data set X, Xi ∈ X, Ql 4 The K-modes Algorithm
is the mean of cluster l, and y i ,l is an element of a partition The k-modes algorithm is a simplified version of the k-
matrix Y n x l as in (Hand 1981). d is a dissimilarity prototypes algorithm described in (Huang 1997). In this
measure usually defined by the squared Euclidean algorithm we have made three major modifications to the
distance. k-means algorithm, i.e., using different dissimilarity
There exist a few variants of the k-means algorithm measures, replacing k means with k modes, and using a
which differ in selection of the initial k means, frequency based method to update modes. These
dissimilarity calculations and strategies to calculate cluster modifications are discussed below.
means (Anderberg 1973, Bobrowski and Bezdek 1991).
The sophisticated variants of the k-means algorithm 4.1 Dissimilarity Measures
include the well-known ISODATA algorithm (Ball and Let X, Y be two categorical objects described by m
Hall 1967) and the fuzzy k-means algorithms (Ruspini categorical attributes. The dissimilarity measure between X
1969, 1973). and Y can be defined by the total mismatches of the
Most k-means type algorithms have been proved corresponding attribute categories of the two objects. The
convergent (MacQueen 1967, Bezdek 1980, Selim and smaller the number of mismatches is, the more similar the
Ismail 1984). The k-means algorithm has the following two objects. Formally,
important properties. m
1. It is efficient in processing large data sets. The d( X ,Y) = ∑ δ (x j , y j ) (2)
j =1
computational complexity of the algorithm is
where
O(tkmn), where m is the number of attributes, n is
0 ( xj = yj)
the number of objects, k is the number of clusters, δ (x j , y j ) = (3)
and t is the number of iterations over the whole data 1 ( xj ≠ yj)
set. Usually, k, m, t << n. In clustering large data
d(X,Y) gives equal importance to each category of an
sets the k-means algorithm is much faster than the
attribute. If we take into account the frequencies of
hierarchical clustering algorithms whose general
categories in a data set, we can define the dissimilarity
computational complexity is O(n2) (Murtagh 1992).
measure as
2. It often terminates at a local optimum (MacQueen
m (n x + n y )
1967, Selim and Ismail 1984). To find out the d χ2 ( X , Y) = ∑
j j
δ (x j , y j ) (4)
global optimum, techniques such as deterministic j =1 n x n y
j j
annealing (Kirkpatrick et al. 1983, Rose et al. 1990) where n x j , n y j are the numbers of objects in the data set
and genetic algorithms (Goldberg 1989, Murthy
that have categories xj and yj for attribute j. Because
d χ 2 ( X , Y ) is similar to the chi-square distance in The k-modes algorithm consists of the following steps
(refer to (Huang 1997) for the detailed description of the
(Greenacre 1984), we call it chi-square distance. This
algorithm):
dissimilarity measure gives more importance to rare
1. Select k initial modes, one for each cluster.
categories than frequent ones. Eq. (4) is useful in
2. Allocate an object to the cluster whose mode is the
discovering under-represented object clusters such as
nearest to it according to d. Update the mode of the
fraudulent claims in insurance databases.
cluster after each allocation according to the
Theorem.
4.2 Mode of a Set 3. After all objects have been allocated to clusters,
Let X be a set of categorical objects described by retest the dissimilarity of objects against the current
categorical attributes A1, A2, …, Am. modes. If an object is found such that its nearest
Definition: A mode of X is a vector Q = [q1, q2, …, qm] mode belongs to another cluster rather than its
current one, reallocate the object to that cluster and
∈ Ω that minimises
n update the modes of both clusters.
D(Q, X ) = ∑ d ( X i ,Q) (5) 4. Repeat 3 until no object has changed clusters after a
i =1 full cycle test of the whole data set.
where X = {X1, X2, …, Xn} and d can be either defined as Like the k-means algorithm the k-modes algorithm also
in Eq. (2) or in Eq. (4). Here, Q is not necessarily an produces locally optimal solutions that are dependent on
element of X. the initial modes and the order of objects in the data set. In
Section 5 we use a real example to show how appropriate
4.3 Find a Mode for a Set initial mode selection methods can improve the clustering
Let nck , j be the number of objects having category ck,j in results.
In our current implementation of the k-modes
n ck , j algorithm we include two initial mode selection methods.
attribute Aj and f r ( A j = ck , j | X ) = the relative The first method selects the first k distinct records from the
n
frequency of category ck,j in X. data set as the initial k modes. The second method is
implemented in the following steps.
Theorem: The function D(Q,X) is minimised 1. Calculate the frequencies of all categories for all
iff f r ( A j = q j | X ) ≥ f r ( A j = c k , j | X ) for qj ≠ ck,j for all j attributes and store them in a category array in the
= 1..m. descending order of frequency as shown in Figure
1. Here, ci,j denotes category i of attribute j and
The proof of the theorem is given in the Appendix. f(ci,j) ≥ f(ci+1,j) where f(ci,j) is the frequency of
The theorem defines a way to find Q from a given X,
category ci,j.
and therefore is important because it allows to use the k-
means paradigm to cluster categorical data without losing
its efficiency. The theorem implies that the mode of a data c1,1 c1,2 c1,3 c1, 4
set X is not unique. For example, the mode of set {[a, b], c2 ,1 c2 ,2 c2 ,3 c2, 4
[a, c], [c, b], [b, c]} can be either [a, b] or [a, c].
c3,1 c3,3 c3,4
c4 ,1 c4 ,3
4.4 The k-modes Algorithm
c5,3
Let {S1, S2, …, Sk} be a partition of X, where Sl ≠ ∅ for 1
≤ l ≤ k, and {Q1,Q2,…,Qk} the modes of {S1, S2, …, Sk}.
Figure 1. The category array of a data set with 4 attributes having 4,
The total cost of the partition is defined by 2, 5, 3 categories respectively.
k n
E = ∑ ∑ y i ,l d ( X i , Ql ) (6)
l =1i =1 2. Assign the most frequent categories equally to the
where yi,l is an element of a partition matrix Y n x l as in initial k modes. For example in Figure 1, assume k
(Hand 1981) and d can be either defined as in Eq. (2) or in = 3. We assign Q1 = [q1,1=c1,1, q1,2=c2,2, q1,3=c3,3,
Eq. (4). q1,4=c1,4], Q2 = [q2,1=c2,1, q2,2=c1,2, q2,3=c4,3,
Similar to the k-means algorithm, the objective of q2,4=c2,4] and Q3 = [q3,1=c3,1, q3,2=c2,2, q3,3=c1,3,
clustering X is to find a set {Q1, Q2, …, Qk} that can
q3,4=c3,4].
minimise E. Although the form of this cost function is the 3. Start with Q1. Select the record most similar to Q1
same as Eq. (1), d is different. Eq. (6) can be minimised by
the k-modes algorithm below. and substitute Q1 with the record as the first initial
mode. Then select the record most similar to Q2 to one correspondence between clusters and disease
and substitute Q2 with the record as the second classes, which means the observations in the same disease
classes were clustered into the same clusters. This
initial mode. Continue this process until Qk is
represents a complete recovery of the 4 disease classes
substituted. In these selections Ql ≠ Qt for l ≠ t. from the test data set.
Step 3 is taken to avoid the occurrence of empty clusters. In Figure 2(b) two observations of the disease class P
The purpose of this selection method is to make the initial were misclassified into cluster 1 which was dominated by
modes diverse, which can result in better clustering results the observations of the disease class R. However, the
(see Section 5.1.3). observations in the other two disease classes were correctly
clustered into clusters 3 and 4. This clustering result can
5 Experimental Results also be considered good.
We used the well known soybean disease data to test Cluster 1 Cluster 2 Cluster 3 Cluster 4
classification performance of the algorithm and another D 10
large data set selected from a health insurance database to C 10
test computational efficiency of the algorithm. The second R 10
P 17
data set consists of half a million records, each being (a)
described by 34 categorical attributes.
Cluster 1 Cluster 2 Cluster 3 Cluster 4
5.1 Tests on Soybean Disease Data D 10
C 10
R 10
5.1.1 Test Data Sets P 2 15
(b)
The soybean disease data is one of the standard test data
sets used in the machine learning community. It has often Figure 2. Two misclassification matrices. (a) Correspondence between
been used to test conceptual clustering algorithms clusters of test data set 1 and disease classes. (b)
(Michalski and Stepp 1983, Fisher 1987). We chose this Correspondence between clusters of test data set 9 and
disease classes.
data set to test our algorithm because of its publicity and
because all its attributes can be treated as categorical If we use the number of misclassified observations as a
without categorisation. measure of a clustering result, we can summarise the 200
The soybean data set has 47 observations, each being clustering results in Table 1. The first column in the table
described by 35 attributes. Each observation is identified gives the number of misclassified observations. The second
by one of the 4 diseases -- Diaporthe Stem Canker, and third columns show the numbers of clustering results.
Charcoal Rot, Rhizoctonia Root Rot, and Phytophthora
Rot. Except for Phytophthora Rot which has 17 Table 1.
observations, all other diseases have 10 observations each. Misclassified First Selection Method Second Selection Method
Eq. (2) was used in the tests because all disease classes are Observations
almost equally distributed. Of the 35 attributes we only 0 13 14
selected 21 because the other 14 have only one category. 1 7 8
To study the effect of record order, we created 100 test 2 12 26
data sets by randomly reordering the 47 observations. By 3 4 9
doing this we were also selecting different records for the 4 7 6
initial modes using the first selection method. All disease 5 2 1
identifications were removed from the test data sets. >5 55 36
3400
selected according to its cost value. 3200
Table 2. 2800
2600
Misclassified Total mismatches for Total mismatches for
2400
Observations method 1 method 2
2200
0 194(13) 194(14)
2000
1 194(7) 194(7), 197(1)
1800
2 194(12) 194(25),195(1) 1600
10 20 30 40 50 60 70 80 90 100
3 195(2),197(1), 201(1) 195(6),196(2),197(1) Number of clusters
4 195(2),196(3),197(2) 195(4),196(1),197(1)
Figure 3. Scalability to the number of clusters in clustering 500000
5 197(2) 197(1) records.
>5 203-261 209-254
4000
3500
Table 3.
3000
No. of classes No. of runs Mean cost Std Dev
Real run time in seconds
1 1 247 - 2500
500
disease classes the initial modes have and the second is the
Figure 4. Scalability to the number of records clustered into 100
corresponding number of runs with the number of disease clusters.
classes in the initial modes. This table indicates that the
more diverse the disease classes are in the initial modes, These results are very encouraging because they show
the better the clustering results. The initial modes selected clearly a linear increase in time as both the number of
by the second method have 3 disease types, therefore more clusters and number of records increase. Clustering half a
good cluster results were produced than by the first million objects into 100 clusters took about a hour, which
method. is quite acceptable. Compared with the results of clustering
From the modes and category distributions of different data with mixed values (Huang 1997), this algorithm is
attributes in different clusters the algorithm can also much faster than its previous version because it needs
many less iterations to converge.
The above soybean disease data tests indicate that a References
good clustering result should be selected from multiple Anderberg, M. R. (1973) Cluster Analysis for
runs of the algorithm over the same data set with different Applications, Academic Press.
record orders and/or different initial modes. This can be Ball, G. H. and Hall, D. J. (1967) A Clustering
done in practice by running the algorithm in parallel on a Technique for Summarizing Multivariate Data, Behavioral
parallel computing system. Other parts of the algorithm Science, 12, pp. 153-155.
such as the operation to allocate an object to a cluster can Bezdek, J. C. (1980) A Convergence Theorem for the
also be parallelised to improve the performance. Fuzzy ISODATA Clustering Algorithms, IEEE
Transactions on Pattern Analysis and Machine
6 Summary and Future Work Intelligence, 2(8), pp. 1-8.
The biggest advantage of the k-means algorithm in data Bobrowski, L. and Bezdek, J. C. (1991) c-Means
mining applications is its efficiency in clustering large Clustering with the l1 and l∞ Norms, IEEE Transactions on
data sets. However, its use is limited to numeric values. Systems, Man and Cybernetics, 21(3), pp. 545-554.
The k-modes algorithm presented in this paper has Fisher, D. H. (1987) Knowledge Acquisition Via
removed this limitation whilst preserving its efficiency. Incremental Conceptual Clustering, Machine Learning,
The k-modes algorithm has made the following 2(2), pp.139-172.
extensions to the k-means algorithm: Goldberg, D. E. (1989) Genetic Algorithms in Search,
1. replacing means of clusters with modes, Optimisation, and Machine Learning, Addison-Wesley.
2. using new dissimilarity measures to deal with Gowda, K. C. and Diday, E. (1991) Symbolic Clustering
categorical objects, and Using a New Dissimilarity Measure, Pattern Recognition,
3. using a frequency based method to update modes of 24(6), pp. 567-578.
clusters. Gower, J. C. (1971) A General Coefficient of Similarity
These extensions allow us to use the k-means paradigm and Some of its Properties, BioMetrics, 27, pp. 857-874.
directly to cluster categorical data without need of data Greenacre, M. J. (1984) Theory and Applications of
conversion. Correspondence Analysis, Academic Press.
Another advantage of the k-modes algorithm is that Hand, D. J. (1981) Discrimination and Classification,
the modes give characteristic descriptions of clusters. John Wiley & Sons.
These descriptions are very important to the user in Huang, Z. (1997) Clustering Large Data Sets with Mixed
interpreting clustering results. Numeric and Categorical Values, In Proceedings of The
Because data mining deals with very large data sets, First Pacific-Asia Conference on Knowledge Discovery
scalability is a basic requirement to the data mining and Data Mining, Singapore, World Scientific.
algorithms. Our experimental results have demonstrated Jain, A. K. and Dubes, R. C. (1988) Algorithms for
that the k-modes algorithm is indeed scalable to very large Clustering Data, Prentice Hall.
and complex data sets in terms of both the number of Kirkpatrick, S., Gelatt, C. D. and Vecchi, M. P. (1983)
records and the number of clusters. In fact the k-modes Optimisation by Simulated Annealing, Science, 220(4598),
algorithm is faster than the k-means algorithm because our pp.671-680.
experiments have shown that the former often needs less Kodratoff, Y. and Tecuci, G. (1988) Learning Based on
iterations to converge than the later. Conceptual Distance, IEEE Transactions on Pattern
Our future work plan is to develop and implement a Analysis and Machine Intelligence, 10(6), pp. 897-909.
parallel k-modes algorithm to cluster data sets with MacQueen, J. B. (1967) Some Methods for Classification
millions of objects. Such an algorithm is required in a and Analysis of Multivariate Observations, In Proceedings
number of data mining applications, such as partitioning of the 5th Berkeley Symposium on Mathematical Statistics
very large heterogeneous sets of objects into a number of and Probability, pp. 281-297.
smaller and more manageable homogeneous subsets that Michalski, R. S. and Stepp, R. E. (1983) Automated
can be more easily modelled and analysed, and detecting Construction of Classifications: Conceptual Clustering
under-represented concepts, e.g., fraud in a very large Versus Numerical Taxonomy, IEEE Transactions on
number of insurance claims. Pattern Analysis and Machine Intelligence, 5(4), pp. 396-
410.
Acknowledgments Murtagh, F. (1992) Comments on “Parallel Algorithms
The author is grateful to Dr Markus Hegland at The for Hierarchical Clustering and Cluster Validity”, IEEE
Australian National University, Mr Peter Milne and Dr Transactions on Pattern Analysis and Machine
Graham Williams at CSIRO for their comments on the Intelligence, 14(10), pp. 1056-1057.
paper.
Murthy, C. A. and Chowdhury, N. (1996) In Search of For the dissimilarity measure
Optimal Clusters Using Genetic Algorithms, Pattern m (n x j + n y j )
Recognition Letters, 17, pp. 825-832. d χ 2 ( x, y) = ∑ δ ( x j , y j ) , we write
j =1 nx j n y j
Ralambondrainy, H. (1995) A Conceptual Version of the
n
k-Means Algorithm, Pattern Recognition Letters, 16, pp. ∑ d x 2 ( X i , Q)
1147-1157. i =1
Rose, K., Gurewitz, E. and Fox, G. (1990) A n m (n xi , j + nq j )
Deterministic Annealing Approach to Clustering, Pattern =∑∑ δ ( xi , j , q j )
i =1 j = 1 n xi , j n q j
Recognition Letters, 11, pp. 589-594.
Ruspini, E. R. (1969) A New Approach to Clustering, m n 1 1
Information Control, 19, pp. 22-32. = ∑ ∑( + )δ ( x i , j , q j )
j =1i =1 n q j n xi , j
Ruspini, E. R. (1973) New Experimental Results in Fuzzy
m n 1 m n 1
Clustering, Information Sciences, 6, pp. 273-284. = ∑∑ δ ( xi , j , q j ) + ∑ ∑ δ ( xi , j , q j )
Selim, S. Z. and Ismail, M. A. (1984) K-Means-Type n
j =1i =1 q
j
n
j =1i =1 x
i, j
Algorithms: A Generalized Convergence Theorem and Now we have
Characterization of Local Optimality, IEEE Transactions n 1
on Pattern Analysis and Machine Intelligence, 6(1), pp. ∑ δ ( xi , j , q j )
i =1 n x
81-87. i, j
pp. 544-555. = nc j − 1
where nc j is the number of categories in Aj and nck , j the
n ck , j n m
n nq j
= ∑ n(1 − )
j =1 n
m
= ∑ n(1 − f r ( A j = q j | X )
j =1