Professional Documents
Culture Documents
1 s2.0 S0167865518305579 Main
1 s2.0 S0167865518305579 Main
a r t i c l e i n f o a b s t r a c t
Article history: Center-based clustering methods like k-Means intend to identify closely packed clusters of data points by
Received 19 January 2018 respectively finding the centers of each cluster. However, k-Means requires the user to guess the number
Available online 13 September 2018
of clusters, instead of estimating the same on the run. Hence, the incorporation of accurate automatic
Keywords: estimation of the natural number of clusters present in a data set is important to make a clustering
Data clustering method truly unsupervised. For k-Means, the minimum of the pairwise distance between cluster centers
Inter-center distance decreases as the user-defined number of clusters increases. In this paper, we observe that the last signif-
Center-based clustering icant reduction occurs just as the user-defined number surpasses the natural number of clusters. Based
Cluster number estimation on this insight, we propose two techniques: the Last Leap (LL) and the Last Major Leap (LML) to estimate
the number of clusters for k-Means. Over a number of challenging situations, we show that LL accurately
identifies the number of well-separated clusters, whereas LML identifies the number of equal-sized clus-
ters. Any disparity between the values of LL and LML can thus inform a user about the underlying cluster
structures present in the data set. The proposed techniques are independent of the size of the data set,
making them especially suitable for large data sets. Experiments show that LL and LML perform competi-
tively with the best cluster number estimation techniques while imposing drastically lower computational
burden.
© 2018 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.patrec.2018.09.003
0167-8655/© 2018 Elsevier B.V. All rights reserved.
A. Gupta et al. / Pattern Recognition Letters 116 (2018) 72–79 73
Fig. 1. The data has 3 clusters, which can be identified from the knee-point in the Fig. 2. The knee-point often fails for higher number of clusters. The data set has
plot of Wk over increasing values of the number of clusters k. 25 clusters, however the knee-point identifies only 9 clusters.
tion methods: the Last Leap (LL) and the Last Major Leap (LML).
Both methods observe only the minimum distance between clus-
ter centers to estimate the cluster number, consequently having
computational complexity far lower than existing methods (mak-
ing them much faster on large data sets). Through a number of ex-
periments we compare our methods with other existing methods
for k-Means clustering to determine which accurately estimates
the natural number of clusters.
2. Motivation Fig. 3. The last significant reduction in dk occurs after k∗ = 3 for the data 3clusters
(Fig. 1(a)) and after k∗ = 25 for the data 25clusters (Fig. 2(a)).
Table 1
Specifications of the cluster number estimation methods.
Sl. no. Name Complexity Selection criteria for kˆ Min no. of clusters
Fig. 6. Data sets used to validate the selection criteria for the cluster number esti- Fig. 7. The clusters identified by LL for the validation data sets.
mation methods.
Table 2
The accuracy for cluster number estimation methods following the selection criteria
in Table 1 on the validation data sets in Fig. 6.
Method Accuracy (%) Method Accuracy (%) Method Accuracy (%)
AIC 75.00 Gap 83.33 PBMF 91.67
BIC 83.33 HV 83.33 PCAES 83.33
CH 91.67 Hart 41.67 PS 83.33
CE 83.33 I 91.67 RLWY 66.67
CWB 83.33 Jump 91.67 Rez 75
DB 75.00 LL 91.67 SIL 83.33
DUNN 75.00 LML 100 Slope 83.33
Fig. 9. Well-separated clusters are generated with standard deviation σ set to 1,
Knee 83.33 MPC 83.33 XB 83.33
and the minimum possible distance between their centers set to 10.
FS 66.67 PC 83.33 Xu 83.33
FH 91.67 PI 75 ZXF 75
1 Fig. 10. Accuracy of estimating the number of clusters on data sets satisfying all
The source codes for LL, LML, and all the experiments are available at
https://github.com/Avisek20/cluster_number_estimation. constraints.
A. Gupta et al. / Pattern Recognition Letters 116 (2018) 72–79 77
Table 3
The execution times for the top-performing methods from
Section 5.2. The execution times are averaged over 30 runs of each
method on data sets generated with increasing sizes, containing 5
equal-sized well-separated spherical clusters.
5.4. Violation of constraint 2 We first investigate the performances under soft violation of
Constraint 3, by generating data with clusters having low over-
In the next experiment we violate constraint 2 by randomly lap. This is ensured by generating clusters with a 50% probability
generating clusters with standard deviation set to 2, 3, or 4 times of its center lying within a minimum distance between 3 and 5
the minimum standard deviation (which is set to 1). We gener- units to the center of another cluster (the variance of all clusters
ate 50 groups of data in a manner similar to Section 5.2, covering is 1 unit). We generate 100 groups of data, where each group con-
increasing number of dimensions (2, 10, 20, 35, and 50) and in- tains five two-dimensional data sets. Each data set has 5, 10, 20,
creasing number of clusters (2, 10, 20, 35, and 50). The results in 35, and 50 number of clusters. Generating two-dimensional data
Fig. 13 show that LL, CE, FHV, I and MPC performed the best. The makes it easier to manually verify how many of the clusters are
performance of the previously top performing CH dropped slightly overlapped, and how many are well-separated. From the results
(-7%) in the presence of high variation in the dispersion of the shown in Fig. 15, we observe that BIC, CH and ZXF achieve the
clusters. We observe a rather high drop in performance for BIC. highest accuracy in identifying the number of overlapped clusters.
Here we observe an interesting contrast in the behaviour of LL and Although ZXF and BIC perform well here, in previous experiments
LML. Since LML identifies equal sized clusters, it tends to choose we observed lower performances for ZXF when the number of di-
a higher number of clusters when neighbouring clusters have dif- mensions was greater than two, and for BIC when there were large
ferent spreads. This behaviour is beneficial in accurately selecting disparities in cluster variances. CH is perhaps the most reliable in
the number of clusters for a data set such as Iris, but on synthetic identifying the clusters when they are slightly overlapped. When
data sets with large disparity in the spread of nearby clusters, LML looking at the number of well-separated clusters, we observe the
tends to break the larger clusters into smaller equal-sized clusters. previously top performing LL, CE, FHV, MPC, and PC perform the
Thus when we do not know much about the cluster structures best here as well. Even in the presence of overlap, these methods
present in a data, the difference in estimated number of clusters can reliably identify the number of well-separated clusters. We ob-
by LL and LML can help inform us about the nature of the clusters serve that by dividing clusters into equal sizes, LML can provide a
identified by k-Means. decent estimate of the number of overlapped clusters.
Fig. 11. Accuracy of estimating the number of clusters on data sets with clusters Fig. 14. For neighbouring clusters with high disparity in spread, LL tends to identify
having slight difference in the number of points. well-separated clusters, whereas LML identifies clusters of similar size.
78 A. Gupta et al. / Pattern Recognition Letters 116 (2018) 72–79
Fig. 15. Accuracy of estimating the number of clusters on data sets with clusters having slight overlap.
Fig. 16. Accuracy of estimating the number of clusters on data sets with clusters having high overlap.
Table 4 Table 5
Specifications of the real data sets. The average rank of estimating kˆ from real data sets.
Name Size of data set Name Average rank Name Average rank
AIC 18.0 0 0 MPC 2.750
Data Banknote Authentication (1372,4) BIC 18.0 0 0 PC 1.0 0 0
Echocardiogram (106,9) CH 2.875 PI 13.625
Iris (150,4) CE 1.0 0 0 PBMF 6.125
Seeds (210,7) CWB 8.375 PCAES 8.375
Sonar Mines vs. Rocks (208,60) DB 18.0 0 0 PS 4.125
Wine (178,13) Dunn 15.500 RLWY 2.875
Colon Cancer (62,20 0 0) Knee 4.250 SIL 18.0 0 0
Prostate Cancer (102,6033) FS 4.125 Slope 4.375
FHV 5.375 XB 2.625
Gap 5.500 Xu 8.375
HV 18.0 0 0 ZXF 18.0 0 0
Next, we investigate the performance under strong violation of Hart 6.625 SC 12.875
Constraint 3. Clusters are generated with 50% probability of their I 2.875 LL 1.0 0 0
centers having a minimum distance between 2 and 3 units to the Jump 4.375 LML 1.0 0 0
center of another cluster (the variance of all clusters is fixed at
1). Similar to the previous experiment, we generate 100 groups of
two-dimensional data sets containing 5, 10, 20, 35, and 50 clusters. Learning Repository [41]. The last two are high-dimension cancer
The results in Fig. 16 suggest that only CH (with a 75% accuracy) data sets present at https://www.stat.cmu.edu/∼jiashun/Research/
and ZXF (with an accuracy of 86%) can occasionally provide cor- software/GenomicsData. Due to the higher difficulty in estimat-
rect estimates for the number of overlapped clusters. We observe ing the number of clusters on real data sets, performing within
that LL, FHV, and HV perform the best at identifying the num- [k∗ − 1, k∗ + 1] is considered a success. Each method is assigned a
ber of well-separated clusters, closely followed by LML, Jump, MPC, rank based on their performance, where the most successful meth-
and PC. Since highly overlapped clusters form dense well-separated ods are assigned rank 1, and then the next best methods are as-
clusters, both LL and LML provide more accurate estimates. signed rank 2, and so on. The average rank for each method is re-
ported in Table 5. From the results we observe LL, LML, CE, and PC
perform the best, closely followed by XB, MPC, CH, I and RLWY.
5.6. Real-world data
Thus, we observe that LL and LML achieved competent perfor-
mance on real data sets as well. The full results for each method
The identification of the number of clusters from real-world
on all the real data sets are present in the supplementary docu-
data sets is indeed a challenge. Clustering methods should iden-
ment.
tify groups that are sufficiently dissimilar. Identifying clusters that
are well-separated ensures that the identified clusters are suf-
ficiently dissimilar. Therefore in this section we investigate the 6. Conclusion
performance of LL and LML on identifying the number of clus-
ters in real-world data sets. The real data sets considered are In this paper, we propose two methods: LL and LML to estimate
shown in Table 4. The first six data sets are from the UCI Machine the number of clusters from a data set. Both methods use only the
A. Gupta et al. / Pattern Recognition Letters 116 (2018) 72–79 79
minimum distance between cluster centers to estimate the num- [10] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Sci-
ber of clusters. We compared different cluster number estimation ence 315 (5814) (2007) 972–976.
[11] J. Handl, J. Knowles, An evolutionary approach to multiobjective clustering,
methods that are applicable for the k-Means clustering algorithm IEEE Trans. Evol. Comput. 11 (1) (2007) 56–76.
to investigate which method performs the best. From extensive ex- [12] S. Louhichi, M. Gzara, H. Ben-Abdallah, Unsupervised varied density based
periments, we observe that LL and LML can estimate the num- clustering algorithm using spline, Pattern Recognit. Lett. 93 (2017) 48–57.
[13] M. Du, S. Ding, Y. Xue, A novel density peaks clustering algorithm for mixed
ber of clusters from different perspectives, giving a more informed data, Pattern Recognit. Lett. 97 (2017) 46–53.
view of the cluster structures present in a given data. LL performs [14] Z. Liang, P. Chen, Delta-density based clustering with a divide-and-conquer
at par with the best cluster number estimation methods in accu- strategy: 3DC clustering, Pattern Recognit. Lett. 73 (2016) 52–59.
[15] Q. Tong, X. Li, B. Yuan, A highly scalable clustering scheme using boundary
rately identifying the number of well-separated clusters. LML per-
information, Pattern Recognit. Lett. 89 (2017) 1–7.
forms equally well in ideal situations, and in more ambiguous situ- [16] S.M. Savaresi, D.L. Boley, On the Performance of Bisecting K-Means and PDDP,
ations it can identify equal-sized clusters. The difference in results in: Proceedings of the 2001 SIAM International Conference on Data Mining,
2001, pp. 1–14.
between LL and LML can inform a user of the underlying cluster
[17] G. Hamerly, C. Elkan, Learning the k in k-means, in: Proceedings of the 16th In-
structures present in a data set. The performance of LL and LML are ternational Conference on Neural Information Processing Systems, in: NIPS’03,
consistent over a number of challenging situations (such as slight MIT Press, Cambridge, MA, USA, 2003, pp. 281–288.
or high differences in the number of points in different clusters, [18] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Trans. Pattern
Anal. Mach. Intell. 13 (8) (1991) 841–847.
different spreads of clusters, and slight or high overlap between [19] A. Fujita, D.Y. Takahashi, A.G. Patriota, A non-parametric method to estimate
clusters). We also observe that LL and LML are robust to the in- the number of clusters, Comput. Stat. Data Anal. 73 (2014) 27–39.
crease in the number of clusters (up to 50 clusters) as well as the [20] M.R. Rezaee, B.P. Lelieveldt, J.H. Reiber, A new cluster validity index for the
fuzzy c-mean, Pattern Recognit. Lett. 19 (3) (1998) 237–246.
number of dimensions (up to 50), while drastically outperforming [21] Q. Zhao, M. Xu, P. Fränti, Sum-of-Squares Based Cluster Validity Index and Sig-
all other methods in terms of computation time required due to nificance Analysis., in: ICANNGA, 5495, Springer, 2009, pp. 313–322.
its O(k2 ) time complexity, where generally k < < n. We conclude [22] T. Caliński, J. Harabasz, A dendrite method for cluster analysis, Communica-
tions in Statistics 3 (1) (1974) 1–27.
that to identify clusters that are well-separated, the minimum dis- [23] J.C. Bezdek, Mathematical Models for Systematics and Taxonomy, in: Eighth
tance between clusters provides enough information to estimate International Conference on Numerical Taxonomy, 3, 1975, pp. 143–166.
the number of clusters. This lets us make the general recommen- [24] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern
Anal. Mach. Intell. 1 (2) (1979) 224–227.
dation of using LL or LML in applications to identify clusters from
[25] J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting
large data sets. In such situations LL and LML can efficiently es- compact well-separated clusters, J. Cybernet. 3 (3) (1973) 32–57.
timate the number of clusters by considering only the minimum [26] Y. Fukuyama, M. Sugeno, A new method of choosing the number of clusters for
the fuzzy c-means method, Proceedings of the Fifth Fuzzy Systems Symposium
distance between centers, while not needing to consult the data
(1989) 247–250.
set. [27] R.N. Dave, Validating fuzzy partitions obtained through c-shells clustering, Pat-
tern Recognit. Lett. 17 (6) (1996) 613–623.
Supplementary material [28] M. Halkidi, M. Vazirgiannis, Clustering Validity Assessment: Finding the opti-
mal partitioning of a data set, in: Proceedings of the 2001 IEEE International
Conference on Data Mining (ICDM 2001), IEEE, 2001, pp. 187–194.
Supplementary material associated with this article can be [29] J.A. Hartigan, Statistical theory in clustering, J.Class. 2 (1) (1985) 63–76.
found, in the online version, at doi: 10.1016/j.patrec.2018.09.003 [30] U. Maulik, S. Bandyopadhyay, Performance evaluation of some clustering al-
gorithms and validity indices, IEEE Trans. Pattern Anal. Mach. Intell. 24 (12)
(2002) 1650–1654.
References [31] C.A. Sugar, G.M. James, Finding the number of clusters in a dataset: an infor-
mation-theoretic approach, J. Am. Stat. Assoc. 98 (463) (2003) 750–763.
[1] R. Xu, D. Wunsch, Survey of clustering algorithms, IEEE Trans. Neural Netw. 16 [32] J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybernet. 3 (3) (1973) 58–73.
(3) (2005) 645–678. [33] A.M. Bensaid, L.O. Hall, J.C. Bezdek, L.P. Clarke, M.L. Silbiger, J.A. Arrington,
[2] M. Teboulle, A unified continuous optimization framework for center-based R.F. Murtagh, Validity-guided (re)clustering with applications to image seg-
clustering methods, J. Mach. Learn. Res. 8 (Jan) (2007) 65–102. mentation, IEEE Trans. Fuzzy Syst. 4 (2) (1996) 112–123.
[3] A.K. Jain, Data clustering: 50 years beyond k-means, Pattern Recognit. Lett. 31 [34] M.K. Pakhira, S. Bandyopadhyay, U. Maulik, Validity index for crisp and fuzzy
(8) (2010) 651–666. clusters, Pattern Recognit. 37 (3) (2004) 487–501.
[4] E. Hancer, D. Karaboga, A comprehensive survey of traditional, merge-split and [35] K.-L. Wu, M.-S. Yang, A cluster validity index for fuzzy clustering, Pattern
evolutionary approaches proposed for determination of cluster number, Swarm Recognit. Lett. 26 (9) (2005) 1275–1291.
Evol. Comput. 32 (2017) 49–67. [36] M. Ren, P. Liu, Z. Wang, J. Yi, A self-adaptive fuzzy c-means algorithm for deter-
[5] G.W. Milligan, M.C. Cooper, An examination of procedures for determining the mining the optimal number of clusters, Comput. Intell. Neurosci. (2016) 3–15.
number of clusters in a data set, Psychometrika 50 (2) (1985) 159–179. [37] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation
[6] E. Dimitriadou, S. Dolničar, A. Weingessel, An examination of indexes for de- of cluster analysis, J. Comput. Appl. Math. 20 (1987) 53–65.
termining the number of clusters in binary data sets, Psychometrika 67 (1) [38] L. Xu, Bayesian ying–yang machine, clustering and number of clusters, Pattern
(2002) 137–159. Recognit. Lett. 18 (11) (1997) 1167–1178.
[7] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J.M. Pérez, I. Perona, An extensive [39] B. Rezaee, A cluster validity index for fuzzy clustering, Fuzzy Sets Syst. 161 (23)
comparative study of cluster validity indices, Pattern Recognit. 46 (1) (2013) (2010) 3014–3025.
243–256. [40] D. Arthur, S. Vassilvitskii, k-means++: The Advantages of Careful Seeding, in:
[8] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algo-
set via the gap statistic, J. Royal Stat. Soc. Series B (Stat. Methodol.) 63 (2) rithms, Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035.
(2001) 411–423. [41] M. Lichman, UCI Machine Learning Repository, 2013.
[9] R. Tibshirani, G. Walther, Cluster validation by prediction strength, J. Comput.
Graph. Stat. 14 (3) (2005) 511–528.