Presented By :Rohit Paul

 Process of partitioning a set of data objects into subsets
(called clusters)
 Objects in a cluster are similar to one another and
dissimilar to objects in other clusters.
 To evaluate the “goodness” of the resulting clusters.
 Different aspects of cluster validation
 To compare clustering algorithms
 To compare two different cluster set
 Comparing the results of a cluster analysis to externally known
 Determining the ‘correct’ number of clusters
 Scikit-learn(sklearn) – a library for machine learning in
 from sklearn.metrics import ..
Types of Validity Indices
 Internal Quality Indices
 Use to measure the goodness of a clustering structure without
respect to external information.
 How well the clusters are separated and how compact the
clusters are.
 External Quality Indices
 Measure the extent to which cluster labels match the externally
supplied class labels.
Internal Quality Indices
 Based on the following two criteria:
 Compactness/Cohesion: how closely related the objects in a
cluster are
 Separation: how distinct or well-separated a cluster is from
other clusters
 Application
 To compare clustering algorithms
 Determining the ‘correct’ number of clusters
Disadvantages of k-mean
Choosing the number of clusters k
 In most exploratory applications, the number of clusters K is
 Correct choice of k is often ambiguous
Davies Bouldin Index

Maximum of intra-cluster distance by

inter-cluster distance
 Lower the DB index value, better is the clustering

>> from sklearn.metrics import davies_bouldin_score

>> davies_bouldin_score(X, labels)
Dunn Index

It is defined as Minimum separation by

maximum diameter
 Higher the Dunn index value, better is the clustering.
Silhouette Index
 The Silhouette Coefficient combine ideas cohesion and
separation, but for individual points
S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }
 a(i) is the average dissimilarity of ith object to all other objects
in the same cluster
 b(i) is the average dissimilarity of i th object with all objects in
the closest cluster.
>> from sklearn.metrics import silhouette_score
>> silhouette_score(X, labels)
Other Internal Cluster Validity Indices
 Root-mean-square std dev
 R-squared
 Modified Hubert statistics
 Calinski-Harabasz index
 I index
 SD validity index
 S_Dbw validity index and so on….
External Quality Indices
Comparing the results of a cluster analysis to an
externally known result, such as externally provided
class labels
 Validate against ground truth
 Compare two clusters
Jaccard Score
Rand Index
 Measure the number of pairs that are in:
 A = Same class both in P and G
 B = Same class in P but different in G
 C = Different class in P but

same in G
 D = Different class both in

P and G
 Agreement: a, d
 Disagreement: b, c
 Rand Index:

>> from sklearn.metrics import adjusted_rand_score

>> adjusted_rand_score(labels_true, labels_pred)
 Precision: What % of tuples that the classifier labeled
positive are actually positive
 Recall: What % of positive tuples did
the classifier label as positive

F-Measure : The harmonic mean of precision and

Others External Cluster Validity Indices
 Normalized Mutual Information(NMI)
 Purity
 Sorensen-Dice
 Braun-Banquet
 Normalized Van Dongen
 Pair-Set Index
 Centroid Index and many more….
 Understanding of Internal Clustering Validation Measures
Yanchi Liu1,2, Zhongmou Li2, Hui Xiong2, Xuedong
Gao1, Junjie Wu31School of Economics and
Management, University of Science and Technology
Beijing, China

Thank You !!

