Cluster Validation: Presented By:Rohit Paul

CLUSTER VALIDATION
Presented By :Rohit Paul

CLUSTERING
 Process of partitioning a set of data objects into subsets
(called clusters)
 Objects in a cluster are similar to one another and
dissimilar to objects in other clusters.
CLUSTER VALIDITY INDICES
 To evaluate the “goodness” of the resulting clusters.
 Different aspects of cluster validation
 To compare clustering algorithms
 To compare two different cluster set
 Comparing the results of a cluster analysis to externally known
results
 Determining the ‘correct’ number of clusters
 Scikit-learn(sklearn) – a library for machine learning in
python
 from sklearn.metrics import ..
Types of Validity Indices
 Internal Quality Indices
 Use to measure the goodness of a clustering structure without
respect to external information.
 How well the clusters are separated and how compact the
clusters are.
 External Quality Indices
 Measure the extent to which cluster labels match the externally
supplied class labels.
Internal Quality Indices
 Based on the following two criteria:
 Compactness/Cohesion: how closely related the objects in a
cluster are
 Separation: how distinct or well-separated a cluster is from
other clusters
 Application
 To compare clustering algorithms
 Determining the ‘correct’ number of clusters
Disadvantages of k-mean
Choosing the number of clusters k
 In most exploratory applications, the number of clusters K is
unknown
 Correct choice of k is often ambiguous
Davies Bouldin Index
Maximum of intra-cluster distance by

inter-cluster distance
 Lower the DB index value, better is the clustering
>> from sklearn.metrics import davies_bouldin_score

………....
>> davies_bouldin_score(X, labels)
Dunn Index
It is defined as Minimum separation by

maximum diameter
 Higher the Dunn index value, better is the clustering.
Silhouette Index
 The Silhouette Coefficient combine ideas cohesion and
separation, but for individual points
S(i) = ( b(i) – a(i) ) / ( max { ( a(i), b(i) ) }
Where,
 a(i) is the average dissimilarity of ith object to all other objects
in the same cluster
 b(i) is the average dissimilarity of i th object with all objects in
the closest cluster.
>> from sklearn.metrics import silhouette_score
………....
>> silhouette_score(X, labels)
Other Internal Cluster Validity Indices
 Root-mean-square std dev
 R-squared
 Modified Hubert statistics
 Calinski-Harabasz index
 I index
 SD validity index
 S_Dbw validity index and so on….
External Quality Indices
Comparing the results of a cluster analysis to an
externally known result, such as externally provided
class labels
 Validate against ground truth
 Compare two clusters
Jaccard Score
Rand Index
 Measure the number of pairs that are in:
 A = Same class both in P and G
 B = Same class in P but different in G
 C = Different class in P but
same in G
 D = Different class both in
P and G
 Agreement: a, d
 Disagreement: b, c
 Rand Index:
>> from sklearn.metrics import adjusted_rand_score

………....
>> adjusted_rand_score(labels_true, labels_pred)
F-measure
 Precision: What % of tuples that the classifier labeled
positive are actually positive
 Recall: What % of positive tuples did
the classifier label as positive
F-Measure : The harmonic mean of precision and

recall
Others External Cluster Validity Indices
 Normalized Mutual Information(NMI)
 Purity
 Sorensen-Dice
 Braun-Banquet
 Normalized Van Dongen
 Pair-Set Index
 Centroid Index and many more….
Reference
 https://medium.com/swlh/how-to-choose-the-right-numbe
r-of-clusters-in-the-k-means-algorithm-9160c57ec760
 https://present5.com/clustering-methods-part-3-cluster-val
idation-pasi-franti/
 https://www.datanovia.com/en/lessons/cluster-validation-s
tatistics-must-know-methods/
 https://www.geeksforgeeks.org/dunn-index-and-db-index-
cluster-validity-indices-set-1/
 Understanding of Internal Clustering Validation Measures
Yanchi Liu1,2, Zhongmou Li2, Hui Xiong2, Xuedong
Gao1, Junjie Wu31School of Economics and
Management, University of Science and Technology
Beijing, China

Thank You !!

Cluster Validation: Presented By:Rohit Paul

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Validation: Presented By:Rohit Paul

Uploaded by

Copyright:

Available Formats

CLUSTER VALIDATION

Presented By :Rohit Paul

Maximum of intra-cluster distance by

>> from sklearn.metrics import davies_bouldin_score

It is defined as Minimum separation by

>> from sklearn.metrics import adjusted_rand_score

F-Measure : The harmonic mean of precision and

You might also like