Cluster Validation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Cluster Validation

Cluster Validity
For cluster analysis, the question is how to evaluate the “goodness” of
the resulting clusters?
But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?
▪ To avoid finding patterns in noise
▪ To compare clustering algorithms
▪ To compare two sets of clusters
▪ To compare two clusters

-BY KUNAL DEY 2


Clusters found in Random Data
1 1

0.9 0.9

0.8 0.8
Random 0.7 0.7

Points 0.6 0.6 DBSCAN


0.5
y

0.5

y
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0.2 0.4 0.6 0.8 1 0
x 0 0.2 0.4 0.6 0.8 1
x

1
1
0.9
0.9
0.8
K-means 0.8
Complete
0.7
0.7
0.6
0.6
Link
0.5
y

0.5
y

0.4
0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
-BY KUNAL DEY 3
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether
non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to
externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference
to external information.
- Use only the data

4. Comparing the results of two different sets of cluster analyses to determine which
is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.

-BY KUNAL DEY 4


Measures of Cluster Validity
Numerical measures that are applied to judge various aspects of cluster validity,
are classified into the following three types.
▪ External Index: Used to measure the extent to which cluster labels match
externally supplied class labels.
• Matching Based Measures: Purity, F-measure
• Entropy-based Measures: Entropy, Normalized Mutual Information.
• Pairwise Measures: Rand index, Jaccard Coefficients
▪ Internal Index: Used to measure the goodness of a clustering structure
without respect to external information.
• Sum of Squared Error (SSE), Silhouette score
▪ Relative Index: Used to compare two different clustering or clusters.
• Often an external or internal index is used for this function
Sometimes these are referred to as criteria instead of indices
◦ However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.

-BY KUNAL DEY 5


Measures of Cluster Validity
Let 𝑇 = {𝑇1 , 𝑇2 , … . , 𝑇𝑘 } the ground-truth partitioning, and each 𝑇𝑖 a partition.
Let 𝐶 = {𝐶1 , . . , 𝐶𝑟 } denote a clustering, with each 𝐶𝑖 referred to a cluster.
▪External evaluation measures try capture the extent to which points from the same
partition appear in the same cluster, and the extent to which points from different
partitions are grouped in different clusters.
▪ All of the external measures rely on the 𝑟 × 𝑘 contingency table 𝑵 that is induced by a
clustering 𝐶 and the ground-truth partitioning 𝑇 , defined as follows:

the count 𝑛𝑖𝑗 denotes the number of points that are common to cluster
𝐶𝑖 and ground-truth partition 𝑇𝑗 . Further, for clarity, let 𝒏𝒊 = |𝑪𝒊 | denote the number
of points in cluster 𝐶𝑖 , and let 𝒎𝒋 = |𝑻𝒋 | denote the number of points in partition 𝑇𝑗 .

-BY KUNAL DEY 6


External Index: Matching Based Measures
▪ Purity quantifies the extent to which a cluster Ci contains entities from only
one partition. In other words, it measures how “pure” each cluster is. The
purity of cluster Ci is defined as:

▪ The purity of clustering C is defined as the weighted sum of the clusterwise


purity values:

𝑛𝑖
denotes the fraction of points in cluster 𝐶𝑖 .
𝑛

Perfect clustering if purity = 1 and r = k (the number of clusters obtained is


the same as that in the ground truth)

-BY KUNAL DEY 7


External Index: Matching Based Measures
Ex. 1 (green or orange): purity1 = 30/50; purity2 = 20/25;
purity3 = 25/25; purity = (30 + 20 + 25)/100 = 0.75
Problem: Two clusters may share the same majority partition
(orange table)

Maximum matching: Only one cluster can match one partition


▪Match: Pairwise matching,
Weight:

▪Maximum weight matching:

▪Ex2. (green) match = purity = 0.75; (orange) match = 0.65 > 0.6

-BY KUNAL DEY 8


External Index: Matching Based Measures
The following table summarizes the clustering result of a clustering algorithm. What is the
maximum matching schema for maximum matching, i.e., which partition should each
cluster match?
A:

B:

C:

D:

-BY KUNAL DEY 9


External Index: Matching Based Measures
The following table summarizes the clustering result of a clustering algorithm. What is the
maximum matching schema for maximum matching, i.e., which partition should each
cluster match?
A:

B:

C:

D:

-BY KUNAL DEY 10


External Index: Matching Based Measures
F-measure is the harmonic mean of the precision and
recall values for each cluster.

Precision: The fraction of points in 𝐶𝑖 from the majority


partition (i.e., the same as purity), where 𝑗𝑖 is the partition
that contains the maximum # of points from 𝐶𝑖 .
Ex. (green table)
◦ prec1 = 30/50;
◦ prec2 = 20/25;
◦ prec3 = 25/25

-BY KUNAL DEY 11


External Index: Matching Based Measures
Recall: The fraction of point in partition shared in common with
cluster 𝐶𝑖 , where:
Ex. (green table):
▪ recall1 = 30/35; recall2 = 20/40; recall3 = 25/25
F-measure for clustering 𝐶𝑖 : The harmonic means of
𝑝𝑟𝑒𝑐𝑖 and 𝑟𝑒𝑐𝑎𝑙𝑙𝑖

F-measure for clustering C: average of all clusters:


Ex. (green table)
▪ F1 = 60/85; F2 = 40/65; F3 = 1; F = 0.774
For a perfect clustering, when r = k, the maximum value of the F-measure is 1.

-BY KUNAL DEY 12


External Index: Entropy-based Measures
Entropy is a measure of the uncertainty around a source of information.
The entropy of a clustering C:

The entropy of the partitioning T:

The conditional entropy of 𝑇 with respect to cluster 𝐶𝑖 :

-BY KUNAL DEY 13


External Index: Entropy-based Measures
The conditional entropy of 𝑇 with respect to cluster 𝐶:

is the probability that a point in cluster 𝑖 also belongs to partition 𝑗.

-BY KUNAL DEY 14


External Index: Entropy-based Measures
The more a cluster’s members are split into different partitions, the higher the
conditional entropy.
For a perfect clustering, the conditional entropy value is 0, where the worst
possible conditional entropy value is log 𝑘

-BY KUNAL DEY 15


External Index: Entropy-based Measures
Mutual information:
Quantifies the amount of shared info between the
clustering C and partitioning T :

Measures the dependency between the observed joint probability 𝑃𝑖𝑗


of 𝐶 and 𝑇, and the expected joint probability 𝑃𝐶𝑖 × 𝑃𝑇𝑗 under the
independence assumption

When C and T are independent, 𝑃𝑖𝑗 = 𝑃𝐶𝑖 × 𝑃𝑇𝑗 , 𝐼(𝐶, 𝑇) = 0. However,


there is no upper bound on the mutual information

-BY KUNAL DEY 16


External Index: Entropy-based Measures
Normalized mutual information (NMI):

Value range of NMI: [0,1]. Value close to 1 indicates a good clustering

What is the Normalized Mutual Information when the clustering is identical


to the partition?
❑ Entropy of the clustering H(C)
❑ Entropy of the clustering H(T)
❑ 1.0

-BY KUNAL DEY 17


External Index: Entropy-based Measures
Normalized mutual information (NMI):

Value range of NMI: [0,1]. Value close to 1 indicates a good clustering

What is the Normalized Mutual Information when the clustering is identical


to the partition?
❑ Entropy of the clustering H(C)
❑ Entropy of the clustering H(T)
❑ 1.0

-BY KUNAL DEY 18


External Index: Entropy-based Measures
T1 T2 T3 ni
C1 0 47 14 61
C2 50 0 0 50
C3 0 3 36 39
mj 50 50 50 n=150

-BY KUNAL DEY 19


External Index: Entropy-based Measures

-BY KUNAL DEY 20


External Index: Pairwise Measures
Four possibilities based on the agreement between cluster label and partition label
❑ TP: true positive—Two points 𝑥𝑖 and 𝑥𝑗 belong to the same partition T , and
they also in the same cluster 𝐶

where 𝑦𝑖 : the true partition label , and 𝑦ො𝑖 : the cluster label for point 𝑥𝑖

❑ FN: false negative:


❑ FP: false positive:
❑ TN: true negative:

-BY KUNAL DEY 21


External Index: Pairwise Measures
Because there are pairs of points, we have the following
identity:

A naive computation of the preceding four cases requires O(n2) time. However,
they can be computed more efficiently using the contingency table , with
1 ≤ 𝑖 ≤ 𝑟 𝑎𝑛𝑑 1 ≤ 𝑗 ≤ 𝑘.

-BY KUNAL DEY 22


External Index: Pairwise Measures
To compute the total number of false negatives, we remove the number of true
positives from the number of pairs that belong to the same partition

The last step follows from the fact that

-BY KUNAL DEY 23


External Index: Pairwise Measures
The number of false positives can be obtained in a similar manner by subtracting
the number of true positives from the number of point pairs that are in the same
cluster:

Finally, the number of true negatives can be obtained as follows:

-BY KUNAL DEY 24


External Index: Pairwise Measures
Jaccard coefficient: Fraction of true positive point pairs, but after ignoring the
true negatives.
▪ Jaccard = TP/(TP + FN + FP) [i.e., denominator ignores TN]
▪ Perfect clustering: Jaccard = 1

Rand Statistic:
▪ Rand = (TP + TN)/N
▪ Symmetric; perfect clustering: Rand = 1

Fowlkes-Mallow Measure:
▪ Geometric mean of precision and recall

-BY KUNAL DEY 25


External Index: Pairwise Measures

T1 T2 T3 ni
C1 0 47 14 61
C2 50 0 0 50
C3 0 3 36 39
mj 50 50 50 n=100

-BY KUNAL DEY 26


Internal Index
▪ A trade-off in maximizing intra-cluster compactness and inter-cluster separation
▪ Given a clustering C = {C1, . . ., Ck} with k clusters, cluster Ci containing
𝑛𝑖 = | 𝐶𝑖 | points.
▪ Let W(S, R) be sum of weights on all edges with one vertex in S and the other in
R

Where 𝑤𝑖𝑗 is the edge weight which can be defined as the distance between 𝑥𝑖 and 𝑥𝑗
The sum of all the intra-cluster weights over all clusters:

The sum of all the inter-cluster weights:

-BY KUNAL DEY 27


Internal Index
The number of distinct intra-cluster edges:

The number of distinct inter-cluster edges:

Beta-CV measure: The ratio of the mean intra-cluster distance to the mean inter-
cluster distance. The smaller, the better the clustering

-BY KUNAL DEY 28


Internal Index

-BY KUNAL DEY 29


Internal Index
Modularity (for graph clustering):

Modularity measures the difference between the observed and expected fraction
of weights on edges within the clusters.

The smaller the value, the better the clustering—the intra-cluster distances are
lower than expected

-BY KUNAL DEY 30


Internal Index
Silhouette Measure: Validation measure which quantifies degree to which each item
belongs in its assigned cluster, relative to the other clusters.

-BY KUNAL DEY 31


Internal Index
Silhouette Measure: Validation measure which quantifies degree to which each item
belongs in its assigned cluster, relative to the other clusters.

Silhouette coefficient for an item 𝑥𝑖 is given by 𝑠𝑖 .

Values are in the range [-1,1], a larger value is better.

Silhouette coefficient for clustering C: Calculate overall score for a clustering by


averaging the silhouette coefficient for all n items.

SC close to +1 implies good clustering


▪ Points are close to their own clusters but
far from other clusters

-BY KUNAL DEY 32


Internal Index: Silhouette Score example

Using the Euclidian metric to calculate the distance,


C1
Silhouette coefficient for an item 𝐴8 is given by 𝑠𝐴8 :
C3
𝑎𝐴8 =(d(𝐴8 , 𝐴4 )+d(𝐴8 , 𝐴5 )+d(𝐴8 , 𝐴6 )+d(𝐴8 , 𝐴3 ))/4
=4.55
𝑎𝐴8 =(1.4+5+5.4+6.4)/4 =4.55
C2
Distance from point 𝐴8 to C1:
𝑏𝐴8 =2.23

Distance from point 𝐴8 to C2:


𝑏𝐴8 =(4.5+7.6)/ 2= 6.05

The closest cluster to 𝐴8 is C1


𝑆𝐴8 = (2.23 – 4.55)/ Max(2.23,4.55)= -2.32/4.55=-0.50
-BY KUNAL DEY 33
Internal Index: Silhouette Score example

Using the Euclidian metric to calculate the distance,


Silhouette coefficient for an item 𝐴8 is given by 𝑠𝐴8 : C1

𝑎𝐴8 =(2.23+1.4)/2 =1.82


C3
Distance from point 𝐴8 to C2: C2
𝑏𝐴8 =(4.5+7.6)/2=6.05

Distance from point 𝐴8 to C3:


𝑏𝐴8 =(5+5.4+6.4)/ 3= 5.6

The closest cluster to 𝐴8 is C3


𝑆𝐴8 = (5.6 – 1.82)/ Max(5.6,1.82)= 3.78/5.6=0.68

-BY KUNAL DEY 34


Internal Index: Silhouette Score example

-BY KUNAL DEY 35


Internal Index: Silhouette Score example
To calculate the Silhouette coefficient for given
clustering, we repeat the same steps we did previously
to all points then average the scores.

SC close to +1 implies good clustering


▪ Points are close to their own clusters but
far from other clusters

-BY KUNAL DEY 36


Relative Index
Relative measure: Directly compare different clusterings, usually those obtained via
different parameter settings for the same algorithm
Silhouette coefficient as a relative measure: Estimate the # of clusters in the data
Pick the k value that yields the best clustering, i.e., yielding high
values for SC and SCi (1 ≤ i ≤ k)
For example:

-BY KUNAL DEY 37


Cluster Stability
The main idea behind cluster stability is that the clusterings obtained from several
datasets sampled from the same underlying distribution as D should be similar or
“stable”. The cluster stability approach can be used to find good parameter values for
a given clustering.

A bootstrapping approach: to find the best value of k (judged on stability)


▪ Generate t samples of size n by sampling from D with replacement
▪ For each sample 𝐷𝑖 , run the same clustering algorithm with k values from 2 to
𝑘𝑚𝑎𝑥
▪ Compare the distance between all pairs of clusterings 𝐶𝑘 (𝐷𝑖 ) and 𝐶𝑘 (𝐷𝑗 ) via some
distance function
▪ Compute the expected pairwise distance for each value of k
▪ The value k* that exhibits the least deviation between the clusterings obtained
from the resampled datasets is the best choice for k since it exhibits the most
stability

-BY KUNAL DEY 38


Other Methods for Finding K, the Number of
Clusters
▪ Empirical method:
# of clusters: 𝑘 ≈ 𝑛Τ2 for a dataset of n points (e.g., n = 200, k = 10)
This may work for small data set
▪ Elbow method: Calculate the Within-Cluster-Sum of Squared Errors
(WSS) for different values of k, and choose the k for which WSS
becomes first starts to diminish. In the plot of WSS-versus-k, this is
visible as an elbow.

WSS=σ𝑘𝑖 σ𝑥∈𝐶𝑖(𝑥 − 𝑚𝑖 )

-BY KUNAL DEY 39


Other Methods for Finding K, the Number of
Clusters
▪ Cross validation method:
o Divide a given data set into m parts
o Use m – 1 parts to obtain a clustering model
o Use the remaining part to test the quality of the clustering
• For example, for each point in the test set, find the closest centroid,
and use the sum of squared distance between all points in the test set
and the closest centroids to measure how well the model fits the test
set
o For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best

-BY KUNAL DEY 40


Clustering Tendency
▪ Clustering Tendency: assessing the suitability of clustering (whether the
data contains inherent grouping structure).
▪ Determining clustering tendency or clusterability is a hard task because
there are so many different definitions of clusters
▪ E.g., partitioning, hierarchical, density-based, graph-based, etc.
▪ Even when fixing cluster type, still hard to define an appropriate null
model for a data set
▪ Still, there are some clusterability assessment methods, such as:
▪ Spatial histogram: Contrast the histogram of the data with that
generated from random samples
▪ Distance distribution: Compare the pairwise point distance from the
data with the pairwise point distance from the randomly generated
samples.
▪ Hopkins Statistic: A sparse sampling test for spatial randomness

-BY KUNAL DEY 41


Clustering Tendency
▪ Spatial Histogram Approach: Contrast the d-dimensional histogram of
the input dataset D with the histogram generated from random samples
▪ Dataset D is clusterable if the distributions of two histograms are rather
different
▪ Method outline:
▪ Divide each dimension into equal width bins,
count how many points lie in each cells, and
obtain the empirical joint probability mass
function (EPMF)
▪ Do the same for the randomly sampled data
▪ Compute how much they differ using the
Kullback-Leibler (KL) divergence value (relative
entropy), which measure the dissimilarity of two
probability distributions, p and q.

-BY KUNAL DEY 42


Using Similarity Matrix for Cluster
Validation
▪ Order the similarity matrix with respect to cluster labels and inspect
visually.

-BY KUNAL DEY 43


Using Similarity Matrix for Cluster
Validation
▪ Clusters in random data are not so crisp

-BY KUNAL DEY 44


Using Similarity Matrix for Cluster
Validation
▪ Clusters in random data are not so crisp

-BY KUNAL DEY 45


Using Similarity Matrix for Cluster
Validation
▪ Clusters in random data are not so crisp

-BY KUNAL DEY 46


Using Similarity Matrix for Cluster
Validation
▪ Clusters in random data are not so crisp

-BY KUNAL DEY 47

You might also like