Professional Documents
Culture Documents
Cluster Validation
Cluster Validation
Cluster Validation
Cluster Validity
For cluster analysis, the question is how to evaluate the “goodness” of
the resulting clusters?
But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?
▪ To avoid finding patterns in noise
▪ To compare clustering algorithms
▪ To compare two sets of clusters
▪ To compare two clusters
0.9 0.9
0.8 0.8
Random 0.7 0.7
0.5
y
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0.2 0.4 0.6 0.8 1 0
x 0 0.2 0.4 0.6 0.8 1
x
1
1
0.9
0.9
0.8
K-means 0.8
Complete
0.7
0.7
0.6
0.6
Link
0.5
y
0.5
y
0.4
0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
-BY KUNAL DEY 3
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether
non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to
externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference
to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine which
is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.
the count 𝑛𝑖𝑗 denotes the number of points that are common to cluster
𝐶𝑖 and ground-truth partition 𝑇𝑗 . Further, for clarity, let 𝒏𝒊 = |𝑪𝒊 | denote the number
of points in cluster 𝐶𝑖 , and let 𝒎𝒋 = |𝑻𝒋 | denote the number of points in partition 𝑇𝑗 .
𝑛𝑖
denotes the fraction of points in cluster 𝐶𝑖 .
𝑛
▪Ex2. (green) match = purity = 0.75; (orange) match = 0.65 > 0.6
B:
C:
D:
B:
C:
D:
where 𝑦𝑖 : the true partition label , and 𝑦ො𝑖 : the cluster label for point 𝑥𝑖
A naive computation of the preceding four cases requires O(n2) time. However,
they can be computed more efficiently using the contingency table , with
1 ≤ 𝑖 ≤ 𝑟 𝑎𝑛𝑑 1 ≤ 𝑗 ≤ 𝑘.
Rand Statistic:
▪ Rand = (TP + TN)/N
▪ Symmetric; perfect clustering: Rand = 1
Fowlkes-Mallow Measure:
▪ Geometric mean of precision and recall
T1 T2 T3 ni
C1 0 47 14 61
C2 50 0 0 50
C3 0 3 36 39
mj 50 50 50 n=100
Where 𝑤𝑖𝑗 is the edge weight which can be defined as the distance between 𝑥𝑖 and 𝑥𝑗
The sum of all the intra-cluster weights over all clusters:
Beta-CV measure: The ratio of the mean intra-cluster distance to the mean inter-
cluster distance. The smaller, the better the clustering
Modularity measures the difference between the observed and expected fraction
of weights on edges within the clusters.
The smaller the value, the better the clustering—the intra-cluster distances are
lower than expected
WSS=σ𝑘𝑖 σ𝑥∈𝐶𝑖(𝑥 − 𝑚𝑖 )