Professional Documents
Culture Documents
11.cluster Validation PDF
11.cluster Validation PDF
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.8 0.8
K-means Complete
0.7 0.7
Link
0.6 0.6
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e.,
distinguishing whether non-random structure actually
exists in the data.
2. Comparing the results of a cluster analysis to externally
known results, e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit
the data without reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster
analyses to determine which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we
want to evaluate the entire clustering or just individual
clusters.
Measures of Cluster Validity
Numerical measures that are applied to judge various aspects
of cluster validity, are classified into the following three types.
External Index: Used to measure the extent to which cluster
labels match externally supplied class labels.
Entropy
Internal Index: Used to measure the goodness of a
clustering structure without respect to external information.
Sum of Squared Error (SSE)
Relative Index: Used to compare two different clusterings or
clusters.
Often an external or internal index is used for this function, e.g., SSE or
entropy
Sometimes these are referred to as criteria instead of indices
However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
Supervised/External measures of
Cluster Validation
Classification Oriented
Entropy, Purity, Precision, Recall and F-measure
Similarity Oriented
Involves comparison of two matrices
Ideal Cluster Similarity Matrix
Ideal Class Similarity Matrix
External Measures of Cluster Validity: Entropy
and Purity
Unsupervised measures of Cluster
Validation
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
0.4 y 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 10 0.9
0.8 20 0.8
0.7 30 0.7
0.6 40 0.6
Points
0.5 50 0.5
y
0.4 60 0.4
0.3 70 0.3
0.2 80 0.2
0.1 90 0.1
0 100 0
0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 Similarity
x Points
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
DBSCAN
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
K-means
Using Similarity Matrix for Cluster
Validation
Clusters in random data are not so crisp
1 1
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
Complete Link
Using Similarity Matrix for Cluster
Validation
0.9
1 500
0.8
2 6
0.7
1000
3 0.6
4
1500 0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000 0
500 1000 1500 2000 2500 3000
DBSCAN
Internal Measures: SSE
Clusters in more complicated figures aren’t well separated
Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
SSE
SSE is good for comparing two clusterings or two clusters
(average SSE).
Can also be used to estimate the number of clusters
10
6 9
8
4
7
2 6
SSE
5
0
4
-2
3
-4 2
1
-6
0
5 10 15 2 5 10 15 20 25 30
K
Internal Measures: SSE
SSE curve for a more complicated data set
1
2 6
3
4
i
Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation
Example: SSE
BSS + WSS = constant
m
1 m1 2 3 4 m2 5
cohesion separation
Internal Measures: Silhouette Coefficient
Silhouette Coefficient combine ideas of both cohesion and
separation, but for individual points, as well as clusters
and clusterings
For an individual point, i
Calculate a = average distance of i to the points in its cluster
Calculate b = min (average distance of i to points in another
cluster)
The silhouette coefficient for a point is then given by
s = 1 – a/b if a < b, (or s = b/a - 1 if a b, not the usual case)
b
a
Working assumption:
There are considerably more “normal”
observations than “abnormal” observations
(outliers/anomalies) in the data
Anomaly Detection Schemes
General Steps
Build a profile of the “normal” behavior
Profile can be patterns or summary statistics for the overall population
Use the “normal” profile to detect anomalies
Anomalies are observations whose characteristics
differ significantly from the normal profile
The top n data points whose distance to the kth nearest neighbor
is greatest