Professional Documents
Culture Documents
Lect 10 DM
Lect 10 DM
• Cluster analysis
– Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
What is Cluster Analysis?
• Clustering analysis is an important human activity
• Early in childhood, we learn how to distinguish between cats
and dogs
• Typical applications
– As a stand-alone tool to get insight into data distribution
– As a preprocessing step for other algorithms
Clustering
• Hard vs. Soft
– Hard: same object can only belong to single
cluster
– Soft: same object can belong to different
clusters
Clustering
• Flat vs. Hierarchical
– Flat: clusters are flat
– Hierarchical: clusters form a tree
• Agglomerative
• Divisive
Clustering: Rich Applications and
Multidisciplinary Efforts
• Pattern Recognition
• Spatial Data Analysis
– Create thematic maps in GIS by clustering feature spaces
– Detect spatial clusters or for other spatial mining tasks
• Image Processing
• Economic Science (especially market research)
• WWW
– Document classification
– Cluster Weblog data to discover groups of similar access patterns
Quality: What Is Good Clustering?
• A good clustering method will produce high quality clusters with
– high intra-class similarity
(Similar to one another within the same cluster)
– low inter-class similarity
(Dissimilar to the objects in other clusters)
What is Cluster Analysis?
• Finding groups of objects such that the objects in a group
will be similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Inter-class
Intra-cluster distances are
distances are maximized
minimized
Similarity and Dissimilarity Between
Objects
d(i, j) q (| x x |q | x x |q ...| x x |q )
i1 j1 i2 j2 ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
• If q = 2, d is Euclidean distance:
d(i, j) (| x |2 | x x |2 ...| x x |2 )
i1 i2 j2 ip jp
x
j1
2 each 2 the 2
1
objects
1
cluster 1
means
0 0
to
0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10 10
10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
object as initial 6 6
5 5
3
the 3
2
cluster
2
1 1
0 means 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
10 10
Example
• Run K-means clustering with 3 clusters (initial
centroids: 3, 16, 25) for at least 2 iterations
Example
• Centroids:
3 – 2 3 4 7 9 new centroid: 5
57
0 1 2 3 4 5 6 7 8 9
10
Categorical Values
58
K-mediods example
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
• Initialize k mediods
• Let us assume c1 = (3,4) and c2 = (7,4)
• Calculate distance so as to associate each data
object to its nearest medoid.
Data Data
i c1 objects Cost i c2 objects Cost
(distance) (distance)
(Xi) (Xi)
1 3 4 2 6 3 1 7 4 2 6 7
3 3 4 3 8 4 3 7 4 3 8 8
4 3 4 4 7 4 4 7 4 4 7 6
5 3 4 6 2 5 5 7 4 6 2 3
6 3 4 6 4 3 6 7 4 6 4 1
7 3 4 7 3 5 7 7 4 7 3 1
9 3 4 8 5 6 9 7 4 8 5 2
10 3 4 7 6 6 10 7 4 7 6 2
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)}
• Select one of the nonmedoids O′. Let us assume O′ = (7,3)
• Now the medoids are c1(3,4) and O′(7,3)
Data Data
objects Cost (distance) i O′ objects Cost (distance)
i c1
(Xi) (Xi)
1 3 4 2 6 3 1 7 3 2 6 8
3 3 4 3 8 4 3 7 3 3 8 9
4 3 4 4 7 4 4 7 3 4 7 7
5 3 4 6 2 5 5 7 3 6 2 2
6 3 4 6 4 3 6 7 3 6 4 2
8 3 4 7 4 4 8 7 3 7 4 1
9 3 4 8 5 6 9 7 3 8 5 3
10 3 4 7 6 6 10 7 3 7 6 3
Biomisa.org/
resources
100