Professional Documents
Culture Documents
Lect13B-Cluster Analysis-II
Lect13B-Cluster Analysis-II
• Can be visualized as a
Dendrogram
6 5
0.2
4
• A diagram that s h ows t h e 0.15 3 4
2
h i e ra r c h i c a l r e l a t i o n s h i p 5
between objects. 0.1 2
• A tre e l i ke d i a g ra m t h at
re co rd s t h e s e q u e n c e s o f 0.05 1
1
merges or splits 3
0
1 3 2 5 4 6
Typical Alternatives to Calculate the Distance between Clusters
• Single link: smallest distance between an element in one cluster and an element in the
other, i.e., dis(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an element in
the other, i.e., dis(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in the other,
i.e., dis(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj)
• Medoid: one chosen, centrally located object in the cluster
Hierarchical Algorithms
• Single-link
• Distance between two clusters set equal to the minimum of
distances between all instances
• Single link (nearest neighbour). The distance between two
clusters is determined by the distance of the two closest
objects (nearest neighbours) in the different clusters.
• Complete-link
• Distance between two clusters set equal to maximum of all
distances between instances in the clusters
• Complete link (furthest neighbour). The distances between
clusters are determined by the greatest distance between
any two objects in the different clusters (i.e., by the
"furthest neighbours").
• Tightly bound, compact clusters
Hierarchical Algorithms cont..
Pair-group average.
The distance between two clusters is calculated as
the average distance between all pairs of
objects in the two different clusters.
This method is also very efficient when the objects
form natural distinct "clumps," however, it
performs equally well with elongated, "chain"
type clusters.
Pair-group centroid.
The distance between two clusters is determined
as the distance between centroids.
Hierarchical Clustering
• Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4 Agglomerative (bottom-up): (AGNES)
Agglomerative cluster is a common type of Hierarchical
Clustering that is also called Agglomerative nesting (AGNES).
a Start with each document being a single cluster.
ab
Eventually all documents belong to the same cluster.
b
abcde
c Divisive (top-down): (DIANA)
cde
d It is a top-down clustering approach. It works as similar as
de Agglomerative Clustering but in the opposite direction, also
e known as DIANA (Divisive Clustering Analysis).
Start with all documents belong to the same cluster.
Step 4 Step 3 Step 2 Step 1 Step 0 Eventually each node forms a cluster on its own.
Dendrogram: Shows How the Clusters are Merged
1. Start assigning each observation as a single point cluster, so that if we have N observations,
we have N clusters, each containing just one observation.
2. Find the closest (most similar) pair of clusters and make them into one cluster, we now have
N-1 clusters. This can be done in various ways to identify similar and dissimilar measures.
3. Find the two closest clusters and make them to one cluster. We now have N-2 clusters. This
can be done using agglomerative clustering linkage techniques .
4. Repeat steps 2 and 3 until all observations are clustered into one single cluster of size N.
This process continues until all the objects have been clustered. These successive clustering
operations produce a binary clustering tree (dendrogram), whose root is the class that
contains all the observations. This dendrogram represents a hierarchy of partitions.
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:
• Can result in “straggly” (long and thin) clusters due to chaining effect.
• Appropriate in some domains, such as clustering islands: “Hawai’i clusters”
• After merging ci and cj, the similarity of the resulting cluster to another
cluster, ck, is:
0.4 0.53
2 0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
X Y
Identify two nearest clusters 1 0.4 0.53
2 0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
Repeat process until all objects in same cluster
X Y
1 0.4 0.53
2 0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
Average link
X Y
1
• Average distance matrix 2
0.4 0.53
0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.3
Construct a distance matrix
1 2 3 4 5 6
1 0
2 0.24 0
3 0.22 0.15 0