Professional Documents
Culture Documents
MDA Session4
MDA Session4
Cluster Analysis
Cluster Analysis
Objective: To separate individual observations/
items/ objects, into groups, or clusters, on the basis
of the values for the 𝑞 variables measured on each
object.
The objects within each cluster should be similar
and objects in different clusters should be dissimilar.
The dissimilarity between any two objects is typically
quantified using a distance measure (like Euclidean
distance).
A type of unsupervised classification – as the nature
of the groups (or the number of groups, typically) are
unknown before the objects are classified into
clusters.
2
1
01‐07‐2021
Applications of Cluster
Analysis
In marketing: researchers attempt to find
distinct clusters of the consumer population
so that several distinct marketing strategies
can be used for the clusters.
In ecology: scientists classify plants or
animals into various groups based on some
measurable characteristics.
In genetics: Researchers may separate genes
into several classes based on their expression
ratios measured at different time points.
The Need for Clustering Algorithms
Suppose 𝑛 objects are to segregated into a small number (say, 𝑘) of clusters,
where 𝑘 𝑛.
If the number of clusters 𝑘 is decided ahead of time, then the number of ways to
partition the 𝑛 objects into 𝑘 clusters is a “Stirling number of the second kind.”
Example: There are 46, 771, 289, 738, 810 ways to form 𝑘 4 (non‐empty)
The image part
with relationship
ID rId4 was not
clusters from 𝑛 25 objects.
found in the file.
> library(copula)
> options(scipen = 999)
> Stirling2( 25,4)
[1] 46771289738810
2
01‐07‐2021
The Need for Clustering Algorithms (Continued)
If the number of clusters ( 𝑘) is flexible, the possible number of partitions is much higher.
Example: There are over 4638590 TRILLION possible partitions of 𝑛 25 objects if we let
the number of clusters 𝑘 vary. (This is called the Bell number for n.)
Challenging even for a modern‐day computer to explore ALL the possible clustering
partitions to see which is best.
Hence, need intelligently designed algorithms that will search among the best possible
The image part
with relationship
ID rId4 was not
found in the file.
partitions relatively quickly.
x <‐ seq(10,50,10)
library(numbers)
> bell(25) sapply(x, bell)
[1] 4638590332229998592
11
4
5
12
3
6 {1,3,6,7,9}
{11,13}
{2,5,8} 10
{12}
2 13
{4}
7
{10}
{1,3,6,7,9,11,13} {1,3,6,7,9}
1 {2,5,8,12} {11,13} 9
{4,10} 8 {2,5,8,12}
{4,10}
6
3
01‐07‐2021
Which attribute to use for clustering?
• Color combination & design
• size
• location
• All of the above
What is their relative importance ?
Lessons to be learnt from
the clustering game
Differences (distinction) are almost always unavoidable
both within cluster as well as between cluster. But
between cluster variation should be dominant.
The notion of difference/similarity/distance may vary
from one analyst to another, possibly depending on what
factors are perceived to be more important.
Choice of NUMBER of clusters could be tricky /
controversial.
4
01‐07‐2021
Clustering based on single attribute
is relatively easy (quiz!)
Zcost
3.00
2.00
1.00
0.00
-1.00
-2.00
5
01‐07‐2021
Distance and (dis‐)similarity measures between cases
Basic principle of clustering: STEP I
Use of Euclidean distance (other choices)
Standardize observations to give equal weights to all variables
Not always
Similarity and distance in presence/absence type attributes
define similarity s = (# of attributes with match)/(# no of attributes)
s = 1/(1+d)
Differential weights for 1‐1 and 0‐0 match (presence/absence) can be accommodated
Use correlation to cluster variables
11
Popular choices of distance / dissimilarity between
multivariate observations
( x y ) ( x y) ( x y)
p
0.5 0.5
d E ( x, y ) i i
2 t
d M ( x, y ) ( x y )t S 1 ( x y )
i 1
(squared)
City –block Mahalanobis
Euclidean
distance distance
distance (Manhattan distance)
p
dCB ( x, y ) | xi yi |
i 1
Other choices: maximum norm, Minkowski distance etc.
12
6
01‐07‐2021
Types of Clustering Algorithms
There are three major classes of clustering methods – from oldest to newest:
Hierarchical methods
Partitioning methods
Model‐based methods
Hierarchical methods cluster the data in a series of 𝑛 steps, typically joining
observations together step by step to form clusters.
The image part
with relationship
ID rId4 was not
found in the file.
Partitioning methods first determine 𝑘, and then typically attempt to find the
partition into 𝑘 clusters that optimizes some objective function of the data.
Model‐based clustering takes a statistical approach, formulating a model that
categorizes the data into subpopulations and using maximum likelihood to
estimate the model parameters.
13
Clustering
methods
Non-hierarchical
Hierarchical
K-means
Linkage based
Average linkage
Variance based :
Ward’s method
SS within cluster is minimized
7
01‐07‐2021
Hierarchical Clustering
15
Defining Closeness of Clusters
The key in a hierarchical clustering algorithm is specifying how to determine the
two “closest” clusters at any given step.
For the first step, join the two objects whose (Euclidean?) distance is smallest.
In subsequent stages: should we join two individual objects together, or merge
an object into a cluster that already has multiple objects?
The image part
with relationship
ID rId3 was not
found in the file.
Join on the basis of Inter‐cluster dissimilarity (distance)
one of three linkage methods
distance between cluster/group Centroids
Within group sum of squares (Wald’s method)
16
8
01‐07‐2021
• Cluster the provinces
in terms of density of
health‐care workers.
• How is it different
from 2001?
17
18
9
01‐07‐2021
19
Canadian health‐workers density in 2001
20
10
01‐07‐2021
Step II:
Distance/linkage between two groups
• Complete linkage
• Single linkage
• Average linkage
•Centroid method
21
Linkage Methods in Hierarchical Clustering
The single linkage algorithm (“nearest neighbor” clustering), at each step, joins
the clusters whose minimum distance between objects is smallest, i.e., joins the
clusters 𝐴 and 𝐵 with the smallest
𝑑 min 𝑑
∈ , ∈
The complete linkage algorithm (“farthest neighbor clustering” ), at each step,
joins the clusters whose maximum distance between objects is smallest, i.e., joins
the clusters 𝐴 and 𝐵 with the smallest The image part
with relationship
ID rId4 was not
found in the file.
𝑑 max 𝑑
∈ , ∈
The average linkage algorithm, at each step, joins the clusters whose average
distance between objects is smallest, i.e., joins the clusters A and B with the
smallest
1
𝑑 𝑑
𝑛 𝑛
∈ ∈
where 𝑛 and 𝑛 are the number of objects in clusters 𝐴 and 𝐵, respectively
22
11
01‐07‐2021
Ward’s Method (1963)
At each step, the method searches over all possible ways to join a pair of clusters
so that the K‐means criterion 𝑊𝑆𝑆 is minimized for that step.
It begins with each object as its own cluster (so that 𝑊𝑆𝑆 0) and concludes
with all objects in one cluster.
The R function hclust performs Ward’s method if the option method =
The image part
with relationship
ID rId4 was not
found in the file.
’ward’ is specified.
23
Dendrograms
A hierarchical algorithm actually produces not one partition of the data, but lots of
partitions.
There is a clustering partition for each step 1, 2, . . . , 𝑛.
The series of merging can be represented at a glance by a treelike structure called a
dendrogram. The image part
with relationship
ID rId4 was not
found in the file.
To get a single 𝑘‐cluster solution, we would cut the dendrogram horizontally at a point
that would produce 𝑘 groups (The R function cutree can do this).
It is strongly recommended to examine the full dendrogram first before determining
where to cut it.
A natural set of clusters may be apparent from a glance at the dendrogram.
24
12
01‐07‐2021
Standardization of Observations
If the variables in our data set are of different types or are measured on very
different scales, then some variables may play an inappropriately dominant role
in the clustering process.
In this case, it is recommended to standardize the variables in some way before
clustering the objects. The image part
with relationship
ID rId4 was not
found in the file.
Possible standardization approaches:
1. Divide each column by its sample standard deviation, so that all variables have standard
deviation 1.
2. Divide each variable by its sample range max min ; Milligan and Cooper (1988) found
that this approach best preserved the clustering structure.
3. Convert data to 𝑧‐scores by (for each variable) subtracting the sample mean and then
dividing by the sample standard deviation – a common option in clustering software
packages.
25
26
13
01‐07‐2021
27
28
14
01‐07‐2021
29
30
15
01‐07‐2021
31
Examining cluster constituents
32
16