Professional Documents
Culture Documents
Cluster Analysis: DSCI 5240 Data Mining and Machine Learning For Business
Cluster Analysis: DSCI 5240 Data Mining and Machine Learning For Business
Cluster Analysis
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240
Alfred Nobel
2
DSCI 5240
Introduction to Clustering
• Cluster: A collection of data objects
• Large similarity among objects in the same cluster
• Dissimilarity among objects in different clusters
• Clustering is an unsupervised classification technique: No pre-determined classes
• Typical applications of clustering
• A stand-alone analysis, to gain insight on the data
• A pre-processing step for other predictive models
• Cluster analysis is also known as segmentation
3
DSCI 5240
Applications
• Marketing – Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
• Land use – Identification of areas of similar land use in an earth observation database
• Insurance – Identifying groups of motor insurance policy holders with a high average
claim cost
• City-planning – Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies – Observed earth quake epicenters should be clustered along
continent faults
4
DSCI 5240
Cluster Analysis Involves Subjective
Judgements
Good Clustering
Intra-cluster
distances
• Good clustering will produce minimized
high quality clusters with
• High intra-class similarity
• Low inter-class similarity Inter-cluster
• Quality of the clustering distances
depends on maximized
6
DSCI 5240
What is Similarity?
We know what it is when we see it… but it’s hard to define precisely
7
DSCI 5240
A Mathematical Approach to Determining
Similarity
• It is hard to define “similar enough,” but smaller values of indicate a higher degree of
similarity
8
DSCI 5240
Minkowski Difference
• Minkowski Distance is a means of calculating the distance between points in
dimensional space
A
𝑛
𝑞
𝑑 ( 𝑥, 𝑦 )=
𝑞
√
𝑞
∑ (|𝑥𝑖 − 𝑦𝑖|)
𝑖=1
𝑞
𝑞
𝑞
¿
√ (|𝑥 1 − 𝑦 1|) + (| 𝑥2 − 𝑦 2|) + …+ (|𝑥 𝑛 − 𝑦 𝑛|)
B C
9
DSCI 5240
Euclidean Distance
Jim Mike
Age: 38 Age: 97
𝑛
𝑑 ( 𝐽𝑖𝑚, 𝑀𝑖𝑘𝑒 )=
√ ∑ ( 𝑥 𝑗𝑖𝑚 − 𝑦𝑚𝑖𝑘𝑒 )
𝑖=1
2
2
2 2
¿ √ ( 38− 97 ) + ( 50,000 −1,000,000 ) + ( 5 −0 )
≈
950,000
10
DSCI 5240
Plotting Similarity
Jim
950,0
00
Mike
11
DSCI 5240
Euclidean Distance
Kate Mike
Age: 22 Age: 97
𝑛
𝑑 ( 𝐾𝑎𝑡𝑒, 𝑀𝑖𝑘𝑒 )= √ ∑ ( 𝑥 𝑘𝑎𝑡𝑒 − 𝑦 𝑚𝑖𝑘𝑒)
𝑖=1
2
2
2 2
¿ √ ( 22− 97 ) + ( 1,000,000 −1,000,000 ) + ( 10 −0 )
≈
76
12
DSCI 5240
Plotting Similarity
Kate
76
Jim 950,000
Mike
13
DSCI 5240
Standardization
• Standardization is an important consideration when performing cluster analysis
14
DSCI 5240
Standardization
𝑥 0,1=
𝑥 − 𝑥𝑚𝑖𝑛
𝑥 𝑚𝑎𝑥 − 𝑥 𝑚𝑖𝑛
𝑥 − 𝑥´
𝑧=
𝑠
Z-Score Scaling to [0, 1]
• Common approach to • Less frequently used
standardization • Subtract the minimum value from
• Subtract the mean from each each observation and divide by
observation and divide by the range
sample standard deviation • Resulting data will have a mean of
• Resulting data will have a mean of zero and a standard deviation of
zero and a standard deviation of one
one (because you divide by s) • When outliers are present, this
approach may be overly harsh
15
DSCI 5240
Kate
2.7
5 8
Jim 2.496
Mike
16
DSCI 5240
17
DSCI 5240
Clustering Approaches
• Partitional Clustering
• Goal is to partition a dataset containing n objects into k clusters
• Given k, find a partition of k clusters that optimize the chosen partitioning criterion
• Global optimal: Exhaustively enumerate all partitions
• Heuristic methods:
• k-means (MacQueen 1967) – Each cluster is represented by a calculated
centroid
• k-medoids (Kaufman and Rousseeuw 1987) – Each cluster is represented by
one of the objects in the cluster. Also known as partition around medoids
(PAM)
• Hierarchical Clustering
Goal is to identify the hierarchies between objects n in the dataset such that they can
be represented in a nested tree structure
18
DSCI 5240
k-Means Clustering
19
DSCI 5240
k-Means Algorithm
20
DSCI 5240
k-Means Visualized
21
DSCI 5240
Income
4 0.930 0.850 0.5
5 0.390 0.200 0.4
6 0.580 0.250 0.3
0.2
0.1
0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
22
DSCI 5240
Income
0.5
0.4
0.3
0.2
0.1
0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
23
DSCI 5240
1.0
0.9
𝑛
√
0.8
2
𝑑 ( 𝑥, 𝑦 )= ∑ ( 𝑥𝑖 − 𝑦𝑖 ) 0.7
0.6
Income
𝑖=1
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
24
DSCI 5240
Income
0.5
Obs Age Income 0.4
25
DSCI 5240
Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
26
DSCI 5240
Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
27
DSCI 5240
Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
28
DSCI 5240
Income
Avg 0.427 0.208 0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
29
DSCI 5240
Income
Avg 0.837 0.700 0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C2.2 0.837 0.700 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
30
DSCI 5240
Income
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age
31
DSCI 5240
Income
4 0.930 0.850 0.816 0.177 0.5
5 0.390 0.200 0.038 0.670 0.4
6 0.580 0.250 0.159 0.518 0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age
32
DSCI 5240
Income
4 0.930 0.850 0.816 0.177 0.5
5 0.390 0.200 0.038 0.670 0.4
6 0.580 0.250 0.159 0.518 0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age
33
DSCI 5240
Income
6 0.580 0.250 0.159 0.518 0.5
Avg 0.465 0.219 0.4
0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
34
DSCI 5240
Income
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C3.2 0.965 0.925 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
35
DSCI 5240
Income
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age
36
DSCI 5240
Income
4 0.930 0.850 0.784 0.083 0.5
5 0.390 0.200 0.077 0.925 0.4
6 0.580 0.250 0.119 0.777 0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age
37
DSCI 5240
Income
4 0.930 0.850 0.784 0.083 0.5
5 0.390 0.200 0.077 0.925 0.4
6 0.580 0.250 0.119 0.777 0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age
38
DSCI 5240
• We generally interpret
clusters based on their
centroids
• You can think of a centroid
as representative of the
observations within the
cluster
39
DSCI 5240
Choosing k
• Visualization
• Data-Driven Approaches
40
DSCI 5240
Natural Groupings
VS VS
41
DSCI 5240
3.5
Error 2
1.5
X
1
0.5
0
1 2 3 4 5 6 7
K
42
DSCI 5240
Comments on k-Means
Strengths Weaknesses
• K-means is a very flexible algorithm • Applicable only when mean is defined
that can be used in a wide variety of
• Need to specify k, the number of
contexts
clusters, in advance
• Efficient: O(tkn), where n is # of
• Unable to handle noisy data and
observations, k is #of clusters, and t is
outliers
# of iterations. Normally k, t << n
• Not suitable to discover clusters with
• Widely available in data mining tools
non-convex shapes
• Straightforward and easy to understand
43
DSCI 5240
Clustering is Difficult!
• What if there are many dimensions?
• What if the variables are not of the same type?
• What if the number of objects is large?
• What if the data has noise or outliers?
• User-specified constraints (e.g., managers might think
income should weigh twice more than age)
• Interpretability and usability of clusters
44