Cluster Analysis: DSCI 5240 Data Mining and Machine Learning For Business

DSCI 5240
Cluster Analysis
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240
The Importance of Similarity
“One can state, without exaggeration, that the observation

of and the search for similarities and differences are the
basis of all human knowledge.”
Alfred Nobel
2
DSCI 5240
Introduction to Clustering
• Cluster: A collection of data objects
• Large similarity among objects in the same cluster
• Dissimilarity among objects in different clusters
• Clustering is an unsupervised classification technique: No pre-determined classes
• Typical applications of clustering
• A stand-alone analysis, to gain insight on the data
• A pre-processing step for other predictive models
• Cluster analysis is also known as segmentation
3
DSCI 5240
Applications
• Marketing – Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
• Land use – Identification of areas of similar land use in an earth observation database
• Insurance – Identifying groups of motor insurance policy holders with a high average
claim cost
• City-planning – Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies – Observed earth quake epicenters should be clustered along
continent faults
4
DSCI 5240
Cluster Analysis Involves Subjective
Judgements
How many clusters? Two clusters?
Four clusters? Six clusters?

5
DSCI 5240
Good Clustering
Intra-cluster
distances
• Good clustering will produce minimized
high quality clusters with
• High intra-class similarity
• Low inter-class similarity Inter-cluster
• Quality of the clustering distances
depends on maximized
• The similarity measure used

• The implementation
• Quality is also measured by
the ability to discover hidden
patterns
6
DSCI 5240
What is Similarity?
We know what it is when we see it… but it’s hard to define precisely
7
DSCI 5240
A Mathematical Approach to Determining
Similarity
• Similarity can be expressed in terms of a difference function
• Definitions of difference functions vary depending on the type of variables involved

(interval, binary, nominal, ordinal)
• It is hard to define “similar enough,” but smaller values of indicate a higher degree of
similarity
8
DSCI 5240
Minkowski Difference
• Minkowski Distance is a means of calculating the distance between points in
dimensional space
A
𝑛
𝑞
𝑑 ( 𝑥, 𝑦 )=
𝑞
√
𝑞
∑ (|𝑥𝑖 − 𝑦𝑖|)
𝑖=1
𝑞
𝑞
𝑞
¿
√ (|𝑥 1 − 𝑦 1|) + (| 𝑥2 − 𝑦 2|) + …+ (|𝑥 𝑛 − 𝑦 𝑛|)
B C
• When , we get Euclidean Distance ()
• When , we get Manhattan Distance ()
9
DSCI 5240
Euclidean Distance
Jim Mike
Age: 38 Age: 97
Income: $50,000 Income: $1,000,000

# of credit cards: 5 # of credit cards: 0
𝑛
𝑑 ( 𝐽𝑖𝑚, 𝑀𝑖𝑘𝑒 )=
√ ∑ ( 𝑥 𝑗𝑖𝑚 − 𝑦𝑚𝑖𝑘𝑒 )
𝑖=1
2
2
2 2
¿ √ ( 38− 97 ) + ( 50,000 −1,000,000 ) + ( 5 −0 )
≈
950,000
10
DSCI 5240
Plotting Similarity
Jim
950,0
00
Mike
11
DSCI 5240
Euclidean Distance
Kate Mike
Age: 22 Age: 97
Income: $1,000,000 Income: $1,000,000
# of credit cards: 10 # of credit cards: 0
𝑛
𝑑 ( 𝐾𝑎𝑡𝑒, 𝑀𝑖𝑘𝑒 )= √ ∑ ( 𝑥 𝑘𝑎𝑡𝑒 − 𝑦 𝑚𝑖𝑘𝑒)
𝑖=1
2
2
2 2
¿ √ ( 22− 97 ) + ( 1,000,000 −1,000,000 ) + ( 10 −0 )
≈
76
12
DSCI 5240
Plotting Similarity
Kate
76
Jim 950,000
Mike
13
DSCI 5240
Standardization
• Standardization is an important consideration when performing cluster analysis
• Because similarity is measured in terms of distance, dimensions measured in large

scales have a much larger effect.
• There are multiple approaches to standardization, we will discuss:

• z-score
• Scaling to [0, 1]
14
DSCI 5240
Standardization
𝑥 0,1=
𝑥 − 𝑥𝑚𝑖𝑛
𝑥 𝑚𝑎𝑥 − 𝑥 𝑚𝑖𝑛
𝑥 − 𝑥´
𝑧=
𝑠
Z-Score Scaling to [0, 1]
• Common approach to • Less frequently used
standardization • Subtract the minimum value from
• Subtract the mean from each each observation and divide by
observation and divide by the range
sample standard deviation • Resulting data will have a mean of
• Resulting data will have a mean of zero and a standard deviation of
zero and a standard deviation of one
one (because you divide by s) • When outliers are present, this
approach may be overly harsh
15
DSCI 5240
The Impact of Standardization
Kate
2.7
5 8
Jim 2.496
Mike
16
DSCI 5240
How Standardization Works

GRE Age z(GRE) z(Age) GRE Age z(GRE) z(Age)
(1) (2) (3) (4) (5) (6) (7) (8)
596 23 0.423 0.332 Mean 560.000 22.300 0.000 0.000
473 22 -1.022 -0.142 Range 246.000 7.000 2.889 3.316
482 22 -0.916 -0.142 Variance 7252.222 4.456 1.000 1.000
527 23 -0.388 0.332 Std Dev 85.160 2.111 1.000 1.000
505 23 -0.646 0.332
693 24 1.562 0.805
626 24 0.775 0.805
663 17 1.209 -2.511
447 21 -1.327 -0.616
588 24 0.329 0.805
17
DSCI 5240
Clustering Approaches
• Partitional Clustering
• Goal is to partition a dataset containing n objects into k clusters
• Given k, find a partition of k clusters that optimize the chosen partitioning criterion
• Global optimal: Exhaustively enumerate all partitions
• Heuristic methods:
• k-means (MacQueen 1967) – Each cluster is represented by a calculated
centroid
• k-medoids (Kaufman and Rousseeuw 1987) – Each cluster is represented by
one of the objects in the cluster. Also known as partition around medoids
(PAM)
• Hierarchical Clustering
Goal is to identify the hierarchies between objects n in the dataset such that they can
be represented in a nested tree structure
18
DSCI 5240
k-Means Clustering
19
DSCI 5240
k-Means Algorithm
The k-means algorithm is implemented in following steps:

1. Select the desired number of clusters k
2. Select k initial seeds (often chosen at random)
3. Calculate average cluster values (cluster centroids) over each variable (for the
initial iteration, this will simply be the initial seeds)
4. Assign each of the other observations to the cluster with the nearest centroid
5. Recalculate cluster centroids (averages) based on the assignments from step 4
6. Iterate between steps 4 and 5, stop when there are no more new assignments
20
DSCI 5240
k-Means Visualized
• Select two seeds as the

initial centroids
• Assign each observation to

the closest centroid
• Recalculate cluster centroids
• Rinse and repeat
21
DSCI 5240
k-Means Clustering Example

Plot the data
Obs Age Income 1.0
1 0.550 0.175 0.9

0.8
2 0.340 0.250
0.7
3 1.000 1.000
0.6
Income
4 0.930 0.850 0.5
5 0.390 0.200 0.4
6 0.580 0.250 0.3
0.2
0.1
0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
22
DSCI 5240

Pick to initial seeds at random
Centroid Age Income 1.0
C1.1 0.450 0.150 0.9

0.8
C1.2 0.600 0.300
0.7
0.6
Income
0.5
0.4
0.3
0.2
0.1
0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
23
DSCI 5240

Calculate distance from centroids
1.0
0.9

𝑛
√
0.8
2
𝑑 ( 𝑥, 𝑦 )= ∑ ( 𝑥𝑖 − 𝑦𝑖 ) 0.7
0.6
Income
𝑖=1
0.5
0.4
0.3
0.2
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
24
DSCI 5240

2 2
√
𝑑𝑜𝑏𝑠1,𝑐1= ( 𝐴𝑔𝑒𝑜𝑏𝑠1−𝐴𝑔𝑒𝑐1) +¿ (𝐼𝑛𝑐𝑜𝑚𝑒𝑜𝑏𝑠1 −𝐼𝑛𝑐𝑜𝑚𝑒𝑐1) 1.0
0.9
0.8

2 2
¿ √(0.550−0.450) +¿ (0 .175−0.150) =0.103 0.7
0.6
Income
0.5
Obs Age Income 0.4
1 0.550 0.175 0.3

0.2
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
25
DSCI 5240

Obs Age Income DistC1.1 DistC1.2 1.0
1 0.550 0.175 0.103 0.135 0.9

0.8
2 0.340 0.250 0.149 0.265
0.7
3 1.000 1.000 1.012 0.806
0.6
Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
26
DSCI 5240

Assign observations to clusters
based on distance
1 0.550 0.175 0.103 0.135 0.9

0.8
2 0.340 0.250 0.149 0.265
0.7
3 1.000 1.000 1.012 0.806
0.6
Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
27
DSCI 5240

based on distance
1 0.550 0.175 0.103 0.135 0.9

0.8
2 0.340 0.250 0.149 0.265
0.7
3 1.000 1.000 1.012 0.806
0.6
Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age
28
DSCI 5240

Calculate new cluster centroids
1 0.550 0.175 0.103 0.135 0.9

0.8
2 0.340 0.250 0.149 0.265
0.7
5 0.390 0.200 0.078 0.233
0.6
Income
Avg 0.427 0.208 0.5
0.4
0.3
0.2
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
29
DSCI 5240

3 1.000 1.000 1.012 0.806 0.9

0.8
4 0.930 0.850 0.849 0.641
0.7
6 0.580 0.250 0.164 0.054
0.6
Income
Avg 0.837 0.700 0.5
0.4
0.3
0.2
C2.2 0.837 0.700 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
30
DSCI 5240

based on distance
C1.1 0.450 0.150 0.9

0.8
C1.2 0.600 0.300
0.7
0.6
Income
0.5
0.4
0.3
0.2
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age
31
DSCI 5240

based on distance
1 0.550 0.175 0.128 0.598 0.9

0.8
2 0.340 0.250 0.096 0.670
0.7
3 1.000 1.000 0.977 0.342
0.6
Income
4 0.930 0.850 0.816 0.177 0.5
5 0.390 0.200 0.038 0.670 0.4
6 0.580 0.250 0.159 0.518 0.3
0.2
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age
32
DSCI 5240

based on distance
1 0.550 0.175 0.128 0.598 0.9

0.8
2 0.340 0.250 0.096 0.670
0.7
3 1.000 1.000 0.977 0.342
0.6
Income
4 0.930 0.850 0.816 0.177 0.5
5 0.390 0.200 0.038 0.670 0.4
6 0.580 0.250 0.159 0.518 0.3
0.2
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age
33
DSCI 5240

1 0.550 0.175 0.128 0.598 0.9

0.8
2 0.340 0.250 0.096 0.670
0.7
5 0.390 0.200 0.038 0.670
0.6
Income
6 0.580 0.250 0.159 0.518 0.5
Avg 0.465 0.219 0.4
0.3
0.2
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
34
DSCI 5240

3 1.000 1.000 0.977 0.342 0.9

0.8
4 0.930 0.850 0.816 0.177
0.7
Avg 0.965 0.925
0.6
Income
0.5
0.4
0.3
0.2
C3.2 0.965 0.925 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
35
DSCI 5240

based on distance
C2.1 0.427 0.208 0.9

0.8
C2.2 0.837 0.700
0.7
0.6
Income
0.5
0.4
0.3
0.2
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age
36
DSCI 5240

based on distance
1 0.550 0.175 0.096 0.857 0.9

0.8
2 0.340 0.250 0.129 0.920
0.7
3 1.000 1.000 0.947 0.083
0.6
Income
4 0.930 0.850 0.784 0.083 0.5
5 0.390 0.200 0.077 0.925 0.4
6 0.580 0.250 0.119 0.777 0.3
0.2
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age
37
DSCI 5240

based on distance
1 0.550 0.175 0.096 0.857 0.9

0.8
2 0.340 0.250 0.129 0.920
0.7
3 1.000 1.000 0.947 0.083
0.6
No changes  We are done!
Income
4 0.930 0.850 0.784 0.083 0.5
5 0.390 0.200 0.077 0.925 0.4
6 0.580 0.250 0.119 0.777 0.3
0.2
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age
38
DSCI 5240
Interpreting our Clusters

Centroid Age Income
• So what does all of this C3.1 0.465 0.219
mean? C3.2 0.965 0.925
• We generally interpret
clusters based on their
centroids
• You can think of a centroid
as representative of the
observations within the
cluster
39
DSCI 5240
Choosing k
• Visualization
• Natural Groupings – Application Specific
• Data-Driven Approaches
40
DSCI 5240
Natural Groupings
The Hills School Employees Males Females
VS VS
41
DSCI 5240
Data Driven Approaches

• Run K-Means multiple times with different values of K
• Calculate Within-Cluster Sum of Squared Error
• Create scree-plot to identify “optimal” K
3.5
Within Cluster SSE

2.5
Error 2
1.5
X
1
0.5
0
1 2 3 4 5 6 7
K
42
DSCI 5240
Comments on k-Means
Strengths Weaknesses
• K-means is a very flexible algorithm • Applicable only when mean is defined
that can be used in a wide variety of
• Need to specify k, the number of
contexts
clusters, in advance
• Efficient: O(tkn), where n is # of
• Unable to handle noisy data and
observations, k is #of clusters, and t is
outliers
# of iterations. Normally k, t << n
• Not suitable to discover clusters with
• Widely available in data mining tools
non-convex shapes
• Straightforward and easy to understand
43
DSCI 5240
Clustering is Difficult!
• What if there are many dimensions?
• What if the variables are not of the same type?
• What if the number of objects is large?
• What if the data has noise or outliers?
• User-specified constraints (e.g., managers might think
income should weigh twice more than age)
• Interpretability and usability of clusters
44

Cluster Analysis: DSCI 5240 Data Mining and Machine Learning For Business

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis: DSCI 5240 Data Mining and Machine Learning For Business

Uploaded by

Copyright:

Available Formats

DSCI 5240

The Importance of Similarity

“One can state, without exaggeration, that the observation

How many clusters? Two clusters?

Four clusters? Six clusters?

• The similarity measure used

• Similarity can be expressed in terms of a difference function

• Definitions of difference functions vary depending on the type of variables involved

• When , we get Euclidean Distance ()

• When , we get Manhattan Distance ()

Income: $50,000 Income: $1,000,000

Income: $1,000,000 Income: $1,000,000

# of credit cards: 10 # of credit cards: 0

• Because similarity is measured in terms of distance, dimensions measured in large

• There are multiple approaches to standardization, we will discuss:

The Impact of Standardization

How Standardization Works

The k-means algorithm is implemented in following steps:

• Select two seeds as the

• Assign each observation to

• Recalculate cluster centroids

• Rinse and repeat

k-Means Clustering Example

Obs Age Income 1.0

1 0.550 0.175 0.9

k-Means Clustering Example

Centroid Age Income 1.0

C1.1 0.450 0.150 0.9

k-Means Clustering Example

k-Means Clustering Example

1 0.550 0.175 0.3

k-Means Clustering Example

Obs Age Income DistC1.1 DistC1.2 1.0

1 0.550 0.175 0.103 0.135 0.9

k-Means Clustering Example

1 0.550 0.175 0.103 0.135 0.9

k-Means Clustering Example

1 0.550 0.175 0.103 0.135 0.9

k-Means Clustering Example

Obs Age Income DistC1.1 DistC1.2 1.0

1 0.550 0.175 0.103 0.135 0.9

k-Means Clustering Example

Obs Age Income DistC1.1 DistC1.2 1.0

3 1.000 1.000 1.012 0.806 0.9

k-Means Clustering Example

C1.1 0.450 0.150 0.9

k-Means Clustering Example

1 0.550 0.175 0.128 0.598 0.9

k-Means Clustering Example

1 0.550 0.175 0.128 0.598 0.9

k-Means Clustering Example

Obs Age Income DistC2.1 DistC2.2 1.0

1 0.550 0.175 0.128 0.598 0.9

k-Means Clustering Example

Obs Age Income DistC2.1 DistC2.2 1.0

3 1.000 1.000 0.977 0.342 0.9

k-Means Clustering Example

C2.1 0.427 0.208 0.9

k-Means Clustering Example

1 0.550 0.175 0.096 0.857 0.9

k-Means Clustering Example

1 0.550 0.175 0.096 0.857 0.9