Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

DSCI 5240

Cluster Analysis
DSCI 5240 Data Mining and Machine Learning for Business
Javier Rubio-Herrero
DSCI 5240

The Importance of Similarity

“One can state, without exaggeration, that the observation


of and the search for similarities and differences are the
basis of all human knowledge.”

Alfred Nobel

2
DSCI 5240

Introduction to Clustering
• Cluster: A collection of data objects
• Large similarity among objects in the same cluster
• Dissimilarity among objects in different clusters
• Clustering is an unsupervised classification technique: No pre-determined classes
• Typical applications of clustering
• A stand-alone analysis, to gain insight on the data
• A pre-processing step for other predictive models
• Cluster analysis is also known as segmentation

3
DSCI 5240

Applications
• Marketing – Help marketers discover distinct groups in their customer bases, and then
use this knowledge to develop targeted marketing programs
• Land use – Identification of areas of similar land use in an earth observation database
• Insurance – Identifying groups of motor insurance policy holders with a high average
claim cost
• City-planning – Identifying groups of houses according to their house type, value, and
geographical location
• Earth-quake studies – Observed earth quake epicenters should be clustered along
continent faults

4
DSCI 5240
Cluster Analysis Involves Subjective
Judgements

How many clusters? Two clusters?

Four clusters? Six clusters?


5
DSCI 5240

Good Clustering
Intra-cluster
distances
• Good clustering will produce minimized
high quality clusters with
• High intra-class similarity
• Low inter-class similarity Inter-cluster
• Quality of the clustering distances
depends on maximized

• The similarity measure used


• The implementation
• Quality is also measured by
the ability to discover hidden
patterns

6
DSCI 5240

What is Similarity?

We know what it is when we see it… but it’s hard to define precisely
7
DSCI 5240
A Mathematical Approach to Determining
Similarity

•  Similarity can be expressed in terms of a difference function

• Definitions of difference functions vary depending on the type of variables involved


(interval, binary, nominal, ordinal)

• It is hard to define “similar enough,” but smaller values of indicate a higher degree of
similarity

8
DSCI 5240

Minkowski Difference
• Minkowski Distance is a means of calculating the distance between points in
dimensional space
A
  𝑛

𝑞
𝑑 ( 𝑥, 𝑦 )=
𝑞

𝑞
∑ (|𝑥𝑖 − 𝑦𝑖|)
𝑖=1
𝑞
𝑞

𝑞
 ¿
√ (|𝑥 1 − 𝑦 1|) + (| 𝑥2 − 𝑦 2|) + …+ (|𝑥 𝑛 − 𝑦 𝑛|)
B C

 • When , we get Euclidean Distance ()

• When , we get Manhattan Distance ()

9
DSCI 5240

Euclidean Distance

Jim Mike
Age: 38 Age: 97

Income: $50,000 Income: $1,000,000


# of credit cards: 5 # of credit cards: 0

  𝑛
𝑑 ( 𝐽𝑖𝑚, 𝑀𝑖𝑘𝑒 )=
√ ∑ ( 𝑥 𝑗𝑖𝑚 − 𝑦𝑚𝑖𝑘𝑒 )
𝑖=1
2
2

2 2
 ¿ √ ( 38− 97 ) + ( 50,000 −1,000,000 ) + ( 5 −0 )

  950,000
10
DSCI 5240

Plotting Similarity

Jim

950,0
00

Mike

11
DSCI 5240

Euclidean Distance

Kate Mike
Age: 22 Age: 97

Income: $1,000,000 Income: $1,000,000

# of credit cards: 10 # of credit cards: 0

  𝑛
𝑑 ( 𝐾𝑎𝑡𝑒, 𝑀𝑖𝑘𝑒 )= √ ∑ ( 𝑥 𝑘𝑎𝑡𝑒 − 𝑦 𝑚𝑖𝑘𝑒)
𝑖=1
2
2

2 2
 ¿ √ ( 22− 97 ) + ( 1,000,000 −1,000,000 ) + ( 10 −0 )

  76
12
DSCI 5240

Plotting Similarity

Kate

76

Jim 950,000

Mike

13
DSCI 5240

Standardization
• Standardization is an important consideration when performing cluster analysis

• Because similarity is measured in terms of distance, dimensions measured in large


scales have a much larger effect.

• There are multiple approaches to standardization, we will discuss:


• z-score
• Scaling to [0, 1]

14
DSCI 5240

Standardization  
𝑥 0,1=
𝑥 − 𝑥𝑚𝑖𝑛
𝑥 𝑚𝑎𝑥 − 𝑥 𝑚𝑖𝑛
  𝑥 − 𝑥´
𝑧=
𝑠
Z-Score Scaling to [0, 1]
• Common approach to • Less frequently used
standardization • Subtract the minimum value from
• Subtract the mean from each each observation and divide by
observation and divide by the range
sample standard deviation • Resulting data will have a mean of
• Resulting data will have a mean of zero and a standard deviation of
zero and a standard deviation of one
one (because you divide by s) • When outliers are present, this
approach may be overly harsh

15
DSCI 5240

The Impact of Standardization

Kate

2.7
5 8

Jim 2.496

Mike

16
DSCI 5240

How Standardization Works


GRE Age z(GRE) z(Age) GRE Age z(GRE) z(Age)
(1) (2) (3) (4) (5) (6) (7) (8)
596 23 0.423 0.332 Mean 560.000 22.300 0.000 0.000
473 22 -1.022 -0.142 Range 246.000 7.000 2.889 3.316
482 22 -0.916 -0.142 Variance 7252.222 4.456 1.000 1.000
527 23 -0.388 0.332 Std Dev 85.160 2.111 1.000 1.000
505 23 -0.646 0.332
693 24 1.562 0.805
626 24 0.775 0.805
663 17 1.209 -2.511
447 21 -1.327 -0.616
588 24 0.329 0.805

17
DSCI 5240

Clustering Approaches
• Partitional Clustering
• Goal is to partition a dataset containing n objects into k clusters
• Given k, find a partition of k clusters that optimize the chosen partitioning criterion
• Global optimal: Exhaustively enumerate all partitions
• Heuristic methods:
• k-means (MacQueen 1967) – Each cluster is represented by a calculated
centroid
• k-medoids (Kaufman and Rousseeuw 1987) – Each cluster is represented by
one of the objects in the cluster. Also known as partition around medoids
(PAM)
• Hierarchical Clustering
Goal is to identify the hierarchies between objects n in the dataset such that they can
be represented in a nested tree structure

18
DSCI 5240

k-Means Clustering

19
DSCI 5240

k-Means Algorithm

The k-means algorithm is implemented in following steps:


1. Select the desired number of clusters k
2. Select k initial seeds (often chosen at random)
3. Calculate average cluster values (cluster centroids) over each variable (for the
initial iteration, this will simply be the initial seeds)
4. Assign each of the other observations to the cluster with the nearest centroid
5. Recalculate cluster centroids (averages) based on the assignments from step 4
6. Iterate between steps 4 and 5, stop when there are no more new assignments

20
DSCI 5240

k-Means Visualized

• Select two seeds as the


initial centroids

• Assign each observation to


the closest centroid

• Recalculate cluster centroids

• Rinse and repeat

21
DSCI 5240

k-Means Clustering Example


Plot the data

Obs Age Income 1.0

1 0.550 0.175 0.9


0.8
2 0.340 0.250
0.7
3 1.000 1.000
0.6

Income
4 0.930 0.850 0.5
5 0.390 0.200 0.4
6 0.580 0.250 0.3
0.2
0.1
0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age

22
DSCI 5240

k-Means Clustering Example


Pick to initial seeds at random

Centroid Age Income 1.0

C1.1 0.450 0.150 0.9


0.8
C1.2 0.600 0.300
0.7
0.6

Income
0.5
0.4
0.3
0.2
0.1
0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age

23
DSCI 5240

k-Means Clustering Example


Calculate distance from centroids

1.0
0.9
 
𝑛


0.8
2
𝑑 ( 𝑥, 𝑦 )= ∑ ( 𝑥𝑖 − 𝑦𝑖 ) 0.7
0.6

Income
𝑖=1
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age

24
DSCI 5240

k-Means Clustering Example


Calculate distance from centroids
  2 2

𝑑𝑜𝑏𝑠1,𝑐1= ( 𝐴𝑔𝑒𝑜𝑏𝑠1−𝐴𝑔𝑒𝑐1) +¿ (𝐼𝑛𝑐𝑜𝑚𝑒𝑜𝑏𝑠1 −𝐼𝑛𝑐𝑜𝑚𝑒𝑐1) 1.0
0.9
0.8
 
2 2
¿ √(0.550−0.450) +¿ (0 .175−0.150) =0.103 0.7
0.6

Income
0.5
Obs Age Income 0.4

1 0.550 0.175 0.3


0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age

25
DSCI 5240

k-Means Clustering Example


Calculate distance from centroids

Obs Age Income DistC1.1 DistC1.2 1.0

1 0.550 0.175 0.103 0.135 0.9


0.8
2 0.340 0.250 0.149 0.265
0.7
3 1.000 1.000 1.012 0.806
0.6

Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age

26
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Obs Age Income DistC1.1 DistC1.2 1.0

1 0.550 0.175 0.103 0.135 0.9


0.8
2 0.340 0.250 0.149 0.265
0.7
3 1.000 1.000 1.012 0.806
0.6

Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age

27
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Obs Age Income DistC1.1 DistC1.2 1.0

1 0.550 0.175 0.103 0.135 0.9


0.8
2 0.340 0.250 0.149 0.265
0.7
3 1.000 1.000 1.012 0.806
0.6

Income
4 0.930 0.850 0.849 0.641 0.5
5 0.390 0.200 0.078 0.233 0.4
6 0.580 0.250 0.164 0.054 0.3
0.2
Centroid Age Income 0.1
C1.1 0.450 0.150 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C1.2 0.600 0.300
Age

28
DSCI 5240

k-Means Clustering Example


Calculate new cluster centroids

Obs Age Income DistC1.1 DistC1.2 1.0

1 0.550 0.175 0.103 0.135 0.9


0.8
2 0.340 0.250 0.149 0.265
0.7
5 0.390 0.200 0.078 0.233
0.6

Income
Avg 0.427 0.208 0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age

29
DSCI 5240

k-Means Clustering Example


Calculate new cluster centroids

Obs Age Income DistC1.1 DistC1.2 1.0

3 1.000 1.000 1.012 0.806 0.9


0.8
4 0.930 0.850 0.849 0.641
0.7
6 0.580 0.250 0.164 0.054
0.6

Income
Avg 0.837 0.700 0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C2.2 0.837 0.700 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age

30
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Centroid Age Income 1.0

C1.1 0.450 0.150 0.9


0.8
C1.2 0.600 0.300
0.7
0.6

Income
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age

31
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Obs Age Income DistC2.1 DistC2.2 1.0

1 0.550 0.175 0.128 0.598 0.9


0.8
2 0.340 0.250 0.096 0.670
0.7
3 1.000 1.000 0.977 0.342
0.6

Income
4 0.930 0.850 0.816 0.177 0.5
5 0.390 0.200 0.038 0.670 0.4
6 0.580 0.250 0.159 0.518 0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age

32
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Obs Age Income DistC2.1 DistC2.2 1.0

1 0.550 0.175 0.128 0.598 0.9


0.8
2 0.340 0.250 0.096 0.670
0.7
3 1.000 1.000 0.977 0.342
0.6

Income
4 0.930 0.850 0.816 0.177 0.5
5 0.390 0.200 0.038 0.670 0.4
6 0.580 0.250 0.159 0.518 0.3
0.2
Centroid Age Income 0.1
C2.1 0.427 0.208 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C2.2 0.837 0.700
Age

33
DSCI 5240

k-Means Clustering Example


Calculate new cluster centroids

Obs Age Income DistC2.1 DistC2.2 1.0

1 0.550 0.175 0.128 0.598 0.9


0.8
2 0.340 0.250 0.096 0.670
0.7
5 0.390 0.200 0.038 0.670
0.6

Income
6 0.580 0.250 0.159 0.518 0.5
Avg 0.465 0.219 0.4
0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age

34
DSCI 5240

k-Means Clustering Example


Calculate new cluster centroids

Obs Age Income DistC2.1 DistC2.2 1.0

3 1.000 1.000 0.977 0.342 0.9


0.8
4 0.930 0.850 0.816 0.177
0.7
Avg 0.965 0.925
0.6

Income
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C3.2 0.965 0.925 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age

35
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Centroid Age Income 1.0

C2.1 0.427 0.208 0.9


0.8
C2.2 0.837 0.700
0.7
0.6

Income
0.5
0.4
0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age

36
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Obs Age Income DistC3.1 DistC3.2 1.0

1 0.550 0.175 0.096 0.857 0.9


0.8
2 0.340 0.250 0.129 0.920
0.7
3 1.000 1.000 0.947 0.083
0.6

Income
4 0.930 0.850 0.784 0.083 0.5
5 0.390 0.200 0.077 0.925 0.4
6 0.580 0.250 0.119 0.777 0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age

37
DSCI 5240

k-Means Clustering Example


Assign observations to clusters
based on distance
Obs Age Income DistC3.1 DistC3.2 1.0

1 0.550 0.175 0.096 0.857 0.9


0.8
2 0.340 0.250 0.129 0.920
0.7
3 1.000 1.000 0.947 0.083
0.6
No changes  We are done!

Income
4 0.930 0.850 0.784 0.083 0.5
5 0.390 0.200 0.077 0.925 0.4
6 0.580 0.250 0.119 0.777 0.3
0.2
Centroid Age Income 0.1
C3.1 0.465 0.219 0.0
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
C3.2 0.965 0.925
Age

38
DSCI 5240

Interpreting our Clusters


Centroid Age Income

• So what does all of this C3.1 0.465 0.219

mean? C3.2 0.965 0.925

• We generally interpret
clusters based on their
centroids
• You can think of a centroid
as representative of the
observations within the
cluster

39
DSCI 5240

Choosing k

• Visualization

• Natural Groupings – Application Specific

• Data-Driven Approaches

40
DSCI 5240

Natural Groupings

The Hills School Employees Males Females

VS VS

41
DSCI 5240

Data Driven Approaches


• Run K-Means multiple times with different values of K
• Calculate Within-Cluster Sum of Squared Error
• Create scree-plot to identify “optimal” K

3.5

Within Cluster SSE


2.5

Error 2

1.5
X
1

0.5

0
1 2 3 4 5 6 7
K
42
DSCI 5240

Comments on k-Means

Strengths Weaknesses
• K-means is a very flexible algorithm • Applicable only when mean is defined
that can be used in a wide variety of
• Need to specify k, the number of
contexts
clusters, in advance
• Efficient: O(tkn), where n is # of
• Unable to handle noisy data and
observations, k is #of clusters, and t is
outliers
# of iterations. Normally k, t << n
• Not suitable to discover clusters with
• Widely available in data mining tools
non-convex shapes
• Straightforward and easy to understand

43
DSCI 5240

Clustering is Difficult!
• What if there are many dimensions?
• What if the variables are not of the same type?
• What if the number of objects is large?
• What if the data has noise or outliers?
• User-specified constraints (e.g., managers might think
income should weigh twice more than age)
• Interpretability and usability of clusters

44

You might also like