Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

Clustering and its Applications

Introduction
◼ Cluster: a collection of data objects.
 Similarto one another within the same cluster.
 Dissimilar to the objects in other clusters.

◼ Cluster Analysis
 Grouping a set of data objects onto clusters.

2
3
4
Introduction
◼ Clustering is an unsupervised learning algorithm:
 No pre-defined classes

5
Supervised Learning
Unsupervised Learning

Unlabelled Data ML Algorithm Clusters

7
Unsupervised Learning

Unlabelled Data ML Algorithm Clusters

8
Clustering and its Applications
How to do clustering?

10
K-means Clustering
1) Choose k initial center points randomly
2) Cluster data using Euclidean distance
3) Calculate new center points for each cluster using only points within the cluster
4) Re-Cluster all data using the new center points
a) This step may cause data points to be placed in a different cluster
5) Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points
are moved from one cluster to another or some other convergence criteria is met

11
K-means Clustering
Start

Select value of k
(no of cluster)

Find centroid of each


False
cluster

No object True
Calculate distance of each End
move
object from each centroid
group?

Group objects based on


minimum distance

12
Example
Suppose we have several objects (in the given case, 4 different suppliers) and
each object have two attributes or features as shown in table below.
Our goal is to group these objects into 2 group of suppliers based on the two
features (quality index and delivery index).

Object Attribute 1 (X) Attribute 2 (Y)


Quality Index Delivery Index
Supplier A 1 1
Supplier B 2 1
Supplier C 4 3
Supplier D 5 4
13
Example
Each supplier represents one point with two attributes (X, Y).

Attribute 2: Delivery Index (Y)


D

𝐴 𝐵 𝐶 𝐷 C

1 2 4 5 𝑋
1 1 3 4 𝑌
A B

Attribute 1: Quality Index (X)

14
K-means Clustering
1) Choose k initial center points randomly
2) Cluster data using Euclidean distance
3) Calculate new center points for each cluster using only points within the cluster
4) Re-Cluster all data using the new center points
a) This step may cause data points to be placed in a different cluster
5) Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points
are moved from one cluster to another or some other convergence criteria is met

15
Iteration 0

16
Step 1: Choose k initial center points randomly
Suppose supplier A and supplier B are chosen as two initial center points i.e. centroids.
Let C1 and C2 denote the centroids, then C1=A=(1,1) and C2=B=(2,1) .

Attribute 2: Delivery Index (Y)


D

A B

Attribute 1: Quality Index (X)

17
Step 2: Cluster data using Euclidean distance
Calculate the distance between cluster centroids to each object.
Distance matrix at iteration 0
0 0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1 Euclidean distance
𝐷 =
1 0 2.83 4.24 C2(2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2 𝐴, 𝐶1 = (1 − 1)2 + (1 − 1)2 = 0
𝐴 𝐵 𝐶 𝐷 𝐴, 𝐶2 = (1 − 2)2 + (1 − 1)2 = 1
𝐵, 𝐶1 = (2 − 1)2 + (1 − 1)2 =1
𝐴 𝐵 𝐶 𝐷
𝐵, 𝐶2 = (2 − 2)2 + (1 − 1)2 =0
1 2 4 5 𝑋
𝐶, 𝐶1 = (4 − 1)2 + (3 − 1)2 = 3.61
1 1 3 4 𝑌
𝐶, 𝐶2 = (4 − 2)2 + (3 − 1)2 = 2.83

𝐷, 𝐶1 = (5 − 1)2 + (4 − 1)2 = 5

𝐷, 𝐶2 = (5 − 2)2 + (4 − 1)2 = 4.24


18
Step 2: Cluster data using Euclidean distance
Assign each object based on the minimum distance.

0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1


𝐷0 =
1 0 2.83 4.24 C2(2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷

01 0 0 0 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1
𝐺 =
0 1 1 1 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
The element of Group matrix above is 1 if and only if the object is assigned to that group.
Supplier A is assigned to cluster 1, Supplier B to cluster 2, C to cluster 2 and D to cluster 2.

19
Attribute 2: Delivery Index (Y)

C
𝐴 𝐵 𝐶 𝐷
x 1 2 4 5
y 1 1 3 4

A B

Attribute 1: Quality Index (X)


20
Iteration 1

21
Step 3: Calculate new centroids for each cluster using only points within the cluster
Cluster 1 has one member thus the centroid remains C1 = (1 , 1)
Cluster 2 now has three members; thus the centroid is the average coordinate among the three
members.

𝐶1 = 1,1
𝐴 𝐵 𝐶 𝐷
2+4+5 1+3+4 11 8
1 2 4 5 𝑋 𝐶2 = , = ,
3 3 3 3
1 1 3 4 𝑌

22
Attribute 2: Delivery Index (Y)
D

C
(11/3,8/3)

A B

Attribute 1: Quality Index (X)

23
Step 4: Re-Cluster all data using the new centroids.
Compute the distance of all objects to the new centroids.

0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1


𝐷1 =
3.14 2.36 0.47 1.89 C2(11/3,8/3) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷

𝐴 𝐵 𝐶 𝐷
1 2 4 5 𝑋
1 1 3 4 𝑌

24
Step 4: Re-Cluster all data using the new centroids.
Assign each object based on the minimum distance.

0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1


𝐷1 =
3.14 2.36 0.47 1.89 C2(11/3,8/3) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷

1 1 0 0 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1
𝐺1 =
0 0 1 1 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
Based on the new distance matrix, we move supplier B to Cluster 1 while all the other objects remain in the same cluster
as earlier.

25
Attribute 2: Delivery Index (Y)

C
𝐴 𝐵 𝐶 𝐷
x 1 2 4 5
y 1 1 3 4

A B

Attribute 1: Quality Index (X)

26
Iteration 2

27
Step 4: Re-Cluster all data using the new centroids.
Calculate the new centroids based on the clustering of previous iteration. Cluster 1
and cluster 2 both has two members; thus the new centroids are:

1+2 1+1 3 𝐴 𝐵 𝐶 𝐷
𝐶1 = , = ,1
2 2 2 1 2 4 5 𝑋
1 1 3 4 𝑌
4+5 3+4 9 7
𝐶2 = , = ,
2 2 2 2

28
Attribute 2: Delivery Index (Y)
(9/2,7/2) D

A (3/2,1) B

Attribute 1: Quality Index (X)


29
Compute the distance of all objects to the new centroids.

0.5 0.5 3.20 4.61 C1(3/2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1


𝐷2 =
4.30 3.54 0.71 0.71 C2(9/2,7/2) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷

𝐴 𝐵 𝐶 𝐷
1 2 4 5 𝑋
1 1 3 4 𝑌

30
Assign each object based on the minimum distance.

0.5 0.5 3.20 4.61 C1(3/2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1


𝐷2 =
4.30 3.54 0.71 0.71 C2(9/2,7/2) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷

21 1 0 0 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1
𝐺 =
0 0 1 1 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷

31
 Comparing the clustering of last iteration and this iteration reveals that the
objects do not move cluster anymore.

 Thus, the computation of the k-mean clustering has reached its stability and no
more iteration is needed. We get the final clustering as follows:

Object Attribute 1 (X) Attribute 2 (Y)


Cluster
Quality Index Delivery Index
Supplier A 1 1 1
Supplier B 2 1 1
Supplier C 4 3 2
Supplier D 5 4 2
32
Final Clusters

Attribute 2: Delivery Index (Y)


D
(9/2,7/2)
C

A (3/2,1) B

Attribute 1: Quality Index (X)


33
Clustering using Python

Library Use
#Store and read the data Panda Data manipulation and analysis library
import pandas as pd NumPy Numerical Python : Mathematical and logical
sup_df = pd.read_csv("supplier.csv" ) operations library
sup_df matplotlib Plotting library for the Python programming
language
Seaborn Data visualization library based on matplotlib.
#plot the data
It provides a high-level interface for drawing
import numpy as np attractive and informative statistical graphics.
import matplotlib.pyplot as plt
import seaborn as sn
sn.lmplot( "quality_index", "delivery_index", data=sup_df, fit_reg = False, size = 4 );
plt.title( "quality and delivery index data of supplier");

34
Clustering using Python

#Store and read the data


import pandas as pd
sup_df = pd.read_csv("supplier.csv" )
sup_df

#plot the data


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
sn.lmplot( "quality_index", "delivery_index", data=sup_df, fit_reg = False, size = 4 );
plt.title( "quality and delivery index data of supplier");

35
Clustering using Python
#Selecting the features
new_sup_df = sup_df[["quality_index", "delivery_index"]]
new_sup_df Library/ Use
Parameter
sklearn Scikit-learn is a free machine learning library
for Python. It features various algorithms like
#K-means Clustering k-means clustering, decision tree, random
forests, logistic regression, etc.
from sklearn.cluster import KMeans
markers List of recognised markers available here:
clusters_new = KMeans( 2 )
https://matplotlib.org/3.3.1/api/markers_api.
clusters_new.fit( new_sup_df )
html
new_sup_df["clusterid"] = clusters_new.labels_
new_sup_df hue Determines which column in the data frame
should be used for colour encoding

#Plot the clusters


import seaborn as sn
markers = ['+','^']
sn.lmplot( "quality_index", "delivery_index",data=new_sup_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
36
List of recognized Markers

List of recognised markers


available here:
https://matplotlib.org/3.3.1/api
/markers_api.html

37
Clustering using Python
#Selecting the features
new_sup_df = sup_df[["quality_index", "delivery_index"]]
new_sup_df

#K-means Clustering
from sklearn.cluster import KMeans
clusters_new = KMeans( 2 )
clusters_new.fit( new_sup_df )
new_sup_df["clusterid"] = clusters_new.labels_
new_sup_df

#Plot the clusters


import seaborn as sn
markers = ['+','^']
sn.lmplot( "quality_index", "delivery_index",data=new_sup_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
38
Case Study

39
Case Study
As a supply chain manager, you know the exact location (i.e., latitude and longitude) of
your 947 demand points. You need to decide how many and where to locate the
distribution centers so that you can maximize the responsiveness as well as minimize
the cost (fixed + variable).

40
Clustering using Python
Slno Location latitude longitude
1 Demand Point 1 12.656788 77.51487
2 Demand Point 2 12.665295 77.50966
3 Demand Point 3 12.672858 77.50613
4 Demand Point 4 12.672949 77.4665
5 Demand Point 5 12.678273 77.50663
6 Demand Point 6 12.678928 77.47567
7 Demand Point 7 12.679721 77.4709
---- ---- ----
---- ---- ----
947 Demand Point 947 13.544134 77.51294

41
42
Clustering using Python

Library Use
Panda Data manipulation and analysis library
NumPy Numerical Python : Mathematical and logical
import pandas as pd operations library
location_df = pd.read_csv( "LogLatData.csv" ) matplotlib Plotting library for the Python programming
location_df language
Seaborn Data visualization library based on matplotlib.
import numpy as np It provides a high-level interface for drawing
attractive and informative statistical graphics.
import matplotlib.pyplot as plt
import seaborn as sn
sn.lmplot( "latitude", "longitude", data=location_df, fit_reg = False, size = 4 );
plt.title( "Longitude and Latitude data of locations");

43
Clustering using Python

import pandas as pd
location_df = pd.read_csv( "LogLatData.csv" )
location_df

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
sn.lmplot( "latitude", "longitude", data=location_df, fit_reg = False, size = 4 );
plt.title( "Longitude and Latitude data of locations");

44
Clustering using Python
#Selecting the features
new_location_df = location_df[["latitude", "longitude"]]
new_location_df Library/ Use
Parameter
sklearn Scikit-learn is a free machine learning library
for Python. It features various algorithms like
#K-means Clustering k-means clustering, decision tree, random
forests, logistic regression, etc.
from sklearn.cluster import KMeans
markers List of recognised markers available here:
clusters_new = KMeans( 3)
https://matplotlib.org/3.3.1/api/markers_api.
clusters_new.fit( new_location_df )
html
new_location_df["clusterid"] = clusters_new.labels_
new_location_df hue Determines which column in the data frame
should be used for colour encoding

#Plot the clusters


import seaborn as sn
markers = ['+','^','.']
sn.lmplot( "latitude", "longitude",data=new_location_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
45
Clustering using Python
#Selecting the features
new_location_df = location_df[["latitude", "longitude"]]
new_location_df

#K-means Clustering
from sklearn.cluster import KMeans
clusters_new = KMeans( 3)
clusters_new.fit( new_location_df )
new_location_df["clusterid"] = clusters_new.labels_

#Plot the clusters


import seaborn as sn
markers = ['+','^','.']
sn.lmplot( "latitude", "longitude",data=new_location_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
46
Clustering using Python
#Selecting the features
new_location_df = location_df[["latitude", "longitude"]]
new_location_df

#K-means Clustering
from sklearn.cluster import KMeans
clusters_new = KMeans( 4)
clusters_new.fit( new_location_df )
new_location_df["clusterid"] = clusters_new.labels_

#Plot the clusters


import seaborn as sn
markers = ['+','^','.', '*']
sn.lmplot( "latitude", "longitude",data=new_location_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
47
Clustering using Python
#Selecting the features
new_location_df = location_df[["latitude", "longitude"]]
new_location_df

#K-means Clustering
from sklearn.cluster import KMeans
clusters_new = KMeans( 5)
clusters_new.fit( new_location_df )
new_location_df["clusterid"] = clusters_new.labels_

#Plot the clusters


import seaborn as sn
markers = ['+','^','.','*','>']
sn.lmplot( "latitude", "longitude",data=new_location_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
48
Finding Optimal Number of Clusters

49
Finding Optimal Number of Clusters

෍ 𝑥−𝜇 2

𝑥∈𝑠1
Within cluster sum of square error
Finding Optimal Number of Clusters

𝑘
2
❖ If we increase the value of k then the within cluster sum min ෍ ෍ 𝑥 − 𝜇
𝑘
of square error reduces 𝑠𝑖 =1 𝑥∈𝑠𝑖
Finding Optimal Number of Clusters

𝑘
2
❖ If we increase the value of k then the within cluster sum min ෍ ෍ 𝑥 − 𝜇
𝑘
of square error reduces 𝑠𝑖 =1 𝑥∈𝑠𝑖
Finding Optimal Number of Clusters

෍ 𝑥−𝜇 2

𝑥∈𝑠1
Within cluster sum of square error
Finding Optimal Number of Clusters using Elbow Method
◼ If we assume all the products belong to only one cluster, then the variance of the cluster will be the
highest.
◼ As we increase the number of clusters, the total variance of all clusters will start reducing.
◼ The total variance will be zero if we assume each product is a cluster by itself.
◼ Elbow curve method considers the percentage of variance explained as a function of the number of
clusters.
◼ The optimum number of clusters is chosen in such a way that adding another cluster does not change
the variance explained significantly.
◼ For a set of records (X1, X2,…..,Xn), where each observation is a d-dimensional real vector, k-means
clustering algorithm segments the observations into k (<=n) sets {S1, S2,…..,Sk) to minimize the within
cluster sum of squares.
𝑘

min ෍ ෍ 𝑥 − 𝜇 2
𝑘
𝑠𝑖 =1 𝑥∈𝑠𝑖
Finding Optimal Number of Clusters using Elbow Method

𝑘
2
min ෍ ෍ 𝑥 − 𝜇
𝑘
𝑠𝑖 =1 𝑥∈𝑠𝑖

◼ The initial increase in number of clusters will add much information, but at some point, the marginal gain will drop,
giving an angle to the graph (similar to elbow).
◼ The number of clusters indicated at this angle can be chosen to be the most appropriate number of clusters
◼ Choosing the number of clusters in this approach is called “elbow criterion”
55
Clustering using Python
#Determining number of clusters

import matplotlib.pyplot as plt


from sklearn.cluster import KMeans
cluster_range = range( 1, 10 )
cluster_errors = []
for num_clusters in cluster_range:
clusters = KMeans( num_clusters )
clusters.fit( new_location_df )
cluster_errors.append( clusters.inertia_ )
plt.figure(figsize=(6,4))
plt.plot( cluster_range, cluster_errors, marker = "o" )
plt.title('Elbow Diagram')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squares Error');

#Exporting the output to Excel


location_df.to_excel(r'C:\Users\Admin\Desktop\aftercluster.xlsx')

56
Hierarchical clustering

57
Hierarchical clustering
◼ Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of
cluster analysis which seeks to build a hierarchy of clusters.
◼ Strategies for hierarchical clustering generally fall into two types:
 Agglomerative: This is a "bottom-up" approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
 Divisive: This is a "top-down" approach: all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.

58
Hierarchical clustering: Agglomerative approach

Initialization:
Each object is a cluster.
Iteration:
Merge two clusters which
are most similar to each
other; Until all objects are
merged into a single cluster

Bottom-Up

59
Hierarchical clustering: Divisive Approaches

Initialization:
All objects stay in one cluster.
Iteration:
Select a cluster and split it
into two sub clusters
Until each leaf cluster
contains only one object

Top-Down

60
Hierarchical clustering: Dendrogram
◼ A tree that shows how clusters are merged/split hierarchically
◼ Each node on the tree is a cluster; each leaf node is a singleton cluster

61
Hierarchical clustering: Dendrogram
◼ A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster

62
Hierarchical clustering: Dendrogram
◼ A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster

63
Hierarchical Clustering
Example: Suppliers Data

Object Attribute 1 (X) Attribute 2 (Y)


Quality Index Delivery Index Cluster

Supplier A 1 1 1

Supplier B 2 1 1

Supplier C 4 3 2

Supplier D 5 4 2

82
Hierarchical Clustering
Example: Demand Data

83
K-Modes Clustering

84
CustomerID CreditHistory Balance_Savings Employment Maritalstatus Job
1 critical unknown over-seven Single Unskilled
2 all-paid-duly less100DM four-years female-divorced skilled
3 critical less100DM seven-years Single Unskilled
4 all-paid-duly less100DM seven-years Single skilled
5 delay less100DM four-years Single skilled
6 all-paid-duly unknown four-years Single Unskilled
7 all-paid-duly Between 500 and 1000 DM over-seven Single skilled
8 all-paid-duly less100DM four-years Single management
9 all-paid-duly over1000DM seven-years male-divorced Unskilled
10 critical less100DM unemployed Married management
11 all-paid-duly less100DM one-year female-divorced skilled
12 all-paid-duly less100DM one-year female-divorced skilled
13 all-paid-duly less100DM four-years female-divorced skilled
14 critical less100DM over-seven Single Unskilled
15 all-paid-duly less100DM four-years female-divorced skilled
16 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled
17 critical unknown over-seven Single skilled
18 all-paid-duly unknown one-year Single skilled
19 all-paid-duly less100DM over-seven female-divorced management
20 all-paid-duly Between 500 and 1000 DM over-seven Single skilled
21 critical less100DM four-years Single skilled
22 all-paid-duly Between 500 and 1000 DM four-years Single skilled
23 critical less100DM one-year Single Unskilled
24 critical Between 100 and 500 DM one-year Single skilled
25 critical unknown four-years Married skilled
26 all-paid-duly less100DM four-years Single Unskilled
27 all-paid-duly less100DM over-seven Married Unskilled
28 bank-paid-duly over1000DM four-years female-divorced skilled
29 all-paid-duly less100DM four-years Single skilled
30 delay less100DM over-seven Single skilled
85
ID CreditHistory Balance_Savings Employment Maritalstatus Job Cluster
3 all-paid-duly less100DM seven-years Single skilled First
4 delay less100DM four-years Single skilled First
5 all-paid-duly unknown four-years Single Unskilled First
6 all-paid-duly Between 500 and 1000 DM over-seven Single skilled First
7 all-paid-duly less100DM four-years Single management First
8 all-paid-duly over1000DM seven-years male-divorced Unskilled First
17 all-paid-duly unknown one-year Single skilled First
19 all-paid-duly Between 500 and 1000 DM over-seven Single skilled First
20 critical less100DM four-years Single skilled First
21 all-paid-duly Between 500 and 1000 DM four-years Single skilled First
23 critical Between 100 and 500 DM one-year Single skilled First
24 critical unknown four-years Married skilled First
25 all-paid-duly less100DM four-years Single Unskilled First
28 all-paid-duly less100DM four-years Single skilled First
29 delay less100DM over-seven Single skilled First
0 critical unknown over-seven Single Unskilled Second
2 critical less100DM seven-years Single Unskilled Second
9 critical less100DM unemployed Married management Second
13 critical less100DM over-seven Single Unskilled Second
16 critical unknown over-seven Single skilled Second
22 critical less100DM one-year Single Unskilled Second
26 all-paid-duly less100DM over-seven Married Unskilled Second
1 all-paid-duly less100DM four-years female-divorced skilled Third
10 all-paid-duly less100DM one-year female-divorced skilled Third
11 all-paid-duly less100DM one-year female-divorced skilled Third
12 all-paid-duly less100DM four-years female-divorced skilled Third
14 all-paid-duly less100DM four-years female-divorced skilled Third
15 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled Third
18 all-paid-duly less100DM over-seven female-divorced management Third
27 bank-paid-duly over1000DM four-years female-divorced skilled Third 86
87
K-Modes Clustering
◼ K-Modes clustering was first introduced by Huang (1998).
◼ K-Modes is used for categorical data.
◼ Distance between two data points X and Y is the number of observations in X and Y whose values are
different (simple dissimilarity measure), formally formulated as follows:

𝑑1 𝑋, 𝑌 = ෍ 𝛿 𝑥𝑖 , 𝑦𝑖
𝑖=1

0, 𝑥𝑖 = 𝑦𝑖
𝛿 𝑥𝑖 , 𝑦𝑖 = ቊ
1, 𝑥𝑖 ≠ 𝑦𝑖
𝑥𝑖 is the value of the ith observation of the X data, 𝑦𝑖 is the value of the ith observation of the Y data,
and n is the number of observations.

88
Steps for K-Modes
1) Select the K initial modes
2) Allocate the observation to the closest cluster based on a simple dissimilarity measure.
Update each cluster mode after each allocation.
3) After all the observation have been allocated to a cluster, check the dissimilarity value
of each observation against the mode. If an observation turns out that the closest mode
is in another cluster, move the observation to the appropriate cluster and update the
mode of both clusters.
4) Repeat step 3 until none of the observation change to another clusters.

90
How to choose optimal number of clusters?
◼ To determine the optimal number of clusters, the Elbow method is used but it is
modified to use within cluster difference (WCD). From the results of plotting within cluster
difference for various values, the principle of the elbow method takes the value of k at the
point when the value does not decrease significantly with the addition of value of k.

𝑊𝐶𝐷 = ෍ ෍ 𝑑1 𝑥, 𝑐
𝑠𝑖 =1 𝑥∈𝑠𝑖

where k is the number of clusters, {S1, S2,…..,Sk) is the set of clusters, c is the centroid of
the clusters, 𝑑1 is the simple dissimilarity measure.
𝑛

𝑑1 𝑋, 𝑌 = ෍ 𝛿 𝑥𝑖 , 𝑦𝑖
𝑖=1
91
CustomerID CreditHistory Balance_Savings Employment Maritalstatus Job
1 critical unknown over-seven Single Unskilled
2 all-paid-duly less100DM four-years female-divorced skilled
3 critical less100DM seven-years Single Unskilled
4 all-paid-duly less100DM seven-years Single skilled
5 delay less100DM four-years Single skilled
6 all-paid-duly unknown four-years Single Unskilled
7 all-paid-duly Between 500 and 1000 DM over-seven Single skilled
8 all-paid-duly less100DM four-years Single management
9 all-paid-duly over1000DM seven-years male-divorced Unskilled
10 critical less100DM unemployed Married management
11 all-paid-duly less100DM one-year female-divorced skilled
12 all-paid-duly less100DM one-year female-divorced skilled
13 all-paid-duly less100DM four-years female-divorced skilled
14 critical less100DM over-seven Single Unskilled
15 all-paid-duly less100DM four-years female-divorced skilled
16 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled
17 critical unknown over-seven Single skilled
18 all-paid-duly unknown one-year Single skilled
19 all-paid-duly less100DM over-seven female-divorced management
20 all-paid-duly Between 500 and 1000 DM over-seven Single skilled
21 critical less100DM four-years Single skilled
22 all-paid-duly Between 500 and 1000 DM four-years Single skilled
23 critical less100DM one-year Single Unskilled
24 critical Between 100 and 500 DM one-year Single skilled
25 critical unknown four-years Married skilled
26 all-paid-duly less100DM four-years Single Unskilled
27 all-paid-duly less100DM over-seven Married Unskilled
28 bank-paid-duly over1000DM four-years female-divorced skilled
29 all-paid-duly less100DM four-years Single skilled
30 delay less100DM over-seven Single skilled
92
ID CreditHistory Balance_Savings Employment Maritalstatus Job Cluster
3 all-paid-duly less100DM seven-years Single skilled First
4 delay less100DM four-years Single skilled First
5 all-paid-duly unknown four-years Single Unskilled First
6 all-paid-duly Between 500 and 1000 DM over-seven Single skilled First
7 all-paid-duly less100DM four-years Single management First
8 all-paid-duly over1000DM seven-years male-divorced Unskilled First
17 all-paid-duly unknown one-year Single skilled First
19 all-paid-duly Between 500 and 1000 DM over-seven Single skilled First
20 critical less100DM four-years Single skilled First
21 all-paid-duly Between 500 and 1000 DM four-years Single skilled First
23 critical Between 100 and 500 DM one-year Single skilled First
24 critical unknown four-years Married skilled First
25 all-paid-duly less100DM four-years Single Unskilled First
28 all-paid-duly less100DM four-years Single skilled First
29 delay less100DM over-seven Single skilled First
0 critical unknown over-seven Single Unskilled Second
2 critical less100DM seven-years Single Unskilled Second
9 critical less100DM unemployed Married management Second
13 critical less100DM over-seven Single Unskilled Second
16 critical unknown over-seven Single skilled Second
22 critical less100DM one-year Single Unskilled Second
26 all-paid-duly less100DM over-seven Married Unskilled Second
1 all-paid-duly less100DM four-years female-divorced skilled Third
10 all-paid-duly less100DM one-year female-divorced skilled Third
11 all-paid-duly less100DM one-year female-divorced skilled Third
12 all-paid-duly less100DM four-years female-divorced skilled Third
14 all-paid-duly less100DM four-years female-divorced skilled Third
15 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled Third
18 all-paid-duly less100DM over-seven female-divorced management Third
27 bank-paid-duly over1000DM four-years female-divorced skilled Third 93
94
Data Set (ML Repository)
https://archive.ics.uci.edu/ml/index.php

95
Thank You

96

You might also like