Cluster Analysis-Part I, II and III

Clustering and its Applications
Introduction
◼ Cluster: a collection of data objects.
 Similarto one another within the same cluster.
 Dissimilar to the objects in other clusters.
◼ Cluster Analysis
 Grouping a set of data objects onto clusters.
2
3
4
Introduction
◼ Clustering is an unsupervised learning algorithm:
 No pre-defined classes
5
Supervised Learning
Unsupervised Learning
Unlabelled Data ML Algorithm Clusters
7
Unsupervised Learning
Unlabelled Data ML Algorithm Clusters
8
Clustering and its Applications
How to do clustering?
10
K-means Clustering
1) Choose k initial center points randomly
2) Cluster data using Euclidean distance
3) Calculate new center points for each cluster using only points within the cluster
4) Re-Cluster all data using the new center points
a) This step may cause data points to be placed in a different cluster
5) Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points
are moved from one cluster to another or some other convergence criteria is met
11
K-means Clustering
Start
Select value of k
(no of cluster)
Find centroid of each

False
cluster
No object True
Calculate distance of each End
move
object from each centroid
group?
Group objects based on

minimum distance
12
Example
Suppose we have several objects (in the given case, 4 different suppliers) and
each object have two attributes or features as shown in table below.
Our goal is to group these objects into 2 group of suppliers based on the two
features (quality index and delivery index).
Object Attribute 1 (X) Attribute 2 (Y)

Quality Index Delivery Index
Supplier A 1 1
Supplier B 2 1
Supplier C 4 3
Supplier D 5 4
13
Example
Each supplier represents one point with two attributes (X, Y).
Attribute 2: Delivery Index (Y)

D
𝐴 𝐵 𝐶 𝐷 C
1 2 4 5 𝑋
1 1 3 4 𝑌
A B
Attribute 1: Quality Index (X)
14
K-means Clustering
1) Choose k initial center points randomly
2) Cluster data using Euclidean distance
3) Calculate new center points for each cluster using only points within the cluster
4) Re-Cluster all data using the new center points
a) This step may cause data points to be placed in a different cluster
5) Repeat steps 3 & 4 until the center points have moved such that in step 4 no data points
are moved from one cluster to another or some other convergence criteria is met
15
Iteration 0
16
Step 1: Choose k initial center points randomly
Suppose supplier A and supplier B are chosen as two initial center points i.e. centroids.
Let C1 and C2 denote the centroids, then C1=A=(1,1) and C2=B=(2,1) .

D
A B
17
Step 2: Cluster data using Euclidean distance
Calculate the distance between cluster centroids to each object.
Distance matrix at iteration 0
0 0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1 Euclidean distance
𝐷 =
1 0 2.83 4.24 C2(2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2 𝐴, 𝐶1 = (1 − 1)2 + (1 − 1)2 = 0
𝐴 𝐵 𝐶 𝐷 𝐴, 𝐶2 = (1 − 2)2 + (1 − 1)2 = 1
𝐵, 𝐶1 = (2 − 1)2 + (1 − 1)2 =1
𝐴 𝐵 𝐶 𝐷
𝐵, 𝐶2 = (2 − 2)2 + (1 − 1)2 =0
1 2 4 5 𝑋
𝐶, 𝐶1 = (4 − 1)2 + (3 − 1)2 = 3.61
1 1 3 4 𝑌
𝐶, 𝐶2 = (4 − 2)2 + (3 − 1)2 = 2.83
𝐷, 𝐶1 = (5 − 1)2 + (4 − 1)2 = 5
𝐷, 𝐶2 = (5 − 2)2 + (4 − 1)2 = 4.24

18
Step 2: Cluster data using Euclidean distance
Assign each object based on the minimum distance.
0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

𝐷0 =
1 0 2.83 4.24 C2(2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
01 0 0 0 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1
𝐺 =
0 1 1 1 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
The element of Group matrix above is 1 if and only if the object is assigned to that group.
Supplier A is assigned to cluster 1, Supplier B to cluster 2, C to cluster 2 and D to cluster 2.
19
C
𝐴 𝐵 𝐶 𝐷
x 1 2 4 5
y 1 1 3 4
A B

20
Iteration 1
21
Step 3: Calculate new centroids for each cluster using only points within the cluster
Cluster 1 has one member thus the centroid remains C1 = (1 , 1)
Cluster 2 now has three members; thus the centroid is the average coordinate among the three
members.
𝐶1 = 1,1
𝐴 𝐵 𝐶 𝐷
2+4+5 1+3+4 11 8
1 2 4 5 𝑋 𝐶2 = , = ,
3 3 3 3
1 1 3 4 𝑌
22
D
C
(11/3,8/3)
A B
23
Step 4: Re-Cluster all data using the new centroids.
Compute the distance of all objects to the new centroids.
0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

𝐷1 =
3.14 2.36 0.47 1.89 C2(11/3,8/3) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
𝐴 𝐵 𝐶 𝐷
1 2 4 5 𝑋
1 1 3 4 𝑌
24
0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

𝐷1 =
3.14 2.36 0.47 1.89 C2(11/3,8/3) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
1 1 0 0 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1
𝐺1 =
0 0 1 1 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
Based on the new distance matrix, we move supplier B to Cluster 1 while all the other objects remain in the same cluster
as earlier.
25
C
𝐴 𝐵 𝐶 𝐷
x 1 2 4 5
y 1 1 3 4
A B
26
Iteration 2
27
Calculate the new centroids based on the clustering of previous iteration. Cluster 1
and cluster 2 both has two members; thus the new centroids are:
1+2 1+1 3 𝐴 𝐵 𝐶 𝐷
𝐶1 = , = ,1
2 2 2 1 2 4 5 𝑋
1 1 3 4 𝑌
4+5 3+4 9 7
𝐶2 = , = ,
2 2 2 2
28
(9/2,7/2) D
A (3/2,1) B

29
Compute the distance of all objects to the new centroids.
0.5 0.5 3.20 4.61 C1(3/2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

𝐷2 =
4.30 3.54 0.71 0.71 C2(9/2,7/2) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
𝐴 𝐵 𝐶 𝐷
1 2 4 5 𝑋
1 1 3 4 𝑌
30
0.5 0.5 3.20 4.61 C1(3/2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

𝐷2 =
4.30 3.54 0.71 0.71 C2(9/2,7/2) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
21 1 0 0 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1
𝐺 =
0 0 1 1 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 2
𝐴 𝐵 𝐶 𝐷
31
 Comparing the clustering of last iteration and this iteration reveals that the
objects do not move cluster anymore.
 Thus, the computation of the k-mean clustering has reached its stability and no
more iteration is needed. We get the final clustering as follows:

Cluster
Quality Index Delivery Index
Supplier A 1 1 1
Supplier B 2 1 1
Supplier C 4 3 2
Supplier D 5 4 2
32
Final Clusters

D
(9/2,7/2)
C
A (3/2,1) B

33
Clustering using Python
Library Use
#Store and read the data Panda Data manipulation and analysis library
import pandas as pd NumPy Numerical Python : Mathematical and logical
sup_df = pd.read_csv("supplier.csv" ) operations library
sup_df matplotlib Plotting library for the Python programming
language
Seaborn Data visualization library based on matplotlib.
#plot the data
It provides a high-level interface for drawing
import numpy as np attractive and informative statistical graphics.
import matplotlib.pyplot as plt
import seaborn as sn
sn.lmplot( "quality_index", "delivery_index", data=sup_df, fit_reg = False, size = 4 );
plt.title( "quality and delivery index data of supplier");
34
#Store and read the data

import pandas as pd
sup_df = pd.read_csv("supplier.csv" )
sup_df
#plot the data

import numpy as np
sn.lmplot( "quality_index", "delivery_index", data=sup_df, fit_reg = False, size = 4 );
plt.title( "quality and delivery index data of supplier");
35
#Selecting the features
new_sup_df = sup_df[["quality_index", "delivery_index"]]
new_sup_df Library/ Use
Parameter
sklearn Scikit-learn is a free machine learning library
for Python. It features various algorithms like
#K-means Clustering k-means clustering, decision tree, random
forests, logistic regression, etc.
from sklearn.cluster import KMeans
markers List of recognised markers available here:
clusters_new = KMeans( 2 )
https://matplotlib.org/3.3.1/api/markers_api.
clusters_new.fit( new_sup_df )
html
new_sup_df["clusterid"] = clusters_new.labels_
new_sup_df hue Determines which column in the data frame
should be used for colour encoding
#Plot the clusters

markers = ['+','^']
sn.lmplot( "quality_index", "delivery_index",data=new_sup_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
36
List of recognized Markers
List of recognised markers

available here:
https://matplotlib.org/3.3.1/api
/markers_api.html
37
new_sup_df = sup_df[["quality_index", "delivery_index"]]
new_sup_df
#K-means Clustering
clusters_new = KMeans( 2 )
clusters_new.fit( new_sup_df )
new_sup_df["clusterid"] = clusters_new.labels_
new_sup_df
#Plot the clusters

markers = ['+','^']
sn.lmplot( "quality_index", "delivery_index",data=new_sup_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
38
Case Study
39
Case Study
As a supply chain manager, you know the exact location (i.e., latitude and longitude) of
your 947 demand points. You need to decide how many and where to locate the
distribution centers so that you can maximize the responsiveness as well as minimize
the cost (fixed + variable).
40
Slno Location latitude longitude
1 Demand Point 1 12.656788 77.51487
2 Demand Point 2 12.665295 77.50966
3 Demand Point 3 12.672858 77.50613
4 Demand Point 4 12.672949 77.4665
5 Demand Point 5 12.678273 77.50663
6 Demand Point 6 12.678928 77.47567
7 Demand Point 7 12.679721 77.4709
---- ---- ----
---- ---- ----
947 Demand Point 947 13.544134 77.51294
41
42
Library Use
Panda Data manipulation and analysis library
NumPy Numerical Python : Mathematical and logical
import pandas as pd operations library
location_df = pd.read_csv( "LogLatData.csv" ) matplotlib Plotting library for the Python programming
location_df language
Seaborn Data visualization library based on matplotlib.
import numpy as np It provides a high-level interface for drawing
attractive and informative statistical graphics.
sn.lmplot( "latitude", "longitude", data=location_df, fit_reg = False, size = 4 );
plt.title( "Longitude and Latitude data of locations");
43
import pandas as pd
location_df = pd.read_csv( "LogLatData.csv" )
location_df
import numpy as np
sn.lmplot( "latitude", "longitude", data=location_df, fit_reg = False, size = 4 );
plt.title( "Longitude and Latitude data of locations");
44
new_location_df = location_df[["latitude", "longitude"]]
new_location_df Library/ Use
Parameter
sklearn Scikit-learn is a free machine learning library
for Python. It features various algorithms like
#K-means Clustering k-means clustering, decision tree, random
forests, logistic regression, etc.
markers List of recognised markers available here:
clusters_new = KMeans( 3)
https://matplotlib.org/3.3.1/api/markers_api.
clusters_new.fit( new_location_df )
html
new_location_df["clusterid"] = clusters_new.labels_
new_location_df hue Determines which column in the data frame
should be used for colour encoding
#Plot the clusters

markers = ['+','^','.']
sn.lmplot( "latitude", "longitude",data=new_location_df,hue = "clusterid", fit_reg=False,markers = markers, size = 4 );
45
new_location_df
#K-means Clustering
#Plot the clusters

markers = ['+','^','.']
46
new_location_df
#K-means Clustering
#Plot the clusters

markers = ['+','^','.', '*']
47
new_location_df
#K-means Clustering
#Plot the clusters

markers = ['+','^','.','*','>']
48
Finding Optimal Number of Clusters
49
෍ 𝑥−𝜇 2
𝑥∈𝑠1
Within cluster sum of square error
𝑘
2
❖ If we increase the value of k then the within cluster sum min ෍ ෍ 𝑥 − 𝜇
𝑘
of square error reduces 𝑠𝑖 =1 𝑥∈𝑠𝑖
𝑘
2
❖ If we increase the value of k then the within cluster sum min ෍ ෍ 𝑥 − 𝜇
𝑘
of square error reduces 𝑠𝑖 =1 𝑥∈𝑠𝑖
෍ 𝑥−𝜇 2
𝑥∈𝑠1
Within cluster sum of square error
Finding Optimal Number of Clusters using Elbow Method
◼ If we assume all the products belong to only one cluster, then the variance of the cluster will be the
highest.
◼ As we increase the number of clusters, the total variance of all clusters will start reducing.
◼ The total variance will be zero if we assume each product is a cluster by itself.
◼ Elbow curve method considers the percentage of variance explained as a function of the number of
clusters.
◼ The optimum number of clusters is chosen in such a way that adding another cluster does not change
the variance explained significantly.
◼ For a set of records (X1, X2,…..,Xn), where each observation is a d-dimensional real vector, k-means
clustering algorithm segments the observations into k (<=n) sets {S1, S2,…..,Sk) to minimize the within
cluster sum of squares.
𝑘
min ෍ ෍ 𝑥 − 𝜇 2
𝑘
𝑠𝑖 =1 𝑥∈𝑠𝑖
Finding Optimal Number of Clusters using Elbow Method
𝑘
2
min ෍ ෍ 𝑥 − 𝜇
𝑘
◼ The initial increase in number of clusters will add much information, but at some point, the marginal gain will drop,
giving an angle to the graph (similar to elbow).
◼ The number of clusters indicated at this angle can be chosen to be the most appropriate number of clusters
◼ Choosing the number of clusters in this approach is called “elbow criterion”
55
#Determining number of clusters

cluster_range = range( 1, 10 )
cluster_errors = []
for num_clusters in cluster_range:
clusters = KMeans( num_clusters )
clusters.fit( new_location_df )
cluster_errors.append( clusters.inertia_ )
plt.figure(figsize=(6,4))
plt.plot( cluster_range, cluster_errors, marker = "o" )
plt.title('Elbow Diagram')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squares Error');
#Exporting the output to Excel

location_df.to_excel(r'C:\Users\Admin\Desktop\aftercluster.xlsx')
56
Hierarchical clustering
57
Hierarchical clustering
◼ Hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of
cluster analysis which seeks to build a hierarchy of clusters.
◼ Strategies for hierarchical clustering generally fall into two types:
 Agglomerative: This is a "bottom-up" approach: each observation starts in its own
cluster, and pairs of clusters are merged as one moves up the hierarchy.
 Divisive: This is a "top-down" approach: all observations start in one cluster, and
splits are performed recursively as one moves down the hierarchy.
58
Hierarchical clustering: Agglomerative approach
Initialization:
Each object is a cluster.
Iteration:
Merge two clusters which
are most similar to each
other; Until all objects are
merged into a single cluster
Bottom-Up
59
Hierarchical clustering: Divisive Approaches
Initialization:
All objects stay in one cluster.
Iteration:
Select a cluster and split it
into two sub clusters
Until each leaf cluster
contains only one object
Top-Down
60
Hierarchical clustering: Dendrogram
◼ A tree that shows how clusters are merged/split hierarchically
◼ Each node on the tree is a cluster; each leaf node is a singleton cluster
61
◼ A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster
62
◼ A clustering of the data objects is obtained by cutting the dendrogram at the desired
level, then each connected component forms a cluster
63
Hierarchical Clustering
Example: Suppliers Data

Quality Index Delivery Index Cluster
Supplier A 1 1 1
Supplier B 2 1 1
Supplier C 4 3 2
Supplier D 5 4 2
82
Hierarchical Clustering
Example: Demand Data
83
K-Modes Clustering
84
CustomerID CreditHistory Balance_Savings Employment Maritalstatus Job
1 critical unknown over-seven Single Unskilled
2 all-paid-duly less100DM four-years female-divorced skilled
3 critical less100DM seven-years Single Unskilled
4 all-paid-duly less100DM seven-years Single skilled
5 delay less100DM four-years Single skilled
6 all-paid-duly unknown four-years Single Unskilled
7 all-paid-duly Between 500 and 1000 DM over-seven Single skilled
8 all-paid-duly less100DM four-years Single management
9 all-paid-duly over1000DM seven-years male-divorced Unskilled
10 critical less100DM unemployed Married management
11 all-paid-duly less100DM one-year female-divorced skilled
14 critical less100DM over-seven Single Unskilled
16 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled
17 critical unknown over-seven Single skilled
18 all-paid-duly unknown one-year Single skilled
19 all-paid-duly less100DM over-seven female-divorced management
21 critical less100DM four-years Single skilled
22 all-paid-duly Between 500 and 1000 DM four-years Single skilled
23 critical less100DM one-year Single Unskilled
24 critical Between 100 and 500 DM one-year Single skilled
25 critical unknown four-years Married skilled
26 all-paid-duly less100DM four-years Single Unskilled
27 all-paid-duly less100DM over-seven Married Unskilled
28 bank-paid-duly over1000DM four-years female-divorced skilled
29 all-paid-duly less100DM four-years Single skilled
30 delay less100DM over-seven Single skilled
85
ID CreditHistory Balance_Savings Employment Maritalstatus Job Cluster
3 all-paid-duly less100DM seven-years Single skilled First
4 delay less100DM four-years Single skilled First
5 all-paid-duly unknown four-years Single Unskilled First
6 all-paid-duly Between 500 and 1000 DM over-seven Single skilled First
7 all-paid-duly less100DM four-years Single management First
8 all-paid-duly over1000DM seven-years male-divorced Unskilled First
17 all-paid-duly unknown one-year Single skilled First
20 critical less100DM four-years Single skilled First
21 all-paid-duly Between 500 and 1000 DM four-years Single skilled First
23 critical Between 100 and 500 DM one-year Single skilled First
24 critical unknown four-years Married skilled First
25 all-paid-duly less100DM four-years Single Unskilled First
28 all-paid-duly less100DM four-years Single skilled First
29 delay less100DM over-seven Single skilled First
0 critical unknown over-seven Single Unskilled Second
2 critical less100DM seven-years Single Unskilled Second
9 critical less100DM unemployed Married management Second
13 critical less100DM over-seven Single Unskilled Second
16 critical unknown over-seven Single skilled Second
22 critical less100DM one-year Single Unskilled Second
26 all-paid-duly less100DM over-seven Married Unskilled Second
1 all-paid-duly less100DM four-years female-divorced skilled Third
10 all-paid-duly less100DM one-year female-divorced skilled Third
15 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled Third
18 all-paid-duly less100DM over-seven female-divorced management Third
27 bank-paid-duly over1000DM four-years female-divorced skilled Third 86
87
K-Modes Clustering
◼ K-Modes clustering was first introduced by Huang (1998).
◼ K-Modes is used for categorical data.
◼ Distance between two data points X and Y is the number of observations in X and Y whose values are
different (simple dissimilarity measure), formally formulated as follows:
𝑑1 𝑋, 𝑌 = ෍ 𝛿 𝑥𝑖 , 𝑦𝑖
𝑖=1
0, 𝑥𝑖 = 𝑦𝑖
𝛿 𝑥𝑖 , 𝑦𝑖 = ቊ
1, 𝑥𝑖 ≠ 𝑦𝑖
𝑥𝑖 is the value of the ith observation of the X data, 𝑦𝑖 is the value of the ith observation of the Y data,
and n is the number of observations.
88
Steps for K-Modes
1) Select the K initial modes
2) Allocate the observation to the closest cluster based on a simple dissimilarity measure.
Update each cluster mode after each allocation.
3) After all the observation have been allocated to a cluster, check the dissimilarity value
of each observation against the mode. If an observation turns out that the closest mode
is in another cluster, move the observation to the appropriate cluster and update the
mode of both clusters.
4) Repeat step 3 until none of the observation change to another clusters.
90
How to choose optimal number of clusters?
◼ To determine the optimal number of clusters, the Elbow method is used but it is
modified to use within cluster difference (WCD). From the results of plotting within cluster
difference for various values, the principle of the elbow method takes the value of k at the
point when the value does not decrease significantly with the addition of value of k.
𝑊𝐶𝐷 = ෍ ෍ 𝑑1 𝑥, 𝑐
where k is the number of clusters, {S1, S2,…..,Sk) is the set of clusters, c is the centroid of
the clusters, 𝑑1 is the simple dissimilarity measure.
𝑛
𝑑1 𝑋, 𝑌 = ෍ 𝛿 𝑥𝑖 , 𝑦𝑖
𝑖=1
91
CustomerID CreditHistory Balance_Savings Employment Maritalstatus Job
1 critical unknown over-seven Single Unskilled
3 critical less100DM seven-years Single Unskilled
4 all-paid-duly less100DM seven-years Single skilled
5 delay less100DM four-years Single skilled
6 all-paid-duly unknown four-years Single Unskilled
8 all-paid-duly less100DM four-years Single management
9 all-paid-duly over1000DM seven-years male-divorced Unskilled
10 critical less100DM unemployed Married management
14 critical less100DM over-seven Single Unskilled
16 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled
17 critical unknown over-seven Single skilled
18 all-paid-duly unknown one-year Single skilled
19 all-paid-duly less100DM over-seven female-divorced management
21 critical less100DM four-years Single skilled
22 all-paid-duly Between 500 and 1000 DM four-years Single skilled
23 critical less100DM one-year Single Unskilled
24 critical Between 100 and 500 DM one-year Single skilled
25 critical unknown four-years Married skilled
26 all-paid-duly less100DM four-years Single Unskilled
27 all-paid-duly less100DM over-seven Married Unskilled
28 bank-paid-duly over1000DM four-years female-divorced skilled
29 all-paid-duly less100DM four-years Single skilled
30 delay less100DM over-seven Single skilled
92
ID CreditHistory Balance_Savings Employment Maritalstatus Job Cluster
3 all-paid-duly less100DM seven-years Single skilled First
4 delay less100DM four-years Single skilled First
5 all-paid-duly unknown four-years Single Unskilled First
7 all-paid-duly less100DM four-years Single management First
8 all-paid-duly over1000DM seven-years male-divorced Unskilled First
17 all-paid-duly unknown one-year Single skilled First
20 critical less100DM four-years Single skilled First
21 all-paid-duly Between 500 and 1000 DM four-years Single skilled First
23 critical Between 100 and 500 DM one-year Single skilled First
24 critical unknown four-years Married skilled First
25 all-paid-duly less100DM four-years Single Unskilled First
28 all-paid-duly less100DM four-years Single skilled First
29 delay less100DM over-seven Single skilled First
0 critical unknown over-seven Single Unskilled Second
2 critical less100DM seven-years Single Unskilled Second
9 critical less100DM unemployed Married management Second
13 critical less100DM over-seven Single Unskilled Second
16 critical unknown over-seven Single skilled Second
22 critical less100DM one-year Single Unskilled Second
26 all-paid-duly less100DM over-seven Married Unskilled Second
15 all-paid-duly Between 100 and 500 DM four-years female-divorced Unskilled Third
18 all-paid-duly less100DM over-seven female-divorced management Third
27 bank-paid-duly over1000DM four-years female-divorced skilled Third 93
94
Data Set (ML Repository)
https://archive.ics.uci.edu/ml/index.php
95
Thank You
96

Cluster Analysis-Part I, II and III

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cluster Analysis-Part I, II and III

Uploaded by

Copyright:

Available Formats

Clustering and its Applications

Unlabelled Data ML Algorithm Clusters

Unlabelled Data ML Algorithm Clusters

Find centroid of each

Group objects based on

Object Attribute 1 (X) Attribute 2 (Y)

Attribute 2: Delivery Index (Y)

Attribute 1: Quality Index (X)

Attribute 2: Delivery Index (Y)

Attribute 1: Quality Index (X)

𝐷, 𝐶2 = (5 − 2)2 + (4 − 1)2 = 4.24

0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

Attribute 1: Quality Index (X)

Attribute 1: Quality Index (X)

0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

0 1 3.61 5 C1(1,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

Attribute 1: Quality Index (X)

Attribute 1: Quality Index (X)

0.5 0.5 3.20 4.61 C1(3/2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

0.5 0.5 3.20 4.61 C1(3/2,1) 𝐶𝑙𝑢𝑠𝑡𝑒𝑟 − 1

Object Attribute 1 (X) Attribute 2 (Y)

Attribute 2: Delivery Index (Y)

Attribute 1: Quality Index (X)

#Store and read the data

#plot the data

#Plot the clusters

List of recognised markers

#Plot the clusters

#Plot the clusters

#Plot the clusters

#Plot the clusters

#Plot the clusters

import matplotlib.pyplot as plt

#Exporting the output to Excel

Object Attribute 1 (X) Attribute 2 (Y)

You might also like