20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

• What it is?
• python implementation
• Use cases
K-MEANS CLUSTERING
07/14/2023
SCIKIT-LEARN – CHEAT SHEET
07/14/2023 Slide no. 2

WHAT IT IS
• K-means clustering is a simple unsupervised learning algorithm that is used to group/ cluster n objects
based on certain attributes into k partitions, where k < n.
• Follows a simple procedure of classifying a given data set into a number of clusters, defined by the
letter "k," which is fixed beforehand.
• Groups data using a “top-down” approach since it starts with a predefined number of clusters and
assigns all observations to each of them.
• no overlaps in the groups; each observation is assigned only to a single group.
• Approach is computationally faster and can handle greater numbers of observations than
agglomerative hierarchical clustering
• K-means clustering has uses in search engines, market segmentation, statistics and even astronomy.
07/14/2023 Slide no. 3

EXAMPLE
Different movie genres appeal to people of different personalities.

To confirm this, construct a plot of movie titles along personality
dimensions:
There appears to be 3 clusters:

• Red: Conscientious extraverts who like action and romance
genres
• Blue: Anxious and open people who like advant-garde and

fantasy genres
• Yellow: Introverts with social anxieties who like Japanese

animations (otaku culture)
• Movies in the center seem to be general household favorites.
07/14/2023 Slide no. 4

K-MEANS ALGORITHM PROPERTIES
• There are always K clusters.
• There is always at least one item in each cluster.
• The clusters are non-hierarchical and they do not overlap.
• Every member of a cluster is closer to its cluster than any other cluster
07/14/2023 Slide no. 5

THE K-MEANS ALGORITHM PROCESS
1. The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters.
2. For each data point:
• Calculate the distance from the data point to each cluster.
• If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own
cluster, move it into the closest cluster.
• Repeat the above step until a complete pass through all the data points results in no data point moving
from one cluster to another. At this point the clusters are stable and the clustering process ends.
• The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and
intra-cluster distances
07/14/2023 Slide no. 6

PROXIMITY BETWEEN CLUSTERS
1. Single Linkage - Distance between two nearest

patterns in the clusters
2. Complete Linkage - Distance between 2 farthest

data patterns
3. Average Linkage –
4. Average Group (centroid) linkage –
07/14/2023 Slide no. 7

DISTANCE CALCULATION
• The Euclidean distance function measures the ‘as-the-crow-flies’ distance. The formula for this distance
between a point X (X1, X2, etc.) and a point Y (Y1, Y2, etc.) is:
07/14/2023 Slide no. 8

HOW IT WORKS
Group the data points into 3 clusters
Randomly decide on starting center points of the 3

clusters
07/14/2023 Slide no. 9

HOW IT WORKS
Iterations -- 1
Cluster C1 Cluster C2 Cluster C3

Initial centroid assigned (randomly) 2 10 5 8 1 2
Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment
A1 2 10 0 5 9 c1
A2 2 5 5 6 4 c3
A3 8 4 12 7 9 c2
Distance calculation using rectilinear
A4 5 8 5 0 10 c2
method and assignment of cluster based on
A5 7 5 10 5 9 c2
min distance A6 6 4 10 5 7 c2
A7 1 2 9 10 0 c3
A8 4 9 3 2 10 c2

2 10 8 4 2 5
New centroid for each cluster calculated 5 8 1 2
(averaging of data points) 7 5
6 4
4 9
2 10 6 6 1.5 3.5
07/14/2023 Slide no. 10
HOW IT WORKS
Iterations -- 2
2 10 6 6 1.5 3.5
Centroid adjusted from previous iteration Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment
A1 2 10 0 8 7 c1
A2 2 5 5 5 2 c3
A3 8 4 12 4 7 c2
Distance calculation using rectilinear A4 5 8 5 3 8 c2
method and assignment of cluster based on A5 7 5 10 2 7 c2
min distance A6 6 4 10 2 5 c2
A7 1 2 9 9 2 c3
A8 4 9 3 5 8 c1

2 10 8 4 2 5
4 9 5 8 1 2
New centroid for each cluster calculated 7 5
3 9.5 6.5 5.25 1.5 3.5
07/14/2023 Slide no. 11

HOW IT WORKS
Iterations -- 3
3 9.5 6.5 5.25 1.5 3.5
Centroid adjusted from previous iteration Data Point Distance from c1 Distance from c2 Distance from c1 Cluster Assignment Cluster Assignment
A1 2 10 0 8 7 c1 c1
A2 2 5 5 5 2 c3 c3
A3 8 4 12 4 7 c2 c2
Distance calculation using rectilinear A4 5 8 5 3 8 c2 c2
A5 7 5 10 2 7 c2 c2
method and assignment of cluster based on
A6 6 4 10 2 5 c2 c2
min distance
A7 1 2 9 9 2 c3 c3
A8 4 9 3 5 8 c1 c1
Cluster C1 Cluster C2 Cluster C3 No change in the cluster

2 10 8 4 2 5 assignment, Hence STOP
4 9 5 8 1 2
New centroid for each cluster calculated 7 5
3 9.5 6.5 5.25 1.5 3.5
07/14/2023 Slide no. 12

KMEANS VISUALIZATION
• http://shabal.in/visuals/kmeans/4.html
07/14/2023 Slide no. 13

PRACTICAL APPLICATIONS
• Customer Segmentation:
Pricing Segmentation
Loyalty
Spend Behaviour
Branch Geo
Customer Need needs, channel of preferences, service expectations.
Category Who are these customers?
Why are they behaving the way to?
Customer Service Customer Value in last 6/12/18/24 months
Customer Type – Individuals and Small Businesses
Product type (e.g. Gas, Electricity etc)
Length of Relationship
Overall consumption
Number of complains
News Article Clustering
07/14/2023 Slide no. 14

DISADVANTAGES
• Fixed number of clusters can make it difficult to

predict what K should be. A Globular clusters are very tightly bound, which gives
them their spherical shapes and relatively high stellar
densities toward their centers.
• Does not work well with non-globular clusters.
• Different initial partitions can result in different final

clusters.
• Even outliers become part of some cluster!
07/14/2023 Slide no. 15

WHEN K-MEANS CLUSTERING FAILS
• (-) Real-life data is almost always messy.
• (-) The examples and illustrations we see in our statistics courses are designed to reflect ideal situations that
sadly almost never occur in the real world.
• (+) This works best when your data is roughly "spherical," as in the toy data set
• (+) Still the most widely used unsupervised machine learning algorithms
07/14/2023 Slide no. 16

USE CASES
Behavioral Inventory Sorting sensor Detecting bots or

segmentation categorization measurements anomalies
• Segment by purchase • Group inventory by sales • Detect activity types in • Separate valid activity
history activity motion sensors groups from bots
• Segment by activities on • Group inventory by • Group images • Group valid activity to
application, website, or manufacturing metrics • Separate audio clean up outlier detection
platform • Identify groups in health
• Define personas based on monitoring
interests
• Create profiles based on
activity monitoring
07/14/2023 Slide no. 17

KMEANS – SCITKIT LEARN
• KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001,

precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=None,
algorithm=’auto’)[source]
• Attributes:
• cluster_centers_ : array, [n_clusters, n_features] Coordinates of cluster centers. If the algorithm stops
before fully converging
• labels_ : Labels of each point
• inertia_ : float, Sum of squared distances of samples to their closest cluster center.
• n_iter_ : int, Number of iterations run.
07/14/2023 Slide no. 18

K-MEANS++
• How does the k-means++ works?

• research yourself!!!
• Essentially there are 3 methods to assign initial centers
• K-Means++
• Random choice of points from the samples
• user provided points(vectors)
• K-Means++
• Say we need to select 2 cluster centers,

• instead of selecting them all randomly {like we do in simple k means}, we will select the first one randomly, then find the points that are
farthest to the first center
• These points most probably do not belong to the first cluster center as they are far from it and assign the second cluster center nearby those
far points.
07/14/2023 Slide no. 19
SPECIFYING DIFFERENT DISTANCE CALCULATION METHOD
• SCITKIT learn implementation only supports ‘Euclidean’ distance measure
07/14/2023 Slide no. 20

20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

Uploaded by

Copyright:

Available Formats

You might also like

20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans

Uploaded by

Copyright:

Available Formats

• What it is?

07/14/2023 Slide no. 2

• no overlaps in the groups; each observation is assigned only to a single group.

07/14/2023 Slide no. 3

Different movie genres appeal to people of different personalities.

There appears to be 3 clusters:

• Blue: Anxious and open people who like advant-garde and

• Yellow: Introverts with social anxieties who like Japanese

• Movies in the center seem to be general household favorites.

07/14/2023 Slide no. 4

• There are always K clusters.

• There is always at least one item in each cluster.

• The clusters are non-hierarchical and they do not overlap.

07/14/2023 Slide no. 5

07/14/2023 Slide no. 6

1. Single Linkage - Distance between two nearest

2. Complete Linkage - Distance between 2 farthest

4. Average Group (centroid) linkage –

07/14/2023 Slide no. 7

07/14/2023 Slide no. 8

Group the data points into 3 clusters

Randomly decide on starting center points of the 3

07/14/2023 Slide no. 9

Cluster C1 Cluster C2 Cluster C3

Cluster C1 Cluster C2 Cluster C3

Cluster C1 Cluster C2 Cluster C3

3 9.5 6.5 5.25 1.5 3.5

07/14/2023 Slide no. 11

Cluster C1 Cluster C2 Cluster C3 No change in the cluster

3 9.5 6.5 5.25 1.5 3.5

07/14/2023 Slide no. 12

07/14/2023 Slide no. 13

07/14/2023 Slide no. 14

• Fixed number of clusters can make it difficult to

• Different initial partitions can result in different final

• Even outliers become part of some cluster!

07/14/2023 Slide no. 15

• (-) Real-life data is almost always messy.

07/14/2023 Slide no. 16

Behavioral Inventory Sorting sensor Detecting bots or

07/14/2023 Slide no. 17

• KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001,

07/14/2023 Slide no. 18

• How does the k-means++ works?

• Essentially there are 3 methods to assign initial centers

• Say we need to select 2 cluster centers,

• SCITKIT learn implementation only supports ‘Euclidean’ distance measure

07/14/2023 Slide no. 20

You might also like