K-Means Clustering

K-means Clustering
K-means Clustering
• What is clustering?
• Why would we want to cluster?
• How would you determine clusters?
• How can you do this efficiently?
Clustering
• Unsupervised learning
– Requires data, but no labels
• Detect patterns e.g. in
– Group emails or search results
– Customer shopping patterns
– Regions of images
• Useful when don’t know what you’re looking for
• Basic idea: group together similar instances
• What could “similar” mean?
– One option: Euclidean distance
• Clustering results are crucially dependent on the
measure of similarity (or distance) between “points”
to be clustered
Clustering algorithms
K-means Clustering
Basic Algorithm:
• Step 0: select K
• Step 1: Randomly select any K data points as cluster
centers.
• Step 2: calculate distance from each object to each
cluster center.
• What type of distance should we use?
– Squared Euclidean distance
– given distance function
K-means Clustering
• Step 3: Assign each object to the closest cluster
• Step 4: Compute the new centroid for each cluster
– The center of a cluster is computed by taking mean of all
the data points contained in that cluster.
• Iterate from step 2 to 4:
– Calculate distance from objects to cluster centroids.
– Assign objects to closest cluster
– Recalculate new centroids
K-means Clustering
• Stop based on convergence criteria
– Center of newly formed clusters do not change
– Data points remain present in the same cluster
– Maximum number of iterations are reached
Example 2
Example 3
• Cluster the following eight points (with (x, y) representing
locations) into three clusters:
– A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b =
(x2, y2) is defined as-
– Ρ(a, b) = |x2 – x1| + |y2 – y1|
• Use K-Means Algorithm to find the three cluster centers after

the second iteration.
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1|
– = |2 – 2| + |10 – 10| = 0
– Ρ(A1, C2) = |x2 – x1| + |y2 – y1|
– = |5 – 2| + |8 – 10| = 3 + 2 = 5
– Ρ(A1, C3) = |x2 – x1| + |y2 – y1|
– = |1 – 2| + |2 – 10| = 1 + 8 = 9
New clusters are-
• Cluster-01: First cluster contains points-

– A1(2, 10)
• Cluster-02: Second cluster contains points-

– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)
– A8(4, 9)
• Cluster-03: Third cluster contains points-

– A2(2, 5)
– A7(1, 2)
• Re-compute the new cluster centers.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.
• For Cluster-01:
– We have only one point A1(2, 10) in Cluster-01.
– So, cluster center remains the same.
• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)
• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
• This is completion of Iteration-01.

Iteration - 2
• Calculate the distance of each point from each of the center of
the three clusters
• The distance is calculated by using the given distance function.
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| = 0

– Ρ(A1, C2) = |x2 – x1| + |y2 – y1| = |6 – 2| + |6 – 10| = 4 + 4 = 8
• Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

– Ρ(A1, C3) = |x2 – x1| + |y2 – y1| = |1.5 – 2| + |3.5 – 10| = 0.5 + 6.5 = 7
New clusters
• Cluster-01: First cluster contains points-

– A1(2, 10)
– A8(4, 9)
• Cluster-02: Second cluster contains points-

– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)
• Cluster-03: Third cluster contains points-

– A2(2, 5)
– A7(1, 2)
New cluster centers
• For Cluster-01:
– Center of Cluster-01 = ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)
• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25)
• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
K-means Clustering
• Strengths
– Simple iterative method
– Guaranteed to converge in a finite number of iterations
– User provides “K”
– Running time per iteration:
• Assign data points to closest cluster center O(KN) time
• Change the cluster center to the average of its assigned points O(N)
• Weaknesses
– Often too simple  bad results
– can not handle noisy data and outliers
– Difficult to guess the correct “K”
– not suitable to identify clusters when the clusters have varying sizes,
different densities or non-convex shapes
K-means Issues
• Distance measure is squared Euclidean
– Scale should be similar in all dimensions
• Rescale data?
– Not good for nominal data. Why?
• Approach tries to minimize the within-cluster sum of
squares error (WCSS) or inertia
– Implicit assumption that SSE is similar for each group
WCSS
• The over all WCSS is given by:
• The goal is to find the smallest WCSS

• Does this depend on the initial seed values?
• Possibly.
• Figure shows two suboptimal solutions that the algorithm can
converge to if you are not lucky with the random initialization
step.
Finding the optimal number of clusters
• The inertia is not a good performance metric when trying to

choose k because it keeps getting lower as we increase k.
• Indeed, the more clusters there are, the closer each instance
will be to its closest centroid, and therefore the lower the
inertia will be
• When plotting the inertia as a function of the number of
clusters k, the curve often contains an inflexion point called
the “elbow”
• curve has roughly the shape of an arm, and there is an
“elbow”
• K-Means fails to cluster these ellipsoidal blobs properly
Image segmentation using K-Means with various
numbers of color clusters
Bottom Line
• K-means
– Easy to use
– Need to know K
– May need to scale data
– Good initial method
• Local optima
– No guarantee of optimal solution
– Repeat with different starting values

K-Means Clustering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K-Means Clustering

Uploaded by

Copyright:

Available Formats

K-means Clustering

• Use K-Means Algorithm to find the three cluster centers after

• Cluster-01: First cluster contains points-

• Cluster-02: Second cluster contains points-

• Cluster-03: Third cluster contains points-

• This is completion of Iteration-01.

• Calculating Distance Between A1(2, 10) and C2(6, 6)-

• Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-

• Cluster-01: First cluster contains points-

• Cluster-02: Second cluster contains points-

• Cluster-03: Third cluster contains points-

• The goal is to find the smallest WCSS

• The inertia is not a good performance metric when trying to

You might also like