Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 38

K-means Clustering

K-means Clustering
• What is clustering?
• Why would we want to cluster?
• How would you determine clusters?
• How can you do this efficiently?
Clustering
• Unsupervised learning
– Requires data, but no labels
• Detect patterns e.g. in
– Group emails or search results
– Customer shopping patterns
– Regions of images
• Useful when don’t know what you’re looking for
• Basic idea: group together similar instances
• What could “similar” mean?
– One option: Euclidean distance
• Clustering results are crucially dependent on the
measure of similarity (or distance) between “points”
to be clustered
Clustering algorithms
K-means Clustering
Basic Algorithm:
• Step 0: select K
• Step 1: Randomly select any K data points as cluster
centers.
• Step 2: calculate distance from each object to each
cluster center.
• What type of distance should we use?
– Squared Euclidean distance
– given distance function
K-means Clustering
• Step 3: Assign each object to the closest cluster
• Step 4: Compute the new centroid for each cluster
– The center of a cluster is computed by taking mean of all
the data points contained in that cluster.
• Iterate from step 2 to 4:
– Calculate distance from objects to cluster centroids.
– Assign objects to closest cluster
– Recalculate new centroids
K-means Clustering
• Stop based on convergence criteria
– Center of newly formed clusters do not change
– Data points remain present in the same cluster
– Maximum number of iterations are reached
Example 2
Example 3
• Cluster the following eight points (with (x, y) representing
locations) into three clusters:
– A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b =
(x2, y2) is defined as-
– Ρ(a, b) = |x2 – x1| + |y2 – y1|

• Use K-Means Algorithm to find the three cluster centers after


the second iteration.
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1|
– = |2 – 2| + |10 – 10| = 0
• Calculating Distance Between A1(2, 10) and C2(5, 8)-
– Ρ(A1, C2) = |x2 – x1| + |y2 – y1|
– = |5 – 2| + |8 – 10| = 3 + 2 = 5
• Calculating Distance Between A1(2, 10) and C3(1, 2)-
– Ρ(A1, C3) = |x2 – x1| + |y2 – y1|
– = |1 – 2| + |2 – 10| = 1 + 8 = 9
New clusters are-

• Cluster-01: First cluster contains points-


– A1(2, 10)

• Cluster-02: Second cluster contains points-


– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)
– A8(4, 9)

• Cluster-03: Third cluster contains points-


– A2(2, 5)
– A7(1, 2)
• Re-compute the new cluster centers.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.

• For Cluster-01:
– We have only one point A1(2, 10) in Cluster-01.
– So, cluster center remains the same.

• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)

• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)

• This is completion of Iteration-01.


Iteration - 2
• Calculate the distance of each point from each of the center of
the three clusters
• The distance is calculated by using the given distance function.
• Calculating Distance Between A1(2, 10) and C1(2, 10)-
– Ρ(A1, C1) = |x2 – x1| + |y2 – y1| = |2 – 2| + |10 – 10| = 0

• Calculating Distance Between A1(2, 10) and C2(6, 6)-


– Ρ(A1, C2) = |x2 – x1| + |y2 – y1| = |6 – 2| + |6 – 10| = 4 + 4 = 8

• Calculating Distance Between A1(2, 10) and C3(1.5, 3.5)-


– Ρ(A1, C3) = |x2 – x1| + |y2 – y1| = |1.5 – 2| + |3.5 – 10| = 0.5 + 6.5 = 7
New clusters

• Cluster-01: First cluster contains points-


– A1(2, 10)
– A8(4, 9)

• Cluster-02: Second cluster contains points-


– A3(8, 4)
– A4(5, 8)
– A5(7, 5)
– A6(6, 4)

• Cluster-03: Third cluster contains points-


– A2(2, 5)
– A7(1, 2)
New cluster centers

• For Cluster-01:
– Center of Cluster-01 = ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)

• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25)

• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
K-means Clustering
• Strengths
– Simple iterative method
– Guaranteed to converge in a finite number of iterations
– User provides “K”
– Running time per iteration:
• Assign data points to closest cluster center O(KN) time
• Change the cluster center to the average of its assigned points O(N)
• Weaknesses
– Often too simple  bad results
– can not handle noisy data and outliers
– Difficult to guess the correct “K”
– not suitable to identify clusters when the clusters have varying sizes,
different densities or non-convex shapes
K-means Issues
• Distance measure is squared Euclidean
– Scale should be similar in all dimensions
• Rescale data?
– Not good for nominal data. Why?
• Approach tries to minimize the within-cluster sum of
squares error (WCSS) or inertia
– Implicit assumption that SSE is similar for each group
WCSS
• The over all WCSS is given by:

• The goal is to find the smallest WCSS


• Does this depend on the initial seed values?
• Possibly.
• Figure shows two suboptimal solutions that the algorithm can
converge to if you are not lucky with the random initialization
step.
Finding the optimal number of clusters

• The inertia is not a good performance metric when trying to


choose k because it keeps getting lower as we increase k.
• Indeed, the more clusters there are, the closer each instance
will be to its closest centroid, and therefore the lower the
inertia will be
• When plotting the inertia as a function of the number of
clusters k, the curve often contains an inflexion point called
the “elbow”
• curve has roughly the shape of an arm, and there is an
“elbow”
• K-Means fails to cluster these ellipsoidal blobs properly
Image segmentation using K-Means with various
numbers of color clusters
Bottom Line
• K-means
– Easy to use
– Need to know K
– May need to scale data
– Good initial method
• Local optima
– No guarantee of optimal solution
– Repeat with different starting values

You might also like