Professional Documents
Culture Documents
K-Means Clustering
K-Means Clustering
K-means Clustering
• What is clustering?
• Why would we want to cluster?
• How would you determine clusters?
• How can you do this efficiently?
Clustering
• Unsupervised learning
– Requires data, but no labels
• Detect patterns e.g. in
– Group emails or search results
– Customer shopping patterns
– Regions of images
• Useful when don’t know what you’re looking for
• Basic idea: group together similar instances
• What could “similar” mean?
– One option: Euclidean distance
• Clustering results are crucially dependent on the
measure of similarity (or distance) between “points”
to be clustered
Clustering algorithms
K-means Clustering
Basic Algorithm:
• Step 0: select K
• Step 1: Randomly select any K data points as cluster
centers.
• Step 2: calculate distance from each object to each
cluster center.
• What type of distance should we use?
– Squared Euclidean distance
– given distance function
K-means Clustering
• Step 3: Assign each object to the closest cluster
• Step 4: Compute the new centroid for each cluster
– The center of a cluster is computed by taking mean of all
the data points contained in that cluster.
• Iterate from step 2 to 4:
– Calculate distance from objects to cluster centroids.
– Assign objects to closest cluster
– Recalculate new centroids
K-means Clustering
• Stop based on convergence criteria
– Center of newly formed clusters do not change
– Data points remain present in the same cluster
– Maximum number of iterations are reached
Example 2
Example 3
• Cluster the following eight points (with (x, y) representing
locations) into three clusters:
– A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4,
9)
• Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
• The distance function between two points a = (x1, y1) and b =
(x2, y2) is defined as-
– Ρ(a, b) = |x2 – x1| + |y2 – y1|
• For Cluster-01:
– We have only one point A1(2, 10) in Cluster-01.
– So, cluster center remains the same.
• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5) = (6, 6)
• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
• For Cluster-01:
– Center of Cluster-01 = ((2 + 4)/2, (10 + 9)/2) = (3, 9.5)
• For Cluster-02:
– Center of Cluster-02 = ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4) = (6.5, 5.25)
• For Cluster-03:
– Center of Cluster-03 = ((2 + 1)/2, (5 + 2)/2) = (1.5, 3.5)
K-means Clustering
• Strengths
– Simple iterative method
– Guaranteed to converge in a finite number of iterations
– User provides “K”
– Running time per iteration:
• Assign data points to closest cluster center O(KN) time
• Change the cluster center to the average of its assigned points O(N)
• Weaknesses
– Often too simple bad results
– can not handle noisy data and outliers
– Difficult to guess the correct “K”
– not suitable to identify clusters when the clusters have varying sizes,
different densities or non-convex shapes
K-means Issues
• Distance measure is squared Euclidean
– Scale should be similar in all dimensions
• Rescale data?
– Not good for nominal data. Why?
• Approach tries to minimize the within-cluster sum of
squares error (WCSS) or inertia
– Implicit assumption that SSE is similar for each group
WCSS
• The over all WCSS is given by: