Professional Documents
Culture Documents
Data Science 2
Data Science 2
K-means Algorithm
g
Cluster Analysis in Data Mining
Algorithm Description
1
3/22/2012
Algorithm Description
Types of Clustering
Partitioning and Hierarchical Clustering
Hierarchical Clustering
- A set of nested clusters organized as a hierarchical tree
Partitioningg Clusteringg
- A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
Algorithm Description
p1
p3 p4
p2
2
3/22/2012
Algorithm Description
What is K-means?
1. Partitional clustering approach
4 Number of clusters
4. clusters, K
K, must be specified
Algorithm Statement
3
3/22/2012
Algorithm Statement
Details of K-means
1 Initial centroids are often chosen randomly.
1. randomly
- Clusters produced vary from one run to another
2. The centroid is (typically) the mean of the points in the cluster.
3.‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation,
etc.
4. K-means will converge for common similarity measures mentioned above.
5. Most of the convergence happens in the first few iterations.
- Often the stopping condition is changed to ‘Until relatively few points
change clusters’
Algorithm Statement
Euclidean Distance
A simple example: Find the distance between two points, the original
and the point (3,4)
4
3/22/2012
Algorithm Statement
Update Centroid
We use the following equation to calculate the n dimensional
centroid point amid k n-dimensional points
Example of K-means
2.5
1.5
y
0.5
5
3/22/2012
Example of K-means
Assigning the points to nearest K clusters and re-compute the
centroids
Iteration 3
3
2.5
1.5
y
0.5
Example of K-means
K-means terminates since the centroids converge to certain points
and do not change.
Iteration 6
3
2.5
1.5
y
0.5
6
3/22/2012
Example of K-means
Iteration 1 Iteration 2 Iteration 3
3 3 3
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Example of K-means
Demo of K-means
7
3/22/2012
To get SSE,
SSE we square these errors and sum them them.
K
SSE dist 2 ( mi , x )
i 1 xCi
x is a data point in cluster Ci and mi is the representative point for cluster
Ci
can show that mi corresponds to the center (mean) of the cluster
Problem about K
How to choose K?
1. Use another clustering method, like EM.
8
3/22/2012
Cluster Quality
9
3/22/2012
Cluster Quality
Limitation of K-means
K-means has problems when clusters are of
differing
g
Sizes
Densities
Non-globular shapes
K-means has
K h problems
bl when
h the
h d
data contains
i
outliers.
10
3/22/2012
Limitation of K-means
Application of K-means
Image Segmentation
The k-means clustering algorithm is commonly used in
computer vision as a form of image segmentation. The
results of the segmentation are used to aid border detection
and object recognition.
11
3/22/2012
12
3/22/2012
1 k 2
d (k , x, c) k x j ci
m
i 1
i
i 1 x j Ci
0.09
0.08
0.07
Cost of clustering
0.06
0.05
0.04
0.03
0 02
0.02
0.01
0
2 3 4 5 6 7 8 9 10 11 12 13
Number of clusters
13
3/22/2012
14
3/22/2012
12
10
Wind speed (m/s)
0
0 20 40 60 80 100 120 140
Drive train acceleration
Reference
1. Introduction to Data Mining, P.N. Tan, M. Steinbach, V. Kumar, Addison Wesley
3. http://www.cs.cmu.edu/~cga/ai-course/kmeans.pdf
4. http://www.cse.msstate.edu/~url/teaching/CSE6633Fall08/lec16%20k-means.pdf
15
3/22/2012
Appendix One
Appendix Two
16