Professional Documents
Culture Documents
Cluster
Cluster
• Challenges
– Scalability
– Ability to deal with different types of attributes
– Discovery of clusters with arbitrary shape
– Requirements for domain knowledge to determine input parameters
– Ability to deal with noisy data
Considerations for Cluster Analysis
• Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
• Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
• Similarity measure
– Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-
based (e.g., density or contiguity)
• Clustering space
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
Major Clustering Approaches
• Partitioning approach
– Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
– Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach
– Create a hierarchical decomposition of the set of data (or objects)
using some criterion
– Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach
– Based on connectivity and density functions; can find arbitrarily
shaped clusters
– Typical methods: DBSACN, OPTICS, DenClue
• Grid-based approach
– Based on a multiple-level granularity structure; fast processing time
– Typical methods: STING, WaveCluster, CLIQUE
Major Clustering Approaches
• Density-based approach
– Clusters are dense regions in the data space, separated by regions of lower density of
points
– Goal is to identify dense regions; measured by number of objects close to a given point
– Any point x in the data set, with a neighbor count greater than or equal to Min_Pts, is
marked as a core point
– x is border point, if the number of its neighbors is less than Min_Pts, but it belongs to
the ϵ-neighborhood of some core point z
– If a point is neither a core nor a border point, then it is called a noise point or an outlier
Major Clustering Approaches
• Grid-based approach
1. Partitioning the data space into a finite number of cells
2. Calculating the cell density for each cell
3. Sorting of the cells according to their densities
4. Identifying cluster centres (blocks with the highest density )
5. Traversal of neighbor cells
Partitioning Algorithms
• Given a data set, D, of n objects, and k, the number of clusters
to form, a partitioning algorithm organizes the objects into k
partitions (k<=n), where each partition represents a cluster
• Sum of squared distances is minimized (where ci is the centroid
or medoid of cluster Ci)
• Sensitive to outliers
k-Medoids Clustering Method
• Instead of taking the mean value of the objects in a cluster as
a reference point, actual object is used
• Absolute-error criterion is used
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
k-means k-medoids
k-Medoid Clustering Method
• Find representative objects (medoids) in clusters
– PAM works effectively for small data sets, but does not
scale well for large data sets (due to the computational
complexity)
– O(k(n-k)2) for each iteration, n = # data, k= # clusters
k-Medoids Algorithm: Illustration
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7
Arbitrary Assign
7
6 6 6
5
choose k 5 each 5
4 object as 4 remaining 4
3
initial 3
object to 3
2
medoids 2
nearest 2
medoids
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
8
Compute
9
8
Swapping O total cost of
Until no change
7 7
and Oramdom 6
swapping 6
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Hierarchical Clustering
• Group data objects into a hierarchy
• Produces a set of nested clusters organized as a hierarchical tree
• Dendogram - A tree structure representing the sequence of merging
decisions
• Useful for data summarization and visualization
• Does not require the number of clusters k as an input
• Needs a termination condition
• Methods – agglomerative, divisive, BIRCH, Chameleon
Error
Data objects
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative: bottom-up (merging) fashion
1. Start with the points as individual clusters
2. At each step, merge the closest pair of clusters until only one
cluster (or k clusters) left
• MIN
• MAX
• Group Average
• Distance Between Centroids
• Other methods driven by an objective
function
– Ward’s Method uses squared error
(ESS)
Ward’s Method - Example
• Five customers – A, B, C, D, E
• Ratings provided – 2, 5, 9, 10, 15, respectively, on a 20-point scale
• Cluster customers based on ratings
• Stage 1:
– Five clusters of one, ESS = 0
– No loss of information since there is no
clustering
• Stage 2:
– Combining C and D as they are closest; Centroid = 9.5
– Four cluster solution: {A, B, {C,D}, E}
– ESS = 0 + 0 + [(9-9.5)2 + (10-9.5)2] + 0 = 0.5
• Stage 3:
– ESS for the solution {{A,B}, {C,D}, E} = 5.0
– ESS for the solution {A, B, {C,D,E}} = 20.7
– ESS for the solution {{A,E}, {C,D}, B} = 85.0
– ESS for the solution {A, {B,E}, {C,D}} = 50.5
– ESS for the solution {A, {B,C,D}, E} = 14.0
– ESS for the solution {{A,C,D}, B, E} = 50.5
Ward’s Method - Example
• Stage 4:
– ESS for the solution {{A,B,C,D}, E} = 41.0
– ESS for the solution {{A,B,E}, {C,D}} = 93.2
– ESS for the solution {{A,B}, {C,D,E}} = 25.2
• Stage 5:
– ESS for the solution {{A,B, C,D,E}} = 98.8
Ward’s Method - Comments
• Clusters from previous stage are never taken apart
• Less sensitive to outliers
• Spherical tightly bound clusters - biased towards globular
clusters
• Can be used to decide value of k