Professional Documents
Culture Documents
Chapter 6.3 Cluster Analysis
Chapter 6.3 Cluster Analysis
Vanishree M
2
Outline
• Introduction
• Types of clusters
• Conducting cluster analysis
• Selecting clustering procedure
• Hierarchical Cluster Analysis
• Hierarchical Cluster Analysis: Example
• Hierarchical clustering – a case
• K means clustering
• K means clustering: Example
3
What is a cluster?
• Clustering refers to the grouping of records, observations, or cases into
classes of similar objects.
• A cluster is a collection of records that are similar to one another and
dissimilar to records in other clusters.
• Clustering differs from classification in that there is no target variable for
clustering.
• The clustering task does not try to classify, estimate, or predict the value of a
target variable. Instead, clustering algorithms seek to segment the entire
data set into relatively homogeneous subgroups or clusters, where the
similarity of the records within the cluster is maximized, and the similarity to
records outside this cluster is minimized.
4
An ideal cluster
Variable 1
Variable 2
6
Variable 1
X
Variable 2
7
Cluster Analysis…
Cluster Analysis…
• For optimal performance, clustering algorithms, just like
algorithms for classification, require the data to be normalized
so that no particular variable or subset of variables dominates
the analysis. Analysts may use either the min– max
normalization or Z-score standardization
▫ Min − max normalization: X∗ = X − min( X ) / Range ( X )
▫ Z - score standardization: X∗ = X − mean ( X ) / SD ( X )
9
Ward’s
Method
Cluster 1 Cluster 2
Complete Linkage
Maximum
Distance
Cluster 1 Cluster 2
Average Linkage
Average Distance
Cluster 1 Cluster 2
15
Centroid Method
17
Dendogram
• Read the dendrogram from
left to right.
• The vertical lines indicate the
distances at which objects
have been combined.
• For example, according to our
calculations above, objects B,
C, and E are merged at a
distance of 1.414.
27
1 1 1 1
2 2 2 2
3 1 1 1
4 3 3 2
5 2 2 2
6 1 1 1
7 1 1 1
8 1 1 1
9 2 2 2
10 3 3 2
11 2 2 2
12 1 1 1
13 2 2 2
14 3 3 2
15 1 1 1
16 3 3 2
17 1 1 1
18 4 3 2
19 3 3 2
20 2 2 2
31
Scree Plot
36
Dendogram
37
Cluster Membership
• When we view the results, a
three-segment solution appears
promising.
• The first segment comprises
compact cars, whereas the second
segment contains sports cars, and
the third limousines.
• Increasing the solution by one
segment would further split up
the sports cars segment into two
sub-segments. This does not
appear to be very helpful, as now
two of the four segments
comprise only one object.
39
K Means clustering
• The k -means clustering is a straightforward and effective algorithm for
finding clusters in data. The algorithm is as follows:
▫ Step 1: Ask the user how many clusters k the data set should be partitioned
into.
▫ Step 2: Randomly assign k records to be the initial cluster center locations.
▫ Step 3: For each record, find the nearest cluster center. Thus, each cluster
center “owns” a subset of the records, thereby representing a partition of
the data set. We therefore have k clusters, C 1 , C 2 , … , C k .
▫ Step 4: For each of the k clusters, find the cluster centroid, and update the
location of each cluster center to the new value of the centroid.
▫ Step 5: Repeat steps 3– 5 until convergence or termination.
41
K Means clustering
• Suppose that we have n data points ( a 1 , b 1 , c 1 ), ( a 2 , b 2 , c 2 ), … , ( a n , b
n , c n ), the centroid of these points is the of gravity of these points and is
located at point (∑ a i ∕ n , ∑ b i ∕ n ∑ c i ∕ n).
• For example, the points (1,1,1), (1,2,1), (1,3,1), and (2,1,1) would have centroid (
1+1+1+2 1 + 2 + 3 + 1 1 + 2 + 3 + 1
, , ) = ( 1 .25 , 1 .75 , 1 .00 )
4 4 4
• The algorithm terminates when the centroids no longer change. In other words,
the algorithm terminates when for all clusters C 1 , C 2 , … , C k , all the records
“owned” by each cluster center remain in that cluster. Alternatively, the
algorithm may terminate when some convergence criterion is met, such as no
significant shrinkage in the mean squared error (MSE)
42
x y
a 1 3
b 3 3
c 4 3
d 5 3
e 1 2
f 4 2
g 1 1
h 2 1
43