Professional Documents
Culture Documents
Lecture9 Clustering For Students
Lecture9 Clustering For Students
Agglomerative Hierarchical
Clustering
CDS504 Module 10
Copyright @ Dr. Aihua Yan
Friday, November 11, 2022
Examples of Clustering
Applications
2
Clustering Applications - Marketing (1)
o Marketing- market segmentation: customers are segmented based on
transaction history, demographic data, behavioral data, psychological
data, and a marketing strategy is tailored for each segment.
Losing Weight
Health Requirements
Market Segments f
Making Friends or Fitness Centers Sports Focus
3
Clustering Applications - Finance (2)
o Finance- balanced portfolios given data on a variety of investment opportunities
(e.g., stocks), one may find clusters based on financial performance variables such as
return (daily, weekly, or monthly), volatility, beta, and other characteristics, such as
industry and market capitalization. Selecting securities from different clusters can
help create a balanced portfolio.
4
Clustering applications (3)
o Spatial Data Analysis
– Detect spatial clusters and explain
them in spatial data mining. e.g.
sales distribution by zip code.
o Image Processing
– Disease diagnose in healthcare
o Web analysis
– Document classification
– Cluster web-log data to discover
groups of similar access patterns.
5
Introduction:
What is Cluster Analysis?
6
What is Cluster analysis?
Purpose: Identify groups of individuals or objects that are similar to each
other but different from individuals/objects in other groups.
7
Objectives in Cluster Analysis
Object
Within-cluster Centroid
Variation=Minimum
Between-cluster
Variation=Maximum
8
Within-groups vs. Between-groups
pWithin-groups property: Each group is homogenous with respect to
certain characteristics, i.e. observations in each group are similar to each other
p Between-groups property: Each group should be different from other
groups with respect to the same characteristics, i.e. observations of one group
should be different from the observations of other groups.
9
How many clusters?
Types of Clustering
o A clustering is a set of clusters
o Important distinction between hierarchical and partitional sets of clusters
– Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
• Algorithm: e.g., agglomerative hierarchical clustering
p1
p3 p4
p2
p1 p2 p3 p4
15
Tie- not a critical
issue in clustering
Stop rule: We are seeking for a solution where
an additional combination of clusters or objects
would occur at a greatly increased distance
(最大信息丢失).
1.414
2.236
2 3.162
Steps of Hierarchical Clustering
Step 3: Similarity
Step 1: Objectives Step 2: Choice of Measures
of cluster analysis Variables (Distance
Measures)
19
Observations Measures
20
Step 1: Objective of CA
o Taxonomy description
– Taxonomy: an empirically based classification of objects.
– Although cluster analysis is viewed principally as an exploratory technique, cluster analysis can be
used for confirmatory purposes. In such cases, a proposed typology (theoretically based
classification) can be compared to that derived from the cluster analysis.
o Data Simplification
– Instead of viewing all of the observations as unique, they can be viewed as members of clusters
and profiled by their general characteristics.
Step 2: Choice of Variables
o The selection of clustering variables should base on an explicit theory, past research, or supposition.
o The researcher must also realize that the importance of including only those variables that:
– characterize the objects being clustered, and
– relate specifically to the objectives of the cluster analysis. See the next slide for the variables for need-based market
sementation.
o Practical considerations:
– Cluster analysis can be affected dramatically by the inclusion of only one or two inappropriate or undifferentiated
variables.
– The analyst is always encouraged to examine the results and to eliminate the variables that are not distinctive (i.e.
that do not differ significantly) across the derived clusters.
o Warning:
– Avoid including variables “just because you have”
– Results are dramatically affected by inclusion of even one or two inappropriate or undifferentiated variables
Clustering
variables
Step 3: Choice of similarity measure:
Distance Measures
o Distance (or dissimilarity) Measures
– Euclidean Distance Measuring distance between
– Minkowski Metric two observations/objects
– Euclidean Distance for Standardized Data
Distance Measures (2)
Minkowski metric between cases i and j:
Euclidean Distance=
(0.04 + 0.09 + 0.25) = 0.616
26
Standardization of variables
o Note: Euclidean distance depends on the scale of the variables! Variables with large
values will contribute more to the distance measure than variables with small values.
o Standardization of variables is commonly preferred to avoid problems due to
different scales.
o Clustering variables should be standardized whenever possible.
– Most commonly done using Z-scores
𝑋−𝜇
𝑍=
𝜎
27
Measuring distance
Step 4: deciding on clustering algorithm between two clusters
o Centroid method: the distance between the two
cluster centroids. The centroid of a merged cluster is
a weighted combination of the centroids of the two
individual clusters, where the weights are
proportional to the sizes of the clusters.
28
Ward’s Method
o Ward’s Method consider the “loss of
information” that occurs when observations
are clustered together.
o Ward’s method would choose the
configuration that results in the smallest
incremental loss of information.
o Ward’s method will produce clusters of
similar shape and size.
29
Step 5: Choose the number of clusters
Stopping rules of hierarchical clustering
o Distance-based rules:
– Rule 1: one potential stopping rule is Elbow’s
rule.
– Rule 2: another alternative rule is looking at
dendrogram.
– Distance-based rules do not work very well in all
cases. It is often difficult to identify where the Based on the above figure, what is your
break actually occurs. final number of clusters ?
30
Elbow Rule
o One should choose a number of clusters so that adding another
cluster doesn't give much better modelling of the data.
31
Step 6: Interpretation of the clusters
o The interpretation stage involves examining each cluster in terms of the
cluster variate to name or assign a label accurately describing the nature
of the clusters.
o When starting the interpretation process, one measure frequently used is
the cluster’s centroid.
o The profiling and interpretation of the clusters, however, achieve more
than just description and are essential elements in selecting between
cluster solutions when the stopping rules indicate more than one
appropriate cluster solution.
Profiling
variables
Example: Income, Age, Education
Hierarchical Clustering: Problems and Limitations
o Once a decision is made to combine two clusters, it cannot be undone
The determination of cluster number is hard. It needs both mathematical and practical
considerations.