Professional Documents
Culture Documents
BA2 7 Cluster
BA2 7 Cluster
Cluster Analysis
• Segmenting markets: Cities or regions with similar or common traits can be grouped on
the basis of climatic or socio-economic conditions.
• Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost.
• City-planning: Identifying groups of houses according to their house type, value, and
geographical location.
• Career planning and training analysis: For human resource planning, people can be
grouped into clusters on the basis of their educational/experience or aptitude and
aspirations.
• The similarity and dissimilarity of clusters can be differentiated by distance between the clusters.
• The data collected assumes standardization.
• The collinearity among the variables is minimal.
• There are no significant outliers.
• The sample need to be representative of the population.
• The data ignores the mood of the data provider and other aspects.
Cluster Analysis vs Factor Analysis
Assess the
Interpret and
validity of
profile clusters
clustering
Formulation of the Problem
• City- block (Manhattan) distance. Uses the sum of the variables’ absolute
differences
• Hierarchical algorithms:
• Tree-like structure for understanding the levels of observations
• Typical methods: Diana, Agnes
• Non-Hierarchical algorithms :
• A centroid is chosen and the distance from the centroid is measured.
• Typical methods: K-means
Hierarchical Clustering
The sample size is moderate (generally 300-400 but not exceeding 1000).
• Types:
• Agglomerative Algorithm
• Divisive Algorithm
• In the beginning, the procedure starts with the number of clusters equal to the no. of
respondents. It then calculates the distance between each observations and all other
observations.
• This process continues till all observations are grouped together and made one cluster.
• Ward’s method
Most commonly used method. It uses variance measure for clustering observations. Relies
on concept that sum of squares within cluster should be minimum. Hence in every step,
within cluster variance is calculated between all the observations and cluster with
minimum within sum of the square value are grouped together. Such agglomerative
process continues in every step till all observations are sequentially grouped to form one
single cluster.
How many clusters to select?
- No specific rule
- Depending on the business problem / context
Example
Dendrogram:
Graphical representation (tree graph) of the results of a hierarchical procedure.
Starting with each object as a separate cluster, the dendrogram shows graphically
how the clusters are combined at each step of the procedure until all are
contained in a single cluster
Exercise: Cluster Formation and Dendrogram
From the given distance matrix form clusters, create dendrogram and measure overall similarity
A B C D E F
A 0
B 0.23 0
C 0.22 0.15 0
In order to select a new cluster at each step, every possible combination of clusters must be
considered. This entire cumbersome procedure makes it practically impossible to perform by
hand, making a computer a necessity for most data sets containing more than a handful of
data points.
Output - Dendrogram
• Horizontal axes indicates the cases and the vertical axes indicates the distance (semi-partial R2)
• As one moves up in the vertical axis the items in the cluster will be more dissimilar
• Drawing a horizontal line parallel to the horizontal axis gives the clusters formed by the intersection
of the horizontal lines with the vertical lines.
• For example in this dendrogram if a line is drawn at distance = 0.6, there are two intersection points
with the vertical lines hence indicating two clusters can be formed at that distance.
Validating Cluster Solutions
1. Separate samples can be collected for analysis and validation
purpose. Two separate clustering solutions can be adopted. But
it require large size sample to accommodate both analysis as
well as validation of sample adequacy.
• Cluster seeds: Initial cluster centres in the non-hierarchical clustering that are the initial points
from which one starts. Then the clusters are created around these seeds.
• Cluster membership: This indicates the address or the cluster to which a particular
person/object belongs.
• Dendrogram: This is a tree like diagram that is used to graphically present the cluster results. The
vertical axis represents the objects and the horizontal represents the inter-respondent distance.
The figure is to be read from left to right.
• Distances between final cluster centres: These are the distances between the individual pairs of
clusters. A robust solution that is able to demarcate the groups distinctly is the one where the
inter cluster distance is large; the larger the distance the more distinct are the clusters.
Key concepts in cluster analysis
• Entropy group: The individuals or small groups that do not seem to fit into any
cluster.
• Final cluster centres: The mean value of the cluster on each of the variables that
is a part of the cluster variate.
• Hierarchical methods: A step-wise process that starts with the most similar pair
and formulates a tree-like structure composed of separate clusters.
• Non-hierarchical methods: Cluster seeds or centres are the starting points and
one builds individual clusters around it based on some pre-specified distance of
the seeds.
Key concepts in cluster analysis
• Proximity matrix: A data matrix that consists of pair-wise distances/
similarities between the objects. It is a N x N matrix, where N is the
number of objects being clustered.
• Summary: Number of cases in each cluster is indicated in the non-
hierarchical clustering method.
• Vertical icicle diagram: Quite similar to the dendrogram, it is a graphical
method to demonstrate the composition of the clusters. The objects are
individually displayed at the top. At any given stage the columns
correspond to the objects being clustered, and the rows correspond to
the number of clusters. An icicle diagram is read from bottom to top.