Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 33

Cluster Analysis

Cluster Analysis

• A multivariate approach for grouping


observations based on similarity among
measured variables.
• Cluster Analysis classifies individuals or objects
into a small number of mutually exclusive and
exhaustive groups.
• Individuals or objects are assigned to groups so
that there is great similarity within groups and
much less similarity between groups.
• The cluster should have high internal (within-
cluster) homogeneity and high external
(between-cluster) heterogeneity.
What is Cluster Analysis?

• Cluster: a collection of data objects


• Similar to one another within the same cluster
• Dissimilar to the objects in other clusters
• Cluster analysis
• Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
• Unsupervised learning: no predefined classes
• Typical applications
• As a stand-alone tool to get insight into data distribution
• As a preprocessing step for other algorithms
3
Applications of Cluster Analysis

• Market segmentation: Customers/potential customers can be split into smaller more


homogenous groups by using the method.

• Segmenting industries: Same grouping principle can be applied for industrial


consumers.

• Segmenting markets: Cities or regions with similar or common traits can be grouped on
the basis of climatic or socio-economic conditions.

• Insurance: Identifying groups of motor insurance policy holders with a high average
claim cost.

• City-planning: Identifying groups of houses according to their house type, value, and
geographical location.
• Career planning and training analysis: For human resource planning, people can be
grouped into clusters on the basis of their educational/experience or aptitude and
aspirations.

• Segmenting sectors/instruments: Different factors like raw material cost, financial


allocations, seasonality and other factors are being used to group sectors together to
understand the growth and performance of a group of industries.
Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters with

• high intra-class similarity

• low inter-class similarity

• The quality of a clustering result depends on both the similarity


measure used by the method and its implementation

• The quality of a clustering method is also measured by its ability


to discover some or all of the hidden patterns
Cluster Analysis - Assumptions

• The similarity and dissimilarity of clusters can be differentiated by distance between the clusters.
• The data collected assumes standardization.
• The collinearity among the variables is minimal.
• There are no significant outliers.
• The sample need to be representative of the population.
• The data ignores the mood of the data provider and other aspects.
Cluster Analysis vs Factor Analysis

Cluster Analysis Factor Analysis


- grouping is based on -grouping is based on
the distance patterns of variation
(proximity) (correlation)

Factor analysis, we form group of variables based on the


several people’s responses to those variables. In contrast to
Cluster analysis, we group people based on their responses
to several variables.
Developing a research plan for Cluster Analysis

Formulate the Select the


problem distance measure

Decide on the Select the


number of clustering
clusters procedure

Assess the
Interpret and
validity of
profile clusters
clustering
Formulation of the Problem

• In clustering analysis, we need to select the variables on which clustering should be


based.

• The variables selected must be relevant to the business problem.

• The variables can be chosen referring to the literature review.


Distance measures
several distance measures are available, each with specific characteristics.
• Euclidean distance. The most commonly recognized to as straight-line
distance.

• City- block (Manhattan) distance. Uses the sum of the variables’ absolute
differences

• Chebychev distance. Is the maximum of the absolute difference in the


clustering variables’ values. Frequently used when working with metric (or
ordinal) data.
Formation of Clusters

The Clustering algorithms are broadly classified into two:

• Hierarchical algorithms:
• Tree-like structure for understanding the levels of observations
• Typical methods: Diana, Agnes

• Non-Hierarchical algorithms :
• A centroid is chosen and the distance from the centroid is measured.
• Typical methods: K-means
Hierarchical Clustering

• Use distance matrix as clustering criteria. This method does not


require the number of clusters k as an input, but needs a termination
condition

Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative


(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
Hierarchical Clustering
• HCA are preferred when:

The sample size is moderate (generally 300-400 but not exceeding 1000).

• Types:
• Agglomerative Algorithm
• Divisive Algorithm

• It is grouping the closest objects one at a time.

• It identifies the most similar object and then group them.

• In the beginning, the procedure starts with the number of clusters equal to the no. of
respondents. It then calculates the distance between each observations and all other
observations.

• This process continues till all observations are grouped together and made one cluster.

• This procedure is called Agglomerative Hierarchical Clustering.


Hierarchical Agglomerative Methods
• Single linkage method
In this method the distance between two clusters is defined to be the distance between the two closest
members, or neighbors (minimum distance).

• Furthest neighbor method (complete linkage method)


In this case the distance between two clusters is defined to be the maximum distance between members, i.e.
the distance between the two subjects that are furthest apart.
Hierarchical Agglomerative Methods
• Average (between groups) linkage method (sometimes referred to as UPGMA)
The distance between two clusters is calculated as the average distance between all pairs of subjects
in the two clusters. This is considered to be a fairly robust method.

• Ward’s method
Most commonly used method. It uses variance measure for clustering observations. Relies
on concept that sum of squares within cluster should be minimum. Hence in every step,
within cluster variance is calculated between all the observations and cluster with
minimum within sum of the square value are grouped together. Such agglomerative
process continues in every step till all observations are sequentially grouped to form one
single cluster.
How many clusters to select?

- No specific rule
- Depending on the business problem / context
Example

Suppose a marketing researcher wishes to determine market


segments in a community based on patterns of loyalty to
brands and stores a small sample of seven respondents is
selected as a pilot test of how cluster analysis is applied. Two
measures of loyalty- V1(store loyalty) and V2(brand loyalty)
were measured for each respondents on 0-10 scale.
Simple Example
How do we measure similarity?
Proximity Matrix of Euclidean Distance Between Observations
How do we form clusters?
SIMPLE RULE:

Identify the two most similar(closest) observations not already in the


same cluster and combine them.

We apply this rule repeatedly to generate a number of cluster


solutions, starting with each observation as its own “cluster” and
then combining two clusters at a time until all observations are in a
single cluster. This process is termed a hierarchical procedure
because it moves in a stepwise fashion to form an entire range of
cluster solutions. It is also an agglomerative method because clusters
are formed by combining existing clusters
How many groups do we form?

Therefore, the three – cluster solution of Step 4 seems the most


appropriate for a final cluster solution, with two equally sized
clusters, (B-C-D) and (E-F-G), and a single outlying observation (A).

This approach is particularly useful in identifying outliers, such as


Observation A. It also depicts the relative size of varying clusters,
although it becomes unwieldy when the number of observations
increases.
How do we form Clusters?

AGGLOMERATIVE PROCESS CLUSTER SOLUTION


Overall Similarity
Minimum Distance Number of Measure
Step Unclustered Observations Observation Pair Cluster Membership Clusters (Average Within-
Cluster Distance)

  Initial Solution   (A)(B)(C)(D)(E)(F)(G) 7 0

1 1.414 E-F (A)(B)(C)(D)(E-F)(G) 6 1.414


2 2.000 E-G (A)(B)(C)(D)(E-F-G) 5 2.192
3 2.000 C-D (A)(B)(C-D)(E-F-G) 4 2.144
4 2.000 B-C (A)(B-C-D)(E-F-G) 3 2.234
5 2.236 B-E (A)(B-C-D-E-F-G) 2 2.896
6 3.162 A-B (A-B-C-D-E-F-G) 1 3.420
7 3.162 F-G - - -
Graphical Portrayals

Dendrogram:
Graphical representation (tree graph) of the results of a hierarchical procedure.
Starting with each object as a separate cluster, the dendrogram shows graphically
how the clusters are combined at each step of the procedure until all are
contained in a single cluster
Exercise: Cluster Formation and Dendrogram

From the given distance matrix form clusters, create dendrogram and measure overall similarity

  A B C D E F

A 0          

B 0.23 0        

C 0.22 0.15 0      

D 0.37 0.20 0.15 0    

E 0.34 0.14 0.28 0.29 0  

F 0.23 0.25 0.11 0.22 0.39 0


Hierarchical Clustering Method – Ward’s Method
• Ward (1963) proposed a clustering procedure seeking to form the
partitions Pn, Pn-1, … , P1 in a manner that minimizes the loss associated
with each grouping, and to quantify that loss in a form that is readily
interpretable.
• At each step in the analysis, the union of every possible cluster pair is
considered and the two clusters whose fusion results in minimum
increase in ‘information loss’ are combined.
• Information loss is defined by ward’s in terms of an error sum-of-
square criterion, ESS.
Hierarchical Clustering Method – Ward’s Method
Like other clustering methods, Ward’s method starts with n clusters, each containing a single
object. These n clusters are combined to make one cluster containing all objects. At each
step, the process makes a new cluster that minimizes variance, measured by
an index called E (also called the sum of squares index).
At each step, the following calculations are made to find E:

1.Find the mean of each cluster.


2.Calculate the distance between each object in a particular cluster, and that cluster’s mean.
3.Square the differences from Step 2.
4.Sum (add up) the squared values from Step 3.
5.Add up all the sums of squares from Step 4.

In order to select a new cluster at each step, every possible combination of clusters must be
considered. This entire cumbersome procedure makes it practically impossible to perform by
hand, making a computer a necessity for most data sets containing more than a handful of
data points. 
Output - Dendrogram

• Horizontal axes indicates the cases and the vertical axes indicates the distance (semi-partial R2)
• As one moves up in the vertical axis the items in the cluster will be more dissimilar
• Drawing a horizontal line parallel to the horizontal axis gives the clusters formed by the intersection
of the horizontal lines with the vertical lines.
• For example in this dendrogram if a line is drawn at distance = 0.6, there are two intersection points
with the vertical lines hence indicating two clusters can be formed at that distance.
Validating Cluster Solutions
1. Separate samples can be collected for analysis and validation
purpose. Two separate clustering solutions can be adopted. But
it require large size sample to accommodate both analysis as
well as validation of sample adequacy.

2. Internal validation of cluster analysis is conducted


a) Compactness:
• To measure intra-cluster homogeneity
• A cluster is homogeneous when the sum of squares
within cluster is minimum compared to the sum of
squares between clusters
b) Connectedness:
• Represents the group of data associated together.
• groups the nearest observations together
Key concepts in cluster analysis

• Agglomeration schedule: A hierarchical method that provides


information on the objects, starting with the most similar pair and
then at each stage provides information on the object joining the pair
at a later stage.
• ANOVA table: The one way ANOVA statistics for each clustering
variable. The higher is the ANOVA value, the higher is the difference
between the clusters on that variable.
• Cluster variate: The variables or parameters representing the objects
to be clustered and used to calculate the similarity between objects.
• Cluster centroid: The average values of the objects on all the variables
in the cluster variate.
Key concepts in cluster analysis

• Cluster seeds: Initial cluster centres in the non-hierarchical clustering that are the initial points
from which one starts. Then the clusters are created around these seeds.

• Cluster membership: This indicates the address or the cluster to which a particular
person/object belongs.

• Dendrogram: This is a tree like diagram that is used to graphically present the cluster results. The
vertical axis represents the objects and the horizontal represents the inter-respondent distance.
The figure is to be read from left to right.

• Distances between final cluster centres: These are the distances between the individual pairs of
clusters. A robust solution that is able to demarcate the groups distinctly is the one where the
inter cluster distance is large; the larger the distance the more distinct are the clusters.
Key concepts in cluster analysis

• Entropy group: The individuals or small groups that do not seem to fit into any
cluster.
 
• Final cluster centres: The mean value of the cluster on each of the variables that
is a part of the cluster variate.
 
• Hierarchical methods: A step-wise process that starts with the most similar pair
and formulates a tree-like structure composed of separate clusters.
 
• Non-hierarchical methods: Cluster seeds or centres are the starting points and
one builds individual clusters around it based on some pre-specified distance of
the seeds.
Key concepts in cluster analysis
• Proximity matrix: A data matrix that consists of pair-wise distances/
similarities between the objects. It is a N x N matrix, where N is the
number of objects being clustered.
 
• Summary: Number of cases in each cluster is indicated in the non-
hierarchical clustering method.
 
• Vertical icicle diagram: Quite similar to the dendrogram, it is a graphical
method to demonstrate the composition of the clusters. The objects are
individually displayed at the top. At any given stage the columns
correspond to the objects being clustered, and the rows correspond to
the number of clusters. An icicle diagram is read from bottom to top.

You might also like