Professional Documents
Culture Documents
Week 09
Week 09
Week 09
ANALYTICS
Cluster analysis
Saji K Mathew, PhD
Professor, Department of Management Studies
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
Market customization
} Segmentation involves identifying groups of consumers
who behave differently in response to a given marketing
strategy
} It leads to formation of distinct subsets such that
members are different across segments but are similar
within
Clustering
} Unsupervised
} To discover natural groupings
} Which are sub-segments among the current subscribers?
} Clustering is not statistically sound but practically
insightful
} Issue of generalizability (local optima)
} Guided by human intelligence, depends on bases
} Applications:
} Biology, medicine, psychology, market structure, geography
} Data availability
} Clustering solution could be strongly affected by
} Irrelevant variables
} Undifferentiated variables
Clustering problem
} A marketer wants to segment a small community based
on store loyalty (V1) and brand loyalty (V2).
} A small sample of 7 respondents were chosen
} 0-10 scale was used to measure both V1 and V2
Data
Respondents
Clustering variable A B C D E F G
V1 3 4 4 2 6 7 6
V2 2 5 7 7 6 7 4
Observation A B C D E F G
A
B 3.162
C 5.099 2.000
D 5.099 2.828 2.000
E 5.000 2.236 2.236 4.123
F 6.403 3.606 3.606 5.000 1.414
G 3.606 2.236 2.236 5.000 2.000 3.162 -
Agglomerative clustering
Clustering algorithms
} Hierarchical clustering
} Agglomerative
} Single linkage
} Complete linkage
} Composite measures
¨ Average linkage
¨ Average similarity of all objects within clusters (example discussed)
¨ Centroid
¨ Distance between cluster centroids
¨ Wards
¨ Sum of squares of similarity within
clusters
} Divisive (top down)
} Partitioning
} K-means, K-Medoids, K-Modes
} Density based: Grow a cluster till density (number of data points within a
neighborhood) reaches a minimum threshold
} Grid based: Quantize the object space into a finite number of cells that
form a grid structure
Measures of distance
} Clustering could work with different data types
} Distance is measured differently for various data types
} Distance is measured as similarity or dissimilarity
Sim(i,j) = 1 - dissim(i,j)
Data structures and measures of
distance (similarity)
} Data matrix é x11 ... x1f ... x1p ù
ê ú
ê ... ... ... ... ... ú
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
êë n1 û
} Dissimilarity matrix é 0 ù
ê d(2,1) 0 ú
ê ú
ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû
Measurers of distance
} Metric data
} Euclidean, Manhattan, Minkowski distances
d (i, j) = q (| x - x |q + | x - x |q +...+ | x - x |q )
i1 j1 i2 j2 ip jp
} Ordinal rif -1
zif =
} Use standardization M f -1
1 0 sum
} Binary simJaccard (i, j) = a 1 a b a +b
} Jaccard coefficient a +b+c 0 c d c+d
sum a+c b+d p
} Categorical d (i, j) = p -
p
m
} Match ratio (m: # of matches, p: total # of variables
E = Sik=1S pÎCi ( p - ci ) 2
Determining number of clusters
} Involves both practical and theoretical considerations
} Practical: How many clusters are useful/actionable?
} Eg.: Number of market subsegments
} Heuristics methods
} clusters, when n is the number of objects (data points)
data points (objects) per cluster
} Elbow method
Comparing partitioning methods
} K-Means
} Solution is sensitive to outliers (as mean is used for centroid)
The Hopkins statistic will be in the range (0, 1). Uniformly distributed data will
have a Hopkins statistic of 0.5 because the values of αi and βi will be similar.
The values of αi will typically be much lower than βi for the clustered data. This
will result in a value of the Hopkins statistic that is closer to 1. Therefore, a high
value of the Hopkins statistic H is indicative of highly clustered data points
(Agarwal, 2015)