Week 09

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

BUSINESS INTELLIGENCE &

ANALYTICS

Cluster analysis
Saji K Mathew, PhD
Professor, Department of Management Studies
INDIAN INSTITUTE OF TECHNOLOGY MADRAS
Market customization
} Segmentation involves identifying groups of consumers
who behave differently in response to a given marketing
strategy
} It leads to formation of distinct subsets such that
members are different across segments but are similar
within
Clustering
} Unsupervised
} To discover natural groupings
} Which are sub-segments among the current subscribers?
} Clustering is not statistically sound but practically
insightful
} Issue of generalizability (local optima)
} Guided by human intelligence, depends on bases
} Applications:
} Biology, medicine, psychology, market structure, geography

} How is clustering different from factor analysis?


Steps in cluster analysis
} Decide which variables to use as base variables
} Descriptor vs behavior
} Select measures of similarity
} How to measure similarity?
} Choose an algorithm to group similar objects
} How to assign objects to clusters?
} Create clusters
} How many clusters?
} Describe the clusters (Profiling)
Choosing variables as bases
} Basis for bases
} Reason for clustering
} New product design
¨ Benefits sought
} Positioning
¨ Perception about existing brands
} Customer loyalty/retention
¨ Recency, Frequency, Monetary value (RFM)

} Data availability
} Clustering solution could be strongly affected by
} Irrelevant variables
} Undifferentiated variables
Clustering problem
} A marketer wants to segment a small community based
on store loyalty (V1) and brand loyalty (V2).
} A small sample of 7 respondents were chosen
} 0-10 scale was used to measure both V1 and V2
Data

Respondents

Clustering variable A B C D E F G
V1 3 4 4 2 6 7 6
V2 2 5 7 7 6 7 4

Euclidean distance between respondents =


(k: variable; i, j: respondents)
Proximity matrix

Observation A B C D E F G
A
B 3.162
C 5.099 2.000
D 5.099 2.828 2.000
E 5.000 2.236 2.236 4.123
F 6.403 3.606 3.606 5.000 1.414
G 3.606 2.236 2.236 5.000 2.000 3.162 -
Agglomerative clustering
Clustering algorithms
} Hierarchical clustering
} Agglomerative
} Single linkage
} Complete linkage
} Composite measures
¨ Average linkage
¨ Average similarity of all objects within clusters (example discussed)
¨ Centroid
¨ Distance between cluster centroids
¨ Wards
¨ Sum of squares of similarity within
clusters
} Divisive (top down)
} Partitioning
} K-means, K-Medoids, K-Modes
} Density based: Grow a cluster till density (number of data points within a
neighborhood) reaches a minimum threshold
} Grid based: Quantize the object space into a finite number of cells that
form a grid structure
Measures of distance
} Clustering could work with different data types
} Distance is measured differently for various data types
} Distance is measured as similarity or dissimilarity

Sim(i,j) = 1 - dissim(i,j)
Data structures and measures of
distance (similarity)
} Data matrix é x11 ... x1f ... x1p ù
ê ú
ê ... ... ... ... ... ú
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
êë n1 û

} Dissimilarity matrix é 0 ù
ê d(2,1) 0 ú
ê ú
ê d(3,1) d ( 3,2) 0 ú
ê ú
ê : : : ú
êëd ( n,1) d ( n,2) ... ... 0úû
Measurers of distance
} Metric data
} Euclidean, Manhattan, Minkowski distances
d (i, j) = q (| x - x |q + | x - x |q +...+ | x - x |q )
i1 j1 i2 j2 ip jp

} Ordinal rif -1
zif =
} Use standardization M f -1

1 0 sum
} Binary simJaccard (i, j) = a 1 a b a +b
} Jaccard coefficient a +b+c 0 c d c+d
sum a+c b+d p

} Categorical d (i, j) = p -
p
m
} Match ratio (m: # of matches, p: total # of variables
E = Sik=1S pÎCi ( p - ci ) 2
Determining number of clusters
} Involves both practical and theoretical considerations
} Practical: How many clusters are useful/actionable?
} Eg.: Number of market subsegments
} Heuristics methods
} clusters, when n is the number of objects (data points)
data points (objects) per cluster
} Elbow method
Comparing partitioning methods
} K-Means
} Solution is sensitive to outliers (as mean is used for centroid)

} Time complexity: O(nkt)


} n=sample size, k=number of clusters, and t=number of iterations
} K-Medoids
} Instead mean cluster center, K-Medoids use the most centrally located
(representative) object in a cluster as reference

} Less sensitive to outliers (robust); more computationally expensive


} K-Modes
} Used for categorical data (mode instead of mean for centroid)
} Mixed data: Standardize data / combine K-means and K-mode
Cluster quality
} Extrinsic methods
} Used when prior labels/categories are known (supervised)
} Measures: Homogeneity, small cluster preservation etc.
} Intrinsic methods
} Silhouette coefficient:
} a(o) is the average distance between o and all other objects in the cluster
to which o belongs
} b(o) is the minimum average distance from o to all clusters to which o
does not belong.
} The value of the silhouette coefficient is between -1 and 1
} Pseudo F: Between-cluster-sum-of-squares / (c-1)) / (within-
cluster-sum-of-squares / (n-c)
} Pseudo F describes the ratio of between cluster variance to within-
cluster variance.
} c: number of clusters; n: total number of objects
Segmentation using clustering

PRISM: Potential Rating Index for Zip Markets


Matches 36,000 zip codes to 40 lifestyle
(VALS) clusters
Movers & Shakers
Kids & Cul-de-Sacs
Evaluation using clustering tendency
} Determines how dataset is meaningful for
clustering; a non-random structure (uniform)à
is not useful for clustering
} Hopkins statistic (H) widely used
Let D be the data set A sample S of r synthetic data points is randomly generated in
the domain of the data space. Another sample R of r data points is selected from D.
Let α1 . . . αr be the distances of the data points in the sample R ⊆ D to their nearest
neighbors within the original D.
Similarly, let β1 . . . βr be the distances of the data points in the synthetic sample S to
their nearest neighbors within D.

The Hopkins statistic will be in the range (0, 1). Uniformly distributed data will
have a Hopkins statistic of 0.5 because the values of αi and βi will be similar.
The values of αi will typically be much lower than βi for the clustered data. This
will result in a value of the Hopkins statistic that is closer to 1. Therefore, a high
value of the Hopkins statistic H is indicative of highly clustered data points
(Agarwal, 2015)

You might also like