Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

CLUSTERING

SEGMENTATION
What is Clustering

Clustering is the process of grouping observations of similar


kinds into smaller groups within the larger population.

Clustering is based on the concepts of similarity and


distance, while proximity is determined by a distance function.
It allows the generation of clusters where each of these
groups consists of individuals who have common features
with each other.

Clustering is used for knowledge discovery rather than


Clustering Vs Classification

The analysis of clusters is similar to the classification models, with the difference
that the groups are not preset (labels). The goal is to perform a partition of data
into clusters that can be disjoint or not.
Types of Clustering

Non-hierarchical methods (eg Kmeans) Hierarchical methods produce a set of nested


divide a dataset of N objects into M clusters. clusters in which each pair of objects or clusters is
Used in BA progressively nested in a larger cluster until only one
cluster remains.
Good Clusters

● High intra-class similarity


● Low inter-class similarity
Why do Clustering ?
Primarily used to perform segmentation, be it customer,
product or store.

● Customer
● Products - can be clustered together into hierarchical groups
based on their attributes like use, size, brand, flavor etc; stores
with similar characteristics – similar sales, size, customer base
etc, can be clustered together.
● Store
Why do Clustering ?
● Anomaly detection - for example, identifying fraud transactions.
Cluster detection methods can be used on a sample containing
only good transactions to determine the shape and size of the
“normal” cluster. When a transaction comes along that falls
outside the cluster for any reason, it is suspect. This approach
has been used in medicine to detect the presence of abnormal
cells in tissue samples and in telecommunications to detect
Clustering is oftenindicative
calling patterns used to break large set of data into smaller groups
of fraud.
that are more amenable to other techniques. For example, logistic
regression results can be improved by performing it separately on
smaller clusters that behave differently and may follow slightly different
Business Application of Clustering
A grocer retailer used clustering to segment its 1.3MM loyalty
card customers into 5 different groups based on their buying
behavior. It then adopted customized marketing strategies for
each of these segments in order to target them more
effectively.
Features of Clustering
● Undirected data mining technique. Can be used to identify
hidden patterns and structures in the data without formulating a
specific hypothesis. There is no target variable in clustering.
(eg in previous case, the grocery retailer was not actively trying
to identify fresh food lovers at the start of the analysis. It was
just attempting to understand the different buying behaviors of
its customer base.)
● Identify similarities with respect to specific behaviors or
dimensions. (eg, the objective was to identify customer
segments with similar buying behavior. Hence, clustering was
performed using variables that represent the customer buying
Why do Clustering ?
● Discover structures in data without providing an explanation
or interpretation. In other words, cluster analysis simply
discovers patterns in data without explaining why they exist.
The resulting clusters are meaningless by themselves. They
need to be profiled extensively to build their identity i.e. to
understand what they represent and how they are different from
In the
the parent
retailer’s case, each cluster was profiled on its buying
population.
behavior. Customers in cluster 1 spent a quarter of their total
spend on fresh, organic produce. This was significantly higher
than other customers who spent less than 5% on this category.
This segment of customers was called ‘Fresh food lovers’ as
K means Clustering
Clustering on Built in Data Set : iris

Iris flowers are categorised by their length and width of petal and sepal
Clustering - iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
head(iris)
irisFeatures = iris[-5] # remove last column of iris flower name
iriskm1 = kmeans(irisFeatures,centers=3)
iriskm1$size # no of rows in in each cluster
#50 62 38
iriskm1$cluster # row no to clusters
[1] 1 1 1 1
plot(irisFeatures$Sepal.Length, col=iriskm1$cluster)
iriskm1$centers #characteristics of each clusters
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.0 3.4 1.5 0.25
2 5.9 2.7 4.4 1.43
3 6.8 3.1 5.7 2.07
Selecting No of Clusters
#Reduce total within ss
data = iris[-5]
km1= kmeans(data,centers=1) ; km1$tot.withinss
km2= kmeans(data,centers=2) ; km2$tot.withinss
km4= kmeans(data,centers=4) ; km4$tot.withinss
km5= kmeans(data,centers=5) ; km5$tot.withinss
library(NbClust)
nc = NbClust(data, distance="euclidean",min.nc=2, max.nc=15,
method="average")
km3= kmeans(data,centers=3) ; km3$tot.withinss
cbind(km1$tot.withinss, km2$tot.withinss, km3$tot.withinss,
km4$tot.withinss,km5$tot.withinss)
#we select no clusters at elbow point #adding more clusters does not
significantly reduce total withinss
[,1] [,2] [,3] [,4] [,5]
[1,] 681 152 79 72 70
Scaling of Data : Equal Importance to All Variables
set.seed(1234); marks10 = ceiling(runif(100, 5,10))

set.seed(1234); marks500 = ceiling(runif(100,


250,500))
students1= data.frame(marks10, marks500);

head(students1)
km1 = kmeans(students1, centers=3)
km1$centers
students2 = scale(students1)
km2 = kmeans(students2, centers=3)
km2$centers
Customer Segmentation
Customer Segmentation
set.seed(1234); (age = ceiling(rnorm(50, 45, 10)))
set.seed(1234);(income = ceiling(rnorm(50, 100000, 10000)))
set.seed(1234);(children = sample(c(1,2,3), size=50, replace=T,
prob=c(.4,.3,.2)))
customers = data.frame(age, income, children)
head(customers)
# No of Clusters
library(NbClust)
nc = NbClust(customers, distance="euclidean", min.nc=2, max.nc=15,
method="average")
km1 = kmeans(customers, centers=3)
age income children
km1$centers
1 42 96149 1.7
2 56 109979 1.8
3 33 87810 1.6
Plots in Clustering
library(cluster)
cluster::clusplot(customers, km1$cluster, color=TRUE, shade=TRUE,
labels=2, lines=0)
Applications of Clustering in Business
● Customer Segmentation
○ Needs
○ Purchase Behaviour/ Sales
○ Loyalty
● Product Segmentation
● Store Segmentation
● Grouping Web Pages
● Grouping Patient Types
● Image Processing
● Preprocessing of Data
Summary

Clustering is a powerful technique to explore patterns


structures within data and has wide applications is business
analytics.

There are various methods for clustering.

An analyst should be familiar with multiple clustering


algorithms and should be able to apply the most relevant
technique as per the business needs.
Further Reading
https://www.slideshare.net/bridgetut/basics-of-clustering
https://www.mapbusinessonline.com/Whitepaper.aspx/Effective-Customer-Segme
ntation
http://flevy.com/browse/flevypro/market-and-customer-segmentation-2359
http://www.simulace.info/index.php/Customer_segmentation_techniques
Cluster Analysis
Kmeans Clustering
Hierarchical Clustering
Clustering
Clustering in R
https://engineering.eckovation.com/hierarchical-clustering/
Clustering : Similarity Distance Measure
Euclidean Distance
x= c(1,4,8,6,3,10,7,11,13,2)
y=c (5,9,15.2, 12, 7,7, 4,10,13

You might also like