Professional Documents
Culture Documents
Prrethy-Dr. Huma Lone - AL
Prrethy-Dr. Huma Lone - AL
com
2022, Vol. 6, No. 5, 7798–7804
1
Freelance trainer, Pune, India
2
Associate Professor, Rajgad Institute of Management Research and Development,
Dhankawadi, Pune-43, India
1
humalone@gmail.com,2prajaktawarale@gmail.com
Abstract
Customer segmentation is the division of a business customersinto categories called customer
segments such that each customer segment exhibits similar characteristics. This division of customers
is built on factors that can directly or indirectly affect the market or business such as product
preferences or expectations, locations, behaviours etc. Customer segmentation can be implemented
through clustering, which is one of the highlyrecognized machine learning techniques. Cluster
analysis is applied in many business applications, from customized marketing to industry analysis. It
is an unsupervised learning technique that divides a dataset into a set of meaningful sub-classes,
called clusters. It helps to comprehend the natural grouping in a dataset andcreate clusters of similar
records which depends on several measurements made in the form of attributes.
This research paper has focused on creating customers clusters by applying K-means and
Agglomerative clustering algorithms on a dataset consisting of 200 customers. Various machine
learning libraries were used in Python programming language to implement and visualize the results.
Keywords— Agglomerative clustering, Elbow method, Dendrogram, Clustering, Customer
Segmentation, K-means Clustering
Start with n clusters (each data point = cluster) Lab, which is an IDE (Interactive Development
Environment) for Python.
PYTHON AS DATA ANALYSIS TOOL
Two closest data points are merged into one
cluster Widely used open-source programming
languages such as Python, Rand tools such as,
At every step, the two clusters with the smallest Tableau, XLMiner, RapidMiner, MATLAB,
distance are merged
SAS, Stata etc. are extensively used for data
Figure 2: Agglomerative Clustering Algorithm analysis, interactive computing, and data
Many people are of the opinion that visualization. Python is a general-purpose
clustering,and customer segmentation can be programming language which is designed to be
used interchangeably. Even though, it is true simple and easy to understand. Artificial
that clustering is one of the highly appropriate learning, machine learning and deep learning
and widely used methods for segmentation, it is applications are developed easily with Python’s
not the only method. It is observed that a libraries such as numpy, pandas, matplotlib and
clustering-based segmentation will be better scikit-learn. These support libraries have made
than other segmentation methods. This method python as an incredible option as a primary
involves collecting data about the customers in language for data science application [11].
the form of attributes or featuresand then Figure 3 shows the essential python libraries for
discovering different clusters from that data. Machine Learning Applications [14].
Finally, label these clusters by analysing the
characteristics of the clusters [11]. • Numerical Python, used for numerical
computing
numPy ●Supports the data structures, algorithms
DATA COLLECTION AND RESEARCH for most scientific applications
METHODOLOGY containing numerical data
The data set named “Mall_customers.csv” was ●Library gives high-level data structures
(such as data Frame)
taken from Kaggle.com. Kaggle is an online • advanced indexing functionality for
pandas
platform of data scientists and machine reshaping, slicing and dicing
learning experts[12]. It permits users to collect • Primary library used in data science
applications
and publish datasets, explore, and construct
●Python library for generating plots and
models in a web-based data-science other 2D data visualizations
matplotlib
environment. The dataset contained basic • Many of the data scientist consider it as
default visualization tool
information about customers of a supermarket
mall like customer ID, age, gender, annual ●Tremendous capability of ML
algorithms such as Regression,
income, and spending score. There were five Classification (Support vector machine,
features (CustomerId, Gender, Age, Annual K-nearest neighbours, Random forest,
sklearn Logistic regression etc.), Clustering ( K-
Income and Spending Score) with 200 means, Agglomerative) etc.
customers. Two different clustering algorithms: ●More than over 1,500 contributors from
around the world
K-means and Agglomerative clustering had
been used for customer segmentation[7]. Since
more than 2-dimensional data is difficult to Figure 3: Essential Python Libraries for
visualize, only last two features (Annual Machine Learning Applications
Income and Spending Score) were considered
as input to the two clustering algorithms. DATA ANALYSIS AND
Python, is the most preferred and largely used INTERPRETATION
programming language for machine learning Figure 4 demonstrates the process for
applications had been used as a data analysis implementing K-means and Agglomerative
tool[11]. Execution of the code for both the clustering on the selected dataset in python.
clustering algorithms had been done in Jupiter Import libraries, import dataset, finding the
optimal number of clusters, training the model
on dataset and cluster visualisations are the five Figure 5: Python libraries for implementing K-
essential steps for applying clustering means clustering
algorithms for customer segmentation.
Import Libraries Import • numpy, matplotlib.pyplot, pandas,
Libraries sklearn
Import Dataset
Import • Library & module: pandas
Dataset • Function :read.csv, iloc
Finding the optimal number of clusters
where adding another cluster does not further Considering the best number of clusters as 5,
enhance the total WSS much. From the curve, the model was trained on selected dataset by
the optimal number of clusters was chosen as using sklearn library. Matplotlib library was
five. used for visualization. Figure 9 and10 clearly
shows five clusters as an output of K-means and
Agglomerative clustering algorithm based on
two features (Annual Income and Spending
Score). Each small circle in figure 9 and 10
represents a customer in the dataset. The five
bigger circles in figure 9 are the centroids of
each cluster. Despite having their own
advantages and disadvantages the output of
both the algorithms were similar for selected
dataset. Hence both the methods can be
appropriately applied on the selected dataset.
Target, Extravagant and Impecunious based on because these customers are satisfied with
their spending score and annual income. Figure mall’s service. The marketing team may not
11 shows the cluster characteristics of five want to focus on these customers as they have
clusters. low income, so beyond certain limit they cannot
be attracted to offers from the mall. The focus
• Low spending score could be to that extent that the mall should not
Cluster 1:
and high annual lose them.
Saver income
Impecunious (Cluster 5): The customers in
• Average spending this category have low income and hence they
Cluster 2:
score and Average
Vigilant
annual income
prefer to spend less. The Mall marketing team
will not be interesting to focus on this cluster of
Cluster 3: • High spending score
and high annual
customers
Target income
DISCUSSION AND CONCLUSION
• High spending score
Cluster 4: Understanding a business’s customer base is
and low annual
Extravagant
income extremely important for any business
• Low spending score organisation. Customer segmentation is one of
Cluster 5:
and low annual the ways to gain deeper understanding of
Impecunious
income
customer behaviour. It is one of the important
Figure 11: Five Clusters with their applications of cluster analysis amongst many
characteristics applications spread across different
Saver (Cluster 1): The customers in this domains.Sales and marketing efforts can be
category have low spending score and high well designed for these clusters of customers to
annual income. Despite having high income, achieve high return on investment.
this cluster of customers spend less. One of the Unsupervised machine learning algorithms such
reasons could be that these customers are as K-means and Agglomerative clustering
unsatisfied or disappointed with the mall’s algorithms can be easily applied using python
services. The Mall marketing team should focus support libraries to summarize and visualize the
on this cluster as customers in this cluster have clusters. The current research applied K-means
potential to spend. and Agglomerative clustering algorithms on
Vigilant (Cluster 2): The customers in this Mall_Customers dataset and discovered
category have average spending score and different clusters from that data. Finally, these
average annual income. This cluster of clusters were labelledas Saver, Vigilant, Target,
customers may not be the prime target of the Extravagant and Impecunious by analysing the
mall, but marketing team should target their characteristics of the clusters.These clusters can
marketing efforts to retain them to increase help marketing team of the Mall to focus on
their spending score. these segments of customers differently and
Target (Cluster 3): The customers in this achieve maximum profit.
category have high spending score and high
annual income. These customers can be REFERENCES
considered as prime source of mall’s profit. 1. H. H. Ali and L. E. Kadhum, ‘K-Means
These customers are satisfied from mall’s Clustering Algorithm Applications in
service and are regular customers. The Data Mining and Pattern Recognition’,
marketing team should target them with Int. J. Sci. Res., vol. 6, no. 8, pp. 1577–
attractive offers to gain more profits. 1584, 2017.
Extravagant (Custer4): The customers in this 2. T. A. Nguyen, ‘Customer
category have high spending score and low Segmentation: A Step By Step Guide
annual income. Despite having low income, For B2B’, 2018.
they tend to buy products. This could be https://openviewpartners.com/blog/cust
© 2022 JPPW. All rights reserved
Dr. Huma Lone, et. al. 7804