Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

A Machine Learning approach to segment the

ϮϬϮϯ/ŶƚĞƌŶĂƚŝŽŶĂůŽŶĨĞƌĞŶĐĞŽŶƌƚŝĨŝĐŝĂů/ŶƚĞůůŝŐĞŶĐĞĂŶĚ<ŶŽǁůĞĚŐĞŝƐĐŽǀĞƌLJŝŶŽŶĐƵƌƌĞŶƚŶŐŝŶĞĞƌŝŶŐ;/KE&ͿͮϵϳϵͲϴͲϯϱϬϯͲϯϰϯϲͲϰͬϮϯͬΨϯϭ͘ϬϬΞϮϬϮϯ/ͮK/͗ϭϬ͘ϭϭϬϵͬ/KE&ϱϳϭϮϵ͘ϮϬϮϯ͘ϭϬϬϴϰϯϯϵ

customers of online sales data for better and


efficient marketing purposes
Mathesh T1, Sumathy G2 and Maheshwari A2
1
Department of Information Technology Jeppiaar Engineering College Chennai, India

E-mail : mati02official@gmail.com sumathyg@srmist.edu.in maheshwa1@srmist.edu.in

Abstract- The Internet is becoming huge and is used Only a small percentage of companies who buys
by a more diverse audience every day. The amount of sales lead has some team setup to segregate and
data gathered from the platform through different segment the customer leads, others fall into trap of
online lead companies are gargantuan so it needs to be using a common approach for all the leads which
maintained and segregated in order to extract
leads to poor sales conversion rate of leads data.
meaningful data from it. A lot of companies have
started to gather customer data through their own
Individual salesperson is the most affected because
platform or through various vendors who sell it to they only have domain knowledge of their respective
sales companies/organizations/individuals for some sales area but fail to understand the clusters or
profit. Sometimes these data are large and scattered segments present in the sales data. Basic or common
enough to even confuse big sales organizations. In filters provided by various software won’t work very
order for better and more effective marketing of these well for multi-dimensional columns. So, it is
sales data, We propose to use four machine learning important for each sales executive to understand the
clustering algorithms( K-Means, Agglomerative, customer segments in the data and customise their
Mean-Shift and DBSCAN) in order to find customer marketing technique for each group.
segments based on the data provided. Based on this
segmented customer group, we can be able to find a
pattern and decide which customer group is better for
In this paper, for research purposes, we bought a set
which business. of sales leads from a company named Shrine. There
are three important sources of this data:- Internal
Keywords— Machine Learning, Customer customers of the company, data gathered from
Segmentation, K-Means, Agglomerative, DBSCAN, various online networking tools and job profiles
Marketing from a hiring company. The data gathered has been
diverse and contain columns related to their personal
I.INTRODUCTION and work data. A lot of analysis and cleaning was
required in order for the data to be used by the
A major objective of the research and study of these model which is covered in the Methodology section.
online leads data is to find meaningful customer
segments and use them for better and more efficient Clustering is a machine-learning approach for
marketing. There are various problems gathered finding the patterns in data and grouping them as
from the online leads which are not structured in a clusters. Unsupervised Clustering is a method to find
proper way. Some of the problems include wastage clusters without a target or a previously clustered
of time and resources in trying to send promotional column. We are going to use four unsupervised
emails or telecalling the customers who won’t fit the clustering algorithms in this paper namely K-Means,
target audience of the sales company/ organization/ Agglomerative, Mean Shift and DBSCAN. The
individual, also from a huge amount of lead data but same data but in a different format based on their
only less than one per cent of sales conversion data algorithm is going to be trained by these models and
is achieved due to improper and generic approach in their output is studied and analysed.
marketing. There is no meaning in trying to sell a
youth product to a customer lead who is above 50
years of age.

1
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
II. RELATED WORK has some noise. Nika and Ben emphasizes how can
we proceed with clustering keeping noise present in
Every market is unique and heterogenous and the data as one of the deciding parameters [10].
finding the right consumers is a big and difficult task Labelling each cluster after the model prediction
for the sales and marketing team. Gillian has requires some domain knowledge about the data.
stressed the importance of Market Segmentation by Sulekha suggests that various factors like
suggesting how segmentation allows an organization Demographic, Psychographic and Behavioural
to achieve high ROI (Rate Of Investment) [1]. which can be used to determine the naming of the
Online Leads have opened a completely new path clusters, once the model prediction is over [11].
for the sales organization to gather customers, a lot
of data are available to them unlike in the old days III. RESEARCH METHODOLOGY
where companies have to use alternative methods
like campaigns, and forms to gather data. There are The following section covers how the entire process
different ways to gather leads like social media, of research methodology from data collection to
online ads etc. Simon, Kilian, Christian and model training is done.
Alexander have revealed how a typical Online Lead
Generation Funnel Works [2]. A. Data Collection
The data used in the paper was brought from the
Clustering Algorithms use machine learning company named Shrine and the three important
techniques and help us in identifying the clusters in sources for this data collection are Internal
the data. The Distance between the points is one of customers of the company, data gathered from
the key factors for deciding the clusters and there are various online networking tools and job profiles
several ways to calculate the distance like Euclidean, from a hiring company. The data received was
Minkowski etc [3]. Danuta and Jan suggested how diverse and had a lot of different features like Name,
using different clustering algorithms like DBSCAN Date Of Birth, Profession, Designation, Current
and K-Means helps in the segmentation of Bank Company etc. Figure 1 shows the partial view of the
Customers [4]. The comparative study using the initial data which was received. There were five
local grocery store data helped in identifying clusters different datasets given to us and EDA (Exploratory
and predicting potential and not-potential customers Data Analysis) has to be done before proceeding as
[5]. there were a lot of anomalies present in the data. The
Initial dataset consists of 9108 rows and not all the
Segmentation of Customers for various domains is datasets have all the features present. We had to
discussed in the literature (see [6], [7], [8], [9]). study and select the features which could result in
These techniques stress the importance of the quality meaningful analysis. From the 15 columns
of data which is fed to the model for prediction. The presented, we narrowed it down to four columns
original data needs to be undergone several steps of (DOB, Education, Experience, Salary) as these were
tuning and pre-processing in order to be used by the the common and meaningful data present in
model. Not every data point is in a cluster, real data common in all the datasets.

Fig 1. Partial view of online sales data before pre-processing

2
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Fig 2. Final view of online sales of data after pre-processing

B. Data Pre-processing studying their respective outputs. The four models


The final dataset which was used in the model had a discussed in this paper are:- K-Means,
dimension of (3107 rows x 4 Columns). The Agglomerative, Mean-Shift and DBSCAN as these
elimination of the rest of the rows is due to models have a completely different approaches to
duplicates in data, a lot of missing values, unrealistic identifying clusters.
and improper values present in the data, which had
to figure out separately in the Data Pre-processing K-Means Clustering:-
process. K-Means is one of the most used unsupervised
clustering algorithms where clusters are identified
Since we require the data to be of the same through a partitioning principle using the centroids.
magnitude the same data had to undergo another set The K-Means algorithm work as follows:-
of pre-processing for the 4 columns. The DOB is not
in the right format to be used by the model, so we set 1. Initialize the K-clusters which are needed to be
a common reference date for all the DOB data and found from the data. The K-value can be found
computed age from it. The education columns had through an elbow mechanism which is discussed in
different ordinal categorical variables like Bachelor's the experiment section. [12]
and Master's degrees which needed to be converted 2. Once the K value is decided, the algorithm picks a
into a numerical format to be used by the model. k number of random centroids and from that
Since this Education follows some orders, a centroid, nearby data points are picked as clusters.
particular weight has been assigned to each degree, 3. These nearby data points are calculated by the
for example, PhD being the highest degree present in distance metrics. The one which we are using in the
the data mapped to the highest value while 10th model is named Euclidean Distance where the
Grade being the lowest educational value mapped to distance is calculated based on the square root of the
a lower value. The salary columns were in different sum of squared differences between the two points.
scales(thousands, lakhs, crores) and different Let the two points are P(x1, x2) and Q (y1, y2) the
currency formats (INR, USD, SGD). So, all the data Euclidean distance is given by :-
in the column are converted into INR format and the
values are brought in per ten thousand scales by
dividing all the values. The experience columns have
been converted to months from years in order to
match the magnitude of data present in other
columns. We can see the final data after 4. The cluster’s centroid keeps on shifting for a
preprocessing in figure 2. given no. of iterations or until the centroid allocation
no longer changes.
C. Models
Instead of using a single model for analysing the Agglomerative Clustering:-
data, we chose to study the significance of the Agglomerative is one of the famous Hierarchical
impact of different models of data by comparatively based clustering algorithms where clustering starts

3
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
by taking each data point as a single cluster and PDF(Probability Density Function) where n, d and x
starts merging these clusters based on the similarity correspond to length, dimensions and datapoint
between them. The procedure is repeated until all the value respectively.
data points are merged under a single cluster.
Linkage helps in merging two clusters into one in
Agglomerative clustering. There are various types of
linkages and we are using Ward’s method in this
paper as it helps in creating compact and even-sized DBSCAN Clustering:-
clusters. DBSCAN refers to the Density-Based Spatial
Clustering of Applications with Noise. As
In ward’s method, the distance between two clusters mentioned in the Literature review, not all data
is based on the similarity calculated as the sum of points can be clusters, the real data has some noise.
the square of the distances between the points x and DBSCAN helps us in identifying those clusters and
y divided by the product of their centroids C1 and the algorithm does not have any impact on outliers,
C2. unlike K-Means which is very sensitive to outliers.
There are two parameters (Epsilon(eps) and
Minimum Points(minpoints) which are needed for
the DBSCAN to perform. The eps describe the
radius around the cluster and the minpoints establish
Mean – Shift Clustering:- a condition for the minimum number of points
Mean-Shift Clustering identifies clusters by needed to be in the clusters. The idea of how to
iteratively shifting their centroids to a higher derive these values is explained in the Features and
probability density region. Once the centroid cannot Parameter section.
move further the algorithm stops. The advantage of
the Mean-Shift algorithm is unlike the K-Means we D. Features and Parameter Selection:-
don’t need to specify the number of clusters to be Selecting the right parameters for each clustering
found. Mean shift automatically finds the number of algorithm with the right data has a huge significance
clusters based on the density. Kernel(K) is a and impact in successful model creation. For training
weighting function used in the calculation of the the model and selecting the respective features the
density function. Bandwidth(h) is one of the famous python library scikit-learn is used [13]. This
important parameters needed for the Mean-Shift. It section describes how the features are scaled and
can be considered the radius of the circle which how parameters are selected for each algorithm.
denotes how many data points should be present in
the cluster. Below is the formula for calculating

Fig 3:- Distribution of Age, Salary, Experience respectively through histogram.

4
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Scaling and Outlier Removal:- scaled data is used in finding the elbow value. When
The data we use in research is gathered in such a K values increase the SSE will decrease. The point
way the whole data points have been scattered with a where the elbow is found is chosen as the K-Cluster
very high and low magnitude of values to match the value. From figure 4, using the elbow method, the K
real-world scenario of data gathering. Some of the value for the data is chosen as 2 and the model is
algorithms we use are more sensitive to outliers [14] trained accordingly.
and need to be taken care of before using them in the
model, to achieve quality results. Figure 3 shows the Training the Agglomerative model:-
distributions of data points in our data through a Agglomerative clustering considers all the data
histogram. points present in the plane as a separate cluster at
first and then proceed to group them together as one
We can clearly see some of the data lying in the final cluster. Since Agglomerative data is also
outlier zone. To remove the outlier, we are going to sensitive to outliers the same scaled data is being
replace high and low values with the upper and used for the model-building process. As explained in
lower limit of IQR(Interquartile Region). The IQR the model section we are going to use the ward's
describes the middle 50% of values when ordered linkage in merging clusters.
from lowest to highest. For dealing with the scaling
of data, we are going to Z-Score mechanism, since In order to visualize and determine the output of the
the variance is more or less between the columns. Z- clustering the process Dendrogram is used. A
Score helps in understanding how much data differs dendrogram shows the hierarchical relationship
from its standard deviation. Following is the formula between each cluster. Figure 5 shows the
of Z-Score where x, ߤ, ߪ corresponds to observed dendrogram for the scaled data. From the above
value, mean and standard deviation respectively. dendrogram, the clusters are chosen as 3 based on
the longest vertical line which is not being cut by
any of the horizontal lines. The Agglomerative
model is then trained with 3 clusters.
Determination of K value:
For selecting the right K value, the elbow method
mechanism is used. It works by finding the Sum of
Squares Error (SSE) values of each data point with
its respective centroids with different values of K.
The Formula for SSE is given below where x, ߤ݅ and
k correspond to datapoint, centroid and cluster size
respectively.

Fig 5:- Dendrogram for the scaled data.

Bandwidth for Mean Shift Algorithm:-


Mean shift works by assuming all data points in the
feature space as an empirical probability density
function. The determination of the correct bandwidth
is crucial for the clustering process in the Mean Shift
algorithm. To determine the right bandwidth, we are
going to calculate it based on the K-Nearest
neighbours mechanism defined by Dorin and Peter
[15].
Fig 4: Plot for determination of elbow method with
respective to WSS (Within SSE Score) The bandwidth was found to be 197.59 using the
Since K Means is highly sensitive to outliers, the above approach. With the same bandwidth value, the

5
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Mean Shift algorithm was trained and 14 different unique clusters were formed. Some clusters were
clusters were found in the output. These outputs are identified as noise by the algorithm and marked as -
analysed in the result and analysis section. 1, which could be useful during analysis

Determining the epsilon and min points:-


There are two parameters needed to be estimated for
the DBSCAN algorithm. One is the epsilon value
which describes the radius of the circle and the other
one is the minimum points which specifies the
conditions on how many minimum numbers of
points should be present in the cluster. A rule of
thumb says to determine the minimum number of
points based on the (k+1) rule where k is the number
of features present. But we also took some domain
knowledge into account by looking at the data.
Keeping both the conditions in mind, the min points Fig 6:- Plot to determine the EPS value.
value was chosen as 5.
IV. EXPERIMENT RESULT AND ANALYSIS
The determination of epsilon value is much more
challenging one and determine statistical proof in The number of clusters formed from the K-Means,
order for the model to be accurate enough. Agglomerative, Mean-Shift and DBSCAN models is
Following is the approach followed for listed in figures 6, 7, 8 and 9 respectively. In the
determination of epsilon value:- figures, we have listed out the mean values of each
1. Calculate the nearest neighbours for all feature, from which we can create label based on the
datapoints. domain knowledge of the product to be sold. The
2. Find the K-neighbour distance and sort the Scale for Degree Level is as follows 5- PhD, 4 –
distance and plot them in the graph. Masters, 3 – Bachelors, 2.5 – Diploma, 2 - 12th
3. The knee point(bend point) in the plot is Grade, 1 - 10th Grade. The number of clusters found
determined as the epsilon value. for K-Means, Agglomerative, Mean-Shift and
DBSCAN algorithms are 2, 3, 14 and 17
Following the above approach, the plot shown in respectively with DBSCAN having a special column
figure 6 was created. By zooming in on the plot, the for outliers. The salary is mentioned in 10k scale per
epsilon value was found to be 22. These parameters annum in INR.
are used to train the DBSCAN algorithm and the 17

Fig 7:- K-Means Cluster Results (Mean Values)

Fig 8:- Agglomerative Cluster Results (Mean Values)

6
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Fig 9:- Mean-Shift Cluster Results (Mean Values)

From the above experiment, we found that both K- but have been under age 40. This customer segment
Means and Agglomerative models struggled to can be named as Customers with high potential for
identify and find more clusters from the data point. luxury products.
The reason for the issue is generally because of the
working procedure and algorithm implemented by Unlike the K-Means and Agglomerative models
each model which got affected by differences in where the labels are only based on salary, much
magnitude present in data, even after scaling was more meaningful information has been able to be
done. The identified clusters for the K-Means and obtained from Mean Shift and DBSCAN models. In
Agglomerative Models can be named based on their DBSCAN models we were able to significantly
salary as listed in figures 7 and 8. For K-means, we identify the noise present in the data and segregate
can name the labels above higher middle-class them as a separate cluster. These noises are
income and below higher middle-class income and extraordinary cases where it only represents a
for Agglomerative the three labels can be high class, minority portion of a customer segmentation
middle class and low class. problem and can be used only in very few cases.

Both the K-Means and Agglomerative models were When analyzing all the models, bringing all the
able to find some meaningful clusters but they were column scales to an equal margin was one of the key
not meaningful enough for a marketing company to factors and helped in identifying the right clusters
maximize their efficiency in categorizing potential even though the experience and salary column had a
customers for each product. This is where the Mean little large magnitude than the other columns. To
Shift and DBSCAN were able to succeed mainly due measure and analyze how good the clustering is
to their density-based approach. The total number of done in all algorithms, Silhouette Score is used.
clusters formed from the Mean Shift and DBSCAN Silhouette score is the measure to understand how
were 14 and 17 respectively. The respective labels well each data point has been clustered into the
for both Mean Shift and DBSCAN can be correct cluster by the below formula where ‘a’
determined on a different basis based on the domain corresponds to average distance between the
knowledge and the mean values present in each centroid of a cluster and the datapoints and ‘b’
feature of the clusters. For example, if we look at corresponds to average distance between the data
cluster 4 from figure 9 in Mean-Shift Cluster Results point and closest cluster data points.
we can see people in that customer segment are
earning above 3 crores INR per annum on average

7
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
different principles or dimensions.

People with domain knowledge will prefer


DBSCAN or Mean Shift as having more customer
segments helps them to market and reach the target
audience for each product more effectively. Thus,
the algorithms which use some density-based
approach like DBSCAN or Mean shift tend to be
more useful for any marketing with the aim to
identify different customer segments from a large
pile of different and scattered data.

V. CONCLUSION

We were able to identify some meaningful customer


segments which could be used in marketing
effectively from an online lead source which had a
huge number of different features present in
different scales. A huge amount of time had to be
allocated for data pre-processing and tuning in order
to arrive at the correct result. Having many
Customer segments segregated based on various
Fig 10:- DBSCAN Cluster Results (Mean Values) factors from online lead data can easily help any
marketing team or individual in targeting,
customizing or personalizing their market campaign
for a specific target audience which could in turn
maximize their revenue and their sales conversion
rate.

REFERENCES
[1] G. Martin, "The importance of marketing segmentation,"
American journal of business education (AJBE), vol. 4, no. 6,
pp. 15-18, 2011.
[2] S. Stolz, K. Wisskirchen, C. Schlereth and A. Hoffmann,
"Online Lead Generation: An Emerging Industry," Marketing
Review St Gallen, June 2021.
Fig 11:- Silhouette score of 4 models used. [3] K. Bindra and A. Mishra, "A detailed study of clustering
algorithms," 6th International Conference on Reliability,
Infocom Technologies and Optimization (Trends and Future
Figure 11 shows the silhouette score of each model Directions)(ICRITO) (IEEE), pp. 371 - 376, 2017.
used. For DBSCAN the silhouette score is calculated [4] D. Zakrzewska and J. Murlewski, "Clustering algorithms for
bank customer segmentation," International Conference on
by removing the -1 labels to make sure the silhouette Intelligent Systems Design and Applications (ISDA'05) (IEEE),
score is calculated meaningfully. We can see a high pp. 197 - 202, 2005.
[5] A. Choudhury and K. Nur, "A machine learning approach to
score for both Agglomerative and Mean Shift identify potential customer based on purchase behavior,"
models denoting how well and efficiently clustering International Conference on Robotics, Electrical and Signal
was done for each data point. But the analysis can’t Processing Techniques (ICREST) IEEE, pp. 242 -247, 2019.
[6] T. Kansal, S. Bahuguna, V. Singh and T. Choudhury,
be only determined based on silhouette score. "Customer segmentation using K-means clustering,"
Silhouette score is one of the metrics for analyzing International conference on computational techniques,
unsupervised clustering model results. The real- electronics and mechanical systems (CTEMS) (IEEE), pp. 135 -
139, 2018.
world scenario may be different. From the usage [7] P. Balakrishnan, M. Cooper, V. Jacob and P. Lewis,
perspective, we found DBSCAN to be much more "Comparative performance of the FSCL neural net and K-means
algorithm for market segmentation," European journal of
effective than Agglomerative despite having a low operational research,, vol. 93, no. 2, pp. 346 - 357, 1996.
score, mainly because of how well it was able to [8] M. Zait and H. Messatfa, "A comparative study of clustering
differentiate and segments customers based on methods," Future Generation Computer Systems, vol. 12, no.
(2-3), pp. 149 - 159, 1997.

8
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
[9] Y. Rani and H. Rohil, "A study of hierarchical clustering [14] V. Peter and R. Mehta, "Impact of outlier removal and
algorithm," ter S & on Te SIT, vol. 2, no. 113, 2013. normalization approach in modified k-means clustering
[10] S. Ben-David and N. Haghtalab, "Clustering in the presence of algorithm," International Journal of Computer Science Issues
background noise," International Conference on Machine (IJCSI), vol. 8, no. 5, p. 331, 2011.
Learning PMLR, pp. 280 - 288, 2014. [15] D. Comaniciu and P. Meer, "Mean shift: A robust approach
[11] S. Goyat, "The basis of market segmentation: a critical review toward feature space analysis.," IEEE Transactions on pattern
of literature," European Journal of Business and Management, analysis and machine intelligence, vol. 24, no. 5, pp. 603 - 619,
vol. 3, no. 9, pp. 45 - 54, 2011. 2002.
[12] T. Kodinariya and P. Makwana, "Review on determining
number of Cluster in K-Means Clustering," International
Journal,, vol. 1, no. 6, pp. 90 - 95, 2013.
[13] "Scikit-Learn," scikit-learn, [Online]. Available: https://scikit-
learn.org/stable/.

9
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.

You might also like