Professional Documents
Culture Documents
A Machine Learning
A Machine Learning
ϮϬϮϯ/ŶƚĞƌŶĂƚŝŽŶĂůŽŶĨĞƌĞŶĐĞŽŶƌƚŝĨŝĐŝĂů/ŶƚĞůůŝŐĞŶĐĞĂŶĚ<ŶŽǁůĞĚŐĞŝƐĐŽǀĞƌLJŝŶŽŶĐƵƌƌĞŶƚŶŐŝŶĞĞƌŝŶŐ;/KE&ͿͮϵϳϵͲϴͲϯϱϬϯͲϯϰϯϲͲϰͬϮϯͬΨϯϭ͘ϬϬΞϮϬϮϯ/ͮK/͗ϭϬ͘ϭϭϬϵͬ/KE&ϱϳϭϮϵ͘ϮϬϮϯ͘ϭϬϬϴϰϯϯϵ
Abstract- The Internet is becoming huge and is used Only a small percentage of companies who buys
by a more diverse audience every day. The amount of sales lead has some team setup to segregate and
data gathered from the platform through different segment the customer leads, others fall into trap of
online lead companies are gargantuan so it needs to be using a common approach for all the leads which
maintained and segregated in order to extract
leads to poor sales conversion rate of leads data.
meaningful data from it. A lot of companies have
started to gather customer data through their own
Individual salesperson is the most affected because
platform or through various vendors who sell it to they only have domain knowledge of their respective
sales companies/organizations/individuals for some sales area but fail to understand the clusters or
profit. Sometimes these data are large and scattered segments present in the sales data. Basic or common
enough to even confuse big sales organizations. In filters provided by various software won’t work very
order for better and more effective marketing of these well for multi-dimensional columns. So, it is
sales data, We propose to use four machine learning important for each sales executive to understand the
clustering algorithms( K-Means, Agglomerative, customer segments in the data and customise their
Mean-Shift and DBSCAN) in order to find customer marketing technique for each group.
segments based on the data provided. Based on this
segmented customer group, we can be able to find a
pattern and decide which customer group is better for
In this paper, for research purposes, we bought a set
which business. of sales leads from a company named Shrine. There
are three important sources of this data:- Internal
Keywords— Machine Learning, Customer customers of the company, data gathered from
Segmentation, K-Means, Agglomerative, DBSCAN, various online networking tools and job profiles
Marketing from a hiring company. The data gathered has been
diverse and contain columns related to their personal
I.INTRODUCTION and work data. A lot of analysis and cleaning was
required in order for the data to be used by the
A major objective of the research and study of these model which is covered in the Methodology section.
online leads data is to find meaningful customer
segments and use them for better and more efficient Clustering is a machine-learning approach for
marketing. There are various problems gathered finding the patterns in data and grouping them as
from the online leads which are not structured in a clusters. Unsupervised Clustering is a method to find
proper way. Some of the problems include wastage clusters without a target or a previously clustered
of time and resources in trying to send promotional column. We are going to use four unsupervised
emails or telecalling the customers who won’t fit the clustering algorithms in this paper namely K-Means,
target audience of the sales company/ organization/ Agglomerative, Mean Shift and DBSCAN. The
individual, also from a huge amount of lead data but same data but in a different format based on their
only less than one per cent of sales conversion data algorithm is going to be trained by these models and
is achieved due to improper and generic approach in their output is studied and analysed.
marketing. There is no meaning in trying to sell a
youth product to a customer lead who is above 50
years of age.
1
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
II. RELATED WORK has some noise. Nika and Ben emphasizes how can
we proceed with clustering keeping noise present in
Every market is unique and heterogenous and the data as one of the deciding parameters [10].
finding the right consumers is a big and difficult task Labelling each cluster after the model prediction
for the sales and marketing team. Gillian has requires some domain knowledge about the data.
stressed the importance of Market Segmentation by Sulekha suggests that various factors like
suggesting how segmentation allows an organization Demographic, Psychographic and Behavioural
to achieve high ROI (Rate Of Investment) [1]. which can be used to determine the naming of the
Online Leads have opened a completely new path clusters, once the model prediction is over [11].
for the sales organization to gather customers, a lot
of data are available to them unlike in the old days III. RESEARCH METHODOLOGY
where companies have to use alternative methods
like campaigns, and forms to gather data. There are The following section covers how the entire process
different ways to gather leads like social media, of research methodology from data collection to
online ads etc. Simon, Kilian, Christian and model training is done.
Alexander have revealed how a typical Online Lead
Generation Funnel Works [2]. A. Data Collection
The data used in the paper was brought from the
Clustering Algorithms use machine learning company named Shrine and the three important
techniques and help us in identifying the clusters in sources for this data collection are Internal
the data. The Distance between the points is one of customers of the company, data gathered from
the key factors for deciding the clusters and there are various online networking tools and job profiles
several ways to calculate the distance like Euclidean, from a hiring company. The data received was
Minkowski etc [3]. Danuta and Jan suggested how diverse and had a lot of different features like Name,
using different clustering algorithms like DBSCAN Date Of Birth, Profession, Designation, Current
and K-Means helps in the segmentation of Bank Company etc. Figure 1 shows the partial view of the
Customers [4]. The comparative study using the initial data which was received. There were five
local grocery store data helped in identifying clusters different datasets given to us and EDA (Exploratory
and predicting potential and not-potential customers Data Analysis) has to be done before proceeding as
[5]. there were a lot of anomalies present in the data. The
Initial dataset consists of 9108 rows and not all the
Segmentation of Customers for various domains is datasets have all the features present. We had to
discussed in the literature (see [6], [7], [8], [9]). study and select the features which could result in
These techniques stress the importance of the quality meaningful analysis. From the 15 columns
of data which is fed to the model for prediction. The presented, we narrowed it down to four columns
original data needs to be undergone several steps of (DOB, Education, Experience, Salary) as these were
tuning and pre-processing in order to be used by the the common and meaningful data present in
model. Not every data point is in a cluster, real data common in all the datasets.
2
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Fig 2. Final view of online sales of data after pre-processing
3
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
by taking each data point as a single cluster and PDF(Probability Density Function) where n, d and x
starts merging these clusters based on the similarity correspond to length, dimensions and datapoint
between them. The procedure is repeated until all the value respectively.
data points are merged under a single cluster.
Linkage helps in merging two clusters into one in
Agglomerative clustering. There are various types of
linkages and we are using Ward’s method in this
paper as it helps in creating compact and even-sized DBSCAN Clustering:-
clusters. DBSCAN refers to the Density-Based Spatial
Clustering of Applications with Noise. As
In ward’s method, the distance between two clusters mentioned in the Literature review, not all data
is based on the similarity calculated as the sum of points can be clusters, the real data has some noise.
the square of the distances between the points x and DBSCAN helps us in identifying those clusters and
y divided by the product of their centroids C1 and the algorithm does not have any impact on outliers,
C2. unlike K-Means which is very sensitive to outliers.
There are two parameters (Epsilon(eps) and
Minimum Points(minpoints) which are needed for
the DBSCAN to perform. The eps describe the
radius around the cluster and the minpoints establish
Mean – Shift Clustering:- a condition for the minimum number of points
Mean-Shift Clustering identifies clusters by needed to be in the clusters. The idea of how to
iteratively shifting their centroids to a higher derive these values is explained in the Features and
probability density region. Once the centroid cannot Parameter section.
move further the algorithm stops. The advantage of
the Mean-Shift algorithm is unlike the K-Means we D. Features and Parameter Selection:-
don’t need to specify the number of clusters to be Selecting the right parameters for each clustering
found. Mean shift automatically finds the number of algorithm with the right data has a huge significance
clusters based on the density. Kernel(K) is a and impact in successful model creation. For training
weighting function used in the calculation of the the model and selecting the respective features the
density function. Bandwidth(h) is one of the famous python library scikit-learn is used [13]. This
important parameters needed for the Mean-Shift. It section describes how the features are scaled and
can be considered the radius of the circle which how parameters are selected for each algorithm.
denotes how many data points should be present in
the cluster. Below is the formula for calculating
4
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Scaling and Outlier Removal:- scaled data is used in finding the elbow value. When
The data we use in research is gathered in such a K values increase the SSE will decrease. The point
way the whole data points have been scattered with a where the elbow is found is chosen as the K-Cluster
very high and low magnitude of values to match the value. From figure 4, using the elbow method, the K
real-world scenario of data gathering. Some of the value for the data is chosen as 2 and the model is
algorithms we use are more sensitive to outliers [14] trained accordingly.
and need to be taken care of before using them in the
model, to achieve quality results. Figure 3 shows the Training the Agglomerative model:-
distributions of data points in our data through a Agglomerative clustering considers all the data
histogram. points present in the plane as a separate cluster at
first and then proceed to group them together as one
We can clearly see some of the data lying in the final cluster. Since Agglomerative data is also
outlier zone. To remove the outlier, we are going to sensitive to outliers the same scaled data is being
replace high and low values with the upper and used for the model-building process. As explained in
lower limit of IQR(Interquartile Region). The IQR the model section we are going to use the ward's
describes the middle 50% of values when ordered linkage in merging clusters.
from lowest to highest. For dealing with the scaling
of data, we are going to Z-Score mechanism, since In order to visualize and determine the output of the
the variance is more or less between the columns. Z- clustering the process Dendrogram is used. A
Score helps in understanding how much data differs dendrogram shows the hierarchical relationship
from its standard deviation. Following is the formula between each cluster. Figure 5 shows the
of Z-Score where x, ߤ, ߪ corresponds to observed dendrogram for the scaled data. From the above
value, mean and standard deviation respectively. dendrogram, the clusters are chosen as 3 based on
the longest vertical line which is not being cut by
any of the horizontal lines. The Agglomerative
model is then trained with 3 clusters.
Determination of K value:
For selecting the right K value, the elbow method
mechanism is used. It works by finding the Sum of
Squares Error (SSE) values of each data point with
its respective centroids with different values of K.
The Formula for SSE is given below where x, ߤ݅ and
k correspond to datapoint, centroid and cluster size
respectively.
5
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Mean Shift algorithm was trained and 14 different unique clusters were formed. Some clusters were
clusters were found in the output. These outputs are identified as noise by the algorithm and marked as -
analysed in the result and analysis section. 1, which could be useful during analysis
6
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
Fig 9:- Mean-Shift Cluster Results (Mean Values)
From the above experiment, we found that both K- but have been under age 40. This customer segment
Means and Agglomerative models struggled to can be named as Customers with high potential for
identify and find more clusters from the data point. luxury products.
The reason for the issue is generally because of the
working procedure and algorithm implemented by Unlike the K-Means and Agglomerative models
each model which got affected by differences in where the labels are only based on salary, much
magnitude present in data, even after scaling was more meaningful information has been able to be
done. The identified clusters for the K-Means and obtained from Mean Shift and DBSCAN models. In
Agglomerative Models can be named based on their DBSCAN models we were able to significantly
salary as listed in figures 7 and 8. For K-means, we identify the noise present in the data and segregate
can name the labels above higher middle-class them as a separate cluster. These noises are
income and below higher middle-class income and extraordinary cases where it only represents a
for Agglomerative the three labels can be high class, minority portion of a customer segmentation
middle class and low class. problem and can be used only in very few cases.
Both the K-Means and Agglomerative models were When analyzing all the models, bringing all the
able to find some meaningful clusters but they were column scales to an equal margin was one of the key
not meaningful enough for a marketing company to factors and helped in identifying the right clusters
maximize their efficiency in categorizing potential even though the experience and salary column had a
customers for each product. This is where the Mean little large magnitude than the other columns. To
Shift and DBSCAN were able to succeed mainly due measure and analyze how good the clustering is
to their density-based approach. The total number of done in all algorithms, Silhouette Score is used.
clusters formed from the Mean Shift and DBSCAN Silhouette score is the measure to understand how
were 14 and 17 respectively. The respective labels well each data point has been clustered into the
for both Mean Shift and DBSCAN can be correct cluster by the below formula where ‘a’
determined on a different basis based on the domain corresponds to average distance between the
knowledge and the mean values present in each centroid of a cluster and the datapoints and ‘b’
feature of the clusters. For example, if we look at corresponds to average distance between the data
cluster 4 from figure 9 in Mean-Shift Cluster Results point and closest cluster data points.
we can see people in that customer segment are
earning above 3 crores INR per annum on average
7
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
different principles or dimensions.
V. CONCLUSION
REFERENCES
[1] G. Martin, "The importance of marketing segmentation,"
American journal of business education (AJBE), vol. 4, no. 6,
pp. 15-18, 2011.
[2] S. Stolz, K. Wisskirchen, C. Schlereth and A. Hoffmann,
"Online Lead Generation: An Emerging Industry," Marketing
Review St Gallen, June 2021.
Fig 11:- Silhouette score of 4 models used. [3] K. Bindra and A. Mishra, "A detailed study of clustering
algorithms," 6th International Conference on Reliability,
Infocom Technologies and Optimization (Trends and Future
Figure 11 shows the silhouette score of each model Directions)(ICRITO) (IEEE), pp. 371 - 376, 2017.
used. For DBSCAN the silhouette score is calculated [4] D. Zakrzewska and J. Murlewski, "Clustering algorithms for
bank customer segmentation," International Conference on
by removing the -1 labels to make sure the silhouette Intelligent Systems Design and Applications (ISDA'05) (IEEE),
score is calculated meaningfully. We can see a high pp. 197 - 202, 2005.
[5] A. Choudhury and K. Nur, "A machine learning approach to
score for both Agglomerative and Mean Shift identify potential customer based on purchase behavior,"
models denoting how well and efficiently clustering International Conference on Robotics, Electrical and Signal
was done for each data point. But the analysis can’t Processing Techniques (ICREST) IEEE, pp. 242 -247, 2019.
[6] T. Kansal, S. Bahuguna, V. Singh and T. Choudhury,
be only determined based on silhouette score. "Customer segmentation using K-means clustering,"
Silhouette score is one of the metrics for analyzing International conference on computational techniques,
unsupervised clustering model results. The real- electronics and mechanical systems (CTEMS) (IEEE), pp. 135 -
139, 2018.
world scenario may be different. From the usage [7] P. Balakrishnan, M. Cooper, V. Jacob and P. Lewis,
perspective, we found DBSCAN to be much more "Comparative performance of the FSCL neural net and K-means
algorithm for market segmentation," European journal of
effective than Agglomerative despite having a low operational research,, vol. 93, no. 2, pp. 346 - 357, 1996.
score, mainly because of how well it was able to [8] M. Zait and H. Messatfa, "A comparative study of clustering
differentiate and segments customers based on methods," Future Generation Computer Systems, vol. 12, no.
(2-3), pp. 149 - 159, 1997.
8
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.
[9] Y. Rani and H. Rohil, "A study of hierarchical clustering [14] V. Peter and R. Mehta, "Impact of outlier removal and
algorithm," ter S & on Te SIT, vol. 2, no. 113, 2013. normalization approach in modified k-means clustering
[10] S. Ben-David and N. Haghtalab, "Clustering in the presence of algorithm," International Journal of Computer Science Issues
background noise," International Conference on Machine (IJCSI), vol. 8, no. 5, p. 331, 2011.
Learning PMLR, pp. 280 - 288, 2014. [15] D. Comaniciu and P. Meer, "Mean shift: A robust approach
[11] S. Goyat, "The basis of market segmentation: a critical review toward feature space analysis.," IEEE Transactions on pattern
of literature," European Journal of Business and Management, analysis and machine intelligence, vol. 24, no. 5, pp. 603 - 619,
vol. 3, no. 9, pp. 45 - 54, 2011. 2002.
[12] T. Kodinariya and P. Makwana, "Review on determining
number of Cluster in K-Means Clustering," International
Journal,, vol. 1, no. 6, pp. 90 - 95, 2013.
[13] "Scikit-Learn," scikit-learn, [Online]. Available: https://scikit-
learn.org/stable/.
9
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on September 04,2023 at 08:04:00 UTC from IEEE Xplore. Restrictions apply.