Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Bisecting K-Means - An Efficient Approach to

Customer Segmentation
Divanshu Nayan Pranshu Mishra Shahoor Ahmed
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
C.V. Raman Global University C.V. Raman Global University C.V. Raman Global University
Bhubaneswar, India Bhubaneswar, India Bhubaneswar, India

Dr. Bichitrananda Behra


Computer Science and Engineering
C.V. Raman Global University
Bhubaneswar, India

Abstract—Organizations can efficiently segment their con- for evaluating clients based on their purchasing patterns. To
sumer base by leveraging RFM (Recency, Frequency, and Mon- quantify the Recency, Frequency, and Monetary aspects, a
etary) values, derived from analyzing transactional data over scoring system is developed. These scores are then amalga-
a specified period. This segmentation approach facilitates the
identification of groups with similar behaviors, enabling a deeper mated to generate an RFM score, ranging from 555 to 111
understanding of customer needs and the exploration of potential (Haiying and Yu, 2010). This composite score serves as a tool
clients for the business. Additionally, segmenting the client base for examining customers’ historical and current behaviors to
has a positive impact on revenue generation. Emphasizing the predict future patterns. Notably, within this framework, the
retention of current consumers over acquiring new ones is widely scores of Recency, Frequency, and Monetary exhibit a direct
acknowledged as a priority. For instance, businesses can employ
marketing strategies tailored to specific market niches to cultivate correlation with customers’ lifetime value and retention rates.
client loyalty and enhance retention efforts.The study utilizes
classic K-means and hierarchical clustering algorithms to cluster
transactional data subsequent to conducting an RFM analysis. It
introduces a novel approach for selecting the Bisecting K-Means Following the computation of recency, frequency, and mon-
method. The efficacy of these techniques is evaluated based on
cluster compactness, execution time, and the average similarity etary values, the K-Means technique is applied to group the
between each cluster and its most similar counterpart. variables into clusters within the customer base. This facilitates
Index Terms—Customer segmentation, RFM analysis, K- the identification of which consumer group contributes most
Means, Hierarchical Clustering, Bisecting K-Means significantly to the company’s profitability by examining the
behavior of each cluster. Additionally, two other clustering
I. I NTRODUCTION algorithms, the Bisecting K-Means algorithm and Hierarchi-
The current business environment has become more com- cal clustering, are employed. The objective of this study is
petitive, requiring new approaches to maintain a competitive to introduce a method for enhancing the interpretability of
edge. A customer segmentation model’s implementation can clusters, improving compactness, and reducing cluster spread
greatly increase business earnings. The Pareto principle, which and processing time. Once customer clusters are established,
states that every 2 customers out of 10 usually contribute understanding the distinctions among these groupings be-
disproportionately to revenue, emphasizes the need of prior- comes imperative. A thorough examination of the clusters is
itizing client retention over gaining new ones. Business ex- conducted to identify targeted clients and tailor offers and
perts can customize marketing strategies, identify trends, plan promotions relevant to their needs and preferences.Marketing
product development, coordinate advertising campaigns, and professionals will find the proposed consumer segmentation
provide pertinent items by utilizing customer segmentation, methodology valuable for its potential to enhance targeted
which capitalizes on a variety of distinctive client attributes. marketing efforts. The remaining portion of the research
Customer segmentation guarantees successful communication concentrates on comparing and contrasting the three cluster-
with specific groups by tailoring communications. Customer ing techniques, evaluating them based on similarity, cluster
segmentation often makes use of variables including location, compactness, execution time, and other pertinent variables.
age, gender, income, lifestyle, and past purchasing patterns. This comparative analysis will provide valuable insights into
In this context, behavioral data is utilized for segmenta- the strengths and weaknesses of each technique, enabling
tion due to its widespread availability, dynamic nature, and marketing professionals to make informed decisions about
foundation in past purchase behaviors. Recency, Frequency, which clustering approach best suits their specific needs and
and Monetary (RFM) analysis emerges as a prominent method objectives.
II. A LGORITHM DESCRIPTION 1) Recency: How recently did the client make a purchase?:
The segmentation process utilizes the transactional dataset The amount of time a customer takes between two purchases
of a company’s clients, employing three distinct algorithms to is known as their recency value. A lower number of recency
group clients based on RFM analysis. Initially, the data un- suggests that the client makes several quick trips to the
dergoes pre-processing to remove outliers and filter significant business. In a similar vein, a higher value suggests a lower
occurrences. The z-score method is employed to identify out- likelihood of a customer visiting the business soon.
liers and assess how the data align with the mean and standard 2) Frequency: How many times did the customer make a
deviation. Through this method, the standard deviation and purchase?: The quantity of purchases a consumer makes in a
mean are standardized to 0 and 1, respectively. Outliers are certain time frame is known as their frequency. The company’s
identified as data points that exhibit significant deviation from clients are more devoted the greater the frequency value.
the mean (zero). Subsequently, the recency, frequency, and 3) Monetary: What was the customer’s expenditure?: The
monetary values are computed by inputting the preprocessed quantity of money spent by the consumer during a specific
data into the RFM model.The three clustering algorithms, time period is referred to as monetary. The more money spent,
namely K-Means, Hierarchical Clustering, and Bisecting K- the more revenue the company receives from them.
Means, are subsequently applied to the three qualities (recency,
B. K-Means clustering:
frequency, and monetary values). These algorithms partition
the clients into distinct groups based on their RFM characteris- K-Means is a common algorithm that divides the data into
tics. Following this, the cluster compactness, similarity index, the number of clusters that are specified so that the intra-cluster
and execution time of each clustering method are scrutinized to similarity is high. It takes the parameters and the number of
assess their effectiveness. For a quick reference, a summarized clusters as inputs. The iterative K-Means method calculates
depiction of the suggested client segmentation strategy is the centroids’ values before to each iteration. The centroids
presented in Figure 1. determined at each iteration determine which clusters the data
points are shifted within. The procedure is iterated until the
total can no longer be reduced. Algorithm 1 displays the K-
Means algorithm.
Min-max normalization is used to normalize the recency,
frequency, and monetary values of the variables. Because
skewed values could be troublesome, this is done. Now, the
scaled data is subjected to the clustering method. To determine
which customer category generates the most revenue for the
business, the amount of money earned by each is calculated.
K-means has complexity O(n + k + i). where k is the number
of clusters, i denotes the number of iterations, and ’n’ denotes
the number of instances.
K-Means Algorithm
1: Input:
2: - Dataset D = {x1 , x2 , . . . , xn } with n data points in d
dimensions.
3: - Number of clusters k.
4: Output:
5: - Cluster centroids {c1 , c2 , . . . , ck }.
6: Initialization:
7: 1. Randomly select k data points as initial cluster centroids
(0) (0) (0)
{c1 , c2 , . . . , ck }.
8: for t = 1 to T (maximum iterations) do
A. RFM analysis 9: 2. Assignment Step: Assign each data point xi to the
nearest cluster centroid cj based on distance metric
In database marketing, Recency, Frequency, and Monetary
(e.g., Euclidean distance).
(RFM) analysis stands as a potent and widely recognized
method. Ranking clients based on their historical purchasing (t−1) 2
assign xi to cluster j = arg min ||xi − cl ||
behavior is a prevalent practice in this realm. RFM analysis l∈{1,2,...,k}

finds numerous applications across various domains, including 10: 3. Update Step: Recompute the centroid of each cluster
online shopping and e-commerce, particularly in scenarios cj as the mean of the data points assigned to it.
involving a large number of clients. This strategy entails
utilizing three dimensions to segment customers: Monetary (t) 1 X
cj = xi
(M), Frequency (F), and Recency (R). |Cj |
xi ∈Cj
11: 4. Termination Criterion: If the centroids haven’t data points in a single cluster and strategically divides the
changed significantly between iterations (or a maximum cluster with the most significant internal differences using
number of iterations is reached), then terminate. Other- K-means (K=2). This selective splitting continues until the
wise, go back to step 2. desired number of clusters is reached. This approach can
12: end for=0 be advantageous for large cluster counts due to its focus
on the most informative splits and its tendency to produce
C. Hierarchical Clustering: clusters with more balanced sizes compared to standard K-
Hierarchical clustering is an unsupervised learning tech- means.The time complexity of the bisecting k-means algorithm
nique that organizes data points into a nested structure, similar is O((K-1)IN), where I is the number of iterations to converge.
to a family tree. It begins by treating each data point as its own Bisecting k-means is also linear in the size of the documents.
individual cluster. Then, it iteratively merges the most similar Bisecting K-Means Clustering Algorithm
clusters based on a chosen distance metric (like Euclidean
distance) until a single cluster encompassing all data points is 1: Input:
formed. This process creates a visual representation called a 2: - Dataset D = {x1 , x2 , . . . , xn } with n data points in d
dendrogram, which depicts the merging hierarchy and allows dimensions.
you to determine the optimal number of clusters for your 3: - Number of clusters k.
data analysis. However, the computational cost of hierarchical 4: Output:
clustering can be significant. In the worst-case scenario, its 5: - Cluster centroids {c1 , c2 , . . . , ck }.
time complexity scales with the cube of the number of data 6: Initialization:
points (O(n3 )), making it less suitable for massive datasets 7: 1. Start with all data points in a single cluster.
compared to other clustering algorithms. 8: for t = 1 to k − 1 (bisection steps) do
Min-max normalization is used to scale the variables, same 9: 2. Splitting Step: Apply K-Means algorithm (often
like in the preceding procedure. The clients are currently with a single iteration) to the current cluster to split
grouped according to the most recent, frequent, and monetary it into two sub-clusters.
values using hierarchical clustering. 10: 3. Choose one of the sub-clusters for further splitting
Agglomerative Hierarchical Clustering Algorithm: in the next iteration. Common strategies include:
11: (a) Selecting the sub-cluster with higher centroid dis-
1: Input: tance (larger diameter).
2: - Dataset D = {x1 , x2 , . . . , xn } with n data points in d 12: (b) Selecting the sub-cluster with higher within-cluster
dimensions. variance (more spread).
3: - Distance metric (e.g., Euclidean distance). 13: end for
4: - Linkage function (e.g., Single Linkage, Complete Link- 14: 4. The final set of clusters consists of the k remaining
age). clusters after bisection. =0
5: Output:
6: - Dendrogram representing the hierarchical cluster struc- III. E XPERIMENTATION AND RESULT DISCUSSION
ture.
7: Initialization:
8: 1. Consider each data point as an individual cluster.
9: 2. Compute a proximity matrix storing the distance be-
tween all data points.
10: for t = 1 to n − 1 (merging iterations) do
11: 3. Find closest clusters: Identify the two most similar
clusters based on the chosen linkage function and the
proximity matrix.
12: 4. Merge clusters: Combine the identified clusters into
a new cluster.
13: 5. Update proximity matrix: Recalculate distances
between the new cluster and all remaining clusters. By using the transactional data set of customers of an online
14: end for retailer for a year, which is sourced from the University of
15: 6. The final set of clusters and their hierarchy is repre- California, Irvine (UCI) repository, the effectiveness of the
sented by the dendrogram. =0 suggested methodology is assessed. This section outlines the
consumer segmentation process step-by-step. The dataset has
D. Bisecting K-means eight attributes, such as the customer ID, product code, name,
Bisecting K-means offers a unique clustering perspective, price, date, and time of purchase, among others. There are
merging the top-down logic of divisive hierarchical clustering 541910 instances with eight attributes in the original data
with the iterative splitting of K-means. It starts with all set. The dataset includes consumer purchases made between
December 1, 2010, and December 9, 2011. During data of the clustering result that shows the number of customers in
pre-processing, any cases with missing values in significant each cluster as well as the shortest distance between a cluster
attributes, unit prices and quantities less than 0, and dates point and another cluster point. The data points inside a cluster
older than the current date are eliminated. As an extra step are closer to one another but not to the ones in other clusters
in the pre-processing of the data, the Z-Score analysis is also when the average silhouette width is bigger and vice versa.And
carried out to detect the outliers. Only those records that pass data points inside a cluster are less similar to the ones in
the filtering process—such as invoice data and time, product other clusters when the Davies-Bouldin score is Smaller and
quantity per transaction, and product pricing per unit in terms vice versa. For the final clusters produced by the Hierarchical
of currency and frequency—have been sent into the benchmark Clustering and Bisecting K-Means approach as well as the
algorithms. The three extra attributes—recentness, frequency, K-means clustering technique, the average silhouette width is
and monetary—that are produced from RFM computation are computed. The average silhouette width of the Bisecting K-
present in 4067 occurrences of the amended dataset. Table 2 Means clustering is found to be larger than that of the K-Means
displays a description of the original dataset. clustering and the Hierarchical clustering.

IV. C ONCLUSION :
Customer relationships will be strengthened by customer
segmentation. While acquiring new clients is significant for
the business, keeping the current clientele is even more crucial
(Tong et al., 2017). This work uses RFM analysis for seg-
mentation and then expands it to include other methods such
as K-Means clustering, Hierarchical Clustering, and a new
technique. K-Means bisection achieved by slightly altering
the current K-Means clustering. These methods’ operation is
examined. After analyzing how long each algorithm takes
to run, it is found that the suggested Bisecting K-Means
strategy takes less time. Because of its simplicity and lower
computation cost, the suggested algorithm is more efficient.
Due to the fact that segmentation is carried out according to
values of currency, frequency, and recency, the business is able
Fig. 2 displays the result plots produced by bisecting K- to tailor its marketing campaigns to the clients’ purchasing
means, hierarchical clustering, and K-means clustering. Every habits. Future research will examine consumer behavior in
algorithm’s execution time is computed using the system time. each category, including the products that members of that
It is found that because of its lower computational cost, the segment purchase on a regular basis. This would make it easier
suggested Bisecting K-Means method works faster than the to give particular products greater promotional incentives.
other two. The average distance between the generated clusters
is studied using the silhouette width and the average similarity R EFERENCES
of each cluster with its most similar cluster is meansured by [1] Phan Duy Hung, Nguyen Thi Thuy Lien, and Nguyen Duc Ngoc. 2019.
Davies-Bouldin score. The silhouette plot is a visual analysis Customer Segmentation Using Hierarchical Agglomerative Clustering.
In Proceedings of the 2nd International Conference on Information Sci- based on NPS data mining. China Commun. 14 (11), 260–268.
ence and Systems (ICISS ’19). Association for Computing Machinery, https://doi.org/10.1109/CC.2017.8233665.
New York, NY, USA, 33–37. [20] Shah, S., Singh, M., 2012. Comparison of a Time Efficient Modified
[2] Chihli Hung, Chih-Fong Tsai, Market segmentation based on hierarchi- K-mean Algorithm with K-Mean and K-Medoid Algorithm. In: 2012
cal self-organizing map for markets of multimedia on demand, Expert International
Systems with Applications, Volume 34, Issue 1, 2008, Pages 780-787,
ISSN 0957-4174,
[3] I. Maryani, D. Riana, R. D. Astuti, A. Ishaq, Sutrisno and E. A. Pratama,
”Customer Segmentation based on RFM model and Clustering Tech-
niques With K-Means Algorithm,” 2018 Third International Conference
on Informatics and Computing (ICIC), Palembang, Indonesia, 2018, pp.
1-6, doi: 10.1109/IAC.2018.8780570.
[4] A. Joy Christy, A. Umamakeswari, L. Priyatharsini, A. Neyaa, RFM
ranking – An effective approach to customer segmentation, Journal of
King Saud University - Computer and Information Sciences, Volume
33, Issue 10, 2021, Pages 1251-1257, ISSN 1319-1578,
[5] Chongkolnee Rungruang, Pakwan Riyapan, Arthit Intarasit, Khanchit
Chuarkham, Jirapond Muangprathub, RFM model customer segmenta-
tion based on hierarchical approach using FCA, Expert Systems with
Applications, Volume 237, Part B, 2024, 121449, ISSN 0957-4174,
[6] M. Aryuni, E. Didik Madyatmadja and E. Miranda, ”Customer Seg-
mentation in XYZ Bank Using K-Means and K-Medoids Cluster-
ing,” 2018 International Conference on Information Management and
Technology (ICIMTech), Jakarta, Indonesia, 2018, pp. 412-416, doi:
10.1109/ICIMTech.2018.8528086.
[7] R. Kashef, M.S. Kamel, Enhanced bisecting k-means clustering using
intermediate cooperation, Pattern Recognition, Volume 42, Issue 11,
2009, Pages 2557-2569, ISSN 0031-3203.
[8] V. Rohilla, M. S. S. kumar, S. Chakraborty and M. S. Singh,
”Data Clustering using Bisecting K-Means,” 2019 International Con-
ference on Computing, Communication, and Intelligent Systems (IC-
CCIS), Greater Noida, India, 2019, pp. 80-83, doi: 10.1109/ICC-
CIS48478.2019.8974537.
[9] S. Banerjee, A. Choudhary and S. Pal, ”Empirical evaluation of K-
Means, Bisecting K-Means, Fuzzy C-Means and Genetic K-Means clus-
tering algorithms,” 2015 IEEE International WIE Conference on Elec-
trical and Computer Engineering (WIECON-ECE), Dhaka, Bangladesh,
2015, pp. 168-172, doi: 10.1109/WIECON-ECE.2015.7443889.
[10] He X., Li, C., 2016. The research and application of customer seg-
mentation one-commerce websites. In: 2016 6th International Con-
ference on Digital Home(ICDH), Guangzhou, pp. 203–208. doi:
10.1109/ICDH.2016.050.
[11] Haiying, M., Yu, G., 2010. Customer Segmentation Study of Col-
lege Students Based on the RFM. In: 2010 International Conference
on E-Business and EGovernment, Guangzhou, pp. 3860-3863. doi:
10.1109/ICEE.2010.968.
[12] Sheshasaayee, A., Logeshwari, L., 2017. An efficiency analysis on
the TPA clustering methods for intelligent customer segmentation. In:
2017 International Conference on Innovative Mechanisms for Industry
Applications (ICIMIA), Bangalore, pp. 784–788.
[13] Liu, C.C., Chu, S.W., Chan, Y.K., Yu, S.S., 2014. A Modified K-
Means Algorithm – Two-Layer K-Means Algorithm. In: 2014 Tenth
International Conference on Intelligent Information Hiding and Multi-
media Signal Processing, Kitakyushu, pp. 447–450. doi: 10.1109/IIH-
MSP.2014.118.
[14] Cho, Young, Moon, S.C., 2013. Weighted mining frequent pattern-based
customer’s RFM score for personalized u-commerce recommendation
system. J. Converg. 4, 36–40.
[15] Jiang, T., Tuzhilin, A., March 2009. Improving personal-
ization solutions through optimal segmentation of customer
bases. IEEE Trans. Knowledge Data Eng. 21(3), 305–320.
https://doi.org/10.1109/TKDE.2008.163N.
[16] Lu, H., Lin, J.Lu., Zhang, G., May 2014. A customer churn prediction
model in telecom industry using boosting. IEEE
[17] Memon, K.H., Lee, D.H., 2017. Generalised fuzzy c-means clustering
algorithm with local information. In: IET Image Processing, vol. 11, no.
1, pp. 1-12, 1.
[18] Zahrotun, L., 2017. Implementation of data mining technique for cus-
tomer relationship management (CRM) on online shop tokodiapers.com
with fuzzy c-means clustering. In: 2017 2nd International conferences
on Information Technology, Information Systems and Electrical Engi-
neering (ICITISEE), Yogyakarta, pp. 299–303.
[19] Tong, L., Wang, Y., Wen, F., Li, X., Nov. 2017. The re-
search of customer loyalty improvement in telecom industry

You might also like