Professional Documents
Culture Documents
1 s2.0 S1319157818304178 Main
1 s2.0 S1319157818304178 Main
1 s2.0 S1319157818304178 Main
a r t i c l e i n f o a b s t r a c t
Article history: The efficient segmentation of customers of an enterprise is categorized into groups of similar behavior
Received 1 June 2018 based on the RFM (Recency, Frequency and Monetary) values of the customers. The transactional data
Revised 26 August 2018 of a company over is analyzed over a specific period. Segmentation gives a good understanding of the
Accepted 4 September 2018
need of the customers and helps in identifying the potential customers of the company. Dividing the cus-
Available online 5 September 2018
tomers into segments also increases the revenue of the company. It is believed that retaining the cus-
tomers is more important than finding new customers. For instance, the company can deploy
Keywords:
marketing strategies that are specific to an individual segment to retain the customers. This study ini-
Customer segmentation
RFM analysis
tially performs an RFM analysis on the transactional data and then extends to cluster the same using tra-
K-Means ditional K-means and Fuzzy C- Means algorithms. In this paper, a novel idea for choosing the initial
Fuzzy C-Means centroids in K- Means is proposed. The results obtained from the methodologies are compared with
Initial centroids one another by their iterations, cluster compactness and execution time.
Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University. This is an
open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
https://doi.org/10.1016/j.jksuci.2018.09.004
1319-1578/Ó 2018 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1252 A.J. Christy et al. / Journal of King Saud University – Computer and Information Sciences 33 (2021) 1251–1257
proposed with an intension to reduce the number of iterations by combining transactional data of several customers. The authors
than the traditional clustering algorithms. The outcome of the pro- also showed that it is NP-hard to find an optimal segmentation
posed work is a meaningful customer segmentation which will be solution. So, Tuzhilin came up with different sub-optimal cluster-
useful for marketing people. The rest of the study focuses on ana- ing methods. The authors then experimentally examined the cus-
lyzing all the three clustering approaches regarding iterations, tomer segments obtained by direct grouping, and it is observed
cluster compactness, execution time and various other factors. to be better than the statistical approach.
3. Algorithm description
2. Literature review
and similar dataset is better than K-means algorithm because in K- This paper proposes a new way for choosing the initial centroids
means a data point must entirely be present in only one cluster. In for the K- Means algorithm. The three variables Recency (R), Fre-
this study, a customer may belong to more than one cluster which quency(F) and Monetary(M) that are to be clustered are sorted
increases the chance of retaining the customers by treating them and stored in ascending order in three vectors as R’, F’ and M’.
with different offers for each segment. The time complexity of The median value of each vector is found and assigned as the initial
Fuzzy C-Means is O (n + k + d2 + i), where d is the number of centroids for the K-Means algorithm. Iteratively the median values
iterations. are calculated from the R’, F’ and M’ values k number of times
Similar to the previous algorithm, the variables are scaled using depending the value of k (number of segments). Choosing initial
min-max normalization. Now the customers are clustered based centroids with its mean distribution reduces the number of itera-
on Fuzzy C-Means clustering (Zahrotun, 2017) based on the tions and computational time of traditional K –Means algorithm.
recency, frequency, and monetary values. It is observed the clusters that are obtained through the modified
approach is more meaningful and appropriate compared to the tra-
Algorithm 2 (Fuzzy C-Means). ditional method by choosing centroids randomly. The complexity
of RM K-Means is as same as K-Means, which is O (n + k + i). Since
the initial random centroids are computed using a median based
Input => Customer Dataset containing ‘n’ instances method, the proposed RM K-means algorithm reduces the number
=> k: the number of clusters of iterations with K-means.
Output:
=> Customer data Partitioned to k clusters Algorithm 2 (RM K-Means).
Algorithm:
1. Randomly selects k initial centers.
2. Calculate thefuzzy matrix mij. Input:
P membership
lij ¼ 1= kc¼1 ddij m2 1 => Customer dataset containing n instances
ic
=> K: the number of clusters
3. Compute
P the
cluster
m centers vj.
P Output:
v j ¼ ni¼1 lij xi = ni¼1 ðlij Þm Customer data partitioned to k clusters
4. Repeat steps 2 and 3 until lowest value of j is achieved, Algorithm steps:
where j is the objective function. 1. Upload the Customer transactional dataset
2. Pre-process the dataset by removing outliers and null
valued instances
3. Compute R, F and M Score for each instance
3.4. Repetitive median K-Means 4. Order RFM score in sequence as R’, F’ and M’ and store it
in a vector
Although K- Means algorithm is traditionally used for grouping, 5. Let S = total number of instances/ k
it has few disadvantages. K-Means chooses the initial centroids in a 6. Split R’F’M’ vector with k segments where each seg-
random fashion. Then the distance of each data point from the cen- ment consists of S instances
troid is calculated by Euclidian distance, and each point is allocated 7. For i = 1 to k do
to the closest centroid which forms a cluster. The problem with 6.1. Compute median for each segment i
choosing initial centroids randomly is that the centroid may bun- 6.2. Store the median in the vector m[i].
dle closer to each other causing the clusters to be less meaningful. 8. Let the values of m vector be the initial centroids for K-
Initial centroids determine the goodness of cluster such as reduc- means
ing number of iterations, global optimum solutions, and cluster 9. Compute the distance of each object’s RFM with
compactness. The performance of K-Means is degraded by random centroids
initial centroids (Liu et al., 2014). 10. Group the object based on the minimum distance
11. Recompute cluster centroids
12. Repeat steps 8 to 10 until there is no change in the clus-
Table 2
Online Retail Dataset Description.
ter members or centroids
Table 3
RFM Calculator.
Table 4
Comparative Analysis of RM K-Means.
name, the price of the product, date and time of purchase, etc. The average silhouette width of RM K-Means is greater than that of
original data set consists of 18,267 instances with eight attributes. Fuzzy C-Means clustering and the K – Means clustering. The results
The dataset contains the purchase of information of customers given in Table 4 are plotted in Fig. 3.
from 1-12-2010 to 09-12-2011. The instances with missing values
in important attributes, unit price and quantity less than 0 and the
5. Conclusion
date exceeding the current date are all removed during data
pre-processing. To identify the outliers, the Z-Score analysis is also
Segmenting the customers will deepen the relationships with
performed as an additional step in data pre-processing. The mean-
customers. Finding new customers for the enterprise is vital,
ingful instances such as invoice data and time, the quantity of pro-
meanwhile retaining the existing clients (Tong et al., 2017) is even
duct per transaction, product price per unit concerning recency,
more important. In this paper, segmentation is done using RFM
monetary and frequency are filtered, and only those records have
analysis and then is extended to other algorithms like K –Means
been inputted into the benchmark algorithms. The modified data-
clustering, Fuzzy C – Means and a new algorithm RM K-Means
set contains 772 instances with three additional attributes recency,
by making a minor modification in the existing K – Means cluster-
frequency and monetary derived from RFM calculation. The
ing. The working of these approaches is analyzed. The time taken
description of the original dataset is shown in Table 2.
by each algorithm to execute is analyzed, and it is observed that
the proposed K –Means approach consumes lesser time and also
4.1. RFM calculator reduces the number of iterations. The proposed algorithm is more
effective because the centroids are more meaningful and are calcu-
Table 3 denotes the precise calculation for computing the RFM lated at the beginning based on the effective medians of data dis-
score for each instance, where the score 5 in each parameter is tribution. Since segmentation is done based on the values of
the highest. recency, frequency, and monetary values, the company can cus-
The output plots obtained from K- Means, Fuzzy C-Means and tomize their marketing strategies to the customers based on their
RM K- Means are shown in Fig. 2. buying behavior. Future work includes studying the performance
The execution time for each algorithm is calculated from the of the customers in each segment such as the products which are
system time. It is observed that the proposed RM K- Means con- bought frequently by the members of each segment. This would
sumes lesser time than the other two techniques because of the help better in providing better promotional offers to specific
lesser number of iterations. The number of iterations is reduced products.
in the RM K-Means because the initial centroids are calculated
based on median values. The silhouette width is used for studying
References
the average distance between the resulting clusters. Silhouette plot
visually analyses the clustering outcome and displays the number He X., Li, C., 2016. The research and application of customer segmentation on
of customers in each cluster and also the minimum distance from e-commerce websites. In: 2016 6th International Conference on Digital Home
(ICDH), Guangzhou, pp. 203–208. doi: 10.1109/ICDH.2016.050.
the point in the cluster to that of another cluster. A higher value of
Haiying, M., Yu, G., 2010. Customer Segmentation Study of College Students Based
average silhouette width indicates that the data points within a on the RFM. In: 2010 International Conference on E-Business and E-
cluster are closer to each other but not to the points in other clus- Government, Guangzhou, pp. 3860-3863. doi: 10.1109/ICEE.2010.968.
ters. The average silhouette width is calculated for the resulting Sheshasaayee, A., Logeshwari, L., 2017. An efficiency analysis on the TPA clustering
methods for intelligent customer segmentation. In: 2017 International
clusters obtained by both K-means clustering technique and by Conference on Innovative Mechanisms for Industry Applications (ICIMIA),
the RM K-Means and K-Means technique. It is observed that the Bangalore, pp. 784–788.
A.J. Christy et al. / Journal of King Saud University – Computer and Information Sciences 33 (2021) 1251–1257 1257
Srivastava, R., 2016. Identification of customer clusters using RFM model: a case of Conference on Communication Systems and Network Technologies, Rajkot,
diverse purchaser classification. Int. J. Bus. Anal. Intell. 4 (2), 45–50. pp. 435–437.
Memon, K.H., Lee, D.H., 2017. Generalised fuzzy c-means clustering algorithm with Liu, C.C., Chu, S.W., Chan, Y.K., Yu, S.S., 2014. A Modified K-Means Algorithm – Two-
local information. In: IET Image Processing, vol. 11, no. 1, pp. 1-12, 1. Layer K-Means Algorithm. In: 2014 Tenth International Conference on
Zahrotun, L., 2017. Implementation of data mining technique for customer Intelligent Information Hiding and Multimedia Signal Processing, Kitakyushu,
relationship management (CRM) on online shop tokodiapers.com with fuzzy pp. 447–450. doi: 10.1109/IIH-MSP.2014.118.
c-means clustering. In: 2017 2nd International conferences on Information Cho, Young, Moon, S.C., 2013. Weighted mining frequent pattern-based customer’s RFM
Technology, Information Systems and Electrical Engineering (ICITISEE), score for personalized u-commerce recommendation system. J. Converg. 4, 36–40.
Yogyakarta, pp. 299–303. Jiang, T., Tuzhilin, A., March 2009. Improving personalization solutions through
Tong, L., Wang, Y., Wen, F., Li, X., Nov. 2017. The research of customer loyalty optimal segmentation of customer bases. IEEE Trans. Knowledge Data Eng. 21
improvement in telecom industry based on NPS data mining. China Commun. (3), 305–320. https://doi.org/10.1109/TKDE.2008.163N.
14 (11), 260–268. https://doi.org/10.1109/CC.2017.8233665. Lu, H., Lin, J.Lu., Zhang, G., May 2014. A customer churn prediction model in telecom
Shah, S., Singh, M., 2012. Comparison of a Time Efficient Modified K-mean industry using boosting. IEEE Trans. Ind. Inf. 10 (2), 1659–1665. https://doi.org/
Algorithm with K-Mean and K-Medoid Algorithm. In: 2012 International 10.1109/TII.2012.2224355.