A Review On Churn Prediction and Customer Segmentation Using Machine Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON), 26-27 May 2022

A review on Churn Prediction and Customer


2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COM-IT-CON) | 978-1-6654-9602-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/COM-IT-CON54601.2022.9850924

Segmentation using Machine Learning


Ankita Zadoo, Tanmay Jagtap, Nikhil Khule, Ashutosh Kedari, Shilpa Khedkar
Department of Computer Engineering,
Modern Education Society’s College of Engineering,
Pune, Maharashtra, India
1
ankitazadoo2@gmail.com
2
tanmayjagtap27@gmail.com
3
nikhil.akhule1903@gmail.com
4
kedariashutosh89@gmail.com
5
shilpa.khedkar@mescoepune.org

Abstract— Telecom companies have a huge amount of retention strategies for those customers. The business of the
customer data generated every data. Since acquiring new customers left from one telecom service provider can give an
customers is a difficult task this data can be used to explain the edge to the competitors. This is why the churn of customers is
customers’ behaviour with the company and help companies an important factor in Customer Relationship Management.
plan the retaining strategies. Customer segmentation, a
Dividing the customer base into groups or clusters with
technique of dividing the customer base into multiple groups of
customers with similar behavior, helps in getting an idea about similar behaviour is called customer segmentation. Customers
the company’s customers. First the data will be cleaned, in one group or cluster have similar behaviour.
analyzed, and prepared for model training. Data preparation is The telecom companies need not only customer churn
important in this problem since the machine learning algorithms prediction but also the factor analysis to understand the
used for these tasks perform poorly when given unprepared data. reasons of churn. It is beneficial to telecom companies to use
For this, feature selection techniques such as Principal their resources on high-value customers. Churn prediction and
Component Analysis (PCA), information gain and correlation customer segmentation techniques will help in identifying
attribute ranking filters, Linear Discriminant Analysis (LDA) these potentially high value customers.
can be used. Churn prediction is a supervised binary
The number of people using internet services is increasing
classification task and customer segmentation is an unsupervised
clustering task. This study focuses on some of the churn day by day and innovations in the telecom sector are making
prediction and customer segmentation techniques. With the help the internet more accessible than ever. This ever-increasing
of these techniques, a churn prediction model to predict churn need for internet service is helping the competition in the
and a model to generate customer segments can be built. telecommunications sector to grow. The companies are
finding ways to retain the existing customers as the cost of
Keywords—Churn Prediction, Customer Segmentation, Clustering, retaining is much less than that of acquiring new ones. The
Feature Selection, Machine Learning, Telecom. data generated by the customers can be used to company’s
benefit using machine learning techniques. Identifying
I. INTRODUCTION
churning customers and customer segments can help the
The increasing demand for internet services has made the companies plan their strategies for high-value customers. This
telecom market more competitive than ever. The cost of way, companies will be able to spend their time and resources
acquisition of new customers is much more than retaining the on their high value customers which will keep them from
existing customers. Thus, Customer Relationship churning. Identifying the needs of newly joined customers can
Management (CRM) analysers and business analysts need to also be help retain these customers since newly joined
know the behaviour of the customers and the reasons for a customers are most likely to churn. The problem is to predict
particular behaviour. Attracting new customers is no longer the probability of churn of a customer and identify which
suitable for businesses. Acquiring new customers is much group of customers the customer belongs to.
more expensive than retaining existing customers. This makes
a big difference because of the volatile nature of the telecom II. LITERATURE SURVEY
customer base. The companies have huge volumes of data A lot of pre-existing work uses many machine learning
generated as the customers use their services. This data can be techniques as well as some deep learning techniques. The
used to make informed decisions to retain existing customers. review focuses on some of the techniques used in the domain.
Churn management is an important aspect of customer
relationship management. Churn prediction and customer The literature survey is divided into two parts: churn
segmentation can help companies understand the effects of prediction and customer segmentation.
current business strategies on the customers' business with the
company. Predicting the churn of customers can help plan

978-1-6654-9602-5/22/$31.00 ©2022 IEEE 174

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:50:57 UTC from IEEE Xplore. Restrictions apply.
A. Churn Prediction Yadav et al. [5] studied various machine learning
Ullah et al. in [1] use a real-world dataset of Call Details techniques for churn prediction in the banking, telecom, and IT
Records (CDR) of a South Asian telecom company [1]. While sector. Churn for each sector and Features responsible for
picking an algorithm, the study suggests using multiple churn in those domains are discussed. The performance of four
algorithms on the dataset and choosing the one which performs algorithms was compared for these three sectors. Random
the best. Performance of tree-based algorithms such as decision forest performed best on the banking dataset, whereas in the IT
tree, random forest, the random tree was better than other sector and telecom, logistic regression and XGBoost
classification algorithms such as Multilayer Perceptron (MLP), algorithms outperformed other algorithms respectively. This
Logistic Regression. These classification algorithms perform showed that for each sector, churn prediction techniques can be
better when high-value features are given to the algorithm. For different.
this, feature selection is done using information gain and Ahmad et al. in [6] mainly focus on the real-world
correlation attributes ranking filter and then selecting high- application of the system. Hortonwokrs Data Platform (HDP) -
ranking features. an open-source big data platform, is used in the study. First, the
Caigny et al. in [2] proposed an approach to model churn data is gathered from all resources and stored in HDFS, and
prediction which combines decision trees and logistic Spark is used to process the stored data. Spark is also used for
regression algorithms called the Logit Leaf Model (LLM). feature engineering. The volume of data under study was
Decision trees suffer to handle linear relationships between the around 70 terabytes. The data was also found to be imbalanced
variables and logistic regression has difficulties with since the primary class occupied around 95\% of the data. The
interaction effects between variables [2]. In the logit leaf problem of imbalance in the dataset was solved by
model, a decision tree is created on the dataset which generates undersampling the primary class. Tree class algorithms such as
homogeneous segments of customers, and then logistic decision tree, gradient boosting machine etc. were used for
regressions is then applied to these segments. Logistic classification.
regressions in (LLM) deploy a forward selection mechanism The research done by Jain et al. in [7] logit boost and
which gives an inbuilt feature selection mechanism to the logistic regression models for churn prediction. Within CRM,
algorithm. The proposed logit leaf model performs on par with churn management has received much attention since customer
ensemble methods like the random forest. acquisition is more expensive than customer retention. Logit
The research done by Ahn et al. [3] compiles many boost model is an additive logistic regression model. Results of
techniques used for churn prediction in various domains such these algorithms are compared using metrics such as AUC-
as insurance, games, internet services, and management. It also ROC, F-measure, precision, recall. Both models gave an
describes terms related to churn such as Customer Acquisition accuracy of around 85\%.
Value (CAC), Customer Lifetime Value (CLV). The study
starts by giving a brief definition of churn and explains B. Customer Segmentation
contractual and non-contractual churn. It defines the churn In the study done by Christy et al. in [8], the RFM ranking
period as a section that can restore a customer’s trust. The technique is discussed which is used to evaluate customers'
RFM (Recency, Frequency, Monetary) method for evaluating value to the company. RFM stands for Recency, Frequency,
loyal, high-value customers are also discussed. The study also and Monetary [8]. Recency is defined as the number of days a
addresses data preparation techniques to be used for some customer has taken between two purchases. Frequency is the
machine learning algorithms to improve performance like one- number of purchases or interactions done by the customer in a
hot encoding, bucketing, normalization, feature embedding. specific time period. Monetary is the amount of money spent
Churn prediction algorithms perform well on balanced datasets by the customer during a specific period. RFM values are
than imbalanced datasets. In practice, the data may not be calculated after the data preprocessing phase. This data is then
enough for a machine learning algorithm. To solve this used for clustering by 3 algorithms, namely, K-means, Fuzzy
problem, oversampling and undersampling methods such as C-means, and Repetitive median (RM) K-means clustering [8].
Synthetic Minority Oversampling Technique (SMOTE), K-means clustering assumes that a data point belongs to only
Adaptive Synthetic Sampling (ADASYN), Majority Weighted one cluster. Fuzzy C-means permits the data to be in more than
Minority Oversampling Technique (MWMOTE) can be used one cluster. Random selection of the initial centroids makes the
to balance the dataset. K-means algorithm less effective. Hence, repetitive median K-
The study done by Adhikary and Gupta in [4] applies over means is used which reduces the time complexity of the
100 classifiers of different families for churn prediction. These algorithm by choosing medians of the variables as initial
algorithms are classified as discriminant analysis, Bayesian centroids. RM K-means performed significantly better than K-
approach, neural network classifiers, decision trees, boosting, means and Fuzzy C-means.
bagging, random forest, Support Vector Machine (SVM), In another study done by Alkhayrat et al. in [9], deep
logistic regression, and others. The highest results were learning techniques and PCA were used for dimensionality
achieved with random forest family classifiers. The reduction of the dataset and removal of irrelevant features.
performance of these models is then compared using metrics Autoencoders are a special type of neural network used for
such as precision, recall, F-measure, MCC, and Area Under the unsupervised tasks. An autoencoder consists of three main
Curve of Receiver Characteristic Operator (AUC-ROC). components: encoder, code, and decoder. The encoder
compresses the input into lower-dimensional code and the

175

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:50:57 UTC from IEEE Xplore. Restrictions apply.
decoder reconstruct the input from the code [9]. Here, the collected from the company’s data stores or customer data
autoencoder can learn to compress the input for dimensionality platforms (CDP) where data of all customers is stored. The
reduction. Three datasets were prepared: the original dataset, data stored may be of multiple products of a company. The
PCA-transformed dataset, and autoencoder-transformed emphasis of the system is on getting useful insights into the
dataset. K-means clustering algorithm is then applied to every existing customers of the company. In this case, the datasets in
dataset to compare the effect of dimensionality reduction. use will be Cell2Cell [12] dataset and churn prediction
Silhouette coefficients of the generated clusters were used for competition from Kaggle [13] and a sample dataset from IBM
comparison. PCA performed poorer than the original dataset [14].
and autoencoder-transformed dataset. The autoencoder was
implemented using Keras and TensorFlow libraries in Python.
The study shows that the autoencoder’s learning ability of non-
linear relations makes it useful for feature selection.
Pantula et al. in [10] proposed a neural network approach
for the implementation of the fuzzy c-means clustering
technique. The fuzzy C-means algorithm allows a data point to
be a member of more than one clusters using a membership Fig. 1 System Architecture
function. The authors implement this algorithm with deep
learning because of the flexibility of neural networks. The B. Data Cleaning
neural network performed significantly better than the This is the process to identify and correct the errors in the
traditional fuzzy c-means algorithm. dataset that may affect the machine learning model. It is found
Tavakoli et al. in [11] used a modified version of the RFM that cleaned data improves a model’s performance. Unclean
model for customer valuation for customer segmentation. The data may contain incomplete, irrelevant, or sometimes
authors separated R (Recency) from F (Frequency) and M duplicate items. These can be removed from the dataset
(Monetary) since the frequency and monetary together altogether, or if the dataset is small then the missing values
represent the loyalty of the customer. The proposed model was can be computed using statistical techniques. Python supports
evaluated based on real data from a large-scale E-commerce many libraries which can be used to clean the data.
company - Digikala. The authors also applied different
definitions for RFM since every domain may have different C. Data Analysis
definitions of these variables. For clustering, the K-means In this phase, the cleaned data is analyzed to get some
algorithm was used. Retention strategies for each cluster were insights into the behavior of features. The analysis is done to
also discussed. make sense of the data and understand each feature before
Customer acquisition cost is much larger than the cost of training the model. The analysis will help in understanding
customer retention. Losing customers to competing features and their relationships better before training the
organizations affects the enterprise value of the company. model.
Customer segmentation can divide the customer base into
groups with similar behavior. The analysis of the clusters D. Data Preparation
generated and churn prediction results can be used for This is the last step before the training phase of the system.
planning retention strategies for customers. Many complex In this phase, the dataset is made ready for model training. A
problems such as image processing, natural language machine learning algorithm may be sensitive to some
processing tasks are being solved by machine learning irregularity in the data. For example, the decision tree
methods on relevant data. Methods discussed in this research algorithm struggles to handle linear relationships between the
are new and can be used to gain insights into the customer features. Logistic regression is an algorithm that has
base. The system proposed in this research has many machine difficulties with interactions between the variables. Thus,
learning and statistical methods with different characteristics. preparing the dataset to suit the algorithm helps the model’s
Feature selection methods can be used to the attributes which performance. Many algorithms tend to favor the dominant
lead to a certain behavior of customers. The analysis of churn class in an imbalanced dataset. To overcome this,
and the customer segments will take telecom companies ahead undersampling and oversampling techniques are used. In
of their competition. The companies can work on their undersampling, some of the data points from the dominant
products and services to suit the needs of customers. class are omitted. In oversampling, the data points of less-
frequently occurring classes are repeated to overcome the
III. PROPOSED METHOD imbalance. There are many variants of undersampling and
oversampling techniques. But the core principle remains the
A. Data Collection same. In clustering, algorithms like k-means clustering
Data collection is the process of collecting data from various algorithm do not perform well with high-dimensional data. A
resources. This data can be structured, semi-structured, or dataset with ‘n’ features has n-dimensional problem space but
unstructured. In our case, the data is structured but may be in reality, not all problem space is used by the dataset. Thus,
spread out in multiple resources. In this phase, the data is feature selection techniques are used to transform the n-

176

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:50:57 UTC from IEEE Xplore. Restrictions apply.
dimensional space into a lower-dimensional space. Some
techniques used in feature selection are PCA, information gain
ranking filters. Autoencoder is a deep learning approach for
dimensionality reduction. These neural networks can be used
to reduce the effect of redundant features in the dataset.

IV. MACHINE LEARNING MODELS


Churn prediction is a binary classification task, whereas
customer segmentation is a clustering task.

A. Churn Prediction

Popular classification algorithms such as logistic


regression, decision tree, random forest, multi-layer perceptron Fig. 3 Multilayer Perceptron Neural Network
can be used to predict the churn of the customer. In our
proposed system, a series of churn prediction models will be Activation functions are used to produce a non-linearity in
fed the prepared data. The results of the classifiers will then be the output. Non-linear functions such as sigmoid (1), tanh,
compared to see which approach performed the best. ReLU are used as activation functions. For classification tasks,
sigmoid function is generally used as the activation function.
Logistic regression is a binary classification algorithm. It is Random forest is an tree-based ensemble algorithm. Ouputs of
based on the sigmoid function whose range is from 0 to 1. It is multiple decision trees are combined and output is given. The
given as follows [5]: Random Forest (RF) algorithm can be used for binary or
The model takes the inputs and predicts the probability of multiple classification tasks as well as regression tasks. Each
input belonging to class '0' in this case 'not churn'. decision tree is formed by choosing a subset of the initial
features. This algorithm is used because of its stable nature
and accurate results. Random forest algorithm is less prone to
overfitting. Drawback of random forest is its poor
Multi-layer perceptron neural network is a feed forward explainability.
neural network which can be used to predict the churn. Neural
networks in general have a very good fault tolerance. They B. Churn Segmentation
can model complex relationships between independent Customer segmentation is an unsupervised clustering task.
variables. Popular algorithms used for clustering are K-means,
DBSCAN, fuzzy c-means. Here, feature selection techniques
will be used to reduce the dimensions of the dataset. The
reduced dataset will have high-value features which will help
the performance of the clustering algorithms. K-means is a
simple clustering algorithm which assigns each data point to
one of the k centroids which is nearest to the given point. This
algorithm assumes that every data point belongs to a single
cluster. The algorithm measures Euclidean distance from the
centroid and the data point. The process is repeated until the
Fig. 2 Sigmoid Function sum of intra-cluster distances cannot be minimized further.
MLP consists of at least 3 layers of nodes called neurons - Fuzzy C-means applies a different approach to clustering. It
input, hidden and output layer. These neurons symbolize the allows data points to be a member of more than one clusters.
neurons in human brain. The nodes in consecutive layers are Instead of measuring the distance from the centroid, it
connected to each other Each neuron takes input from previous measures the likelihood of the data point belonging to that
layers. This is given as follows: cluster. This gives much more information about the
data points.
V. EVALUATION AND VISUALIZATION OF THE RESULTS
The neuron also has a bias associated with it. Output of For churn prediction, metrics such as precision, recall, F-
neuron is a weighted sum of the weight matrix and the input measure, AUC-ROC curve can be used. These metrics are
vector and bias. This output is then given to an activation calculated from the confusion matrix of the models. These
function. Value of activation function is given to the neurons in metrics are explained in [1] and [7]. For evaluation of
the next layer. clustering algorithms, the elbow method for optimal number of
clusters and silhouette coefficient can be used. After studying

177

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:50:57 UTC from IEEE Xplore. Restrictions apply.
the results, retaining strategies for each cluster can be Identification in Telecom Sector," in IEEE Access, vol. 7, pp. 60134-
designed. Factor analysis of churn data can be used to find the 60149, 2019, doi: 10.1109/ACCESS.2019.2914999.
[2] Arno De Caigny, Kristof Coussement, Koen W. De Bock, “A new
causes of churn. hybrid classification algorithm for customer churn prediction based on
logistic regression and decision trees”, European Journal of
VI. CONCLUSIONS Operational Research, Volume 269, Issue 2, 2018, Pages 760-772,
Considering the volatile nature of customers in the telco ISSN 0377-2217, https://doi.org/10.1016/j.ejor.2018.02.009.
[3] J. Ahn, J. Hwang, D. Kim, H. Choi and S. Kang, "A Survey on Churn
market, the companies will be benefitted by proposing Analysis in Various Business Domains," in IEEE Access, vol. 8, pp.
retention strategies for the churning customers. Companies 220816-220839, 2020, doi: 10.1109/ACCESS.2020.3042657.
can on update their services according to the retention [4] Adhikary, D.D., Gupta, D., “Applying over 100 classifiers for churn
strategies. In this paper, a system for churn prediction and prediction in telecom companies”. Multimed Tools Appl 80, 35123–
35144 (2021). https://doi.org/10.1007/s11042-020-09658-z
customer segmentation has been proposed. The system will [5] Jain, H., Yadav, G., Manoov, R. (2021), “Churn Prediction and
help companies find out potentially churning customers and Retention in Banking, Telecom and IT Sectors Using Machine
strategies to retain them. Learning Techniques”. In: Patnaik, S., Yang, XS., Sethi, I. (eds)
Advances in Machine Learning and Computational Intelligence.
Algorithms for Intelligent Systems. Springer, Singapore.
VII. FUTURE SCOPE https://doi.org/10.1007/978-981-15-5243-4_12
The proposed model can be used by customer relationship [6] Ahmad, A.K., Jafar, A. & Aljoumaa, K. “Customer churn prediction in
managers (CRM) to predict potential churning customers and telecom using machine learning in big data platform”. J Big Data 6, 28
(2019). https://doi.org/10.1186/s40537-019-0191-6
to identify clusters in the customer base. The results of these [7] Hemlata Jain, Ajay Khunteta, Sumit Srivastava, “Churn Prediction in
models can be used by marketers to implement retention Telecommunication using Logistic Regression and Logit Boost”,
strategies for churning customers. Knowing which cluster Procedia Computer Science, Volume 167, 2020, Pages 101-112, ISSN
does the customer belong to will surely help in choosing 1877-0509, https://doi.org/10.1016/j.procs.2020.03.187.
[8] A. Joy Christy, A. Umamakeswari, L. Priyatharsini, A. Neyaa, “RFM
appropriate retention strategies. In the future, the researchers ranking – An effective approach to customer segmentation”, Journal of
can consider using other machine learning models which can King Saud University - Computer and Information Sciences, Volume
explain the factors of churn along with the churn probability. 33, Issue 10, 2021, Pages 1251-1257, ISSN 1319-1578,
Researchers can also work on the imbalance in the datasets https://doi.org/10.1016/j.jksuci.2018.09.004
[9] Alkhayrat, M., Aljnidi, M. & Aljoumaa, K. A comparative
and improve the performance of existing models. Other dimensionality reduction study in telecom customer segmentation
businesses can implement this system to make crucial using deep learning and PCA. J Big Data 7, 9 (2020).
business decisions to help them grow. This system with some https://doi.org/10.1186/s40537-020-0286-0
domain-specific modifications can also help identify the [10] Priyanka D. Pantula, Srinivas S. Miriyala, Kishalay Mitra, “An
Evolutionary Neuro-Fuzzy C-means Clustering Technique”,
churning customers in other industries. Engineering Applications of Artificial Intelligence, Volume 89, 2020,
103435, ISSN 0952-1976,
ACKNOWLEDGMENT https://doi.org/10.1016/j.engappai.2019.103435
We would like to thank the Department of Computer [11] M. Tavakoli, M. Molavi, V. Masoumi, M. Mobini, S. Etemad and R.
Rahmani, "Customer Segmentation and Strategy Development Based
Engineering, Modern Education Society's College of on User Behavior Analysis, RFM Model and Data Mining Techniques:
Engineering (Wadia Campus), Pune for all the support A Case Study," 2018 IEEE 15th International Conference on e-
provided for this research. We would like to show our gratitude Business Engineering (ICEBE), 2018, pp. 119-126, doi:
to Mrs. S. A. Sapkal for sharing their pearls of wisdom during 10.1109/ICEBE.2018.00027.
the research. [12] “Cell2cell,” 2018. [Online]. Available:
https://www.kaggle.com/jpacse/datasets-for-churn-telecom
[13] “Customer churn prediction 2020,” 2020. [Online]. Available:
REFERENCES https://www.kaggle.com/c/customer-churn-prediction-2020
[1] I. Ullah, B. Raza, A. K. Malik, M. Imran, S. U. Islam and S. W. Kim, [14] “Telco customer churn,” 2018. [Online]. Available:
"A Churn Prediction Model Using Random Forest: Analysis of https://www.kaggle.com/blastchar/telco-customer-churn
Machine Learning Techniques for Churn Prediction and Factor

178

Authorized licensed use limited to: Unitec Library. Downloaded on September 17,2022 at 05:50:57 UTC from IEEE Xplore. Restrictions apply.

You might also like