Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

WK,QWHUQDWLRQDO7HOHFRPPXQLFDWLRQ1HWZRUNVDQG$SSOLFDWLRQV&RQIHUHQFH ,71$&

Comparison of learning techniques for prediction of


customer churn in telecommunication
Manpreet Singh, Sarbjeet Singh, Nadesh Seen, Sakshi Kaushal and Harish Kumar
CSE, UIET, Panjab University
Chandigarh, India
manpreetkkh@gmail.com, sarbjeet@pu.ac.in, nadeshseen@gmail.com, sakshi@pu.ac.in, harishk@pu.ac.in

Abstract—Customer Churn is a challenging and one of the Supervised Learning trains a model using known input
most demanding issues in the telecom sector. The primary and output data. The data is labeled and these labels set the
motivation of businesses at present is just not only to acquire ground to exploit the data to predict future outputs on new
new customers, but to retain existing customers as well. In fact, data. Unsupervised Learning is employed if the data is
customer retention is more important because of the associated unlabeled. It finds hidden patterns or structures in the input
high costs. The present work has been carried out in a churn data using statistical means to predict churns.
prediction modeling context and benchmarks four machine
learning techniques against a publicly available This present study applies four classical machine learning
telecommunication dataset. The results provide two important algorithms (SVM, Logistic Regression, k-NN and Random
conclusions: i) Random Forest technique outperforms other Forests) for prediction of churns on a publically available
basic classification models and ii) Feature Engineering plays dataset [3] in telecommunication domain.
critical role in the performance of the model.

Keywords—Machine learning, Customer Churn Prediction II. LITERATURE REVIEW


A number of initiatives have been taken by different
I. INTRODUCTION researchers for prediction of churn in different domains.
Table I summarizes the notable work carried out in this area
Customer churning refers to the migration of a customer highlighting the major goals achieved along with the dataset
from one organization to another. Customer churns are those and the techniques used in the work.
targeted customers that have already decided to leave the
company or the service provider and planned to shift to the
competitor’s company in the market. Customer churn is one III. CHURN PREDICTION AND EVALUATION
of the rapidly growing issue in the telecom sector. The high This section describes the steps followed for the
cost involved in acquiring a new customers has resulted in prediction of churn in telecommunication.
the change of focus of telecom sector from acquiring new
customers to retaining new customers. Literature reveals the
following types of customers [1]: A. Churn Prediction Steps
Active Churner: These customers are those who want to
leave their service provider. 1) Data Preparation
The data used in this work comes from the Earino
Passive Churner: Passive Churner are those to whom the company dataset [3]. The dataset consists of 21 columns
companies have discontinued the services. including the label. The dataset available at [3] is not
Silent Churner: These are those customers who may normalized. The features available in the dataset are:
suddenly discontinue the services without any prior notice. State, Account length, Area code, International plan,
Customer churning is the key concern in Phone number, Voice mail plan, Number of voice mail
telecommunication [2]. The needs and behavior of the messages, Total day minutes, Total day calls, Total day
customers are needed to be understood clearly in order to charge, Total evening minutes, Total evening calls, Total
develop stronger relationship with them and all such issues evening charge, Total night minutes, Total night calls, Total
are addressed under Customer Relationship Management night charge, Total international minutes, Total international
(CRM). The companies are focusing more on the long-term calls, Customer Service Calls, International Charge, Churn
relationships with the customers and are observing their
customers behaviour from time to time. Companies use The Dataset is imbalanced, with churn class being only
various feature engineering techniques to get the hidden 14.5 % of the total 3,333 samples. The continuous variables
relationships between different entities of the database and to mostly follow Gaussian distribution. For continuous
predict churn in an efficient way. The present scenario has variables, binning and feature scaling is performed. The
led many companies to invest liberally in the Customer dataset has no missing values.
Relationship Management for customer churn prediction.
Customer Churn Prediction problem is being widely studied a) Data Pre-Processing
in various domains such as telecommunication, banking, Categorical Variables are encoded using One Hot
online shopping, and social network services. Encoding. One Hot Encoding is an alternative approach to
Label Encoding Scheme. This method has the benefit that a
Most of the churn prediction techniques employ machine value is not weighted improperly.
learning techniques. The learning happens in two ways:
supervised and unsupervised.

k,(((
WK,QWHUQDWLRQDO7HOHFRPPXQLFDWLRQ1HWZRUNVDQG$SSOLFDWLRQV&RQIHUHQFH ,71$&

TABLE I. SUMMARY OF CHURN PREDICTION INITIATIVES, DATASETS AND TECHNIUES USED AND THEIR OUTCOMES

Authors Techniques Year Dataset Outcomes

W. Au, K.C.C. Data Mining by Evolutionary 2003 Malaysian subscriber They were able to discover rules very
Chan, X. Yao [5] Learning, Decision tree database of wireless telecom effectively and predicted churn in the
(C4.5), Neural industry telecom data accurately.
Network.
S.-Y. Hung, D.C. Decision Tree and Neural 2006 Taiwan telecom company’s Both DT as well as NN techniques can
Yen, H.-Y. Wang Network Dataset deliver accurately while BPN
[5] performance is better than DT without
segmentation.
R.J. Jadhav, U.T. Back Propagation Neural 2011 It contains data from in-house Customers who are at risk of churning
Pawar [7] Network algorithm customer database, are predicted.
proprietary call record from
company & research survey
A. Sharma, P. Artificial Neural Network 2011 Telecom Dataset, UCI Accuracy obtained by Artificial
Prabin Kumar [8] Repository, University of Neural Network based model is 92%.
California, Irvine
H. Abbasimehr Adaptive Neuro-fuzzy 2011 Telecom Dataset, UCI Neuro-Fuzzy performed better than
[9] Inference system (ANFIS) Repository, University of C4.5, RIPPER in case of accuracy,
California, Irvine sensitivity and specificity.
E. Shaaban, Y. Decision tree, neural network, 2012 The Dataset is obtained from Accuracy of Neural Network is 83.7%,
Helmy, A. Khedr, and SVM. an anonymous mobile service SVM is 83.7% and Decision Tree is
M. Nasr [10] provider 77.9%
I. Brandusoiu, G. Support Vector Machine 2013 Telecom Dataset, UCI Accuracy of SVM based model is
Toderean [11] (SVM) Algorithm Repository, University Of 88.56%.
California, Irvine
K. Kim, C.-H. Logistic regression and 2014 Customer’s personal An efficient approach is developed
Jun, J. Lee [12] Multilayer perceptron neural information and CDR data is using SPA (as propagation process).
networks present in the dataset
G. Olle [13] Logistic Regression, Voted 2014 Asian Mobile telecom A Hybrid learning model is developed
perceptron operator dataset to predict churn.
T. Vafeiadis, K.I. SVM, Decision tree, Artificial 2015 Telecom Dataset, UCI SVM-POLY classifier is the best,
Diamantaras, G. Neural Network, Naive Repository, University Of using AdaBoost.
Sarigiannidis, Bayes, Regression Analysis, California, Irvine
K.C. boosting
Chatzisavvas [14]
b) Data Transformation Features which have the strongest relationship with the
Feature scaling may or may not have a significant effect output variable are selected using statistical tests. The scikit-
on the results and depends heavily on a algorithm that is learn library provides the SelectKBest class which is used to
being used. The features which have larger magnitude weigh select ‘k’ number of features according to the ‘k’ highest
more in calculations with respect to features having smaller ANOVA F-value scores.
magnitude. To suppress this effect, we brought all the b) Dimensionality Reduction
features to reasonably at the same level of magnitudes. PCA has been used for noise filtering and feature
Feature scaling is performed on features which are extraction. The scikit learn’s inbuilt implementation of PCA
continuous in nature. Voice Mail Plan, International Plan has been used to reduce the dimensionality of the data.
and Customer Service Calls features are exempted from
c) Oversampling(SMOTE)
standard scaling procedure because Voice Mail Plan and
International Plan are categorical with only 0 and 1 as SMOTE is used for the oversampling of the churners
categories and Customer Service Calls feature has values class (Here, positive class). SMOTE module from ‘imblearn’
spread over a range of 0 – 7 with frequency of 6 and 7 being library is used because here the positive class is under-
very less. represented.

2) Feature Engineering 3) Train Test Split


The dataset is partitioned into training and test dataset
a) Feature Selection
with train to test ratio of 7:3.
In this phase, selection of a subset of features is
performed, which are logically more relevant and have more 4) Hyper-parameter Optimization
predictive power. As there is no general method to gauge Hyper-parameter optimization or tuning is the problem of
the predictive power and logical relevance of a feature, it is selecting a set of optimal hyper-parameters for a learning
purely subjective and requires domain knowledge. For algorithm. The current work makes use of hyper-parameter
example, Phone number is not considered for predictive optimization using grid search. A pipeline with the following
analysis because churning of the customer intuitively and sequence is created:
logically does not depend on this feature.
WK,QWHUQDWLRQDO7HOHFRPPXQLFDWLRQ1HWZRUNVDQG$SSOLFDWLRQV&RQIHUHQFH ,71$&

Feature Selection Æ PCA Æ Model Training


The parameters to be searched are specified for each step
of the sequence and also the range in which searching is to
be performed is specified. (E.g. k: range (1, 30)) The
searching is done on training dataset and results are sorted
based on mean test score. The set of hyper-parameters with
the highest mean test score is considered as the best model
searched.
5) Model Training
Four machine learning models which are used to train the
dataset are: Logistic Regression, Support Vector Machine, K-
Nearest Neighbours and Random Forest. The default
classification threshold is 0.5 for all the models. Results are
obtained first without adjusting the threshold and then after Fig. 1. ROC curve for Logistic Regression
adjusting the threshold. ROC-AUC Score: 0.84

IV. RESULTS
The Dataset used has 3,333 customers and 2,850 (85.5
%) are churners and 453 (14.5 %) are non-churners. The
objective under the problem is to select the model that yields
good classification accuracy with maximum possible recall
(sensitivity towards correctly predicting churner class).
We trained models with following optimal values of
parameters:

• Logistic Regression: None


• Support Vector Machine: feature_selection__k:17,
pca__n_components:12, svm__C:10, Fig. 2. Thresholds vs. Metrics Scores (Logistic Regression)
svm__class_weight: balanced
• K-Nearest Neighbours: feature_selection__k:15, TABLE II. CONFUSION MATRIX, THRESHOLD=0.5
knn__n_neighbors:26, pca__n_components:11 Predicted Predicted
• Random Forest Classifier: feature_selection__k:16, Negative Positive
rf__n_estimators:33 Actual
657 198
Negative
Actual
Sensitivity of the model can be increased by adjusting the Positive
177 678
classification threshold. The thresholds vs major error
metrics plot facilitates this decision making. The confusion TABLE III. CONFUSION MATRIX, THRESHOLD=0.2
matrices for the two cases viz. before threshold adjustment
and after threshold adjustment are also presented for each of Predicted Predicted
Negative Positive
the four machine learning models. The threshold is chosen
Actual
intuitively, guided by the objective to maximize recall Negative
310 545
without significantly affecting the model classification Actual
29 826
accuracy. Positive

1) Logistic Regression TABLE IV. PERFORMANCE INDICATORS


The ROC curve of Fig. 1 shows that logistic regression Model Threshold =0.5 Threshold =0.2
model is not showing satisfactory results because to achieve Accuracy 0.78 0.66
100% sensitivity, we need to face false positive rate of 80%, Recall 0.79 0.96
which is not desired as it will misclassify a large chunk of Precision 0.77 0.60
non-churners as churners. It can be clearly seen from Fig. 2 F1-Score 0.78 0.74
Specificity 0.76 0.38
that by changing the threshold, the metrics scores vary non-
ROC-AUC 0.84 -
linearly. By adjusting threshold to 0.2, the sensitivity rises
from 79% to 96%. As a result, the classification accuracy Table IV shows that by adjusting threshold to 0.2, F1-
further decreases to 66% making it unsuitable for Score has not changed appreciably. Although the sensitivity
classification. is significantly improved (the main objective) but specificity
Table II and Table III shows the effect of threshold on has worsened. Thus, the model is not practically useful.
confusion matrix entries. The False Negatives significantly
reduce from 177 to 29 and number of True Positives increase 2) K Nearest Neighbors
from 678 to 826. Moreover, the False Positives get KNN has better classifying power than Logistic
augmented by 347 which imply that we are predicting large Regression as witnessed by their ROC Scores (0.92 vs
chunk of non-churners as churners. This may incur 0.84). ROC curve for KNN is shown in Fig. 3.
significant expenses to the company.
WK,QWHUQDWLRQDO7HOHFRPPXQLFDWLRQ1HWZRUNVDQG$SSOLFDWLRQV&RQIHUHQFH ,71$&

Fig. 3. ROC curve for K Nearest Neighbors Fig. 5. ROC curve for SVM
ROC-AUC Score: 0.92 ROC-AUC Score: 0.98

Fig. 4. Thresholds vs. Metrics Scores (KNN) Fig. 6. Thresholds vs. Metrics Scores

The threshold of 0.25 increases the sensitivity from 88% At 0.25 threshold, sensitivity increases from 97% to 98%
to 99% but accompanying it, is a dip in overall accuracy by by not changing other evaluation metrics appreciably.
82% to 65%. Moreover, the overall classification accuracy is almost the
TABLE V. CONFUSION MATRIX, THRESHOLD=0.5 same i.e. 94%.
Predicted Predicted
Negative Positive TABLE VII. CONFUSION MATRIX, THRESHOLD=0.5
Actual
659 196 Predicted Predicted
Negative
Negative Positive
Actual
97 758 Actual
Positive 783 72
Negative
Actual
TABLE VI. CONFUSION MATRIX, THRESHOLD=0.2 24 831
Positive
Predicted Predicted
Negative Positive TABLE VIII. CONFUSION MATRIX, THRESHOLD=0.25
Actual
264 591 Predicted Predicted
Negative Negative Positive
Actual Actual
6 849 769 86
Positive
Negative
Actual
Although, KNN has predicted True Positives better than 17 838
Positive
Logistic Regression but again the increase in False Positives
by 395 is a stronger vote against it. This simplifies to a The True Positives predicted are greater and the
tradeoff between classifying 91 churners correctly and difference between total churners and their predicted
misclassifying 395 incorrectly as churners. This is surely a numbers is very less (just 17). The False Positives generated
poor model considering the company’s objective of cost- are very less and the False Negatives are less as compared to
cutting. previous models (both in favor of our objective). Threshold
adjustment has not shown any significant improvement in
3) Support Vector Machine predicting more True Positives.
The ROC Score claims SVM to be a classifier with
excellent predictive capability. The Area under ROC Curve 4) Random Forest
is close to unity. To achieve almost 100% True Positive The ROC Score claims Random Forest to be a classifier
Rate (our objective) we need to compromise to False with outstanding predictive capability. The Area under ROC
Positive Rate of just 20%. Curve is close to unity. To achieve almost 100% True
WK,QWHUQDWLRQDO7HOHFRPPXQLFDWLRQ1HWZRUNVDQG$SSOLFDWLRQV&RQIHUHQFH ,71$&

Positive Rate (our objective) we need to compromise to Although before adjustment of threshold, Random
False Positive Rate of just 20 %.( similar to SVM model) Forest Accuracy is greater than that of SVM but the True
Positives predicted are less in case of Random Forest. But
False Positives being 29 (rather than 72 in SVM) is a good
argument in favor of Random Forest. After threshold
adjustment, both SVM and Random Forest Models perform
exactly similar.
V. SUMMARY AND CONCLUSION
Churn Prediction can be modelled as a binary
classification problem. This work aims to solve this problem
using four classical methods of machine learning. The
prediction capabilities of different classification models
have been examined. The dataset taken is imbalanced and
not normalized. A subset of irrelevant features (with low
qualitative predictive power) is removed. The features
having strongest relationship with the output variable are
Fig. 7. ROC curve for Random Forest
ROC-AUC Score: 0.99 selected. We tuned the models for maximum predictive
At 0.3 threshold, the sensitivity (recall) is optimized performance using Grid Search. Models are trained on the
from 94% to 98% without much affecting the accuracy and best set of parameters obtained from the Grid Search
specificity (capability of not predicting non-churners as procedure. The classification efficiency is gauged using
churners). standard evaluation metrics (confusion metrics and ROC
Curve). Classification threshold is adjusted so as to optimize
sensitivity of the model.
It is concluded from the results presented in Section IV
that Random Forest and SVM are comparably the best
models for the given dataset. The False Positives predicted
by Random Forest and SVM are much less than the other
two models. Also, the True Positives are predicted with
accuracy of 94% and sensitivity of 98%.

REFERENCES
[1] V. Lazarov, M. Capota, churn prediction, Bus. Anal. Course TUM
Comput. Sci. (2007).
[2] R.H. Wolniewicz, R. Dodier, Predicting customer behavior in
telecommunications, IEEE Intell. Syst. 19 (2) (2004) 50–58.
[3] https://www.kaggle.com/becksddf/churn-in-telecoms-dataset
Fig. 8. Thresholds vs. Metrics Scores
[4] Leif E. Peterson (2009) K-nearest neighbor. Scholarpedia, 4(2):1883.
TABLE IX. CONFUSION MATRIX, THRESHOLD=0.5 [5] W. Au, K.C.C. Chan, X. Yao, A novel Evolutionary data mining
algorithm with applications to churn prediction, IEEE Trans. Evol.
Predicted Predicted Comput. 7 (6) (2003) 532–545
Negative Positive [6] S.-Y. Hung, D.C. Yen, H.-Y. Wang, Applying data mining to telecom
Actual churn management, Expert Syst. Appl. 31 (3) (2006) 515–524
826 29
Negative [7] R.J. Jadhav, U.T. Pawar, Churn prediction in Telecommunication
Actual using data mining technology, Int. J. Adv. Comput. Sci. Appl. 2 (2)
47 808
Positive (2011) 17–19.
[8] A. Sharma, P. Prabin Kumar, A neural network based approach for
TABLE X. CONFUSION MATRIX, THRESHOLD=0.3 Predicting Customer churn in cellular network services, Int. J.
Comput. Appl. 27 (11) (2011) 26–31.
Predicted Predicted
[9] H. Abbasimehr, A neuro-fuzzy classifier for Customer churn
Negative Positive
prediction, Int. J. Comput. Appl 19 (8) (2011) 35–41.
Actual
768 87 [10] E. Shaaban, Y. Helmy, A. Khedr, M. Nasr, A proposed churn
Negative
prediction model, Int. J. Eng. Res. Appl 2 (4) (2012) 693–697.
Actual
17 838 [11] I. Brandusoiu, G. Toderean, Churn prediction in the
Positive
telecommunications sector using support vector machines, Ann.
ORADEA Univ. Fascicle Manag. Technol. Eng. (1) (2013).
TABLE XI. PERFORMANCE INDICATORS
[12] K. Kim, C.-H. Jun, J. Lee, Improved churn prediction in
Model Threshold =0.5 Threshold =0.3 telecommunication industry by analyzing a large network, Expert
Accuracy 0.95 0.94 Syst. Appl. 41 (15) (2014) 6575–6584
Recall 0.94 0.98 [13] G. Olle, A hybrid churn prediction model in mobile
Precision 0.96 0.90 Telecommunication industry, Int. J. e-Educ. e-Bus. e-Manag. e-Learn.
F1-Score 0.95 0.94 4 (1) (2014) 55–62.
Specificity 0.96 0.90 [14] T. Vafeiadis, K.I. Diamantaras, G. Sarigiannidis, K.C. Chatzisavvas,
ROC-AUC 0.99 - A comparison of machine learning techniques for customer churn
prediction, Simul. Model. Pract. Theory 55 (2015) 1–9.

You might also like