Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

JAME, Volume : 1 - Issue : 2 - Year: 2021

JAME

METRICS
JO
URNAL OF
E-ISSN:2791-7401 Received: 30.11.2021
Volume : 1, Issue 2, 2021

NO
Acccepted: 13.12.2021

CO
A PP
LIE OE URL: https://journals.gen.tr/jame
D MICR

RESEARCH ARTICLE

Customer churn analysis
in banking sector: Evidence from
explainable machine learning models

Hasraddin Guliyev1 Ferda Yerdelen Tatoğlu2

1 The Economic Research Center of Turkish World, Azerbaijan State Economic University, Azerbaijan.
Ph.D. Student at Department of Econometrics, Istanbul University, Istanbul, e-mail: hasradding@unec.edu.az
2 Istanbul University, Faculty of Economics, Department of Econometrics, Istanbul/Turkey, e-mail: yerdelen@istanbul.edu.tr

Abstract

Although large companies try to gain new customers, they also want to retain their old customers. Therefore, customer churn
analysis is important for identifying old customers without loss and developing new products and making new strategic
decisions for retaining customers. This study focuses on the customer churn analysis, that is a significant topic in banks
customer relationship management. Identifying customer churn in banks will helps the management to classification who are
likely to churn early and target customers using promotions, as well as provide insight into which factors should be considered
when retaining customers. Although different models are used for customer churn analysis in the literature, this study focuses
on especially explainable Machine Learning models and uses SHapely Additive exPlanations (SHAP) values to support the
machine learning model evaluation and interpretability for customer churn analysis. The goal of the research is to estimate
the explainable machine learning model using real data from banking and to evaluate many machine learning models using
test data. According to the results, the XgBoost model outperformed other machine learning methods in classifying churn
customers.

Keywords: Customer Loyalty, Customer Retention, Customer Churn Analysis, Machine Learning Models, Tree-Based
Predictive Models

JEL Codes: C53, C55, M31

Citation: GULIYEV, H. & YERDELEN TATOGLU, F., (2021). Customer churn analysis in banking sector: Evidence from explainable machine learning
models. Journal of Applied Microeconometrics (JAME). 1(2), 85-99, DOI: 10.53753/jame.1.2.03

Corresponding Author:
Hasraddin Guliyev Content of this journal is licensed under a Creative Commons
E-mail: hasradding@unec.edu.az Attribution-NonCommercial 4.0 International License.

85
GULIYEV & YERDELEN TATOĞLU

1. INTRODUCTION

Customer churn is a serious issue in an era of increasingly crowded markets and increased competition amongst
businesses (Colgate et al. 1996). According to much research, the cost of acquiring new consumers is 1/5 the cost
of keeping existing ones (Athanassopoulos 2000). For this reason, firms prefer to retain existing customers than
add new ones and apply policies in this direction. One of the most valuable qualities in strategies designed to de-
crease or prevent customer churn is customer behavior data in the current customer base (Ganesh et al. 2000). As
a result, as part of a consumer strategic plan targeted at decreasing customer churn, the discovery, and exploration
for customers with a strong desire to leave the organization, or customer churn prediction, is essential (Blattberg
et al. 2008).

Customer churn is a well-known issue in most sectors (Saradhi and Palshikar 2011), hence it’s critical to develop a
perfect predictive model designed in support of customer churn that could be used to formulate customer retention
strategy. This topic is much more important in markets where competition is high and acquiring new customers is
more difficult than retaining existing customers. Banking is one of the sectors where analyzing customer behavior
and estimating customer churn based on these behaviors is an essential topic of research. Customer churn analysis
results have a large impact on the bank’s policy. Because the results of churn analysis allow banks to develop new
customer strategies or improve existing ones. In addition, banks are critical to a country’s financial growth and de-
velopment, so the banking sector is an essential factor to the country’s and people’s financial stability. Because it is
not always possible to get new customers in the competitive banking market, banks’ primary goal is to ensure that
existing customers are retention. Because banks, like all companies in the service sector, are customer-oriented,
customer relationships with banks are a priority to their long-term business achievement. Studies conducted for the
banking sector of various countries have revealed that, due to the competitive and dynamic nature of the banking
sector, ensuring customer satisfaction is an important policy to prevent customer churn. Developing strong rela-
tionships with their customers have an advantage to high customer satisfaction and thus customer loyalty and turns
into the most important factor for the stability, growth, and profitability of businesses.

In customer churn prediction in banking or other sectors, for example, a scoring approach supports the calculation
of a potential churn probability per customer is based on their past data. The need to retain existing customers to
maintain market share has led to an increased need for the development of various machine learning techniques for
churn customer analysis. In customer churn prediction logistic regression (LR) is a very widespread paradigm to
predict a churn probability since of great comprehensibility (Verbeke et al. 2012). The LR model, however, has a
weak classification performance. On the other hand, despite the high prediction performance of the machine learn-
ing models explanation is difficult. In this study, we suggest an explainable machine learning model that combines
the comprehensibility of logistic regression with the high classification performance of machine learning models.

2. LITERATURE REVIEW

With the advancement of techniques in the last 10-15 years, customer churn analysis studies using machine learn-
ing have grown in popularity. In the literature, it’s clear that customer churn analysis has been used in a variety of
industries (Kawale et al. 2009). While most studies have been done for the telecommunications and communica-
tion sector, customer churn analysis has also found applications in a wide range of fields, including e-commerce,
banking, insurance, retail trade, energy, games and entertainment, and the medical (see Ahn et al. 2006; Bose and
Chen 2009; Khan et al. 2010; Soeini and Rodpysh 2012; Buettgens et al. 2012; Long et al. 2012; Abbasimehr et
al. 2013). Because this study is a customer churn analysis application in banking, it will be focused on this sector.
Although the factors affecting customer satisfaction and loyalty differ for all countries, there are common points
in the studies carried out.

According to Chakiso (2015), to obtain valuable customers, satisfy customers, and ensure customer loyalty, re-
lational marketing (such as trust, bonding, communication, and reciprocity) must be used in the banking sector.
Customers are encouraged through strong customer relationships to be always more satisfied and loyalty of bank,
according to many studies. Conforming to Ozatac et al. (2016) the most significant determinants of customer sat-
isfaction in banking are the accuracy of the information, the responsiveness of employees, access to all services,
ability of employees, reliability, security of financial transactions, personalization, and consistency. As determined
by Singh et al. (2013) lead to customer satisfaction are punctuality, effective communication, direct and acceptable

86
KÖMÜRYAKAN & YILGÖR JAME, Volume : 1 - Issue : 2 - Year: 2021

information, efficient employee services in banking. Pasha and Waleed (2016) supervised a study in Pakistan to de-
termine the factors that control customer loyalty. Customer satisfaction, brand loyalty, pricing policy, and service
quality were found to be the most important determinants of customer loyalty in their study. The development of
relational marketing dimensions such as quality service, personalized products, reliability, personalized commu-
nication, problem management, customer education, customer engagement, and the use of new technology, have
been improved customer empowerment, according to Chatterjee and Kamesh (2020).

Besides relationship marketing, studies show that more objective reasons play a significant role in customer loy-
alty for banking. Many customers are served by banks through a variety of channels, including ATMs, mobile
applications, and internet banking. Customers who have become more aware of service quality could be moving
their financial services from one bank to another for a variety of reasons, including technological advancements,
customer-friendly service, low-interest rates, geographic closeness, and a variety of services offered. When cus-
tomers’ options for service expand, a competitive market emerges. As a result, the competition among banks im-
proves bank reliability and service quality significantly however it also increases the risk of churn to customers. In
banking, as in many other sectors, developing a model which predicts customer is churn based on demographic,
psychological, and transaction data is critical and in machine learning models are possible to predict who is churn
customer and why. These predictive models have the advantage to lead to the design of personalized service and
products and encourage customer loyalty with resulting in increased customer high satisfaction.

Mutanen (2006) offered a logistics regression-based customer churn study of the retail banking sector. Naveen et
al. (2009) conducted detailed research with data mining techniques for churn customers that use credit cards. Bilal
(2016) used gender, age, average monthly income, consumer status (retired, student, employed, unemployed), and
whether the customer uses two or more bank products as control variables in the neural network model. According
to Bilal customers that use multiple banking products are less probability of churn. Keramati et al. (2016) used
the decision tree (DT) model to investigate churn customers in electronic banking (internet bank, telephone bank,
mobile bank, ATM). They discovered the customers’ dissatisfaction (duration of customer engagement, number of
customer complaints), service usage (total number of uses and transactional amounts), and demographic variables
(age, gender, employment status, education level) are effective on customer churn. Brânduşoiu et al. (2016) used
a big dataset that includes 21 control variables for an advanced data mining model that predict prepaid customer
churn. He et al. (2009) utilized a prediction model based on the Artificial Neural Network(ANN) algorithm for the
complication of customer churn in a big Chinese telecom corporation with roughly 5.2 million consumers. The
overall accuracy rate for prediction was 91.1 percent in the study. Nie et al. (2011) applied LR and DT models
to predict churn customers using credit cards belonging to a Chinese bank. They discovered that the LR model
outperformed from DT model to predict churn customers in a large dataset containing financial data from 135
variables for 60 million customers. Rajamohamed and Manokaran (2018) compared different classification mod-
els such as the k-nearest neighbor, Support Vectors Machine, Random Forest, Decision Tree, and Naive Bayes to
predict customer churn in banking and discovered the Support Vectors Machine model was the most accurate, fol-
lowed by the Random Forest model. Lopez-Diaz et al. (2017) compared 7 classification models for their predicting
customer churn in a Spanish bank with 823,985 customers and observed that logistic regression was the greatest
performance used for customer churn prediction.

In this study, in parallel with the literature, the effects of various factors such as age, income, gender, credit card
status, and discount opportunities offered by banks on customer churn were examined with LR, DT, RF and Xg-
boost classification models.

3. METHODOLOGY

3.1. Logistic Regression Model (LR)

Logistic regression models discover the relationship among qualitative and other variables. In most models es-
tablished with logistic regression, dependent variable has only two results. Usually, the emphasized event that is
being realized is indicated by 1 and the one which is not realized by 0. The scientific society in the domains of eco-
nomics, financial sector, and other social and environmental sciences gets now incorporated these models (Jabeur
2017; Zheng et al. 2020). The LR model is used to estimate the likelihood of an occurring event based on a set of
predictors. The following is the predicted output of the logistic regression:

87
Logistic
models established regression models withis logistic discover the relationship
regression, dependent among variablequalitative
has only andtwo other variables.
results. Usually, In most the
emphasized
emphasized
models established event
event that
that
with is being
being
logistic realized
realized
regression, is
is indicated
indicated
dependent by
by 11 and
and
variable the
the hasone
oneonly which
which
two is
is not
not
results. realized
realized
Usually, by
by 0.
the0.
emphasized
The scientific event society that isinbeing the realized isofindicated
domains economics, by 1financial
and the one sector,which and is notother realized
social byand 0.
The
emphasized
The scientific
scientific event society
society that isinbeing in the
the domains
realized isofindicated
domains of economics,
economics, financial
by 1 financial
and the one sector,
sector,which and
and is notother
other social
realized
social and
byand 0.
environmental
environmental
The scientific sciences
sciences
society gets
gets
in the now
now incorporated
incorporated
domains of these
these
economics, models
models (Jabeur
(Jabeur
financial 2017;
2017;
sector, Zheng
Zheng
and et
et
other al.
al. 2020).
2020).
social The
The
and
environmental
GULIYEV
LR model
& YERDELEN TATOĞLU
is used sciences to estimate gets now the incorporated
likelihood of these
an modelsevent
occurring (Jabeur based2017; on Zheng
aa set of etpredictors.
al. 2020). The The
LR
environmental
LR model
model is
is used
used sciences to
to estimate
estimate gets now the
the likelihood
incorporated
likelihood of
of an
these
an occurring
modelsevent
occurring (Jabeur
event based
based2017; on
on Zheng
a set
set of
ofetpredictors.
al. 2020). The
predictors. The
following
following
LR model is
is
is the
the used predicted
predicted to estimate output
output the of
of the
the
likelihood logistic
logistic of anregression:
regression:
occurring event based on a set of predictors. The
following is the predicted output of the logistic regression:
following
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = is
𝑃𝑃𝑃𝑃
𝑃𝑃𝑃𝑃𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 the predicted output of the logistic regression:
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 �� 𝑃𝑃𝑃𝑃𝑖𝑖𝑖𝑖 �� = (1)
𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖 = = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙
𝑙𝑙𝑙𝑙 �1−𝑃𝑃𝑃𝑃 = 𝑍𝑍𝑍𝑍𝑍𝑍𝑍𝑍𝑖𝑖𝑖𝑖 = = 𝛽𝛽𝛽𝛽𝑋𝑋𝑋𝑋 𝛽𝛽𝛽𝛽𝑋𝑋𝑋𝑋𝑖𝑖𝑖𝑖 +
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 � = 𝑍𝑍𝑍𝑍𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝛽𝛽𝛽𝛽𝑋𝑋𝑋𝑋𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 + 𝑢𝑢𝑢𝑢𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
+ 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑖𝑖𝑖𝑖  (1)
(1)
1−𝑃𝑃𝑃𝑃
𝑃𝑃𝑃𝑃
1−𝑃𝑃𝑃𝑃
(1) (1)
𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖
 𝐿𝐿𝐿𝐿𝑖𝑖𝑖𝑖 = 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 � � = 𝑍𝑍𝑍𝑍𝑖𝑖𝑖𝑖 = 𝛽𝛽𝛽𝛽𝑋𝑋𝑋𝑋𝑖𝑖𝑖𝑖 + 𝑢𝑢𝑢𝑢𝑖𝑖𝑖𝑖
In
In the above 1−𝑃𝑃𝑃𝑃𝑖𝑖𝑖𝑖 expression, 𝑍𝑍𝑍𝑍𝑍𝑍𝑍𝑍𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 is aa linear representation of the input variables and takes a value
In the
between the above above
-∞ and
expression,
expression,
+∞, while 𝑍𝑍𝑍𝑍𝑖𝑖𝑖𝑖𝐿𝐿𝐿𝐿is is takes a linear
linear a value
representation
representation between 0
of
ofand the
the1.input
input
LR
variables
variables
has several
and
and takes
takes aa flaws.
statistical
value
value
Inbetween
In
between
the above
the above -∞
-∞
expression, and
and expression,+∞,
+∞,
is while
while
a linear
𝑍𝑍𝑍𝑍𝑖𝑖𝑖𝑖 𝐿𝐿𝐿𝐿is𝐿𝐿𝐿𝐿
𝑖𝑖𝑖𝑖
𝑖𝑖𝑖𝑖 takes takes
representation
a linear a
a value of
representation
value between
the input
between 0
variables
0 ofand
and the 1.
and
1. LR
takes
input
LR has
a several
value
variables
has several
between
and statistical
-∞ and
takes
statistical
+∞,
a flaws.
value
flaws.
Multicollinearity
while
Multicollinearity
between takes a value between
-∞ and and
and
+∞, decreased
decreased
while 0 and 1. 𝑖𝑖𝑖𝑖LRperformance
𝐿𝐿𝐿𝐿 takeshas several statistical
performance a value accuracy
accuracy
between flaws. are
are
0 two
two
and of
Multicollinearity
1.of them.
them.
LR has and decreased performance
several statistical flaws.
Multicollinearity
accuracy are two of them. and decreased 𝑖𝑖𝑖𝑖 performance accuracy are two of them.
Multicollinearity and decreased performance accuracy are two of them.
3.2.
3.2. Decision Tree
3.2. Decision Tree 3.2. Decision Decision Tree Tree
Ross
Ross Quinlan
Quinlan developed
developed the
the C4.5
C4.5 Decision
Decision 3.2. Decision Tree
Tree (DT)
(DT) TreeClassification
Classification Method
Method as
as an
an expansion of
Ross
the ID3 Quinlanalgorithm, developed which the
he C4.5
previously Decision created. Tree (DT)
These Classification
classifiers use Method
the data as an expansion
samples expansion
to build
of
of
aa
the
RossID3
the
Ross ID3
Quinlan Quinlanalgorithm,
algorithm,
developed developed which
which
the C4.5
he
the
he Decision
previously
C4.5
previously Decision
Tree (DT)
created.
created. TreeThese These
Classification
classifiers
(DT) Classification
classifiers
Method
use
as anuse
the
Method
the
expansion
data
dataof the
samples
as an expansion
samples
ID3
to
to
algorithm,
build
build ofa
decision
decision
the ID3 tree
tree
algorithm, as
as aa machine machine
which he learning
learning
previously technique.
technique. created. The
The
These edge-based
edge-based
classifiers segmentation
segmentation
use the data strategy
strategy
samples is
is
to used
used
build to
to
decision
which
build
he
decisiontree astree
previously a machine
created.
models
These learning
with
classifiers
an technique. The
use
information
the data samples
gain edge-based
to
metric
build a
used segmentation
decision
to
tree
select
as
an
a strategy
machine
appropriate is used
learning
input toa
build
decision
technique.
build decision decision
tree
The as
edge-based treea
tree models models
machine segmentation with
learning
with input an information
strategytechnique.
an information is used to gain
Thebuild
gainstudy metric
edge-based
decision
metric used
tree
usedato to
models select
segmentation with
select an
an
an appropriate
strategy
information
appropriate is used
gain input
inputto
variable
variable
metric
build used from
from
decision to select among
among an
tree models all
all
appropriate the
the tree's
tree's
with an input input
variable variables.
variables.
information from among The
The
gainstudy study
all the
metric selects
selects
tree’s a
input
used atotesttest
test drive
drive
variables.
select anthrough
through
The study n
appropriaten outcomes
outcomes
selects a
input
variable
that splits from the among
data set allN,the as tree'swell input
as, training variables. The selects drive through n outcomes
that
test
variable splits
drive through
from the data
n outcomes
among set all N, as
that
the tree's well
splits as,
the
input training
data set N, data
variables. data
as well Theset
set into
into
as,study subsets
subsets
training data set
selects (N
a(N 11,, N
into
test N 22,, NN
subsets
drive (N1k,k).
33,..,N
,..,N
through N2(Ci,P)
). (Ci,P)
,n ,..,Nkis
N3outcomes ). the
is the
that splits
total
(Ci,P) number
is the total theof datasamples
number set of N, samplesin asPwell that
in P as, belong
that training
belongto toC data
C ii,, and
and set|P||P|into
is is
the subsets
the
totaltotalnumber(N 1, N2, N3,..,Nk). (Ci,P) is the
number
of samples of samples in
in P. The entropy P. The
total
that
total number
splits
number theof of
data samples
samples set N, in inas Pwell P that
thatas, belong
belong training to
to Cdata C and
i, and set|P| |P|
into is the
subsets
is the total
total(N number
1, N2, N
number of samples
of3,..,N k). (Ci,P)
samples in P.
in P.isThe The
the
of
entropy the set Pofis thegiven set by; PP is given by;
entropy
total number
entropy of the set P is given by; of the ofset samples is given in P by; that belong to C i , and |P| is the total number of samples in P. The
entropy of the set𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘P is𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶 given,𝑃𝑃𝑃𝑃)
𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖,𝑃𝑃𝑃𝑃)
by; 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖 ,𝑃𝑃𝑃𝑃)
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑃𝑃𝑃𝑃) = − � 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶𝑖𝑖𝑖𝑖 ,𝑃𝑃𝑃𝑃) 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛 𝑙𝑙𝑙𝑙 ��𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶
 𝑖𝑖𝑖𝑖 ,𝑃𝑃𝑃𝑃)� (2) (2)
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑃𝑃𝑃𝑃)
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑃𝑃𝑃𝑃) = − �𝑖𝑖𝑖𝑖=0 = − � 𝑘𝑘𝑘𝑘 |𝑃𝑃𝑃𝑃|
|𝑃𝑃𝑃𝑃| 𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛
𝑛𝑛𝑛𝑛 𝑙𝑙𝑙𝑙
2
𝑙𝑙𝑙𝑙2 �
𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶
|𝑃𝑃𝑃𝑃|
|𝑃𝑃𝑃𝑃|
𝑖𝑖𝑖𝑖 ,𝑃𝑃𝑃𝑃)�
� (2)
(2)
𝑖𝑖𝑖𝑖=0 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶|𝑃𝑃𝑃𝑃| 𝑖𝑖𝑖𝑖 ,𝑃𝑃𝑃𝑃) 2 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓(𝐶𝐶𝐶𝐶
|𝑃𝑃𝑃𝑃| 𝑖𝑖𝑖𝑖 ,𝑃𝑃𝑃𝑃)
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑃𝑃𝑃𝑃) = − � 𝑖𝑖𝑖𝑖=0 𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛 𝑙𝑙𝑙𝑙2 � � (2)
The
The
The overall
overall
overall knowledge knowledge
knowledge subject
𝑖𝑖𝑖𝑖=0 subject
|𝑃𝑃𝑃𝑃|
of N may
subject of of N
be
N may
calculated
may be
|𝑃𝑃𝑃𝑃|
be calculated
after N is split
calculated after
afterwithN N is
regard
is split with
to the
split with regard
outcomes
regard ofto
toa the
given
the outcomes
char-
outcomes
The
of
acteristic, overall
aa given about knowledge
characteristic, z. N’s information subject
about content ofz. Nmay
N's may bebe
information calculated calculated usingafter
content Infomay N isbe
(N). split
The with
calculated
entire regard
informationusing tocontent
the outcomes
Info of NThe
(N).
of
The
of a given
overall
given characteristic,
knowledge
characteristic, about
subject
about of z.
z. NN's
N's may information
be
information calculated content
content after may
mayN is be calculated
split
be with
calculated regardusing
using to Info (N).
the outcomes
Info (N). The
The
is
entire equal to the
information weighted sum
content of each
of N subset’s is equal entropies.to the weighted sum of each subset's entropies.
entire
of a given information characteristic, content
entire information content of N is equal to the weighted sum of each subset's entropies. of
about N is
z. equal
N's to
information the weighted content sum may of each
be subset's
calculated entropies.
using Info (N). The
entire information 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 content of N is equal to the weighted sum of each subset's entropies.
|𝑁𝑁𝑁𝑁 |
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 (𝑁𝑁𝑁𝑁) = � 𝑛𝑛𝑛𝑛 |𝑁𝑁𝑁𝑁 𝑖𝑖𝑖𝑖 |
|𝑁𝑁𝑁𝑁𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 | 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖))  (3) (3)(3)
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑧𝑧𝑧𝑧𝑧𝑧𝑧𝑧𝑧𝑧𝑧𝑧 (𝑁𝑁𝑁𝑁)
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 (𝑁𝑁𝑁𝑁) = =� �𝑖𝑖𝑖𝑖=0 𝑛𝑛𝑛𝑛 |𝑁𝑁𝑁𝑁|
𝑖𝑖𝑖𝑖=0 |𝑁𝑁𝑁𝑁 |𝑁𝑁𝑁𝑁|
|𝑁𝑁𝑁𝑁| |
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁 𝑖𝑖𝑖𝑖 ) (3)
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑧𝑧𝑧𝑧 (𝑁𝑁𝑁𝑁) = �𝑖𝑖𝑖𝑖=0 |𝑁𝑁𝑁𝑁| 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁𝑖𝑖𝑖𝑖 ) 𝑖𝑖𝑖𝑖
(3)
The
The
The gain
gain
gain is is
given
is given
given by:𝑖𝑖𝑖𝑖=0 by:
by:
The gain is given by:
The
𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖gain
𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 (𝑧𝑧𝑧𝑧)
(𝑧𝑧𝑧𝑧) = =is 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁)
given
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁) by: − 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑧𝑧𝑧𝑧𝑧𝑧𝑧𝑧(𝑁𝑁𝑁𝑁)
− 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 (𝑁𝑁𝑁𝑁) (4)
(4)
𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖 (𝑧𝑧𝑧𝑧) = 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁) − 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑧𝑧𝑧𝑧 (𝑁𝑁𝑁𝑁)  (4) (4)
It
It divides
𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖(𝑧𝑧𝑧𝑧) =
divides N about
about the
N 𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛(𝑁𝑁𝑁𝑁) the test test − on on zz to
𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 to𝑧𝑧𝑧𝑧 (𝑁𝑁𝑁𝑁)
give
give information.
information. This
This is
is completed to choose the attribute (4)
zz with
It
the divides
greatest N about
knowledge the test gain. on z(1) to give
Condition information. all the This
samples is completed
completed
in a dataset
to
to choose
choose
fit to
the
thesame
the
attribute
attributeclass,
with
z with
the
the
Itdivides
Itthe greatest
divides
greatest NN about knowledge
about
knowledge the the test on testgain. gain.
z onto give z(1)(1) toinformation. Condition
give
Condition information. Thisall
all is the
the samples
This
completed
samples toin
is completed aa dataset
inchoose fit to
thetoattribute
dataset choose
fit to zthethe
the same
attribute
withsame class,
the greatest
class, the
z with
the
decision
decision tree
tree generates
generates a leaf node to select that class. (2) Otherwise, any input variable provides
the
knowledge
decision
any
greatest
information
gain.knowledge
tree (1) Condition
generates
gain, a aadecision
leaf
all thenode
gain.
leaf (1)
node samples
node
to select
Condition
to select
in a dataset
leading
thatthe
all
that
the
class.
fit to
class.
tree
the(2)
samples (2)
with
same Otherwise,
inclass,
Otherwise,
the
a dataset anyfitinput
the decision
class's any
expected
input
totree thevariable
same class,
generates
variable
value is
aprovides
leaf the
provides
produced.
any
node
decision
any information
to select
information
that
tree generates gain,
class.
gain,
(2) a decision
Otherwise,
a adecision leaf node node
any
node
input
to select leading
leading
variable the
that
theclass.tree
provides
tree with
any
with the class's
information
(2) Otherwise,
the class's expected
gain, a decision
any inputvalue
expected value node
variable is produced.
leading
provides
is produced.
(3)
the
(3)
any If
tree
If an
withunknown
an
information the class’s instance's
unknown gain, expected value
instance's a decision class
class node is
is produced.
is confronted,
confronted,
leading (3) If anaaunknown
the tree decision
decision
with the node
instance’s
node
class'sis
is constructed
class is confronted,
constructed
expected leading
valueleading
is the
the tree
a decision
produced. tree
(3) If
node
together is anwith unknown
constructed the leading class'sinstance's the tree
expected classvalue.
together iswith confronted,
the class’s a decision
expected value. node is constructed leading the tree
together
(3)
together If an with with
unknown the class's
the class's
instance's expected
expected classvalue. value.
is confronted, a decision node is constructed leading the tree
together with the class's expected value. 3.3. Random Forest
3.3. Random Forest
Random Forest (RF) is a famous ML model used for data classification (Çağlayan et al. 2020).
This algorithm is frequently utilized in sectors such as investing (Jabeur 2017), customer
Random Forest (RF) is a famous ML model used for data classification (Çağlayan et al. 2020). This algorithm is
management
frequently utilizedand in sectors marketing such as(Salminen investing (Jabeur et al.2017), 2019). customer A group management of trees and underpins
marketing (Salminen the RF.etIt is
complemented with an aggregate
al. 2019). A group of trees underpins the RF. It is complemented with an aggregate of the prediction’s mean value, of the prediction's mean value, which is produced at the
conclusion
which is produced of each at theofconclusion the trees,ofreducing each of thethe trees, lack of robustness
reducing the lack ofofrobustness
a single of tree. Eachtree.
a single of Each
the trees of is
thecreated trees isusing createda using subset a subset of input of input variables variables that thatare arepicked
picked at atrandom.
random. The The following
following is an expression
is an expression for
theforestimated the estimated model: model:
1
𝑦𝑦𝑦𝑦� = ∑𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖=1 𝑙𝑙𝑙𝑙𝑘𝑘𝑘𝑘 (𝑥𝑥𝑥𝑥)  (5) (5)
𝑛𝑛𝑛𝑛
th
Thevector
The vector of input
of input features
features is x, andis g(x)
x, and g(x) is a of
is a collection collection of the
the kth learner k learner
random trees. Therandom
RF finaltrees. The
estimate is RF
final
the meanestimate is the mean
of the outcome of thetrees.
of the whole outcome of the
As a result, whole
with trees. As
such weights, a individual
each result, withtreesuch
has anweights,
impact oneach
the RF estimation.
individual tree Corresponding
has an impact to Yeşilkanat
on the RF (2020), the Random
estimation. Forest model is to
Corresponding superior to other (2020),
Yeşilkanat machine the
learning
Random Forest model is superior to other machine learning methods. This is due to theauto-
methods. This is due to the former’ stability in the direction of acquiring training data from subsets former'
matically and shaping trees using random techniques. Furthermore, because the Random Forest model achieves
stability in the direction of acquiring training data from subsets automatically and shaping trees
training by applying bootstrapping on a randomly chosen independent subset of datasets, the overfitting quantity
isusing random techniques. Furthermore, because the Random Forest model achieves training by
preserved.
applying bootstrapping on a randomly chosen independent subset of datasets, the overfitting
quantity is preserved.
88
3.4. eXtreme Gradient Boosting
The eXtreme Gradient Boosting (XgBoost) is the model that implements Chen and Guestrin's
Random Forest model is superior to other machine learning methods. This is due to the former'
stability in the direction of acquiring training data from subsets automatically and shaping trees
using random techniques. Furthermore, because the Random Forest model achieves training by
applying bootstrapping on a randomly chosen independent subset of datasets,
JAME, Volume : 1 - Issue :the overfitting
2 - Year: 2021
quantity is preserved.

3.4. eXtreme Gradient Boosting 3.4. eXtreme Gradient Boosting


The eXtreme Gradient Boosting (XgBoost) is the model that implements Chen and Guestrin's
(2016)
The gradient
eXtreme boosting
Gradient Boostingtechnique.
(XgBoost)Itisisthea model
widelythatutilized flexible
implements Chentool
andon the way(2016)
Guestrin’s to tree boosting
gradient
boosting
algorithmtechnique. It is acutting-edge
achieves widely utilizedclassification
flexible tool on the
andway to tree boosting(Mai
effectiveness algorithm
et al.achieves
2020).cutting-edge
The result is
classification
generated by andthe
effectiveness
XgBoost,(Mai et al.is2020).
which The result
a collection ofisregression
generated bytrees.
the XgBoost, which is aequation
The following collectionisofused
regression trees. The following equation is used to arrive at the final score:
to arrive at the final score:
𝑦𝑦𝑦𝑦� = ∑𝐻𝐻𝐻𝐻
ℎ=1 𝑙𝑙𝑙𝑙ℎ (𝑥𝑥𝑥𝑥)  (6) (6)
Thenumber
The number of trees
of trees in thisinequation
this equation
is H, andistheH,score
andfortheeach
score for
tree’s each
leaf is K.tree's leaf is K. has
Multicollinearity Multicollinearity
no effect on
hasXgBoost,
the no effect
whichonisthe XgBoost,benefit.
an additional which is antoadditional
In order maximize the benefit. In order toXgBoost
model performance, maximize the the
involves model
performance,
selection XgBoost
of certain parameters.involves
Parameterthetuning
selection of certain
is essential for theparameters.
XGBoost to get Parameter tuning isandessential
around overfitting too
much confusion
for the XGBoostof thetomodel. But, because
get around the XgBoost
overfitting and utilizes
too much multiple settings, of
confusion thisthe
canmodel.
be difficult.
But,Onbecause
the way the
toXgBoost
maximize utilizes
the hyper-parameter values, we applied the grid search method with cross-validation.
multiple settings, this can be difficult. On the way to maximize the hyper-
parameter values, we applied the grid search method with cross-validation.
3.5. The Performance Metrics of Classification Models
3.5. The Performance Metrics of Classification Models
In order to determine which of the applied machine learning classification models are more successful both indi-
In order
vidually andto determine
among which
themselves, some of the applied
performance machine
metrics learning(Çağlayan
must be examined classification
2020). models
These arearemore
metrics
prediction,
successful as well as the conclusions of the performance evaluation of the machine learning
used to assessboth individually
the effectiveness andclassification
of the among themselves,
method in usesome performance
and to metrics models.
compare classification must beMultiple
examined
classification
(Çağlayan model.
2020). These metrics are used to assess the
metrics of models should be considered because evaluating theseeffectiveness of the
values as a single classification
success method
criterion would be in
use and All
incorrect. to compare
observation classification
in the test datamodels. Multiple
set is replaced in themetrics of models
model created should
with the bedata
training considered because
set in the clas-
sification
evaluatingmodels,
theseandvalues
classification prediction
as a single scorescriterion
success are achieved.
wouldTheberesults of comparing
incorrect. the predicted in
All observation values
the test
prediction,
with the as
actual well
values as
are the
used conclusions
to determineFigure
howof 1. Confusion
the
well performance
this model Matrix
evaluation
predicts, as well as of
its the
successmachine
and learning
performance.
data set is replaced in the model created with the training data set in the classification models, and
classification
The confusion model.
classificationmatrix summarizes the results of the model’s accuracy in making a prediction, as well as the con-
prediction scores are achieved. The results Actual ofValues
comparing the predicted values with
clusions of the performance Confusion
evaluation ofMatrix
the machine learning classification model.
the actual values are used to determine how well 0this model predicts, 1 as well as its success and
performance. The confusion matrix summarizes True the results ofFalse
Positive the model's
Positiveaccuracy in making a
Values Predicted

Figure
0 1.1.Confusion
Figure Confusion Matrix Matrix
Values

TP FP
Actual Values
False Negative True Negative
Confusion Matrix 1
0FN 1TN
True Positive False Positive
Predicted

0
Figure 1 shows the confusion matrix is explained as follows forFPa two-category classification TP
model: False Negative True Negative
1
True Positive (TP); indicates that observations with FN a true class value TN of 1 are correctly predicted
as 1. 1 shows the confusion matrix is explained as follows for a two-category classification model:
Figure
Figure
TruePositive 1 shows the
Negative (TN); confusion
indicates matrix is explained
the situation where asobservations
follows for with a two-category
a predicted
true class classification
value of 0 are
True
model: (TP); indicates that observations with a true class value of 1 are correctly as 1.
correctly predicted as 0.
TrueTrue PositiveNegative (TN); (TP);indicates indicates the that observations
situation with a with
where observations true aclass value
true class of of
value 1 are
0 arecorrectly predicted
correctly predicted
False
asas1.0. Negative (FN); shows that observations with a true class value of 1 are incorrectly evaluated
as 0 as a result of the prediction.
TrueFalseNegative Negative (FN); (TN);shows indicates the situation
that observations with awhere
true class observations
value of 1 are with a true
incorrectly class value
evaluated as 0 as of 0 are
a result
False
correctly
of the prediction. Positive
predicted (FP); as 0.shows that observations with a true class value of 0 are incorrectly evaluated
as 1 as a result of the prediction (Deng et al. 2020).
False
FalseNegative Positive (FP); (FN); shows shows that observations
that observations with a true withclass a truevalueclass of 0 arevalue of 1 areevaluated
incorrectly incorrectly
as 1 asevaluated
a result
The accuracy
as 0 as a result of the prediction.
of the prediction rate
(Deng (ACC)
et al. is
2020).calculated by taking the ratio of the number of classified observations
(𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁) to the total number of samples (𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁). This enables the evaluation of
False
The accuracy Positiverate
the estimation (FP); (ACC)
of().the shows value that
oftheobservations
is calculated
the
by taking the with
estimation
ratio of
resulta true class
the number
made withvalue theofof
of classified 0 are incorrectly
observations
classification
() toevaluated
model
the total
as with
1 when
number
asthe 1 astrue of
a result samples ofofthe This enables
prediction evaluation
(Deng etcaseof the
al. 2020). estimation of the value the estimation result made
the classification model as 1 when the true value of a class is 1, and the case that the estimated value of the class is true
value a class is 1, and the that the estimated value of the class is 0 when the
The 0value
when accuracy of
the the truerate class
value(ACC) is
of 0. theACC iscan
is calculated
class be calculated
0. ACC by be
can taking using
the ratio
calculated the of
using following
thethe number
following formula:
of classified observations
formula:
(𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁) to the𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 total number
+ 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁 of samples (𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁). This enables the evaluation of
the estimation
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = of the value of the estimation result made with the classification model as 1 when (7) (7)
𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
the true value of a class is 1, and the case that the estimated value of the class is 0 when the true
valueWithofathe confusion class is matrix, 0. ACCwe cancan also calculate
be calculated using sensitivity
the following and specificity
formula: rates. The sensitivity is
the ratio of correctly classified (TP) positive input values to the total true positive values (TP +
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
FN):
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = (7)
𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 89
With a confusion
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑙𝑙𝑙𝑙𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑦𝑦𝑦𝑦 = matrix, we can also calculate sensitivity and specificity rates. The sensitivity (8) is
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁
the ratio of correctly classified (TP) positive input values to the total true positive values (TP +
the true
the true value
value of
of aa class
class is
is 1,
1, and
and the
the case
case that
that the
the estimated
estimated value
value of
of the
the class
class is
is 00 when
when the
the true
true
value of
value of the
the class
class is
is 0.
0. ACC
ACC can
can be
be calculated
calculated using
using the
the following
following formula:
formula:
GULIYEV & YERDELEN TATOĞLU 𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 +
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 =
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = (7)
(7)
𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁 +
𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃
𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃 +
+ 𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 +
+ 𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
With aa confusion
With confusion matrix, matrix, we we can can also
also calculate
calculate sensitivity
sensitivity and
and specificity
specificity rates.
rates. The
The sensitivity
sensitivity isis
With
the a confusion
ratio of matrix, we
correctly can also calculate
classified (TP) sensitivity
positive and specificity
input values torates.
the The sensitivity
total true is the ratio
positive of cor-
values (TP +
the ratio of correctly classified (TP) positive input values to the total true positive values (TP +
rectly classified (TP) positive input values to the total true positive values (TP + FN):
FN):
FN):
 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑙𝑙𝑙𝑙𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑦𝑦𝑦𝑦
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 (8)(8)
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑙𝑙𝑙𝑙𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑦𝑦𝑦𝑦 = = (8)
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 +
𝑇𝑇𝑇𝑇𝑃𝑃𝑃𝑃 + 𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁
𝐹𝐹𝐹𝐹𝑁𝑁𝑁𝑁
Thespecificity
The
The specificity
specificity is ratio
is the
is the ratio
the ratio
of the of of the correctly
correctly
correctly
the classified
classifiedclassified (TN)
(TN) to the(TN) to the
the total
total positive
to total positive
valuespositive values
(TN+FP)values (TN+FP)
of the number
(TN+FP) of
of of
observations:
the number
the number of observations: of observations:
𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑦𝑦𝑦𝑦
𝑦𝑦𝑦𝑦 =
= (9)
(9) (9)
𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁 +
𝑇𝑇𝑇𝑇𝑁𝑁𝑁𝑁 + 𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃
𝐹𝐹𝐹𝐹𝑃𝑃𝑃𝑃
Thesensitivity,
The
The sensitivity,
sensitivity, specificity
specificity
specificity rate andrate rate and the
and
the ACC the ACC
ACC
metrics metrics
metrics
have have values
have
values ranging values ranging
ranging
from 0 to from
1, and iffrom 00 to
they are to 1, and
1,
close and if they
if
to 1, the they
are close to 1, the model's
are close to 1, the model's performance is exceptionally good. Furthermore, sensitivity and
model’s performance is exceptionally performance
good. is
Furthermore, exceptionally
sensitivity and good. Furthermore,
specificity are inversely sensitivity
related, which and
specificity are
means
specificity that as are inversely
inversely related,
sensitivity grows, related, which
which means
specificity means that
decreases and that as
as sensitivity
sensitivity grows,
conversely (Lambert grows, specificity
and specificity decreases
Lipkovich 2018). decreases and and
conversely
conversely (Lambert and Lipkovich 2018). (Lambert and Lipkovich 2018).
The other metric AUC is a measurement of the entire area under the Receiver Operating Characteristic (ROC)
curve, and it is one of the metrics used to evaluate model performance, together with the ROC curve (Jabeur et al
2021). The AUC value ranges from 0 to 1, with a value near 1 indicating a more accurate model. The distributions
of TN and TP do not intersect when the area under the ROC curve is large, indicating that the classes have been
successfully separated (Mai et al 2019).

3.6. Imbalanced Classification Problems

For customer churn analysis, studies have identified an imbalanced class distribution on customer data sets. Be-
cause the sample size of churn customers is substantially less than that of non-churn customers, the following
scenario might occur; the accuracy of the classification is high, while churn customer prediction accuracy is low.
So, the problem with unbalanced datasets is that typical classification learning techniques are typically biased to-
wards the majority classes (referred to as “negative”), resulting in a greater misclassification rate in minority class
occurrences (referred to as “positive” class) (Chawla 2009). The most common approach to this problem is to use
a resampling technique to balance the class distribution of the training set before training a classification model.
Random oversampling (ROS) and random undersampling (RUS) are two approaches for resampling (RUS). ROS,
which consists of decreasing the data by deleting instances belonging to the majority class with the goal of equaliz-
ing the number of examples of each class; and RUS, which intends to reproduce or generate new positive examples
in order to acquire importance (Batista et al. 2004). The main disadvantage of random undersampling is that it
might lose potentially relevant data that could be significant in the induction process. The elimination of data is
an important decision to make, hence many undersampling proposals include heuristics to overcome the limits of
non-heuristic decisions. Random oversampling, on the other hand, may increase the likelihood of overfitting since
it duplicates the minority class instances exactly. In this manner, a symbolic classifier, for example, may generate
rules that appear to be accurate but only cover one reproduced case.

Ensemble learning (tree-based) models are another option for improving the performance of a single classifier by
training multiple separate classifiers and integrating their outputs to produce the final choice (Kuncheva 2004).
Cost-sensitive ensembles, on the other hand, use the ensemble learning algorithm to lead cost reduction rather
than altering the underlying classifier in order to accept costs in the learning process. Ensemble learning models
include Random Forest, AdaBoost, and XgBoost. Ensemble Learning models are well-known in data mining and
machine learning for their good performance in a wide range of applications, and it may be the better alternative
for the class imbalance problem (Wozniak 2014). For example, Ahmad et. al (2019) discovered that tree-based
models performed better from undersampling for unbalanced classification in their customer churn analysis in the
telecommunications sector.

90
JAME, Volume : 1 - Issue : 2 - Year: 2021

4. DATA AND ANALYSIS

4.1. Data and Variables

This paper aims on application of machine learning models for predicting the churn customers. The research is
based on real data from a bank. Before customer churn analysis, we need to determine churn status of customers.
Customers who close individual loans and do not apply for new loans despite 9 months after the close date of
the loan are included in the churn customer category and take 1 value for the dependent variable of classification
model, however, it takes 0 value when customer applying to loan during the 9 months from the close date of loan.
According to the calculations, 91% of the customers generally applied for the second loan within 9 months, so we
used the 9-month criterion to determine customer churn status.

The database of a bank was used in the data collecting process, and 274,542 observations were analyzed once all
the pre-elimination processes are completed. The most appropriate input variables which according to the local
market conditions are selected for predicting the customer’s churn status and summarized in Table 1.

Table 1. Definition of Variables


Assigned Short
Variable Definition
Name
shows customer's churn category. If the customer is
churn status churn_status churn, it takes a value of 1,
if is non-churn, a value of 0.
Input Variables
customer’s age age Shows customer’s age
shows the average income of the customer for the last
average income salary
12 months.
if the customer is male, it takes a value of 1,
gender gender
if is female, a value of 0.
loan amount amount amount of customer's last used credit
nominal interest rate calculated to
interest rate interest_rate
the customer's last used loan
credit term duration duration of customer's last used credit (months)
credit closing
It gets negative values daily if the customer has paid the
and early payme
closed loan before the expiry date; positive values if the loan
nt
has been delayed.
(days)
if the customer is offered an interest rate discount in last
interest rate
rate_discount loan compared to previous loan, it gets a value of 1 and
discount
in other case a value of 0.
if the customer is offered an increase in the loan amount
in last loan compared to previous loan, it gets a value of
amount increase amountup
1 and in other case a value of 0.
competition if customers live in competition region where more
competition
region branch of other banks exist 1, if not, a value of 0.

credit card if the customer has a credit card, it takes a value of 1, if


creditcard
status not, a value of 0.
if the customer's salary card belongs to bank, it takes the
salary card card_status value 0 and if it belongs to another bank, it takes the
value of 1.
it shows how many individual loans the customer has
drawn to determine the relationship with the bank. since
credit count creditcount
it is a categorical variable, customers with 1 individual
loan were taken as a base and 3 dummy variables were

91
it shows how many individual loans the customer has
drawn to determine the relationship with the bank. since
credit count creditcount
GULIYEV & YERDELEN TATOĞLU
it is a categorical variable, customers with 1 individual
loan were taken as a base and 3 dummy variables were

4.2. Describing
4.2. Describing of Variables of Variables
and Correlation Analysisand Correlation Analysis
Table 2 summarizes the descriptive statistics for all variables examined in this study. When looking
at the2 proportion
Table of outcome
summarizes the descriptivevariable thatallisvariables
statistics for churn status,
examined weinsee
thisthat 88.5
study. Whenpercent
lookingofatcustomers
the propor- are
tion of outcome
no churner variable
and 11.6 that is churn
percent of status,
customerswe seeare thatchurner.
88.5 percent
The of average
customersage
are no
of churner and 11.6 percent
the customers that input
of customers are churner. The average age of the customers that input variable is 49.5, with
variable is 49.5, with a standard deviation of 12.8, and a range of 19 to 70. The mean annual salary a standard deviation of
12.8, and a range of 19 to 70. The mean annual salary is 309, with a low of 50 and a high of 11850. When we look
is 309, with a low of 50 and a high of 11850. When we look at the categorical input variable like
at the categorical input variable like gender, we notice that males account for 52.3 percent of the customer while
gender,account
females we notice that In
for 47.7%. males
other account for 52.3
input variables, it canpercent of thethe
be interpreted customer
same way.while females account for
47.7%. In other input variables, it can be interpreted the same way.
Table 2. Descriptive Statistics of Variables
Table 2. Descriptive Statistics of Variables
Statistics / Statistics / Statistics /
Variables Variables Variables
Frequency Frequency Frequency
Mean (sd) : 28.4
No- (2.2)
churn ( 88.5% ) interest No ( 73.0% )
churn competition
status rate min ≤ med ≤ max: Yes ( 27.0% )
Churn ( 11.5% )
14 ≤ 28.2 ≤ 39
Mean (sd): 49.5 Mean (sd): 29.2
(12.8) (10.5) No ( 31.5% )
credit
age duration
min ≤ med ≤ max: min ≤ med ≤ max: card Yes ( 68.5% )
19 ≤ 52 ≤ 70 3 ≤ 36 ≤ 156
Mean (sd): 309 Mean (sd): -370.6
(250.9) (378.9) No ( 92.7% )
card
salary closed
min ≤ med ≤ max: min ≤ med ≤ max: status Yes ( 7.3% )
50 ≤ 199.6 ≤ 11850 -1820 ≤ -264 ≤ 120
Male ( 52.3% ) rate No ( 39.7% )
gender 1 ( 41.2% )
Female ( 47.7% ) discount Yes ( 60.3% )
Mean (sd): 2558.4 credit 2 ( 46.0% )
(2132.8) No ( 31.1% ) count 3 ( 10.3% )
credit amount
amount min ≤ med ≤ max: up Yes ( 68.9% ) 4+ ( 2.5% )
300 ≤ 2000 ≤ 20000

We
Wehave
havecalculated
calculatedthe correlation coefficients
the correlation before using
coefficients machine-learning
before techniques to ensure
using machine-learning that theto
techniques input
ensure
variable selection is accurate. Figure 2 shows the pairwise Spearman`s rank correlation among the variables in our
that
study:
the input variable selection is accurate. Figure 2 shows the pairwise Spearman`s rank
correlation among the variables in our study:

92
JAME, Volume : 1 - Issue : 2 - Year: 2021

Figure 2. Spearman’s Correlation Heatmap


FigureFigure 2. Spearman's
2. Heatmap
Figure 2. Spearman's Correlation Spearman's Correlation
Correlation HeatmapHeatmap
Figure 2. Spearman's Correlation Heatmap
Figure 2. Spearman's Correlation Heatmap
ure 2. Spearman's Correlation Heatmap

Source: Source:
Authors Authors ‘own calculation
‘own calculation thefrom
fromAuthors the dataset.
dataset.
calculation from the
dataset. Source: ‘own calculation from the dataset.
Source: Authors ‘own calculation from the dataset.
Spearman's
Spearman's correlation correlation
evaluates evaluates it ismonotonic
Source:
monotonic Authors
relationships, relationships,
‘own calculation
and and
from
itdatasets it isdataset.
the a reliable tool
bigfor big datasets
tion evaluates monotonic
Spearman’s relationships,
correlation and
evaluates a reliable
monotonic relationships,tool forand big
it is aisreliable
a reliabletool fortoolbigfor datasets datasets
with outliers.
from the dataset.with outliers. Calculated Spearman's correlation coefficients indicate customer churn status
with outliers.
ulated Spearman's Spearman's
Calculated
correlation
Calculated Spearman’s correlation
Spearman's
coefficients
correlation evaluates
correlation
indicate customer
coefficients monotonic
coefficients
indicate churn
customer relationships,
indicate
status
churnhas a and
has aitmoderate
customer
status ischurn
a reliable
statustool
negative has a has
for big adatasets
correlation
moderate
correlation moderate
with rate
with with
negative
rate negative
discountoutliers.
correlation
discount correlation
Calculated
(𝜌𝜌𝜌𝜌=-0.38)
( =-0.38) with and
and Spearman's
with
rate
the the rate up
Spearman's
discount
amount
amount correlation
discountcorrelation
((𝜌𝜌𝜌𝜌=-0.38)
up =-0.49) evaluates
(𝜌𝜌𝜌𝜌 (𝜌𝜌𝜌𝜌=-0.38)
inand
=-0.49) the inand
coefficients
the
next monotonic
theapproval.
amount
the
credit amount
indicate relationships,
up (𝜌𝜌𝜌𝜌 =-0.49)
customer
up (𝜌𝜌𝜌𝜌 Furthermore,
=-0.49) the init
and
inchurn
customer is a reliable
the
status has a too
ates monotonic relationships,
next
churncredit aapproval.
moderate
has moderate and itFurthermore,
negative is a correlation
reliable tool
withnegative
customer
withforcredit
outliers. big datasets
Calculated
churn
rate has
discount aSpearman's
moderate
(𝜌𝜌𝜌𝜌=-0.38) correlation
negative
and coefficients
correlation
the counts
amount with
upcredit indicate
credit incustomer
(𝜌𝜌𝜌𝜌(credit-
=-0.49) the c
next credit
l. Furthermore, approval.
customer churn hasnegative
Furthermore, correlation
a moderatecustomer with
churn has card
correlation status
a moderate with ( credit
=-0.38)
negative and credit
correlation variables
with
arman's correlation
card coefficients
status
next (𝜌𝜌𝜌𝜌=-0.38)
credit indicate
and
approval. customer
creditmoderate
counts
Furthermore, churn negative status
variables has a
correlation
(creditcount2, with rate discount
creditcount3 (𝜌𝜌𝜌𝜌=-0.38)
and and
creditcount4+), the amount up (
8) and card count2,
creditstatus
counts creditcount3
(𝜌𝜌𝜌𝜌=-0.38)
variables and and creditcount4+),
credit counts
(creditcount2, variablescustomer
however
creditcount3 a positive
(creditcount2, churn has
correlation
and creditcount4+), a moderate
with
creditcount3 credit and negative
closed ( correlation with credit
=0.33)
creditcount4+), and card
n with rate discount (𝜌𝜌𝜌𝜌=-0.38) and the amount next up (𝜌𝜌𝜌𝜌
credit =-0.49)
approval. in the
Furthermore, customer churn thathas aand
moderate negative corre
status
however (
card =0.19)
a positive
status that indicating
correlation
(𝜌𝜌𝜌𝜌=-0.38) the
and customer’s
with salary
creditcounts
credit closed card belongs
(𝜌𝜌𝜌𝜌that
variables=0.33) to another
andstatus
card
(creditcount2, bank.
status Many
(𝜌𝜌𝜌𝜌 variables’
=0.19)
creditcount3 thatcorrelation
indicating
creditcount4+),
however
correlation with a positive
credit closed correlation
(𝜌𝜌𝜌𝜌 =0.33) with
atand credit
card closed
status (𝜌𝜌𝜌𝜌=0.19)
a(𝜌𝜌𝜌𝜌 =0.33) and card
indicating (𝜌𝜌𝜌𝜌 =0.19) indicating
more, customerthe churn has
coefficients
customer's
however a
were moderate
calculated
salary
a positive cardnegative
a lowcard
belongs
correlationcorrelation
level.
to As
status
another
with with
result,
(𝜌𝜌𝜌𝜌=-0.38)
credit bank.credit
using
closed
nonlinear
and
Many credit machine
counts
variables'
(𝜌𝜌𝜌𝜌 =0.33) and
learning
variables
correlation models
card status
to better explain
(creditcount2,
(𝜌𝜌𝜌𝜌 =0.19) that indicating and
coefficients creditcount3
were
ry card the customer's
belongs salarybank.
to relationship
the another card
with belongs
Many
churn tomay
another
variables'
status bank.
correlation
beand
advantageous Many variables'
coefficients
because were
of low-level correlation
correlationcoefficients
coefficients. were
edit counts variables
calculated (creditcount2,
at a lowAs level. creditcount3
As acardhowever
result, using creditcount4+),
a positive
nonlinear correlation
machine with credit
learning closed
models (𝜌𝜌𝜌𝜌explain
to =0.33)explain
better and card thestatus (𝜌𝜌𝜌𝜌 =0.1
level. Ascalculated
a result, at athe
using lowcustomer's
level.
nonlinear asalary
machineresult, using
learning belongs
nonlinear
models to another
tomachine
better bank. Many
learning
explain variables'
themodels correlation
to better coefficients
the were
n with credit closed (𝜌𝜌𝜌𝜌calculated
relationship =0.33)
withand card
churn
atbecause
a low status
status
level. (𝜌𝜌𝜌𝜌
the
may =0.19)
customer's
be that
advantageous
As a result, indicating
salary
using card
because
nonlinear belongsof to another
low-level
machinecorrelation bank.
correlation
learning models Many variables'
coefficients. correlation
to better explain the c
relationship
hurn status with churn
may be advantageous status may below-level
of advantageous because
correlation of low-level
coefficients. coefficients.
longs to another4.3. Machine
bank. Many Learning
variables' Models’
relationship with churncalculatedcorrelationOverall Performance
coefficients
status mayatbe a low were and Variable
level. As abecause
advantageous
Importance
result, using nonlinear
of low-level machine learning
correlation models to b
coefficients.
result, using nonlinear 4.3 machine
Machine learning
Learning modelsModels' to better
relationship explain
with
Overall churn thestatus
Performance may be
and advantageous
Variable because
Importance of low-level correlat
e Learning Models'4.3 MachineOverall Learning
Performance Models' andOverall
Variable Performance
Importance and Variable Importance
may be advantageous
The performancebecause ofofthelow-level
4.3 Machine correlation
indicated machine-learning
Learning coefficients.
models is compared in this section. The data from 274,542
Models' Overall Performance andinVariable Importance
f the indicated The
The performance performance
machine-learning
customers was splitof
of the the
indicated
models
into twoindicated
is
parts; 80%machine-learning
machine-learning
compared thein
of4.3 thismodels
dataset section.
(219,634models
is The is
compared
data
customers) compared
in this
was used to this
section.
train section.
The data
machine The data
learning
Machine Learning Models' Overall Performance and Variable Im
mers from
was from
274,542
split into 274,542
models, The
two while
customers
parts;customers
performance
20%80%of
wasthe
ofsplitwas
of
dataset
the thesplit
into
dataset two into
indicated
(54,908 twomachine-learning
customers)
parts;
(219,634 parts;
80% of80%
was of the
useddataset
the
customers) towas dataset
testmodels
them. The
(219,634
used (219,634
is compared
effectiveness
customers) customers)
in thisused
ofwas
each was used
section.
individual The data
g Models' Overall Performance and Variable Importance
arning to to train
model was
whilefrom
train machine
models, machine
20% 274,542
learning learning
determined using
customers
of themodels,
dataset some The
models,
whilewas20%
(54,908 performance
while
performance
split 20%
into
of
customers) of
metrics.
thetwo of the
the
We
parts;
dataset
was indicated
dataset
have80% to of
(54,908
used machine-learning
(54,908
explored the customers)
results
customers)
test of
the dataset (219,634 the models
was
proposed
was usedcustomers) used is compared
to
machine
to test test
was usedin this s
cated machine-learning
learning
them. The models
models
to model and is
effectiveness
train machine compared
evaluatedof thesein
from this
results
each individual
learning models,section.
274,542
through
model
while The
customers
the data
model
20% was split
performance
was determined
of the into
dataset two
metrics
using parts;
(i.e., 80% of
sensitivity,
some customers)
(54,908 performance the dataset
used to test cust
specificity,
metrics.
was (219,634
ness of them. The effectiveness
each individual
accuracy, and of each
was
Receiver individual
determined
Operating using model
some
Characteristic- was determined
performance
ROC curve). using
metrics. some performance metrics.
split into two parts; 80%
Weexplored
havethem. ofThe
the dataset
explored thelearning (219,634
results
effectiveness to
ofofthe customers)
train
each machine
proposed
individual was
machineusedlearning
learning
model models,
was while
models
determined 20%
and
using ofsome
the dataset
evaluated these (54,908 customers)
resultsmetrics.
performance
Weofhave
he results the proposed the results
machine of the proposed
models machine
and evaluatedlearningthese models
results and evaluated these results
odels, while 20% of the
through dataset
the
We(i.e., model
have (54,908
performancecustomers)
explored themetrics them.
results of The was
metricsthe used
effectiveness
(i.e., to testof
sensitivity,
proposedand each
machine individual
specificity,
learning model
accuracy,
modelsand was determined
and
and evaluated Receiverusing some perf
through metrics
performance the model performance
sensitivity, specificity, (i.e., sensitivity,
accuracy, specificity,
Receiver accuracy, Receiver these results
ch individual model
Operatingwas
through determined
Characteristic-
the ROC using
model ROC some
We performance
have explored
curve).
performance metrics metrics.
the results
(i.e., of the proposed
sensitivity, specificity,machine learning
accuracy, andmodels and evalu
Receiver
Operating
ristic- ROC curve). Characteristic- curve).
of the proposed machine learning models
Operating Characteristic- ROC curve).and evaluated
through the these
model results
performance metrics (i.e., sensitivity, specificity, accurac
ance metrics (i.e., sensitivity, specificity, Operating accuracy, Characteristic-
and Receiver ROC curve).
C curve).

93
Table 3. The Performance of Machine Learning Models in Testing Data
Machine Learning Models
GULIYEV &Performance Metrics
YERDELEN TATOĞLU
LR DT RF Xgboost
Sensitivity 0.9896 0.9819 0.9843 0.9854

TableSpecificity
3. The
TablePerformance
3. The Performance
0.5693
of Machine 0.8206
of MachineLearning Models
Learning Models in0.8387
Testing
in Testing Data Data
0.8504
ACC 0.9410 0.9632 0.9675 0.9697
AUC 0.9464 Machine
0.9510 Learning Models
0.9797 0.9850
Performance Metrics
LR DT RF Xgboost
Sensitivity
Table 3 expresses the results computed 0.9896 0.9819
with the advanced 0.9843
machine-learning 0.9854 such as
models
Specificity
Logistic regression (LR), Decision Tree0.5693
(DT), Random 0.8206
Forest (RF),0.8387
and XgBoost.0.8504
We used the
ACC 0.9410 0.9632 0.9675 0.9697
caret package in R for model estimates in the training dataset and parameter turning to deal with
the overfitting AUC
problem and model performance0.9464 boosting.0.9510To calculate hyperparameters,
0.9797 0.9850we used
5-cross validation and the grid search method. First, we used the Logistic regression model; while
itsTable
Table 3 3expresses
expresses
sensitivity the
theresults
(0.9896) computed
results
is high, computedwith the
its specificity advanced
with machine-learning
the advanced
(0.5693) low, models
so we such
is verymachine-learning cannot as models
Logistic
conclude regression
such thatasthe
(LR),
Logistic Decision
regression Tree (DT), Random Forest (RF), and XgBoost. We used the caret package in R for model estimates
LR model is very (LR),accuracy Decision
modelTree (DT), Random
in classification. ForForest (RF), and
the Decision TreeXgBoost.
model, the We complexity
used the
in the training dataset and parameter turning to deal with the overfitting problem and model performance boosting.
package
parameter
caret is in R for
estimated model
to beestimates
0.0012. in the
Although training
the dataset
DT and
model's parameter
sensitivity
To calculate hyperparameters, we used 5-cross validation and the grid search method. First, we used the Logistic
turning
(0.9819), to deal with
specificity
the(0.8296),
overfitting
regression ACC problem
model; while and
(0.9410) model
and AUCperformance
its sensitivity (0.9510)
(0.9896) areboosting.
is high, acceptable, Toit(0.5693)
its specificity calculate
cannotisbe hyperparameters,
veryconsidered we
the best
low, so we cannot used
model.
conclude
5-cross
The validation
thatmaximum
the LR model and
depth the
is very grid search
parameter,
accuracy which
model method.
is the First,
number
in classification. weof
For used
the the Logistic
variables
Decision randomly
Tree regression
model, sampled
the model;
complexity while
as candidates
parameter
itsatsensitivity
iseach split,(0.9896)
estimated 0.0012.is high,
toisbecalculated as 4itsfor
Although specificity
thethe
DTRandom
model’s (0.5693)
Forestismodel.
sensitivity very low,
(0.9819), so we cannot
Although
specificity the conclude
performance
(0.8296), that
ACC (0.9410)of the the
andRF
LRmodel
modelis isvery
AUC (0.9510)verygood,
areaccuracyit stillmodel
acceptable, cannotinbe
it cannot be classification.
said to be theFor
considered the best themodel.
best
model. Decision
The WeTree
maximum usedmodel,
the XgBoost
depth the complexity
parameter, model and
which is the
number of
parameter
determined isvariables
estimated
that the randomly
to besampled
optimal 0.0012.
number as candidates
Although
of tree sizesatthe
eachDT split,model's
was is calculated
130. as 4offorthe
sensitivity
Because the Random
values Forest
(0.9819), model.
specificity
of sensitivity
Although the performance of the RF model is very good, it still cannot be said to be the best model. We used the
(0.8296),
(0.9854), ACC (0.9410)(0.8504),
specificity and AUCaccuracy(0.9510)(0.9697),
are acceptable,
and AUC it cannot
(0.9850),be considered
as well asthe thebest model.of
closeness
XgBoost model and determined that the optimal number of tree sizes was 130. Because of the values of sensitivity
Thethesemaximum
metrics depth
to 1, parameter,
the XgBoost which
model is
hadthe anumber
higher of variables
predictive randomly
performance
(0.9854), specificity (0.8504), accuracy (0.9697), and AUC (0.9850), as well as the closeness of these metrics sampled
in the as
test candidates
data set when to
at all
each thesplit,
1, models iswere
XgBoost calculated
model had as
compared. 4 for
a higher the Random
Topredictive
emphasize, Forest
we would
performance model. testAlthough
in the like to set
data point theout
when performance
that a were
all models moreofcompared.
the RF
accurate
model is veryofgood,
estimation
To emphasize, the it still
churned
we would likecannot
to pointbe
customer said
that ato
out(positive beclass)
more the best
accurate model.
is estimation
more We
important used
of the forthe
churned XgBoost
customer
customer model
churn
(positive and
analysis,
class)
determined
so having that
is more important the optimal
a higher for specificity
customer number
rate is
churn ofmore
tree sizes
analysis, so having was 130.for
advantageous
a higher Because
specificity of the valuestheofarea
us. Consequently,
rate is more sensitivity
advantageous under
for us.of
Consequently,
(0.9854),
Receiverspecificity the area
Operating(0.8504), under of
Characteristic Receiver
accuracy(ROC)Operating
(0.9697), Characteristic
curve and AUC
plotted (ROC)
for(0.9850), curve
XgBoostasis wellplotted
higher for XgBoost
as compared
the closenessis higher
of
to other
these compared
metrics totoother
1,
machine learning models. machine
the XgBoost learning models.
model had a higher predictive performance in the test data set when
all models were compared. To emphasize, we would like to point out that a more accurate
estimation ofFigure the churned Figure
3. Compare3. Compare
customer of Receiver
of(positive
Receiver Operating
class) is more
Operating Characteristic
important(ROC)
Characteristic Curves Curves
for customer
(ROC) churn analysis,
so having a higher specificity rate is more advantageous for us. Consequently, the area under of
Receiver Operating Characteristic (ROC) curve plotted for XgBoost is higher compared to other
machine learning models.

Figure 3. Compare of Receiver Operating Characteristic (ROC) Curves

Source:
Source:Authors ‘own ‘own
Authors calculation from the dataset.
calculation from the dataset.

It is useful to know the proportional contributions of all factors on the final forecast outcome when predicting the
churner. Lundberg et al. (2018) recently suggested the SHAP to assess the significance of specific characteristics.
This can benefit in balancing the accuracy and interpretability of black-box machine-learning models. The impor-
tance of variables is shown in Figure 4.
Source: Authors ‘own calculation from the dataset.

94
It is useful to know the proportional contributions of all factors on the final forecast outcome when
predicting the churner. Lundberg et al. (2018) recently suggested the SHAP to assess the
significance of specific characteristics. This can benefit in balancing
JAME, Volume : 1 - Issue : 2 -the
Year:accuracy
2021 and
interpretability of black-box machine-learning models. The importance of variables is shown in
Figure 4.
Figure 4. The Shapley values of XgBoost Model.
Figure 4. The Shapley values of XgBoost Model.

Source: Authors ‘own calculation from XgBoost Model.


Source: Authors ‘own calculation from XgBoost Model.

Figure 3Figure
shows the importance
3 shows the of the variablesofwith
importance thethe effects of with
variables the variables. A Shapley
the effects of thevalue for a feature
variables. and
A Shapley value
an instance is represented by each point on the summary plot. The feature is demonstrated
for a feature and an instance is represented by each point on the summary plot. The feature is on the y-axis position,
while the Shapley value is demonstrated on the x-axis position. The value of the feature is represented by the color,
demonstrated on the y-axis position, while the Shapley value is demonstrated on the x-axis position.
which ranges from low to high. To get a meaning of the Shapley value per input variable, the overlapping points
Theinvalue
are jittered of the
the y-axis feature
direction. Theisfeatures
represented by from
are ranked the color,
the mostwhich ranges
important to thefrom low to high.
least important one. To get a
meaning
The 5 most importantof variables
the Shapley value
to explain per input
customer churnvariable,
status are the overlapping
the customer points
is offered are loan
a higher jittered
amountin the y-axis
compared direction. The features
to the previous loan (amountare ranked
up), earlyfrom the of
payment most important
credit or delaying tocredit
the least important
(closed), one. isThe 5 most
the customer
offered aimportant
lower interestvariables to explain
rate compared customer
to previously loan churn status are
(rate discount), the customer
the number of creditsis(credit
offeredcounta 2)
higher loan
and the duration of customer’s last used credit (duration). Shapley values show that when
amount compared to the previous loan (amount up), early payment of credit or delaying credit an interest rate discount
(rate discount)
(closed),and the
morecustomer
loan amount is (amount
offered up) is determined
a lower interestis rate
applied to a new loan,
compared the probability
to previously loanof (rate
churn discount),
decreases. In addition, if the number of credits (credit number 2, credit number 3, credit number 4 +) increases,
the number of credits (credit count 2) and the duration of customer's last used credit (duration).
the probability of churn decreases. Interestingly, a decrease in the duration of the previous loan and having a credit
Shapley
card (card values the
status) reduces show that when
probability an interest
of customer churn. rate discount
However, (rate discount)
the probability and more
of churn increases as loan
the amount
(amount
customer’s age risesup)
andisanother
determined is applied
bank provides to a new
a monthly salaryloan, the probability
to customers of churn
with their own decreases.
bank card (this is dueIn addition,
if theinnumber
to the change of credits
the customer’s (credit
workplace). number the
Furthermore, 2, interest
credit rate
number 3, and
of credit credit numberto4live
the customer +)inincreases,
the the
probability of churn decreases. Interestingly, a decrease in the duration of the previous loan and
competitive region increases the probability of churn.
having a credit card (card status) reduces the probability of customer churn. However, the
probability of churn increases as the customer's age rises and another bank provides a monthly
5. CONCLUSION
salary to customers with their own bank card (this is due to the change in the customer's workplace).
In this paper, we proposed ML methods for predicting customer churn. The machine learning predictive models
need to achieve high AUC values. Firstly, to test and train the model, the sample dataset is divided into 80% for
training and 20% for testing. For validation and hyperparameter tuning, we selected to use 5-fold cross-validation.
In addition, we contended with another problem: the data was not balanced and only about 11.5% of the dataset
had included churn customers. To solve this problem, undersampling methods or tree base algorithms are sug-
gested, so we applied the tree-based models such as Decision Tree, Random Forest, and XgBoost. The XgBoost
outperformed for each metric, with an AUC of 96.97 percent and the Random Forest model comes in second.

Shapley values present the most important variable that explains the churn status of the customers and it indicated
positive or negative effect of input variables for XgBoost model. In general, the important reasons that increase the
risk of churn of the customer are the fact that the salary card belongs in another bank in the next period and that the

95
GULIYEV & YERDELEN TATOĞLU

customer inhabits in the competitive region where there are other alternatives to take credit. The most important
reason that increases the probability of churn is that the applied credit interest is high, which in turn reduces the
customer’s ability to repay the loan and shifts the customer towards the other bank that offers more favorable in-
terest rates for subsequent loans. The customer’s relationship with the bank is a critical component in minimizing
the risk of customer churn. If the customer has a longer-term relationship with the bank, then the customer will
benefit from the advantages offered by the bank’s loyalty program and will maintain the relationship with the bank
for a long time. In addition, applying the interest rate discount and upping the amount of credit to the customer
decreases the risk of churner.

Our results suggest that building a model that can accurately anticipate customer retention might have some
management and financial consequences for banking in order to reduce the probability of churn. Firstly, correctly
classifying a customer as a churner or non-churner helps decrease the expenses associated with misclassification.
Second, our findings show that academics and practitioners do not have to rely exclusively on conventional meth-
odology as logistic regression for predicting customer churn. Finally, our findings suggest management recom-
mendations for improving the decision-making process in the context that customer churn prediction. Banks and
financial institutions may use XgBoost models to correctly identify clients who are at risk of churn, focus their
efforts on them, and potentially get profit. Companies should more focus on customer retention policies rather than
concentrating on new target markets, which are generally difficult to gain. So, the findings of the machine-learning
techniques of this research could have a variety of policy implications for customer relationship management and
the marketing strategy of the company. In the future, more explainable machine learning methods should be used,
and models with higher performance should be suggested for predicting customer churn.

Funding

The authors declare that this study has no financial support.

Conflict of Interest

The authors declare that they have no conflicts of interest.

96
JAME, Volume : 1 - Issue : 2 - Year: 2021

REFERENCES

• ABBASIMEHR, H., SETAK, M., & SOROOR, J. (2013). A framework for identification of high-value customers by
including social network-based variables for churn prediction using neuro-fuzzy techniques. International Journal of
Production Research, 51(4), 1279-1294.

• AHMAD, A. K., JAFAR, A., & ALJOUMAA, K. (2019). Customer churn prediction in telecom using machine learning
in big data platform. Journal of Big Data, 6(1), 1-24.

• AHN, J. H., HAN, S. P., & LEE, Y.S. (2006). Customer churn analysis: Churn determinants and mediation effects of
partial defection in the Korean mobile telecommunications service industry. Telecommunications policy, 30(10-11),
552-568.

• AKAY, E.C., SOYDAN, N.T.Y. & GACAR, B.K. (2020). MAKİNE ÖĞRENMESİ VE EKONOMİ: BİBLİYOMETR-
İK ANALİZ. PressAcademia Procedia, 12 (1), 104-105.

• ATHANASSOPOULOS, A.D. (2000). Customer satisfaction cues to support market segmentation and explain switch-
ing behavior. Journal of business research, 47(3), 191-207.

• BATISTA, G.E., PRATI, R.C., & MONARD, M.C. (2004). A study of the behavior of several methods for balancing
machine learning training data. ACM SIGKDD explorations newsletter, 6(1), 20-29.

• BILAL ZORIĆ, A. (2016). Predicting customer churn in banking industry using neural networks.  Interdisciplinary
Description of Complex Systems: INDECS, 14(2), 116-124.

• BLATTBERG, R. C., KIM, B.D., & NESLIN, S.A. (2008). Churn Management. In Database Marketing (pp. 607-633).
Springer, New York, NY.

• BOSE, I., & CHEN, X. (2009). Quantitative models for direct marketing: A review from systems perspective. Europe-
an Journal of Operational Research, 195(1), 1-16.

• BRÂNDUŞOIU, I., TODEREAN, G., & BELEIU, H. (2016). Methods for churn prediction in the pre-paid mobile
telecommunications industry. In 2016 International conference on communications (COMM) (pp. 97-100). IEEE.

• BUETTGENS, M., NICHOLS, A., & DORN, S. (2012). Churning under the ACA and state policy options for mitiga-
tion. Prepared for Robert Wood Johnson Foundation, Timely Analysis of Immediate Health Policy Issues, http://www.
urban. org/UploadedPDF/412587-Churning-Under-the-ACA-and-State-Policy-Options-for-Mitigation. pdf.

• ÇAĞLAYAN AKAY, E. (2018). Ekonometride Yeni Bir Ufuk: Büyük Veri ve Makine Öğrenmesi. Social Sciences
Research Journal, 7(2): 41-53.

• ÇAĞLAYAN AKAY, E. (2020). Ekonometride Büyük Veri ve Makine Öğrenmesi: Temel Kavramlar, Der Yayınları,
İstanbul.

• CHAKISO, C.B. (2015). The effect of relationship marketing on customers’ loyalty (Evidence from Zemen
Bank). EMAJ: Emerging Markets Journal, 5(2), 58-70.

• CHATTERJEE, D., & KAMESH, A.V.S. (2020). Significance of Relationship marketing in banks in terms of Customer
Empowerment and satisfaction. European Journal of Molecular & Clinical Medicine, 7(4), 999-1009.

• CHATTERJEE, D., SEKHAR, S.C., & BABU, M.K. (2021). Customer Empowerment-A Way to Administer Customer
Satisfaction in Indian Banking Sector.  NVEO-NATURAL VOLATILES & ESSENTIAL OILS Journal| NVEO, 1621-
1629.

• CHAWLA, N.V. (2009). Data mining for imbalanced datasets: An overview. Data mining and knowledge discovery
handbook, 875-886.

97
GULIYEV & YERDELEN TATOĞLU

• CHEN, T., & GUESTRIN, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd
acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).

• COLGATE, M., STEWART, K., & KINSELLA, R. (1996). Customer defection: a study of the student market in Ire-
land. International journal of bank marketing.

• DENG, X., LIU, Q., DENG, Y., & MAHADEVAN, S. (2016). An improved method to construct basic probability as-
signment based on the confusion matrix for classification problem. Information Sciences, 340, 250-261.

• GANESH, J., ARNOLD, M.J., & REYNOLDS, K.E. (2000). Understanding the customer base of service providers: an
examination of the differences between switchers and stayers. Journal of marketing, 64(3), 65-87.

• HE, Y., HE, Z., & ZHANG, D. (2009). A study on prediction of customer churn in fixed communication network based
on data mining. In 2009 sixth international conference on fuzzy systems and knowledge discovery (Vol. 1, pp. 92-94).
IEEE.

• JABEUR, S.B. (2017). Bankruptcy prediction using partial least squares logistic regression. Journal of Retailing and
Consumer Services, 36, 197-202.

• JABEUR, S. B., GHARIB, C., MEFTEH-WALI, S., & ARFI, W.B. (2021). CatBoost model and artificial intelligence
techniques for corporate failure prediction. Technological Forecasting and Social Change, 166, 120658.

• KAWALE, J., PAL, A., & SRIVASTAVA, J. (2009). Churn prediction in MMORPGs: A social influence based ap-
proach. In 2009 international conference on computational science and engineering (Vol. 4, pp. 423-428). IEEE.

• KERAMATI, A., GHANEEI, H., & MIRMOHAMMADI, S. M. (2016). Developing a prediction model for customer
churn from electronic banking services using data mining. Financial Innovation, 2(1), 1-13.

• KHAN, A.A., JAMWAL, S., & SEPEHRI, M.M. (2010). Applying data mining to customer churn prediction in an
internet service provider. International Journal of Computer Applications, 9(7), 8-14.

• KUNCHEVA, L. (2004). Combining pattern classifiers methods and algorithms. john wiley&sons. Inc. Publication,
Hoboken.

• LAMBERT, J., & LIPKOVICH, I. (2008). A macro for getting more out of your ROC curve. In SAS Global forum,
paper (Vol. 231).

• LONG, X., YIN, W., AN, L., NI, H., HUANG, L., LUO, Q., & CHEN, Y. (2012, March). Churn analysis of online
social network users using data mining techniques. In Proceedings of the international MultiConference of Engineers
and Conputer Scientists (Vol. 1).

• LÓPEZ-DÍAZ, M. C., LÓPEZ-DÍAZ, M., & MARTÍNEZ-FERNÁNDEZ, S. (2017). A stochastic comparison of


customer classifiers with an application to customer attrition in commercial banking. Scandinavian Actuarial Jour-
nal, 2017(7), 606-627.

• LUNDBERG, S. M., ERION, G. G., & LEE, S. I. (2018). Consistent individualized feature attribution for tree ensem-
bles. arXiv preprint arXiv:1802.03888.

• MAI, F., TIAN, S., LEE, C., & MA, L. (2019). Deep learning models for bankruptcy prediction using textual disclo-
sures. European journal of operational research, 274(2), 743-758.

• MUTANEN, T. (2006). Customer churn analysis–a case study.  Journal of Product and Brand Management,  14(1),
4-13.

• Naveen, N., Ravi, V., & Kumar, D. A. (2009). Application of fuzzyARTMAP for churn prediction in bank credit
cards. International Journal of Information and Decision Sciences, 1(4), 428-444.

• NIE, G., ROWE, W., ZHANG, L., TIAN, Y., & SHI, Y. (2011). Credit card churn forecasting by logistic regression and

98
JAME, Volume : 1 - Issue : 2 - Year: 2021

decision tree. Expert Systems with Applications, 38(12), 15273-15285.

• OZATAC, N., SANER, T., & SEN, Z.S. (2016). Customer satisfaction in the banking sector: the case of North Cy-
prus. Procedia Economics and Finance, 39, 870-878.

• RAJAMOHAMED, R., & MANOKARAN, J. (2018). Improved credit card churn prediction based on rough clustering
and supervised learning techniques. Cluster Computing, 21(1), 65-77.

• RISSELADA, H., VERHOEF, P.C., & BIJMOLT, T.H. (2010). Staying power of churn prediction models. Journal of
Interactive Marketing, 24(3), 198-208.

• SALMINEN, J., YOGANATHAN, V., CORPORAN, J., JANSEN, B.J., & JUNG, S.G. (2019). Machine learning ap-
proach to auto-tagging online content for content marketing efficiency: A comparative analysis between methods and
content type. Journal of Business Research, 101, 203-217.

• SARADHI, V.V., & PALSHIKAR, G.K. (2011). Employee churn prediction. Expert Systems with Applications, 38(3),
1999-2006.

• SINGH, S., ANUSHA, B., & RAGHUVARDHAN, M. (2013). Impact of Banking Services on Customer Empower-
ment, Overall Performance and Customer Satisfaction: Empirical Evidence.  Journal of Business and Management
(IOSR-JBM), 16(1), 17-24.

• SOEINI, R. A., & RODPYSH, K. V. (2012). Applying data mining to insurance customer churn management. Interna-
tional Proceedings of Computer Science and Information Technology, 30, 82-92.

• VERBEKE, W., DEJAEGER, K., MARTENS, D., HUR, J., & BAESENS, B. (2012). New insights into churn pre-
diction in the telecommunication sector: A profit driven data mining approach. European journal of operational re-
search, 218(1), 211-229.

• WALEED, A., PASHA, A., & AKHTAR, A. (2016). Exploring the impact of liquidity on profitability: Evidence from
banking sector of Pakistan. Journal of Internet Banking and Commerce, 21(3).

• WOŹNIAK, M., GRANA, M., & CORCHADO, E. (2014). A survey of multiple classifier systems as hybrid sys-
tems. Information Fusion, 16, 3-17.

• YEŞILKANAT, C. M. (2020). Spatio-temporal estimation of the daily cases of COVID-19 in worldwide using random
forest machine learning algorithm. Chaos, Solitons & Fractals, 140, 110210.

• ZHENG, K., ZHANG, Z., & SONG, B. (2020). E-commerce logistics distribution mode in big-data context: A case
analysis of JD. COM. Industrial Marketing Management, 86, 154-162.

• ZORIĆ, A.B. (2016). Predicting customer churn in banking industry using neural networks. Interdisciplinary Descrip-
tion of Complex Systems: INDECS, 14(2), 116-124.

99

You might also like