Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

[DOCUMENT TITLE]

VIETNAM NATIONAL UNIVERSITY, HANOI


INTERNATIONAL SCHOOL

INS2061 – DATA MINING IN BUSINESS ANALYTICS

FINAL PROJECT
TELCO CUSTOMER CHURN DATA MINING

Student Name: Đặng Anh Tài – MIS2018A - 18071494


Nguyễn Thị Thanh Thư – MIS2016A -
16071310
Lecturer: Trần Thị Oanh

Hanoi – 2021

1
[DOCUMENT TITLE]

I) Introduction
- In the market, the fact that many companies now focus a lot on using some methods to analyzes the
data of the customer or the employees to deal with some business problems. It can be the turnover
rate, the trend in the market nowadays… And one of the biggest problems that cause a lot of damage
to any business is the churn rate – which also means the possibility that the customer may not use
your company’s product anymore. Because the competitive in the market is really high, so every
company should reduce the churn rate as much as possible.

- In this project, we use the dataset called “Telco Churn Customer” from the Kaggle.com website.
This company mainly focus on providing Internet and Telecommunication service. This dataset
consists of more than 7000 rows and 21 columns. In this dataset, it provides some information, for
example:

+) Which customers who left within the last month – the column is called Churn.

+) Services that each customer has signed up for – phone, multiple lines, internet, online security,
online backup, device protection, tech support, and streaming TV and movies.

+) Customer account information – how long they’ve been a customer, contract, payment method,
paperless billing, monthly charges, and total charges.

+) Demographic info about customers – gender, age range, and if they have partners and
dependents.

II) Problem statement


- In this business, they have to deal with the problem that the churn rate is a critical indicator to track
the health of a subscription-based company. To be more precise, the company can take measures in
advance by predicting the customer churn rate to retain customers consistently. Therefore, this
project goal is to make a churn prediction so that Telco can optimize products and services
proactively.

- In this situation, there are various data to help solve the problem. It can be the customer account
information: how long that they have been a customer, which type of contract and services that they

2
[DOCUMENT TITLE]

use, and the demographic of the customer (gender, types of contract, relationship status...) can affect
the churn. So that data is used to solve the problem in churn rate of Telco customer’s dataset.

- In this problem, we can use some machine learning method such as decision tree, logistic
regression, random forest classification to create a much closer look at the problem

- Those plans can easily apply in the business model. From this situation, the business wants to see
what factors may affect the churn rate, and their target customer of Telco. When they defined all the
problem, they can use the results to have a better service for the targeted customer to have them
stay will the company and create a flexibility to serve the new one.

III) Data collection and data preprocessing


1) Cleaning the data and data preprocessing.
In this part, we use some methods to handle with the missing values and some meaningless data to
eliminate and transform into another types of data that is useful to build up and train the model.

- First of all, we use Google Colab as a main notebook that allow us to execute the code. After that,
we load the dataset onto Google Colab.

- To prepare to import data as well as prepare for running the models by the machine learning, we
import some libraries such as pandas, numpy, sklear, matplot...

3
[DOCUMENT TITLE]

-
When we

finished importing the libraries, now we start to import the dataset and checking for the values. It can
easily be seen that, this dataset has 7043 rows with 21 columns and do not have any null value. But
in this situation, this dataset has 11 missing values:

- To handle with this problem, we use dropna() functions to remove missing value in the data. After
removing, now this dataset did not contain any missing value, and the numbers of rows decrease
from 7043 to 7032.

4
[DOCUMENT TITLE]

- After finishing deal with the missing value, now we will check for the dataset by using head()
function. This dataset contains of a lot of mixed numeric and non-numeric values and floating
numbers.

Just three columns have the integer data types. The problem is, to apply for the machine learning
methods, we transform it into numeric/integer data types to apply those methods. To solve this, we
use the label encoding to transform all the mixed values into numeric data types, rename the columns
name and split some additional columns to prepare for the running model section.

5
[DOCUMENT TITLE]

- This image below showed that we successfully transform the data and now, all the columns are in
the integer or floating numbers

6
[DOCUMENT TITLE]

2) Split the dataset to training dataset and testing dataset


- After cleaning a dataset, we split it into training dataset and testing dataset, with a ratio of 70% -
30% and we successfully split the data to prepare for the running model steps.

IV) Data visualization


- In this step, we use the heatmap to define what factor may affect the churn rate of Telco dataset,

7
[DOCUMENT TITLE]

From the heatmap, the Fiber Optic internet service (0.39), the Paperless Billing (Electronic check
method (0.3))

Partner status (0.32) and the Partner status

8
[DOCUMENT TITLE]

The type of contracts, which value “0” represent for the month-to-month contract, “1” represent for the
one-year contract and “2” for the two-year contract. It seems to be, the customer signed a month-to-
month contract may have a higher churn rate.

Those four components are the most affect the churn rate. Surprisingly, the gender of the customer
(0.0085) does not affect the churn rate, that means, the gender of the customer means nothing for the
company in the churn situation:

9
[DOCUMENT TITLE]

V) Build and train the model


1) DV and IV
Variable data types IV / DV Variable data types IV / DV
Customer ID (categorical) IV Internet service (categorical) IV
Gender (categorical) IV Online security (categorical) IV
Senior Citizen (numeric) IV Online backup (categorical) IV
Partner (categorical) IV Device protection (categorical) IV
Tenure(numeric) IV Tech support (categorical) IV
Phone service (categorical) IV Streaming TV (categorical) IV
Multiple lines (categorical) IV Streaming movies (categorical) IV
Contract (categorical) IV Monthly charges (float) IV
Paperless billing (categorical) IV Total charges (categorical) IV
Payment method (categorical) IV Churn (categorical) DV

2) Running machine learning methods.


- In this situation, we want to choose three different machine learning methods are: Logistic
regression, Decision tree, Random Forest classification.

a) Logistic Classification

- This is how we run the logistic classification

10
[DOCUMENT TITLE]

- After complete building the logistic classification model, now we will print the Logistic Regression
classifier report as well as the confusion matrix

- From the result, we can see the confusion matrix that contains of the numbers of TP value is 1423,
FP is 157, TN is 245, FN is 285.

- From the report, we can see the number of F1 score is 0.51, respectively, to take the exact number
to make a comparison, we use print() functions to get the exactly numbers.

11
[DOCUMENT TITLE]

b) Random Forest classifier

- This image below shows how we run the Random Forest classifier model and the result of the
random forest methods

- The confusion matrix below has 1434 TP values, 146 FP values, 261 TN values and 269 values.
The exact results is: the F1 score is 0.569. This number, compared to the same score when we run
the Logistic Regression, it is smaller.

12
[DOCUMENT TITLE]

c) Decision tree model

- To run the decision tree model, first of all, we import all the requirement libraries such as
sklearn.tree, graphicviz,… to run the decision tree without any error

- We also make a visualization about the decision tree, but this tree is quite big and hardly to
examine. So we decided to optimize the tree below by using pydotplus

13
[DOCUMENT TITLE]

- After that, we have the result of a optimization tree:

- After finishing the optimization, we will find the confusion matrix and the Accuracy score as well as
the F1 score. The F1 score is about 0.47.

14
[DOCUMENT TITLE]

- Before choosing the methods, instead of choosing accuracy score to predict like normal, this dataset
face with the imbalance in the churn. If we decided to choose the accuracy score to predict, it may
lead to the ambiguous result because the splitting data in definitely random, so the accuracy score is
not the main methods to predict. We recommend using F1 score, because F1 conveys the balance
between the precision and the recall. Moreover, the role of F1 score is to select a model based on a
balance between precision and recall.

- After running all three models, we can make an overview result of all models

F1 score Recall
Logistic Regression 0.5864 0.54
Random Forest Classification 0.5693 0.51
Decision Tree 0.4644 0.35

- So, with the highest F1 score as well as the accuracy, the logistic regression seemed to be the best
model in this situation. However, because the dataset is imbalance, so the result might not be correct
because the ratio of Churn and No Churn was too big, so the random split may create a wrong
answer that leads to wrong model. We will run the code to handle with the imbalance, using SMOTE

15
[DOCUMENT TITLE]

(Synthetic Minority Over-sampling Technique. SMOTE illustrate how this technique works consider
some training data which has s samples, and F features in the feature space of the data. Note that
these features, for simplicity, are continuous.

3) Fine-tuning
- As mentioned above, we will use SMOTE to deal with the imbalance. Because the No Churn rate
was dominated Churn, so with the over sampling technique, we will add more on to Churn value to
create the balance between Churn and No Churn.

- We successfully complete creating the balance, now we will run again all three models

a) Logistic Regression

- After running the steps to create the Logistic Regression model, the result showed that, the F1 score
now is nearly 0.63, while the recall is much better (0.80)

16
[DOCUMENT TITLE]

b) Random Forest classifier

- After running the steps to create the Logistic Regression model, the result showed that, the F1 score
now is 0.6054, while the recall is 0.63

17
[DOCUMENT TITLE]

c) Decision Tree

- In this method, we also visualize the tree and then optimize it. Surprisingly, there is a bit different in
the F1 and recall score:

Before optimizing:

After optimizing:

- Compare the two situations, the F1 scores increase from 0.5812 to 0.5915 and the recall change
from 0.618 to 0.7611. In conclusion, we have the table to compare the difference before and after
eliminating the imbalanced.

Before balancing After balancing

F1 score Recall F1 score Recall


Logistic Regression 0.5864 0.54 0.63 0.80
Random Forest 0.5693 0.51 0.609 0.63
Decision Tree 0.4644 0.35 0.5951 0.761

- After balancing, the Logistic Regression model have the highest score, so we will choose Logistic
Regression model as the recommendation for the company.

VI- Project expansion


- This dataset is not the best dataset, due to the imbalance and the missing value, as well as some
attributes may not relevant to predict. In this situation, it should include the data, for example, ages,

18
[DOCUMENT TITLE]

monthly salary… to predict the target customer. From that, it is much easier to predict the churn trend
to find out what factor affect the churn rate.
- The limitation of this project is, we tried the best to find out the best model for this dataset and also
some factors that affect the churn rate. But this result comes from the random splitting data, so in the
real situation, some predictions might not true anymore. We should investigate more to find the best
result for the churn rate.

VII- Result communication and Recommendation


- From this situation, we can see that, the imbalance in Telco dataset creates some barriers to identify
which is the best model to predict. We also can see the factors that affect the churn rate, like the
payment method, or the internet service. That is the problem that Telco should care about it. With
those factors mentioned before, Telco should provide some benefits like the discount for their service,
as well as the better customer service, to reduce the churn rate of Telco service.

Member Job percentage Job done


Đặng Anh Tài 60% run code, debug, writing report
Nguyễn Thị Thanh Thư 40% debug, writing report

19

You might also like