Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

CUSTOMER CHURN

PREDICTION PROJECT

BY SHWETA GUPTA

1|Page
Table of Contents
Table of Contents ....................................................................................................................................................... 2
List of Tables .............................................................................................................................................................. 3
List of Figures ............................................................................................................................................................ 4
1. Customer Churn Prediction ............................................................................................................................... 5
1.1 Problem Understanding................................................................................................................................. 5
1.2 Data Report .................................................................................................................................................. 5
1.3 Exploratory data analysis ............................................................................................................................ 10
1.4 Business insights from EDA. ....................................................................................................................... 32

2|Page
List of Tables
Table 1: Data description.............................................................................................................................................. 6
Table 2 Data Description post removing anomalies....................................................................................................... 7
Table 3: Statistical description of data........................................................................................................................... 8
Table 4: Skewness measure of data ............................................................................................................................. 8
Table 5: Categorical variables levels % share ............................................................................................................. 10
Table 6: Data description post imputation of missing values ........................................................................................ 26
Table 7: Average per numeric variable of different levels of account_segment ............................................................ 27
Table 8: Average per numeric variable of different levels of Payment .......................................................................... 28
Table 9: Average per numeric variable of different levels of Service_Score ................................................................. 28
Table 10: Average per numeric variable of different levels of CC_Agent_Score .......................................................... 29
Table 11: Sample encoded dataset ............................................................................................................................ 30
Table 12: VIF values .................................................................................................................................................. 31
Table 13: Feature variance ......................................................................................................................................... 32
Table 14: Sample scaled data .................................................................................................................................... 33
Table 15: Principal Components variance and cumulative variance ............................................................................ 34
Table 16: No of clusters and WSS .............................................................................................................................. 36
Table 17: No of clusters and silhouette score.............................................................................................................. 36
Table 18: Sum of squares for clusters......................................................................................................................... 39
Table 19: K-Mean cluster proportion ........................................................................................................................... 40
Table 20: Numeric features average per cluster profile ............................................................................................... 40

3|Page
List of Figures
Figure 1: Countplot for categorical features ................................................................................................................ 11
Figure 2: Tenure distribution and boxplot .................................................................................................................... 11
Figure 3: CC_Contacted_LY distribution and boxplot .................................................................................................. 12
Figure 4: rev_per_month distribution and boxplot ........................................................................................................ 12
Figure 5: rev_growth_yoy distribution and boxplot ...................................................................................................... 12
Figure 6: coupon_used_for_payment distribution and boxplot ..................................................................................... 13
Figure 7: Days_Since_CC_connect distribution and boxplot ....................................................................................... 13
Figure 8: cashback distribution and boxplot ................................................................................................................ 13
Figure 9: Numeric features pairplot ............................................................................................................................. 15
Figure 10: Pearson’s r Correlation plot ........................................................................................................................ 16
Figure 11: Phik (φk) correlation plot ............................................................................................................................ 17
Figure 12: Churn vs numeric features plot .................................................................................................................. 18
Figure 13: Churn vs categorical features countplot ..................................................................................................... 19
Figure 14: account_segment vs numeric features plot ................................................................................................ 20
Figure 15: account_segment vs categorical features plot ............................................................................................ 21
Figure 16: Complain_ly vs numeric features plot ......................................................................................................... 22
Figure 17: Complain_ly vs categorical features plot .................................................................................................... 23
Figure 18: rev_growth_yoy vs categorical features plot ............................................................................................... 24
Figure 19: Chi Square test p values plot for categorical features ................................................................................. 25
Figure 20: Principal Components vs Variable patch plot.............................................................................................. 35
Figure 21: WSS vs Numbers of cluster plot ................................................................................................................. 36
Figure 22: No of clusters vs Average silhouette score ................................................................................................. 37
Figure 23: Scatter plot for 2 clusters ........................................................................................................................... 37
Figure 24: Scatter plot for 3 clusters ........................................................................................................................... 38
Figure 25: Scatter plot for 4 clusters ........................................................................................................................... 38
Figure 26: Scatter plot for 5 clusters ........................................................................................................................... 39

4|Page
1. Customer Churn Prediction
1.1 Problem Understanding

Problem Statement:

A DTH provider is facing lot of competition in the current market, and it has become a challenge to retain the existing
customers in the current situation. Hence, the company wants to develop a binary classification model through which they
can do churn prediction of the accounts and provide segmented offers to the potential churners. In this company, account
churn is a major thing because 1 account can have multiple customers. hence by losing one account the company might be
losing more than one customer.

Need of the study/project: Thorough analysis of the available data will help in finding out essential attributes that drive the
churn rate of customers. Identifying potential churners and their behavior will help in providing segmented offers or services
to them. This will help in decreasing churn rate and in turn increase the revenue.

Business/Social Opportunity: Working on providing segmented offers and services will create a healthy competition in
market. This will motivate competitors to come up with creative plans that are both lucrative and profitable at the same time.
As a result, there will be varied options available for a customer to choose from and this will remove chances of creating a
monopolistic market.

1.2 Data Report

Data Collection Methodology: Primary data source from within the organization was used to take random sample data for
analysis. Data ranges approximately between 2013-14 to 2022. Below points were considered while choosing a subset of
features from the available features

• Demographics of customer
• Age as a customer with the company
• Payment pattern
• Payment mode
• Customer care interaction and satisfaction measures
• Revenue contribution

Raw data must be on daily frequency as we have details of number of days since no contact was made with customer
care.

Data Inspection and understanding of attributes:


Data collected is as below:
No of customer account ID: 11260
No of features: 19

Non-Null
# Column Description Classification Dtype
Count
0 AccountID account unique identifier Discrete int64 11260 non-
null
1 Churn account churn flag (Target) Discrete int64 11260 non-
null
2 Tenure Tenure of account Continuous object 11158 non-
null
3 City_Tier Tier of primary customer's city Discrete float64 11148 non-
null

5|Page
4 CC_Contacted_LY How many times all the Continuous float64 11158 non-
customers of the account has null
contacted customer care in last
12months
5 Payment Preferred Payment mode of the Discrete object 11151 non-
customers in the account null
6 Gender Gender of the primary customer Discrete object 11152 non-
of the account null
7 Service_Score Satisfaction score given by Discrete float64 11162 non-
customers of the account on null
service provided by company
8 Account_user_cou Number of customers tagged Discrete object 11148 non-
nt with this account null
9 account_segment Account segmentation on the Discrete object 11163 non-
basis of spend null
10 CC_Agent_Score Satisfaction score given by Discrete float64 11144 non-
customers of the account on null
customer care service provided
by company
11 Marital_Status Marital status of the primary Discrete object 11048 non-
customer of the account null
12 rev_per_month Monthly average revenue Continuous object 11158 non-
generated by account in last 12 null
months
13 Complain_ly Any complaints has been raised Discrete float64 10903 non-
by account in last 12 months null
14 rev_growth_yoy revenue growth percentage of Continuous object 11260 non-
the account (last 12 months vs null
last 24 to 13 month)
15 coupon_used_for_ How many times customers Continuous object 11260 non-
payment have used coupons to do the null
payment in last 12 months
16 Day_Since_CC_co Number of days since no Continuous object 10903 non-
nnect customers in the account has null
contacted the customer care
17 cashback Monthly average cashback Continuous object 10789 non-
generated by account in last 12 null
months
18 Login_device Preferred login device of the Discrete object 11039 non-
customers in the account null

Table 1: Data description

dtypes: float64(5), int64(2), object(12)


memory usage: 1.6+ MB

Inferences from Table:


• There are 11260 rows and 19 columns.
• Datatypes are int, float and object.
• There are 2676 null values.
• No duplicate rows found.
• Churn is dependent/target variable.

6|Page
• Tenure should be numeric however the datatype shows as object which hints towards some anomaly in data.
Similar is the case for Account_user_count, rev_per_month, rev_growth_yoy, coupon_used_for_payment,
Day_Since_CC_connect, cashback.

Using value_counts function below anomalies were identified in the data


• Tenure has special character #
• Account_user_count has special character @
• rev_per_month has special character +
• rev_growth_yoy has special character $
• coupon_used_for_payment has special characters #, $ and *
• Day_Since_CC_connect has special character $
• Cashback has special character $
• Gender column has Male & M both and Female and F both
• account_segment has Regular Plus & Regular + both and Super Plus and Super + both
• Login_device has &&&&

We will treat these anomalies as below so that they do not form an invalid category and impact our analysis
• Replace special characters with null
• Replace &&&& with null.

Now the datatypes look in line with description and non-null value count has decreased.

# Column Dtype Non-Null Count

0 AccountID int64 11260 non-null


1 Churn int64 11260 non-null
2 Tenure float64 11042 non-null
3 City_Tier float64 11148 non-null
4 CC_Contacted_LY float64 11158 non-null

5 Payment object 11151 non-null


6 Gender object 11152 non-null
7 Service_Score float64 11162 non-null

8 Account_user_count float64 10816 non-null


9 account_segment object 11163 non-null
10 CC_Agent_Score float64 11144 non-null
11 Marital_Status object 11048 non-null
12 rev_per_month float64 10469 non-null
13 Complain_ly float64 10903 non-null
14 rev_growth_yoy float64 11257 non-null
15 coupon_used_for_payment float64 11257 non-null
16 Day_Since_CC_connect float64 10902 non-null
17 cashback float64 10787 non-null
18 Login_device object 10500 non-null

Table 2 Data Description post removing anomalies

7|Page
dtypes: float64(12), int64(2), object(5)
memory usage: 1.6+ MB

Table 3: Statistical description of data

# Column Skewness

1 Tenure 3.896
2 CC_Contacted_LY 1.423

3 rev_per_month 9.094
4 rev_growth_yoy 0.752
5 coupon_used_for_payment 2.575
6 Day_Since_CC_connect 1.273
7 cashback 8.771

Table 4: Skewness measure of data

8|Page
Inferences drawn from data inspection:
• Even though City_Tier, Service_Score, Account_user_count, CC_Agent_Score, Complain_ly are numeric by values,
they are categorical in nature. City_Tier, Service_Score, CC_Agent_Score are ordinal, Complain_ly is binary and
Account_user_count is nominal.
• 75% accounts have a tenure less than 16, and maximum tenure is 99. This hints towards presence of outliers. On a
similar pattern there is a hint towards the presence of outliers in CC_Contacted_LY, rev_per_month, rev_growth_yoy,
coupon_used_for_payment, Day_Since_CC_connect and cashback.
• More than 50% accounts are from Tier 1 cities.
• More than 75% customers have not contacted customer care within last 10 days.
• There could be maximum 6 customers in one account.
• All the numeric features are right skewed except for rev_growth_yoy which has very less skewness.

9|Page
1.3 Exploratory data analysis

Univariate Analysis:

Categorical:

# Column Values % share


1 Churn 0 83.16
1 16.84
2 City_Tier 1 65.15
2 4.31
3 30.54
3 Payment Debit Card 41.14
Credit Card 31.49
E wallet 10.91
Cash on Delivery 9.09
UPI 7.37
4 Gender Male 60.11
Female 39.88
5 Service_Score 0 0.07
1 0.69
2 29.13
3 49.18
4 20.88
5 0.04
6 Account_user_count 1 4.12
2 4.86
3 30.15
4 42.24
5 15.71
6 2.91
7 account_segment Regular 4.66
Regular Plus 36.94
Super 36.39
Super Plus 7.33
HNI 14.68
8 CC_Agent_Score 1 20.66
2 10.44
3 30.15
4 19.09
5 19.66
9 Marital_Status Single 31.86
Married 53.04
Divorced 15.09
10 Complain_ly 0 71.47
1 28.53
11 Login_device Mobile 71.26
Computer 28.74

Table 5: Categorical variables levels % share

10 | P a g e
Figure 1: Countplot for categorical features

Continuous:

Figure 2: Tenure distribution and boxplot

11 | P a g e
Figure 3: CC_Contacted_LY distribution and boxplot

Figure 4: rev_per_month distribution and boxplot

Figure 5: rev_growth_yoy distribution and boxplot

12 | P a g e
Figure 6: coupon_used_for_payment distribution and boxplot

Figure 7: Days_Since_CC_connect distribution and boxplot

Figure 8: cashback distribution and boxplot

13 | P a g e
Inferences drawn from Univariate Analysis:

• 83% of the observations are for accounts that do not churn.


• 95% of the accounts are from Tier1 and Tier 3 cities.
• Payments are majorly done by credit or debit cards.
• Primary account holders are dominated by males.
• Majority of the accounts have rated the service of the company 3 followed by 2 and 4. This indicates average
satisfaction with company services.
• Majority of the accounts are held by married people.
• Majority of the accounts have 3-5 customers in their account. This might be because majority of accounts are held
by married people and they tend to have family using same account.
• Most of the accounts belong to Regular plus or Super category followed by HNI.
• Customer care interaction has been rated maximum times as 3 followed by 1, 5 and 4. This shows customer care
agents are performing good.
• More than 70% accounts have not logged any complaint in last 12 months.
• Most of the customers use mobile to access their account.
• All the numeric variables except rev_growth_yoy have outliers and non-normal distribution with right skewenss.

14 | P a g e
Bivariate Analysis:

Figure 9: Numeric features pairplot

15 | P a g e
Figure 10: Pearson’s r Correlation plot

16 | P a g e
Figure 11: Phik (φk) correlation plot

17 | P a g e
Figure 12: Churn vs numeric features plot

18 | P a g e
Figure 13: Churn vs categorical features countplot

19 | P a g e
Figure 14: account_segment vs numeric features plot

20 | P a g e
Figure 15: account_segment vs categorical features plot

21 | P a g e
Figure 16: Complain_ly vs numeric features plot

22 | P a g e
Figure 17: Complain_ly vs categorical features plot

23 | P a g e
Figure 18: rev_growth_yoy vs categorical features plot

24 | P a g e
Figure 19: Chi Square test p values plot for categorical features

Inferences drawn from Bivariate Analysis:

• Numeric features do not seem to have good linear correlation.


• Pandas profiling report suggests presence of some Phik (φk) correlation. This incorporates non-linear correlation and
works well with categorical variables.
• There is good difference in average of Tenure and Days_Since_CC_connect for accounts that churn and the ones
that do not churn. CC_contacted_LY, cashback, rev_per_month and Account_user_count have minor difference.
rev_growth_yoy and coupon_used_for payment have negligible difference.
• Payment pattern, Gender distribution, Login_device, Marital_Status, City_Tier, Service_Score, CC_Agent_score and
Account_user_count vary similarly across accounts that churn or not churn. Only account_segment and Complain_ly
seem to have different patterns across customers that churn and customers that do not churn.
• Very small number of customers from Regular and Super plus category churn. These customers also have high
average tenure, cashback, rev_per_month, coupon_used_for_payment and Days_Since_CC_connect.
• Customers who churn raise more complaints which is obvious. These also have low tenure, high frequency of
connecting with customer service also at a shorter duration gap. Majority of these belong to Regular Plus and Super
category.
• P values from chi square test suggests that Churn seem to have correlation with all categorical features.
• Payment-Complain_ly, Payment-Service_Score, Gender-Login_device, account_segment-Complain_ly,
Marital_Status-Complain_ly, Login_device- Complain_ly, Login_device- Service_Score, Login_device-City_Tier,
City_Tier- Complain_ly, City_Tier- Service_Score, Service_Score-Complain_ly, Service_Score-CC_Agent_Score:
These feature combinations seem to have no corretalion.

25 | P a g e
Removal of unwanted variables:

• AccountID: It has unique value for all observations. It does not add any variance in data. So this could be dropped
for further analysis
• All the categorical columns have dependency over Churn so we will not drop them as of now.
• From numeric columns, rev_growth_yoy has negligible variation wrt to customers who churn and customers who do
not churn. Also it does not have strong correlation with any other numeric variable. It’s average does not vary across
different levels for categorical variables. Since it does not correlate with independent or dependent variable we will
drop it for further analysis.

Missing Value Treatment:

• Numeric: Most of the numeric variables do not have normal distribution and outliers present, hence we have used
median to replace missing values.
• Categorical: Missing values are replaced with mode.

# Column Dtype Non-Null Count

0 AccountID int64 11260 non-null


1 Churn int64 11260 non-null
2 Tenure float64 11260 non-null
3 City_Tier float64 11260 non-null
4 CC_Contacted_LY float64 11260 non-null

5 Payment object 11260 non-null


6 Gender object 11260 non-null
7 Service_Score float64 11260 non-null

8 Account_user_count float64 11260 non-null


9 account_segment object 11260 non-null
10 CC_Agent_Score float64 11260 non-null
11 Marital_Status object 11260 non-null
12 rev_per_month float64 11260 non-null
13 Complain_ly float64 11260 non-null
14 rev_growth_yoy float64 11260 non-null
15 coupon_used_for_payment float64 11260 non-null
16 Day_Since_CC_connect float64 11260 non-null
17 cashback float64 11260 non-null
18 Login_device object 11260 non-null

Table 6: Data description post imputation of missing values

26 | P a g e
Outlier Treatment:

Outlier values seem to be genuine and not a data entry error. So we will not treat the outliers. Below is the justification for not
treating outliers for each feature

Tenure: Outlier accounts could be old customers


CC_Contacted_LY: People who contacted more than 100 times have rated the customer the customer care service very
low. This might be the reason of multiple contacts.
rev_per_month: 70% of the accounts that have high revenue per month are from Super and Regular plus category and
these category customers churn the most. So, we will keep their data in original form as we need our model to significantly
study their pattern.
rev_growth_yoy: Does not have any outliers
coupon_used_for_payment: More than 85% outlier accounts have 3-5 users in their account, so multiple payments could
have been done and hence multiple use of coupons.
cashback: More than 75% outlier accounts have made payment through credit/debit cards which might have promotional
cashback offers from banks.

Variable Transformation:

Table 7: Average per numeric variable of different levels of account_segment

27 | P a g e
Table 8: Average per numeric variable of different levels of Payment

Table 9: Average per numeric variable of different levels of Service_Score

28 | P a g e
Table 10: Average per numeric variable of different levels of CC_Agent_Score

Comparing means of numeric variables for different levels of categorical features and considering the significance of these
features we will not club any levels.

29 | P a g e
Categorical variables cannot be fit into the Machine learning algorithms in their raw form. They require to be numeric. We
will convert string values into numeric using dummy encoding.

Post conversion the dataset is as below (Transposed image)

Table 11: Sample encoded dataset

After transformation we have 24 features.

Note: account_segment was not encoded to be converted into ordinal data because there is no clear increase or decrease
in the average value of various numeric features for each level of account_segment. Similar is the cse when compared with
other categorical feature.

Addition of new variables: Based on analysis so far and domain knowledge, we conclude to not create any new variables
from existing ones as of now.

30 | P a g e
Since now all the features are in numeric format we can check for multicollinearity using Variance Inflation factor(VIF) and
see if any variable can be further dropped

# Column VIF

1 Churn 1.29
2 Tenure 1.17
3 City_Tier 1.45
4 CC_Contacted_LY 1.02
5 Service_Score 1.16
6 Account_user_count 1.14
7 CC_Agent_Score 1.03

8 rev_per_month 1.01
9 Complain_ly 1.08
10 coupon_used_for_payment 1.23
11 Day_Since_CC_connect 1.29
12 Cashback 1.08
13 Payment_Credit Card 3.11

14 Payment_Debit Card 3.33


15 Payment_E wallet 2.32
16 Payment_UPI 1.7
17 Gender_Male 1.02
18 account_segment_Regular 1.31
19 account_segment_Regular Plus 2.64

20 account_segment_Super 2.31
21 account_segment_Super Plus 1.46
22 Marital_Status_Married 2.16
23 Marital_Status_Single 2.19
24 Login_device_Mobile 1.01

Table 12: VIF values

VIF values are less than 5, which suggests that there will probably be no multicollinearity issue while modelling. Also
absence of multicollinearity suggest no need to further drop any variables.

31 | P a g e
1.4 Business insights from EDA.

Data Unbalance: We have a classification problem at hand. 83% of data belongs to class 0 and 17% belongs to class
1, indicating imbalance in data. We will address this problem at later stage by applying one or more of the below
approaches
• Use SMOTE to oversample minority class data.
• Use F1-score as evaluation metric since it is suitable for imbalanced data.
• Use algorithms that are least impacted by data imbalance.
• Rather than prediction of class labels, we can predict probability of class so that threshold could be adjusted, and
the classes are separated efficiently.

We have 24 variables at hand which is quite a huge number, so we will try reducing the count by doing PCA

PCA is distance-based learning technique. So, before proceeding further with PCA the data has to be scaled so that no
variable dominates incorrectly.

Table 13: Feature variance

Variance across all the features is not similar, hence we will use Min Max method for scaling.

32 | P a g e
Table 14: Sample scaled data

PCA is unsupervised learning technique. So, we will further drop Churn (dependent variable) variable.

Usually, PCA doesn’t perform well on categorical data. We have good number of categorical features hence PCA might not
give good result.

33 | P a g e
Table 15: Principal Components variance and cumulative variance

As per above table 11 PCs explain almost 90% variance in data. So, we could go ahead with creating profiles for these 11
PC’s and drop the remaining ones.

34 | P a g e
Figure 20: Principal Components vs Variable patch plot

Looking at the patch plot (which identifies prominent features for a PC) we could not clearly create profiles for the new
dimensions that are meaningful. Hence, we would not proceed further with PCA.

Clustering: This is also unsupervised technique. we have a larger dataset so K-Means clustering could be a good option. It
is a non-hierarchical clustering process. In this process we need to specify the numbers of clusters prior to starting the
clustering process. Optimal number of clusters can be identified using below methods

1. Elbow curve: It is a plot between number of clusters and the corresponding within sum of squares (WSS) also called
inertia.

35 | P a g e
Figure 21: WSS vs Numbers of cluster plot

Number of Clusters Within Sum of


Squares(Inertia)
1 30862.38
2 26750.03
3 24387.65
4 22655.85
5 21270.75
6 20260.68
7 19334.7
8 18753.02
9 17905.99

Table 16: No of clusters and WSS

WSS drops significantly from number of clusters at 2 to 3 by a magnitude of 2363. From 3 to 4 the drop is of magnitude
1732. From 4 to 5 clusters the drop is of magnitude 1385. After 5 clusters the decline in WSS decreases further upon
increasing the number of clusters. WSS plot suggests the data could be clustered in 3 or 4 or 5 groups.

2. Silhouette Score: It is a coefficient of measure of how similar data points are to their own cluster when compared
with other clusters.

Number of Clusters Silhouette Score

2 0.135
3 0.123
4 0.126
5 0.148
6 0.142
7 0.146
8 0.145
9 0.145

Table 17: No of clusters and silhouette score

36 | P a g e
Figure 22: No of clusters vs Average silhouette score

The score ranges from -1 to +1. The more the score is closer to +1 implies the data points are far from the neighboring
clusters and closer to their own cluster. So, we usually select the number of clusters for which the silhouette score is
highest. Here the score is highest for 5 clusters. Next highest score is for 7 clusters. If we increase the numbers of
clusters further the score decreases significantly. However, none of the option have decent silhouette score.

3. Cluster Scatter Plot:

Figure 23: Scatter plot for 2 clusters

37 | P a g e
Figure 24: Scatter plot for 3 clusters

Figure 25: Scatter plot for 4 clusters

38 | P a g e
Figure 26: Scatter plot for 5 clusters

The plots show suggest that the data points are separated only for 2 clusters. Rest all have overlapping data points.

4. Comparison of Between Sum of squares, Within Sum of squares, Total Within Sum of squares

Number of Between Sum Total Sum Within Sum of Total Within Size of
Clusters of squares of Squares Squares Sum of Squares cluster
2 4112.34 30862.38 12860.04 26750.03 5188
13890 6072
3 6474.73 30862.38 7953.58 24387.65 3520
10724.91 4579
5709.16 3161
4 8206.52 30862.38 7380.78 22655.85 3342
4625.26 2174
4798.04 2765
5851.78 2979
5 9591.62 30862.38 4067.25 21270.75 2477
4734.72 2279
4095.98 2219
5477.78 2819
2895.03 1466

Table 18: Sum of squares for clusters

Between Sum of squares: Average squared Euclidean distance between all cluster centroids.
Total Sum of squares: Measure of total variability within the data (This is same irrespective of the number of clusters)
Within Sum of squares: Squared average Euclidean distance of all the points within a cluster from the cluster centroid
Total Within Sum of squares: Sum of all the clusters within sum of squares

39 | P a g e
WSS is less than between sum of squares for 4 and 5 clusters. 5 clusters has highest between sum of squares but
WSS for individual clusters decreases significantly. So 4 clusters could be an optimal option.

Based on all the methods there is no clarity on a specific number of clusters. Still 4 seem a viable option. We will
create 4 clusters and analyse further

Steps to create clusters

Step 1:
Create an object using KMeans function from sklearn.cluster library for 4 clusters.

Step 2:
Fit the object created in step 1 with scaled data.
Our model is ready now. Using labels_ function we can extract the cluster details of each observation.

Below table has proportion share of all the four clusters.

K Cluster Classification % share in data

0 29.68
1 26.46
2 24.56
4 19.30

Table 19: K-Mean cluster proportion

Validation of model: It can be done by checking the sil-width of each observation. If it is greater than 0 it implies
mapping of each record to clusters is done correctly. It can be obtained using silhouette_samples function.

The minimum value of sil-width is 0.068 indicating that sil-width for each observation is positive and greater than 0.
This indicates that mapping of data points to their centroid is correctly done.

Silhouette score is 0.235 which is quite low indicating that the overall clustering performance is not good.

Cluster 0 Cluster 1 Cluster 2 Cluster 3


Average Average Average Average
Tenure 9.753142 11.191352 11.759855 11.500504
CC_Contacted_LY 17.450628 18.455842 17.516817 18.165827
rev_per_month 6.336625 5.657314 6.281013 6.620342
coupon_used_for_payment 1.685518 1.767709 1.901266 1.821752
Day_Since_CC_connect 4.232496 4.992180 4.862568 4.411547
cashback 187.264545 195.518201 196.290043 201.855764

Table 20: Numeric features average per cluster profile

Averages of cluster profiles across various numeric variables do not vary indicating no difference in characteristics
of clusters.

Based on the analysis so far we can conclude that the clusters created do not differ from each other and hence will not
add value if we create separate models for these clusters.

40 | P a g e
Business Insights:

• Data is imbalanced. This would be required to handled using suitable algorithms and evaluation metrics. If
required SMOTE will used.
• New and moderate age customers churn more.
• Tier 1 and 3 city customers churn more. Tier 1: They have ample options available in market. Tier 3: Do not have
enough exposure to usage to DTH.
• Majority customers are from Regular Plus, Super and HNI segment. Their average revenue is lowest, and they
churn the most. These customers should be specifically addressed.
• Majority customers use debit and credit cards for payment. Tie up with banks to provide offers on debit and credit
cards for payment of DTH subscriptions.
• Have separate plans based on gender, number of users per account.
• Accounts with high user count churn more, maybe due absence of plans that cater to the needs of all age groups.
• Usage of mobile is more as compared with computers/laptops. Mobile usage specific plans and high price plans
for usage on computers could be thought about.
• Have a customer care service department dedicated to contact low tenure account owners. Get their feedback,
ask them if they have any issues. This could help in establishing a good relation.
• Customers who have not contacted in a long time churn more so they should be called and checked upon.
• Customers are not very satisfied with company services, but they are satisfied with customer care agent services.
So, focus on coming up with competitive plans, customer friendly policies and offers that can compete with other
providers

More insights and recommendations would be provided based on further analysis and the model we will build.

41 | P a g e

You might also like