Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Problem Statement

Customer Churn is a burning problem for Telecom companies. Almost every telecom company
pays a premium to get a customer on-board. Customer churn is a directly impacts company’s
revenue.

In this case-study, we simulate one such case of customer churn where we work on a data of
post-paid customers with a contract. The data has information about customer usage behaviour,
contract details, and payment details. The data also indicates which were the customers who
cancelled their service.

Based on this past data, Perform an EDA and build a model which can predict whether a
customer will cancel their service in the future or not.

Data Dictionary

Churn - 1 if customer cancelled service, 0 if not


AccountWeeks - number of weeks customer has had active account
ContractRenewal - 1 if customer recently renewed contract, 0 if not
DataPlan - 1 if customer has data plan, 0 if not
DataUsage - gigabytes of monthly data usage
CustServCalls - number of calls into customer service
DayMins - average daytime minutes per month
DayCalls - average number of daytime calls
MonthlyCharge - average monthly bill
OverageFee - largest overage fee in last 12 months
RoamMins - average number of roaming minutes

Loading...
#Import all necessary modules
import pandas as pd  
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix
from sklearn.preprocessing import scale

cell_df = pd.read_excel("Cellphone.xlsx")
EDA

cell_df.head()

cell_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Churn 3333 non-null int64
1 AccountWeeks 3303 non-null float64
2 ContractRenewal 3315 non-null float64
3 DataPlan 3324 non-null float64
4 DataUsage 3317 non-null float64
5 CustServCalls 3281 non-null float64
6 DayMins 3298 non-null float64
7 DayCalls 3322 non-null float64
8 MonthlyCharge 3320 non-null float64
9 OverageFee 3309 non-null float64
10 RoamMins 3326 non-null float64
dtypes: float64(10), int64(1)
memory usage: 286.6 KB

There are missing values in some coumns.


All variables are of numeric type and does not contain any data inconsistencies (causing
numeric variables to be object due to some special characters present in the data).
Churn is the target variable.
Churn, ContractRenewal and DataPlan are binary variables.

cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','

Loading...

Check for Missing values

cell_df.isnull().sum()

Churn 0
AccountWeeks 30
ContractRenewal 18
DataPlan 9
DataUsage 16
CustServCalls 52
DayMins 35
DayCalls 11
MonthlyCharge 13
OverageFee 24
RoamMins 7
dtype: int64

Imputing missing values

Since, ContractRenewal and DataPlan are binary, we cannot substitute with mean values for
these 2 variables. We will impute these two variables with their respective modal values.

cols = ['ContractRenewal','DataPlan']
for column in cols:
    print(column)
    mode_1 = cell_df[column].mode()[0]
    print(mode_1)
    cell_df[column].fillna(value=mode_1,inplace=True)
    
cell_df.isnull().sum()

ContractRenewal
1.0
DataPlan
0.0
Churn 0
AccountWeeks 30
ContractRenewal 0
DataPlan 0
DataUsage 16
CustServCalls 52
DayMins 35
DayCalls 11
MonthlyCharge 13
OverageFee 24
RoamMins 7
dtype: int64

Now let us impute the rest of the continuous variables with the median. For that we are going to
use the SimpleImputer sub module from sklearn.
Loading...

from sklearn.impute import SimpleImputer

SI = SimpleImputer(strategy='median')

cell_df = pd.DataFrame(SI.fit_transform(cell_df),columns=cell_df.columns)

cell_df.head()
cell_df.isnull().sum()

Churn 0
AccountWeeks 0
ContractRenewal 0
DataPlan 0
DataUsage 0
CustServCalls 0
DayMins 0
DayCalls 0
MonthlyCharge 0
OverageFee 0
RoamMins 0
dtype: int64

Checking for Duplicates

# Are there any duplicates ?
dups = cell_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
#df[dups]

Number of duplicate rows = 0

Proportion in the Target classes

cell_df.Churn.value_counts(normalize=True)

0.0 0.855086
1.0 0.144914
Name: Churn, dtype: float64

Distribution of the variables Check

from pylab import rcParams
Loading...

rcParams['figure.figsize'] = 15,8

cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','

Outlier Checks

cols=['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','Ove
for i in cols:
    sns.boxplot(cell_df[i])
    plt.show()

Although outliers exists as per the boxplot, by looking at the data distribution in describe(), the
values are not too far away. Treating the outliers by converting them to min/max values will
cause most variables to have values to be the same. So, outliers are not treated in this case

Bi-Variate Analysis with Target variable

Account Weeks and Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['AccountWeeks']);

AccountWeeks shows similar distribution between churn and no churn, and is normally
distributed

Data Usage against Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['DataUsage']);

DataUsage shows clear distinction between churn and no churn. Customers who has not
churned shows a wider distribution indicating more data usage. Whereas customers who has
churned has smaller distribution (mostly near data usage 0) with many outliers indicating few
customers who has more data usage still has churned
Loading...

DayMins against Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['DayMins']);

DayMins shows distinction between churn and no churn, and both are normally distributed with
little skewness. Distribution is much wider for churn than no churn
DayCalls against Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['DayCalls']);

DayCalls shows similar distribution between churn and no churn, and is normally distributed

MonthlyCharge against Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['MonthlyCharge']);

MonthlyCharge shows some skewness in the distribution between churn and no churn.
Distribution is much wider for churn indicating more monthly charge means more churn. Median
of churn is higher than no churn

OverageFee against Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['OverageFee']);

Distribution is almost similar between churn and no churn

RoamMins against Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['RoamMins']);

Loading...
Distribution is almost similar between churn and no churn. Medians are almost same

CustServCalls against Churn

sns.boxplot(x=cell_df['Churn'],y=cell_df['CustServCalls']);

Distribution much wider for churn and lesser for no churn. More CustServCalls indicates more
churn.
Contract Renewal against Churn

sns.countplot(x=cell_df['ContractRenewal'],hue=cell_df['Churn']);

The contract renewal is totally opposite to the churn as the churn value of 0 shows that the user
not cancelled the service whereas the contract renewal of value 0 shows that user has not
renewed the contract.
When customers has not renewed the contract, count of churn and no churn is almost same.
More customers who has renewed the contract has not churned.

Data Plan against Churn

sns.countplot(x=cell_df['DataPlan'],hue=cell_df['Churn']);
# pd.crosstab(cell_df['DataPlan'],cell_df['Churn']).plot(kind='bar');

Very few people have opted for having a data plan. Almost one-fifth of the customers have
churned irrespective of having a data plan nor not. There isn't any significant difference between
churn and no churn here.

Train (70%) - Test (30%) Split

# Creating a copy of the original data frame
df = cell_df.copy()

df.head()

Loading...

X = df.drop('Churn',axis=1)
Y = df.pop('Churn')

Y.value_counts()

0.0 2850
1.0 483
Name: Churn, dtype: int64

X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X,Y,test_size=0.30,random_
print('Number of rows and columns of the training set for the independent variables:',X_tr
print('Number of rows and columns of the training set for the dependent variable:',Y_train
print('Number of rows and columns of the test set for the independent variables:',X_test.s
print('Number of rows and columns of the test set for the dependent variable:',Y_test.shap

Number of rows and columns of the training set for the independent variables: (2333,
Number of rows and columns of the training set for the dependent variable: (2333,)
Number of rows and columns of the test set for the independent variables: (1000, 10)
Number of rows and columns of the test set for the dependent variable: (1000,)

Applying Standard Scaler to scale the data

from sklearn.preprocessing import StandardScaler
stand_scal = StandardScaler()
X_train = stand_scal.fit_transform(X_train)
X_test = stand_scal.transform (X_test)

LDA Model

#Build LDA Model
clf = LinearDiscriminantAnalysis()
model=clf.fit(X_train,Y_train)

Prediction

# Training Data Class Prediction with a cut-off value of 0.5
pred_class_train = model.predict(X_train)

# Test Data Class Prediction with a cut-off value of 0.5
pred_class_test = model.predict(X_test)

Training
Loading... Data and Test Data Confusion Matrix Comparison

## Confusion matrix on the training data
cm = confusion_matrix(Y_train, pred_class_train)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
## Confusion matrix on the training data
cm = confusion_matrix(Y_test, pred_class_test)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()

Training Data and Test Data Classification Report Comparison

print('Classification Report of the training data:\n\n',metrics.classification_report(Y_tr
print('Classification Report of the test data:\n\n',metrics.classification_report(Y_test,p

Classification Report of the training data:

precision recall f1-score support

0.0 0.88 0.95 0.92 1995


1.0 0.47 0.24 0.32 338

accuracy 0.85 2333


macro avg 0.67 0.60 0.62 2333
weighted avg 0.82 0.85 0.83 2333

Classification Report of the test data:

precision recall f1-score support

0.0 0.89 0.95 0.92 855


1.0 0.48 0.28 0.35 145

accuracy 0.85 1000


macro avg 0.68 0.61 0.63 1000
weighted avg 0.83 0.85 0.83 1000

Inferences
Loading...

                                                                                           Note :

                                                     Precison : tells us how many predictions are 

                                                                out of all the total positive pred

                                                     Recall   : how many observations of positive 

                                                                predicted as positive. 
Inferences using the default value 0.5 for cut-off for test data

For {Customer who didnot Churn (Label 0 )}:

Precision (89%) – 89% of Customers who didnot Churn are correctly predicted ,out of all
Customers who didnot Churn that are predicted .

Recall (95%) – Out of all the Customers who actually didnot Churn , 95% of Customers who
didnot Churn have been predicted correctly .

For {Customer who did Churn (Label 1 )}:

Precision (48%) – 48% of Customers who did Churn are correctly predicted ,out of all
Customers who did Churn that are predicted .

Recall (28%) – Out of all the Customers who actually did Churn , 28% of Customers who did
Churn have been predicted
correctly .

Overall accuracy of the model – 85 % of total predictions are correct

Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This proves
no overfitting or underfitting has happened, and overall the model is a good model for
classification

Probability prediction for the training and test data

# Training Data Probability Prediction
pred_prob_train = model.predict_proba(X_train)

# Test Data Probability Prediction
Loading...
pred_prob_test = model.predict_proba(X_test)

# AUC and ROC for the training data

# calculate AUC
auc = metrics.roc_auc_score(Y_train,pred_prob_train[:,1])
print('AUC for the Training Data: %.3f' % auc)

#  calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(Y_train,pred_prob_train[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label = 'Training Data')
# AUC and ROC for the test data

# calculate AUC
auc = metrics.roc_auc_score(Y_test,pred_prob_test[:,1])
print('AUC for the Test Data: %.3f' % auc)

#  calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(Y_test,pred_prob_test[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label='Test Data')
# show the plot
plt.legend(loc='best')
plt.show()

pred_prob_train[:,1]

array([0.02069932, 0.20179416, 0.15422548, ..., 0.10058096, 0.16733976,


0.0528347 ])

Generate Coefficients and intercept for the Linear Discriminant Function

#intercept value
clf.intercept_ 

array([-2.37504846])

#coefficients for the Linear Discriminant Function
clf.coef_

array([[-0.0016027 , -0.87558073, -0.29525062, -0.31084048, 0.7657377 ,


0.47806798, 0.04872382, 0.29382077, 0.23943236, 0.18057541]])

Loading...
X.columns

Index(['AccountWeeks', 'ContractRenewal', 'DataPlan', 'DataUsage',


'CustServCalls', 'DayMins', 'DayCalls', 'MonthlyCharge', 'OverageFee',
'RoamMins'],
dtype='object')

a=clf.coef_
np.round(a,2) # rounded up coefficients 

array([[-0. , -0.88, -0.3 , -0.31, 0.77, 0.48, 0.05, 0.29, 0.24,


0.18]])
By the above equation and the coefficients it is clear that
predictor 'CustServCallls' has the largest magnitude thus this helps in classifying the best
predictor 'ContractRenewal' has the smallest magnitude thus this helps in classifying the
least

Using LDA for Dimensionality Reduction

lda_model = LinearDiscriminantAnalysis(n_components = 1)# as only two classes are there fo
X_train_lda = lda_model.fit_transform(X_train, Y_train)
X_test_lda = lda_model.transform(X_test)

print(X_train_lda.shape)
print(X_test_lda.shape)

(2333, 1)
(1000, 1)

The output above confirms we only have 1 feature for all the records in our training and test
sets.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_lda, Y_train)
y_pred = model.predict(X_test_lda)

The output below shows that with only a single feature, our machine learning model achieves an
accuracy of 85% which is same as the accuracy achieved using all the features.

Loading...
print('Classification Report of the test data:\n\n',metrics.classification_report(Y_test,y_

Classification Report of the test data:

precision recall f1-score support

0.0 0.88 0.97 0.92 855


1.0 0.50 0.19 0.27 145

accuracy 0.85 1000


macro avg 0.69 0.58 0.60 1000
weighted avg 0.82 0.85 0.83 1000
END

Loading...

You might also like