Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory

Problem Statement
Customer Churn is a burning problem for Telecom companies. Almost every telecom company
pays a premium to get a customer on-board. Customer churn is a directly impacts company’s
revenue.
In this case-study, we simulate one such case of customer churn where we work on a data of
post-paid customers with a contract. The data has information about customer usage behaviour,
contract details, and payment details. The data also indicates which were the customers who
cancelled their service.
Based on this past data, Perform an EDA and build a model which can predict whether a
customer will cancel their service in the future or not.
Data Dictionary
Churn - 1 if customer cancelled service, 0 if not

AccountWeeks - number of weeks customer has had active account
ContractRenewal - 1 if customer recently renewed contract, 0 if not
DataPlan - 1 if customer has data plan, 0 if not
DataUsage - gigabytes of monthly data usage
CustServCalls - number of calls into customer service
DayMins - average daytime minutes per month
DayCalls - average number of daytime calls
MonthlyCharge - average monthly bill
OverageFee - largest overage fee in last 12 months
RoamMins - average number of roaming minutes
Loading...
#Import all necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix
from sklearn.preprocessing import scale
cell_df = pd.read_excel("Cellphone.xlsx")
EDA
cell_df.head()
cell_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Churn 3333 non-null int64
1 AccountWeeks 3303 non-null float64
2 ContractRenewal 3315 non-null float64
3 DataPlan 3324 non-null float64
4 DataUsage 3317 non-null float64
5 CustServCalls 3281 non-null float64
6 DayMins 3298 non-null float64
7 DayCalls 3322 non-null float64
8 MonthlyCharge 3320 non-null float64
9 OverageFee 3309 non-null float64
10 RoamMins 3326 non-null float64
dtypes: float64(10), int64(1)
memory usage: 286.6 KB
There are missing values in some coumns.

All variables are of numeric type and does not contain any data inconsistencies (causing
numeric variables to be object due to some special characters present in the data).
Churn is the target variable.
Churn, ContractRenewal and DataPlan are binary variables.
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','
Loading...
Check for Missing values
cell_df.isnull().sum()
Churn 0
AccountWeeks 30
ContractRenewal 18
DataPlan 9
DataUsage 16
CustServCalls 52
DayMins 35
DayCalls 11
MonthlyCharge 13
OverageFee 24
RoamMins 7
dtype: int64
Imputing missing values
Since, ContractRenewal and DataPlan are binary, we cannot substitute with mean values for
these 2 variables. We will impute these two variables with their respective modal values.
cols = ['ContractRenewal','DataPlan']
for column in cols:
print(column)
mode_1 = cell_df[column].mode()[0]
print(mode_1)
cell_df[column].fillna(value=mode_1,inplace=True)

ContractRenewal
1.0
DataPlan
0.0
Churn 0
AccountWeeks 30
ContractRenewal 0
DataPlan 0
DataUsage 16
CustServCalls 52
DayMins 35
DayCalls 11
MonthlyCharge 13
OverageFee 24
RoamMins 7
dtype: int64
Now let us impute the rest of the continuous variables with the median. For that we are going to
use the SimpleImputer sub module from sklearn.
Loading...
from sklearn.impute import SimpleImputer
SI = SimpleImputer(strategy='median')
cell_df = pd.DataFrame(SI.fit_transform(cell_df),columns=cell_df.columns)
cell_df.head()
Churn 0
AccountWeeks 0
ContractRenewal 0
DataPlan 0
DataUsage 0
CustServCalls 0
DayMins 0
DayCalls 0
MonthlyCharge 0
OverageFee 0
RoamMins 0
dtype: int64
Checking for Duplicates
# Are there any duplicates ?
dups = cell_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
#df[dups]
Number of duplicate rows = 0
Proportion in the Target classes
cell_df.Churn.value_counts(normalize=True)
0.0 0.855086
1.0 0.144914
Name: Churn, dtype: float64
Distribution of the variables Check
from pylab import rcParams
Loading...
rcParams['figure.figsize'] = 15,8
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','
Outlier Checks
cols=['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','Ove
for i in cols:
sns.boxplot(cell_df[i])
plt.show()
Although outliers exists as per the boxplot, by looking at the data distribution in describe(), the
values are not too far away. Treating the outliers by converting them to min/max values will
cause most variables to have values to be the same. So, outliers are not treated in this case
Bi-Variate Analysis with Target variable
Account Weeks and Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['AccountWeeks']);
AccountWeeks shows similar distribution between churn and no churn, and is normally
distributed
Data Usage against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['DataUsage']);
DataUsage shows clear distinction between churn and no churn. Customers who has not
churned shows a wider distribution indicating more data usage. Whereas customers who has
churned has smaller distribution (mostly near data usage 0) with many outliers indicating few
customers who has more data usage still has churned
Loading...
DayMins against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['DayMins']);
DayMins shows distinction between churn and no churn, and both are normally distributed with
little skewness. Distribution is much wider for churn than no churn
DayCalls against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['DayCalls']);
DayCalls shows similar distribution between churn and no churn, and is normally distributed
MonthlyCharge against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['MonthlyCharge']);
MonthlyCharge shows some skewness in the distribution between churn and no churn.
Distribution is much wider for churn indicating more monthly charge means more churn. Median
of churn is higher than no churn
OverageFee against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['OverageFee']);
Distribution is almost similar between churn and no churn
RoamMins against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['RoamMins']);
Loading...
Distribution is almost similar between churn and no churn. Medians are almost same
CustServCalls against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['CustServCalls']);
Distribution much wider for churn and lesser for no churn. More CustServCalls indicates more
churn.
Contract Renewal against Churn
sns.countplot(x=cell_df['ContractRenewal'],hue=cell_df['Churn']);
The contract renewal is totally opposite to the churn as the churn value of 0 shows that the user
not cancelled the service whereas the contract renewal of value 0 shows that user has not
renewed the contract.
When customers has not renewed the contract, count of churn and no churn is almost same.
More customers who has renewed the contract has not churned.
Data Plan against Churn
sns.countplot(x=cell_df['DataPlan'],hue=cell_df['Churn']);
# pd.crosstab(cell_df['DataPlan'],cell_df['Churn']).plot(kind='bar');
Very few people have opted for having a data plan. Almost one-fifth of the customers have
churned irrespective of having a data plan nor not. There isn't any significant difference between
churn and no churn here.
Train (70%) - Test (30%) Split
# Creating a copy of the original data frame
df = cell_df.copy()
df.head()
Loading...
X = df.drop('Churn',axis=1)
Y = df.pop('Churn')
Y.value_counts()
0.0 2850
1.0 483
Name: Churn, dtype: int64
X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X,Y,test_size=0.30,random_
print('Number of rows and columns of the training set for the independent variables:',X_tr
print('Number of rows and columns of the training set for the dependent variable:',Y_train
print('Number of rows and columns of the test set for the independent variables:',X_test.s
print('Number of rows and columns of the test set for the dependent variable:',Y_test.shap
Number of rows and columns of the training set for the independent variables: (2333,
Number of rows and columns of the training set for the dependent variable: (2333,)
Number of rows and columns of the test set for the independent variables: (1000, 10)
Number of rows and columns of the test set for the dependent variable: (1000,)
Applying Standard Scaler to scale the data
from sklearn.preprocessing import StandardScaler
stand_scal = StandardScaler()
X_train = stand_scal.fit_transform(X_train)
X_test = stand_scal.transform (X_test)
LDA Model
#Build LDA Model
clf = LinearDiscriminantAnalysis()
model=clf.fit(X_train,Y_train)
Prediction
# Training Data Class Prediction with a cut-off value of 0.5
pred_class_train = model.predict(X_train)
# Test Data Class Prediction with a cut-off value of 0.5
pred_class_test = model.predict(X_test)
Training
Loading... Data and Test Data Confusion Matrix Comparison
## Confusion matrix on the training data
cm = confusion_matrix(Y_train, pred_class_train)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
## Confusion matrix on the training data
cm = confusion_matrix(Y_test, pred_class_test)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
Training Data and Test Data Classification Report Comparison
print('Classification Report of the training data:\n\n',metrics.classification_report(Y_tr
print('Classification Report of the test data:\n\n',metrics.classification_report(Y_test,p
Classification Report of the training data:
precision recall f1-score support
0.0 0.88 0.95 0.92 1995

1.0 0.47 0.24 0.32 338
accuracy 0.85 2333

macro avg 0.67 0.60 0.62 2333
weighted avg 0.82 0.85 0.83 2333
Classification Report of the test data:
0.0 0.89 0.95 0.92 855

1.0 0.48 0.28 0.35 145
accuracy 0.85 1000

macro avg 0.68 0.61 0.63 1000
weighted avg 0.83 0.85 0.83 1000
Inferences
Loading...
Note :
Precison : tells us how many predictions are
out of all the total positive pred
Recall : how many observations of positive
predicted as positive.
Inferences using the default value 0.5 for cut-off for test data
For {Customer who didnot Churn (Label 0 )}:
Precision (89%) – 89% of Customers who didnot Churn are correctly predicted ,out of all
Customers who didnot Churn that are predicted .
Recall (95%) – Out of all the Customers who actually didnot Churn , 95% of Customers who
didnot Churn have been predicted correctly .
For {Customer who did Churn (Label 1 )}:
Precision (48%) – 48% of Customers who did Churn are correctly predicted ,out of all
Customers who did Churn that are predicted .
Recall (28%) – Out of all the Customers who actually did Churn , 28% of Customers who did
Churn have been predicted
correctly .
Overall accuracy of the model – 85 % of total predictions are correct
Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This proves
no overfitting or underfitting has happened, and overall the model is a good model for
classification
Probability prediction for the training and test data
# Training Data Probability Prediction
pred_prob_train = model.predict_proba(X_train)
# Test Data Probability Prediction
Loading...
pred_prob_test = model.predict_proba(X_test)
# AUC and ROC for the training data
# calculate AUC
auc = metrics.roc_auc_score(Y_train,pred_prob_train[:,1])
print('AUC for the Training Data: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(Y_train,pred_prob_train[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label = 'Training Data')
# AUC and ROC for the test data
# calculate AUC
auc = metrics.roc_auc_score(Y_test,pred_prob_test[:,1])
print('AUC for the Test Data: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(Y_test,pred_prob_test[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label='Test Data')
# show the plot
plt.legend(loc='best')
plt.show()
pred_prob_train[:,1]
array([0.02069932, 0.20179416, 0.15422548, ..., 0.10058096, 0.16733976,

0.0528347 ])
Generate Coefficients and intercept for the Linear Discriminant Function
#intercept value
clf.intercept_
array([-2.37504846])
#coefficients for the Linear Discriminant Function
clf.coef_
array([[-0.0016027 , -0.87558073, -0.29525062, -0.31084048, 0.7657377 ,

0.47806798, 0.04872382, 0.29382077, 0.23943236, 0.18057541]])
Loading...
X.columns
Index(['AccountWeeks', 'ContractRenewal', 'DataPlan', 'DataUsage',

'CustServCalls', 'DayMins', 'DayCalls', 'MonthlyCharge', 'OverageFee',
'RoamMins'],
dtype='object')
a=clf.coef_
np.round(a,2) # rounded up coefficients
array([[-0. , -0.88, -0.3 , -0.31, 0.77, 0.48, 0.05, 0.29, 0.24,

0.18]])
By the above equation and the coefficients it is clear that
predictor 'CustServCallls' has the largest magnitude thus this helps in classifying the best
predictor 'ContractRenewal' has the smallest magnitude thus this helps in classifying the
least
Using LDA for Dimensionality Reduction
lda_model = LinearDiscriminantAnalysis(n_components = 1)# as only two classes are there fo
X_train_lda = lda_model.fit_transform(X_train, Y_train)
X_test_lda = lda_model.transform(X_test)
print(X_train_lda.shape)
print(X_test_lda.shape)
(2333, 1)
(1000, 1)
The output above confirms we only have 1 feature for all the records in our training and test
sets.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_lda, Y_train)
y_pred = model.predict(X_test_lda)
The output below shows that with only a single feature, our machine learning model achieves an
accuracy of 85% which is same as the accuracy achieved using all the features.
Loading...
print('Classification Report of the test data:\n\n',metrics.classification_report(Y_test,y_
Classification Report of the test data:
0.0 0.88 0.97 0.92 855

1.0 0.50 0.19 0.27 145
accuracy 0.85 1000

macro avg 0.69 0.58 0.60 1000
weighted avg 0.82 0.85 0.83 1000
END
Loading...

Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

Problem Statement

Churn - 1 if customer cancelled service, 0 if not

There are missing values in some coumns.

Check for Missing values

Imputing missing values

Checking for Duplicates

Number of duplicate rows = 0

Proportion in the Target classes

Distribution of the variables Check

Bi-Variate Analysis with Target variable

Account Weeks and Churn

Data Usage against Churn

DayMins against Churn

MonthlyCharge against Churn

OverageFee against Churn

Distribution is almost similar between churn and no churn

RoamMins against Churn

CustServCalls against Churn

Data Plan against Churn

Train (70%) - Test (30%) Split

Applying Standard Scaler to scale the data

Training Data and Test Data Classification Report Comparison

Classification Report of the training data:

precision recall f1-score support

0.0 0.88 0.95 0.92 1995

accuracy 0.85 2333

Classification Report of the test data:

precision recall f1-score support

0.0 0.89 0.95 0.92 855

accuracy 0.85 1000

For {Customer who didnot Churn (Label 0 )}:

For {Customer who did Churn (Label 1 )}:

Overall accuracy of the model – 85 % of total predictions are correct

Probability prediction for the training and test data

array([0.02069932, 0.20179416, 0.15422548, ..., 0.10058096, 0.16733976,

Generate Coefficients and intercept for the Linear Discriminant Function

array([[-0.0016027 , -0.87558073, -0.29525062, -0.31084048, 0.7657377 ,

Index(['AccountWeeks', 'ContractRenewal', 'DataPlan', 'DataUsage',

array([[-0. , -0.88, -0.3 , -0.31, 0.77, 0.48, 0.05, 0.29, 0.24,

Using LDA for Dimensionality Reduction

Classification Report of the test data:

precision recall f1-score support

0.0 0.88 0.97 0.92 855

accuracy 0.85 1000

You might also like