Professional Documents
Culture Documents
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
Predictive+Modelling+-+Linear+Discriminant+Analysis+-+Mentor+version - Ipynb - Colaboratory
Customer Churn is a burning problem for Telecom companies. Almost every telecom company
pays a premium to get a customer on-board. Customer churn is a directly impacts company’s
revenue.
In this case-study, we simulate one such case of customer churn where we work on a data of
post-paid customers with a contract. The data has information about customer usage behaviour,
contract details, and payment details. The data also indicates which were the customers who
cancelled their service.
Based on this past data, Perform an EDA and build a model which can predict whether a
customer will cancel their service in the future or not.
Data Dictionary
Loading...
#Import all necessary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import metrics,model_selection
from sklearn.metrics import roc_auc_score,roc_curve,classification_report,confusion_matrix
from sklearn.preprocessing import scale
cell_df = pd.read_excel("Cellphone.xlsx")
EDA
cell_df.head()
cell_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Churn 3333 non-null int64
1 AccountWeeks 3303 non-null float64
2 ContractRenewal 3315 non-null float64
3 DataPlan 3324 non-null float64
4 DataUsage 3317 non-null float64
5 CustServCalls 3281 non-null float64
6 DayMins 3298 non-null float64
7 DayCalls 3322 non-null float64
8 MonthlyCharge 3320 non-null float64
9 OverageFee 3309 non-null float64
10 RoamMins 3326 non-null float64
dtypes: float64(10), int64(1)
memory usage: 286.6 KB
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','
Loading...
cell_df.isnull().sum()
Churn 0
AccountWeeks 30
ContractRenewal 18
DataPlan 9
DataUsage 16
CustServCalls 52
DayMins 35
DayCalls 11
MonthlyCharge 13
OverageFee 24
RoamMins 7
dtype: int64
Since, ContractRenewal and DataPlan are binary, we cannot substitute with mean values for
these 2 variables. We will impute these two variables with their respective modal values.
cols = ['ContractRenewal','DataPlan']
for column in cols:
print(column)
mode_1 = cell_df[column].mode()[0]
print(mode_1)
cell_df[column].fillna(value=mode_1,inplace=True)
cell_df.isnull().sum()
ContractRenewal
1.0
DataPlan
0.0
Churn 0
AccountWeeks 30
ContractRenewal 0
DataPlan 0
DataUsage 16
CustServCalls 52
DayMins 35
DayCalls 11
MonthlyCharge 13
OverageFee 24
RoamMins 7
dtype: int64
Now let us impute the rest of the continuous variables with the median. For that we are going to
use the SimpleImputer sub module from sklearn.
Loading...
from sklearn.impute import SimpleImputer
SI = SimpleImputer(strategy='median')
cell_df = pd.DataFrame(SI.fit_transform(cell_df),columns=cell_df.columns)
cell_df.head()
cell_df.isnull().sum()
Churn 0
AccountWeeks 0
ContractRenewal 0
DataPlan 0
DataUsage 0
CustServCalls 0
DayMins 0
DayCalls 0
MonthlyCharge 0
OverageFee 0
RoamMins 0
dtype: int64
# Are there any duplicates ?
dups = cell_df.duplicated()
print('Number of duplicate rows = %d' % (dups.sum()))
#df[dups]
cell_df.Churn.value_counts(normalize=True)
0.0 0.855086
1.0 0.144914
Name: Churn, dtype: float64
from pylab import rcParams
Loading...
rcParams['figure.figsize'] = 15,8
cell_df[['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','
Outlier Checks
cols=['AccountWeeks','DataUsage','CustServCalls','DayMins','DayCalls','MonthlyCharge','Ove
for i in cols:
sns.boxplot(cell_df[i])
plt.show()
Although outliers exists as per the boxplot, by looking at the data distribution in describe(), the
values are not too far away. Treating the outliers by converting them to min/max values will
cause most variables to have values to be the same. So, outliers are not treated in this case
sns.boxplot(x=cell_df['Churn'],y=cell_df['AccountWeeks']);
AccountWeeks shows similar distribution between churn and no churn, and is normally
distributed
sns.boxplot(x=cell_df['Churn'],y=cell_df['DataUsage']);
DataUsage shows clear distinction between churn and no churn. Customers who has not
churned shows a wider distribution indicating more data usage. Whereas customers who has
churned has smaller distribution (mostly near data usage 0) with many outliers indicating few
customers who has more data usage still has churned
Loading...
sns.boxplot(x=cell_df['Churn'],y=cell_df['DayMins']);
DayMins shows distinction between churn and no churn, and both are normally distributed with
little skewness. Distribution is much wider for churn than no churn
DayCalls against Churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['DayCalls']);
DayCalls shows similar distribution between churn and no churn, and is normally distributed
sns.boxplot(x=cell_df['Churn'],y=cell_df['MonthlyCharge']);
MonthlyCharge shows some skewness in the distribution between churn and no churn.
Distribution is much wider for churn indicating more monthly charge means more churn. Median
of churn is higher than no churn
sns.boxplot(x=cell_df['Churn'],y=cell_df['OverageFee']);
sns.boxplot(x=cell_df['Churn'],y=cell_df['RoamMins']);
Loading...
Distribution is almost similar between churn and no churn. Medians are almost same
sns.boxplot(x=cell_df['Churn'],y=cell_df['CustServCalls']);
Distribution much wider for churn and lesser for no churn. More CustServCalls indicates more
churn.
Contract Renewal against Churn
sns.countplot(x=cell_df['ContractRenewal'],hue=cell_df['Churn']);
The contract renewal is totally opposite to the churn as the churn value of 0 shows that the user
not cancelled the service whereas the contract renewal of value 0 shows that user has not
renewed the contract.
When customers has not renewed the contract, count of churn and no churn is almost same.
More customers who has renewed the contract has not churned.
sns.countplot(x=cell_df['DataPlan'],hue=cell_df['Churn']);
# pd.crosstab(cell_df['DataPlan'],cell_df['Churn']).plot(kind='bar');
Very few people have opted for having a data plan. Almost one-fifth of the customers have
churned irrespective of having a data plan nor not. There isn't any significant difference between
churn and no churn here.
# Creating a copy of the original data frame
df = cell_df.copy()
df.head()
Loading...
X = df.drop('Churn',axis=1)
Y = df.pop('Churn')
Y.value_counts()
0.0 2850
1.0 483
Name: Churn, dtype: int64
X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X,Y,test_size=0.30,random_
print('Number of rows and columns of the training set for the independent variables:',X_tr
print('Number of rows and columns of the training set for the dependent variable:',Y_train
print('Number of rows and columns of the test set for the independent variables:',X_test.s
print('Number of rows and columns of the test set for the dependent variable:',Y_test.shap
Number of rows and columns of the training set for the independent variables: (2333,
Number of rows and columns of the training set for the dependent variable: (2333,)
Number of rows and columns of the test set for the independent variables: (1000, 10)
Number of rows and columns of the test set for the dependent variable: (1000,)
from sklearn.preprocessing import StandardScaler
stand_scal = StandardScaler()
X_train = stand_scal.fit_transform(X_train)
X_test = stand_scal.transform (X_test)
LDA Model
#Build LDA Model
clf = LinearDiscriminantAnalysis()
model=clf.fit(X_train,Y_train)
Prediction
# Training Data Class Prediction with a cut-off value of 0.5
pred_class_train = model.predict(X_train)
# Test Data Class Prediction with a cut-off value of 0.5
pred_class_test = model.predict(X_test)
Training
Loading... Data and Test Data Confusion Matrix Comparison
## Confusion matrix on the training data
cm = confusion_matrix(Y_train, pred_class_train)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
## Confusion matrix on the training data
cm = confusion_matrix(Y_test, pred_class_test)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
print('Classification Report of the training data:\n\n',metrics.classification_report(Y_tr
print('Classification Report of the test data:\n\n',metrics.classification_report(Y_test,p
Inferences
Loading...
Note :
Precison : tells us how many predictions are
out of all the total positive pred
Recall : how many observations of positive
predicted as positive.
Inferences using the default value 0.5 for cut-off for test data
Precision (89%) – 89% of Customers who didnot Churn are correctly predicted ,out of all
Customers who didnot Churn that are predicted .
Recall (95%) – Out of all the Customers who actually didnot Churn , 95% of Customers who
didnot Churn have been predicted correctly .
Precision (48%) – 48% of Customers who did Churn are correctly predicted ,out of all
Customers who did Churn that are predicted .
Recall (28%) – Out of all the Customers who actually did Churn , 28% of Customers who did
Churn have been predicted
correctly .
Accuracy, AUC, Precision and Recall for test data is almost inline with training data. This proves
no overfitting or underfitting has happened, and overall the model is a good model for
classification
# Training Data Probability Prediction
pred_prob_train = model.predict_proba(X_train)
# Test Data Probability Prediction
Loading...
pred_prob_test = model.predict_proba(X_test)
# AUC and ROC for the training data
# calculate AUC
auc = metrics.roc_auc_score(Y_train,pred_prob_train[:,1])
print('AUC for the Training Data: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(Y_train,pred_prob_train[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label = 'Training Data')
# AUC and ROC for the test data
# calculate AUC
auc = metrics.roc_auc_score(Y_test,pred_prob_test[:,1])
print('AUC for the Test Data: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = metrics.roc_curve(Y_test,pred_prob_test[:,1])
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.',label='Test Data')
# show the plot
plt.legend(loc='best')
plt.show()
pred_prob_train[:,1]
#intercept value
clf.intercept_
array([-2.37504846])
#coefficients for the Linear Discriminant Function
clf.coef_
Loading...
X.columns
a=clf.coef_
np.round(a,2) # rounded up coefficients
lda_model = LinearDiscriminantAnalysis(n_components = 1)# as only two classes are there fo
X_train_lda = lda_model.fit_transform(X_train, Y_train)
X_test_lda = lda_model.transform(X_test)
print(X_train_lda.shape)
print(X_test_lda.shape)
(2333, 1)
(1000, 1)
The output above confirms we only have 1 feature for all the records in our training and test
sets.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_lda, Y_train)
y_pred = model.predict(X_test_lda)
The output below shows that with only a single feature, our machine learning model achieves an
accuracy of 85% which is same as the accuracy achieved using all the features.
Loading...
print('Classification Report of the test data:\n\n',metrics.classification_report(Y_test,y_
Loading...