QWE Case Study

Retail & Supply Chain Analytics
QWE Inc. helped small- and medium-size businesses manage their online presence through a
subscription service. As with many successful dot-com start-ups, QWE experienced fast growth
initially but, as the company and its business model matured, management realized the need for
deeper analytical insight into some key business processes, one of which was customer retention
and henceforth the Churn Data Analysis has been performed on the dataset to find out the
likelihood of the customers leaving before Feb. 2012.
 Details of the dataset are mentioned below :

> str(churn)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6347 obs. of 12 variables:
$ Customer Age (in months) : num 67 67 55 63 57 58 57 46 56 56 ...
$ Churn (1 = Yes, 0 = No) : num 0 0 0 0 0 0 0 0 0 0 ...
$ CHI Score Month 0 : num 0 62 0 231 43 138 180 116 78 78 ...
$ CHI Score 0-1 : num 0 4 0 1 -1 -10 -5 -11 -7 -37 ...
$ Support Cases Month 0 : num 0 0 0 1 0 0 1 0 1 0 ...
$ Support Cases 0-1 : num 0 0 0 -1 0 0 1 0 -2 0 ...
$ SP Month 0 : num 0 0 0 3 0 0 3 0 3 0 ...
$ SP 0-1 : num 0 0 0 0 0 0 3 0 0 0 ...
$ Logins 0-1 : num 0 0 0 167 0 43 13 0 -9 -7 ...
$ Blog Articles 0-1 : num 0 0 0 -8 0 0 -1 0 1 0 ...
$ Views 0-1 : num 0 -16 0 21996 9 ...
$ Days Since Last Login 0-1: num 31 31 31 0 31 0 0 6 7 14 ...
After cleaning the data , there are no empty values which is depicted by the Missingness Map below:
count_percent
0 1
94.910982 5.089018
Percentage of the Customers (Churn ( Blue = 1, Red = 0))
1. Without any statistical test and primarily on basis of the boxplot, it can be observed that
customers with age between 5-20 will not churn and with age > 10 may churn. As the
applicable region is overlapping in both the graphs henceforth the conclusion on basis of
this boxplot is not feasible.
2. Binomial Logistic Regression with all the variables involved as the independent variable
and the Churn status as the dependent variable has been fit on the data and the below results
are obtained.
> logit_model <- glm(trainData$`Churn (1 = Yes, 0 = No)` ~.,data=trainData, f

amily = "binomial")
> summary(logit_model)
Call:
glm(formula = trainData$`Churn (1 = Yes, 0 = No)` ~ ., family = "binomial",
data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0289 -0.3630 -0.2921 -0.2272 2.9535
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.694e+00 1.265e-01 -21.304 < 2e-16 ***
`Customer Age (in months)` 6.355e-03 6.706e-03 0.948 0.343293
`CHI Score Month 0` -4.562e-03 1.475e-03 -3.093 0.001979 **
`CHI Score 0-1` -1.268e-02 3.039e-03 -4.173 3.01e-05 ***
`Support Cases Month 0` -1.416e-01 1.211e-01 -1.169 0.242529
`Support Cases 0-1` 1.427e-01 1.000e-01 1.427 0.153638
`SP Month 0` 2.621e-02 1.202e-01 0.218 0.827399
`SP 0-1` -1.386e-02 9.093e-02 -0.152 0.878878
`Logins 0-1` 1.351e-04 2.452e-03 0.055 0.956050
`Blog Articles 0-1` -1.993e-02 3.098e-02 -0.643 0.520090
`Views 0-1` -1.124e-04 4.376e-05 -2.569 0.010211 *
`Days Since Last Login 0-1` 1.874e-02 5.328e-03 3.518 0.000434 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1786.6 on 4442 degrees of freedom

Residual deviance: 1701.4 on 4431 degrees of freedom
AIC: 1725.4
Number of Fisher Scoring iterations: 6
Predictions for 672, 354 & 5203 ID are mentioned below :

> pred_672*100
1
4.986717
> pred_672*100
1
4.986717
> pred_354*100
1
6.793391
> pred_5203*100
1
3.297767
All these 3 IDs have low prob. of leaving and in actual too they did not leave the system.
3. The approach with single regression model does not seem robust as there are insignificant
variables in the data. After dropping these variables , another model is created and the results
are as mentioned below :
> logit_model_revised <- glm(`Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
+ `CHI Score 0-1`+`Views 0-1` + `Days Since Last Login 0-1`,d
ata = trainData,family = "binomial")
> summary(logit_model_revised)
Call:
glm(formula = `Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
`CHI Score 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
family = "binomial", data = trainData)
Deviance Residuals:
-0.9716 -0.3686 -0.2895 -0.2327 3.0019
Coefficients:
(Intercept) -2.655e+00 1.144e-01 -23.217 < 2e-16 ***
`CHI Score Month 0` -4.694e-03 1.170e-03 -4.013 5.98e-05 ***
`CHI Score 0-1` -1.285e-02 2.665e-03 -4.823 1.41e-06 ***
`Views 0-1` -1.190e-04 4.266e-05 -2.790 0.00527 **
`Days Since Last Login 0-1` 2.091e-02 5.224e-03 4.002 6.28e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 1718.1
The AIC value for this model is less than the earlier model but there is not much drastic difference in
terms of the value and also the Residual Deviance & Degree of freedom are very different , henceforth
the model is not a great fit. On basis of this model , the below values for the 672, 354 & 5203 are
obtained.
> pred_672_revised*100
1
3.452503
1
4.811735
1
4.166317
By this model too , the specific IDs in contention have very less probability of leaving the system and
from this model all the factors are significant which implies that the features CHI Score Month 0 , CHI
Score 0-1, Views 0-1 & Days Since Last Login 0-1 contribute the most to the predicted probabilities that
the customers will churn.
Besides this since there was less percentage of the data which was churning from the system ,
henceforth with the help of splitstackshape package a balanced sampling was performed and a stratified
sample was created.
The result was obtained and the summary is mentioned below :
> logit_model_balanced_sign <- glm(`Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +

+ `CHI Score 0-1`+`Views 0-1` + `Days Since Last Login 0-1`,data = t
rainData,family = "binomial")
> summary(logit_model_balanced_sign)
Call:
glm(formula = `Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
`CHI Score 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
family = "binomial", data = trainData)
Deviance Residuals:
-0.8308 -0.3857 -0.2930 -0.2401 3.0034
Coefficients:
(Intercept) -2.522e+00 1.313e-01 -19.213 < 2e-16 ***
`CHI Score Month 0` -5.641e-03 1.316e-03 -4.286 1.82e-05 ***
`CHI Score 0-1` -7.975e-03 2.841e-03 -2.807 0.0050 **
`Views 0-1` -1.070e-04 5.975e-05 -1.791 0.0732 .
`Days Since Last Login 0-1` 2.148e-02 5.336e-03 4.025 5.70e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

AIC: 1462.4
Since the AIC values have decreased by a good margin , henceforth this model seems to be the best
fit and a balanced sampling seems to be the better option.
Code File :

QWE Case Study

Uploaded by

Copyright:

Available Formats

You might also like

QWE Case Study

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

QWE Case Study

Uploaded by

Copyright:

Available Formats

Retail & Supply Chain Analytics

 Details of the dataset are mentioned below :

Percentage of the Customers (Churn ( Blue = 1, Red = 0))

> logit_model <- glm(trainData$`Churn (1 = Yes, 0 = No)` ~.,data=trainData, f

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1786.6 on 4442 degrees of freedom

Number of Fisher Scoring iterations: 6

Predictions for 672, 354 & 5203 ID are mentioned below :

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1786.6 on 4442 degrees of freedom

Number of Fisher Scoring iterations: 6

The result was obtained and the summary is mentioned below :

> logit_model_balanced_sign <- glm(`Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1519.1 on 3624 degrees of freedom

Number of Fisher Scoring iterations: 6

You might also like