Professional Documents
Culture Documents
QWE Case Study
QWE Case Study
QWE Case Study
QWE Inc. helped small- and medium-size businesses manage their online presence through a
subscription service. As with many successful dot-com start-ups, QWE experienced fast growth
initially but, as the company and its business model matured, management realized the need for
deeper analytical insight into some key business processes, one of which was customer retention
and henceforth the Churn Data Analysis has been performed on the dataset to find out the
likelihood of the customers leaving before Feb. 2012.
After cleaning the data , there are no empty values which is depicted by the Missingness Map below:
count_percent
0 1
Retail & Supply Chain Analytics
94.910982 5.089018
1. Without any statistical test and primarily on basis of the boxplot, it can be observed that
customers with age between 5-20 will not churn and with age > 10 may churn. As the
applicable region is overlapping in both the graphs henceforth the conclusion on basis of
this boxplot is not feasible.
2. Binomial Logistic Regression with all the variables involved as the independent variable
and the Churn status as the dependent variable has been fit on the data and the below results
are obtained.
Call:
glm(formula = trainData$`Churn (1 = Yes, 0 = No)` ~ ., family = "binomial",
data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0289 -0.3630 -0.2921 -0.2272 2.9535
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.694e+00 1.265e-01 -21.304 < 2e-16 ***
`Customer Age (in months)` 6.355e-03 6.706e-03 0.948 0.343293
`CHI Score Month 0` -4.562e-03 1.475e-03 -3.093 0.001979 **
`CHI Score 0-1` -1.268e-02 3.039e-03 -4.173 3.01e-05 ***
`Support Cases Month 0` -1.416e-01 1.211e-01 -1.169 0.242529
`Support Cases 0-1` 1.427e-01 1.000e-01 1.427 0.153638
`SP Month 0` 2.621e-02 1.202e-01 0.218 0.827399
`SP 0-1` -1.386e-02 9.093e-02 -0.152 0.878878
`Logins 0-1` 1.351e-04 2.452e-03 0.055 0.956050
`Blog Articles 0-1` -1.993e-02 3.098e-02 -0.643 0.520090
`Views 0-1` -1.124e-04 4.376e-05 -2.569 0.010211 *
`Days Since Last Login 0-1` 1.874e-02 5.328e-03 3.518 0.000434 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All these 3 IDs have low prob. of leaving and in actual too they did not leave the system.
3. The approach with single regression model does not seem robust as there are insignificant
variables in the data. After dropping these variables , another model is created and the results
are as mentioned below :
> logit_model_revised <- glm(`Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
+ `CHI Score 0-1`+`Views 0-1` + `Days Since Last Login 0-1`,d
ata = trainData,family = "binomial")
> summary(logit_model_revised)
Call:
glm(formula = `Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
Retail & Supply Chain Analytics
`CHI Score 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
family = "binomial", data = trainData)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9716 -0.3686 -0.2895 -0.2327 3.0019
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.655e+00 1.144e-01 -23.217 < 2e-16 ***
`CHI Score Month 0` -4.694e-03 1.170e-03 -4.013 5.98e-05 ***
`CHI Score 0-1` -1.285e-02 2.665e-03 -4.823 1.41e-06 ***
`Views 0-1` -1.190e-04 4.266e-05 -2.790 0.00527 **
`Days Since Last Login 0-1` 2.091e-02 5.224e-03 4.002 6.28e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The AIC value for this model is less than the earlier model but there is not much drastic difference in
terms of the value and also the Residual Deviance & Degree of freedom are very different , henceforth
the model is not a great fit. On basis of this model , the below values for the 672, 354 & 5203 are
obtained.
> pred_672_revised*100
1
3.452503
> pred_354_revised*100
1
4.811735
> pred_5203_revised*100
1
4.166317
By this model too , the specific IDs in contention have very less probability of leaving the system and
from this model all the factors are significant which implies that the features CHI Score Month 0 , CHI
Score 0-1, Views 0-1 & Days Since Last Login 0-1 contribute the most to the predicted probabilities that
the customers will churn.
Besides this since there was less percentage of the data which was churning from the system ,
henceforth with the help of splitstackshape package a balanced sampling was performed and a stratified
sample was created.
Call:
glm(formula = `Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
`CHI Score 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
family = "binomial", data = trainData)
Retail & Supply Chain Analytics
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8308 -0.3857 -0.2930 -0.2401 3.0034
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.522e+00 1.313e-01 -19.213 < 2e-16 ***
`CHI Score Month 0` -5.641e-03 1.316e-03 -4.286 1.82e-05 ***
`CHI Score 0-1` -7.975e-03 2.841e-03 -2.807 0.0050 **
`Views 0-1` -1.070e-04 5.975e-05 -1.791 0.0732 .
`Days Since Last Login 0-1` 2.148e-02 5.336e-03 4.025 5.70e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Since the AIC values have decreased by a good margin , henceforth this model seems to be the best
fit and a balanced sampling seems to be the better option.
Code File :