QWE Case Study

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Retail & Supply Chain Analytics

QWE Inc. helped small- and medium-size businesses manage their online presence through a
subscription service. As with many successful dot-com start-ups, QWE experienced fast growth
initially but, as the company and its business model matured, management realized the need for
deeper analytical insight into some key business processes, one of which was customer retention
and henceforth the Churn Data Analysis has been performed on the dataset to find out the
likelihood of the customers leaving before Feb. 2012.

 Details of the dataset are mentioned below :


> str(churn)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6347 obs. of 12 variables:
$ Customer Age (in months) : num 67 67 55 63 57 58 57 46 56 56 ...
$ Churn (1 = Yes, 0 = No) : num 0 0 0 0 0 0 0 0 0 0 ...
$ CHI Score Month 0 : num 0 62 0 231 43 138 180 116 78 78 ...
$ CHI Score 0-1 : num 0 4 0 1 -1 -10 -5 -11 -7 -37 ...
$ Support Cases Month 0 : num 0 0 0 1 0 0 1 0 1 0 ...
$ Support Cases 0-1 : num 0 0 0 -1 0 0 1 0 -2 0 ...
$ SP Month 0 : num 0 0 0 3 0 0 3 0 3 0 ...
$ SP 0-1 : num 0 0 0 0 0 0 3 0 0 0 ...
$ Logins 0-1 : num 0 0 0 167 0 43 13 0 -9 -7 ...
$ Blog Articles 0-1 : num 0 0 0 -8 0 0 -1 0 1 0 ...
$ Views 0-1 : num 0 -16 0 21996 9 ...
$ Days Since Last Login 0-1: num 31 31 31 0 31 0 0 6 7 14 ...

After cleaning the data , there are no empty values which is depicted by the Missingness Map below:

count_percent
0 1
Retail & Supply Chain Analytics

94.910982 5.089018

Percentage of the Customers (Churn ( Blue = 1, Red = 0))

1. Without any statistical test and primarily on basis of the boxplot, it can be observed that
customers with age between 5-20 will not churn and with age > 10 may churn. As the
applicable region is overlapping in both the graphs henceforth the conclusion on basis of
this boxplot is not feasible.

2. Binomial Logistic Regression with all the variables involved as the independent variable
and the Churn status as the dependent variable has been fit on the data and the below results
are obtained.

> logit_model <- glm(trainData$`Churn (1 = Yes, 0 = No)` ~.,data=trainData, f


amily = "binomial")
> summary(logit_model)
Retail & Supply Chain Analytics

Call:
glm(formula = trainData$`Churn (1 = Yes, 0 = No)` ~ ., family = "binomial",
data = trainData)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.0289 -0.3630 -0.2921 -0.2272 2.9535

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.694e+00 1.265e-01 -21.304 < 2e-16 ***
`Customer Age (in months)` 6.355e-03 6.706e-03 0.948 0.343293
`CHI Score Month 0` -4.562e-03 1.475e-03 -3.093 0.001979 **
`CHI Score 0-1` -1.268e-02 3.039e-03 -4.173 3.01e-05 ***
`Support Cases Month 0` -1.416e-01 1.211e-01 -1.169 0.242529
`Support Cases 0-1` 1.427e-01 1.000e-01 1.427 0.153638
`SP Month 0` 2.621e-02 1.202e-01 0.218 0.827399
`SP 0-1` -1.386e-02 9.093e-02 -0.152 0.878878
`Logins 0-1` 1.351e-04 2.452e-03 0.055 0.956050
`Blog Articles 0-1` -1.993e-02 3.098e-02 -0.643 0.520090
`Views 0-1` -1.124e-04 4.376e-05 -2.569 0.010211 *
`Days Since Last Login 0-1` 1.874e-02 5.328e-03 3.518 0.000434 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1786.6 on 4442 degrees of freedom


Residual deviance: 1701.4 on 4431 degrees of freedom
AIC: 1725.4

Number of Fisher Scoring iterations: 6

Predictions for 672, 354 & 5203 ID are mentioned below :


> pred_672*100
1
4.986717
> pred_672*100
1
4.986717
> pred_354*100
1
6.793391
> pred_5203*100
1
3.297767

All these 3 IDs have low prob. of leaving and in actual too they did not leave the system.

3. The approach with single regression model does not seem robust as there are insignificant
variables in the data. After dropping these variables , another model is created and the results
are as mentioned below :
> logit_model_revised <- glm(`Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
+ `CHI Score 0-1`+`Views 0-1` + `Days Since Last Login 0-1`,d
ata = trainData,family = "binomial")
> summary(logit_model_revised)
Call:
glm(formula = `Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
Retail & Supply Chain Analytics

`CHI Score 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
family = "binomial", data = trainData)

Deviance Residuals:
Min 1Q Median 3Q Max
-0.9716 -0.3686 -0.2895 -0.2327 3.0019

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.655e+00 1.144e-01 -23.217 < 2e-16 ***
`CHI Score Month 0` -4.694e-03 1.170e-03 -4.013 5.98e-05 ***
`CHI Score 0-1` -1.285e-02 2.665e-03 -4.823 1.41e-06 ***
`Views 0-1` -1.190e-04 4.266e-05 -2.790 0.00527 **
`Days Since Last Login 0-1` 2.091e-02 5.224e-03 4.002 6.28e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1786.6 on 4442 degrees of freedom


Residual deviance: 1708.1 on 4438 degrees of freedom
AIC: 1718.1

Number of Fisher Scoring iterations: 6

The AIC value for this model is less than the earlier model but there is not much drastic difference in
terms of the value and also the Residual Deviance & Degree of freedom are very different , henceforth
the model is not a great fit. On basis of this model , the below values for the 672, 354 & 5203 are
obtained.
> pred_672_revised*100
1
3.452503
> pred_354_revised*100
1
4.811735
> pred_5203_revised*100
1
4.166317

By this model too , the specific IDs in contention have very less probability of leaving the system and
from this model all the factors are significant which implies that the features CHI Score Month 0 , CHI
Score 0-1, Views 0-1 & Days Since Last Login 0-1 contribute the most to the predicted probabilities that
the customers will churn.

Besides this since there was less percentage of the data which was churning from the system ,
henceforth with the help of splitstackshape package a balanced sampling was performed and a stratified
sample was created.

The result was obtained and the summary is mentioned below :

> logit_model_balanced_sign <- glm(`Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +


+ `CHI Score 0-1`+`Views 0-1` + `Days Since Last Login 0-1`,data = t
rainData,family = "binomial")
> summary(logit_model_balanced_sign)

Call:
glm(formula = `Churn (1 = Yes, 0 = No)` ~ `CHI Score Month 0` +
`CHI Score 0-1` + `Views 0-1` + `Days Since Last Login 0-1`,
family = "binomial", data = trainData)
Retail & Supply Chain Analytics

Deviance Residuals:
Min 1Q Median 3Q Max
-0.8308 -0.3857 -0.2930 -0.2401 3.0034

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.522e+00 1.313e-01 -19.213 < 2e-16 ***
`CHI Score Month 0` -5.641e-03 1.316e-03 -4.286 1.82e-05 ***
`CHI Score 0-1` -7.975e-03 2.841e-03 -2.807 0.0050 **
`Views 0-1` -1.070e-04 5.975e-05 -1.791 0.0732 .
`Days Since Last Login 0-1` 2.148e-02 5.336e-03 4.025 5.70e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1519.1 on 3624 degrees of freedom


Residual deviance: 1452.4 on 3620 degrees of freedom
AIC: 1462.4

Number of Fisher Scoring iterations: 6

Since the AIC values have decreased by a good margin , henceforth this model seems to be the best
fit and a balanced sampling seems to be the better option.

Code File :

You might also like