Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Introduction:

The Bookbinders Book Club (BBBC) is a distributor company that sells specialty books through
direct marketing. Since the beginning of its operations, the company developed a database of its
costumers that includes a variety of information about them. Nowadays, the company counts
with a database of 500,00 readers.

BBBC recently did a direct mailing program to 20,000 of its costumers, where along with its
usual mailing included a specially produced brochure for the book The Art History of Florence.
From the mailing program resulted 1806 orders for the purchase of the book. The company
wants to improve the efficacy of its direct mail programming. To do that, BBBC will use
different predictive modeling approaches to understand what were the factors that influenced the
costumers to buy the book.

For this case study were taken into consideration three different models: an RFM (Recency,
Frequency and Monetary Value) model, an Ordinary Linear Regression model, and a binary
logit model. The aim with the case study is to analyze the different models individually and
understand what are the factors that most influence the purchase of the book The Art of History
of Florence.

To estimate our models, we first used a sample from the database of BBBC that was composed
by 400 customers who purchased the book and 1200 customers who did not. Afterwards, with
the purpose of assessing the performance of the estimated models we tested them on an
additional database from 2300 customers – holdout sample. The holdout sample contains a
proportion of costumers who ordered the book that approximates the population reality as a
whole.

The variables used for the following analysis were:

Choice: Whether the customer purchased The Art History of Florence. 1 corresponds to a
purchase and 0 corresponds to a nonpurchase.

Gender: 0 = Female and 1 = Male.

Amount purchased: Total money spent on BBBC books.

Frequency: Total number of purchases in the chosen period (used as a proxy for frequency.)

Last purchase (recency of purchase): Months since last purchase.

First purchase: Months since first purchase.

P_Child: Number of children’s books purchased.

1
P_Youth: Number of youth books purchased.

P_Cook: Number of cookbooks purchased.

P_DIY: Number of do-it-yourself books purchased.

P_Art: Number of art books purchased.

Q1/Q2 (Note: We decided that it made sense to answer both questions at the same time):

Ordinary Linear Regression Model

The Ordinary Linear Regression can be used as a predictive model. The Linear Regression
model represents the relationship between a dependent variable (in this case the variable
Choice) and an independent variable (in this case we have more than one: Gender; Amount
purchased; Frequency; Last purchase; First purchase; P_Child; P_Youth; P_Cook; P_DIY;
and P_Art). The model can be written as: Choice = β0 + β1Gender + β2Frequency +
β3LastPurchase + β4First Purchase + β5P_Child + β6P_Youth + β7P_Cook + β8P_DIY + β9P_Art +
+ u. The u term represents the so called error term which represents all the non-observed factors
that are not included in our independent variables but which affect the dependent variable . The
Ordinary Least Squares (OLS) is an estimator that estimates the unknow parameters, the Betas,
(the coefficients) of the Linear Regression Model. By using the Engenius Program we can
obtain the Linear Regression Model of the book binders data and its corresponding estimated
parameters/betas.

Figure 1 - Calibration Data of the Linear Regression Model

After analysing, the table in figure 1 we can conclude that one average man purchased more
BBBC books than woman. We can also take other relevant conclusions, such as the maximum
amount of money spent on BBBC books was $474. Moreover, the maximum number of months
that someone spent without purchasing a book from BBBC was 12 and the minimum was 1
month. In addition, the average period from the first purchase of a BBBC book and the moment

2
of the present mailing program was sent 22,576 months. Furthermore, people bought on
average more cookbooks (0.76).

Figure 2- Linear Regression Model Statistics

The R-squared determines the goodness of fit of a regression. Basically, it dictates “how good”
a model is by telling if the observed data fit the expected data. It consists of the ratio between
the Sum of Squares Explained (SSE), which measures the sample variation in the fitted values
of y(i), and the Sum of Squares Total (SST) which measures the total sample variation in the
y(i). Meaning, that it calculates the proportion of the variation in y that can be explained by the
variation in independent variables. Moreover, this is a measure that is always between 0 and 1,
so an R2 of 1 indicates a very strong relationship. As we can see in table figure 2, our model has
a R-squared of 0.24 which indicates that only 24% of the variation in the data can be explained
by the model.

Figure 3 - Linear Regression Model Parameters

In figure 3 we can see that all the variables, apart from the First purchase variable, are statically
significant. The variable Amount Purchased is statistically significant for a 95% confidence
level (p-value = 0.0125 which is below 0.05) and the rest of the variables are statistically
significant for any confidence level. This means that we can reject the null hypothesis of H0: β =
0. Thus, indicating that the variables are related with the variable Choice.

3
Moreover, if we analyse the coefficient of for example the variable Amount Purchase we would
conclude that a 1 unit increase of the Amount Purchased would lead to an increase of Choice of
0.0003 units, holding other independent variables constant. Also, by looking to the coefficient
of Gender (binary variable) we would conclude that -0.1310 is the average in Choice depending
if the costumer is Female or Men, holding the other variable constant. However, this conclusion
is wrong because our dependent variable Choice is binary. Meaning, that it will only assume the
value 0 or 1. In this case, it is not appropriated to use a Linear Regression Model to identify the
factors that influenced the purchase of the book.

Below we can see the deciles regarding the this model:

Figure 4 – Deciles based on the Linear Regression Model

RFM model:
For the RFM method, use a regression model in which choice (0/1) is the dependent variable and
recency of purchase (last purchase), frequency, and amount purchased (monetary value) are the
independent variables. For the regression model with all predictors, use a regression model in
which choice (0/1) is the dependent variable and all the other variables in the data set are used as
independent variables. Use the regression module in the Excel AnalysisToolPak for the RFM
method and the regression model with all predictors (for the latter, you can also use Enginius with
a continuous dependent variable). The following three variables were used:

-Recency: which indicates months since last purchase.

-Frequency: which indicates the total number of purchases in the given period.

-Monetary value: which indicates the amount spent on BBC books.

4
The RFM model can be implemented by simply assigning scores for each of the three variables:
recency, frequency, and amount purchased. The following scores were assigned to each of the
three variables:
Recency:
Last purchased in the last 2 months: 25 points;
Last purchased in the past 3 or 4 months: 20 points;
Last purchased in the past 5 or 6 months: 15 points;
Last purchased in the past 7 or 8 months: 10 points;
Last purchased in the past 9 or 10 months: 5 points;
Did not purchase in the last 10 months: 0 points.
Frequency (based on total purchases as recorded in the database):
Has purchased a total of less than 15 times: 25 points;
Has purchased a total of 15 to 20 times: 15 points;
Has purchased a total of 21 to 25 times: 10 points;
Has purchased a total of 26 to 30 times: 5 points;
Has purchased a total of more than 30 times: 0 points.
Monetary value:
Has purchased a total of less than $50: 10 points;
Has purchased between $51 and $150: 20 points;
Has purchased between $151 and $250: 30 points;
Has purchased between $251 and $350: 40 points;
Has purchased more than $350: 50 points.
• The combined RFM score is calculated as follows:
Total RFM score = score for Recency + score for Frequency + score for Monetary
value.
• The data was sorted on RFM scores from highest to lowest.
• The RFM scores varied from 95 to 30.
• The data was segmented into deciles depending on the RFM scores starting from the
highest 10%.
• The response rate (Choice to purchase or not to purchase) was analysed for each decile.

Amount
Deciles Number of Hits Response Rate purchased
10% 34 17% 65244
20% 26 13% 51906
30% 23 11% 52888
40% 17 8% 69300
50% 23 11% 23963
60% 17 8% 36452
70% 23 11% 49636
80% 18 9% 27187
90% 16 8% 38800
100% 7 3% 33761

Figure 5 – Deciles based on RFM scores

5
Based on the RFM Score we were able to divide the data set into deciles from highest scores to
lowest but when conducting an analysis of the number of hits and response rates of each decile
there is not much difference, for example the 30% decile has a similar response hit to both the
50% and 70% the main difference relies on the amount purchased. The problem with the RFM
model is that it only relies on three variables and does not consider other factors.

In order to apply the next two models to the holdout data we eliminated the column that referred
to the Choice variable included in the data.

Binary Logit model:

The objective of the model is to predict the probabilities that an individual will choose each of
several choice alternatives (e.g., buy versus not buy). The model has as properties: the
probabilities lie between 0 and 1 and sum to 1. For the binary logit model, we used Enginius
with a 0/1 dependent variable.

Figure 6 – Calibration data of the Binary Logit Model

After analysing the table above shown we can conclude the same that we did in the Linear
Regression model.

6
Figure 7 – Confusion matrix of the Binary Logit Model

The confusion matrix shown above tells us that the model has correctly classified 1282 of the
1600 observations. In this table, the off-diagonal elements are classification errors. Below we
can see that the global hit rate of the model is 80% which is the percentage of the desired
number of outcomes received by a salesperson relative to the total activity level. In this case, the
diagonal elements represent alternative-specific hit rates.

Figure 8 – Hit rate of the Binary Logit Model

Figure 9 – Binary Logit Model Parameters

7
By analysing the model’s parameters shown above, we can affirm with a 95% level of
confidence that all coefficients, except for “First Purchase”, influence customer response here,
since their p-value is smaller than 0.05 (as for the “First Purchase” variable, it has 0.2486 which
is bigger than 0.05). Moreover, we can even say for every level of confidence that all the
variables except for “Amount purchased” and “First purchase”, are statistically significant, being
influencing costumer response. This means that we can reject the null hypothesis of H0: β = 0.
Thus, indicating that the variables are related with the variable “Choice”.

Moreover, if we analyse the coefficient of for example the variable Amount Purchased we
would conclude that a whenever there is a one unit increase of the Amount Purchased, this leads
to an increase of Choice of 0.0019 units, holding other independent variables constant.

Therefore, the greater the amount previously purchased, the greater the likelihood of responding
to the current mailing (as the coefficient for Amount Purchased is positive and significant).

Now, looking to the variable Frequency, we can also conclude that the Art History of Florence
has the potential to increase loyalty by attracting customers who typically have fewer
interactions with the company, which means that the more frequently the individual buys, the
lower the likelihood the customer would respond to the current mailing. Similarly to this result,
we can see that it appears that it attracts more those who are less loyal customers, since The
longer the time since last purchase, the greater the likelihood the customer would respond to the
current mailing. Hence, it may intensify the retention of these customers.

On the one hand, when looking to the number of art books bought, the higher it is, the greater
the likelihood of responding to current mailing (Amount purchased). On the other hand, the
greater the number of any of the other book types, the lower the likelihood of responding to the
current mailing.

Also, by looking to the coefficient of Gender (binary variable) we would conclude that -0.8643
is the average in Choice depending on if the costumer is Female or Men, holding the other
variable constant, which means that men are less likely to buy. Further in this case study we also
sorted the response probability provided by the model from highest to lowest and analyzed the
response rate of customer for each decile.

Further we will confirm this, but this seems to suggest that a simple RFM model may not do as
well (because it ignores several significant influences, such as previous purchase of art books,
purchase of children’s books and others).

8
The response rate analysed by logit model is attached below. We calculated the logarithm of the
sum of each type of data multiplied by its parameters, ordered them in descending order, then
segmented the total sample into ten deciles and figured out how many hits are in each decile.

Figure 10 – Deciles based on Binary Logit model scores

From the table we can see the response rate varies with the deciles and shows a downward trend
in general. In the first decile, there is a relatively high response rate which becomes lower
sharply in the second one and then with the increase of the deciles, the response rate almost
demonstrates a tendency which is decreasing slowly.

Question 3

Based on the RFM and logit model mentioned above, the united sold were estimated according
to the response rate. Details are showed in the form attached below:

Figure 11– Details based on the response rate of deciles generated by the two models

9
If assuming a mailing is sent to everyone on the list, we can expect a response rate of 8.9%
(=204/2300, the same as the response rate in the test sample) and net revenue of $12,737. As for
RFM model, though the profit keeps increasing in general as the deciles increases, it is still
lower than that of logit model at the same level. Meanwhile, it can be seen from the above table
that the profit led by logit model shows a trend which is going up first and then descend,
peaking at the 40% decile ($23145). Therefore, it is much better to gain the most effective profit
by using the Logit model and targeting the top 40% of the list.

Question 4

As we could see above, the Logit and Linear Regression models provide similar response rates
for each decile, while the response rate from RFM model was very low in initial deciles. The
final output at 100% target remains same for all models as it is equivalent to not using any
model. We can point out as limitations of RFM model: it uses only 3 variables i.e. Last
Purchased, Frequency and Monetary value and ignores remaining variables and it does not
utilize the training dataset.

On the other hand, the Regression and Logit models use all the provided variables and models
use the training data sets to derive the coefficients which are applied to holding dataset. This
makes these models more robust. Although the constants and coefficient values for Logit and
Regression models are different, both models show the similar positive and negative
correlations of dependent variables with the output variable. We can also add that the same
principles as in the linear regression govern p-values for Logit (P<0.05 indicates significance at
95% confidence level) Logit is a non-linear model, so interpreting its parameters are not as
straightforward as in linear models (i.e. do not resemble marginal effects).

However, the Linear Regression model is affected by causality and also by overfitting,
independently of its R2 and Adjusted R2 values. Also, as said above, this model does not
explain the binary variable (Choice). So, in this case, it is not appropriated to use a Linear
Regression Model to identify the factors that influenced the purchase of the book.

Also, in the case of the Logic model, R-squared cannot be used. As an alternative we can use
wither McFadden’s Pseudo R-squared: R2McF = 1 – ln (LM)/ln (L0), with LM measuring how
much the current model explains variations in the data and L0 how much a model without
covariates explains variation in the data or Hit rates, which simply count the number of hit and
misses (which was the one used). The value obtained was 80%. Although it is not that high, we
can say that, with all this in mind, it is the best fitted model.

10
Question 5:

If we consider, the profit calculations made in question 3 we can conclude that it is


advantageous to BBBC to use the predictive models to increase the efficiency of its direct mail
program. As mentioned before, although in the RFM model the profits increased with the
increase in deciles the values are still lower than in the Logit model. Thus, if we were to choose
between the two models the Logit Model would be the most reasonable choice.

However, the decision of the company should invest in developing expertise in one of these
methods to develop an in-house capability would depend on the frequency that direct mail will
be sent and also the number of mailings. In this specific case, if we were to use the Logit Model
the difference in profit of mailing only to the top 40% of the list instead of the entirety of it
would be $10,408.00 ($23,145.00 -$12,737.00). Therefore, if BBBC has more than
15000/20000 mailings per year the before-mentioned profit would more than support the cost of
having to develop an in-house capability.

Question 6:

As deducted from the previous analysis, there is a lack of certainty in terms of understanding the
consumer behaviour and the factors influencing this behaviour.

The aim is to use a variety of models to have a deeper understanding of the customer purchasing
behaviour at BBC. We want the models to give approximate prediction on how well their direct
marketing campaigns will be doing to try and move away from the traditional mass marketing
previously used.

For the sake of simplicity and further automation, we think the company should give priority to
its data collection systems, it needs to establish a relevant and reliable customer base to be able
to have a clear segmentation in customer Lifetime value, perhaps using a multivariate approach
would be more informative but the most important step is investing into turning the company
into customer centricity and implementing a CRM system.

CRM systems are usually well developed and have an automated flow. The Bookbinders
Company should invest in software like Marketing engineering for Excel, Salesforce, Oracle,
etc.

Once the company gathers the data for its customers, they can run the Binary Logit Choice
Model using Marketing Engineering for Excel for example and target 40% of their customers
with highest value of Response Probability.

11

You might also like