Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Predicting Customer Interest

The Caravan dataset contains 85 predictors that measure demographic characteristics for 5,822
individuals and “Purchase,” which indicates whether or not a given individual purchases a caravan
insurance policy. We begin by converting our Purchase column type from character to integer.
Essentially, all the ‘Yes’ responses are changed to 1 and ‘No’ responses are changed to 0. A preview of
the data set and the distributions of purchases (1) and non-purchases (0) from the Purchase column can be
seen is Figure 1 and 2 respectively. It should be noted that non-purchases are much more frequent that
purchases.

Figure 1. Preview of the Caravan dataset: 5822 records and 85 predictors + 1 Purchase variable.

Figure 2. Frequency of purchases (1) and non-purchase (0).


The data is then randomly split into test and train sets. The size of the test set is chosen to be of 1,000
samples. A logistic regression model is fitted to the training set. For the preliminary model, all the 85
variables are included and an AIC of 1949.1 was obtained. After reviewing the preliminary model, a
function “corr_var” was used to determine the variables that had the highest correlation to Purchase. Once
these variables were identified using the graph in Figure 4, the logistic regression model was refined to
contain fewer variables to minimize the complexity as well as the AIC value. Results of this refined
model can be viewed in Figure 5.
Figure 4. Bar chart of Purchase and Top 10 variables with the highest correlation.

Figure 5. Model summary for th refined logistic regression model with formula: Purchase ~
PPERSAUT+APLEZIER+PWAPART+MKOOPKLA+PBRAND+MOPLLAAG+MINKGEM
(APERSAUT and AWAPART were removed from the model to reduce AIC value further).
We then use the refined logistic regression model to predict the probabilities of purchase in both the test
and train set. These predictions are compared to the actual values of purchases in Figure 6. We can see
that the train set predict slightly more occurrences of purchases than the test set. The correlation
predictions and the actual values was calculated to compliment the understanding of Figure 6. Correlation
between train predictions and actual purchases was found to be 0.307 and the correlation between test
predictions and actual purchases was found to be a little lower at 0.245 which may be considered
acceptable. This gives us an insight that purchases may be infrequent event. This is expected as we have
seen in the bar graph in Figure 2.
Figure 6. Comparison of predictions to actual purchases.
A threshold of 0.5 was set i.e., when the probability (prediction) is above 0.5, the event is assumed to be a
purchase or 1. Moving on, the confusion matrix was calculated to represent: True negatives (5466), True
positives (7) and False negatives (341), False positives (8) as shown in Figure 7.

Figure 7. Confusion matrix containing both test and train sets.


From the confusion matrix, it is evident that the logistic regression model is predicting a significant
number of false negatives (341) while is quite accurate at predicting true negatives/non-purchases.
Correspondingly, the sensitivity and specificity were found to be 0.020 and 0.998 respectively. This
indicates that the logistic regression model is accurate in predicting non-purchases as it has high
specificity. However due to its low sensitivity, the predictions of purchases are not reliable since the
model underestimates the occurrence of purchases. This may be reduced by increasing the threshold to be
less than 0.5.

You might also like