Professional Documents
Culture Documents
Assign2 GLM
Assign2 GLM
The Caravan dataset contains 85 predictors that measure demographic characteristics for 5,822
individuals and “Purchase,” which indicates whether or not a given individual purchases a caravan
insurance policy. We begin by converting our Purchase column type from character to integer.
Essentially, all the ‘Yes’ responses are changed to 1 and ‘No’ responses are changed to 0. A preview of
the data set and the distributions of purchases (1) and non-purchases (0) from the Purchase column can be
seen is Figure 1 and 2 respectively. It should be noted that non-purchases are much more frequent that
purchases.
Figure 1. Preview of the Caravan dataset: 5822 records and 85 predictors + 1 Purchase variable.
Figure 5. Model summary for th refined logistic regression model with formula: Purchase ~
PPERSAUT+APLEZIER+PWAPART+MKOOPKLA+PBRAND+MOPLLAAG+MINKGEM
(APERSAUT and AWAPART were removed from the model to reduce AIC value further).
We then use the refined logistic regression model to predict the probabilities of purchase in both the test
and train set. These predictions are compared to the actual values of purchases in Figure 6. We can see
that the train set predict slightly more occurrences of purchases than the test set. The correlation
predictions and the actual values was calculated to compliment the understanding of Figure 6. Correlation
between train predictions and actual purchases was found to be 0.307 and the correlation between test
predictions and actual purchases was found to be a little lower at 0.245 which may be considered
acceptable. This gives us an insight that purchases may be infrequent event. This is expected as we have
seen in the bar graph in Figure 2.
Figure 6. Comparison of predictions to actual purchases.
A threshold of 0.5 was set i.e., when the probability (prediction) is above 0.5, the event is assumed to be a
purchase or 1. Moving on, the confusion matrix was calculated to represent: True negatives (5466), True
positives (7) and False negatives (341), False positives (8) as shown in Figure 7.