Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Building a Predictive Model

for Auto Loan Defaults


Auto Loan CreditRisk Competition by Peak2Tails

- Vaibhav Pandey
Recap of Past Exploratory Deriving New
WoE Binning
Presentation Data Analysis Features
Training and Test Variable Selection Multicollinearity
Data Split using IV Check

Present
Objective
Logistic Regression Final Model
Model Validation
Analysis Development
• Whole Dataset contains 5837 records of past default behaviour of
Auto Loan borrowers.

Training and • The Data is sampled randomly into two parts:


• Training Set (70%): 4064 records
Test Data Split • Testing Set (30%): 1773 records

• Whole Model intelligence is built on Training Set while Testing Set


is kept separate to validate the results later on.
We start with 26 variables and performed Weight of Evidence binning on
each to compute Information Value.
IV after Coarse
S No Variable Keep
Binning
1 Bureau Score 0.810204086 Yes
2 RevoLineByTotRevTrade 0.563723168 Yes IV < 0.02: Not Predictive
3 TotRevoLine 0.489594168 Yes
4 TotDerByOldAge 0.384528104 Yes 0.02 <= IV < 0.1: Weak Predictive Power
5 TotDerByTrade 0.364632265 Yes
6 TotDer 0.286172303 Yes 0.1 <= IV < 0.3: Medium Predictive Power
7 OldAge 0.280564824 Yes
0.3 <= IV < 0.5: Strong Predictive Power
Variable 8
9
LTV
RevoUtil
0.195281766
0.15679478
Yes
Yes IV >= 0.5: Suspicious Predictive Power
10 TotTrade 0.130682924 Yes
Selection using 11
12
RevoDebtByRevTrade
TotRevoDebt
0.101454192
0.086444386
Yes
Yes
We dropped variables ‘TotRevoTrade’,
IV 13
14
DownPayByPurchPrice
Total Income
0.076311002
0.08169131
Yes
Yes ‘TotOpenByTotTrade’, ‘TotOpenTrade’, ’Loan Amt’
15 LTI 0.074021329 Yes and ’Lease_Loan’ as IV was less than 0.02.
16 DTI_Mod 0.058749312 Yes
17 Purch_Price 0.047454197 Yes Variables ‘Loan Term’ and ‘Used’ were dropped as
18 RevoDebtByTotIncome 0.047476619 Yes Number of WoE bins was less than 3.
19 LoanAmtByLoanTerm 0.032088245 Yes
20 TotRevoTrade 0.012372456 No
21 Loan Term 0.041069725 No
22 TotOpenByTotTrade 0.014088287 No
23 TotOpenTrade 0.01486741 No
24 Used 0.02604294 No
25 Loan Amt 0.012924509 No
26 Lease_Loan 9.32116E-06 No
Correlations
among Variables

Following set of variables found to be highly correlated:


• ‘tot_derog’, ‘Tot Der/Total Trade’ and ‘Tot Der/Age Old’
• ‘tot_revo_line’ and ‘tot_revo_line/tot_revo_trade’
• ‘Tot_revo_debt’ ,’tot_revo debt/tot_income’ and ‘tot_revo_debt/tot_revo_trade’
• ‘LTI’ and ‘DTIMod’
Correlation Maps only investigate correlations among two variables, while VIF
looks into whether a variable can be explained by linear combination of all
the other variables. It regresses one variable with all the other variables and
calculates VIF=1/(1- R^2).

Variable
Inflation Factor
Generally, value of VIF less than 5 is considered acceptable.
From the four sets of highly correlated variables we keep the ones with highest IV
and drop rest.

We recompute the VIFs of remaining variables and now find the values to be acceptable.

Variable
Inflation Factor
We run the Logistic Regression on the remaining variables and analyse the results.

• As we had replaced values of all variables


with WoE = ln(%bads/%goods) during Data
preprocessing stage, sign of coefficients of
all the variables in the model has come
positive i.e. high values means high PD, as
expected.

• Log Liklihood of full model is greater than


LL of Null Model, as expected.

Logistic • R-square values between 0.2-0.4 indicate


good explained variance in case of Logistic

Regression Regression. However, R square value of


our model is slightly less than 0.2 which
indicate lesser explained variance.

• P-values of coefficients lower than 5% for


many variables show that they are
statistically significant.

• While some variables in our model have p


values more than 5% reflecting their
statistical insignificance, the model as a
whole has p value less than 0.05 and hence
variables are jointly statistically significant.
We re-run the model after dropping following variables with p values more than 0.05.

• After dropping these variables, AIC


and BIC have reduced, showing
these dropped variables contributed
more to model complexity than to
explainability.

• However, p value of ‘rev_util’ and


‘down_pay/purch_price’ is still more
Final Model than 0.05 .

• We had tried dropping these


variables also, but it led to increase
in AIC and BIC.

• Further these variables are also


important from business perspective
and hence we have kept these in our
final model.

• Also, p value of model as a whole is


less than 0.05 which shows statistical
significance of all the variables jointly.
We have used Test Dataset inputs to predict values of PD and compared them with
actual default status to validate test results.

Area under the curve (AUC) measures


Discriminatory power of the model.
Model It measures whether the model is able

Validation to distinguish between the good borrower


from the Bad.

While random Model has AUC=0.5,


value above 0.7 is considered acceptable.
Higher the AUC the better.

AUC(Our Model)= 0.76

Gini Coefficient= 2*AUC-1


= 0.52 (Our Model)
K S Statistic

Our Model, gives low PD values to Non Defaulters, while relatively high PD values to Defaulters.
Cumulatively, bad borrowers are given high PDs and bunched on the right.
We have considered KS Cutoff Value of 0.22 as PD cutoff. PD above 0.22 considered as default.

Model
Performance Metric Value
Confusion
Predicted
Non
Predicted
0.76
Metrics ROC_AUC Matrix
Defaults
Defaults

Accuracy Score 0.69 Actual


Non 962 448
Precision 0.36 Default

Recall Score 0.72 Actual


Default 96 257
F1 Score 0.48

You might also like