Auto Loan Model Development and Validation

Building a Predictive Model
for Auto Loan Defaults

Auto Loan CreditRisk Competition by Peak2Tails
- Vaibhav Pandey
Recap of Past Exploratory Deriving New
WoE Binning
Presentation Data Analysis Features
Training and Test Variable Selection Multicollinearity
Data Split using IV Check
Present
Objective
Logistic Regression Final Model
Model Validation
Analysis Development
• Whole Dataset contains 5837 records of past default behaviour of
Auto Loan borrowers.
Training and • The Data is sampled randomly into two parts:

• Training Set (70%): 4064 records
Test Data Split • Testing Set (30%): 1773 records
• Whole Model intelligence is built on Training Set while Testing Set

is kept separate to validate the results later on.
We start with 26 variables and performed Weight of Evidence binning on
each to compute Information Value.
IV after Coarse
S No Variable Keep
Binning
1 Bureau Score 0.810204086 Yes
2 RevoLineByTotRevTrade 0.563723168 Yes IV < 0.02: Not Predictive
3 TotRevoLine 0.489594168 Yes
4 TotDerByOldAge 0.384528104 Yes 0.02 <= IV < 0.1: Weak Predictive Power
5 TotDerByTrade 0.364632265 Yes
6 TotDer 0.286172303 Yes 0.1 <= IV < 0.3: Medium Predictive Power
7 OldAge 0.280564824 Yes
0.3 <= IV < 0.5: Strong Predictive Power
Variable 8
9
LTV
RevoUtil
0.195281766
0.15679478
Yes
Yes IV >= 0.5: Suspicious Predictive Power
10 TotTrade 0.130682924 Yes
Selection using 11
12
RevoDebtByRevTrade
TotRevoDebt
0.101454192
0.086444386
Yes
Yes
We dropped variables ‘TotRevoTrade’,
IV 13
14
DownPayByPurchPrice
Total Income
0.076311002
0.08169131
Yes
Yes ‘TotOpenByTotTrade’, ‘TotOpenTrade’, ’Loan Amt’
15 LTI 0.074021329 Yes and ’Lease_Loan’ as IV was less than 0.02.
16 DTI_Mod 0.058749312 Yes
17 Purch_Price 0.047454197 Yes Variables ‘Loan Term’ and ‘Used’ were dropped as
18 RevoDebtByTotIncome 0.047476619 Yes Number of WoE bins was less than 3.
19 LoanAmtByLoanTerm 0.032088245 Yes
20 TotRevoTrade 0.012372456 No
21 Loan Term 0.041069725 No
22 TotOpenByTotTrade 0.014088287 No
23 TotOpenTrade 0.01486741 No
24 Used 0.02604294 No
25 Loan Amt 0.012924509 No
26 Lease_Loan 9.32116E-06 No
Correlations
among Variables
Following set of variables found to be highly correlated:

• ‘tot_derog’, ‘Tot Der/Total Trade’ and ‘Tot Der/Age Old’
• ‘tot_revo_line’ and ‘tot_revo_line/tot_revo_trade’
• ‘Tot_revo_debt’ ,’tot_revo debt/tot_income’ and ‘tot_revo_debt/tot_revo_trade’
• ‘LTI’ and ‘DTIMod’
Correlation Maps only investigate correlations among two variables, while VIF
looks into whether a variable can be explained by linear combination of all
the other variables. It regresses one variable with all the other variables and
calculates VIF=1/(1- R^2).
Variable
Inflation Factor
Generally, value of VIF less than 5 is considered acceptable.
From the four sets of highly correlated variables we keep the ones with highest IV
and drop rest.
We recompute the VIFs of remaining variables and now find the values to be acceptable.
Variable
Inflation Factor
We run the Logistic Regression on the remaining variables and analyse the results.
• As we had replaced values of all variables

with WoE = ln(%bads/%goods) during Data
preprocessing stage, sign of coefficients of
all the variables in the model has come
positive i.e. high values means high PD, as
expected.
• Log Liklihood of full model is greater than

LL of Null Model, as expected.
Logistic • R-square values between 0.2-0.4 indicate

good explained variance in case of Logistic
Regression Regression. However, R square value of

our model is slightly less than 0.2 which
indicate lesser explained variance.
• P-values of coefficients lower than 5% for

many variables show that they are
statistically significant.
• While some variables in our model have p

values more than 5% reflecting their
statistical insignificance, the model as a
whole has p value less than 0.05 and hence
variables are jointly statistically significant.
We re-run the model after dropping following variables with p values more than 0.05.
• After dropping these variables, AIC

and BIC have reduced, showing
these dropped variables contributed
more to model complexity than to
explainability.
• However, p value of ‘rev_util’ and

‘down_pay/purch_price’ is still more
Final Model than 0.05 .
• We had tried dropping these

variables also, but it led to increase
in AIC and BIC.
• Further these variables are also

important from business perspective
and hence we have kept these in our
final model.
• Also, p value of model as a whole is

less than 0.05 which shows statistical
significance of all the variables jointly.
We have used Test Dataset inputs to predict values of PD and compared them with
actual default status to validate test results.
Area under the curve (AUC) measures

Discriminatory power of the model.
Model It measures whether the model is able
Validation to distinguish between the good borrower

from the Bad.
While random Model has AUC=0.5,

value above 0.7 is considered acceptable.
Higher the AUC the better.
AUC(Our Model)= 0.76
Gini Coefficient= 2*AUC-1

= 0.52 (Our Model)
K S Statistic
Our Model, gives low PD values to Non Defaulters, while relatively high PD values to Defaulters.
Cumulatively, bad borrowers are given high PDs and bunched on the right.
We have considered KS Cutoff Value of 0.22 as PD cutoff. PD above 0.22 considered as default.
Model
Performance Metric Value
Confusion
Predicted
Non
Predicted
0.76
Metrics ROC_AUC Matrix
Defaults
Defaults
Accuracy Score 0.69 Actual

Non 962 448
Precision 0.36 Default
Recall Score 0.72 Actual

Default 96 257
F1 Score 0.48

Auto Loan Model Development and Validation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Auto Loan Model Development and Validation

Uploaded by

Copyright:

Available Formats

Building a Predictive Model

for Auto Loan Defaults

Training and • The Data is sampled randomly into two parts:

• Whole Model intelligence is built on Training Set while Testing Set

Following set of variables found to be highly correlated:

• As we had replaced values of all variables

• Log Liklihood of full model is greater than

Logistic • R-square values between 0.2-0.4 indicate

Regression Regression. However, R square value of

• P-values of coefficients lower than 5% for

• While some variables in our model have p

• After dropping these variables, AIC

• However, p value of ‘rev_util’ and

• We had tried dropping these

• Further these variables are also

• Also, p value of model as a whole is

Area under the curve (AUC) measures

Validation to distinguish between the good borrower

While random Model has AUC=0.5,

AUC(Our Model)= 0.76

Gini Coefficient= 2*AUC-1

Accuracy Score 0.69 Actual

Recall Score 0.72 Actual

You might also like