Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Bank Loan Default Prediction

FINAL REPORT

BY:
Pranav v
PGP-BABI June ‘20

1
CONTENTS Pg.No
1.Introduction……………………………………………………………………..5
1.1.Problem……………………………………………………………………..5
1.2.Solution……………………………………………………………………..6
1.3.Need for the Study/Project………………………………………..6
1.4.Business Understanding/Scope………………………………….6
1.5.Business Hypothesis…………………………………………………..7
2.Data Report ……………………………………………………………………..7
2.1.Data collection……………………………………………………………7
2.2.Data dictionary……………………………………………………….....7
2.3.Data summary……………………………………………………………12
3.EDA(Exploratory data analysis)…………………………………………14
3.1.Missing value treatment……………………………………………..14
3.2.Univariate analysis………………………………………………………15
3.3.Bi-variate analysis……………………………………………………….18
3.4.Time series plot…………………………………………………………..22
3.5.Data transformation……………………………………………………24
3.6.Outliers……………………………………………………………………….24
3.7.Multicollinearity and significant variables……………………25
3.8.Addition of new variables……………………………………………26

2
3.9.Insights from EDA(Exploratory data analysis)………………27
4.Model Building………………………………………………………………….28
4.1.Check for Balance of data…………………………………………….28
4.2.CART…………………………………………………………………………….29
4.2.1 Interpretation………………………………………………………30
4.2.2 Evaluation(Train)………………………………………………….32
4.2.3 Evaluation(Test)……………………………………………………33
4.3.Logistic Regression…………………………………………………………34
4.3.1 Interpretation……………………………………………………….35
4.3.2 Evaluation(Train)…………………………………………………..38
4.3.3 Evaluation(Test)……………………………………………………38
4.4.Naive Bayes……………………………………………………………………39
4.4.1 Interpretation……………………………………………………….41
4.4.2 Evaluation(Train)…………………………………………………..41
4.4.3 Evaluation(Test)…………………………………………………….42
5.Model Comparison……………………………………………………………..43
6.Model Tuning………………………………………………………………………43
6.1.Random forest………………………………………………………………43
6.1.1 Interpretation……………………………………………………….44
6.1.2 Evaluation(Train)……………………………………………………46

3
6.1.3 Evaluation(Test)…………………………………………………….46
6.2.Boosting……………………………………………………………………….47
6.2.1 Interpretation……………………………………………………….48
6.2.2 Evaluation(Train)………………………………………………….48
6.2.3 Evaluation(Test)……………………………………………………49
7.Best Model…………………………………………………………………………50
8.Conclusion………………………………………………………………………….51
9.Actions and recommendations…………………………………………..51
APPENDIX……………………………………………………………………………..53

4
1.Introduction
Across the globe, the banking sector acts as the catalyst for the country’s economy.
Banks play a vital role in providing financial resources to both corporates and
individuals. The lending business is one of the key functions of the bank. All the
banks are trying to figure out effective business strategies to persuade customers to
apply for their loans. However, some customers do not adhere to the proceedings
after their applications are approved. The loan is not repaid as per the defined
schedule thus leading to financial loss to the bank.

The two most critical questions in the lending industry are:

1) How risky is the borrower?

2) Given the borrower’s risk, should we lend him/her?


The answer to the first question determines the interest rate the borrower would have.
Interest rate measures among other things (such as time value of money) the riskiness
of the borrower, i.e. the riskier the borrower, the higher the interest rate. With interest
rate in mind, we can then determine if the borrower is eligible for the loan.

Prior assessment of defaulters is one of the most important concerns of the banks for
survival in the highly competitive market and profitability. Most of the banks use
their credit scoring models and risk assessment techniques to analyse the loan
application and make decisions on whether to approve the application or not. Despite
this, financial institutions facing huge capital loses with the default of loans by the
borrowers. When the borrower is unable to pay the interest amount on time as well
as unable to return the principal amount, in such cases, the bank declares that amount
as non-performing.

1.1.Problem
Loans are the most important product offered by banking financial institutions and
they apply different marketing strategies more than often to sell the products.
Nowadays, banks are scrutinizing each loan application to identify potential loan
default cases so that they can predict which client is going to default the loan
repayment and at which step. Based upon these predictions banks will try to figure
out customers who have a high probability of defaulting out and take concerted
actions before they default.

5
1.2.Solution
Hence, there is a need to build a model to predict default loan which will help the
bank to take required actions including –
 Avoiding the exposure
 Intensify the collection efforts
 Initiate the collateral sale
 Avoiding certain customer or product segment

1.3.Need for the Study/Project


 Retail lending is an important division of any large bank and it helps the
bank to grow at a rapid pace by earning significant fee income, interest
income while also fetching savings and current account.
 In some of the banks, it contributes more than 70% of the banks’ assets and
revenues.
 In some cases, higher delinquency has even led to bank run and closure of a
bank. A recent example of YES Bank is well known where lending to risky
customers led to unprecedented by RBI and impacted the brand image of the
bank.
1.4.Business Understanding/Scope
Study on loan default prediction has been undertaken by many enthusiastic
researchers and data analysts using various tool and techniques. This study will not
restrict the objective to the mere prediction of a defaulter. It will be expanded to
the further objectives which are listed down.
The primary objectives of the study are:
 The primary goal of this project is to extract patterns from a loan approved dataset
and predict the likely loan defaulters by classification techniques.
 Identify the important or highly impacting business attributes that lead to loan
defaults thus providing actionable insights to the bank.
 Customer segmentation based on their credit history and other vital variables in
the available data to check on specific behavioural pattern.

6
1.5.Business Hypothesis
The assignment aims to predict the customers who will default the payment based
on the personal and professional details.
NULL Hypothesis: (HO) –No predictor is available to predict default.
ALTERNATE Hypothesis: (HA) – There is at least one independent variable to
predict default.
2.Data Report
Exploring the dataset, we get that there are 41 variables broadly divided into two
categories – (1) Demographic variables – containing details about customers and (2)
Loan Variables – containing details about Loan sanctioned.
The data collected is for loans sanctioned between Dec 2011 – Jan 2015 by the
bank to various customers. Demographic variables include details of customers like
annual income, Desc, Homeownership etc, whereas loan variables include variables
like loan amount, funded amount, Payment term, principal received till date etc.
2.1.Data collection
The dataset contains about 226786 observations spread across 41 observations. 25
variables belong are continuous whereas 16 variables are categorical with various
categories defined for each of the variables.
The basic summary and the type of variable under study is provided to get the
domain knowledge of various variables .
2.2.Data dictionary
S.NO VARIABLE DESCRIPTION SUMMARY TYPE
1 member_id A unique Id for the Unique ID Continous
borrower member.
2 loan_amnt The listed amount Min loan applied: Continous
of the loan applied $500
for by the borrower. Maximum loan
If at some point in applied: $35000
time, the credit
department reduces
the loan amount,
then it will be
reflected in this
value.

7
3 funded_amnt The total amount Minimum Amount Continous
committed to that Sanctioned :500
loan at that point in Maximum Amount
time. Sanctioned :35000
Only 0.24% of the
loan amount
requested by
applicant has not
been fully
sanctioned by bank
4 funded_amnt_inv The total amount Minimum Amount Continous
committed by Investors
investors for that sanctioned: 0
loan at that point in Maximum Amount
time. Investors
sanctioned: 35000
5 term The number of Term periods – 2 Categorical
payments on the levels:
loan. Values are in 36 Months -70%
months and can be 60 Months – 30%
either 36 or 60.
6 int_rate Interest Rate on the Minimum Interest Categorical
loan rate: 5.32%
Maximum Interest
rate: 28.99%
7 installment The monthly Maximum Installme Continous
payment owed by nt paid: $15.69
the borrower if the Maximum
loan originates. Installment paid:
$1409.99
8 grade Assigned loan grade There are 7 levels: Categorical
A, B, C, D, E, F, G
9 emp_length Employment length Employment Length Categorical
in years. Possible possible values:
values are between Ranges from < 1
0 and 10 where 0 year to 10+ years but
means less than one mostly under 10+
year and 10 means years
ten or more years. There are 11 levels
and NA category
11 annual_inc The self-reported Minimum annual inc Continous
annual income ome: $3000
provided by the Maximum annual in
borrower during come: $8900060
registration.
Continous

8
Income Verification
12 verification_status Status of the status categories
verification done 3 levels:
1. Not Verified
2. Source Verified
3. Verified
13 issue_d The month which Unique dates when Categorical
the loan was funded the loan was issued
14 pymnt_plan Indicates if a n-226780 Categorical
payment plan has y-6
been put in place for
the loan
15 desc Loan description Only 14% Categorical
provided by the observations have
borrower comments from
applicants,
explaining the
reason for taking
loan
16 purpose A category provided Purpose Categories Categorical
by the borrower for have 14 levels:
the loan request. 1. Car 2. Credit_card
3.
Debt_consolidation
4. Educational
5.
Home_improvement
6. House, 7.
Major_purchase 8.
Medical
9. Moving 10. Other
11.
Renewable_energy
12. Small_business
13. Vacation 14.
Wedding
17 addr_state The state provided Different states Categorical
by the borrower in
the loan application
18 dti A ratio calculated Minimum derived Continous
using the dti: 0
borrower’s total Maximum derived
monthly debt dti: 59.26
payments on the
total debt
obligations,

9
excluding mortgage
and the requested
LC loan, divided by
the borrower’s self-
reported monthly
income.
19 delinq_2yrs The number of 30+ Minimum instances Continous
days past-due of 30+ days
incidences of Delinquency: 0
delinquency in the Maximum instances
borrower’s credit of 30+ days
file for the past 2 Delinquency: 29
years
20 earliest_cr_line The month the Unique values Categorical
borrower’s earliest
reported credit line
was opened
21 inq_last_6mths The number of Minimum Inquiries Continous
inquiries in past 6
made in Last 6
months (excluding Months: 0
auto and mortgage Maximum Inquiries
inquiries) made in Last 6
Months: 8
22 mths_since_last_delinq The number of Minimum months Continous
months since the since Last
borrower’s last Delinquency: 0
delinquency. Maximum months
since Last
Delinquency: 151
Na s: 124638

23 open_acc The number of open Minimum number of Continous


credit lines in the Open Credit Line in
borrower’s credit Applicant’s: 0
file. Maximum number
of Open Credit
Lines in
Applicant’s: 76
24 revol_bal Total credit Minimum Credit Continous
revolving balance Balance Revolving:
0
Maximum Credit
Balance
Revolving:1743266
25 revol_util Revolving line Minimum Credit Continous
utilization rate, or used by applicant

10
the amount of credit relative to Balance
the borrower is Revolving: 0
using relative to all Maximum Credit
available revolving used by applicant
credit. relative to Balance
Revolving: 892
26 total_acc The total number of Minimum Number Continous
credit lines of Credit lines
currently in the available in
borrower’s credit applicants credit
file Line: 2
Maximum Number
of Credit lines
available in
applicants credit
Line: 150
27 out_prncp Remaining Min: 0 Continous
outstanding Max:35000
principal for total
amount funded
28 out_prncp_inv Remaining Min: 0 Continous
outstanding Max:35000
principal for portion
of total amount
funded by investors
29 total_pymnt Payments received Min:0 Continous
to date for total Max:57778
amount funded
30 total_pymnt_inv Payments received Min:0 Continous
to date for portion of Max:57778
total amount funded
by investors
31 total_rec_prncp Principal received Min: 0 Continous
to date Max:35000
32 total_rec_int Interest received to Minimum interest Continous
date received:0
Maximum interest
received:22777.6
33 total_rec_late_fee Late fees received Minimum late fee Continous
to date received:0
Maximum late fee
received:286.74
34 recoveries post charge off Values are 0’s Continous
gross recovery
35 collection_recovery_fee post charge off Values are 0’s Continous
collection fee

11
36 last_pymnt_d Last month payment Unique values Categorical
was received
37 last_pymnt_amnt Last total payment Minimum payment Continous
amount received received:0
Maximum payment
received:36475.6
38 next_pymnt_d Next scheduled Unique values Categorical
payment date
39 last_credit_pull_d The most recent Unique values Categorical
month pulled credit
for this loan
40 application_type Indicates whether Application Types: Categorical
the loan is an 1. JOINT – 0.002%
individual 2. INDIVIDUAL –
application or a 99.94%
joint application
with two co-
borrowers
41 loan_status Current status of the Default: 8.4% Categorical
loan Paid: 91.59%
Table 1

2.3.Data summary

Customer Information:

 The Dataset consists of a variety of Customer information like Employment


length, Annual income and type of Homeownership.
 Employment length helps us to determine how long he/she is employed and it
ranges from <1 year to more than 10 years of employment. Majority of
applicants, a total of 175105 (33% of all applicants) are employed more than
10 years
 The annual income of the applicants varies from 1200 to 9500000, which helps
us to find the ability to know their financial strength. The average annual
income of the applicants is 75030(in US dollars)50% of the applicants have
their Homeownership as Mortgage and 40% of them are rental.
 Majority of applicants are from the states of California & New York

12
Loan Details:

The dataset conveys that the Purpose of the loan is taken, the statistics for the same
is as below:

 58.86% of the total applicants have mentioned that the purpose of the loan bei
ng taken as Debt consolidation. Nearly 132971 applicants have taken loan for
Debt consolidation.
 20.16 % of the applicants have marked their purpose of the loan as credit card
123670 applicants have taken loan for payment of their Credit card dues.
 Debt consolidation and credit card due payment have the highest loan amount
sanctioned followed by home improvement
 There are fewer takers of loan for educational purposes, wedding &
renewable energy but have higher Defaults.
 The data shows that 70% of the applicants opted for a Loan Tenure of 36 mon
ths whereas 30% opted for 60-month Loan tenure.
 More than 1/3rd of the loans have not been verified which is an alarming sign
for the banks and they should following the verification process diligently to r
educe defaults
 About 80% of the loans are for 3 years and the rest 20% is for five years. So
the loans are not for very long term
 Missing values (blank observations) are present in 2 variables – months since
last delinquency and revolving line utilization. Treatment would be done in
EDA section
 There are n/a cases marked in some variables and it is assumed that data is not
available for the same. So, we would keep these values in the model and
proceed with the analysis.

13
3.EDA(Exploratory data analysis)
3.1.Missing value treatment
Our dataset has 226786 observations of 41 variables.

As we can see there are 226786 observations & 41 columns in the dataset, it will be
very difficult to look at each column one by one & find the NA or missing values.
So let's find out all columns where missing values are more than a certain percentage,
let's say 15%. Also, find out those columns which will not add any value/insight for
our study. Hence, we will remove those columns as it is not feasible to impute
missing values for those columns.

Figure 1

VARIABLE MISSING COUNT TREATMENT


recoveries All are 0’s Removed them
collection_recovery_fee All are 0’s Removed them
pymnt_plan Only 6 rows (4-Default and Removed them
2-Fully paid) with value as y

mths_since_last_delinq Nearly 54% missing value so Removed them


removed it. Explained in
Missing value section

14
Desc 152077 From our study the data we
purpose understand that it is the
detail of the ‘Purpose’
variable. Hence we drop
desc as it will be redundant
information, so removed
them

Table2

Some irrelevant columns are next_pymnt_d , mths_since_last_delinq , recoveries


(All are 0’s), collection_recovery_fee (All are 0’s), pymnt_plan , desc ,
mths_since_last_delinq can be removed.

3.2.Univariate analysis
We will perform univariate analysis both for continuous and categorical variables

CONTINUOUS:

Figure 2

 The general observation is that the variables are skewed due to the presence
of outliers. These outliers might be realistic data with no error.

15
 We need to further deliberate on the skewness and kurtosis values for each of
the values and check whether the picture as presented by histogram is correct
or not.
 Skewness shall reflect the symmetry in the data i.e. whether the data
distribution is the same on the right and left-hand side of the centre point.
Kurtosis depicts whether the data is tilted towards a side (as compared to a
normal distribution). This is due to the presence of outliers.

Figure 3

 Loans are majorly taken for the short term with 80% of loans taken for 36
months tenure.
 Employment tenure is maximum for 10+ years showcasing that majority of
customers who have taken loans are experienced and well placed. So chances
of default would be less than that compared to fresher or less experienced.

16
CATEGORICAL:

Figure 4

INFERENCES:

 Loan applications are individual only. Joint is only 6 cases. Banks need to
check if a joint application can help reduce default rates and accordingly
emphasize joint accountability based upon the loan amount.
 The main purpose of the loans being taken is debt consolidation (60%)
followed by credit card(20%).
 Many loans have not been verified by the banks which could be a possible
reason for default.
 Loans default are less as compared to loans not defaulted so we need to
balance the dataset before proceeding further with the model prediction.

17
3.3.Bi-variate analysis
Loan Amount v/s Purpose:

Figure 5

It can be inferred that:


Maximum people have obtained a loan for debt consolidation and credit card and
least is for education.
Employment length v/s Loan status:

Figure 6

**Tableau generated graphs

18
It can be inferred from figure 6 that defaulting of the loan is seen higher with
people having 10+ years of work experience.

Employment length v/s Loan status:

Figure 7

It can be inferred that defaulting of the loan is seen higher with people of grade C.
**Tableau generated graph

Grade v/s Loan status v/s Interest Rate:

Figure 8

19
We can derive a few points from the plot (figure 8):

 interest rate increases with grade worsening


 a few loans seem to have an equally low-interest-rate independent of grade
 the spread of rates seems to increase with grade worsening
 there tend to be more outliers on the lower end of the rate
 The 3-year term has a much higher number of high-rated borrowers while
the 5-year term has a larger number in the low-rating grades

Grade v/s Loan Amount:

Figure 9

We can derive a few points from the plot:

 there is not a lot of difference between default and non-default


 lower quality loans tend to have a higher loan amount
 there are virtually no outliers except for grade B
 the loan amount spread (IQR) seems to be slightly higher for lower quality
loans.

20
Grade v/s Interest rate v/s Grade:

Figure 10

We can derive a few points from the plot:

 interest rate increases with grade worsening


 a few loans seem to have an equally low-interest-rate independent of grade
 the spread of rates seems to increase with grade worsening
 there tend to be more outliers on the lower end of the rate
 The 3-year term has a much higher number of high-rated borrowers while the 5-year term
has a larger number in the low-rating grades

Grade v/s Interest rate v/s Grade:

Figure 11

**Tableau generated graph

21
We can derive a few points from the plot (Figure 11):

 there are instances when the funded amount is smaller loan amount
 there seems to be a number of loans where investment is smaller
funded amount i.e. not the full loan is invested in

3.4.Time series plot


let’s take a look at interest rates over time but split the time series by grade to see if
there are differences in interest rate development depending on the borrower grade.

Figure 12

For better understanding let us create separate plots for in-depth


understanding.
**Tableau generated graph

22
Figure 13

We can derive a few points from the plot:

 the mean interest rate is falling or relatively constant for high-rated clients
 the mean interest rate is increasing significantly for low-rated clients

Issue Date vs Grade vs Loan Amount:

Figure 14

We can derive a few points from the plot:

 the mean loan amount is increasing for all grades


 while high-rated clients have some mean loan amount volatility, it is
much higher for low-rated clients.

**Tableau generated graph

23
3.5.Data transformation

Figure 15

INFERENCES:
 Changed NA of emp_length to < 1-year category.
 Club NONE observations to OTHER for home_ownership variable.
 Club Source Verified with Verified for verification_status variable.
 Changed loan_status into Default where
 0: Fully Paid
 1:Default
3.6.Outliers

Figure 16

24
 Outliers are observations not common for the dataset and might be
captured as an error. Used standard deviation as cut off for judging
the outliers in the dataset.
 Outstanding principal, outstanding principal investors, Total
payment, total payment investors, Total principal received, total
interest received Total Late fees, last date, Last payment amount,
next payment date – these all variables have outliers.
 Boxplot has also confirmed the same. But these are specific to
loans and can vary as per the loans sanctioned.
 Annual income is also specific to individuals and can vary.

So no treatment is done for outliers and all values are assumed to be


part of data capturing

3.7.Multicollinearity and significant variables

Figure 16

25
Figure 17
The above figure tells the variables with high correlation and can be removed on
further analysis of VIF.

IMPORTANT VARIABLES FOR THE STUDY AND FOR MODEL CREATION:

 terms
 int_rate
 installment
 grade
 emp_length
 dti
 issue_d
 revol_b,revol_util
 total_rec_int
 last_pymnt_d
 last_pymnt_amnt
 last_credit_pull_d

3.8.Addition of new variables


All the key variables required for building the model like Debt to Income ratio or
Fixed obligations to income ratio, revolving balance utilization rate, outstanding
principal, past delinquency etc. are available in the given data set.
Important variable not available in the dataset is the collateral value which bank
might have taken for the loan. If the collateral is available at >100% of the loan
amount, it gives comfort to the bank to give specific loans.
Hence, no new variable is required to be created.

26
3.9.Insights from EDA(Exploratory data analysis)
 The main purpose of the loans being taken is debt consolidation (60%)
followed by credit card(20%).
 Employment tenure is maximum for 10+ years showcasing that majority of
customers who have taken loans are experienced and well placed. So chances
of default would be less than that compared to fresher or less experienced.
 Interest rates are high in case of defaulters which is the case that if the
creditworthiness is not that good loan might be available but at a higher rate
of interest.
 Default per cent as a total number of loan reimbursed in 60 months is more
than that of 30 months. So short term loans are less prone to default.
 Grade A has the least default cases. So better grade to loan class belongs to
less is the chance of default.
 Default cases are more in the verified loan application. So this reflects that
stringent verification is required to be done by the banks.
 We need to remove the multicollinearity either by
 Dropping the variables from the analysis
 Use PCA to combine the variables
 Use ratios and combine variables

 Bartlett test of sphericity also doesn’t indicate the p-value to conclude


performing the PCA.So first use of ratios and dropping of variables to be tried
and then proceed with model building. We can also use clustering to cluster
the customers with similar characteristics and predict default accordingly.
 New variable creation, PCA feasibility will be done during the next phase of
model building.

Statistical significance of Variables- Insights


To find the statistical significance of variables we have adopted 4 pronged approach
• Build the Logistics model on sample data out raw data provided and
captured the p-value. However, no variables came significant due to the
presence of correlation between the variables.
• Multicollinearity check (vif<10): vif score depicts whether
multicollinearity exists between the independent variables or not.
Higher the values indicate the presence of correlation between
independent variables and in that case, the prediction is not accurate by
the model.

27
• Boruta Importance matrix – Boruta helps to finalize the feature
selection. It executes importance run and eliminates the least important
variables.
• Chi-square test in Anova – which variables are statistically important
based upon Annova test

4.Model Building
4.1. Check for Balance of data
Is data unbalanced?
Yes, the data is unbalanced with only 8% observation is default and rest 92 % is non-
default. We will use SMOTE during model building to oversize the default cases in
the model and balance the data before proceeding with the Model building.

Figure 18

The above figure explains that the Test and train dataset is unbalanced. So we
apply SMOTE to balance the train and test data.

Figure 19

The above dataset gives us a balanced dataset.

28
4.2.CART
 The purpose of using CART is because as we know that Cart works like the
if-else clause wherein here every variable is considered for studying the
outcome of the dependent variable, here it is loan_status.
 CART helps in defining the class to which a particular value of the dependent
variable loan_status (Default and Fully paid) belongs based on various
predictors.
 Classification trees are used when the dataset needs to be split into classes
which belong to the response variable.
 Non-Zero Variance: We have verified that there is no variable with zero
variance. Hence we will use all the variable for CART.
 Minsplit: 900, minbucket: 300, xval: 10
The output of the CART model:

Figure 20

On applying the CART model to our dataset we see that the important variables used
for tree construction are:
 issue_d
 last_credit_pull_d
 last_pymnt_amnt
 term
 total_rec_late_fee

29
4.2.1 Interpretation
Further, the tree is visualized as :

Figure 21
INSIGHTS DRAWN:
 48 % of the customers have defaulted the payment of loan after the last credit
pull date.
 38% of the customers have not paid any late fee, have not defaulted that is
they pay their instalment before the last date or they never default and they
don’t pay late fee.
 1 % of the customers have who have term =36 months have defaulted in
paying the loan.
AFTER PRUNING:
To prune the tree, we find the best ‘Complexity Parameter’ of the tree.
Cost-complexity pruning works by successively collapsing the node that produces
the smallest per-node increase in our error/cost, while at the same time weighing the
overall complexity (e.g. size) of our tree to decide upon the best-pruned tree that
minimizes our cost-complexity function.

30
Figure 22

We prune the tree at cp=”0” to avoid overfitting.

Figure 23

The purpose of pruning is to decrease the complexity of the tree based on the value
of cp. Finally, the number of splits is 8.

31
Figure 24

From the above CP table, it is understood that at 0th split the relative error is 100%.
The default is 10 fold cross-validation and cross-validation error at 0 split is 100 and
the standard deviation is 0.3 per cent. And at the 2nd split, the relative error has
decreased to 7.9 % and the xerror has decreased to 8% and standard deviation to 0.1
%. The reduction in error, xerror, xstd is observed at every step further till 7th split,
and when observed there is no increment in the values except decrement. Therefore
there is no need to prune the tree further.
4.2.2 Evaluation(Train)

Figure 25

Please refer Appendix for Source Code of CART for AUC curve etc

32
4.2.3 Evaluation(Test)

Figure 26

Interpretation of CART Model:


The main variable to split the node are last_pymnt_amnt, term, last_pymnt_d,
issue_d, last_credit_pull_d,total_rec_int_fee.
The specificity is low which means there are few false positive.
The model is stable as evident from the output of the confusion matrix for training
and testing dataset. Based on the test metrics we can interpret that.
 The model will catch 98% of the customers who will default.
 The model will catch 97% of the customers who will not default.
 Overall all accuracy is 97%.
 Out of the customers who are predicted as will default, 98% of them will
actually default.
 Out of the customers who are predicted as will Not default, 97% of them will
actually not default.

Please refer Appendix for Source Code of CART for AUC curve etc

33
4.3.LogisticRegression
Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.

Figure 27

34
4.3.1Interpretation
From figure 27 we can see that:
Significant variables:
installment
issue_d
dti
revol_bal,revol_util
total_rec_late_fee,last_pymnt_d
last_pymnt_amnt,last_credit_pull_d.
Also, the AIC Score is 12992. This will be observed in subsequent stages when we
refine the model. The model having least AIC Score would be the most preferred
and optimized one.
Evaluating model performance:
Model significance is checked using the log-likelihood test
In statistics, a likelihood ratio test is a statistical test used for comparing the
goodness of fit of two models, one of which (the null model) is a special
case of the other (the alternative model). The test is based on the likelihood
ratio, which expresses how many times more likely the data are under one
model than the other. This likelihood ratio, or equivalently its logarithm, can
then be used to compute a p-value, or compared to a critical value to decide whether
to reject the null model and hence accept the alternative model.

Figure 28

35
H0: All betas are zero
H1: At least 1 beta is nonzero
From the log-likelihood, we can see that, intercept only model -54465 variance
was unknown to us. When we take the full model, -6466 variance was unknown to
us.
So we can say that 1-(-6466 /-54465 )= 88.12% of the uncertainty inherent in the
intercept only model is calibrated by the full model. Chisq likelihood ratio is
significant.
Also, the p value suggests that we can accept the Alternate Hypothesis that at
least one of the beta is not zero. So Model is significant.
Model robustness check
Now since we concluded that the model built is significant, let’s find out how robust
it is with the help of McFadden pseudo-R Squared Test.
McFadden pseudo R-square: McFadden pseudo-R Squared: Logistic regression
models are fitted using the method of maximum likelihood -i.e. the parameter
estimates are those values which maximize the likelihood of the data which
have been observed.
McFadden's R squared measure is defined as (1-LogLc/LogLnull) where Lc
denotes the (maximized) likelihood value from the current fitted model, and Lnull
denotes the corresponding value but for the null model -the model with only
an intercept and no covariates.

Figure 29

The McFadden’s pseudo-R Squared test suggests that at least 88.12% variance of
the data is captured by our Model, which suggests it’s a robust model.

36
Odds explanatory power
Let’s find out the power of Odds and Probability of the variables impacting on
loan_status.

Figure 30

Figure 31

Heteroscedasticity check:
Solution of regression problem becomes unstable in presence of 2 ormore
correlated predictors. Multicollinearity can be measured by computing variance
inflation factor (VIF) which gauges –how much the variance of regression
coefficient is inflated due to multicolinearity.

Figure 32

37
From figure 31 and 32 we can see that the variables impacting loan status are:
Term,installment,emp_length,issue_d

4.3.2 Evaluation(Train)

Figure 33

4.3.3 Evaluation(Test)

Figure 34

38
Interpretation of Logistic Regression Model:
The results from performance metrics of both Train and Test data are comparable
which tells us that it is a good model.
The significant variables are last_pymnt_amnt, term, last_pymnt_d, issue_d,
last_credit_pull_d,total_rec_int_fee which is used to create one final model.

The model is stable as evident from the output of confusion matrix for training and
testing dataset. Based on the test metrics we can interpret that.
 The model will catch 96.6% of the customers who will default .
 The model will catch 97.45% of the customers who will not default
 Overall all accuracy is 96%
 Out of the customers who are predicted as will default, 97% of them will
actually default
 Out of the customers who are predicted as will Not default, 97.4% of them
will actually not default

4.4.Naive Bayes
The e1071 package holds the naiveBayes function. It allows continuousand
categorical features to be used in the naive bayes model.It is count-based classifier
i.e. only thing it does is –count how often each variable’s distinct values occur for
each class.
 The given Dataset consistes of both Categorical as well as numerical
Variables.
 We know that Naïve Bayes classifer works well on only Categorical values .
 Navie Bayes works best with Categorical values but can be made to work on
mix datasets having continuous as well as categorical variables .
 Since this algo runs on Conditional Probabilities it becomes very hard to silo
the continous variables as they have no frequency but a continuum scale.
 Moreover, The model can be created with a mixture of both Categorical and
numerical values but the accuracy is less than what when created with only
Categorical values.

39
Figure 35

40
4.4.1 Interpretation
 Navie Bayes works best with Categorical values but can be made to work on
mix datasets having continuous as well as categorical variables as predictors
like in cellphone dataset.
 Since this algorithm runs on Conditional Probabilities it becomes very hard
to silo the continous variables as they have no frequency but a continuum
scale.
 For continous variables: what NB does is takes their mean and standard
deviation or variability and treats it as cut off thresholds ; say anything less
than mean of distributed predictor values is 0 and more than mean is 1.
 Above law suits binary classifier ; however if we have multinomial Response
categories than it will have to go for quantiles, deciles partitioning the data
accordingly and assigning them the probabilities.
 Based on above NB’s working on mixed dataset and its accuracy is always
questionable. Its findings and predictions need to be supported by other
Classifiers before any actionable operations
 The Output for the NB model displays in the matrix format for each predictor
its mean [,1] and std deviation [,2] for class 1 and class 0.
 The independence of predictors (no-multicollinearity) has been assumed for
sake of simplicity

4.4.2 Evaluation(Train)

Figure 36

41
4.4.3 Evaluation(Test)

Figure 37

INTERPRETATION:
 Navie Bayes works best with Categorical values but can be made to work on
mix datasets having continuous as well as categorical variables as predictors
like in cellphone dataset.
 Since this algorithm runs on Conditional Probabilities it becomes very hard to
silo the continuous variables as they have no frequency but a continuum scale.
 Based on above NB’s working on mixed dataset and its accuracy is always
questionable.
 Its findings and predictions need to be supported by other Classifiers before
any actionable operations.
 The accuracy is less when compared to other models.

42
5.Model Coparison

Performance Logistic CART Naïve Bayes


Measure Regression
Test Data Test Data Test Data
Accuracy 96.7 97.31 92.84
Sensitivity 96.69 98.14 97.89
Specificity 97.45 97.23 92.3
Auc 99.15 98.99 95.14
Gini 88.01 82.21 90.28
Ks 94.17 95.38 91
 For Naïve Bayes, the base assumption is that the predictor variables are
independent and equally important. For our data, we have seen that the
predictors are correlated. Hence, we can conclude that Naïve Bayes is not
giving correct prediction.
 Based on above NB’s working on mixed dataset and its accuracy is always
questionable.so we can’t rely on Naïve Bayes.
 For logistic regression, all variables need to be independent of each other.
 Further the model can be tuned by performance metrics for better predictions.
 Out of Logistic regression, CART and Naïve Bayes, CART Model has the
highest Accuracy and sensitivity. Hence, we conclude that CART model is
the best among the three.

6.Model Tuning
6.1.Random forest
We try to build the forest with 4 variables as candidate at each split and 1000 as the
minimum size of terminal node.

43
We analyze the Out of Bag error to find ntree. In our case it is around 190.

Figure 38

6.1.1 Interpretation
Tune Rf and Important Variables
We use tuneRF function to get mtry value and build the tuned random forest. As per
the below result mtry=4 has minimum out of bag error.

Figure 39
Important Variables:
Based on the output of Mean Decrease Gini we can say the top 4 variables to predict
if customer will default the loan or not are last_pymnt_amt, issue_d, last_pymnt_d
and last_credit_pull_d.

44
Interpretation:

Figure 40

Figure 41

From the above figures we can see that last_pymnt_amt, issue_d, last_pymnt_d
and last_credit_pull_d are the significant variables for prediction.

45
6.1.2 Evaluation(Train)

Figure 42

6.1.3 Evaluation(Test)

Figure 43

46
INTERPRETATION:
The model is stable as evident from the output of confusion matrix for training and
testing dataset.
Based on the test metrics we can interpret that –
1. The model will catch 98% of the customers who will default.
2. The model will catch 97% of the customers who will not default in loan
payment
3. Overall all accuracy is 97%
4. Out of the customers who are predicted as will default, 95% of them will
actually default
5. Out of the customers who are predicted as will Not default, 99% of them will
actually not default

6.2.Boosting
Boosting is another ensemble algorithm which is used to reduce bias and also
variance, in supervised learning. In ensemble algorithm, set of weak learners are
combined to form strong learner.
Boosting is more towards Bias i.e simple learners or more specifically Weak
learners. Now a weak learner is a learner which always learns something i.e does
better than chance and also has error rate less than 50%.The best example of a weak
learner is a Decision tree. This is the reason we generally use ensemble technique on
decision trees to improve its accuracy and performance

Figure 44

47
6.2.1 Interpretation

Figure 45

From the above figure we can see that issue_d,last_pymnt_amnt,last_credit_pull_d


and last_pymnt_d are the significant variables for prediction.

6.2.2 Evaluation(Train)

Figure 46

Accuracy=99.51

Sensitivity=99.47

Specificity=99.54

Auc=99.97

Gini=93, Ks=99.04

48
6.2.3 Evaluation(Test)

Figure 47

Accuracy=99.10

Sensitivity=98.61

Specificity=99.15

Auc=99.89

Gini=98.2, Ks=97.99

INTERPRETATION:
The model is stable as evident from the output of confusion matrix for training and
testing dataset.
Based on the test metrics we can interpret that -
1. The model will catch 98% of the customers who will default.
2. The model will catch 99.15% of the customers who will not default in loan
payment
3. Overall all accuracy is 99%
4. Out of the customers who are predicted as will default, 98% of them will
actually default
5. Out of the customers who are predicted as will Not default, 99.15% of them will
actually not default

49
7.Best Model

METRICS RANDOM FOREST BOOSTING


ACCURACY 97.31 99.10
SENSITIVITY 98.14 98.61
SPECIFICITY 97.23 99.15

On comparing Random forest and boosting we see that boosting seems to outplay
random forest .

METRICS CART LR NB RF BOOSTING

ACCURACY 97.31 96.7 92.84 97.31 99.10

SENSITIVITY 98.14 96.69 97.89 98.14 98.61

SPECIFICITY 97.23 97.45 92.3 97.23 99.15

AUC 98.99 99.15 95.14 99.75 99.89

GINI 82.21 88.01 90.28 99.51 98.2

KS 95.38 94.17 91 96.7 97.99

 KS value for most of the models are more than 90% , hence they will
Perform well to separate the default and fully paid cases
 AUC is more than 95% for the all models, hence we can consider them as
good classifier
 Out of Logistic regression, CART and Naïve Bayes, CART Model has the
highest Accuracy and sensitivity. Hence, we conclude that CART model is
the best among the three.
 Comparing CART with Random forest and Boosting, we can conclude that
the model developed using BOOSTING is the best.

50
8.Conclusion
 We have built various models to understand the factors which influence
loan being defaulted.
 As per model comparison, CART is the best as accuracy is about 97.31%
and sensitivity is also 98%
 The model built using ensemble technique (Random Forest and
Boosting) is also good model as accuracy is about 98% and there is
balance between sensitivity and specificity.
 Overall we can see that ,Boosting is the most stable model .
 Issue_d, last_pymnt_amt, last_credit_pull_d play important role to predict
if customer with default or not.
 There is no definitive guide of which algorithms to use given any situation.
What may work on some data sets may not necessarily work on others.
Therefore, always evaluate methods using cross validation to get a reliable
estimates.
 Sometimes we may be willing to give up some improvement to the model
if that would increase the complexity much more than the percentage
change in the improvement to the evaluation metrics.
 In some classification problems, False Negatives are a lot more expensive
than False Positives. Therefore, we can reduce cut-off points to reduce the
False Negatives.

9.Actions and recommendations


 Verification process needs to be re-engineered & made more stringent for its
renewed value addition in loan approval process.
 Bank should focus on short term loan (<=36 months) as default rate is lower.
 Customers with higher annual income are being observed to have higher
default rate. Consistency of income & veracity of declared income needs to
be cross-checked.
 Existing Risk-Reward payoff is not working. While higher interest rates are
being charged for risky customers, default rate is significantly higher. Hence,
review of loan pricing & approval matrix is required.
 Since employment length doesn’t have impact on default rate, business should
target ‘new to credit’ millennial customers where the competition might be
lower & higher margins would be possible.

51
 Education loan has lower default rate compared to other products. Hence,
business should focus on growth of this segment.
 Past credit history and delinquency should be cross verified.
 Special rules should assigned to customers who fall under low income
category.
 Background checks for customers who are ought to default.
 Higher the interest rate ,higher the chances of default. So there is a need to
create matrix considering income and assets as a valid characteristic.
 The home ownership(Own/Rent/Mortgage) also mattered. Especially people
who rented house where prone to default.
 Factors that dint contribute much or cause for loan default were:
 Annual salary
 Employee length

52
APPENDIX

AUC CURVE ,GINI,KS,AUC


RANDOM FOREST(TRAIN)

53
RANDOM FOREST(TEST)

54
CART(TRAIN)

55
CART(TEST)

56
LR(TEST)

57
TABLEAU GENERATED GRAPHS:

58
Reading Data:
library(readxl)

Data<- read_xlsx('Data.xlsx')

names(Data)

Data Transformation:
Data$issue_d=as.Date(Data$issue_d)

59
Data$last_pymnt_d=as.Date(Data$last_pymnt_d)

Data$last_credit_pull_d=as.Date(Data$last_credit_pull_d)

Data$issue_d=as.numeric(Data$issue_d)

Data$last_pymnt_d=as.numeric(Data$last_pymnt_d)

Data$last_credit_pull_d=as.numeric(Data$last_credit_pull_d)

Data$term=as.factor(Data$term)

Data$grade=as.factor(Data$grade)

Data$emp_length=as.factor(Data$emp_length)

Data$loan_status=as.factor(Data$loan_status)

str(Data)

CHECKING MISSING VALUE :


colSums(is.na(Data))

Data <- na.omit(Data)

DATA SPLIT:
library(caTools)

set.seed(1203)

index=sample.split(Data$loan_status,SplitRatio = 0.7)

TrainData=subset(Data,index==TRUE)

TestData=subset(Data,index==FALSE)

prop.table(table(TrainData$loan_status))

prop.table(table(TestData$loan_status))

TrainData=as.data.frame(TrainData)

#Generate Synthetic Data using SMOTE

library(DMwR)

set.seed(101)

traindf <- SMOTE(loan_status ~ ., TrainData,perc.over = 200,perc.under = 150)

60
prop.table(table(traindf$loan_status))

##Logistic regression

library(car)

set.seed(1)

lgmodel <- glm(formula= loan_status ~.,traindf, family = binomial(link = "logit"))

summary(lgmodel)

# Variation Inflation Factor (Multicollinearity)

car::vif(lgmodel)

# Likelihood ratio test


library(lmtest)

lrtest(lgmodel)# Pseudo R-square

library(pscl)

pR2(lgmodel)

# Odds Ratio
exp(coef(lgmodel))

# Probability
exp(coef(lgmodel))/(1+exp(coef(lgmodel)))

names(traindf)

## Interpretation(In sample-Train)
Model1_pred = predict(lgmodel2,newdata = traindf,type='response')

61
Model1_predicted=ifelse(Model1_pred>0.5,1,0)

#Factor conversion

Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))

head(Model1_predicted_factor)

## Confusion matrix

Model1.CM=confusionMatrix(TestData$loan_status,Model1_predicted_factor)

Model1.CM

prop.table(table(Model1_predicted_factor,traindf$loan_status))

## Interpretation(In sample-Test)
Model1_pred = predict(lgmodel2,newdata = TestData,type='response')

Model1_predicted=ifelse(Model1_pred>0.5,1,0)

#Factor conversion

Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))

head(Model1_predicted_factor)

## Confusion matrix
Model1.CM=confusionMatrix(TestData$loan_status,Model1_predicted_factor)

Model1.CM

prop.table(table(TestData$loan_status,Model1_predicted_factor))

# Naive Bayes
library(e1071)

set.seed(1)

nbmodel <- naiveBayes(loan_status ~., data=traindf)

nbmodel

library(caret)

nb_predictions <- predict(nbmodel,traindf)

62
confusionMatrix(nb_predictions,traindf$loan_status,positive="1")

nb_predictions=as.numeric(nb_predictions)

traindf1=as.numeric(traindf$loan_status)

# ROC plot
library(ROCR)

train.roc <- prediction(nb_predictions, TestData$loan_status)

plot(performance(train.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for train data")

abline(0, 1, lty = 8, col = "blue")

# AUC
train.auc = performance(train.roc, "auc")

train.area = as.numeric(slot(train.auc, "y.values"))

train.area

# KS
ks.train <- performance(train.roc, "tpr", "fpr")

train.ks <- max(attr(ks.train, "y.values")[[1]] - (attr(ks.train, "x.values")[[1]]))

train.ks

# Gini
train.gini = (2 * train.area) - 1

train.gini

##Test
library(caret)

nb_predictions <- predict(nbmodel,TestData)

confusionMatrix(nb_predictions,TestData$loan_status,positive="1")

nb_predictions=as.numeric(nb_predictions)

traindf1=as.numeric(traindf$loan_status)

# ROC plot

63
library(ROCR)

test.roc <- prediction(nb_predictions, TestData$loan_status)

plot(performance(test.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for train data")

abline(0, 1, lty = 8, col = "blue")

AUC
test.auc = performance(test.roc, "auc")

test.area = as.numeric(slot(test.auc, "y.values"))

test.area

# KS
ks.test <- performance(test.roc, "tpr", "fpr")

test.ks <- max(attr(ks.test, "y.values")[[1]] - (attr(ks.train, "x.values")[[1]]))

test.ks

# Gini

test.gini = (2 * test.area) - 1

test.gini

## CART
library(rpart)

library(rpart.plot)

#Get Control variable which would allow the tree to grow to maximum

tree_control = rpart.control(minsplit=900, minbucket = 300, cp = 0, xval = 10)

tree_iter1 <- rpart(formula = loan_status ~ .,data = traindf, method = "class", control = tree_control)

# Plot the tree


library(RColorBrewer)

library(rattle)

64
fancyRpartPlot(tree_iter1)

rpart.plot(tree_iter1)

prp(tree_iter1)

printcp(tree_iter1)

tree_iter1$cptable

# Pruned tree

ptree<- prune(tree_iter1, cp= 0.001985339 ,"CP")

printcp(ptree)

fancyRpartPlot(ptree, uniform=TRUE, main="Pruned Classification Tree")

rpart.plot(ptree)

prp(ptree)

# Evaluate the model


# Get KS , AUC and Gini

pred_train_cart <- prediction(traindf$predict.score[,2], traindf$loan_status)

perf_train_cart <- performance(pred_train_cart, "tpr", "fpr")

plot(perf_train_cart)

KS_train_cart <- max(attr(perf_train_cart, 'y.values')[[1]]-attr(perf_train_cart, 'x.values')[[1]])

library(caret)

train.auc = performance(pred_train_cart, "auc")

train.area = as.numeric(slot(train.auc, "y.values"))

train.area

gini_train_cart = ineq(traindf$predict.score[,2], type="Gini")

gini_train_cart

65
confusionMatrix((table(traindf$predict.class,

traindf$loan_status)),

mode = "everything",positive = '1')

### TEST

# Get KS , AUC and Gini

pred_test <- prediction(TestData$predict.score[,2], TestData$loan_status)

perf_test <- performance(pred_test, "tpr", "fpr")

plot(perf_test)

KS_test <- max(attr(perf_test, 'y.values')[[1]]-attr(perf_test, 'x.values')[[1]])

KS_test

auc_test <- performance(pred_test,"auc");

auc_test <- as.numeric(auc_test@y.values)

auc_test

gini_test = ineq(TestData$predict.score[,2], type="Gini")

gini_test

confusionMatrix((table(TestData$predict.class,

TestData$loan_status)),

mode = "everything",positive = '1')

##Random forest
library(randomForest)

random_forest <- randomForest(loan_status~.,

data = traindf,

ntree=500, mtry = 8 , nodesize = 200,importance=TRUE)

print(random_forest )

# Listing the importance of the variables.


impVar <- round(randomForest::importance(random_forest), 2)

impVar[order(impVar[,3], decreasing=TRUE),]

66
#ntree : random number

#mtry : square root of the number of predictor variable

#nodesize : 10% of the number of records in training dataset

names(traindf)

str(traindf)

set.seed(2306)

tuned_rf <- tuneRF(x = traindf[,-15],

y=traindf$loan_status,

mtryStart = 8,

ntreeTry=500,

stepFactor = 1.5,

improve = 0.001,

trace=T,

plot = T,

doBest = TRUE,

nodesize = 280 ,

importance=T

tuned_rf

# Plotting for arriving at the optimum number of trees

plot(tuned_rf, main="")

legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)

title(main="Error Rates Random Forest RFDF.dev")

##Train
predrf <- predict(random_forest, newdata = traindf[,-15], type = "prob")

67
RFpredROC <- ROCR::prediction(predrf[,2], traindf$loan_status)

perf <- performance(RFpredROC, "tpr", "fpr")

plot(perf)

as.numeric(performance(RFpredROC, "auc")@y.values)

# KS

ks.train <- performance(RFpredROC, "tpr", "fpr")

train.ks <- max(attr(ks.train, "y.values")[[1]] - (attr(ks.train, "x.values")[[1]]))

train.ks

# Gini

train.gini = (2 * 0.9995019) - 1

train.gini

#Confusion matrix
traindf$predict.class <- predict(tuned_rf, traindf, type="class")

traindf$predict.score <- predict(tuned_rf, traindf, type="prob")

confusionMatrix(traindf$predict.class,

traindf$loan_status)

#Test
predrf <- predict(random_forest, newdata = TestData[,-15], type = "prob")

RFpredROC <- ROCR::prediction(predrf[,2], TestData$loan_status)

perf <- performance(RFpredROC, "tpr", "fpr")

plot(perf)

as.numeric(performance(RFpredROC, "auc")@y.values)

# KS
ks.test <- performance(RFpredROC, "tpr", "fpr")

test.ks <- max(attr(ks.test, "y.values")[[1]] - (attr(ks.test, "x.values")[[1]]))

test.ks

68
# Gini
train.gini = (2 * 0.9975996) - 1

train.gini

lgmodel <- glm(formula= loan_status ~.,traindf, family = binomial(link = "logit"))

## Interpretation(In sample-Train)
Model1_pred = predict(lgmodel,newdata = TestData,type='response')

Model1_predicted=ifelse(Model1_pred>0.5,1,0)

#Factor conversion

Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))

head(Model1_predicted_factor)

## Confusion matrix

library(caret)

Model1.CM=confusionMatrix(Model1_predicted_factor,TestData$loan_status)

Model1.CM

69

You might also like