Professional Documents
Culture Documents
Final Report
Final Report
FINAL REPORT
BY:
Pranav v
PGP-BABI June ‘20
1
CONTENTS Pg.No
1.Introduction……………………………………………………………………..5
1.1.Problem……………………………………………………………………..5
1.2.Solution……………………………………………………………………..6
1.3.Need for the Study/Project………………………………………..6
1.4.Business Understanding/Scope………………………………….6
1.5.Business Hypothesis…………………………………………………..7
2.Data Report ……………………………………………………………………..7
2.1.Data collection……………………………………………………………7
2.2.Data dictionary……………………………………………………….....7
2.3.Data summary……………………………………………………………12
3.EDA(Exploratory data analysis)…………………………………………14
3.1.Missing value treatment……………………………………………..14
3.2.Univariate analysis………………………………………………………15
3.3.Bi-variate analysis……………………………………………………….18
3.4.Time series plot…………………………………………………………..22
3.5.Data transformation……………………………………………………24
3.6.Outliers……………………………………………………………………….24
3.7.Multicollinearity and significant variables……………………25
3.8.Addition of new variables……………………………………………26
2
3.9.Insights from EDA(Exploratory data analysis)………………27
4.Model Building………………………………………………………………….28
4.1.Check for Balance of data…………………………………………….28
4.2.CART…………………………………………………………………………….29
4.2.1 Interpretation………………………………………………………30
4.2.2 Evaluation(Train)………………………………………………….32
4.2.3 Evaluation(Test)……………………………………………………33
4.3.Logistic Regression…………………………………………………………34
4.3.1 Interpretation……………………………………………………….35
4.3.2 Evaluation(Train)…………………………………………………..38
4.3.3 Evaluation(Test)……………………………………………………38
4.4.Naive Bayes……………………………………………………………………39
4.4.1 Interpretation……………………………………………………….41
4.4.2 Evaluation(Train)…………………………………………………..41
4.4.3 Evaluation(Test)…………………………………………………….42
5.Model Comparison……………………………………………………………..43
6.Model Tuning………………………………………………………………………43
6.1.Random forest………………………………………………………………43
6.1.1 Interpretation……………………………………………………….44
6.1.2 Evaluation(Train)……………………………………………………46
3
6.1.3 Evaluation(Test)…………………………………………………….46
6.2.Boosting……………………………………………………………………….47
6.2.1 Interpretation……………………………………………………….48
6.2.2 Evaluation(Train)………………………………………………….48
6.2.3 Evaluation(Test)……………………………………………………49
7.Best Model…………………………………………………………………………50
8.Conclusion………………………………………………………………………….51
9.Actions and recommendations…………………………………………..51
APPENDIX……………………………………………………………………………..53
4
1.Introduction
Across the globe, the banking sector acts as the catalyst for the country’s economy.
Banks play a vital role in providing financial resources to both corporates and
individuals. The lending business is one of the key functions of the bank. All the
banks are trying to figure out effective business strategies to persuade customers to
apply for their loans. However, some customers do not adhere to the proceedings
after their applications are approved. The loan is not repaid as per the defined
schedule thus leading to financial loss to the bank.
Prior assessment of defaulters is one of the most important concerns of the banks for
survival in the highly competitive market and profitability. Most of the banks use
their credit scoring models and risk assessment techniques to analyse the loan
application and make decisions on whether to approve the application or not. Despite
this, financial institutions facing huge capital loses with the default of loans by the
borrowers. When the borrower is unable to pay the interest amount on time as well
as unable to return the principal amount, in such cases, the bank declares that amount
as non-performing.
1.1.Problem
Loans are the most important product offered by banking financial institutions and
they apply different marketing strategies more than often to sell the products.
Nowadays, banks are scrutinizing each loan application to identify potential loan
default cases so that they can predict which client is going to default the loan
repayment and at which step. Based upon these predictions banks will try to figure
out customers who have a high probability of defaulting out and take concerted
actions before they default.
5
1.2.Solution
Hence, there is a need to build a model to predict default loan which will help the
bank to take required actions including –
Avoiding the exposure
Intensify the collection efforts
Initiate the collateral sale
Avoiding certain customer or product segment
6
1.5.Business Hypothesis
The assignment aims to predict the customers who will default the payment based
on the personal and professional details.
NULL Hypothesis: (HO) –No predictor is available to predict default.
ALTERNATE Hypothesis: (HA) – There is at least one independent variable to
predict default.
2.Data Report
Exploring the dataset, we get that there are 41 variables broadly divided into two
categories – (1) Demographic variables – containing details about customers and (2)
Loan Variables – containing details about Loan sanctioned.
The data collected is for loans sanctioned between Dec 2011 – Jan 2015 by the
bank to various customers. Demographic variables include details of customers like
annual income, Desc, Homeownership etc, whereas loan variables include variables
like loan amount, funded amount, Payment term, principal received till date etc.
2.1.Data collection
The dataset contains about 226786 observations spread across 41 observations. 25
variables belong are continuous whereas 16 variables are categorical with various
categories defined for each of the variables.
The basic summary and the type of variable under study is provided to get the
domain knowledge of various variables .
2.2.Data dictionary
S.NO VARIABLE DESCRIPTION SUMMARY TYPE
1 member_id A unique Id for the Unique ID Continous
borrower member.
2 loan_amnt The listed amount Min loan applied: Continous
of the loan applied $500
for by the borrower. Maximum loan
If at some point in applied: $35000
time, the credit
department reduces
the loan amount,
then it will be
reflected in this
value.
7
3 funded_amnt The total amount Minimum Amount Continous
committed to that Sanctioned :500
loan at that point in Maximum Amount
time. Sanctioned :35000
Only 0.24% of the
loan amount
requested by
applicant has not
been fully
sanctioned by bank
4 funded_amnt_inv The total amount Minimum Amount Continous
committed by Investors
investors for that sanctioned: 0
loan at that point in Maximum Amount
time. Investors
sanctioned: 35000
5 term The number of Term periods – 2 Categorical
payments on the levels:
loan. Values are in 36 Months -70%
months and can be 60 Months – 30%
either 36 or 60.
6 int_rate Interest Rate on the Minimum Interest Categorical
loan rate: 5.32%
Maximum Interest
rate: 28.99%
7 installment The monthly Maximum Installme Continous
payment owed by nt paid: $15.69
the borrower if the Maximum
loan originates. Installment paid:
$1409.99
8 grade Assigned loan grade There are 7 levels: Categorical
A, B, C, D, E, F, G
9 emp_length Employment length Employment Length Categorical
in years. Possible possible values:
values are between Ranges from < 1
0 and 10 where 0 year to 10+ years but
means less than one mostly under 10+
year and 10 means years
ten or more years. There are 11 levels
and NA category
11 annual_inc The self-reported Minimum annual inc Continous
annual income ome: $3000
provided by the Maximum annual in
borrower during come: $8900060
registration.
Continous
8
Income Verification
12 verification_status Status of the status categories
verification done 3 levels:
1. Not Verified
2. Source Verified
3. Verified
13 issue_d The month which Unique dates when Categorical
the loan was funded the loan was issued
14 pymnt_plan Indicates if a n-226780 Categorical
payment plan has y-6
been put in place for
the loan
15 desc Loan description Only 14% Categorical
provided by the observations have
borrower comments from
applicants,
explaining the
reason for taking
loan
16 purpose A category provided Purpose Categories Categorical
by the borrower for have 14 levels:
the loan request. 1. Car 2. Credit_card
3.
Debt_consolidation
4. Educational
5.
Home_improvement
6. House, 7.
Major_purchase 8.
Medical
9. Moving 10. Other
11.
Renewable_energy
12. Small_business
13. Vacation 14.
Wedding
17 addr_state The state provided Different states Categorical
by the borrower in
the loan application
18 dti A ratio calculated Minimum derived Continous
using the dti: 0
borrower’s total Maximum derived
monthly debt dti: 59.26
payments on the
total debt
obligations,
9
excluding mortgage
and the requested
LC loan, divided by
the borrower’s self-
reported monthly
income.
19 delinq_2yrs The number of 30+ Minimum instances Continous
days past-due of 30+ days
incidences of Delinquency: 0
delinquency in the Maximum instances
borrower’s credit of 30+ days
file for the past 2 Delinquency: 29
years
20 earliest_cr_line The month the Unique values Categorical
borrower’s earliest
reported credit line
was opened
21 inq_last_6mths The number of Minimum Inquiries Continous
inquiries in past 6
made in Last 6
months (excluding Months: 0
auto and mortgage Maximum Inquiries
inquiries) made in Last 6
Months: 8
22 mths_since_last_delinq The number of Minimum months Continous
months since the since Last
borrower’s last Delinquency: 0
delinquency. Maximum months
since Last
Delinquency: 151
Na s: 124638
10
the amount of credit relative to Balance
the borrower is Revolving: 0
using relative to all Maximum Credit
available revolving used by applicant
credit. relative to Balance
Revolving: 892
26 total_acc The total number of Minimum Number Continous
credit lines of Credit lines
currently in the available in
borrower’s credit applicants credit
file Line: 2
Maximum Number
of Credit lines
available in
applicants credit
Line: 150
27 out_prncp Remaining Min: 0 Continous
outstanding Max:35000
principal for total
amount funded
28 out_prncp_inv Remaining Min: 0 Continous
outstanding Max:35000
principal for portion
of total amount
funded by investors
29 total_pymnt Payments received Min:0 Continous
to date for total Max:57778
amount funded
30 total_pymnt_inv Payments received Min:0 Continous
to date for portion of Max:57778
total amount funded
by investors
31 total_rec_prncp Principal received Min: 0 Continous
to date Max:35000
32 total_rec_int Interest received to Minimum interest Continous
date received:0
Maximum interest
received:22777.6
33 total_rec_late_fee Late fees received Minimum late fee Continous
to date received:0
Maximum late fee
received:286.74
34 recoveries post charge off Values are 0’s Continous
gross recovery
35 collection_recovery_fee post charge off Values are 0’s Continous
collection fee
11
36 last_pymnt_d Last month payment Unique values Categorical
was received
37 last_pymnt_amnt Last total payment Minimum payment Continous
amount received received:0
Maximum payment
received:36475.6
38 next_pymnt_d Next scheduled Unique values Categorical
payment date
39 last_credit_pull_d The most recent Unique values Categorical
month pulled credit
for this loan
40 application_type Indicates whether Application Types: Categorical
the loan is an 1. JOINT – 0.002%
individual 2. INDIVIDUAL –
application or a 99.94%
joint application
with two co-
borrowers
41 loan_status Current status of the Default: 8.4% Categorical
loan Paid: 91.59%
Table 1
2.3.Data summary
Customer Information:
12
Loan Details:
The dataset conveys that the Purpose of the loan is taken, the statistics for the same
is as below:
58.86% of the total applicants have mentioned that the purpose of the loan bei
ng taken as Debt consolidation. Nearly 132971 applicants have taken loan for
Debt consolidation.
20.16 % of the applicants have marked their purpose of the loan as credit card
123670 applicants have taken loan for payment of their Credit card dues.
Debt consolidation and credit card due payment have the highest loan amount
sanctioned followed by home improvement
There are fewer takers of loan for educational purposes, wedding &
renewable energy but have higher Defaults.
The data shows that 70% of the applicants opted for a Loan Tenure of 36 mon
ths whereas 30% opted for 60-month Loan tenure.
More than 1/3rd of the loans have not been verified which is an alarming sign
for the banks and they should following the verification process diligently to r
educe defaults
About 80% of the loans are for 3 years and the rest 20% is for five years. So
the loans are not for very long term
Missing values (blank observations) are present in 2 variables – months since
last delinquency and revolving line utilization. Treatment would be done in
EDA section
There are n/a cases marked in some variables and it is assumed that data is not
available for the same. So, we would keep these values in the model and
proceed with the analysis.
13
3.EDA(Exploratory data analysis)
3.1.Missing value treatment
Our dataset has 226786 observations of 41 variables.
As we can see there are 226786 observations & 41 columns in the dataset, it will be
very difficult to look at each column one by one & find the NA or missing values.
So let's find out all columns where missing values are more than a certain percentage,
let's say 15%. Also, find out those columns which will not add any value/insight for
our study. Hence, we will remove those columns as it is not feasible to impute
missing values for those columns.
Figure 1
14
Desc 152077 From our study the data we
purpose understand that it is the
detail of the ‘Purpose’
variable. Hence we drop
desc as it will be redundant
information, so removed
them
Table2
3.2.Univariate analysis
We will perform univariate analysis both for continuous and categorical variables
CONTINUOUS:
Figure 2
The general observation is that the variables are skewed due to the presence
of outliers. These outliers might be realistic data with no error.
15
We need to further deliberate on the skewness and kurtosis values for each of
the values and check whether the picture as presented by histogram is correct
or not.
Skewness shall reflect the symmetry in the data i.e. whether the data
distribution is the same on the right and left-hand side of the centre point.
Kurtosis depicts whether the data is tilted towards a side (as compared to a
normal distribution). This is due to the presence of outliers.
Figure 3
Loans are majorly taken for the short term with 80% of loans taken for 36
months tenure.
Employment tenure is maximum for 10+ years showcasing that majority of
customers who have taken loans are experienced and well placed. So chances
of default would be less than that compared to fresher or less experienced.
16
CATEGORICAL:
Figure 4
INFERENCES:
Loan applications are individual only. Joint is only 6 cases. Banks need to
check if a joint application can help reduce default rates and accordingly
emphasize joint accountability based upon the loan amount.
The main purpose of the loans being taken is debt consolidation (60%)
followed by credit card(20%).
Many loans have not been verified by the banks which could be a possible
reason for default.
Loans default are less as compared to loans not defaulted so we need to
balance the dataset before proceeding further with the model prediction.
17
3.3.Bi-variate analysis
Loan Amount v/s Purpose:
Figure 5
Figure 6
18
It can be inferred from figure 6 that defaulting of the loan is seen higher with
people having 10+ years of work experience.
Figure 7
It can be inferred that defaulting of the loan is seen higher with people of grade C.
**Tableau generated graph
Figure 8
19
We can derive a few points from the plot (figure 8):
Figure 9
20
Grade v/s Interest rate v/s Grade:
Figure 10
Figure 11
21
We can derive a few points from the plot (Figure 11):
there are instances when the funded amount is smaller loan amount
there seems to be a number of loans where investment is smaller
funded amount i.e. not the full loan is invested in
Figure 12
22
Figure 13
the mean interest rate is falling or relatively constant for high-rated clients
the mean interest rate is increasing significantly for low-rated clients
Figure 14
23
3.5.Data transformation
Figure 15
INFERENCES:
Changed NA of emp_length to < 1-year category.
Club NONE observations to OTHER for home_ownership variable.
Club Source Verified with Verified for verification_status variable.
Changed loan_status into Default where
0: Fully Paid
1:Default
3.6.Outliers
Figure 16
24
Outliers are observations not common for the dataset and might be
captured as an error. Used standard deviation as cut off for judging
the outliers in the dataset.
Outstanding principal, outstanding principal investors, Total
payment, total payment investors, Total principal received, total
interest received Total Late fees, last date, Last payment amount,
next payment date – these all variables have outliers.
Boxplot has also confirmed the same. But these are specific to
loans and can vary as per the loans sanctioned.
Annual income is also specific to individuals and can vary.
Figure 16
25
Figure 17
The above figure tells the variables with high correlation and can be removed on
further analysis of VIF.
terms
int_rate
installment
grade
emp_length
dti
issue_d
revol_b,revol_util
total_rec_int
last_pymnt_d
last_pymnt_amnt
last_credit_pull_d
26
3.9.Insights from EDA(Exploratory data analysis)
The main purpose of the loans being taken is debt consolidation (60%)
followed by credit card(20%).
Employment tenure is maximum for 10+ years showcasing that majority of
customers who have taken loans are experienced and well placed. So chances
of default would be less than that compared to fresher or less experienced.
Interest rates are high in case of defaulters which is the case that if the
creditworthiness is not that good loan might be available but at a higher rate
of interest.
Default per cent as a total number of loan reimbursed in 60 months is more
than that of 30 months. So short term loans are less prone to default.
Grade A has the least default cases. So better grade to loan class belongs to
less is the chance of default.
Default cases are more in the verified loan application. So this reflects that
stringent verification is required to be done by the banks.
We need to remove the multicollinearity either by
Dropping the variables from the analysis
Use PCA to combine the variables
Use ratios and combine variables
27
• Boruta Importance matrix – Boruta helps to finalize the feature
selection. It executes importance run and eliminates the least important
variables.
• Chi-square test in Anova – which variables are statistically important
based upon Annova test
4.Model Building
4.1. Check for Balance of data
Is data unbalanced?
Yes, the data is unbalanced with only 8% observation is default and rest 92 % is non-
default. We will use SMOTE during model building to oversize the default cases in
the model and balance the data before proceeding with the Model building.
Figure 18
The above figure explains that the Test and train dataset is unbalanced. So we
apply SMOTE to balance the train and test data.
Figure 19
28
4.2.CART
The purpose of using CART is because as we know that Cart works like the
if-else clause wherein here every variable is considered for studying the
outcome of the dependent variable, here it is loan_status.
CART helps in defining the class to which a particular value of the dependent
variable loan_status (Default and Fully paid) belongs based on various
predictors.
Classification trees are used when the dataset needs to be split into classes
which belong to the response variable.
Non-Zero Variance: We have verified that there is no variable with zero
variance. Hence we will use all the variable for CART.
Minsplit: 900, minbucket: 300, xval: 10
The output of the CART model:
Figure 20
On applying the CART model to our dataset we see that the important variables used
for tree construction are:
issue_d
last_credit_pull_d
last_pymnt_amnt
term
total_rec_late_fee
29
4.2.1 Interpretation
Further, the tree is visualized as :
Figure 21
INSIGHTS DRAWN:
48 % of the customers have defaulted the payment of loan after the last credit
pull date.
38% of the customers have not paid any late fee, have not defaulted that is
they pay their instalment before the last date or they never default and they
don’t pay late fee.
1 % of the customers have who have term =36 months have defaulted in
paying the loan.
AFTER PRUNING:
To prune the tree, we find the best ‘Complexity Parameter’ of the tree.
Cost-complexity pruning works by successively collapsing the node that produces
the smallest per-node increase in our error/cost, while at the same time weighing the
overall complexity (e.g. size) of our tree to decide upon the best-pruned tree that
minimizes our cost-complexity function.
30
Figure 22
Figure 23
The purpose of pruning is to decrease the complexity of the tree based on the value
of cp. Finally, the number of splits is 8.
31
Figure 24
From the above CP table, it is understood that at 0th split the relative error is 100%.
The default is 10 fold cross-validation and cross-validation error at 0 split is 100 and
the standard deviation is 0.3 per cent. And at the 2nd split, the relative error has
decreased to 7.9 % and the xerror has decreased to 8% and standard deviation to 0.1
%. The reduction in error, xerror, xstd is observed at every step further till 7th split,
and when observed there is no increment in the values except decrement. Therefore
there is no need to prune the tree further.
4.2.2 Evaluation(Train)
Figure 25
Please refer Appendix for Source Code of CART for AUC curve etc
32
4.2.3 Evaluation(Test)
Figure 26
Please refer Appendix for Source Code of CART for AUC curve etc
33
4.3.LogisticRegression
Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.
Figure 27
34
4.3.1Interpretation
From figure 27 we can see that:
Significant variables:
installment
issue_d
dti
revol_bal,revol_util
total_rec_late_fee,last_pymnt_d
last_pymnt_amnt,last_credit_pull_d.
Also, the AIC Score is 12992. This will be observed in subsequent stages when we
refine the model. The model having least AIC Score would be the most preferred
and optimized one.
Evaluating model performance:
Model significance is checked using the log-likelihood test
In statistics, a likelihood ratio test is a statistical test used for comparing the
goodness of fit of two models, one of which (the null model) is a special
case of the other (the alternative model). The test is based on the likelihood
ratio, which expresses how many times more likely the data are under one
model than the other. This likelihood ratio, or equivalently its logarithm, can
then be used to compute a p-value, or compared to a critical value to decide whether
to reject the null model and hence accept the alternative model.
Figure 28
35
H0: All betas are zero
H1: At least 1 beta is nonzero
From the log-likelihood, we can see that, intercept only model -54465 variance
was unknown to us. When we take the full model, -6466 variance was unknown to
us.
So we can say that 1-(-6466 /-54465 )= 88.12% of the uncertainty inherent in the
intercept only model is calibrated by the full model. Chisq likelihood ratio is
significant.
Also, the p value suggests that we can accept the Alternate Hypothesis that at
least one of the beta is not zero. So Model is significant.
Model robustness check
Now since we concluded that the model built is significant, let’s find out how robust
it is with the help of McFadden pseudo-R Squared Test.
McFadden pseudo R-square: McFadden pseudo-R Squared: Logistic regression
models are fitted using the method of maximum likelihood -i.e. the parameter
estimates are those values which maximize the likelihood of the data which
have been observed.
McFadden's R squared measure is defined as (1-LogLc/LogLnull) where Lc
denotes the (maximized) likelihood value from the current fitted model, and Lnull
denotes the corresponding value but for the null model -the model with only
an intercept and no covariates.
Figure 29
The McFadden’s pseudo-R Squared test suggests that at least 88.12% variance of
the data is captured by our Model, which suggests it’s a robust model.
36
Odds explanatory power
Let’s find out the power of Odds and Probability of the variables impacting on
loan_status.
Figure 30
Figure 31
Heteroscedasticity check:
Solution of regression problem becomes unstable in presence of 2 ormore
correlated predictors. Multicollinearity can be measured by computing variance
inflation factor (VIF) which gauges –how much the variance of regression
coefficient is inflated due to multicolinearity.
Figure 32
37
From figure 31 and 32 we can see that the variables impacting loan status are:
Term,installment,emp_length,issue_d
4.3.2 Evaluation(Train)
Figure 33
4.3.3 Evaluation(Test)
Figure 34
38
Interpretation of Logistic Regression Model:
The results from performance metrics of both Train and Test data are comparable
which tells us that it is a good model.
The significant variables are last_pymnt_amnt, term, last_pymnt_d, issue_d,
last_credit_pull_d,total_rec_int_fee which is used to create one final model.
The model is stable as evident from the output of confusion matrix for training and
testing dataset. Based on the test metrics we can interpret that.
The model will catch 96.6% of the customers who will default .
The model will catch 97.45% of the customers who will not default
Overall all accuracy is 96%
Out of the customers who are predicted as will default, 97% of them will
actually default
Out of the customers who are predicted as will Not default, 97.4% of them
will actually not default
4.4.Naive Bayes
The e1071 package holds the naiveBayes function. It allows continuousand
categorical features to be used in the naive bayes model.It is count-based classifier
i.e. only thing it does is –count how often each variable’s distinct values occur for
each class.
The given Dataset consistes of both Categorical as well as numerical
Variables.
We know that Naïve Bayes classifer works well on only Categorical values .
Navie Bayes works best with Categorical values but can be made to work on
mix datasets having continuous as well as categorical variables .
Since this algo runs on Conditional Probabilities it becomes very hard to silo
the continous variables as they have no frequency but a continuum scale.
Moreover, The model can be created with a mixture of both Categorical and
numerical values but the accuracy is less than what when created with only
Categorical values.
39
Figure 35
40
4.4.1 Interpretation
Navie Bayes works best with Categorical values but can be made to work on
mix datasets having continuous as well as categorical variables as predictors
like in cellphone dataset.
Since this algorithm runs on Conditional Probabilities it becomes very hard
to silo the continous variables as they have no frequency but a continuum
scale.
For continous variables: what NB does is takes their mean and standard
deviation or variability and treats it as cut off thresholds ; say anything less
than mean of distributed predictor values is 0 and more than mean is 1.
Above law suits binary classifier ; however if we have multinomial Response
categories than it will have to go for quantiles, deciles partitioning the data
accordingly and assigning them the probabilities.
Based on above NB’s working on mixed dataset and its accuracy is always
questionable. Its findings and predictions need to be supported by other
Classifiers before any actionable operations
The Output for the NB model displays in the matrix format for each predictor
its mean [,1] and std deviation [,2] for class 1 and class 0.
The independence of predictors (no-multicollinearity) has been assumed for
sake of simplicity
4.4.2 Evaluation(Train)
Figure 36
41
4.4.3 Evaluation(Test)
Figure 37
INTERPRETATION:
Navie Bayes works best with Categorical values but can be made to work on
mix datasets having continuous as well as categorical variables as predictors
like in cellphone dataset.
Since this algorithm runs on Conditional Probabilities it becomes very hard to
silo the continuous variables as they have no frequency but a continuum scale.
Based on above NB’s working on mixed dataset and its accuracy is always
questionable.
Its findings and predictions need to be supported by other Classifiers before
any actionable operations.
The accuracy is less when compared to other models.
42
5.Model Coparison
6.Model Tuning
6.1.Random forest
We try to build the forest with 4 variables as candidate at each split and 1000 as the
minimum size of terminal node.
43
We analyze the Out of Bag error to find ntree. In our case it is around 190.
Figure 38
6.1.1 Interpretation
Tune Rf and Important Variables
We use tuneRF function to get mtry value and build the tuned random forest. As per
the below result mtry=4 has minimum out of bag error.
Figure 39
Important Variables:
Based on the output of Mean Decrease Gini we can say the top 4 variables to predict
if customer will default the loan or not are last_pymnt_amt, issue_d, last_pymnt_d
and last_credit_pull_d.
44
Interpretation:
Figure 40
Figure 41
From the above figures we can see that last_pymnt_amt, issue_d, last_pymnt_d
and last_credit_pull_d are the significant variables for prediction.
45
6.1.2 Evaluation(Train)
Figure 42
6.1.3 Evaluation(Test)
Figure 43
46
INTERPRETATION:
The model is stable as evident from the output of confusion matrix for training and
testing dataset.
Based on the test metrics we can interpret that –
1. The model will catch 98% of the customers who will default.
2. The model will catch 97% of the customers who will not default in loan
payment
3. Overall all accuracy is 97%
4. Out of the customers who are predicted as will default, 95% of them will
actually default
5. Out of the customers who are predicted as will Not default, 99% of them will
actually not default
6.2.Boosting
Boosting is another ensemble algorithm which is used to reduce bias and also
variance, in supervised learning. In ensemble algorithm, set of weak learners are
combined to form strong learner.
Boosting is more towards Bias i.e simple learners or more specifically Weak
learners. Now a weak learner is a learner which always learns something i.e does
better than chance and also has error rate less than 50%.The best example of a weak
learner is a Decision tree. This is the reason we generally use ensemble technique on
decision trees to improve its accuracy and performance
Figure 44
47
6.2.1 Interpretation
Figure 45
6.2.2 Evaluation(Train)
Figure 46
Accuracy=99.51
Sensitivity=99.47
Specificity=99.54
Auc=99.97
Gini=93, Ks=99.04
48
6.2.3 Evaluation(Test)
Figure 47
Accuracy=99.10
Sensitivity=98.61
Specificity=99.15
Auc=99.89
Gini=98.2, Ks=97.99
INTERPRETATION:
The model is stable as evident from the output of confusion matrix for training and
testing dataset.
Based on the test metrics we can interpret that -
1. The model will catch 98% of the customers who will default.
2. The model will catch 99.15% of the customers who will not default in loan
payment
3. Overall all accuracy is 99%
4. Out of the customers who are predicted as will default, 98% of them will
actually default
5. Out of the customers who are predicted as will Not default, 99.15% of them will
actually not default
49
7.Best Model
On comparing Random forest and boosting we see that boosting seems to outplay
random forest .
KS value for most of the models are more than 90% , hence they will
Perform well to separate the default and fully paid cases
AUC is more than 95% for the all models, hence we can consider them as
good classifier
Out of Logistic regression, CART and Naïve Bayes, CART Model has the
highest Accuracy and sensitivity. Hence, we conclude that CART model is
the best among the three.
Comparing CART with Random forest and Boosting, we can conclude that
the model developed using BOOSTING is the best.
50
8.Conclusion
We have built various models to understand the factors which influence
loan being defaulted.
As per model comparison, CART is the best as accuracy is about 97.31%
and sensitivity is also 98%
The model built using ensemble technique (Random Forest and
Boosting) is also good model as accuracy is about 98% and there is
balance between sensitivity and specificity.
Overall we can see that ,Boosting is the most stable model .
Issue_d, last_pymnt_amt, last_credit_pull_d play important role to predict
if customer with default or not.
There is no definitive guide of which algorithms to use given any situation.
What may work on some data sets may not necessarily work on others.
Therefore, always evaluate methods using cross validation to get a reliable
estimates.
Sometimes we may be willing to give up some improvement to the model
if that would increase the complexity much more than the percentage
change in the improvement to the evaluation metrics.
In some classification problems, False Negatives are a lot more expensive
than False Positives. Therefore, we can reduce cut-off points to reduce the
False Negatives.
51
Education loan has lower default rate compared to other products. Hence,
business should focus on growth of this segment.
Past credit history and delinquency should be cross verified.
Special rules should assigned to customers who fall under low income
category.
Background checks for customers who are ought to default.
Higher the interest rate ,higher the chances of default. So there is a need to
create matrix considering income and assets as a valid characteristic.
The home ownership(Own/Rent/Mortgage) also mattered. Especially people
who rented house where prone to default.
Factors that dint contribute much or cause for loan default were:
Annual salary
Employee length
52
APPENDIX
53
RANDOM FOREST(TEST)
54
CART(TRAIN)
55
CART(TEST)
56
LR(TEST)
57
TABLEAU GENERATED GRAPHS:
58
Reading Data:
library(readxl)
Data<- read_xlsx('Data.xlsx')
names(Data)
Data Transformation:
Data$issue_d=as.Date(Data$issue_d)
59
Data$last_pymnt_d=as.Date(Data$last_pymnt_d)
Data$last_credit_pull_d=as.Date(Data$last_credit_pull_d)
Data$issue_d=as.numeric(Data$issue_d)
Data$last_pymnt_d=as.numeric(Data$last_pymnt_d)
Data$last_credit_pull_d=as.numeric(Data$last_credit_pull_d)
Data$term=as.factor(Data$term)
Data$grade=as.factor(Data$grade)
Data$emp_length=as.factor(Data$emp_length)
Data$loan_status=as.factor(Data$loan_status)
str(Data)
DATA SPLIT:
library(caTools)
set.seed(1203)
index=sample.split(Data$loan_status,SplitRatio = 0.7)
TrainData=subset(Data,index==TRUE)
TestData=subset(Data,index==FALSE)
prop.table(table(TrainData$loan_status))
prop.table(table(TestData$loan_status))
TrainData=as.data.frame(TrainData)
library(DMwR)
set.seed(101)
60
prop.table(table(traindf$loan_status))
##Logistic regression
library(car)
set.seed(1)
summary(lgmodel)
car::vif(lgmodel)
library(pscl)
pR2(lgmodel)
# Odds Ratio
exp(coef(lgmodel))
# Probability
exp(coef(lgmodel))/(1+exp(coef(lgmodel)))
names(traindf)
## Interpretation(In sample-Train)
Model1_pred = predict(lgmodel2,newdata = traindf,type='response')
61
Model1_predicted=ifelse(Model1_pred>0.5,1,0)
#Factor conversion
Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))
head(Model1_predicted_factor)
## Confusion matrix
Model1.CM=confusionMatrix(TestData$loan_status,Model1_predicted_factor)
Model1.CM
prop.table(table(Model1_predicted_factor,traindf$loan_status))
## Interpretation(In sample-Test)
Model1_pred = predict(lgmodel2,newdata = TestData,type='response')
Model1_predicted=ifelse(Model1_pred>0.5,1,0)
#Factor conversion
Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))
head(Model1_predicted_factor)
## Confusion matrix
Model1.CM=confusionMatrix(TestData$loan_status,Model1_predicted_factor)
Model1.CM
prop.table(table(TestData$loan_status,Model1_predicted_factor))
# Naive Bayes
library(e1071)
set.seed(1)
nbmodel
library(caret)
62
confusionMatrix(nb_predictions,traindf$loan_status,positive="1")
nb_predictions=as.numeric(nb_predictions)
traindf1=as.numeric(traindf$loan_status)
# ROC plot
library(ROCR)
plot(performance(train.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for train data")
# AUC
train.auc = performance(train.roc, "auc")
train.area
# KS
ks.train <- performance(train.roc, "tpr", "fpr")
train.ks
# Gini
train.gini = (2 * train.area) - 1
train.gini
##Test
library(caret)
confusionMatrix(nb_predictions,TestData$loan_status,positive="1")
nb_predictions=as.numeric(nb_predictions)
traindf1=as.numeric(traindf$loan_status)
# ROC plot
63
library(ROCR)
plot(performance(test.roc, "tpr", "fpr"), col = "red", main = "ROC Curve for train data")
AUC
test.auc = performance(test.roc, "auc")
test.area
# KS
ks.test <- performance(test.roc, "tpr", "fpr")
test.ks
# Gini
test.gini = (2 * test.area) - 1
test.gini
## CART
library(rpart)
library(rpart.plot)
#Get Control variable which would allow the tree to grow to maximum
tree_iter1 <- rpart(formula = loan_status ~ .,data = traindf, method = "class", control = tree_control)
library(rattle)
64
fancyRpartPlot(tree_iter1)
rpart.plot(tree_iter1)
prp(tree_iter1)
printcp(tree_iter1)
tree_iter1$cptable
# Pruned tree
printcp(ptree)
rpart.plot(ptree)
prp(ptree)
plot(perf_train_cart)
library(caret)
train.area
gini_train_cart
65
confusionMatrix((table(traindf$predict.class,
traindf$loan_status)),
### TEST
plot(perf_test)
KS_test
auc_test
gini_test
confusionMatrix((table(TestData$predict.class,
TestData$loan_status)),
##Random forest
library(randomForest)
data = traindf,
print(random_forest )
impVar[order(impVar[,3], decreasing=TRUE),]
66
#ntree : random number
names(traindf)
str(traindf)
set.seed(2306)
y=traindf$loan_status,
mtryStart = 8,
ntreeTry=500,
stepFactor = 1.5,
improve = 0.001,
trace=T,
plot = T,
doBest = TRUE,
nodesize = 280 ,
importance=T
tuned_rf
plot(tuned_rf, main="")
##Train
predrf <- predict(random_forest, newdata = traindf[,-15], type = "prob")
67
RFpredROC <- ROCR::prediction(predrf[,2], traindf$loan_status)
plot(perf)
as.numeric(performance(RFpredROC, "auc")@y.values)
# KS
train.ks
# Gini
train.gini = (2 * 0.9995019) - 1
train.gini
#Confusion matrix
traindf$predict.class <- predict(tuned_rf, traindf, type="class")
confusionMatrix(traindf$predict.class,
traindf$loan_status)
#Test
predrf <- predict(random_forest, newdata = TestData[,-15], type = "prob")
plot(perf)
as.numeric(performance(RFpredROC, "auc")@y.values)
# KS
ks.test <- performance(RFpredROC, "tpr", "fpr")
test.ks
68
# Gini
train.gini = (2 * 0.9975996) - 1
train.gini
## Interpretation(In sample-Train)
Model1_pred = predict(lgmodel,newdata = TestData,type='response')
Model1_predicted=ifelse(Model1_pred>0.5,1,0)
#Factor conversion
Model1_predicted_factor=factor(Model1_predicted,levels = c(0,1))
head(Model1_predicted_factor)
## Confusion matrix
library(caret)
Model1.CM=confusionMatrix(Model1_predicted_factor,TestData$loan_status)
Model1.CM
69