Credit Risk Modelling

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

CREDIT RISK

MODELLING
By Durgesh Kinerkar & Abhishek Singh
TABLE OF CONTENTS

01 OBJECTIVE 02 DATA EXPLORATION 03 DATA PREPROCESSING

04 MODEL DEVELOPMENT 05 MODEL EVALUATION 06 CONCLUSIONS AND


RECOMMENDATIONS

page 02
PROBLEM 1
OBJECTIVE
The objective is to provide a solution to assist the
company in arriving at suitable criteria to sanction loans.

Identify the most important attributes this company


should use in future to consider the loan approval.

Interpret various measures of accuracy that will be


provided in the chosen model

page 03
DATA EXPLORATION
Basic data exploration provides some fundamental information such as the structure of the data, shape of the data frame, data
types of columns and count of missing values. It also suggests that the data is very messy. That is, there are many missing
values, and all the categorical variables need to be encoded. The value count of the target variable shows that the classes are
imbalanced. That is, there is a class imbalance. This is helpful in selecting preliminary evaluation methods. If classes are
imbalanced, it means 99% of the loans are good loans, and only 1% of the loans are bad loans. Then a model which classifies
every loan as a good loan has a 99% accuracy score, but it will not serve our purpose. Hence, the accuracy score is not a
suitable metric.

Data contains 135 columns and 25000 rows

PROJECT ONE

page 03
DATA PREPROCESSING
As the data is extremely messy, it is essential to preprocess the data before giving it as input to a machine learning
model. In this analysis, the focus of data preprocessing is three aspects - dealing with missing values, feature
elimination and label encoding.

There are 135 features or attributes in the dataset. The proportion of missing values for each feature is calculated, and
any feature with more than 25% missing values is dropped for the dataset. This also serves the purpose of eliminating
features which are not significant. Subsequently, any row with a missing value is deleted as all the attributes in the
dataset are sensitive, and any imputation method may distort the distribution of the data or introduce bias.

SERVICE
Further, using the data description of the dataset to understand the attributes and keeping in mind the objective of the
study some more insignificant features are eliminated from the data. Finally, the 34 attributes are retained and the
categorical variables and encoded using a label encoder.

page 03
MODEL DEVELOPMENT
A set of classification models are developed - Logistic Regression model, Support Vector Machine Model, Decision Tree Model
and KNN model. The accuracy of each model is evaluated on the validation set using cross-validation. A boxplot is generated to
visualize the results. It is observed that the Support Vector Machine model performs the best with a median accuracy between
86% to 88%. The median is chosen as it is less affected by the outliers. The Logistic Regression classifier is a close second.

page 06
RANDOM FOREST CLASSIFIER
An ensemble learning model is developed which is basically a combination of all the individual
classification models to check if it gives better performance as the ensemble learning classifier
overcomes the variance of the individual models and uses majority voting to arrive at a prediction.
However, there is no significant improvement in model performance. A random Forest model is
developed and evaluated. The Random Forest is an extension of bagging which randomizes the
features used in each training subset and has decision trees as base learners. It reduces the tendency
of decision trees to overfit training data by utilizing the concept of bagging and randomization. The
accuracy of the Random Forest classification model is 87.22%.

page 03
FEATURE IMPORTANCE
The Random Forest model provides a metric to measure
which features are most important in prediction. This can be
used to identify the best or most significant attributes for
prediction.

From the graph, we can see that the following features are the
most important ones: Debt-to-Income Ratio, Average Current
Balance, Interest Rate, Annual Income, Employment length
and other attributes as displayed in the bar graph.

page 03
MODEL EVALUATION
A confusion matrix is plotted, and a classification report is
generated which shows various model evaluation metrics
such as precision, recall and F1-score. A confusion matrix
gives a detailed report on model performance with true
positives, true negatives, false positives and false
negatives. It provides a holistic view of the model
performance.

NOTE: Bad Loan = 0, Good Loan = 1

page 03
MODEL EVALUATION

The recall or the true positive rate for Class 0 ("Bad loans") measures the percentage of actually bad
loans that were correctly classified. The recall for Class ) is 0.00 which essentially means that the
model cannot predict bad loans and hence serves no purpose. The precision measures the percentage
of loans classified as bad loans that are classified correctly, that is, they are actually bad loans. This
model has a precision of 40%.
MODEL EVALUATION
It can be observed that though the model has an accuracy of
87%, it does not serve our purpose as it cannot correctly
predict bad loans. Rather, the model performs very poorly in
predicting bad loans as shown by model evaluation metrics
like the confusion matrix and classification report. This is
because of class imbalance. If classes are imbalanced, it
means 99% of the loans are good loans, and only 1% of the
loans are bad loans. Then a model which classifies every loan
as a good loan has a 99% accuracy score, but it will not serve
our purpose. Hence, the accuracy score is not a suitable
metric. The class imbalance generally occurs when there are
many more instances of one class as compared to other
classes - this is the problem which exists in this credit risk
data (see graph below). The performance drops significantly
when the model is trained on such biased data.
DEALING WITH IMBALANCED CLASSES
The imbalance class problem can be solved by using resampling. Resampling can be done on training data
to get an equal number of observations from both classes so that the model gets good input data. This
can be done in two ways:

Random under-sampling: removing observations of the majority class until both classes are balanced
Random over-sampling: adding more observations of the majority class

A major drawback of this technique is the loss of information in case of undersampling and overfitting or
poor generalization in case of oversampling.

As the dataset has sufficiently large number of observations, random undersampling was the more
appropriate choice. Once the model has been trained on the resampled data, it can be seen that the
though it classifies many "good loans" as "bad loans", it correctly predicts bad loans.
A NEW MODEL
A new model is developed and trained on a resampled dataset and validated on the same test data as before. It can be
observed that the model performance has improved significantly. The recall is 1.00 which means that all bad loans are
classified correctly though the precision is 0.30; that is, a significant number of good loans were classified as bad loans;
the model is sufficiently useful to identify bad loans and prevent losses.
PROBLEM 2
OBJECTIVE

The objective is to develop a credit risk assessment model to


classify good loans and bad loans using credit risk data and
other attributes.

page 03
MODEL DEVELOPMENT
XGBoost is used to develop a gradient-boosted model. Gradient Boosting aims to improve the model
performance by training multiple machine learning models sequentially where each model learns from the
mistakes of its predecessors. In general, decision trees are used as a base learner. However, it is observed that
the model performance is lower than the Random Forest classifier.
MODEL EVALUATION
It can be observed that the gradient-boosted machine learning model, which is also trained on a resampled dataset after
dealing with class imbalance, shows a similar performance to the random forest model. For Class 0, the precision is 0.29,
and the recall is 1.00 which means it correctly classifies bad loans, but when it comes to good loans the error rate is more
However, the accuracy is 69%.
ANOTHER MODEL EVALUATION METRIC
The area under the ROC curve is basically the measure of the correct predictions for positive class. The
ROC curve shows model performance at every classification threshold, and it plots the True Positive Rate
and False Positive Rate. The area under the ROC curve is 0.6773 which is very close to the desired value
of 1. The Area Under the ROC curve (AUC) is the measure of the ability of the classifier to distinguish
between classes. The higher the AUC, the better the model's performance at distinguishing between the
positive and negative classes.

AUC-ROC SCORE for Random Forest Classifier (with imbalanced classes): 0.67

AUC-ROC SCORE for Random Forest Classifier (with balanced classes): 0.88

AUC-ROC SCORE for Gradient Boosted Classifier (with balanced classes): 0.96
CONCLUSION
In conclusion, it is recommended that both the Random Forest Classifier and the Gradient-boosted
Classifier (XGBoost) can be used as a model to predict bad loans. Both models perform significantly well
in predicting bad loans. However, the performance drops when it comes to classifying good loans this
means that the credit company will reject some loan applications which may be good loans but it will
accurately identify bad loans and will avoid bad loans in its credit portfolio. The key point in this
analytical study is treating the input data before feeding the input into the model and feature selection
based on business logic and statistical data analysis.

You might also like