Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Bachelor Thesis

Using Deep Learning and Explainable AI to


Predict and Explain Loan Defaults

At
ZHAW School of Management and Law
Business IT

Submitted by
Sakip Sulejmani

On
9th June 2021

Supervised by
Dr. Bledar Fazlija
Institute of Wealth & Asset Management
Management Summary
The use of machine learning in finance is increasing, and while deep learning models
are becoming the state of the art to make predictions, the difficulty of interpreting them
is a drawback. This is especially so in finance, where each result that a model outputs
must be explainable and justifiable. In recent years, novel explainable AI methods have
been researched and developed to explain deep learning models and their decisions.
The aim of this bachelor thesis was to analyze a use case in credit scoring, spe-
cifically in loan defaulting, with deep learning and explainable AI. It also aimed to
show that deep learning can be used to predict loan defaults in finance, that explainable
AI methods offer insights for interpreting the black box’s internal decisions, and fur-
thermore, that it is possible to improve models with insights from explainable AI.
A peer-to-peer loan dataset from Bondora with 164,547 instances and 112 fea-
tures was analyzed, pre-processed, and prepared for deep learning. Multiple neural net-
works with different parameters were fitted and evaluated to find the best hyperparame-
ters for loan default predicting with the dataset. A post hoc analysis with SHAP was
applied to the best model to retrieve insights from it. These insights were then used to
explain the model’s decisions and to adjust it.
The results show that the model has an AUC of 0.72 and can therefore differen-
tiate between a defaulted and a not defaulted loan with a probability of 72%. In addi-
tion, a recall of 0.88 was reached, meaning the model predicts 88% of defaulted loans
correctly. Furthermore, the insights gained from explainable AI enabled the creation of
a second, adjusted model that reached equally good performance with only half of the
features. Moreover, the explainable AI insights were used to determine and analyze the
fifteen features which influence the model the most. The three most influential were
debt-to-income, applied amount and loan duration. Additionally, two loan instances
from the dataset were analyzed in detail with SHAP.
In conclusion, using deep learning and explainable AI we were able to predict
loan defaults, and interpret as well as explain the model’s decisions. Moreover, the ex-
plainable AI insights could be used to adjust and improve the model. A complete use
case in credit scoring is shown in this thesis, highlighting that deep learning and ex-
plainable AI can be used in finance. However, the gained insights from the explainable
AI methods were very specific to the used dataset and therefore further research with
different datasets would be interesting.

i
Table of Contents
Management Summary ...................................................................................................... i

List of Figures .................................................................................................................. iii

List of Tables ................................................................................................................... iv

1 Introduction .............................................................................................................. 1

2 Theoretical background ............................................................................................ 2

2.1 Artificial intelligence ........................................................................................ 2


2.1.1 Machine learning ...................................................................................... 3
2.1.2 Deep learning ............................................................................................ 3
2.2 Explainable AI .................................................................................................. 4
2.2.1 Shapley values .......................................................................................... 5
2.3 Credit scoring with deep learning and explainable AI ..................................... 5
3 Methodology ............................................................................................................. 7

3.1 Data ................................................................................................................... 8


3.1.1 Data preparation and pre-processing ........................................................ 8
3.1.2 Data analysis ........................................................................................... 11
3.1.3 Data correlations ..................................................................................... 18
3.2 Predicting loan default with a deep learning model ....................................... 21
3.2.1 Preparing the data for machine learning ................................................. 21
3.2.2 Creating and training the model ............................................................. 23
3.3 Interpreting and explaining the models’ decisions ......................................... 27
3.3.1 Global interpretability............................................................................. 27
3.3.2 Local interpretability .............................................................................. 35
3.4 Adjusting the model with interpretability insights ......................................... 37
4 Results .................................................................................................................... 38

5 Discussion ............................................................................................................... 41

6 References .............................................................................................................. 43

7 Appendix ................................................................................................................ 46

ii
List of Figures
Figure 1: Data science lifecycle (Murdoch et al., 2019, p. 2) .......................................... 2
Figure 2: Artificial intelligence, machine learning and deep learning ............................. 2
Figure 3: Classical programming vs Machine learning (Chollet, 2018, p. 5) .................. 3
Figure 4: NN model (IBM, 2021b) ................................................................................... 4
Figure 5: Missing values per feature .............................................................................. 10
Figure 6: Variable “Age“ histogram ............................................................................... 12
Figure 7: Variable “Interest” histogram ......................................................................... 14
Figure 8: Variable “LoanDuration” histogram ............................................................... 15
Figure 9: Pearson’s correlation matrix ........................................................................... 18
Figure 10: Phik correlation matrix.................................................................................. 20
Figure 11: First instance of the dataset after scaling ...................................................... 23
Figure 12: Basic model plot............................................................................................ 24
Figure 13: Basic model with dropout plot ...................................................................... 24
Figure 14: NNs without and with dropout ...................................................................... 25
Figure 15: Final model plot ............................................................................................ 26
Figure 16: Global model feature importance mean ........................................................ 28
Figure 17: Global model feature importance top 15....................................................... 29
Figure 18: SHAP values “DebtToIncome” .................................................................... 30
Figure 19: SHAP values “AppliedAmount“ ................................................................... 30
Figure 20: SHAP values “LoanDuration” ...................................................................... 31
Figure 21: SHAP values “Age” ...................................................................................... 32
Figure 22: SHAP values “Interest“ ................................................................................. 33
Figure 23: SHAP values “ExistingLiabilities“ ............................................................... 34
Figure 24: SHAP values “AmountOfPreviousLoansBeforeLoan” ................................ 34
Figure 25: SHAP global model feature importance 16 - 48 ........................................... 35
Figure 26: SHAP values for instance 2201 in test dataset .............................................. 36
Figure 27: SHAP values for instance 58 in test dataset .................................................. 37
Figure 28: Initial model learning process ....................................................................... 38
Figure 29: Adjusted model learning process .................................................................. 39
Figure 30: Model confusion matrix ................................................................................ 39
Figure 31: AUC - ROC curves ....................................................................................... 40

iii
List of Tables
Table 1: Data preparation removed features and criterias ................................................ 9
Table 2: Categorical features and their categories............................................................ 9
Table 3: Value counts for the feature “Status“ ............................................................... 10
Table 4: Dataset overview .............................................................................................. 11
Table 5: Snippet of dataset ............................................................................................. 12
Table 6: Variable “AppliedAmount” statistics ............................................................... 13
Table 7: Variable “DebtToIncome”, “AmountOfPreviousLoans” and
“ExistingLiabilities” statistics ........................................................................................ 13
Table 8: Variable “IncomeTotal” statistics .................................................................... 14
Table 9: Variable “MonthlyPayment” statistics ............................................................. 15
Table 10: Variable “Gender” categories ......................................................................... 16
Table 11: Variable “Country“ categories ....................................................................... 16
Table 12: Variable “Education” categories ................................................................... 16
Table 13: Variable “VerificationType” categories ........................................................ 16
Table 14: Variable “HomeOwnershipType” categories ................................................. 17
Table 15: Variable “EmploymentDurationCurrentEmployer” categories...................... 17
Table 16: Variable “NewCreditCustomer” categories .................................................. 17
Table 17: Target variable “Default” categories .............................................................. 17
Table 18: Snippet of the dataset before preparing for machine learning........................ 21
Table 19: Snippet of dataset after one-hot encoding ...................................................... 22
Table 20: Grid search possible values for each parameter ............................................. 25
Table 21: Grid search results top 50 ............................................................................... 26
Table 22: Features possible to remove (SHAP) ............................................................. 38
Table 23: Performance measurements ............................................................................ 40
Table 24: Initial model fitting output ............................................................................. 46
Table 25: Adjusted model fitting output......................................................................... 46
Table 26: Instances 2201 (0) and 58 (1) in test dataset .................................................. 47

iv
1 Introduction
This Bachelor thesis showcases the use of deep learning and explainable AI in the con-
text of credit scoring. A statistical analysis was performed to determine creditworthiness
using a dataset from Bondora, a peer-to-peer (P2P) lending platform. In P2P lending,
one borrower receives money from one-to-many entities, which can be either individu-
als or companies. The borrowers and lenders are usually connected over an online plat-
form (e.g., Bondora). After a borrower has applied for the loan, lenders can bid to lend
the money. There are also auto-bid options in which the platform automatically lends a
predefined sum from a lender to multiple diversified borrowers. However, in P2P, de-
faults are quite common. Therefore, identifying loans with a high possibility of default
is important for the platform as well as the lenders.
Machine learning (ML) models are becoming the state of the art for prediction
making. However, the challenge of interpretability has always been a drawback
(Murdoch et al., 2019, p. 2). The more complex a model becomes, the less interpretable
it is. In fact, model complexity used to be limited to enforce intrinsic interpretability of
the model (Molnar et al., 2020, p. 2). ML, on the other hand, usually follows a non-
linear approach, which results in less interpretable models but good predictive perfor-
mance (Molnar et al., 2020, p. 2). To make ML models more interpretable, a lot of in-
terpretable machine learning (IML) and explainable AI research has been conducted in
the last few years. Explainable AI aims to extract relationships learned by the model,
and, as Molnar et al. (2020, p. 2) add, to justify or improve a prediction. Justifying a
prediction can be especially important in environments with a lot of regulation, such as
healthcare and finance.
In finance, datasets often have complex data relations, which deep learning can-
not only identify but also exploit (Heaton et al., 2018, p. 1). Therefore, a deep learning
model was applied to the Bondora data set to gain advantageous results compared to
other ML methods.

1
Since deep learning models are hard to interpret by standard means, a post-hoc
interpretation method was used to explain the decisions of the model. As depicted in
Figure 1, the predictions are explained in a separate step afterwards. Finally, the insights
from the post hoc analysis were used to remove some not needed features and therefore,
improve the model.

Figure 1: Data science lifecycle (Murdoch et al., 2019, p. 2)

2 Theoretical background
To be able to apply the deep learning model as described in the Introduction, it is neces-
sary to understand the theoretical concepts of ML and explainable AI and, to acquire
knowledge about research that has been conducted in these areas. In the following sec-
tion, the terms machine learning, deep learning, and neural networks are explained, fol-
lowed by explainable AI. Moreover, the current state of research in credit scoring with
machine learning and explainable AI will be presented.

2.1 Artificial intelligence


In computer science, artificial intelligence (AI) is defined as something with a human-
like intelligence (IBM, 2020a). Chollet (2018, p. 4) defines it as “the effort to automate
intellectual tasks normally performed by humans.” There are many different fields that
fall under the term AI but for this thesis, ML and deep learning are relevant. How AI,
ML and deep learning relate to each other is depicted Figure 2.

Figure 2: Artificial intelligence, machine learning and deep learning

2
2.1.1 Machine learning
ML is a form of artificial intelligence (AI) and in computer science, AI is the term used
for anything capable of human-like intelligence. ML is an AI that focuses on learning
from experience and improving over time (IBM, 2020b). Instead of programming data
rules by hand, an application with machine learning automatically creates and learns the
rules by looking at the data (Chollet, 2018, p. 5). As shown in Figure 3, in classical pro-
gramming, the rules and data are given to the program and the results are the answers.
In ML, on the other hand, the program takes the data and the answers, and the results
are rules, which can then be applied to other data (Chollet, 2018, p. 5).

Figure 3: Classical programming vs Machine learning (Chollet, 2018, p. 5)

2.1.2 Deep learning


Deep learning as a form of ML also tries to learn rules from the given data and answers.
In contrast, deep learning can have multiple successive layers that try to learn different
rules or interpretations of the data (Heaton et al., 2018, p. 3). As Heaton et al. described
(2018, p.3), every layer applies a given non-linear activation function and extracts fea-
tures into factors. They further explained that the output of the first layer (factors) be-
comes the new input (features) for the next layer. A layer can be described as a filter; it
takes some data as an input and outputs the data in a more useful way (Chollet, 2018, p.
28). Layers try to extract more meaningful representations of the imputed data (Chollet,
2018, p. 28). Depending on the use case, a deep learning model can have tens, hundreds
or even thousands of layers, while other machine learning models only have one or two
layers to represent the data (Chollet, 2018, p. 8). That is where the “deep” in deep learn-
ing comes from. A deep learning model does not have a deeper understanding of the
data, it provides a deeper (greater) representation of the data.

Neural networks (NNs) are part of each layer in a deep learning model. NNs
were constructed to work similarly to a human brain, mirroring how the neurons com-

3
municate with each other (IBM, 2021b). Nowadays, NNs are not determined to closely
mimic the human brain, but to solve complex problems (Marini, 2009, p. 477). Indeed,
these NNs are far from being able to mimic the human brain, yet they deliver very accu-
rate predictions in various fields (Marini, 2009, p. 477).

NNs consist of multiple node layers as shown in Figure 4. More precisely, an


input layer, one or more hidden layers and an output layer. All neurons (hidden units)
connect to the neurons in the next layers and pass their output to the next ones. Each
neuron has a weight, which specifies what the layer does to the input data. Therefore,
the weights define the transformation that will be applied to the input (Chollet, 2018, p.
10).

Figure 4: NN model (IBM, 2021b)

2.2 Explainable AI
Deep learning and other complex AI algorithms can lack transparency and become very
difficult to interpret (Hagras, 2018, p. 29). These models usually take an input and give
an output without being able to reveal why they decided to do so. Explainable AI (or
interpretable AI) stands for an AI that can be easily understood and have its actions ana-
lyzed by humans (Hagras, 2018, p. 29). Depending on the use case and audience, these
insights can come in different formats such as visualizations, mathematical equations, or
natural language (Murdoch et al., 2019, p. 2).
There are two different methods for explainable AI, one that focuses on the in-
terpretability in the modeling and one in the post hoc analysis (as shown in the Intro-
duction). The first method of interpretability is dedicated to building models that pro-
vide a view on the relationships that the model has learned (Murdoch et al., 2019, p. 3).

4
However, this insight usually comes with less complex models and lower accuracy
(Murdoch et al., 2019, p. 3). On the other hand, in the post hoc analysis, the model is
built for the best accuracy and afterwards altered to interpret the model (Murdoch et al.,
2019, p. 4). Since this would be very hard to do manually by human, various post hoc
interpretability methods have been developed to get an insight into the trained model
without needing to change the model (Murdoch et al., 2019, p. 6). Interpretability can
either be described by the input features used by the model or the low-level network
parameters (Chakraborty et al., 2017, p. 1). In post-hoc interpretation, which is used in
this thesis, the model functionality can be explained through text or spoken language,
image and visualization, local features (explaining in the context of the local feature
space around the input) as well as examples of similar inputs (Chakraborty et al., 2017,
p. 3).

2.2.1 Shapley values


Different frameworks can be used to perform explainable post hoc analysis. This thesis
uses SHAP (SHapley Additive exPlanations). SHAP is beneficial in comparison to oth-
er explainable AI models because it allows the contribution of each variable for each
instance in the dataset to be shown, without restricting to only using one specific ML
model (Bussmann et al., 2021, p. 205). SHAP enables numerous model agnostic local
and global techniques to be obtained and visualized (Kłosok & Chlebus, 2020, p. 14).
The underlying idea of Shapley values is a game in which payouts are assigned to
the players depending on their contribution to the total payout (Molnar, 2020, p. 222;
Shapley, 1953, p. 316). In ML, the prediction task for a single data instance is the game
and the players are the feature values of the data instance (Molnar, 2020, p. 222). There-
fore, these features together contribute to the prediction (total payout) and the feature
importance (gain) is the prediction for this single data instance minus the average pre-
diction of the dataset (Molnar, 2020, p. 222). This means that for a set of features and
the prediction outputted by the model, SHAP calculates the contribution of each feature
to the final prediction (Kłosok & Chlebus, 2020, p. 15).

2.3 Credit scoring with deep learning and explainable AI


In credit scoring, and finance in general, ML models with the highest possible accurate
risk prediction are needed (Bücker et al., 2020, p. 2). According to regulators, these

5
models must also be transparent and auditable and therefore, rather simple models such
as logistic regression or decision trees are still common, despite the superior predictive
ability of advanced machine learning models such as deep learning (Bücker et al., 2020,
p. 2).
A lot of research has also been done on rule-based models. For instance, Soui et
al. (pp. 145, 156) generated classification rules that minimize the risk and maximize
accuracy. However, because of various issues in the rules generation process, the rule-
based approach is very difficult to apply to big data (Moscato et al., 2021, p. 2). Moreo-
ver, in the last few years there has been a lot of research into using complex machine
learning models, although the majority of these models remain complex black boxes
(Carvalho et al., 2019, p. 1). So, there is still a gap in the research area of credit scoring
with deep learning and explainable AI.
Many papers and studies have compared different machine learning techniques in
credit scoring. In their comparison, Imtiaz & Brimicombe (2017, p. 3) show that in all
comparisons, NNs performed better when the dataset did not have missing values. Fur-
thermore, while the NN was a bit worse during the training of the model compared to
Decision Trees, the AUC in the out of sample data was a lot better than that from Deci-
sion Trees (87.90% vs 79.09%) (Imtiaz & Brimicombe, 2017, p. 4). Moreover,
Lessmann et al., (2015) compared over 40 classifiers with eight different credit scoring
data sets and evaluated them with six performance measures. They show that the credit
scoring standard, Logistic Regression (LR), can be outperformed by multiple other
models including NNs (Lessmann et al., 2015, p. 133). While some (Finlay, 2010, p.
531) do not see much advantage of using NNs over LR, both Lessmann et al. (2015, p.
130) and Baesens et al. (2003, p. 634) present findings that suggest otherwise.
Moscato et al. (2021) performed one of the newest studies on machine learning
approaches for credit scoring. They aimed to compare several classification engines’
performances and measure the results according to different evaluation metrics
(Moscato et al., 2021, p. 4). In credit scoring, multiple evaluation metrics have been
defined to evaluate the models. According to Abellán & Castellano (2017, p. 9), accura-
cy (ACC) does not seem to be the most suitable metric as it does not consider that in
credit scoring false positives are more important than false negatives (Moscato et al.,
2021, p. 5). Moscato et al (2021, p. 5) suggest using more suitable methods such as

6
Sensitivity (TPR) and Specificity (TNR) which evaluate the accuracy of positive and
negative results. Furthermore, Precision and F-Rate can be used to evaluate how accu-
rately a model predicts positive and negative classes, and Area Under Curve (AUC) can
be used to determine the trade-off of the true positive rate and false positive rate
(Moscato et al., 2021, p. 5).
As ML gains popularity for its predictive abilities, it is becoming clear that mod-
els' interpretations must also be explainable (Murdoch et al., 2019, p. 1). A recent re-
search paper that uses explainable AI (Shapley values) to interpret its model, shows that
it is possible to obtain the feature importance from an XGBoost model (Ariza-Garzón et
al., 2020, p. 64883). The authors also show that depending on the model, the feature
importance can vary, however, only slightly. Furthermore, they show that quantitative
values influence the model more than qualitative ones (70% vs 30%) (Ariza-Garzón et
al., 2020, p. 64884). In addition, they observed that the same feature value does not al-
ways have the same impact on the output value (Ariza-Garzón et al., 2020, p. 64884).
For example, a large loan amount could in one instance decrease and in another increase
the default probability. Demajo et al. (2020) also used a XGBoost model and applied
post-hoc analysis. In addition to gaining insights, they conducted a questionnaire to find
out if the insights gained with SHAP and other explainable AI methods were helpful for
humans (Demajo et al., 2020, p. 199). About 75% of participants were satisfied with the
explanations (Demajo et al., 2020, p. 199). However, some participants suggested add-
ing an overall risk rating or visualization charts (Demajo et al., 2020, p. 199). Further-
more, Provenzano et al. (2020, p. 16) also show that explainable AI, more precisely
SHAP, can be used in credit scoring for model explainability.

3 Methodology
This section explains the processing of the dataset with a deep learning model and de-
scribing the model’s decisions. First the dataset is described and pre-processed. Second,
it outlines how the model was configured and evaluated. Last, the black box deep learn-
ing model’s decisions are shown with the use of SHAP and an adjusted model is fitted
and evaluated.

7
3.1 Data
As mentioned in section 1, the official loan dataset from Bondora, which is updated
daily (2021), is used. The dataset is available in a comma-separated values (CSV) for-
mat, which is widely used and simple to use. Bondora (2021) offers the following da-
tasets for download:
1) Loan dataset: Advance loan information that is not covered by data protection
laws
2) Portfolio CashFlow, PnL Statement and Balance Sheet: Data as of the end of
each month
3) Historic payments: All received payments categorized by type and date
4) Loan schedules: All past and future scheduled payments categorized by type and
date
5) Debt events: All loan lifecycle process events

Although all this data is available, only the loan dataset is used here because the other
sets all contain data from when the loan had already started, which therefore cannot be
used to predict default before the loan has started. The loan dataset was downloaded on
the 28th of March 2021 from Bondora (2021).
There has already been some research with the same dataset in recent years
(Byanjankar, Heikkilä, and Mezei, 2015; Disbergen, 2019; Zaytsev, 2020). Byanjankar
et al. (2015, p. 724) compared a logistic regression model with a NN and concluded that
logistic regression (LR) is more accurate (65.34%) when predicting non default loans
compared to NN (62.70%), while the NN was more accurate (74.38%) when predicting
default loans compared to LR (61.03%). Disbergen (2019, p. 18) obtained similar re-
sults with LR and a random forest model. However, the Bondora dataset is updated dai-
ly and therefore the data as well as the classification group sizes of defaulted and not
defaulted loans vary from study to study.

3.1.1 Data preparation and pre-processing


The downloaded loan dataset has 164,547 instances and 112 features. However, the
dataset is not complete and has a lot of missing values. Therefore, it must be prepared
first. As a first step, some features were removed. The criteria for removing the features
were:
a) Not loan default relevant
b) Feature not available before loan starts

8
c) Duplicate (already in another feature)
d) Rating by Bondora or external company

This required reading through the description of each feature on Bondora’s website.
Table 1 shows which features were consequently removed.
Table 1: Data preparation removed features and criterias

Next, the categorical features must be transformed. They are transformed in line with
the information on Bondora’s website (Bondora, 2021). For each feature, there is a cat-
egory per number. Additionally, to some missing values, depending on the feature,
sometimes “0” or “-1” means that the value is missing. This must be considered when
transforming the values, even though this is not stated on Bondora’s website. Also,
some categorical features are already transformed in the dataset, which means that those
should not be changed. Table 2 shows the identified categorical features, and which
number was transformed to which category.
Table 2: Categorical features and their categories

9
As printed in Table 3, 67,179 loans are “Late”, which means that they defaulted. 49,201
loans are “Current”, meaning that they are still ongoing and cannot be used for this
analysis because it is not known yet if the loan defaulted or not. The other 48,167 are
“Repaid”, meaning they did not default and were paid back. After removing the loans
in “Current”, 115,346 were left.

Table 3: Value counts for the feature “Status“

With the variable “Status”, the target variable “Default” can be created. All instances in
the state “Late” were set to “1” and all instances with the state “Repaid” were set to “0”.
Next, the missing values were analyzed. As plotted in Figure 5, there are six
features with 70% missing values, one with ~53%, one with ~31% and some more with
less than 10%.

Figure 5: Missing values per feature

10
These are some options for dealing with missing values:
- Leave in the dataset
- Fill with mean/median
- Remove feature from the dataset
- Remove instances from the dataset

Leaving the missing features was not an option since that is not beneficial for deep
learning models and filling the missing features with the mean or median could distort
the model. So, the chosen options are removing the feature and removing the instances.
For features where > 30% of values are missing, the feature is removed from the da-
taset. For features with ≤ 30% of missing values, the instances are removed from the
dataset. This ensures that there are no missing values left and allows the creation of a
robust and unbiased model. After this step, the dataset consisted of 105,112 instances
and 20 features.

3.1.2 Data analysis


In this section, the data will be explained and analyzed. First, the numeric features, vari-
ables that have a measure and a numeric meaning, will be explored and the categorical
variables that were transformed in the previous section will be analyzed using the pan-
das profiling package. While pandas offers basic data analysis with the “describe”-
method, the “profile_report”-method allows for more serious exploratory data analysis.
An overview of the dataset is given in Table 4. As depicted, the dataset consists
of 20 variables and 105,112 observations. Since the missing values were removed in the
previous section, there are no missing cells. Pandas profiling identified 25 rows as pos-
sible duplicates, however further investigation revealed that there were no duplicates,
only that some variables had duplicate values while the rows were distinct. From the 20
variables, 12 are numeric, 7 are categorical and one, the target variable, is a Boolean.

Table 4: Dataset overview

11
Table 5 shows a snippet (first five rows) of the dataset, giving a good overview of how
the data. The variables were reordered to start with the numeric ones, followed by the
categorical and finally, the target variable. In the following pages, each variable is ana-
lyzed, showing quantile statistics, descriptive statistics, and the histogram.
Table 5: Snippet of dataset

The variable “Age” stands for the age of the borrower when signing the loan application
(Bondora, 2021). The histogram in Figure 6 shows that the age is right skewed, with
more younger than older people applying for a loan. For some unknown reason, there
are a lot more loan applicants aged 43 than any other age. Moreover, there is an outlier
with the age 70, which could be due to having all people over 70 in there.

Figure 6: Variable “Age“ histogram


“AppliedAmount” refers to the amount that the borrower applied for in Euros. Table 6
shows that the amounts range from 500€ to 10,132€. For a P2P platform, it makes sense
that the applied amounts are not large because for larger amounts people usually go to a
traditional bank. Furthermore, the median is 2125€, the mean is 2769€ and the 3rd quan-
tile is 4150€. It has to be considered though, that Bondora started mostly in the Baltic

12
states, where the average is around 1140€ (Baltic FEZ, n.d.) and only expanded to Fin-
land, Spain and Slovakia lately. The sum of all the amounts applied for is 291,091,055€.

Table 6: Variable “AppliedAmount” statistics

The ratio of the borrowers’ monthly debt payments to monthly income can be seen in
the feature “DebtToIncome”, the numbers and amounts of the previous loans are stored
in “NumberOfPreviousLoansBeforeLoan” and “AmountOfPreviousLoansBeforeLoan”.
“ExistingLiabilities” holds the number of the borrower’s liabilities (Bondora, 2021).
Table 7 shows that over 75% of users have no debts, and around 5% have debts that
cost them 50% or more of their income. Moreover, Table 7 shows that 75% of the peo-
ple either did not have previous loans or that the summed amount of them was below
4240€. However, there are also people with over 25 previous loans. The median of ex-
isting liabilities is two. Only 5% of the borrowers have 9 or more liabilities, while at
least one person had 40 existing liabilities. Additionally, the 95th percentile of the total
liabilities is 1466€, so most of the borrowers do not have a high amount of liabilities.

Table 7: Variable “DebtToIncome”, “AmountOfPreviousLoans” and “ExistingLiabilities” statistics

13
The feature “Interest” shows the maximum interest rate accepted in the loan application
(Bondora, 2021). Figure 7 depicts the histogram of the feature. It makes sense that in-
terest rates are higher than those from a traditional bank but having interest rates of over
250% was unexpected. However, interest rates over 80% are only outliers. The median
interest rate is 33.83%, which is rather reasonable for a peer-to-peer platform.

Figure 7: Variable “Interest” histogram


Table 8 shows the statistics of the feature “IncomeTotal”, which stands for the borrow-
er’s total income. The 5th percentile is at 520€ which is around the minimum wage in
the Baltic states (Baltic FEZ, n.d.). The median is at 1300€, so half of the people earn
less than 1300€ and the other half earns more. Furthermore, Table 8 shows that 5% of
the people earn more than 3100€ which is almost 6 times the minimum wage. The high-
est monthly income is 1,012,019 €.

Table 8: Variable “IncomeTotal” statistics

14
Figure 8 depicts the “LoanDuration”, the duration of the loan in months. Most of the
loans’ durations and the median are 60 months. The second most frequent loan duration
is 36 months.

Figure 8: Variable “LoanDuration” histogram

“MonthlyPayment” is the calculated monthly payment that the borrower will have to
pay. Table 9 shows that most of the monthly payments are quite low. 25% of the bor-
rowers pay around 42€ per month. The median is around 104.50€. Only 5% of the bor-
rowers pay more than 345€, while there are some outliers, with one even reaching al-
most 2370€ per month. Considering the highest applied amount was 10,632 €, this was
probably one of the shorter loans.

Table 9: Variable “MonthlyPayment” statistics

15
Next, the categorical variables will be looked at. Table 10 shows that 63.4% are men,
26.6% women and 10% did not specify their gender. Estonia is the home country of
over 50% of the borrowers, Finland 27.4%, Spain 21.7%, and Slovakia less than 0.1%
(see Table 11).

Table 10: Variable “Gender” categories Table 11: Variable “Country“ categories

As shown in Table 12, over 35% have a secondary education, 26.6% a higher education
qualification and about 23% a vocational certificate. 9.8% have only a primary level of
education and 4.4% a basic education. Moreover, Table 13 depicts that almost 60% ver-
ified their income and expenses and 7.2% verified only their income. Over 34% of the
incomes are not verified, although it is not clear whether the borrower decided to not
verify the income or whether s/he failed to verify it.

Table 12: Variable “Education” Table 13: Variable “VerificationType”


categories categories

As shown in Table 14, over a third of the borrowers are homeowners. Almost one fourth
are tenants in pre furnished properties, 16.2% live with their parents and 11.3% are
homeowners with a mortgage. Regarding employment duration, almost 40% have been

16
at the same employer for more than five years, 22.1% have been at their employer for
almost 5 years, and 18.5%, have spent up to one year at their current employer. People
who are in the trial period, retired or have been with their employer for 1 – 4 years are
only in the single-digit percentage range (see Table 15).

Table 14: Variable “HomeOwner- Table 15: Variable “EmploymentDuration-


shipType” categories CurrentEmployer” categories

Table 16 shows that around 60% are new credit customers on Bondora, while the other
40% had or have at least one loan ongoing on Bondora. Last, the target variables cate-
gories are shown in Table 17. It shows that 61.6% of loans defaulted. This is because as
soon as 3 monthly payments in a row are missed, meaning the loan is behind for over 90
days, Bondora marks the status of the loan as “Late”, which in this thesis is used to
identify defaulted loans.

Table 16: Variable “NewCreditCustomer” Table 17: Target variable “Default”


categories categories

17
3.1.3 Data correlations
In this section correlations between features are searched. First, the numeric features
are checked with Pearson’s correlation matrix depicted in Figure 9. Pearson’s correla-
tion coefficient r is a value between -1 and 1 that shows the linear correlation between
two variables. 0 indicates no correlation, 1 indicates total positive linear correlation and
-1 indicates total negative linear correlation.

Figure 9: Pearson’s correlation matrix

Figure 9 reveals several correlations in the numeric variables of the dataset (correlations
mentioned once are not repeated):
- “AppliedAmount” correlates strongly positively with “MonthlyPayment” and
slightly positively with “LoanDuration”
- “DebtToIncome” correlates slightly positively with “ExistingLiabilities” and
slightly positively with “MonthlyPayment”

18
- “AmountOfPreviousLoansBeforeLoan” correlates slightly positively with “Ex-
istingLiabilities”, strongly positively with “NoOfPreviousLoansBeforeLoan”
and slightly negatively with “Interest”
- “ExistingLiabilities” correlates slightly negatively with “Interest” and slightly
positively with “NoOfPreviousLoansBeforeLoan”
- “Interest” correlates slightly positively with “MonthlyPayment” and slightly
negatively with “NoOfPreviousLoansBeforeLoan”
- “MonthlyPayment” correlates slightly negatively with “NoOfPreviousLoansBe-
foreLoan”
- The target variable “Default” correlates slightly positively with “Interest” and
“LoanDuration”

Since only the numeric variables were looked at before, a new correlation coefficient,
Phik (φK), will be used to understand the correlations between all variables. Phik is
based on several refinements to Pearson’s hypothesis test of independence of two varia-
bles and works with categorical, ordinal and interval variables, captures non-linear de-
pendencies and reverts it to Pearson’s correlation coefficient (Baak et al., 2019, p. 3).
Phik shows positive correlations from 0 to 1, 0 being no correlation and 1 being
total correlation. Figure 10 shows that there are many nonlinear correlations between
the variables in the dataset. Moreover, correlations emerge that were not revealed in the
linear correlation coefficient before.

19
Figure 10: Phik correlation matrix

These are some correlations that were not discovered before:


- “FreeCash” correlates with “IncomeTotal”
- “AmountOfPreviousLoansBeforeLoan” correlates with “NewCreditCustomer”
- “Age” correlates slightly with “EmploymentDurationCurrentEmployer”
- “NewCreditCustomer” correlates with “NoOfPreviousLoansBeforeLoan” and
slightly with “Interest”
- “Country” correlates slightly with “Interest” and “Gender”
- “HomeOwnershipType” correlates slightly with “Age” and “Country”
- “AppliedAmount” correlates slightly with “NewCreditCustomer” and “Country”
- “DebtToIncome” correlates slightly with “VerificationType”

20
Thanks to nonlinear correlations, one can see more connections and relationships be-
tween the variables. However, these are correlations found by Phik, which does not im-
ply that the deep learning model will apply the same rules or find the same correlations
and therefore, these relations cannot be used to explain the deep learning model used in
this paper.

3.2 Predicting loan default with a deep learning model


Next, the model to predict the default of loans was created. For this, the data is prepared
for deep learning. After that, the model was trained and evaluated.

3.2.1 Preparing the data for machine learning


Since some data preparation had already been done, as outlined in the section before,
only some changes needed to be made. We needed to one-hot encode the categorical
variables, split the dataset to train and test subsets and then scale all features to be in the
range of 0 and 1.
Table 18 shows the imported dataset. The categorical variables store their values
as text. This is not beneficial for machine learning and therefore, one hot encoding must
be applied.

Table 18: Snippet of the dataset before preparing for machine learning

21
To do this transformation, the “get-dummies”-method from pandas is used. The result
can be seen in Table 19. For each category in each categorical variable a new feature
was added to the dataset. Where the instance had that category as a text in the categori-
cal variable before, a 1 was set to the new feature, and in all other new features for that
categorical variable, a 0 was set. The result is a dataset with 28 new added features and
a total of 48 features.

Table 19: Snippet of dataset after one-hot encoding

22
Next, the dataset is splitedt the “train_test_split”-method from sklearn. A test-size of
20% was applied. We end up with two data sets:
- Train dataset: 84,089 instances
- Test dataset: 21,023 instances

Furthermore, scaling is applied to the features. For this, “MinMaxScaler” from sklearn
is used. All features are scaled between 0 and 1. The result is not interpretable for hu-
mans but is very beneficial for the model. Figure 11 depicts the first instance of the train
dataset with scaled values.

Figure 11: First instance of the dataset after scaling

3.2.2 Creating and training the model


For the model Keras and TensorFlow 2 are used. While TensorFlow is the end-to-end
machine learning platform, Keras is the high-level API focused on modern deep learn-
ing (Keras, n.d.).
First, a basic model was prepared with Keras using the Sequential interface.
Then three dense (fully connected) layers were added. So, there were the input- and
output-layers and one hidden layer in between. The decision on how many layers and
hidden unites to use was informed in a later step by a grid search. For now, as many
hidden unites as features in the dataset are used, 48. The last layer before the input layer
only has half of the hidden units, to help the model before outputting the one output
variable, so that it only must scale from 24 to one variable. The input and intermediate
layers use relu, the most used function for deep learning as their activation function,
while the output layer uses a sigmoid activation, also one of the most widely used func-
tions, with an output range from 0 to 1. Since this is a classification task, it is important
to use a suited activation function in the output layer. The model is plotted in Figure 12.

23
Figure 12: Basic model plot

Before being able to train the model, three more things need to be configured as part of
the compilation step in Keras (Chollet, 2018, p. 28):
- Loss function: How the NN should assess the performance of the training data
and how it should steer itself in the correct direction
- Optimizer: The NNs technique for updating itself based on its loss function and
the data it observes
- Metrics: Which metrics to monitor during training and testing

For the loss function binary_crossentropy is used as it is the standard for binary classifi-
cation use cases. The chosen optimizer is adam, which is commonly used in deep learn-
ing with NNs. Accuracy was chosen as the metric to be monitored.
Fitting and evaluating the configured model, an overfitting of the model could
be observed, especially when using a high number of epochs and lower batch sizes. The
most common ways to prevent overfitting in neural networks according to Chollet
(2018, p. 110) are:
(1) Get more data
(2) Reduce layers and units of the network
(3) Add weight regularization
(4) Add dropout

It was decided to adopt option (4) and add dropout to the model. Dropout is one of the
most widely used regularization techniques for NNs and consists of setting some of the
output features to zero during the training by hazard (Chollet, 2018, p. 109). Therefore,
dropout was applied to the last hidden layer of the model. The updated model is plotted
in Figure 13.

Figure 13: Basic model with dropout plot

24
A dropout rate between 0 and 1 must also be specified. Setting a dropout rate of 0.2 tells
the model to drop out (set to zero) 20% of the output features. Also important, no output
features are dropped during the testing. So, by dropping out a percentage of the features
during the training, we introduce some noise to the output values to break up patterns
(Chollet, 2018, p. 109). This means that the model is forced to lose some learned con-
nections and therefore, prevents it from overfitting. A pictorial example is shown in
Figure 14.

Figure 14: NNs without and with dropout

The next step was to evaluate the model and to do hyperparameter tuning. A grid search
was conducted to find the best fitting parameters. The parameters were predefined and
then the model was re-fitted with all possible permutations. The defined possible pa-
rameters for the grid search are shown in Table 20.

Table 20: Grid search possible values for each parameter

The top 50 results of the grid search are depicted in Table 21 and sorted by the AUC on
the test dataset. The table shows that the five best performing combinations for the test
AUC all have only 2 hidden layers. So, more layers do not equal better performance in
this example. However, Table 21 also shows that the best performance on the train da-
taset (highest train AUC) in the sub table was measured in instance 46, which was the

25
model with the most possible layers, hidden units, epochs, and the smallest possible
batch size. So, the model fitted itself closer and closer to the train dataset. The overfit-
ting with the train dataset in instance 46 did not help the AUC of the out of sample per-
formance, with 0.7169 it is one of the lowest in the table.

Table 21: Grid search results top 50

The best possible performance on the test dataset was the goal, so the parameters of
instance 1 are chosen as the best fitting ones. The final model is plotted in Figure 15.

Figure 15: Final model plot

26
3.3 Interpreting and explaining the models’ decisions
In this section, the model and its decision are interpreted and explained. First the varia-
bles’ importance to the model will be discussed, followed by partial dependence plots
per variable and last, individual instances with their SHAP values.
As described in section 2, SHAP is used for that task. Since a deep learning
model is used, we used SHAP’s “DeepExplainer”. The SHAP explainer is created with
the model and a random subset of 5000 instances of the train dataset, as recommended
by the SHAP package in python. Using a larger background dataset would slow down
the performance drastically. Moreover, the “shap_values”-method is used on the ex-
plainer object with a subset of 5000 instances of the test dataset to obtain the SHAP
values for those predictions. Having the test dataset and its SHAP values, some of
SHAP’s methods can be used to interpret the model’s decisions and behaviors.

3.3.1 Global interpretability


To gain a global interpretability understanding of the model, it is important to know
how much each feature contributes to the models’ output on average. With the “sum-
mary_plot”-method, the bar plot, as depicted in Figure 16, can be obtained.
We see each feature and its mean SHAP value. The base SHAP value for each
feature is at 0, meaning that each feature starts at influencing the output variable equal-
ly. In addition, it depicts each feature’s SHAP value bar becoming larger the more it
contributed to the result. In this plot, the mean absolute value for each feature over all
the given samples is plotted, meaning that the summary contribution of each feature is
shown, but not to which direction the features pushed the model output. So, we do not
see whether the model was influenced to predict the instance as defaulted or as not de-
faulted.

27
Figure 16: Global model feature importance mean

After determining the features model impacts, the top 15 features are of Figure 16 are
looked at in more detail. Figure 17 shows a not meaned summary plot of the 15 most
important features. Each dot shows one feature of an instance, while its color shows the
feature value. Numeric variables are red when high value, violet when medium value
and blue when low value. For categorical variables, which were transformed and are
now either 0 or 1, blue represents 0, and red 1. Additionally, on the X-axis, there is
again the SHAP value, but this time it is not meaned. Therefore, one can see how much
and in which direction that feature pushed the output variable. Having a negative SHAP

28
value means that the values were subtracted from the output variable, and therefore,
impacted it negatively (moved closer to 0, not defaulted), while a positive SHAP value
is added to the output variable, and therefore, impacted it positively (moved closer to 1,
defaulted). Additionally, for numeric features, a closer look per variable is shown with
the scaled values.

Figure 17: Global model feature importance top 15

Figure 17 depicts that a low “DebtToIncome” value impacts the model mostly positive-
ly, and a high value impacts it negatively. We also can observe some medium values,
which also impact the model negatively. The SHAP values of the feature are depicted in
Figure 18, which clearly shows that for most of the instances (borrowers) that have a
debt-to-income ratio of zero, meaning that there is no debt, the SHAP value is impacted
positively (0 to + 0.1). For almost all other values where there is a debt-to-income ratio
larger than zero, meaning that the people have other debts, the SHAP value is impacted
negatively (0 to -0.35) and therefore moving the output variable closer to zero, a not
defaulted loan. We would have expected a person with other debts to be more likely to
not repay the loan on time. One explanation may be that people with other debts are
more likely to have more experience with loans and therefore know how important it is
to pay them back on time.

29
Figure 18: SHAP values “DebtToIncome”

Figure 17 shows that a low applied amount (smaller loan) moves the output value closer
to zero and therefore to not defaulted, while a higher value impacts the output variable
positively and therefore means it is more likely to default. Figure 19 provides a closer
look at the SHAP values of the variable “AppliedAmount”. The lower 20% (scaled ap-
plied amount between 0 and 0.2) almost all impact the output variable negatively (0 to -
0.15). While for the amounts in the range of 20% - 95%, the SHAP value goes in both
directions (-0.04 to 0.17), but it still influences the output variable more often positive-
ly, and therefore closer to default. For the biggest loans there is a similar trend, but with
some outliers that impact the output variable more negatively (0 to -0.08).

Figure 19: SHAP values “AppliedAmount“

30
Loan duration is the next feature in Figure 17. Shorter loan durations, represented by the
blue dots, impact the output variable mostly negatively and the longer loan durations,
red dots, mostly positively. Figure 20 shows that very short loan durations move the
output variable much (-0.1 to -0.5) closer to zero. However, even the longest loan’s im-
pact is in the range of -0.05 to 0.1. So, we can say that the model believes a shorter loan
is likely to be paid back on time, while a longer loan is not equally unlikely to be paid
back on time.

Figure 20: SHAP values “LoanDuration”

The fourth most relevant feature for the model is the “Country_ES” variable, as shown
in Figure 17. The blue dots, meaning not a resident of Spain, impact the ouput mostly
negatively (0 to -0.1). There are also some blue dots on the positive side of the axis. On
the other hand, most of the red dots, representing residents of Spain, are on the right
side of the axis and therefore impact the output positively (0 to 0.2). So, the model con-
siders a loan from residents of Spain more likely to default.
In “EducationHigher”, the model expects that people with a higher education,
red dots, will be more likely to pay back their loans. Figure 17 shows that the blue dots,
people without a higher education, impact the output positively (0 to 0.1) and the red
dots impact it negatively (0 to -0.2).
For the variable age, depicted in Figure 17, the younger people (blue dots) im-
pact the model negatively. The middle-aged (violet dots) are close to zero, meaning they
impact in both directions (some closer to default, some closer to not default), and the
older (red dots) impact the model positively, meaning they are more likely to default

31
their loan. The described linear correlation can be observed in Figure 21. It shows that
the youngest 30% of borrowers’ output variable is impacted mostly negatively (0 to -
0.15), while for the middle aged it is mostly around zero and for the oldest 40% it is
impacted positively (0 to 0.15). So, for the model, a younger person is more likely to
pay back the loan on time than an older person.

Figure 21: SHAP values “Age”

Next in Figure 17 is the “Education_Secondary” feature. Its distribution is similar to the


“Education_Higher” feature, with the model expecting borrowers with a secondary edu-
cation (red dots) to pay back their loans on time more frequently (0 to -0.15) than peo-
ple without a secondary education (blue dots) (0 to 0.1).
The 8th most influential feature is the interest of the loan. Figure 17 shows that a
lower interest rate (blue dots) impacts the output negatively, while medium interests’
(violet dots) impact on output variable is mostly around zero and high interest (red dots)
impacts the output variable positively. Figure 22 shows that the 15% of loans with the
lowest interest rates are more likely to be paid back on time, while the higher the inter-
est rate, the more positively impacted the output variable is (0 to 0.25).

32
Figure 22: SHAP values “Interest“

Figure 17 shows similar distributions for the features “HomeOwnershipType_Owner”


and “HomeOwnershipType_Mortgage”. It shows that the model perceives people that
either own or are still paying off their home (red dots) as more likely not to default their
loans, while the other borrowers (blue dots) are perceived as more likely to default.
The next feature is “NewCreditCustomer_No”. Figure 17 shows that not new
customers (red dots) impact the models output negatively (0 to -0.1), while the others
(new customer or unknown) impact the output positively (0 to 0.08). However, for both
categories, most of the instances are close to zero.
A similar distribution to the one from “NewCreditCustomer_No” is observed for
“Country_EE” in Figure 17. People with residency in Estonia (red dots) influence the
output negatively, while the others (blue dots) influence it positively. However, there
are also just as many red and blue dots close to zero. So, the impact of Estonia is -0.075
to 0.05 and for not Estonia it is -0.05 to 0.075.
Figure 17 shows “ExistingLiabilities” as the next most important feature. It
shows that the model perceives a lower number of existing liabilities (blue dots) to have
a negative impact on the output and a high number of existing liabilities (red dots) to
have a positive impact to the output variable, meaning that it is more likely to default.
Moreover, Figure 23 reveals a linear correlation between the existing liabilities and its
SHAP value. We see that, the higher the existing liabilities, the higher the SHAP value
for it.

33
Figure 23: SHAP values “ExistingLiabilities“

The next feature in Figure 17, the amount of previous loans, shows instances with a low
amount of previous loans (blue dots) are close to zero and influencing the output varia-
ble to both directions (-0.5 to 0.2). However, we see instances with a high amount of
previous loans (red dots) far left on the X-axis and on the right side, meaning that again,
the output variable is impacted negatively as well as positively, depending on the in-
stance. Figure 24 shows that for the lower 30% of loans, the SHAP value is between -
0.1 and 0.1.

Figure 24: SHAP values “AmountOfPreviousLoansBeforeLoan”

The 15th most important feature, as shown in Figure 17, is “Verification-


Type_Income_expenses_verified”. As plotted, the loans for which the income and ex-
penses are verified (red dots) impact the output variable mostly negatively (0 to -0.15),
while the others (blue dots) impact it mostly positively (0 to 0.15).

34
The other 33 features and their impacts on the models’ output are depicted in Figure 25.

Figure 25: SHAP global model feature importance 16 - 48

3.3.2 Local interpretability


In this section, two random instances from the test dataset are looked at with SHAP, to
understand and explain why the model decided to mark the instances as either default or
not default. We want to see, which features of the instance influenced the model how
much and to which direction. SHAP provides different methods to plot local interpreta-
bilities. In these examples, the waterfall plot is used. As in the plots above, blue indi-
cates an impact to the left (moving closer to 0), not defaulted, and red indicates an im-
pact to the right (moving closer to 1), defaulted. For local interpretabilities, the accumu-
lated SHAP value starts at 0.615. This is due to the mean of the target value “Default”

35
in the test dataset being 0.615 and therefore, each prediction starts from there. With a
perfectly balanced dataset, the accumulated SHAP value would start at 0.5.
Figure 26 depicts the SHAP values and local interpretability for a defaulted in-
stance with index 2201 (depicted in Table 26 in the appendix) in the test dataset. Figure
26 shows that the 39 least important features accumulated subtracted 0.05 from the tar-
get variable. “NewCreditCustomer_Yes” and “NoOfPreviousLoansBeforeLoan” had a
similar negative impact on the target variable, subtracting 0.03 each and moving it clos-
er to a not defaulted loan. This borrower is not a new credit customer, and the number
of previous loans is 4. Because the person is not a homeowner 0.06 was added to the
output variable and because he was 32 years, the model added 0.09 to the output varia-
ble, moving it close to 0.7 and therefore, to a defaulted loan. For being a resident of
Estonia, another 0.09 was added to the output variable. While 0.1 was added because
“Education_Basic” was zero, 0.1 was subtracted because the person is not a resident of
Finland. And finally, we see that due to the gender not being female, 0.11 were sub-
tracted, but added again because the gender was male. The output was 0.744 and there-
fore a default was predicted.

Figure 26: SHAP values for instance 2201 in test dataset

Next, Figure 27 shows the SHAP values and local interpretability for a not defaulted
instance with index 58 (depicted in Table 26 in the appendix) in the test dataset. It
shows that the 39 least important features accumulated added 0.02 to the output varia-
ble. Being male, not having vocational education and being 49 years old influenced the
model to subtract 0.05 three times. Because the employment duration at the current em-

36
ployer was less than one year, 0.07 was added to the output variable. Not being a new
credit customer and being a homeowner resulted in 0.07 being subtracted twice. Not
having a higher education and the living Estonia both caused 0.08 to be subtracted. The
loan amount being quite high at 1701€ was perceived as a reason for default by the
model and therefore 0.12 was added to the output variable. The output was 0.375 and
therefore no default was predicted.

Figure 27: SHAP values for instance 58 in test dataset

We again observe, like the findings in the previous section about global interpretability,
that interpretations with SHAP can differ for each instance. For example, while the var-
iable Estonia was perceived as a reason for default and therefore moved the output vari-
able closer to 1, the same feature with the same value in the second instance was per-
ceived as a reason that default would be avoided, moving the output variable closer to 0.

3.4 Adjusting the model with interpretability insights


As Figure 16 reveals, some features rarely impact the model, meaning that the model
does not benefit a lot from those features and that the dataset as well as the model can
be adjusted. The goal is to increase efficiency by having equally good performance with
less features. The features identified in Figure 16 as possible to remove (impact on
model output ≤ 0.01) are listed in Table 22. In total, half of the features, 24, were re-
moved.

37
Table 22: Features possible to remove (SHAP)

After the features are removed, the same grid search as before was run. The results are
very similar to the ones before in Table 21. The best result with an AUC on the test da-
taset is even higher than that achieved previously, while the best fitting hyperparameters
for the adjusted model are the same ones as for the initial model.

4 Results
In this section, both models’ results are analyzed, and measurements presented. We
compare the initial model with the adjusted model and the insights from SHAP.
The learning process of the initial model, depicted in Figure 28, reveals how the
model improved during the epochs. On the train set, the accuracy improved from 0.6430
to 0.7069. Moreover, on the test set, it improved from 0.6769 to 0.6948. The fitting pro-
cess with more details for each epoch is also shown in Table 24 in the appendix.

Figure 28: Initial model learning process

Next, the learning process of the adjusted model is plotted (Figure 29). The accuracy of
the train set improved from 0.6424 to 0.7016, while it improved on the test set from
0.6748 to 0.6968. Again, the fitting process is shown in the appendix (Table 25).

38
Figure 29: Adjusted model learning process

For further evaluation, a confusion matrix was used for the initial and adjusted model
(Figure 30). The test dataset that the model is evaluated against had 21,023 instances, of
which 8105 were not defaulted (N) loans and 12918 were defaulted (P) loans.
The important numbers from the matrix are:
- True Negative (TN): Loans predicted as not defaulted and are not defaulted
- False Positive (FP): Loans predicted as defaulted and are not defaulted
- False Negative (FN): Loans predicted as not defaulted and are defaulted
- True Positive (TP): Loans predicted as defaulted and are defaulted

Initial Model Adjusted Model

Figure 30: Model confusion matrix

With the values from the matrix, further measures could be calculated (Table 23). The
measures are described as follows:
- True Positive Rate (TPR): The rate between TP and actual P
- True Negative Rate (TNR): The rate between TN and actual N
- False Positive Rate (FPR): The rate between FP and actual N
- False Negative Rate (FNR): The rate between FN and actual P

39
- Precision: The proportion of loans predicted to default that defaulted
- Recall: The proportion of defaulted loans that were predicted to default
- F1 Score: The weighted average of precision and recall

Table 23: Performance measurements

The AUC – ROC curves for both models (Figure 31) indicate how well the models can
differentiate between a (randomly chosen) defaulted and a (randomly chosen) not de-
faulted loan.

Initial Model Adjusted Model

Figure 31: AUC - ROC curves

40
5 Discussion
This thesis researched and investigated the use of deep learning with NN and explaina-
ble AI to predict and explain loan defaults. The main objective was to predict loan de-
faults and use explainable AI to understand why the model decided one way or the oth-
er. Additionally, it wanted to determine whether the insights from explainable AI help
to improve the model’s performance.
As shown in the previous chapter, two models were trained and the insights from
the explainable AI method SHAP helped to improve the model. The measurements
show the validation accuracy from the adjusted model at 0.6968, which is slightly high-
er than the validation accuracy of the initial model at 0.6962. We observe the same pat-
tern in the validation loss, where the adjusted model has a lower score of 0.5868 com-
pared to the initial model with 0.5891. However, looking at the confusion matrix
measures and acknowledging that there are more suitable measures than ACC in credit
scoring as proposed by Moscato et al. (2021, p. 5), we can observe that the initial model
scores a higher TPR at 0.8777 while the adjusted model reaches 0.8708, meaning that it
is better at identifying defaulted loans. From 1000 defaulted loans, the initial model
predicted 878 of them correctly, while the adjusted model only predicted 871 of them.
Another measure proposed by Moscato et al. (2021, p. 5) is the FPR, where the adjusted
model with a value of 0.5806 performs somewhat better than the initial model with
0.5930, because here a lower value is better. From 1000 not defaulted loans, the adjust-
ed model predicted 581 would default, while the initial one predicted 593 would default.
Neither of these results are particularly strong and as such, the model should be trained
and adjusted to do better here.
In regards to the proportion of the predicted default loans that did default, the
adjusted model does somewhat better than the initial model, with a precision of 0.7050
compared to 0.7023. This means that from 1000 loans marked as default by the model,
705 would be correct from the adjusted model, while the initial one would predict 702
of them correctly. In regards to the proportion of defaulted loans that were predicted to
default, the initial model performed somewhat better with a recall of 0.8777 compared
to 0.8708 from the adjusted model. As such, from 1000 defaulted loans, the initial mod-
el predicted 878 correctly, while the adjusted model predicted 871 of them. Finally, the
initial model has a higher value of weighted average of precision and recall than the

41
adjusted model, with sores of 0.7803 and 0.7792 respectively. In finance, we want to
maximize the recall because all loan defaults should be predicted, and we can accept the
loss of having more false positives.
Overall, we see that the AUC of the initial model is 0.7174 and the AUC from
the adjusted model (removed half the features with lowest SHAP value) is 0.7175.
While this is a very small accuracy gain, explainable AI enabled the development of a
model that only needs half the features to predict just as accurately. There are, however,
fine differences in the models’ predictions and depending on the use case one or the
other should be preferred. Furthermore, we observe that both models are better at pre-
dicting defaulted loans than they are at predicting loans that did not default. We also see
that they are more likely to predict false positives than false negatives, which is good in
finance. A falsely marked default loan could be checked by a specialist who then can
decide the next steps, while a falsely marked non default loan could probably slip
through and lead to problems when it defaults.
Additionally, explainable AI and SHAP allowed us to understand which features
are important to the model. These are insights that are not usually given from a black
box model. SHAP also allows us to explain what leads the model to a decision, and be-
cause of which features and values an instance is marked as default or not default.
Moreover, it is possible to show which feature influenced the output, to what extent and
in which direction.
To sum up, we can predict loan defaults, interpret, and explain the models’ deci-
sions and results as well as adjust the models with insights from explainable AI. A
complete use case in credit scoring was presented in this thesis and shows that deep
learning with explainable AI can be used in the complex area of finance. Further re-
search could examine and evaluate other datasets with loan default predictions or even
other credit scoring use cases in finance. The insights from SHAP are specific to the
used dataset and therefore it would be interesting to see other areas using deep learning
and explainable AI to obtain the best possible performance without losing the model’s
insights and learned relations.

42
6 References
Abellán, Joaquín, and Javier G. Castellano. 2017. “A Comparative Study on Base Clas-
sifiers in Ensemble Methods for Credit Scoring.” Expert Systems with Applica-
tions 73:1–10. doi: 10.1016/j.eswa.2016.12.020.
Ariza-Garzón, Miller Janny, Javier Arroyo, Antonio Caparrini, and Maria-Jesus Sego-
via-Vargas. 2020. “Explainability of a Machine Learning Granting Scoring Model
in Peer-to-Peer Lending.” IEEE Access 8:64873–90. doi:
10.1109/ACCESS.2020.2984412.
Baak, M., R. Koopman, H. Snoek, and S. Klous. 2019. “A New Correlation Coefficient
between Categorical, Ordinal and Interval Variables with Pearson Characteris-
tics.” ArXiv:1811.11440 [Stat].
Baesens, B., T. Van Gestel, S. Viaene, M. Stepanova, J. Suykens, and J. Vanthienen.
2003. “Benchmarking State-of-the-Art Classification Algorithms for Credit Scor-
ing.” Journal of the Operational Research Society 54(6):627–35. doi:
10.1057/palgrave.jors.2601545.
Balticfez. n.d. “The Three Baltic Countries – Lithuania, Latvia and Estonia.” Baltic
FEZ. Retrieved May 18, 2021b (https://www.balticfez.com/baltic-area/).
Bondora. 2021. “Public Reports.” Bondora.com. Retrieved April 5, 2021
(https://www.bondora.com/).
Bücker, Michael, Gero Szepannek, Alicja Gosiewska, and Przemyslaw Biecek. 2020.
“Transparency, Auditability and EXplainability of Machine Learning Models in
Credit Scoring.” ArXiv:2009.13384 [Cs, Econ, q-Fin, Stat].
Bussmann, Niklas, Paolo Giudici, Dimitri Marinelli, and Jochen Papenbrock. 2021.
“Explainable Machine Learning in Credit Risk Management.” Computational
Economics 57(1):203–16. doi: 10.1007/s10614-020-10042-0.
Byanjankar, Ajay, Markku Heikkilä, and Jozsef Mezei. 2015. “Predicting Credit Risk in
Peer-to-Peer Lending: A Neural Network Approach.” Pp. 719–25 in 2015 IEEE
Symposium Series on Computational Intelligence.
Carvalho, Diogo V., Eduardo M. Pereira, and Jaime S. Cardoso. 2019. “Machine Learn-
ing Interpretability: A Survey on Methods and Metrics.” Electronics 8(8):832.
doi: 10.3390/electronics8080832.
Chakraborty, Supriyo, Richard Tomsett, Ramya Raghavendra, Daniel Harborne,

43
Moustafa Alzantot, Federico Cerutti, Mani Srivastava, Alun Preece, Simon Julier,
Raghuveer M. Rao, Troy D. Kelley, Dave Braines, Murat Sensoy, Christopher J.
Willis, and Prudhvi Gurram. 2017. “Interpretability of Deep Learning Models: A
Survey of Results.” Pp. 1–6 in 2017 IEEE SmartWorld, Ubiquitous Intelligence
Computing, Advanced Trusted Computed, Scalable Computing Communications,
Cloud Big Data Computing, Internet of People and Smart City Innovation
(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).
Chollet, François. 2018. Deep Learning with Python. Shelter Island, New York: Man-
ning Publications Co.
Demajo, Lara Marie, Vince Vella, and Alexiei Dingli. 2020. “Explainable AI for Inter-
pretable Credit Scoring.” Computer Science & Information Technology (CS & IT)
185–203. doi: 10.5121/csit.2020.101516.
Disbergen, Douwe. 2019. “A Machine Learning Approach to Quantify the Default Risk
of Peer-to-Peer Loans on Bondora.Com.” Faculty of Economics and Management
(TiSEM).
Finlay, Steven. 2010. “Credit Scoring for Profitability Objectives.” European Journal of
Operational Research 202(2):528–37. doi: 10.1016/j.ejor.2009.05.025.
Hagras, Hani. 2018. “Toward Human-Understandable, Explainable AI.” Computer
51(9):28–36. doi: 10.1109/MC.2018.3620965.
Heaton, J. B., N. G. Polson, and J. H. Witte. 2018. “Deep Learning in Finance.”
ArXiv:1602.06561 [Cs].
IBM. 2020a. “What Is Artificial Intelligence (AI)?” IBM Cloud Education. Retrieved
April 8, 2021 (https://www.ibm.com/cloud/learn/what-is-artificial-intelligence).
IBM. 2020b. “What Is Machine Learning?” IBM Cloud Education. Retrieved March 28,
2021 (https://www.ibm.com/cloud/learn/machine-learning).
IBM. 2021b. “What Are Neural Networks?” IBM Cloud Education. Retrieved April 8,
2021 (https://www.ibm.com/cloud/learn/neural-networks).
Imtiaz, Sharjeel, and Allan Brimicombe. 2017. “A Better Comparison Summary of
Credit Scoring Classification.” International Journal of Advanced Computer Sci-
ence and Applications 8. doi: 10.14569/IJACSA.2017.080701.
Keras. n.d. “Keras Documentation: About Keras.” Keras. Retrieved May 20, 2021a
(https://keras.io/about/).

44
Kłosok, Marta, and Marcin Chlebus. 2020. TOWARDS BETTER UNDERSTANDING
OF COMPLEX MACHINE LEARNING MODELS USING EXPLAINABLE ARTI-
FICIAL INTELLIGENCE (XAI) -CASE OF CREDIT SCORING MODELLING.
Lessmann, Stefan, Bart Baesens, Hsin-Vonn Seow, and Lyn C. Thomas. 2015.
“Benchmarking State-of-the-Art Classification Algorithms for Credit Scoring: An
Update of Research.” European Journal of Operational Research 247(1):124–36.
doi: 10.1016/j.ejor.2015.05.030.
Marini, F. 2009. “3.14 - Neural Networks.” Pp. 477–505 in Comprehensive Chemomet-
rics, edited by S. D. Brown, R. Tauler, and B. Walczak. Oxford: Elsevier.
Molnar, Christoph. 2020. Interpretable Machine Learning. Lulu.com.
Molnar, Christoph, Giuseppe Casalicchio, and Bernd Bischl. 2020. “Interpretable Ma-
chine Learning -- A Brief History, State-of-the-Art and Challenges.”
ArXiv:2010.09337 [Cs, Stat].
Moscato, Vincenzo, Antonio Picariello, and Giancarlo Sperlí. 2021. “A Benchmark of
Machine Learning Approaches for Credit Score Prediction.” Expert Systems with
Applications 165:113986. doi: 10.1016/j.eswa.2020.113986.
Murdoch, W. James, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu.
2019. “Interpretable Machine Learning: Definitions, Methods, and Applications.”
Proceedings of the National Academy of Sciences 116(44):22071–80. doi:
10.1073/pnas.1900654116.
Provenzano, A. R., D. Trifirò, A. Datteo, L. Giada, N. Jean, A. Riciputi, G. Le Pera, M.
Spadaccino, L. Massaron, and C. Nordio. 2020. “Machine Learning Approach for
Credit Scoring.” ArXiv:2008.01687 [q-Fin, Stat].
Shapley, Lloyd S. 1953. “A Value for N-Person Games.” in Contributions to the Theory
of Games (AM-28), Volume II. Princeton University Press.
Soui, Makram, Ines Gasmi, Salima Smiti, and Khaled Ghédira. 2019. “Rule-Based
Credit Risk Assessment Model Using Multi-Objective Evolutionary Algorithms.”
Expert Systems with Applications 126:144–57. doi: 10.1016/j.eswa.2019.01.078.
Zaytsev, Vitaly. 2020. “Selection and Evaluation of Relevant Predictors for Credit Scor-
ing in Peer-to-Peer Lending with Random Forest Based Methods.”

45
7 Appendix

Table 24: Initial model fitting output

Table 25: Adjusted model fitting output

46
Table 26: Instances 2201 (0) and 58 (1) in test dataset

47

You might also like