MachineLearningApproachforSmallBusinessLoanDefaultPrediction

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/377238327
Machine Learning Approach for Small Business Loan Default Prediction
Article · April 2023
CITATIONS READS
0 291
1 author:
Sowmya Chintalapudi
Carleton University
2 PUBLICATIONS 4 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sowmya Chintalapudi on 09 January 2024.
The user has requested enhancement of the downloaded file.

Sowmya Chintalapudi, School of Information Technology, Carleton University, 101132004 1
Machine Learning Approach for Small

Business Loan Default Prediction
Sowmya Chintalapudi sowmyachintalapudi@cmail.carleton.ca
that have defaulted on their loans. This needs to be

Abstract— With recent growth in the competitive financial advanced to meet the fast pace innovative small business
market, estimating the risk involved in a loan application is one world. The machine learning algorithm developed to predict
of the challenges for banks' profit or loss. By predicting the loan loan defaults can be a valuable tool for small business
defaulters, the bank can reduce its non- performing assets. Every lenders. By using this algorithm, lenders can more
loan application received may not be accepted. Most banks use accurately assess the risk of default associated with each
their credit scoring and risk assessment procedures to examine
loan application. This can help lenders make better-
loan applications and make credit approval decisions. In this
informed decisions about which loans to approve and which
work, some of the key factors are considered apart from the
credit score which will be helpful for making a better decision for to reject, ultimately reducing their exposure to risk. To
predicting a loan approval. This paper aims at developing a implement the algorithm in a real-world setting, the lender
model using Logistic Regression (LR) algorithm to predict loan would need to integrate it into their loan evaluation process.
default. The dataset contains a sample size of 899,164 This would involve gathering the necessary data from loan
observations and 27 features. Significant evaluation measures applicants and inputting it into the algorithm for analysis.
like Confusion Matrix, Accuracy, Recall, Precision, F1- Score, The data used to train the algorithm would include
ROC analysis and Feature Importance has been calculated and information such as credit scores, financial statements, and
factors that are considered have significant positive impact on the
business plans. Once the algorithm has analyzed the data, it
algorithm effectiveness and showed promising metrics. The result
would generate a prediction of the likelihood of default for
shows that this algorithm has achieved an accuracy of 98.5% and
has the highest recall of 99%. This model can accurately identify each loan application. Lenders can then use this information
high-risk loans and provide risk mitigation strategies and this to make more informed decisions about whether to approve
tool can help reduce the SBA's default rates and charge-off costs. or reject each loan. Loans with a high likelihood of default
could be rejected, while loans with a lower risk of default
could be approved. Using a machine learning algorithm to
predict loan defaults offers several benefits for small
Keywords— Logistic Regression, loan approval, Machine
learning business lenders. First, it allows lenders to make more
informed decisions about which loans to approve, reducing
I. INTRODUCTION their exposure to risk and potential losses. Second, it
streamlines the loan evaluation process, making it more
Small business lending is a critical aspect of the
efficient and cost-effective.
economy. Small businesses require capital to grow and
Finally, it can improve the lender's reputation by
expand, and traditional banks are often hesitant to lend to
reducing the number of defaults associated with their loans.
them due to the higher perceived risk. In recent years,
Determining loan eligibility in a conventional approach
alternative lenders have emerged to fill this gap, offering
typically includes credit score, Annual income, and
small business loans with more flexible terms and higher
immediate past few years’ income statements. However, the
approval rates. However, these lenders also face the risk of
risk to grant a loan may be affected by several other factors.
loan defaults, which can have a significant impact on their
In this proposed system several other factors are considered
bottom line. Lots of small businesses, innovative start-ups
to help determining the loan approval or denial. The
are coming up with disruptive ideas. However, not all the
development of a machine learning algorithm that can
businesses would be able to fund themselves. On the other
predict loan defaults with highest possible
hand, financial institutions still follow conventional
precision/accuracy has the potential to revolutionize small
approaches to determine loan eligibility. There have been
business lending. By integrating this algorithm into
many success stories of start-ups receiving Small Business
SBA/lender loan evaluation process, lenders can reduce
Administration (SBA) loan guarantees [1]. However, there
their exposure to risk and potential losses while also
have also been stories of small businesses and/or start-ups
streamlining the process and improving their reputation. Regression to moderate their complex and not well-
The assessment is determined by estimating the loan’s predictable nature.
default probability by analyzing the historical dataset and
then classifying the loan into one of two categories: (a) The Support Vector Machine (SVM) algorithm is
higher risk—likely to default on the loan (i.e., be charged another commonly used machine learning technique that
off/failure to pay in full) or (b) lower risk—likely to pay off is utilized to predict binary outcomes. Li .et al.[4]
the loan in full. After comparing best possible algorithms, proposed heterogeneous ensemble learning model to
we can propose relatively best fit algorithm to financial predict the default of borrowers. To verify their proposed
institutions. A loan issuer or decision maker can leverage method, a comparison has been made with the traditional
this to decide. The dataset is used from Kaggle, and the ML models SVM, Decision Tree and KNN (k-nearest
dataset is a real dataset from the U.S. Small Business neighbor). They worked by selecting XGBoost, DNN and
Administration (SBA). During the loan application LR algorithms as the individual classifiers for training
process various data about the client, client’s business, and then used the traditional. Using SVM algorithm and
and the year the business was established is collected. other machine learning methods the results suggest that
Some of the data is relevant to risk prediction . the performance of SVM is not as effective as the logistic
regression model.
II. LITERATURE REVIEW
Abedin M. Z. et al.[5], discussed that in small business
In this section some of the contributions that has been credit risk assessment, the default and non-default classes
done on creating machine learning models using various are very imbalanced. To solve this problem, they
algorithm to improve the loan approval prediction for small proposed a technique called WSMOTE- weighted
businesses are discussed briefly. The loan is designed for synthetic minority oversampling technique. The results
small businesses to pay for daily expenses or cover short- show that the proposed Random Forest classifier used in
term financing issues [1]. Usually, banks require the the WSMOTE-ensemble model provides a good balance
borrowers to have a 500 to 600 minimum credit score and between the performance on default class and that of
have started their business for at least one or two years [1]. nondefault class. Madaan.et al[6] in their paper explains
Due to the high demands of loan now, demand for further that loan prediction is mainly based on the credit score.
improvements in the models for credit scoring and loan Credit score is a key factor to determine the eligibility in
prediction is increasing significantly. Most of the ML the competitive financial world. This paper does a
models target credibility based on their financial situation to comprehensive and comparative analysis between two
grant credit. For a loan approval, focusing on solutions and algorithms for loan default predictions (i) Random Forest,
developing new models can increase efficiency in business. and (ii) Decision Trees. The results suggest that the
There are many banks who are using the traditional models random forest model has an overall higher performance
to check the credit scores and use the same for loan than the decision tree model in terms of F1 score,
approvals. With the use of ML models, more justified precision, recall, and accuracy rate.
algorithmic and easily integrable solutions can be possible
[2]. Addo P.M. et al.[7] work shows that the choice of variables
The logistic regression model is the mostly used method to respond to business objectives and the choice of the
in determining the loan approval prediction as it has a few algorithms used to decide are two important key aspects in
advantages. Turiel, J. D. et al.[3] proposed a two-phase the management processing when issuing the loan.
model by using the Logistic Regression and Support Vector
Machine (SVM) algorithms together with Linear and Non- The author in [8] has compared the performance of the
Linear Deep Neural Networks wherein the first phase deep neural network model with other algorithms for a
predicts loan rejection, while the second one predicts default default prediction problem. The results suggest that the
risk for approved loans. The two phases model for all loan deep learning models can get the highest specification but
purposes described showed better performance overall [3]. the lowest AUC on the test dataset. The disadvantage of a
The authors say that there was a discrepancy observed neural network is that the model requires more
between how credit analysts treat these loans and how they computation power since it needs to optimize many
might be treated more efficiently, in terms of their default parameters. Also, the neural network model tends to
risk and characteristics. However, for their future work they overfit the test data. Furthermore, the model is hard to
have considered to combine Neural Networks with Logistic interpret due to the complex structure.
Coşer, A. et al. [9] uses a methodology to determine found that XGBoost with monotonic constraints
customers probability of entering default three sampling outperformed scorecard model by 7% in K-S statistic. The
scenarios were elaborated, in which four classifiers were proposed model was compared with Random Forest, Neural
applied using the following machine learning algorithms: Networks, GBDT and XGBoost.
LightGBM, XGBoost, Logistic Regression and Random
Forests. The purpose of this research is to understand the Harris, T. [15] studied on prediction of credit risk using a
patterns that can lead to a significant risk for a customer to support vector machine algorithm (SVM) and applied for
enter default and to build an accurate predictive model that is two definitions of default: (i) a broader rule was considered
able to effectively classify observations in the two classes. for up to 90 days payment overdue; (ii) Only customers
[9] have shown the random forest model performs better on with more than 90 days late payment. The author claims
predicting peer-to-peer loan default than logistic regression. that the model used for the broader definition has a higher
accuracy than the other one and at the same time, it is a
The author [10] uses different boosting and bagging models reliable and accurate method to predict credit un-
to predict peer-to-peer loan default. The results indicate that creditworthiness compared to the traditional judgment
the boosting method has a much higher accuracy than approach.
the bagging method. The author showed that the AdaBoost
model has a higher AUC than the XGBoost model. [8] also Zhang, T. et al. [16] propose a new model that uses
compares different machine learning methods and finds that Multiple Instance Learning (MIL)in development of a credit
the XGBoost model has the highest AUC among other scoring model by including not only socio-demographic and
algorithms, including random forest models and deep loan application data, but also the transaction history data of
learning models. [4] also gets a similar result in predicting the applicant. This method allows to extract dynamic
the peer-to-peer loan default in China. features from transactional information and the results
showed that all the classifiers that were applied using newly
A.A Taha. et al. [11], use a boosting method to predict credit added data had a significant increase in accuracy in
card fraud. They compared the LightGBM model with comparison with not considering transactional data.
several other basic machine learning algorithms including
logistic regression, decision tree, SVM, and Naive Bayes Papouskova, M. et al. [17] introduce a novel two-stages
and proposes an approach for detecting fraud in credit card credit risk model: the first part consists of a model used to
transactions using an optimized light gradient boosting predict the probability of default through ensemble
machine (OLightGBM). Their experimental results indicate classifiers that discriminate between good and bad payers;
that the proposed approach outdid the other machine the second stage makes an in-depth analysis on customers
learning algorithms and achieved the highest performance in with a predicted probability of default and a regression
terms of Accuracy, AUC, Precision and F1-score. ensemble is applied to determine the loan default. Later, the
two models are combined to predict the expected loss. The
Xiaojun, M. et al. [12] use two novel machine learning researchers claim that this method outperforms other state
algorithms called LightGBM and XGBoost to predict the of the art models used to predict the loss and exposure the
default of customers based on real-life peer to peer (P2P) default.
transactions. The authors point out that the reason why they
chose these algorithms is that they have an intense background There was considerable research performed to determine
and a practical applicability proven by numerous studies that the efficient algorithms and to predict potential default of
reveal the remarkable performance of their application. The a loan and to determine if a loan could be approved or
results of the reasearch shown that LightGBM recorded the not. However, most of the research was done by
best performance in comparison with XGBoost, having an considering limited parameters and mostly on credit score
error rate of 19.9% and an accuracy of 80.1% [12] of the applicant. However, in this study, additional
parameters such as – Identifying default percentage by
Yu Li in paper [13] did a comprehensive study comparing the industry, how did loans backed by real-estate perform,
XGBoost algorithm’s performance with the performance of Default percentage of businesses by urban and rural areas,
logistic regression. The paper concluded that the model Businesses having revolving line of credit, Default
discrimination and model stability of the XGBoost model was percentage by state, Time difference between loan approval
substantially higher than that of the logistic regression model. date and disbursement date are considered to predict
Wang, W.et al. [14] compared traditional scorecard credit risk probability of default and thus help to decide approval of
model against various machine learning models and a small business loan application.
III. PROBLEM STATEMENT 3.2 Research Questions
This work will focus on the following questions.

Small Business Administration (SBA) is a US federal
government agency that is founded to assist and protect the
1. What model can be used for loan default
small businesses which mainly focuses on loan guarantee
prediction?
program. Loan Guaranty Program is a government subsidized
2. What are the types of data that are very correlated
loan program in which the Small Business Administration
to the low probability of default?
(SBA) [18] offers government guarantees on loans made by
3. What are the benefits of using machine learning in
commercial lenders to small businesses. The Small Business
terms of time consuming, and accuracy?
Administration (SBA) works with banking agencies to issue
loans to small companies. The SBA provides guidance for
partnering banks instead of lending money directly to small IV. DATA SET
business owners (https://www.sba.gov/). The dataset used in this work contains actual historical
small business loan data from the U.S. Small Business
3.1 Problem of Current Loan Approval Process for SBA Loans Administration (SBA). This dataset includes historical data
(or) Gaps in current work: from 1987 through 2014 with a total of 899,164
observations. It contains 27 columns, some of the
Any small businesses that apply for a loan, the following important/valuable fields are:
parameters are considered in decision making – personal
history, personal financial statement, business financial
statements that includes profit and loss statements, loan
application history records, income tax returns, collateral, etc.
Any agency would follow these basic checks to finalize the
loan application [18]. Most of the existing loan approval
algorithms consider following factors to determine approval or
denial of a loan application. Personal history, includes the
following details:
• Credit score
• Income
• Employment history
• Debt-to-income ratio
• Loan amount
• Payment history
[19] in their work mentioned that many other factors have

constrained most SMEs to effectively receive financing On the top of this rich data, Logistic regression algorithm is
through traditional methods. If the parameters are limited, used to predict if a small business would be able to default or
these businesses are mostly likely to be not eligible for a loan. not. This prediction can be used by institution to approve or
On the other hand, financial institutions would be at risk if decline a loan. In this proposed work we are developing a
they provide loans to ineligible businesses by not considering model using industry specific loan performance data along
all possible effective factors. The importance of developing with regular parameters that financial institutions use to
machine learning techniques has been recognized by the predict potential loan default.
industry and different models have been built on to support
different sectors. The SBA loan default rate ranged from 11.5% to 24.7%
between 2000 and 2015 [20], and the SBA needed to cover
the charge-off they guaranteed. SBA is using a credit score
model to streamline the loan approval process. However, the
credit score model cannot estimate the risk of default [21],
which leads to the annual charge-off amount exceeding $1.6
billion from 2011 to 2020. For the proposed work, we use a
combination of effective algorithm and additional factors V. BLOCK DIAGRAM

used to predict the likelihood of a loan to default in future.
This paper aims at developing models using predictive
machine learning algorithms namely Logistic Regression Start
(LR) to evaluate loan data. From the data set there are
certain key variables before data training the data. The gross
amount of disbursement might be an important indicator of Load Pre-processed
loan default. The NAICS (North American industry SBA dataset
classification system code) number [23], the code for
industrial classification, is expected to give out information
that we are considering for our project. The first two digits
in the code is an indicator of the industry the borrower Train ML model
belongs to. While the accuracy of algorithm can vary based
on factors used, using the right parameters makes big
difference in the actual prediction. We will be using a rich
dataset which has valuable information collected from 1984 Predict loan default
till date. This dataset has small business loan related to approve or deny
information including – Age of the business, loan amount,
whether defaulted or not, Industrial sector and region etc.,
VI. PROPOSED SOLUTION
Traditionally, loan underwriting has relied on a

combination of credit scores, financial statements, and
other documents to assess creditworthiness. However,
machine learning algorithms can analyze large amounts
of data and identify patterns that may not be
immediately apparent to human analysts. This can lead
to more accurate predictions of loan defaults and better
risk management . The tool is customizable and can be
adapted to the needs of loan officers and small
businesses. The loan default prediction tool allows
users to set their own criteria for evaluating loan
applications and predicting defaults. This flexibility
allows users to tailor the tool to their specific needs,
rather than relying on a one-size-fits-all approach. The
loan default prediction tool has the potential to
significantly improve the SBA's lending practices. By
accurately identifying high-risk loans and providing
risk mitigation strategies, the tool can help reduce the
SBA's default rates and charge-off costs. Additionally,
Figure 2: Industry Default rate the tool can streamline the loan approval process,
improve consistency in decision-making, and provide
valuable insights through historical data analysis.
Overall, the loan default prediction tool can be a
valuable addition to the SBA's toolkit. By leveraging
machine learning and data analysis, the SBA can better
support small businesses while minimizing their
financial risk. For the proposed work, a combination of
effective algorithm and additional factors are used to
predict the likelihood of a loan to default in future.
This paper aims at developing models using predictive

machine learning algorithms namely Logistic
Regression (LR) to evaluate loan data. [22] mentioned
that logistic regression is a powerful tool specially
when addressing the classification problems. This
algorithm is extensively used in the bankruptcy
prediction [21][22][23][24]. The algorithm could be fed
with latest data as and when available and will analyze
the consolidated dataset to give the prediction. This
solution could be used by financial institutions as well
with some customizations as per the needs of the
financial institution. The tool is designed to be user-
friendly and accessible to loan officers and small
business owners who may not have expertise in data
analysis or machine learning. The interface is intuitive
and easy to use, and the tool provides clear and concise
recommendations for risk mitigation.
6.1 How is the problem addressed.
From the data set there are certain key variables before
training the data. In this work, some of the key factors are
cleansed, transformed, and enhanced to get effective output
from the algorithm. The NAICS (North American industry
classification system code) number, the code for industrial
Table 1: Industry default rates- first two-digit NAICS codes
classification [23], is used to visualize how various
industries performed in terms of loan repayment. The first
two digits in the code indicates industry the borrower
belongs to. While the accuracy of algorithm can vary based
on factors used, using the right parameters makes big
difference in the actual prediction. A rich dataset is used
which has valuable information collected from 1984. This
dataset has small business loan related information
including – Age of the business, loan amount, whether
defaulted or not, Industrial sector and region etc., There are
other major explanatory factors that we consider in this
study:
- Identify default percentage by industry.

- How did loans backed by real-estate perform
- Default percentage of businesses by urban and
rural areas
- Businesses having revolving line of credit.
- Default percentage by state
- Time difference between loan approval date
and disbursement date
Table 2: Description of 27 variables in both datasets

VII. ARCHITECTURE DIAGRAM that can be effectively used for modeling. There is a
need to convert it in useful format because it may
have some irrelevant, missing information and noisy
The architecture of loan prediction model is shown
data. To deal with this problem data cleaning
in Figure 2. It consists of following main blocks.
technique has been used.
For instance: there are several fields with null values

which would result in inaccuracies and affect the
performance of algorithm – these values are ignored.
Also, dollar amount values have special chars ($) and
unwanted spaces, which are cleansed. All the dollar
amount values are converted to float data type for
effective calculations. Data types of various columns
are changed so that data calculation and comparison
could be effective. Reduction techniques are used to
deal with huge volume of data. So, data analysis will
become easier, and it intends to get accurate results.
So, data storage capacity increase and cost to analysis
of data reduces.
Fig 2: Architecture Diagram

3. Test Set: The test set is used to evaluate the
performance of the logistic regression model. This set
contains loan applications with unknown outcomes,
and the trained model is used to predict whether each
1. Input dataset: The SBAnational.csv dataset is the loan will default or not. The accuracy of the model's
initial input for this algorithm. It contains information predictions is evaluated by comparing the predicted
about small business loans that were approved or outcomes with the actual outcomes of the test set. If
denied by the Small Business Administration (SBA) the accuracy is high, it means that the logistic
in the United States. The dataset includes various regression model has learned how to predict loan
features such as loan amount, loan type, industry, default with a high degree of accuracy.
borrower demographics, Gross amount of loan
approved, SBAs guaranteed amount, charge off date,
4. Training Set: The preprocessed dataset is divided
if a business has revolving line of credit or not and if
into two sets: training and testing sets. The training
the business is in Urban or Rural location etc. Dataset
set is used to train the logistic regression algorithm to
has been collected from the Kaggle. This data is
predict loan default. This set contains a known output
cleansed, transformed, and enhanced to derive most
(loan default status) for each loan application, which
effective factors. The data set is further split into
is used to train the algorithm to learn how to predict
training and testing on a 75:25 for training and testing
default based on the input features.
respectively. The training dataset is used to train the
model. The data set has a total of 899,164
observations and contains 27 columns. Table 1 and 5. Logistic Regression Algorithm: Logistic regression
table 2 explains variables in the dataset – industry is a statistical method used for binary classification
default rates with NAICS codes and description of problems, where the goal is to predict one of two
the variables in the data set respectively. possible outcomes (such as yes/no or true/false)
based on one or more predictor variable. Logistic
regression is much like linear regression except that
2. Data Preprocessing: This section involves
how they are used. It is used for solving regression
preparing the input dataset for modeling by
problems, whereas Logistic regression is used for
performing various data preprocessing techniques
solving the classification problems [25]. The sigmoid
such as data cleaning, data transformation, and data
function is a numerical function used to outline
normalization. This is done to remove any
predicted values to probabilities. It maps any real
inconsistencies and ensure that the data is in a format
value to another value that is in between 0 and 1. The 8. Class Diagram

value must be in between 0 and 1 which means it can
exceed the limit, then it forms a curve like the “S”
form.
6. Build Model: The logistic regression model is built

by fitting the algorithm to the training data and
adjusting the parameters of the model to minimize
the difference between predicted and actual
outcomes. The resulting model is used to make
predictions on new loan applications.
Fig 4: Class Diagram
VIII. ALGORITHM AND PSEUDOCODE
Fig 3: Logistic regression 8.1 Pseudocode:
1. Load the dataset 'SBAnational.csv' and import the

7. Feature Engineering necessary libraries.
In feature engineering a proper input dataset which is 2. Preprocess the dataset:
• Check for missing values and impute them
compatible as per machine learning algorithm requirements is
with appropriate methods.
prepared. In our model Pandas and NumPy libraries are • Convert categorical variables to numerical
imported to run. So, the performance of machine learning using one-hot encoding.
model improves. • Split the dataset into training and validation
sets.
import pandas as pd
3. Initialize a logistic regression model with
'random_state' parameter set to 2.
import numpy as np 4. Train the model on the training set using the 'fit'
method.
5. Make predictions on the validation set using the
'predict' method.
6. Evaluate the model performance using metrics such
as accuracy, precision, recall, F1-score, and
confusion matrix.
7. To improve model performance, try hyperparameter
tuning techniques such as grid search or random
search.
8. Finally, deploy the model on new data to make
predictions.
# Load necessary libraries and the dataset # Train the model and make predictions
import pandas as pd %mprun log_reg.fit(X_train, y_train)
from sklearn.linear_model import LogisticRegression %memit y_logpred = log_reg.predict(X_val)
from sklearn.preprocessing import OneHotEncoder,
StandardScaler
from sklearn.model_selection import train_test_split # predict probabilities on validation set
from sklearn.metrics import accuracy_score, y_pred_proba = log_reg.predict_proba(X_val)[:,1]
precision_score, recall_score, f1_score, confusion_matrix
Step 4: Measuring Accuracy and ROC
# Load the dataset fromsklearn.metrics import accuracy_score
dataset = pd.read_csv('SBAnational.csv')
accuracy_score(y_test,y_pred)
# Preprocessing the dataset
## Check for missing values and impute them # Following commands are to generate ROC graph
## Convert categorical variables to numerical log_reg
## Split the dataset into training and validation sets # predict probabilities on validation set
y_pred_proba = log_reg.predict_proba(X_val)[:,1]
# Initialize a logistic regression model with 'random_state'
parameter set to 2
log_reg = LogisticRegression(random_state=2) # generate ROC curve
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)
# Train the model on the training set using the 'fit' method
log_reg.fit(X_train, y_train) # calculate AUC score
roc_auc = auc(fpr, tpr)
# Make predictions on the validation set using the 'predict'
method
# plot ROC curve
y_pred = log_reg.predict(X_val)
plt.figure(figsize=(8,6))
# Evaluate the model performance plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area
accuracy = accuracy_score(y_val, y_pred) = %0.2f)' % roc_auc)
precision = precision_score(y_val, y_pred) plt.plot([0, 1], [0, 1], color='grey', lw=2, linestyle='--')
recall = recall_score(y_val, y_pred) plt.xlim([0.0, 1.0])
f1 = f1_score(y_val, y_pred) plt.ylim([0.0, 1.05])
cm = confusion_matrix(y_val, y_pred)
plt.xlabel('False Positive Rate')
# Hyperparameter tuning techniques plt.ylabel('True Positive Rate')
## Grid search plt.title('Receiver Operating Characteristic')
## Random search plt.legend(loc='lower right')
plt.show()
# Deploy the model on new data to make predictions Step 5: import sklearn.metrics as metrics fpr,tpr,threshold ←
## Load the new data
metrics.roc_curve(y_test, pred) roc_auc ← metrics.auc(fpr,
## Preprocess the new data
## Make predictions using the trained model tpr)
plt.title("Logistic Regression") plt.plot(fpr,tpr,'b',label = 'auc
8.2 Algorithm details: = %0.2f'%roc_auc) plt.legend(loc = 'lower right')
Step 1: Data preprocessing IX. EVALUATION AND RESULTS

Step 2: Fitting Logistic Regression to our training set model
import LogisticRegression
9.1 Complexity Analysis:
lr←LogisticRegression(random_state=0) lr.fit(X_train,
y_train)
In this work, complexity analysis is performed using
various parameters – Accuracy, F1 Score, Precision and recall
# Initialize model along with time it took to train the model. Existing python
%memit log_reg = LogisticRegression(random_state=2) libraries and its functions are used for this analysis.
Step 3: Testing the model
pred ← lr.predict(X_test) y_pred
9.1.1 Accuracy:
Accuracy measures the percentage of predicted categorized

values that are right. This is defined as
𝐴 =(𝑇𝑃+𝑇𝑁)/(𝑇𝑃 +𝑇𝑁 +𝐹𝑃+𝐹𝑁)
Where, TP True Positive TN True Negative, FP False

Negative FP False Positive.
Accuracy is a measure of how well the algorithm performs in

predicting the outcomes of interest. In this case of logistic
regression for SBA loan default prediction, accuracy is 9.1.3 Space Complexity:
measured on test dataset sizes of 25% of the total data.
In the specific example you provided, the space complexity of
9.1.2 Time complexity: the logistic regression model was calculated to be O
(54015536). This means that the amount of memory required
In terms of time complexity, the logistic regression algorithm to store the model scales linearly with the number of features
has a time complexity of O (k * n^2), where k is the number in the dataset, with a constant factor of 8 bytes per parameter.
of iterations required to converge and n is the number of
features in the input data. Therefore, the space complexity of the logistic regression
model grows linearly with the number of features, and the
In this case, the time complexity evaluation of O (2433600) amount of memory required to store the model can become
means that the algorithm will take approximately 2433600 quite large for datasets with many features. It's important to
units of time to complete its training process. However, this is keep this in mind when working with large datasets, as it can
a theoretical estimate and the actual time taken by the affect the performance and scalability of the algorithm.
algorithm can vary depending on the specific implementation,
the size of the input data, and other factors. Space complexity of logistic regression model: O (54015536)
To expand on the theoretical evaluation of time and space

complexity, the time complexity refers to the amount of time
required by an algorithm to complete its task, while space
complexity refers to the amount of memory or storage space
required by an algorithm to perform its operations. In general,
these two factors are inversely related: algorithms that require
more time to complete their tasks often require less space, and
vice versa.
In the case of logistic regression, the time complexity is

determined by the number of iterations required to converge,
which in turn depends on the number of features in the input
data. The space complexity of the algorithm is also determined
by the size of the input data, as the algorithm needs to store
the feature vectors and associated labels during the training
process.
Time complexity of training logistic regression model: O

(2433600)
9.1.4 Precision and recall:

9.2.1 Confusion matrix:
True Positive Rate, also known as Recall, is the
percentage of positive values that were properly
Confusion Matrix (CM) is a two-dimensional rectangular
predicted out of all positive values.
array, in which one dimension contains predicted values,
R = TP/(TP +FN)
while the other dimension reflects actual values of the
The precision represents the proportion of properly
classifiers. Since the study here is to predict the loan
predicted positive values.
default, which is binary, i.e. 0 indicates not having a default,
P = TP/(TP +FP)
and 1 represents having one. A confusion matrix is a table
Precision and recall are measures of the quality of the
that summarizes the performance of a classification
predictions made by the algorithm. Precision is the
algorithm by comparing the predicted class labels to the
proportion of correct predictions among the total
actual class labels.
predicted positives, while recall is the proportion of
correct predictions among the actual positives. In this
Following is the generated confusion matrix:
work, the precision and recall of the algorithm are
[89210 592]
measured on test dataset which is 25% of the total data.
[1099 23782]
9.1.4 F1 score:
Which can be interpreted as:
• 89,210 loans were correctly predicted to not default
Precision and recall are combined to produce an F1-
(true negatives).
score.
• 592 loans were incorrectly predicted to default (false
positives).
F1 score=2∗𝑃∗𝑅/(𝑃+𝑅)
• 1,099 loans were incorrectly predicted to not default
(false negatives).
F1 score is a measure of the overall quality of the
• 23,782 loans were correctly predicted to default (true
predictions made by the algorithm, considering both
positives).
precision and recall. In this work of predicting SBA loan
default prediction, F1 score of the algorithm is calculated
on the test dataset which is 25% of the total data.
9.1.5 Results of Analysis
Below is the snapshot of Complexity Analysis results.
precision recall f1-score support
0 0.988 0.993 0.991 89964

1 0.975 0.956 0.966 24719
accuracy 0.985 114683
macro avg 0.982 0.975 0.978 114683
wtd. avg 0.985 0.985 0.985 114683
9.2 Efficiency Analysis:

9.2.2 Receiver Operating Characteristic (ROC) curve:
On the top of above-mentioned metrics like Accuracy,

A graphical representation of the performance of the
Precision, Recall, F1 Score and time taken to train the model,
model that shows the trade-off between true positive rate
following parameters are also determined – Confusion matrix
and false positive rate at different classification
and Receiver Operating Characteristic (ROC) curve . thresholds.
9.2.4 Feature Importance
The importance values are calculated using the

coefficients of the logistic regression model. These
values indicate how much each feature contributes to the
prediction of the target variable (in this case, whether a
loan would default or not). This plot is useful for
identifying which features have the most influence on the
target variable and can help guide feature selection or
engineering efforts. In this case, it can provide insights
into the factors that contribute to loan defaults and may
inform risk assessment or underwriting decisions. The
features that are considered here are gross approval-
Gross amount of loan approved by bank, backed real
estate, term- Loan term in months, disbursement gross -
Amount disbursed, Charged-off amount.
9.2.3 Operational Analysis
The output of % memit shows the peak memory usage and the
memory increment of the statement. The peak memory usage
is the maximum amount of memory used during the execution
of the statement, and the memory increment is the change in
memory usage between the beginning and end of the statement
[26]. In the output, the peak memory usage for the model
initialization statement is1442.16 MiB, and the increment is
0.02 MiB. This means that the statement used a maximum of
1442.16 MiB of memory and caused a negligible increase in
memory usage compared to the beginning of the statement.
For the prediction statement, the peak memory usage was

1127.48 MiB, and the increment was 5.97 MiB. This means
that the statement used a maximum of 1127.48 MiB of
memory and caused an increase of 5.97 MiB in memory usage 9.3 Comparison of the solution with other related
compared to the beginning of the statement. approaches
With the default rate increasing considerably, better

solutions for predicting the loan default have been
coming up timely. In [29] the author uses logistic
regression and considered factors such as age, objective,
credit score, credit amount, credit period and compared
the results with various other algorithms and was able to
achieve accuracy of 0.785. [30] compares the loan
prediction using various algorithms considering
parameters like credit score, income, age, marital status,
gender, etc. The predictive models based on Logistic
Regression, Decision Tree, and Random Forest, give the
accuracy as 80.945%, 93.648% and 83.388%. [31]
identified the important risk factors that contribute to
loan status classification problem and proposed a
peak memory: 1442.16 MiB, increment: 0.02 MiB
technique which shows an accuracy of 95.84%. In [32]
peak memory: 1127.48 MiB, increment: 5.97 MiB
using customers past data, such as age, income, loan analyses and develop an effective model for predicting key
amount, and tenure of work factors are considered to performance indicators in small businesses.
determine the factors that have the most impact on the
prediction and the work was done using SVM, K-NN,
and random forest algorithms. SVM has achieved 73.2%
accuracy, K-NN has got 68% accuracy, Random Forest
REFERENCES
has 81% accuracy, and Logistic Regression has 77%
1. Wamala, Y. (2021, March 25). Types of loans: What are the
accuracy. From this proposed model the tool would differences? ValuePenguin.
generate a default risk based on the borrower's loan and https://www.valuepenguin.com/loans/types-of-loans. (Accessed
03 July 2021)
other financial information. The default risk score would
be calculated using the machine learning algorithm that is 2. H. Ince and B. Aktan, "A comparison of data mining techniques
for credit scoring in banking: A managerial
developed as part of the project, which has a 98.5% perspective", Journal of Business Economics and Management,
precision/accuracy rate. This proposed model would vol. 10, no. 3, pp. 233-240, 2009.
reduce the SBA's charge-off costs by identifying high-
3. Turiel, J. D., & Aste, T. (2019). P2P Loan acceptance and
risk loans before they default. This would enable the default prediction with Artificial Intelligence. arXiv preprint
SBA to take proactive measures to mitigate risk, rather arXiv:1907.01800.
than covering the cost of defaulted loans. 4. Li, W., Ding, S., Wang, H. et al. Heterogeneous ensemble
learning with feature engineering for default prediction in peer-
to-peer lending in China. World Wide Web 23, 23–45 (2020).
X. CONCLUSION
5. Abedin, M. Z., Guotai, C., Hajek, P., & Zhang, T. (2022).
In this model, using logistic regression the loan default
Combining weighted SMOTE with ensemble learning for the
prediction has been done by considering various parameters. class-imbalanced prediction of small business credit
This model leverages machine learning and data analysis to risk. Complex & Intelligent Systems, 1-21.
predict loan defaults, which is a relatively new and innovative 6. Madaan, M., Kumar, A., Keshri, C., Jain, R., & Nagrath, P.
approach to evaluating credit risk. This model gives an (2021). Loan default prediction using decision trees and random
forest: A comparative study. In IOP Conference Series:
accuracy rate of 98.5% which is significantly higher than that Materials Science and Engineering (Vol. 1022, No. 1, p.
of traditional credit scoring models, which typically have 012042). IOP Publishing.
accuracy rates between 70-80%. 7. Addo, P. M., Guegan, D., & Hassani, B. (2018). Credit risk
analysis using machine and deep learning models. Risks, 6(2),
The development of a machine learning algorithm that 38.
can predict loan defaults with highest possible 8. Aleksandrova, Y. (2021). Comparing performance of machine
precision/accuracy has the potential to revolutionize small learning algorithms for default risk prediction in peer to peer
lending. TEM Journal, 10(1), 133-143.
business lending. By integrating this algorithm into
SBA/lender loan evaluation process, lenders can reduce their 9. Coşer, A., Maer-matei, M. M., & Albu, C. (2019). PREDICTIVE
MODELS FOR LOAN DEFAULT RISK
exposure to risk and potential losses while also streamlining ASSESSMENT. Economic Computation & Economic Cybernetics
the process and improving their reputation. While there may Studies & Research, 53(2).
be some challenges associated with implementing this
10. L. Lai, "Loan Default Prediction with Machine Learning
technology in the real world, the benefits it offers are Techniques," 2020 International Conference on Computer
significant and could ultimately result in a more efficient and Communication and Network Security (CCNS), Xi'an, China,
2020, pp. 5-9, doi: 10.1109/CCNS50731.2020.00009.
effective small business lending industry.
11. A. A. Taha and S. J. Malebary, "An Intelligent Approach to Credit
Card Fraud Detection Using an Optimized Light Gradient Boosting
ACKNOWLEDGMENT Machine," in IEEE Access, vol. 8, pp. 25579-25587, 2020, doi:
10.1109/ACCESS.2020.2971354.
I would like to acknowledge the Small Business
12. Ma, X., Sha, J., Wang, D., Yu, Y., Yang, Q., & Niu, X. (2018).
Administration (SBA) for providing a valuable and rich data Study on a prediction of P2P network loan default based on the
source that contributed significantly to the development of the machine learning LightGBM and XGboost algorithms according to
high accuracy algorithm presented in this research paper. The different high dimensional data cleaning. Electronic Commerce
data provided by the SBA enabled me to conduct extensive Research and Applications, 31, 24-39.
13. Y. Li, "Credit Risk Prediction Based on Machine Learning

Methods," 2019 14th International Conference on Computer
Science & Education (ICCSE), Toronto, ON, Canada, 2019, pp.
1011-1013, doi: 10.1109/ICCSE.2019.8845444.
14. Wang, W., Lesner, C., Ran, A., Rukonic, M., Xue, J., & Shiu,
E. (2020, April). Using small business banking data for
explainable credit risk scoring. In Proceedings of the AAAI 27. Fabio Sigrist and Christoph Hirnschall. Grabit: Gradient tree-
Conference on Artificial Intelligence (Vol. 34, No. 08, pp. boosted tobit models for default prediction. Journal of Banking &
13396-13401). Finance, 102:177–192, 2019
15. Harris, T. (2013). Quantitative credit risk assessment using support 28. Sheikh, M. A., Goel, A. K., & Kumar, T. (2020, July). An
vector machines: Broad versus Narrow default definitions. Expert
Systems with Applications, 40(11), 4404-4413. approach for prediction of loan approval using machine learning
algorithm. In 2020 International Conference on Electronics and
16. Zhang, T., Zhang, W., Wei, X. U., & Haijing, H. A. O. (2018). Sustainable Communication Systems (ICESC)(pp. 490-494).
Multiple instance learning for credit risk assessment with IEEE.
transaction data. Knowledge-Based Systems, 161, 65-77.
29. Dosalwar, S., Kinkar, K., Sannat, R., & Pise, N. (2021). Analysis
17. Papouskova, M., & Hajek, P. (2019). Two-stage consumer credit
risk modelling using heterogeneous ensemble learning. Decision of loan availability using machine learning
support systems, 118, 33-45 techniques. International Journal of Advanced Research in
Science, Communication and Technology (IJARSCT), 9(1), 15-20.
18. U.S. Small Business Administration. (2021b). Small Business
Administration loanprogram performance. U.S. Small Business 30. Khan, A., Bhadola, E., Kumar, A., & Singh, N. (2021). Loan
Administration. Retrieved from: approval prediction model a comparative analysis. Advances and
https://www.sba.gov/document/report-small-business- Applications in Mathematical Sciences, 20(3).
administration-loanprogram-performance
31. Wang, H., & Cheng, L. (2021). CatBoost model with synthetic
19. Chen, M. (2011). Bankruptcy prediction in firms with statistical features in application to loan risk assessment of small
and intelligent businesses. arXiv preprint arXiv:2106.07954.
techniques and a comparison of evolutionary computation
approaches. Computers & Mathematics with Applications,
32. P. Tumuluru, L. R. Burra, M. Loukya, S. Bhavana, H. M. H.
62(12), 4514-4524.
CSaiBaba and N. Sunanda, "Comparative Analysis of Customer
20. Zhu, Y., Zhou, L., Xie, C., Wang, G. J., & Nguyen, T. V. (2019). Loan Approval Prediction using Machine Learning
Forecasting SMEs' credit risk in supply chain finance with an Algorithms," 2022 Second International Conference on Artificial
enhanced hybrid ensemble machine learning Intelligence and Smart Energy (ICAIS), Coimbatore, India, 2022,
approach. International Journal of Production Economics, 211, 22- pp. 349-353, doi: 10.1109/ICAIS53314.2022.9742800.
33
21. Voigt, K., & Campbell, C. W. (2017, October 3). 1 in 6 small

business administration loans fail, study finds. NerdWallet.
https://www.nerdwallet.com/article/smallbusiness/ study-1-in-6-
sba-small-business-administration-loans-fail. (Accessed: 20 April
2021)
22. Glennon, D. C., & Nigro, P. (2005b). Measuring the default risk of
small business loans: A survival analysis approach. Journal of
Money, Credit & Banking, 37(5), 923- 947.
doi:10.1353/mcb.2005.0051
23. Min Li, Amy Mickel, and Stanley Taylor. “should this loan be
approved or denied?”: A large dataset with class assignment
guidelines. Journal of Statistics Education, 26(1):55–66, 2018b.
doi:10.1080/10691898.2018.1434342. URL
https://doi.org/10.1080/10691898.2018.1434342.
24. M. Zieba, S. K. Tomczak, and J. M. Tomczak. Ensemble boosted

trees with synthetic features generation in application to
bankruptcy prediction. Expert Systems with Applications,
58(Oct.):93101, 2016b.
25. Stewart Jones, David Johnstone, and Roy Wilson. Predicting

corporate bankruptcy: An evaluation of alternative statistical
frameworks. Journal of Business Finance & Accounting, 44(1-
2):3–34, 2017.
26. H. Son, C. Hyun, D. Phan, and H.J. Hwang. Data analytic

approach for bankruptcy prediction. Expert Systems with
Applications, 138:112816, 2019. ISSN 0957-4174. doi:
https://doi.org/10.1016/j.eswa.2019.07.033. URL
https://www.sciencedirect.com/science/article/pii/S095741741930
5123.
APPENDIX
- LOGISTIC REGRESSION RESULTS

- DEFAULT PERCENTAGE BY INDUSTRY

View publication stats

MachineLearningApproachforSmallBusinessLoanDefaultPrediction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MachineLearningApproachforSmallBusinessLoanDefaultPrediction

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Machine Learning Approach for Small Business Loan Default Prediction

Article · April 2023

The user has requested enhancement of the downloaded file.

Machine Learning Approach for Small

that have defaulted on their loans. This needs to be

III. PROBLEM STATEMENT 3.2 Research Questions

This work will focus on the following questions.

[19] in their work mentioned that many other factors have

combination of effective algorithm and additional factors V. BLOCK DIAGRAM

Traditionally, loan underwriting has relied on a

This paper aims at developing models using predictive

6.1 How is the problem addressed.

- Identify default percentage by industry.

Table 2: Description of 27 variables in both datasets

For instance: there are several fields with null values

Fig 2: Architecture Diagram

value to another value that is in between 0 and 1. The 8. Class Diagram

6. Build Model: The logistic regression model is built

Fig 4: Class Diagram

VIII. ALGORITHM AND PSEUDOCODE

Fig 3: Logistic regression 8.1 Pseudocode:

1. Load the dataset 'SBAnational.csv' and import the

Step 1: Data preprocessing IX. EVALUATION AND RESULTS

Accuracy measures the percentage of predicted categorized

𝐴 =(𝑇𝑃+𝑇𝑁)/(𝑇𝑃 +𝑇𝑁 +𝐹𝑃+𝐹𝑁)

Where, TP True Positive TN True Negative, FP False

Accuracy is a measure of how well the algorithm performs in

To expand on the theoretical evaluation of time and space

In the case of logistic regression, the time complexity is

Time complexity of training logistic regression model: O

9.1.4 Precision and recall:

9.1.5 Results of Analysis

Below is the snapshot of Complexity Analysis results.

precision recall f1-score support

0 0.988 0.993 0.991 89964

9.2 Efficiency Analysis:

On the top of above-mentioned metrics like Accuracy,

9.2.4 Feature Importance

The importance values are calculated using the

For the prediction statement, the peak memory usage was

With the default rate increasing considerably, better

13. Y. Li, "Credit Risk Prediction Based on Machine Learning

21. Voigt, K., & Campbell, C. W. (2017, October 3). 1 in 6 small

24. M. Zieba, S. K. Tomczak, and J. M. Tomczak. Ensemble boosted

25. Stewart Jones, David Johnstone, and Roy Wilson. Predicting

26. H. Son, C. Hyun, D. Phan, and H.J. Hwang. Data analytic

- LOGISTIC REGRESSION RESULTS

- DEFAULT PERCENTAGE BY INDUSTRY

View publication stats

You might also like