Professional Documents
Culture Documents
MachineLearningApproachforSmallBusinessLoanDefaultPrediction
MachineLearningApproachforSmallBusinessLoanDefaultPrediction
net/publication/377238327
CITATIONS READS
0 291
1 author:
Sowmya Chintalapudi
Carleton University
2 PUBLICATIONS 4 CITATIONS
SEE PROFILE
All content following this page was uploaded by Sowmya Chintalapudi on 09 January 2024.
streamlining the process and improving their reputation. Regression to moderate their complex and not well-
The assessment is determined by estimating the loan’s predictable nature.
default probability by analyzing the historical dataset and
then classifying the loan into one of two categories: (a) The Support Vector Machine (SVM) algorithm is
higher risk—likely to default on the loan (i.e., be charged another commonly used machine learning technique that
off/failure to pay in full) or (b) lower risk—likely to pay off is utilized to predict binary outcomes. Li .et al.[4]
the loan in full. After comparing best possible algorithms, proposed heterogeneous ensemble learning model to
we can propose relatively best fit algorithm to financial predict the default of borrowers. To verify their proposed
institutions. A loan issuer or decision maker can leverage method, a comparison has been made with the traditional
this to decide. The dataset is used from Kaggle, and the ML models SVM, Decision Tree and KNN (k-nearest
dataset is a real dataset from the U.S. Small Business neighbor). They worked by selecting XGBoost, DNN and
Administration (SBA). During the loan application LR algorithms as the individual classifiers for training
process various data about the client, client’s business, and then used the traditional. Using SVM algorithm and
and the year the business was established is collected. other machine learning methods the results suggest that
Some of the data is relevant to risk prediction . the performance of SVM is not as effective as the logistic
regression model.
II. LITERATURE REVIEW
Abedin M. Z. et al.[5], discussed that in small business
In this section some of the contributions that has been credit risk assessment, the default and non-default classes
done on creating machine learning models using various are very imbalanced. To solve this problem, they
algorithm to improve the loan approval prediction for small proposed a technique called WSMOTE- weighted
businesses are discussed briefly. The loan is designed for synthetic minority oversampling technique. The results
small businesses to pay for daily expenses or cover short- show that the proposed Random Forest classifier used in
term financing issues [1]. Usually, banks require the the WSMOTE-ensemble model provides a good balance
borrowers to have a 500 to 600 minimum credit score and between the performance on default class and that of
have started their business for at least one or two years [1]. nondefault class. Madaan.et al[6] in their paper explains
Due to the high demands of loan now, demand for further that loan prediction is mainly based on the credit score.
improvements in the models for credit scoring and loan Credit score is a key factor to determine the eligibility in
prediction is increasing significantly. Most of the ML the competitive financial world. This paper does a
models target credibility based on their financial situation to comprehensive and comparative analysis between two
grant credit. For a loan approval, focusing on solutions and algorithms for loan default predictions (i) Random Forest,
developing new models can increase efficiency in business. and (ii) Decision Trees. The results suggest that the
There are many banks who are using the traditional models random forest model has an overall higher performance
to check the credit scores and use the same for loan than the decision tree model in terms of F1 score,
approvals. With the use of ML models, more justified precision, recall, and accuracy rate.
algorithmic and easily integrable solutions can be possible
[2]. Addo P.M. et al.[7] work shows that the choice of variables
The logistic regression model is the mostly used method to respond to business objectives and the choice of the
in determining the loan approval prediction as it has a few algorithms used to decide are two important key aspects in
advantages. Turiel, J. D. et al.[3] proposed a two-phase the management processing when issuing the loan.
model by using the Logistic Regression and Support Vector
Machine (SVM) algorithms together with Linear and Non- The author in [8] has compared the performance of the
Linear Deep Neural Networks wherein the first phase deep neural network model with other algorithms for a
predicts loan rejection, while the second one predicts default default prediction problem. The results suggest that the
risk for approved loans. The two phases model for all loan deep learning models can get the highest specification but
purposes described showed better performance overall [3]. the lowest AUC on the test dataset. The disadvantage of a
The authors say that there was a discrepancy observed neural network is that the model requires more
between how credit analysts treat these loans and how they computation power since it needs to optimize many
might be treated more efficiently, in terms of their default parameters. Also, the neural network model tends to
risk and characteristics. However, for their future work they overfit the test data. Furthermore, the model is hard to
have considered to combine Neural Networks with Logistic interpret due to the complex structure.
Sowmya Chintalapudi, School of Information Technology, Carleton University, 101132004 3
Coşer, A. et al. [9] uses a methodology to determine found that XGBoost with monotonic constraints
customers probability of entering default three sampling outperformed scorecard model by 7% in K-S statistic. The
scenarios were elaborated, in which four classifiers were proposed model was compared with Random Forest, Neural
applied using the following machine learning algorithms: Networks, GBDT and XGBoost.
LightGBM, XGBoost, Logistic Regression and Random
Forests. The purpose of this research is to understand the Harris, T. [15] studied on prediction of credit risk using a
patterns that can lead to a significant risk for a customer to support vector machine algorithm (SVM) and applied for
enter default and to build an accurate predictive model that is two definitions of default: (i) a broader rule was considered
able to effectively classify observations in the two classes. for up to 90 days payment overdue; (ii) Only customers
[9] have shown the random forest model performs better on with more than 90 days late payment. The author claims
predicting peer-to-peer loan default than logistic regression. that the model used for the broader definition has a higher
accuracy than the other one and at the same time, it is a
The author [10] uses different boosting and bagging models reliable and accurate method to predict credit un-
to predict peer-to-peer loan default. The results indicate that creditworthiness compared to the traditional judgment
the boosting method has a much higher accuracy than approach.
the bagging method. The author showed that the AdaBoost
model has a higher AUC than the XGBoost model. [8] also Zhang, T. et al. [16] propose a new model that uses
compares different machine learning methods and finds that Multiple Instance Learning (MIL)in development of a credit
the XGBoost model has the highest AUC among other scoring model by including not only socio-demographic and
algorithms, including random forest models and deep loan application data, but also the transaction history data of
learning models. [4] also gets a similar result in predicting the applicant. This method allows to extract dynamic
the peer-to-peer loan default in China. features from transactional information and the results
showed that all the classifiers that were applied using newly
A.A Taha. et al. [11], use a boosting method to predict credit added data had a significant increase in accuracy in
card fraud. They compared the LightGBM model with comparison with not considering transactional data.
several other basic machine learning algorithms including
logistic regression, decision tree, SVM, and Naive Bayes Papouskova, M. et al. [17] introduce a novel two-stages
and proposes an approach for detecting fraud in credit card credit risk model: the first part consists of a model used to
transactions using an optimized light gradient boosting predict the probability of default through ensemble
machine (OLightGBM). Their experimental results indicate classifiers that discriminate between good and bad payers;
that the proposed approach outdid the other machine the second stage makes an in-depth analysis on customers
learning algorithms and achieved the highest performance in with a predicted probability of default and a regression
terms of Accuracy, AUC, Precision and F1-score. ensemble is applied to determine the loan default. Later, the
two models are combined to predict the expected loss. The
Xiaojun, M. et al. [12] use two novel machine learning researchers claim that this method outperforms other state
algorithms called LightGBM and XGBoost to predict the of the art models used to predict the loss and exposure the
default of customers based on real-life peer to peer (P2P) default.
transactions. The authors point out that the reason why they
chose these algorithms is that they have an intense background There was considerable research performed to determine
and a practical applicability proven by numerous studies that the efficient algorithms and to predict potential default of
reveal the remarkable performance of their application. The a loan and to determine if a loan could be approved or
results of the reasearch shown that LightGBM recorded the not. However, most of the research was done by
best performance in comparison with XGBoost, having an considering limited parameters and mostly on credit score
error rate of 19.9% and an accuracy of 80.1% [12] of the applicant. However, in this study, additional
parameters such as – Identifying default percentage by
Yu Li in paper [13] did a comprehensive study comparing the industry, how did loans backed by real-estate perform,
XGBoost algorithm’s performance with the performance of Default percentage of businesses by urban and rural areas,
logistic regression. The paper concluded that the model Businesses having revolving line of credit, Default
discrimination and model stability of the XGBoost model was percentage by state, Time difference between loan approval
substantially higher than that of the logistic regression model. date and disbursement date are considered to predict
Wang, W.et al. [14] compared traditional scorecard credit risk probability of default and thus help to decide approval of
model against various machine learning models and a small business loan application.
Sowmya Chintalapudi, School of Information Technology, Carleton University, 101132004 4
• Credit score
• Income
• Employment history
• Debt-to-income ratio
• Loan amount
• Payment history
From the data set there are certain key variables before
training the data. In this work, some of the key factors are
cleansed, transformed, and enhanced to get effective output
from the algorithm. The NAICS (North American industry
classification system code) number, the code for industrial
Table 1: Industry default rates- first two-digit NAICS codes
classification [23], is used to visualize how various
industries performed in terms of loan repayment. The first
two digits in the code indicates industry the borrower
belongs to. While the accuracy of algorithm can vary based
on factors used, using the right parameters makes big
difference in the actual prediction. A rich dataset is used
which has valuable information collected from 1984. This
dataset has small business loan related information
including – Age of the business, loan amount, whether
defaulted or not, Industrial sector and region etc., There are
other major explanatory factors that we consider in this
study:
VII. ARCHITECTURE DIAGRAM that can be effectively used for modeling. There is a
need to convert it in useful format because it may
have some irrelevant, missing information and noisy
The architecture of loan prediction model is shown
data. To deal with this problem data cleaning
in Figure 2. It consists of following main blocks.
technique has been used.
# Load necessary libraries and the dataset # Train the model and make predictions
import pandas as pd %mprun log_reg.fit(X_train, y_train)
from sklearn.linear_model import LogisticRegression %memit y_logpred = log_reg.predict(X_val)
from sklearn.preprocessing import OneHotEncoder,
StandardScaler
from sklearn.model_selection import train_test_split # predict probabilities on validation set
from sklearn.metrics import accuracy_score, y_pred_proba = log_reg.predict_proba(X_val)[:,1]
precision_score, recall_score, f1_score, confusion_matrix
Step 4: Measuring Accuracy and ROC
# Load the dataset fromsklearn.metrics import accuracy_score
dataset = pd.read_csv('SBAnational.csv')
accuracy_score(y_test,y_pred)
# Preprocessing the dataset
## Check for missing values and impute them # Following commands are to generate ROC graph
## Convert categorical variables to numerical log_reg
## Split the dataset into training and validation sets # predict probabilities on validation set
y_pred_proba = log_reg.predict_proba(X_val)[:,1]
# Initialize a logistic regression model with 'random_state'
parameter set to 2
log_reg = LogisticRegression(random_state=2) # generate ROC curve
fpr, tpr, thresholds = roc_curve(y_val, y_pred_proba)
# Train the model on the training set using the 'fit' method
log_reg.fit(X_train, y_train) # calculate AUC score
roc_auc = auc(fpr, tpr)
# Make predictions on the validation set using the 'predict'
method
# plot ROC curve
y_pred = log_reg.predict(X_val)
plt.figure(figsize=(8,6))
# Evaluate the model performance plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area
accuracy = accuracy_score(y_val, y_pred) = %0.2f)' % roc_auc)
precision = precision_score(y_val, y_pred) plt.plot([0, 1], [0, 1], color='grey', lw=2, linestyle='--')
recall = recall_score(y_val, y_pred) plt.xlim([0.0, 1.0])
f1 = f1_score(y_val, y_pred) plt.ylim([0.0, 1.05])
cm = confusion_matrix(y_val, y_pred)
plt.xlabel('False Positive Rate')
# Hyperparameter tuning techniques plt.ylabel('True Positive Rate')
## Grid search plt.title('Receiver Operating Characteristic')
## Random search plt.legend(loc='lower right')
plt.show()
# Deploy the model on new data to make predictions Step 5: import sklearn.metrics as metrics fpr,tpr,threshold ←
## Load the new data
metrics.roc_curve(y_test, pred) roc_auc ← metrics.auc(fpr,
## Preprocess the new data
## Make predictions using the trained model tpr)
plt.title("Logistic Regression") plt.plot(fpr,tpr,'b',label = 'auc
8.2 Algorithm details: = %0.2f'%roc_auc) plt.legend(loc = 'lower right')
9.1.1 Accuracy:
9.1.4 F1 score:
Which can be interpreted as:
• 89,210 loans were correctly predicted to not default
Precision and recall are combined to produce an F1-
(true negatives).
score.
• 592 loans were incorrectly predicted to default (false
positives).
F1 score=2∗𝑃∗𝑅/(𝑃+𝑅)
• 1,099 loans were incorrectly predicted to not default
(false negatives).
F1 score is a measure of the overall quality of the
• 23,782 loans were correctly predicted to default (true
predictions made by the algorithm, considering both
positives).
precision and recall. In this work of predicting SBA loan
default prediction, F1 score of the algorithm is calculated
on the test dataset which is 25% of the total data.
The output of % memit shows the peak memory usage and the
memory increment of the statement. The peak memory usage
is the maximum amount of memory used during the execution
of the statement, and the memory increment is the change in
memory usage between the beginning and end of the statement
[26]. In the output, the peak memory usage for the model
initialization statement is1442.16 MiB, and the increment is
0.02 MiB. This means that the statement used a maximum of
1442.16 MiB of memory and caused a negligible increase in
memory usage compared to the beginning of the statement.
using customers past data, such as age, income, loan analyses and develop an effective model for predicting key
amount, and tenure of work factors are considered to performance indicators in small businesses.
determine the factors that have the most impact on the
prediction and the work was done using SVM, K-NN,
and random forest algorithms. SVM has achieved 73.2%
accuracy, K-NN has got 68% accuracy, Random Forest
REFERENCES
has 81% accuracy, and Logistic Regression has 77%
1. Wamala, Y. (2021, March 25). Types of loans: What are the
accuracy. From this proposed model the tool would differences? ValuePenguin.
generate a default risk based on the borrower's loan and https://www.valuepenguin.com/loans/types-of-loans. (Accessed
03 July 2021)
other financial information. The default risk score would
be calculated using the machine learning algorithm that is 2. H. Ince and B. Aktan, "A comparison of data mining techniques
for credit scoring in banking: A managerial
developed as part of the project, which has a 98.5% perspective", Journal of Business Economics and Management,
precision/accuracy rate. This proposed model would vol. 10, no. 3, pp. 233-240, 2009.
reduce the SBA's charge-off costs by identifying high-
3. Turiel, J. D., & Aste, T. (2019). P2P Loan acceptance and
risk loans before they default. This would enable the default prediction with Artificial Intelligence. arXiv preprint
SBA to take proactive measures to mitigate risk, rather arXiv:1907.01800.
than covering the cost of defaulted loans. 4. Li, W., Ding, S., Wang, H. et al. Heterogeneous ensemble
learning with feature engineering for default prediction in peer-
to-peer lending in China. World Wide Web 23, 23–45 (2020).
X. CONCLUSION
5. Abedin, M. Z., Guotai, C., Hajek, P., & Zhang, T. (2022).
In this model, using logistic regression the loan default
Combining weighted SMOTE with ensemble learning for the
prediction has been done by considering various parameters. class-imbalanced prediction of small business credit
This model leverages machine learning and data analysis to risk. Complex & Intelligent Systems, 1-21.
predict loan defaults, which is a relatively new and innovative 6. Madaan, M., Kumar, A., Keshri, C., Jain, R., & Nagrath, P.
approach to evaluating credit risk. This model gives an (2021). Loan default prediction using decision trees and random
forest: A comparative study. In IOP Conference Series:
accuracy rate of 98.5% which is significantly higher than that Materials Science and Engineering (Vol. 1022, No. 1, p.
of traditional credit scoring models, which typically have 012042). IOP Publishing.
accuracy rates between 70-80%. 7. Addo, P. M., Guegan, D., & Hassani, B. (2018). Credit risk
analysis using machine and deep learning models. Risks, 6(2),
The development of a machine learning algorithm that 38.
can predict loan defaults with highest possible 8. Aleksandrova, Y. (2021). Comparing performance of machine
precision/accuracy has the potential to revolutionize small learning algorithms for default risk prediction in peer to peer
lending. TEM Journal, 10(1), 133-143.
business lending. By integrating this algorithm into
SBA/lender loan evaluation process, lenders can reduce their 9. Coşer, A., Maer-matei, M. M., & Albu, C. (2019). PREDICTIVE
MODELS FOR LOAN DEFAULT RISK
exposure to risk and potential losses while also streamlining ASSESSMENT. Economic Computation & Economic Cybernetics
the process and improving their reputation. While there may Studies & Research, 53(2).
be some challenges associated with implementing this
10. L. Lai, "Loan Default Prediction with Machine Learning
technology in the real world, the benefits it offers are Techniques," 2020 International Conference on Computer
significant and could ultimately result in a more efficient and Communication and Network Security (CCNS), Xi'an, China,
2020, pp. 5-9, doi: 10.1109/CCNS50731.2020.00009.
effective small business lending industry.
11. A. A. Taha and S. J. Malebary, "An Intelligent Approach to Credit
Card Fraud Detection Using an Optimized Light Gradient Boosting
ACKNOWLEDGMENT Machine," in IEEE Access, vol. 8, pp. 25579-25587, 2020, doi:
10.1109/ACCESS.2020.2971354.
I would like to acknowledge the Small Business
12. Ma, X., Sha, J., Wang, D., Yu, Y., Yang, Q., & Niu, X. (2018).
Administration (SBA) for providing a valuable and rich data Study on a prediction of P2P network loan default based on the
source that contributed significantly to the development of the machine learning LightGBM and XGboost algorithms according to
high accuracy algorithm presented in this research paper. The different high dimensional data cleaning. Electronic Commerce
data provided by the SBA enabled me to conduct extensive Research and Applications, 31, 24-39.
14. Wang, W., Lesner, C., Ran, A., Rukonic, M., Xue, J., & Shiu,
E. (2020, April). Using small business banking data for
Sowmya Chintalapudi, School of Information Technology, Carleton University, 101132004 14
explainable credit risk scoring. In Proceedings of the AAAI 27. Fabio Sigrist and Christoph Hirnschall. Grabit: Gradient tree-
Conference on Artificial Intelligence (Vol. 34, No. 08, pp. boosted tobit models for default prediction. Journal of Banking &
13396-13401). Finance, 102:177–192, 2019
15. Harris, T. (2013). Quantitative credit risk assessment using support 28. Sheikh, M. A., Goel, A. K., & Kumar, T. (2020, July). An
vector machines: Broad versus Narrow default definitions. Expert
Systems with Applications, 40(11), 4404-4413. approach for prediction of loan approval using machine learning
algorithm. In 2020 International Conference on Electronics and
16. Zhang, T., Zhang, W., Wei, X. U., & Haijing, H. A. O. (2018). Sustainable Communication Systems (ICESC)(pp. 490-494).
Multiple instance learning for credit risk assessment with IEEE.
transaction data. Knowledge-Based Systems, 161, 65-77.
29. Dosalwar, S., Kinkar, K., Sannat, R., & Pise, N. (2021). Analysis
17. Papouskova, M., & Hajek, P. (2019). Two-stage consumer credit
risk modelling using heterogeneous ensemble learning. Decision of loan availability using machine learning
support systems, 118, 33-45 techniques. International Journal of Advanced Research in
Science, Communication and Technology (IJARSCT), 9(1), 15-20.
18. U.S. Small Business Administration. (2021b). Small Business
Administration loanprogram performance. U.S. Small Business 30. Khan, A., Bhadola, E., Kumar, A., & Singh, N. (2021). Loan
Administration. Retrieved from: approval prediction model a comparative analysis. Advances and
https://www.sba.gov/document/report-small-business- Applications in Mathematical Sciences, 20(3).
administration-loanprogram-performance
31. Wang, H., & Cheng, L. (2021). CatBoost model with synthetic
19. Chen, M. (2011). Bankruptcy prediction in firms with statistical features in application to loan risk assessment of small
and intelligent businesses. arXiv preprint arXiv:2106.07954.
techniques and a comparison of evolutionary computation
approaches. Computers & Mathematics with Applications,
32. P. Tumuluru, L. R. Burra, M. Loukya, S. Bhavana, H. M. H.
62(12), 4514-4524.
CSaiBaba and N. Sunanda, "Comparative Analysis of Customer
20. Zhu, Y., Zhou, L., Xie, C., Wang, G. J., & Nguyen, T. V. (2019). Loan Approval Prediction using Machine Learning
Forecasting SMEs' credit risk in supply chain finance with an Algorithms," 2022 Second International Conference on Artificial
enhanced hybrid ensemble machine learning Intelligence and Smart Energy (ICAIS), Coimbatore, India, 2022,
approach. International Journal of Production Economics, 211, 22- pp. 349-353, doi: 10.1109/ICAIS53314.2022.9742800.
33
22. Glennon, D. C., & Nigro, P. (2005b). Measuring the default risk of
small business loans: A survival analysis approach. Journal of
Money, Credit & Banking, 37(5), 923- 947.
doi:10.1353/mcb.2005.0051
23. Min Li, Amy Mickel, and Stanley Taylor. “should this loan be
approved or denied?”: A large dataset with class assignment
guidelines. Journal of Statistics Education, 26(1):55–66, 2018b.
doi:10.1080/10691898.2018.1434342. URL
https://doi.org/10.1080/10691898.2018.1434342.
APPENDIX