Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO)

Amity University, Noida, India. Mar 14-15, 2024

Loan Default Prediction Using Machine Learning


2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) | 979-8-3503-5035-7/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICRITO61523.2024.10522232

Natasha Robinson Nidhi Sindhwani


AIIT, Amity University AIIT, Amity University
Noida Noida
rnatasha093@gmail.com nsindhwani@amity.edu

Abstract - The project titled "Loan Default Prediction The subsequent sections delve into the employed
Using Machine Learning" has been developed with the aim of methodology, the considered evaluation metrics, and the
enhancing the evaluation of credit risk in financial institutions. broader implications of integrating machine learning into the
Traditional models for credit scoring often encounter crucial domain of credit risk assessment which would make
difficulties in capturing the intricate financial behaviours, thus potential positive changes in the current system of loan
necessitating the utilization of advanced machine learning credibility. This will not only be a crucial financial tool for
techniques. By making use of a comprehensive dataset that the lenders but also provide transparency of the system to the
incorporates information about borrowers as well as historical people.
financial data, this project employs algorithms such as
Random Forest and Gradient Boosting. Through the II. OBJECTIVES
preprocessing of data and the application of feature
engineering methods, the dataset is optimized, and the A. Make a Predictive Model for Loan Defaults:
performance of the models is thoroughly evaluated using Create a machine learning model that can find out the
metrics such as accuracy and precision. The selected models possible loan default that can occur with efficiency and high
are further refined through the process of hyperparameter
accuracy.
tuning, ensuring that their predictive capabilities are
optimized. The success of this project lies in its provision of a B. Providing A Complete Analysis of The Outcome:
reliable tool to financial institutions, enabling them to
Result analysis of how the result was calculated what
accurately predict the risks associated with loan defaults. Once
validated, the final model seamlessly integrates into the existing
factors were primarily involved in the final outcome.
loan processing systems, thereby empowering lenders to make C. Providing A Report to The Borrower:
more well-informed decisions. By contributing to the
advancement of credit risk assessment methodologies through Providing a detailed report to the borrower if in case the
machine learning, this project is poised to bring about a result is negative what are the possible reasons of loan being
revolutionary change in the way financial institutions manage defaulted and provide possible solutions in improvising the
their loan portfolios, offering improved accuracy and efficiency chances of increasing the probability of acquiring the loan.
in the identification of potential loan defaults. D. Advise On the Amount of Loan That Could Be Acquired
Keywords – Loan Defaults, financial data, credit scores, (if result is negative):
machine learning, credit risk. If the outcome is negative then it will give a probable
result of how much amount the borrower is eligible for and
I. INTRODUCTION can acquire without any default.
In the ever-changing realm of financial institutions, E. Reducing The Time Constraint:
accurately predicting the risks associated with loan defaults
plays a crucial role in ensuring the stability and longevity of Minimizing the time taken to approve a loan by
lending practices. Conventional models for evaluating credit evaluating the loan applicant’s profile and eligibility for loan
scores often struggle with the intricacies of borrowers' with high accuracy using a predictive machine learning
financial behaviours, necessitating a shift towards more algorithm.
advanced and adaptable approaches. F. Transparency Of the Work:
This project emerges as a strategic response to this Transparency of all the activities performed in financial
challenge, with the aim of harnessing the capabilities of institutes is a major problem for both the institute and the
state-of-the-art machine learning techniques. This project loan applicants providing a proper report for it would solve
centres on leveraging a diverse range of characteristics, this problem to some extent.
including borrower information, historical financial data, and
credit history, to construct a robust predictive model. By III. LITERATURE REVIEW
incorporating algorithms like Random Forest, Gradient The scholarly discourse on Loan Default Prediction
Boosting, and Neural Networks, we aspire to transcend the utilizing machine learning underscores a noteworthy shift
limitations of traditional models and enhance the accuracy of towards refining methodologies for assessing credit risk.
credit risk assessment. The significance of this endeavour Scholars emphasize the integration of diverse datasets,
lies in its potential to revolutionize credit risk management including unconventional attributes such as social and online
within financial institutions. behaviour, to augment predictive models, recognizing the
Through meticulous data preprocessing, feature dynamic nature of borrowers' financial activities.
engineering, and model optimization, our project endeavours The selection of algorithms is a central focus, with
to provide lenders with a dependable and efficient tool for studieshighlighting the efficacy of ensemble methodologies
identifying and managing loan default risks. such as Random Forest and Gradient Boosting.

979-8-3503-5035-7/24/$31.00 ©2024 IEEE 1


Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.
These algorithms excel in capturing intricate By integrating diverse attributes, leveraging advanced
relationships within data, which is pivotal for credit risk algorithms, and employing rigorous validation processes,
assessment. Moreover, there is an emphasis on scholars aim to equip financial institutions with more precise
hyperparameter tuning to optimize model parameters, and adaptable tools for predicting and managing the risks
thereby contributing to enhanced predictive accuracy and associated with loan defaults.
generalizability.
TABLE I. A BRIEF ABOUT THE RESEARCH WORK THAT HAS BEEN PERFORMED BEFORE INCLUDING THE METHODS USED, THE OBTAINED RESULTS,
THE INSIGHTS OBTAINED AND THE LIMITATIONS THAT WERE ENCOUNTERED.
S. no. Publications Insights Method Used Results Limitations
[1] Loan Default The paper discusses the Data Collection and KNN model performs better
Prediction Using prediction of loan defaulters Data Cleaning than Decision tree model
Machine Learning using machine learning Loan default prediction
Techniques techniques such as KNN and techniques are effective
Decision tree models.
[2] Loan Risk Prediction The paper uses a random forest Random forest classification Test accuracy: 85.62%
based on Random classification model to predict model F1-score: 85.48% _
Forest Model the possibility of loan default Linear regression (mentioned
with an accuracy of 85.62%. but not used).
[3] Improving Credit Risk The paper discusses the use of Data mining and machine Test set accuracy: 95.2%
Assessment through deep learning and machine learning techniques Correctly predicted default
Deep Learning-based learning techniques to develop Deep learning using Keras state: 238 out of 250
Consumer Loan a consumer loan default and TensorFlow respondents.
Default Prediction prediction model, improving
Model credit risk assessment.
[4] Loan approval The paper discusses the use of Logistic regression Logistic regression model Model includes
prediction using logistic regression model for Decision trees, random produces different results. variables other
machine learning predicting loan defaulters by forests, and gradient boosting. Model includes personal than checking
considering variables such as attributes for predicting loan account
age, purpose, credit history, defaulters. information
credit amount, etc Different models
produce different
results.
[5] Loan Risk Prediction The paper discusses the use of Random forest classification Test accuracy: 85.62%
Model based on a random forest classification model F1-score: 85.48%
Random Forest model to predict the possibility Machine learning algorithms
of loan default, with an
accuracy of 85.62%.
[6] Loan Default The paper discusses using the Data preprocessing including The random forest algorithm The resulting
Prediction: A Random Forest algorithm to elimination of variables with yields robust results for model is data-
Complete Revision predict loan default based on missing values and encoding class prediction. driven and may not
influential variables. It categorical variables as The F1-Macro Score be applicable to all
confirms the algorithm's dummies. achieves 90% accuracy for P2P platforms.
capacity for binary Random Forest algorithm the evaluation sample. The study focuses
classification problems. used for binary classification on LendingClub
to predict default and may not be
probabilities. generalizable to
other platforms.
[7] Loan Prediction Using Loan prediction using machine Decision tree Decision tree
Machine Learning learning algorithms such as Random forest Random forest
Methods decision tree, random forest, Logistic regression Logistic regression
and logistic regression. It
evaluates the accuracy of these
methods and concludes that
logistic regression is the most
accurate with 86.4% accuracy.
[8] An active learning Active learning application for Random forest classifier Random forest classifier
application on loan loan default prediction using a model achieves 93.2% accuracy in
default prediction: random forest classifier model Active learning strategy loan default prediction.
based on forest with high accuracy and Active learning strategy
classifier model improved efficiency. improves efficiency by 7%
with 50 labelled data.
[9] Multi-view GCN for The paper proposes a multi- Multi-view loan application MGCN outperforms
Loan Default Risk view graph convolution graphs (MLAGs) construction conventional and deep
Prediction network (MGCN) for loan Multi-view graph convolution learning models.
default risk prediction. network (MGCN) for loan Results are shown on three
default risk prediction public datasets.
[10] A Bayesian deep The paper discusses the use of Bayesian deep learning model The model has more than Missing values,
learning method based a Bayesian deep learning Data pre-processing 96% accuracy. incomplete
on loan default rate model for predicting loan The model has higher categorical data,
detection default probability, achieving performance compared to and irrelevant
over 96% accuracy. popular classification features require
models. data pre-
processing.

2
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.
IV. PROPOSED METHOD exploring different combinations of hyperparameters to find
the optimal configuration. Additionally, cross-validation is
The methodologies employed for the prediction of loan employed to assess the model's performance across multiple
defaults through the use of machine learning encompass a subsets of the dataset.
comprehensive and systematic process aimed at developing a
precise and robust predictive model. E. Model Training:
Once the optimal hyperparameters have been determined,
the selected model undergoes iterative training on the
historical dataset. During training, the model's parameters are
adjusted to minimize the difference between predicted and
actual outcomes, thereby improving its predictive accuracy.
F. Evaluation Metrics:
Accuracy is used to measure the overall correctness of
the model's predictions. Additionally, precision, recall, and
the F1 score provide insights into the model's ability to
correctly identify positive cases. Furthermore, the area under
the ROC curve (AUC-ROC) is utilized to assess the model's
ability to discriminate between default and non-default cases.
Fig. 1. The initial steps involving data collection, exploratory data analysis
and feature engineering.

This process can be divided into several key stages:


A. Data Preprocessing:
This initial stage involves the application of various
techniques to ensure the quality and integrity of the data.
These techniques include cleaning the data to address
missing values, outliers, and inconsistencies. Furthermore,
normalization and scaling are carried out to ensure that all
variables contribute equally to the model and prevent any Fig. 3. The next crucial steps are selecting a perfect model that can be
undue influence from certain attributes. most promising and evaluating the outcome.

B. Feature Engineering: G. Validation:


The next stage is the careful selection and creation of In order to ensure the reliability and generalizability of
features that will be used in the predictive model. Relevant the predictive model, it is subjected to validation using
features are selected based on domain knowledge and data separate datasets that were not used during the training
analysis, with the goal of eliminating noise and focusing on process. This out-of-sample testing ensures that the model
the most informative variables. Additionally, new features can effectively generalize to new and unseen data.
may be created to capture complex relationships or patterns
within the data. H. Deployment and Integration:
Once the final model has been developed and validated, it
can be deployed into operational systems. This involves
seamlessly integrating the model with existing loan
processing systems to ensure a smooth and efficient
workflow. Additionally, continuous monitoring and periodic
model updates may be implemented to adapt to evolving
patterns in borrower behaviour and maintain the predictive
accuracy of the model over time.
By diligently following these methodologies, the Loan
Fig. 2. Last few important steps of the project involving predictions of Default Prediction model is designed to be highly accurate,
new data, result interpretation and documenting the outcome. reliable, and adaptable to the dynamic nature of financial
data and borrower behaviour.
C. Model Selection:
In this stage, a suitable machine learning algorithm is V. RESULTS AND DISCUSSIONS
chosen to develop the predictive model. Various algorithms In this loan repayment prediction project, the primary
are considered, each with its own strengths and weaknesses. objective is to leverage a Random Forest Classifier to assess
Additionally, ensemble methods such as Random Forests and the likelihood of loan repayment based on various features.
Gradient Boosting are often preferred due to their ability to The dataset encompasses crucial information such as loan
handle non-linearity and capture intricate relationships terms, interest rates, annual income, loan durations,
within the data. verification statuses, and loan purposes.
D. Hyperparameter Tuning: The dataset is loaded from a CSV file with a focus on
This is typically done through techniques such as grid addressing missing values, specifically targeting the
search or randomized search, which involve systematically 'total_acc' column, where missing entries are imputed using

3
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.
the median value. Additionally, categorical variables like The success of the model hinges on meticulous data
'term', 'verification_status', and 'purpose' undergo numerical preprocessing and thoughtful consideration of the dataset's
encoding through Label Encoding, while numerical values unique characteristics.
are extracted from pertinent columns such as 'term',
'total_acc', and 'duration_of_loan'. Future iterations of the project may involve further fine-
tuning of hyperparameters, exploration of alternative models
Following data preprocessing, the dataset is split into for comparison, and more in-depth feature engineering to
training and testing sets, with 80% of the data reserved for enhance predictive capabilities.
training the model and 20% for evaluating its performance.
TABLE II. A BRIEF RESULT EVALUATION WITH OTHER RESEARCHES
The Random Forest Classifier, configured with 100 THAT HAVE BEEN PERFORMED ON THIS TOPIC
estimators and a fixed random state of 42, is then trained on
S. No. Parameters Comparison with Other Researches
the pre-processed data. Model evaluation involves making
1 Data Processing Handling missing values and outliers for
predictions on the test set and analysing performance metrics cleaner data.
such as accuracy, a confusion matrix, and a classification 2 Model Selection Use of Random Forest Algorithm rather
report.To gain deeper insights into the model's behaviour, than a traditional model like logistic
visualizations are incorporated. regression.
3 Evaluation Use of confusion matrix and
Metrics classification report.
4 Feature Extensive feature engineering and
Importance evaluation for better understanding of the
prominent features that contribute.
5 Explainability Proper explanation of the outcome for
better understandability to both the
financial institute and the loan
applicants.

VI. CONCLUSION
The Random Forest Classifier implemented for the
purpose of predicting loan repayment demonstrates
promising performance and offers a strong framework for
evaluating the likelihood of borrowers repaying their loans.
The model utilizes a combination of important features, such
as loan terms, interest rates, annual income, and verification
Fig. 4. This Graph represents the actual, predicted and the F1-Score of the
model
status, to generate accurate predictions. By carefully
preprocessing the data, managing missing values, and
encoding categorical variables, the model proves its ability to
effectively learn from the given dataset.
The evaluation metrics, which include accuracy, a
confusion matrix, and a classification report, collectively
provide evidence of the model's informed predictions. The
visualizations, notably the confusion matrix heatmap and the
feature importance bar chart, contribute to a comprehensive
understanding of the model's behaviour and highlight key
factors that influence its decision-making.
As a tool for financial institutions and lenders, this model
has the potential to improve the loan approval decision-
making process by identifying applicants who pose a high
risk. The success of this project emphasizes the importance
Fig. 5. This Graph Represents the Important Features That Were Selected of careful feature engineering and the interpretability of the
During the Random Forest Model and Filtered According to Their model, which opens up possibilities for further refinements
Importance and potential applications in real-world lending scenarios.
Future iterations could explore additional models,
hyperparameter tuning, and expanded feature engineering to
A heatmap of the confusion matrix vividly illustrates the continuously enhance the accuracy and robustness of the
model's ability to correctly classify instances of loan predictions.
repayment, offering valuable information on true positives, VII. FUTURE SCOPE
true negatives, false positives, and false negatives.
The expansive future potential of the loan repayment
Additionally, a bar chart portraying feature importance prediction model lies in its ability toenhancepredictive
aids in understanding which variables significantly influence capabilities and contribute to the evolving financial decision-
the model's predictions. In conclusion, this project making landscape. By further refining the model through
successfully explores the application of a Random Forest advanced feature engineering, incorporating alternative
Classifier for loan repayment prediction, providing valuable machine learning algorithms, and hyperparameter tuning, its
insights into the model's performance and the relative accuracy and versatility can be elevated. To adapt to
importance of various features. changing economic conditions and borrower behaviours,

4
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.
integration of real-time data streams and continuous model [5] Xiang, Zhang. Loan Risk Prediction Model based on Random Forest.
training would enable adaptive responses. Furthermore, Advances in Economics, Management and Political Sciences, (2023).
doi: 10.54254/2754-1169/5/20220082
exploring interpretability techniques and model
[6] José, Antonio, Núñez, Mora., Pilar, Madrazo-Lemarroy. Loan Default
explainability tools can foster greater trust and transparency Prediction: A Complete Revision of LendingClub. Revista mexicana
in the decision-making process for stakeholders. The model's de economía y finanzas, (2023). doi: 10.21919/remef. v18i3.886
applicability extends beyond the binary prediction [7] Simiao, Wang. Loan Prediction Using Machine Learning Methods.
framework, with potential adaptations for risk stratification, Advances in Economics, Management and Political Sciences, (2023).
dynamic interest rate determination, and personalized doi: 10.54254/2754-1169/5/20220081
financial counselling. By embracing innovations in fintech [8] Shasha, Liu., Ming, Shan, Guan., Yang, Li., Menglu, Wang., HuiMin,
and artificial intelligence, this model has the potential to play Zhu. A Bayesian deep learning method based on loan default rate
detection. (2023). doi: 10.1117/12.2678879
a pivotal role in fostering more informed and equitable
lending practices in the ever-evolving financial landscape. [9] Ebenezer, Owusu., Richard, Quainoo., Solomon, Kuuku, Mensah.,
Justice, Kwame, Appati. A Deep Learning Approach for Loan Default
Prediction Using Imbalanced Dataset. International Journal of
REFERENCES Intelligent Information Technologies, (2023). doi:
[1] Loan Default Prediction Using Machine Learning Techniques. Indian 10.4018/ijiit.318672
Scientific Journal of Research in Engineering and Management, [10] Jovanne, C., Alejandrino., Jovito, Jr., P., Bolacoy., John, Vianne,
(2023). doi: 10.55041/ijsrem24519 Murcia. Supervised and unsupervised data mining approaches in loan
[2] Hongyun, Jin. Loan Risk Prediction based on Random Forest Model. default prediction. International Journal of Electrical and Computer
(2023). doi: 10.21203/rs.3.rs-3094217/v1 Engineering, (2023). doi: 10.11591/ijece. v13i2.pp1837-1847
[3] Muhamad, Abdul, Aziz, Muhamad, Saleh, Jumaa., Mohammed, [11] Zixuan, Zhang. Credit Card Default Prediction based on Machine
Saqib. (2023). Improving Credit Risk Assessment through Deep Learning Techniques. BCP business & management, (2023). doi:
Learning-based Consumer Loan Default Prediction Model. 10.54691/bcpbm. v44i.4954
International Journal of Finance & Banking Studies, doi: [12] Jiaqi, Fan. Predicting of Credit Default by SVM and Decision Tree
10.20525/ijfbs. v12i1.2579 Model Based on Credit Card Data. BCP business & management,
[4] Yash, Diwate., Prashant, Singh, Rana., Pratik, A., Chavan. Loan (2023). doi: 10.54691/bcpbm. v38i.3666
approval prediction using machine learning. International Research [13] Loan Default Prediction Based on Machine Learning Methods.
Journal of Modernization in Engineering Technology and Science, (2023). doi: 10.4108/eai.2-12-2022.2328740
(2023). doi: 10.56726/irjmets39658

5
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on July 09,2024 at 14:37:16 UTC from IEEE Xplore. Restrictions apply.

You might also like