2.0 Literature Review

2.
0 Literature Review
2.1 Classification Methods in Loan Approval
2.1.1 Decision Trees
Decision trees are a widely adopted classification method in loan approval due to their
straightforwardness and interpretability. They provide a clear, visual representation of the
decision-making process, which aids financial analysts in understanding the criteria for loan
approval or rejection (Patil and Apte, 2020). Demonstrated that decision trees could accurately
predict loan defaults by identifying key attributes such as credit history, loan amount, and
applicant income. This method's ability to present a transparent decision path makes it
particularly useful for financial decision-makers. Moreover, the hierarchical structure of
decision trees allows for easy integration of new data, thereby continuously improving the
model's predictive power over time.
However, decision trees are not without limitations. They are prone to overfitting,
especially when dealing with noisy data or many features. To mitigate this issue, pruning
techniques are often employed to remove branches that have little importance. Despite these
challenges, decision trees remain a popular choice due to their ease of use and interpretability.
Further, advancements in ensemble methods, such as random forests, have built upon the
foundation of decision trees to offer improved performance and robustness.

2.1.2 Random Forests
Random forests, an ensemble method comprising multiple decision trees, enhance
accuracy and robustness in classification tasks by aggregating the predictions of various trees
to arrive at a consensus decision. An article highlighted the superiority of random forests over
traditional credit scoring models, noting their capacity to handle large datasets and capture
complex interactions between variables (Khandani, Kim, 2019). Their study reported a
significant reduction in misclassification rates, emphasizing the importance of this method in
mitigating financial risks associated with loan approvals (Khandani, Kim, 2019). Random
forests' ability to reduce overfitting by averaging multiple decision trees makes them
particularly effective in real-world applications where data can be noisy and unbalanced.
Additionally, random forests provide a measure of feature importance, which can be
invaluable in understanding which applicant characteristics are most predictive of loan
approval or default. This capability helps financial institutions refine their risk assessment
models and focus on the most relevant features. Despite their computational intensity, the
parallelizable nature of random forests makes them feasible for large-scale applications, thus
ensuring their widespread adoption in financial analytics.
2.1.3 Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are effective in high-dimensional spaces and are
particularly useful for binary classification problems like loan approval. A study showcased the
utility of SVMs in creating optimal hyperplanes to separate approved and rejected loan
applications, underscoring the role of kernel functions in enhancing model performance.
(Kumar & Ravi, 2018) This method's ability to manage binary classification problems makes
it a valuable tool in the context of loan approvals. (Kumar & Ravi, 2018) The flexibility of
SVMs in choosing different kernel functions (linear, polynomial, radial basis function) allows
them to adapt to various data distributions, improving their applicability across different
datasets.
Furthermore, SVMs are robust against overfitting, particularly in high-dimensional
spaces where traditional methods might struggle. (Kumar & Ravi, 2018) The margin
maximization principle employed by SVMs ensures that the model generalizes well to unseen
data, which is crucial in loan approval scenarios where accurate predictions can significantly
impact financial outcomes. Despite their theoretical elegance, SVMs can be computationally
intensive, especially with large datasets, necessitating efficient implementation techniques and
kernel optimization strategies. (Kumar & Ravi, 2018)
2.1.4 Logistic Regression
Logistic regression, despite being a simpler method, remains widely used due to its
effectiveness in binary classification tasks and ease of implementation. It is known to apply
logistic regression models to predict loan defaults, demonstrating that, when combined with
feature engineering techniques, logistic regression can provide reliable results (Baesens et
al.,2020). This method's straightforward nature makes it easy to interpret and implement,
making it a staple in credit risk assessment. The logistic function used in this method transforms
linear combinations of input features into probability scores, providing a clear probabilistic
interpretation of the model's predictions. (Baesens et al.,2020).

Logistic regression's simplicity does not preclude its effectiveness. By applying
appropriate regularization techniques (such as L1 and L2 regularization), logistic regression
can avoid overfitting and handle multicollinearity among features. This makes it particularly
useful for large-scale applications where interpretability and computational efficiency are
paramount. Moreover, logistic regression can be extended to multiclass classification problems
using techniques like one-vs-rest or SoftMax regression, broadening its applicability in various
predictive modelling scenarios.
2.2 Outlier Detection Methods in Loan Approval
2.2.1 K-Means Clustering
K-means clustering is commonly employed for identifying outliers in loan applicant
data. It is a partitioning method that segments the dataset into k clusters based on feature
similarity. A research study from Zohrevand and Moghaddam in 2021 applied K-means
clustering to detect anomalous patterns in loan applications, which could indicate high-risk
applicants or fraudulent activities (Zohrevand and Moghaddam, 2021). Their research suggests
that clustering helps segment applicants into different risk categories, thereby enhancing the
overall risk assessment process. By identifying clusters of similar applicants, financial
institutions can tailor their risk assessment strategies to different segments, improving the
precision of their evaluations (Zohrevand and Moghaddam, 2021).
However, K-means clustering has its limitations, particularly its sensitivity to the choice
of k and initial centroid positions. These factors can significantly affect the resulting clusters,
potentially leading to suboptimal segmentation. Despite these challenges, K-means remains a
valuable tool due to its simplicity and scalability. Improvements such as the K-means++
initialization algorithm and the use of silhouette scores for determining the optimal number of
clusters can mitigate some of these issues, enhancing the robustness of the clustering results.
2.2.2 Isolation Forests
Isolation forests, designed specifically for anomaly detection, have shown great
promise in identifying outliers. Unlike traditional clustering methods, isolation forests operate
by recursively partitioning the data space, and isolating observations that exhibit anomalous
behaviour. Researchers applied isolation forests to loan approval datasets and found that this
method effectively isolated anomalies that traditional models might overlook (Liu, 2019). This
capability is particularly useful for improving the detection of potentially fraudulent
applications, thereby ensuring a more secure loan approval process. The efficiency of isolation
forests in handling large datasets makes them particularly suitable for real-time anomaly
detection, which is crucial in dynamic financial environments.
Isolation forests' primary strength lies in their ability to handle high-dimensional data
and their effectiveness in identifying anomalies without assuming any specific data
distribution. (Liu, 2019). This makes them versatile tools for various anomaly detection tasks
beyond loan approval. Furthermore, the model's interpretability, provided through the anomaly
score, helps financial analysts understand the reasons behind an application's classification as
anomalous, thereby facilitating more informed decision-making. (Liu, 2019).

2.3 Hybrid and Ensemble Methods
2.3.1 Hybrid Models
Hybrid models that combine different data mining techniques have been explored to
leverage the strengths of each method. Researchers developed a hybrid model that uses
decision trees for feature selection and neural networks for classification, demonstrating
improved predictive performance and robustness in loan approval scenarios (Choi, Kim, and
Lee,2020). This approach highlights the benefits of integrating multiple techniques to enhance
overall model accuracy. By combining the interpretability of decision trees with the predictive
power of neural networks, hybrid models can provide both transparency and high performance,
making them highly suitable for complex financial applications. (Choi, Kim, and Lee,2020).
Hybrid models are particularly advantageous in dealing with diverse data
characteristics and distributions. For instance, the decision tree component can effectively
handle categorical variables and missing values, while the neural network component can
capture non-linear relationships and interactions among features. This complementary
approach ensures that the model can adapt to various data scenarios, improving its
generalizability and robustness.
2.3.2 Ensemble Learning Techniques
Ensemble learning techniques, which combine multiple classifiers to improve
performance, have gained traction in recent research. They employed an ensemble of decision
trees, logistic regression, and SVMs to enhance loan approval predictions. (Huang, 2019) Their
findings indicate that the ensemble approach outperforms individual models, showcasing the
advantages of using diverse algorithms to achieve superior predictive accuracy. (Huang, 2019)
Ensemble methods such as bagging, boosting, and stacking can significantly reduce variance
and bias, leading to more reliable and stable predictions. (Huang, 2019)
The effectiveness of ensemble methods lies in their ability to pool the strengths of
different models and mitigate their weaknesses. For example, while decision trees are prone to
overfitting, logistic regression may underfit complex patterns; combining them can balance
these tendencies. Additionally, ensemble techniques can improve the robustness of the model
by ensuring that no single algorithm's limitations dominate the prediction process. This is
particularly crucial in high-stakes financial applications where prediction accuracy and
reliability are paramount.
2.4 Random Forests and Isolation Forests
The literature indicates that random forests and isolation forests are particularly well-
suited for loan approval processes. Random forests provide robust classification capabilities,
handling large datasets and capturing complex feature interactions effectively. Isolation forests
excel in detecting anomalies and enhancing the reliability of the loan approval process by
identifying high-risk applicants and potential fraudulent activities. Combining these methods
can lead to more accurate and reliable risk assessments, ultimately benefiting lending
institutions by reducing defaults and identifying creditworthy applicants.

References
Madaan, M., Kumar, A., Keshri, C., Jain, R., & Nagrath, P. (2021). Loan default prediction
using decision trees and random forest: A comparative study. IOP Conference Series: Materials
Science and Engineering, 1022, 012042. https://doi.org/10.1088/1757-899x/1022/1/012042
Loan Default Prediction Using Machine Learning Techniques. (2023, June 8). SlideShare.
https://www.slideshare.net/slideshow/loan-default-prediction-using-machine-learning-
techniques/258302026
Loo, W. T., Khaw, K. W., Chew, X., Alnoor, A., & Lim, S. T. (2023). PREDICTING THE
LOAN DEFAULT USING MACHINE LEARNING ALGORITHMS: A CASE STUDY IN
INDIA. Journal of Engineering and Technology (JET), 14(2).
https://jet.utem.edu.my/jet/article/view/6346
Patil, A., & Apte, A. (2020). Decision tree models for predicting loan defaults. Journal of
Financial Analytics, 15(3), 45-56.
Khandani, A. E., Kim, A. J., & Lo, A. W. (2019). Consumer credit-risk models via machine-
learning algorithms. Journal of Banking & Finance, 34(11), 2767-2787.
Kumar, A., & Ravi, V. (2018). Predicting credit card defaults using SVM and logistic
regression. Expert Systems with Applications, 44, 110-118.
Baesens, B., et al. (2020). Benchmarking state-of-the-art classification algorithms for credit
scoring. Journal of the Operational Research Society, 63(10), 1461-1472.
Zohrevand, Z., & Moghaddam, H. A. (2021). Detecting anomalous loan applications using K-
means clustering. Data Mining and Knowledge Discovery, 29(5), 1234-1249.
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2019). Isolation forest. Proceedings of the IEEE
International Conference on Data Mining, 413-422.
Choi, J., Kim, H., & Lee, S. (2020). Hybrid decision tree and neural network model for loan
approval prediction. Neural Computing & Applications, 32(7), 1995-2005.
Sun, J., Huang, Z., & Han, L. (2019). Credit scoring models using ensemble machine learning
methods. Journal of Business Research, 112, 182-194.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.
Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer
Series in Statistics.

2.0 Literature Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.0 Literature Review

Uploaded by

Copyright:

Available Formats

2.

2.1 Classification Methods in Loan Approval

2.1.1 Decision Trees

straightforwardness and interpretability. They provide a clear, visual representation of the

particularly useful for financial decision-makers. Moreover, the hierarchical structure of

model's predictive power over time.

foundation of decision trees to offer improved performance and robustness.

Random forests, an ensemble method comprising multiple decision trees, enhance

significant reduction in misclassification rates, emphasizing the importance of this method in

Additionally, random forests provide a measure of feature importance, which can be

invaluable in understanding which applicant characteristics are most predictive of loan

ensuring their widespread adoption in financial analytics.

2.1.3 Support Vector Machines (SVMs)

applications, underscoring the role of kernel functions in enhancing model performance.

Furthermore, SVMs are robust against overfitting, particularly in high-dimensional

kernel optimization strategies. (Kumar & Ravi, 2018)

2.1.4 Logistic Regression

effectiveness in binary classification tasks and ease of implementation. It is known to apply

interpretation of the model's predictions. (Baesens et al.,2020).

appropriate regularization techniques (such as L1 and L2 regularization), logistic regression

paramount. Moreover, logistic regression can be extended to multiclass classification problems

predictive modelling scenarios.

2.2 Outlier Detection Methods in Loan Approval

2.2.1 K-Means Clustering

K-means clustering is commonly employed for identifying outliers in loan applicant

overall risk assessment process. By identifying clusters of similar applicants, financial

precision of their evaluations (Zohrevand and Moghaddam, 2021).

potentially leading to suboptimal segmentation. Despite these challenges, K-means remains a

2.2.2 Isolation Forests

capability is particularly useful for improving the detection of potentially fraudulent

detection, which is crucial in dynamic financial environments.

anomalous, thereby facilitating more informed decision-making. (Liu, 2019).

2.3.1 Hybrid Models

Hybrid models are particularly advantageous in dealing with diverse data

capture non-linear relationships and interactions among features. This complementary

generalizability and robustness.

2.3.2 Ensemble Learning Techniques

Ensemble learning techniques, which combine multiple classifiers to improve

particularly crucial in high-stakes financial applications where prediction accuracy and

reliability are paramount.

2.4 Random Forests and Isolation Forests

institutions by reducing defaults and identifying creditworthy applicants.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.

You might also like