Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

2.

0 Literature Review

2.1 Classification Methods in Loan Approval

2.1.1 Decision Trees

Decision trees are a widely adopted classification method in loan approval due to their

straightforwardness and interpretability. They provide a clear, visual representation of the

decision-making process, which aids financial analysts in understanding the criteria for loan

approval or rejection (Patil and Apte, 2020). Demonstrated that decision trees could accurately

predict loan defaults by identifying key attributes such as credit history, loan amount, and

applicant income. This method's ability to present a transparent decision path makes it

particularly useful for financial decision-makers. Moreover, the hierarchical structure of

decision trees allows for easy integration of new data, thereby continuously improving the

model's predictive power over time.

However, decision trees are not without limitations. They are prone to overfitting,

especially when dealing with noisy data or many features. To mitigate this issue, pruning

techniques are often employed to remove branches that have little importance. Despite these

challenges, decision trees remain a popular choice due to their ease of use and interpretability.

Further, advancements in ensemble methods, such as random forests, have built upon the

foundation of decision trees to offer improved performance and robustness.


2.1.2 Random Forests

Random forests, an ensemble method comprising multiple decision trees, enhance

accuracy and robustness in classification tasks by aggregating the predictions of various trees

to arrive at a consensus decision. An article highlighted the superiority of random forests over

traditional credit scoring models, noting their capacity to handle large datasets and capture

complex interactions between variables (Khandani, Kim, 2019). Their study reported a

significant reduction in misclassification rates, emphasizing the importance of this method in

mitigating financial risks associated with loan approvals (Khandani, Kim, 2019). Random

forests' ability to reduce overfitting by averaging multiple decision trees makes them

particularly effective in real-world applications where data can be noisy and unbalanced.

Additionally, random forests provide a measure of feature importance, which can be

invaluable in understanding which applicant characteristics are most predictive of loan

approval or default. This capability helps financial institutions refine their risk assessment

models and focus on the most relevant features. Despite their computational intensity, the

parallelizable nature of random forests makes them feasible for large-scale applications, thus

ensuring their widespread adoption in financial analytics.

2.1.3 Support Vector Machines (SVMs)

Support Vector Machines (SVMs) are effective in high-dimensional spaces and are

particularly useful for binary classification problems like loan approval. A study showcased the

utility of SVMs in creating optimal hyperplanes to separate approved and rejected loan

applications, underscoring the role of kernel functions in enhancing model performance.

(Kumar & Ravi, 2018) This method's ability to manage binary classification problems makes
it a valuable tool in the context of loan approvals. (Kumar & Ravi, 2018) The flexibility of

SVMs in choosing different kernel functions (linear, polynomial, radial basis function) allows

them to adapt to various data distributions, improving their applicability across different

datasets.

Furthermore, SVMs are robust against overfitting, particularly in high-dimensional

spaces where traditional methods might struggle. (Kumar & Ravi, 2018) The margin

maximization principle employed by SVMs ensures that the model generalizes well to unseen

data, which is crucial in loan approval scenarios where accurate predictions can significantly

impact financial outcomes. Despite their theoretical elegance, SVMs can be computationally

intensive, especially with large datasets, necessitating efficient implementation techniques and

kernel optimization strategies. (Kumar & Ravi, 2018)

2.1.4 Logistic Regression

Logistic regression, despite being a simpler method, remains widely used due to its

effectiveness in binary classification tasks and ease of implementation. It is known to apply

logistic regression models to predict loan defaults, demonstrating that, when combined with

feature engineering techniques, logistic regression can provide reliable results (Baesens et

al.,2020). This method's straightforward nature makes it easy to interpret and implement,

making it a staple in credit risk assessment. The logistic function used in this method transforms

linear combinations of input features into probability scores, providing a clear probabilistic

interpretation of the model's predictions. (Baesens et al.,2020).


Logistic regression's simplicity does not preclude its effectiveness. By applying

appropriate regularization techniques (such as L1 and L2 regularization), logistic regression

can avoid overfitting and handle multicollinearity among features. This makes it particularly

useful for large-scale applications where interpretability and computational efficiency are

paramount. Moreover, logistic regression can be extended to multiclass classification problems

using techniques like one-vs-rest or SoftMax regression, broadening its applicability in various

predictive modelling scenarios.

2.2 Outlier Detection Methods in Loan Approval

2.2.1 K-Means Clustering

K-means clustering is commonly employed for identifying outliers in loan applicant

data. It is a partitioning method that segments the dataset into k clusters based on feature

similarity. A research study from Zohrevand and Moghaddam in 2021 applied K-means

clustering to detect anomalous patterns in loan applications, which could indicate high-risk

applicants or fraudulent activities (Zohrevand and Moghaddam, 2021). Their research suggests

that clustering helps segment applicants into different risk categories, thereby enhancing the

overall risk assessment process. By identifying clusters of similar applicants, financial

institutions can tailor their risk assessment strategies to different segments, improving the

precision of their evaluations (Zohrevand and Moghaddam, 2021).

However, K-means clustering has its limitations, particularly its sensitivity to the choice

of k and initial centroid positions. These factors can significantly affect the resulting clusters,

potentially leading to suboptimal segmentation. Despite these challenges, K-means remains a

valuable tool due to its simplicity and scalability. Improvements such as the K-means++
initialization algorithm and the use of silhouette scores for determining the optimal number of

clusters can mitigate some of these issues, enhancing the robustness of the clustering results.

2.2.2 Isolation Forests

Isolation forests, designed specifically for anomaly detection, have shown great

promise in identifying outliers. Unlike traditional clustering methods, isolation forests operate

by recursively partitioning the data space, and isolating observations that exhibit anomalous

behaviour. Researchers applied isolation forests to loan approval datasets and found that this

method effectively isolated anomalies that traditional models might overlook (Liu, 2019). This

capability is particularly useful for improving the detection of potentially fraudulent

applications, thereby ensuring a more secure loan approval process. The efficiency of isolation

forests in handling large datasets makes them particularly suitable for real-time anomaly

detection, which is crucial in dynamic financial environments.

Isolation forests' primary strength lies in their ability to handle high-dimensional data

and their effectiveness in identifying anomalies without assuming any specific data

distribution. (Liu, 2019). This makes them versatile tools for various anomaly detection tasks

beyond loan approval. Furthermore, the model's interpretability, provided through the anomaly

score, helps financial analysts understand the reasons behind an application's classification as

anomalous, thereby facilitating more informed decision-making. (Liu, 2019).


2.3 Hybrid and Ensemble Methods

2.3.1 Hybrid Models

Hybrid models that combine different data mining techniques have been explored to

leverage the strengths of each method. Researchers developed a hybrid model that uses

decision trees for feature selection and neural networks for classification, demonstrating

improved predictive performance and robustness in loan approval scenarios (Choi, Kim, and

Lee,2020). This approach highlights the benefits of integrating multiple techniques to enhance

overall model accuracy. By combining the interpretability of decision trees with the predictive

power of neural networks, hybrid models can provide both transparency and high performance,

making them highly suitable for complex financial applications. (Choi, Kim, and Lee,2020).

Hybrid models are particularly advantageous in dealing with diverse data

characteristics and distributions. For instance, the decision tree component can effectively

handle categorical variables and missing values, while the neural network component can

capture non-linear relationships and interactions among features. This complementary

approach ensures that the model can adapt to various data scenarios, improving its

generalizability and robustness.

2.3.2 Ensemble Learning Techniques

Ensemble learning techniques, which combine multiple classifiers to improve

performance, have gained traction in recent research. They employed an ensemble of decision

trees, logistic regression, and SVMs to enhance loan approval predictions. (Huang, 2019) Their

findings indicate that the ensemble approach outperforms individual models, showcasing the
advantages of using diverse algorithms to achieve superior predictive accuracy. (Huang, 2019)

Ensemble methods such as bagging, boosting, and stacking can significantly reduce variance

and bias, leading to more reliable and stable predictions. (Huang, 2019)

The effectiveness of ensemble methods lies in their ability to pool the strengths of

different models and mitigate their weaknesses. For example, while decision trees are prone to

overfitting, logistic regression may underfit complex patterns; combining them can balance

these tendencies. Additionally, ensemble techniques can improve the robustness of the model

by ensuring that no single algorithm's limitations dominate the prediction process. This is

particularly crucial in high-stakes financial applications where prediction accuracy and

reliability are paramount.

2.4 Random Forests and Isolation Forests

The literature indicates that random forests and isolation forests are particularly well-

suited for loan approval processes. Random forests provide robust classification capabilities,

handling large datasets and capturing complex feature interactions effectively. Isolation forests

excel in detecting anomalies and enhancing the reliability of the loan approval process by

identifying high-risk applicants and potential fraudulent activities. Combining these methods

can lead to more accurate and reliable risk assessments, ultimately benefiting lending

institutions by reducing defaults and identifying creditworthy applicants.


References

Madaan, M., Kumar, A., Keshri, C., Jain, R., & Nagrath, P. (2021). Loan default prediction
using decision trees and random forest: A comparative study. IOP Conference Series: Materials
Science and Engineering, 1022, 012042. https://doi.org/10.1088/1757-899x/1022/1/012042

Loan Default Prediction Using Machine Learning Techniques. (2023, June 8). SlideShare.
https://www.slideshare.net/slideshow/loan-default-prediction-using-machine-learning-
techniques/258302026

Loo, W. T., Khaw, K. W., Chew, X., Alnoor, A., & Lim, S. T. (2023). PREDICTING THE
LOAN DEFAULT USING MACHINE LEARNING ALGORITHMS: A CASE STUDY IN
INDIA. Journal of Engineering and Technology (JET), 14(2).
https://jet.utem.edu.my/jet/article/view/6346

Patil, A., & Apte, A. (2020). Decision tree models for predicting loan defaults. Journal of
Financial Analytics, 15(3), 45-56.

Khandani, A. E., Kim, A. J., & Lo, A. W. (2019). Consumer credit-risk models via machine-
learning algorithms. Journal of Banking & Finance, 34(11), 2767-2787.

Kumar, A., & Ravi, V. (2018). Predicting credit card defaults using SVM and logistic
regression. Expert Systems with Applications, 44, 110-118.

Baesens, B., et al. (2020). Benchmarking state-of-the-art classification algorithms for credit
scoring. Journal of the Operational Research Society, 63(10), 1461-1472.

Zohrevand, Z., & Moghaddam, H. A. (2021). Detecting anomalous loan applications using K-
means clustering. Data Mining and Knowledge Discovery, 29(5), 1234-1249.

Liu, F. T., Ting, K. M., & Zhou, Z. H. (2019). Isolation forest. Proceedings of the IEEE
International Conference on Data Mining, 413-422.

Choi, J., Kim, H., & Lee, S. (2020). Hybrid decision tree and neural network model for loan
approval prediction. Neural Computing & Applications, 32(7), 1995-2005.
Sun, J., Huang, Z., & Han, L. (2019). Credit scoring models using ensemble machine learning
methods. Journal of Business Research, 112, 182-194.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer.

Friedman, J., Hastie, T., & Tibshirani, R. (2001). The Elements of Statistical Learning. Springer
Series in Statistics.

You might also like