Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/369143834

Loan Eligibility Prediction Using Logistics Regression Algorithm

Research · March 2023

CITATIONS READS

0 1,349

1 author:

Cornelius Sarungu
Binus University
14 PUBLICATIONS 6 CITATIONS

SEE PROFILE

All content following this page was uploaded by Cornelius Sarungu on 13 July 2023.

The user has requested enhancement of the downloaded file.


WHITEPAPER

Loan Eligibility Prediction Using Logistics


Regression Algorithm
Cornelius Mellino Sarungu
Information System Department
Binus Online Learning
Bina Nusantara University
Jakarta, Indonesia, 11480
cornelius.sarungu@binus.ac.id

Abstract—Loan eligibility is the core of lending business. determined as feasible or eligible to be fulfilled (Hasibuan,
Every loan application form must go through this process. In 2008).
conventional way, without machine learning algorithm, this
process can take quite considerable time and made the II. BUSINESS UNDERSTANDING
customers wait for the result. With machine learning
In credit applications, both manual (paper based) and
algorithm, it can be sped up to the matter of minutes or even
seconds. This improvement could become key advantage for online, all of them must go through a series of data
some lender. Even though we must also consider other things checking, analysis and scoring to determine eligibility for
like background checking, fraud prevention, etc. On this approval. The thing that determines eligibility is usually in
experiment, we try to implement the Logistic Regression the form of a series of variables that must be filled in when
algorithm on loan eligibility case. The ability to produce binary the customer fills out the application form.
classified output made this algorithm fit to the case which need
only two states: approved or rejected. The experiment is done In making a credit application, the customer must fill
through a series of steps, and ended with evaluation, which out an application form which usually has many fields to fill
after that we draw conclusions. From the evaluation result, it is in. The contents of this credit application form include bio
concluded that the model performance is not good enough data, residence information, employment information,
although most of the evaluation metrics show satisfactory workplace information, financial information (debt, assets),
values, except the AUC. closest contact information, banking information (account
numbers, credit cards), supporting documents. As for
Keywords—loan eligibility, predictive machine learning,
logistics regression
corporate loans, you usually must submit a credit proposal
that contains the following information: company executive
I. INTRODUCTION summary, company identity and structure, general
In this research we created a model that can predict the description of the company, company financial condition,
feasibility of granting mortgage approval (KPR) for industry analysis, company financial structure, analysis of
processed applications. Several variables or features become financial projections, credit guarantees, attachments (Jusuf,
elements of the built modeling input. 2003).
The credit process is the process of lending a certain Eligibility or eligibility in the context of credit approval
amount of funds to a person or institution so that they can is very important for financial institutions such as banks and
meet their needs with these funds. For loans that target cooperatives that run savings and loan businesses. The
individuals, they are also referred to as retail loans, while manual process takes a long time, in a matter of days, which
those that target companies or institutions are referred to as of course makes the customer wait. Competition for online
corporate loans. In submitting, there are many things to credit service providers makes players compete to maximize
consider. Some types of credit even require collateral in the their business services, especially in increasing the speed of
form of assets, which also vary. For housing loans (KPR), the approval process. Competition in this aspect forces these
for example, the house that we repay will automatically players to explore machine learning and AI technologies to
become collateral and the certificate is held by the lender shorten time, and at the same time obtain high accuracy of
until the installment is paid off (Suyatno, 2007). approval decisions taken. High accuracy is needed, because
However, both retail and corporate credit applications errors in making credit decisions will result in losses for
must go through a series of processes before reaching the related credit service providers (Amrin & Pahlevi, 2022).
approval stage. The series of processes include background
Either wrongly agreeing to a customer who is not worthy
checks, checks for eligibility scores based on variables
of being given credit (false positive) or disapproving of a
obtained from information obtained from forms filled out by
customers, checks for eligibility scores from third parties customer who deserves credit (false negative), all of which
such as Pefindo for Indonesia, checks for completeness of open up the potential for loss (Amrin & Pahlevi, 2022).
documents, biometric validation. After all of the above With the implementation of machine learning models, it is
checks produce positive scores, the credit application can be hoped that the determination of eligibility can be accelerated
to a matter of minutes, of course by considering the essential
WHITEPAPER
features. In the modeling that was carried out this time,
Logistic Regression algorithms were tried to be ( 1)
implemented, which can provide outputs that represent
approved (1) or rejected (0) decisions. b. Confusion Matrix

A. Problem Statements This matrix maps the prediction results into


several categories, including:
Problem points in the credit application process include:
• The processing of the credit application form takes TABLE I. CONFUSION MATRIX TABLE
quite a long time. The length of processing time Model Prediction Actual
from the time the form is submitted until the
True Positive 1 1
notification of approval/rejection by the customer is
received, which is one of the important points in False Positive 1 0
competition for lending services. From the
False Negative 0 1
customer's point of view, the faster it is, of course,
the higher the level of satisfaction with the service. True Negative 0 0
Meanwhile, from a financial institution standpoint, c. Accuracy
increased speed will increase the competitiveness of
its services in the related business arena. Accuracy is measured by the following
formula:
• The accuracy of credit approval decisions is very
important for financial institutions that provide these
services. Errors in giving approval will bring ( 2)
financial losses to the institution. If it turns out that
the credit is approved but the customer is unable to d. Precision
return it will bring real financial losses, if the credit Precision is measured by the following
is not approved even though the customer should be
formula:
eligible to be given and has the ability to repay the
loan then it will bring losses in the form of loss of
opportunity for credit service provider institutions. ( 3)

B. Goals e. Sensitivity
The goals of implementing this machine learning solution Sensitivity is measured by the following
include: formula:
• Speed up credit applications by implementing
machine learning modeling based on Logistic ( 4)
Regression. The performance of the algorithm will
be measured and their advantages and disadvantages f. Area Under The Curve
will be reviewed. The area under the curve (area under the
• Look for modeling algorithms that have the highest curve) or also known as AUC is used as a
level of accuracy, and have optimal values for other measure to judge whether a model is good or
aspects of assessment that can assist credit service bad. AUC close to 1 means that the model has
providers in granting approval to applications good performance, while AUC close to 0.5
submitted by customers. This will certainly help indicates that the model has poor performance.
reduce the risk of losses that may occur in the future. The curve here is the ROC curve. From Fig. 1
it can be seen that the more convex the ROC
C. Solution Specifications
curve, the better the model performance,
The implementation of machine learning modeling to meaning that the more accurate prediction
predict the eligibility value for credit applications this time results. Meanwhile, the more linear the ROC
has the following specifications: curve, the worse the model performance.
• Implement modeling using three algorithms: However, it should also be noted that if the
Logistic Regression, Random Forest and Boosting. AUC value is too close to 1, it indicates the
possibility of overfitting in the modeling that
• The performance assessment of the three models we make [2].
will be made using several metrics/measurement
methods, including the following:
a. Mean Squared Error (MSE)
This metric squares the difference between the
predicted and actual values, then takes the
final average value (Bickel, 2015).
The MSE formula is as follows:
WHITEPAPER
df_train.head()

• View summary data with the describe() function.


df_train.describe()

• View data structures with the info() function.


df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--------- -------------- -----
0 Loan_ID 367 non-null object
1 Gender 356 non-null object
2 Married 367 non-null object
3 Dependents 357 non-null object
4 Education 367 non-null object
5 Self_Employed 344 non-null object
6 ApplicantIncome 367 non-null int64
Fig. 1. Area Under the Curve (AUC). 7 CoapplicantIncome 367 non-null int64
8 LoanAmount 362 non-null float64
III. DATA UNDERSTANDING 9 Loan_Amount_Term 361 non-null float64
10 Credit_History 338 non-null float64
The data used in this project is the Loan Eligible Dataset
11 Property_Area 367 non-null object
from Kaggle by Vikas Ukani [11]. This data is data from the
Dream Housing Finance company which handles all dtypes: float64(3), int64(2), object(7)
mortgage loans. They are present in all urban, semi-urban, memory usage: 34.5+ KB
and rural areas. The customer first applies for a mortgage
loan after which the company validates the customer's
eligibility for the loan. IV. DATA PREPARATION
Data preparation techniques used to prepare data before
A. The Features
being processed into machine learning models include:
• Loan_ID : loan ID, unique code / number
A. Data splitting
• Gender: Sex, Male/Female
Data can be split after data preparation activities, but in
• Married: married status of the applicant, (Y/N) this case the dataset has indeed been split from the original
source. So, we immediately carry out the data preparation
• Dependents : Amount of dependents, number process in both parts. We name the section for training as
• Education: The last education of the applicant, df_train, and the section for testing or validation as df_val.
(Graduate/Under Graduate) At this stage we divide the data into a training portion of
around 70% and a testing portion of around 30%.
• Self_Employed : Is the applicant self-employed,
(Y/N) B. Identifies a null value.
• ApplicantIncome : Applicant income, number At this stage we identify null values, both in training and
testing data. The following is a visualization of null values
• CoapplicantIncome : Wife's income, number in both data. Fig. 2 is a visualization of the null value in the
• LoanAmount : Loan amount, number in thousands training data.

• Loan_Amount_Term : Loan term, number in months Fig. 3 is a visualization of the null value in the testing
data. The white lines in each column represent the existence
• Credit_History : Loan history, yes or no (1/0) of a null value in that column. If later this null value is
removed, the re-visualization will display plain columns
• Property_Area : Property area, Urban/ Semi-Urban/
without any transverse white lines at all.
Rural
This null value must be removed so as not to affect the
• Loan_Status : Loan approval status, (Y/N)
performance of the model. Especially if the model involves
B. Data Exploration mathematical calculations, then the presence of a null value
After the raw data is loaded, we perform a series of might cause the computational process to stop.
exploratory activities as follows:
• View the beginning of the data table with the head()
function.
df_train = pd.read_csv("loan-train.csv")
WHITEPAPER

Fig. 2. Visualization of null values in training data. Fig. 5. Visualization of null values in validation data after NA removal.

D. Changing categorical values to numerical values.


After the null or NA value disappears from all the data,
then we move on to the next stage, namely changing the
categorical data values to numeric fingers. In this step we
use the fit_transform() function from the LabelEncoder
class, which is in the sklearn.preprocessing library. This step
of changing to a numerical value is useful for several
modeling algorithms that can only process data in numerical
form, so that our data is safe and can be used in general by
the three algorithms that we will try, so we better do this
Fig. 3. Visualization of null values in validation data. step. The following is a snippet of data after some features
have been changed from categorical to numerical.
C. Replace null values with the mean or mode values. E. Create a heatmap visualization to check the relationship
There are several ways of dealing with null or NA (Not between features.
Available) values. In this study we chose two methods, After changing the categorical values, we try to see a
namely using the highest value (mode) to fill in the NA in little bit of correlation or correlation between the features in
the categorical data and using the mean value (mean) to fill our data structure. This can be done easily using the
in the NA in the numerical data. heatmap() function in the seaborn library, as well as the
corr() function of the dataframe in the pandas library. The
The following are features or variables that are affected following in Fig. 6 is a heatmap image of the correlation
by the NA value replacement process. Replaced with mode: mapping of the training data structure. It can be seen in the
Gender, Married, Dependents, Self_Employed, figure that there is a significant correlation between
Credit_History. Replaced with mean: LoanAmount, LoanAmount and ApplicantIncome and what is more
Loan_Amout_Term. important is Loan_Status and Credit_History. A positive
Fig. 4 is a visualization of the training data after the null score indicates a correlation in a positive or straight
/ NA value replacement process. Meanwhile, Fig. 5 is a direction, while a negative score indicates an inverse
visualization of the testing data after the same process. relationship.

Fig. 4. Visualization of null values in training data after NA removal.

Fig. 6. The heatmap correlation mapping of the training data structure.

F. Bypass the Loan_ID feature.


We cut this feature because it will not be used in the
model training process.
WHITEPAPER
G. Check each feature with a histogram or scatter plot
visualization.
At this stage we examine the characteristics of each
feature using histogram diagrams and scatter plots. The
histogram diagram helps us see the distribution of data,
while the scatter plot helps us see the data patterns and
outliers that appear.
H. Check Gender data (Gender).
Following are the values of the Gender variable or
feature.

TABLE II. GENDER

Gender value count


Female 0 112 Fig. 8. Marital status histogram.

Male 1 502 J. Check the data on the number of dependents


(Dependents).
From Fig. 7 it can be observed that the customers who apply The following is the value of the Dependents variable or
for loans are mostly men. feature.

TABLE IV. DEPENDENTS

Dependents value count


0 0 360

1 1 102

2 2 101

>=3 3 51

From Fig. 9 it can be observed that most of the customers


who apply for loans do not have children. While those who
have children, the average number is 1 to 2 people.

Fig. 7. Gender histogram.

I. Check Marital Status data (Married).


Following are the values of Married variables or
features.

TABLE III. MARITAL STATUS

Married value count


No 0 213

Yes 1 401

From Fig. 8 it can be observed that most of the customers


who applied for loans were married.
Fig. 9. Dependents histogram.

K. Check Last Education data (Education).


Following are the values of the Education variables or
features.

TABLE V. EDUCATION

Education value count


Graduate 0 480

Not Graduate 1 134


WHITEPAPER
From Fig. 10 it can be observed that most of the customers
who applied for loans were graduates of higher education.

Fig. 12. Applicant income histogram.

Fig. 10. Education histogram.

L. Check the Employment Status data (Self_Employed).


Following are the values of the Self_Employed variables
or features.

TABLE VI. EMPLOYMENT STATUS


Employment
value count
Status
No 0 532

Yes 1 82

From Fig. 11 it can be observed that most of the customers


who applied for loans worked as employees, while only Fig. 13. Applicant income scatter plot.
about 15% worked independently.
After we cut the outliers at the USD 30,000 threshold, the
histogram diagram and scatter plot will look like Fig. 14 and
Fig. 15 below.

Fig. 11. Employment status histogram.

M. Check Applicant Income data (ApplicantIncome). Fig. 14. Applicant income histogram after outliers cut out.
From Fig. 12 it can be observed that most applicants
have an income of less than USD 10,000 per year, while
there are a small number of applicants who have a very high
income of up to USD 80,000 per year. This is also shown in
Fig. 13 which makes the data points densely packed at the
bottom and there are a few points that are loosely scattered
above. A small amount of data that is far apart from most
other data is also known as outliers. These outliers should be
cut first from our data because it has a risk of biasing the
results of our modeling predictions.
WHITEPAPER

Fig. 15. Applicant income scatter plot after outliers cut out. Fig. 18. Co-applicant income scatter plot after outliers cut out.

N. Check Co-applicant Income data. O. Check the Loan Amount data (LoanAmount).
From Fig. 16 it can be observed that most applicant pairs From Fig. 19 and Fig. 20 it can be observed that the
have an income of less than USD 10,000 per year, while amount of loans taken is mostly in the range of USD 100-
there are a small number of applicant pairs who have a very 200K. Meanwhile, the maximum value taken is around USD
high income of up to USD 40,000 per year. This is also 600K.
shown in Fig. 17 which makes the data points densely
packed at the bottom and there are a few points that are
loosely scattered above. Outliers that appear here will also
be cut from the data.

Fig. 19. Loan amount histogram.

Fig. 16. Co-applicant income histogram.

Fig. 20. Loan amount scatter plot.

P. Check the loan tenure data (Loan_Amount_Term).


Fig. 17. Co-applicant income scatter plot. From Fig. 21 and Fig. 22 it can be observed that the loan
tenure taken is mostly 360 months or 30 years of mortgage
After we cut the outliers at the USD 15,000 threshold, the installments.
scatter plot diagram will look like Fig. 18 below.
WHITEPAPER

Fig. 21. Loan tenure histogram. Fig. 24. Property area histogram.

V. MODELLING
The model chosen for this solution is a model that uses
the Logistic Regression algorithm, because this algorithm is
suitable for problems with many independent variables and
produces binary output (0/1, Yes/No, Approve/Reject, etc.).
Pros
• Easy to implement.
• Can accommodate multi-variables.
• Not only does it provide a measure of how precise a
predictor is (a measure of the coefficient), but also
the direction of the association (positive or
negative).
Fig. 22. Loan tenure scatter plot. • Very fast in classifying unknown records.
• Has good accuracy for simple data sets and
Q. Check credit history data (Credit_History). performs well when data sets are linearly separable.
From Fig. 23 it can be observed that most customers Cons
who submitted loan application have taken loan(s) before.
• If the number of observations is smaller than the
number of features, Logistic Regression cannot be
used because it can cause overfitting.
• The main limitation of Logistic Regression is the
assumption of linearity between the dependent and
independent variables.
• Can only be used to predict discrete functions.
Therefore, the Logistic Regression dependent
variable is bound to a set of discrete numbers.
• Non-linear problems cannot be solved by Logistic
Regression because they have a linear decision
surface. Linearly separable data is rare in real-
world scenarios.
Fig. 23. Credit history histogram.
• Logistic Regression requires an average or no
multicollinearity between independent variables.
R. Check the Property Agent data (Property_Area).
• It is difficult to derive complex relationships using
From Fig. 24 it can be observed that most of the Logistic Regression. More powerful and concise
mortgaged properties are in Semi-Urban areas, while the algorithms such as Neural Networks can easily
number of properties in Rural and Urban areas is balanced. outperform these algorithms.
The following is the code for the Logistic Regression model
in Python.
WHITEPAPER
pipe = make_pipeline(StandardScaler(), B. Confusion matrix.
LogisticRegression(solver = "lbfgs"))
pipe.fit(X_train, y_train) # apply scaling on The results of creating a confusion matrix from a
training data comparison between the real validation output (y_val) and
pipe.score(X_val, y_val) the predictive output of the model (y_pred) are as follows:

In the code above, we use a pipelining technique where TABLE VIII. CONFUSION MATRIX.
before the data is entered into the LogisticRegression()
Logistic Regression
model, the data is scaled first with the standard method. This
is so that the data has a shorter distance. Scaling needs to be True Positive 16
done when the values of the existing features have very far False Positive 19
ranges from one another. Sometimes if we don't do scaling,
False Negative 1
our model will crash computationally.
True Negative 85
First of all, we create a pipe object with the
make_pipeline() function and include parameters in the
form of StandardScaler() and LogisticRegression() objects. The matrix show that the false positive rate is still quite
So, enter the scaler object and the model. For the object high, indicate that the model is not too good in performance.
model, we use the default solver "lbfgs". The C. Accuracy.
LogisticRegression class has several solvers, including
The accuracy value of the model is 0.834711. This is
newton-cg, lbfgs, liblinear, sag, saga. We choose lbfgs
quite high in term of accuracy. But the high value of
considering the size of the data is not too large.
accuracy doesn’t always reflect a good performance. We
The next step we do training on this model. However, it must check other parameters.
is different from direct training, training using this pipeline
D. Precision.
technique that we use is the pipe object, while the function
called to run the training remains the same, namely the fit() The precision value of the model is 0.817308. Precision
function. The parameters entered are X_train data which reflects the percentage of approved loans among all the
contains data with various independent features/variables, predicted approved loans.
and y_train which contains the dependent variable which is E. Sensitivity.
the actual reference value.
The sensitivity value of the model is 0.988372.
The final step is to print the score from the training Sensitivity reflects the percentage of approved loans among
results by calling the pipe.score() function and entering the all the actual approved loans, including those that falsely
testing data parameters X_val which contains the approved.
independent variables, and y_val which contains the
F. Area under the curve (AUC).
dependent variables.
The AUC value of the model is 0.722757. It shows that
VI. EVALUATION the model performance is in between 0.5 and 1, which
Evaluation of the performance of machine learning means from the AUC perspective it should be quite good but
modeling is done in several ways. On this evaluation stage, could be improved. The AUC value is generated by calling
modeling using Logistic Regression measures its the roc_auc_score() function from the sklearn.metrics
performance with several metrics. Here's an explanation: library.

A. Mean squared error (MSE) The following is the ROC curve visualization from the
Logistic Regression algorithm modeling. The AUC score
One of the advantages of MSE here is to identify
generated here is not very satisfactory, because it only
whether there are outliers in our model which will cause the
produces a number of 0.6820.
error value to be very large. The relatively large MSE value
(> 0.1) here is probably due to the wide range between the
average and maximum values in several features such as
income and partner income.
The following are the results of the MSE evaluation of the
models:

TABLE VII. MSE RESULT.

Logistic Regression
train_mse 0.188797

test_mse 0.165289

The MSE result is quite low, but the gap between test_mse
and train_mse is expected to be lower.
WHITEPAPER

REFERENCES
[1] A. Amrin and O. Pahlevi, “Implementation of logistic regression
classification algorithm and support vector machine for credit
eligibility prediction,” Journal of Informatics and Telecommunication
Engineering, vol. 5, no. 2, pp. 433–441, 2022.
[2] D. Kurniawan, Pengenalan Machine Learning dengan Python.
Jakarta, Indonesia: Elex Media, 2021.
[3] Kasmir, Bank dan Lembaga Keuangan lainnya, Edisi Revisi. Jakarta,
Indonesia: PT Raja Grafindo Persada, 2014.
[4] L. Zhao, S. Lee, and S.-P. Jeong, “Decision tree application to
classification problems with boosting algorithm,” Electronics, vol. 10,
no. 16, p. 1903, Aug. 2021, doi: 10.3390/electronics10161903.
[5] M. Gopinath, K. Srinivas Shankar Maheep, and R. Sethuraman,
Fig. 25. ROC curve visualization. “Customer loan approval prediction using logistic regression,”
Advances in Parallel Computing, 2021.
VII. CONCLUSION [6] M. Yarmolenko and B. Howlin, “Extreme Gradient boosting
algorithm classification for predicting lifespan-extending chemical
There are some conclusions drawn from this experiment: compounds,” 2022.
• The Logistic Regression algorithm surely could be [7] M. S. P. Hasibuan, Dasar-dasar Perbankan. Jakarta, Indonesia: PT.
Grafindo, 2008.
used to predict loan eligibility, while the MSE,
[8] J. Jusuf, Kiat Jitu Memperoleh Kredit Bank. Jakarta, Indonesia: Elex
accuracy, precision and recall or sensitivity Media, 2003.
indicate a quite satisfactory we must acquire more [9] P. J. Bickel and K. A. Doksum, Mathematical Statistics: Basic Ideas
information from other metrics such as AUC to and Selected Topics, vol. 1, 2 vols. CRC Press, 2015.
make sure about the model performance. [10] T. Suyatno, Kelembagaan Perbankan. Jakarta, Indonesia: PT.
Gramedia Pustaka Utama, 2007.
• The model performance is also influenced by the [11] V. Ukani, Loan eligibility dataset, 2020. [Online]. Available:
data. Size, integrity, outliers, label correctness, all https://www.kaggle.com/datasets/vikasukani/loan-eligible-dataset.
can impact the model performance. [Accessed: 04-Nov-2022].
[12] V. Ukani, Loan eligibility machine learning, 2020. [Online].
• We can conduct further research which focusing on Available: https://www.kaggle.com/code/vikasukani/loan-eligibility-
some aspects, such as to improve the Logistic prediction-machine-learning. [Accessed: 16-Oct-2022].
Regression model performance, comparison with .
other algorithms, and testing the performance with
much larger data.

View publication stats

You might also like