Professional Documents
Culture Documents
Loan Eligibility Prediction Using Logistics Regression Algorithm
Loan Eligibility Prediction Using Logistics Regression Algorithm
net/publication/369143834
CITATIONS READS
0 1,349
1 author:
Cornelius Sarungu
Binus University
14 PUBLICATIONS 6 CITATIONS
SEE PROFILE
All content following this page was uploaded by Cornelius Sarungu on 13 July 2023.
Abstract—Loan eligibility is the core of lending business. determined as feasible or eligible to be fulfilled (Hasibuan,
Every loan application form must go through this process. In 2008).
conventional way, without machine learning algorithm, this
process can take quite considerable time and made the II. BUSINESS UNDERSTANDING
customers wait for the result. With machine learning
In credit applications, both manual (paper based) and
algorithm, it can be sped up to the matter of minutes or even
seconds. This improvement could become key advantage for online, all of them must go through a series of data
some lender. Even though we must also consider other things checking, analysis and scoring to determine eligibility for
like background checking, fraud prevention, etc. On this approval. The thing that determines eligibility is usually in
experiment, we try to implement the Logistic Regression the form of a series of variables that must be filled in when
algorithm on loan eligibility case. The ability to produce binary the customer fills out the application form.
classified output made this algorithm fit to the case which need
only two states: approved or rejected. The experiment is done In making a credit application, the customer must fill
through a series of steps, and ended with evaluation, which out an application form which usually has many fields to fill
after that we draw conclusions. From the evaluation result, it is in. The contents of this credit application form include bio
concluded that the model performance is not good enough data, residence information, employment information,
although most of the evaluation metrics show satisfactory workplace information, financial information (debt, assets),
values, except the AUC. closest contact information, banking information (account
numbers, credit cards), supporting documents. As for
Keywords—loan eligibility, predictive machine learning,
logistics regression
corporate loans, you usually must submit a credit proposal
that contains the following information: company executive
I. INTRODUCTION summary, company identity and structure, general
In this research we created a model that can predict the description of the company, company financial condition,
feasibility of granting mortgage approval (KPR) for industry analysis, company financial structure, analysis of
processed applications. Several variables or features become financial projections, credit guarantees, attachments (Jusuf,
elements of the built modeling input. 2003).
The credit process is the process of lending a certain Eligibility or eligibility in the context of credit approval
amount of funds to a person or institution so that they can is very important for financial institutions such as banks and
meet their needs with these funds. For loans that target cooperatives that run savings and loan businesses. The
individuals, they are also referred to as retail loans, while manual process takes a long time, in a matter of days, which
those that target companies or institutions are referred to as of course makes the customer wait. Competition for online
corporate loans. In submitting, there are many things to credit service providers makes players compete to maximize
consider. Some types of credit even require collateral in the their business services, especially in increasing the speed of
form of assets, which also vary. For housing loans (KPR), the approval process. Competition in this aspect forces these
for example, the house that we repay will automatically players to explore machine learning and AI technologies to
become collateral and the certificate is held by the lender shorten time, and at the same time obtain high accuracy of
until the installment is paid off (Suyatno, 2007). approval decisions taken. High accuracy is needed, because
However, both retail and corporate credit applications errors in making credit decisions will result in losses for
must go through a series of processes before reaching the related credit service providers (Amrin & Pahlevi, 2022).
approval stage. The series of processes include background
Either wrongly agreeing to a customer who is not worthy
checks, checks for eligibility scores based on variables
of being given credit (false positive) or disapproving of a
obtained from information obtained from forms filled out by
customers, checks for eligibility scores from third parties customer who deserves credit (false negative), all of which
such as Pefindo for Indonesia, checks for completeness of open up the potential for loss (Amrin & Pahlevi, 2022).
documents, biometric validation. After all of the above With the implementation of machine learning models, it is
checks produce positive scores, the credit application can be hoped that the determination of eligibility can be accelerated
to a matter of minutes, of course by considering the essential
WHITEPAPER
features. In the modeling that was carried out this time,
Logistic Regression algorithms were tried to be ( 1)
implemented, which can provide outputs that represent
approved (1) or rejected (0) decisions. b. Confusion Matrix
B. Goals e. Sensitivity
The goals of implementing this machine learning solution Sensitivity is measured by the following
include: formula:
• Speed up credit applications by implementing
machine learning modeling based on Logistic ( 4)
Regression. The performance of the algorithm will
be measured and their advantages and disadvantages f. Area Under The Curve
will be reviewed. The area under the curve (area under the
• Look for modeling algorithms that have the highest curve) or also known as AUC is used as a
level of accuracy, and have optimal values for other measure to judge whether a model is good or
aspects of assessment that can assist credit service bad. AUC close to 1 means that the model has
providers in granting approval to applications good performance, while AUC close to 0.5
submitted by customers. This will certainly help indicates that the model has poor performance.
reduce the risk of losses that may occur in the future. The curve here is the ROC curve. From Fig. 1
it can be seen that the more convex the ROC
C. Solution Specifications
curve, the better the model performance,
The implementation of machine learning modeling to meaning that the more accurate prediction
predict the eligibility value for credit applications this time results. Meanwhile, the more linear the ROC
has the following specifications: curve, the worse the model performance.
• Implement modeling using three algorithms: However, it should also be noted that if the
Logistic Regression, Random Forest and Boosting. AUC value is too close to 1, it indicates the
possibility of overfitting in the modeling that
• The performance assessment of the three models we make [2].
will be made using several metrics/measurement
methods, including the following:
a. Mean Squared Error (MSE)
This metric squares the difference between the
predicted and actual values, then takes the
final average value (Bickel, 2015).
The MSE formula is as follows:
WHITEPAPER
df_train.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--------- -------------- -----
0 Loan_ID 367 non-null object
1 Gender 356 non-null object
2 Married 367 non-null object
3 Dependents 357 non-null object
4 Education 367 non-null object
5 Self_Employed 344 non-null object
6 ApplicantIncome 367 non-null int64
Fig. 1. Area Under the Curve (AUC). 7 CoapplicantIncome 367 non-null int64
8 LoanAmount 362 non-null float64
III. DATA UNDERSTANDING 9 Loan_Amount_Term 361 non-null float64
10 Credit_History 338 non-null float64
The data used in this project is the Loan Eligible Dataset
11 Property_Area 367 non-null object
from Kaggle by Vikas Ukani [11]. This data is data from the
Dream Housing Finance company which handles all dtypes: float64(3), int64(2), object(7)
mortgage loans. They are present in all urban, semi-urban, memory usage: 34.5+ KB
and rural areas. The customer first applies for a mortgage
loan after which the company validates the customer's
eligibility for the loan. IV. DATA PREPARATION
Data preparation techniques used to prepare data before
A. The Features
being processed into machine learning models include:
• Loan_ID : loan ID, unique code / number
A. Data splitting
• Gender: Sex, Male/Female
Data can be split after data preparation activities, but in
• Married: married status of the applicant, (Y/N) this case the dataset has indeed been split from the original
source. So, we immediately carry out the data preparation
• Dependents : Amount of dependents, number process in both parts. We name the section for training as
• Education: The last education of the applicant, df_train, and the section for testing or validation as df_val.
(Graduate/Under Graduate) At this stage we divide the data into a training portion of
around 70% and a testing portion of around 30%.
• Self_Employed : Is the applicant self-employed,
(Y/N) B. Identifies a null value.
• ApplicantIncome : Applicant income, number At this stage we identify null values, both in training and
testing data. The following is a visualization of null values
• CoapplicantIncome : Wife's income, number in both data. Fig. 2 is a visualization of the null value in the
• LoanAmount : Loan amount, number in thousands training data.
• Loan_Amount_Term : Loan term, number in months Fig. 3 is a visualization of the null value in the testing
data. The white lines in each column represent the existence
• Credit_History : Loan history, yes or no (1/0) of a null value in that column. If later this null value is
removed, the re-visualization will display plain columns
• Property_Area : Property area, Urban/ Semi-Urban/
without any transverse white lines at all.
Rural
This null value must be removed so as not to affect the
• Loan_Status : Loan approval status, (Y/N)
performance of the model. Especially if the model involves
B. Data Exploration mathematical calculations, then the presence of a null value
After the raw data is loaded, we perform a series of might cause the computational process to stop.
exploratory activities as follows:
• View the beginning of the data table with the head()
function.
df_train = pd.read_csv("loan-train.csv")
WHITEPAPER
Fig. 2. Visualization of null values in training data. Fig. 5. Visualization of null values in validation data after NA removal.
1 1 102
2 2 101
>=3 3 51
Yes 1 401
TABLE V. EDUCATION
Yes 1 82
M. Check Applicant Income data (ApplicantIncome). Fig. 14. Applicant income histogram after outliers cut out.
From Fig. 12 it can be observed that most applicants
have an income of less than USD 10,000 per year, while
there are a small number of applicants who have a very high
income of up to USD 80,000 per year. This is also shown in
Fig. 13 which makes the data points densely packed at the
bottom and there are a few points that are loosely scattered
above. A small amount of data that is far apart from most
other data is also known as outliers. These outliers should be
cut first from our data because it has a risk of biasing the
results of our modeling predictions.
WHITEPAPER
Fig. 15. Applicant income scatter plot after outliers cut out. Fig. 18. Co-applicant income scatter plot after outliers cut out.
N. Check Co-applicant Income data. O. Check the Loan Amount data (LoanAmount).
From Fig. 16 it can be observed that most applicant pairs From Fig. 19 and Fig. 20 it can be observed that the
have an income of less than USD 10,000 per year, while amount of loans taken is mostly in the range of USD 100-
there are a small number of applicant pairs who have a very 200K. Meanwhile, the maximum value taken is around USD
high income of up to USD 40,000 per year. This is also 600K.
shown in Fig. 17 which makes the data points densely
packed at the bottom and there are a few points that are
loosely scattered above. Outliers that appear here will also
be cut from the data.
Fig. 21. Loan tenure histogram. Fig. 24. Property area histogram.
V. MODELLING
The model chosen for this solution is a model that uses
the Logistic Regression algorithm, because this algorithm is
suitable for problems with many independent variables and
produces binary output (0/1, Yes/No, Approve/Reject, etc.).
Pros
• Easy to implement.
• Can accommodate multi-variables.
• Not only does it provide a measure of how precise a
predictor is (a measure of the coefficient), but also
the direction of the association (positive or
negative).
Fig. 22. Loan tenure scatter plot. • Very fast in classifying unknown records.
• Has good accuracy for simple data sets and
Q. Check credit history data (Credit_History). performs well when data sets are linearly separable.
From Fig. 23 it can be observed that most customers Cons
who submitted loan application have taken loan(s) before.
• If the number of observations is smaller than the
number of features, Logistic Regression cannot be
used because it can cause overfitting.
• The main limitation of Logistic Regression is the
assumption of linearity between the dependent and
independent variables.
• Can only be used to predict discrete functions.
Therefore, the Logistic Regression dependent
variable is bound to a set of discrete numbers.
• Non-linear problems cannot be solved by Logistic
Regression because they have a linear decision
surface. Linearly separable data is rare in real-
world scenarios.
Fig. 23. Credit history histogram.
• Logistic Regression requires an average or no
multicollinearity between independent variables.
R. Check the Property Agent data (Property_Area).
• It is difficult to derive complex relationships using
From Fig. 24 it can be observed that most of the Logistic Regression. More powerful and concise
mortgaged properties are in Semi-Urban areas, while the algorithms such as Neural Networks can easily
number of properties in Rural and Urban areas is balanced. outperform these algorithms.
The following is the code for the Logistic Regression model
in Python.
WHITEPAPER
pipe = make_pipeline(StandardScaler(), B. Confusion matrix.
LogisticRegression(solver = "lbfgs"))
pipe.fit(X_train, y_train) # apply scaling on The results of creating a confusion matrix from a
training data comparison between the real validation output (y_val) and
pipe.score(X_val, y_val) the predictive output of the model (y_pred) are as follows:
In the code above, we use a pipelining technique where TABLE VIII. CONFUSION MATRIX.
before the data is entered into the LogisticRegression()
Logistic Regression
model, the data is scaled first with the standard method. This
is so that the data has a shorter distance. Scaling needs to be True Positive 16
done when the values of the existing features have very far False Positive 19
ranges from one another. Sometimes if we don't do scaling,
False Negative 1
our model will crash computationally.
True Negative 85
First of all, we create a pipe object with the
make_pipeline() function and include parameters in the
form of StandardScaler() and LogisticRegression() objects. The matrix show that the false positive rate is still quite
So, enter the scaler object and the model. For the object high, indicate that the model is not too good in performance.
model, we use the default solver "lbfgs". The C. Accuracy.
LogisticRegression class has several solvers, including
The accuracy value of the model is 0.834711. This is
newton-cg, lbfgs, liblinear, sag, saga. We choose lbfgs
quite high in term of accuracy. But the high value of
considering the size of the data is not too large.
accuracy doesn’t always reflect a good performance. We
The next step we do training on this model. However, it must check other parameters.
is different from direct training, training using this pipeline
D. Precision.
technique that we use is the pipe object, while the function
called to run the training remains the same, namely the fit() The precision value of the model is 0.817308. Precision
function. The parameters entered are X_train data which reflects the percentage of approved loans among all the
contains data with various independent features/variables, predicted approved loans.
and y_train which contains the dependent variable which is E. Sensitivity.
the actual reference value.
The sensitivity value of the model is 0.988372.
The final step is to print the score from the training Sensitivity reflects the percentage of approved loans among
results by calling the pipe.score() function and entering the all the actual approved loans, including those that falsely
testing data parameters X_val which contains the approved.
independent variables, and y_val which contains the
F. Area under the curve (AUC).
dependent variables.
The AUC value of the model is 0.722757. It shows that
VI. EVALUATION the model performance is in between 0.5 and 1, which
Evaluation of the performance of machine learning means from the AUC perspective it should be quite good but
modeling is done in several ways. On this evaluation stage, could be improved. The AUC value is generated by calling
modeling using Logistic Regression measures its the roc_auc_score() function from the sklearn.metrics
performance with several metrics. Here's an explanation: library.
A. Mean squared error (MSE) The following is the ROC curve visualization from the
Logistic Regression algorithm modeling. The AUC score
One of the advantages of MSE here is to identify
generated here is not very satisfactory, because it only
whether there are outliers in our model which will cause the
produces a number of 0.6820.
error value to be very large. The relatively large MSE value
(> 0.1) here is probably due to the wide range between the
average and maximum values in several features such as
income and partner income.
The following are the results of the MSE evaluation of the
models:
Logistic Regression
train_mse 0.188797
test_mse 0.165289
The MSE result is quite low, but the gap between test_mse
and train_mse is expected to be lower.
WHITEPAPER
REFERENCES
[1] A. Amrin and O. Pahlevi, “Implementation of logistic regression
classification algorithm and support vector machine for credit
eligibility prediction,” Journal of Informatics and Telecommunication
Engineering, vol. 5, no. 2, pp. 433–441, 2022.
[2] D. Kurniawan, Pengenalan Machine Learning dengan Python.
Jakarta, Indonesia: Elex Media, 2021.
[3] Kasmir, Bank dan Lembaga Keuangan lainnya, Edisi Revisi. Jakarta,
Indonesia: PT Raja Grafindo Persada, 2014.
[4] L. Zhao, S. Lee, and S.-P. Jeong, “Decision tree application to
classification problems with boosting algorithm,” Electronics, vol. 10,
no. 16, p. 1903, Aug. 2021, doi: 10.3390/electronics10161903.
[5] M. Gopinath, K. Srinivas Shankar Maheep, and R. Sethuraman,
Fig. 25. ROC curve visualization. “Customer loan approval prediction using logistic regression,”
Advances in Parallel Computing, 2021.
VII. CONCLUSION [6] M. Yarmolenko and B. Howlin, “Extreme Gradient boosting
algorithm classification for predicting lifespan-extending chemical
There are some conclusions drawn from this experiment: compounds,” 2022.
• The Logistic Regression algorithm surely could be [7] M. S. P. Hasibuan, Dasar-dasar Perbankan. Jakarta, Indonesia: PT.
Grafindo, 2008.
used to predict loan eligibility, while the MSE,
[8] J. Jusuf, Kiat Jitu Memperoleh Kredit Bank. Jakarta, Indonesia: Elex
accuracy, precision and recall or sensitivity Media, 2003.
indicate a quite satisfactory we must acquire more [9] P. J. Bickel and K. A. Doksum, Mathematical Statistics: Basic Ideas
information from other metrics such as AUC to and Selected Topics, vol. 1, 2 vols. CRC Press, 2015.
make sure about the model performance. [10] T. Suyatno, Kelembagaan Perbankan. Jakarta, Indonesia: PT.
Gramedia Pustaka Utama, 2007.
• The model performance is also influenced by the [11] V. Ukani, Loan eligibility dataset, 2020. [Online]. Available:
data. Size, integrity, outliers, label correctness, all https://www.kaggle.com/datasets/vikasukani/loan-eligible-dataset.
can impact the model performance. [Accessed: 04-Nov-2022].
[12] V. Ukani, Loan eligibility machine learning, 2020. [Online].
• We can conduct further research which focusing on Available: https://www.kaggle.com/code/vikasukani/loan-eligibility-
some aspects, such as to improve the Logistic prediction-machine-learning. [Accessed: 16-Oct-2022].
Regression model performance, comparison with .
other algorithms, and testing the performance with
much larger data.