NTCC Seminar Sem6 Prachi Kumari A35400719009

1.
INTRODUCTION
Presently everyone relies on bank for fulfilling their dreams. Therefore the rate of loan application is
increasing at a very high speed in recent years. As the applications are increasing so the risk is also
increasing because there is always risk in lending money. We all are aware that bank are very particular in
giving loans, they are very conscious while lending any kind of loan to their customers. Even after taking
all the precautions and measure it happens very often that the decisions taken by bank is not correct and
there is surely a need of any such automation of this kind of processes in order to lessen the chance of loan
failure in Banks.
Artificial Intelligence AI, we all have must heard of this term somewhere or the other is an emerging
technology. It basically solves many real life problems. Machine learning is one such AI technique which is
highly used in predicting the models. Prediction of loan is very helpful for bank employees and customers.
We preferably divide the data set in the ratio 70:30 or 80:20 which is training and testing data set. Then we
apply various kind of mathematical algorithm to predict the high accuracy of the model. This paper is
implemented in machine learning algorithms to solve loan defaulters and approval in banking sector. The
paper comprises of literature review in which 11 authors paper has been reviewed then a model is
implemented based on the papers with Logistic regression accuracy to be 73% and decision tree accuracy to
be 70% in 5 k folds.
The main aim of this research paper is to provide rapid, fast and smooth way in deciding the deserving
customer. Bank give loans based on these details, Gender, Marital Status, Education, Number of Dependents,
Income, Loan Amount, Credit History and other. This undertaking has taken the information of past clients of
different banks to whom on a bunch of boundaries advance were supported. So the AI model is prepared on
that record to obtain precise outcomes. Our fundamental target of this undertaking is to foresee the wellbeing
of credit.
1.1 MOTIVATION
Loan approval is a vital cycle for banking associations. The framework endorsed or reject the loan applications.
Recuperation of loans is a significant contributing parameter in the budget reports of a bank. It is undeniably
challenging to anticipate the chance of paying the loan by the client. Utilizing Machine learning we predict the
loan prediction
1.2 Title & Objective of the study

The goal of our project is to anticipate regardless of whether a loan will default in light of genuine monetary
information just and regardless of whether financial backers ought to loan to a client. Information from 2007-2015
will be utilized on the grounds that a large portion of the credits from that period have proactively been reimbursed or
defaulted on.
1.3 Need of the Study

In this day and age, acquiring advances from monetary establishments has turned into an extremely normal
peculiarity. Consistently many individuals apply for loans, for an assortment of purposes. However, not every one of
the candidates is reliable, and not every person can be provided with loans. Each year, there are situations where
individuals don't reimburse the majority of the loan amount and add up to the bank which brings about colossal
monetary loss. The risk related with settling on a choice on an advance endorsement is enormous. Thus, this task is to
accumulate credit information from the Lending Club site and use AI methods on this information to separate
significant data and foresee on the off chance that a client would have the option to reimburse the advance or not. At
the end of the day, we wants to anticipate on the off chance that the client would be a defaulter or not.
1|Page
2. LITERATURE REVIEW
1. A study on predicting loan default based on the random forest algorithm. Lin Zhua , Dafeng Qiua , Daji
Ergua,* , Cai Yinga , Kuiyi Liu. (ITQM 2019)
The data was collected from the Lending Club for the first quarter of 2019. Random forest, supervised learning
algorithm was used to predict the performance (accuracy, AUC, F1-Score and recall). Random forest was selected
because there were large amount of missing data in the data set and random forest suits best as it can maintain
accuracy even if large data is missing in large data set. The outcome was again predicted with other machine learning
algorithm methods namely, decision tree, logistic regression and the SVM. It was seen that random forest and decision
tree was having comparable performance than the support vector machine as 75% and logistic regression as 73%. Still
it can be inferred that random forest works best with 98% accuracy and decision tree at 95%. Result shows that
precision and recall both are above 0.95 which has strong ability of generalization.
The outcome (this research paper was also supported by grants from the National Natural Science Foundation of
China #U1811462, #71774134 and #71373216, in part by the Innovation Scientific Research Program for Graduates
in Southwest Minzu University (No. CX2018SZ158).
2. Recognizing and Predicting the Non-Performing Loans of Commercial Banks Zhang Yu1,2 , Guan
Yongsheng1,3 , Yu Gang1 and Lu Haixia1 1 School of Management, Harbin Institute of Technology,
P.R. China 2 School of Economics, Harbin University of Science and Technology, P.R. China
3LongJiang Bank, Province Heilongjiang, P.R. China Hester0524@163.coml
In this paper data was cleaned firstly by deleting some features having missing values in raw database. Again they
used mean value properties of filling the missing values “customers’ ages”. They even used global constants to
replace the missing values. For example null was used where there was missing values. Lastly they washed the dirty
data.
After performing data cleaning and preprocessing training data set obtained was 1490 with 742 normal samples, 213
attention samples, 385 substandard samples and 150 loss samples. Testing dataset was 624 from raw data set with 247
normal samples, 110 attention samples, 192 substandard samples, 75 loss samples.
The classification capabilities which was adopted was decision tree, Naïve Bayes and SVM. The result of decision
tree of average classification accuracy rate was 94.3%, the classification accuracy rate of Class 0 was 98.7%, the
classification accuracy rate of Class1 was 76.8%, the classification accuracy rate of Class 2 was 62.6%, the
classification accuracy rate of Class 4 was 67%.
The result of Naive Bayes is:
TP Rate FP Rate Precision Recall F-Measure ROC Area Class

0.858 0.405 0.943 0.858 0.898 0.794 {0}
0.307 0.096 0.165 0.307 0.214 0.742 {2}
0.591 0.043 0.316 0.591 0.412 0.896 {1}
0.578 0.011 0.556 0.578 0.566 0.913 {4}
Weighted Avg. 0.811 0.367 0.868 0.811 0.835 0.797
The initial step of building a SVM model is to develop the sample set by analysis. Furthermore, pick type of kernel
function. This article expects to embrace RBF work as internal item portion capacity to assemble the model.
Through analysis of the model having10341 loan samples, the identification accuracy rate is 99.2895%.
2|Page
3. Loan Prediction Using Logistic Regression in Machine Learning S. Sreesouthry1 , A. Ayubkhan2 , M.
Mohamed Rizwan3 , D. Lokesh4 , K. Prithivi Raj5 (2021).
In this paper logistic regression was used to predict the model. The strategy for estimating starts with cleaning and
Data collecting, the imputation of missing qualities, exploratory data set examination, and afterward model
development to test the informational collection. Model and examination on experiments from tests. On an
informational index, the most ideal situation. On the underlying data collection, the exactness acquired is 0.77. A later
study, ensuing ends are drawn. Such candidates with the least financial assessment won't get an advance.
Endorsement, attributable to a more serious gamble of not taking care of the advance. Such candidates who have
significant levels, more often than not, low pay furthermore, lower credit amount necessities are bound to get
acknowledged, which appears to be legit, their borrowings. Some other qualities, for example, orientation and
conjugal status doesn't consider.
4. DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING

Aboobyda Jafar Hamid1 and Tarig Mohammed Ahmed2 1Department of Computer
Sciences ,University Khartoum, Sudan 2Department of Computer Sciences ,University Khartoum,
Sudan (March 2016)
The algorthim used in this research paper were e j48, bayesNet and naiveBayes. There are a few calculations in every
one of this strategy which used to deliver a model to anticipate the class of obscure class tables. The major objective
of this calculation is the arrangement by a model for foreseeing the class of obscure records.
The normal strides for every one of these classifiers comprise of the accompanying:
 Prepare the training set.
 Building the model.
 Applied the model upon unknown data test set class.
 Evaluating the accuracy of the model.
Of all the algorithm used j48 showed the best result.
5. MCDM APPROACH TO EVALUATING BANK LOAN DEFAULT MODELS Gang Koua , Yi Pengb ,
Chen Lub a School of Business Administration, Southwestern University of Finance and Economics,
Chengdu, China b School of Management and Economics, University of Electronic Science and
Technology of China, Chengdu, 610054, P. R. China Received 25 March 2013; accepted 10 November
2013
In this research paper firstly data preprocessing and data cleaning was done, then feature selection was done following
the sampling (SMOTE) and classification i.e, finding AUC, ROC and then measure was taken.
This paper focused on include choice, uneven information, and model evaluation in bank credit default forecast. It
fostered a strategy to resolve the three issues. Initially, ICA and PCA were utilized to choose significant elements.
Also, the SMOT approach was used to bargain with the lopsided information by making manufactured default models.
Thirdly, TOPSIS, a numerous standards dynamic technique, was used to rank a determination of default expectation
models.
A test concentrate on utilizing a huge genuine bank credit default dataset gave by a Chinese business bank was led to
approve the proposed cycle. PCA and ICA created two separate datasets and tried in the testing, arrangement, and
model evaluation steps. After highlight choice, PCA and ICA diminished the dimensionality from 104 to 31
furthermore, 32, separately. The diminished datasets were then adjusted utilizing the SMOTE approach what's more,
characterized with five chose grouping calculations. The exhibitions of classifiers were estimated utilizing exactness,
Type-I mistake rate, Type-II blunder rate, and AUC. At long last, TOPSIS was applied to the PCA and ICA datasets
to assess the classifiers in view of the upsides of execution measurements. The test results showed that K-NN has
great potential in default forecast. Likewise, the result demonstrated that there is no huge contrast among PCA and
ICA on default forecast of the particular dataset.
3|Page
6. Analysis of Loan Availability using Machine Learning Techniques Article · September 2021. Sharayu
Dosalwar Dr. Vishwanath Karad MIT World Peace University, Rahul Sannat Dr. Vishwanath Karad
MIT World Peace University, Pune, Nitin Pise Dr. Vishwanath Karad MIT World Peace University,
Ketki Kinkar MIT World Peace University.
The methodology adopted in this research paper was Loan id dataset, preprocessing, data framing, various regression
algorithms, Prediction results, Performance analysis. There were 7 different types of algorithm used in this research
paper namely logistic regression with accuracy as 78%, decision tree classification with accuracy as 66%, K-nearest
algorithm with accuracy as 61%, naïve Bayes with accuracy as 77%, random forest classification with accuracy as
77%, Support vector machine with accuracy as 65%, XGBoost classifier with accuracy as 77%.
The Logistic Regression model could have been replaced with the Linear Regression model if the output variable
would have been continuous. A different model should be utilized to represent the distinction when the result variable
is not consistent or dichotomous. Following that, various models were made to represent the dichotomous nature of
the result variable. Due to its numerical clearness and flexibility, the Logistic Regression model was picked over
different models.
7. International Journal of Scientific and Research Publications, Volume 11, Issue 6, June 2021 403 ISSN
2250-3153 This publication is licensed under Creative Commons Attribution CC BY.
http://dx.doi.org/10.29322/IJSRP.11.06.2021.p11453 www.ijsrp.org Customer Loan Prediction Using
Supervised Learning Technique L. Udaya Bhanu1 , Dr. S. Narayana2 1M.Tech Student, Dept. of
Computer Science & Engineering, Gudlavalleru Engineering College, Gudlavalleru, Andhra Pradesh,
India 2Professor&Mentor, Dept. of Computer Science & Engineering, Gudlavalleru Engineering
College, Gudlavalleru, Andhra Pradesh, India
In this paper, many classification algorithms were implemented to predict customer loan such as Logistic Regression
with accuracy as 73%, Random Forest with accuracy as 83%, KNN with accuracy as 59%, SVM with accuracy as
78%, decision Tree Classifier with accuracy as 72%. Comparing on these five classifiers the random forest classifier
was having the highest accuracy as 82%.
In this try first and foremost gather the information and comprehend the information with assistance of (.describe())
and afterward examinations of information then look for any missing/invalid/meddling information present in the
dataset and then assess the disarray matrices(accuracy, accuracy, review,
f1-score) lastly model structure i.e., utilized strategies Procedures are intended to recognize blunders in information at
a lower level of detail. Information approvals have been remembered for the framework in pretty much every region
where there is an opportunities for the client to submit mistakes. The framework will not acknowledge invalid data.
Whenever invalid data is entered in, the framework like a shot prompts the client and furthermore the client should
again key inside the data and additionally the framework will acknowledge for the data given that the information is
right. Approvals are encased any place vital. The framework is intended to be an easy to use one. In elective words the
framework has been intended to talk successfully with the client. The framework has been planned with popup menus.
8. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 08
Issue: 05 | May 2021 www.irjet.net p-ISSN: 2395-0072 © 2021, IRJET | Impact Factor value: 7.529 |
ISO 9001:2008 Certified Journal | Page 1741 Loan Approval Prediction Using Machine Learning 1-
3Student, Dept. of Computer Engineering, ISB&M College of Engineering, Pune, India
In this paper the author has studied 10 researched paper and concluded that the connection of assumption starts from
cleaning and treatment of data, attribution of missing characteristics, exploratory examination of enlightening
assortment and a while later model construction to appraisal of model and testing on test data. On Data set, the best-
case accuracy obtained on the principal instructive assortment is 0.811. The going with closes are reached later
assessment that those applicants whose FICO rating was most discernibly dreadful will fail to get advance support, in
light of a higher probability of not reimbursing the credit total. As a general rule, those applicants who have high level
compensation and solicitations for lower proportion of advance will undoubtedly get insisted which looks good,
4|Page
bound to deal with their credits. Some other brand name like orientation and intimate status seems, by all accounts, to
be not to be pondered over by the association.
9. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume

18, Issue 3, Ver. I (May-Jun. 2016), PP 79-81 www.iosrjournals.org DOI: 10.9790/0661-1803017981
www.iosrjournals.org 79 | Page Loan Approval Prediction based on Machine Learning Approach
Kumar Arun, Garg Ishan, Kaur Sanmeet (sh.arun.rana@gmail.com , CSED, Thapar University, India
In this research paper various parameter was set of different algorithms like decision tree, random forest, support
vector machine, linear model, ADA Boost, neural network. The parameters are:
From an appropriate examination of positive places and requirements on the part, it very well may be securely
finished up that the item is an exceptionally proficient part. This application is working appropriately and meeting to
all Banker necessities. This part can be effortlessly connected numerous different frameworks.
There have been numbers instances of PC misfires, mistakes in satisfied and most significant load of highlights is
fixed in computerized forecast framework, So soon the so - called programming could be made safer, solid and
dynamic weight change .In not so distant future this module of expectation can be coordinate with the module of
mechanized handling framework the framework is prepared on old preparation dataset in future programming can be
made to such an extent that new testing date ought to likewise partake in preparing information after some fix time.
Issue: 10 | Oct 2021 www.irjet.net p-ISSN: 2395-0072 © 2021, IRJET | Impact Factor value: 7.529 | ISO
9001:2008 Certified Journal | Page 1322 THE LOAN PREDICTION USING MACHINE LEARNING
Dr.C K Gomathy, Ms.Charulatha,Mr.AAkash ,Ms.Sowjanya
In this paper only decision tree was implemented to take out the result. In this work, the author describes different
group methods for parallel arrangement and furthermore for multi class order. The new strategy that is portrayed by
the creators for group is COB which gives successful execution of order yet it too compromised with commotion and
exception information of order. At last they reasoned that the gathering based calculation works on the outcomes for
preparing informational index.
11. International Journal of Emerging Technologies in Engineering Research (IJETER) Volume 6, Issue
10, October (2018) www.ijeter.everscience.org ISSN: 2454-6410 ©EverScience Publications 64 A Model
to Predict Loan Defaulters using Machine Learning R. B. Saroo Raj 1 , Gurpartap Singh 2 , Balaji S 3 ,
K. H. Ajit Baskar 4 1 Asistant Professor, Computer Science Department, SRM Institute of Science and
Technology, Chennai,India 2, 3, 4 Computer Science Department, SRM Institute of Science and
Technology, Chennai,India
5|Page
In this research paper the author mean by advance evaluation process, the gathering of steps that are taken to make a
choice about giving an advance to the client or not. At the moment that the client applies for an advance yielding
application, the bank official should explore about what called 5 C's which are Character (or Credit History), Cash
Flow (or Limit), Collateral, Capitalization and Conditions. It is helpful for evaluation advance application and it saw
as a strong framework for check the acknowledge danger recognized for a conceivable bank.
In this two algorithm was used to make comparison j48 and Naïve Bayes out of which j48 proved to be best algorithm
with 78.37% accuracy and Naïve Bayes to be 78.87%. J48 calculation is best since it has high precision and low mean
preeminent mix-up as showed up in the result. Furthermore, it can portray the cases really than substitute procedures.
Perplexity network of the two calculations showed that the j48 calculation is the best one.
6|Page
3. Methodology
Machine Learning using python has been taken because python is easily understood by the humans i.e. for its
simplicity. It is also considered to be the best language for AI.
Training the data set and then testing it means supervised learning has been used.
Following are the modules for creating the model.
1. Importing libraries and dataset.

2. Univariate Analysis
3. Bivariate Analysis
4. Missing Values and Outliers Treatment
5. Feature Engineering
6. Splitting Training and Testing data set
7. Model Building
3.1 System requirements
Hardware used
 Intel Core i5-8250U 1.80 GHz Processor
 1TB Hard Disk Drive.
 8GB RAM.
 O.S. – Windows 10 Enterprises
 64 Bit Processor
Software Used
Tools:
Python 3.7.2, Jupyter Notebook, Numpy, Pandas, Matplotlib, Seaborn, Scikit-learn, Scipy
Techniques:
Logistic regression, Decision Tree
7|Page
4. IMPLEMENTATION
4.1 Importing the libraries and reading the data set:
Figure 4.1 Importing and reading the dataset
4.2 Understanding the data:
The procedure can be divided into several stages.
The first stage is the data gathering stage in which dataset has been downloaded. This data used to train and test
the machine learning model. The downloaded dataset is firstly raw and unstructured data. We have 12
independent variables and 1 target variable, i.e. Loan_Status in the train dataset. We have similar features in the
test dataset as the train dataset except the Loan_Status.
8|Page
Figure 4.2 Understanding the Data set
4.3 Univariate Analysis:
The loan of 422 people out of 192 was approved.
The approval rate is around 69%.
9|Page
Figure 4.3 Univariate Analysis
4.3.1. Categorical Variable:
It can be inferred from the above bar plots that:

 80% applicants in the dataset are male.
 Around 65% of the applicants in the dataset are married.
 Around 15% applicants in the dataset are self employed.
 Around 85% applicants have repaid their debts.
10 | P a g e
Figure 4.3.1 Categorical Variable
4.3.2 Ordinal Variable
Following inferences can be made from the above bar plots:
 Most of the applicants don’t have any dependents.

 Around 80% of the applicants are Graduate.
 Most of the applicants are from Semiurban area.
Figure 4.3.2 Ordinal variable
11 | P a g e
4.3.3. Numerical Variables
 It can be inferred that most of the data in the distribution of applicant income is towards left which
means it is not normally distributed.
 The boxplot confirms the presence of a lot of outliers/extreme values. This can be attributed to
the income disparity in the society.
We can see that there are a higher number of graduates with very high incomes, which are appearing to be the
outliers.
Figure 4.3.3.a Left Skewed and outliers
 Majority of coapplicant’s income ranges from 0 to 5000.

 We also see a lot of outliers in the coapplicant income and it is not normally distributed.
 Outliers are present
 the distribution is fairly normal
12 | P a g e
Figure 4.3.3.b Normal distribution and Outliers
4.4 Bivariate Analysis
a. Categorical vs Target variable
 It can be inferred that the proportion of male and female applicants is more or less same for
both approved and unapproved loans.
 Proportion of married applicants is higher for the approved loans.
 Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.
 There is nothing significant we can infer from Self_Employed vs Loan_Status plot.
 It seems people with credit history as 1 are more likely to get their loans approved.
 Proportion of loans getting approved in semiurban area is higher as compared to that in rural or urban
areas.
13 | P a g e
Figure 4.4 a Categorical vs Target variable
14 | P a g e
b. Numerical vs Target variable
There is not visible difference between "the mean income" of people for which the loan has been "approved" vs
the mean income of people for which the loan has "not been approved".
Analysing bins for the applicant income variable based on the values in it and the corresponding loan status for
each bin.It can be inferred that Applicant income does not affect the chances of loan approval.
coapplicant’s income is less the chances of loan approval are high.
Figure 4.4 b Numerical vs Target variable
15 | P a g e
 Proportion of loans getting approved for applicants having low Total_Income is very less as
compared to that of applicants with Average, High and Very High Income.
 the proportion of approved loans is higher for Low and Average Loan Amount as compared to that
of High Loan Amount
4.5 Correlation and heatmap
Correlation is a term used to represent the statistical measure of linear relationship between two variables. It can
also be defined as the measure of dependence between two different variables. If there are multiple variables
and the goal is to find correlation between all of these variables and store them using appropriate data structure,
the matrix data structure is used. Such matrix is called as correlation matrix.
 Correlation heatmap is graphical representation of correlation matrix representing correlation between

different variables.
 The value of correlation can take any values from -1 to 1.
Inference from the graph below:
16 | P a g e
 Highly correlated variables are (ApplicantIncome - LoanAmount)
 Also (Credit_History - Loan_Status). is higly correlated
 LoanAmount is also correlated with CoapplicantIncome.
Figure 4.5 Correlation
4.6 Missing Values and Outliers Treatment
Outlier is the value far from the main group. Missing value is the value of blank. We often meet them when
we analyze large size data.
Outlier and missing value are also called "abnormal value", "noise", "trash", "bad data" and "incomplete data".
Some people dislike them because when there are these data in the data set, we cannot
make beautiful statistical model or the software outputs error.
Outlier and missing value are often removed as unnecessary data. But the removal may remove the important
information because outlier and missing value express some facts. And there are cases that we need to
understand the reason of the mechanism of such data.
17 | P a g e
Figure 4.6 a Missing Value
Figure 4.6. b Missing value and outliers treated
18 | P a g e
4.7 Feature Engineering
Feature engineering refers to the process of using domain knowledge to select and transform the most
relevant variables from raw data when creating a predictive model using machine learning or statistical
modelling.
 combine the Applicant Income and Coapplicant Income

 Distribution is shifted towards left, i.e., the distribution is right skewed.
 After taking the log transformation to make the distribution normal.
 Now the distribution looks much closer to normal.
 calculate the EMI by taking the ratio of loan amount with respect to loan amount term.
 the income left after the EMI has been paid
 drop the variables which we used to create these new features.
 because the correlation between those old features and these new features will be very high
 removing correlated features will help in reducing the noise too.
Figure 4.7 a Normal Distribution
19 | P a g e
Figure 4.7 b removing correlated feature
20 | P a g e
4.8 Model Building
 Loan Id is not a significant variable and d=it is not required as a feature for building model
 Loan Status is target variabel so seggregating it
 Dummy variables for Categorical Variable so each category can be given as a seperate feature
to the model
Scaling the data set importing standard_scalar from sklearn.preprocessing library
Splitting the data set to test and train data set
Implementing Logistic regression from sklearn.linear_model.
Figure 4.8 a Splitting of testing and training data set
21 | P a g e
b. Predicting the values precision, recall, f1 score:
F1 is an overall measure of a model’s accuracy that combines precision and recall, in that weird way that
addition and multiplication just mix two ingredients to make a separate dish altogether. That is, a good F1 score
means that you have low false positives and low false negatives, so you’re correctly identifying real threats and
you are not disturbed by false alarms. An F1 score is considered perfect when it’s 1, while the model is a total
failure when it’s 0.
- Precision 57% false positive, class 0. And 80% for class 1
- Recall means out all the actual positive observations only 53% of the observation has been predicted as
positive for class 0
Figure 4.8.b precision, recall, f1 score, support
22 | P a g e
4.9 Decision Tree Model
o It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
Mean accuracy of the model is 70%
Figure 4.9 Decision Tree Model
23 | P a g e
5. Interpretation
Arranging the data
Figure: 5.1 Data arrangement
Coefficient Plot
Figure 5.2 Coefficient Plot
24 | P a g e
6. Conclusion
After completing this project I concluded that:
 It is concluded that logistic regression is used to build such type of prediction model.
 Throughout this project I used Machine Learning to train or create the model. In this project I create
the model which is used to detect the loan whose mean is almost 70% accurate in test data for
decision tree and 73% accurate for logistic regression.
 To make these accurate projects we need lots of data to train the model as well as good systems and
engineers.
25 | P a g e
26 | P a g e
8. Future Scope
 A successful model is built up using machine learning algorithm but better result could have been
obtained using other machine learning algorithms. Various defaults like missing values incorrect data,
outliers were seen in the dataset. Adding the new data points and correct values in those features and
then predicting them again can give better result as decision tree results was not much satisfied
comparing with other research paper.
 The data set need to be updated on a regular basis. This will make the prediction more correct
and accurate.
 In future this type of models will available more but we need to ensure that their quality will not
degrade and we need to take more research on these fields.
In other words we need to research more in this field to build better predictions.
27 | P a g e
8. REFERENCES and BIBLIOGRAPHY
1. International Advanced Research Journal in Science, Engineering and Technology
2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 08 Issue: 04 | Apr 2021 www.irjet.net p-ISSN: 2395-0072
3. https://www.academia.edu/39186582/Loan_Default_Prediction_using_Machine_Learning_
Techniques
4. http://sersc.org/journals/index.php/IJAST/article/view/460/423
5. A study on predicting loan default based on the random forest algorithm. Lin Zhua , Dafeng Qiua , Daji
Ergua,* , Cai Yinga , Kuiyi Liu. (ITQM 2019)
6. Recognizing and Predicting the Non-Performing Loans of Commercial Banks Zhang Yu1,2 , Guan
Yongsheng1,3 , Yu Gang1 and Lu Haixia1 1 School of Management, Harbin Institute of Technology,
P.R. China 2 School of Economics, Harbin University of Science and Technology, P.R. China
3LongJiang Bank, Province Heilongjiang, P.R. China Hester0524@163.coml
7. Loan Prediction Using Logistic Regression in Machine Learning S. Sreesouthry1 , A. Ayubkhan2 , M.

Mohamed Rizwan3 , D. Lokesh4 , K. Prithivi Raj5 (2021).
8. DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING

Aboobyda Jafar Hamid1 and Tarig Mohammed Ahmed2 1Department of Computer
Sciences ,University Khartoum, Sudan 2Department of Computer Sciences ,University Khartoum,
Sudan (March 2016)
9. MCDM APPROACH TO EVALUATING BANK LOAN DEFAULT MODELS Gang Koua , Yi Pengb
, Chen Lub a School of Business Administration, Southwestern University of Finance and Economics,
Chengdu, China b School of Management and Economics, University of Electronic Science and
Technology of China, Chengdu, 610054, P. R. China Received 25 March 2013; accepted 10 November
2013
10. Analysis of Loan Availability using Machine Learning Techniques Article · September 2021. Sharayu
Dosalwar Dr. Vishwanath Karad MIT World Peace University, Rahul Sannat Dr. Vishwanath Karad
MIT World Peace University, Pune, Nitin Pise Dr. Vishwanath Karad MIT World Peace University,
Ketki Kinkar MIT World Peace University.
11. International Journal of Scientific and Research Publications, Volume 11, Issue 6, June 2021 403 ISSN
2250-3153 This publication is licensed under Creative Commons Attribution CC BY.
http://dx.doi.org/10.29322/IJSRP.11.06.2021.p11453 www.ijsrp.org Customer Loan Prediction Using
Supervised Learning Technique L. Udaya Bhanu1 , Dr. S. Narayana2 1M.Tech Student, Dept. of
Computer Science & Engineering, Gudlavalleru Engineering College, Gudlavalleru, Andhra Pradesh,
India 2Professor&Mentor, Dept. of Computer Science & Engineering, Gudlavalleru Engineering
College, Gudlavalleru, Andhra Pradesh, India
Issue: 05 | May 2021 www.irjet.net p-ISSN: 2395-0072 © 2021, IRJET | Impact Factor value: 7.529 |
ISO 9001:2008 Certified Journal | Page 1741 Loan Approval Prediction Using Machine Learning 1-
3Student, Dept. of Computer Engineering, ISB&M College of Engineering, Pune, India
28 | P a g e
13. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume
18, Issue 3, Ver. I (May-Jun. 2016), PP 79-81 www.iosrjournals.org DOI: 10.9790/0661-1803017981
www.iosrjournals.org 79 | Page Loan Approval Prediction based on Machine Learning Approach
Kumar Arun, Garg Ishan, Kaur Sanmeet (sh.arun.rana@gmail.com , CSED, Thapar University, India
Issue: 10 | Oct 2021 www.irjet.net p-ISSN: 2395-0072 © 2021, IRJET | Impact Factor value: 7.529 |
ISO 9001:2008 Certified Journal | Page 1322 THE LOAN PREDICTION USING MACHINE
LEARNING Dr.C K Gomathy, Ms.Charulatha,Mr.AAkash ,Ms.Sowjanya
15. International Journal of Emerging Technologies in Engineering Research (IJETER) Volume 6, Issue 10,
October (2018) www.ijeter.everscience.org ISSN: 2454-6410 ©EverScience Publications 64 A Model
to Predict Loan Defaulters using Machine Learning R. B. Saroo Raj 1 , Gurpartap Singh 2 , Balaji S 3 ,
K. H. Ajit Baskar 4 1 Asistant Professor, Computer Science Department, SRM Institute of Science and
Technology, Chennai,India 2, 3, 4 Computer Science Department, SRM Institute of Science and
Technology, Chennai,India
16. Elements of Statistical Learning. Hastie, Tibshirani, and Friedman. Springer
17. Pattern Recognition and Machine Learning. Christopher Bishop.
18. Introduction to Machine Learning, Ethem Alpaydin, MIT Press, 2004
29 | P a g e
30 | P a g e

NTCC Seminar Sem6 Prachi Kumari A35400719009

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NTCC Seminar Sem6 Prachi Kumari A35400719009

Uploaded by

Copyright:

Available Formats

1.

1.2 Title & Objective of the study

1.3 Need of the Study

The result of Naive Bayes is:

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

4. DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING

9. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume

Following are the modules for creating the model.

1. Importing libraries and dataset.

3.1 System requirements

 Intel Core i5-8250U 1.80 GHz Processor

 1TB Hard Disk Drive.

 O.S. – Windows 10 Enterprises

4.1 Importing the libraries and reading the data set:

Figure 4.1 Importing and reading the dataset

4.2 Understanding the data:

The procedure can be divided into several stages.

4.3 Univariate Analysis:

The loan of 422 people out of 192 was approved.

The approval rate is around 69%.

4.3.1. Categorical Variable:

It can be inferred from the above bar plots that:

4.3.2 Ordinal Variable

Following inferences can be made from the above bar plots:

 Most of the applicants don’t have any dependents.

Figure 4.3.2 Ordinal variable

Figure 4.3.3.a Left Skewed and outliers

 Majority of coapplicant’s income ranges from 0 to 5000.

4.4 Bivariate Analysis

a. Categorical vs Target variable

coapplicant’s income is less the chances of loan approval are high.

Figure 4.4 b Numerical vs Target variable

4.5 Correlation and heatmap

 Correlation heatmap is graphical representation of correlation matrix representing correlation between

Inference from the graph below:

Figure 4.5 Correlation

4.6 Missing Values and Outliers Treatment

Figure 4.6. b Missing value and outliers treated

 combine the Applicant Income and Coapplicant Income

Figure 4.7 a Normal Distribution

Scaling the data set importing standard_scalar from sklearn.preprocessing library

Splitting the data set to test and train data set

Implementing Logistic regression from sklearn.linear_model.

Figure 4.8 a Splitting of testing and training data set

- Precision 57% false positive, class 0. And 80% for class 1

Figure 4.8.b precision, recall, f1 score, support

Mean accuracy of the model is 70%

Figure 4.9 Decision Tree Model

Arranging the data

Figure: 5.1 Data arrangement

Figure 5.2 Coefficient Plot

After completing this project I concluded that:

1. International Advanced Research Journal in Science, Engineering and Technology

2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

7. Loan Prediction Using Logistic Regression in Machine Learning S. Sreesouthry1 , A. Ayubkhan2 , M.

8. DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING

16. Elements of Statistical Learning. Hastie, Tibshirani, and Friedman. Springer

17. Pattern Recognition and Machine Learning. Christopher Bishop.

18. Introduction to Machine Learning, Ethem Alpaydin, MIT Press, 2004

You might also like