Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Credit / Home Loans

Standard Bank is embracing the digital transformation wave and intends to use new and exciting technologies to
give their customers a complete set of services from the convenience of their mobile devices. As Africa’s
biggest lender by assets, the bank aims to improve the current process in which potential borrowers apply for a
home loan. The current process involves loan officers having to manually process home loan applications. This
process takes 2 to 3 days to process upon which the applicant will receive communication on whether or not
they have been granted the loan for the requested amount. To improve the process Standard Bank wants to
make use of machine learning to assess the credit worthiness of an applicant by implementing a model that will
predict if the potential borrower will default on his/her loan or not, and do this such that the applicant receives a
response immediately after completing their application.

You will be required to follow the data science lifecycle to fulfill the objective. The data science lifecycle
(https://www.datascience-pm.com/crisp-dm-2/) includes:

Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment

Part One

The Home Loans Department manager wants to know the following:

1. An overview of the data. (HINT: Provide the number of records, fields and their data types. Do for both).
2. What data quality issues exist in both train and test? (HINT: Comment any missing values and duplicates)
3. How do the the loan statuses compare? i.e. what is the distrubition of each?
4. How do women and men compare when it comes to defaulting on loans in the historical dataset?
5. How many of the loan applicants have dependents based on the historical dataset?
6. How do the incomes of those who are employed compare to those who are self employed based on the
historical dataset?
7. Are applicants with a credit history more likely to default than those who do not have one?
8. Is there a correlation between the applicant's income and the loan amount they applied for?

Part Two

Develop a machine learning model - Logistic Regression and decision tree to check the accuracy of both

Imporing Libraries
In [262]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
# import warnings
# warnings.simplefilter("ignore")

Import Dataset
In [263]:
train = pd.read_csv(r"/kaggle/input/loan-prediction-problem-dataset/train_u6lujuX_CVtuZ9i
.csv")
test = pd.read_csv(r"/kaggle/input/loan-prediction-problem-dataset/test_Y3wMUE5_7gLdaTN.c
sv")

Part One

EDA
In [264]:
train.head(10)

Out[264]:

Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Am

0 LP001002 Male No 0 Graduate No 5849 0.0 NaN

1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0

2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0

Not
3 LP001006 Male Yes 0 No 2583 2358.0 120.0
Graduate

4 LP001008 Male No 0 Graduate No 6000 0.0 141.0

5 LP001011 Male Yes 2 Graduate Yes 5417 4196.0 267.0

Not
6 LP001013 Male Yes 0 No 2333 1516.0 95.0
Graduate

7 LP001014 Male Yes 3+ Graduate No 3036 2504.0 158.0

8 LP001018 Male Yes 2 Graduate No 4006 1526.0 168.0

9 LP001020 Male Yes 1 Graduate No 12841 10968.0 349.0

In [265]:
test.head(10)
Out[265]:

Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Am

0 LP001015 Male Yes 0 Graduate No 5720 0 110.0

1 LP001022 Male Yes 1 Graduate No 3076 1500 126.0

2 LP001031 Male Yes 2 Graduate No 5000 1800 208.0

3 LP001035 Male Yes 2 Graduate No 2340 2546 100.0

Not
4 LP001051 Male No 0 No 3276 0 78.0
Graduate

Not
5 LP001054 Male Yes 0 Yes 2165 3422 152.0
Graduate

Not
6 LP001055 Female No 1 No 2226 0 59.0
Graduate

Not
7 LP001056 Male Yes 2 No 3881 0 147.0
Graduate

8 LP001059 Male Yes 2 Graduate NaN 13633 0 280.0

Not
9 LP001067 Male No 0 No 2400 2400 123.0
Graduate

In [266]:
In [266]:
train.shape
Out[266]:

(614, 13)

In [267]:
# we concat for easy analysis
n = train.shape[0] # we set this to be able to separate the
df = pd.concat([train, test], axis=0)
df.head()
Out[267]:

Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Am

0 LP001002 Male No 0 Graduate No 5849 0.0 NaN

1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0

2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0

Not
3 LP001006 Male Yes 0 No 2583 2358.0 120.0
Graduate

4 LP001008 Male No 0 Graduate No 6000 0.0 141.0

Own EDA

1. An overview of the data. (HINT: Provide the number of records, fields and their data types. Do for both).

In [268]:
train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 614 non-null object
1 Gender 601 non-null object
2 Married 611 non-null object
3 Dependents 599 non-null object
4 Education 614 non-null object
5 Self_Employed 582 non-null object
6 ApplicantIncome 614 non-null int64
7 CoapplicantIncome 614 non-null float64
8 LoanAmount 592 non-null float64
9 Loan_Amount_Term 600 non-null float64
10 Credit_History 564 non-null float64
11 Property_Area 614 non-null object
12 Loan_Status 614 non-null object
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 367 entries, 0 to 366
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Loan_ID 367 non-null object
1 Gender 356 non-null object
2 Married 367 non-null object
3 Dependents 357 non-null object
4 Education 367 non-null object
5 Self_Employed 344 non-null object
6 ApplicantIncome 367 non-null int64
7 CoapplicantIncome 367 non-null int64
7 CoapplicantIncome 367 non-null int64
8 LoanAmount 362 non-null float64
9 Loan_Amount_Term 361 non-null float64
10 Credit_History 338 non-null float64
11 Property_Area 367 non-null object
dtypes: float64(3), int64(2), object(7)
memory usage: 34.5+ KB

The data consists of 981 observations and 13 variables, with 8 of them being categorical and 5 numerical. 614 of
the observations are designated for training, while the remaining are for testing.

2. What data quality issues exist in both train and test? (HINT: Comment any missing values and
duplicates)

In [269]:
print(
f'''Duplicate Value in train dataset
{train.duplicated().value_counts()}
Duplicate value in test dataset
{test.duplicated().value_counts()}'''
)

Duplicate Value in train dataset


False 614
dtype: int64
Duplicate value in test dataset
False 367
dtype: int64

In [270]:
print(train.isna().sum())
print(f'Empty value in the entire train dataset: {train.isna().sum().sum()}')

Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
Empty value in the entire train dataset: 149

In [271]:
print(test.isna().sum())
print(f'Empty value in the entire test dataset: {test.isna().sum().sum()}')

Loan_ID 0
Gender 11
Married 0
Dependents 10
Education 0
Self_Employed 23
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 5
Loan_Amount_Term 6
Credit_History 29
Property_Area 0
dtype: int64
Empty value in the entire test dataset: 84
There are missing values in both training and testing datasets, which will be filled before using in machine
learning models. There are no duplicate values in the data.

3. How do the the loan statuses compare? i.e. what is the distrubition of each?

In [272]:
df['Loan_Status'].value_counts()
Out[272]:
Y 422
N 192
Name: Loan_Status, dtype: int64

In [273]:
sns.countplot(data=df, x='Loan_Status');

When we look at the credit status, we see that the weight of the people with a yes credit status in the data is
higher. If this ratio is too high, it can be a problem for machine learning algorithms. Fortunately, there is an
acceptable situation.

4. How do women and men compare when it comes to defaulting on loans in the historical dataset?

In [274]:
train.groupby('Gender')['Loan_Status'].value_counts(normalize=True )
Out[274]:
Gender Loan_Status
Female Y 0.669643
N 0.330357
Male Y 0.693252
N 0.306748
Name: Loan_Status, dtype: float64

In [275]:
ax = sns.countplot(data=train, x='Loan_Status', hue='Gender')
ax.set_title('Count of Male and Female');

In the data set, it is seen that men are more common among the observations with defaulting on loans.

5. How many of the loan applicants have dependents based on the historical dataset?

In [276]:
train['Dependents'].value_counts()
Out[276]:
0 345
1 102
2 101
3+ 51
Name: Dependents, dtype: int64

In [277]:
ax = sns.countplot(data = train, x="Dependents")
ax.set_title('Count of loan applicant by dependent');
245 of the loan applicants are obliged to look after someone. Out of 254, 102 has one dependent, 101 has 2
dependents and 51 has 3 or more than 3 dependents.

6. How do the incomes of those who are employed compare to those who are self employed based on the
historical dataset?

In [278]:
train.groupby('Self_Employed')['ApplicantIncome'].describe()
Out[278]:

count mean std min 25% 50% 75% max

Self_Employed

No 500.0 5049.748000 5682.895810 150.0 2824.50 3705.5 5292.75 81000.0

Yes 82.0 7380.817073 5883.564795 674.0 3452.25 5809.0 9348.50 39147.0

In [279]:
ax = sns.violinplot(
x='Self_Employed',
y='ApplicantIncome',
data=train,
)
ax.set_title('Comparision of applicants incone');

From the graph we can interpret that the incomes of those who are employed are higher on average.

7. Are applicants with a credit history more likely to default than those who do not have one?
In [280]:
train.groupby('Credit_History')['Loan_Status'].value_counts(normalize=True)
Out[280]:
Credit_History Loan_Status
0.0 N 0.921348
Y 0.078652
1.0 Y 0.795789
N 0.204211
Name: Loan_Status, dtype: float64

In [281]:
sns.catplot(x='Loan_Status',
data=train,
kind='count',
hue='Credit_History');

It has been observed that the number of defaults is higher for those with a credit history.

8. Is there a correlation between the applicant's income and the loan amount they applied for?

In [282]:

corr = train[["LoanAmount", "ApplicantIncome"]].corr()


ax = sns.heatmap(data=corr, annot=True, linewidth=0.5, linecolor='black')
ax.set_title("Correlation between applicant's income and loan amount");
This 0.57 correlation shows that there is not a strong relationship between the income of the person applying for
credit and the amount of credit applied for.

Part Two

Model preparation

Data Preparation
As we have seen in EDA, there are a lot of issue in the data. So, first of all we need to get red to useless columns
we will be going to drop 'Loan_ID'. We can see there is a lot of missing values in the dataset, we need to deal
with the missing values and we also need to convert dtype opject into numeric for training the model.

In [283]:
# Matrix of features
df = train.copy()
df.drop(columns = "Loan_ID", inplace = True)

In [284]:
#Filling the missing values

df.Loan_Status = np.where(df.Loan_Status.isna(), 'Test', df.Loan_Status)


df.Gender = df.Gender.fillna(df.Gender.mode()[0])
df.Married = df.Married.fillna(df.Married.mode()[0])
df.Dependents = df.Dependents.fillna(df.Dependents.mode()[0])
df.Self_Employed = df.Self_Employed.fillna(df.Self_Employed.mode()[0])
df.LoanAmount = df.LoanAmount.fillna(df.groupby('Education')['LoanAmount'].transform('me
dian'))
df.Loan_Amount_Term = df.Loan_Amount_Term.fillna(df.groupby('Education')['Loan_Amount_Te
rm'].transform('median'))
df.Credit_History = df.Credit_History.fillna(df['Credit_History'].median())

In [285]:
# Setting up of variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1:]

### Handle Missing Values Here ###


df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].mean())
df['Loan_Amount_Term'] = df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mean())
df['Credit_History'] = df['Credit_History'].fillna(df['Credit_History'].mean())

df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Married'] = df['Married'].fillna(df['Married'].mode()[0])
df['Dependents'] = df['Dependents'].fillna(df['Dependents'].mode()[0])
df['Self_Employed'] = df['Self_Employed'].fillna(df['Self_Employed'].mode()[0])

In [286]:
le = LabelEncoder()

# convert string(text) to number for Machine Learning


X['Gender'] = le.fit_transform(X['Gender'])
X['Married'] = le.fit_transform(X['Married'])
X['Education'] = le.fit_transform(X['Education'])
X['Dependents'] = le.fit_transform(X['Dependents'])
X['Self_Employed'] = le.fit_transform(X['Self_Employed'])
X['Property_Area'] = le.fit_transform(X['Property_Area'])

In [287]:

# label encode target


y['Loan_Status'] = le.fit_transform(y['Loan_Status'])

# # train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42
)

Building decision tree model

In [288]:
#Building the decision Tree model
from sklearn.tree import DecisionTreeClassifier

cf = DecisionTreeClassifier()

#Training the model


cf.fit( X_train, y_train )

#Predicting the outcomes


predictions_clf = cf.predict(X_test)

In [289]:
print('Model Accuracy:', round(accuracy_score(predictions_clf, y_test) * 100,2))

Model Accuracy: 71.54

In [290]:
print(confusion_matrix(predictions_clf, y_test))

[[23 15]
[20 65]]

Building Logistic Regression model

In [291]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=5000)
model.fit(X_train,y_train)
p_predict = model.predict(X_test)

print("The accuracy is", round(accuracy_score(p_predict, y_test) * 100,2))

The accuracy is 78.86

In [292]:

print(confusion_matrix(p_predict, y_test))

[[18 1]
[25 79]]

We are able to conclude that Logistic regression works best in this dataset

You might also like