Randomforest Mllab

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

randomforest-mllab

April 13, 2024

1 21BCE5695

2 M. Ashwin

3 Random Forest
[ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from sklearn.model_selection import train_test_split
import sklearn

4 Importing and visualizing data


[ ]: dataf = pd.read_csv('loan_approval_dataset.csv')

[ ]: print(dataf.shape)

(4269, 8)

[ ]: dataf.head()

[ ]: no_of_dependents education self_employed income_annum loan_amount \


0 2 Graduate No 9600000 29900000
1 0 Not Graduate Yes 4100000 12200000
2 3 Graduate No 9100000 29700000
3 3 Graduate No 8200000 30700000
4 5 Not Graduate Yes 9800000 24200000

loan_term cibil_score loan_status


0 12 778 Approved
1 8 417 Rejected
2 20 506 Rejected
3 8 467 Rejected

1
4 20 382 Rejected

[ ]: print(dataf['loan_status'].value_counts())

loan_status
Approved 2656
Rejected 1613
Name: count, dtype: int64

[ ]: dataf['loan_status'].value_counts().plot.bar()

[ ]: <Axes: xlabel='loan_status'>

5 Independent variables (Categorical)


A categorical variable (also called qualitative variable) is a variable that can take on one of a
limited, and usually fixed, number of possible values.

2
[ ]: dataf['education'].value_counts(normalize=True).plot.bar(title='Gender')
plt.show()
dataf['no_of_dependents'].value_counts(normalize=True).plot.bar(title='Married')
plt.show()
dataf['self_employed'].value_counts(normalize=True).plot.
↪bar(title='Self_Employed')

plt.show()

3
4
6 Idependent variables (Numerical)
Visualizing the distribution of annual income.
[ ]: sns.displot(dataf['income_annum'])

[ ]: <seaborn.axisgrid.FacetGrid at 0x7e9285181150>

5
We can see that the annual income is evenly distributed.
Cibil Score distribution bar graph.
[ ]: sns.displot(dataf['cibil_score'])

[ ]: <seaborn.axisgrid.FacetGrid at 0x7e92864193f0>

6
We can see that the cibil score is also evenly distributed hence, no normalization will be required
Loan Amount distribution plot.
[ ]: sns.displot(dataf['loan_amount'])

[ ]: <seaborn.axisgrid.FacetGrid at 0x7e92851817b0>

7
Encoding data
[ ]: from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()
obj = (dataf.dtypes == 'object')
print(type(obj))
for col in list(obj[obj].index):
dataf[col] = label_encoder.fit_transform(dataf[col])

<class 'pandas.core.series.Series'>

[ ]: edu = []
for i in dataf['education']:
if i==0:
edu.append(1)
else:
edu.append(0)

8
dataf['education'] = edu

[ ]: l = []
for i in dataf['loan_status']:
if i==0:
l.append(1)
else:
l.append(0)
dataf['loan_status'] = l

Correlation matrix
[ ]: matrix = dataf.corr()
f, ax = plt.subplots(figsize=(10,10))
sns.heatmap(matrix,vmax=.8,square=True,cmap="BuPu", annot = True)

[ ]: <Axes: >

9
7 Model Building
Splitting data into training and testing
[ ]: Xval = dataf.drop(['loan_status'], axis=1)
Yval = dataf['loan_status']

X_train, X_test, Y_train, Y_test = train_test_split(Xval, Yval, train_size=0.8,␣


↪random_state=5)

[ ]: print(np.array(X_train).shape)
print(np.array(Y_train).shape)
print(np.array(X_test).shape)
print(np.array(Y_test).shape)

(3415, 7)
(3415,)
(854, 7)
(854,)
Random Forest Model
[ ]: from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 7, criterion = 'entropy',␣


↪random_state =7)

model.fit(X_train, Y_train)

yPred = model.predict(X_test)

[ ]: from sklearn import metrics

val = metrics.accuracy_score(Y_test, yPred)

[ ]: print("Accuracy:", val)

Accuracy: 0.9765807962529274

10

You might also like