Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

10/20/21, 12:00 AM IntroToML-Part2

Moving Ahead
This sheet is prepared by Prakshaal Jain based on Session 6,7 by Siby Sir

In [ ]:

logistic regression

If you are very interested --> Read this


https://www.ritchieng.com/logistic-regression/

Let X1, X2, ......, Xn are the features.

Z = b0 +b1.X1 + b2.X2+ ....... + bn.Xn

$f(Z) = \frac{e^{Z}} {1+e^{Z}} $

f(Z) lies between 0 and 1, which can be considered as a probability.

Such a fn f(Z) is called the Logistic / Sigmoid function.

Managers!!!
If f(Z) lies between 0 and 0.5 the it is Class 0

Else it will be class 1

Consider it as a threshold!

In [ ]:

Building LogReg Model


Remember Regression likha hai iska matlab yeh nahi hai ki yeh regression hai !! this is
classification!!

In [9]:
# Import Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

In [10]:
credit=pd.read_csv('https://online.stat.psu.edu/onlinecourses/sites/stat508/files/germa

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 1/15


10/20/21, 12:00 AM IntroToML-Part2

Data Exploration
In [11]:
credit.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 1000 entries, 0 to 999


Data columns (total 21 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Creditability 1000 non-null int64

1 Account Balance 1000 non-null int64

2 Duration of Credit (month) 1000 non-null int64

3 Payment Status of Previous Credit 1000 non-null int64

4 Purpose 1000 non-null int64

5 Credit Amount 1000 non-null int64

6 Value Savings/Stocks 1000 non-null int64

7 Length of current employment 1000 non-null int64

8 Instalment per cent 1000 non-null int64

9 Sex & Marital Status 1000 non-null int64

10 Guarantors 1000 non-null int64

11 Duration in Current address 1000 non-null int64

12 Most valuable available asset 1000 non-null int64

13 Age (years) 1000 non-null int64

14 Concurrent Credits 1000 non-null int64

15 Type of apartment 1000 non-null int64

16 No of Credits at this Bank 1000 non-null int64

17 Occupation 1000 non-null int64

18 No of dependents 1000 non-null int64

19 Telephone 1000 non-null int64

20 Foreign Worker 1000 non-null int64

dtypes: int64(21)

memory usage: 164.2 KB

Data Preprocessing
Do you rember what is y and X?
Batao Target konsa and Features konsa?

In [12]:
# Target

y = credit['Creditability']

# Features

X = credit.drop(['Creditability'],axis=1)

Are the features numeric or categorical?


In [13]:
categorical_features=['Account Balance',

'Payment Status of Previous Credit', 'Purpose',

'Value Savings/Stocks', 'Length of current employment',

'Instalment per cent', 'Sex & Marital Status', 'Guarantors',

'Duration in Current address', 'Most valuable available asset',

'Concurrent Credits', 'Type of apartment',

'No of Credits at this Bank', 'Occupation', 'No of dependents',

'Telephone', 'Foreign Worker']

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 2/15


10/20/21, 12:00 AM IntroToML-Part2

We make dummies of categorical features!


In [14]:
X = pd.get_dummies(X,columns=categorical_features)

Building the model


In [16]:
import statsmodels.api as sm

X = sm.add_constant(X)

Same old story -> Train test split


In [17]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2,random_state=10)

In [18]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

Out[18]: ((800, 72), (200, 72), (800,), (200,))

Building the model


In [19]:
logReg_1=sm.Logit(y_train,X_train)

logReg_1=logReg_1.fit()

Warning: Maximum number of iterations has been exceeded.

Current function value: 0.434640

Iterations: 35

c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\statsmodels\base
\model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. C
heck mle_retvals

warnings.warn("Maximum Likelihood optimization failed to "

Model ban gaya lets see kaisa bana!


In [20]:
logReg_1.summary2()

Out[20]: Model: Logit Pseudo R-squared: 0.280

Dependent Variable: Creditability AIC: 805.4233

Date: 2021-10-19 23:01 BIC: 1063.0770

No. Observations: 800 Log-Likelihood: -347.71

Df Model: 54 LL-Null: -482.61

Df Residuals: 745 LLR p-value: 1.9042e-30

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 3/15


10/20/21, 12:00 AM IntroToML-Part2

Converged: 0.0000 Scale: 1.0000

No. Iterations: 35.0000

Coef. Std.Err. z P>|z| [0.025 0.975]

const 0.4063 1404911.6361 0.0000 1.0000 -2753575.8020 2753576.6145

Duration of Credit (month) -0.0245 0.0108 -2.2707 0.0232 -0.0456 -0.0033

Credit Amount -0.0001 0.0001 -2.7644 0.0057 -0.0002 -0.0000

Age (years) 0.0255 0.0111 2.2989 0.0215 0.0038 0.0473

Account Balance_1 -0.7485 8373976.1154 -0.0000 1.0000 -16412692.3421 16412690.8452

Account Balance_2 -0.1653 7979569.8243 -0.0000 1.0000 -15639669.6331 15639669.3024

Account Balance_3 0.1688 8509025.3500 0.0000 1.0000 -16677383.0607 16677383.3983

Account Balance_4 1.1513 8713706.8835 0.0000 1.0000 -17078550.5122 17078552.8147

Payment Status of Previous


-0.3046 nan nan nan nan nan
Credit_0

Payment Status of Previous


-0.6436 nan nan nan nan nan
Credit_1

Payment Status of Previous


0.1630 nan nan nan nan nan
Credit_2

Payment Status of Previous


0.1771 nan nan nan nan nan
Credit_3

Payment Status of Previous


1.0145 nan nan nan nan nan
Credit_4

Purpose_0 -0.7104 nan nan nan nan nan

Purpose_1 0.8887 nan nan nan nan nan

Purpose_2 0.0006 nan nan nan nan nan

Purpose_3 0.2011 nan nan nan nan nan

Purpose_4 0.1495 nan nan nan nan nan

Purpose_5 -0.5100 nan nan nan nan nan

Purpose_6 -1.0768 nan nan nan nan nan

Purpose_8 0.2384 nan nan nan nan nan

Purpose_9 0.0760 nan nan nan nan nan

Purpose_10 1.1492 nan nan nan nan nan

Value Savings/Stocks_1 -0.4313 nan nan nan nan nan

Value Savings/Stocks_2 -0.2359 nan nan nan nan nan

Value Savings/Stocks_3 0.1928 nan nan nan nan nan

Value Savings/Stocks_4 0.4554 nan nan nan nan nan

Value Savings/Stocks_5 0.4253 nan nan nan nan nan

Length of current -0.2629 nan nan nan nan nan


localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 4/15
10/20/21, 12:00 AM IntroToML-Part2

employment_1

Length of current
-0.1827 nan nan nan nan nan
employment_2

Length of current
0.1122 nan nan nan nan nan
employment_3

Length of current
0.5661 nan nan nan nan nan
employment_4

Length of current
0.1735 nan nan nan nan nan
employment_5

Instalment per cent_1 0.5298 nan nan nan nan nan

Instalment per cent_2 0.3641 nan nan nan nan nan

Instalment per cent_3 -0.0776 nan nan nan nan nan

Instalment per cent_4 -0.4100 nan nan nan nan nan

Sex & Marital Status_1 -0.2274 3355443.2000 -0.0000 1.0000 -6576548.0516 6576547.5967

Sex & Marital Status_2 0.1296 3138729.7088 0.0000 1.0000 -6151797.0568 6151797.3161

Sex & Marital Status_3 0.5023 3720105.6914 0.0000 1.0000 -7291272.6715 7291273.6760

Sex & Marital Status_4 0.0018 3424634.8754 0.0000 1.0000 -6712161.0141 6712161.0178

Guarantors_1 -0.1139 nan nan nan nan nan

Guarantors_2 -0.5065 nan nan nan nan nan

Guarantors_3 1.0267 nan nan nan nan nan

Duration in Current address_1 0.4545 nan nan nan nan nan

Duration in Current address_2 -0.0824 nan nan nan nan nan

Duration in Current address_3 0.0167 nan nan nan nan nan

Duration in Current address_4 0.0174 nan nan nan nan nan

Most valuable available


0.3084 nan nan nan nan nan
asset_1

Most valuable available


0.2523 nan nan nan nan nan
asset_2

Most valuable available


0.3069 nan nan nan nan nan
asset_3

Most valuable available


-0.4614 nan nan nan nan nan
asset_4

Concurrent Credits_1 -0.1116 nan nan nan nan nan

Concurrent Credits_2 0.1374 nan nan nan nan nan

Concurrent Credits_3 0.3805 nan nan nan nan nan

Type of apartment_1 -0.1943 2464197.4678 -0.0000 1.0000 -4829738.4820 4829738.0935

Type of apartment_2 0.0140 1006004.4036 0.0000 1.0000 -1971732.3854 1971732.4133

Type of apartment_3 0.5866 1742450.7397 0.0000 1.0000 -3415140.1081 3415141.2812

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 5/15


10/20/21, 12:00 AM IntroToML-Part2

No of Credits at this Bank_1 0.3896 4137845.4729 0.0000 1.0000 -8110027.7110 8110028.4901

No of Credits at this Bank_2 -0.0429 4147230.4827 -0.0000 1.0000 -8128422.4245 8128422.3388

No of Credits at this Bank_3 0.1520 4141602.0288 0.0000 1.0000 -8117390.6629 8117390.9668

No of Credits at this Bank_4 -0.0924 4107669.3512 -0.0000 1.0000 -8050884.0812 8050883.8964

Occupation_1 0.4355 7357889.2236 0.0000 1.0000 -14421197.4450 14421198.3160

Occupation_2 0.1472 7355832.2235 0.0000 1.0000 -14417166.0872 14417166.3816

Occupation_3 -0.0489 7316122.2492 -0.0000 1.0000 -14339336.1638 14339336.0659

Occupation_4 -0.1275 7321291.5583 -0.0000 1.0000 -14349467.9020 14349467.6470

No of dependents_1 0.3512 nan nan nan nan nan

No of dependents_2 0.0551 nan nan nan nan nan

Telephone_1 0.0772 5627249.3174 0.0000 1.0000 -11029205.9169 11029206.0713

Telephone_2 0.3290 5622781.4728 0.0000 1.0000 -11020448.8507 11020449.5087

Foreign Worker_1 -1.0247 1757479.2690 -0.0000 1.0000 -3444597.0956 3444595.0461

Foreign Worker_2 1.4310 1757479.2690 0.0000 1.0000 -3444594.6399 3444597.5019

Lets take a closer look at p value


In [24]:
#Seesing which ones are less than 0.05

pd.DataFrame(logReg_1.pvalues)[0]<0.05

Out[24]: const False

Duration of Credit (month) True

Credit Amount True

Age (years) True

Account Balance_1 False

...

No of dependents_2 False

Telephone_1 False

Telephone_2 False

Foreign Worker_1 False

Foreign Worker_2 False

Name: 0, Length: 72, dtype: bool

P Values less than alpha ----> Significant

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 6/15


10/20/21, 12:00 AM IntroToML-Part2

In [25]:
significant_features=['Duration of Credit (month)','Credit Amount','Age (years)']

In [27]:
X_new_1 = X[significant_features]

Remaking the model


In [28]:
X_train,X_test,y_train,y_test = train_test_split(X_new_1,y,test_size=0.2,random_state=1

In [29]:
logReg_2=sm.Logit(y_train,X_train)

logReg_2=logReg_2.fit()

Optimization terminated successfully.

Current function value: 0.574603

Iterations 5

In [30]:
logReg_2.summary2()

Out[30]: Model: Logit Pseudo R-squared: 0.048

Dependent Variable: Creditability AIC: 925.3648

Date: 2021-10-19 23:15 BIC: 939.4186

No. Observations: 800 Log-Likelihood: -459.68

Df Model: 2 LL-Null: -482.61

Df Residuals: 797 LLR p-value: 1.0992e-10

Converged: 1.0000 Scale: 1.0000

No. Iterations: 5.0000

Coef. Std.Err. z P>|z| [0.025 0.975]

Duration of Credit (month) -0.0226 0.0078 -2.8827 0.0039 -0.0379 -0.0072

Credit Amount -0.0001 0.0000 -1.8939 0.0582 -0.0001 0.0000

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 7/15


10/20/21, 12:00 AM IntroToML-Part2

Age (years) 0.0452 0.0044 10.2041 0.0000 0.0365 0.0539

Damm the p value


Dropping Credit amount

In [35]:
X_new_2 = X_new_1.drop(['Credit Amount'],axis=1)

X_train,X_test,y_train,y_test=train_test_split(X_new_2,y,test_size=0.2,random_state=10)

Remaking the model


In [36]:
log_reg_3 = sm.Logit(y_train,X_train)

log_reg_3 = log_reg_3.fit()

Optimization terminated successfully.

Current function value: 0.576819

Iterations 5

In [37]:
log_reg_3.summary2()

Out[37]: Model: Logit Pseudo R-squared: 0.044

Dependent Variable: Creditability AIC: 926.9100

Date: 2021-10-19 23:18 BIC: 936.2792

No. Observations: 800 Log-Likelihood: -461.45

Df Model: 1 LL-Null: -482.61

Df Residuals: 798 LLR p-value: 7.7598e-11

Converged: 1.0000 Scale: 1.0000

No. Iterations: 5.0000

Coef. Std.Err. z P>|z| [0.025 0.975]

Duration of Credit (month) -0.0319 0.0061 -5.2237 0.0000 -0.0439 -0.0200

Age (years) 0.0444 0.0044 10.1043 0.0000 0.0358 0.0530

Looks good now!


Findings:
Z = -0.0319 Duration of Credit (month) + 0.0444 Age (years)

$ f(Z) = \frac{e^{Z}} {1+e^{Z}}$, which is a prob lies in between 0 and 1

Prediction
localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 8/15
10/20/21, 12:00 AM IntroToML-Part2

In [38]:
y_pred = log_reg_3.predict(X_test)

Lest visualise the output


In [41]:
pred_df= pd.DataFrame({'Actual_Class':y_test, 'Predicted_Prob':y_pred})

pred_df['Predicted_Class']=pred_df['Predicted_Prob'].map(lambda x: 1 if x>0.5 else 0)

In [44]:
pred_df.iloc[0:10]

Out[44]: Actual_Class Predicted_Prob Predicted_Class

841 0 0.690284 1

956 0 0.884680 1

544 1 0.850290 1

173 1 0.657965 1

759 0 0.512259 1

955 0 0.866104 1

121 1 0.599699 1

230 1 0.696740 1

11 1 0.526841 1

120 1 0.545433 1

Performance measures!!

This is very important for all ML interviews!!

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 9/15


10/20/21, 12:00 AM IntroToML-Part2

Sensitivity/ Recall
TPR = TP / (TP+FN)

Conditional prob of getting the predicted class as positive given that the actual class is positive.

Specificity:
TNR = TN /(FP+TN)

Conditional prob of getting the predicted class as Negative given that the actual class is Negative.

Precision:
= TP / (TP+FP)

Conditional prob of getting the actua class as positive given that the predicted class is positive.

F1- score
PrecisionRecall

Take the Harmonic Mean of Precision and Recall

Accuracy:
= (TP+TN)/N

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 10/15


10/20/21, 12:00 AM IntroToML-Part2

In [45]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(pred_df['Actual_Class'],pred_df['Predicted_Class'])

print(' Confusion matrix is given by: \n',cm)

Confusion matrix is given by:

[[ 11 56]

[ 3 130]]

In [46]:
from sklearn.metrics import classification_report

report=classification_report(pred_df['Actual_Class'],pred_df['Predicted_Class'])

print('The Classification Report of the model:\n',report)

The Classification Report of the model:

precision recall f1-score support

0 0.79 0.16 0.27 67

1 0.70 0.98 0.82 133

accuracy 0.70 200

macro avg 0.74 0.57 0.54 200

weighted avg 0.73 0.70 0.63 200

Lets plot and see the probablities


In [47]:
# Combining

## Plotting corresponding to class 0 and 1

sns.distplot(pred_df[pred_df['Actual_Class']==0]['Predicted_Prob'],color='b');

sns.distplot(pred_df[pred_df['Actual_Class']==1]['Predicted_Prob'],color='g');

c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\seaborn\distribu
tions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in
a future version. Please adapt your code to use either `displot` (a figure-level functio
n with similar flexibility) or `histplot` (an axes-level function for histograms).

warnings.warn(msg, FutureWarning)

c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\seaborn\distribu
tions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in
a future version. Please adapt your code to use either `displot` (a figure-level functio
n with similar flexibility) or `histplot` (an axes-level function for histograms).

warnings.warn(msg, FutureWarning)

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 11/15


10/20/21, 12:00 AM IntroToML-Part2

Model Optimization
ROC Curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters: True Positive
Rate. False Positive Rate.

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

A curve obtained by plotting 1-TNR along X- axis and TNR along Y- axis.

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 12/15


10/20/21, 12:00 AM IntroToML-Part2

managers!! Red line se jitna durr hoga utna accha


classifier!!
In [49]:
from sklearn.metrics import roc_curve,roc_auc_score

fpr,tpr,threshold = roc_curve(pred_df['Actual_Class'],pred_df['Predicted_Prob'])

In [50]:
# Drawing curve

plt.plot(fpr, tpr);

plt.plot([0,1],[0,1],color='r');

plt.xlabel('FPR');

plt.ylabel('TPR');

plt.title(' ROC curve');

score=roc_auc_score(pred_df['Actual_Class'],pred_df['Predicted_Prob'])

print( 'ROC AUC SCORE:' , score)

ROC AUC SCORE: 0.5653686454943327

Youden Index!
Y =Max(TPR +TNR -1)

= Max(TPR- (1-TNR))

= Max(TPR-FPR)

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 13/15


10/20/21, 12:00 AM IntroToML-Part2

In [53]:
# Making FPR TPR Data frame

fpr_tpr = pd.DataFrame({'fpr':fpr,'tpr':tpr,'threshold':threshold})

# Finding Difference

fpr_tpr['diff'] = fpr_tpr['tpr']-fpr_tpr['fpr']

# Sorting Values

fpr_tpr.sort_values('diff',ascending= False).head()

Out[53]: fpr tpr threshold diff

94 0.791045 0.947368 0.526841 0.156324

96 0.805970 0.954887 0.521594 0.148917

93 0.791045 0.939850 0.527421 0.148805

100 0.835821 0.977444 0.504669 0.141623

98 0.820896 0.962406 0.513231 0.141510

Managers!!

The best threshold is 0.5268


In [54]:
pred_df['Predicted_New_Class'] = pred_df['Predicted_Prob'].map(lambda x: 1 if x>0.52684

In [55]:
new_report = classification_report(pred_df['Actual_Class'],pred_df['Predicted_New_Class

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 14/15


10/20/21, 12:00 AM IntroToML-Part2

In [57]:
print('The new Classification Report:\n', new_report)

The new Classification Report:

precision recall f1-score support

0 0.64 0.21 0.31 67

1 0.70 0.94 0.80 133

accuracy 0.69 200

macro avg 0.67 0.57 0.56 200

weighted avg 0.68 0.69 0.64 200

Compare with previous report!!


In [59]:
print('The old Classification Report of the model:\n',report)

The old Classification Report of the model:

precision recall f1-score support

0 0.79 0.16 0.27 67

1 0.70 0.98 0.82 133

accuracy 0.70 200

macro avg 0.74 0.57 0.54 200

weighted avg 0.73 0.70 0.63 200

Inference! --> There is not much change as threshold is


very close to earlier threshold but see the recall is
improved!
In [ ]:

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 15/15

You might also like