Logistic Regression - Session 6 & 7

10/20/21, 12:00 AM IntroToML-Part2
Moving Ahead
This sheet is prepared by Prakshaal Jain based on Session 6,7 by Siby Sir
In [ ]:

logistic regression
If you are very interested --> Read this

https://www.ritchieng.com/logistic-regression/
Let X1, X2, ......, Xn are the features.
Z = b0 +b1.X1 + b2.X2+ ....... + bn.Xn
$f(Z) = \frac{e^{Z}} {1+e^{Z}} $
f(Z) lies between 0 and 1, which can be considered as a probability.
Such a fn f(Z) is called the Logistic / Sigmoid function.
Managers!!!
If f(Z) lies between 0 and 0.5 the it is Class 0
Else it will be class 1
Consider it as a threshold!
In [ ]:

Building LogReg Model

Remember Regression likha hai iska matlab yeh nahi hai ki yeh regression hai !! this is
classification!!
In [9]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [10]:
credit=pd.read_csv('https://online.stat.psu.edu/onlinecourses/sites/stat508/files/germa
localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 1/15

Data Exploration
In [11]:
credit.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999

Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Creditability 1000 non-null int64
1 Account Balance 1000 non-null int64
2 Duration of Credit (month) 1000 non-null int64
3 Payment Status of Previous Credit 1000 non-null int64
4 Purpose 1000 non-null int64
5 Credit Amount 1000 non-null int64
6 Value Savings/Stocks 1000 non-null int64
7 Length of current employment 1000 non-null int64
8 Instalment per cent 1000 non-null int64
9 Sex & Marital Status 1000 non-null int64
10 Guarantors 1000 non-null int64
11 Duration in Current address 1000 non-null int64
12 Most valuable available asset 1000 non-null int64
13 Age (years) 1000 non-null int64
14 Concurrent Credits 1000 non-null int64
15 Type of apartment 1000 non-null int64
16 No of Credits at this Bank 1000 non-null int64
17 Occupation 1000 non-null int64
18 No of dependents 1000 non-null int64
19 Telephone 1000 non-null int64
20 Foreign Worker 1000 non-null int64
dtypes: int64(21)
memory usage: 164.2 KB
Data Preprocessing
Do you rember what is y and X?
Batao Target konsa and Features konsa?
In [12]:
# Target
y = credit['Creditability']
# Features
X = credit.drop(['Creditability'],axis=1)
Are the features numeric or categorical?

In [13]:
categorical_features=['Account Balance',
'Payment Status of Previous Credit', 'Purpose',
'Value Savings/Stocks', 'Length of current employment',
'Instalment per cent', 'Sex & Marital Status', 'Guarantors',
'Duration in Current address', 'Most valuable available asset',
'Concurrent Credits', 'Type of apartment',
'No of Credits at this Bank', 'Occupation', 'No of dependents',
'Telephone', 'Foreign Worker']

We make dummies of categorical features!

In [14]:
X = pd.get_dummies(X,columns=categorical_features)
Building the model

In [16]:
import statsmodels.api as sm
X = sm.add_constant(X)
Same old story -> Train test split

In [17]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2,random_state=10)
In [18]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape
Out[18]: ((800, 72), (200, 72), (800,), (200,))
Building the model

In [19]:
logReg_1=sm.Logit(y_train,X_train)
logReg_1=logReg_1.fit()
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.434640
Iterations: 35
c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\statsmodels\base
\model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. C
heck mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
Model ban gaya lets see kaisa bana!

In [20]:
logReg_1.summary2()
Out[20]: Model: Logit Pseudo R-squared: 0.280
Dependent Variable: Creditability AIC: 805.4233
Date: 2021-10-19 23:01 BIC: 1063.0770
No. Observations: 800 Log-Likelihood: -347.71
Df Model: 54 LL-Null: -482.61
Df Residuals: 745 LLR p-value: 1.9042e-30

Converged: 0.0000 Scale: 1.0000
No. Iterations: 35.0000
Coef. Std.Err. z P>|z| [0.025 0.975]
const 0.4063 1404911.6361 0.0000 1.0000 -2753575.8020 2753576.6145
Duration of Credit (month) -0.0245 0.0108 -2.2707 0.0232 -0.0456 -0.0033
Credit Amount -0.0001 0.0001 -2.7644 0.0057 -0.0002 -0.0000
Age (years) 0.0255 0.0111 2.2989 0.0215 0.0038 0.0473
Account Balance_1 -0.7485 8373976.1154 -0.0000 1.0000 -16412692.3421 16412690.8452
Account Balance_2 -0.1653 7979569.8243 -0.0000 1.0000 -15639669.6331 15639669.3024
Account Balance_3 0.1688 8509025.3500 0.0000 1.0000 -16677383.0607 16677383.3983
Account Balance_4 1.1513 8713706.8835 0.0000 1.0000 -17078550.5122 17078552.8147
Payment Status of Previous

-0.3046 nan nan nan nan nan
Credit_0

Credit_1

0.1630 nan nan nan nan nan
Credit_2

Credit_3

Credit_4
Purpose_0 -0.7104 nan nan nan nan nan
Purpose_1 0.8887 nan nan nan nan nan
Value Savings/Stocks_1 -0.4313 nan nan nan nan nan
Value Savings/Stocks_2 -0.2359 nan nan nan nan nan
Value Savings/Stocks_3 0.1928 nan nan nan nan nan
Length of current -0.2629 nan nan nan nan nan

employment_1
Length of current
employment_2
Length of current
employment_3
Length of current
employment_4
Length of current
employment_5
Instalment per cent_1 0.5298 nan nan nan nan nan
Instalment per cent_2 0.3641 nan nan nan nan nan
Instalment per cent_3 -0.0776 nan nan nan nan nan
Instalment per cent_4 -0.4100 nan nan nan nan nan
Sex & Marital Status_1 -0.2274 3355443.2000 -0.0000 1.0000 -6576548.0516 6576547.5967
Sex & Marital Status_2 0.1296 3138729.7088 0.0000 1.0000 -6151797.0568 6151797.3161
Guarantors_1 -0.1139 nan nan nan nan nan
Guarantors_2 -0.5065 nan nan nan nan nan
Guarantors_3 1.0267 nan nan nan nan nan
Duration in Current address_1 0.4545 nan nan nan nan nan
Duration in Current address_2 -0.0824 nan nan nan nan nan
Most valuable available

asset_1

asset_2

asset_3

asset_4
Concurrent Credits_1 -0.1116 nan nan nan nan nan
Concurrent Credits_2 0.1374 nan nan nan nan nan
Concurrent Credits_3 0.3805 nan nan nan nan nan
Type of apartment_1 -0.1943 2464197.4678 -0.0000 1.0000 -4829738.4820 4829738.0935
Type of apartment_2 0.0140 1006004.4036 0.0000 1.0000 -1971732.3854 1971732.4133
Type of apartment_3 0.5866 1742450.7397 0.0000 1.0000 -3415140.1081 3415141.2812

No of Credits at this Bank_1 0.3896 4137845.4729 0.0000 1.0000 -8110027.7110 8110028.4901
No of Credits at this Bank_2 -0.0429 4147230.4827 -0.0000 1.0000 -8128422.4245 8128422.3388
No of Credits at this Bank_3 0.1520 4141602.0288 0.0000 1.0000 -8117390.6629 8117390.9668
No of Credits at this Bank_4 -0.0924 4107669.3512 -0.0000 1.0000 -8050884.0812 8050883.8964
Occupation_1 0.4355 7357889.2236 0.0000 1.0000 -14421197.4450 14421198.3160
Occupation_2 0.1472 7355832.2235 0.0000 1.0000 -14417166.0872 14417166.3816
Occupation_3 -0.0489 7316122.2492 -0.0000 1.0000 -14339336.1638 14339336.0659
Occupation_4 -0.1275 7321291.5583 -0.0000 1.0000 -14349467.9020 14349467.6470
No of dependents_1 0.3512 nan nan nan nan nan
No of dependents_2 0.0551 nan nan nan nan nan
Telephone_1 0.0772 5627249.3174 0.0000 1.0000 -11029205.9169 11029206.0713
Telephone_2 0.3290 5622781.4728 0.0000 1.0000 -11020448.8507 11020449.5087
Foreign Worker_1 -1.0247 1757479.2690 -0.0000 1.0000 -3444597.0956 3444595.0461
Foreign Worker_2 1.4310 1757479.2690 0.0000 1.0000 -3444594.6399 3444597.5019
Lets take a closer look at p value

In [24]:
#Seesing which ones are less than 0.05
pd.DataFrame(logReg_1.pvalues)[0]<0.05
Out[24]: const False
Duration of Credit (month) True
Credit Amount True
Age (years) True
Account Balance_1 False
...
No of dependents_2 False
Telephone_1 False
Telephone_2 False
Foreign Worker_1 False
Foreign Worker_2 False
Name: 0, Length: 72, dtype: bool
P Values less than alpha ----> Significant

In [25]:
significant_features=['Duration of Credit (month)','Credit Amount','Age (years)']
In [27]:
X_new_1 = X[significant_features]
Remaking the model

In [28]:
X_train,X_test,y_train,y_test = train_test_split(X_new_1,y,test_size=0.2,random_state=1
In [29]:
logReg_2=sm.Logit(y_train,X_train)
logReg_2=logReg_2.fit()
Optimization terminated successfully.
Iterations 5
In [30]:
logReg_2.summary2()
Date: 2021-10-19 23:15 BIC: 939.4186
Coef. Std.Err. z P>|z| [0.025 0.975]
Credit Amount -0.0001 0.0000 -1.8939 0.0582 -0.0001 0.0000

Age (years) 0.0452 0.0044 10.2041 0.0000 0.0365 0.0539
Damm the p value

Dropping Credit amount
In [35]:
X_new_2 = X_new_1.drop(['Credit Amount'],axis=1)
X_train,X_test,y_train,y_test=train_test_split(X_new_2,y,test_size=0.2,random_state=10)
Remaking the model

In [36]:
log_reg_3 = sm.Logit(y_train,X_train)
log_reg_3 = log_reg_3.fit()
Optimization terminated successfully.
Iterations 5
In [37]:
log_reg_3.summary2()
Date: 2021-10-19 23:18 BIC: 936.2792
Coef. Std.Err. z P>|z| [0.025 0.975]
Age (years) 0.0444 0.0044 10.1043 0.0000 0.0358 0.0530
Looks good now!

Findings:
Z = -0.0319 Duration of Credit (month) + 0.0444 Age (years)
$ f(Z) = \frac{e^{Z}} {1+e^{Z}}$, which is a prob lies in between 0 and 1
Prediction
In [38]:
y_pred = log_reg_3.predict(X_test)
Lest visualise the output

In [41]:
pred_df= pd.DataFrame({'Actual_Class':y_test, 'Predicted_Prob':y_pred})
pred_df['Predicted_Class']=pred_df['Predicted_Prob'].map(lambda x: 1 if x>0.5 else 0)
In [44]:
pred_df.iloc[0:10]
Out[44]: Actual_Class Predicted_Prob Predicted_Class
841 0 0.690284 1
956 0 0.884680 1
544 1 0.850290 1
173 1 0.657965 1
759 0 0.512259 1
955 0 0.866104 1
121 1 0.599699 1
230 1 0.696740 1
11 1 0.526841 1
120 1 0.545433 1
Performance measures!!
This is very important for all ML interviews!!

Sensitivity/ Recall
TPR = TP / (TP+FN)
Conditional prob of getting the predicted class as positive given that the actual class is positive.
Specificity:
TNR = TN /(FP+TN)
Conditional prob of getting the predicted class as Negative given that the actual class is Negative.
Precision:
= TP / (TP+FP)
Conditional prob of getting the actua class as positive given that the predicted class is positive.
F1- score
PrecisionRecall
Take the Harmonic Mean of Precision and Recall
Accuracy:
= (TP+TN)/N

In [45]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(pred_df['Actual_Class'],pred_df['Predicted_Class'])
print(' Confusion matrix is given by: \n',cm)
Confusion matrix is given by:
[[ 11 56]
[ 3 130]]
In [46]:
from sklearn.metrics import classification_report
report=classification_report(pred_df['Actual_Class'],pred_df['Predicted_Class'])
print('The Classification Report of the model:\n',report)
The Classification Report of the model:
precision recall f1-score support
0 0.79 0.16 0.27 67
1 0.70 0.98 0.82 133
accuracy 0.70 200
macro avg 0.74 0.57 0.54 200
weighted avg 0.73 0.70 0.63 200
Lets plot and see the probablities

In [47]:
# Combining
## Plotting corresponding to class 0 and 1
sns.distplot(pred_df[pred_df['Actual_Class']==0]['Predicted_Prob'],color='b');
sns.distplot(pred_df[pred_df['Actual_Class']==1]['Predicted_Prob'],color='g');
c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\seaborn\distribu
tions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in
a future version. Please adapt your code to use either `displot` (a figure-level functio
n with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\seaborn\distribu
tions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in
a future version. Please adapt your code to use either `displot` (a figure-level functio
n with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

Model Optimization
ROC Curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters: True Positive
Rate. False Positive Rate.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
A curve obtained by plotting 1-TNR along X- axis and TNR along Y- axis.

managers!! Red line se jitna durr hoga utna accha

classifier!!
In [49]:
from sklearn.metrics import roc_curve,roc_auc_score
fpr,tpr,threshold = roc_curve(pred_df['Actual_Class'],pred_df['Predicted_Prob'])
In [50]:
# Drawing curve
plt.plot(fpr, tpr);
plt.plot([0,1],[0,1],color='r');
plt.xlabel('FPR');
plt.ylabel('TPR');
plt.title(' ROC curve');
score=roc_auc_score(pred_df['Actual_Class'],pred_df['Predicted_Prob'])
print( 'ROC AUC SCORE:' , score)
ROC AUC SCORE: 0.5653686454943327
Youden Index!
Y =Max(TPR +TNR -1)
= Max(TPR- (1-TNR))
= Max(TPR-FPR)

In [53]:
# Making FPR TPR Data frame
fpr_tpr = pd.DataFrame({'fpr':fpr,'tpr':tpr,'threshold':threshold})
# Finding Difference
fpr_tpr['diff'] = fpr_tpr['tpr']-fpr_tpr['fpr']
# Sorting Values
fpr_tpr.sort_values('diff',ascending= False).head()
Out[53]: fpr tpr threshold diff
94 0.791045 0.947368 0.526841 0.156324
96 0.805970 0.954887 0.521594 0.148917
93 0.791045 0.939850 0.527421 0.148805
100 0.835821 0.977444 0.504669 0.141623
98 0.820896 0.962406 0.513231 0.141510
Managers!!
The best threshold is 0.5268

In [54]:
pred_df['Predicted_New_Class'] = pred_df['Predicted_Prob'].map(lambda x: 1 if x>0.52684
In [55]:
new_report = classification_report(pred_df['Actual_Class'],pred_df['Predicted_New_Class

In [57]:
print('The new Classification Report:\n', new_report)
The new Classification Report:
0 0.64 0.21 0.31 67
1 0.70 0.94 0.80 133
accuracy 0.69 200
macro avg 0.67 0.57 0.56 200
weighted avg 0.68 0.69 0.64 200
Compare with previous report!!

In [59]:
print('The old Classification Report of the model:\n',report)
The old Classification Report of the model:
0 0.79 0.16 0.27 67
1 0.70 0.98 0.82 133
accuracy 0.70 200
macro avg 0.74 0.57 0.54 200
weighted avg 0.73 0.70 0.63 200
Inference! --> There is not much change as threshold is

very close to earlier threshold but see the recall is
improved!
In [ ]:


Logistic Regression - Session 6 &amp; 7

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Regression - Session 6 &amp; 7

Uploaded by

Copyright:

Available Formats

10/20/21, 12:00 AM IntroToML-Part2

If you are very interested --> Read this

Let X1, X2, ......, Xn are the features.

Z = b0 +b1.X1 + b2.X2+ ....... + bn.Xn

$f(Z) = \frac{e^{Z}} {1+e^{Z}} $

f(Z) lies between 0 and 1, which can be considered as a probability.

Such a fn f(Z) is called the Logistic / Sigmoid function.

Else it will be class 1

Building LogReg Model

import matplotlib.pyplot as plt

import seaborn as sns

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 1/15

RangeIndex: 1000 entries, 0 to 999

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Creditability 1000 non-null int64

1 Account Balance 1000 non-null int64

2 Duration of Credit (month) 1000 non-null int64

3 Payment Status of Previous Credit 1000 non-null int64

4 Purpose 1000 non-null int64

5 Credit Amount 1000 non-null int64

6 Value Savings/Stocks 1000 non-null int64

7 Length of current employment 1000 non-null int64

8 Instalment per cent 1000 non-null int64

9 Sex & Marital Status 1000 non-null int64

10 Guarantors 1000 non-null int64

11 Duration in Current address 1000 non-null int64

12 Most valuable available asset 1000 non-null int64

13 Age (years) 1000 non-null int64

14 Concurrent Credits 1000 non-null int64

15 Type of apartment 1000 non-null int64

16 No of Credits at this Bank 1000 non-null int64

17 Occupation 1000 non-null int64

18 No of dependents 1000 non-null int64

19 Telephone 1000 non-null int64

20 Foreign Worker 1000 non-null int64

memory usage: 164.2 KB

Are the features numeric or categorical?

'Payment Status of Previous Credit', 'Purpose',

'Value Savings/Stocks', 'Length of current employment',

'Instalment per cent', 'Sex & Marital Status', 'Guarantors',

'Duration in Current address', 'Most valuable available asset',

'Concurrent Credits', 'Type of apartment',

'No of Credits at this Bank', 'Occupation', 'No of dependents',

'Telephone', 'Foreign Worker']

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 2/15

We make dummies of categorical features!

Building the model

Same old story -> Train test split

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2,random_state=10)

Out[18]: ((800, 72), (200, 72), (800,), (200,))

Building the model

Warning: Maximum number of iterations has been exceeded.

Current function value: 0.434640

warnings.warn("Maximum Likelihood optimization failed to "

Model ban gaya lets see kaisa bana!

Out[20]: Model: Logit Pseudo R-squared: 0.280

Dependent Variable: Creditability AIC: 805.4233

Date: 2021-10-19 23:01 BIC: 1063.0770

No. Observations: 800 Log-Likelihood: -347.71

Df Model: 54 LL-Null: -482.61

Df Residuals: 745 LLR p-value: 1.9042e-30

localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 3/15

Logistic Regression - Session 6 & 7

Logistic Regression - Session 6 & 7