Professional Documents
Culture Documents
Logistic Regression - Session 6 & 7
Logistic Regression - Session 6 & 7
Moving Ahead
This sheet is prepared by Prakshaal Jain based on Session 6,7 by Siby Sir
In [ ]:
logistic regression
Managers!!!
If f(Z) lies between 0 and 0.5 the it is Class 0
Consider it as a threshold!
In [ ]:
In [9]:
# Import Libraries
import numpy as np
import pandas as pd
In [10]:
credit=pd.read_csv('https://online.stat.psu.edu/onlinecourses/sites/stat508/files/germa
Data Exploration
In [11]:
credit.info()
<class 'pandas.core.frame.DataFrame'>
dtypes: int64(21)
Data Preprocessing
Do you rember what is y and X?
Batao Target konsa and Features konsa?
In [12]:
# Target
y = credit['Creditability']
# Features
X = credit.drop(['Creditability'],axis=1)
X = sm.add_constant(X)
In [18]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape
logReg_1=logReg_1.fit()
Iterations: 35
c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\statsmodels\base
\model.py:604: ConvergenceWarning: Maximum Likelihood optimization failed to converge. C
heck mle_retvals
employment_1
Length of current
-0.1827 nan nan nan nan nan
employment_2
Length of current
0.1122 nan nan nan nan nan
employment_3
Length of current
0.5661 nan nan nan nan nan
employment_4
Length of current
0.1735 nan nan nan nan nan
employment_5
Sex & Marital Status_1 -0.2274 3355443.2000 -0.0000 1.0000 -6576548.0516 6576547.5967
Sex & Marital Status_2 0.1296 3138729.7088 0.0000 1.0000 -6151797.0568 6151797.3161
Sex & Marital Status_3 0.5023 3720105.6914 0.0000 1.0000 -7291272.6715 7291273.6760
Sex & Marital Status_4 0.0018 3424634.8754 0.0000 1.0000 -6712161.0141 6712161.0178
pd.DataFrame(logReg_1.pvalues)[0]<0.05
...
No of dependents_2 False
Telephone_1 False
Telephone_2 False
In [25]:
significant_features=['Duration of Credit (month)','Credit Amount','Age (years)']
In [27]:
X_new_1 = X[significant_features]
In [29]:
logReg_2=sm.Logit(y_train,X_train)
logReg_2=logReg_2.fit()
Iterations 5
In [30]:
logReg_2.summary2()
In [35]:
X_new_2 = X_new_1.drop(['Credit Amount'],axis=1)
X_train,X_test,y_train,y_test=train_test_split(X_new_2,y,test_size=0.2,random_state=10)
log_reg_3 = log_reg_3.fit()
Iterations 5
In [37]:
log_reg_3.summary2()
Prediction
localhost:8888/lab/tree/OneDrive/college/Trimester 2/Machine Learning/Intro to ML - Prk/IntroToML-Part2.ipynb 8/15
10/20/21, 12:00 AM IntroToML-Part2
In [38]:
y_pred = log_reg_3.predict(X_test)
In [44]:
pred_df.iloc[0:10]
841 0 0.690284 1
956 0 0.884680 1
544 1 0.850290 1
173 1 0.657965 1
759 0 0.512259 1
955 0 0.866104 1
121 1 0.599699 1
230 1 0.696740 1
11 1 0.526841 1
120 1 0.545433 1
Performance measures!!
Sensitivity/ Recall
TPR = TP / (TP+FN)
Conditional prob of getting the predicted class as positive given that the actual class is positive.
Specificity:
TNR = TN /(FP+TN)
Conditional prob of getting the predicted class as Negative given that the actual class is Negative.
Precision:
= TP / (TP+FP)
Conditional prob of getting the actua class as positive given that the predicted class is positive.
F1- score
PrecisionRecall
Accuracy:
= (TP+TN)/N
In [45]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(pred_df['Actual_Class'],pred_df['Predicted_Class'])
[[ 11 56]
[ 3 130]]
In [46]:
from sklearn.metrics import classification_report
report=classification_report(pred_df['Actual_Class'],pred_df['Predicted_Class'])
sns.distplot(pred_df[pred_df['Actual_Class']==0]['Predicted_Prob'],color='b');
sns.distplot(pred_df[pred_df['Actual_Class']==1]['Predicted_Prob'],color='g');
c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\seaborn\distribu
tions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in
a future version. Please adapt your code to use either `displot` (a figure-level functio
n with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
c:\users\admin\appdata\local\programs\python\python39\lib\site-packages\seaborn\distribu
tions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in
a future version. Please adapt your code to use either `displot` (a figure-level functio
n with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)
Model Optimization
ROC Curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters: True Positive
Rate. False Positive Rate.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
A curve obtained by plotting 1-TNR along X- axis and TNR along Y- axis.
fpr,tpr,threshold = roc_curve(pred_df['Actual_Class'],pred_df['Predicted_Prob'])
In [50]:
# Drawing curve
plt.plot(fpr, tpr);
plt.plot([0,1],[0,1],color='r');
plt.xlabel('FPR');
plt.ylabel('TPR');
score=roc_auc_score(pred_df['Actual_Class'],pred_df['Predicted_Prob'])
Youden Index!
Y =Max(TPR +TNR -1)
= Max(TPR- (1-TNR))
= Max(TPR-FPR)
In [53]:
# Making FPR TPR Data frame
fpr_tpr = pd.DataFrame({'fpr':fpr,'tpr':tpr,'threshold':threshold})
# Finding Difference
fpr_tpr['diff'] = fpr_tpr['tpr']-fpr_tpr['fpr']
# Sorting Values
fpr_tpr.sort_values('diff',ascending= False).head()
Managers!!
In [55]:
new_report = classification_report(pred_df['Actual_Class'],pred_df['Predicted_New_Class
In [57]:
print('The new Classification Report:\n', new_report)