Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

Problem Statement
WHO is a specialized agency of the UN which is concerned with the world population health.
Based upon the various parameters, WHO allocates budget for various areas to conduct various
campaigns/initiatives to improve healthcare. Annual salary is an important variable which is
considered to decide budget to be allocated for an area.

We have a data which contains information about 32561 samples and 15 continuous and
categorical variables. Extraction of data was done from 1994 Census dataset.

The goal here is to build a binary model to predict whether the salary is >50K or <50K.

Data Dictionary

1. age: age
2. workclass: workclass
3. education: highest education
4. marrital status: marital status
5. occupation: occupation
6. sex: sex
7. capital gain: income from investment sources other than salary/wages
8. capital loss: income from investment sources other than salary/wages
9. working hours: nummber of working hours per week
10. salary: salary

In [1]: 1 import numpy as np


2 import pandas as pd
3 import matplotlib.pyplot as plt
4 import seaborn as sns
5 from sklearn.model_selection import train_test_split,GridSearchCV
6 from sklearn.linear_model import LogisticRegression
7 from sklearn import metrics
8 from sklearn.metrics import roc_auc_score,roc_curve,classification_report,co

In [2]: 1 adult_data=pd.read_csv("adult.data.csv")

EDA

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 1/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [3]: 1 adult_data.head()

Out[3]: working
marrital capital capital
age workclass education occupation sex hours per salary
status gain loss
week

Never- Adm-
0 39 State-gov Bachelors Male 2174 0 40 <=50K
married clerical

Married-
Self-emp- Exec-
1 50 Bachelors civ- Male 0 0 13 <=50K
not-inc managerial
spouse

Handlers-
2 38 Private HS-grad Divorced Male 0 0 40 <=50K
cleaners

Married-
Handlers-
3 53 Private 11th civ- Male 0 0 40 <=50K
cleaners
spouse

Married-
Prof-
4 28 Private Bachelors civ- Female 0 0 40 <=50K
specialty
spouse

In [4]: 1 adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 education 32561 non-null object
3 marrital status 32561 non-null object
4 occupation 32561 non-null object
5 sex 32561 non-null object
6 capital gain 32561 non-null int64
7 capital loss 32561 non-null int64
8 working hours per week 32561 non-null int64
9 salary 32561 non-null object
dtypes: int64(4), object(6)
memory usage: 2.5+ MB

There are no missing values. 6 variables are numeric and remaining categorical. Categorical
variables are not in encoded format

Check for duplicate data

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 2/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [5]: 1 dups = adult_data.duplicated()


2 print('Number of duplicate rows = %d' % (dups.sum()))
3 print(adult_data.shape)

Number of duplicate rows = 5864


(32561, 10)

Do we need to remove the duplicate data over here? We have removed the duplicate data but
when are the cases that we remove duplicate data?

In [6]: 1 adult_data.drop_duplicates(inplace=True)

In [7]: 1 dups = adult_data.duplicated()


2 print('Number of duplicate rows = %d' % (dups.sum()))
3 print(adult_data.shape)

Number of duplicate rows = 0


(26697, 10)

Geting unique counts of all Objects

In [9]: 1 for feature in adult_data.columns:


2 if adult_data[feature].dtype == 'object':
3 print(feature)
4 print(adult_data[feature].value_counts())
5 print('\n')
6

workclass
Private 17474
Self-emp-not-inc 2447
Local-gov 1980
? 1519
State-gov 1246
Self-emp-inc 1089
Federal-gov 921
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64

education
HS-grad 7815
Some-college 5692
Bachelors 4461
Masters 1606
Assoc-voc 1281
A d 1036
'workclass' and 'occupation' has ?
Since, high number of cases have ?, we will convert them into a new level

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 3/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [10]: 1 # Replace ? to new Unk category


2 adult_data.workclass=adult_data.workclass.str.replace('?', 'Unk')
3 adult_data.occupation = adult_data.occupation.str.replace('?', 'Unk')

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: The defau


lt value of regex will change from True to False in a future version. In additi
on, single character regular expressions will*not* be treated as literal string
s when regex=True.

D:\Anaconda\lib\site-packages\ipykernel_launcher.py:3: FutureWarning: The defau


lt value of regex will change from True to False in a future version. In additi
on, single character regular expressions will*not* be treated as literal string
s when regex=True.
This is separate from the ipykernel package so we can avoid doing imports unt
il

In [11]: 1 adult_data.describe()

Out[11]: age capital gain capital loss working hours per week

count 26697.000000 26697.000000 26697.000000 26697.000000

mean 39.987489 1304.600929 105.699330 40.852530

std 13.691269 8111.031099 441.214823 13.114255

min 17.000000 0.000000 0.000000 1.000000

25% 29.000000 0.000000 0.000000 38.000000

50% 39.000000 0.000000 0.000000 40.000000

75% 49.000000 0.000000 0.000000 46.000000

max 90.000000 99999.000000 4356.000000 99.000000

Checking the spread of the data using boxplot for the continuous
variables.

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 4/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [12]: 1 cols = ['age','capital gain','capital loss','working hours per week']


2 for i in cols:
3 sns.boxplot(adult_data[i],whis=1.5)
4 plt.grid()
5 plt.show();

D:\Anaconda\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass


the following variable as a keyword arg: x. From version 0.12, the only valid
positional argument will be `data`, and passing other arguments without an ex
plicit keyword will result in an error or misinterpretation.
FutureWarning

Treating the outliers.

We can treat Outliers with the following code. We will treat the outliers for the 'Age' variable
only.

In [13]: 1 def remove_outlier(col):


2 sorted(col)
3 Q1,Q3=np.percentile(col,[25,75])
4 IQR=Q3-Q1
5 lower_range= Q1-(1.5 * IQR)
6 upper_range= Q3+(1.5 * IQR)
7 return lower_range, upper_range

In [14]: 1 lr,ur=remove_outlier(adult_data['working hours per week'])


2 print('Lower Range :',lr,'\nUpper Range :',ur)
3 adult_data['working hours per week']=np.where(adult_data['working hours per
4 adult_data['working hours per week']=np.where(adult_data['working hours per

Lower Range : 26.0


Upper Range : 58.0

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 5/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [15]: 1 ## This is a loop to treat outliers for all the non-'object' type varible
2
3 # for column in adult_data.columns:
4 # if adult_data[column].dtype != 'object':
5 # lr,ur=remove_outlier(adult_data[column])
6 # adult_data[column]=np.where(adult_data[column]>ur,ur,adult_data[co
7 # adult_data[column]=np.where(adult_data[column]<lr,lr,adult_data[co

In [16]: 1 cols = ['age','capital gain','capital loss','working hours per week']


2 for i in cols:
3 sns.boxplot(adult_data[i],whis=1.5)
4 plt.grid()
5 plt.show();

D:\Anaconda\lib\site-packages\seaborn\_decorators.py:43: FutureWarning: Pass


the following variable as a keyword arg: x. From version 0.12, the only valid
positional argument will be `data`, and passing other arguments without an ex
plicit keyword will result in an error or misinterpretation.
FutureWarning

Checking for Correlations.

In [17]: 1 adult_data.corr()

Out[17]: age capital gain capital loss working hours per week

age 1.000000 0.068974 0.039005 0.037001

capital gain 0.068974 1.000000 -0.038534 0.085196

capital loss 0.039005 -0.038534 1.000000 0.055047

working hours per week 0.037001 0.085196 0.055047 1.000000

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 6/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [18]: 1 plt.figure(figsize=(12,7))
2 sns.heatmap(adult_data.corr(), annot=True,mask=np.triu(adult_data.corr(),+1)

In [19]: 1 adult_data.describe()

Out[19]: age capital gain capital loss working hours per week

count 26697.000000 26697.000000 26697.000000 26697.000000

mean 39.987489 1304.600929 105.699330 41.169682

std 13.691269 8111.031099 441.214823 9.029725

min 17.000000 0.000000 0.000000 26.000000

25% 29.000000 0.000000 0.000000 38.000000

50% 39.000000 0.000000 0.000000 40.000000

75% 49.000000 0.000000 0.000000 46.000000

max 90.000000 99999.000000 4356.000000 58.000000

There is hardly any correlation between the numeric variables


localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 7/22
4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [20]: 1 # Pairplot using sns


2 sns.pairplot(adult_data ,diag_kind='hist' ,hue='salary');

Converting all objects to categorical codes

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 8/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [21]: 1 ## We are coding up the 'education' variable in an ordinal manner


2
3 adult_data['education']=np.where(adult_data['education'] =='Preschool', '1',
4 adult_data['education']=np.where(adult_data['education'] =='1st-4th', '2', a
5 adult_data['education']=np.where(adult_data['education'] =='5th-6th', '3', a
6 adult_data['education']=np.where(adult_data['education'] =='7th-8th', '4', a
7 adult_data['education']=np.where(adult_data['education'] =='9th', '5', adult_
8 adult_data['education']=np.where(adult_data['education'] =='10th', '6', adul
9 adult_data['education']=np.where(adult_data['education'] =='11th', '7', adul
10 adult_data['education']=np.where(adult_data['education'] =='12th', '8', adul
11 adult_data['education']=np.where(adult_data['education'] =='HS-grad', '9', a
12 adult_data['education']=np.where(adult_data['education'] =='Prof-school', '9
13 adult_data['education']=np.where(adult_data['education'] =='Assoc-acdm', '10
14 adult_data['education']=np.where(adult_data['education'] =='Assoc-voc', '11'
15 adult_data['education']=np.where(adult_data['education'] =='Some-college', '
16 adult_data['education']=np.where(adult_data['education'] =='Bachelors', '13'
17 adult_data['education']=np.where(adult_data['education'] =='Masters', '14',
18 adult_data['education']=np.where(adult_data['education'] =='Doctorate', '15'

In [22]: 1 ## We are grouping certain types of 'workclass' under different categories


2
3 adult_data['workclass']=np.where(adult_data['workclass'] =='Federal-gov', 'G
4 adult_data['workclass']=np.where(adult_data['workclass'] =='Local-gov', 'Gov
5 adult_data['workclass']=np.where(adult_data['workclass'] =='State-gov', 'Gov
6
7 adult_data['workclass']=np.where(adult_data['workclass'] =='Self-emp-inc', '
8 adult_data['workclass']=np.where(adult_data['workclass'] =='Self-emp-not-inc
9 adult_data['workclass']=np.where(adult_data['workclass'] =='unknown', 'Other
10 adult_data['workclass']=np.where(adult_data['workclass'] =='Without-pay', 'O
11 adult_data['workclass']=np.where(adult_data['workclass'] =='Never-worked', '

In [23]: 1 ## We are grouping certain types of 'marritalstatus' under different categor


2
3 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Divo
4 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Sepa
5 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Neve
6 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Wido
7
8 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Marr
9 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Marr
10 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Marr
11 adult_data['marrital status']=np.where(adult_data['marrital status'] =='Marr

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 9/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [24]: 1 ## We are grouping certain types of 'occupation' under different categories


2
3 adult_data['occupation']=np.where(adult_data['occupation'] =='Adm-clerical',
4 adult_data['occupation']=np.where(adult_data['occupation'] =='Exec-manageria
5
6 adult_data['occupation']=np.where(adult_data['occupation'] =='Craft-repair',
7 adult_data['occupation']=np.where(adult_data['occupation'] =='Handlers-clean
8 adult_data['occupation']=np.where(adult_data['occupation'] =='Transport-movi
9 adult_data['occupation']=np.where(adult_data['occupation'] =='Farming-fishin
10 adult_data['occupation']=np.where(adult_data['occupation'] =='Machine-op-ins
11
12 adult_data['occupation']=np.where(adult_data['occupation'] =='Tech-support',
13 adult_data['occupation']=np.where(adult_data['occupation'] =='Other-service'
14 adult_data['occupation']=np.where(adult_data['occupation'] =='Protective-ser
15 adult_data['occupation']=np.where(adult_data['occupation'] =='Priv-house-ser
16 adult_data['occupation']=np.where(adult_data['occupation'] =='Prof-specialty
17
18 adult_data['occupation']=np.where(adult_data['occupation'] =='unknown', 'Unk
19 adult_data['occupation']=np.where(adult_data['occupation'] =='Armed-Forces',

In [25]: 1 adult_data.head()

Out[25]: working
capital capital hours
age workclass education marrital status occupation sex salar
gain loss per
week

0 39 Government 13 CurrentlySingle WhiteCollar Male 2174 0 40.0 <=50

1 50 Others 13 Married WhiteCollar Male 0 0 26.0 <=50

2 38 Private 9 CurrentlySingle BlueCollar Male 0 0 40.0 <=50

3 53 Private 7 Married BlueCollar Male 0 0 40.0 <=50

4 28 Private 13 Married Service Female 0 0 40.0 <=50

In [26]: 1 adult_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null object
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 2.2+ MB

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 10/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [27]: 1 ## Converting the education variable to numeric


2
3 adult_data['education'] = adult_data['education'].astype('int64')
4 adult_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null int64
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(4), object(5)
memory usage: 2.2+ MB

In [28]: 1 ## Converting the 'salary' Variable into numeric by using the LabelEncoder f
2 from sklearn.preprocessing import LabelEncoder
3
4 ## Defining a Label Encoder object instance
5 LE = LabelEncoder()

In [29]: 1 ## Applying the created Label Encoder object for the target class
2 ## Assigning the 0 to <=50k and 1 to >50k
3
4 adult_data['salary'] = LE.fit_transform(adult_data['salary'])
5 adult_data.head()

Out[29]: working
capital capital hours
age workclass education marrital status occupation sex salar
gain loss per
week

0 39 Government 13 CurrentlySingle WhiteCollar Male 2174 0 40.0

1 50 Others 13 Married WhiteCollar Male 0 0 26.0

2 38 Private 9 CurrentlySingle BlueCollar Male 0 0 40.0

3 53 Private 7 Married BlueCollar Male 0 0 40.0

4 28 Private 13 Married Service Female 0 0 40.0

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 11/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [30]: 1 ## Converting the other 'object' type variables as dummy variables


2
3 adult_data_dummy = pd.get_dummies(adult_data,drop_first=True)
4 adult_data_dummy.head()

Out[30]: working
capital capital hours
age education salary workclass_Others workclass_Private workclas
gain loss per
week

0 39 13 2174 0 40.0 0 0 0

1 50 13 0 0 26.0 0 1 0

2 38 9 0 0 40.0 0 0 1

3 53 7 0 0 40.0 0 0 1

4 28 13 0 0 40.0 0 0 1

Train Test Split

In [31]: 1 # Copy all the predictor variables into X dataframe


2 X = adult_data_dummy.drop('salary', axis=1)
3
4 # Copy target into the y dataframe.
5 y = adult_data_dummy['salary']

In [32]: 1 # Split X and y into training and test set in 70:30 ratio
2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , r

In [33]: 1 y_train.value_counts(1)

Out[33]: 0 0.736876
1 0.263124
Name: salary, dtype: float64

In [34]: 1 y_test.value_counts(1)

Out[34]: 0 0.736954
1 0.263046
Name: salary, dtype: float64

Logistic Regression Model


We are making some adjustments to the parameters in the Logistic Regression Class to get a
better accuracy. Details of which can be found out on the site scikit-learn mentioned below

scikit-learn (https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 12/22
4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

Argument=solver{‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’


Algorithm to use in the optimization problem.

For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster
for large ones.

For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.

‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty

‘liblinear’ and ‘saga’ also handle L1 penalty

‘saga’ also supports ‘elasticnet’ penalty

‘liblinear’ does not support setting penalty='none'

Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a scaler from
sklearn.preprocessing.

New in version 0.17: Stochastic Average Gradient descent solver.

New in version 0.19: SAGA solver.

Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in
0.22.

Article on Solvers (https://towardsdatascience.com/dont-sweat-the-solver-stuff-aea7cddc3451)

In [35]: 1 # Fit the Logistic Regression model


2 model = LogisticRegression(solver='newton-cg',max_iter=10000,penalty='none',
3 model.fit(X_train, y_train)

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.


[Parallel(n_jobs=2)]: Done 1 out of 1 | elapsed: 6.1s finished

Out[35]: LogisticRegression(max_iter=10000, n_jobs=2, penalty='none', solver='newton-c


g',
verbose=True)

Predicting on Training and Test dataset


localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 13/22
4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
Predicting on Training and Test dataset

In [36]: 1 ytrain_predict = model.predict(X_train)


2 ytest_predict = model.predict(X_test)

Getting the Predicted Classes and Probs

In [37]: 1 ytest_predict_prob=model.predict_proba(X_test)
2 pd.DataFrame(ytest_predict_prob).head()

Out[37]: 0 1

0 0.563210 0.436790

1 0.005707 0.994293

2 0.933151 0.066849

3 0.761364 0.238636

4 0.716087 0.283913

Model Evaluation
In [38]: 1 # Accuracy - Training Data
2 model.score(X_train, y_train)

Out[38]: 0.8265104083052389

AUC and ROC for the training data

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 14/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [39]: 1 # predict probabilities


2 probs = model.predict_proba(X_train)
3 # keep probabilities for the positive outcome only
4 probs = probs[:, 1]
5 # calculate AUC
6 auc = roc_auc_score(y_train, probs)
7 print('AUC: %.3f' % auc)
8 # calculate roc curve
9 train_fpr, train_tpr, train_thresholds = roc_curve(y_train, probs)
10 plt.plot([0, 1], [0, 1], linestyle='--')
11 # plot the roc curve for the model
12 plt.plot(train_fpr, train_tpr);

AUC: 0.881

In [40]: 1 # Accuracy - Test Data


2 model.score(X_test, y_test)

Out[40]: 0.8213483146067416

AUC and ROC for the test data

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 15/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [41]: 1 # predict probabilities


2 probs = model.predict_proba(X_test)
3 # keep probabilities for the positive outcome only
4 probs = probs[:, 1]
5 # calculate AUC
6 test_auc = roc_auc_score(y_test, probs)
7 print('AUC: %.3f' % auc)
8 # calculate roc curve
9 test_fpr, test_tpr, test_thresholds = roc_curve(y_test, probs)
10 plt.plot([0, 1], [0, 1], linestyle='--')
11 # plot the roc curve for the model
12 plt.plot(test_fpr, test_tpr);

AUC: 0.881

Confusion Matrix for the training data

In [42]: 1 confusion_matrix(y_train, ytrain_predict)

Out[42]: array([[12674, 1096],


[ 2146, 2771]], dtype=int64)

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 16/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [43]: 1 plot_confusion_matrix(model,X_train,y_train);

In [44]: 1 print(classification_report(y_train, ytrain_predict))

precision recall f1-score support

0 0.86 0.92 0.89 13770


1 0.72 0.56 0.63 4917

accuracy 0.83 18687


macro avg 0.79 0.74 0.76 18687
weighted avg 0.82 0.83 0.82 18687

Confusion Matrix for test data

In [45]: 1 confusion_matrix(y_test, ytest_predict)

Out[45]: array([[5412, 491],


[ 940, 1167]], dtype=int64)

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 17/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [46]: 1 plot_confusion_matrix(model,X_test,y_test);

In [47]: 1 print(classification_report(y_test, ytest_predict))

precision recall f1-score support

0 0.85 0.92 0.88 5903


1 0.70 0.55 0.62 2107

accuracy 0.82 8010


macro avg 0.78 0.74 0.75 8010
weighted avg 0.81 0.82 0.81 8010

Applying GridSearchCV for Logistic Regression


In [48]: 1 grid={'penalty':['l2','none'],
2 'solver':['sag','lbfgs'],
3 'tol':[0.0001,0.00001]}

In [49]: 1 model = LogisticRegression(max_iter=10000,n_jobs=2)

In [50]: 1 grid_search = GridSearchCV(estimator = model, param_grid = grid, cv = 3,n_jo

In [51]: 1 grid_search.fit(X_train, y_train)

Out[51]: GridSearchCV(cv=3, estimator=LogisticRegression(max_iter=10000, n_jobs=2),


n_jobs=-1,
param_grid={'penalty': ['l2', 'none'], 'solver': ['sag', 'lbfgs'],
'tol': [0.0001, 1e-05]},
scoring='f1')

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 18/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [52]: 1 print(grid_search.best_params_,'\n')
2 print(grid_search.best_estimator_)

{'penalty': 'l2', 'solver': 'lbfgs', 'tol': 0.0001}

LogisticRegression(max_iter=10000, n_jobs=2)

In [53]: 1 best_model = grid_search.best_estimator_

In [54]: 1 # Prediction on the training set


2
3 ytrain_predict = best_model.predict(X_train)
4 ytest_predict = best_model.predict(X_test)

In [55]: 1 ## Getting the probabilities on the test set


2
3 ytest_predict_prob=best_model.predict_proba(X_test)
4 pd.DataFrame(ytest_predict_prob).head()

Out[55]: 0 1

0 0.569658 0.430342

1 0.005321 0.994679

2 0.929277 0.070723

3 0.769657 0.230343

4 0.697778 0.302222

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 19/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [56]: 1 ## Confusion matrix on the training data


2
3 plot_confusion_matrix(best_model,X_train,y_train)
4 print(classification_report(y_train, ytrain_predict),'\n');

precision recall f1-score support

0 0.85 0.92 0.89 13770


1 0.71 0.56 0.63 4917

accuracy 0.83 18687


macro avg 0.78 0.74 0.76 18687
weighted avg 0.82 0.83 0.82 18687

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 20/22


4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

In [57]: 1 ## Confusion matrix on the test data


2
3 plot_confusion_matrix(best_model,X_test,y_test)
4 print(classification_report(y_test, ytest_predict),'\n');

precision recall f1-score support

0 0.85 0.92 0.88 5903


1 0.70 0.55 0.62 2107

accuracy 0.82 8010


macro avg 0.78 0.73 0.75 8010
weighted avg 0.81 0.82 0.81 8010

You can select other parameters to perform GridSearchCV and try optimize the desired parameter.

Note: Alternatively, one hot encoding can also be done instead of label encoding on categorical
variables before building the logistic regression model. Do play around with these techniques using
one hot encoding as well.

Running in Google Colab


Importing jupyter notebook

1. Login to Google
2. Go to drive.google.com
localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 21/22
4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook

3. Upload jupyter notebook file into the drive


4. double click it, or right click -> open with -> google colaboratory Alternatively,
5. Login to Google
6. Go to https://colab.research.google.com/notebooks/intro.ipynb#recent=true
(https://colab.research.google.com/notebooks/intro.ipynb#recent=true)
7. Upload the jupyter notebook

Loading dataset into colab

Use the below code to load the dataset


from google.colab import files

uploaded = files.upload() # upload file here from local

import io

df2 = pd.read_csv(io.BytesIO(uploaded['Filename.csv'])) #give the filename in quotes

Go to Runtime > change Runtime type > check if it points to Python

Happy Learning

localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 22/22

You might also like