Professional Documents
Culture Documents
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
Problem Statement
WHO is a specialized agency of the UN which is concerned with the world population health.
Based upon the various parameters, WHO allocates budget for various areas to conduct various
campaigns/initiatives to improve healthcare. Annual salary is an important variable which is
considered to decide budget to be allocated for an area.
We have a data which contains information about 32561 samples and 15 continuous and
categorical variables. Extraction of data was done from 1994 Census dataset.
The goal here is to build a binary model to predict whether the salary is >50K or <50K.
Data Dictionary
1. age: age
2. workclass: workclass
3. education: highest education
4. marrital status: marital status
5. occupation: occupation
6. sex: sex
7. capital gain: income from investment sources other than salary/wages
8. capital loss: income from investment sources other than salary/wages
9. working hours: nummber of working hours per week
10. salary: salary
In [2]: 1 adult_data=pd.read_csv("adult.data.csv")
EDA
In [3]: 1 adult_data.head()
Out[3]: working
marrital capital capital
age workclass education occupation sex hours per salary
status gain loss
week
Never- Adm-
0 39 State-gov Bachelors Male 2174 0 40 <=50K
married clerical
Married-
Self-emp- Exec-
1 50 Bachelors civ- Male 0 0 13 <=50K
not-inc managerial
spouse
Handlers-
2 38 Private HS-grad Divorced Male 0 0 40 <=50K
cleaners
Married-
Handlers-
3 53 Private 11th civ- Male 0 0 40 <=50K
cleaners
spouse
Married-
Prof-
4 28 Private Bachelors civ- Female 0 0 40 <=50K
specialty
spouse
In [4]: 1 adult_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 education 32561 non-null object
3 marrital status 32561 non-null object
4 occupation 32561 non-null object
5 sex 32561 non-null object
6 capital gain 32561 non-null int64
7 capital loss 32561 non-null int64
8 working hours per week 32561 non-null int64
9 salary 32561 non-null object
dtypes: int64(4), object(6)
memory usage: 2.5+ MB
There are no missing values. 6 variables are numeric and remaining categorical. Categorical
variables are not in encoded format
Do we need to remove the duplicate data over here? We have removed the duplicate data but
when are the cases that we remove duplicate data?
In [6]: 1 adult_data.drop_duplicates(inplace=True)
workclass
Private 17474
Self-emp-not-inc 2447
Local-gov 1980
? 1519
State-gov 1246
Self-emp-inc 1089
Federal-gov 921
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
education
HS-grad 7815
Some-college 5692
Bachelors 4461
Masters 1606
Assoc-voc 1281
A d 1036
'workclass' and 'occupation' has ?
Since, high number of cases have ?, we will convert them into a new level
In [11]: 1 adult_data.describe()
Out[11]: age capital gain capital loss working hours per week
Checking the spread of the data using boxplot for the continuous
variables.
We can treat Outliers with the following code. We will treat the outliers for the 'Age' variable
only.
In [15]: 1 ## This is a loop to treat outliers for all the non-'object' type varible
2
3 # for column in adult_data.columns:
4 # if adult_data[column].dtype != 'object':
5 # lr,ur=remove_outlier(adult_data[column])
6 # adult_data[column]=np.where(adult_data[column]>ur,ur,adult_data[co
7 # adult_data[column]=np.where(adult_data[column]<lr,lr,adult_data[co
In [17]: 1 adult_data.corr()
Out[17]: age capital gain capital loss working hours per week
In [18]: 1 plt.figure(figsize=(12,7))
2 sns.heatmap(adult_data.corr(), annot=True,mask=np.triu(adult_data.corr(),+1)
In [19]: 1 adult_data.describe()
Out[19]: age capital gain capital loss working hours per week
In [25]: 1 adult_data.head()
Out[25]: working
capital capital hours
age workclass education marrital status occupation sex salar
gain loss per
week
In [26]: 1 adult_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null object
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 2.2+ MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 26697 entries, 0 to 32560
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 26697 non-null int64
1 workclass 26697 non-null object
2 education 26697 non-null int64
3 marrital status 26697 non-null object
4 occupation 26697 non-null object
5 sex 26697 non-null object
6 capital gain 26697 non-null int64
7 capital loss 26697 non-null int64
8 working hours per week 26697 non-null float64
9 salary 26697 non-null object
dtypes: float64(1), int64(4), object(5)
memory usage: 2.2+ MB
In [28]: 1 ## Converting the 'salary' Variable into numeric by using the LabelEncoder f
2 from sklearn.preprocessing import LabelEncoder
3
4 ## Defining a Label Encoder object instance
5 LE = LabelEncoder()
In [29]: 1 ## Applying the created Label Encoder object for the target class
2 ## Assigning the 0 to <=50k and 1 to >50k
3
4 adult_data['salary'] = LE.fit_transform(adult_data['salary'])
5 adult_data.head()
Out[29]: working
capital capital hours
age workclass education marrital status occupation sex salar
gain loss per
week
Out[30]: working
capital capital hours
age education salary workclass_Others workclass_Private workclas
gain loss per
week
0 39 13 2174 0 40.0 0 0 0
1 50 13 0 0 26.0 0 1 0
2 38 9 0 0 40.0 0 0 1
3 53 7 0 0 40.0 0 0 1
4 28 13 0 0 40.0 0 0 1
In [32]: 1 # Split X and y into training and test set in 70:30 ratio
2 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30 , r
In [33]: 1 y_train.value_counts(1)
Out[33]: 0 0.736876
1 0.263124
Name: salary, dtype: float64
In [34]: 1 y_test.value_counts(1)
Out[34]: 0 0.736954
1 0.263046
Name: salary, dtype: float64
scikit-learn (https://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 12/22
4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster
for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.
Note that ‘sag’ and ‘saga’ fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a scaler from
sklearn.preprocessing.
Changed in version 0.22: The default solver changed from ‘liblinear’ to ‘lbfgs’ in
0.22.
In [37]: 1 ytest_predict_prob=model.predict_proba(X_test)
2 pd.DataFrame(ytest_predict_prob).head()
Out[37]: 0 1
0 0.563210 0.436790
1 0.005707 0.994293
2 0.933151 0.066849
3 0.761364 0.238636
4 0.716087 0.283913
Model Evaluation
In [38]: 1 # Accuracy - Training Data
2 model.score(X_train, y_train)
Out[38]: 0.8265104083052389
AUC: 0.881
Out[40]: 0.8213483146067416
AUC: 0.881
In [43]: 1 plot_confusion_matrix(model,X_train,y_train);
In [46]: 1 plot_confusion_matrix(model,X_test,y_test);
In [52]: 1 print(grid_search.best_params_,'\n')
2 print(grid_search.best_estimator_)
LogisticRegression(max_iter=10000, n_jobs=2)
Out[55]: 0 1
0 0.569658 0.430342
1 0.005321 0.994679
2 0.929277 0.070723
3 0.769657 0.230343
4 0.697778 0.302222
You can select other parameters to perform GridSearchCV and try optimize the desired parameter.
Note: Alternatively, one hot encoding can also be done instead of label encoding on categorical
variables before building the logistic regression model. Do play around with these techniques using
one hot encoding as well.
1. Login to Google
2. Go to drive.google.com
localhost:8888/notebooks/Downloads/Predictive Modelling - Logistic Regression - Mentor Version-1.ipynb 21/22
4/27/2021 Predictive Modelling - Logistic Regression - Mentor Version-1 - Jupyter Notebook
import io
Happy Learning