Logistic Binary Classification

3/12/24, 11:56 AM MlYtLec8.
ipynb - Colaboratory
keyboard_arrow_down Question in the video

import pandas as pd
df = pd.read_csv("insurance_data.csv")
df.head()
age bought_insurance
0 22 0
1 25 0
2 47 1
3 52 0
4 46 1
Next steps: Generate code with df

toggle_off View recommended plots
import matplotlib.pyplot as plt

%matplotlib inline
As we can see below, the points are mainly on the line y=0 and y=1 (Binary).
Therefore we can use Logistic Regression.
plt.scatter(df.age, df.bought_insurance, marker='+', c='red')
<matplotlib.collections.PathCollection at 0x7f0b78a97f40>
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['age']], df.bought_insurance, test_size=0.2)
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
▾ LogisticRegression
LogisticRegression()
lr.predict(X_test)
array([1, 0, 1, 0, 0, 0])
lr.score(X_test, y_test)
0.8333333333333334
keyboard_arrow_down Exercise
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df1 = pd.read_csv("HR_comma_sep.csv") # target variable is 'left'

df1.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
Next steps: Generate code with df1

df1.describe() # gives the entire statistical details about every column
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000
df1['left'].value_counts()
0 11428
1 3571
Name: left, dtype: int64
df_no_strings = df1.drop(['Department', 'salary'], axis=1) # we create a new dataframe without the columns that have non-numeric value
df_no_strings.head()
0 0.38 0.53 2 157 3 0 1 0
1 0.80 0.86 5 262 6 0 1 0
2 0.11 0.88 7 272 4 0 1 0
3 0.72 0.87 5 223 5 0 1 0
4 0.37 0.52 2 159 3 0 1 0
Next steps: Generate code with df_no_strings

df_no_strings.corr().style.background_gradient(cmap='coolwarm', axis=None) # we use this new data frame to find out the correlation of each attribute with each other
satisfaction_level 1.000000 0.105021 -0.142970 -0.020048 -0.100866 0.058697 -0.388375 0.025605
last_evaluation 0.105021 1.000000 0.349333 0.339742 0.131591 -0.007104 0.006567 -0.008684
number_project -0.142970 0.349333 1.000000 0.417211 0.196786 -0.004741 0.023787 -0.006064
average_montly_hours -0.020048 0.339742 0.417211 1.000000 0.127755 -0.010143 0.071287 -0.003544
time_spend_company -0.100866 0.131591 0.196786 0.127755 1.000000 0.002120 0.144822 0.067433
Work_accident 0.058697 -0.007104 -0.004741 -0.010143 0.002120 1.000000 -0.154622 0.039245
left -0.388375 0.006567 0.023787 0.071287 0.144822 -0.154622 1.000000 -0.061788
promotion_last_5years 0.025605 -0.008684 -0.006064 -0.003544 0.067433 0.039245 -0.061788 1.000000
From the above table, we can see that the relation between 'last_evaulation', 'number_project', 'average_monthly_hours', 'promotion_last_5years'
and 'left' is very less (0.0...).
This means that they do not affect the result a lot. So we can ignore these attributes.
keyboard_arrow_down We check the relation of 'Salary' and 'Department' seperately in the form of barcharts as we
excluded them in the correlation table.
pd.crosstab(df1.salary, df1.left).plot(kind='bar')
https://colab.research.google.com/drive/1naxOKbCFR9MWgqvfW8Iq8iyQ2Em8ZIAL#scrollTo=iBVyRIRbVhSX&printMode=true 1/3
3/12/24, 11:56 AM MlYtLec8.ipynb - Colaboratory
<Axes: xlabel='salary'>
It is seen above that people with high salary do not tend to leave the company.
pd.crosstab(df1.Department, df1.left).plot(kind='bar')
<Axes: xlabel='Department'>
It is seen above that a lot of employees left from the Sales dept but a lot of them retained as well. So we can conclude that there is no direct
relationship of 'Department' and 'left'.
keyboard_arrow_down From the Data Analysis, we can conclude that:

We will use the following variables as independent variables in our model -
1) satisfaction_level
2) time_spend_company
3) Work_accident
4) salary
newdf1 = df1[['satisfaction_level', 'time_spend_company', 'Work_accident', 'salary', 'left']]

newdf1
satisfaction_level time_spend_company Work_accident salary left
0 0.38 3 0 low 1
1 0.80 6 0 medium 1
2 0.11 4 0 medium 1
3 0.72 5 0 low 1
4 0.37 3 0 low 1
... ... ... ... ... ...
14994 0.40 3 0 low 1
14995 0.37 3 0 low 1
14996 0.37 3 0 low 1
14997 0.11 4 0 low 1
14998 0.37 3 0 low 1
14999 rows × 5 columns
Next steps: Generate code with newdf1

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
newdf1['salary'] = le.fit_transform(newdf1['salary'])
newdf1
<ipython-input-127-3d81752a9699>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

newdf1['salary'] = le.fit_transform(newdf1['salary'])
satisfaction_level time_spend_company Work_accident salary left
0 0.38 3 0 1 1
1 0.80 6 0 2 1
2 0.11 4 0 2 1
3 0.72 5 0 1 1
4 0.37 3 0 1 1
... ... ... ... ... ...
14994 0.40 3 0 1 1
14995 0.37 3 0 1 1
14996 0.37 3 0 1 1
14997 0.11 4 0 1 1
14998 0.37 3 0 1 1
14999 rows × 5 columns
Next steps: Generate code with newdf1

X = newdf1.drop(['left'], axis='columns')
y = newdf1.left
from sklearn.model_selection import train_test_split
Start coding or generate with AI.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
▾ LogisticRegression
LogisticRegression()
lr.predict(X_test)
array([0, 1, 0, ..., 0, 0, 0])
lr.score(X_test, y_test)
output 0.768
Start coding or generate with AI.
3/12/24, 11:56 AM MlYtLec8.ipynb - Colaboratory

Logistic Binary Classification

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logistic Binary Classification

Uploaded by

Copyright:

Available Formats

3/12/24, 11:56 AM MlYtLec8.

keyboard_arrow_down Question in the video

Next steps: Generate code with df

import matplotlib.pyplot as plt

Therefore we can use Logistic Regression.

plt.scatter(df.age, df.bought_insurance, marker='+', c='red')

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['age']], df.bought_insurance, test_size=0.2)

from sklearn.linear_model import LogisticRegression

df1 = pd.read_csv("HR_comma_sep.csv") # target variable is 'left'

0 0.38 0.53 2 157 3 0 1 0 sales low

1 0.80 0.86 5 262 6 0 1 0 sales medium

2 0.11 0.88 7 272 4 0 1 0 sales medium

3 0.72 0.87 5 223 5 0 1 0 sales low

4 0.37 0.52 2 159 3 0 1 0 sales low

Next steps: Generate code with df1

df1.describe() # gives the entire statistical details about every column

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years

count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000

mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268

std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281

min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000

25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000

50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000

75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000

max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years

0 0.38 0.53 2 157 3 0 1 0

1 0.80 0.86 5 262 6 0 1 0

2 0.11 0.88 7 272 4 0 1 0

3 0.72 0.87 5 223 5 0 1 0

4 0.37 0.52 2 159 3 0 1 0

Next steps: Generate code with df_no_strings

satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years

satisfaction_level 1.000000 0.105021 -0.142970 -0.020048 -0.100866 0.058697 -0.388375 0.025605

last_evaluation 0.105021 1.000000 0.349333 0.339742 0.131591 -0.007104 0.006567 -0.008684

number_project -0.142970 0.349333 1.000000 0.417211 0.196786 -0.004741 0.023787 -0.006064

average_montly_hours -0.020048 0.339742 0.417211 1.000000 0.127755 -0.010143 0.071287 -0.003544

time_spend_company -0.100866 0.131591 0.196786 0.127755 1.000000 0.002120 0.144822 0.067433

Work_accident 0.058697 -0.007104 -0.004741 -0.010143 0.002120 1.000000 -0.154622 0.039245

left -0.388375 0.006567 0.023787 0.071287 0.144822 -0.154622 1.000000 -0.061788

promotion_last_5years 0.025605 -0.008684 -0.006064 -0.003544 0.067433 0.039245 -0.061788 1.000000

keyboard_arrow_down From the Data Analysis, we can conclude that:

newdf1 = df1[['satisfaction_level', 'time_spend_company', 'Work_accident', 'salary', 'left']]

satisfaction_level time_spend_company Work_accident salary left

... ... ... ... ... ...

14994 0.40 3 0 low 1

14995 0.37 3 0 low 1

14996 0.37 3 0 low 1

14997 0.11 4 0 low 1

14998 0.37 3 0 low 1

14999 rows × 5 columns

Next steps: Generate code with newdf1

from sklearn.preprocessing import LabelEncoder

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

... ... ... ... ... ...

14999 rows × 5 columns

Next steps: Generate code with newdf1

from sklearn.model_selection import train_test_split

Start coding or generate with AI.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.linear_model import LogisticRegression

array([0, 1, 0, ..., 0, 0, 0])

Start coding or generate with AI.