day_no_46_date_no_06_07_2024_Logistic_Regression_Practical(pdf)

7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical
In [ ]: from google.colab import files

uploaded = files.upload()
Choose Files No file chosen Upload widget is only available when the cell has
been executed in the current browser session. Please rerun this cell to enable.
Saving creditcard.csv to creditcard.csv
Import the required packages

In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_
from sklearn.ensemble import RandomForestClassifier
In [ ]: data = pd.read_csv('creditcard.csv')
In [ ]: data.head() #data is masked
Out[ ]: V1 V2 V3 V4 V5 V6 V7 V8
0 0.114697 0.796303 -0.149553 -0.823011 0.878763 -0.553152 0.939259 -0.108502 0
1 -0.039318 0.495784 -0.810884 0.546693 1.986257 4.386342 -1.344891 -1.743736 -0
2 2.275706 -1.531508 -1.021969 -1.602152 -1.220329 -0.462376 -1.196485 -0.147058 -0
3 1.940137 -0.357671 -1.210551 0.382523 0.050823 -0.171322 -0.109124 -0.002115 0
4 1.081395 -0.502615 1.075887 -0.543359 -1.472946 -1.065484 -0.443231 -0.143374 1
5 rows × 30 columns
In [ ]: data.dtypes
file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 1/8
Out[ ]: V1 float64
V2 float64
V3 float64
V4 float64
V5 float64
V6 float64
V7 float64
V8 float64
V9 float64
V10 float64
V11 float64
V12 float64
V13 float64
V14 float64
V15 float64
V16 float64
V17 float64
V18 float64
V19 float64
V20 float64
V21 float64
V22 float64
V23 float64
V24 float64
V25 float64
V26 float64
V27 float64
V28 float64
V29 float64
Target int64
dtype: object
In [ ]: data.isnull().sum()
Out[ ]: V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
V29 0
Target 0
dtype: int64
In [ ]: data.duplicated().sum()
Out[ ]: 675
In [ ]: data[data.duplicated()] #print the duplicate rows
Out[ ]: V1 V2 V3 V4 V5 V6 V7 V8
1181 2.010213 0.063667 -1.620606 0.341472 0.368741 -0.586677 0.034489 -0.04375
1936 1.302378 -0.606529 -0.681986 -1.904603 1.326623 3.436312 -1.145127 0.95914
2530 2.055797 -0.326668 -2.752041 -0.842316 2.463072 3.173856 -0.432126 0.72770
2878 1.076018 -0.126284 1.320255 1.154681 -0.892714 0.356662 -0.792107 0.39630
3301 1.109985 0.368032 -0.061407 1.376844 0.070437 -1.100573 0.610397 -0.48720
... ... ... ... ... ... ... ... .
56809 1.886717 -0.517305 -1.351317 -0.141112 0.586967 1.052636 -0.330743 0.35318
56830 1.284143 0.462738 -0.371277 0.825644 0.464456 -0.466731 0.459673 -0.18623
56865 1.018412 1.036663 -1.689814 1.315476 1.698436 0.528807 0.331715 0.36453
56893 2.060160 0.018599 -1.072853 0.381576 0.018414 -1.063353 0.240911 -0.36561
56939 -1.807896 1.155051 0.272890 -0.957869 0.093949 1.457128 -0.746298 1.44253
In [ ]: data.drop_duplicates(inplace = True)
In [ ]: data.duplicated().sum()
Out[ ]: 0
In [ ]: data.select_dtypes('object') #select all the columns of object datatype
Out[ ]:
0
...
56957
56958
56959
56960
56961
In [ ]: data.select_dtypes('number') #select all the columns of int and float datatype
Out[ ]: V1 V2 V3 V4 V5 V6 V7 V
0 0.114697 0.796303 -0.149553 -0.823011 0.878763 -0.553152 0.939259 -0.1085
1 -0.039318 0.495784 -0.810884 0.546693 1.986257 4.386342 -1.344891 -1.7437
2 2.275706 -1.531508 -1.021969 -1.602152 -1.220329 -0.462376 -1.196485 -0.1470
3 1.940137 -0.357671 -1.210551 0.382523 0.050823 -0.171322 -0.109124 -0.0021
4 1.081395 -0.502615 1.075887 -0.543359 -1.472946 -1.065484 -0.443231 -0.1433
... ... ... ... ... ... ... ...
56957 2.030797 -0.825073 -0.729555 -0.519187 -0.639893 -0.169482 -0.619049 -0.0179
56958 -0.263947 1.119700 -0.639394 -0.880567 1.194120 -0.310693 0.962087 -0.0888
56959 2.206867 -0.748559 -1.443015 -1.101542 -0.332197 -0.646931 -0.536272 -0.1294
56960 1.430579 -0.842354 0.415998 -1.328439 -1.284654 -0.888110 -0.653237 -0.2381
56961 -7.792712 5.599937 0.258943 0.061360 -2.586555 4.770837 -8.221863 -20.2983
In [ ]: #check whether the data is balanced or imbalanced

data['Target'].value_counts()
Out[ ]: Target
0 56189
1 98
Name: count, dtype: int64
In [ ]: data.shape
Out[ ]: (56287, 30)
In [ ]: 98 / 56287
Out[ ]: 0.0017410769804750653
In [ ]: 56189 / 56287
Out[ ]: 0.9982589230195249
Things to do when we have imbalanced data

1. Oversampling (SMOTE technique)
2. Undersampling
3. Use Tree Based Algorithms with the imbalanced dataset
Machine Learning Process

In [ ]: X = data.drop('Target', axis = 1) #data.drop(columns = 'Target')
y = data['Target']
In [ ]: #splitting the data into training and test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify
Apply Random Forest Classifier on the Data

In [ ]:
In [ ]: rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)
Out[ ]: ▾ RandomForestClassifier
RandomForestClassifier()
In [ ]: y_pred = rf_classifier.predict(X_test)
In [ ]: accuracy_score(y_test, y_pred)
Out[ ]: 0.9991117427607035
We should never use accuracy score as a metric to check the performance of a classification
model when there is imbalance in the dataset. Rather we should use roc_auc_score in the
case of imbalanced data.
In [ ]: roc_auc_score(y_test, y_pred)
Out[ ]: 0.7999110161950526
In [ ]: precision_score(y_test, y_pred)
Out[ ]: 0.8571428571428571
In [ ]: recall_score(y_test, y_pred)
Out[ ]: 0.6
Check the Variable/Feature importance in Random Forest

In [ ]: features = X_train.columns
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)
In [ ]: importances
Out[ ]: array([0.01570774, 0.01632325, 0.01893537, 0.02842234, 0.00774267,

0.02100563, 0.02043602, 0.01341524, 0.03712189, 0.07700222,
0.07225627, 0.13569955, 0.01524527, 0.10012457, 0.01295145,
0.06670605, 0.16725197, 0.03747319, 0.01770945, 0.01369339,
0.0139307 , 0.01083112, 0.00909953, 0.00969823, 0.01082515,
0.0104288 , 0.01644287, 0.01123376, 0.01228631])
In [ ]: plt.barh(range(len(indices)), importances[indices], color = 'red')

plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.show()
In [ ]: arr1 = np.random.randint(45, 98, size = 30)
In [ ]: arr1
Out[ ]: array([54, 76, 79, 63, 85, 48, 60, 94, 52, 56, 87, 93, 54, 77, 46, 72, 47,
49, 62, 89, 62, 69, 56, 60, 56, 60, 85, 85, 53, 73])
In [ ]: print(np.sort(arr1))
print(np.argsort(arr1))
[46 47 48 49 52 53 54 54 56 56 56 60 60 60 62 62 63 69 72 73 76 77 79 85
85 85 87 89 93 94]
[14 16 5 17 8 28 12 0 9 24 22 6 25 23 20 18 3 21 15 29 1 13 2 4
26 27 10 19 11 7]
In [ ]:

day_no_46_date_no_06_07_2024_Logistic_Regression_Practical(pdf)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

day_no_46_date_no_06_07_2024_Logistic_Regression_Practical(pdf)

Uploaded by

Copyright:

Available Formats

7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

In [ ]: from google.colab import files

Import the required packages

from sklearn.model_selection import train_test_split

In [ ]: data.head() #data is masked

0 0.114697 0.796303 -0.149553 -0.823011 0.878763 -0.553152 0.939259 -0.108502 0

1 -0.039318 0.495784 -0.810884 0.546693 1.986257 4.386342 -1.344891 -1.743736 -0

2 2.275706 -1.531508 -1.021969 -1.602152 -1.220329 -0.462376 -1.196485 -0.147058 -0

3 1.940137 -0.357671 -1.210551 0.382523 0.050823 -0.171322 -0.109124 -0.002115 0

4 1.081395 -0.502615 1.075887 -0.543359 -1.472946 -1.065484 -0.443231 -0.143374 1

In [ ]: data[data.duplicated()] #print the duplicate rows

1181 2.010213 0.063667 -1.620606 0.341472 0.368741 -0.586677 0.034489 -0.04375

1936 1.302378 -0.606529 -0.681986 -1.904603 1.326623 3.436312 -1.145127 0.95914

2530 2.055797 -0.326668 -2.752041 -0.842316 2.463072 3.173856 -0.432126 0.72770

2878 1.076018 -0.126284 1.320255 1.154681 -0.892714 0.356662 -0.792107 0.39630

3301 1.109985 0.368032 -0.061407 1.376844 0.070437 -1.100573 0.610397 -0.48720

... ... ... ... ... ... ... ... .

56809 1.886717 -0.517305 -1.351317 -0.141112 0.586967 1.052636 -0.330743 0.35318

56830 1.284143 0.462738 -0.371277 0.825644 0.464456 -0.466731 0.459673 -0.18623

56865 1.018412 1.036663 -1.689814 1.315476 1.698436 0.528807 0.331715 0.36453

56893 2.060160 0.018599 -1.072853 0.381576 0.018414 -1.063353 0.240911 -0.36561

56939 -1.807896 1.155051 0.272890 -0.957869 0.093949 1.457128 -0.746298 1.44253

675 rows × 30 columns

In [ ]: data.select_dtypes('object') #select all the columns of object datatype

56287 rows × 0 columns

In [ ]: data.select_dtypes('number') #select all the columns of int and float datatype

0 0.114697 0.796303 -0.149553 -0.823011 0.878763 -0.553152 0.939259 -0.1085

1 -0.039318 0.495784 -0.810884 0.546693 1.986257 4.386342 -1.344891 -1.7437

2 2.275706 -1.531508 -1.021969 -1.602152 -1.220329 -0.462376 -1.196485 -0.1470

3 1.940137 -0.357671 -1.210551 0.382523 0.050823 -0.171322 -0.109124 -0.0021

4 1.081395 -0.502615 1.075887 -0.543359 -1.472946 -1.065484 -0.443231 -0.1433

... ... ... ... ... ... ... ...

56957 2.030797 -0.825073 -0.729555 -0.519187 -0.639893 -0.169482 -0.619049 -0.0179

56958 -0.263947 1.119700 -0.639394 -0.880567 1.194120 -0.310693 0.962087 -0.0888

56959 2.206867 -0.748559 -1.443015 -1.101542 -0.332197 -0.646931 -0.536272 -0.1294

56960 1.430579 -0.842354 0.415998 -1.328439 -1.284654 -0.888110 -0.653237 -0.2381

56961 -7.792712 5.599937 0.258943 0.061360 -2.586555 4.770837 -8.221863 -20.2983

56287 rows × 30 columns

In [ ]: #check whether the data is balanced or imbalanced

Out[ ]: (56287, 30)

Things to do when we have imbalanced data

Machine Learning Process

In [ ]: #splitting the data into training and test

Apply Random Forest Classifier on the Data

Check the Variable/Feature importance in Random Forest

Out[ ]: array([0.01570774, 0.01632325, 0.01893537, 0.02842234, 0.00774267,

In [ ]: plt.barh(range(len(indices)), importances[indices], color = 'red')

In [ ]: arr1 = np.random.randint(45, 98, size = 30)

You might also like