Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

In [ ]: from google.colab import files


uploaded = files.upload()

Choose Files No file chosen Upload widget is only available when the cell has
been executed in the current browser session. Please rerun this cell to enable.
Saving creditcard.csv to creditcard.csv

Import the required packages


In [ ]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split


from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_
from sklearn.ensemble import RandomForestClassifier

In [ ]: data = pd.read_csv('creditcard.csv')

In [ ]: data.head() #data is masked

Out[ ]: V1 V2 V3 V4 V5 V6 V7 V8

0 0.114697 0.796303 -0.149553 -0.823011 0.878763 -0.553152 0.939259 -0.108502 0

1 -0.039318 0.495784 -0.810884 0.546693 1.986257 4.386342 -1.344891 -1.743736 -0

2 2.275706 -1.531508 -1.021969 -1.602152 -1.220329 -0.462376 -1.196485 -0.147058 -0

3 1.940137 -0.357671 -1.210551 0.382523 0.050823 -0.171322 -0.109124 -0.002115 0

4 1.081395 -0.502615 1.075887 -0.543359 -1.472946 -1.065484 -0.443231 -0.143374 1

5 rows × 30 columns

In [ ]: data.dtypes

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 1/8
7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

Out[ ]: V1 float64
V2 float64
V3 float64
V4 float64
V5 float64
V6 float64
V7 float64
V8 float64
V9 float64
V10 float64
V11 float64
V12 float64
V13 float64
V14 float64
V15 float64
V16 float64
V17 float64
V18 float64
V19 float64
V20 float64
V21 float64
V22 float64
V23 float64
V24 float64
V25 float64
V26 float64
V27 float64
V28 float64
V29 float64
Target int64
dtype: object

In [ ]: data.isnull().sum()

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 2/8
7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

Out[ ]: V1 0
V2 0
V3 0
V4 0
V5 0
V6 0
V7 0
V8 0
V9 0
V10 0
V11 0
V12 0
V13 0
V14 0
V15 0
V16 0
V17 0
V18 0
V19 0
V20 0
V21 0
V22 0
V23 0
V24 0
V25 0
V26 0
V27 0
V28 0
V29 0
Target 0
dtype: int64

In [ ]: data.duplicated().sum()

Out[ ]: 675

In [ ]: data[data.duplicated()] #print the duplicate rows

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 3/8
7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

Out[ ]: V1 V2 V3 V4 V5 V6 V7 V8

1181 2.010213 0.063667 -1.620606 0.341472 0.368741 -0.586677 0.034489 -0.04375

1936 1.302378 -0.606529 -0.681986 -1.904603 1.326623 3.436312 -1.145127 0.95914

2530 2.055797 -0.326668 -2.752041 -0.842316 2.463072 3.173856 -0.432126 0.72770

2878 1.076018 -0.126284 1.320255 1.154681 -0.892714 0.356662 -0.792107 0.39630

3301 1.109985 0.368032 -0.061407 1.376844 0.070437 -1.100573 0.610397 -0.48720

... ... ... ... ... ... ... ... .

56809 1.886717 -0.517305 -1.351317 -0.141112 0.586967 1.052636 -0.330743 0.35318

56830 1.284143 0.462738 -0.371277 0.825644 0.464456 -0.466731 0.459673 -0.18623

56865 1.018412 1.036663 -1.689814 1.315476 1.698436 0.528807 0.331715 0.36453

56893 2.060160 0.018599 -1.072853 0.381576 0.018414 -1.063353 0.240911 -0.36561

56939 -1.807896 1.155051 0.272890 -0.957869 0.093949 1.457128 -0.746298 1.44253

675 rows × 30 columns

In [ ]: data.drop_duplicates(inplace = True)

In [ ]: data.duplicated().sum()

Out[ ]: 0

In [ ]: data.select_dtypes('object') #select all the columns of object datatype

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 4/8
7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

Out[ ]:
0

...

56957

56958

56959

56960

56961

56287 rows × 0 columns

In [ ]: data.select_dtypes('number') #select all the columns of int and float datatype

Out[ ]: V1 V2 V3 V4 V5 V6 V7 V

0 0.114697 0.796303 -0.149553 -0.823011 0.878763 -0.553152 0.939259 -0.1085

1 -0.039318 0.495784 -0.810884 0.546693 1.986257 4.386342 -1.344891 -1.7437

2 2.275706 -1.531508 -1.021969 -1.602152 -1.220329 -0.462376 -1.196485 -0.1470

3 1.940137 -0.357671 -1.210551 0.382523 0.050823 -0.171322 -0.109124 -0.0021

4 1.081395 -0.502615 1.075887 -0.543359 -1.472946 -1.065484 -0.443231 -0.1433

... ... ... ... ... ... ... ...

56957 2.030797 -0.825073 -0.729555 -0.519187 -0.639893 -0.169482 -0.619049 -0.0179

56958 -0.263947 1.119700 -0.639394 -0.880567 1.194120 -0.310693 0.962087 -0.0888

56959 2.206867 -0.748559 -1.443015 -1.101542 -0.332197 -0.646931 -0.536272 -0.1294

56960 1.430579 -0.842354 0.415998 -1.328439 -1.284654 -0.888110 -0.653237 -0.2381

56961 -7.792712 5.599937 0.258943 0.061360 -2.586555 4.770837 -8.221863 -20.2983

56287 rows × 30 columns

In [ ]: #check whether the data is balanced or imbalanced


data['Target'].value_counts()

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 5/8
7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

Out[ ]: Target
0 56189
1 98
Name: count, dtype: int64

In [ ]: data.shape

Out[ ]: (56287, 30)

In [ ]: 98 / 56287

Out[ ]: 0.0017410769804750653

In [ ]: 56189 / 56287

Out[ ]: 0.9982589230195249

Things to do when we have imbalanced data


1. Oversampling (SMOTE technique)
2. Undersampling
3. Use Tree Based Algorithms with the imbalanced dataset

Machine Learning Process


In [ ]: X = data.drop('Target', axis = 1) #data.drop(columns = 'Target')
y = data['Target']

In [ ]: #splitting the data into training and test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify

Apply Random Forest Classifier on the Data


In [ ]:

In [ ]: rf_classifier = RandomForestClassifier()
rf_classifier.fit(X_train, y_train)

Out[ ]: ▾ RandomForestClassifier

RandomForestClassifier()

In [ ]: y_pred = rf_classifier.predict(X_test)

In [ ]: accuracy_score(y_test, y_pred)

Out[ ]: 0.9991117427607035

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 6/8
7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

We should never use accuracy score as a metric to check the performance of a classification
model when there is imbalance in the dataset. Rather we should use roc_auc_score in the
case of imbalanced data.

In [ ]: roc_auc_score(y_test, y_pred)

Out[ ]: 0.7999110161950526

In [ ]: precision_score(y_test, y_pred)

Out[ ]: 0.8571428571428571

In [ ]: recall_score(y_test, y_pred)

Out[ ]: 0.6

Check the Variable/Feature importance in Random Forest


In [ ]: features = X_train.columns
importances = rf_classifier.feature_importances_
indices = np.argsort(importances)

In [ ]: importances

Out[ ]: array([0.01570774, 0.01632325, 0.01893537, 0.02842234, 0.00774267,


0.02100563, 0.02043602, 0.01341524, 0.03712189, 0.07700222,
0.07225627, 0.13569955, 0.01524527, 0.10012457, 0.01295145,
0.06670605, 0.16725197, 0.03747319, 0.01770945, 0.01369339,
0.0139307 , 0.01083112, 0.00909953, 0.00969823, 0.01082515,
0.0104288 , 0.01644287, 0.01123376, 0.01228631])

In [ ]: plt.barh(range(len(indices)), importances[indices], color = 'red')


plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.show()

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 7/8
7/7/24, 12:53 AM day_no_46_date_no_07_07_2024_Logistic_Regression_Practical

In [ ]: arr1 = np.random.randint(45, 98, size = 30)

In [ ]: arr1

Out[ ]: array([54, 76, 79, 63, 85, 48, 60, 94, 52, 56, 87, 93, 54, 77, 46, 72, 47,
49, 62, 89, 62, 69, 56, 60, 56, 60, 85, 85, 53, 73])

In [ ]: print(np.sort(arr1))
print(np.argsort(arr1))

[46 47 48 49 52 53 54 54 56 56 56 60 60 60 62 62 63 69 72 73 76 77 79 85
85 85 87 89 93 94]
[14 16 5 17 8 28 12 0 9 24 22 6 25 23 20 18 3 21 15 29 1 13 2 4
26 27 10 19 11 7]

In [ ]:

file:///C:/Users/ER_SATENDRA_SHARMA/Downloads/day_no_46_date_no_07_07_2024_Logistic_Regression_Practical.html 8/8

You might also like