Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

ide-to-6-classification-algorithms

January 31, 2024

1 � Classification Algorithms with Python


Classification is a fundamental task in machine learning that involves assigning a category or label
to each observation based on its features. It is widely used in various applications such as image and
speech recognition, sentiment analysis, fraud detection, and many others. In this article, we will
dive into the world of classification algorithms with Python and explore some of the most popular
and powerful models.
1. Logistic Regression
2. Artificial Neural Networks (Coming soon)
3. K-nearest Neighbors
4. Support Vector Machine
5. Decision Trees Classifier
6. Random Forest Classifier
7. XGBoost Classifier

2 � Predicting heart disease using machine learning

3 � Problem Definition
In a statement, > Given clinical parameters about a patient, can we predict whether or not they
have heart disease?

4 � Features
This is where you’ll get different information about each of the features in your data. You can do
this via doing your own research (such as looking at the links above) or by talking to a subject
matter expert (someone who knows about the dataset).

5 � Create data dictionary


1. age - age in years
2. sex - (1 = male; 0 = female)
3. cp - chest pain type
• 0: Typical angina: chest pain related decrease blood supply to the heart
• 1: Atypical angina: chest pain not related to heart
• 2: Non-anginal pain: typically esophageal spasms (non heart related)

1
• 3: Asymptomatic: chest pain not showing signs of disease
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above
130-140 is typically cause for concern
5. chol - serum cholestoral in mg/dl
• serum = LDL + HDL + .2 * triglycerides
• above 200 is cause for concern
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
• ‘>126’ mg/dL signals diabetes
7. restecg - resting electrocardiographic results
• 0: Nothing to note
• 1: ST-T Wave abnormality
– can range from mild symptoms to severe problems
– signals non-normal heart beat
• 2: Possible or definite left ventricular hypertrophy
– Enlarged heart’s main pumping chamber
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest looks at stress of heart during
excercise unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
• 0: Upsloping: better heart rate with excercise (uncommon)
• 1: Flatsloping: minimal change (typical healthy heart)
• 2: Downslopins: signs of unhealthy heart
12. ca - number of major vessels (0-3) colored by flourosopy
• colored vessel means the doctor can see the blood passing through
• the more blood movement the better (no clots)
13. thal - thalium stress result
• 1,3: normal
• 6: fixed defect: used to be defect but ok now
• 7: reversable defect: no proper blood movement when excercising
14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)

[1]: !pip install -q hvplot

/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available


(required by /bin/bash)
WARNING: Running pip as the 'root' user can result in broken permissions
and conflicting behaviour with the system package manager. It is recommended to
use a virtual environment instead: https://pip.pypa.io/warnings/venv

[2]: import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import hvplot.pandas

2
from scipy import stats

%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

[3]: data = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")


data.head()

[3]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \
0 63 1 3 145 233 1 0 150 0 2.3 0
1 37 1 2 130 250 0 1 187 0 3.5 0
2 41 0 1 130 204 0 0 172 0 1.4 2
3 56 1 1 120 236 0 1 178 0 0.8 2
4 57 0 0 120 354 0 1 163 1 0.6 2

ca thal target
0 0 1 1
1 0 2 1
2 0 2 1
3 0 2 1
4 0 2 1

6 � Exploratory Data Analysis (EDA)


� The goal here is to find out more about the data and become a subject matter export
on the dataset you’re working with.
1. What question(s) are you trying to solve?
2. What kind of data do we have and how do we treat different types?
3. What’s missing from the data and how do you deal with it?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?
[4]: data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64

3
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

[5]: data.shape

[5]: (303, 14)

[6]: pd.set_option("display.float", "{:.2f}".format)


data.describe()

[6]: age sex cp trestbps chol fbs restecg thalach exang \


count 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00 303.00
mean 54.37 0.68 0.97 131.62 246.26 0.15 0.53 149.65 0.33
std 9.08 0.47 1.03 17.54 51.83 0.36 0.53 22.91 0.47
min 29.00 0.00 0.00 94.00 126.00 0.00 0.00 71.00 0.00
25% 47.50 0.00 0.00 120.00 211.00 0.00 0.00 133.50 0.00
50% 55.00 1.00 1.00 130.00 240.00 0.00 1.00 153.00 0.00
75% 61.00 1.00 2.00 140.00 274.50 0.00 1.00 166.00 1.00
max 77.00 1.00 3.00 200.00 564.00 1.00 2.00 202.00 1.00

oldpeak slope ca thal target


count 303.00 303.00 303.00 303.00 303.00
mean 1.04 1.40 0.73 2.31 0.54
std 1.16 0.62 1.02 0.61 0.50
min 0.00 0.00 0.00 0.00 0.00
25% 0.00 1.00 0.00 2.00 0.00
50% 0.80 1.00 0.00 2.00 1.00
75% 1.60 2.00 1.00 3.00 1.00
max 6.20 2.00 4.00 3.00 1.00

[7]: data.target.value_counts()

[7]: 1 165
0 138
Name: target, dtype: int64

[8]: data.target.value_counts().hvplot.bar(
title="Heart Disease Count", xlabel='Heart Disease', ylabel='Count',
width=500, height=350

4
)

[8]: :Bars [index] (target)

[9]: # Checking for messing values


data.isna().sum()

[9]: age 0
sex 0
cp 0
trestbps 0
chol 0
fbs 0
restecg 0
thalach 0
exang 0
oldpeak 0
slope 0
ca 0
thal 0
target 0
dtype: int64

6.0.1 � Notes:
• We have 165 person with heart disease and 138 person without heart disease, so
our problem is balanced.
• Looks like the perfect dataset!!! No null values :-)

[10]: categorical_val = []
continous_val = []
for column in data.columns:
if len(data[column].unique()) <= 10:
categorical_val.append(column)
else:
continous_val.append(column)

[11]: categorical_val

[11]: ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']

[12]: have_disease = data.loc[data['target']==1, 'sex'].value_counts().hvplot.


↪bar(alpha=0.4)

no_disease = data.loc[data['target']==0, 'sex'].value_counts().hvplot.


↪bar(alpha=0.4)

(no_disease * have_disease).opts(

5
title="Heart Disease by Sex", xlabel='Sex', ylabel='Count',
width=500, height=450, legend_cols=2, legend_position='top_right'
)

[12]: :Overlay
.Bars.Sex.I :Bars [index] (sex)
.Bars.Sex.II :Bars [index] (sex)

[13]: have_disease = data.loc[data['target']==1, 'cp'].value_counts().hvplot.


↪bar(alpha=0.4)

no_disease = data.loc[data['target']==0, 'cp'].value_counts().hvplot.


↪bar(alpha=0.4)

(no_disease * have_disease).opts(
title="Heart Disease by Chest Pain Type", xlabel='Chest Pain Type',␣
↪ylabel='Count',

width=500, height=450, legend_cols=2, legend_position='top_right'


)

[13]: :Overlay
.Bars.Cp.I :Bars [index] (cp)
.Bars.Cp.II :Bars [index] (cp)

[14]: have_disease = data.loc[data['target']==1, 'fbs'].value_counts().hvplot.


↪bar(alpha=0.4)

no_disease = data.loc[data['target']==0, 'fbs'].value_counts().hvplot.


↪bar(alpha=0.4)

(no_disease * have_disease).opts(
title="Heart Disease by fasting blood sugar", xlabel='fasting blood sugar >␣
↪120 mg/dl (1 = true; 0 = false)',

ylabel='Count', width=500, height=450, legend_cols=2,␣


↪legend_position='top_right'

[14]: :Overlay
.Bars.Fbs.I :Bars [index] (fbs)
.Bars.Fbs.II :Bars [index] (fbs)

[15]: have_disease = data.loc[data['target']==1, 'restecg'].value_counts().hvplot.


↪bar(alpha=0.4)

no_disease = data.loc[data['target']==0, 'restecg'].value_counts().hvplot.


↪bar(alpha=0.4)

(no_disease * have_disease).opts(

6
title="Heart Disease by resting electrocardiographic results",␣
↪xlabel='resting electrocardiographic results',
ylabel='Count', width=500, height=450, legend_cols=2,␣
↪legend_position='top_right'

[15]: :Overlay
.Bars.Restecg.I :Bars [index] (restecg)
.Bars.Restecg.II :Bars [index] (restecg)

[16]: plt.figure(figsize=(15, 15))

for i, column in enumerate(categorical_val, 1):


plt.subplot(3, 3, i)
data[data["target"] == 0][column].hist(bins=35, color='blue', label='Have␣
↪Heart Disease = NO', alpha=0.6)

data[data["target"] == 1][column].hist(bins=35, color='red', label='Have␣


↪Heart Disease = YES', alpha=0.6)

plt.legend()
plt.xlabel(column)

7
6.0.2 � Notes:
• cp {Chest Pain} : People with cp equl to 1, 2, 3 are more likely to have heart
disease than people with cp equal to 0.
• restecg {resting electrocardiographic results} : People with value 1 (signals non-
normal heart beat, can range from mild symptoms to severe problems) are more
likely to have heart disease.
• exang {exercise induced angina} : People with value 0 (No ==> exercice induced
angina) have heart disease more than people with value 1 (Yes ==> exercice
induced angina)
• slope {the slope of the peak exercise ST segment} : People with slope value
equal to 2 (Downslopins: signs of unhealthy heart) are more likely to have heart
disease than people with slope value equal to 0 (Upsloping: better heart rate with
excercise) or 1 (Flatsloping: minimal change (typical healthy heart)).
• ca {number of major vessels (0-3) colored by flourosopy} : the more blood move-

8
ment the better so people with ca equal to 0 are more likely to have heart disease.
• thal {thalium stress result} : People with thal value equal to 2 (fixed defect: used
to be defect but ok now) are more likely to have heart disease.

[17]: plt.figure(figsize=(15, 15))

for i, column in enumerate(continous_val, 1):


plt.subplot(3, 2, i)
data[data["target"] == 0][column].hist(bins=35, color='blue', label='Have␣
↪Heart Disease = NO', alpha=0.6)

data[data["target"] == 1][column].hist(bins=35, color='red', label='Have␣


↪Heart Disease = YES', alpha=0.6)

plt.legend()
plt.xlabel(column)

9
6.0.3 � Notes:
• trestbps : resting blood pressure (in mm Hg on admission to the hospital) any-
thing above 130-140 is typically cause for concern
• chol {serum cholestoral in mg/dl} : above 200 is cause for concern.
• thalach {maximum heart rate achieved} : People how acheived a maximum more
than 140 are more likely to have heart disease.
• oldpeak ST depression induced by exercise relative to rest looks at stress of heart
during excercise unhealthy heart will stress more

6.0.4 Age vs. Max Heart Rate for Heart Disease

[18]: # Create another figure


plt.figure(figsize=(9, 7))

# Scatter with postivie examples


plt.scatter(data.age[data.target==1],
data.thalach[data.target==1],
c="salmon")

# Scatter with negative examples


plt.scatter(data.age[data.target==0],
data.thalach[data.target==0],
c="lightblue")

# Add some helpful info


plt.title("Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Disease", "No Disease"]);

10
7 � Correlation Matrix
[19]: # Let's make our correlation matrix a little prettier
corr_matrix = data.corr()
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr_matrix,
annot=True,
linewidths=0.5,
fmt=".2f",
cmap="YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

[19]: (14.5, -0.5)

11
[20]: data.drop('target', axis=1).corrwith(data.target).hvplot.barh(
width=600, height=400,
title="Correlation between Heart Disease and Numeric Features",
ylabel='Correlation', xlabel='Numerical Features',
)

[20]: :Bars [index] (0)

• fbs and chol are the lowest correlated with the target variable.
• All other variables have a significant correlation with the target variable.

12
8 � Data Processing
After exploring the dataset, I observed that I need to convert some categorical variables into dummy
variables and scale all the values before training the Machine Learning models. First, I’ll use the
get_dummies method to create dummy columns for categorical variables.

[21]: categorical_val.remove('target')
dataset = pd.get_dummies(data, columns = categorical_val)

[22]: dataset.head()

[22]: age trestbps chol thalach oldpeak target sex_0 sex_1 cp_0 cp_1 \
0 63 145 233 150 2.30 1 0 1 0 0
1 37 130 250 187 3.50 1 0 1 0 0
2 41 130 204 172 1.40 1 1 0 0 1
3 56 120 236 178 0.80 1 0 1 0 1
4 57 120 354 163 0.60 1 1 0 1 0

… slope_2 ca_0 ca_1 ca_2 ca_3 ca_4 thal_0 thal_1 thal_2 thal_3
0 … 0 1 0 0 0 0 0 1 0 0
1 … 0 1 0 0 0 0 0 0 1 0
2 … 1 1 0 0 0 0 0 0 1 0
3 … 1 1 0 0 0 0 0 0 1 0
4 … 1 1 0 0 0 0 0 0 1 0

[5 rows x 31 columns]

[23]: print(data.columns)
print(dataset.columns)

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',


'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
dtype='object')
Index(['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'target', 'sex_0',
'sex_1', 'cp_0', 'cp_1', 'cp_2', 'cp_3', 'fbs_0', 'fbs_1', 'restecg_0',
'restecg_1', 'restecg_2', 'exang_0', 'exang_1', 'slope_0', 'slope_1',
'slope_2', 'ca_0', 'ca_1', 'ca_2', 'ca_3', 'ca_4', 'thal_0', 'thal_1',
'thal_2', 'thal_3'],
dtype='object')

[24]: from sklearn.preprocessing import StandardScaler

s_sc = StandardScaler()
col_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[col_to_scale] = s_sc.fit_transform(dataset[col_to_scale])

[25]: dataset.head()

13
[25]: age trestbps chol thalach oldpeak target sex_0 sex_1 cp_0 cp_1 \
0 0.95 0.76 -0.26 0.02 1.09 1 0 1 0 0
1 -1.92 -0.09 0.07 1.63 2.12 1 0 1 0 0
2 -1.47 -0.09 -0.82 0.98 0.31 1 1 0 0 1
3 0.18 -0.66 -0.20 1.24 -0.21 1 0 1 0 1
4 0.29 -0.66 2.08 0.58 -0.38 1 1 0 1 0

… slope_2 ca_0 ca_1 ca_2 ca_3 ca_4 thal_0 thal_1 thal_2 thal_3
0 … 0 1 0 0 0 0 0 1 0 0
1 … 0 1 0 0 0 0 0 0 1 0
2 … 1 1 0 0 0 0 0 0 1 0
3 … 1 1 0 0 0 0 0 0 1 0
4 … 1 1 0 0 0 0 0 0 1 0

[5 rows x 31 columns]

9 � Models Building
[26]: from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report

def print_score(clf, X_train, y_train, X_test, y_test, train=True):


if train:
pred = clf.predict(X_train)
clf_report = pd.DataFrame(classification_report(y_train, pred,␣
↪output_dict=True))

print("Train Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")

elif train==False:
pred = clf.predict(X_test)
clf_report = pd.DataFrame(classification_report(y_test, pred,␣
↪output_dict=True))

print("Test Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")

14
[27]: from sklearn.model_selection import train_test_split

X = dataset.drop('target', axis=1)
y = dataset.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣


↪random_state=42)

Now we’ve got our data split into training and test sets, it’s time to build a machine learning model.
We’ll train it (find the patterns) on the training set.
And we’ll test it (use the patterns) on the test set.
We’re going to try 3 different machine learning models: > 1. Logistic Regression > 2. K-Nearest
Neighbours Classifier > 3. Support Vector machine > 4. Decision Tree Classifier > 5. Random
Forest Classifier > 6. XGBoost Classifier

9.1 1. Logistic Regression


[28]: from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)

print_score(lr_clf, X_train, y_train, X_test, y_test, train=True)


print_score(lr_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 86.79%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.88 0.86 0.87 0.87 0.87
recall 0.82 0.90 0.87 0.86 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 97.00 115.00 0.87 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 80 17]
[ 11 104]]

Test Result:
================================================
Accuracy Score: 86.81%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg

15
precision 0.87 0.87 0.87 0.87 0.87
recall 0.83 0.90 0.87 0.86 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 41.00 50.00 0.87 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[ 5 45]]

[29]: test_score = accuracy_score(y_test, lr_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, lr_clf.predict(X_train)) * 100

results_df = pd.DataFrame(data=[["Logistic Regression", train_score,␣


↪test_score]],

columns=['Model', 'Training Accuracy %', 'Testing␣


↪Accuracy %'])

results_df

[29]: Model Training Accuracy % Testing Accuracy %


0 Logistic Regression 86.79 86.81

9.2 2. K-nearest neighbors


[30]: from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)

print_score(knn_clf, X_train, y_train, X_test, y_test, train=True)


print_score(knn_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 86.79%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.86 0.87 0.87 0.87 0.87
recall 0.85 0.89 0.87 0.87 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 97.00 115.00 0.87 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 82 15]
[ 13 102]]

16
Test Result:
================================================
Accuracy Score: 86.81%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.85 0.88 0.87 0.87 0.87
recall 0.85 0.88 0.87 0.87 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 41.00 50.00 0.87 91.00 91.00
_______________________________________________
Confusion Matrix:
[[35 6]
[ 6 44]]

[31]: test_score = accuracy_score(y_test, knn_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, knn_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(data=[["K-nearest neighbors", train_score,␣


↪test_score]],

columns=['Model', 'Training Accuracy %', 'Testing␣


↪Accuracy %'])

results_df = results_df.append(results_df_2, ignore_index=True)


results_df

[31]: Model Training Accuracy % Testing Accuracy %


0 Logistic Regression 86.79 86.81
1 K-nearest neighbors 86.79 86.81

9.3 3. Support Vector machine


[32]: from sklearn.svm import SVC

svm_clf = SVC(kernel='rbf', gamma=0.1, C=1.0)


svm_clf.fit(X_train, y_train)

print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)


print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 93.40%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg

17
precision 0.94 0.93 0.93 0.93 0.93
recall 0.92 0.95 0.93 0.93 0.93
f1-score 0.93 0.94 0.93 0.93 0.93
support 97.00 115.00 0.93 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 89 8]
[ 6 109]]

Test Result:
================================================
Accuracy Score: 87.91%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.86 0.90 0.88 0.88 0.88
recall 0.88 0.88 0.88 0.88 0.88
f1-score 0.87 0.89 0.88 0.88 0.88
support 41.00 50.00 0.88 91.00 91.00
_______________________________________________
Confusion Matrix:
[[36 5]
[ 6 44]]

[33]: test_score = accuracy_score(y_test, svm_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, svm_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(data=[["Support Vector Machine", train_score,␣


↪test_score]],

columns=['Model', 'Training Accuracy %', 'Testing␣


↪Accuracy %'])

results_df = results_df.append(results_df_2, ignore_index=True)


results_df

[33]: Model Training Accuracy % Testing Accuracy %


0 Logistic Regression 86.79 86.81
1 K-nearest neighbors 86.79 86.81
2 Support Vector Machine 93.40 87.91

9.4 4. Decision Tree Classifier


[34]: from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)

18
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 1.00 1.00 1.00 1.00
recall 1.00 1.00 1.00 1.00 1.00
f1-score 1.00 1.00 1.00 1.00 1.00
support 97.00 115.00 1.00 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 97 0]
[ 0 115]]

Test Result:
================================================
Accuracy Score: 78.02%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.72 0.84 0.78 0.78 0.79
recall 0.83 0.74 0.78 0.78 0.78
f1-score 0.77 0.79 0.78 0.78 0.78
support 41.00 50.00 0.78 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[13 37]]

[35]: test_score = accuracy_score(y_test, tree_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, tree_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(data=[["Decision Tree Classifier", train_score,␣


↪test_score]],

columns=['Model', 'Training Accuracy %', 'Testing␣


↪Accuracy %'])

results_df = results_df.append(results_df_2, ignore_index=True)


results_df

[35]: Model Training Accuracy % Testing Accuracy %


0 Logistic Regression 86.79 86.81

19
1 K-nearest neighbors 86.79 86.81
2 Support Vector Machine 93.40 87.91
3 Decision Tree Classifier 100.00 78.02

9.5 5. Random Forest


[36]: from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

rf_clf = RandomForestClassifier(n_estimators=1000, random_state=42)


rf_clf.fit(X_train, y_train)

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)


print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 1.00 1.00 1.00 1.00
recall 1.00 1.00 1.00 1.00 1.00
f1-score 1.00 1.00 1.00 1.00 1.00
support 97.00 115.00 1.00 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 97 0]
[ 0 115]]

Test Result:
================================================
Accuracy Score: 82.42%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.80 0.84 0.82 0.82 0.82
recall 0.80 0.84 0.82 0.82 0.82
f1-score 0.80 0.84 0.82 0.82 0.82
support 41.00 50.00 0.82 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 8 42]]

20
[37]: test_score = accuracy_score(y_test, rf_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(data=[["Random Forest Classifier", train_score,␣


↪test_score]],

columns=['Model', 'Training Accuracy %', 'Testing␣


↪Accuracy %'])

results_df = results_df.append(results_df_2, ignore_index=True)


results_df

[37]: Model Training Accuracy % Testing Accuracy %


0 Logistic Regression 86.79 86.81
1 K-nearest neighbors 86.79 86.81
2 Support Vector Machine 93.40 87.91
3 Decision Tree Classifier 100.00 78.02
4 Random Forest Classifier 100.00 82.42

9.6 6. XGBoost Classifer


[38]: from xgboost import XGBClassifier

xgb_clf = XGBClassifier(use_label_encoder=False)
xgb_clf.fit(X_train, y_train)

print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)


print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 1.00 1.00 1.00 1.00
recall 1.00 1.00 1.00 1.00 1.00
f1-score 1.00 1.00 1.00 1.00 1.00
support 97.00 115.00 1.00 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 97 0]
[ 0 115]]

Test Result:
================================================
Accuracy Score: 82.42%
_______________________________________________
CLASSIFICATION REPORT:

21
0 1 accuracy macro avg weighted avg
precision 0.80 0.84 0.82 0.82 0.82
recall 0.80 0.84 0.82 0.82 0.82
f1-score 0.80 0.84 0.82 0.82 0.82
support 41.00 50.00 0.82 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 8 42]]

[39]: test_score = accuracy_score(y_test, xgb_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, xgb_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(data=[["XGBoost Classifier", train_score,␣


↪test_score]],

columns=['Model', 'Training Accuracy %', 'Testing␣


↪Accuracy %'])

results_df = results_df.append(results_df_2, ignore_index=True)


results_df

[39]: Model Training Accuracy % Testing Accuracy %


0 Logistic Regression 86.79 86.81
1 K-nearest neighbors 86.79 86.81
2 Support Vector Machine 93.40 87.91
3 Decision Tree Classifier 100.00 78.02
4 Random Forest Classifier 100.00 82.42
5 XGBoost Classifier 100.00 82.42

10 � Models Hyperparameter Tuning


10.1 1. Logistic Regression Hyperparameter Tuning
[40]: from sklearn.model_selection import GridSearchCV

params = {"C": np.logspace(-4, 4, 20),


"solver": ["liblinear"]}

lr_clf = LogisticRegression()

lr_cv = GridSearchCV(lr_clf, params, scoring="accuracy", n_jobs=-1, verbose=1,␣


↪cv=5)

lr_cv.fit(X_train, y_train)
best_params = lr_cv.best_params_
print(f"Best parameters: {best_params}")
lr_clf = LogisticRegression(**best_params)

22
lr_clf.fit(X_train, y_train)

print_score(lr_clf, X_train, y_train, X_test, y_test, train=True)


print_score(lr_clf, X_train, y_train, X_test, y_test, train=False)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


Best parameters: {'C': 0.23357214690901212, 'solver': 'liblinear'}
Train Result:
================================================
Accuracy Score: 85.85%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.86 0.86 0.86 0.86 0.86
recall 0.82 0.89 0.86 0.86 0.86
f1-score 0.84 0.87 0.86 0.86 0.86
support 97.00 115.00 0.86 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 80 17]
[ 13 102]]

Test Result:
================================================
Accuracy Score: 85.71%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.85 0.86 0.86 0.86 0.86
recall 0.83 0.88 0.86 0.85 0.86
f1-score 0.84 0.87 0.86 0.86 0.86
support 41.00 50.00 0.86 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[ 6 44]]

[41]: test_score = accuracy_score(y_test, lr_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, lr_clf.predict(X_train)) * 100

tuning_results_df = pd.DataFrame(
data=[["Tuned Logistic Regression", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df

23
[41]: Model Training Accuracy % Testing Accuracy %
0 Tuned Logistic Regression 85.85 85.71

10.2 2. K-nearest neighbors Hyperparameter Tuning


[42]: train_score = []
test_score = []
neighbors = range(1, 30)

for k in neighbors:
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
train_score.append(accuracy_score(y_train, model.predict(X_train)))
# test_score.append(accuracy_score(y_test, model.predict(X_test)))

[43]: plt.figure(figsize=(10, 7))

plt.plot(neighbors, train_score, label="Train score")


# plt.plot(neighbors, test_score, label="Test score")
plt.xticks(np.arange(1, 21, 1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()

print(f"Maximum KNN score on the test data: {max(train_score)*100:.2f}%")

Maximum KNN score on the test data: 100.00%

24
[44]: knn_clf = KNeighborsClassifier(n_neighbors=27)
knn_clf.fit(X_train, y_train)

print_score(knn_clf, X_train, y_train, X_test, y_test, train=True)


print_score(knn_clf, X_train, y_train, X_test, y_test, train=False)

Train Result:
================================================
Accuracy Score: 81.13%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.84 0.80 0.81 0.82 0.81
recall 0.73 0.88 0.81 0.81 0.81
f1-score 0.78 0.83 0.81 0.81 0.81
support 97.00 115.00 0.81 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 71 26]
[ 14 101]]

Test Result:
================================================
Accuracy Score: 87.91%

25
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.89 0.87 0.88 0.88 0.88
recall 0.83 0.92 0.88 0.87 0.88
f1-score 0.86 0.89 0.88 0.88 0.88
support 41.00 50.00 0.88 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[ 4 46]]

[45]: test_score = accuracy_score(y_test, knn_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, knn_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(
data=[["Tuned K-nearest neighbors", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df

[45]: Model Training Accuracy % Testing Accuracy %


0 Tuned Logistic Regression 85.85 85.71
1 Tuned K-nearest neighbors 81.13 87.91

10.3 3. Support Vector Machine Hyperparameter Tuning


[46]: svm_clf = SVC(kernel='rbf', gamma=0.1, C=1.0)

params = {"C":(0.1, 0.5, 1, 2, 5, 10, 20),


"gamma":(0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1),
"kernel":('linear', 'poly', 'rbf')}

svm_cv = GridSearchCV(svm_clf, params, n_jobs=-1, cv=5, verbose=1,␣


↪scoring="accuracy")

svm_cv.fit(X_train, y_train)
best_params = svm_cv.best_params_
print(f"Best params: {best_params}")

svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)

print_score(svm_clf, X_train, y_train, X_test, y_test, train=True)


print_score(svm_clf, X_train, y_train, X_test, y_test, train=False)

26
Fitting 5 folds for each of 147 candidates, totalling 735 fits
Best params: {'C': 5, 'gamma': 0.01, 'kernel': 'rbf'}
Train Result:
================================================
Accuracy Score: 87.74%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.88 0.87 0.88 0.88 0.88
recall 0.85 0.90 0.88 0.87 0.88
f1-score 0.86 0.89 0.88 0.88 0.88
support 97.00 115.00 0.88 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 82 15]
[ 11 104]]

Test Result:
================================================
Accuracy Score: 84.62%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.85 0.85 0.85 0.85 0.85
recall 0.80 0.88 0.85 0.84 0.85
f1-score 0.83 0.86 0.85 0.84 0.85
support 41.00 50.00 0.85 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 6 44]]

[47]: test_score = accuracy_score(y_test, svm_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, svm_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(
data=[["Tuned Support Vector Machine", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df

[47]: Model Training Accuracy % Testing Accuracy %


0 Tuned Logistic Regression 85.85 85.71
1 Tuned K-nearest neighbors 81.13 87.91
2 Tuned Support Vector Machine 87.74 84.62

27
10.4 4. Decision Tree Classifier Hyperparameter Tuning
[48]: params = {"criterion":("gini", "entropy"),
"splitter":("best", "random"),
"max_depth":(list(range(1, 20))),
"min_samples_split":[2, 3, 4],
"min_samples_leaf":list(range(1, 20))
}

tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1,␣
↪verbose=1, cv=5)

tree_cv.fit(X_train, y_train)
best_params = tree_cv.best_params_
print(f'Best_params: {best_params}')

tree_clf = DecisionTreeClassifier(**best_params)
tree_clf.fit(X_train, y_train)

print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)


print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)

Fitting 5 folds for each of 4332 candidates, totalling 21660 fits


Best_params: {'criterion': 'entropy', 'max_depth': 5, 'min_samples_leaf': 7,
'min_samples_split': 2, 'splitter': 'best'}
Train Result:
================================================
Accuracy Score: 88.68%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.88 0.90 0.89 0.89 0.89
recall 0.88 0.90 0.89 0.89 0.89
f1-score 0.88 0.90 0.89 0.89 0.89
support 97.00 115.00 0.89 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 85 12]
[ 12 103]]

Test Result:
================================================
Accuracy Score: 64.84%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.60 0.70 0.65 0.65 0.66
recall 0.68 0.62 0.65 0.65 0.65

28
f1-score 0.64 0.66 0.65 0.65 0.65
support 41.00 50.00 0.65 91.00 91.00
_______________________________________________
Confusion Matrix:
[[28 13]
[19 31]]

[49]: test_score = accuracy_score(y_test, tree_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, tree_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(
data=[["Tuned Decision Tree Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df

[49]: Model Training Accuracy % Testing Accuracy %


0 Tuned Logistic Regression 85.85 85.71
1 Tuned K-nearest neighbors 81.13 87.91
2 Tuned Support Vector Machine 87.74 84.62
3 Tuned Decision Tree Classifier 88.68 64.84

10.5 5. Random Forest Classifier Hyperparameter Tuning


[50]: n_estimators = [500, 900, 1100, 1500]
max_features = ['auto', 'sqrt']
max_depth = [2, 3, 5, 10, 15, None]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]

params_grid = {
'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf
}

rf_clf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf_clf, params_grid, scoring="accuracy", cv=5, verbose=1,␣
↪n_jobs=-1)

rf_cv.fit(X_train, y_train)
best_params = rf_cv.best_params_
print(f"Best parameters: {best_params}")

29
rf_clf = RandomForestClassifier(**best_params)
rf_clf.fit(X_train, y_train)

print_score(rf_clf, X_train, y_train, X_test, y_test, train=True)


print_score(rf_clf, X_train, y_train, X_test, y_test, train=False)

Fitting 5 folds for each of 432 candidates, totalling 2160 fits


Best parameters: {'max_depth': 5, 'max_features': 'auto', 'min_samples_leaf': 2,
'min_samples_split': 10, 'n_estimators': 500}
Train Result:
================================================
Accuracy Score: 92.45%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.94 0.92 0.92 0.93 0.92
recall 0.90 0.95 0.92 0.92 0.92
f1-score 0.92 0.93 0.92 0.92 0.92
support 97.00 115.00 0.92 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 87 10]
[ 6 109]]

Test Result:
================================================
Accuracy Score: 83.52%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.82 0.84 0.84 0.83 0.83
recall 0.80 0.86 0.84 0.83 0.84
f1-score 0.81 0.85 0.84 0.83 0.83
support 41.00 50.00 0.84 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 7 43]]

[51]: test_score = accuracy_score(y_test, rf_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, rf_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(
data=[["Tuned Random Forest Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)

30
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df

[51]: Model Training Accuracy % Testing Accuracy %


0 Tuned Logistic Regression 85.85 85.71
1 Tuned K-nearest neighbors 81.13 87.91
2 Tuned Support Vector Machine 87.74 84.62
3 Tuned Decision Tree Classifier 88.68 64.84
4 Tuned Random Forest Classifier 92.45 83.52

10.6 6. XGBoost Classifier Hyperparameter Tuning


[52]: param_grid = dict(
n_estimators=stats.randint(10, 1000),
max_depth=stats.randint(1, 10),
learning_rate=stats.uniform(0, 1)
)

xgb_clf = XGBClassifier(use_label_encoder=False)
xgb_cv = RandomizedSearchCV(
xgb_clf, param_grid, cv=5, n_iter=150,
scoring='accuracy', n_jobs=-1, verbose=1
)
xgb_cv.fit(X_train, y_train)
best_params = xgb_cv.best_params_
print(f"Best paramters: {best_params}")

xgb_clf = XGBClassifier(**best_params)
xgb_clf.fit(X_train, y_train)

print_score(xgb_clf, X_train, y_train, X_test, y_test, train=True)


print_score(xgb_clf, X_train, y_train, X_test, y_test, train=False)

Fitting 5 folds for each of 150 candidates, totalling 750 fits


Best paramters: {'learning_rate': 0.5081267640215167, 'max_depth': 4,
'n_estimators': 574}
Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 1.00 1.00 1.00 1.00
recall 1.00 1.00 1.00 1.00 1.00
f1-score 1.00 1.00 1.00 1.00 1.00
support 97.00 115.00 1.00 212.00 212.00
_______________________________________________

31
Confusion Matrix:
[[ 97 0]
[ 0 115]]

Test Result:
================================================
Accuracy Score: 78.02%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.73 0.83 0.78 0.78 0.78
recall 0.80 0.76 0.78 0.78 0.78
f1-score 0.77 0.79 0.78 0.78 0.78
support 41.00 50.00 0.78 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[12 38]]

[53]: test_score = accuracy_score(y_test, xgb_clf.predict(X_test)) * 100


train_score = accuracy_score(y_train, xgb_clf.predict(X_train)) * 100

results_df_2 = pd.DataFrame(
data=[["Tuned XGBoost Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df

[53]: Model Training Accuracy % Testing Accuracy %


0 Tuned Logistic Regression 85.85 85.71
1 Tuned K-nearest neighbors 81.13 87.91
2 Tuned Support Vector Machine 87.74 84.62
3 Tuned Decision Tree Classifier 88.68 64.84
4 Tuned Random Forest Classifier 92.45 83.52
5 Tuned XGBoost Classifier 100.00 78.02

[54]: results_df

[54]: Model Training Accuracy % Testing Accuracy %


0 Logistic Regression 86.79 86.81
1 K-nearest neighbors 86.79 86.81
2 Support Vector Machine 93.40 87.91
3 Decision Tree Classifier 100.00 78.02
4 Random Forest Classifier 100.00 82.42
5 XGBoost Classifier 100.00 82.42

32
It seems that the results doesn’t improved a lot after hyperparamter tuning. Maybe because the
dataset is small.

11 6. Features Importance According to Random Forest and XG-


Boost
[55]: def feature_imp(df, model):
fi = pd.DataFrame()
fi["feature"] = df.columns
fi["importance"] = model.feature_importances_
return fi.sort_values(by="importance", ascending=False)

[56]: feature_imp(X, rf_clf).plot(kind='barh', figsize=(12,7), legend=False)

[56]: <AxesSubplot:>

[57]: feature_imp(X, xgb_clf).plot(kind='barh', figsize=(12,7), legend=False)

[57]: <AxesSubplot:>

33
[ ]:

34

You might also like