Professional Documents
Culture Documents
Ide To 6 Classification Algorithms
Ide To 6 Classification Algorithms
3 � Problem Definition
In a statement, > Given clinical parameters about a patient, can we predict whether or not they
have heart disease?
4 � Features
This is where you’ll get different information about each of the features in your data. You can do
this via doing your own research (such as looking at the links above) or by talking to a subject
matter expert (someone who knows about the dataset).
1
• 3: Asymptomatic: chest pain not showing signs of disease
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above
130-140 is typically cause for concern
5. chol - serum cholestoral in mg/dl
• serum = LDL + HDL + .2 * triglycerides
• above 200 is cause for concern
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
• ‘>126’ mg/dL signals diabetes
7. restecg - resting electrocardiographic results
• 0: Nothing to note
• 1: ST-T Wave abnormality
– can range from mild symptoms to severe problems
– signals non-normal heart beat
• 2: Possible or definite left ventricular hypertrophy
– Enlarged heart’s main pumping chamber
8. thalach - maximum heart rate achieved
9. exang - exercise induced angina (1 = yes; 0 = no)
10. oldpeak - ST depression induced by exercise relative to rest looks at stress of heart during
excercise unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
• 0: Upsloping: better heart rate with excercise (uncommon)
• 1: Flatsloping: minimal change (typical healthy heart)
• 2: Downslopins: signs of unhealthy heart
12. ca - number of major vessels (0-3) colored by flourosopy
• colored vessel means the doctor can see the blood passing through
• the more blood movement the better (no clots)
13. thal - thalium stress result
• 1,3: normal
• 6: fixed defect: used to be defect but ok now
• 7: reversable defect: no proper blood movement when excercising
14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)
2
from scipy import stats
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")
[3]: age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \
0 63 1 3 145 233 1 0 150 0 2.3 0
1 37 1 2 130 250 0 1 187 0 3.5 0
2 41 0 1 130 204 0 0 172 0 1.4 2
3 56 1 1 120 236 0 1 178 0 0.8 2
4 57 0 0 120 354 0 1 163 1 0.6 2
ca thal target
0 0 1 1
1 0 2 1
2 0 2 1
3 0 2 1
4 0 2 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
3
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
[5]: data.shape
[7]: data.target.value_counts()
[7]: 1 165
0 138
Name: target, dtype: int64
[8]: data.target.value_counts().hvplot.bar(
title="Heart Disease Count", xlabel='Heart Disease', ylabel='Count',
width=500, height=350
4
)
[9]: age 0
sex 0
cp 0
trestbps 0
chol 0
fbs 0
restecg 0
thalach 0
exang 0
oldpeak 0
slope 0
ca 0
thal 0
target 0
dtype: int64
6.0.1 � Notes:
• We have 165 person with heart disease and 138 person without heart disease, so
our problem is balanced.
• Looks like the perfect dataset!!! No null values :-)
[10]: categorical_val = []
continous_val = []
for column in data.columns:
if len(data[column].unique()) <= 10:
categorical_val.append(column)
else:
continous_val.append(column)
[11]: categorical_val
[11]: ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal', 'target']
(no_disease * have_disease).opts(
5
title="Heart Disease by Sex", xlabel='Sex', ylabel='Count',
width=500, height=450, legend_cols=2, legend_position='top_right'
)
[12]: :Overlay
.Bars.Sex.I :Bars [index] (sex)
.Bars.Sex.II :Bars [index] (sex)
(no_disease * have_disease).opts(
title="Heart Disease by Chest Pain Type", xlabel='Chest Pain Type',␣
↪ylabel='Count',
[13]: :Overlay
.Bars.Cp.I :Bars [index] (cp)
.Bars.Cp.II :Bars [index] (cp)
(no_disease * have_disease).opts(
title="Heart Disease by fasting blood sugar", xlabel='fasting blood sugar >␣
↪120 mg/dl (1 = true; 0 = false)',
[14]: :Overlay
.Bars.Fbs.I :Bars [index] (fbs)
.Bars.Fbs.II :Bars [index] (fbs)
(no_disease * have_disease).opts(
6
title="Heart Disease by resting electrocardiographic results",␣
↪xlabel='resting electrocardiographic results',
ylabel='Count', width=500, height=450, legend_cols=2,␣
↪legend_position='top_right'
[15]: :Overlay
.Bars.Restecg.I :Bars [index] (restecg)
.Bars.Restecg.II :Bars [index] (restecg)
plt.legend()
plt.xlabel(column)
7
6.0.2 � Notes:
• cp {Chest Pain} : People with cp equl to 1, 2, 3 are more likely to have heart
disease than people with cp equal to 0.
• restecg {resting electrocardiographic results} : People with value 1 (signals non-
normal heart beat, can range from mild symptoms to severe problems) are more
likely to have heart disease.
• exang {exercise induced angina} : People with value 0 (No ==> exercice induced
angina) have heart disease more than people with value 1 (Yes ==> exercice
induced angina)
• slope {the slope of the peak exercise ST segment} : People with slope value
equal to 2 (Downslopins: signs of unhealthy heart) are more likely to have heart
disease than people with slope value equal to 0 (Upsloping: better heart rate with
excercise) or 1 (Flatsloping: minimal change (typical healthy heart)).
• ca {number of major vessels (0-3) colored by flourosopy} : the more blood move-
8
ment the better so people with ca equal to 0 are more likely to have heart disease.
• thal {thalium stress result} : People with thal value equal to 2 (fixed defect: used
to be defect but ok now) are more likely to have heart disease.
plt.legend()
plt.xlabel(column)
9
6.0.3 � Notes:
• trestbps : resting blood pressure (in mm Hg on admission to the hospital) any-
thing above 130-140 is typically cause for concern
• chol {serum cholestoral in mg/dl} : above 200 is cause for concern.
• thalach {maximum heart rate achieved} : People how acheived a maximum more
than 140 are more likely to have heart disease.
• oldpeak ST depression induced by exercise relative to rest looks at stress of heart
during excercise unhealthy heart will stress more
10
7 � Correlation Matrix
[19]: # Let's make our correlation matrix a little prettier
corr_matrix = data.corr()
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(corr_matrix,
annot=True,
linewidths=0.5,
fmt=".2f",
cmap="YlGnBu");
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
11
[20]: data.drop('target', axis=1).corrwith(data.target).hvplot.barh(
width=600, height=400,
title="Correlation between Heart Disease and Numeric Features",
ylabel='Correlation', xlabel='Numerical Features',
)
• fbs and chol are the lowest correlated with the target variable.
• All other variables have a significant correlation with the target variable.
12
8 � Data Processing
After exploring the dataset, I observed that I need to convert some categorical variables into dummy
variables and scale all the values before training the Machine Learning models. First, I’ll use the
get_dummies method to create dummy columns for categorical variables.
[21]: categorical_val.remove('target')
dataset = pd.get_dummies(data, columns = categorical_val)
[22]: dataset.head()
[22]: age trestbps chol thalach oldpeak target sex_0 sex_1 cp_0 cp_1 \
0 63 145 233 150 2.30 1 0 1 0 0
1 37 130 250 187 3.50 1 0 1 0 0
2 41 130 204 172 1.40 1 1 0 0 1
3 56 120 236 178 0.80 1 0 1 0 1
4 57 120 354 163 0.60 1 1 0 1 0
… slope_2 ca_0 ca_1 ca_2 ca_3 ca_4 thal_0 thal_1 thal_2 thal_3
0 … 0 1 0 0 0 0 0 1 0 0
1 … 0 1 0 0 0 0 0 0 1 0
2 … 1 1 0 0 0 0 0 0 1 0
3 … 1 1 0 0 0 0 0 0 1 0
4 … 1 1 0 0 0 0 0 0 1 0
[5 rows x 31 columns]
[23]: print(data.columns)
print(dataset.columns)
s_sc = StandardScaler()
col_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[col_to_scale] = s_sc.fit_transform(dataset[col_to_scale])
[25]: dataset.head()
13
[25]: age trestbps chol thalach oldpeak target sex_0 sex_1 cp_0 cp_1 \
0 0.95 0.76 -0.26 0.02 1.09 1 0 1 0 0
1 -1.92 -0.09 0.07 1.63 2.12 1 0 1 0 0
2 -1.47 -0.09 -0.82 0.98 0.31 1 1 0 0 1
3 0.18 -0.66 -0.20 1.24 -0.21 1 0 1 0 1
4 0.29 -0.66 2.08 0.58 -0.38 1 1 0 1 0
… slope_2 ca_0 ca_1 ca_2 ca_3 ca_4 thal_0 thal_1 thal_2 thal_3
0 … 0 1 0 0 0 0 0 1 0 0
1 … 0 1 0 0 0 0 0 0 1 0
2 … 1 1 0 0 0 0 0 0 1 0
3 … 1 1 0 0 0 0 0 0 1 0
4 … 1 1 0 0 0 0 0 0 1 0
[5 rows x 31 columns]
9 � Models Building
[26]: from sklearn.metrics import accuracy_score, confusion_matrix,␣
↪classification_report
print("Train Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_train, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_train, pred)}\n")
elif train==False:
pred = clf.predict(X_test)
clf_report = pd.DataFrame(classification_report(y_test, pred,␣
↪output_dict=True))
print("Test Result:\n================================================")
print(f"Accuracy Score: {accuracy_score(y_test, pred) * 100:.2f}%")
print("_______________________________________________")
print(f"CLASSIFICATION REPORT:\n{clf_report}")
print("_______________________________________________")
print(f"Confusion Matrix: \n {confusion_matrix(y_test, pred)}\n")
14
[27]: from sklearn.model_selection import train_test_split
X = dataset.drop('target', axis=1)
y = dataset.target
Now we’ve got our data split into training and test sets, it’s time to build a machine learning model.
We’ll train it (find the patterns) on the training set.
And we’ll test it (use the patterns) on the test set.
We’re going to try 3 different machine learning models: > 1. Logistic Regression > 2. K-Nearest
Neighbours Classifier > 3. Support Vector machine > 4. Decision Tree Classifier > 5. Random
Forest Classifier > 6. XGBoost Classifier
lr_clf = LogisticRegression(solver='liblinear')
lr_clf.fit(X_train, y_train)
Train Result:
================================================
Accuracy Score: 86.79%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.88 0.86 0.87 0.87 0.87
recall 0.82 0.90 0.87 0.86 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 97.00 115.00 0.87 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 80 17]
[ 11 104]]
Test Result:
================================================
Accuracy Score: 86.81%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
15
precision 0.87 0.87 0.87 0.87 0.87
recall 0.83 0.90 0.87 0.86 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 41.00 50.00 0.87 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[ 5 45]]
results_df
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)
Train Result:
================================================
Accuracy Score: 86.79%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.86 0.87 0.87 0.87 0.87
recall 0.85 0.89 0.87 0.87 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 97.00 115.00 0.87 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 82 15]
[ 13 102]]
16
Test Result:
================================================
Accuracy Score: 86.81%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.85 0.88 0.87 0.87 0.87
recall 0.85 0.88 0.87 0.87 0.87
f1-score 0.85 0.88 0.87 0.87 0.87
support 41.00 50.00 0.87 91.00 91.00
_______________________________________________
Confusion Matrix:
[[35 6]
[ 6 44]]
Train Result:
================================================
Accuracy Score: 93.40%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
17
precision 0.94 0.93 0.93 0.93 0.93
recall 0.92 0.95 0.93 0.93 0.93
f1-score 0.93 0.94 0.93 0.93 0.93
support 97.00 115.00 0.93 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 89 8]
[ 6 109]]
Test Result:
================================================
Accuracy Score: 87.91%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.86 0.90 0.88 0.88 0.88
recall 0.88 0.88 0.88 0.88 0.88
f1-score 0.87 0.89 0.88 0.88 0.88
support 41.00 50.00 0.88 91.00 91.00
_______________________________________________
Confusion Matrix:
[[36 5]
[ 6 44]]
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
18
print_score(tree_clf, X_train, y_train, X_test, y_test, train=True)
print_score(tree_clf, X_train, y_train, X_test, y_test, train=False)
Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 1.00 1.00 1.00 1.00
recall 1.00 1.00 1.00 1.00 1.00
f1-score 1.00 1.00 1.00 1.00 1.00
support 97.00 115.00 1.00 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 97 0]
[ 0 115]]
Test Result:
================================================
Accuracy Score: 78.02%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.72 0.84 0.78 0.78 0.79
recall 0.83 0.74 0.78 0.78 0.78
f1-score 0.77 0.79 0.78 0.78 0.78
support 41.00 50.00 0.78 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[13 37]]
19
1 K-nearest neighbors 86.79 86.81
2 Support Vector Machine 93.40 87.91
3 Decision Tree Classifier 100.00 78.02
Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 1.00 1.00 1.00 1.00
recall 1.00 1.00 1.00 1.00 1.00
f1-score 1.00 1.00 1.00 1.00 1.00
support 97.00 115.00 1.00 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 97 0]
[ 0 115]]
Test Result:
================================================
Accuracy Score: 82.42%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.80 0.84 0.82 0.82 0.82
recall 0.80 0.84 0.82 0.82 0.82
f1-score 0.80 0.84 0.82 0.82 0.82
support 41.00 50.00 0.82 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 8 42]]
20
[37]: test_score = accuracy_score(y_test, rf_clf.predict(X_test)) * 100
train_score = accuracy_score(y_train, rf_clf.predict(X_train)) * 100
xgb_clf = XGBClassifier(use_label_encoder=False)
xgb_clf.fit(X_train, y_train)
Train Result:
================================================
Accuracy Score: 100.00%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 1.00 1.00 1.00 1.00 1.00
recall 1.00 1.00 1.00 1.00 1.00
f1-score 1.00 1.00 1.00 1.00 1.00
support 97.00 115.00 1.00 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 97 0]
[ 0 115]]
Test Result:
================================================
Accuracy Score: 82.42%
_______________________________________________
CLASSIFICATION REPORT:
21
0 1 accuracy macro avg weighted avg
precision 0.80 0.84 0.82 0.82 0.82
recall 0.80 0.84 0.82 0.82 0.82
f1-score 0.80 0.84 0.82 0.82 0.82
support 41.00 50.00 0.82 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 8 42]]
lr_clf = LogisticRegression()
lr_cv.fit(X_train, y_train)
best_params = lr_cv.best_params_
print(f"Best parameters: {best_params}")
lr_clf = LogisticRegression(**best_params)
22
lr_clf.fit(X_train, y_train)
Test Result:
================================================
Accuracy Score: 85.71%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.85 0.86 0.86 0.86 0.86
recall 0.83 0.88 0.86 0.85 0.86
f1-score 0.84 0.87 0.86 0.86 0.86
support 41.00 50.00 0.86 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[ 6 44]]
tuning_results_df = pd.DataFrame(
data=[["Tuned Logistic Regression", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df
23
[41]: Model Training Accuracy % Testing Accuracy %
0 Tuned Logistic Regression 85.85 85.71
for k in neighbors:
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
train_score.append(accuracy_score(y_train, model.predict(X_train)))
# test_score.append(accuracy_score(y_test, model.predict(X_test)))
24
[44]: knn_clf = KNeighborsClassifier(n_neighbors=27)
knn_clf.fit(X_train, y_train)
Train Result:
================================================
Accuracy Score: 81.13%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.84 0.80 0.81 0.82 0.81
recall 0.73 0.88 0.81 0.81 0.81
f1-score 0.78 0.83 0.81 0.81 0.81
support 97.00 115.00 0.81 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 71 26]
[ 14 101]]
Test Result:
================================================
Accuracy Score: 87.91%
25
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.89 0.87 0.88 0.88 0.88
recall 0.83 0.92 0.88 0.87 0.88
f1-score 0.86 0.89 0.88 0.88 0.88
support 41.00 50.00 0.88 91.00 91.00
_______________________________________________
Confusion Matrix:
[[34 7]
[ 4 46]]
results_df_2 = pd.DataFrame(
data=[["Tuned K-nearest neighbors", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df
svm_cv.fit(X_train, y_train)
best_params = svm_cv.best_params_
print(f"Best params: {best_params}")
svm_clf = SVC(**best_params)
svm_clf.fit(X_train, y_train)
26
Fitting 5 folds for each of 147 candidates, totalling 735 fits
Best params: {'C': 5, 'gamma': 0.01, 'kernel': 'rbf'}
Train Result:
================================================
Accuracy Score: 87.74%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.88 0.87 0.88 0.88 0.88
recall 0.85 0.90 0.88 0.87 0.88
f1-score 0.86 0.89 0.88 0.88 0.88
support 97.00 115.00 0.88 212.00 212.00
_______________________________________________
Confusion Matrix:
[[ 82 15]
[ 11 104]]
Test Result:
================================================
Accuracy Score: 84.62%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.85 0.85 0.85 0.85 0.85
recall 0.80 0.88 0.85 0.84 0.85
f1-score 0.83 0.86 0.85 0.84 0.85
support 41.00 50.00 0.85 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 6 44]]
results_df_2 = pd.DataFrame(
data=[["Tuned Support Vector Machine", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df
27
10.4 4. Decision Tree Classifier Hyperparameter Tuning
[48]: params = {"criterion":("gini", "entropy"),
"splitter":("best", "random"),
"max_depth":(list(range(1, 20))),
"min_samples_split":[2, 3, 4],
"min_samples_leaf":list(range(1, 20))
}
tree_clf = DecisionTreeClassifier(random_state=42)
tree_cv = GridSearchCV(tree_clf, params, scoring="accuracy", n_jobs=-1,␣
↪verbose=1, cv=5)
tree_cv.fit(X_train, y_train)
best_params = tree_cv.best_params_
print(f'Best_params: {best_params}')
tree_clf = DecisionTreeClassifier(**best_params)
tree_clf.fit(X_train, y_train)
Test Result:
================================================
Accuracy Score: 64.84%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.60 0.70 0.65 0.65 0.66
recall 0.68 0.62 0.65 0.65 0.65
28
f1-score 0.64 0.66 0.65 0.65 0.65
support 41.00 50.00 0.65 91.00 91.00
_______________________________________________
Confusion Matrix:
[[28 13]
[19 31]]
results_df_2 = pd.DataFrame(
data=[["Tuned Decision Tree Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df
params_grid = {
'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf
}
rf_clf = RandomForestClassifier(random_state=42)
rf_cv = GridSearchCV(rf_clf, params_grid, scoring="accuracy", cv=5, verbose=1,␣
↪n_jobs=-1)
rf_cv.fit(X_train, y_train)
best_params = rf_cv.best_params_
print(f"Best parameters: {best_params}")
29
rf_clf = RandomForestClassifier(**best_params)
rf_clf.fit(X_train, y_train)
Test Result:
================================================
Accuracy Score: 83.52%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.82 0.84 0.84 0.83 0.83
recall 0.80 0.86 0.84 0.83 0.84
f1-score 0.81 0.85 0.84 0.83 0.83
support 41.00 50.00 0.84 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[ 7 43]]
results_df_2 = pd.DataFrame(
data=[["Tuned Random Forest Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
30
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df
xgb_clf = XGBClassifier(use_label_encoder=False)
xgb_cv = RandomizedSearchCV(
xgb_clf, param_grid, cv=5, n_iter=150,
scoring='accuracy', n_jobs=-1, verbose=1
)
xgb_cv.fit(X_train, y_train)
best_params = xgb_cv.best_params_
print(f"Best paramters: {best_params}")
xgb_clf = XGBClassifier(**best_params)
xgb_clf.fit(X_train, y_train)
31
Confusion Matrix:
[[ 97 0]
[ 0 115]]
Test Result:
================================================
Accuracy Score: 78.02%
_______________________________________________
CLASSIFICATION REPORT:
0 1 accuracy macro avg weighted avg
precision 0.73 0.83 0.78 0.78 0.78
recall 0.80 0.76 0.78 0.78 0.78
f1-score 0.77 0.79 0.78 0.78 0.78
support 41.00 50.00 0.78 91.00 91.00
_______________________________________________
Confusion Matrix:
[[33 8]
[12 38]]
results_df_2 = pd.DataFrame(
data=[["Tuned XGBoost Classifier", train_score, test_score]],
columns=['Model', 'Training Accuracy %', 'Testing Accuracy %']
)
tuning_results_df = tuning_results_df.append(results_df_2, ignore_index=True)
tuning_results_df
[54]: results_df
32
It seems that the results doesn’t improved a lot after hyperparamter tuning. Maybe because the
dataset is small.
[56]: <AxesSubplot:>
[57]: <AxesSubplot:>
33
[ ]:
34