Professional Documents
Culture Documents
Case Study - Breast Cancer Classification Using A Support Vector Machine - by Mahsa Mir - Towards Data Science
Case Study - Breast Cancer Classification Using A Support Vector Machine - by Mahsa Mir - Towards Data Science
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 1/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
In this tutorial, we’re going to create a model to predict whether a patient has a positive
breast cancer diagnosis based on several tumor features.
Problem Statement
The breast cancer database is a publicly available dataset from the UCI Machine learning
Repository. It gives information on tumor features such as tumor size, density, and
texture.
Goal: To create a classification model that looks at predicts if the cancer diagnosis is
benign or malignant based on several features.
import pandas as pd
import numpy as np
%matplotlib inline
#import Data
df_cancer = pd.read_csv('Breast_cancer_data.csv')
df_cancer.head()
df_cancer.info()
df_cancer.describe()
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 2/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
#visualizing data
sns.pairplot(df_cancer,
Get started Open in app hue = 'diagnosis')
plt.figure(figsize=(7,7))
Breast_cancer_data
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 3/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
Data.describe( )
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 4/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
edgecolor = 'grey'
fig = plt.figure(figsize=(12,12))
plt.subplot(221)
plt.title('mean_radius vs mean_texture')
plt.subplot(222)
plt.title('mean_radius vs mean_perimeter')
plt.subplot(223)
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 5/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
plt.title('mean_radius
Get started Open in app vs mean_area')
plt.subplot(224)
plt.title('mean_radius vs mean_smoothness')
plt.savefig('2')
plt.show()
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 6/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
df_cancer.isnull().sum()
df_cancer['diagnosis'].unique()
df_cancer['diagnosis'] =
df_cancer['diagnosis'].map({'benign':0,'malignant':1})
df_cancer.head()
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 7/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
df_cancer.corr()['diagnosis'][:-1].sort_values().plot(kind ='bar')
Count-plot — Correlations
Step 3: Splitting the Data-Set into Training Set and Test Set
Data is divided into the Train set and Test set. We use the Train set to make the
algorithm learn the data’s behavior and then check the accuracy of our model on the
Test set.
Features ( X ): The columns that are inserted into our model will be used to make
predictions.
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 8/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
X started
Get = df_cancer.drop(['diagnosis'],axis=1).values
Open in app
y = df_cancer['diagnosis'].values
Source: researchgate/figure
What is the SVM Job? SVM chooses the hyperplane that does maximum separation
between classes.
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 9/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
What are hard and soft margins? If data can be linearly separable, SVM might
Get started Open in app
return maximum accuracy (Hard Margin). When data is not linearly separable, all
we need do is relax the margin to allow misclassifications (Soft Margin).
What is Kernel Trick? if our data is not linearly separable, we could apply a “Kernel
Trick” method which maps the nonlinear data to higher dimensional space.
svc_model = SVC()
svc_model.fit(X_train, y_train)
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 10/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
Out of 55 women predicted to not have breast cancer, two were classified as not
Get started Open in app
having when actually they had (type one error).
Out of 88 women predicted to have breast cancer, 14 were classified as having breast
cancer whey they did not (type two error).
What does this classification report result mean? Basically it means that the SVM
Model was able to classify tumors into malignant and benign with 89% accuracy.
Note:
Recall is the fraction of all relevant results that were correctly classified.
F1-score is the harmonic mean between precision and recall that ranges between 0
(terrible) to 1 (perfection).
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 11/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
#normalized
Get started scaler
Open in app - fit&transform on train, fit only on test
n_scaler = MinMaxScaler()
X_train_scaled = n_scaler.fit_transform(X_train.astype(np.float))
X_test_scaled = n_scaler.transform(X_test.astype(np.float))
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)
y_predict_scaled = svc_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_predict_scaled)
sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_predict_scaled))
Out of 55 women predicted to not have breast cancer, 4 women were classified as
not having when actually they had (Type 1 error)
out of 88 women predicted to have breast cancer, 7 were classified as having breast
cancer whey they did not (Type 2 error)
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 12/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
What does this classification report result mean? Basically, it means that the SVM
Get started Open in app
model was able to classify tumors into malignant/benign with 92% accuracy.
Smaller C: Lower variance but higher bias (soft margin) and reduce the cost of miss-
classification (less penalty).
Larger C: Lower bias and higher variance (hard margin) and increase the cost of
miss-classification (more strict).
Gamma:
Smaller Gamma: Large variance, far reach, and more generalized solution.
Larger Gamma: High variance and low bias, close reach, and also closer data points
have a higher weight.
So, let’s find the optimal parameters for our model using grid search:
param_grid = {'C':[0.1,1,10,100,1000],'gamma':
[1,0.1,0.01,0.001,0.001], 'kernel':['rbf']}
grid = GridSearchCV(SVC(),param_grid,verbose = 4)
grid.fit(X_train_scaled,y_train)
grid.best_params_
grid.best_estimator_
grid_predictions = grid.predict(X_test_scaled)
cmG = confusion_matrix(y_test,grid_predictions)
sns.heatmap(cmG, annot=True)
print(classification_report(y_test,grid_predictions))
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 13/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
As you can see in this case the last model improvement did not yield the percentage of
accuracy. However, we were succeed to decrease an error type II.
I hope this has helped you to understand the topic better. Any feedback is welcome as it
allows me to get new insights and correct any mistakes!
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 14/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
Get started
Get this newsletter
Open in app
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 15/15