Case Study - Breast Cancer Classification Using A Support Vector Machine - by Mahsa Mir - Towards Data Science

11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data
| Towards Data Science
Get started Open in app
Follow 599K Followers
Case Study: Breast Cancer Classification Using

a Support Vector Machine
Create a model to predict whether a patient has breast cancer based on tumor
features
Mahsa Mir Jul 8, 2020 · 7 min read
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 1/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science
Photo by Peter Boccia on Unsplash

In this tutorial, we’re going to create a model to predict whether a patient has a positive
breast cancer diagnosis based on several tumor features.
Problem Statement
The breast cancer database is a publicly available dataset from the UCI Machine learning
Repository. It gives information on tumor features such as tumor size, density, and
texture.
Goal: To create a classification model that looks at predicts if the cancer diagnosis is
benign or malignant based on several features.
Data used: Kaggle-Breast Cancer Prediction Dataset
Step 1: Exploring the Dataset

First, let’s understand our dataset:
#import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#import models from scikit learn module:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.svm import SVC
#import Data
df_cancer = pd.read_csv('Breast_cancer_data.csv')
df_cancer.head()
#get some information about our Data-Set
df_cancer.info()
df_cancer.describe()
#visualizing data
sns.pairplot(df_cancer,
Get started Open in app hue = 'diagnosis')
plt.figure(figsize=(7,7))
sns.heatmap(df_cancer['mean_radius mean_texture mean_perimeter

mean_area mean_smoothness diagnosis'.split()].corr(), annot=True)
sns.scatterplot(x = 'mean_texture', y = 'mean_perimeter', hue =

'diagnosis', data = df_cancer)
Breast_cancer_data
Some information about the Dataset
Data.describe( )
Pairplot of features from Breast_cancer_data
Correlation between features
#visualizing features correlation
palette ={0 : 'orange', 1 : 'blue'}
edgecolor = 'grey'
fig = plt.figure(figsize=(12,12))
plt.subplot(221)
ax1 = sns.scatterplot(x = df_cancer['mean_radius'], y =

df_cancer['mean_texture'], hue = "diagnosis",
data = df_cancer, palette =palette, edgecolor=edgecolor)
plt.title('mean_radius vs mean_texture')
plt.subplot(222)

df_cancer['mean_perimeter'], hue = "diagnosis",
plt.title('mean_radius vs mean_perimeter')
plt.subplot(223)

df_cancer['mean_area'], hue = "diagnosis",
plt.title('mean_radius
Get started Open in app vs mean_area')
plt.subplot(224)

df_cancer['mean_smoothness'], hue = "diagnosis",
plt.title('mean_radius vs mean_smoothness')
fig.suptitle('Features Correlation', fontsize = 20)
plt.savefig('2')
plt.show()
Step 2: Handling of Missing/Categorical Data

Before applying any method, we need to check if any values are missing and then
deal with them if so. In this dataset, there are no missing values — but always keep
the habit of checking for null values in a dataset!
Since machine learning models are based on mathematical equations, we need to

encode the categorical variables. Here I used label encoding since we have two
distinct values in the “diagnosis” column:
#check how many values are missing (NaN)
here we do not have any missing values
df_cancer.isnull().sum()
#handling categorical data
df_cancer['diagnosis'].unique()
df_cancer['diagnosis'] =
df_cancer['diagnosis'].map({'benign':0,'malignant':1})
df_cancer.head()
Let’s keep climbing our dataset:
#visualizing diagnosis column >>> 'benign':0,'malignant':1

sns.countplot(x='diagnosis',data = df_cancer)
plt.title('number of Benign_0 vs Malignan_1')
# correlation between features
df_cancer.corr()['diagnosis'][:-1].sort_values().plot(kind ='bar')
plt.title('Corr. between features and target')
Count-plot — Correlations
Step 3: Splitting the Data-Set into Training Set and Test Set
Data is divided into the Train set and Test set. We use the Train set to make the
algorithm learn the data’s behavior and then check the accuracy of our model on the
Test set.
Features ( X ): The columns that are inserted into our model will be used to make
predictions.
Prediction ( y ): Target variable that will be predicted by the features.
#define X variables and our target(y)
X started
Get = df_cancer.drop(['diagnosis'],axis=1).values
Open in app
y = df_cancer['diagnosis'].values
#split Train and Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.25, random_state=101)
Step 4: Data Modeling-Support Vector Machine

Support Vector Machine (SVM) is one of the most useful supervised ML algorithms. It
can be used for both classification and regression tasks.
Source: researchgate/figure
There are a couple of concepts we first need to understand:
What is the SVM Job? SVM chooses the hyperplane that does maximum separation
between classes.
What are hard and soft margins? If data can be linearly separable, SVM might
return maximum accuracy (Hard Margin). When data is not linearly separable, all
we need do is relax the margin to allow misclassifications (Soft Margin).
What is Hyper-parameter C? The number of misclassifications errors can be

controlled using the C parameter, which has a direct effect on the hyperplane.
What is Hyper-parameter gamma? Gamma is used to give weightage to points

close to support vector. In other words, changing the value of gamma would change
the shape of the hyperplane.
What is Kernel Trick? if our data is not linearly separable, we could apply a “Kernel
Trick” method which maps the nonlinear data to higher dimensional space.
Now let’s get back to our code!
#Support Vector Classification model
svc_model = SVC()
svc_model.fit(X_train, y_train)
Step 5: Model Evaluation
from sklearn.metrics import classification_report, confusion_matrix
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)
What does the confusion_matrix information result mean?:
We had 143 women in our test set.
Out of 55 women predicted to not have breast cancer, two were classified as not
having when actually they had (type one error).
Out of 88 women predicted to have breast cancer, 14 were classified as having breast
cancer whey they did not (type two error).
What does this classification report result mean? Basically it means that the SVM
Model was able to classify tumors into malignant and benign with 89% accuracy.
Note:
Precision is the fraction of relevant results.
Recall is the fraction of all relevant results that were correctly classified.
F1-score is the harmonic mean between precision and recall that ranges between 0
(terrible) to 1 (perfection).
Step 6: What Can We Do to Improve Our Model?

1. Data normalization
Feature scaling will help us to see all the variables from the same lens (same scale), in
this way we will bring all values into the range [0,1]:
#normalized
Get started scaler
Open in app - fit&transform on train, fit only on test
from sklearn.preprocessing import MinMaxScaler
n_scaler = MinMaxScaler()
X_train_scaled = n_scaler.fit_transform(X_train.astype(np.float))
X_test_scaled = n_scaler.transform(X_test.astype(np.float))
#Support Vector Classification model - apply on scaled data
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)
from sklearn.metrics import classification_report, confusion_matrix
y_predict_scaled = svc_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_predict_scaled)
sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_predict_scaled))
What does the confusion matrix information result mean?:
We had 143 women in our test set
Out of 55 women predicted to not have breast cancer, 4 women were classified as
not having when actually they had (Type 1 error)
out of 88 women predicted to have breast cancer, 7 were classified as having breast
cancer whey they did not (Type 2 error)
Result for SVC Model — Scaled dataset
What does this classification report result mean? Basically, it means that the SVM
model was able to classify tumors into malignant/benign with 92% accuracy.
2. SVM parameters optimization

C parameter — as we said, it controls the cost of misclassification on Train data.
Smaller C: Lower variance but higher bias (soft margin) and reduce the cost of miss-
classification (less penalty).
Larger C: Lower bias and higher variance (hard margin) and increase the cost of
miss-classification (more strict).
Gamma:
Smaller Gamma: Large variance, far reach, and more generalized solution.
Larger Gamma: High variance and low bias, close reach, and also closer data points
have a higher weight.
So, let’s find the optimal parameters for our model using grid search:
#find best hyper parameters
from sklearn.model_selection import GridSearchCV
param_grid = {'C':[0.1,1,10,100,1000],'gamma':
[1,0.1,0.01,0.001,0.001], 'kernel':['rbf']}
grid = GridSearchCV(SVC(),param_grid,verbose = 4)
grid.fit(X_train_scaled,y_train)
grid.best_params_
grid.best_estimator_
grid_predictions = grid.predict(X_test_scaled)
cmG = confusion_matrix(y_test,grid_predictions)
sns.heatmap(cmG, annot=True)
print(classification_report(y_test,grid_predictions))
Result for using SVC model (Scaled Data + Optimal Parameters)
As you can see in this case the last model improvement did not yield the percentage of
accuracy. However, we were succeed to decrease an error type II.
I hope this has helped you to understand the topic better. Any feedback is welcome as it
allows me to get new insights and correct any mistakes!
Sign up for The Variable

By Towards Data Science
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
Get started
Get this newsletter
Open in app
Python Support Vector Machine Machine Learning Kaggle Data Science
About Write Help Legal
Get the Medium app

Case Study - Breast Cancer Classification Using A Support Vector Machine - by Mahsa Mir - Towards Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Case Study - Breast Cancer Classification Using A Support Vector Machine - by Mahsa Mir - Towards Data Science

Uploaded by

Copyright:

Available Formats

11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data

| Towards Data Science

Get started Open in app

Follow 599K Followers