Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data

| Towards Data Science

Get started Open in app

Follow 599K Followers

Case Study: Breast Cancer Classification Using


a Support Vector Machine
Create a model to predict whether a patient has breast cancer based on tumor
features

Mahsa Mir Jul 8, 2020 · 7 min read

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 1/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

Photo by Peter Boccia on Unsplash


Get started Open in app

In this tutorial, we’re going to create a model to predict whether a patient has a positive
breast cancer diagnosis based on several tumor features.

Problem Statement
The breast cancer database is a publicly available dataset from the UCI Machine learning
Repository. It gives information on tumor features such as tumor size, density, and
texture.

Goal: To create a classification model that looks at predicts if the cancer diagnosis is
benign or malignant based on several features.

Data used: Kaggle-Breast Cancer Prediction Dataset

Step 1: Exploring the Dataset


First, let’s understand our dataset:

#import required libraries

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

#import models from scikit learn module:

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.svm import SVC

#import Data

df_cancer = pd.read_csv('Breast_cancer_data.csv')

df_cancer.head()

#get some information about our Data-Set

df_cancer.info()

df_cancer.describe()

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 2/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

#visualizing data

sns.pairplot(df_cancer,
Get started Open in app hue = 'diagnosis')

plt.figure(figsize=(7,7))

sns.heatmap(df_cancer['mean_radius mean_texture mean_perimeter


mean_area mean_smoothness diagnosis'.split()].corr(), annot=True)

sns.scatterplot(x = 'mean_texture', y = 'mean_perimeter', hue =


'diagnosis', data = df_cancer)

Breast_cancer_data

Some information about the Dataset

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 3/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

Get started Open in app

Data.describe( )

Pairplot of features from Breast_cancer_data

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 4/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

Get started Open in app

Correlation between features

#visualizing features correlation

palette ={0 : 'orange', 1 : 'blue'}

edgecolor = 'grey'

fig = plt.figure(figsize=(12,12))

plt.subplot(221)

ax1 = sns.scatterplot(x = df_cancer['mean_radius'], y =


df_cancer['mean_texture'], hue = "diagnosis",

data = df_cancer, palette =palette, edgecolor=edgecolor)

plt.title('mean_radius vs mean_texture')

plt.subplot(222)

ax2 = sns.scatterplot(x = df_cancer['mean_radius'], y =


df_cancer['mean_perimeter'], hue = "diagnosis",

data = df_cancer, palette =palette, edgecolor=edgecolor)

plt.title('mean_radius vs mean_perimeter')

plt.subplot(223)

ax3 = sns.scatterplot(x = df_cancer['mean_radius'], y =


df_cancer['mean_area'], hue = "diagnosis",

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 5/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

data = df_cancer, palette =palette, edgecolor=edgecolor)

plt.title('mean_radius
Get started Open in app vs mean_area')

plt.subplot(224)

ax4 = sns.scatterplot(x = df_cancer['mean_radius'], y =


df_cancer['mean_smoothness'], hue = "diagnosis",

data = df_cancer, palette =palette, edgecolor=edgecolor)

plt.title('mean_radius vs mean_smoothness')

fig.suptitle('Features Correlation', fontsize = 20)

plt.savefig('2')

plt.show()

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 6/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

Get started Open in app

Step 2: Handling of Missing/Categorical Data


Before applying any method, we need to check if any values are missing and then
deal with them if so. In this dataset, there are no missing values — but always keep
the habit of checking for null values in a dataset!

Since machine learning models are based on mathematical equations, we need to


encode the categorical variables. Here I used label encoding since we have two
distinct values in the “diagnosis” column:

#check how many values are missing (NaN)

here we do not have any missing values

df_cancer.isnull().sum()

#handling categorical data

df_cancer['diagnosis'].unique()

df_cancer['diagnosis'] =
df_cancer['diagnosis'].map({'benign':0,'malignant':1})

df_cancer.head()

Let’s keep climbing our dataset:

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 7/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

#visualizing diagnosis column >>> 'benign':0,'malignant':1

Get started Open in app


sns.countplot(x='diagnosis',data = df_cancer)

plt.title('number of Benign_0 vs Malignan_1')

# correlation between features

df_cancer.corr()['diagnosis'][:-1].sort_values().plot(kind ='bar')

plt.title('Corr. between features and target')

Count-plot — Correlations

Step 3: Splitting the Data-Set into Training Set and Test Set
Data is divided into the Train set and Test set. We use the Train set to make the
algorithm learn the data’s behavior and then check the accuracy of our model on the
Test set.

Features ( X ): The columns that are inserted into our model will be used to make
predictions.

Prediction ( y ): Target variable that will be predicted by the features.

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 8/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

#define X variables and our target(y)

X started
Get = df_cancer.drop(['diagnosis'],axis=1).values

Open in app
y = df_cancer['diagnosis'].values

#split Train and Test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.25, random_state=101)

Step 4: Data Modeling-Support Vector Machine


Support Vector Machine (SVM) is one of the most useful supervised ML algorithms. It
can be used for both classification and regression tasks.

Source: researchgate/figure

There are a couple of concepts we first need to understand:

What is the SVM Job? SVM chooses the hyperplane that does maximum separation
between classes.

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 9/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

What are hard and soft margins? If data can be linearly separable, SVM might
Get started Open in app
return maximum accuracy (Hard Margin). When data is not linearly separable, all
we need do is relax the margin to allow misclassifications (Soft Margin).

What is Hyper-parameter C? The number of misclassifications errors can be


controlled using the C parameter, which has a direct effect on the hyperplane.

What is Hyper-parameter gamma? Gamma is used to give weightage to points


close to support vector. In other words, changing the value of gamma would change
the shape of the hyperplane.

What is Kernel Trick? if our data is not linearly separable, we could apply a “Kernel
Trick” method which maps the nonlinear data to higher dimensional space.

Now let’s get back to our code!

#Support Vector Classification model

from sklearn.svm import SVC

svc_model = SVC()

svc_model.fit(X_train, y_train)

Step 5: Model Evaluation

from sklearn.metrics import classification_report, confusion_matrix

y_predict = svc_model.predict(X_test)

cm = confusion_matrix(y_test, y_predict)

sns.heatmap(cm, annot=True)

What does the confusion_matrix information result mean?:

We had 143 women in our test set.

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 10/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

Out of 55 women predicted to not have breast cancer, two were classified as not
Get started Open in app
having when actually they had (type one error).

Out of 88 women predicted to have breast cancer, 14 were classified as having breast
cancer whey they did not (type two error).

What does this classification report result mean? Basically it means that the SVM
Model was able to classify tumors into malignant and benign with 89% accuracy.

Note:

Precision is the fraction of relevant results.

Recall is the fraction of all relevant results that were correctly classified.

F1-score is the harmonic mean between precision and recall that ranges between 0
(terrible) to 1 (perfection).

Step 6: What Can We Do to Improve Our Model?


1. Data normalization
Feature scaling will help us to see all the variables from the same lens (same scale), in
this way we will bring all values into the range [0,1]:

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 11/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

#normalized
Get started scaler
Open in app - fit&transform on train, fit only on test

from sklearn.preprocessing import MinMaxScaler

n_scaler = MinMaxScaler()

X_train_scaled = n_scaler.fit_transform(X_train.astype(np.float))

X_test_scaled = n_scaler.transform(X_test.astype(np.float))

#Support Vector Classification model - apply on scaled data

from sklearn.svm import SVC

svc_model = SVC()

svc_model.fit(X_train_scaled, y_train)

from sklearn.metrics import classification_report, confusion_matrix

y_predict_scaled = svc_model.predict(X_test_scaled)

cm = confusion_matrix(y_test, y_predict_scaled)

sns.heatmap(cm, annot=True)

print(classification_report(y_test, y_predict_scaled))

What does the confusion matrix information result mean?:

We had 143 women in our test set

Out of 55 women predicted to not have breast cancer, 4 women were classified as
not having when actually they had (Type 1 error)

out of 88 women predicted to have breast cancer, 7 were classified as having breast
cancer whey they did not (Type 2 error)

Result for SVC Model — Scaled dataset

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 12/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

What does this classification report result mean? Basically, it means that the SVM
Get started Open in app
model was able to classify tumors into malignant/benign with 92% accuracy.

2. SVM parameters optimization


C parameter — as we said, it controls the cost of misclassification on Train data.

Smaller C: Lower variance but higher bias (soft margin) and reduce the cost of miss-
classification (less penalty).

Larger C: Lower bias and higher variance (hard margin) and increase the cost of
miss-classification (more strict).

Gamma:

Smaller Gamma: Large variance, far reach, and more generalized solution.

Larger Gamma: High variance and low bias, close reach, and also closer data points
have a higher weight.

So, let’s find the optimal parameters for our model using grid search:

#find best hyper parameters

from sklearn.model_selection import GridSearchCV

param_grid = {'C':[0.1,1,10,100,1000],'gamma':
[1,0.1,0.01,0.001,0.001], 'kernel':['rbf']}

grid = GridSearchCV(SVC(),param_grid,verbose = 4)

grid.fit(X_train_scaled,y_train)

grid.best_params_

grid.best_estimator_

grid_predictions = grid.predict(X_test_scaled)

cmG = confusion_matrix(y_test,grid_predictions)

sns.heatmap(cmG, annot=True)

print(classification_report(y_test,grid_predictions))

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 13/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

Get started Open in app

Result for using SVC model (Scaled Data + Optimal Parameters)

As you can see in this case the last model improvement did not yield the percentage of
accuracy. However, we were succeed to decrease an error type II.

I hope this has helped you to understand the topic better. Any feedback is welcome as it
allows me to get new insights and correct any mistakes!

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 14/15
11/21/21, 10:34 PM Case Study: Breast Cancer Classification Using a Support Vector Machine | by Mahsa Mir | Towards Data Science

Get started
Get this newsletter
Open in app

Python Support Vector Machine Machine Learning Kaggle Data Science

About Write Help Legal

Get the Medium app

https://towardsdatascience.com/case-study-breast-cancer-classification-svm-2b67d668bbb7 15/15

You might also like