Machine Learning For Breast Cancer Diagnosis A Proof of Concept

Machine Learning for Breast
Cancer Diagnosis
A Proof of Concept
P. K. SHARMA
Email: from_pramod @yahoo.com
Introduction
 Machine learning is branch of Data Science which incorporates a large set of statistical techniques.
 These techniques enable data scientists to create a model which can learn from past data and detect
patterns from massive, noisy and complex data sets.
 Researchers use machine learning for cancer prediction and prognosis.
 Machine learning allows inferences or decisions that otherwise cannot be made using conventional
statistical methodologies.
 With a robustly validated machine learning model, chances of right diagnosis improve.
 It specially helps in interpretation of results for borderline cases.
Breast Cancer: An overview
 The most common cancer in women worldwide.

 The principle cause of death from cancer among women globally.
 Early detection is the most effective way to reduce breast cancer deaths.
 Early diagnosis requires an accurate and reliable procedure to distinguish between benign breast tumors
from malignant ones
 Breast Cancer Types - three types of breast tumors: Benign breast tumors, In-situ cancers, and Invasive
cancers.
 The majority of breast tumors detected by mammography are benign.
 They are non-cancerous growths and cannot spread outside of the breast to other organs.
 In some cases, it is difficult to distinguish certain benign masses from malignant lesions with mammography.
 If the malignant cells have not gone through the basal membrane but is completely contained in the lobule or the
ducts, the cancer is called in-situ or noninvasive.
 If the cancer has broken through the basal membrane and spread into the surrounding tissue, it is called invasive.
 This analysis assists in differentiating between benign and malignant tumors.
Data Source
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
 The data used for this POC is from University of  Citation: This breast cancer databases was obtained from
Wisconsin. the University of Wisconsin Hospitals, Madison from Dr.
William H. Wolberg.
 Reference :
o O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via
linear programming", SIAM News, Volume 23, Number 5,
September 1990, pp 1 & 18.
o William H. Wolberg and O.L. Mangasarian: "Multisurface
method of pattern separation for medical diagnosis applied to
breast cytology", Proceedings of the National Academy of
Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.
o O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern
recognition via linear programming: Theory and application to
medical diagnosis", in: "Large-scale numerical optimization",
Thomas F. Coleman and Yuying Li, editors, SIAM Publications,
Philadelphia 1990, pp 22-30.
o K. P. Bennett & O. L. Mangasarian: "Robust linear programming
discrimination of two linearly inseparable sets", Optimization
Methods and Software 1, 1992, 23-34 (Gordon & Breach Science
Publishers).
Data Files
# of
Data File Name Description File Name # of records
attributes
breast-cancer-wisconsin.data breast-cancer-wisconsin.names 699 11
Data file with comments based on

unformatted-data 699 11
breast-cancer-wisconsin.data
wdbc.data wdbc.names 569 32
wpbc.data wpbc.names 198 34
In this case study, lets analyze breast-cancer-wisconsin.data and wdbc.data.

Data Sets
The data is in CSV format without any column headers. Columns are interpreted from the associated “names”
files.
Flow of Data
Biopsy
Measurements Reports Evaluation Diagnosis
Procedure
Analysis of Preparation of Predictions and

measurements ML Models validation
Analysis
Input Files Lab Setup
 wdbc.data RandomForestClassifier
 breast-cancer-
wisconsin.data
StratifiedKFold GridSearchCV pyplot
Data Preparation train_test_split learning_curve interp Outputs
 Address missing data Components  Trained Classifier

 Training - Testing –  Predictions
Validation data SciPy Pandas scikit-learn
Classifier Params. NumPy IPython Matplotlib seaborn

Libraries
 min_samples_leaf
 n_estimators
 min_samples_split Python Environment Linux
 max_features
Data Visualization
Data Description : wdbc.data
 Features are computed from a digitized image of a fine needle aspirate

(FNA) of a breast mass. 1. ID number
 They describe characteristics of the cell nuclei present in the image. 2. Diagnosis (M = malignant, B = benign)
 The mean, standard error, and "worst" or largest (mean of the three largest 3-32. Ten real-valued features are computed for
values) of these features were computed for each image, resulting in 30 each cell nucleus:
features. a) radius (mean of distances from center to
points on the perimeter)
 For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
All feature values are recoded with four significant digits. b) texture (standard deviation of gray-scale
values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the
contour)
h) concave points (number of concave portions
of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" -
1)
wdbc.data
 Mean Radius, Mean Perimeter and Mean appear to be helpful

in classification.
 Higher the values of each parameter more are the chances of it
being malignant.
wdbc.data
 Mean Concavity, Mean Concave Points, and Mean

Compactness appear to be helpful in classification.
 Higher the values of each parameter more are the
chances of it being malignant.
wdbc.data
 Mean Smoothness,
Mean Texture,
Mean Fractal
Dimension, Mean
Symmetry and
Mean
Compactness do
not appears to
have influence on
classification.
 Both type of cases
are spread across.
Data Description : breast-cancer-wisconsin.data
 Missing attribute values: 16 # Attribute Domain
 There are 16 instances in Groups 1 to 6 that 1. Sample code number id number

contain a single missing (i.e., unavailable)
2. Clump Thickness 1 - 10
attribute value, now denoted by "?".
3. Uniformity of Cell Size 1 - 10
4. Uniformity of Cell Shape 1 - 10
5. Marginal Adhesion 1 - 10
6. Single Epithelial Cell Size 1 - 10
7. Bare Nuclei 1 - 10
8. Bland Chromatin 1 - 10
9. Normal Nucleoli 1 - 10
10. Mitoses 1 - 10
(2 for benign, 4
11. Class
for malignant)
 The features distinguish between

benign and Malignant fairly well.
 The feature seems to distinguish between

benign and Malignant fairly well.
 The feature seems to distinguish

between benign and Malignant fairly
well.
Results
WDBC.DATA
Analysis: wdbc.data
Two dimensional plot shows
excellent separation of
 Training data is divided in 5 folds. Benign and Malignant cases
 Test data has 114 records

• High accuracy.
 Accuracy Score: 0.9561 • Supports the diagnosis.
Classification
Precision Recall f1-score Support
Report:
0 0.96 0.97 0.97 71
1 0.95 0.93 0.94 43
avg / total 0.96 0.96 0.96 114
Model performs equally

Confusion Matrix: Predicted Benign Predicted Malignant
well on both test and
training sets
True Benign 69 2
True Malignant 3 40
Three cases, although

malignant, are predicted
as benign
Plotting three cases…
Factors influencing predictions.
Plotting three cases:
Factors having no influence on predictions…
Plotting two features at a time
 Also analyzed cases if only two of the

features were available.
 Classifier was trained on two features at
a time and decision boundary is
plotted.
 Model could predict the cases with
reasonable accuracy
Results
BREAST-CANCER-WISCONSIN.DATA
Two dimensional plot shows excellent
Analysis: separation of Benign and Malignant cases
 Training data is divided in 5 folds.
 Test data has 140 records
• High accuracy.
 Accuracy Score: 0.9643 • Supports the diagnosis.
Classification
Precision Recall f1-score Support
Report:
0 0.98 0.97 0.97 95
1 0.93 0.96 0.95 45
avg / total 0.96 0.96 0.96 140
Confusion Matrix: Predicted Benign Predicted Malignant

Model performs equally
True Benign 92 3 well on both training as
well as test data
True Malignant 2 43
Two cases, although

malignant, are predicted
as benign
Plotting two cases…
Plotting three cases…
Factors influencing predictions.
Plotting two features at a time
 Classifier was trained

on two features at a
time and decision
boundary is plotted.
 As expected, classifier
needs more than just
two parameters to
give accurate
predictions.

Machine Learning For Breast Cancer Diagnosis A Proof of Concept

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning For Breast Cancer Diagnosis A Proof of Concept

Uploaded by

Copyright:

Available Formats

Machine Learning for Breast

 The most common cancer in women worldwide.

breast-cancer-wisconsin.data breast-cancer-wisconsin.names 699 11

Data file with comments based on

wdbc.data wdbc.names 569 32

wpbc.data wpbc.names 198 34

In this case study, lets analyze breast-cancer-wisconsin.data and wdbc.data.

Analysis of Preparation of Predictions and

Input Files Lab Setup

Data Preparation train_test_split learning_curve interp Outputs

 Address missing data Components  Trained Classifier

Classifier Params. NumPy IPython Matplotlib seaborn

 Features are computed from a digitized image of a fine needle aspirate

 Mean Radius, Mean Perimeter and Mean appear to be helpful

 Mean Concavity, Mean Concave Points, and Mean

 Missing attribute values: 16 # Attribute Domain

 There are 16 instances in Groups 1 to 6 that 1. Sample code number id number

4. Uniformity of Cell Shape 1 - 10

6. Single Epithelial Cell Size 1 - 10

 The features distinguish between

 The feature seems to distinguish between

 The feature seems to distinguish

 Test data has 114 records

1 0.95 0.93 0.94 43

avg / total 0.96 0.96 0.96 114

Model performs equally

Three cases, although

 Also analyzed cases if only two of the

1 0.93 0.96 0.95 45

avg / total 0.96 0.96 0.96 140

Confusion Matrix: Predicted Benign Predicted Malignant

Two cases, although

 Classifier was trained

You might also like