Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2021 International Conference on

Artificial Intelligence (ICAI)


Islamabad, Pakistan, April 05-07, 2021

Malignant and Benign Breast Cancer Classification


using Machine Learning Algorithms
Sharmin Ara Annesha Das Ashim Dey
Department o f CSE, CUET Department o f CSE, CUET Department o f CSE, CUET
Chittagong-4349, Bangladesh Chittagong-4349, Bangladesh Chittagong-4349, Bangladesh
2021 International Conference on Artificial Intelligence (ICAI) | 978-1-6654-3293-1/20/$31.00 ©2021 IEEE | DOI: 10.1109/ICAI52203.2021.9445249

u1604044@student.cuet.ac.bd annesha@cuet.ac.bd ashim@cuet.ac.bd

Abstract—At the moment, the most prevalent form of cancer To determine breast cancer, patients are frequently subjected
diagnosed in women across the globe is breast cancer. It develops to a barrage of examinations which include ultrasound, biopsy
in the breast tissue and is one of the most frequent causes of and mammography according to the varying nature of breast
women s death. This cancer can be cured if it is diagnosed
at preliminary stage. Malignant and benign are two types of cancer symptoms. Of these methods, the most indicative is
tumor found in case of breast cancer. Malignant tumors are biopsy that involves the extraction of sample tissues or cells
deadly as their rate of growth is much higher than benign for investigation. The sample of cells is collected from a
tumors. So, early identification of tumor type is pivotal for the breast through Fine Needle Aspiration (FNA) procedure and
appropriate treatment of a patient having breast cancer. In this then sent for analysis under a microscope to a pathology
work, Wisconsin Breast Cancer Dataset has been used which
was collected from UCI repository. Our goal is to analyze the laboratory [2]. Numerical features such as radius, texture,
dataset and evaluate the performance of various machine learning perimeter and area can be calculated from microscopic images
algorithms for predicting breast cancer. Here, Support Vector of cells and tissues. Data subsequently obtained from FNA
Machine, Logistic Regression, K-Nearest Neighbors, Decision are analyzed to predict the probability of the patient having
Tree, Naive Bayes and Random Forest classifiers have been malignant tumor in combination with various imaging data.
implemented for classifying tumors into benign and malignant.
The accuracy of each algorithm is calculated and compared to The prophesy and probability of survival can be remarkably
find the most suitable one. Based on the analysis, Random Forest enhanced by early diagnosis of BC, as it allows patients to re­
and Support Vector Machine outperform other classifiers with ceive timely clinical treatment. The proper identification of BC
accuracy of 96.5%. These classifiers can be used to build an and patient categorization into malignant or benign classes is
automatic diagnostic system for preliminary diagnosis of breast an extremely significant avenue of research. Various methods
cancer.
Keywords—Analysis of WBCD Dataset, Malignant, Benign, for predicting breast cancer have been established in recent
Classification, Breast Cancer Diagnosis, Machine Learning Al­ years. Classification techniques for instance, Random Forest
gorithmsI. (RF), Support Vector Machine (SVM), Adaboost Classifier, K-
Nearest Neighbors (KNN) and XGboost classifier have been
I. I NTRODUCTION used in the recent literature [3].
In this work, Wisconsin Breast Cancer Dataset (WBCD) of
A form of cancer which originates in breast is known as the FNA biopsy system has been used and different machine
breast cancer (BC). Uncontrolled division or expansion of cells learning (ML) classifiers have been implemented to determine
initiates cancer. Usually, breast cancer cells create a tumor and the form of breast cancer in a suspected patient. Six classifi­
can be detected on an X-ray. Among women, breast cancer has cation models were used including Random Forest, Logistic
become one of the most familiar illnesses resulting in death. Regression, Decision Tree, Naive Bayes, SVM and KNN. In
As one of the most regular cancers in women, breast cancer order to discover the most suited model for predicting breast
always has shown extremely high occurrence and death rate cancer, the results acquired are then evaluated to compare the
affecting about 10% women at some stages of their lives. After algorithms. The main objective of our paper are:
lung cancer, it is the 2nd biggest reason behind female’s death.
• To analyze the WBCD dataset for finding relation be­
Among all cancers, breast cancer accounts for 25% together
tween the features.
with 12% of all new cases in women [1]. It is possible to detect
• To apply different established classifiers on the WBCD
breast cancer by classifying tumors. Malignant and benign
dataset for comparing them.
are two separate types of tumor as found in breast cancer
• To find most satisfactory approach supporting the dataset
cases. Malignant tumors spread at a higher rate compared to
with good prediction accuracy.
benign. To differentiate between these tumors, doctors need a
reliable diagnostic technique. But it is usually very difficult The remainder of this paper is outlined as follows: literature
for the tumors to be distinguished even by the specialists. So, review is presented in Section II. The overall methodology is
a reliable automatic diagnostic system is direly required for illustrated in Section III. Section IV represents the obtained
the diagnosis of tumor type. results. At last, Section V concludes the entire work.

978-1-6654-3293-1/21/$31.00 ©2021 IEEE

97

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on July 01,2021 at 00:46:24 UTC from IEEE Xplore. Restrictions apply.
II. L i t er a t u r e Re v ie w Our overall methodology can be presented in following
In the previous works, many models have been proposed subsections-
which use different feature sets and methods of machine A. Dataset Description
learning to diagnose breast cancer. The scarcity of large B. Dataset Analysis
datasets and inequality between negative and positive classes C. Training and Testing
are the main challenges in the research area of breast cancer A. Dataset Description
prediction.
Dr. William H. Wolberg of the University of Wisconsin Hos-
In [1], the prime goal of the analysis is to detect the
pital in Madison, Wisconsin, USA has developed the WBCD
algorithm that operates quicker, more reliably, and more effec-
dataset used for this paper which is publicly accessible. This
tively in breast cancer prediction. With a precision of 99.76%,
dataset includes 357 and 212 cases of benign and malignant
Random Forest surpasses all other algorithms. In [2], [4], [5],
breast cancer respectively as shown in Fig. 1.
authors have performed comparative analysis of the precise
breast cancer prediction offered by existing ML algorithms.
The dataset used in these papers is WBCD. In [6], authors
have used Random Forest classifier for the identification and
prediction of breast cancer to determine whether or not the
person has breast cancer. This offers the highest identification
accuracy since both classification and regression approaches
are used for Random Forest algorithm. In [7], a study on
breast cancer was provided by the authors to develop pre-
dictive models for breast cancer survival. In this paper, three
breast cancer survivability prediction models were applied
to two classes: benign and malignant cancer. In [8], authors
highlighted all previous research on ML algorithms used for
breast cancer determination. They suggested that the problem Fig. 1. Class distribution.
of limited available dataset can be solved by data augmentation
techniques. In [9], authors presented a technique that can be The dataset comprises 32 columns, with the ID number
used to detect and identify cell morphology in automated being the first column and the diagnosis outcome (0-benign
systems that carry out the classification using computer-aided and 1-malignant) being the second column. The rest of the
mammogram image features. In [10], authors have compared columns (3-32) contain three measurements (mean, standard
various classification and clustering algorithms in the survey. deviation, and mean of worst) of ten features. These features
The result shows that the algorithms for classification are represent the shape and size of the target cancer cell nucleus.
better predictors than the clustering algorithms. The sample of cells is collected from a breast through Fine
In [11], the method of automatic detection of anomalies Needle Aspiration (FNA) procedure in biopsy test. For each
in mammograms is discussed. Applying the fuzzy-C-means cell nucleus, these features are determined by analyzing under
and thresholding strategy, suspect region-of-interest (ROI) was a microscope in a pathology laboratory. All values of the
segmented. The proposed algorithm for the Mini-MIAS dataset features are stored up to four significant digits. There were
was validated. They concluded that the performance of suspi- no null entries in the dataset. The 10 real-value features are
cious region detection in mammograms can be improved by described in Table I.
subtracting preprocessed enhanced and preprocessed enhanced
B. Dataset Analysis
inverted images. In [12], for the identification of the malignant
and benign state, an algorithm was proposed by the authors For the dataset analysis, the whole dataset has been con-
depending on a fuzzy inference system. Comparison of con- sidered. In Fig. 2, the mean radius feature of the dataset
ventional performance criteria such as sensitivity, accuracy and is counter-plotted. From the figure, it can be observed that
specificity suggests that their introduced solution outperforms suspected patients not bearing cancer have a mean radius of
Artificial Neural Network (ANN) and SVM classification. around 1, whereas suspected patients bearing cancer have a
In [13], different deep learning concepts related to breast mean radius of more than 1.
mammogram analysis are reviewed and contributions to this Now, in Fig. 3, the correlation among the features of the
area are summarized. This work summarizes the past of WBCD dataset is shown in a heatmap. Correlation heatmap
mammogram research, recent advances, and the current state shows a 2D correlation matrix between two discrete dimen-
of the art. sions where the first dimension value is considered as a row
and the second dimension value as a column of the heatmap.
III. M e t h o d o l o g y
In this heatmap, the colored cells in a monochromatic scale are
We have configured a series of steps to come up with the used to show the resultant correlation between the features of
most reliable results in order to determine whether stage of the dataset. Increasing intensity of color represents increasing
the tumor is malignant (cancerous) or benign (non-cancerous). correlation. The value of the color of the cells is proportional

98

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on July 01,2021 at 00:46:24 UTC from IEEE Xplore. Restrictions apply.
Fig. 2. Mean radius feature of WBCD dataset.

TABLE I
Fe a t u r e De s c r i p t io n

Feature Name Feature Description


Radius Average of distances from center to circumference
points.
Texture Standard deviation (SD) of gray-scale value.
Perimeter Gross distance between the snake points.
Area Total number of pixels on the inside of the snake
along with one half of the pixels in the circumfer-
ence.
Smoothness Local variance in length of radius, quantified by
calculating the length difference.
Compactness Perimeter2/A rea .
Concavity Intensity of the contour’s concave parts.
Concave points The number of contour concavities.
Symmetry The difference in length between lines perpendic-
ular to the major axis in both directions to the cell
boundary.
Fractal Coastline estimation. A higher value leads to a
Dimension less normal contour representing a higher risk of
malignancy.

to the number of measurements that match the dimensional


values. The dimensional value (-1 to +1) is calculated from
the linearity between the pair of features. If both variables
vary and move in the same direction, positive correlation Fig. 3. Correlation between the different features.
is acquired. In case of negative correlation, increase in one
variable is associated with a decrease in the other and vice
The less correlated features can be removed as they have too
versa. From the figure, we can see how often one feature
less impact on the target. We have omitted these features for
affects all the other features in this heat map (e.g radius mean
enhancing the accuracy of the implemented classifiers.
has 32% influence on texture mean).
In Fig. 4, the correlation barplot represents the correlation C. Training and Testing
between the diagnosis outcome and every other feature of Initially, the dataset is read from the CSV file. The data
the dataset individually. Here, the ’smoothness_se’ feature is entries from the dataset are analyzed on the basis of their
highly negatively correlated with the diagnosis in the cor- features before they were used for further step. Then, we split
relation barplot. The ’fractal_dimension_mean’, ’texture_se’, the dataset into two portions randomly: training set (75%)
and ’symmetry_se’ features are associated very less negatively and testing set (25%). Not every feature within the dataset is
and other remaining features are highly positively correlated. useful and capable of giving same contribution to the result.

99

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on July 01,2021 at 00:46:24 UTC from IEEE Xplore. Restrictions apply.
set test dataset. 25% of the entire dataset was included in the
test dataset. A confusion matrix is generated for the actual
and predicted result consisting of TP, FP, TN, and FN for
calculating accuracy for each algorithms used. Below, the
meaning of the terms is mentioned.
• TP = True Positive (Accurately Identified)
• TN = True Negative (Inaccurately Identified)
• FP = False Positive (Accurately Rejected)
• FN = False Negative (Inaccurately Rejected)
For example, confusion matrix for SVM is shown in Fig 6.
The assessment of used ML algorithms is carried out using
one of the metrics called accuracy. Accuracy is the prediction
fraction. It represents the ratio of the number of accurate
predictions over the gross number of predictions done by the
model as shown in (1).

Fig. 4. Correlation of the features with target.

According to the data analysis, we have done feature selection


to eliminate less correlated features which will increase the
accuracy. Then, the dataset is ready for the application of
the ML algorithms to examine their performance. After this
step, we have accomplished the performance analysis by
the comparative study of the resultant testing and training
accuracy. Fig. 5 depicts the overall workflow of the study.

D a ta s e t Lo a d D a t a A n a ly s is T r a in - T e s t S p lit

Fig. 6. Confusion matrix.

1 The formula of accuracy is:


P e r fo r m a n c e T ra in in g U sin g F e a tu re
A n a ly s is M L a lg o rith m s S e le c t io n

AcCUraCV = (T P + t N + FP + FN) (1)


The result presented in Table II demonstrates the obtained
Fig. 5. Overall workflow. accuracy of training and testing for SVM, Random Forest,
Naive Bayes, Decision Tree, KNN and Logistic Regression.
Machine learning is an automated approach to learn where
algorithms are programmed to gain experience from past TABLE II
datasets for predicting future. In this project, we have used Co m p a r Va r i o u s A l
is o n a m o n g g o r it h m s

the following ML algorithms: Algorithms Training Accuracy Testing Accuracy


• Logistic Regression. Logistic Regression 99.1% 94.4%
KNN 97.6% 95.8%
• Support Vector Machine (SVM). Decision Tree 100% 95.1%
• Random Forest. Naive Bayes 95.1% 92.3%
Random Forest 99.5% 96.5%
• Naive Bayes. SVM 98.8% 96.5%
• Decision Tree.
• K-Nearest Neighbors (KNN). We can see that the maximum accuracy is achieved using
Random Forest and SVM on the test set which is 96.5%.
IV. R ESULTS Their training accuracy is also higher than other algorithms.
In this section, after implementing ML algorithms, we have Our study shows that SVM is the preferable algorithm for
analyzed the performance of the algorithms on the dataset. prediction. It is known that SVM performs relatively good if
This is performed by executing the algorithms on the formerly clear separation between classes is present in the dataset and

100

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on July 01,2021 at 00:46:24 UTC from IEEE Xplore. Restrictions apply.
the dataset is high dimensional. Random forest also gave better [12] F.-T. Johra and M. M. H. Shuvo, “Detection of breast cancer from
result alongside SVM. Generally, random forest obtains higher histopathology image and classifying benign and malignant state us-
ing fuzzy logic,” in 2016 3rd International Conference on Electrical
accuracy without any normalization of dataset values. Engineering and Information Communication Technology (ICEEICT).
IEEE, 2016, pp. 1-5.
[13] O. V. Singh and P. Choudhary, “A study on convolution neural network
V. C o n c l u s io n for breast cancer detection,” in 2019 Second International Conference
on Advanced Computational and Communication Paradigms (ICACCP).
Now-a-days, one of the deadly diseases affecting women IEEE, 2019, pp. 1-7.
is breast cancer. In our work, the Wisconsin Breast Cancer
Dataset was utilized and several ML algorithms were applied
to assimilate the efficacy and usefulness of these algorithms to
find the highest accuracy of classifying malignant and benign
breast cancer. The correlation between different features of the
dataset has been analyzed for feature selection. The results
will assist to pick the best ML algorithm for the construction
of an automatic breast cancer diagnostic system. From our
study, we can conclude that SVM and Random Forest give the
maximum accuracy with an accuracy of 96.5%. We will try
to strengthen our work in future by handling a comparatively
large dataset and incorporating some more functions such as
breast cancer phase detection and so on. We hope that this
study will contribute in the clinical application of breast cancer
treatment.

Re f e r en c es

[1] J. Sivapriya, A. Kumar, S. Siddarth Sai, and S. Sriram, “Breast cancer


prediction using machine learning,” International Journal of Recent
Technology and Engineering (IJRTE), vol. 8, 2019.
[2] Y. Khourdifi and M. Bahaj, “Applying best machine learning algorithms
for breast cancer prediction and classification,” in 2018 International
Conference on Electronics, Control, Optimization and Computer Science
(ICECOCS). IEEE, 2018, pp. 1-5.
[3] N. K. Sinha, M. Khulal, M. Gurung, and A. Lal, “Developing a web
based system for breast cancer prediction using xgboost classifier,”
International Journal o f Engineering Research Technology (IJERT),
vol. 9, 2020.
[4] R. Dhanya, I. R. Paul, S. S. Akula, M. Sivakumar, and J. J. Nair, “A
comparative study for breast cancer prediction using machine learning
and feature selection,” in 2019 International Conference on Intelligent
Computing and Control Systems (ICCS). IEEE, 2019, pp. 1049-1055.
[5] M. M. Islam, H. Iqbal, M. R. Haque, and M. K. Hasan, “Prediction of
breast cancer using support vector machine and k-nearest neighbors,”
in 2017 IEEE Region 10 Humanitarian Technology Conference (R10-
HTC). IEEE, 2017, pp. 226-229.
[6] M. S. Yarabarla, L. K. Ravi, and A. Sivasangari, “Breast cancer
prediction via machine learning,” in 2019 3rd International Conference
on Trends in Electronics and Informatics (ICOEI). IEEE, 2019, pp.
121-124.
[7] V. Chaurasia, S. Pal, and B. Tiwari, “Prediction of benign and malignant
breast cancer using data mining techniques,” Journal of Algorithms &
Computational Technology, vol. 12, no. 2, pp. 119-126, 2018.
[8] N. Fatima, L. Liu, S. Hong, and H. Ahmed, “Prediction of breast cancer,
comparative review of machine learning techniques, and their analysis,”
IEEE Access, vol. 8, pp. 150360-150376, 2020.
[9] A. Toprak, “Extreme learning machine (elm)-based classification of
benign and malignant cells in breast cancer,” Medical science monitor:
international medical journal of experimental and clinical research,
vol. 24, p. 6537, 2018.
[10] D. S. Jacob, R. Viswan, V. Manju, L. PadmaSuresh, and S. Raj, “A
survey on breast cancer prediction using data miningtechniques,” in 2018
Conference on Emerging Devices and Smart Systems (ICEDSS). IEEE,
2018, pp. 256-258.
[11] K. L. Kashyap, M. K. Bajpai, and P. Khanna, “Breast cancer detection
in digital mammograms,” in 2015 IEEE international conference on
imaging systems and techniques (IST). IEEE, 2015, pp. 1-6.

101

Authorized licensed use limited to: Univ of Calif Santa Barbara. Downloaded on July 01,2021 at 00:46:24 UTC from IEEE Xplore. Restrictions apply.

You might also like