Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Breast Cancer Detection using Machine Learning

Mayank Singh Mayank Singh Mr. Hradesh Kumar


department of computer science department of computer science department of computer science
Galgotias University Galgotias University Galgotias University
Uttar Pradesh, India Uttar Pradesh, India Uttar Pradesh, India
mt029322@gmail.com mayank77s77@gmail.com hradesh.kumar@galgotiasuniversity.edu.in

Abstract—Breast cancer is one of the most deadly diseases that cancer include inheritance from one generation to another. Two
may affect women. There are numerous subcategories of breast of the most popularly inherited genes that cause breast cancer
cancer; the type of cancer is determined by which breast cells are breast cancer gene 1 (BRCA1) and breast cancer gene 2
become malignant. It is very important to detect any disease in its
early stages so that it can be cured with ease. Machine learning (BRCA2). The percentage of patient diagnosed with breast
is one of the most effective tools to detect breast cancer. Machine cancer are increasing by 0.5% every year. It is important to
learning helps in training the machine easily and creating some have an awareness about the disease so that it will be easy to
models that can predict the chances of breast cancer in females. detect the disease and take necessary actions to cure it.
Some of the most common machine learning classifiers for breast Diseases can be dangerous and even life-threatening if it is
cancer detection are SVM, Naive Bayes, Logistic Regression,
KNN, and Decision tree. The purpose of this research is to find not diagnosed in the early stage. Most cancer patients lose
out the best machine-learning technique that provides the most their lives because of the failure to detection of the disease
accuracy for the detection of breast cancer among females. The before it becomes incurable. Having an effective technique
accuracy of machine learning models can differ for different is one of the most important thing for the diagnosis of breast
datasets. This research is experimented on over different datasets cancer among females so that the doctor can plan the treatment
of large and small sizes, and after analysing the accuracies of
these machine learning techniques, a conclusion is drawn. accordingly. The traditional way of diagnosis of breast cancer
is time consuming as well as requires a lot of human efforts.
Index Terms—Machine Learning, Breast Cancer, Naive Bayes, Machine learning algorithms provides an efficient way to
SVM, KNN, Logistic Regression. detect breast cancer with more accuracy and decreases the
chances of human errors. With the help of machine learning,
I. I NTRODUCTION different models can be trained and the results will be out in
Cancer is a condition in which the body’s cells grow less time.
uncontrollably and form a mass of tissue generally known
as a tumour. The speed of division of these cells is much II. LITERATURE SURVEY
faster compared to healthy cells in the body. According to Machine learning is an efficient method to detect breast
statistics, breast cancer is the most common cancer in women; cancer as it uses various classifiers such as SVM, Decision
every 1 out of 3 females diagnosed with cancer is a breast Trees, KNN, and Naive Bayes to aid diagnosis. However, the
cancer patient. A survey conducted by the WHO shows that issue arises as to which classifier provides the most accurate
in 2020, over 2.3 million women were diagnosed with breast results, given that the accuracy of various classifiers may
cancer, among whom 6,85000 deaths were reported globally. vary. Numerous studies and researches have been conducted
Despite advances in technology in the medical field, it is still in the past to identify the most precise and efficient machine
the second-leading cause of cancer deaths in women. The learning algorithm. The purpose of all research is to discover
awareness of breast cancer is very low, which is why most the most effective methods for early cancer detection. This
people ignore it, and it costs them their lives. Breast cancer section describes past machine learning-based breast cancer
usually starts in the cells present in the milk-producing ducts detection studies.
(invasive ductal carcinoma), and it easily spreads all over the
breast. It can also begin in glandular tissue known as lobules A research project named ”Breast cancer detection using
(invasive lobular carcinoma) or some other cells or tissues in machine learning techniques” was carried out by Sweta Bhise,
the breast. Simran Bepari, Shrutika Gadekar, and Deepmala Kale with
Many factors can increase the chances of breast cancer in the objective of determining which of the several machine
females; some of them are lifestyle, heredity, and ionizing learning algorithms is better compared to others. CNN was
radiation. Even though it is still debatable, some females with used as a classifier, and for feature selection, RFE was used.
little or no exposure to the risk factors of cancer get affected The study also included a comparison of the method that they
by it, while others with direct exposure to the causes are safe. employed to other machine learning algorithms, including
It can affect any female, even if there is no family history SVM, Naive Bayes, and KNN. The primary purpose of their
of cancer gene inheritance. Only 5 to 10% of cases of breast research was to differentiate between malignant and benign
tumours by using a convolutional neural network with a Keras approach called the particle swarm optimised wavelet neural
backend, and the secondary objective was to study the data network (PSOWNN), which seems to be superior to existing
to identify how the model might be implemented in practise methods like CNN. On the basis of the comparison of the
after having accomplished the primary objective. According 905 pictures generated by the method with those generated
to the findings of their investigation, CNN is superior to the by other illnesses, 98.6% of the ailments have been correctly
other methodologies in terms of accuracy, precision, and the identified. They arrived at the conclusion that the specificity
size of the data set.[1] of PSOWNNs is 98.8 percent. In addition, PSOWNNs have
a precision of 98.6%, which implies that, despite the fact
Researchers Habib Dhahri, Eslam Al Maghayreh, Awais that a huge number of women are affected by breast cancer,
Mahmood, Wail Elkilani, and Mohammed Faisal Nagi did only 830 (95.2%) are verified with a high level of certainty
a study with the working title ”Automated Breast Cancer as having the illness.[5]
Diagnosis Based on Machine Learning Algorithms” in order
to investigate the problem of autonomous breast cancer Comparisons of SVM, logistic regression, naive bayes,
diagnosis using an algorithm that is based on machine and random forest have been carried out by Sivapriya J.,
learning. In the course of their investigation, there were three Aravind Kumar V., Siddarth Sai S., and Sriram S. The
separate experiments conducted. During the first experiment, Wisconsin breast cancer dataset is used in the comparative
they were able to show that the three evolutionary algorithms analysis that is being carried out. According to the findings of
that are used the most often are capable of producing investigations, the Random Forest algorithm has the highest
similar results given the appropriate conditions. In the level of accuracy (99.76%) and the lowest proportion of
second experiment, we looked into the idea that combining mistakes. For all of the experiments that were carried out
different methods for choosing features will lead to improved in the simulated environment, the Anaconda Data Science
accuracy. During the final test, we were able to derive how Platform was applied.[6]
to independently create the supervised machine learning
classifier. They used the GP method in an effort to solve According to the findings of the study ”Using Machine
the hyperparameter issue, which is problematic for machine Learning Algorithms for Breast Cancer Risk Prediction
learning algorithms. This problem causes challenges for the and Diagnosis” by Hiba Asria, Hajar Mousannifb, Hassan
algorithms.[2] Al Moatassimec, and Thomas Noeld, the SVM algorithm
delivers the greatest accuracy of 97.13 percent with fewer
Kalyani Wadkar, Prashant Pathak, and Nikhil Wagh conducted errors when compared to KNN, NB, and C4.5 (decision tree).
an in-depth comparison of ANN and SVM and incorporated The search for an effective algorithm for machine learning
multiple classifiers such as CNN, KNN, and Inception V3 was the one and only objective of the inquiry.[7]
to enhance dataset processing. The research outcomes and
performance analysis discovered that ANN was a superior Written by Mohamed Ebrahim, Ahmed Ahmed Hesham, and
classifier to SVM because ANN demonstrated a higher rate Saleh Mesbah, ”Accuracy Assessment of Machine Learning
of effectiveness.[3] Algorithms Used to Predict Breast Cancer” This article
presents a comprehensive and objective evaluation of several
In a study titled ”Machine Learning Algorithms for machine learning-based breast cancer prediction methods. For
Breast Cancer Prediction and Diagnosis,” Mohammed Amine the purpose of this research, the National Cancer Institute
Naji, Sanaa El Filali, Kawtar Aarika, EL Habib Benlahmar, (NCI) in the United States provided access to their database
Rachida Ait Abdelouhahid, and Olivier Debauche identified of 1.7 million data records. Both standard and deep learning
that Support Vector Machine succeeded in an accuracy approaches were used in the assessment of the system’s
rating of 97.2%, precision of 97.5%, and AUC of 96.6% accuracy. Decision tree (DT), linear discriminant (LD),
and beat the rest of the algorithms. The primary goal of logistic regression (LR), support vector machine (SVM), and
the study paper was to predict and diagnose breast cancer ensemble techniques (ET) were the methods that were used.
using machine-learning algorithms and to determine the most Techniques such as probabilistic neural networks (PNN),
successful with regard to confusion matrix, accuracy, and deep neural networks (DNN), and recurrent neural networks
precision.[4] (RNN) were examined and contrasted against one another.
Additionally, the impact that feature selection has on accuracy
A research project dubbed ”PSOWNNs-CNN” was carried out was looked into. The findings showed that decision trees
by Ashkan Nomani, Yasaman Ansari, Mohammad Hossein and ensemble approaches performed better than the other
Nasirpour, Armin Masoumian, Ehsan Sadeghi Pour, and procedures, as they both achieved an accuracy of 98.7%. This
Amin Valizadeh. The purpose of the study was to investigate was shown by the fact that both strategies produced the same
several methods for diagnosing breast cancer using artificial outcome.[8]
intelligence and computational imaging. Using methods
from machine learning, the authors suggest an innovative The dataset that was generated by Dr. William H. Walberg at
approach to the detection of breast cancer. There is also an the University of Wisconsin Hospital was used by Muhammet
Fatih Ak. Data visualisation and machine learning techniques III. P ROPOSED S YSTEM
such as logistic regression, k-nearest neighbours, support
vector machines, naive Bayes, decision trees, random The data set that we have used throughout the course of
forests, and rotation forests were used in the analysis of this research was obtained from the kaggle website. To begin
this dataset. The programming languages R, Minitab, and the process of putting various machine learning algorithms
Python were chosen for usage in conjunction with these into action, we must first import several libraries such as
machine learning methods and visualisations. An analysis Numpy, Pandas, and matplotlib.
of each method’s strengths and weaknesses was carried
out. When all of the features were accounted for in the The following step is to import that data set, followed
logistic regression model, the classification accuracy was by the visualisation using bar graphs for the number of
at its highest level (98.1%), and the recommended strategy patients with malignant and benign tumours.
demonstrated an improvement in the accuracy of its results.[9]
Feature selection
According to the findings of the study ”Breast cancer A crucial stage in constructing a machine learning model
detection using machine learning techniques,” written by for breast cancer detection is the selection of features. It
Sarthak Vyas, Abhinav Chauhan, Deepak Rana, and Noman entails selecting the most pertinent and informative features
Ansari, the support vector machine (SVM) provides the (or variables) from the available data, which can enhance the
greatest level of accuracy among all machine learning model’s precision and performance.
algorithms. This research suggests that the machine learning The feature selection method which we have used in our
(ML) methods known as decision trees, artificial neural research is RFE.
networks, K-nearest neighbours, and support vector machines
be used in order to achieve a diagnosis of breast cancer that RFE is an abbreviation for recursive feature elimination
is both effective and accurate and it was also observed that is a feature selection process that improves a model by
the error rate of SVM was very low as compared to the other deleting the feature (or features) with the worst predictive
machine learning algorithms.[10] power until the target number of features is obtained. RFE
attempts to get rid of dependencies and collinearity by
K. Anastraj, Dr. T. Chakravarthy, and K. Sriram examined four repeatedly deleting just a few of features at a time throughout
machine learning methods using the original Wisconsin breast each iteration. Features are sorted according on the coef
cancer datasets. These algorithms were the back propagation or f eatureimportances attributes associated with their
network, artificial neural network (ANN), convolutional respective models.
neural network (CNN), and support vector machine (SVM).
A deep and convolutional neural network that was trained Next step is using the confusion matrix to get the accuracies.
using ALEXNET was used both for the purpose of gathering A confusion matrix provides a tabular representation of the
features and for the assessment of benign and malignant diverse findings from a classification problem’s prediction
tumours. According to the findings of the simulation, the and results and aids in visualising them. Choosing a model is
support vector machine is the most successful method since the most crucial aspect of machine learning. Classifications of
it achieved better outcomes (94%).[11] machine learning algorithms include supervised learning and
unsupervised learning. Our research requires only supervised
The study titled ”Automated breast cancer detection by learning. We utilised all predictive methodologies and
reconstruction independent component analysis (RICA)-based recorded their accuracy as well as visualized the accuracies
hybrid features using machine learning paradigms” proposes of various ML algorithms using bar graphs alone to discover
an integrated approach to the obtaining of features. This the most efficient one.
approach is based on texture, morphology, scale invariant
feature transform (SIFT), grey level co-occurrence matrix
(GLCM), entropy, elliptic fourier descriptors (EFDs), RICA,
and sparse filtering techniques. Detecting breast cancer has
been accomplished by the use of a number of different
machine learning strategies, such as support vector machines
(SVM), decision trees (DT), k-nearest neighbour, and Nave
Bayes classifiers. The RICA-based feature set that made use
of the SVM RBF resulted in an overall accuracy of 94.88%
and a ROC AUC value of 0.9914. All of the studies like
these are conducted to provide best results to the world so
that the detection of the disease can be as possible as early
as possible.[12] Here TP=True positive, TN=True negative, FP=False postive,
FN=False negative.
algorithm stores available data and categorises new data
Accuracy=(TP+TN)/(TP+TN+FP+FN). elements according to similarities. KNN operates by locating
a point in the data that is near to the new point entering the
Given below is the proposed architecture diagram of the machine. The algorithm then individually sorts these points
system: of closest approach with respect to the point of arrival in the
range frame. This point distance is measured using a variety
of methods, but the Euclidean distance is most commonly
used by professionals.

Naive Bayes
The Naive Bayes classification algorithm is a classification
algorithm that uses supervised learning. It is based on Bayes’
theorem, which determines the chance of an occurrence after
the event has occurred. It is one of the simplest yet most
powerful ML algorithms presently in use and is used in many
kinds of industries. In practise, Bayes’ naive assumption that
all predictors (or features) are independent is uncommon.
This constrains the practical applicability of the algorithm.
This algorithm confronts the ”zero frequency problem” of
CLASSIFIERS assigning zero probability to categorical variables whose
categories were not present in the training dataset. This issue
Support Vector Machine can be resolved using a filtering strategy. To attain the utmost
A basic linear SVM classifier draws a straight line between level of accuracy, the Naive Bayes method requires large data
two classes. This indicates that each data point on one side sets.
of the line corresponds to a single category, and data points
on the other side of the line fall into another category. This Decision Tree
means you can choose from an infinite number of lines. This Decision trees are a supervised learning technique that can
makes the linear SVM algorithm superior to other algorithms be used to solve both classification and regression problems,
such as k-nearest neighbour is to select the best line to but are primarily suited for classification problems. It is
classify the data points. A line is selected that separates the a tree-structured classifier with internal nodes representing
data as far away from the cabinet data points as possible. data set characteristics, branches representing decision rules,
Another reason to use SVMs is that you can find complex and leaf nodes representing outcomes. A decision tree has
relationships between data without performing quite a bit of two nodes: a leaf node and a decision node. Decision nodes
transformations yourself. are used to make decisions and contain multiple branches,
whereas leaf nodes are the result of those decisions and do not
Logistic Regression contain any additional branches. Based on the characteristics
The linear regression hyperplane cannot be applied for of a specific data set, a decision or test is made.
predicting the variable that is dependents using the
independent variable. Therefore, if you have data that IV. OUTPUT
is categorical, you will use logistic regression. Logistic
Regression determines whether a statement is true or false, Among the evaluated machine learning algorithms, logistic
in contrast to a continuum. Used for categorising. Using the regression demonstrated the highest level of accuracy, with a
sigmoid function, the independent variable is transformed 95.6% accuracy rate, according to our research. The next most
into a probability expression spanning from 0 to 1 with accurate algorithm was Decision Tree, with an accuracy rate
respect to the dependent variable. It is a popular machine of 94.74%, followed by Naive Bayes, with an accuracy rate of
learning algorithm due to its ability to offer probabilities and 93.10%. Support Vector Machine achieved an accuracy rate
categorise new samples based on continuous and discrete of 93.8%, while K-Nearest Neighbour achieved an accuracy
measures. Logical regression assumes linearity between the rate of 93.7%.
dependent and independent variables, which is a limitation. The following are the results of this entire study, namely the
accuracy of various machine learning algorithms:
K-Nearest Neighbour
K-Nearest Neighbour is one of the simplest supervised Support Vector Machine: 0.9385964912280702
learning-based machine learning algorithms. The K-NN Logistic Regression: 0.956140350877193
algorithm implies similarities between the new cases/data and K-Nearest Neighbour: 0.9385964912280702
existing cases and positions new cases in categories most Naive Bayes: 0.9473684210526315
closely resembling to existing categories. The K-Nave Bayes Decision Tree: 0.9473684210526315
it can be optimised further in the future using new technologies
for larger datasets.
VI. R EFERENCES
[1] Sweta Bhise, Simran Bepari, Shrutika Gadekar, and
Deepmala Kale ”Breast cancer detection using machine
learning techniques”(2021)

[2] Habib Dhahri, Eslam Al Maghayreh, Awais Mahmood,


Wail Elkilani, and Mohammed Faisal Nagi ”Automated
Breast Cancer Diagnosis Based on Machine Learning
Algorithms”(2019).

[3] Kalyani Wadkar, Prashant Pathak and Nikhil Wagh “Breast


Cancer Detection Using ANN Network and Performance
These accuracies are calculated using the formula:
Analysis with SVM” (2019).

[4] Mohammed Amine Naji, Sanaa El Filali, Kawtar


Aarika, EL Habib Benlahmar, Rachida Ait Abdelouhahid and
Olivier Debauche “Machine Learning Algorithms For Breast
The graph below illustrates the total number of patients Cancer Prediction And Diagnosis” (2021).
with malignant and benign tumours. Benign indicates the
number of patient that are not having cancer and malignant [5] Ashkan Nomani, Yasaman Ansari, Mohammad Hossein
shows the number of patient that are actually affected by Nasirpour, Armin Masoumian, Ehsan Sadeghi Pour and Amin
cancer(i.e. blue bar indicates the number of malignant patients Valizadeh ”PSOWNNs-CNN: A Computational Radiology
and orange bar represents the number of benign patients): for Breast Cancer Diagnosis Improvement Based on Image
Processing Using Machine Learning Methods”.(2022)

[6] Sivapriya J, Aravind Kumar V, Siddarth Sai S, Sriram S


”Breast Cancer Prediction using Machine Learning” (2019).

[7] Hiba Asria, Hajar Mousannifb, Hassan Al Moatassimec,


Thomas Noeld ”Using Machine Learning Algorithms for
Breast Cancer Risk Prediction and Diagnosis”.(2016)

[8] Mohamed Ebrahim,Ahmed Ahmed Hesham and Saleh


Mesbah ”Accuracy Assessment of Machine Learning
Algorithms Used to Predict Breast Cancer” (2023).

[9] Muhammet Fatih Ak ”A Comparative Analysis of Breast


Cancer Detection and Diagnosis Using Data Visualization
V. C ONCLUSION and Machine Learning Applications” (2020).
This research was conducted solely for the purpose of
identifying the machine learning algorithm that produces the [10] Sarthak Vyas, Abhinav Chauhan, Deepak Rana,
most accurate results. We have analysed various machine Noman Ansari “Breast Cancer Detection Using Machine
learning algorithms such as SVM, KNN, decision tree, naive Learning Techniques ” (2022).
bayes, and logistic regression, which provide the most accurate
results (95.6%). The method that we have used for feature [11] K.Anastraj , Dr.T.Chakravarthy and K.Sriram ”
selection is RFE, i.e., recursive feature elimination, which Breast Cancer detection either Benign Or Malignant Tumor
helped us to select only the relevant features for our research using Deep Convolutional Neural Network With Machine
from the dataset. This research led us to conclude that logistic LearningTechniques ” (2019).
regression is the most effective of these five machine learning
algorithms. The sole need that must be met before using the [12]Lal Hussain, Shahzad Ahmad Qureshi, Amjad Aldweesh,
system is the data to first complete preprocessing and apply Jawad ur Rehman Pirzada, Faisal Mehmood Butt, Elsayed Tag
the right feature selection method. Although it provides the eldin, Mushtaq Ali, Abdulmohsen Algarni and Muhammad
optimal solution, there is always room for improvement, and Amin Nadim ”Automated breast cancer detection by
reconstruction independent component analysis (RICA) based
hybrid features using machine learning paradigms”

[13] P. Boix-Montesinos, M.J. Vicent,, A. Armiñán, M.


Orzáez, P.M. Soriano-Teruel. The past, the present, and the
future of breast cancer models for nanomedicine development
Adv. Drug Deliv. Rev., 173 (2021), pp. 306-330

[14] Parthh Dikshit, Bhawna Dey, Ayush Shukla, Akhilesh


Singh, Tarankit Chadha and Vivek Kumar Sehgal ”Prediction
of Breast Cancer using Machine Learning Techniques” (2022).

[15] ”Breast Cancer Detection Using Infrared Thermal


Imaging and a Deep Learning Model” (2018).

[16] Hannah Le ”Using Machine learning models for


breast cancer detection” (2018).

[17] Saleem Z. Ramadan “Methods Used in Computer- Aided


Diagnosis for Breast Cancer Detection Using Mammograms:
A Review” (2020).

You might also like