Breast Cancer Classifier Using Machine Learning

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

BREAST CANCER CLASSIFIER USING MACHINE LEARNING

Abstract-Breast cancer is a fatal disease-causing mining techniques such as machine learning, statistics,
high mortality in women. Constant efforts are made database, fuzzy set, data warehouse and neural
for creating more efficient techniques for early and network help in diagnosis and prognosis of different
accurate diagnosis. The correct diagnosis of BC and cancer diseases [1]. Machine learning process is based
classification of patients into malignant or benign on the three main strategies that consists of
groups is the subject of must research. BC dataset, preprocessing, features selection or extraction and
machine learning (ML) is widely recognized as the classification. Feature extraction is the main part of
methodology of choice in BC pattern classification machine learning process and actually helps in
and forecast modelling. Classification and data diagnosis and prognosis of cancer, this process can
mining methods are an effective way to classify elaborate the cancer set into benign and malignant
data. tumors [1].

Keywords—Breast Cancer, Machine Learning, II. MACHINE LEARNING ALGORITHMS


Logistic Regression, K Neighbors Classifier, SVC FOR BREAST CANCER PREDICTION
Linear. o Machine Learning is an automatic
I. INTRODUCTION learning method, is a branch of
artificial intelligence(AI) focused on
Breast cancer is one of the most lethal and
building applications that learn from
heterogeneous disease in this present era that causes
data and improve their accuracy over
the death of enormous number of women all over the
time without being programmed to do
world [1].
so.For breast cancer prediction,major
machine learning algorithms are as
It is the second leading cause of death among women
follows:[1]
worldwide [2]. Early detection is the best way to
increase the chance of treatment and survivability, [2]
as it can provide timely clinical treatments to patients. A. LOGISTIC REGRESSION(LR)
Further accurate classification of benign tumors can
Logisti
prevent patients undergoing unnecessary treatments.
c Regression is a supervised learning
There are various machine learning and data mining
classification algorithm used to predict the probability of a
algorithms that are being used for prediction of breast
target variable.
cancer. Finding the most suitable and the most
The nature of the target variable is dichotomous,
appropriate algorithm for the prediction of breast
which means there would be only two possible classes. The
cancer is one of the important tasks [1]. Breast cancer
The dependent variable is binary in nature having
is the cancer that develops in breast cells.
data coded as either 1(stands for success/yes) or 0
(stands for failure/no).
Cancer occurs when changes called mutations take
place in genes that regulate cell growth. The mutations
let the cells divide and multiply in an uncontrolled B. RANDOM FOREST
way. Typically, the cancer forms in either the lobules Random forest is a supervised learning algorithm
or the duct of the breast. Lobules are the glands that which is used for both classification as well as
produce milk, and ducts are the pathways that bring the regression. But, however, it is mainly used for
milk from glands to the nipple. Cancer can also occur classification problems. Random forest algorithm creates
in the fatty tissue or the fibrous connective tissue decision trees on data samples and then gets the
within the breast. prediction from each of them and finally selects the best
The uncontrolled cancer cells often invade other solution by means of voting. It is an ensemble method
healthy breast tissue can travel to the lymph nodes which is better than a single decision tree because it
under the arms. The lymph nodes are a primary reduces the over-fitting averaging the result.
pathway that help the cancer cells move to other parts
of the body.
Data mining is the process of discovering the useful C. DECISION TREE
information from a big dataset, data mining techniques Decision Tree algorithm belongs to the family of
and functions help to discover any kind of disease, data supervised learning algorithms. The decision tree

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


algorithm can be used for solving regression and Training: The model is trained on the training
classification problems. The goal of using a Decision dataset using a supervised learning method. In
Tree is to create a training model that can be used to practice, the training dataset often consists of pairs of
predict the class or value of the target variable by an input vector (or scalar) and the corresponding
learning simple decision rules inferred from prior data output vector (or scalar), where the answer key is
(training data). commonly denoted as the target (or label). The current
model is run with the training dataset and produces a
result, which is then compared with the target, for each
D. K NEAREST NEIGHBOR input vector is the training dataset. Based on the result
K-Nearest Neighbor is one of the simplest Machine of the comparison and the specific learning algorithm
Learning algorithms based on Supervised Learning being used, the parameters of the model are adjusted.
technique. It can be used for Regression as well as The model fitting can include both variable selection
Classification but mostly used for Classification and parameter estimation.
problems. K-NN algorithm assumes the similarity Validation: Successively, the fitted model is used
between the new data and available data into the to predict the response for the observations in the
category that is most similar to the available second dataset called the validation dataset. The
categories. This means when new data appears then it validation dataset provides an unbiased evaluation of a
can be easily classified into a well suite category by model fit on the training dataset while tuning the
using K-NN algorithm. It is also called a lazy learner model’s hyperparameters. The simple procedure is
algorithm because it does not learn from training set complicated in practice by the fact that the validation
immediately instead it stores the dataset and at the time dataset’s error may fluctuate during training,
of classification, it performs an action on the dataset. producing multiple local minima. This complication
has led to the creation of many ad-hoc rules for
deciding when overfitting has truly begun.
E. SUPPORT VECTOR MACHINE
Testing : Finally, the test dataset is a dataset used
Support vector machines (SVMs) are powerful yet to provide an unbiased evaluation of a final model fit
flexible supervised machine learning algorithms which on the training dataset. If the data in the test dataset has
are used both for Classification and Regression never been used in training (for example in cross
problems. However, primarily, it is used for validation), the test dataset is also called a holdout
Classification in Machine Learning. The goal of the dataset. The term “validation set” is sometimes used
SVM algorithm is to create the best line or decision instead of “test set” in some literature (e.g., if the
boundary that can segregate n-dimensional space into original dataset was partitioned into only two subsets,
classes so that we can easily put the new data point in the test set might be referred to as the validation set).
the correct category in the future. This best decision G. Pre-Processing Data
boundary is called a hyperplane. SVM chooses the
extreme points/vectors that help in creating the
hyperplane. These extreme cases are called as support
vectors, and hence algorithm is termed as Support
Vector Machine.

F. Proposed Methodology

Data preprocessing is a process of preparing the


data and making it suitable or a machine learning
model. A real-world data generally contains noises,
Fig (1).Typical Machine Learning Process missing values, and maybe in an unusable format
which cannot be directly used for machine learning
The data used to build the final model usually models. Data preprocessing is required tasks for
comes from multiple datasets. In particular, three cleaning the data and making it suitable for a machine
datasets are commonly used in different stages of the learning model which also increases the accuracy and
creation of the model. efficiency of a machine learning model.
o The Data Preprocessing involves the following o Phase-3 Data Preparation
steps:
Phase-1 Get the Dataset Data Preparation, where we load our data into a
suitable place and prepare it for use in our machine
To create a machine learning model, the learning training. We’ll first put all our data together,
first thing we required is a dataset as a machine and then randomize the ordering.
learning model completely works on data. The
collected data for a particular problem in a proper
format is known as dataset. To use the dataset in our, o Phase-4 Features Selection
we usually put I into a CSV file.
CSV stands for “Comma-Separated Values” files, it is
a file format which allows us to save the tabular such In Machine Learning

as spreadsheets. It is usually for huge datasets and can
use these datasets in programs and statistics, feature
selection, also known
as variable selection,
is the process of
Phase-2 Importing Libraries selection a subset of

relevant features for
use in model
In order to perform preprocessing using Python, we construction.
need to import some predefined Python libraries.
These libraries are used to perform some specific
Data File and Feature Selection Cancer Wisconsin
tasks. There are three specific libraries that we will use
(Diagnostic):-
for data preprocessing, which are:
Data Set from Kaggle repository. Our target parameter
⮚ 1.Numpy is breast cancer diagnosis- Malignant or Benign. The
important features found by the study are:
Numpy Python library is used for including any type
of mathematical operation in a code. It is the Concave points worst, Area worst, Area se, Texture
fundamental package for scientific calculation in worst, Texture mean, Smoothness worst, Smoothness
Python. mean, Radius mean, Symmetry mean.
It also supports to add large, multidimensional arrays
and matrices. Attribute Information :
ID number
Diagnosis = (M= Malignant, B= Benign)
⮚ 2. Matplotlib
Phase-5 Handling Missing data
The second library is matplotlib, which is a
Python 2D plotting library, and with this The next step of data preprocessing is to handle
library, we need to import a sub-library pyplot. missing data in the datasets. If our dataset contains
This library is used to plot any type of charts some missing data, then it may create a huge problem
in Python for the code. for our machine learning model. Hence it is necessary
to handle missing values present in the dataset.
Seaborn is a Python data visualization library
based on matplotlib. Phase-6 Encoding Categorical data

Categorical data are variables that contain label values


⮚ 3.Pandas
rather than numeric values. The number of possible
values is often limited to a fixed set. We will use Label
The last library is the Pandas library, which is encoder to label the categorical data. Label Encoder is
one of the most famous Python Libraries and used for the part of SciKit Learn library in Python and used to
importing and managing the datasets. It is an open- convert categorical data, or text data, into numbers,
source data manipulation and analysis library. which our predictive models can better understand.
Phase-7 Splitting the Dataset into the Training set parameters. A confusion matrix including TP, FP, TN
and Test set and FN for actual data and predict data is formed to
evaluate the parameters [SPRING].
The data we use is usually split into training data and
test data. The training set contains a known output, and TP True Positive
the model learns on this data in order to be generalized TN True Negative
to other data later on. FP False Positive
FN False Negative
We have the test dataset (or subset) to test our model’s
prediction on this subset. The comparative study’s performance is evaluated by
We will do this using SciKit-Learn library in Python the following formulas:
using the train_test_split method.
1. Accuracy (Acc) – It represents the number of
correctly classified data instances over the
total number of data instances.
Phase-8 Feature Scaling
Accuracy (Acc) = (TP + TN)
Most of the times, your dataset will contain features (TP+TN+FP+FN)
highly varying in magnitudes, units and range. But
since, most of the machine learning algorithms use
Euclidian distance between two data points in their
computations. We need to bring all features to the 2. Precision (Prec) – It is the number of correct
same level of magnitudes. This can be achieved by positive results divided by the number of
scaling. positive results predicted by the classifier.

Phase-9 Model Selection Precision= TP


(TP + FP)
Supervised learning is the method in which the
machine is trained on the data which the input and 3. F1 score- F1 score is defined as the harmonic
output are well labelled. The model can learn on the mean between precision and sensitivity.
training data and can process the future data to predict
outcome. F1 score= 2TP
(2TP+FP+FN)
Supervised learning problems can be further grouped
into Regression and Classification problems
A Regression problem is when the output variables is a RESULTS AND DISCUSSION OF
real or continuous value. PROPOSED METHODOLOGY.
A Classification problem is when the output variable is
a category. 1. LOGISTIC REGRESSION

Unsupervised learning is giving away information to


the machine that is neither classified nor labelled and
allowing the algorithm to analyze the given
information without providing any directions. In
unsupervised learning algorithm the machine is trained
from the data which is not labelled or classified
making the algorithm to work without proper
instructions. In our dataset we have the outcome
variable or dependent variable i.e. Y having only two 2. KNEAREST NEIGHBORS
set of values, either M (Malignant) or B (Benign). So
Classification algorithm of supervised learning is
applied on it.

Performance Measure Parameters:

The performance of machine learning technique is


measured with respect to a few performance measure
TABLE1.

METHODOLOGY ACCURACY PRECISION RECALL

LOGISTIC 96.14% 0.95 0.95


REGRESSION

KNEAREST 95.44% 0.96 0.96


3. DECISION TREE NEIGHBORS

DECISION TREE 92.63% 0.94 0.94

SVC LINEAR 96.84% 0.96 0.96

RANDOM 95.09% 0.97 0.97


FOREST
CLASSIFIER
4. SVC LINEAR

RELATED WORK
IN THIS SECTION, SOME OF THE RELATED WORKS
PREVIOUSLY DONE ON BREAST CANCER DIAGNOSIS
BY RESEARCHERS USING DIFFERENT MACHINE
LEARNING APPROACHES ARE DISCUSSED.[LT2]

5. RANDOM FOREST CLASSIFIER


1. Xin Yao [5], 1999 has attempted to implement
neural network for breast cancer diagnosis.
Negative correlation training algorithm was
used to decompose a problem automatically
and solve them. In this article the author has
discussed two approaches such as evolutionary
approach and ensemble approach, in which
evolutionary approach can be used to design
compact neural network automatically. The
ensemble approach was aimed to tackle large
problems but it was in progress.
COMPARISON BETWEEN ALL THE 2. Chih-Lin Chi [5] , 2007 have presented an
FIVE MACHINE LEARNING article on survival analysis of breast cancer on
TECHNIQUES. two breast cancer datasets. This article applies
an Artificial Neural Networks (ANNs) to the
survival analysis problem. Because ANNs can
easily consider variable interactions and create
a non-linear prediction model, they offer more
flexible prediction of survival time than
traditional methods. This study compares
ANN results on two different breast cancer
datasets, both of which use nuclear
morphometric features. The results show that
ANNs can successfully predict recurrence
probability and separate patients with good women’s life. Breast cancer detection can be done with
and bad prognosis. the help of modern machine learning algorithms.
This paper presented a comparative study of five
3. Ahmad [3], compared the performance of machine learning techniques for the prediction of
decision tree (C4.5), SVM, and ANN. The breast cancer, namely Logistic Regression, K Nearest
dataset used was obtained from the Iranian Neighbors, Decision Tree, Support vector machine,
center for breast cancer. Simulation results Random Forest Classifier. The basic features and
showed that SVM was the best classifier working principle of each of the five machine learning
followed by ANN and decision tree. techniques were illustrated. The highest accuracy
obtained by SVC LINEAR is 96.84% whereas the
4. Nematzadeh [3], conducted a comparative lowest accuracy derived from the DECISION TREE is
study on decision tree, NB, NN and SVM with 92.63%. The diagnosis procedure in the medical field
three different kernel functions as classifiers to is very expensive as well as time-consuming. The
classify WPBC and Wisconsin Breast Cancer system proposed that machine learning technique can
(WBC). The experimental result showed that be acted as a clinical assistant for the diagnosis of
NN (10-fold) had the highest accuracy of breast cancer and will be very helpful for new doctors
98.09% in WBC dataset, while SVM-RBF or physicians in case of misdiagnosis. From the study,
(10-fold) had the highest accuracy of 98.32% we can conclude that machine learning techniques are
able to detect the disease automatically with high
in WPBC dataset.
accuracy.
REFERENCES
5. Yue. [4] mainly demonstrated comprehensive
reviews on SVM, K-NNs, ANNs, and 1. PREDICTION OF BREAST
Decision Tree techniques in the application of CANCER,COMPARATIVE REVIEW OF
predicting breast cancer on benchmark MACHINE LEARNING TECHNIQUES,AND THEIR
Wisconsin Breast Cancer Diagnosis (WBCD) ANALYSIS
dataset. According to the authors, deep belief NOREEN FATIMA 1, LI LIU 1, SHA HONG1, AND
networks (DBNs) approach with ANN HAROON AHMED 2,
architecture (DBNs-ANNs) has given the more (Student Member, IEEE)1School of Big Data and
accurate result. This architecture obtained Software Engineering, Chongqin University,
99.68% accuracy, whereas for the SVM Chongqing 400044, China2School of Microelectronics
method, the two-step clustering algorithm and Communication Engineering, Chongqing
alongside the SVM technique has achieved University, Chongqing 400044, China
99.10% classification accuracy. They also
reviewed the ensemble technique where SVM,
Naive Bayes, and J48 were implemented using
the voting technique. The ensemble method
acquired 97.13% accuracy. 2. Analysis of Breast Cancer Detection Using

6. Sakri. [4] focused on the enhancement of the Different Machine Learning Techniques.
accuracy value using a feature selection
algorithm named as particle swarm Siham A. Mohammed
optimization (PSO) along with machine
learning algorithms K-NNs, Naive Bayes (NB) Sadeq Darrab
and reduced error pruning (REP) tree. Their
work perspective holds the Saudi Arabian Salah A. Noaman
women’s breast cancer problem, and according
to their report, it is one of the major problems Gunter Saake
in Saudi Arabia.

Conclusion:
Breast cancer is considered to be one of the ⮚
significant causes of death in women. Early detection
of breast cancer plays an essential role to save
3. “Machine Learning Classification Techniques ● Mahmudul Hasan &
for Breast Cancer Diagnosis.”David A.
Omondiagbe 1 , Shanmugam Veeramani 1*, ● Muhammad Nomani Kabir
Amandeep S. Sidhu 2 1- Curtin University,
Malaysia, CDT 250, Miri 98009, Sarawak,
Malaysia 2- Curtin University, Kent St,
Bentley WA 6102, Australia.
5. BREAST CANCER DIAGNOSIS USING
MACHINE LEARNING ALGORITHMS –A
4. “Breast Cancer Prediction: A Comparative SURVEY B.M.Gayathri. 1 ,C.P.Sumathi2 and
Study Using Machine Learning Techniques.” T.Santhanam3 1Department of Computer
Science, SDNB Vaishnav College for Women,
● Md. Milon Islam, Chennai, India gayathri_bm2003@yahoo.co.in
2Department of Computer Science, SDNB
● Md. Rezwanul Haque, Vaishnav College for Women, Chennai, India
drcpsumathi@gmail.com 3Department of
● Hasib Iqbal, Computer Application, D.G.Vaishnav College
● Md. Munirul Hasan, for Men, Arumbakkam, Chennai, India

You might also like