Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/326521820

A STUDY ON EARLY PREVENTION AND DETECTION OF BREAST CANCER


USING THREE-MACHINE LEARNING TECHNIQUES

Article · April 2018


DOI: 10.26483/ijarcs.v9i0.6134

CITATIONS READS

0 810

2 authors, including:

Nafees Farooqui
Dehradun Institute of Technology
17 PUBLICATIONS   9 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A STUDY ON EARLY PREVENTION AND DETECTION OF BREAST CANCER USING THREE-MACHINE LEARNING TECHNIQUES View project

Disease prediction system using machine learning techniques View project

All content following this page was uploaded by Nafees Farooqui on 20 July 2018.

The user has requested enhancement of the downloaded file.


Volume 9, Special Issue No. 2, April 2018 ISBN: 978-93-5311-643-9
International Journal of Advanced Research in Computer Science
(ISSN: 0976-5697)
CONFERENCE PAPER
Available Online at www.ijarcs.info
A STUDY ON EARLY PREVENTION AND DETECTION OF BREAST CANCER
USING THREE-MACHINE LEARNING TECHNIQUES
Nafees Akhter Farooqui Ritika
Deptt. of Computer Applications Deptt. of Computer Applications
DIT University, DIT University,
Dehradun, India Dehradun, India
nafees.farooqui@dituniversity.edu.in hod.mca@dituniversity.edu.in

Abstract: The size of the Medical data repositories is increasing rapidly. Thus, we cannot easily analyze these data for finding the valuable and
hidden knowledge. There are several machine learning techniques that are used for medical analysis. Breast cancer is the most common cancer
particularly diagnosed in women. It is one of the leading causes of death worldwide. Only early detection can prevent the breast cancer’s
mortality. Breast cancer is a cancer that forms in the cells of the breasts. Now a days Breast cancer had become a very major disease not only in
India but also in other countries. The main objective of this paper is to early diagnosis of the breast cancer patients. For early prevention and
detection of the breast cancer patients, three machine learning techniques (i.e. Decision tree, Support Vector Machine, Random Forest) are used,
that also eliminates the waiting time and reducing the human and technical errors in diagnosing the breast cancer. Earlier detection of Breast
Cancer gives more lives and falling the death rate. Its cure rate and expectation depend on the early identification and finding of the infections.
The selection of suitable machine learning technique is a challenge for the diagnosis of breast cancer. Thus, we have created a model for a breast
cancer prediction system to analyze risk levels which help in prognosis. This paper becomes very helpful to doctor for diagnosis breast cancer
and helpful to patients for early treatment.

Keywords: Decision tree, Support Vector Machine, Random Forest, Prognosis, Prediction System, Risk Levels

I. INTRODUCTION There are different data analysis tools use for the
prediction of the pattern from large data sets. These tools
According to World Health Organization stated that, 8.8 can include different machine learning methods in early
million deaths globally each year are caused by cancer. detection of cancer. Classification learning is way of
Cancer represents 13% of all global deaths. American learning to categorize unseen examples into predefine
Cancer Society stated that, Breast cancer death rates classes based on the set of training examples that is used for
declined 39% from 1989 to 2015 among women [1]. The the prediction of class instance. While association learning
progress is attributed to improvements in early detection and is not predicting only the classes but also the attribute of the
prevention. Due to this reason research of breast cancer lead classes. Clustering is the process of separating datasets into
to increase the survival rate of people. This literature paper subgroup according to their unique features. This type of
explains the breast cancer identification in various learning task identified the classes from the knowledgebase.
characteristics. Growing out of control of cells in the body In this study, to classify the data using decision tree
starts the cancer, it is any place in the body. When cells in algorithm. In some cases, there are random forest algorithm
the breast begin to grow out of control, then breast cancer are used to classify the data, that built a large amount of
starts. The Symptoms of breast cancer include a lump in the decision trees out of sub-datasets from original training set
breast, bloody discharge from the nipple and changes in the by using bagging. Which improves the classification and
shape or texture of the nipple or breast. Breast cancer occurs regression models according to stability and accuracy. Due
almost entirely in women, but men can get breast cancer, to the various challenges in the classification and regression
too. Basically, the breast cancer is two types that are non- models. Support Vector Machines are used for the
invasive and invasive. Non-invasive breast cancer starts in classification problems, which is supervised machine
the milk vessel and does not spread in the other organs even learning algorithm. In this paper, we proposed a predictive
if it grows. But invasive breast cancer is very antagonistic model to detect and prevent a breast cancer at early stage
and spreads to other organs and destroys them. So that, it is using three machine learning techniques. Also estimate the
necessary to detect the affected cell before the spreading to validation of the models, and measures the accuracy,
other nearby organs. Early detection will prevent the death sensitivity, and specificity.
rate of breast cancerous patients. We proposed the new
adaptive technique that eliminates the waiting time and II. LITERATURE REVIEW
reducing the human efforts and technical errors in
diagnosing the breast cancer. The breast cancer's symptoms After reviewing the different literature showed that there
remain unclear and there is no main reason has been have been several studies on the early detection and
appeared [2]. Using Data Mining techniques reduces the prevention of breast cancer using data mining techniques
false positive and false negative decisions that will give the such as decision tree [5, 6]. Sahar A. Mokhtar et al [7] have
advantage for the early prediction of breast cancer [3, 4]. studied decision tree, artificial neural network and support

Conference Paper: II International Conference on “Advancement in Computer Engineering and Information Technology” 
Organized by: Department of Computer Science and Engineering, Integral University, Lucknow, U.P. India            37 
Nafees Akhter Farooqui et al, International Journal of Advanced Research in Computer Science, 9 (Special Issue II), April 2018,37-42

vector machine classification models for the prediction of development of drugs and the treatment of breast cancer
the severity of breast cancer. The performances of the three [13]. In this paper, we study three machine learning
models have been evaluated using the statistical measures techniques as the research methods for the detection of
(accuracy, sensitivity, specificity) and found that Support breast cancer. The purpose of this research is to develop
vector machine model performance is better than the other prediction models that detect and prevent at early stage of
two models on the prediction of the severity of breast the breast cancer.
cancer. In the study of Pendharker patterns in breast cancer,
B. Medical Prognosis Analysis
data mining tool is valuable in identifying patterns in breast
cancer cases that can be used for diagnosis, prognosis, and Medical prognosis is a scientific field to evaluating the
treatment purposes [8]. Rajashree Dash et al [9] was recurrence of disease and to predict survival of patient or
proposed a hybridized K-means clustering algorithm to group of patients [14]. Patients health estimation is
improve the efficiency of the original K-means clustering by performed by the use medical prognosis analysis. These
applying the PCA on original data sets. It is a new approach estimates can help to design treatment as per the expected
to identifying cluster centers and the steps of assigning data outcomes. The field of knowledge discovery and data
points to appropriate clusters. There was a given data set mining techniques advances the medical sciences and their
that partitioned into k clusters. After the experimental result, existence. These methods are more powerful as compared to
it shows that the proposed algorithm provides better traditional statistical methods [14]. In Recent improvements
efficiency and accuracy comparison to the original k-means in early detection and prevention increases the expectations
algorithm with reduced time. Zakaria Suliman zubi et al [10] of survival of breast cancer patient after the diagnosis of the
used different data mining techniques such as neural cancer [15].
networks for detection and classification of lung cancers.
In the Modern medical sciences, there should be IV. PROPOSED MODEL
developed new blood test technique that can easily detect
The proposed model is used for the preprocessing of data. In
the eight types of common cancers in a single test which is
this model build the Knowledge base that stored the
known as SEEK cancer. In this type of blood test detects
collected data. Mostly the entire data is taken as training set
tiny amounts of DNA and proteins released into the
to build the classification and clustering model and the
bloodstream from cancer cells. This type of blood test
remaining is taken for testing purpose. There are different
technique indicates the presence of ovarian, liver, stomach,
machine learning techniques are used to build the
pancreatic, oesophageal, bowel, lung or breast cancers. This
classification and clustering model. Then the model is tested
blood test is known as a liquid biopsy, which is different
for accuracy, sensitivity and specificity using test data along
from a standard biopsy, where a needle is put into a solid
with merging it to the knowledge base. Finally, the model is
tumor to confirm a cancer diagnosis. SEEK Cancer is also
evaluated using Support Vector Machine. Figure 1 is used
far less invasive. It will be helpful for the early diagnosis of
for the detection of breast cancer affected people.
cancer and more chance of cure with the modern medical
medicines and surgery

III. BACKGROUND

The death rate of breast cancer reduced in last few years due
to modern screening techniques and treatments [11, 12].
These improvements are the direct result of scientific
research [11]. Scientific research are different types that
provides the knowledge about the disease in different forms
but complementary perspectives [13]. To improved
treatment of the breast cancer by different enhanced
methods and computational methods gives of less-invasive
predictive medicine. Thus, the death rate for this cancer has
decreased in recent years
A. Common Breast Cancer Investigation Methods
The Most Common Investigation methods for breast cancer
are laboratory studies, observational studies and clinical
diagnosis. In the Laboratory studies, hypothesis testing is
performed under the controlled conditions. That gives the
complete results but that are limited by the controlled
environment. In the Observational studies examine the Figure 1. Proposed Model for the Early Prediction of Breast Cancer
characteristics of a population and establish the association
between the variables and its outcome. It does not always V. MATERIALS AND METHODS
establish the association between the cause-and-effect. In
the Clinical diagnosis perform the medical study by After the discussion with medical experts, case studies and
involving humans. That shows that, there are cause-and- Extensive literature reviews shows that there are lot of
effect relationship between the variables and the outcomes. factors influencing cancer. These factors are identified and
Clinical diagnosis has been used at large scale in the taken as attributes for this study. Early stage of prediction of

Conference Paper: II International Conference on “Advancement in Computer Engineering and Information Technology” 
Organized by: Department of Computer Science and Engineering, Integral University, Lucknow, U.P. India            38 
Nafees Akhter Farooqui et al, International Journal of Advanced Research in Computer Science, 9 (Special Issue II), April 2018,37-42

breast cancer, used the SEER public dataset, consists the missing values [18, 19]. In the Machine learning there
different files of the cancer that helps to identifying the are several learning probabilistic models including
breast cancer at early stage. There are different attributes in Expectation Maximization-based inference algorithms are
each file, and each recorded file relates to a specific considered that are associated with the predictor variables to
incidence of cancer. The data in the file is collected from find the Maximum Likelihood (ML) solutions. That is more
different areas. These areas contain a population which is valuable for the missing or incomplete data. In the EM
representative of the different cultural groups residing in algorithm there are basically two steps E-Step and M-Step.
that area. The SEER is assumed the mortality rate by cancer
in the area by the different factors [17]. The SEER is an VII. MACHINE LEARNING TECHNIQUES
integral part of the Investigation Research Program (IRP) at
the National Cancer Institute (NCI) and is responsible for In this paper, we used DT, RF, and SVM three machine
collecting incidence and investigate data from the learning techniques to early prevention and detection of
participating datasets (along with descriptive information of breast cancer to find which method performs better.
the data itself) for conducting analytical research projects. Decision trees are classification algorithms that are
[16]. Preprocessed the data to remove unsuitable cases. becoming more powerful and popular with the advancement
After using data cleansing and data preparation strategies, of the data mining. A C4.5 algorithm is used that is based on
the final dataset was constructed which consisted of 16 the ID3 algorithm. To improving the prediction accuracy
variables (16 predictor variables and one outcome variable). recursively applied the technique to separate the observation
The above variable is chosen on the basis of the cancer type into the branches that construct a tree. Each tree node is
and their biological characteristics. Table I shows the either a leaf node or decision node. To construct the DT,
predictor variables. The dataset was cleaned by handling they use mathematical algorithms (e.g., information gain,
missing values, noise, identifying and correcting in Gini index, and Chi-squared test) to identify a variable and
consistencies. Some fields, such as Number of positive corresponding threshold for the variable that splits the input
nodes, Race and most of these cases are benign. Missing observation into two or more subgroups [20].
values for variables were substituted using EM method [18]. Random Forest is a bagging algorithm that
Three Machine learning technique DT, RF, SVM are used to successfully applied at highly quantified models [23]. Due
perform classification and early detection of breast cancer. to overfitting problems in the decision tree, Random Forests
There are 162,500 data points: 128,469 positive cases and builds hundreds or may thousands of trees. That are
34,031 negative cases. The dataset is chosen randomly and different from each other, it uses random samples with
divided into ten groups each have with balance class each replacement [24]. On the average, 33% of the rows will be
set, five-fold cross validation issued and repeated five times. left out of each sample [25]. Each tree classifies its
Figure 2 shows the proposed experimental model. observations, and at the end majority votes [26], decisions
are chosen. Random Forest can also be used in the
VI. EXPECTATION MAXIMIZATION (EM) unsupervised mode for assessing proximities among data
points [27]
The EM algorithm is an appropriate tool in statistical Support Vector Machine (SVM) is the idea of a
estimation problems which have incomplete data, or missing hyperplane that divides a dataset into two classes in the best
data that will imposed on the mix estimation also [28, 29]. possible way. SVM has been used in different types of
The EM algorithm has also been used in various motion problems and that have already been successful in pattern
estimation frameworks [30]. The EM algorithm is an recognition in bioinformatics, cancer diagnosis [21]. The
iterative technique for solving missing data in inferential separating Hyperplane can be linear or non-linear. For linear
problems using only standard method. If the data were not classification, SVM computes the linear decision function in
missing, we could use standard statistical methods, the central gap of the two classes by classifying all the
In any incomplete dataset, there are some assumptions, training data points and placing the decision function as far
includes a predictive probability distribution for the missing from the given data points as possible. SVM also uses a
values that should be averaged in the statistical analysis. The non-linear mapping technique to transform the original
EM algorithm is an important technique for the missing data training data into a higher dimension. To minimizing the
prediction it is the relationship between missing data and classification errors, perform the classification for separate
unknown parameters of a model. When the parameters are the classes [22].
known, then it is possible to obtain unbiased predictions for
Table I. Predictor Variables used for detection of breast cancer.
Sr. No. Variable Name Description
1 Age Actual age of patient in years
2 Stage of Cancer Defined by size of cancer tumor and its spread
3 Grade Appearance of tumor and its similarity to more- or less-aggressive tumors
4 Race Ethnicity: White, Black, Chinese, etc.
5 Lymph Node involvement None, (1–3) minimal, (4–9) significant, etc.
6 Marital Status Married, single, divorced, widowed, separated
7 Primary Site Presence of tumor at particular location in body. Topographical classification of cancer
8 Tumor Size 2-5 cm; at 5 cm prognosis worsens
9 Radiation None, beam radiation, radioisotopes, refused, recommended, etc.
10 Histological Type Form and Structure of tumor
11 Site specific surgery code Information on surgery during first course of therapy, whether cancer-directed or not
12 Behavior Code Normal or aggressive tumor behavior is defined using codes

Conference Paper: II International Conference on “Advancement in Computer Engineering and Information Technology” 
Organized by: Department of Computer Science and Engineering, Integral University, Lucknow, U.P. India            39 
Nafees Akhter Farooqui et al, International Journal of Advanced Research in Computer Science, 9 (Special Issue II), April 2018,37-42

13 Number of positive nodes examined When lymph nodes are involved in cancer, they are called positive
14 Number of nodes examined Total nodes (positive/negative) examined
15 Number of primaries Number of primary tumors (1–6)
16 Clinical extension of tumor Defines Spread of tumor relative to breast

VIII. RESULTS AND DISCUSSION

The results obtained were based on a new database in SEER


datasets for Breast Cancer, by comparing three different
machine learning techniques the different models. The data
from the SEER public datasets for Breast Cancer analysis are
shown in Table II which describes the factors like sensitivity,
specificity and accuracy for different classification techniques.
This paper has explored risk factors for detection of breast
cancer by using machine learning techniques. Each method
has its own limitations and strengths specific to the type of
application. Our results shows that Random forest is better
than the other two. There are some limitations in the current
Figure 2. Proposed Experimental Model
study as we have C1 algorithm for analyzing the disease. Also,
in the many cases, the records were omitted unfortunately, and
variables have noisy so that result accuracy is affected.

Table II. Performance Comparison of : DT, RF, SVM.

Decision Tree Random Forest Support Vector Machine


Dataset
Accuracy Sensitivity Specificity Accuracy Sensitivity Specificity Accuracy Sensitivity Specificity
1 0.66 0.80 0.51 0.72 0.73 0.70 0.52 0.77 0.51
2 0.67 0.71 0.64 0.72 0.78 0.66 0.52 0.47 0.51
3 0.62 0.53 0.71 0.70 0.81 0.59 0.50 0.50 0.58
4 0.67 0.70 0.64 0.68 0.73 0.63 0.51 0.71 0.51
5 0.64 0.72 0.56 0.71 0.74 0.68 0.52 0.47 0.51
6 0.62 0.65 0.60 0.71 0.74 0.69 0.52 0.71 0.51
7 0.63 0.83 0.43 0.69 0.76 0.62 0.51 0.72 0.51
8 0.69 0.85 0.53 0.73 0.78 0.67 0.51 0.76 0.51
9 0.66 0.60 0.71 0.70 0.74 0.66 0.52 0.79 0.51
10 0.64 0.86 0.42 0.72 0.78 0.66 0.51 0.63 0.50
Mean 0.65 0.73 0.58 0.71 0.76 0.65 0.72 0.65 0.52
S.D. 0.02 0.11 0.10 0.02 0.03 0.03 0.72 0.13 0.02

(a) Accuracy of the Random Forest is better than the DT & SVM (b) Senstivity of Random forest is an average other than two

Conference Paper: II International Conference on “Advancement in Computer Engineering and Information Technology” 
Organized by: Department of Computer Science and Engineering, Integral University, Lucknow, U.P. India            40 
Nafees Akhter Farooqui et al, International Journal of Advanced Research in Computer Science, 9 (Special Issue II), April 2018,37-42

[8] Pendharkar PC, Rodger JA, Yaverbaum GJ, Herman N,


Benner M (1999) Association, statistical, mathematical and
neural approaches for mining breast cancer patterns. Expert
Systems with Applications 17: 223-232.
[9] Rajashree Dash “A hybridized K-means clustering
approach for high dimensional dataset” International
Journal of Engineering, Science and Technology Vol. 2,
No. 2, 2010, pp. 59-66
[10] Zakaria Suliman zubi “Improves Treatment Programs of
Lung Cancer using Data Mining Techniques” Journal of
Software Engineering and Applications, February 2014, 7,
69-77
[11] Warren J. Cancer death rates falling, but slowly. WebMD
(c) Specificity of the Random Forest is like a Normal distribution medical news; 2003 (http://aolsvc.health.webmd.aol.-
curve com/content/Artcile/73/82013.htm).
Figure 3. Performance Comparison of DT, RF, SVM in terms of
Accuracy, Sensitivity and Specificity (i.e. a, b, c)
[12] Progress shown in death rates from four leading cancers
(http://cancer.gov/newscenter/pressreleases/2003Report
Release).
IX. CONCLUSION
[13] The ABCs of breast cancer–—types of research studies
The early detection of breast cancer will easily prevent with (http://www.komen.org/bci/abs/chap_01.asp).
the help of the different medical therapy. In this paper, [14] Ohno-Machado L. Modeling medical prognosis: survival
researchers analyzed breast cancer at early stage with the help analysis techniques. J Biomed Inform 2001; 34:428—39.
of the machine learning techniques and compared the [15] Brenner H, Gefeller O, Hakulinen T. A computer program
performance of the DT, RF and SVM. Figure 3 shows the for period analysis of cancer patient survival. Eur J Cancer
2002;38(5):690—5.
graphical comparative analysis of the above techniques in
terms of accuracy, sensitivity, specificity. Then find that the [16] SEER Cancer Statistics Review. Surveillance,
Epidemiology, and End Results (SEER) program
Random Forest performance is better than the other techniques
(www.seer.cancer.gov) public-use data (1973—2000).
to predicting the cancer at early stage. In future more, suitable National Cancer Institute, Surveillance Research Program,
variables will be taken that gives the better result than the Cancer Statistics Branch, released April 2003. Based on the
older one. November 2002 submission. Diagnosis period 1973—2000,
Registries 1—9.
X. ACKNOWLEDGMENTS [17] Hankey BF. The surveillance, epidemiology, and end
results program: a national resource.. Cancer Epidemiol
The author thanks the DIT University, Dehradun for providing Biomarkers Prev 1999; 8:1117—21.
the research grant to support this research work. The [18] Dempster AP, Laird NM, Rubin DB (1977) Maximum
corresponding author wishes to thanks Prof K.K. Raina and Likelihood from Incomplete Data via the EM Algorithm. J
Prof S.K. Gupta for the great cooperation and motivation for R Stat Soc Series B 39: 1-38.
this research. [19] Rubin DB, Schenker N (1991) Multiple Imputation in
Health-Care Databases - an overview and some
XI. REFERENCES applications. Stat Med 10: 585-598.
[20] Quinlan J. C4.5: programs for machine learning. San
[1] Breast cancer facts and figs 2015-2016. American Cancer Mateo, CA: Morgan Kaufmann; 1993.
Society(2015).
[21] Cristianini N, Shawe-taylor J (2000) An Introduction to
[2] Zribi M, Boujelbene Y(2016) The Neural Network with an Support Vector Machines and Other Kernel-based Learning
Incremental Learning Algorithm Approach for Mass Methods, London: Cambridge University Press.
Classification in Breast Cancer.5: 2090-4924.
[22] Joachims T (1998) Making large-scale support vector
[3] Karabatak M, Cevdet M (2009) An expert system for machine learning practical. Advances in Kernel Methods:
detection of breast cancer based on association rules and Support Vector Learning. MIT Press, Cambridge, MA,
neural network. Expert Systems with Applications 36: 169-184.
3465-3469.
[23] Ziegel, E. R. (2012). The Elements of Statistical Learning.
[4] Kovalerchuc B, Triantaphyllou E, Ruiz JF, Clayton J Technometrics.
(1997) Fuzzy logic in computer-aided breast- cancer
[24] Kotsiantis, S. B. (2013). Decision trees: a recent overview.
diagnosis: Analysis of lobulation. Artif Intell Med11: 75-
Artificial Intelligence Review, 39(4), 261-283.
85.
[25] Montano-Gutierrez, L. F., Ohta, S., Kustatscher, G.,
[5] Zhou ZH, Jiang Y (2003) Medical diagnosis with C4.5
Earnshaw, W. C., & Rappsilber, J. (2016). Nano Random
Rule preceded by artificial neural network ensemble. IEEE
Forests to mine protein complexes and their relationships in
Trans Inf Technol Biomed 7: 37-42.
quantitative proteomics data, 050302.
[6] Delen D, Walker G, Kadam A (2005) Predicting breast
[26] Pudlo, P., Marin, J. M., Estoup, A., Cornuet, J. M., Gautier,
cancer survivability: a comparison of three data mining
M., & Robert, C. P. (2016). Reliable ABC model choice via
methods. Artificial Intelligence in Medicine 34:113-127.
random forests. Bioinformatics, 32(6), 859-866.
[7] A. Sahar “Predicting the Serverity of Breast Masses with
[27] Afanador, N. L., Smolinska, A., Tran, T. N., & Blanchet, L.
Data Mining Methods” International Journal of Computer
(2016). Unsupervised random forest: a tutorial with case
Science Issues, Vol. 10, Issues 2, No 2, March 2013 ISSN
studies. Journal of Chemometrics, 30(5), 232-241.
(Print):1694-0814| ISSN (Online):1694-0784
www.IJCSI.org

Conference Paper: II International Conference on “Advancement in Computer Engineering and Information Technology” 
Organized by: Department of Computer Science and Engineering, Integral University, Lucknow, U.P. India            41 
Nafees Akhter Farooqui et al, International Journal of Advanced Research in Computer Science, 9 (Special Issue II), April 2018,37-42

[28] Geoffrey McLachlan and Thriyambakam Krishnan. The [30] Yair Weiss. Bayesian motion estimation and segmentation.
EM Algorithm and Extensions. John Wiley & Sons, New PhD thesis, Massachusetts Institute of Technology, May
York, 1996. 1998.
[29] Geoffrey McLachlan and David Peel. Finite Mixture
Models. John Wiley & Sons, New York, 2000.

Conference Paper: II International Conference on “Advancement in Computer Engineering and Information Technology” 
Organized by: Department of Computer Science and Engineering, Integral University, Lucknow, U.P. India            42 

V i e w p u b l i c a t i o n s t a t s

You might also like