Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

A Hybrid Classification Algorithm Approach for

Breast Cancer Diagnosis


1 1 2 3 2 4
Baraa M. Abed , Khalid Shaker , Hamid A. Jalab , Hothefa Shaker , Ali Mohammed Mansoor , Ahmad F. Alwan ,
5
Ihsan Salman Al-Gburi
1
College of Computer Science and Information Technology, University of Anbar, Ramadi, Iraq
2
Faculty of Computer Science & Information Technology, University of Malaya, Malaysia
3
Modern college of business and sciences, sultanate of Oman
4
College of Arts and Sciences, Department of Mathematical & Physical Sciences, University of Nizwa, sultanate of Oman
5
Graduate School of Science and Engineering, Istanbul Kemerburgaz University, Turkey
{burasoft, khalidalhity}@gmail.com,{amidjalab,ali.mansoor}@um.edu.my, hothefa.shaker@mcbs.edu.om,
ahmadfouad@unizwa.edu.om, Ihsan.algburi@ogr.kemerburgaz.edu.tr

Abstract—Early diagnosis of Breast Cancer is significantly machine (SVM), adaptive neuro-fuzzy inference system
important to treat the disease easily therefore it is necessary to (ANFIS), k nearest neighbor decision tree (DT) case-based
develop techniques that can help physicians to get accurate reasoning (CBR) and rough set theory (RST).
diagnosis. This study suggests a hybrid classification algorithm The most common method for breast cancer detection is
which is based upon Genetic Algorithm (GA) and k Nearest Mammography. The varying interpretation by the radiologists
neighbor algorithm (kNN). GA algorithm has been used for its
primary purpose as an optimization technique for kNN by
about the images that are obtained from mammography has
selecting best features as well as optimization of the k value, led to the use other methods. Fine Needle Aspiration Cytology
while the kNN is used for classification purpose. The planned (FNAC) is another method used for this purpose.
algorithm is tested by applying it on Wisconsin Breast Cancer The aim of this study is to help breast cancer physicians in
Dataset from UCI Repository of Machine Learning Databases early diagnose of the disease in patients with BC. As it is
using different datasets in which the first is Wisconsin Breast mainly diagnosed after appearance of most of the symptoms,
Cancer Database (WBCD) and the second one is Wisconsin even though the majority of women with BC at early stages
Diagnosis Breast Cancer (WDBC) which has changes in the are asymptomatic
number of attributes and number of instances. The proposed
algorithm was measured against different classifier algorithms
on the same database. The evaluation results of the algorithm II. RELATED WORK
proposed have achieved 99% accuracy.
A lot of research is being done on breast cancer diagnosis
Keywords— Breast Cancer Diagnosis; Classification algorithm;
using the (WBCD) and (WDBC) dataset. Many techniques
Genetic algorithm and k Nearest Neighbor algorithm. and methods are constantly developed achieve accurate and
efficient diagnosis results. In [4], a new system was proposed
I. INTRODUCTION for breast cancer classification. The new system uses a hybrid
Breast cancer (BC) is one of the major concerns nowadays of K-means and Support Vector Machine (SVM). BC
and is one of the most leading reasons of death among women diagnosis based on a K-NN algorithm with different distances
as it is highly prevalent cancer type after lungs cancer [1]. (Euclidean distance and Manhattan distance) had been
According to the report of the International Agency for proposed in [5]. Breast Cancer Detection with Reduced
Research on Cancer (IARC), the global burden of cancer has Feature Set in [6] have been discussed using various data
been increased to 14.1 million new cases along with 8.2 mining technique such as artificial neural network (ANN), k-
million deaths in 2012 [2]. Early detection of breast cancer Nearest neighbor (k-NN), radial basis function neural network
helps to decrease death and improve the quality of life among (RBFNN), and SVM which are utilized for diagnosis with
patients. Hence, early detection should have an accurate and feature reduction properties using Independent Component
reliable diagnosis in order to differentiate benign and Analysis (ICA) for reducing the one-dimensional feature
malignant tumour. Fortunately, the rapidly-developing vector that is involved in the computation of an independent
diagnostic techniques that make diagnosis more accurate and component (IC). In [7] new system was proposed using k
the treatment more effective have contributed to the Nearest neighbor algorithm (kNN) and Naïve Bayes with
significant reduction in the burden of the disease. Data mining imputation techniques which is used instead of removing the
is the tool for acquiring information in the form of huge values that are missing from the Mammographic Mass data.
amount of data. This technique is widely used nowadays in The system is evaluated using different performance criteria
health care industry [3]. Different data mining technique that such as accuracy, sensitivity, specificity and ROC analysis.
were discovered and developed include the artificial neural
network (ANN) as the commonest along with support vector
This research is supported by research grant RG312-14AFR from University
of Malaya, Malaysia.

978-1-5090-0925-1/16/$31.00 ©2016 IEEE 264


TABLE I. BREAST CANCER DATASETS Compactness, Symmetry, Concave points and Fractal
dimension). These features give a description of the
Dataset No. of No. of Class1 Class2
Attributes Instances Benign Malignant characteristics in the image of the cell nuclei [14]. The
WBCD 10 699 458 241 features’ mean, standard error (SE), along with the mean of
WDBC 31 569 357 212
the three largest values were calculated for every image
resulting in a total of 30 features.
IV. FEATURE SELECTION
A hybrid combination of SVM, particle swarm optimization
Selection of Feature is one of the most important methods
(PSO), and cuckoo search (CS) in a novel machine method
being used in optimization approaches. The main advantage of
was proposed in [8]. The novel method is comprised of two
feature selection is the limited number of input in the
phases: phase one makes use of CS for optimizing the SVM
classification model which in turn, will increase performance
parameter which is specially developed to identify the kernel
and the accuracy. Selection of Feature methods are divided in
function best initial parameters, and the second on is the PSO
to two classes, the wrapper methods and the filter methods
application for SVM training continuation and finding of the
[15]. In Filter methods, as a pre-processing step, subsets of the
best SVM parameters. Least Squares Support Vector Machine
applied features are selected, and these are completely
(LS-SVM) and Differential Evolution (DE) were applied for
independent of the classifier, while in Wrapper methods, the
BC diagnosis in [9]. In [10], a two feature ranking algorithms
selected features are based on classifier performance. In the
of the Case-Based Reasoning (CBR) system, Adaptive Neuro-
medical diagnosis of BC, a small subset of features indicates
Fuzzy Inference system (ANFIS) and ANN based on PSO
low costs for diagnosis and tests. Fig 1 shows the meaning of
were proposed for the classification of breast cancer
them.
application. SVM are utilized in a proposed method in [11]
with its combination with feature selection for the
classification of breast cancer. [12] uses a system that utilizes V. K-NEAREST NEIGHBORS METHOD
(ReliefF) algorithm for reducing the dimension data and
The k-nearest neighbor’s algorithm is one of the machine
Bayesian network for breast cancer classification. learning algorithms. It is completely supported by the idea that
III. WISCONSIN BREAST CANCER DATASET “objects that are near each other will also have the same
characteristics. Thus, if you know the characteristic features of
The UCI machine learning repository has been used to one of the objects, you can also predict them for its nearest
download this dataset for the purpose of breast cancer neighbor.” This means that any new instance can be
classification [13]. This dataset usage is more common among categorized by the ‘k’ neighbor majority votes, in which k is
the researchers who utilize the machine learning methods for the positive odd integer.
the classification of breast cancer. Dr. William H. Wolberg
kNN lies among one of the straight and simplest techniques
(1989–1991) has collected them at the University of
of data mining. This is an example of Memory-Based
Wisconsin–Madison Hospitals. Table 1 presents a brief
Classification, as at the run-time, the examples of training must
description of these datasets. Each dataset is composed of be in the memory [16]. The most common method to compute
several patterns of classification or a set of numerical features the distance is Euclidean Distance, especially in the case when
or attributes instances. Fine needle aspiration (FNA) from dealing with continuous values. Nearest neighbor classifiers
human breast cancer tissue is used to asses these features. are a class of non-parametric methods used in statistical
A. Wisconsin Breast Cancer Database (WBCD) classification. Fig 1 shows the K-Nearest Neighbors algorithm
steps.
The dataset has 699 instances and 10 attributes including
the class attribute. One of the two possible classes are seen in
each instance; malignant or benign. Every attribute has been • Assign value for K parameter, which represents the
number of the nearest neighbors.
represented in the form of an integer between 1 and 10. These
features include: (uniformity of cell size, clump thickness, • Give a new sample (y).
uniformity of cell shape, single epithelial cell size, marginal • Compute the distance between the (y) instance and
adhesion, bare nuclei, normal nuclei, bland chromatin, along instances in the training.
with mitosis). • Choose the k-nearest neighbors of (y)
• Assign the classes of (y) to the same classes of the k-
nearest neighbors
B. Wisconsin Diagnosis Breast Cancer (WDBC)
This dataset has 569 instances. It consists of 31 Fig. 1. The K-Nearest Neighbors Algorithm
attributes including the class attribute. Every instance can be
either one of two possible classes; malignant or benign. The VI. GENETIC ALGORITHM
attributes description is ten real-valued features which are
computed for each cell nucleus. These features include: Genetic algorithms is a type of meta-heuristic search which
(Texture, Radius, Perimeter, Smoothness, Area, Concavity, simulates the natural selection process [17]. Genetic
algorithms are related to the Evolutionary algorithms (EA)

978-1-5090-0925-1/16/$31.00 ©2016 IEEE 265


larger class which is primarily used for optimization Step 11: Apply Crossover operation on the parent pair to
problem/solution generation. The Genetic Algorithm (GA) is generate two offspring chromosomes of the new (or next)
moved by the genetics of the population (including gene and generation.
heredity frequencies), along with population levelled Step 12: On each of the offspring chromosomes (generated in
evolution, as well as the inspiration by the Mendelian structure step 11) perform a mutation.
understanding (such as alleles, genes and chromosomes) along Step 13: Repeat steps 9 to 11 popsize/2 times to get a new
with the related mechanisms (including mutation and generation of size popsize.
recombination) [18]. Genetic algorithm (GA), which is Step 14: Calculate fitness value for each chromosome in the
develop by John Holland along with his collaborators in the new generation. If the best chromosome in the new generation
time between 1960s and 1970s, is an abstract or a model of has a fitness value greater than the best Individual, update the
evolution of biological which is completely supported by best Individual using the best chromosome of new generation.
Charles Darwin’s natural selection theory. There are numerous Step 15: Check whether the criterion of convergence is met, if
genetic algorithms advantages over the traditional yes then stop, otherwise go back to the step 10.
optimization algorithms, but the most noteworthy advantage is Step 16: Repeat from step 3 until number of iteration =0
the ability to deal with parallelism and complex problems.
Genetic algorithms can deal with diverse optimization types
whether the function of your objective (fitness) is non-
stationary (change with time) or stationary, continuous or
discontinuous, with random noise or linear or nonlinear. As
numerous offspring in a population (or any subgroup) behave
like independent agents, the population can discover the space
of the search in different directions concurrently. This feature
has made this technique an ideal to parallelize the algorithms
for their implementation. Different parameters along with
even diverse encoded strings groups can be controlled
simultaneously [19].
VII. PROPOSED STUDY
The steps of proposed study are discussed as follow:
Step 1: Assign N number of iteration
Step 2: Randomly generate a population of popsize no. of
chromosome. Each chromosome has length (lchrom
dimensions). The length of the chromosome is essentially
equal to noOfAttributes summed with bitsForK which is
allocated for the value of K for K-NN).
Step 3: Create folds by dividing the labelled data into subsets
Fig. 2. Block Diagram of Proposed System’s Framework.
in which one is for training and the others are for testing.
Step 4: Selection of subset features according to the
generation in step 2 A. Fold Creation
Step 5: Compute the distance between training samples and To apply cross validation technique, the entire labeled dataset
test samples according to each chromosome. is divided into mutually exclusive folds. In this study, 70-30
Step 6: For each of the test instances, find the nearest cross validation technique was used as well as multi fold
neighbor from the data set of training based upon distances technique for each iteration used. 30 percent of the labeled
calculated in previous step. data is randomly picked and used for testing and validation;
Step 7: For each of the test data, predict its class based on the this set is called \the testSet. Rest of the data is called the
nearest neighbor class in the data set of training found in step trainingSet, and is used for training purpose.
6.
Step 8: A prediction is correct when the predicted class of a B. Chromosome Representation
test data is same as actual labeled class. GA uses a population of candidate solution encoded by the
Step 9: Calculate the fitness (accuracy) by k-NN of each of chromosomes. Each of the chromosomes is an array of
the chromosomes using the objective function and save the Boolean values representing the alleles or genes. The i-th
chromosome that has the best fitness value as bestIndividual. chromosome of G-th generation can be presented by
The classification accuracy is calculated by (Number of where lchrom is the length of
correct predictions / Total number of Test data). the chromosome.
Step 10: Select a pair of chromosomes from the old (or
current) generation. This will act as the parent chromosomes.

978-1-5090-0925-1/16/$31.00 ©2016 IEEE 266


E. Distance Computation

In this approach, Euclidean distance was used. To calculate


the k-nearest neighbors, the Euclidean metric distance is
normally used. Considering the two input variable case; the
lchrom
Euclidean distance amid the two input vectors p and q is
calculated as the magnitude difference in the vectors i.e. |p - q
|, Where both the data has ‘m’ dimensions i.e. p= (p1, p2… pm)
Fig. 3. Representation of chromosome. and q= (q1, q2,…,qm). The distance of Euclidean between ‘p’
The length of the chromosome is essentially equal to and ‘q’ is set up to be:
noOfAttributes (which is the features’ or attributes’ number)
summed with bitsForK (i.e. the bits number allocated for (1)
value of K for K-NN).
Thus, .
True and False Boolean values for an allele in a chromosome where d is the number of features.
signify that the corresponding feature is on and off
respectively. In general, (NNforGA) calculates distance between two
instances
C. Initialization Phase
GA is aimed at evolving a size of popsize population with the (2)
lchrom-dimensional chromosomes which are called
Individuals. These individuals are the encoding to the solution and
of candidate. i.e.
to the
optimum of global, where the i index denotes the generation G
population. The initial population should cover up the entire (3)
space of search as much as probable by randomizing alleles
uniformly of the respective individuals with True or False For the selected feature set in the
values. following way:
D. Objective Function

Fitness evaluation for each individual is performed by ,


utilizing the function of the objective which returns the where , if
individual fitness value.
= 0 , if .
In current study, a variation of NN algorithm called F. Classification
NNforGA, takes a chromosome (representing the selected
features) as a parameter. It decodes the value of and uses To classify a test instance, NN relies on its nearest neighbor
only the selected features for calculating distance between in the trainingSet. The training instance having the least
instances. The classification accuracy (which is obtained by distance from the test instance determines the class of the test
using the selected feature only) returned by NNforGA, is set instance. Mathematically, i-th test instance is classified as
as the fitness value of the chromosome. The value of K that is
used in NNforGA is decoded from last bitsForK number of where
alleles of the chromosome allocated for this purpose.

G. Performance Evaluation Criteria


2^(bitsForK-1) 2^1 2^0
Confusion matrix is a type of visualization tool which is
used commonly to check the classifiers’ accuracy in
Fig. 4. Decoding the value of K. classification [20]. This tool is used to illustrate the
relationship between the predicted classes and the actuals. The
If the decoded value is an even integer, it is increased by 1 to classification model effectiveness level is computed with
make it an odd integer.

978-1-5090-0925-1/16/$31.00 ©2016 IEEE 267


correct and incorrect number of classification in every variable TABLE III. OPTIMAL PARAMETER AND FEATURE SUBSET SELECTION
possible being categorized in the confusion matrix [21].
Data No. of No. of K Features Selections
The entries that are included in the confusion matrix that have Generation bit for K
certain meaning in our study context are: s
• True Negative (TN) is the correct predictions number in a WBCD 10 2 1 F1F2F3F4F5F6
benign class. WDBC 30 3 2 F1F2F56F6F7F9F10F11F12F16
F21F22F24F25F26F27F28F30
• False negative(FN) indicates the incorrect predictions
number, indicating that the class is malignant.
• False Positive(FP) is the incorrect predictions number, TABLE IV. COMPARISON WITH OTHER METHODS
indicating that the class is benign.
• True Positive(TP) is the correct predictions number in a
malignant class. REFERENCE METHOD ACCURACY (%)

H. Accuracy Computation [4] K- MEANS AND SVM 97.3

Classification accuracy is computed using the following [5] KNN 98


equation:
[6] KNN + ICA 92.5
(4)
[7] NAÏVE BAYES + KNN 81.6
Where TP and TN suggest the True Positive and True
Negative respectively, including positive and negative cases
proportion that were classified correctly. Positive cases are the [8] PSO-BASED SVM 91.3
Benign label records and negative cases are the record with
the Malignant label. FP and FN are the False Positive and [9] DIFFERENTIAL EVOLUTION + SVM 99.7
False Negative which include all the negative cases that were
classified incorrectly as positive and all the positive cases [10] NN+ ADAPTIVE NEURO-FUZZY 83.6
which were classified incorrectly as a negative respectively.
[11] SVM+ FEATURE SELECTION 98.1
I. Results and discussion
To assess our method effectiveness, we have performed PROPOSED KNN FOR GA 99
experiments on WBCD and WDBC. The features of
importance are selected by GA. Table (3) shows the optimal
parameter values for NNforGA approach and classification
accuracy on the testing dataset. Among the models running, it VIII. CONCLUSION
achieved the highest classification accuracy. The experiment In this study, we combined the genetic algorithm and the
results of data use 70-30% training cross validation. The k-nearest neighbor algorithm in order to design efficient
constant optimal parameter for two types of data is as classifier model for breast cancer classification. The model
following: Mutation probability=0.7, Crossover Probability was implemented on Wisconsin breast cancer data. The
=1.0, Population size=10. proposed system achieves high classification performance,
Feature selection and optimization (k) value for classifier are
performed by a genetic algorithm and the classification task
TABLE II. CONFUSION MATRIX
accomplished by k-NN algorithm. A small number of features
Predicted with only important features were selected. The proposed
algorithm was compared with different classifier algorithms.
Malignant Benign The experimental results showed the effectiveness of the
True False proposed algorithm and how it can obtain better results.
Malignant Positive(TP) Negative(FN)
Actual REFERENCES
Benign False True [1] U.S. Cancer Statistics Working Group. United States Cancer Statistics:
Positive(FP) Negative(TN) 1999–2008 Incidence and Mortality Web-based Report. Atlanta (GA):
Department of Health and Human Services, Centers for Disease Control.

[2] D. Max Parkin, Paola Pisani, and J Ferlay. Global cancer statistics. CA:
A cancer journal for clinicians, 49(1):33-64,1999.

978-1-5090-0925-1/16/$31.00 ©2016 IEEE 268


[3] X. Xiong, Kim, Y., Baek, Y., Rhee, D.W., and Kim, S. H., Analysis of [13] https://archive.ics.uci.edu/ml/datasets.html
breast cancer using data mining & statistical techniques, Sixth International
Conference on Software Engineering, Artificial Intelligence, Networking and [14] G. Salama, I., Abdelhalim M.B., and Zeid M. A., "Breast Cancer
Parallel/Distributed Computing, 2005 and First ACIS International Workshop Diagnosis on Three Different Datasets Using Multi- Classifiers", International
on Self-Assembling Wireless Networks. SNPD/SAWN, pp. 82–87, (2005). Journal of Computer and Information Technology (2277 – 0764), Volume 01–
Issue 01, September 2012, pp. 36-43
[ 4 ] Z. Bichen, Sang Won Yoon, Sarah S. Lam, Breast cancer diagnosis
based on feature extraction using a hybrid of K-means and support vector [15] L. Guyon,, Elisseeff, A. (2003). An Introduction to Variable and Feature
machine algorithms, Expert Systems with Applications 41 (2014) 1476–1482. Selection. Journal of Machine Learning Research, 3, 1157-1182.

[16] E. Alpaydin, (1997), Voting over Multiple Condensed Nearest


[5] S. Medjahed, T. Ait Saadi, Abdelkader Benyettou, Breast Cancer Neighbors. Artificial Intelligence Review, p. 115–132.
Diagnosis by using k-Nearest Neighbor with Different Distances and
Classification Rules, International Journal of Computer Applications (0975 - [17] M. Melanie (1996). An Introduction to Genetic Algorithms. Cambridge,
8887) Volume 62 - No. 1, January 2013. M Breast Cancer Detection, IJCSIT) International Journal of Computer
Science and Information Technologies, Vol. 4 (6), 2013, 1023-1028.
[6] A. Niyazi KJlJç, Erdem Bilgili, and Aydin Akan, Breast Cancer Detection
with Reduced Feature Set, Hindawi Publishing Corporation,Computational [18] J.Brownlee, Clever Algorithms: Nature-Inspired Programming
and Mathematical Methods in Medicine,Volume 2015, Article ID 265138, 11 Recipes.lulu.com, 2012.
pageshttp://dx.doi.org/10.1155/2015/265138

[7] C. Güzel, M. Kaya, Oktay Yıldız, Hasan Şakir Bilge, Breast Cancer [19] K. Shaker, and S. Abdullah. 2010. Controlling Multi Algorithms Using
Diagnosis Based on Naïve Bayes Machine Learning Classifier with KNN Round Robin for University Course Timetabling Problem. Database Theory
Missing Data Imputation. and Application, Bio-Science and Bio- Technology (DTA-2010), Lecture
Notes in Computer Sciences, Springer, PP 47-55.
[8] L. Xiaoyong, and Hui Fu1, PSO-Based Support Vector Machine with
[20] J. Han and M. Kamber,”Data Mining Concepts and Techniques”,Morgan
Cuckoo Search Technique for Clinical Disease Diagnoses, Hindawi
Kauffman Publishers, 2000.
Publishing Corporation Scientific World Journal,Volume 2014, Article ID
548483, 1-7.. [21] P. Cabena, Hadjinian, P., Stadler, R., Verhees, J. and Zanasi, A. (1998).
Discovering Data Mining: From Concept to Implementation, Upper Saddle
[9] S.Soliman, E. ElHamd, Classification of Breast Cancer using Differential River, N.
Evolution and Least Squares Support Vector Machine, International Journal
of Emerging Trends & Technology in Computer Science (IJETTCS), Volume
3, Issue 2, March – April 2014

[10] M. Huang, Y. Hung, Wen-Ming Lee, K. Li & Tzu-Hao Wang, Usage of


Case-Based Reasoning, Neural Network and Adaptive Neuro-Fuzzy Inference
System Classification Techniques in Breast Cancer Dataset Classification
Diagnosis.

[11] M. Akay, Support vector machines combined with feature selection for
breast cancer diagnosis, Expert Systems with Applications 36 (2009) 3240–
3247.

[12] A. Fallahi, Shahram Jafari, An Expert System for Detection of Breast


Cancer Using Data Preprocessing and Bayesian Network, International
Journal of Advanced Science and Technology Vol. 34, September, 2011.

978-1-5090-0925-1/16/$31.00 ©2016 IEEE 269

You might also like