Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Journal of Critical Reviews

ISSN- 2394-5125 Vol 7 , Issue 6, 2020

Review Article

AN ANALYSIS ON FEATURE SELECTION METHODS, CLUSTERING AND CLASSIFICATION


USED IN HEART DISEASE PREDICTION –A MACHINE LEARNING APPROACH
A. Ann Romalt1, R. Mathusoothana S. Kumar2
1Department of CSE, Stella Mary’s College of Engineering, Nagercoil, India. romaltnba@gmail.com
2Department of IT, Noorul Islam Center for Higher Education, Nagercoil, India. rmsskdhujaa@gmail.com

Received: 15.02.2020 Revised: 08.03.2020 Accepted: 02.04.2020

Abstract
Recent research shows that major cause of death due to heart disease is increasing nowadays. Around billion people will lose their life
because of heart failure in a year. Heart disease also known as Cardio Vascular disease can be prevented by diagnoses at an early stage.
By means of prediction in machine learning we can diagnoses this disease. The machine learning algorithm makes use of some important
features for prediction. Feature selection plays an important role in prediction, since in a dataset all attribute cannot be used. Only the
relevant features are used. In this paper we discuss some of the feature selection methods used in data mining for the prediction of heart
disease.
Keywords: Classification, Regression, Prediction, Optimization, Probabilistic, Intelligent.

© 2019 by Advance Scientific Research. This is an open-access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
DOI: http://dx.doi.org/10.31838/jcr.07.06.27

INTRODUCTION Some machine learning algorithms which use supervised


Cardio vascular disease is one of the major causes of death learning are SVM, Discriminant analysis, Naïve Bayes, nearest
nowadays. According to a survey 1 in 4 death causes due to neighbor [5].
heart failure [1]. In the year 2016 more than half (54%) of
death is due to heart disease. Around 15.2 million deaths Unsupervised Learning
caused because of this disease [2]. The risk factors of the In unsupervised learning the training data are neither
cardiovascular disease include smoking, high LDL, high blood classified nor labeled. An example of unsupervised learning is
pressure, physical inactivity, obesity, stress etc. [3]. By Clustering. [5].
considering the above features we can predict the chance of
heart disease. For prediction of heart disease, we follow Reinforcement Learning
Machine learning algorithms. There are various types of In reinforcement learning a software agent sense and act in
machine learning algorithm widely used in the biomedical its environment and learn to take optimal action to achieve
data such as i) Unsupervised ii) Semi supervised iii) its goal. These types of machine learning is mainly used in
Reinforcement iv) Evolutionary algorithm vi) Deep Learning game theory, control theory, information theory, simulation
[4]. The machine learning can be classified as shown in based optimization etc. [6]
Figure.1.1.
Because of its accuracy of perception and prediction of
Supervised Learning disease machine learning provides an automatic algorithm.
In supervised learning the known quantity of data known as For prediction of heart disease data mining algorithms such
training tuples are used for prediction. The two forms of as Naïve Bayes, Neural network, Decision tree, SVM, etc.is
supervised learning are classification and regression. used.[7]
Classification categorizes the data based on the training data
set. Regression is a statistical technique used to predict the For this machine learning methods UCI data repository is
desired target quantity when the quantity is continuous. used. In this heart disease dataset, the total attribute is 76
[8]. Processing all of these attributes is not possible to
produce the desired output. The accurate prediction is not
Classification possible while processing all attributes. Around 13 attributes
are relevant. Feature selection method is applied to the
Supervised dataset and the attributes are selected. The attributes are
Regression then grouped into clusters. For forming clusters various
clustering algorithms are used after selecting the features.
Machine Some clustering algorithms such as K-Means and hierarchical
Learning methods are used for clustering. These clusters group the
Unsupervised Clustering
data based on the data points. After clustering the data the
next step is classification. Classification is a process of
Reinforcement prediction in which data are classified based on the classifier.
learning [9]. The training tuple are applied to form a predicted output.
The trend and growth in attributes affect the performance of the
Figure 1.1: Types of Machine Learning output. The training tuple are applied in this stage to form a
predicted output.

Journal of critical reviews 138


AN ANALYSIS ON FEATURE SELECTION METHODS, CLUSTERING AND CLASSIFICATION USED IN HEART DISEASE
PREDICTION – A MACHINE LEARNING APPROACH

The paper is organized as follows Chapter II discuss the about Component Analysis (PCA), Classification and Regression Trees
feature selection in Machine learning, Chapter III gives a (CART) along with fuzzy method is used for prediction. In this
related work carried out to discuss about feature selection paper [14] the desired features are selected using Principal
Chapter IV depicts clustering and classification in machine Compound Analysis (PCA).EM is used for clustering and PCA was
learning. Chapter V describes the proposed model. Chapter VI used for addressing multi-co linearity in the datasets.
describes the performance of different methods used in In [15] EEG signals are preprocessed and 168 features are
prediction and finally Chapter VII concludes the work. selected for each recording. The features are selected by
FEATURE SELECTION IN MACHINE LEARNING Recursive feature elimination (RFE). The feature selection
For accurate prediction feature selection is important. Data method proposed in [15] is mainly used for large dataset.
mining algorithms used feature selection methods for selecting
the desired features from the dataset. These features or The feature selection using wrapper method was proposed by
attributes should be loaded directly into the memory for Mohammed, et. al. [16]. In RFE, the features are ranked by
preprocessing. Feature selection is a process in which only the repeatedly training model and removing features with the
subset of the appropriate features are selected. This method smallest score. To calculate the score Ck=wk2 is used, where wk2 is
identifies the few most important attributes and help to predict the weight of feature in Support Vector Machine (SVM) where
the outcome. It is a form of dimensionality reduction used for calculated as. The features with high ranking are obtained and
preprocessing. The difference between feature selection and selected for classification.
dimensionality reduction is the first method (Feature selection) In paper [17] a hybrid method for feature selection is introduced.
will reduce the attributes without making change in the data set. It proposed a method using Genetic algorithm for solving feature
Since feature selection method deals with less parameter it will selection problem. The features are classified into strong feature,
reduce the complexity. weak, unstable feature. Based on the feature the priority was
given.
There are various methods of feature selection algorithms
applied in classification. They are i) Filter method ii) Wrapper CLUSTERING AND CLASSIFICATION IN MACHINE LEARNING
Method and iii) Embedded method. The filter methods are used Clustering and classification play a vital role in data mining for
to select the features based on the scores in various statistical prediction. The features are selected based are different criteria
correlations. Wrapper method uses a greedy approach in feature so that an efficient classification is done. Classification predicts
selection. It evaluates all possible combination and produces the the resultant output based on the clusters. Different methods are
result for Machine learning. The embedded method combines the used to form cluster in machine learning. Normally Naïve Bayes
advantage of two models. classification is used for classification. A probabilistic classifier is
There are different statistical methods used for feature selection used in Bayes theorem. It calculates a probability of class in input
in machine learning. Filter method feature selection is normally data and helps to predict the class of the unknown data
used for machine learning [10] of large dataset. The wrapper sample.[16].In this paper some classification algorithm are used
method is used in classification algorithm for measuring the to predict diabetes. The algorithms such as SVM, Random forest,
performance whereas the filter method doesn’t use classification Naive Bayes, Decision tree, KNN are compared and their
algorithm instead it uses scientific approach. performance of prediction is listed. A heart disease prediction
system based on Sequential Minimal Optimization (SMO) is
RELATED WORK proposed by Alizadehsani, et.al [17]. Classification algorithm
In this paper we conducted a literature survey to focus on feature based on SVM was improved in this method. Several algorithms
selection methods used in heart disease prediction. Sreevani and including Naïve Bayes, SMO, etc. are used in this method. A heart
Murthy proposed a feature selection method which produces a disease prediction method based on CFS and Bayes theorem was
reduced subset without transforming the data [11]. A compound proposed by T.J. Peter et.al[18].In this paper various
feature selection framework is mentioned. classification algorithm such as Naïve Bayes, Decision Tree, KNN
etc are compared with the proposed method which uses CFS for
Feature selection provides a subset of data. Too much of features attribute selection and Naïve Bayes classifier for prediction.
are not good for machine learning since it produces complexity,
an efficient prediction feature selection algorithm should be Jesmin Nahar et.al proposed a medical knowledge driven heart
simple, classifier used should be accurate. disease prediction[19].In this method classification is based on
Medical knowledge driven feature selection (MFS) and
B. Kolukisa et al purposed a feature selection method which is Computerized feature selection (CFS).The combination of
hybrid method [12]. It uses a UCI dataset for processing MFS+CFS is used for prediction and the accuracy is compared
with various classification algorithm such as NB,DT, AdaBoost,
In paper [13] Shafenoor Amin et.al explained different
etc. In paper[20] Shao et.al proposed a hybrid intelligent method
combination of features for accurate prediction of heart disease.
for heart disease prediction. In this method logistic regression
Since there is no exact algorithm for prediction of heart disease
(LR), multivariate adaptive regression splines (MARS), artificial
the performance is based on the accuracy of prediction. Seven
neural network (ANN), and rough set (RS) techniques are used
classification techniques (k-NN, Decision Tree, Naïve Bayes,
for prediction., The proposed model combines hybrid MARS–
Logistic Regression, Vote, Support Vector Machine and Neural
ANN model as the was the best alternative. Since variables used
Network) were used for prediction. From the available UCI
in this method are less a better classification accuracy can be
dataset attributes a set of features of 13 attributes are selected. A
obtained. A weighted fuzzy rule-based heart disease prediction
Brute force technique was applied on the selected features. A
was proposed by Anooj.k[21]. Two steps are followed for risk
subset chosen has a minimum of 3 attributes. The combination of
prediction. First step generates a weighted fuzzy rule and the
features is given by 2n – ((n2+n/2) +1, where n is the total no. of
second step develop a fuzzy rule-based system. The fuzzy rules
feature generated. Vote, Naïve Bayes and SVM algorithms
are automatically generated in this method.
combined to form better prediction model. A combination of
attributes of 81000 is used in feature selection method for A feature selection technic for heart disease prediction was
accuracy. proposed by Swathi shilaskar Et. al. Prediction of disease with
reduced attributes along with hybrid forward feature selection
In paper [14] Nilashi et.al proposed a machine learning method
by combining fuzzy rule with Evaluation Method (EM), Principal

Journal of critical reviews 139


AN ANALYSIS ON FEATURE SELECTION METHODS, CLUSTERING AND CLASSIFICATION USED IN HEART DISEASE
PREDICTION – A MACHINE LEARNING APPROACH

for prediction was proposed in this paper[22].A hill climbing in this method. A hybrid fuzzy clustering algorithm and genetic
based approach was followed for selecting the features. approach was used for prediction. For feature selection a chaos
firefly based optimal approach is followed. The proposed system
Nilashi et.al [23] proposed an analytical method for disease is compared with SVM, Naïve Bayes to show the accuracy.
prediction. In this machine learning algorithm for clustering and
noise removal is removal is used for prediction. Classification
and Regression Trees
(CART) is used to generate fuzzy rules for knowledge-based.
A firefly-based algorithm for prediction of heart disease was
proposed by Longet.al [24]. Type-2 fuzzy logic system was used

Table 1: Methods used in Prediction

Author Year of Title of paper Methodology used Data set Advantage Disadvantag
Publicat used e
ion
B. Kolukisa et al 2018 Evaluation of Hybrid feature selection method and several Z- Attention to the ensemble
Classification Algorithms, classification algorithms including ensemble Alizadehsani sensitivity, methods
Linear Discriminant classifiers, and specificity, F- multiply the
Analysis and a New Cleveland measure, AUC complexity of
Hybrid Feature Selection heart and running time the original
Methodology for the datasets model
Diagnosis of Coronary
Artery Disease
Shafenoor Amin, 2018 Identification of A combination of features was selected to be UCI dataset Performance is 7 data mining
et.al significant features and used with 7 classification techniques; k-NN, measured based techniques on
data mining techniques Decision Tree, Naïve, Bayes, Logistic on accuracy and 8100
in predicting heart Regression, Vote, Support Vector Machine precision for combinations
disease and Neural Network ease of of features
prediction produce a
vast
computationa
l time
Haotian Shi, et.al 2019 A hierarchical method Recursive feature elimination is employed for ECG heart Can be used in ensemble
based on weighted feature selection from a large number of beat data set large dataset classifier is
extreme gradient features A hierarchical classifier based on used it
boosting in ECG weighted extreme gradient boosting a multiply the
heartbeat classification threshold classifiers is constructed extreme complexity
gradient boosting (XGBoost}
Mohammed, T.A 2019 Hybrid Efficient Genetic Three new genetic algorithms are proposed UCI dataset No domain not suitable
et.al Algorithm for Big Data and incorporated in the ANN algorithm which knowledge is for large
Feature Selection are LWGGA, HWGGA and WGGA required, reduce dataset
Problems time

RoohallahAlizade 2013 A data mining approach Naïve Bayes, SMO, Bagging, and Neural Z-Alizadeh
hsani, et.al for diagnosis of coronary Network are used Sani dataset
artery disease,

T. J. Peter et. 2012 An empirical study on Hybrid CFS+ Naïve Bayes is used for UCI dataset Easy to predict Accuracy is
prediction of heart prediction using naïve less when
disease using bayes compared
classification data mining with recent
techniques methods
Jesmin Nahar, 2013 Computational This work highlights the potential of an UCI dataset Both MFS and Need medical
et.al intelligence for heart expert judgment based (i.e., medical CFS had knowledge for
disease diagnosis: A knowledge driven) feature selection process improved prediction
medical knowledge (termed as MFS) prediction rates,
driven approach in terms of
accuracy
Yuehjen E. Shao, 2014 Hybrid intelligent A hybrid multivariate adaptive regression UCI dataset least no. of Less accurate
et.al modelling schemes for spine (MARS) and artificial neural network explanatory (82%
heart disease ANN algorithm variables accuracy)
classification
P.K. Anooj 2012 Clinical decision support Fuzzy based-input single output mamdani UCI dataset automatic Perform
system: Risk level based fuzzy method is used procedure for better than
prediction of heart generation of neural
disease using weighted fuzzy rules, network
fuzzy rules based
Swathi shilaskar 2013 Feature selection for Hybrid forward feature selection UCI dataset Reduces feature computationa
Et.al medical diagnosis: dimension, l Complexity
evaluation of cardio Improves
vascular disease classification
accuracy
Nilashi et.al 2013 An Analytical Method for EM, PCA, CART and fuzzy rule-based UCI dataset for each module Efficient

Journal of critical reviews 140


AN ANALYSIS ON FEATURE SELECTION METHODS, CLUSTERING AND CLASSIFICATION USED IN HEART DISEASE
PREDICTION – A MACHINE LEARNING APPROACH

Diseases Prediction Using techniques proposed method separate feature


Machine Learning algorithm is used selection
Techniques for accurate method is
prediction needed for
reducing the
attributes
Long et.al 2015 A highly accurate firefly A hybrid fuzzy clustering algorithm and UCI dataset produce better computationa
based algorithm for heart genetic approach accuracy when l Complexity
disease prediction compared with
Naïve bayes and
SVM

PROPOSED MODEL and shows that UCI dataset is more frequently used in machine
The proposed model for heart disease prediction is shown in learning when compared with other dataset. Figure 1.3 shows
Figure.1.2 the percentage of UCI dataset used over other data set.

Data set used

31%
UCI
69%
Other
dataset

Figure 1.2: Proposed Model

We propose a novel approach for heart disease prediction. The


data set from real world data set (UCI) is used. The data set is Figure 1.3: Usage of UCI dataset
preprocessed and the resultant dataset is then applied feature
selection method. Only the selected attributes are used for When analyzing the performance and the classification used, a
accurate prediction and to reduce complexity. Clustering is used hybrid classification method is well suited for accurate
to group the data. A hybrid clustering method is followed for prediction. From the table shown in Table 1 a combined
better result. Classification algorithm with suitable classifier is approach used in machine learning is well suited for better
selected based on the performance. The output obtained is the performance. Further we can conclude that a hybrid approach is
predicted output, whether the patient is under the risk of cardio followed in recent prediction methods. A hybrid approach
vascular disease. combined with Naïve Bayes is a well-suited method for heart
disease prediction. A hybrid probabilistic approach with naïve
PERFORMANCE EVALUATION bayes classifier is a better approach for disease prediction. It is
The performance is evaluated on the basis of accuracy and the summarized in the Figure 1.4.
frequently used machine learning methods. For evaluation
purpose we use Table 1. It summarizes the papers used in survey

Machine Learning Algorithm used


60
50
40
30
20
10
0

Figure 1.4: Comparison of Various approach

Journal of critical reviews 141


AN ANALYSIS ON FEATURE SELECTION METHODS, CLUSTERING AND CLASSIFICATION USED IN HEART DISEASE
PREDICTION – A MACHINE LEARNING APPROACH

CONCLUSION techniques," IEEE-International Conference On Advances In


We conclude this paper by discussing various feature selection Engineering, Science And Management (ICAESM -2012), pp.
methods, attributes used, dataset used, the performance of 514-518
classifier, limitations of each methods, and the methods used in 19. Jesmin Nahar, et.al, Computational intelligence for heart
prediction. By comparing various measures, the optimal output is disease diagnosis: A medical knowledge driven approach,
produced by feature selection for better prediction. Feature Expert Systems with Applications, Volume 40, Issue 1,2013,
selection provides a subset of data. Too much of features are not Pages 96-104,
good for machine learning since it produces complexity. So best 20. Yuehjen E. Shao, et.al. Hybrid intelligent modeling schemes
feature selection method is chosen based on the analysis. The for heart disease classification, Applied Soft Computing,
classification algorithm gives an insight of machine learning Volume 14, Part A, 2014, Pages 47-52.
methods and its predictive performance. In the above analysis, 21. P.K. Anooj,”Clinical decision support system: Risk level
the feature selection methods and accuracy of the classification prediction of heart disease using weighted fuzzy rules”,
technique is discussed and their performance is analyzed. The Journal of King Saud University - Computer and Information
predictive task will become faster when features are chosen Sciences, Volume 24, Issue 1,2012,Pages 27-40,
correctly. The machine learning algorithm with optimal accuracy 22. Swati Shilaskar, Ashok Ghatol”, Feature selection for
will yield a good prediction. medical diagnosis: Evaluation for cardiovascular diseases,”
Expert Systems with Applications, Volume 40, Issue
REFERENCES 10,2013, Pages 4146-4153,
1. https:// www.medicalnewstoday.com /articles / 23. Mehrbakhsh Nilashi et.al,” An analytical method for
282929.php diseases prediction using machine learning techniques”,
2. https://www.who.int/news-room/fact-sheets/ detail/ the- Computers & Chemical Engineering, Volume 06,2017, Pages
top-10-causes-of-death 212-223,
3. https://www.webmd.com/heart-disease/risk-factors-for- 24. Nguyen Cong Long, Phayung Meesad, Herwig Unger, A
heart-disease#1 highly accurate firefly-based algorithm for heart disease
4. B. D. Kanchan and M. M. Kishor, "Study of machine learning prediction, Expert Systems with Applications, Volume 42,
algorithms for special disease prediction using principal of Issue 21,2015, Pages 8221-8231.
component analysis," 2016 International Conference on
Global Trends in Signal Processing, Information Computing
and Communication (ICGTSPICC), Jalgaon, 2016, pp. 5-10.
5. Gangadhar Shobha, Shanta Rangaswamy, Chapter 8 -
Machine Learning, Editor(s): Venkat N. Gudivada, C.R. Rao,
Handbook of Statistics, Elsevier, Volume 38, 2018, Pages
197-228.
6. https://www.geeksforgeeks.org/what-is-reinforcement-
learning/
7. Hazra, Animeshet.al. (2017). Heart Disease Diagnosis and
Prediction Using Machine Learning and Data Mining
Techniques: A Review. Advances in Computational Sciences
and Technology. 10. 2137-2159.
8. https://archive.ics.uci.edu/ml / support / heart + Disease
9. https://medium.com/@Mandysidana/machine-learning-
types-of-classification-9497bd4f2e14
10. https://www.cs.waikato.ac.nz/~mhall/thesis.pdf
11. C. A. Murthy, "Bridging Feature Selection and Extraction:
Compound Feature Generation," in IEEE Transactions on
Knowledge and Data Engineering, vol. 29, no. 4, pp. 757-
770, 1 April 2017.
12. B. Kolukisa et al., "Evaluation of Classification Algorithms,
Linear Discriminant Analysis and a New Hybrid Feature
Selection Methodology for the Diagnosis of Coronary Artery
Disease," 2018 IEEE International Conference on Big Data
(Big Data), Seattle, WA, USA, 2018, pp. 2232-2238.
13. Shafenoor Amin, et.al. “Identification of significant features
and data mining techniques in predicting heart disease,
“Telematics and Informatics (2018),
14. MehrbakhshNilashiet.al.” An analytical method for diseases
prediction using machine earning techniques,” Computers&
Chemical Engineering, Volume 106,2017,Pages 212-223
15. Haotian Shi, et.al.” A hierarchical method based on weighted
extreme gradient boosting in ECG heartbeat classification,
Computer Methods and Programs in Biomedicine, Volume
71, 2019, Pages 1-10,
16. Mohammed, T.A., Bayat, O., Uçan, O.N. et al. Found Sci
(2019).
17. Roohallah Alizadehsani, et.al, A data mining approach for
diagnosis of coronary artery disease, Computer Methods
and Programs in Biomedicine, Volume 111, Issue 1,2013,
Pages52-61
18. T. J. Peter and K. Somasundaram, "An empirical study on
prediction of heart disease using classification data mining

Journal of critical reviews 142

You might also like