Professional Documents
Culture Documents
An Analysis On Feature Selection Methods, Clustering and Classification Used in Heart Disease Prediction - A Machine Learning Approach
An Analysis On Feature Selection Methods, Clustering and Classification Used in Heart Disease Prediction - A Machine Learning Approach
Review Article
Abstract
Recent research shows that major cause of death due to heart disease is increasing nowadays. Around billion people will lose their life
because of heart failure in a year. Heart disease also known as Cardio Vascular disease can be prevented by diagnoses at an early stage.
By means of prediction in machine learning we can diagnoses this disease. The machine learning algorithm makes use of some important
features for prediction. Feature selection plays an important role in prediction, since in a dataset all attribute cannot be used. Only the
relevant features are used. In this paper we discuss some of the feature selection methods used in data mining for the prediction of heart
disease.
Keywords: Classification, Regression, Prediction, Optimization, Probabilistic, Intelligent.
© 2019 by Advance Scientific Research. This is an open-access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/)
DOI: http://dx.doi.org/10.31838/jcr.07.06.27
The paper is organized as follows Chapter II discuss the about Component Analysis (PCA), Classification and Regression Trees
feature selection in Machine learning, Chapter III gives a (CART) along with fuzzy method is used for prediction. In this
related work carried out to discuss about feature selection paper [14] the desired features are selected using Principal
Chapter IV depicts clustering and classification in machine Compound Analysis (PCA).EM is used for clustering and PCA was
learning. Chapter V describes the proposed model. Chapter VI used for addressing multi-co linearity in the datasets.
describes the performance of different methods used in In [15] EEG signals are preprocessed and 168 features are
prediction and finally Chapter VII concludes the work. selected for each recording. The features are selected by
FEATURE SELECTION IN MACHINE LEARNING Recursive feature elimination (RFE). The feature selection
For accurate prediction feature selection is important. Data method proposed in [15] is mainly used for large dataset.
mining algorithms used feature selection methods for selecting
the desired features from the dataset. These features or The feature selection using wrapper method was proposed by
attributes should be loaded directly into the memory for Mohammed, et. al. [16]. In RFE, the features are ranked by
preprocessing. Feature selection is a process in which only the repeatedly training model and removing features with the
subset of the appropriate features are selected. This method smallest score. To calculate the score Ck=wk2 is used, where wk2 is
identifies the few most important attributes and help to predict the weight of feature in Support Vector Machine (SVM) where
the outcome. It is a form of dimensionality reduction used for calculated as. The features with high ranking are obtained and
preprocessing. The difference between feature selection and selected for classification.
dimensionality reduction is the first method (Feature selection) In paper [17] a hybrid method for feature selection is introduced.
will reduce the attributes without making change in the data set. It proposed a method using Genetic algorithm for solving feature
Since feature selection method deals with less parameter it will selection problem. The features are classified into strong feature,
reduce the complexity. weak, unstable feature. Based on the feature the priority was
given.
There are various methods of feature selection algorithms
applied in classification. They are i) Filter method ii) Wrapper CLUSTERING AND CLASSIFICATION IN MACHINE LEARNING
Method and iii) Embedded method. The filter methods are used Clustering and classification play a vital role in data mining for
to select the features based on the scores in various statistical prediction. The features are selected based are different criteria
correlations. Wrapper method uses a greedy approach in feature so that an efficient classification is done. Classification predicts
selection. It evaluates all possible combination and produces the the resultant output based on the clusters. Different methods are
result for Machine learning. The embedded method combines the used to form cluster in machine learning. Normally Naïve Bayes
advantage of two models. classification is used for classification. A probabilistic classifier is
There are different statistical methods used for feature selection used in Bayes theorem. It calculates a probability of class in input
in machine learning. Filter method feature selection is normally data and helps to predict the class of the unknown data
used for machine learning [10] of large dataset. The wrapper sample.[16].In this paper some classification algorithm are used
method is used in classification algorithm for measuring the to predict diabetes. The algorithms such as SVM, Random forest,
performance whereas the filter method doesn’t use classification Naive Bayes, Decision tree, KNN are compared and their
algorithm instead it uses scientific approach. performance of prediction is listed. A heart disease prediction
system based on Sequential Minimal Optimization (SMO) is
RELATED WORK proposed by Alizadehsani, et.al [17]. Classification algorithm
In this paper we conducted a literature survey to focus on feature based on SVM was improved in this method. Several algorithms
selection methods used in heart disease prediction. Sreevani and including Naïve Bayes, SMO, etc. are used in this method. A heart
Murthy proposed a feature selection method which produces a disease prediction method based on CFS and Bayes theorem was
reduced subset without transforming the data [11]. A compound proposed by T.J. Peter et.al[18].In this paper various
feature selection framework is mentioned. classification algorithm such as Naïve Bayes, Decision Tree, KNN
etc are compared with the proposed method which uses CFS for
Feature selection provides a subset of data. Too much of features attribute selection and Naïve Bayes classifier for prediction.
are not good for machine learning since it produces complexity,
an efficient prediction feature selection algorithm should be Jesmin Nahar et.al proposed a medical knowledge driven heart
simple, classifier used should be accurate. disease prediction[19].In this method classification is based on
Medical knowledge driven feature selection (MFS) and
B. Kolukisa et al purposed a feature selection method which is Computerized feature selection (CFS).The combination of
hybrid method [12]. It uses a UCI dataset for processing MFS+CFS is used for prediction and the accuracy is compared
with various classification algorithm such as NB,DT, AdaBoost,
In paper [13] Shafenoor Amin et.al explained different
etc. In paper[20] Shao et.al proposed a hybrid intelligent method
combination of features for accurate prediction of heart disease.
for heart disease prediction. In this method logistic regression
Since there is no exact algorithm for prediction of heart disease
(LR), multivariate adaptive regression splines (MARS), artificial
the performance is based on the accuracy of prediction. Seven
neural network (ANN), and rough set (RS) techniques are used
classification techniques (k-NN, Decision Tree, Naïve Bayes,
for prediction., The proposed model combines hybrid MARS–
Logistic Regression, Vote, Support Vector Machine and Neural
ANN model as the was the best alternative. Since variables used
Network) were used for prediction. From the available UCI
in this method are less a better classification accuracy can be
dataset attributes a set of features of 13 attributes are selected. A
obtained. A weighted fuzzy rule-based heart disease prediction
Brute force technique was applied on the selected features. A
was proposed by Anooj.k[21]. Two steps are followed for risk
subset chosen has a minimum of 3 attributes. The combination of
prediction. First step generates a weighted fuzzy rule and the
features is given by 2n – ((n2+n/2) +1, where n is the total no. of
second step develop a fuzzy rule-based system. The fuzzy rules
feature generated. Vote, Naïve Bayes and SVM algorithms
are automatically generated in this method.
combined to form better prediction model. A combination of
attributes of 81000 is used in feature selection method for A feature selection technic for heart disease prediction was
accuracy. proposed by Swathi shilaskar Et. al. Prediction of disease with
reduced attributes along with hybrid forward feature selection
In paper [14] Nilashi et.al proposed a machine learning method
by combining fuzzy rule with Evaluation Method (EM), Principal
for prediction was proposed in this paper[22].A hill climbing in this method. A hybrid fuzzy clustering algorithm and genetic
based approach was followed for selecting the features. approach was used for prediction. For feature selection a chaos
firefly based optimal approach is followed. The proposed system
Nilashi et.al [23] proposed an analytical method for disease is compared with SVM, Naïve Bayes to show the accuracy.
prediction. In this machine learning algorithm for clustering and
noise removal is removal is used for prediction. Classification
and Regression Trees
(CART) is used to generate fuzzy rules for knowledge-based.
A firefly-based algorithm for prediction of heart disease was
proposed by Longet.al [24]. Type-2 fuzzy logic system was used
Author Year of Title of paper Methodology used Data set Advantage Disadvantag
Publicat used e
ion
B. Kolukisa et al 2018 Evaluation of Hybrid feature selection method and several Z- Attention to the ensemble
Classification Algorithms, classification algorithms including ensemble Alizadehsani sensitivity, methods
Linear Discriminant classifiers, and specificity, F- multiply the
Analysis and a New Cleveland measure, AUC complexity of
Hybrid Feature Selection heart and running time the original
Methodology for the datasets model
Diagnosis of Coronary
Artery Disease
Shafenoor Amin, 2018 Identification of A combination of features was selected to be UCI dataset Performance is 7 data mining
et.al significant features and used with 7 classification techniques; k-NN, measured based techniques on
data mining techniques Decision Tree, Naïve, Bayes, Logistic on accuracy and 8100
in predicting heart Regression, Vote, Support Vector Machine precision for combinations
disease and Neural Network ease of of features
prediction produce a
vast
computationa
l time
Haotian Shi, et.al 2019 A hierarchical method Recursive feature elimination is employed for ECG heart Can be used in ensemble
based on weighted feature selection from a large number of beat data set large dataset classifier is
extreme gradient features A hierarchical classifier based on used it
boosting in ECG weighted extreme gradient boosting a multiply the
heartbeat classification threshold classifiers is constructed extreme complexity
gradient boosting (XGBoost}
Mohammed, T.A 2019 Hybrid Efficient Genetic Three new genetic algorithms are proposed UCI dataset No domain not suitable
et.al Algorithm for Big Data and incorporated in the ANN algorithm which knowledge is for large
Feature Selection are LWGGA, HWGGA and WGGA required, reduce dataset
Problems time
RoohallahAlizade 2013 A data mining approach Naïve Bayes, SMO, Bagging, and Neural Z-Alizadeh
hsani, et.al for diagnosis of coronary Network are used Sani dataset
artery disease,
T. J. Peter et. 2012 An empirical study on Hybrid CFS+ Naïve Bayes is used for UCI dataset Easy to predict Accuracy is
prediction of heart prediction using naïve less when
disease using bayes compared
classification data mining with recent
techniques methods
Jesmin Nahar, 2013 Computational This work highlights the potential of an UCI dataset Both MFS and Need medical
et.al intelligence for heart expert judgment based (i.e., medical CFS had knowledge for
disease diagnosis: A knowledge driven) feature selection process improved prediction
medical knowledge (termed as MFS) prediction rates,
driven approach in terms of
accuracy
Yuehjen E. Shao, 2014 Hybrid intelligent A hybrid multivariate adaptive regression UCI dataset least no. of Less accurate
et.al modelling schemes for spine (MARS) and artificial neural network explanatory (82%
heart disease ANN algorithm variables accuracy)
classification
P.K. Anooj 2012 Clinical decision support Fuzzy based-input single output mamdani UCI dataset automatic Perform
system: Risk level based fuzzy method is used procedure for better than
prediction of heart generation of neural
disease using weighted fuzzy rules, network
fuzzy rules based
Swathi shilaskar 2013 Feature selection for Hybrid forward feature selection UCI dataset Reduces feature computationa
Et.al medical diagnosis: dimension, l Complexity
evaluation of cardio Improves
vascular disease classification
accuracy
Nilashi et.al 2013 An Analytical Method for EM, PCA, CART and fuzzy rule-based UCI dataset for each module Efficient
PROPOSED MODEL and shows that UCI dataset is more frequently used in machine
The proposed model for heart disease prediction is shown in learning when compared with other dataset. Figure 1.3 shows
Figure.1.2 the percentage of UCI dataset used over other data set.
31%
UCI
69%
Other
dataset