Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Medical Engineering and Physics 105 (2022) 103825

Contents lists available at ScienceDirect

Medical Engineering and Physics


journal homepage: www.elsevier.com/locate/medengphy

A systematic review on machine learning approaches for cardiovascular


disease prediction using medical big data
Javed Azmi a, Muhammad Arif b, *, Md Tabrez Nafis a, M. Afshar Alam a, Safdar Tanweer a,
Guojun Wang b
a
Department of Computer Science & Engineering, Jamia Hamdard, New Delhi, India
b
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China

A R T I C L E I N F O A B S T R A C T

Keywords: There is a considerable rise in cardiovascular diseases in the world. It is pertinently essential to make cardio­
Cardiovascular disease vascular prediction accurate to the maximum. A forecast based on machine learning techniques can be beneficial
CVD in detecting cardiovascular disease (CVD) with maximum precision and accuracy. The disease’s effective pre­
Classification
diction helps in early diagnosis, which cuts down the mortality rate. A health history and the causes of heart
Machine learning
disease require the efficient detection and prediction of CVD. Data analytics is beneficial for making predictions
Disease prediction
based on a massive amount of data, and it aids health clinics in disease prognosis. Regularly, a large volume of
patient-related data is maintained. The information gathered can be used to forecast the emergence of upcoming
diseases. Our study presents a detailed comparative study of Cardiovascular Disease by comparing the various
machine learning techniques mainly comprising of classification and predictive algorithms. The study shows an
in-depth analysis of around forty-one papers related to cardiovascular disease by using machine learning tech­
niques. This study evaluates the selected publications rigorously and identifies gaps in the available literature,
making it useful for researchers to develop and apply in clinical fields, primarily on datasets related to heart
disease. The current study will aid medical practitioners in predicting heart threats ahead of time, allowing them
to take preventative measures.

1. Introduction high-quality data necessitates the optimization of data gathering tools in


health care, as well as their proper utilisation by both patients and
People have a routine and a hectic schedule in their daily lives, which doctors. Electronic health records (EHRs), medical imaging, genetic
causes tension and worry. Also, the percentage of cigarette-addicted and sequencing, payor records, pharmaceutical research, wearables, and
obese individuals has increased dramatically, causing diseases such as medical gadgets, to name a few, are all examples of big data in health­
cancer, heart problems, and other conditions [1]. The most challenging care. It differs from standard computerised medical and human health
aspect of these diseases is predicting them. Each individual’s blood data used for decision-making in three ways: It’s widely available; it
pressure (BP) and heart rate vary, but the pulse rate of the average in­ moves quickly and covers the vast digital universe of the health sector;
dividual ranges from 60 to 100 beats/minute, and the intermediate BP and, because it’s derived from a variety of sources, it’s very varied in
level could be around 120/80. Any obstruction or anomaly in the heart’s structure and character.
average blood circulation or flow might result in a variety of serious Cardiovascular disease (CVD) is one of the most severe disorders in
heart disease consequences. These disorders, often known as cardio­ the world. Medical evidence has indicated that some health conditions
vascular diseases, are among the worst. Heart disorders, blood vessel enhance a person’s probability of developing cardiovascular disease,
diseases, and vascular diseases of the brain are all examples of cardio­ notably heart disease. A few of these risks are a family history of CVDs,
vascular diseases [2]. hypertension, low HDL (good) cholesterol, high LDL (bad) cholesterol, a
Big data enables healthcare providers and administrators to delve high-fat diet, and lack of regular exercise [3]. As a person grows older,
deeper into their patients’ histories and the care they give. Collecting the probability of developing heart disease increases. It is more common

* Corresponding author.
E-mail address: arifmuhammad36@hotmail.com (M. Arif).

https://doi.org/10.1016/j.medengphy.2022.103825
Received 7 February 2022; Received in revised form 4 May 2022; Accepted 26 May 2022
Available online 27 May 2022
1350-4533/© 2022 IPEM. Published by Elsevier Ltd. All rights reserved.
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

in men, whereas women, on the other hand, are at risk after menopause. generate increasingly accurate predictions based on training data as the
A stressful lifestyle can degrade arteries and raise the risk of developing algorithms absorb it. Computational approaches that use the experience
heart disease. Physicians diagnose based on these and other character­ to improve performance or create accurate predictions can describe as
istics and past diagnoses made on patients in similar situations. Millions machine learning. Experience is the last data present to the patient,
of cases have increased in the last several years, and many have died collected in electronic health data and accessible for diagnosis [11]. The
from heart problems. As per a WHO report, CVD is responsible for 17.9 data can computerize user training sets or other data records collected
million deaths around the world [4]. The heart is a vital organ and plays through environmental interactions. In all circumstances, the success of
an essential role in the human body. If there is any problem with it has a a learner’s prediction relies on the quality and sufficient size of the data.
significant impact on one’s health. CVD symptoms such as hypertension, Designing accurate and precise prediction techniques is an essential
high BP, chest pain, cardiac arrest, etc., are used for diagnosis. There are function of machine learning. The space and time complexity of these
various forms of heart problems, each with its own set of symptoms. The algorithms, like in other fields of computer science, are important in­
rising risk of CVD has become a worldwide issue. As a result, the health dicators of their excellence. However, sample complexity is essential in
sector must define and improve ways to lessen the socioeconomic effect machine learning to analyze the collected data needed for the technique
of chronic disorders. In the healthcare industry, a vast amount of data on to learn a pattern of notions. The complexity of the notion classes
heart disease is available, which may be analyzed to make informed assessing and the volume of the training dataset determine independent
decisions. Machine learning and data mining techniques are heavily learning possibilities for an algorithm.
used in medical data analysis and knowledge extraction. Researchers Fig. 1 is the ML model results from training an algorithm with data
have been urged to perform numerous studies in order to reduce sickness and then providing it with test data to get a prediction outcome. A
and mortality rates associated with cardiovascular diseases around the predictive approach is a model that provides a prediction based on
world [5]. trained data [12]. As a result, ML is critical for developing analytics
Machine learning (ML) is widely suggested for heart disease pre­ models. Applications of machine learning can be used in the accurate
diction since it extracts more efficient and accurate data from massive diagnosis, prescribing medicines, and different sorts of pathological tests
datasets, making prediction simple. It is the primary basis of machine and the early symptom of diseases and, at a later stage, the right kinds of
learning that aids in managing vast amounts of data, has high processing opinion-making. Medical diagnosis is a significant aspect of things as it
speed, and generates predictions in the early stages of development. ML decides the line of treatment better. That is why the utilization of
applications prevent hospital errors and improve health policy, disease techniques is widely using machine learning [13]. Thus, machine
prevention, early detection, and avoidable hospital fatalities. Numerous learning is very crucial for the fast and accurate line of treatments.
studies have done similar work, namely determining which ML ap­ Applying the computer-based decision support system can assume a
proaches could diagnose cardiac disease [6,7]. significant job, incorrect diagnosis, and financially savvy treatments.
Although machine learning can enhance heart attack prediction Machine learning aims to create progressively positive results for ac­
outcomes, more studies can improve the prognosis and simplification of curate forecasts. Thus, it seems that machine learning is very effective in
ML algorithms. These research articles have been published from 2018 classifying diseases in a better way. Machine learning applications have
to 2021; however, twelve in 2021, ten in 2020 and 2019, and nine in been seen in numerous fields, such as healthcare, social data, social
2018 using ML algorithms on cardiac disease datasets. Different ma­ media, retail, traffic control, self-driving car, speech recognition, image
chine learning approaches, like classification, prediction, and pattern recognition, medical diagnosis, etc.
recognition, can predict cardiovascular problems. In this survey, ma­
chine learning-based classification algorithms such as Decision Tree, 3. Classification of ML techniques
Support Vector Machine, Random Forest, Naive Bayes, Artificial Neural
Network, K-Nearest Neighbor, and Linear Regressionuse to identify Machine learning is an area of AI that includes the development of
heart disorders. We’ve tried to examine many of the problems and issues algorithms to consequently study through algorithm performance and
that influence the heart and lead to CVD and discuss several methods for experience to be enhanced with every understanding. The algorithm
predicting heart disease. The remainder of the article organize as fol­ recognizes some prototypes in input information and structure-based
lows: Section 2 discusses the use of machine learning methods for models dependent on input information to build forecasts for new re­
ailment prediction, section 3 illustrates the work of different researchers ports [14]. A machine learning technique intensely depends on pro­
for evaluation of the methods used in ML-based CVD prediction, section cessing power. Building algorithms fit for doing this utilizes the binary
4 highlight the gaps of ML model discussed in the literature whether a ’yes’ and ’no’ are the logic of the computers and is the establishment of
model is a traditional or hybrid model has some drawback, section 5 give machine learning. Fig. 2 presents the various kinds of ML classifications
the concept of a general framework for CVD classification, section 6 and their techniques:
provides discussion and analysis of machine learning algorithm perfor­ The machine learning techniques depend on distinguishing designs
mance in CVD prediction, and section 7 present the conclusion of the and patterns from the vast data collections that offer help for expecta­
studyof CVD prediction. tions and assessment processes for diagnosis and treatment plans. Ma­
chine learning predicts the output of healthcare data precisely by
2. Machine learning (ML) applying various algorithms such as supervised, unsupervised and

Arthur Samuel gave the term "Machine Learning" in 1959, defining it


as a field that allows computers to learn without having to be explicitly
taught [8]. The process involves creating algorithms based on previous
data and experience and is known as machine learning. Tom Mitchell in
1997 defined ML as, "A computer program is said to learn from expe­
rience E with respect to some class of tasks T and performance measure
P, if its performance at tasks in T, as measured by P, improves with
experience E” [9].
Machine Learning is a branch of artificial intelligence that allows a
system to learn from data rather than explicit programming. ML is a
complex process that uses a range of algorithms to enhance, explain, and
predict outcomes by continuous learning from data [10]. It is feasible to Fig. 1. Concept of machine learning.

2
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

Fig. 2. Classification of Machine Learning techniques.

reinforcement, etc., for analysis. The classifications of machine learning remaining features [17].
techniques are as follows:
3.1.2. Support vector machines (SVM)
SVM is the latest supervised ML technique introduced by Vapnik in
3.1. Supervised learning techniques the year 1995 [18]. It reduces experimental accuracy arrangement and
amplifies the geometrical edge among these classes. The SVM algorithm
It includes a training set on the labelled data and utilizes it to expect is a widely used supervised machine learning technique with a pre­
the new information. It contains the splitting of data between the defined attribute value that can utilize both a classifier and a predictor.
training and testing sets. The training set first trains the group, and then In the feature space, SVM locates the hyper-plane, which distinguishes
the exhibition is tried on the testing set. The presentation of the set can among the classes for classification. The SVM algorithm depicts in the
be classified using regression measurements. A marked assessment of feature space plots the coordinates of trained data so that a broad range
supervised learning is divided into two groups, such as classification or separates values belonging to different categories. The fall of test data
regression problems. However, on the other hand, the sets are utilized to values on which side of the margin base is planned into the space and
foresee the result, which depends on numerical data information in then categorized [19].
supervised learning [15]. Fig. 4 helped to build the SVM and the data points nearer to the
While classifying the primary information data, the data is initially hyperplane are called support vectors that affect its orientation and
chosen, subsequently afterwards before performing processing per­ position. We maximize the classifier’s margin by using these support
formed, and evaluate all the missing values (NA). At that point, data vectors, and if the support vectors remove, it will alter the part of the
standardize utilizing min-max standardization or z-score. Then, the hyper-plane.
standardization is applied to choose the best highlights, including a
choice strategy. Once the highlight is selected, a few supervised machine 3.1.3. Random forest (RF)
learning algorithms incorporate things like DT, RB, LR, NB, SVM, KNN, Random Forest is a well-known and effective ML technique. This
and ANN classifier is provided for classification of raw data as algorithm works efficiently for both classification and regression, but it
mentioned below: excels at classification [20]. Bagging Aggregation or Bootstrap Aggre­
gation is a type of ML algorithm. The bootstrap method is a strong sta­
3.1.1. Decision tree (DT) tistical strategy for approximate values from sample data, such as the
DT is similar to a tree-like structure for predictive models and its mean. A large number of data samples obtains to calculate the mean
applications spreading over various fields. It can build utilizing an al­ [21]. Finally, the average of all the mean values calculates to estimate
gorithm methodology that distinguishes approaches to divide the in­ the actual average value. Bagging employs the same procedure, but
formation support in different situations. The tree is one of the most rather than calculating the mean value of each sample data, DT has
broadly used and down-to-earth strategies for supervised technique widely used it. The number of data samples considered for the training
learning. It is a non-parametric supervised technique utilized in classi­ and generates the training models for every data sample. Each model
fication and regression [16]. Fig. 3 describes the algorithm that uses the gives a prediction when any data prediction is needed, and these pre­
most significant predictor divides the groupings into two or more related dictions provide an average to better assess the actual output value [22].
subsets. The Decision Tree method begins by calculating the entropy of Fig. 5 is based on ensemble learning essentially means that many
all attributes. The dataset divides into the highest score and lowest en­
tropy using predictors or variables, and these operations repeat with the

Fig. 3. Decision tree. Fig. 4. Support vector machine.

3
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

Fig. 5. Random forest.

classifiers are combined to improve performance. It is a collection of regression are similar in terms of calculating the coefficient values inside
decision trees that use most of the decision trees’ output to enhance their attribute values. Here, the logistic function constructs output
prediction and outcome metrics. prediction using a non-linear function, which is not identical to linear
regression. Its attributes values lie in the range from 0 to 1. The logistic
3.1.4. Naive Bayes (NB) regression predictions use to evaluate the probability of class 0 or class 1
The term "Naive Bayes Classifier" refers to a supervised learning for data instances, which can be helpful in cases where an additional
classification model based on the Bayes Theorem and substantial inde­ explanation for a prediction is required. When features associated with
pendence requirements [23]. It assumes that the presence or absence of one another and features indifferent to output variables remove, then
a specific feature of a class has no bearing on the presence or absence of logistic regression performs better [26].
any other element. Conditional probabilities underpin the NB algorithm. Fig. 7 illustrates the linear relationship between the independent (X)
It employs Bayes’ theorem, which computes a probability by calculating and dependent (Y) variables and independent variable X is directly
historical data value frequency and combinations. This theorem calcu­ proportional to the dependent variable Y. Based on the data points
lates the likelihood of an event occurring, given the chance of another provided, and we attempt to construct a line that best fits the straight-
event happening and based on the Bayes theorem, NB is a basic but line model for the points.
powerful categorization algorithm [24]. Fig. 6 implies predictor inde­
pendence, which means that the traits or features should not be asso­ 3.1.6. Artificial neural networks (ANN)
ciated with one another or related in any manner. It is naive because, in Artificial neural networks, also termed Multilayer Perceptron (MLP),
the case of dependency, all characteristics or attributes improve the are inspired by the biological and capable of simulating exceedingly
probability freely. complicated non-linear functions. ANN is an essential tool of machine
( ) ( ) learning. The term neural in artificial neural networks implies that they
A P B ∗ P (A)
P = A
B P(B)

Where,
P(A): Hypothesis H is true for probability and is term as the prior
probability.
P(B): Evidence-based probability.
( )
P AB : Hypothesis true for evidence-based probability.
( B)
P A : Evidence is true for hypothesis-based probability.

3.1.5. Logistic regression (LR)


Logistic regression is an ML algorithm derived from the discipline of
statistics. This technique is helpful for binary classification, in which
values are classified into two groups [25]. Logistic regression and linear Fig. 7. Logistic regression.

Fig. 6. Naïve Bayes.

4
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

are cognition systems designed to replicate how learning occurs [27].


There are three layers in the neural networks: input, hidden, and output
layers. The hidden layer has made up of units that turn the input into a
pattern that the output layer may control. ANNs are great tools for
extracting and instructing the computer to recognize patterns that are
too complex or unclear for users to identify [28]. Neural networks have
been in use since the 1940s. In recent decades they have become a sig­
nificant aspect of artificial intelligence due to introducing a new
approach known as backpropagation. It enables networks to alter their
hidden layers of neurons in circumstances where the outputs do not
match the user’s expectations. The link between the layers depicts in
Fig. 9. K-Nearest neighbor.
Fig. 8:
Fig. 8 shows that the input layer is the ANN’s first layer and has made
up of neurons that transmit data to the hidden layer and then to the datasets obtain, one with 303 occurrences and 14 features and 1026
output layer, which comprises the weight, cost, and activation functions, occurrences and 14 features. When the datasets are combined, the final
for further processing. dataset has 1329 instances and 14 features. The experimental results,
the dataset with accuracy of 100% for the training and 97.29% for the
3.1.7. K-Nearest neighbor (KNN) testing set, is the highest among all classifiers. The reliability of SVM,
Hodges et al. in 1951 established the KNN rule as a non-parametric DT, and RF assessment using a 10-fold cross-validation technique. With
pattern classification tool [29]. One of the most basic but highly effi­ an accuracy of 99.39%, RF leads the other algorithms, followed by a DT
cient categorization strategies is the KNN technique. It has no data as­ with 99.59% accuracy.
sumptions and is typically used for classification problems without Annu Dhankhar [32] seek to identify the best algorithm to predict
understanding the data distribution. This approach comprises locating HD utilizing several ML techniques from the UCI Repository of heart
the k nearest data points in the training dataset to the point for which a disease datasets. In the HD prediction, the whole dataset divides into
target value is missing and applying the approximate value of the two sets, 80% training data and 20% testing data, to analyze the accu­
identified data sets to it [30]. racy of the classification model. The obtained data were analyzed using
Fig. 9 illustrates the data point is categorized by its nearest neigh­ KNN, RF, DT, and SVM algorithms, and that RF is the best classification
bors, then by the majority vote of those neighbors; a query point is model with a 90% accuracy.
assigned to the data class with the most representatives among the Ghosh et al. (2021) [33] develops an efficient model for Coronary
point’s nearest neighbors. Heart Disease prediction, using ML classifications such as DT, RF, KNN,
AdaBoost, Gradient Boosting, as well as hybrid classifiers. Relief and
LASSO are different evaluation approaches used to determine the most
4. Literature review
critical attributes from medical references based on rank values. The five
datasets are Cleveland, VA Long Beach, Switzerland, Hungary, and
Only state-of-the-art studies and approaches are present to ensure the
Statlog from the UCI repository collect and to create a larger and more
survey’s quality, and research articles from 2018 to the present have
trustworthy dataset. The researcher also deals with machine learning’s
been chosen for the study. A total of 41 research articles consider at the
over-fitting and under-fitting issues. In addition, several hybrid ap­
end of the selection process. After analyzing the selected publications,
proaches, such as Boosting, and Bagging uses to increase the testing rate
the key findings are discussed in this section. It gives a synopsis of the
while decreasing the performance time. The Random Forest with
approaches presented in the selected papers. The research papers were
Bagging model (RFBM) is 99.05% accurate.
divided into categories using machine learning algorithms and utilizing
Maini et al. [34] proposed an efficient and cost-effective prediction
their methodologies on CVD datasets were chosen for this survey
approach for CVD early detection. A tertiary hospital in South India
investigation. The approaches listed below are the most commonly used
provided a total of 1670 categorized health records. 70% of the obtained
to classify cardiac failure and cardiac disease.
data uses to train the predictive model. The prediction system utilizes
Python and five cutting-edge ML algorithms NB, KNN, RF, AdaBoost,
4.1. Random forest and LR. The best prediction method is the RF, which successfully iden­
tified 470 out of 501 medical data points, yielding an accuracy rate of
Nissa et al. =[31] compares different machine learning systems for 93.8%.
CVD prediction in its initial stages. The dataset’s quality was increased Mishra et al. [35] proposed that the system analyses the symptoms
by adopting pre-processing techniques, with the primary focus being on reported by the user input data and predicts the occurrence of the dis­
observing the data and dealing with incorrect and missing values. In ease as an output. Disease prediction accomplishes by using six algo­
addition, the researcher used the three supervised ML algorithms to rithm techniques: SVM, KNN, RF, LR, DT, and ANN with one hidden
predict the condition, and their results compared with using various layer. This project also provides insight into EDA. Furthermore, the web
evaluation metrics. The dataset collects from the UCI repository. Two platform predicts the user’s health risk and offers relevant recommen­
dations based on their health status. The UCI repository dataset was
used, and consider 13 attributes, which give the core basis for tests and
provide accurate findings. Even without restructuring, this system
functions admirably. This research could react to concerns about the
model’s simplicity of explanation. Moreover, with random forest with
accuracy rate of 93.60%, forecast is entirely accurate and feasible.
Padmaja et al. [36] employs ML techniques to develop a model for
early prediction of cardiac illness, which will benefit doctors in recog­
nizing the condition. Classification techniques use input data to predict
and categorize it into the appropriate category or class. DT, LR, KNN, RF,
SVM, NB, and Gradient Boosting are some classification algorithms.
Fig. 8. Artificial neural networks.
These ML algorithms train to predict the incidence of CVD. Their

5
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

performance is compared to evaluation metrics such as accuracy, performs best with 81.0% accuracy.
sensitivity, and precision to determine the best classification model for Nandhini et al. [43] developed a model for real-time prediction of
predicting HD instances. The UCI repository of the Cleveland dataset cardiovascular disease. Technology develops that can determine the
uses for trainingand testing, which considers age, gender, chest presence and absence of heart disease in a user, and it is capable of
discomfort, resting blood pressure, resting ECG, maximum heart rate prediction and diagnosis. The dataset from the UCI repository of com­
and produces an output indicating whether or not the person has the mon Cleveland HD datasets, with 303 cases and 76 features, was used to
cardiac infection. The usage of feature selection methods contributes to train and test the algorithm. The small sample size is split into 50% for
the model’s accuracy. The random forest classifier achieved an accuracy training and 50% for testing data to build the ML techniques NB, SVM,
of 93.44%. RF, LR, KNN, and DT, which were the machine learning methods
Kavitha et al. [37] developed a hybrid model that produces effective employed. Researchers reveal the RF classifier with an accuracy rate of
outcomes using a ML technique, which recognize in the applications. 89.0% for the prediction of CVD. Their concept was constructed on the
The probabilities obtained from one machine learning technique input idea that mobile phones would always be linked to the internet, which
the other ML technique in a hybrid model. The Cleveland dataset from isn’t always the case.
the UCI repository is used for analysis, and there are 303 cases with 14
different features. CVD prediction using ML algorithm such as RF and 4.2. Support vector machine
DT. The use of the hybrid model with DT and RF to enhance the task. The
finding suggests that utilizing the RF technique and a hybrid model, Chowdhury et al. [44] proposed methodology for heart disease
heart disease detection is effective. DT has a 79% accuracy rate, while predictions aims to uncover essential traits using ML algorithms,
RF has an 81% accuracy rate, whereas the hybrid model has 88%. resulting in improved prediction accuracy. Instead of gathering data
Motarwar et al. [38] propose a ML system that uses several methods from an internet repository, researchers collected information from
to predict the chances of developing a cardiac disease. Five algorithms hospitals and healthcare enterprises in the Sylhet district of Bangladesh
are used to run the framework: NB, RF, Logistic Model Tree, DT, and to create a good questionnaire and the most valuable dataset linked to
SVM. The algorithm is trained and tested by running the Cleveland heart disease prediction. The dataset included 564 occurrences and 18
dataset. The dataset is collected and processed; then, the most signifi­ attributes, and the model employed machine learning algorithms such as
cant features consider for feature selection. For this purpose, 80% of the LR, DT, KNN, SVM, and NB, among others. The accuracy of various
data (242 instances) use to train the dataset, and the remaining 20% (61 methods depends on the instances in the dataset may vary. SVM pro­
cases) was used for testing. For finding, RF provides the highest level of duced the best results in the proposed model, with an accuracy of 91.0%
accuracy rate with 95.08%. for the threshold instances of the dataset.
Rajdhan et al. [39] conducts a comparative assessment of the per­ Sangya Ware [45] studies intend to examine heart disease diagnosis
formance of various ML classification algorithms such as DT, NB, RF, using the UCI ML repository of Cleveland datasets, containing 303 cases
and LR. The study’s objective was to determine the best effective ML and 14 attributes used for this research and evaluation. The datasets
system for detecting heart conditions using the UCI Cleveland dataset, before used for analyses it is pre-processed to remove all of the missing
which was made accessible for analysis on the Kaggle website. A and noisy data. The study compared six distinct ML approaches based on
detailed description of the 14 characteristics employed in the proposed several performance metrics: LR, NB, KNN, SVM, RF, and DT. The
study. The suggested work predicts heart disease by investigating the analysis reveals that SVM provides the most excellent results, 89.34%.
four classification algorithms listed above and, based on performance, Li et al. [46] develop a model for a cardiac disease prediction that is
effectively predicting whether or not the patient has HD. The doctor both efficient and accurate and based on ML algorithms. The system
inserts the input values from the person’s medical report. The output builds on classification techniques such as KNN, DT, ANN, NB, LR, SVM,
data is converted into a system to determine the likelihood of developing and conventional feature selection strategies such as maximum rele­
heart problems. The results show that the RF is the most effective vance, minimal redundancy, and relief. The feature selection is a unique
method for predicting CVD, with an accuracy of 90.16 %. rapid conditional information mutual approach used to overcome the
Mohan et al. [40] proposed an approach of random forest hybrid feature selection problem. These feature selection techniques have been
with a linear model (HRFLM). It is a hybrid of the random forest and the used to boost classification accuracy and lower the execution time.
linear model. The essential characteristics of both methods integrated When compared to other algorithms, the accuracy of SVM is 92.37%.
the Hungarian, Cleveland, and Switzerland datasets from the UCI re­ The proposed technique is simple to adopt in healthcare for detecting
pository. KNN, DT, NB, LR, SVM, ANN, Generalized Linear Model, heart problems.
Gradient Boosted Trees, and Genetic Algorithm are ML techniques that Rishabh Magar [47] proposed a machine learning-based web appli­
classify disease severity. HRFLM, a revolutionary approach, predicted cation to predict cardiac disease. It will display the algorithm calculating
heart disease quite well. The achieved accuracy of HRFLM was 88.7%. the probability of heart disease existence and their outcome on the
Annepu and Gowtham [41] suggested a Python-based heart disease webpage. Moreover, reducing the time and cost in predicting the dis­
prediction framework based on the RF method used for the ML tech­ ease. There are several ML techniques; the algorithms train using the
niques training and testing using the Cleveland datasets from the UCI Cleveland dataset of conditions from the UCI repository. For training,
repository of HD. These datasets contain 303 instances and 76 features, 75% of the entries in the data set were utilized, with the remaining 25%
but they used nine after pre-processing and manual attribute selection. used to test the algorithm’s accuracy. The four ML algorithms used were
The dataset utilized for algorithm training was 75% and 25% for testing. DT, LR, NB, and SVM. Individual datasets were trained and tested for
The results were visualized in visual studio code, created using a each of the four algorithms. Different aspects were to be used to identify
graphical user interface. For the classification, the RF classifier has used, the most effective algorithm. With an accuracy of 82.89%, the LR
and the accuracy obtained was 97.56%. technique is the most efficient of the four. SVM was 81.57% accurate,
David and Belcy [42] conducted a comparison study on heart disease while DT and NB were 80.43% and 80.43% accurate, respectively.
using RF, DT, and NB to create a prediction method to examine and Khan et al. [48] compared some widely used ML techniques for
predict the potential of heart disease. The Statlog dataset from the UCI predict CVD. The ML techniques ANN, SVM, DT, and RIPPER classifiers
repository, which included 270 cases and 13 attributes, was utilized for use WEKA 3.6 and the model train and tested using the Cleveland
the model training and testing. The dataset divides into a training model datasets from the UCI repository, containing 303 cases and 14 features.
with 80% and for testing with 20%. The testing results indicate that the In the data pre-processing, the classifier sample size with 296 occur­
RF technique outperformed the DT and NB in predicting heart disease. rences. The selected algorithm’s performance compares to other classi­
When compared to other algorithms for HD prediction, the RF method fiers such as ANN, NB, and KNN. The model result revealed that the

6
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

selected techniques worked better, with SVM achieving a 90.0% accu­ Sabay et al. [56] proposed a methodology for predicting heart dis­
racy rate. ease that demonstrates how synthetic data may utilize to meet data
Lakshmanarao et al. [49] proposed a model for CVD prediction using leakage and avoid the restrictions of limited medical research datasets.
machine learning techniques. Various neural systems and data mining The surrogate datasets use synthetic observations to model the system
are used to determine the HD severity in patients. Failure to identify the and compare the outcomes of the DT, RF, and LR prediction accuracy
problem early otherwise impact the heart or result in sudden death. The tests. The dataset contains 303 cases with 76 attributes taken from the of
data comes from the Kagglerepository’s Framingham database, and the Cleveland dataset of the UCI library and pre-processed into 279 in­
ML algorithms utilized are LR, RF, SVM, DT, J48, KNN, NB, and Ada­ stances with 14 features. They found that utilizing 10-fold
Boost. The SVM shows to be the best, with an accuracy of 90.3%. cross-validation using RF, DT, and LR algorithms with surrogate data
Hariharan et al. [50] conducted a comparisons work on the HD could enhance prediction stability by 81%. They improved the accuracy
prediction using ML techniques such as KNN, DT, and SVM classifica­ of CVD prediction using ANN with surrogate data by approximately 16%
tion. They used the VA Long Beach dataset from the UCI repository of to 96.7%.
heart disease for algorithm training and testing, containing 270 in­
stances and 12 features. A confusion matrix based on the evaluation 4.4. K-Nearest neighbor
conducted for accuracy, specificity, and sensitivity. Their testing results
showed that SVM exceeded DT and KNN in predicting CVD, with a Bharti et al. [57] propose deep learning and ML techniques to find
specificity of 83%, a sensitivity of 100%, and an accuracy of 92.0%. the results and analysis of the UCI repository of the HD dataset are the
Nashif et al. [51] offered a preliminary design for a cloud-based Cleveland, Long Beach VA, Switzerland, and Hungary. The dataset
cardiac disease prediction system that would employ ML algorithms. consists of 14 features, and 303 cases are used to analyse the perfor­
Two datasets from the UCI repository of HD, the Cleveland dataset (303 mance of algorithms with a confusion matrix. The isolation forest han­
instances with 14 attributes) and the VA Long Beach dataset (270 cases dles datasets with irrelevant characteristics and those datasets which are
with 14 characteristics), were combined to form extensive data. In ML relevant features for better outcomes. The machine learning algorithms
techniques, five classification and prediction operations use in the consider DT, SVM, LR, the application of the deep learning approach in
WEKA, including SVM, RF, NB, Multi-Layer Perception, and LR. SVM the model, an accuracy of 94.2%.
was the best classifier among the other methods, with 97.53% of the Shah et al. [58] shows many heart disease features and a model based
classification accuracy. on ML methods such as the RF, DT, KNN, and NB algorithms. It uses
datasets from the Cleveland of the UCI repository of HD patients and
4.3. Artificial neural network selected 303 cases with 14 features in the collection to test the accuracy
of various algorithms. The results show that KNN achieves the most
Ayatollahi et al. [52] compares SVM and ANN classification for excellent accuracy score of 90.78%. The data pre-processes before being
Positive Predictive Value based CVD prediction. Their sample data came utilized in the model. The algorithms with the most significant outcomes
from three hospitals associated with the Medical Science University, in this mode include KNN, NB, and RF.
Iran, containing 1324 cases and 25 attributes. The samples were taken Arghandabi and Shams [59] focused on assessing the performance of
from admitted patients hospitalized during 2016 - 2017 suffering from machine learning algorithms’ models used to predict heart issues.
CVD. The data collected from the UCI repository has based on the var­ Several ML algorithms utilize to diagnose heart disease in patients at an
iables mentioned in the Cleveland data policy guideline. Data early stage. DT, KNN, GB, SVM, and LR algorithms employ for this task,
pre-processing, merging, filtering, normalization, and compression were and the dataset uses from the UCI repository of heart diseases. The
all used to govern the obtained data and uploaded into Microsoft Excel teaching and testing environments are created in the Python program­
2013 and SPSS (v23.0), and statistical computing was performed using R ming language using the Jupyter notebook. The training and testing on
3.3.2. In the algorithm, the dataset has divided into 70% for training and machine learning algorithms revealed that the KNN had a high accuracy
30% for testing. Their studies revealed that the ANN algorithm out­ of 85.7%.
performed in performance sensitivity, power, and accuracy rate 91.75%. Singh and Kumar [60] use the UCI repository dataset to train and
Subhadra and Vikas [53] developed an MLP-based diagnostic system evaluate ML algorithms to predict cardiac illness. These algorithms
with backpropagation as the training technique. The proposed system’s include LR, DT, KNN, and SVM. Anaconda (Jupytor) notebook is the
performance examines using precision, accuracy, specificity, and finest tool for implementing Python programming since it has various
sensitivity and uses the HD Cleveland dataset from the UCI repository, types of libraries and header files that make the task more exact. The
with 76 features and 303 cases. These features for dataset pre-processing accuracy of each algorithm is examined using the confusion matrix and
excluded six instances with missing values and chose only 14 of the 76 found that KNN has an accuracy of 87.0%, the best and most efficient
traits as the most critical HD. The experimental result revealed that the algorithm.
MLP-NN model had a high accuracy of 93.39%. Almustafa [61] compares alternative HD dataset models to predict
Meshref [54] create many machines learning classifiers and apply cardiac illness instances effectively with limited characteristics. For
them to create the best prediction model for heart disease. The data set 1025 patients from Cleveland, Switzerland, Hungarian and Long Beach
used in this study comes from the Cleveland dataset of the UCI re­ V, the dataset covers 14 attributes. On the HD dataset, a feature
pository \, which initially consisted of 303 occurrences and 76 charac­ extraction method is used to remove missing values. To demonstrate the
teristics. Several well-known machine learning approaches, including performance of the selected classification algorithms to classify best and
DT, ANN, SVM, and NB, have been researched to develop, comprehend, or predict heart situations, the methods employed NB, KNN, DT, SVM,
and interpret various cardiovascular disease predicting models. When JRip, Adaboost, Stochastic Gradient Decent, and DT-J48. After using
compared to the other models, the ANN model had the highest accuracy several strategies for classifying the HD dataset, the KNN classifier
of 84.25%. achieved a classification accuracy of 99.70%.
Chaithra and Madhu [55] used comparative analysis based on data Rabbi, Uddin et al. [62] has done extensive research in MATLAB on
mining techniques to create a CVD prediction model. The data used for supervised learning models such as KNN, SVM, ANN, and multilayered
testing came from the Transthoracic Echocardiography database, with feed-forward backpropagation. The UCI machine learning repository
336 cases and 24 features. DT-J48, NN, and NB are three prominent ML Cleveland dataset of heart disease was used, which included 303 in­
models used to analyze and perform classification operations. Their stances with 76 attributes. The dataset pre-processes to eliminate the
testing results revealed that the NN classification performed signifi­ entries with missing values. The resulting data size was 270 cases with
cantly better in CVD prediction, with an accuracy rate of 97.91%. 13 attributes. 50% of the information uses to train the models, while the

7
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

other 50% to test them. According to the experimental results, SVM for detecting cardiovascular heart disease based on various heart-related
outperformed in classification accuracy, with an 85.0% success rate. indicators. Three distinct supervised algorithms, namely RF, NB, and
J48 Classifier utilizing the Weka tool of ML software, were used to
4.5. Decision tree pre-process the dataset. Knowledge Discovery of Database (KDD) is used
to extract the heart illnesses dataset from the ERIC laboratory, which
Geetha [63] provide a model using ML neural classification for CVD consisted of 209 test cases. The accuracy, precision, recall, and F-mea­
infection. The neural system framework recognizes 13 features and sure, among other benchmark metrics, were used to assess the algo­
forecasts the presence or absence of cardiac disorder and numerous rithm’s performance. With a classification accuracy of 100%, the most
execution measures in the patient. The dataset for the model was taken effective model for predicting patients with heart disease was RF
from the Cleveland datasets of the UCI repository of machine learning implemented on specified attributes.
and examined 76 attributes and 303 cases. The algorithm employed NB,
Stochastic Gradient Descent, KNN, DT, SVM, and Adaboost to demon­ 4.7. Logistic regression
strate the selected classification performance to classify best and fore­
cast HD instances. DT obtained 99.70% of accuracy. Prasad et al. [70] created a machine learning technique for CVD
Gao et al. [64] proposed system aims to apply ensemble methods to prediction based on logistic regression (LR). They used the SK-Learn
enhance the accuracy of prediction of HD. ML techniques use to predict module in Python software to compare the LR approach with other
the initial phase of heart disease. The dataset obtains from the Kaggle techniques such as NB, KNN, SVM, DT, and J48. Three hospitals
repository. Ensemble approaches, i.e., bagging and boosting methods, collaborating with the AJA University of Medical Sciences in Iran pro­
use two feature extraction techniques (linear discriminant and principal vided data on heart problems. There are 1324 instances and 25 features
component analysis) to identify essential features from the dataset. A in the sample. The experimental findings showed that the LR algorithm
comparison of ensemble methods (bagging and boosting) and five worked better, with an accuracy of 86.89%.
classifiers such as SVM, KNN, DT, NB, and RF perform on selected fea­ Haq et al. [71] created a hybrid model based on ML architecture for
tures. The bagging method using DT and principal component analysis diagnosing CVD patients using seven prominent classification algo­
featureextraction approach achieved the highest accuracy of 98.6% rithms in Python. NB, KNN, LR, ANN, DT, SVM, and Multilayer
based on experimental observations. Perception are among them. The Cleveland dataset (303 instances and
Agrahara [65] presents comparative results on heart disease pre­ 76 attributes) was employed to use a 10-fold cross-validation strategy
diction and draws analytical inferences. According to the study’s find­ for model training and testing. The best heart disease symptom features
ings, DT, KNN, ANN, NB, RF, and LR algorithm approaches improve the were chosen using feature selection techniques to eliminate cases with
accuracy of the heart disease prediction system in various scenarios. missing values. The data was refined and evaluated by all the classifiers
Ohio chooses as the location for this Cleveland database. A correlation with each feature selection algorithm to find the best-performing model.
matrix uses to analyze the data into a more advanced analysis and The testing findings revealed that LR with 10-fold cross-validation had
diagnostic for advanced studies. Based on the result, DT is the best the best accuracy of 89.0% when selected by the FS algorithm Relief. It is
because of its high accuracy of 98.02%, high precision, and low MSE. In a better prediction method in terms of accuracy because of the excellent
comparison, LR and KNN have lower accuracy of 89.0%, less precision, performance of LR with Relief.
and higher MSE. It illustrates the thoroughly analyzed research studies used to predict
Alotaibi [66] conducted a comparative analysis applying ML tech­ CVD as it is a severe life-threatening disease. The parameters selected for
niques to forecast cardiac disease. The algorithms evaluated included assessing are as below:
DT, NB, LR, SVM, and RF. The UCI repository of HD provided 303 cases
and 14 features for the standard Cleveland datasets. The 10-fold 5. Gaps in literature
cross-validation technique employs during the learning and develop­
ment of the model. According to the findings of the studies, the DT al­ This section will be a handful in assisting scholars in discovering gaps
gorithm had the best accuracy of 93.19% in predicting heart disease, in the recent literature. The importance of identifying CVD is critical
followed by SVM at 92.30%. since it is a life-threaten condition. Table 1 provides information
regarding the potential of each research paper’s limitations. According
4.6. Naive Bayes to the literature review, the current machine learning framework for
CVD suffer from one of the following limitations.
Poorani and Hemalatha [67] proposed a machine learning approach;
it creates prediction models to assess the Cleveland datasets from UCI, • Dataset sample: To achieve high accuracy, many of the systems
containing 76 features and 303 subjects relevant to heart disease (HD). recommend small sample sizes. The investigation reveals that ML
SVM, DT, RF, Multilayer Perceptron (MLP), and J48 classifiers accept algorithms show lower performance with a large sample size. In
data from the data frame for testing and training. The random forest and contrast, the same classifier shows the best accuracy with exclusive
naive bayes theorems are 90.33% more accurate than other classifiers. attributes when used on compact and efficient techniques. The
This proper measurement can support the accuracy of HD. The machine sample size has a significant impact on the estimated levels of
learning approaches used in this work, based on naive bayes and J48, accuracy.
can improve the efficacy of possible heart disease health studies. • Data distribution: There are no standard rules for data usage; the
Tarawneh and Embarak [68] developed a hybrid model for pre­ data splitting is random. The majority of the research studies split the
dicting cardiac disease that merges ML techniques into a single frame­ data into 80%:20% or 70%:30% or even 50%:50% to train and test
work. The dataset has been taken from the Cleveland UCI repository of the dataset. Furthermore, because the data size for each kind of CVD
heart disease, including 303 cases and 14 attributes in the model to train is limited, the combined results may be affected.
and test the algorithms. Data pre-processing minimizes the number of • Data-clinical correlation: Some findings did not match current
characteristics from 14 to 12. The algorithm includes KNN, NB, RF, GA, healthcare practice, making interpretation more challenging. Meta-
J48, and SVM classification, with their respective accuracies, specific­ analysis’ usefulness reflects the fact that it improves the study’s
ities, and sensitivities in CVD prediction taken into account. The ex­ effectiveness by applying the same techniques. As the clinical data is
periments revealed that SVM and NB performed better in predicting HD often insufficient and inconsistent, the majority of machine learning
with 89.2% similar accuracy. approaches fail to address consistent performance. Therefore data
Dhar et al. [69] used an ML algorithm to create a predictive model clinical correlation is needed.

8
Table 1

J. Azmi et al.
Comparison of machine learning techniques for cardiovascular disease prediction.
S.No Author/Year Proposed Work Dataset Used Comparative Proposed Algorithm Accuracy Limitation/Gaps
Algorithm (in %)
1. Nissa, N. et al. (2021) By analyzing algorithms UCI machine learning repository. SVM, DT, RF RF 99.35 The time complexity is not determined, other ML
comparatively for prediction of heart algorithmsare not trained, and framework also not
disease. specified.
2. Dhankhar, A. and S. To find the best algorithm for heart UCI repository for heart disease. DT, RF, ANN, SVM RF 90.00 Small dataset size used, no feature selection, framework
Jain (2021) attack prediction using ML techniques. is not specified.
3. Geetha. S. et al. Framework for cardiac infection using Cleveland dataset from UCI. KNN, NB, DT, SVM, DT 99.70 Complicated architecture, model doesn’t worked on
(2021) Neural system estimation of ML. AdaBoost, huge data, early detection not possible, manually
collecting data is time taken, and prediction is not
appropriate.
4 Ghosh, P. et al. To predict HD in specific Coronary UCI repository from Satlog, DT, GB, KNN, RF RF 99.05 The dependency on a certain feature selection system.
(2021) Artery Disease. Cleveland, Switzerland, VA Long AdaBoost.
Beach and Hungary.
5 Maini, E. et al. Practical and economic prediction Sagar Hospitals, Jayanagar, KNN, RF, AdaBoost, RF 93.60 1670 data from higher income group, robustness of a
(2021) system for early CVDs diagnosis. Bengaluru. ANN, LR. prediction system.
6 Chowdhury, M. N. R. Predicting heart disease using ML Osmani Medical College, Sylhet DT, LR, KNN, NB, SVM 91.00 Data preprocessing is not specified.
et al. (2021). algorithms. MAG, Bangladesh. SVM.
7 Mishra, S. et al. Prediction of HD accurately using the UCI ML repository dataset. KNN, DT, LR, RF, ANN RF 100.00 Unable to identify the specific heart disease, and small
(2021) pickled model in the Flask framework dataset used.
of machine learning technique.
8 Gao, X.-Y. et al. To increase the accuracy of predicting Kaggle heart disease repository. KNN, NB Boosting, DT DT 98.60 A single and small data has chosen, hybrid model with
(2021) cardiac disease using ensemble XGBoost, LR, ANFSI, other algorithms, time duration is not specified for
approaches. ANN, prediction.
9 Padmaja, B. et al. To predict CVDs using machine UCI repository dataset. LR, RF, DT, SVM, NB, RF 93.44 A single and small dataset is used in all classification,
(2021) learning approaches. KNN, GB techniques is not specified for feature selection, and the
processing time is not mentioned.
10 Poorani, S. and D. Predictive models to analyze the Cleveland dataset from UCI. DT, RF, NB, MLP, SVM NB 90.33 A single dataset is used, no comparisons with other
9

Hemalatha (2021) datasets relevant to heart disease datasets, no optimization for classifier performance, and
based on ML techniques. feature selection are also not specified.
11 Bharti, R. et al. Prediction of CVD using ML and DL UCI repository-Cleveland, Long SVM, LR, RF, DT, KNN KNN 84.00 Data used without preprocessing, optimization
(2021) Beach V, Switzerland, and techniques is absent.
Hungary.
12 Kavitha, M. et al. Hybrid model for prediction of CVD. UCI repository. DT, RF RF 88.00 Hybrid model other ML algorithm not used, small size
(2021) dataset is used, no technique applies for increasing the
performance of a classifier, and average accuracy is
obtained.
13 Rajdhan, A. et al. Predicts the probability of CVD and UCI Cleveland dataset available in DT, RF, LR, NB RF 90.16 The small size of data set, feature extraction is not
(2020) classifies a patient’s risk level by the Kaggle. included, the time required for processing is not
applying various ML techniques. mentioned.
14 Shah, D. et al. (2020) HD prediction model based on Cleveland database of UCI DT, NB, RF, KNN, SVM KNN 90.78 For early prediction of heart disease require more
supervised learning algorithms. repository. implementation of the algorithm, a small size dataset is

Medical Engineering and Physics 105 (2022) 103825


used.
15 Agrahara, A. (2020) A study of HD prediction Using ML
algorithms.
Data from the NB, DT, LR, SVM, KNN, RF DT 98.02 Datasets is not specific, Real
Cleveland, Ohio area. experiment unavailable, and
improper feature selection.
16 Sangya Ware et al. To perform a comparative analysis. Cleveland database of UCI SVM, DT, RF, NB, KNN SVM 89.34 Marginal success is achieved, feature selection is
(2020) repository. notconsider for removing missing and noisy data.
17 Li, J. P. et al. (2020) ML approach for the feature selection UCI repository. LR, DT, ANN, KNN, SVM 92.37 A dataset with small sample size,the processing time is
and identification of CVD NB, SVM not mentioned.
18 Arghandabi, H. and A study of ML techniques for the HD UCI HD dataset. DT, KNN, LR, GB, SVM KNN 85.70 Not suitable for early detection of heart problem,
P. Shams (2020) prediction. dataset size is small.
19 Magar, R. et al. Cloud-based ML algorithms to get the Cleveland database of UCI SVM, DT, NB, LR SVM 81.57 A small sample size is used for the model, and the system
(2020) prediction of HD repository. is not guaranteed as no performance metrics are
evaluated.
(continued on next page)
J. Azmi et al.
Table 1 (continued )
20 Motarwar, P. et al. A ML system to predict the chances of UCI dataset of the Cleveland. NB, DT, RF, SVM and RF 95.08 The time required for the prediction is not considered,
(2020) developing a cardiac disease. Logistic Model Tree and the Data size is small.
(LMT)
21 Almustafa, K. M. To compare alternative HD dataset UCI repository-Cleveland, Long NB, KNN, SVM, JRip, KNN 99.70 It’s required to run the model on different classifiers for
(2020) models to predict HD instances. Beach V, Switzerland, and SGD, DT J48, analysis.
Hungary. Adaboost
22 Singh, A. and R. Compute the accuracy of ML UCI machine learning repository. DT, LR, KNN, SVM. KNN 87.0 A single dataset is used, feature extraction not included,
Kumar (2020) algorithms for heart disease prediction not predicting the specific heart problem.
23 Tarawneh, M. and O. Hybridization model used for UCI dataset of the Cleveland. KNN, NB, SVM, ANN, NB 89.2 Small datasets are considered inefficient in predicting
Embarak (2019) predicting HD RF, GA, J48 remote heart disease patients due to limited hospital use
and being unsuitable for open source.
24 Prasad, R. et al. LR based approach for HD prediction. UCI repository. LR, NB, SVM, DT J48, LR 86.89 Datasets used is not specified.
(2019) KNN
25 Ayatollahi, H. et al. SVM and ANN-based model to predict AJA University of Medical SVM, ANN ANN 91.75 Lack of performance comparison with other machine
(2019) a value for CDVs. Sciences, Iran. learning algorithm.
26 Subhadra, K. and B. An intelligent automated system Cleveland data of the UCI NB, DT, KNN, SVM, ANN 93.39 A single dataset issued to the proposed model, no
Vikas (2019) incorporating with ML the techniques repository. MNL-NN optimization of classifiers, checking the performance of
for predicting CVD. given classifiers, no feature selection of missing data,
noisy data.
27 Mohan, S. et al. Effective HD prediction using a hybrid Cleveland data of the UCI DT, RF, GLM, LR, RF 88.70 Absence of other ML classification combination to form
(2019) model repository. SVM, DL, NB, GBT. the hybrid model.
28 Annepu, D. and G. A framework on RF algorithm to Cleveland datasetsfrom UCI RF, DT RF 97.56 Early detection of CVD is not possible, it predict the
Gowtham (2019) predict HD. repository. patient with 50% blockage as normal or absence of heart
problem.
29 Meshref, H. (2019) CVD diagnosis using ML interpretation Cleveland HD datasets from UCI ANN, SVM, DT, NB, RF ANN 84.25 43 instances with extreme data making redundancy, no
approach repository. interpretation of post –hoc model.
10

30 Alotaibi, F. S. (2019) To improve the Heart Failure Cleveland datasets of UCI NB, DT, RF, SVM, LR DT 93.10 The dataset is moderate in size, and there are only a few
prediction accuracy repository. patient records in the collection.
31 Khan, S. N. et al. ML techniques have been used for HD Cleveland datasets ofUCI DT, ANN, SVM, SVM 90.00 Comparison of the algorithms are not appropriate.
(2019) prediction through WEKA. repository. RIPPER
32 Lakshmanarao, A. ML methods for detection of heart Framingham datasetsfrom the SVM, LR, KNN, SVM 90.30 The model is not based on a heuristic technique. Data is
et al. (2019) disease. Kaggle Website. AdaBoost, DT-J48, NB, also redundant.
RF
33 Rabbi, M. F. et al. Prediction heart disease using Cleveland datasets from UCI KNN, SVM, RF, NB, KNN 99.70 For the entire process, one dataset was used, and no
(2018) different classification algorithms. repository. MLP contrasts with other datasets were undertaken.
34 Hariharan, K., et al. Heart Disease Prediction using ML Cleveland datasets from the UCI SVM, DT, BT, ANN, SVM 92.00 The technique uses a single and small dataset and does
(2018). Techniques repository. RF, KNN, LR not use a feature selection process.
35 Chaithra, N. and B. To design a CVD prediction model. Transthoracic Echocardiography DT-J48, ANN, NB ANN 97.19 The frequency of incorrect prediction of the
Madhu (2018) database. cardiovascular disease displayed on the classification
matrix.

Medical Engineering and Physics 105 (2022) 103825


36 Nashif, S. et al. An intelligent and user-friendly HD Cleveland HD datasets from UCI MLP, LR, NB, RF, SVM SVM 97.53 The small dataset is used, Data present in the cloud is
(2018). prediction system repository. not secured.
37 David, H. and S. A. A comparative study and analysis Statlog dataset in UCI repository. NB, DT, RF RF 81.00 The average accuracy was calculated using different
Belcy (2018) using classification algorithms datasets.
38 Haq, A. U. et al. A hybrid smartML based predictive Cleveland HD datasets from the NB, DT, KNN, SVM, LR 89.00 Using feature selection methods and optimization
(2018) system. California Irvine repository. ANN, LR, MLP approaches with a limited number of features.
39 Dhar, S. et al. (2018) A hybrid technique for HD prediction The ERIC laboratory. J48, NB, RF NB 81.00 Data inconsistencies, missing values, noisy data, and
outliers are all examples of data errors.
40 Sabay, A. et al. The framework uses synthetic data for UCI dataset of the Cleveland. LR, DT, RF, ANN ANN 96.70 Using of surrogate sets of synthetic data.
(2018) HD prediction.
41 Nandhini, S. et al. Real-time heart diseases prediction. UCI repository-Cleveland. KNN, NB, RF, LR, DT, RF 89.00 Mobile phones always linked to the internet.
(2018) SVM
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

• Feature selection: Most studies have ignored the impact of feature 7. Discussion
selection approaches on the spilt of data into testing and training.
The literature shows that a good feature selection approach can help All selected papers are assessed thoroughly against some defined
machine learning models perform better. It also reduces issues such parameters such as the research problem, proposed approach, machine
as higher processing costs caused by inappropriate input data in the learning classification used, and accuracy that gives the highest scores in
learning process. their respective models used to evaluate and analyze the research
• Hyper-parameters optimization: Another vital gap in the existing studies. The selected articles apply their technique to the heart disease
research is parameter tuning. The performance of a system can dataset, but their sources are unrestricted. The dataset obtains from UCI,
improve more by fine-tuning its parameters. Algorithm hyper- a hospital, or any other repository of machine learning datasets with
parameter customization is typically kept anonymous, leading to cardiac conditions. Researchers that develop machine learning algo­
significant statistical heterogeneity. rithms include utilizing reduction strategies or feature selection while
• Meta-heuristic technique: Meta-heuristics comprises of intelligent providing the data to the ML classifier. The most critical characteristics
techniques for improving the accuracy of heuristic operations. In our in the earlier type are chest discomfort, blood pressure, fasting blood
study we have shown that majority of the researchers have presented glucose level, cholesterol, ECG readings, maximal pulse rate, exercise,
a heuristic machine learning framework for predicting cardiovas­ age, gender, etc. Because the aim of the study focuses on saving lives, the
cular disease. The main applications of Heuristics techniques is for accuracy of the classifier is one of the most critical aspects of the anal­
locating appropriate or near optimum solutions at a low computa­ ysis. According to several studies, machine learning researchers did not
tional cost without ensuring accessibility. pay much attention to using other CVD datasets to construct clinical
• Ensemble technique: Current studies ignore the optimistic decision support systems. According to the 41 research that looked at 12
ensemble techniques in improving the efficiency of existing ML different datasets, more than 75% of such systems employ UCI heart
models for predicting cardiovascular diseases. Ensemble modelling disease data. In this example, the best approach was described as having
combines the findings of two or more related but distinct analytical a higher frequency of use, and our survey found that three of these
models into a single score. methods were the most often employed for CVD prediction using the
• Clinical Aspect: Some research reported the technical components dataset. Fig. 11: illustrates the different ML models use in the current
without the clinical features due to a lack of clinical monitoring. researches.
Clinical models are used to provide patient data to assist physicians Fig. 11, it was not able to generalize the performance of a single
in making better decisions for their treatment. algorithm as the best because RF appeared with the highest frequency of
accuracy (13 occurrences out of 41). In contrast, the SVM technique
6. An overview of CVD classification techniques seemed to be second inaccuracy (8 instances out of 41) because each
algorithm is distinct. For example, RF frequently outperforms other
Fig. 10 depicts the general framework used for the classification of methods on a dataset with a large dimensionality, while others under­
cardiovascular diseases. The dataset for CVD is made up of patient data perform on the same data. Though the UCI data uses in most clinical
and is available on the internet. Researchers have used a variety of decision support systems for CVD prediction, other researchers have
dataset sources, including the UCI machine learning repository, Statlog,
Cleveland, Long Beach dataset, and others [72].
The dataset consists of different attributes about the patients’ age,
gender, blood pressure, smoking habit, diabetes, hypertension, chest
discomfort type, family history, etc [3]. The dataset is initially
pre-processed and cleaned to eliminate any missing values and
normalize the attributes indicated in the Fig. 10. All of the individuals’
characteristics extract at the feature extraction stage. The test dataset
has obtained from other sources utilized to classify the various features
using different machine learning classifications such as DT, NB, RF,
KNN, ANN, LR, and SVM to create the model. The predictive model will
categorize the dataset as an affected or healthy person depending on the
Fig. 11. Techniques used in current researches.
attributes.

Fig. 10. General framework for cardiovascular diseases classification.

11
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

used alternate datasets and machine learning techniques to achieve [7] Solanki A, Barot M. Study of heart disease diagnosis by comparing various
classification algorithms. International Journal of Engineering and Advanced
similar results. The seven classifiers addressed the CVD prediction
Technology 2019;8(2S2):40–2.
method. There are various forms of machine learning algorithms that [8] Samuel AL. Some studies in machine learning using the game of checkers. IBM
have previously been the subject of several studies. DT, RF, NB, LR, Journal of research and development 1959;3(3):210–29.
ANN, KNN, and SVM are some of the techniques we have used in our [9] Mitchell TM. Does machine learning really work? AI magazine 1997;18(3). 11-11.
[10] Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects.
study. The accuracy of the RF, SVM, and KNN algorithms is reasonably Science 2015;349(6245):255–60.
high when compared to other techniques, according to the results of this [11] Ali AA, Hassan HS, Anwar EM. Heart diseases diagnosis based on a novel
study. convolution neural network and gate recurrent unit technique. Electrical
Engineering 2020.
[12] Aljanabi M, Qutqut H, Hijjawi M. Machine learning classification techniques for
8. Conclusion heart disease prediction: A review. International Journal of Engineering &
Technology 2018;7(4):5373–9.
[13] Obasi T, Shafiq MO. Towards comparing and using Machine Learning techniques
Due to the expansion of knowledge, machine learning is becoming for detecting and predicting Heart Attack and Diseases. big data 2019.
more popular. It enables the extraction of knowledge from massive [14] Kotsiantis SB, Zaharakis I, Pintelas P. Supervised machine learning: A review of
amounts of data that are both time-consuming and, in some situations, classification techniques. Emerging artificial intelligence applications in computer
engineering 2007;160(1):3–24.
unreachable to humans. In the healthcare industry, machine learning- [15] Osisanwo F, Akinsola J, Awodele O, Hinmikaiye J, Olakanmi O, Akinjobi J.
based solutions are routinely used to evaluate patient data, diagnose Supervised machine learning algorithms: classification and comparison.
sickness, and suggest treatment alternatives. An analysis of existing International Journal of Computer Trends and Technology (IJCTT) 2017;48(3):
128–38.
approaches was conducted and compared in order to determine the best
[16] Maji S, Arora S. Decision tree algorithms for prediction of heart disease.
accurate and efficient strategies for predicting cardiovascular disease in Information and communication technology for competitive strategies. Springer;
this study. Today, there are a plethora of machine learning techniques to 2019. p. 447–54.
choose from. In this work, however, seven machine learning algorithms [17] Belson WA. Matching and prediction on the principle of biological classification.
Journal of the Royal Statistical Society: Series C (Applied Statistics) 1959;8(2):
are given and compared in order to determine the best classifier for heart 65–75.
disease prediction. The techniques used are KNN, LR, ANN, NB, DT, [18] Vapnik V, Guyon I, Hastie T. Support vector machines. Mach. Learn 1995;20(3):
SVM, and RF. Based on the recognized research and studies, displays a 273–97.
[19] Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM
performance analysis of many machine learning algorithms and their transactions on intelligent systems and technology (TIST) 2011;2(3):1–27.
accuracies used for Table 1 cardiovascular disease prediction. Random [20] Breiman L. Random Forests Mach Learn 2001;45:5–32.
forest produces the most significant outcomes in terms of accuracy, [21] Chen M, Liu Q, Chen S, Liu Y, Zhang C-H, Liu R. XGBoost-based algorithm
interpretation and application on post-fault transient stability status prediction of
according to many research studies on predicting cardiovascular dis­ power system. IEEE Access 2019;7:13149–58.
ease. The findings show that machine learning-based technologies have [22] Bingzhen Z, Xiaoming Q, Hemeng Y, Zhubo Z. A Random Forest Classification
the potential to transform the healthcare industry by improving the Model for Transmission Line Image Processing. Computer Science & Education
IEEE 2020.
entire process of disease prediction and therapy recommendation in the [23] Rish, I. An empirical study of the naive Bayes classifier. empirical methods in
vast majority of cases. artificial intelligence, 2001.
There is a lot of work to be done in CVD to improve the prediction [24] Lindley DV. Fiducial distributions and Bayes’ theorem. Journal of the Royal
Statistical Society. Series B (Methodological) 1958:102–7.
model’s efficiency, reliability, and accuracy. In today’s society, the vast
[25] Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression. John
volume of information is automated, dispersed, and neglected. The pa­ Wiley & Sons; 2013.
per’s future scope is to anticipate cardiac illnesses utilising new ap­ [26] Wu J, Liu C, Cui W, Zhang Y. Personalized Collaborative Filtering
proaches and algorithms in a time-efficient manner. The model is loaded Recommendation Algorithm based on Linear Regression. Power Data Science IEEE;
2019.
with more medical information to get a better result, which helps the [27] McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous
cardiologist in concluding a patient’s likelihood of having a heart attack. activity. The bulletin of mathematical biophysics 1943;5(4):115–33.
[28] Riedmiller M, Braun H. A direct adaptive method for faster backpropagation
learning: The RPROP algorithm. neural networks. IEEE; 1993.
Declaration of Competing Interest [29] Fix E, Hodges JL. Discriminatory analysis. Nonparametric discrimination:
Consistency properties. International Statistical Review/Revue Internationale de
None. Statistique 1989;57(3):238–47.
[30] Cover T, Hart P. Nearest neighbor pattern classification. IEEE transactions on
information theory 1967;13(1):21–7.
Ethical approval [31] Nissa N, Jamwal S, Mohammad S. Heart Disease Prediction using Machine
Learning Techniques. Wesleyan Journal of Research 2021;13(67).
[32] Annu Dhankhar, S. J. Prediction of Disease Using Machine Learning Algorithms.
Not required. Smart and Sustainable Intelligent Systems. P. C. a. T. C. Namita Gupta, Wiley-
Scrivener Publishing LLC. 2021: 1: 115–126.
Funding [33] Ghosh P, Azam S, Jonkman M, Karim A, Shamrat FJM, Ignatious E, Shultana S,
Beeravolu AR, De Boer F. Efficient Prediction of Cardiovascular Disease Using
Machine Learning Algorithms With Relief and LASSO Feature Selection
None. Techniques. IEEE Access 2021;9:19304–26.
[34] Maini E, Venkateswarlu B, Maini B, Marwaha D. Machine learning–based heart
References disease prediction system for Indian population: An exploratory study done in
South India. Medical Journal Armed Forces India 2021;77(3):302–11.
[35] Mishra S, Neurkar SV, Patil R, Petkar S. Heart Disease Prediction System.
[1] Pouriyeh S, Vahid S, Sannino G, De Pietro G, Arabnia H, Gutierrez J. International Journal of Engineering and Applied Physics 2021;1(2):179–85.
A comprehensive investigation and comparison of machine learning techniques in [36] Padmaja B, Srinidhi C, Sindhu K, Vanaja K, Deepika N, Patro EKR. Early and
the domain of heart disease. Computers and Communications (ISCC) 2017. Accurate Prediction of Heart Disease Using Machine Learning Model. Turkish
[2] Nagendra K, Ussenaiah M. A study on various data mining techniques used for Journal of Computer and Mathematics Education (TURCOMAT) 2021;12(6):
heart diseases. International Journal of Recent Scientific Research 2018:24350–4. 4516–28.
[3] Sen SK. Predicting and diagnosing of heart disease using machine learning [37] Kavitha M, Gnaneswar G, Dinesh R, Sai YR, Suraj RS. Heart Disease Prediction
algorithms. Int J Eng Comput Sci 2017;6(6). using Hybrid machine Learning Model. Inventive Computation Technologies. IEEE;
[4] WHO. Global action plan for the prevention and control of noncommunicable 2021.
diseases 2013-2020. World Health Organization; 2013. [38] Motarwar P, Duraphe A, Suganya G, Premalatha M. Cognitive Approach for Heart
[5] Ozaydin Bunyamin, Berner Eta S, Cimino James J. Appropriate use of machine Disease Prediction using Machine Learning. Emerging Trends in Information
learning in healthcare. Intelligence-Based Medicine 2021;5:100041. ISSN 2666- Technology and Engineering. IEEE; 2020.
5212. [39] Rajdhan A, Agarwal A, Sai M, Ravi D, Ghuli P. Heart disease prediction using
[6] Patel J, TejalUpadhyay D, Patel S. Heart disease prediction using machine learning machine learning. International Journal of Research and Technology 2020;9(04):
and data mining technique. Heart Disease 2015;7(1):129–37. 659–62.

12
J. Azmi et al. Medical Engineering and Physics 105 (2022) 103825

[40] Mohan S, Thirumalai C, Srivastava G. Effective heart disease prediction using [56] Sabay A, Harris L, Bejugama V, Jaceldo-Siegl K. Overcoming small data limitations
hybrid machine learning techniques. IEEE access 2019;7:81542–54. in heart disease prediction by using surrogate data. SMU Data Science Review
[41] Annepu D, Gowtham G. Cardiovascular disease prediction using machine learning 2018;1(3):12.
techniques. International Research Journal of Engineering and Technology 2019;6 [57] Bharti R, Khamparia A, Shabaz M, Dhiman G, Pande S, Singh P. Prediction of heart
(4):3963–71. disease using a combination of machine learning and deep learning. Computational
[42] David H, Belcy SA. Heart Disease Prediction Using Data Mining Techniques. Intelligence and Neuroscience 2021.
ICTACT Journal on Soft Computing 2018;9(1). [58] Shah D, Patel S, Bharti SK. Heart disease prediction using machine learning
[43] Nandhini S, Debnath M, Sharma A. Heart disease prediction using machine techniques. SN Computer Science 2020;1(6):1–6.
learning. International Journal of Recent Engineering Research and Development [59] Arghandabi H, Shams P. A Comparative Study of Machine Learning Algorithms for
(IJRERD) 2018;3(10):39–46. the Prediction of Heart Disease. International Journal for Research in Applied
[44] Chowdhury MNR, Ahmed E, Siddik MAD, Zaman AU. Heart Disease Prognosis Science and Engineering Technology (IJRASET) 2020;8(XII):677–83.
Using Machine Learning Classification Techniques. Convergence in Technology [60] Singh A, Kumar R. Heart disease prediction using machine learning algorithms.
(I2CT). IEEE; 2021. electrical and electronics engineering (ICE3). IEEE; 2020.
[45] Sangya Ware SKR, Choudhary Bharat. Heart Attack Prediction by using Machine [61] Almustafa KM. Prediction of heart disease and classifiers’ sensitivity analysis. BMC
Learning Techniques. International Journal of Recent Technology and Engineering bioinformatics 2020;21(1):1–18.
2020;8(5):1577–80. [62] Rabbi MF, Uddin MP, Ali MA, Kibria MF, Afjal MI, Islam MS, Nitu AM.
[46] Li JP, Haq AU, Din SU, Khan J, Khan A, Saboor A. Heart disease identification Performance evaluation of data mining classification techniques for heart disease
method using machine learning classification in e-healthcare. IEEE Access 2020;8: prediction. American Journal of Engineering Research 2018;7(2):278–83.
107562–82. [63] P. D. C. Geetha S, Kalaivani V, Haritha CJ, Preetha G. Prediction Techniques of
[47] Rishabh Magar RM, Raut Suraj. Heart Disease Prediction Using Machine Learning. Heart Disease and Diabetes Disease using Machine Learning Turkish Journal of
Journal of Emerging Technologies and Innovative Research 2020;7(6):2081–5. Computer and Mathematics Education 2021;12(10):3316–25.
[48] Khan SN, Nawi NM, Shahzad A, Ullah A, Mushtaq MF, Mir J, Aamir M. [64] Gao X-Y, Amin Ali A, Shaban Hassan H, Anwar EM. Improving the Accuracy for
Comparative analysis for heart disease prediction. JOIV: International Journal on Analyzing Heart Diseases Prediction Based on the Ensemble Method. Complexity
Informatics Visualization 2017;1(4-2):227–31. 2021.
[49] Lakshmanarao A, Swathi Y, Sundareswar PSS. Machine learning techniques for [65] Agrahara A. Heart Disease Prediction Using Machine Learning Algorithms.
heart disease prediction. Forest 2019;95(99):97. International Journal of Scientific Research in Computer Science, Engineering and
[50] Hariharan K, Vigneshwar W, Sivaramakrishnan N, Subramaniyaswamy V. Information Technology 2020;6(4):137–49.
A comparative study on heart disease analysis using classification techniques. [66] Alotaibi FS. Implementation of machine learning model to predict heart failure
International Journal of Pure and Applied Mathematics 2018;119(12e):13357–66. disease. International Journal of Advanced Computer Science and Applications
[51] Nashif S, Raihan MR, Islam MR, Imam MH. Heart disease detection by using 2019;10(6):261–8.
machine learning algorithms and a real-time cardiovascular health monitoring [67] Poorani S, Hemalatha D. Machine Learning Techniques for Heart Disease
system. World Journal of Engineering and Technology 2018;6(4):854–73. Prediction. Journal of Cardiovascular Disease Research 2021;12(1):93–6.
[52] Ayatollahi H, Gholamhosseini L, Salehi M. Predicting coronary artery disease: a [68] Tarawneh M, Embarak O. Hybrid approach for heart disease prediction using data
comparison between two data mining algorithms. BMC public health 2019;19(1): mining techniques. In: International Conference on Emerging Internetworking,
1–9. Data & Web Technologies. Springer; 2019.
[53] Subhadra K, Vikas B. Neural network based intelligent system for predicting heart [69] Dhar, S., K. Roy, T. Dey, P. Datta and A. Biswas. A hybrid machine learning
disease. International Journal of Innovative Technology and Exploring Engineering approach for prediction of heart diseases. Computing Communication and
2019;8(5):484–7. Automation, IEEE 2018.
[54] Meshref H. Cardiovascular Disease Diagnosis: A Machine Learning Interpretation [70] Prasad R, Anjali P, Adil S, Deepa N. Heart disease prediction using logistic
Approach. International Journal of Advanced Computer Science and Applications regression algorithm using machine learning. IJEAT 2019:2249–8958.
2019;10(12). [71] Haq AU, Li JP, Memon MH, Nazir S, Sun R. A hybrid intelligent system framework
[55] Chaithra N, Madhu B. Classification models on cardiovascular disease prediction for the prediction of heart disease using machine learning algorithms. Mobile
using data mining techniques. Cardiovascular Diseases & Diagnosis 2018;(6):1–4. Information Systems 2018.
[72] Katarya R, Meena SK. Machine learning techniques for heart disease prediction: a
comparative study and analysis. Health and Technology 2021;11(1):87–97.

13

You might also like