Implementation of A Heart Disease Risk Prediction Model Using Machine Learning

Hindawi
Computational and Mathematical Methods in Medicine

Volume 2022, Article ID 6517716, 14 pages
https://doi.org/10.1155/2022/6517716
Research Article
Implementation of a Heart Disease Risk Prediction Model Using
Machine Learning
K. Karthick ,1 S. K. Aruna ,2 Ravi Samikannu ,3 Ramya Kuppusamy ,4

Yuvaraja Teekaraman ,5 and Amruth Ramesh Thelkar 6
1
Department of Electrical and Electronics Engineering, GMR Institute of Technology, Rajam, Andhra Pradesh, India
2
Department of Computer Science and Engineering, School of Engineering and Technology, CHRIST (Deemed to be University),
Bangalore, Karnataka, India
3
Department of Electrical Computer and Telecommunications Engineering, Botswana International University of Science
and Technology, Palapye, Botswana
4
Department of Electrical and Electronics Engineering, Sri Sairam College of Engineering, Bangalore City, India
5
Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield S1 3JD, UK
6
Faculty of Electrical & Computer Engineering, Jimma Institute of Technology, Jimma University, Ethiopia
Correspondence should be addressed to Yuvaraja Teekaraman; yuvarajastr@ieee.org

and Amruth Ramesh Thelkar; amruth.rt@gmail.com
Received 3 February 2022; Accepted 23 March 2022; Published 2 May 2022
Academic Editor: Deepika Koundal
Copyright © 2022 K. Karthick et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Cardiovascular disease prediction aids practitioners in making more accurate health decisions for their patients. Early detection
can aid people in making lifestyle changes and, if necessary, ensuring effective medical care. Machine learning (ML) is a
plausible option for reducing and understanding heart symptoms of disease. The chi-square statistical test is performed to
select specific attributes from the Cleveland heart disease (HD) dataset. Support vector machine (SVM), Gaussian Naive Bayes,
logistic regression, LightGBM, XGBoost, and random forest algorithm have been employed for developing heart disease risk
prediction model and obtained the accuracy as 80.32%, 78.68%, 80.32%, 77.04%, 73.77%, and 88.5%, respectively. The data
visualization has been generated to illustrate the relationship between the features. According to the findings of the
experiments, the random forest algorithm achieves 88.5% accuracy during validation for 303 data instances with 13 selected
features of the Cleveland HD dataset.
1. Introduction Precision prediction and timely identification of cardiac

disease are essential for improving patient survival rate.
According to WHO data, heart disease is the leading cause Because of the increased collection of medical data, practi-
of mortality globally, resulting in 17.9 million deaths annu- tioners now have a great opportunity to promote healthcare
ally [1]. The most behavioural risk factors for cardiovascular diagnosis. ML plays a vital role in many applications like text
disease and stroke are unhealthy food, lack of physical activ- detection and recognition [3], early prediction [4], power
ity, smoking, and alcohol drinking [1]. A heart attack occurs quality disturbance detection [5], truck traffic classification
when the heart’s blood circulation is obstructed by arteries [6], and agriculture [7]. ML has now become an essential
plaque build-up. A thrombus in an artery causes a stroke tool in the healthcare sector to aid with patient diagnosis.
by impeding blood flow to the brain [2]. The symptoms The current methods for predicting and diagnosing cardiac
are common to other illnesses and might be confused with disease are mostly dependent on practitioners’ evaluation
indicators of ageing, making diagnosis difficult for of a patient’s medical history, signs, and physical assessment
practitioners. reports. Nowadays, information about patients with clinical
2 Computational and Mathematical Methods in Medicine
Table 1: UCI ML repository’s Cleveland heart disease dataset—feature subset [24].
Attribute name Attribute description

Age Age in years
Sex 1 denotes male and 0 denotes female
CP Chest pain type 1, typical angina; type 2, atypical angina; type 3, nonanginal pain; and type 4, asymptomatic
trestbps Resting blood pressure (in mmHg at entry to the health center)
chol Serum lipid level in mg/dL
fbs 1 denotes true, i.e., the fasting blood sugar level > 120 mg/dL; 0 denotes false
restecg Resting ECG results: null, normal; 1, ST-T wave abnormality; and 2, probable or definite left ventricular hypertrophy
thalach Maximum heart rate achieved
exang Exercise induced angina (1 = yes; null = no)
oldpeak ST depression induced by exercise relative to rest
slope The slope of the peak exercise ST segment (1, 2, and 3): 1, upsloping; 2, flat; and 3, downsloping
ca Number of major vessels (0-3) colored by fluoroscopy
thal Thalassemia: 3 = normal, 6 = fixed defect, and 7 = reversible defect
Table 2: Statistical outline of subset attributes.
Attributes Age Sex CP trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
mean 54.44 0.68 3.16 131.69 246.69 0.15 0.99 149.61 0.33 1.04 1.60 0.66 4.70 0.94
std 9.04 0.47 0.96 17.60 51.78 0.36 0.99 22.88 0.47 1.16 0.62 0.93 1.97 1.23
min 29 0 1 94 126 0 0 71 0 0 1 0 0 0
25% 48 0 3 120 211 0 0 133.5 0 0 1 0 3 0
50% 56 1 3 130 241 0 1 153 0 0.8 2 0 3 0
75% 61 1 4 140 275 0 2 166 1 1.6 2 1 7 2
max 77 1 4 200 564 1 2 202 1 6.2 3 3 7 4
reports is widely accessible in databases in the healthcare Woodland ML classifier had an 85 percent precision for car-
field, and it is rising rapidly day by day. In this article, the diovascular disease [10].
UCI ML repository’s Cleveland HD dataset was utilized for Gudadhe et al. [11] described the framework for predict-
developing the prediction model to heart disease. The ing the heart disease using SVM and obtained the accuracy
machine is trained for learning patterns based on the fea- as 80.41%. Kahramanli and Allahverdi [12] combined fuzzy
tures that are already present in the dataset. Classification and crisp values in health data and attained accuracy rates of
is an effective ML approach for prediction. When properly 84.24% to Pima Indian diabetes dataset and 86.8% for the
trained with adequate data, classification is an effective Cleveland HD dataset, respectively. Various ML classifica-
supervised ML method for identifying disease [8]. The pri- tion models [13–17] could be used to improve intelligence.
mary goal of this work is to employ contemporary ML tech- Kahramanli and Allahverdi [12] established the artificial
niques to construct the healthcare heart disease predictive and fuzzy-based model to the Pima Indian diabetes dataset
model. The Cleveland HD dataset was subjected to SVM and the Cleveland HD dataset and found 84.24% and
with radial basis function (RBF) kernel, Gaussian Naive 86.8% accuracy, respectively.
Bayes, logistic regression, LightGBM, XGBoost, and random Olaniyi et al. [18] established a prediction model and
forest algorithm, and the best performing prediction model achieved an accuracy of 85% using feedforward multilayer
for early diagnosis of heart disease was found. perceptron (MLP) and 87.5% using SVM on the UCI ML
datasets. Polat et al. [19] have employed k-nearest neighbour
algorithm and an artificial immune recognition framework
2. Related Work and achieved 87% accuracy on the Cleveland dataset. On a
Cleveland dataset, Detrano et al. [20] achieved 77% using
Nave Bayes, random forest, PART, C4.5, and multilevel per- the logistic regression algorithm. Saw et al. [21] have imple-
ceptron algorithm-based predictive model accuracy to HD mented the improved logistic regression classification model
dataset were determined to be in the range of 75.58%– for heart disease dataset. The fast decision tree and C4.5 tree
83.17% [9]. Moreover, Nave Bayes algorithm has the highest have been employed for HD prediction [22]. As a result of
accuracy as 83.17%, while other algorithms have less than the proposed model’s initial phase, trees and features have
80% accuracy [9]. Kumar et al. discovered that the Random been extracted. The genetic and fuzzy logic-based approach
Computational and Mathematical Methods in Medicine 3
Male Chest pain type: 1
68.0%
Chest pain type: 2 7.6%
16.5%
Chest pain type: 4

47.5%
32.0%
28.4%
Female
Chest pain type: 3
(a) Sex (b) Chest pain type

Fasting blood sugar > 120 mg/dL
8.2%
91.8%
Fasting blood sugar > 120 mg/dL
(c) Fasting blood sugar level
Figure 1: Visualization of features of the Cleveland heart dataset.
1.0
Age 1 −0.098 0.1 0.28 0.21 0.12 0.15 −0.39 0.092 0.2 0.16 0.37 0.13 0.22
Sex −0.098 1 0.01 −0.064 −0.2 0.048 0.022 −0.049 0.15 0.1 0.038 0.086 0.38 0.22
0.8
CP 0.1 0.01 1 −0.036 0.072 −0.04 0.068 −0.33 0.38 0.2 0.15 0.23 0.25 0.41
trestbps 0.28 −0.064 −0.036 1 0.13 0.18 0.15 −0.045 0.065 0.19 0.12 0.098 0.13 0.16
0.6
chol 0.21 −0.2 0.072 0.13 1 0.0098 0.17 −0.0034 0.061 0.047 −0.0041 0.12 0.025 0.071
fbs 0.12 0.048 −0.04 0.18 0.0098 1 0.07 −0.0079 0.026 0.0057 0.06 0.14 0.054 0.059
0.4
restecg 0.15 0.022 0.068 0.15 0.17 0.07 1 −0.083 0.085 0.11 0.13 0.13 0.024 0.18
thalach −0.39 −0.049 −0.33 −0.045 −0.0034 −0.0079 −0.083 1 −0.38 −0.34 −0.39 −0.27 −0.26 −0.42
0.2
exang 0.092 0.15 0.38 0.065 0.061 0.026 0.085 −0.38 1 0.29 0.26 0.15 0.32 0.4
oldpeak 0.2 0.1 0.2 0.19 0.047 0.0057 0.11 −0.34 0.29 1 0.58 0.3 0.34 0.5
0.0
slope 0.16 0.038 0.15 0.12 −0.0041 0.06 0.13 −0.39 0.26 0.58 1 0.11 0.28 0.38
ca 0.37 0.086 0.23 0.098 0.12 0.14 0.13 −0.27 0.15 0.3 0.11 1 0.26 0.52
−0.2
thal 0.13 0.38 0.25 0.13 0.025 0.054 0.024 −0.26 0.32 0.34 0.28 0.26 1 0.5
target 0.22 0.22 0.41 0.16 0.071 0.059 0.18 −0.42 0.4 0.5 0.38 0.52 0.5 1
−0.4
Age Sex CP trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
Figure 2: Heat map of subset attributes.

25
20
15
10
0
80 100 120 140 160 180 200
The person’s maximum heart rate achieved
(a) thalach
35
30
25
20
15
10
0
200 300 400 500
Serum cholesterol in mg(dl)
(b) chol
40
30
20
10
0
100 120 140 160 180 200
Resting blood pressure in mmHg
(c) trestbps
Figure 3: Continued.
10
6
count
0
29 34 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 74 76 77
Age
Target
0
1
2
3
4
(d) The age distribution of people with heart disease
Figure 3: Distribution of various attributes.
has been proposed [23] which is a hybrid model to instantly value of “mean” indicates the feature’s average value. The
generate the rules using a fitness function, appropriate value of “std” reflects the feature’s standard deviation. The
genetic operators, and a rule encoding method. “min” indicates the feature’s minimal value. The 25%, 50%,
In this article, SVM with RBF kernel, Gaussian Naive and 75% are the percentile/quartile of each feature. The
Bayes, logistic regression, LightGBM, XGBoost, and random maximum value of the attribute is indicated by “max.”
forest algorithms were employed to evaluate the classifica- Statistical tests will be useful in determining which attri-
tion accuracy on UCI ML repository’s Cleveland HD dataset butes are having the most powerful relationship with the
[24]. The data visualization has also been done to illustrate performance variable. The “SelectKBest” class in Python’s
the relationship between the features. scikit-learn library is utilized to choose a distinct attribute
in a statistical test set. For nonnegative characteristics in this
3. Materials and Methods dataset, the statistical chi-square (chi2) test was used to pick
13 of the best features.
3.1. Data. The UCI ML repository’s Cleveland HD dataset
was used in this investigation [24]. As indicated in Table 1, 3.3. Dataset Visualization. The data visualization of features
a subset of 13 attributes were utilized in prediction of heart such as gender, chest pain category, and fasting blood sugar
disease with 303 data instances. Table 1 describes about level of the Cleveland heart dataset is shown in Figure 1.
the attributes and its description that were used in the pro- Males are more likely than females to get heart disease,
posed classification model. The clinical variables that were according to this Cleveland dataset. The majority of individ-
considered to be essential were given under attribute column uals with cardiovascular disease experience asymptomatic
in Table 1, and it is chosen based on the chi-square (chi2) chest discomfort.
feature selection method [25]. To develop the heart risk pre- Figure 2 depicts a heat map of the subset attributes,
diction model, the remaining 61 attributes of the dataset which serves as an instant visual summary. Thalassemia is
were excluded to improve the accuracy of the model. Except a genetic disorder that causes people to have low haemoglo-
for null, all other target values from 1 to 4 were considered bin levels than normal. Haemoglobin allows erythrocyte to
as risk of cardiovascular disease for developing the model. transmit oxygen. Figure 3 illustrates the distribution of tha-
The classification model consists of two classes, namely, lach, chol, trestbps, and people count those who are suffering
class 0 and 1. The target values 1 to 4 have been changed from cardiovascular disease based on to their age. Cardio-
as 1 during preprocessing. vascular disease is quite common in people over the age of
60, as well as adults aged 41 to 60. However, it is uncommon
3.2. Feature Selection. The statistical overview of subset attri- in the 19-year to 40-year-old age category and extremely
butes is shown in Table 2 for 303 instances. The count shows uncommon in the 0-year to 18-year-old age category.
us how many nonempty rows are there in a feature. The Figure 4 shows the correlation between attributes such as
200
180
160
thalach
140
120
100
80
200 300 400 500

chol
Target
0 3
1 4
2
(a) Scatterplot for thalach vs. chol
70
60
Age
50
40
30
0 1 2 3 4
target
(b) Age vs. target
100
80
60
Age
40
20
0 1 2 3
ca
Target
0 3
1 4
2
(c) Age vs. ca
200
180
160
Thalach
140
120
100
80
1 2 3 4
CP
Target
0 3
1 4
2
(d) thalach vs. CP
oldpeak 4
−2
0 1
exang
Target
0 3
1 4
2
(e) oldpeak vs. exang
Figure 4: Subset attribute correlation.
60
Age
40
0.05
sex
0.00
−0.05
200
trestbps
150
100
400
chol
200
1.0
fbs
0.5
0.0
2
restecg
0
200
thalach
150
100
1.0
exang
0.5
0.0
6
oldpeak
4
2
0
3
slope
1
3
2
ca
1
0
6
4
thal
2
0
4
target
0
25 50 75 −0.05 0.00 0.05 100 200 250 500 0 1 0.0 2.5 100 200 0 1 0 5 0.0 2.5 0.0 2.5 0 10 0 5
age sex trestbps chol fbs restecg thalach exang oldpeak slope ca thal target
CP
1 3
2 4
Figure 5: Pair plot.

import lightgbm
Assign d_train from lgb.Dataset with X_train & label = y_train
clf = lgb.train with categorical_feature=auto
Prediction y_pred = clf. predict(X_test)
convert into binary values for i range(0, len(y_pred))
setting threshold if y_pred[i]>=0.46:
import and print confusion matrix
print Accuracy
Pseudocode 1: LightGBM.
Table 3: Classification model—prediction accuracy.

import train & test_split
import XGBClassifier Accuracy
Machine learning classifier
xg.predict(X_test) Training set (80%) Test set (20%)
setting threshold if y_pred_train[i]>=0.5: SVM 92.56 80.32
y_pred_train[i]=1
Gaussian Naive Bayes 86.77 78.68
else:
y_pred_train[i]=0 Logistic regression 85.95 80.32
import confusion_matrix LightGBM 98.76 77.04
Compute confusion_matrix (y_pred, y_test) XGBoost 99.58 73.77
Print Accuracy Random forest 100 88.5
Pseudocode 2: XGBoost.
Bayes, logistic regression, LightGBM, XGBoost, and random
forest.
import RandomForestClassifier 4.1. Support Vector Machine. The SVM [26] classifier with
n_estimators=50. RBF kernel is a function that turns a nonlinear problem into
Fit the model (X_train, y_train) a linear problem in a multidimensional space. The RBF ker-
Predict model.predict(X_test) nel in SVM classification algorithm is defined as
Get model Score

K x, x ′ = e−γjjx−x jj ,
2
Rate people =0 ′
Check if len(people) >0:
ð1Þ
Rate people = len(people)/len(output)
Get the prediction of heart disease 2
where kx − x ′ k is the squared Euclidean distance between
import confusion_matrix
Compute confusion_matrix (y_pred, y_test) two feature vectors and γ is a scalar.
Print Accuracy
4.2. Gaussian Naive Bayes. Gaussian Naive Bayes is the clas-
sification algorithm, and here, the 13 features stochastically
Pseudocode 3: Random forest. independent for every class c and the prediction are given as
1
p xi ∣ c j = qffiffiffiffiffiffiffiffiffiffiffiffiffi e−1/2ðxi −μi, j /σi,j Þ for i = 1, 2, ⋯, 13 and j = 0, 1, 2, 3&4,
2
thalach and chol, age and target, age and ca, thalach and CP, 2Πσ2i, j
and oldpeak and exang with respect to target. Figure 5 shows
the pair plot that is useful to quickly explore distributions ð2Þ
and relationships between the attributes. In adult people,
total cholesterol levels < 200 mg/dL are generally preferred. where μi, j is the mean and σi, j is the root-mean square devi-
In the range 200-239 mg/dL, 240 mg/dL, and above, border- ation of the dataset.
lines are regarded to be high. A value of <40 mg/dL is mea-
sured as a risk factor for HD. A level of 41 mg/dL to 59 mg/ 4.3. Logistic Regression. The logistic regression model is
dL is considered borderline low. The maximal HDL level expressed as
that may be measured is 60 mg/dL. T
eα+X β
Pðy = 1 ∣ xÞ = pðα, βÞ = T
, ð3Þ
1 + eα+X β
4. Proposed Machine Learning Classifiers
where α is intercept arguments, β is slope argument vector,
To evaluate the heart disease risk prediction, six ML classi- and Dn = fðX i , yi Þ, i = 1, 2, 3, ⋯, ng is the independent data
fiers were used: SVM with RBF kernel, Gaussian Naive size of n with 303 data instances.
Machine learning classifiers - Model accuracy
Random forest 88.5

100
73.77
XGBoost 99.58
LightGBM 77.04
98.76
80.32
Logistic regression 85.95
78.68
Gaussian naive bayes 86.77
80.32
Support vector machine 92.56
0 20 40 60 80 100 120
Test set
Training set
Figure 6: ML classification models—prediction accuracy.
Confusion matrix for SVM - Test set Confusion matrix for Gaussian Naive Bayes - Test set
Predicted label Predicted label
30 30
0 32 3 25 0 30 5 25
Actual label
Actual label
20 20
15 15
1 9 17 10 1 8 18 10
5 5
0 1 0 1
(a) SVM (b) Gaussian Naive Bayes

Confusion matrix for Logistic Regression - Test set Confusion matrix for LightGBM - Test set
30
25
0 32 3 25 0 29 6
Actual label
Actual label
20
20
15 15
1 9 17 10 1 8 18
10
5
0 1 0 1
(c) Logistic regression (d) LightGBM

Confusion matrix for XGBoost - Test set Confusion matrix for Random Forest - Test set
30
25
0 29 6 0 30 2 25
Actual label
Actual label
20 20
15 15
1 10 16 1 5 24 10
10
5
0 1 0 1
(e) XGBoost (f) Random forest
Figure 7: Confusion matrix.

ROC for SVM test set

1.0
0.8
True positive rate

0.6 AUC: 0.78
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
(a) SVM
ROC for Gaussian naive bayes test set
1.0
0.8
True positive rate
0.6 AUC: 0.77
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
(b) Gaussian Naive Bayes
ROC for Logistic regression test set
1.0
0.8
True positive rate
0.6 AUC: 0.78
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
(c) Logistic regression
ROC for LightGBM test set

1.0
0.8
True positive rate

0.6 AUC: 0.76
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
(d) LightGBM
ROC for XGBoost test set
1.0
0.8
True positive rate
0.6 AUC: 0.72
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
(e) XGBoost
ROC for random forest
1.0
0.8
True positive rate
AUC: 0.92
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0

False positive rate
(f) Random forest
Figure 8: Performance of classification models—ROC curves.

4.4. LightGBM. The LightGBM [27] is a gradient-based Conflicts of Interest

boosting approach which makes use of tree-based learning
methods. The pseudocode of the algorithm is given below. The authors declare no conflict of interest.
4.5. XGBoost. XGBoost algorithm is adopted from [28] and Authors’ Contributions
the pseudocode of the algorithm is given below.
Karthick was responsible for the conceptualization and data
4.6. Random Forest. The random forest [29] constructs mul- curation and wrote the original draft; Aruna was responsible
tiple decision trees and the pseudocode of the algorithm is for the investigation and methodology supervision; Ravi
given below. Samikannu carried out formal analysis; Ramya Kuppusamy
and Yuvaraja Teekaraman wrote, reviewed, and edited the
5. Results and Discussion manuscript; Amruth Ramesh Thelkar carried out methodol-
ogy validation.
The Cleveland HD dataset is split into training and testing
set with a ratio of 80 : 20. The classification model accuracy
is evaluated using the performance matrices from confusion References
matrix and it is expressed as
[1] Who link: https://www.who.int/health-topics/cardiovascular-
diseases/#tab=tab_1.
TP + TN [2] J. L. Quah, S. Yap, S. O. Cheah et al., “Knowledge of signs and
%Accuracy = × 100, ð4Þ symptoms of heart attack and stroke among Singapore resi-
TP + TN + FP + FN
dents,” BioMed Research International, vol. 2014, 2014.
[3] K. Karthick, S. K. Aruna, and R. Manikandan, “Development
where TP stands for true positive, TN stands for true nega- and evaluation of the bootstrap resampling technique based
tive, FP stands for false positive, and FN stands for false neg- statistical prediction model for Covid-19 real time data: a data
driven approach,” Journal of Interdisciplinary Mathematics,
ative. Table 3 gives testing set and training set accuracy in %
pp. 1–13, 2022.
for all the six classifier models. Figure 6 depicts the accuracy
[4] K. Kanagarathinam and K. Sekar, “Estimation of reproduction
of all models graphically. Figures 7 and 8 show the confusion
number and early prediction of 2019 novel coronavirus disease
matrix and receiver operating characteristic (ROC) curves (COVID-19) outbreak in India using statistical computing
for all six ML classification models. The validation indicates approach,” Epidemiology and Health, vol. 42, 2020.
that the random forest algorithm provides better accuracy in [5] K. Sekar, S. Kumar, and K. K, “Power quality disturbance
prediction. The test set prediction accuracy of the random detection using machine learning algorithm,” in 2020 IEEE
forest algorithm is 88.5% with ROC of 0.92 for the selected International Conference on Advances and Developments in
13 attributes of the 303 data instances of the UCI ML repo- Electrical and Electronics Engineering (ICADEE), pp. 1–5,
sitory’s Cleveland HD dataset. The area under the curve 2020.
(AUC) is an indicator of a classifier’s ability to differentiate [6] N. Tahaei, J. J. Yang, M. G. Chorzepa, S. S. Kim, and S. A. Dur-
among classes and can be used to analyse the receiver oper- ham, “Machine learning of truck traffic classification groups
ating characteristic (ROC) curve. The greater the AUC, the from weigh-in-motion data,” Machine Learning with Applica-
more accurate the model is at discriminating between tions, vol. 6, p. 100178, 2021.
favourable and unfavourable classes. [7] V. Meshram, K. Patil, V. Meshram, D. Hanchate, and S. D.
Ramkteke, “Machine learning in agriculture domain: a state-
6. Conclusion of-art survey,” Artificial Intelligence in the Life Sciences,
vol. 1, p. 100010, 2021.
The six ML classification algorithms, namely, SVM with [8] S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni, “Compar-
RBF kernel, Gaussian Naive Bayes, logistic regression, ing different supervised machine learning algorithms for dis-
LightGBM, XGBoost, and random forest, were applied to ease prediction,” BMC medical informatics and decision
making, vol. 19, no. 1, p. 281, 2019.
UCI ML repository’s Cleveland HD dataset, and the predic-
tion model has been developed for cardiovascular disease. [9] C. B. C. Latha and S. C. Jeeva, “Improving the accuracy of pre-
diction of heart disease risk based on ensemble classification
The random forest algorithm provides better accuracy as
techniques,” Informatics in Medicine Unlocked, vol. 16,
88.5% followed by SVM, and logistic regression provides p. 100203, 2019.
80.32% accuracy for the selected 13 attributes using the
[10] N. K. Kumar, G. S. Sindhu, D. K. Prashanthi, and A. S.
chi-square distribution. In this classification model, totally Sulthana, “Analysis and prediction of cardio vascular disease
303 data instances have been used. In future, various heart using machine learning classifiers,” in 2020 6th International
disease datasets from health data repository can be com- Conference on Advanced Computing and Communication Sys-
bined, and the best performing classification model using tems (ICACCS), pp. 15–21, 2020.
contemporary machine learning models can be outlined. [11] M. Gudadhe, K. Wankhade, and S. Dongre, “Decision support
system for heart disease based on support vector machine and
Data Availability artificial neural network,” in 2010 International Conference on
Computer and Communication Technology (ICCCT), pp. 741–
The dataset is available in publicly accessible database. 745, 2010.
[12] H. Kahramanli and N. Allahverdi, “Design of a hybrid system [28] T. Chen and C. Guestrin, “Xgboost: a scalable tree boosting
for the diabetes and heart diseases,” Expert systems with appli- system,” in Proceedings of the 22nd ACM SIGKDD Interna-
cations, vol. 35, no. 1-2, pp. 82–89, 2008. tional Conference on Knowledge Discovery and Data Mining,
[13] A. Gupta, Anjum, S. Gupta, and R. Katarya, “InstaCovNet-19: pp. 785–794, San Francisco, CA, USA, 2016.
a deep learning classification model for the detection of [29] A. Sarica, A. Cerasa, and A. Quattrone, “Random forest algo-
COVID-19 patients using chest X-ray,” Applied Soft Comput- rithm for the classification of neuroimaging data in Alzhei-
ing, vol. 99, p. 106859, 2021. mer's disease: a systematic review,” Frontiers in Aging
[14] I. H. Sarker, “Machine learning: algorithms, real-world appli- Neuroscience, vol. 9, p. 329, 2017.
cations and research directions,” SN COMPUT. SCI., vol. 2,
no. 3, p. 160, 2021.
[15] Y. Pathak, P. K. Shukla, A. Tiwari, S. Stalin, S. Singh, and P. K.
Shukla, “Deep transfer learning based classification model for
COVID-19 disease,” Ing Rech Biomed, vol. 22, 2022.
[16] R. Muazu Musa, A. P. A. Majeed, Z. Taha, S. W. Chang, A. F.
A. Nasir, and M. R. Abdullah, “A machine learning approach
of predicting high potential archers by means of physical fit-
ness indicators,” PLoS One, vol. 14, no. 1, p. e0209638, 2019.
[17] P. Kota, A. Madenahalli, and R. Guturi, “Heart disease classifica-
tion comparison among patients and normal subjects using
machine learning and artificial neural network techniques,” Inter-
national Journal of Biosensors & Bioelectronics, vol. 7, no. 3, 2021.
[18] E. O. Olaniyi, O. K. Oyedotun, and K. Adnan, “Heart diseases
diagnosis using neural networks arbitration,” International
Journal of Intelligent Systems and Applications (IJISA), vol. 7,
no. 12, pp. 75–82, 2015.
[19] K. Polat, S. Sahan, and S. Günes, “Automatic detection of heart
disease using an artificial immune recognition system (AIRS)
with fuzzy resource allocation mechanism and k-nn (nearest
neighbour) based weighting preprocessing,” Expert Systems
with Applications, vol. 32, no. 2, pp. 625–631, 2007.
[20] R. Detrano, A. Janosi, W. Steinbrunn et al., “International
application of a new probability algorithm for the diagnosis
of coronary artery disease,” The American journal of cardiol-
ogy, vol. 64, no. 5, pp. 304–310, 1989.
[21] M. Saw, T. Saxena, S. Kaithwas, R. Yadav, and N. Lal, “Estima-
tion of prediction for getting heart disease using logistic regres-
sion model of machine learning,” in 2020 International
Conference on Computer Communication and Informatics
(ICCCI), pp. 1–6, 2020.
[22] R. El-Bialy, M. A. Salamay, O. H. Karam, and M. E. Khalifa,
“Feature analysis of coronary artery heart disease data sets,”
Procedia Computer Science, vol. 65, pp. 459–468, 2015.
[23] K. Mankad, P. S. Sajja, and R. Akerkar, “Evolving rules using
genetic fuzzy approach - an educational case study,” International
Journal on Soft Computing (IJSC), vol. 2, no. 1, pp. 35–46, 2011.
[24] Heart disease dataset: https://archive.ics.uci.edu/ml/datasets/
heart+Disease.
[25] T. M. Franke, T. Ho, and C. A. Christie, “The chi-square test:
often used and more often misinterpreted,” American Journal
of Evaluation, vol. 33, no. 3, pp. 448–458, 2012.
[26] Y. Zhang, “Support vector machine classification algorithm
and its application,” in Information Computing and Applica-
tions. ICICA 2012, C. Liu, L. Wang, and A. Yang, Eds.,
Springer, Berlin, Heidelberg, 2012.
[27] G. Ke, Q. Meng, T. Finley et al., “LightGBM: a highly efficient
gradient boosting decision tree,” Advances in Neural Informa-
tion Processing Systems, vol. 30, pp. 3149–3157, 2017.

Implementation of A Heart Disease Risk Prediction Model Using Machine Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Implementation of A Heart Disease Risk Prediction Model Using Machine Learning

Uploaded by

Copyright:

Available Formats

Hindawi

Computational and Mathematical Methods in Medicine

K. Karthick ,1 S. K. Aruna ,2 Ravi Samikannu ,3 Ramya Kuppusamy ,4

Correspondence should be addressed to Yuvaraja Teekaraman; yuvarajastr@ieee.org

Received 3 February 2022; Accepted 23 March 2022; Published 2 May 2022

Academic Editor: Deepika Koundal

1. Introduction Precision prediction and timely identiﬁcation of cardiac

Table 1: UCI ML repository’s Cleveland heart disease dataset—feature subset [24].

Attribute name Attribute description

Table 2: Statistical outline of subset attributes.

Male Chest pain type: 1

Chest pain type: 4

(a) Sex (b) Chest pain type

Fasting blood sugar > 120 mg/dL

(c) Fasting blood sugar level

Figure 1: Visualization of features of the Cleveland heart dataset.

Figure 2: Heat map of subset attributes.

Figure 3: Distribution of various attributes.

200 300 400 500

(e) oldpeak vs. exang

Figure 4: Subset attribute correlation.

Figure 5: Pair plot.

Table 3: Classiﬁcation model—prediction accuracy.

Machine learning classifiers - Model accuracy

Random forest 88.5

Figure 6: ML classiﬁcation models—prediction accuracy.

(a) SVM (b) Gaussian Naive Bayes

(c) Logistic regression (d) LightGBM

(e) XGBoost (f) Random forest

Figure 7: Confusion matrix.

ROC for SVM test set

True positive rate

0.6 AUC: 0.77

0.6 AUC: 0.78

ROC for LightGBM test set

True positive rate

0.6 AUC: 0.72

0.0 0.2 0.4 0.6 0.8 1.0

Figure 8: Performance of classiﬁcation models—ROC curves.

4.4. LightGBM. The LightGBM [27] is a gradient-based Conflicts of Interest

You might also like