Olayinka Babe-2

Abstract:
Machine learning and artificial intelligence have been found useful in various disciplines during the course
of their development, especially in the enormous increasing data in recent years. It can be more reliable
for making better and faster decisions for disease predictions. So, machine learning algorithms are
increasingly finding their application to predict various diseases. Constructing a model can also help us
visualize and analyze diseases to improve reporting consistency and accuracy. This project has investigated
how to detect heart disease by applying various machine learning algorithms. The study in this project has
shown a two-step process. The heart disease dataset is first prepared into a required format for running
through machine learning algorithms. Medical records and other information about patients are gathered
from the UCI repository. The heart disease dataset is then used to determine whether or not the patients
have heart disease. Secondly, Many valuable results are shown in this project.
The accuracy rate of the machine learning algorithms, such as Logistic Regression, Support vector machine,
K-Nearest-Neighbors, Random Forest, and Gradient Boosting Classifier, are validated through the
confusion matrix. Current findings suggest that the Logistic Regression algorithm gives a high accuracy rate
of 95% compared to other algorithms. It also shows high accuracy for f1-score, recall, and precision than
the other four different algorithms. However, increasing the accuracy rates to approximately 97% to 100%
of the machine learning algorithms is the future study and challenging part of this research.
Keywords: Machine Learning, Artificial Intelligence, Heart Disease, Linear Regression, Support Vector
Machine,
Chapter 1.0
INTRODUCTION
Among all fatal diseases, heart disease is also considered one of the biggest killers in Africa. 20.1 million
people die each year from heart disease. Heart disease is a common and deadly disease today, shortening
people's lives. Life is based on the working of the heart. Because the heart is part of the body that we
desperately need and the place where life is impossible. Heart disease impairs heart function and can lead
to or cause discomfort to the patient before death. Estimating the likelihood of a lack of blood flow to the
heart is one of the main problems of heart disease. Prediction methods can be constructed by applying
autonomic regression analysis of longitudinal studies. Medical institutions are required to store large
amounts of data in databases due to advances in technology, and interpreting the data is a very difficult task,
by using data mining techniques and machine learning algorithms to overcome many specific data
management problems in medical centers and data analytics. Methods and algorithms help build specific
models and draw important conclusions from data sets. An important function of the heart involves pumping
interstitial fluid from the blood into the extracellular space. From the discussion above, it is clear that the
heart is an important organ of the body. A disturbed mind leads to a disturbed whole body. Factors that
cause heart failure refer to serious abnormal conditions of the heart and blood vessels (arteries and veins)
called heart disease [1]. Different types of heart disease are:
• Coronary artery disease – refers to the buildup of cholesterol plaques that cause hardening or narrowing
of the coronary arteries (arteries that supply blood to the heart).
• Cardiomyopathy – refers to diseases of the heart muscle, which have a variety of causes.
• Angina pectoris – Chest pain caused by reduced blood flow to part of the heart muscle.
• Valvular disease - Refers to diseases that affect one or more of the four heart valves.
• Congenital heart disease – malformation of the heart structure at birth.
• Cerebrovascular disease – refers to diseases of the blood vessels that supply blood to the brain.
• Rheumatic heart disease – a condition in which rheumatic fever damages the heart muscle and valves.
• Heart attack – Permanent damage to a portion of the heart muscle that is cut off from its blood supply.
• Heart Failure – The heart's ability to pump is reduced, leading to heart failure.
The heart diseases mentioned above are just a few of the many heart diseases that affect both women and
men. Machine learning (ML) is a part of artificial intelligence (AI) that allows a software application to
improve its prediction accuracy without being formally programmed. In order to forecast new output
values, machine learning algorithms use historical data as input. Machine learning is a significant and
diversified field, and its scope and application are expanding daily. For this reason, machine learning has
become a crucial competitive differentiation in many organizations. Machine learning includes supervised,
unsupervised, and ensemble learning classifiers that are used to predict and find the accuracy of a dataset.
ML algorithms can build a model based on sample data called train data to make a decision or prediction.
The use of machine learning methods in the medical industry is the subject of the current study, which
mainly focuses on mimicking some human activities or mental processes and recognizing diseases from a
variety of inputs. The term “heart disease” refers to a group of conditions that affect the heart. According
to World Health Organization reports, cardiovascular diseases are now the leading cause of death
worldwide, approximately 17.9 million. Many types of research have been studied and performed with
various machine learning algorithms to diagnose heart diseases [2]. Machine learning is a technology which
can help to achieve diagnosis of heart disease before much damage happens to a person. As an emerging
field in science and technology, machine learning can classify whether a person might be suffering from a
heart disease or not. The usage of information technology in the health care industry is increasing day by
day to aid doctors in decision making activities. It helps doctors and physicians in disease management,
medications, and discovery of patterns and relationships among diagnosis data [3]. Current approaches to
predict cardiovascular risk fail to identify many people who would benefit from preventive treatment,
while others receive unnecessary intervention. Machine-learning offers an opportunity to improve
accuracy by exploiting complex interactions between risk factors.
Chapter 2.0
LITERATURE REVIEW
According to Ghumbre et al., machine learning and deep learning algorithms are applied to predict heart
diseases in the UCI dataset. The authors concluded that machine learning algorithms performed better for
this analysis. Machine learning techniques for heart disease prediction are published by Rohit Bharti et al.,
where the article concluded that different data mining and neural system should be used to find the
seriousness of among patients. Some analysis has been led to think about the implementation of a
predictive data mining strategy on the same dataset. Prediction of heart disease using machine learning is
studied by Jee S H et al. in which training and the testing dataset are performed by using a neural network
algorithm. K-Nearest Neighbor algorithm is reviewed to diagnose heart disease by Mai Shouman et al.
Some efficient algorithms have been used to detect HD, which shows results that each algorithm has its
strength to register the defined objectives [10]. The supervised network has been applied for HD diagnosis,
which is studied by Raihan M et al. This research idea has been broadened and inspired us worldwide by
publishing many articles. This article will construct an ML predictive model, which will help analyze heart
disease regarding the medical history. Data is collected from the UCI repository with patients' medical
records and attributes. This dataset would be utilized to predict whether the patients have heart disease
or not [12] . To diagnose the HD dataset, this article considers 14 attributes of a patient. It classifies
whether the disease is present or not and can help us diagnose diseases with fewer medical treatments.
For this study, this article considers various attributes of patients like age, sex, serum cholesterol, blood
pressure, exang, etc. Five different ML algorithms such as Logistic Regression (LR), Support vector machine
(SVM), K-Nearest-Neighbors (KNN), Random Forest (RF), and Gradient Boosting Classifier (GBC) are applied
for the purpose of classification and prediction of heart disease. Many beneficial results are presented in
this article. The attributes of the given dataset are trained under these algorithms. Based on the
characteristics of the HD dataset, a comparative analysis of algorithms has been studied regarding the
accuracy rate. All the selected ML algorithms are efficient by showing their accuracy, which is greater than
80%. The most efficient algorithm is Logistic Regression (LR), which gives us an accuracy rate of
approximately 95%. Finally, Logistic Regression (LR) algorithm will be considered to predict and diagnose
for heart disease of a patient. This write up is rearranged sequentially. The methodology has been
discussed. Various ML algorithms are studied briefly in chapter 3. Results and analysis are shown in Chapter
4. In the result section, algorithms are compared regarding the confusion matrix. Finally, a conclusion and
future scope have been drawn in chapter 5.

Chapter 3
3.0 Methodology
In this chapter, the method and analysis are described, which is performed in this research work. First of
all, the collection of data and selection of relevant attributes are the initial steps in this study. After that,
the relevant data is preprocessed into the required format. The given data is then separated into two
categories: training and testing datasets. The algorithms are then used, and the given data train the model.
The accuracy of this model is obtained by using the testing data. The procedures of this study are loaded
by using several modules such as a collection of data, selection of attributes, pre-processing of data, data
balancing, and prediction of disease.
3.1.0 Data Collection
In this write up, the dataset is collected from the UCI repository, which is considered in research analysis
by the many authors. So, the first step is organizing the dataset from the UCI repository to predict the
heart disease and then dividing the dataset into two sections: training and testing. In this article, 80% data
has been considered as a training dataset, and 20% dataset is used for testing purposes.
3.1.1. Dataset and Attributes
Attributes of a dataset are properties of a dataset, which are important to analyze and make a prediction
regarding our concern. Various attributes of the patient, like gender, chest pain, serum cholesterol, fasting
blood pressure, exang, etc., are considered for predicting diseases. However, the correlation matrix can
be used for attribute selection to construct a model[13].
Table 1: Attributes used are listed
Attributes Description Values
Sex Sex of subject Male/Female
(male-0, female-1)
CP Chest pain type Four types
Trestbps Resting blood Continuous
pressure
Chol Serum cholesterol Continuous

in mg/dl
FBS Fasting blood < or >120 mg/dl
pressure
Restecg Resting Five values
Electrocardiograph
Thalach Maximum heart Continuous
rate achieved
exang Exercise Induced Yes/No
Angina
oldpeak ST Depression Continuous
introduced by exer
slope Slope of Peak up/flat/down
Exercise ST
segment
Ca Number of major 0-3
vessels
thal Defect type Reversible/Fixed/Normal
Targets Heart disease 1 (disease), 0 (no
disease)
Age Patients age in Continuous
years
3.1.2. Pre-processing of Data
We need to clean and remove the missing or noise values from the dataset to obtain accurate and perfect
results, known as data cleaning. Using some standard techniques in python 3.8, we can fill missing and
noise values, then we need to transform our dataset by considering the dataset's normalization,
smoothing, generalization, and aggregation. Integration is one of the crucial phases in data preprocessing,
and various issues are considered here to integrate. Sometimes the dataset is more complex or difficult to
understand. In this case, the dataset needs to be reduced in a required format, which is best to get a good
result.
3.1.3. Balancing of Data
Balancing the dataset is necessary to improve the performance of machine learning algorithms. A balanced
dataset has the same amount of input samples for each output class (or target class). The imbalanced
dataset can be balanced by considering two methods, such as under sampling and over sampling.
3.1.4. Prediction of Disease
In this article, five different machine learning algorithms are implemented for classification. A comparative
analysis of the algorithms has been studied. Finally, this article considers an ML algorithm that gives the
highest accuracy rate for heart disease prediction [15].

Figure 1. Architecture of prediction models.
Chest FBS pain over
Age Sex type BP Cholesterol 120 EKG

Exercise ST Slope
results Max HR angina depression ST

1 0
70 4 130 322 2 109 0 2.4

2
67 0 3 115 564 2 160 0 1.6

2
57 1 2 124 261 0 0 141 0 0.3

1
64 1 4 128 263 0 0 105 1 0.2

2
74 0 2 120 269 0 2 121 1 0.2

1
65 1 4 120 177 0 0 140 0 0.4

1
56 1 3 130 256 1 2 142 1 0.6

2
59 1 4 110 239 0 2 142 1 1.2 2
60 1 4 140 293 0 2 170 0 1.2 2
63 0 4 150 407 0 2 154 0 4

2
59 1 4 135 234 0 0 161 0 0.5

2
53 1 4 142 226 0 2 111 1 0

1
44 1 3 140 235 0 2 180 0 0

1
1 0
1 0 0
61 1 1 134 234 0 0 145 0 2.6

2
57 0 4 128 303 0 2 159 0 0

1
71 0 4 112 149 0 0 125 0 1.6

2
46 1 4 140 311 0 0 120 1 1.8

2
53 1 4 140 203 1 2 155 1 3.1

3
64 1 1 110 211 0 2 144 1 1.8

2
40 1 1 140 199 0 0 178 1 1.4

1
67 1 4 120 229 0 2 129 1 2.6

2
48 2 130 245 0 2 180 0.2

2
43 4 115 303 181 0 1.2

2
1 2 0 0
1 0
0
47 1 4 112 204 0 143 0 0.1
1
54 0 2 132 288 1 2 159 1 0

1
48 0 3 130 275 0 0 139 0 0.2

1
46 0 4 138 243 0 2 152 1 0

2
51 0 3 120 295 0 2 157 0 0.6

1
58 1 3 112 230 0 2 165 0 2.5

2
71 0 3 110 265 1 2 130 0 0

1
57 1 3 128 229 0 2 150 0 0.4

2
66 1 4 160 228 0 2 138 0 2.3

1
37 0 3 120 215 0 0 170 0 0

1
59 1 4 170 326 0 2 140 1 3.4

3
50 1 4 144 200 0 2 126 1 0.9

2
48 1 4 130 256 1 2 150 1 0

1
1 0
1 0 0
61 1 4 140 207 0 2 138 1 1.9

1
59 1 1 160 273 0 2 125 0 0

1
42 1 3 130 180 0 0 150 0 0

1
48 1 4 122 222 0 2 186 0 0

1
40 1 4 152 223 0 0 181 0 0

1
62 0 4 124 209 0 0 163 0 0

1
44 1 3 130 233 0 0 179 1 0.4

1
46 1 2 101 197 1 0 156 0 0

1
59 3 126 218 1 0 134 0 2.2

2
58 3 140 211 1 165 1
1 2 0 0
1 0 0
1 0
1 0
49 3 118 149 2 126 0.8 1
44 4 110 197 2 177 0 0 1
66 2 160 246 0 120 1 0 2
65 0 4 150 225 2 114 0 1 2
42 1 4 136 315 0 0 125 1 1.8 2
52 1 2 128 205 1 0 184 0 0 1
65 0 3 140 417 1 2 157 0 0.8 1
63 0 2 140 195 0 0 179 0 0 1
45 0 2 130 234 0 2 175 0 0.6 2
41 0 2 105 198 0 0 168 0 0 1
61 1 4 138 166 0 2 125 1 3.6 2
60 0 3 120 178 1 0 96 0 0 1
59 0 4 174 249 0 0 143 1 0 2
62 1 2 120 281 0 2 103 0 1.4 2
57 1 3 150 126 1 0 173 0 0.2 1
51 0 4 130 305 0 0 142 1 1.2 2
44 1 3 120 226 0 0 169 0 0 1
60 0 1 150 240 0 0 171 0 0.9 1
63 1 1 145 233 1 2 150 0 2.3 3
1 0 0 1
1 0 0
57 1 4 150 276 0 2 112 1 0.6 2
51 1 4 140 261 0 2 186 1 0 1
58 0 2 136 319 1 2 152 0 0 1
44 0 3 118 242 0 0 149 0.3 2
47 3 108 243 0 0 152
61 4 120 260 140 1 3.6 2
57 0 4 120 354 0 163 1 0.6 1
70 1 2 156 245 0 2 143 0 0 1
76 0 3 140 197 0 1 116 0 1.1 2
67 0 4 106 223 0 0 142 0 0.3 1
45 1 4 142 309 0 2 147 1 0 2
45 1 4 104 208 0 2 148 1 3 2
39 0 3 94 199 0 0 179 0 0 1
42 0 3 120 209 0 0 173 0 0 2
56 1 2 120 236 0 0 178 0 0.8 1
58 1 4 146 218 0 0 105 0 2 2
35 1 4 120 198 0 0 130 1 1.6 2
58 1 4 150 270 0 2 111 1 0.8 1
41 1 3 130 214 0 2 168 0 2 2
1 4 0
1 4 0 2 0 2
1 0 0 0
1 0 2 1
1 0 0
1 0
57 1 4 110 201 0 0 126 1 1.5 2
42 1 1 148 244 0 2 178 0 0.8 1
62 1 2 128 208 1 2 140 0 0 1
59 1 1 178 270 0 2 145 0 4.2 3
41 0 2 126 306 0 0 163 0 0 1
50 1 4 150 243 0 2 128 0 2.6 2
59 1 2 140 221 0 0 164 1 0 1
61 0 4 130 330 0 2 169 0 0 1
54 124 266 2 109 1 2.2 2
54 110 206 108 1
52 4 125 212 168 1 1
47 4 110 275 118 1 2
66 4 120 302 2 151 0.4 2
58 4 100 234 0 156 0 0.1 1
64 0 3 140 313 0 0 133 0 0.2 1
50 0 2 120 244 0 0 162 0 1.1 1
44 0 3 108 141 0 0 175 0 0.6 2
67 1 4 120 237 0 0 71 0 1 2
4 0
4 2 1 3
1 0 0
49 0 4 130 269 0 0 163 0 0 1
57 1 4 165 289 1 2 124 0 1 2
63 1 4 130 254 0 2 147 0 1.4 2
48 1 4 124 274 0 2 166 0 0.5 2
51 1 3 100 222 0 0 143 1 1.2 2
60 0 4 150 258 0 2 157 0 2.6 2
59 1 4 140 177 0 0 162 1 0 1
45 0 2 112 160 0 0 138 0 0 2
55 0 4 180 327 0 1 117 1 3.4 2
41 1 2 110 235 0 0 153 0 0 1
60 0 4 158 305 0 2 161 0 0 1
54 0 3 135 304 1 0 170 0 0 1
42 1 2 120 295 0 0 162 0 0 1
49 0 2 134 271 0 0 162 0 0 2
46 1 120 249 0 2 144 0.8 1
56 0 200 288 1 133 4
1 4 0
1 4 0 2 0 4
0 0 0
2 1
66 0 1 150 226 114 2.6 3
56 1 4 130 283 1 103 1.6 3
49 1 3 120 188 0 0 139 2 2
54 1 4 122 286 0 2 116 1 3.2 2
57 1 4 152 274 0 0 88 1 1.2 2
65 0 3 160 360 0 2 151 0 0.8 1
54 1 3 125 273 0 2 152 0 0.5 3
54 0 3 160 201 0 0 163 0 0 1
62 1 4 120 267 0 0 99 1 1.8 2
52 0 3 136 196 0 2 169 0 0.1 2
52 1 2 134 201 0 0 158 0 0.8 1
60 1 4 117 230 1 0 160 1 1.4 1
63 0 4 108 269 0 0 169 1 1.8 2
66 1 4 112 212 0 2 132 1 0.1 1
42 1 4 140 226 0 0 178 0 0 1
64 1 4 120 246 0 2 96 1 2.2 3
54 1 3 150 232 0 2 165 0 1.6 1
46 0 3 142 177 0 2 160 1 1.4 3
67 0 3 152 277 0 0 172 0 0 1
1 4 0
1 4 0 2
0 2 0 1
2 0
56 1 4 125 249 1 2 144 1 1.2 2
34 0 2 118 210 0 0 192 0 0.7 1
57 1 4 132 207 0 0 168 1 0 1
64 145 212 2 132 0 2 2
59 138 271 182 0 0 1
50 3 140 233 163 0.6 2
51 1 125 213 125 1.4 1
54 2 192 283 2 195 0 1
53 4 123 282 0 95 1 2 2
52 1 4 112 230 0 0 160 0 0 1
40 1 4 110 167 0 2 114 1 2 2
58 1 3 132 224 0 2 173 0 3.2 1
41 0 3 112 268 0 2 172 1 0 1
41 1 3 112 250 0 0 179 0 0 1
50 0 3 120 219 0 0 158 0 1.6 2
54 0 3 108 267 0 2 167 0 0 1
64 0 4 130 303 0 0 122 0 2 2
0 0 0
1 0 1 2
1 0 0 0
1 0 2 1
1 0 0
1 0
51 0 3 130 256 0 2 149 0 0.5 1
46 0 2 105 204 0 0 172 0 0 1
55 1 4 140 217 0 0 111 1 5.6 3
45 1 2 128 308 0 2 170 0 0 1
56 1 1 120 193 0 2 162 0 1.9 2
66 0 4 178 228 1 0 165 1 1 2
38 1 1 120 231 0 0 182 1 3.8 2
62 0 4 150 244 0 0 154 1 1.4 2
55 1 2 130 262 0 0 155 0 0 1
58 1 4 128 259 0 2 130 1 3 2
43 1 110 211 0 161 0 1
64 0 180 325 0 154 0 1
50 0 4 110 254 159 0
53 1 3 130 197 1 152 1.2 3
45 0 4 138 236 0 2 152 1 0.2 2
65 1 1 138 282 1 2 174 0 1.4 2
4 0 0
4 0 1
0 2 0 1
2 0
69 1 1 160 234 1 2 131 0 0.1 2
69 1 3 140 254 0 2 146 0 2 2
67 1 4 100 299 0 2 125 1 0.9 2
68 0 3 120 211 0 2 115 0 1.5 2
34 1 1 118 182 0 2 174 0 0 1
62 0 4 138 294 1 0 106 0 1.9 2
51 1 4 140 298 0 0 122 1 4.2 2
46 1 3 150 231 0 0 147 0 3.6 2
67 1 4 125 254 1 0 163 0 0.2 2
50 1 3 129 196 0 0 163 0 0 1
42 1 3 120 240 1 0 194 0 0.8 3
56 0 4 134 409 0 2 150 1 1.9 2
41 1 4 110 172 0 2 158 0 0 1
42 0 4 102 265 0 2 122 0 0.6 2
53 1 3 130 246 1 2 173 0 0 1
43 1 3 130 315 0 0 162 0 1.9 1
56 1 4 132 184 0 2 105 1 2.1 2
52 1 4 108 233 1 0 147 0.1 1
62 4 140 394 2 157 1.2 2
0 0 0
1 0 1 2
1 0 0 0
1 0 2 1
1 0 0
1 0
70 3 160 269 0 112 2.9
4 0 0
4 0 1
1 0 0 1
1 0
1 0
1 0
54 4 140 239 0 160 1.2
70 4 145 174 0 125 1 2.6 3
54 2 108 309 0 156 0 0 1
35 4 126 282 2 156 1 0 1
48 1 3 124 255 1 0 175 0 0 1
55 0 2 135 250 0 2 161 0 1.4 2
58 0 4 100 248 0 2 122 0 1 2
54 0 3 110 214 0 0 158 0 1.6 2
69 0 1 140 239 0 0 151 0 1.8 1
77 1 4 125 304 0 2 162 1 0 1
68 1 3 118 277 0 0 151 0 1 1
58 1 4 125 300 0 2 171 0 0 1
60 1 4 125 258 0 2 141 1 2.8 2
51 1 4 140 299 0 0 173 1 1.6 1
55 1 4 160 289 0 2 145 1 0.8 2
52 1 1 152 298 1 0 178 0 1.2 2
60 0 3 102 318 0 0 160 0 0 1
58 1 3 105 240 0 2 154 1 0.6 2
64 1 3 125 309 0 0 131 1 1.8 2

0
0 0 0
1 4 0 2 1 2
1 0 2 0 1
1 0 0
1
37 1 3 130 250 0 0 187 0 3.5 3
59 1 1 170 288 0 2 159 0.2 2
51 1 3 125 245 1 2 166 2.4 2
43 3 122 213 0 165 0.2 2
58 128 216 131 2.2
29 2 130 204 202 0
41 0 2 130 204 2 172 1.4 1
63 0 3 135 252 0 2 172 0 0 1
51 1 3 94 227 0 0 154 1 0 1
54 1 3 120 258 0 2 147 0 0.4 2
44 1 2 120 220 0 0 170 0 0 1
54 1 4 110 239 0 0 126 1 2.8 2
65 1 4 135 254 0 2 127 0 2.8 2
57 1 3 150 168 0 0 174 0 1.6 1
63 1 4 130 330 1 2 132 1 1.8 1
35 0 4 138 183 0 0 182 0 1.4 1
41 1 2 135 203 0 0 132 0 0 2

0
0 0 0
1 4 0 2 1 2
1 0 2 0 1
0 0
62 0 3 130 263 0 0 97 0 1.2 2
43 0 4 132 341 1 2 136 1 3 2
58 0 1 150 283 1 2 162 0 1 1
52 1 1 118 186 0 2 190 0 0 2
61 0 4 145 307 0 2 146 1 1 2
39 1 4 118 219 0 0 140 0 1.2 2
45 1 4 115 260 0 2 185 0 0 1
52 1 4 128 255 0 0 161 1 0 1
62 1 3 130 231 0 0 146 1.8 2
62 0 4 160 164 0 2 145 6.2 3
53 4 138 234 2 160 0 1
43 120 177 120 2.5
47 3 138 257 156 0
52 2 120 325 0 172 0.2 1
68 3 180 274 1 2 150 1 1.6 2
39 3 140 321 0 2 182 0 0 1

0
0 0 0
1 4 0 2 1 2
1 0 2 0 1
1 0 0
1
53 0 4 130 264 0 2 143 0 0.4 2
62 0 4 140 268 0 2 160 0 3.6 3
51 0 3 140 308 0 2 142 0 1.5 1
60 1 4 130 253 0 0 144 1 1.4 1
65 1 4 110 248 0 2 158 0 0.6 1
65 0 3 155 269 0 0 148 0 0.8 1
60 1 3 140 185 0 2 155 0 3 2
60 1 4 145 282 0 2 142 1 2.8 2
54 1 4 120 188 0 0 113 0 1.4 2
44 1 2 130 219 0 2 188 0 0 1
44 1 4 112 290 0 2 153 0 0 1
51 1 3 110 175 0 0 123 0 0.6 1
59 1 3 150 212 1 0 157 0 1.6 1
71 0 2 160 302 0 0 162 0 0.4 1
61 1 3 150 243 1 0 137 1 1 2
55 1 4 132 353 0 0 132 1 1.2 2
64 1 3 140 335 0 0 158 0 1
43 1 4 150 247 0 0 171 1.5 1
58 3 120 340 0 172 0 1
0 0 0
1 4 0 2 1 2
1 0 2 0 1
0 0
60 130 206 132 2.4
0 0 0
1 4 0 2 1 2
1 0 2 0
0 0
58 2 120 284 160 1.8 2
49 1 2 130 266 0 171 0.6 1
48 1 2 110 229 0 0 168 0 1 3
52 1 3 172 199 1 0 162 0 0.5 1
44 1 2 120 263 0 0 173 0 0 1
56 0 2 140 294 0 2 153 0 1.3 2
57 1 4 140 192 0 0 148 0 0.4 2
67 1 4 160 286 0 2 108 1 1.5 2

3.2. Machine Learning Algorithms
A data analysis technique called machine learning automates the development of analytical models. In this
observation, five different algorithms are studied to obtain the accuracy for finding the best one.
3.2.1 Logistic Regression Model
This ML model is often used for classification and predictive analysis, also known as logit regression. It is
also utilized to estimate the discrete values, like the binary outcome, from a collection of independent
variables. A binary result means two possibilities will happen: either the event happens (say 1), or it does
not happen (say 0
Figure 2. Logistic Regression model.
Where z is a function of x1, x2, w1, w2, and b. So,

3.2.2 Support Vector Machine (SVM)
SVM is the most popular supervised machine learning algorithm, which is used for classification as well as
regression [15]. Although, we primarily consider this algorithm for classification problems in ML. The
purpose of the SVM algorithm is to construct the optimum decision boundary or line that can divide
ndimensional space into classes so that we can quickly put the new data point in the correct category.
This optimal decision boundary is known as hyperplane [15]. SVM selects the extreme vectors that help
to create the hyperplane. The extreme vectors are known as support vectors, and the algorithm with
support vectors is called the support vector machine. Here below is a figure of SVM, where the decision
boundaries or hyperplane classifies two different categories. The training sample dataset is (x2, x1), where
x1 is the x-axis vector, and x2 is the target vector, see figure 3. Figure 3. Support Vector Machine.
vector, and x2 is the target vector, see figure 3.
Figure 3. Support Vector Machine.
3.2.3 K-Nearest Neighbors (K-NN)
K-NN is the most straightforward classification algorithm based on supervised learning techniques.
However, the K-NN algorithm can also be used for regression but is mostly used for classification [16]. A
new data point is classified by using the K-NN algorithm depending on how similar the existing data is
stored. It indicates that the K-NN algorithm can quickly classify new data when it appears in a suitable
category, see Figure 4. K-Nearest Neighbors. Here, the horizontal x-axis and vertical y-axis are independent
and dependent variables of a function, respectively. Figure 4 is a simple example of the KNN classification
algorithm. The test sample (Yellow Square with what symbol) should be classified as either a green triangle
or a red star in this algorithm. When k=3 is considered in a small dash circle, the yellow square would be
a green triangle because the majority number in this region is green triangles, not red stars. Now, if we
consider k=7, which is in a large dash circle, then the yellow square would be red stars because the number
of red stars is four and the green triangles are 3. So, It can conclude that the majority vote in a specific
region is important here, see Figure 5.
Figure 4. K-Nearest Neighbors.
3.2.4. Random Forest
Random Forest (RF) is a popular supervised machine learning algorithm used for both classifications and
regression. However, it is mainly used in classification problems. RF algorithm is based on the concept of
ensemble learning. Ensemble learning is a general machine learning procedure that can be used for
multiple learning algorithms to seek better predictive performance [2, 19]. So, the RF technique creates
several decision trees on the data samples, obtains the prediction from each tress, and finally gets the
better solution by considering the majority voting [17]. It is noted that the ensemble method is better
than a single decision tree because it mitigates the over-fitting by averaging results. The large number of
decision trees in RF helps us to get the accuracy and prevent over-fitting of the problems. The following
procedures are completed by RF algorithm, see all
Figure 5. General procedure of Random Forest.
3.2.5. Gradient Boosting
Gradient Boosting (GB) is a machine learning technique that is used in classification and regression
problems like others. It is a powerful algorithm in the field of machine learning [21]. As is well known, the
errors are classified into two categories in machine learning algorithms: Bias error and Variance error. GBC
helps us to minimize bias error sequentially in the model, see Figure 6. A diagram is described as follows
below, Figure 6. Diagram Gradient Boosting (Source [22]). As we can see that the ensemble consists of N
trees; see Figure 6. First of all, the feature matrix X and the labels y are used to train Tree 1. To calculate
the training set residual error r1, the predictions labeled are used. Then, Tree 2 is trained using feature
matrix X and residual errors r1 of Tree 1 as labels. The residual error r2 is then calculated by using
predictive error, see Figure 6.

Figure 6. Diagram Gradient Boosting (Source [22]).
4. 0 Result Analysis
4.1. Analysis of Heart Disease Dataset
Figure 7. Target class. Before going to study the performance of considering machine learning algorithms
in this research, analysis of the features of the heart disease dataset will be focused on here. The total
number of observations in the target attributes is 1025, where not having heart disease 499 (denoted by
0) and having heart disease 526 (represented by 1), see Figure 7. So, the percentage of not having heart
disease is 45.7%, and the percentage of having heart disease is 54.3%, see Figure 8(a). It is shown that the
rate of heart disease is more than the rate of no heart disease. In Figure 8(b), the sex feature of the HD
dataset is observed through the target feature. In sex attribute, the female and male numbers are 312
and 713, respectively. So, the male number is more than double of female number. We can see in this
figure 8(b) that the number of heart diseases in males is higher than in females. Similarly, no heart disease
among males is higher than in females. Figure 8(b) concludes that male is sufferer than female; for more
information, see figure 8(b). Figure 8. (a) Percentage of no heart disease and heart disease, (b) Comparison
between sex and target feature. Figure 9(a) shows a relationship between age and cholesterol with the
target feature. For an experiment, these features from the dataset are considered randomly. The trend of
no heart disease is higher from 55 to 68 when the cholesterol level is between 200 mg/dl and 300 mg/dl.
For validation, the KDE plot 9(b) is studied and shows similar statistics
4.1. Analysis of Heart Disease Dataset
Figure 7. Target class.
Figure 8 (a) Percentage of no heart disease and heart disease, (b)

Figure 10.Correlation matrix of the attributes.
The correlation of the features is drawn in figure 10. The main purpose of the correlation plot is
to define the positive and negative correlation between the features. However, it weak correlation. For
this reason, this article added another figure 11 to obtain these correlations efficiently. In figure 11, we
can see that three features like cp, thalack, and slope.
Figure 11. Correlation with the target feature.
Two strong correlations by cp and slope with target feature are studied statistically. As we can see in figure 12(a),
there is no heart disease when the cp level is more than 350; between 200 and 250. In addition, when the slope
is in 300 < slope-1 < 350, it shows that there is no disease, see figure 12(a). In contrast, for slope-2, there is a
heart disease in 300
4.2. Performance Analysis

In this article, various machine learning algorithms like
Logistic Regression (LR), Support vector machine (SVM), K Nearest-Neighbors (KNN), Random Forest
Classifier (RF), and Gradient Boosting Classifier (GBC) are studied broadly to predict the heart disease. The
accuracy rate of each algorithm has been measured, and selects the algorithm with the highest accuracy.
The accuracy rate is a correct prediction ratio to the total number of given datasets. It can be written as,
Accuracy =
Where, TP: True Positive
TN: True Negative
FP: False Positive
however,heart disease is sustainedmore when the cp is < slope-2 < 350.
Figure 12. (a) cp v/s target, (b) slope v/s target.

FN: False Negative
After performing the machine learning algorithms for training and testing the dataset, we can find the
better algorithm by considering the accuracy rate. The rate of accuracy is calculated with the support of a
confusion matrix. As shown in Table 2, the Logistic Regression algorithm gives us the best accuracy to
compare with other ML algorithms.

Table 2. Accuracy comparison of algorithms.
Figure 11. Correlation with the targ et

feature.
Algorithms Accuracy
Logistic Regression (LR) 0.95
Support vector machine (SVM) 0.90
K-Nearest-Neighbors (KNN) 0.87
Random Forest Classifier (RF) 0.79
Gradient Boosting Classifier (GBC) 0.80
Figure 13. Accuracy comparison of machine learning algorithms by bar
diagram.
This has been studied more on the LR machine learning algorithm through confusion matrix and f1-score.
The confusion matrix shows that the correct predicted value is 95%, see figure 14. f1-score is calculated
by, which is shown in figure 15,
f1 = 2 *
Where, Precision, P = , and Recall R = .
Figure 14. Confusion matrix of LR algorithm.
Figure 15. f1-score, precision, and recall of LR algorithm.

5. Conclusion and Future Scope
The heart is a vital organ in the human body; however, heart disease is a major concern in the world
because this disease is increasing day by day. So, we can handle this disease if we have a model which can
predict the initial condition of heart disease. So, we need to create a machine learning model that can be
more accurate and help to diagnose heart disease with less doubt and cost. It can be a primary technique
for knowing the condition of the heart. For this reason, this article focuses on the heart disease prediction
based on the accuracy rate of the confusion matrix. Following this idea, the statistics of the given
algorithms are used to estimate the accuracy rate of confusion matrix and validated the statistics among
the machine learning algorithms. When five algorithms are compared, it is found that Logistic Regression
algorithm is selected regarding the performance of high accuracy rate. The accuracy rate of Logistic
Regression model is 95%, which indicates that machine learning algorithm will be considered as a
predefined tool to seek heart diseases in the near future. Other statistics such as f1-score, recall, and
precision rate have been calculated for Logistic Regression as 95%, 95%, and 95%, respectively. These
estimated values suggest the highest accuracy of this algorithm.

These findings suggest that machine learning algorithms can effectively learn about the disease
predictions. We may extend this kind of study to diagnose other diseases. We may also analyze the past
history of data and combine other machine learning techniques for better study. Other possible further
applications of this study can include such as, cardiovascular disease prediction, diabetic prediction, breast
cancer prediction, tumor prediction, and multiple disease predictions.
References
[1] Animesh Hazra, Arkomita Mukherjee, Amit Gupta, Asmita Mukherjee, “Heart Disease Diagnosis and Prediction
Using Machine Learning : A Review”, Research Gate Publications, July 2020, pp.21372159.
[2] Wikipedia contributors. (2022, June 22). Machine learning. In
Wikipedia, The Free Encyclopedia. Retrieved 06:31, June 26,
2022, from https://en.wikipedia.org/w/index.php?title=Machine_learning &oldid=1094363111.
[3] Victor Chang, Vallabhanent Rupa Bhavani, Ariel Qianwen Xu, MA Hossain. An artificial intellegence model for
heart disease detection using machine learning. Healthcare Analytics, volume 2, November 2022, 100016.
https://doi.org/10.1016/j.health.2022.100016.
[4] Ghumbre, S. U., & Ghatol, A. A. (2020). Heart disease diagnosis using machine learning algorithm. In
Proceedings of the International Conference on Information Systems Design and Intelligent
Applications 2012 (INDIA 2020) held in Visakhapatnam, India, January 2012 (pp. 217-225). Springer,
Berlin, Heidelberg.
[5] Rohit Bharti, Aditya Khamparia, Mohammed Shabaz, Gaurav
Dhiman, Sagar pande, and Parneet Singh. Prediction of Heart Disease Using a combination of Machine
Learning and Deep learning. Hindawi Computational Intelligence and Neuroscience, Volume 2021, Article ID
8387680, 11 pages. https://doi.org/10.1155/2021/8387680.
[6] Khaled Mohamed Almustafa. Prediction of heart disease and classifiers sensitivity analysis. Almustafa
BMC Bioinfirmatics (2020) 21: 278. https://doi.org/10.1186/s12859-020-03626-y.
[7] Jee S H, Jang Y, Oh D J, Oh B H, Lee S H, Park S W & Yun Y D (2014), A coronary heart disease prediction model.
The Korean Heart Study. BMJ open, 4 (5), e005025.
[8] Mai Shouman, Tim Turner, and Rob Stocker. Applying kNearest Neighbour in diagnosis heart disease patients..
International Journal of Information and Education Technology, vol. 2, No. 3, June 2021.
[9] Ganna A, Magnusson P K, Pedersen N L, de Faire U, Reilly M, Arnlov J & Ingelsson E (2013). Multilocus genetic
risk scores for coronary heart disease prediction. Arteriosclerosis, thrombosis, and vascular biology, 33 (9),
2267-72.
[10] Raihan M, Mondal S, More A, Sagor M O F, Sikder G, Majumder M A & Ghosh K (2016, December). Smartphone
based ischeme heart disease (heart attact) risk prediction using clinical data and data mining approaches, a
prototype design. 23th International conference on Computer and Information Technology (ICCIT) (pp. 299-
303). IEEE.
[11] Acharya U R, Fujita H, Oh S L, Hagiwara Y, Tan J H & Adam M (2020). Application of deep convolutional neural
network for automated detection of myocardial infarction using ECG signals. Information Sciences, 415, 190-8.
[12] Takci H (2018). Improvement of heart attack prediction by the feature selection methods. Turkish
Journal of Electrical Engineering & Computer Sciences, 26 (1), 1-10.
[13] (https://www.kaggle.com/ronitf/heart-disease-uci).
[14] https://archive.ics.uci.edu/ml/datasets/heart+disease.
[15] Ahmad Shahin, Walid Moudani, Fadi Chakik, Mohamad Khalil et al. ”Data Mining in Healthcare Information
Systems: Case Studies in Northern Lebanon”, ISBN: 978 - 1 -4799 -3166 -8 ©2020 IEEE. [16] Puneet Bansal and
Ridhi Saini et al. “Classification of heart diseases from ECG signals using wavelet transform and kNN classifier”,
International Conference on Computing, Communication and Automation (ICCCA2020).
[17] Quazi Abidur Rahman, Larisa G. Tereshchenko, Matthew Kongkatong, Theodore Abraham, M. Roselle
Abraham, and Hagit Shatkay et al. “Utilizing ECG-based Heartbeat Classification for Hypertrophic
Cardiomyopathy Identification”, DOI 10.1109/TNB.2015.2426213, IEEE Transactions on Nano
Bioscience TNB-00035-2021.

Olayinka Babe-2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Olayinka Babe-2

Uploaded by

Copyright:

Available Formats

Abstract:

called heart disease [1]. Different types of heart disease are:

of the coronary arteries (arteries that supply blood to the heart).

while others receive unnecessary intervention. Machine-learning offers an opportunity to improve

accuracy by exploiting complex interactions between risk factors.

future scope have been drawn in chapter 5.

balancing, and prediction of disease.

3.1.0 Data Collection

3.1.1. Dataset and Attributes

be used for attribute selection to construct a model[13].

Table 1: Attributes used are listed

Attributes Description Values

Sex Sex of subject Male/Female

CP Chest pain type Four types

Trestbps Resting blood Continuous

Chol Serum cholesterol Continuous

FBS Fasting blood < or >120 mg/dl

Restecg Resting Five values

Thalach Maximum heart Continuous

oldpeak ST Depression Continuous

slope Slope of Peak up/flat/down

Ca Number of major 0-3

thal Defect type Reversible/Fixed/Normal

Targets Heart disease 1 (disease), 0 (no

Age Patients age in Continuous

3.1.2. Pre-processing of Data

3.1.3. Balancing of Data

3.1.4. Prediction of Disease

highest accuracy rate for heart disease prediction [15].

Chest FBS pain over

Age Sex type BP Cholesterol 120 EKG

results Max HR angina depression ST

70 4 130 322 2 109 0 2.4

67 0 3 115 564 2 160 0 1.6

57 1 2 124 261 0 0 141 0 0.3

64 1 4 128 263 0 0 105 1 0.2

74 0 2 120 269 0 2 121 1 0.2

65 1 4 120 177 0 0 140 0 0.4

56 1 3 130 256 1 2 142 1 0.6

59 1 4 110 239 0 2 142 1 1.2 2

60 1 4 140 293 0 2 170 0 1.2 2

63 0 4 150 407 0 2 154 0 4

59 1 4 135 234 0 0 161 0 0.5

53 1 4 142 226 0 2 111 1 0

44 1 3 140 235 0 2 180 0 0

61 1 1 134 234 0 0 145 0 2.6

57 0 4 128 303 0 2 159 0 0

71 0 4 112 149 0 0 125 0 1.6

46 1 4 140 311 0 0 120 1 1.8

53 1 4 140 203 1 2 155 1 3.1

64 1 1 110 211 0 2 144 1 1.8

40 1 1 140 199 0 0 178 1 1.4

67 1 4 120 229 0 2 129 1 2.6

48 2 130 245 0 2 180 0.2

43 4 115 303 181 0 1.2

54 0 2 132 288 1 2 159 1 0

48 0 3 130 275 0 0 139 0 0.2

46 0 4 138 243 0 2 152 1 0

51 0 3 120 295 0 2 157 0 0.6

58 1 3 112 230 0 2 165 0 2.5

71 0 3 110 265 1 2 130 0 0

57 1 3 128 229 0 2 150 0 0.4