Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Abstract:

Machine learning and artificial intelligence have been found useful in various disciplines during the course

of their development, especially in the enormous increasing data in recent years. It can be more reliable

for making better and faster decisions for disease predictions. So, machine learning algorithms are

increasingly finding their application to predict various diseases. Constructing a model can also help us

visualize and analyze diseases to improve reporting consistency and accuracy. This project has investigated

how to detect heart disease by applying various machine learning algorithms. The study in this project has

shown a two-step process. The heart disease dataset is first prepared into a required format for running

through machine learning algorithms. Medical records and other information about patients are gathered

from the UCI repository. The heart disease dataset is then used to determine whether or not the patients

have heart disease. Secondly, Many valuable results are shown in this project.

The accuracy rate of the machine learning algorithms, such as Logistic Regression, Support vector machine,

K-Nearest-Neighbors, Random Forest, and Gradient Boosting Classifier, are validated through the

confusion matrix. Current findings suggest that the Logistic Regression algorithm gives a high accuracy rate

of 95% compared to other algorithms. It also shows high accuracy for f1-score, recall, and precision than

the other four different algorithms. However, increasing the accuracy rates to approximately 97% to 100%

of the machine learning algorithms is the future study and challenging part of this research.

Keywords: Machine Learning, Artificial Intelligence, Heart Disease, Linear Regression, Support Vector

Machine,
Chapter 1.0

INTRODUCTION

Among all fatal diseases, heart disease is also considered one of the biggest killers in Africa. 20.1 million

people die each year from heart disease. Heart disease is a common and deadly disease today, shortening

people's lives. Life is based on the working of the heart. Because the heart is part of the body that we

desperately need and the place where life is impossible. Heart disease impairs heart function and can lead

to or cause discomfort to the patient before death. Estimating the likelihood of a lack of blood flow to the

heart is one of the main problems of heart disease. Prediction methods can be constructed by applying

autonomic regression analysis of longitudinal studies. Medical institutions are required to store large

amounts of data in databases due to advances in technology, and interpreting the data is a very difficult task,

by using data mining techniques and machine learning algorithms to overcome many specific data

management problems in medical centers and data analytics. Methods and algorithms help build specific

models and draw important conclusions from data sets. An important function of the heart involves pumping

interstitial fluid from the blood into the extracellular space. From the discussion above, it is clear that the

heart is an important organ of the body. A disturbed mind leads to a disturbed whole body. Factors that

cause heart failure refer to serious abnormal conditions of the heart and blood vessels (arteries and veins)

called heart disease [1]. Different types of heart disease are:

• Coronary artery disease – refers to the buildup of cholesterol plaques that cause hardening or narrowing

of the coronary arteries (arteries that supply blood to the heart).

• Cardiomyopathy – refers to diseases of the heart muscle, which have a variety of causes.

• Angina pectoris – Chest pain caused by reduced blood flow to part of the heart muscle.

• Valvular disease - Refers to diseases that affect one or more of the four heart valves.
• Congenital heart disease – malformation of the heart structure at birth.

• Cerebrovascular disease – refers to diseases of the blood vessels that supply blood to the brain.

• Rheumatic heart disease – a condition in which rheumatic fever damages the heart muscle and valves.

• Heart attack – Permanent damage to a portion of the heart muscle that is cut off from its blood supply.

• Heart Failure – The heart's ability to pump is reduced, leading to heart failure.

The heart diseases mentioned above are just a few of the many heart diseases that affect both women and

men. Machine learning (ML) is a part of artificial intelligence (AI) that allows a software application to

improve its prediction accuracy without being formally programmed. In order to forecast new output

values, machine learning algorithms use historical data as input. Machine learning is a significant and

diversified field, and its scope and application are expanding daily. For this reason, machine learning has

become a crucial competitive differentiation in many organizations. Machine learning includes supervised,

unsupervised, and ensemble learning classifiers that are used to predict and find the accuracy of a dataset.

ML algorithms can build a model based on sample data called train data to make a decision or prediction.

The use of machine learning methods in the medical industry is the subject of the current study, which

mainly focuses on mimicking some human activities or mental processes and recognizing diseases from a

variety of inputs. The term “heart disease” refers to a group of conditions that affect the heart. According

to World Health Organization reports, cardiovascular diseases are now the leading cause of death

worldwide, approximately 17.9 million. Many types of research have been studied and performed with

various machine learning algorithms to diagnose heart diseases [2]. Machine learning is a technology which

can help to achieve diagnosis of heart disease before much damage happens to a person. As an emerging

field in science and technology, machine learning can classify whether a person might be suffering from a

heart disease or not. The usage of information technology in the health care industry is increasing day by

day to aid doctors in decision making activities. It helps doctors and physicians in disease management,
medications, and discovery of patterns and relationships among diagnosis data [3]. Current approaches to

predict cardiovascular risk fail to identify many people who would benefit from preventive treatment,

while others receive unnecessary intervention. Machine-learning offers an opportunity to improve

accuracy by exploiting complex interactions between risk factors.

Chapter 2.0
LITERATURE REVIEW

According to Ghumbre et al., machine learning and deep learning algorithms are applied to predict heart

diseases in the UCI dataset. The authors concluded that machine learning algorithms performed better for

this analysis. Machine learning techniques for heart disease prediction are published by Rohit Bharti et al.,

where the article concluded that different data mining and neural system should be used to find the

seriousness of among patients. Some analysis has been led to think about the implementation of a

predictive data mining strategy on the same dataset. Prediction of heart disease using machine learning is

studied by Jee S H et al. in which training and the testing dataset are performed by using a neural network

algorithm. K-Nearest Neighbor algorithm is reviewed to diagnose heart disease by Mai Shouman et al.

Some efficient algorithms have been used to detect HD, which shows results that each algorithm has its

strength to register the defined objectives [10]. The supervised network has been applied for HD diagnosis,

which is studied by Raihan M et al. This research idea has been broadened and inspired us worldwide by

publishing many articles. This article will construct an ML predictive model, which will help analyze heart

disease regarding the medical history. Data is collected from the UCI repository with patients' medical

records and attributes. This dataset would be utilized to predict whether the patients have heart disease

or not [12] . To diagnose the HD dataset, this article considers 14 attributes of a patient. It classifies

whether the disease is present or not and can help us diagnose diseases with fewer medical treatments.

For this study, this article considers various attributes of patients like age, sex, serum cholesterol, blood

pressure, exang, etc. Five different ML algorithms such as Logistic Regression (LR), Support vector machine

(SVM), K-Nearest-Neighbors (KNN), Random Forest (RF), and Gradient Boosting Classifier (GBC) are applied

for the purpose of classification and prediction of heart disease. Many beneficial results are presented in

this article. The attributes of the given dataset are trained under these algorithms. Based on the

characteristics of the HD dataset, a comparative analysis of algorithms has been studied regarding the

accuracy rate. All the selected ML algorithms are efficient by showing their accuracy, which is greater than
80%. The most efficient algorithm is Logistic Regression (LR), which gives us an accuracy rate of

approximately 95%. Finally, Logistic Regression (LR) algorithm will be considered to predict and diagnose

for heart disease of a patient. This write up is rearranged sequentially. The methodology has been

discussed. Various ML algorithms are studied briefly in chapter 3. Results and analysis are shown in Chapter

4. In the result section, algorithms are compared regarding the confusion matrix. Finally, a conclusion and

future scope have been drawn in chapter 5.


Chapter 3

3.0 Methodology

In this chapter, the method and analysis are described, which is performed in this research work. First of

all, the collection of data and selection of relevant attributes are the initial steps in this study. After that,

the relevant data is preprocessed into the required format. The given data is then separated into two

categories: training and testing datasets. The algorithms are then used, and the given data train the model.

The accuracy of this model is obtained by using the testing data. The procedures of this study are loaded

by using several modules such as a collection of data, selection of attributes, pre-processing of data, data

balancing, and prediction of disease.

3.1.0 Data Collection

In this write up, the dataset is collected from the UCI repository, which is considered in research analysis

by the many authors. So, the first step is organizing the dataset from the UCI repository to predict the

heart disease and then dividing the dataset into two sections: training and testing. In this article, 80% data

has been considered as a training dataset, and 20% dataset is used for testing purposes.

3.1.1. Dataset and Attributes

Attributes of a dataset are properties of a dataset, which are important to analyze and make a prediction

regarding our concern. Various attributes of the patient, like gender, chest pain, serum cholesterol, fasting
blood pressure, exang, etc., are considered for predicting diseases. However, the correlation matrix can

be used for attribute selection to construct a model[13].

Table 1: Attributes used are listed

Attributes Description Values

Sex Sex of subject Male/Female

(male-0, female-1)

CP Chest pain type Four types

Trestbps Resting blood Continuous

pressure

Chol Serum cholesterol Continuous


in mg/dl

FBS Fasting blood < or >120 mg/dl

pressure

Restecg Resting Five values

Electrocardiograph

Thalach Maximum heart Continuous

rate achieved
exang Exercise Induced Yes/No

Angina

oldpeak ST Depression Continuous

introduced by exer

slope Slope of Peak up/flat/down

Exercise ST

segment

Ca Number of major 0-3

vessels

thal Defect type Reversible/Fixed/Normal

Targets Heart disease 1 (disease), 0 (no

disease)

Age Patients age in Continuous

years

3.1.2. Pre-processing of Data

We need to clean and remove the missing or noise values from the dataset to obtain accurate and perfect

results, known as data cleaning. Using some standard techniques in python 3.8, we can fill missing and
noise values, then we need to transform our dataset by considering the dataset's normalization,

smoothing, generalization, and aggregation. Integration is one of the crucial phases in data preprocessing,

and various issues are considered here to integrate. Sometimes the dataset is more complex or difficult to

understand. In this case, the dataset needs to be reduced in a required format, which is best to get a good

result.

3.1.3. Balancing of Data

Balancing the dataset is necessary to improve the performance of machine learning algorithms. A balanced

dataset has the same amount of input samples for each output class (or target class). The imbalanced

dataset can be balanced by considering two methods, such as under sampling and over sampling.

3.1.4. Prediction of Disease

In this article, five different machine learning algorithms are implemented for classification. A comparative

analysis of the algorithms has been studied. Finally, this article considers an ML algorithm that gives the

highest accuracy rate for heart disease prediction [15].


Figure 1. Architecture of prediction models.

Chest FBS pain over

Age Sex type BP Cholesterol 120 EKG


Exercise ST Slope

results Max HR angina depression ST


1 0

70 4 130 322 2 109 0 2.4


2

67 0 3 115 564 2 160 0 1.6


2

57 1 2 124 261 0 0 141 0 0.3


1

64 1 4 128 263 0 0 105 1 0.2


2

74 0 2 120 269 0 2 121 1 0.2


1

65 1 4 120 177 0 0 140 0 0.4


1

56 1 3 130 256 1 2 142 1 0.6


2

59 1 4 110 239 0 2 142 1 1.2 2

60 1 4 140 293 0 2 170 0 1.2 2

63 0 4 150 407 0 2 154 0 4


2

59 1 4 135 234 0 0 161 0 0.5


2

53 1 4 142 226 0 2 111 1 0


1

44 1 3 140 235 0 2 180 0 0


1

1 0
1 0 0

61 1 1 134 234 0 0 145 0 2.6


2

57 0 4 128 303 0 2 159 0 0


1

71 0 4 112 149 0 0 125 0 1.6


2

46 1 4 140 311 0 0 120 1 1.8


2

53 1 4 140 203 1 2 155 1 3.1


3

64 1 1 110 211 0 2 144 1 1.8


2

40 1 1 140 199 0 0 178 1 1.4


1

67 1 4 120 229 0 2 129 1 2.6


2

48 2 130 245 0 2 180 0.2


2

43 4 115 303 181 0 1.2


2

1 2 0 0
1 0

0
47 1 4 112 204 0 143 0 0.1
1

54 0 2 132 288 1 2 159 1 0


1

48 0 3 130 275 0 0 139 0 0.2


1

46 0 4 138 243 0 2 152 1 0


2

51 0 3 120 295 0 2 157 0 0.6


1

58 1 3 112 230 0 2 165 0 2.5


2

71 0 3 110 265 1 2 130 0 0


1

57 1 3 128 229 0 2 150 0 0.4


2

66 1 4 160 228 0 2 138 0 2.3


1

37 0 3 120 215 0 0 170 0 0


1

59 1 4 170 326 0 2 140 1 3.4


3

50 1 4 144 200 0 2 126 1 0.9


2

48 1 4 130 256 1 2 150 1 0


1

1 0
1 0 0

61 1 4 140 207 0 2 138 1 1.9


1

59 1 1 160 273 0 2 125 0 0


1

42 1 3 130 180 0 0 150 0 0


1

48 1 4 122 222 0 2 186 0 0


1

40 1 4 152 223 0 0 181 0 0


1

62 0 4 124 209 0 0 163 0 0


1

44 1 3 130 233 0 0 179 1 0.4


1

46 1 2 101 197 1 0 156 0 0


1

59 3 126 218 1 0 134 0 2.2


2

58 3 140 211 1 165 1

1 2 0 0
1 0 0

1 0

1 0

49 3 118 149 2 126 0.8 1

44 4 110 197 2 177 0 0 1

66 2 160 246 0 120 1 0 2

65 0 4 150 225 2 114 0 1 2

42 1 4 136 315 0 0 125 1 1.8 2

52 1 2 128 205 1 0 184 0 0 1

65 0 3 140 417 1 2 157 0 0.8 1

63 0 2 140 195 0 0 179 0 0 1

45 0 2 130 234 0 2 175 0 0.6 2

41 0 2 105 198 0 0 168 0 0 1

61 1 4 138 166 0 2 125 1 3.6 2

60 0 3 120 178 1 0 96 0 0 1

59 0 4 174 249 0 0 143 1 0 2

62 1 2 120 281 0 2 103 0 1.4 2

57 1 3 150 126 1 0 173 0 0.2 1

51 0 4 130 305 0 0 142 1 1.2 2

44 1 3 120 226 0 0 169 0 0 1

60 0 1 150 240 0 0 171 0 0.9 1

63 1 1 145 233 1 2 150 0 2.3 3

1 0 0 1
1 0 0

57 1 4 150 276 0 2 112 1 0.6 2

51 1 4 140 261 0 2 186 1 0 1

58 0 2 136 319 1 2 152 0 0 1

44 0 3 118 242 0 0 149 0.3 2

47 3 108 243 0 0 152

61 4 120 260 140 1 3.6 2

57 0 4 120 354 0 163 1 0.6 1

70 1 2 156 245 0 2 143 0 0 1

76 0 3 140 197 0 1 116 0 1.1 2

67 0 4 106 223 0 0 142 0 0.3 1

45 1 4 142 309 0 2 147 1 0 2

45 1 4 104 208 0 2 148 1 3 2

39 0 3 94 199 0 0 179 0 0 1

42 0 3 120 209 0 0 173 0 0 2

56 1 2 120 236 0 0 178 0 0.8 1

58 1 4 146 218 0 0 105 0 2 2

35 1 4 120 198 0 0 130 1 1.6 2

58 1 4 150 270 0 2 111 1 0.8 1

41 1 3 130 214 0 2 168 0 2 2

1 4 0

1 4 0 2 0 2
1 0 0 0

1 0 2 1

1 0 0

1 0
57 1 4 110 201 0 0 126 1 1.5 2

42 1 1 148 244 0 2 178 0 0.8 1

62 1 2 128 208 1 2 140 0 0 1

59 1 1 178 270 0 2 145 0 4.2 3

41 0 2 126 306 0 0 163 0 0 1

50 1 4 150 243 0 2 128 0 2.6 2

59 1 2 140 221 0 0 164 1 0 1

61 0 4 130 330 0 2 169 0 0 1

54 124 266 2 109 1 2.2 2

54 110 206 108 1

52 4 125 212 168 1 1

47 4 110 275 118 1 2

66 4 120 302 2 151 0.4 2

58 4 100 234 0 156 0 0.1 1

64 0 3 140 313 0 0 133 0 0.2 1

50 0 2 120 244 0 0 162 0 1.1 1

44 0 3 108 141 0 0 175 0 0.6 2

67 1 4 120 237 0 0 71 0 1 2

4 0

4 2 1 3
1 0 0

49 0 4 130 269 0 0 163 0 0 1

57 1 4 165 289 1 2 124 0 1 2

63 1 4 130 254 0 2 147 0 1.4 2

48 1 4 124 274 0 2 166 0 0.5 2

51 1 3 100 222 0 0 143 1 1.2 2

60 0 4 150 258 0 2 157 0 2.6 2

59 1 4 140 177 0 0 162 1 0 1

45 0 2 112 160 0 0 138 0 0 2

55 0 4 180 327 0 1 117 1 3.4 2

41 1 2 110 235 0 0 153 0 0 1

60 0 4 158 305 0 2 161 0 0 1

54 0 3 135 304 1 0 170 0 0 1

42 1 2 120 295 0 0 162 0 0 1

49 0 2 134 271 0 0 162 0 0 2

46 1 120 249 0 2 144 0.8 1

56 0 200 288 1 133 4

1 4 0

1 4 0 2 0 4
0 0 0

2 1

66 0 1 150 226 114 2.6 3

56 1 4 130 283 1 103 1.6 3

49 1 3 120 188 0 0 139 2 2

54 1 4 122 286 0 2 116 1 3.2 2

57 1 4 152 274 0 0 88 1 1.2 2

65 0 3 160 360 0 2 151 0 0.8 1

54 1 3 125 273 0 2 152 0 0.5 3

54 0 3 160 201 0 0 163 0 0 1

62 1 4 120 267 0 0 99 1 1.8 2

52 0 3 136 196 0 2 169 0 0.1 2

52 1 2 134 201 0 0 158 0 0.8 1

60 1 4 117 230 1 0 160 1 1.4 1

63 0 4 108 269 0 0 169 1 1.8 2

66 1 4 112 212 0 2 132 1 0.1 1

42 1 4 140 226 0 0 178 0 0 1

64 1 4 120 246 0 2 96 1 2.2 3

54 1 3 150 232 0 2 165 0 1.6 1

46 0 3 142 177 0 2 160 1 1.4 3

67 0 3 152 277 0 0 172 0 0 1

1 4 0

1 4 0 2
0 2 0 1

2 0

56 1 4 125 249 1 2 144 1 1.2 2

34 0 2 118 210 0 0 192 0 0.7 1

57 1 4 132 207 0 0 168 1 0 1

64 145 212 2 132 0 2 2

59 138 271 182 0 0 1

50 3 140 233 163 0.6 2

51 1 125 213 125 1.4 1

54 2 192 283 2 195 0 1

53 4 123 282 0 95 1 2 2

52 1 4 112 230 0 0 160 0 0 1

40 1 4 110 167 0 2 114 1 2 2

58 1 3 132 224 0 2 173 0 3.2 1

41 0 3 112 268 0 2 172 1 0 1

41 1 3 112 250 0 0 179 0 0 1

50 0 3 120 219 0 0 158 0 1.6 2

54 0 3 108 267 0 2 167 0 0 1

64 0 4 130 303 0 0 122 0 2 2

0 0 0

1 0 1 2
1 0 0 0

1 0 2 1

1 0 0

1 0
51 0 3 130 256 0 2 149 0 0.5 1

46 0 2 105 204 0 0 172 0 0 1

55 1 4 140 217 0 0 111 1 5.6 3

45 1 2 128 308 0 2 170 0 0 1

56 1 1 120 193 0 2 162 0 1.9 2

66 0 4 178 228 1 0 165 1 1 2

38 1 1 120 231 0 0 182 1 3.8 2

62 0 4 150 244 0 0 154 1 1.4 2

55 1 2 130 262 0 0 155 0 0 1

58 1 4 128 259 0 2 130 1 3 2

43 1 110 211 0 161 0 1

64 0 180 325 0 154 0 1

50 0 4 110 254 159 0

53 1 3 130 197 1 152 1.2 3

45 0 4 138 236 0 2 152 1 0.2 2

65 1 1 138 282 1 2 174 0 1.4 2

4 0 0

4 0 1
0 2 0 1

2 0

69 1 1 160 234 1 2 131 0 0.1 2

69 1 3 140 254 0 2 146 0 2 2

67 1 4 100 299 0 2 125 1 0.9 2

68 0 3 120 211 0 2 115 0 1.5 2

34 1 1 118 182 0 2 174 0 0 1

62 0 4 138 294 1 0 106 0 1.9 2

51 1 4 140 298 0 0 122 1 4.2 2

46 1 3 150 231 0 0 147 0 3.6 2

67 1 4 125 254 1 0 163 0 0.2 2

50 1 3 129 196 0 0 163 0 0 1

42 1 3 120 240 1 0 194 0 0.8 3

56 0 4 134 409 0 2 150 1 1.9 2

41 1 4 110 172 0 2 158 0 0 1

42 0 4 102 265 0 2 122 0 0.6 2

53 1 3 130 246 1 2 173 0 0 1

43 1 3 130 315 0 0 162 0 1.9 1

56 1 4 132 184 0 2 105 1 2.1 2

52 1 4 108 233 1 0 147 0.1 1

62 4 140 394 2 157 1.2 2

0 0 0

1 0 1 2
1 0 0 0

1 0 2 1

1 0 0

1 0
70 3 160 269 0 112 2.9

4 0 0

4 0 1
1 0 0 1

1 0

1 0

1 0

54 4 140 239 0 160 1.2

70 4 145 174 0 125 1 2.6 3

54 2 108 309 0 156 0 0 1

35 4 126 282 2 156 1 0 1

48 1 3 124 255 1 0 175 0 0 1

55 0 2 135 250 0 2 161 0 1.4 2

58 0 4 100 248 0 2 122 0 1 2

54 0 3 110 214 0 0 158 0 1.6 2

69 0 1 140 239 0 0 151 0 1.8 1

77 1 4 125 304 0 2 162 1 0 1

68 1 3 118 277 0 0 151 0 1 1

58 1 4 125 300 0 2 171 0 0 1

60 1 4 125 258 0 2 141 1 2.8 2

51 1 4 140 299 0 0 173 1 1.6 1

55 1 4 160 289 0 2 145 1 0.8 2

52 1 1 152 298 1 0 178 0 1.2 2

60 0 3 102 318 0 0 160 0 0 1

58 1 3 105 240 0 2 154 1 0.6 2

64 1 3 125 309 0 0 131 1 1.8 2


0

0 0 0

1 4 0 2 1 2
1 0 2 0 1

1 0 0

1
37 1 3 130 250 0 0 187 0 3.5 3

59 1 1 170 288 0 2 159 0.2 2

51 1 3 125 245 1 2 166 2.4 2

43 3 122 213 0 165 0.2 2

58 128 216 131 2.2

29 2 130 204 202 0

41 0 2 130 204 2 172 1.4 1

63 0 3 135 252 0 2 172 0 0 1

51 1 3 94 227 0 0 154 1 0 1

54 1 3 120 258 0 2 147 0 0.4 2

44 1 2 120 220 0 0 170 0 0 1

54 1 4 110 239 0 0 126 1 2.8 2

65 1 4 135 254 0 2 127 0 2.8 2

57 1 3 150 168 0 0 174 0 1.6 1

63 1 4 130 330 1 2 132 1 1.8 1

35 0 4 138 183 0 0 182 0 1.4 1

41 1 2 135 203 0 0 132 0 0 2


0

0 0 0

1 4 0 2 1 2
1 0 2 0 1

0 0

62 0 3 130 263 0 0 97 0 1.2 2

43 0 4 132 341 1 2 136 1 3 2

58 0 1 150 283 1 2 162 0 1 1

52 1 1 118 186 0 2 190 0 0 2

61 0 4 145 307 0 2 146 1 1 2

39 1 4 118 219 0 0 140 0 1.2 2

45 1 4 115 260 0 2 185 0 0 1

52 1 4 128 255 0 0 161 1 0 1

62 1 3 130 231 0 0 146 1.8 2

62 0 4 160 164 0 2 145 6.2 3

53 4 138 234 2 160 0 1

43 120 177 120 2.5

47 3 138 257 156 0

52 2 120 325 0 172 0.2 1

68 3 180 274 1 2 150 1 1.6 2

39 3 140 321 0 2 182 0 0 1


0

0 0 0

1 4 0 2 1 2
1 0 2 0 1

1 0 0

1
53 0 4 130 264 0 2 143 0 0.4 2

62 0 4 140 268 0 2 160 0 3.6 3

51 0 3 140 308 0 2 142 0 1.5 1

60 1 4 130 253 0 0 144 1 1.4 1

65 1 4 110 248 0 2 158 0 0.6 1

65 0 3 155 269 0 0 148 0 0.8 1

60 1 3 140 185 0 2 155 0 3 2

60 1 4 145 282 0 2 142 1 2.8 2

54 1 4 120 188 0 0 113 0 1.4 2

44 1 2 130 219 0 2 188 0 0 1

44 1 4 112 290 0 2 153 0 0 1

51 1 3 110 175 0 0 123 0 0.6 1

59 1 3 150 212 1 0 157 0 1.6 1

71 0 2 160 302 0 0 162 0 0.4 1

61 1 3 150 243 1 0 137 1 1 2

55 1 4 132 353 0 0 132 1 1.2 2

64 1 3 140 335 0 0 158 0 1

43 1 4 150 247 0 0 171 1.5 1

58 3 120 340 0 172 0 1

0 0 0

1 4 0 2 1 2
1 0 2 0 1

0 0

60 130 206 132 2.4

0 0 0

1 4 0 2 1 2
1 0 2 0

0 0

58 2 120 284 160 1.8 2

49 1 2 130 266 0 171 0.6 1

48 1 2 110 229 0 0 168 0 1 3

52 1 3 172 199 1 0 162 0 0.5 1

44 1 2 120 263 0 0 173 0 0 1

56 0 2 140 294 0 2 153 0 1.3 2

57 1 4 140 192 0 0 148 0 0.4 2

67 1 4 160 286 0 2 108 1 1.5 2


3.2. Machine Learning Algorithms

A data analysis technique called machine learning automates the development of analytical models. In this

observation, five different algorithms are studied to obtain the accuracy for finding the best one.

3.2.1 Logistic Regression Model

This ML model is often used for classification and predictive analysis, also known as logit regression. It is

also utilized to estimate the discrete values, like the binary outcome, from a collection of independent

variables. A binary result means two possibilities will happen: either the event happens (say 1), or it does

not happen (say 0

Figure 2. Logistic Regression model.

Where z is a function of x1, x2, w1, w2, and b. So,


3.2.2 Support Vector Machine (SVM)

SVM is the most popular supervised machine learning algorithm, which is used for classification as well as

regression [15]. Although, we primarily consider this algorithm for classification problems in ML. The

purpose of the SVM algorithm is to construct the optimum decision boundary or line that can divide

ndimensional space into classes so that we can quickly put the new data point in the correct category.

This optimal decision boundary is known as hyperplane [15]. SVM selects the extreme vectors that help

to create the hyperplane. The extreme vectors are known as support vectors, and the algorithm with

support vectors is called the support vector machine. Here below is a figure of SVM, where the decision

boundaries or hyperplane classifies two different categories. The training sample dataset is (x2, x1), where

x1 is the x-axis vector, and x2 is the target vector, see figure 3. Figure 3. Support Vector Machine.

vector, and x2 is the target vector, see figure 3.

Figure 3. Support Vector Machine.

3.2.3 K-Nearest Neighbors (K-NN)

K-NN is the most straightforward classification algorithm based on supervised learning techniques.

However, the K-NN algorithm can also be used for regression but is mostly used for classification [16]. A
new data point is classified by using the K-NN algorithm depending on how similar the existing data is

stored. It indicates that the K-NN algorithm can quickly classify new data when it appears in a suitable

category, see Figure 4. K-Nearest Neighbors. Here, the horizontal x-axis and vertical y-axis are independent

and dependent variables of a function, respectively. Figure 4 is a simple example of the KNN classification

algorithm. The test sample (Yellow Square with what symbol) should be classified as either a green triangle

or a red star in this algorithm. When k=3 is considered in a small dash circle, the yellow square would be

a green triangle because the majority number in this region is green triangles, not red stars. Now, if we

consider k=7, which is in a large dash circle, then the yellow square would be red stars because the number

of red stars is four and the green triangles are 3. So, It can conclude that the majority vote in a specific

region is important here, see Figure 5.

Figure 4. K-Nearest Neighbors.

3.2.4. Random Forest

Random Forest (RF) is a popular supervised machine learning algorithm used for both classifications and

regression. However, it is mainly used in classification problems. RF algorithm is based on the concept of

ensemble learning. Ensemble learning is a general machine learning procedure that can be used for

multiple learning algorithms to seek better predictive performance [2, 19]. So, the RF technique creates
several decision trees on the data samples, obtains the prediction from each tress, and finally gets the

better solution by considering the majority voting [17]. It is noted that the ensemble method is better

than a single decision tree because it mitigates the over-fitting by averaging results. The large number of

decision trees in RF helps us to get the accuracy and prevent over-fitting of the problems. The following

procedures are completed by RF algorithm, see all

Figure 5. General procedure of Random Forest.

3.2.5. Gradient Boosting

Gradient Boosting (GB) is a machine learning technique that is used in classification and regression

problems like others. It is a powerful algorithm in the field of machine learning [21]. As is well known, the

errors are classified into two categories in machine learning algorithms: Bias error and Variance error. GBC

helps us to minimize bias error sequentially in the model, see Figure 6. A diagram is described as follows

below, Figure 6. Diagram Gradient Boosting (Source [22]). As we can see that the ensemble consists of N

trees; see Figure 6. First of all, the feature matrix X and the labels y are used to train Tree 1. To calculate

the training set residual error r1, the predictions labeled are used. Then, Tree 2 is trained using feature

matrix X and residual errors r1 of Tree 1 as labels. The residual error r2 is then calculated by using

predictive error, see Figure 6.


Figure 6. Diagram Gradient Boosting (Source [22]).
4. 0 Result Analysis

4.1. Analysis of Heart Disease Dataset

Figure 7. Target class. Before going to study the performance of considering machine learning algorithms

in this research, analysis of the features of the heart disease dataset will be focused on here. The total

number of observations in the target attributes is 1025, where not having heart disease 499 (denoted by

0) and having heart disease 526 (represented by 1), see Figure 7. So, the percentage of not having heart

disease is 45.7%, and the percentage of having heart disease is 54.3%, see Figure 8(a). It is shown that the

rate of heart disease is more than the rate of no heart disease. In Figure 8(b), the sex feature of the HD

dataset is observed through the target feature. In sex attribute, the female and male numbers are 312

and 713, respectively. So, the male number is more than double of female number. We can see in this

figure 8(b) that the number of heart diseases in males is higher than in females. Similarly, no heart disease

among males is higher than in females. Figure 8(b) concludes that male is sufferer than female; for more

information, see figure 8(b). Figure 8. (a) Percentage of no heart disease and heart disease, (b) Comparison

between sex and target feature. Figure 9(a) shows a relationship between age and cholesterol with the

target feature. For an experiment, these features from the dataset are considered randomly. The trend of

no heart disease is higher from 55 to 68 when the cholesterol level is between 200 mg/dl and 300 mg/dl.

For validation, the KDE plot 9(b) is studied and shows similar statistics
4.1. Analysis of Heart Disease Dataset

Figure 7. Target class.

Figure 8 (a) Percentage of no heart disease and heart disease, (b)


Figure 10.Correlation matrix of the attributes.
The correlation of the features is drawn in figure 10. The main purpose of the correlation plot is

to define the positive and negative correlation between the features. However, it weak correlation. For

this reason, this article added another figure 11 to obtain these correlations efficiently. In figure 11, we

can see that three features like cp, thalack, and slope.
Figure 11. Correlation with the target feature.

Two strong correlations by cp and slope with target feature are studied statistically. As we can see in figure 12(a),

there is no heart disease when the cp level is more than 350; between 200 and 250. In addition, when the slope

is in 300 < slope-1 < 350, it shows that there is no disease, see figure 12(a). In contrast, for slope-2, there is a

heart disease in 300

4.2. Performance Analysis


In this article, various machine learning algorithms like

Logistic Regression (LR), Support vector machine (SVM), K Nearest-Neighbors (KNN), Random Forest

Classifier (RF), and Gradient Boosting Classifier (GBC) are studied broadly to predict the heart disease. The

accuracy rate of each algorithm has been measured, and selects the algorithm with the highest accuracy.

The accuracy rate is a correct prediction ratio to the total number of given datasets. It can be written as,

Accuracy =

Where, TP: True Positive

TN: True Negative

FP: False Positive

however,heart disease is sustainedmore when the cp is < slope-2 < 350.

Figure 12. (a) cp v/s target, (b) slope v/s target.


FN: False Negative

After performing the machine learning algorithms for training and testing the dataset, we can find the

better algorithm by considering the accuracy rate. The rate of accuracy is calculated with the support of a

confusion matrix. As shown in Table 2, the Logistic Regression algorithm gives us the best accuracy to

compare with other ML algorithms.


Table 2. Accuracy comparison of algorithms.

Figure 11. Correlation with the targ et


feature.
Algorithms Accuracy

Logistic Regression (LR) 0.95

Support vector machine (SVM) 0.90

K-Nearest-Neighbors (KNN) 0.87

Random Forest Classifier (RF) 0.79

Gradient Boosting Classifier (GBC) 0.80

Figure 13. Accuracy comparison of machine learning algorithms by bar

diagram.
This has been studied more on the LR machine learning algorithm through confusion matrix and f1-score.

The confusion matrix shows that the correct predicted value is 95%, see figure 14. f1-score is calculated

by, which is shown in figure 15,

f1 = 2 *

Where, Precision, P = , and Recall R = .

Figure 14. Confusion matrix of LR algorithm.

Figure 15. f1-score, precision, and recall of LR algorithm.


5. Conclusion and Future Scope

The heart is a vital organ in the human body; however, heart disease is a major concern in the world

because this disease is increasing day by day. So, we can handle this disease if we have a model which can

predict the initial condition of heart disease. So, we need to create a machine learning model that can be

more accurate and help to diagnose heart disease with less doubt and cost. It can be a primary technique

for knowing the condition of the heart. For this reason, this article focuses on the heart disease prediction

based on the accuracy rate of the confusion matrix. Following this idea, the statistics of the given

algorithms are used to estimate the accuracy rate of confusion matrix and validated the statistics among

the machine learning algorithms. When five algorithms are compared, it is found that Logistic Regression

algorithm is selected regarding the performance of high accuracy rate. The accuracy rate of Logistic

Regression model is 95%, which indicates that machine learning algorithm will be considered as a

predefined tool to seek heart diseases in the near future. Other statistics such as f1-score, recall, and

precision rate have been calculated for Logistic Regression as 95%, 95%, and 95%, respectively. These

estimated values suggest the highest accuracy of this algorithm.


These findings suggest that machine learning algorithms can effectively learn about the disease

predictions. We may extend this kind of study to diagnose other diseases. We may also analyze the past

history of data and combine other machine learning techniques for better study. Other possible further

applications of this study can include such as, cardiovascular disease prediction, diabetic prediction, breast

cancer prediction, tumor prediction, and multiple disease predictions.

References

[1] Animesh Hazra, Arkomita Mukherjee, Amit Gupta, Asmita Mukherjee, “Heart Disease Diagnosis and Prediction
Using Machine Learning : A Review”, Research Gate Publications, July 2020, pp.21372159.

[2] Wikipedia contributors. (2022, June 22). Machine learning. In

Wikipedia, The Free Encyclopedia. Retrieved 06:31, June 26,

2022, from https://en.wikipedia.org/w/index.php?title=Machine_learning &oldid=1094363111.

[3] Victor Chang, Vallabhanent Rupa Bhavani, Ariel Qianwen Xu, MA Hossain. An artificial intellegence model for

heart disease detection using machine learning. Healthcare Analytics, volume 2, November 2022, 100016.

https://doi.org/10.1016/j.health.2022.100016.

[4] Ghumbre, S. U., & Ghatol, A. A. (2020). Heart disease diagnosis using machine learning algorithm. In

Proceedings of the International Conference on Information Systems Design and Intelligent

Applications 2012 (INDIA 2020) held in Visakhapatnam, India, January 2012 (pp. 217-225). Springer,

Berlin, Heidelberg.

[5] Rohit Bharti, Aditya Khamparia, Mohammed Shabaz, Gaurav

Dhiman, Sagar pande, and Parneet Singh. Prediction of Heart Disease Using a combination of Machine
Learning and Deep learning. Hindawi Computational Intelligence and Neuroscience, Volume 2021, Article ID

8387680, 11 pages. https://doi.org/10.1155/2021/8387680.

[6] Khaled Mohamed Almustafa. Prediction of heart disease and classifiers sensitivity analysis. Almustafa

BMC Bioinfirmatics (2020) 21: 278. https://doi.org/10.1186/s12859-020-03626-y.

[7] Jee S H, Jang Y, Oh D J, Oh B H, Lee S H, Park S W & Yun Y D (2014), A coronary heart disease prediction model.

The Korean Heart Study. BMJ open, 4 (5), e005025.

[8] Mai Shouman, Tim Turner, and Rob Stocker. Applying kNearest Neighbour in diagnosis heart disease patients..

International Journal of Information and Education Technology, vol. 2, No. 3, June 2021.

[9] Ganna A, Magnusson P K, Pedersen N L, de Faire U, Reilly M, Arnlov J & Ingelsson E (2013). Multilocus genetic

risk scores for coronary heart disease prediction. Arteriosclerosis, thrombosis, and vascular biology, 33 (9),

2267-72.

[10] Raihan M, Mondal S, More A, Sagor M O F, Sikder G, Majumder M A & Ghosh K (2016, December). Smartphone

based ischeme heart disease (heart attact) risk prediction using clinical data and data mining approaches, a

prototype design. 23th International conference on Computer and Information Technology (ICCIT) (pp. 299-

303). IEEE.

[11] Acharya U R, Fujita H, Oh S L, Hagiwara Y, Tan J H & Adam M (2020). Application of deep convolutional neural

network for automated detection of myocardial infarction using ECG signals. Information Sciences, 415, 190-8.

[12] Takci H (2018). Improvement of heart attack prediction by the feature selection methods. Turkish

Journal of Electrical Engineering & Computer Sciences, 26 (1), 1-10.

[13] (https://www.kaggle.com/ronitf/heart-disease-uci).

[14] https://archive.ics.uci.edu/ml/datasets/heart+disease.
[15] Ahmad Shahin, Walid Moudani, Fadi Chakik, Mohamad Khalil et al. ”Data Mining in Healthcare Information

Systems: Case Studies in Northern Lebanon”, ISBN: 978 - 1 -4799 -3166 -8 ©2020 IEEE. [16] Puneet Bansal and

Ridhi Saini et al. “Classification of heart diseases from ECG signals using wavelet transform and kNN classifier”,

International Conference on Computing, Communication and Automation (ICCCA2020).

[17] Quazi Abidur Rahman, Larisa G. Tereshchenko, Matthew Kongkatong, Theodore Abraham, M. Roselle

Abraham, and Hagit Shatkay et al. “Utilizing ECG-based Heartbeat Classification for Hypertrophic

Cardiomyopathy Identification”, DOI 10.1109/TNB.2015.2426213, IEEE Transactions on Nano

Bioscience TNB-00035-2021.

You might also like