Download as pdf or txt
Download as pdf or txt
You are on page 1of 46

CALIFORNIA STATE UNIVERSITY, NORTHRIDGE

Heart Disease Prediction using Machine Learning algorithms

A thesis submitted in partial fulfillment of the requirements

For the degree of Master of Science

In Computer Science

By

Yash Sharma

May 2023
The thesis of Yash Sharma is approved:

_______________________________________ ________________

Dr. Mahdi Ebrahimi Date

_______________________________________ ________________

Dr. Robert McIlhenny Date

_______________________________________ ________________

Dr. Taehyung Wang, Chair Date

California State University, Northridge

ii
Acknowledgements

I would like to thank Dr. George Wang for being my committee chair for my thesis and

for all his help throughout the entire year. I would also like to thank Dr. Mahdi Ebrahimi for

considering becoming part of my committee this semester and Dr. Robert McIlhenny for

accepting my request to be on my committee this semester.

iii
Table of Contents

Signature Page……………………………………………………………………………………ii

Acknowledgements………………………………………………………………………………iii

List of Tables……………………………………………………………………………………..vi

List of Figures…………………………………………………………………………………….vi

Abstract…………………………………………………………………………………………..vii

1. Introduction……………………………………………………………………………………1

1.1. Objective…………………………………………………………………………………1

1.2. Problem Statement……………………………………………………………………….2

1.3. Target……………………………………………………………………………………..4

1.4. Planning (Updated timeline & issues encountered)……………………………………...4

2. Survey and Background……………………….....……………………………………………7

2.1. Survey…………………………………………………………………………………….7

2.2. Background…………………………………………………………………………….....9

2.3. Related Work…………………………………………………………………………....11

3. Project Development and design…………………………………………………………….13

3.1. Technical Approach & Python libraries Discussion…………………………………….13

3.2. Data Understanding and Preprocessing…………………………………………………14

3.3. Decision Tree Classifier Overview……………………………………………………...16

3.4. K-Neighbors Classifier Overview………………………………………………………18

3.5. Random Forest Classifier Overview…………………………………………………….20

3.6. Support Vector Classifier Overview…………………………………………………….21

3.7. Risks involved…………………………………………………………………………..25

iv
4. Results………………………………………………………………………………………..26

4.1. Decision Tree Classifier results………………………………………………………....26

4.2. K-Neighbors Classifier results…………………………………………………………..27

4.3. Random Forest Classifier results………………………………………………………..28

4.4. Support Vector Classifier results………………………………………………………..29

4.5. Result Analysis...………………………………………………………………………..31

5. Conclusion and Future Work………………………………………………………………...33

6. Bibliography…………………………………………………………………………………35

v
List of Tables

1. Table 1.1 Revised timeline for the project………………………………………………….5-6

List of Figures

1. Figure 2.1 Examining Kaggle and IEEE dataset…………………………………………….15

2. Figure 2.2 Generalized layout of Decision Tree………………………………..……………17

3. Figure 2.3 K-Neighbors Classifier Diagram…………………………………………………19

4. Figure 2.4 Random Forest Classifier using Decision Trees…………………………………20

5. Figure 2.5 Optimal hyperplane selection graphs for Support Vector Classifier…………21-22

6. Figure 2.6 Linear Kernel……………………………………………………………………..23

7. Figure 2.7 Polynomial Kernal………………………………………………………………..23

8. Figure 2.8 Radial Basis Function Kernal…………………………………………………….24

9. Figure 2.9 Decision Tree Classifier Result Graph…………………………………………...26

10. Figure 2.10 K-Neighbors Classifier Result Graph…………………………………………...27

11. Figure 2.11 Random Forest Classifier Result Graph………………………………………...28

12. Figure 2.12 Support Vector Classifier Result Graph………………………………………...30

vi
Abstract

Heart Disease Prediction using Machine Learning algorithms

By

Yash Sharma

Master of Science in Computer Science

In a rapidly evolving world, humans deal with numerous challenges that can range from

learning how to talk and walk to leading a nation or fighting enemies and, simply, anything that a

mind can imagine. With such a variety of issues that every human deals with regularly, they

naturally try to consider easy solutions that can help reduce the amount of stress that they are

facing and, in doing so, compromise their health. When health has to be compromised,

consequences can sometimes be deadly. One such consequence is developing a heart disease that

can transform into a life-ending disease if not discovered in the preliminary stages. To help

prevent a loss of life due to lack of awareness, it is crucial to have a model that can help predict

whether an individual is suffering from a heart disease or not.

In the experiment, the dataset is imported along with all the necessary Python libraries as

well as those for the Machine Learning Algorithms. Then, the dataset will be understood and will

then enter the preprocessing phase where, using processes such as correlation matrix and bar

plots, this dataset will be standardized and scaled for optimality. Once the dataset is standardized

vii
and scaled, multiple algorithms will be applied, each of which will determine the accuracy of

whether a person has a heart disease present or does not have a heart disease. Once the accuracy

scores are predicted, we will know which algorithm provided the best results for our dataset.

viii
1. Introduction

1.1. Objective

Heart disease is a type of disease in which there is a blockage in the coronary arteries, which

are blood vessels that carry blood and oxygen to the heart. The coronary heart disease is caused

by the development of fatty material and plaque inside the coronary arteries.1 As deadly as it

sounds, heart diseases are not easy to identify in their preliminary stages of development.

Researchers have been constantly attempting to identify better detecting techniques that could

timely identify the heart disease in a person as the current techniques are not as effective in

detecting this disease in the early development stages as one would hope for because of accuracy

and computational time.20 Additionally, when professional health experts and advanced

technology is not readily available, it can become tremendously challenging to detect any

symptoms of a heart disease until the person starts experiencing chest pain or complains about

trouble when breathing, by which time, it might be very difficult to cure this disease and save

that person’s life. To prevent an individual from going through this hardship, and to help doctors

detect this disease in the preliminary stages, it will be important to have a tool that detects such a

life-threatening disease early on.

The goal of this research is to develop a model that will help doctors predict whether an

individual is suffering from a heart disease or not before the disease spreads in the body and

makes the situation difficult to control. It focuses on four main machine learning techniques: 1)

Decision Tree Classifier, 2) K-Neighbors Classifier, 3) Random Forest Classifier, and 4) Support

Vector Classifier. A dataset containing factors related to heart disease will also be preprocessed,

evaluated, and trained and, once it has been properly scaled and standardized, will then be

1
applied to each of the above four algorithms to predict the different scores of accuracies of

predicting whether an individual has a heart disease or does not have a heart disease.

1.2. Problem Statement

According to the Centers for Disease Control and Prevention, or CDC, website,2 heart

diseases, also known as cardiovascular diseases, are not easily detected until a person

experiences signs or symptoms of a heart attack such as chest pain, upper back/neck pain,

heartburn, nausea, vomiting, etc.; heart failure such as trouble when breathing, fatigue, swelling

of the feet, etc.; or an arrhythmia like palpitation. Top factors that contribute to this deadly

disease are high blood pressure, high cholesterol, smoking and, surprisingly, almost half of the

population (47%) in the United States suffer from at least one of these three factors. Statistics

released for 2020 indicated that approximately 697,000 people in the United States died of a

heart disease; this means that every fifth person in the country died of a heart disease in 2020!

Other factors that also contribute to the development of heart diseases are unhealthy diet,

overweight/obesity, diabetes, excessive alcohol consumption, etc.

According to the article, Heart Disease and Stroke Statistics – 2023 Update, on the American

Heart Association website, cardiovascular disease, or heart disease, was the leading cause of

deaths in the Unites States in 2020 where the total number of deaths reported was a whopping

928,741. Coronary heart disease was the leading factor contributing to the total number of deaths

in the United States in 2020 due to cardiovascular disease at 41.2%, followed by stroke, which

contributed to 17.3% of the total deaths due to this disease, high blood pressure caused 12.9% of

the individuals to lose their battle to life, 9.2% died due to a heart failure, and 2.6% lost their

battle due to some disease that spread in their arteries. Based on their data collected in 2020, the

2
age-adjusted death rate due to the heart disease in the United States was 224.4 per 100,000

people, which was not too far when compared to the age-adjusted death rate due to the heart

disease globally which was 239.8 per 100,000. Shifting the focus to the money aspect of the

cardiovascular disease, the statistics indicated that the amount of money needed to spend directly

to tackle this disease in 2018 and 2019 combined was $251.4 billion while it costed $155.9

billion in lost productivity and mortality bringing the total cost of battling the cardiovascular

disease to a whopping $407.3 billion for both of these years combined.18

Data science and machine learning have been two of the main driving factors in the

advancement of technology in the medical and healthcare sector. Majority of the companies in

this sector are increasingly adapting and integrating data science and machine learning into their

systems and, by doing so, are improving their chances on finding the patterns for different

diseases as well as provide patients with better feedback on the disease that they could have in

the future based on their medical history and data.

According to the Neptune Blog3, adapting to machine learning tools allow medical

organizations to “find patterns, extract knowledge from data and tackle a diverse set of

computationally hard tasks.” Data scientists can utilize the tools of machine learning and

determine the relationship between “various attributes and features of the patients with the

labelled disease.” This, in turn, allows doctors to understand the patterns of the disease and

produce better, preventive solutions for the patients.

When dealing with a large set of data, it is important to consider the threats that surround it.

Hacking into the system and stealing data of millions of patients’ data can negatively impact an

organization’s integrity and reputation. To avoid such situations from developing, it is crucial

that developers create a model using a language that contains tools providing security over any

3
outside threats. According to the survey conducted by Stack Overflow Developer in 2017, the

chart of which is cited on Belitsoft’s article “Python in Healthcare”4, indicate that Python is one

of the top five popular languages used for developing healthcare systems.

1.3. Target

This model will be useful for a specific type of hospitals and healthcare centers as well as

utilized by special type of health experts who have a detailed understanding about heart diseases

and also have a good understanding about how to consider each factor, which is related to heart

disease, and provide their patients with an educated explanation. Using several factors related to

heart disease, these health experts will be able to predict whether the patient is suffering from a

heart disease or not and, if they are, will then be able to diagnose that patient appropriately and

possibly even cure the disease and, thereby, saving their life.

1.4. Planning

Throughout the year, different challenges and issues forced the original timeline to be

revised. Some of the challenges encountered during the development were as follows:

1. An error occurred while importing rcParms, which contains properties in the matplotlibrc

file and will be used to add styling to the plots/graphs.

2. The unexpected breakdown of my personal device, on which my project program and

other material was saved, caused a major delay in staying consistent with the

originally set timeline.

3. The original dataset from Kaggle consisted of 13 features while the newly found

dataset, from IEEE, consisted of 11 features. Hence, the additional two features, ca

4
(this is number of major vessels from 0 to 3) and thal (this is a blood disorder called

thalassemia) were removed. Removing these two features did not really interfere in

the consistency of the overall model, while it also allowed for the merger of the larger

dataset with the other dataset to be used for this model.

With the above issues, the original timeline was adjusted and revised as follows:

Estimated
Dates Task Duration (in Output
week/s)
Identify problems and challenges with Determining idea for
1
detecting heart diseases (DONE) thesis
Research on existing work done on Determining problem
1
this field (DONE) statement
Prior to
work Determine the requirements and tools Determine technical
needed to execute this research 0.5 requirements for
(DONE) project
Written up 5-page
Complete initial proposal (DONE) 1.5
proposal for professor
Total 4
Work on the preprocessing of the Cleanup the dataset;
dataset (i.e., cleanup the dataset) 2 split into testing and
(DONE) training

Research on K Neighbors Classifier Model will be executed


algorithm and begin to implement it 3 using K Neighbors
into the program (DONE) Classifier algorithm

Semester
1 (Aug -
Dec Research on Decision Tree Classifier Model will be executed
2022) algorithm and begin to implement it 4 using Decision Tree
into the program (DONE) Classifier algorithm

Find background
information about the
Research on Random Forest
3 algorithm such as what
Classifier algorithm (DONE)
it is, what is it used for,
etc.

5
Total 12
Model will be executed
Implementation of Random Forest
3 using Random Forest
Classifier algorithm (DONE)
Classifier algorithm
Research on Support Vector
Model will be executed
Classifier algorithm and begin to
Semester 3 using Support Vector
implement it into the program
2 (Jan - Classifier algorithm
(DONE)
May
Increase in the size of
2023) Finding of additional dataset (DONE) 2
the overall dataset
Thesis Writeup 7 Draft the actual thesis
Create presentation
Prepare defense 2 slides and prepare for
defense talk
Total 17
Table 1.1 Revised timeline for the project

List of tasks completed before the final progress meeting were:

a. Installation of required technologies for model programming.

b. Importing of most of the Python libraries and all of the algorithms

c. Research on K-Neighbors Classifier

d. Research on Decision Tree Classifier

e. Research on Random Forest Classifier

f. Installation of matplotlib library to be able to properly import rcParms

g. Preprocessing of the dataset

h. Implementation of K-Neighbors Classifier

i. Implementation of Decision Tree Classifier

j. Finding of the additional dataset

k. Research and implementation of Support Vector Classifier

List of tasks in progress before the final meeting were:

a. Thesis write-up and prepare defense

6
2. Survey and Background

2.1. Survey

Research suggests that there are many methodologies that can help in predicting heart

disease using machine learning. According to their research, Heart Disease Prediction System

Using Machine Learning, Ranjit Shrestha and Jyotir Moy Chatterjee mention that Artificial

Neural Network algorithm is one of the recommended techniques to utilize when predicting

heart disease mainly due to its ability to reduce the cost factor during the diagnosis in that a

different test could be performed “to make a decision for diagnosis of HD (heart disease).”19

Based on Hindawi’s article, A Method for Improving Prediction of Human Heart Disease Using

Machine Learning Algorithms, Abdul Saboor and his co-contributors stated that auscultation

method was the primary method that physicians used to differentiate between normal and

abnormal cardiac sounds. The physicians listened to these cardiac sounds using their

stethoscopes to identify several types of heart disease but, as it is said that, with every benefit,

there is a flaw as well. Despite providing the ability to the physicians to detect several types of

heart disease, the auscultation method did not always provide with the accurate classification and

clarification of the distinct sounds because that is related to the knowledge and practices of the

doctors that can only be gained after lengthy examinations. Apart from the manual procedure,

the authors also include the use of machine learning algorithms. Several machine learning

algorithms are mentioned, namely Multinomial Naïve Bayes Classifier, Support Vector Machine

Classifier, Logistic Regression Classifier, Classification and Regression Tree (CART)

Algorithm, Latent Dirichlet Allocation Algorithm, AdaBoost Classifier, Random Forest

Classifier, Extra Tree Classifier, and XGBoost Classifier, which can be used to predict heart

disease. Using a dataset of the size of 303 records and 76 attributes, the 10-fold cross-validation

7
method has been used to train and evaluate the model. The 10-fold cross-validation method has

been used primarily because of the lower size of the dataset, which does not have a higher

number of training data. Therefore, by data-splitting (i.e., train-test splitting), although we may

see the model’s underestimated predicting performance, the model will have the other 90% of

the data to learn from. Recall, accuracy, precision, and F-measure have been used to evaluate the

model’s performance. Data preprocessing eliminated any missing, or null, values from the

dataset and scaled and standardized the dataset. The results retrieved after conducting the model

experiment showed that Extra Tree Classifier and AdaBoost Classifier returned with the highest

accuracy of 90%. Random Forest Classifier and Latent Dirichlet Allocation Algorithm returned

with an accuracy of roughly 88%, XGBoost Classifier resulted with accuracy of 85%,

Classification and Regression Tree Algorithm resulted with accuracy of roughly 84%, Logistic

Regression returned with a 75% accuracy, Support Vector Machine Classifier returned with a

70% accuracy, and Multinomial Naïve Bayes Classifier resulted in an accuracy of 59%.20

In their article titled, "Effective Heart Disease Prediction Using Hybrid Machine Learning

Techniques", published by IEEE, S. Mohan, C. Thirumalai, and G. Srivastava discuss about the

challenges that are faced when tackling the life-threatening heart disease. They claim that there

are several factors that contribute to the advancement of this disease such as diabetes, high blood

pressure, high cholesterol, abnormal pulse rate, etc. To help identify this disease in its

preliminary stages, the authors suggest a few machine learning algorithms such as Naïve Bayes,

Generalized Linear Model, Logistic Regression, Deep Learning, Decision Tree, Random Forest,

and Support Vector Machine. They aim to implement their research by combining the

perceptions of medical science and data mining to find different metabolic syndromes,

investigate the data, and eventually perform heart disease prediction. The dataset contains a total

8
of 303 records and, after performing preprocessing on the dataset, six records with null values

have been removed. Along with the 303 records, the dataset also includes thirteen attributes, two

of which, age and gender, have been used to identify personal information of the patient, while

the remaining eleven have been used to hold vital clinical records of the patient. Furthermore,

this dataset, utilized for classification in their research, is a RBFN, or Radial Basis Function

Network, which is divided into 70% for the training dataset and 30% used for the testing dataset.

Upon applying every algorithm into the model, many standard performance metrics like

accuracy, precision, and error in classification have been considered to determine the

computation of the performance efficacy of this model. The results that returned after every

algorithm was applied into the model to evaluate the dataset, Deep Learning resulted in the

accuracy score of 87.4%, followed by Random Forest and Support Vector Classifiers, both tied

at 86.1%, Generalized Linear Model calculated its accuracy score at 85.1%, and closely behind it

was Decision Tree Classifier which returned with an accuracy of 85%, Logistic Regression had

an accuracy score of 82.9% and, lastly, Naïve Bayes recorded its accuracy score at 75.8% in

terms of successfully predicting heart disease.21

2.2. Background

As already mentioned previously in this paper, heart disease is one of the, if not the,

deadliest diseases that humans lose their life battle to every year. Although there have been

advancements in the medical field to detect various life-threatening diseases, there still remains a

greater risk that the disease may not be diagnosed in its early, development stage and, as a result,

could not be cured at the right time. According to their article, Call to Action: Urgent Challenges

in Cardiovascular Disease: A Presidential Advisory From the American Heart Association,

9
Mark McClellan, and the co-authors state that cardiovascular disease, or heart disease, poses

tremendous threat to people of the United States and globally. 85.6 million Americans have been

affected by this disease in one form or another (i.e., have died or have lived with a disability),

and, financially, the disease costs 1 dollar for every 6 dollars spent on healthcare. Innovation

remains a big concern when it comes to finding effective medicines and treatments for

cardiovascular disease, and the overall cost to treat this disease continue to rise. The authors of

this article also share public review about the treatment they receive while undergoing their

treatment and it is known that, while many feel that the advancement in technology has certainly

improved their chances of successfully battling this disease, there are many who are deeply

concerned with the sky-rocketing cost that they have to pay for this treatment. Besides the

affordability, the patients also share another concern which is that the care and treatment does

not always meet with their priorities and goals, in that the patients feel that some of the care or

treatment that they sometimes receive is not always required or necessary. Instead, they continue

to struggle with some of their “bad habits” such as unhealthy diet, smoking, drinking, etc. that

play as a catalyst in increasing their risk of heart disease. Cardiovascular disease accounts for

over 17 million deaths per year globally and is expected to increase to more than 23.6 million

deaths by 2030. The mortality due to this disease can vary based on numerous factors such as

gender, race, and ethnicity, although, over the last few years, the factors affecting the mortality

has changed; it is determined based on the geographic location, wealth, and education. Besides

the concerning death rates, spending for heart disease has also increased. The American Heart

Association and American Stroke Association estimated that a total of $318 billion were spent

for all aspects of the cardiovascular disease. Aside from the direct expenses, there are also

indirect expenses that are related to this disease. For example, individuals in the workforce

10
needing to stay on medical leave or disability during their recovery period results in loss

productivity and, indirectly, affects the overall economy of the nation. Additional costs include

hiring someone who can provide immediate medical assistance or, simply, assistance with

everyday routine, individuals leaving the workforce, etc.22 Benefits due to advancement in

healthcare-related technology have significantly and, even negatively, affected the fight to defeat

diseases such as heart disease primarily due to increased costs that apply during the entire

treatment process. Many middle-class and lower-incomed individuals hesitate to make a move

and begin their treatment for heart disease since affordability becomes their biggest issue.

Affordability causes individuals within these groups to compromise with their treatment in that

they might not aim to go for all the steps that a full treatment may include to treat a heart disease.

With all the above mentioned challenges, and many more, that are involved with the heart

disease, this project aims to make little contribution towards the treatment aspect of the problem

by creating a model, along with a few factors related to heart disease, using few machine

learning algorithms that will assist medical professionals, who have understanding of this disease

and know how to make use of these factors, to actually be able to predict if an individual has a

heart disease or not.

2.3. Related Work

A great amount of work has been done related to predicting heart diseases using machine

learning algorithms. One such research conducted by Madhumita Pal and her co-authors

labelled, Risk prediction of cardiovascular disease using machine learning classifiers, and also

published on the website of National Library of Medicine explains couple of machine learning

methodologies, namely K-Nearest Neighbors Algorithm and Multi-Layer Perceptron Algorithm,

11
to predict the heart disease. They use a dataset of the size of 303 records of which 80%, or 243

records, have been used for the training dataset and the remaining 20%, or 60 records, have been

used for the testing dataset. The dataset also contains 13 important features such as age, gender,

chest pain, rest blood pressure, cholesterol, fasting blood sugar, resting electrocardiographic

result, maximum heart rate, exercise-induced angina, ST depression, slope of peak ST segment,

number of major vessels, thallium stress result, and a target variable (1 indicates the person

having cardiovascular disease and 0 indicates the person not having cardiovascular disease).

After the dataset is preprocessed using the correlation matrix, the two machine learning

algorithms are applied to the model. K-Nearest Neighbors Algorithm returned with an accuracy

score of 73.77% in predicting the heart disease while Multi-Layer Perceptron Algorithm resulted

in a higher accuracy score of 82.47% in predicting the heart disease.23

12
3. Project Development and design

3.1. Technical Approach & Python libraries Discussion

The solution I am proposing is to use Python as the primary language in implementing a

machine learning model that will predict whether a person has a heart disease or not. This model

will be implemented through Jupyter Notebook. I will import several libraries such as:

• Numpy – a numeric library in Python that can work with linear algebra as well as arrays.

• Pandas – this library in Python is used for data analysis and machine learning. It is used
to manipulate and analyze the csv files and data frame.

• Matplotlib – this library in Python is used for plotting and visualizing the data. It will
help in plotting graphs, line plots, histograms, etc. as well define parameters for the

charts, and color the charts.

• train_test_split – this scikit-learn library in Python will be used to split the dataset into
two parts – training dataset and test dataset.

• StandardScalar – this scikit-learn library in Python will be performed as a preprocessing


step in order to standardize the functionality of the input dataset.

Next, using the scikit-learn, or sklearn, library in Python, I import the four machine learning

algorithms that will be used to implement the model which will predict heart disease in a person.

These algorithms are K Neighbors Classifier, Decision Tree Classifier, Random Forest

Classifier, and Support Vector Classifier.

13
3.2. Data Understanding and Preprocessing

The dataset will be imported from Kaggle as well as IEEE and consist of a group of

approximately 1500 random people each of which will contain the following attributes:

• Age – measured in years

• Sex – male or female (1 = male, 0 = female)

• Chest pain type, or cp – this will hold values between 1 and 4

o 1 = atypical angina, which means that one feels chest pain which does not meet with the

criteria for angina5

o 2 = typical angina, which is defined as substernal chest pain precipitated by physical

exertion or emotional stress and relieved with rest6

o 3 = asymptomatic anginal pain

o 4 = non-anginal pain

• Resting blood pressure, or trestbpsr – this is the blood pressure when the person is admitted

to the hospital; it is measured in mmHg.

• Serum cholesterol, or chol – it represents the amount of total cholesterol in a person’s blood

and is measured in mg/dl.

• Fasting blood sugar, or fbs – this measures blood sugar after overnight fasting; fasting blood

sugar less than 99 mg/dl is considered normal, between 100 and 125 mg/dl is considered

prediabetic, and greater than 126 mg/dl is considered diabetic. In dataset, 1 = if fbs < 120

mg/dl and 0 = if fbs >= 120 mg/dl

• Resting electrocardiographic results, or restecg – it records the electrical activity of your

heart while you are at rest, provides information about your heart rate and rhythm, and can

also show if there is enlargement of the heart or evidence of a previous heart attack. 0 =

14
showing probable or definite left ventricular hypertrophy by Estes’ criteria, 1 = normal, and 2

= having ST-T wave abnormality (T wave inversion and/or ST elevation or depression

greater than 0.05 mV)

• Maximum achieved heart rate, or thalch

• Exercise-induced angina, or exang – this is a type of chest pain that is caused, or induced, by

exercise, stress, etc. causing an overload on your heart. In dataset, 1 = pain, 0 = no pain.

• oldpeak – ST depression induced by exercise relative to rest

• slope – represents the ST segment shift relative to the increase in heart rates, which is

induced by exercise. 0 = down sloping, 1 = flat, 2 = up sloping.

• target – represents 2 class variables in this model: 0 represents no disease, 1 represents

disease.

Figure 2.1 shows the information of the dataset after using info() method.

Figure 2.1 Examining Kaggle and IEEE dataset

15
In order to understand the data, I will create a correlation matrix that will help in showing the

positive and negative correlation of the above-mentioned features. Then, bar plots will be created

as well as two target classes, one for predicting whether there is a disease and the other for

predicting whether there is not a disease. The plots will help isolate categorical variables and

then some of these variables will be converted into temporary, dummy variables in order to scale

up all the values before I can apply any machine learning algorithms.

Now that the dataset has been scaled and standardized, I will split the dataset into two parts –

testing and training. Then, I will start applying each of the machine learning algorithms, as

mentioned in the beginning of this section.

3.3. Decision Tree Classifier Overview

Decision Tree Classifier will use a classification tree to produce a predictive model that is

used to formulate conclusions from a range of observations. The decisions, or tests, will be

performed based on the features provided in the dataset. As the name of this algorithm suggests,

it uses a tree-like structure to display the results (i.e., predictions) that are achieved based on a

number of feature-based splits. As shown in Figure 2.2 below, the process of Decision Tree

Classifier begins at the root node. Then, this root node is divided into various nodes, which are

decision nodes, and finally, based on the decision that is selected, we reach the conclusion, or

leaf node, and have a prediction.10

16
Figure 2.2 Generalized layout of Decision Tree (Source: Decision Tree)10

Few reasons for using this algorithm include: 1) Usefulness while solving decision-related

problems such as the problem of this thesis. 2) Helps in thinking about all practical solutions

because of its tree-like structure. 3) Data clean-up is not necessary as compared to other

algorithms.7

In my model, using the Decision Tree Classifier, the majority of coding hassle was reduced

once the Decision Tree was imported from scikit-learn. It provided access to the

DecisionTreeClassifier() class, and a few variables needed to calculate the accuracy. First, a

class value was assigned to every single data point. Then, since the range for the maximum

number of features to be selected for my model was variable, I selected my range of maximum

features to be between 1 and 23. Using the DecisionTreeClassifier() class, the parameters

max_features and random_state were added, where max_features selects maximum number of

17
features from the above range and random_state randomly selected the features at each decision

node. Then, using fit() method, training dataset is taken as an argument, and the results of the

testing dataset are appended to the list. Finally, a line graph was created to plot the accuracy

scores received for the range of maximum numbers I had selected (i.e., 1 to 23).

3.4. K-Neighbors Classifier Overview

The K Neighbors Classifier is a type of non-parametric algorithm that will use the proximity

of the K nearest neighbors from a given data point to make classifications, or predictions, about

the clustering of the individual data points. This algorithm is also known as the lazy-learner

algorithm because “it does not learn from the training set immediately instead it stores the

dataset and at the time of classification, it performs an action on the dataset.”12 Dataset is only

stored when the algorithm is under training phase but, once more data is added to the original

dataset, this algorithm will classify the original dataset into various categories that are similar to

the newly received dataset. K Neighbors Classifier is used for classification problems, but it can

also be applied to problems based on regression. Given a point with an unknown class, the

purpose is to try and understand which other neighbors in that feature space are closest to this

point. These neighbors become the K nearest neighbors. In the case of a classification problem, a

data point is given a class label on the basis of the label that is given to the most frequently used

data point.11 For reference, an image of K Neighbors Classifier is provided below in Figure 2.3:

18
Figure 2.3 K-Neighbors Classifier Diagram (Source: Webtunix Solutions - Business Solution

Provider)12

In Figure 2.3, the point colored in green in the innermost circle is the data point and, based

on the vicinity of the other data points to the green data point, they each are assigned class

labels.12

In my model, using the K Neighbors Classifier, I looked for the K nearest neighbors around

my target variable’s data point and, based on the majority class, provided a class label to this

target variable data point. Then, I determined the total number of neighbors I wanted to include

around the target variable data point; I included 30 neighbors labelled 1 to 31. Then, using the

KNeighborsClassifier() class, available after importing K-Neighbors library from scikit learn, its

parameter, n_neighbors, determined the number of neighbors to be used for that iteration. After

this, using the fit() method, I fit the K nearest neighbors from the training dataset around my

target variable data point. Then, using score() function, I pass test input and the target data to

check the accuracy score, and using append() function, training data results are appended with

the testing data results. The results achieved were then plotted onto a line graph.

19
3.5. Random Forest Classifier Overview

Random Forest Classifier will create multiple decision trees and consider various features,

from the total features, to make better predictions for the overall dataset. Each individual tree in

this algorithm will perform a class prediction and the class that retains the most prediction

among all the other classes will become the model’s prediction. To provide a visual

representation of this explanation, consider Figure 2.4 as a reference:

Figure 2.4 Random Forest Classifier using Decision Trees (Source: Yiu, Tony)8

Fundamental concept behind the reason the Random Forest Classifier performs so well, as

compared to other algorithms, is “A large number of relatively uncorrelated models (trees)

operating as a committee will outperform any of the individual constituent models.”8

In my model, I used the Random Forest Classifier that created multiple decision trees, by

performing random selection of features, from the total number of features. Then, to predict the

20
class, I used different sets of decision tree sizes such as 200, 400, 800, 1600, 3200. Then, using

RandomForestClassifier() class, available after importing Random Forest Classifier from scikit

learn, I passed two parameters: n_estimators and random_state. n_estimators determined the

total number of trees used for that iteration and random_state controlled the sampling of the

features to consider when looking for the best split at each decision node. After this, using the

fit() method, I built a forest of trees of a particular tree size, selected using n_estimators, from the

training dataset. Then, using score() function, I pass test input and the target data to check the

accuracy score, and using append() function, training data results are appended with the testing

data results. Finally, I plotted the accuracy scores for each of the tree sizes onto a bar graph to

check for optimal results.

3.6. Support Vector Classifier Overview

The goal of Support Vector Classifier is to formulate a hyperplane in an N-dimensional

plane, where N is the total number of features, that specifically classifies every data point in that

plane.

21
Figure 2.5 Optimal hyperplane selection graphs for Support Vector Classifier13

As shown in Figure 2.5 above, there are many different hyperplanes that we can select

from in order to separate the two classes of data points. The ultimate goal is to find a hyperplane

that has the maximum distance between the data points of both the classes. Hyperplanes are

extremely useful when trying to classify the classes for the data points as those data points that

are positioned on either side of the hyperplane will be grouped in different classes.13

In my model, using the Support Vector Classifier, I selected the 3 kernels below that I

wanted to use to calculate the accuracy score.

a. Linear: this is the simplest of all the known kernels since the data is represented over

an x-y plane with a constant term, c, which is optional. Linear kernels are quite

simple as they are not represented over higher dimensions. They are also extremely

useful for problems where the dataset consists of a greater number of features.14 A

visual representation of the linear kernel is shown in Figure 2.6, which shows how a

set of data points can be separated in a linear manner:

22
Figure 2.6 Linear Kernal25

b. Polynomial, or poly: when the data cannot be linearly separated, the polynomial

kernel becomes useful since it maps the data over higher-dimensional planes and, in

doing so, sometimes will identify the separation of the two classes in the

hyperplane.15 A visual representation of the polynomial kernel is shown in Figure 2.7,

which shows how a set of data points can be separated in a non-linear manner:

Figure 2.7 Polynomial Kernal25

23
c. Radial Basis Function, or rbf: this kernel is slightly more advanced than the

polynomial kernel due to its ability to plot data over the plane of infinite number of

dimensions. For discussion purposes, if there is an N-dimensional plane, where N is

the total number of features, the Radial Basis Function kernel can make calculations

for any N > 1 such that a graphical representation of quadratic, cubic, or polynomial

equations can be mapped on a regression or classification line.16 A visual

representation of the polynomial kernel is shown in Figure 2.8, which shows how a

set of data points can be separated in a non-linear manner:

Figure 2.8 Radial Basis Function Kernal25

Continuing with the process in my model, any one of the above three kernels will be selected

when kernel parameter is passed for the SVC() class, available after the Support Vector Classifier

is imported using scikit learn. Then, using fit() method, I fit the SVM model according to the

training dataset. After this, using score() function, the accuracy score on the test data and target

values is returned, and using append() function, the results from the training data are appended

with the testing data results. Finally, the accuracy result for each of the three kernels is plotted

onto a bar graph.

24
3.7. Risks involved

Some risks in this project are as follows:

• A trained model, based on machine learning, is dependent on the efficiency of the

algorithms used to train it and, sometimes, they may not produce the best results.

• The implementation phase may take longer than expected due to personal and

employment commitments.

• Since the running time of a program depends partly on the CPU speed and processor,

should the program take long to run, then it may be possible that the dataset may need to

be reduced.

25
4. Results

4.1. Decision Tree Classifier results

➔ The below results are based on the following split of the dataset: 67% training and 33% test

Figure 2.9 Decision Tree Classifier Result Graph

In the line graph shown in Figure 2.9 above, the x-axis is assigned to the values for the

maximum number of features considered during each iteration of the decision tree process, and

y-axis provides with the accuracy scores. An ordered pair on this line graph explains the total

number of maximum features that are used when decision tree chooses a path from the root

node, which contains the total number of features (including dummy variables as well), and the

accuracy in predicting whether a person has a heart disease, which results from that path.

Looking at the graph, when the decision tree selects only 1 feature to predict heart disease, it

returns with an accuracy of approximately 81%. Similarly, when the Decision Tree Classifier

26
selects 2 maximum features, an accuracy of 83% was achieved. The lowest accuracy score of

approximately 79% was achieved when the algorithm included 11 total features and, on the

contrary, the highest accuracy score of 83% was achieved when the maximum number of

features selected was 10.

4.2. K-Neighbors Classifier results

➔ The below results are based on the following split of the dataset: 67% training and 33% test

Figure 2.10 K-Neighbors Classifier Result Graph

The line graph in Figure 2.10 above shows the comparison between different values of K

and the accuracy that each one achieves. The number of Neighbors, K, represents the x-value of

the graph and shows how many neighbors, or features, there are around the target variable data

point. If we use only 1 neighbor (i.e., 1 feature) to predict whether a person has a heart disease,

27
then we get the accuracy of 84%. Similarly, if there are 2 neighbors surrounding our target

variable data point, we have an accuracy of 79.5% to predict heart disease. The lowest accuracy

of 79% was achieved when there were 4, 13, and 29 neighbors selected for the target variable

data point. From the line graph above, the highest accuracy score of 84% was achieved when the

number of neighbors was selected to be 1.

4.3. Random Forest Classifier results

➔ The below results are based on the following split of the dataset: 67% training and 33% test

Figure 2.11 Random Forest Classifier Result Graph

The bar graph, shown in Figure 2.11 above, includes five varied sizes of decision tree,

represented as Number of estimators, and the accuracy score that each of the tree size achieved.

28
Applying the concepts of the Decision Tree Classifier, the results of the Random Forest

Classifier combined multiple decision trees and, based on the varied sizes shown in the bar graph

above, an accuracy score for each decision tree size is predicted. For a forest of the size of 200

decision trees, the accuracy score of 90.7% was achieved. For a forest of the size of 400 decision

trees, the accuracy score of 90.5% was achieved. For a forest of the size of 800 decision trees,

the accuracy score of 90.5% was achieved. For a forest of the size of 1600 decision trees, the

accuracy score of 90.5% was achieved. For a forest of the size of 3200 decision trees, the

accuracy score of 90.9% was achieved. From the results, it can be concluded that a forest with

the size of 3200 decision trees had the highest accuracy score of 90.9%.

4.4. Support Vector Classifier results

➔ The below results are based on the following split of the dataset: 67% training and 33% test

29
Figure 2.12 Support Vector Classifier Result Graph

The bar graph in Figure 2.12 above plots results of three types of kernels in the Support

Vector Classifier. When the model uses the Linear kernel to predict heart disease, an accuracy of

83% is achieved. Similarly, when Polynomial kernel is used to predict heart disease, an accuracy

score of 84.4%. Finally, when the Radial Basis Function kernel is used to predict heart disease,

an accuracy score of 85.6% is achieved. As expected, the Radial Basis Function resulted in the

best outcome out of all the three kernels because, as mentioned in Section 3.6.c, this kernel has

the ability to plot data over the plane of infinite number of dimensions, where dimensions are

determined by the total number of features.

30
4.5. Result Analysis

Collecting the maximum accuracy scores from each of the four classifiers, it can be seen that

Decision Tree Classifier returned the highest accuracy score of 83% when the maximum number

of features selected was 10. K-Neighbors Classifier returned with the highest accuracy score of

84%, which was achieved when the number of neighbors was selected to be 1. Random Forest

Classifier returned with the highest accuracy score of 90.9% when a forest with the size of 3200

decision trees was selected. Finally, Support Vector Classifier returned its best accuracy score of

85.6% when using the Radial Basis Function kernel. Therefore, it can be concluded that Random

Forest Classifier performed best amongst all four classifiers. Few discrepancies in the results are

as follows:

4.5.1. Lack of quality data – in this research, as previously mentioned, a total size of

1,493 for the dataset was used. The presence of noisy data could have impacted the

quality of performance of each algorithm but, to clear as much of these data as

possible, data preprocessing methods such as correlation matrix and plotting bar

graphs were applied. Despite applying these two methods, it is possible that the data

may not have been fully cleaned up and, as a result, some classifiers may not have

been able to use the dataset optimally and predict the heart disease accurately

enough.

4.5.2. Implementation deficiency – inadequate data is one of the factors that can cause

the program to not execute effectively and, consequently, also not return the best

results.

4.5.3. Absence of enough knowledge about the programming language and the topic –

when doing this project, I initially had little knowledge about Python and, therefore,

31
throughout the year, it was a learning curve for me to understand this language so

that a model using machine learning algorithms could be successfully created.

Additionally, besides the extremely basic knowledge of the heart disease itself, I did

not know about many of the disease-related things such as the statistics about the

death rates, expenditures, some risk factors, etc. All of this information, some of

which has been described in the earlier portions of this paper, were learnt after

detailed research but, because the area of study for heart disease is so broad and

deep, that not all of the information could be stated in this paper.

32
5. Conclusion and Future Work

The goal of this research was to develop a model that could help doctors predict whether an

individual is suffering from a heart disease or not before the disease spreads in the body and

reaches the stage where it becomes incurable, and the individual loses their life. Luckily today,

with tremendous advancement in the field of science and technology, the possibilities of curing

various life-threatening diseases, including heart disease, has become possible but, had the same

life-threatening disease been detected in a person half a century ago, then the chances of a

medical expert detecting and curing that person’s disease would have been exceptionally low. A

common thing that ties the generation gap is the stress that humans adapt to during the span of

their lives. As time passed, the stress level in humans also began to increase. According to

American Psychological Association report titled “Stress in America 2020”, the average stress

level, on a scale of 1-10, of all adults across the United States in 2020 was 5.0, which increased

from 4.9 from the previous 2 years, while those among Gen Z adults, who are adults born

between mid- to late-90s and early 2010s, dealt with an average stress level of 5.6 in 2018, 5.8 in

2019, and 6.1 in 2020. As stress increases, humans adapt to unhealthy habits such as smoking,

drinking, unhealthy diet consumption, etc. which invites unwanted diseases into the body. When

the disease enters, an individual might not be aware of its presence and, sometimes deadly,

consequences that the disease brings with it. Elevated levels of stress, unhealthy diet

consumption, drinking, smoking, chest pain, etc. are also some of the many symptoms of a

person suffering from heart disease. This thesis was inspired on the basis of such heart-

wrenching facts about humans who suffer from heart disease and that is why the goal of this

thesis became to develop a model that can help medical experts detect this disease in the

preliminary stages and try to cure and save a human life.

33
Despite receiving decent accuracy scores for all the classifiers used in this project to predict

heart disease, there are still many areas that were left untouched or not explored in detail.

Therefore, the future scope of this project will be to include more sophisticated prediction

algorithms that can perform detailed analyses on the dataset, provide better results when

predicting an outcome, and, in the end, be used to compare for a detailed analysis of the results.

Speaking of the dataset, it will be vital to have larger datasets that can cover a larger portion of

the crowd because having a large dataset will evaluate the model and, after the algorithms have

been applied, it will return results (i.e., accuracy scores) that are more realistic. Another work

that can be done, particularly for the dataset, is to visit healthcare facilities such as hospitals,

labs, etc. and try to collect the actual patient data, combine that data, and, apply this dataset to

different machine learning techniques and have the results be compared with those obtained from

the original dataset.

34
6. Bibliography

1. “NCI Dictionary of Cancer Terms.” National Cancer Institute,

https://www.cancer.gov/publications/dictionaries/cancer-terms/def/coronary-heart-disease.

2. “About Heart Disease.” Centers for Disease Control and Prevention, Centers for Disease

Control and Prevention, 12 July 2022, https://www.cdc.gov/heartdisease/about.htm.

3. Barla, Nilesh. “Data Science and Machine Learning in the Medical Industry.” Neptune.ai, 21

July 2022, https://neptune.ai/blog/data-science-and-machine-learning-in-the-medical-

industry.

4. Shestel, Alex. “Python in Healthcare.” Belitsoft, 18 Mar. 2021, https://belitsoft.com/custom-

application-development-services/healthcare-software-development/python-healthcare.

5. “Atypical Chest Pain.” Premier Pain & Spine, 23 Dec. 2020,

https://www.ppschicago.com/pain-management/chest-pain/atypical-chest-

pain/#:~:text=What%20is%20Atypical%20Chest%20Pain,adequate%20supply%20of%20ox

ygenated%20blood.

6. AlBadri, Ahmed, et al. “Typical Angina Is Associated with Greater Coronary Endothelial

Dysfunction but Not Abnormal Vasodilatory Reserve.” Clinical Cardiology, Wiley

Periodicals, Inc., Oct. 2017,

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680106/#:~:text=Typical%20angina%20(T

A)%20is%20defined,coronary%20artery%20disease%20(CAD).

7. “Decision Tree Algorithm in Machine Learning - Javatpoint.” Www.javatpoint.com,

https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm.

8. Yiu, Tony. “Understanding Random Forest.” Medium, Towards Data Science, 29 Sept. 2021,

https://towardsdatascience.com/understanding-random-forest-58381e0602d2.

35
9. Saini, Anshul. “Decision Tree Algorithm - A Complete Guide.” Analytics Vidhya, 5 Apr.

2023, https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/.

10. “Decision Tree.” Decision Tree - Learn Everything About Decision Trees,

https://www.smartdraw.com/decision-tree/.

11. “What Is the K-Nearest Neighbors Algorithm?” IBM, https://www.ibm.com/topics/knn.

12. Webtunix Solutions - Business Solution Provider. “K-Nearest Neighbors Classifier.” K-

Nearest Neighbors Classifier, https://www.ris-ai.com/k-nearest-neighbors-classification.

13. Gandhi, Rohith. “Support Vector Machine - Introduction to Machine Learning Algorithms.”

Medium, Towards Data Science, 5 July 2018, https://towardsdatascience.com/support-vector-

machine-introduction-to-machine-learning-algorithms-934a444fca47.

14. Dye, Steven. “An Intro to Kernels.” Medium, Towards Data Science, 5 Mar. 2020,

https://towardsdatascience.com/an-intro-to-kernels-

9ff6c6a6a8dc#:~:text=The%20linear%20kernel%20is%20typically,this%20kind%20of%20d

ata%20set.

15. Sidharth. “SVM Kernels: Polynomial Kernel - from Scratch Using Python.” PyCodeMates,

PyCodeMates, 12 Dec. 2022, https://www.pycodemates.com/2022/10/svm-kernels-

polynomial-

kernel.html#:~:text=The%20polynomial%20kernel%20is%20often,hyperplane%20that%20s

eparates%20the%20classes.

16. “Radial Basis Function Kernel - Machine Learning.” GeeksforGeeks, GeeksforGeeks, 22

July 2021, https://www.geeksforgeeks.org/radial-basis-function-kernel-machine-learning/.

36
17. “Stress in America™ 2020: A National Mental Health Crisis.” American Psychological

Association, American Psychological Association,

https://www.apa.org/news/press/releases/stress/2020/report-october#.

18. “Heart Disease and Stroke Statistics - 2023 Update.” Professional.heart.org,

https://professional.heart.org/en/science-news/heart-disease-and-stroke-statistics-2023-

update.

19. Shrestha, Ranjith, and Jyotir Moy Chatterjee. "Heart Disease Prediction System Using

Machine Learning." LBEF Research Journal of Science, Technology and Management 1.2

(2019).

20. Saboor, Abdul, et al. “A Method for Improving Prediction of Human Heart Disease Using

Machine Learning Algorithms.” Mobile Information Systems, Hindawi, 9 Mar. 2022,

https://www.hindawi.com/journals/misy/2022/1410169/.

21. S. Mohan, C. Thirumalai, and G. Srivastava, "Effective Heart Disease Prediction Using

Hybrid Machine Learning Techniques," in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:

10.1109/ACCESS.2019.2923707.

22. McClellan, Mark, et al. “Call to Action: Urgent Challenges in Cardiovascular Disease: A

Presidential Advisory From the American Heart Association.” AHA Journals, American

Heart Association, 24 Jan. 2019,

https://www.ahajournals.org/doi/full/10.1161/CIR.0000000000000652.

23. Pal M, Parija S, Panda G, Dhama K, Mohapatra RK. Risk prediction of cardiovascular

disease using machine learning classifiers. Open Med (Wars). 2022 Jun 17;17(1):1100-1113.

doi: 10.1515/med-2022-0508. PMID: 35799599; PMCID: PMC9206502.

37
24. Finkelhor, Robert S, et al. “The St Segment/Heart Rate Slope as a Predictor of Coronary

Artery Disease: Comparison with Quantitative Thallium Imaging and Conventional ST

Segment Criteria.” American Heart Journal, Mosby, 18 June 2004,

https://www.sciencedirect.com/science/article/abs/pii/0002870386902656#:~:text=The%20S

T%20segment%20shift%20relative,coronary%20artery%20disease%20(CAD).

25. Fraj, Mohtadi Ben. “In Depth: Parameter Tuning for SVC.” Medium, All Things AI, 5 Jan.

2018, https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769.

38

You might also like