Professional Documents
Culture Documents
Sharma Yash Thesis 2023
Sharma Yash Thesis 2023
In Computer Science
By
Yash Sharma
May 2023
The thesis of Yash Sharma is approved:
_______________________________________ ________________
_______________________________________ ________________
_______________________________________ ________________
ii
Acknowledgements
I would like to thank Dr. George Wang for being my committee chair for my thesis and
for all his help throughout the entire year. I would also like to thank Dr. Mahdi Ebrahimi for
considering becoming part of my committee this semester and Dr. Robert McIlhenny for
iii
Table of Contents
Signature Page……………………………………………………………………………………ii
Acknowledgements………………………………………………………………………………iii
List of Tables……………………………………………………………………………………..vi
List of Figures…………………………………………………………………………………….vi
Abstract…………………………………………………………………………………………..vii
1. Introduction……………………………………………………………………………………1
1.1. Objective…………………………………………………………………………………1
1.3. Target……………………………………………………………………………………..4
2.1. Survey…………………………………………………………………………………….7
2.2. Background…………………………………………………………………………….....9
iv
4. Results………………………………………………………………………………………..26
6. Bibliography…………………………………………………………………………………35
v
List of Tables
List of Figures
5. Figure 2.5 Optimal hyperplane selection graphs for Support Vector Classifier…………21-22
vi
Abstract
By
Yash Sharma
In a rapidly evolving world, humans deal with numerous challenges that can range from
learning how to talk and walk to leading a nation or fighting enemies and, simply, anything that a
mind can imagine. With such a variety of issues that every human deals with regularly, they
naturally try to consider easy solutions that can help reduce the amount of stress that they are
facing and, in doing so, compromise their health. When health has to be compromised,
consequences can sometimes be deadly. One such consequence is developing a heart disease that
can transform into a life-ending disease if not discovered in the preliminary stages. To help
prevent a loss of life due to lack of awareness, it is crucial to have a model that can help predict
In the experiment, the dataset is imported along with all the necessary Python libraries as
well as those for the Machine Learning Algorithms. Then, the dataset will be understood and will
then enter the preprocessing phase where, using processes such as correlation matrix and bar
plots, this dataset will be standardized and scaled for optimality. Once the dataset is standardized
vii
and scaled, multiple algorithms will be applied, each of which will determine the accuracy of
whether a person has a heart disease present or does not have a heart disease. Once the accuracy
scores are predicted, we will know which algorithm provided the best results for our dataset.
viii
1. Introduction
1.1. Objective
Heart disease is a type of disease in which there is a blockage in the coronary arteries, which
are blood vessels that carry blood and oxygen to the heart. The coronary heart disease is caused
by the development of fatty material and plaque inside the coronary arteries.1 As deadly as it
sounds, heart diseases are not easy to identify in their preliminary stages of development.
Researchers have been constantly attempting to identify better detecting techniques that could
timely identify the heart disease in a person as the current techniques are not as effective in
detecting this disease in the early development stages as one would hope for because of accuracy
and computational time.20 Additionally, when professional health experts and advanced
technology is not readily available, it can become tremendously challenging to detect any
symptoms of a heart disease until the person starts experiencing chest pain or complains about
trouble when breathing, by which time, it might be very difficult to cure this disease and save
that person’s life. To prevent an individual from going through this hardship, and to help doctors
detect this disease in the preliminary stages, it will be important to have a tool that detects such a
The goal of this research is to develop a model that will help doctors predict whether an
individual is suffering from a heart disease or not before the disease spreads in the body and
makes the situation difficult to control. It focuses on four main machine learning techniques: 1)
Decision Tree Classifier, 2) K-Neighbors Classifier, 3) Random Forest Classifier, and 4) Support
Vector Classifier. A dataset containing factors related to heart disease will also be preprocessed,
evaluated, and trained and, once it has been properly scaled and standardized, will then be
1
applied to each of the above four algorithms to predict the different scores of accuracies of
predicting whether an individual has a heart disease or does not have a heart disease.
According to the Centers for Disease Control and Prevention, or CDC, website,2 heart
diseases, also known as cardiovascular diseases, are not easily detected until a person
experiences signs or symptoms of a heart attack such as chest pain, upper back/neck pain,
heartburn, nausea, vomiting, etc.; heart failure such as trouble when breathing, fatigue, swelling
of the feet, etc.; or an arrhythmia like palpitation. Top factors that contribute to this deadly
disease are high blood pressure, high cholesterol, smoking and, surprisingly, almost half of the
population (47%) in the United States suffer from at least one of these three factors. Statistics
released for 2020 indicated that approximately 697,000 people in the United States died of a
heart disease; this means that every fifth person in the country died of a heart disease in 2020!
Other factors that also contribute to the development of heart diseases are unhealthy diet,
According to the article, Heart Disease and Stroke Statistics – 2023 Update, on the American
Heart Association website, cardiovascular disease, or heart disease, was the leading cause of
deaths in the Unites States in 2020 where the total number of deaths reported was a whopping
928,741. Coronary heart disease was the leading factor contributing to the total number of deaths
in the United States in 2020 due to cardiovascular disease at 41.2%, followed by stroke, which
contributed to 17.3% of the total deaths due to this disease, high blood pressure caused 12.9% of
the individuals to lose their battle to life, 9.2% died due to a heart failure, and 2.6% lost their
battle due to some disease that spread in their arteries. Based on their data collected in 2020, the
2
age-adjusted death rate due to the heart disease in the United States was 224.4 per 100,000
people, which was not too far when compared to the age-adjusted death rate due to the heart
disease globally which was 239.8 per 100,000. Shifting the focus to the money aspect of the
cardiovascular disease, the statistics indicated that the amount of money needed to spend directly
to tackle this disease in 2018 and 2019 combined was $251.4 billion while it costed $155.9
billion in lost productivity and mortality bringing the total cost of battling the cardiovascular
Data science and machine learning have been two of the main driving factors in the
advancement of technology in the medical and healthcare sector. Majority of the companies in
this sector are increasingly adapting and integrating data science and machine learning into their
systems and, by doing so, are improving their chances on finding the patterns for different
diseases as well as provide patients with better feedback on the disease that they could have in
According to the Neptune Blog3, adapting to machine learning tools allow medical
organizations to “find patterns, extract knowledge from data and tackle a diverse set of
computationally hard tasks.” Data scientists can utilize the tools of machine learning and
determine the relationship between “various attributes and features of the patients with the
labelled disease.” This, in turn, allows doctors to understand the patterns of the disease and
When dealing with a large set of data, it is important to consider the threats that surround it.
Hacking into the system and stealing data of millions of patients’ data can negatively impact an
organization’s integrity and reputation. To avoid such situations from developing, it is crucial
that developers create a model using a language that contains tools providing security over any
3
outside threats. According to the survey conducted by Stack Overflow Developer in 2017, the
chart of which is cited on Belitsoft’s article “Python in Healthcare”4, indicate that Python is one
of the top five popular languages used for developing healthcare systems.
1.3. Target
This model will be useful for a specific type of hospitals and healthcare centers as well as
utilized by special type of health experts who have a detailed understanding about heart diseases
and also have a good understanding about how to consider each factor, which is related to heart
disease, and provide their patients with an educated explanation. Using several factors related to
heart disease, these health experts will be able to predict whether the patient is suffering from a
heart disease or not and, if they are, will then be able to diagnose that patient appropriately and
possibly even cure the disease and, thereby, saving their life.
1.4. Planning
Throughout the year, different challenges and issues forced the original timeline to be
revised. Some of the challenges encountered during the development were as follows:
1. An error occurred while importing rcParms, which contains properties in the matplotlibrc
other material was saved, caused a major delay in staying consistent with the
3. The original dataset from Kaggle consisted of 13 features while the newly found
dataset, from IEEE, consisted of 11 features. Hence, the additional two features, ca
4
(this is number of major vessels from 0 to 3) and thal (this is a blood disorder called
thalassemia) were removed. Removing these two features did not really interfere in
the consistency of the overall model, while it also allowed for the merger of the larger
With the above issues, the original timeline was adjusted and revised as follows:
Estimated
Dates Task Duration (in Output
week/s)
Identify problems and challenges with Determining idea for
1
detecting heart diseases (DONE) thesis
Research on existing work done on Determining problem
1
this field (DONE) statement
Prior to
work Determine the requirements and tools Determine technical
needed to execute this research 0.5 requirements for
(DONE) project
Written up 5-page
Complete initial proposal (DONE) 1.5
proposal for professor
Total 4
Work on the preprocessing of the Cleanup the dataset;
dataset (i.e., cleanup the dataset) 2 split into testing and
(DONE) training
Semester
1 (Aug -
Dec Research on Decision Tree Classifier Model will be executed
2022) algorithm and begin to implement it 4 using Decision Tree
into the program (DONE) Classifier algorithm
Find background
information about the
Research on Random Forest
3 algorithm such as what
Classifier algorithm (DONE)
it is, what is it used for,
etc.
5
Total 12
Model will be executed
Implementation of Random Forest
3 using Random Forest
Classifier algorithm (DONE)
Classifier algorithm
Research on Support Vector
Model will be executed
Classifier algorithm and begin to
Semester 3 using Support Vector
implement it into the program
2 (Jan - Classifier algorithm
(DONE)
May
Increase in the size of
2023) Finding of additional dataset (DONE) 2
the overall dataset
Thesis Writeup 7 Draft the actual thesis
Create presentation
Prepare defense 2 slides and prepare for
defense talk
Total 17
Table 1.1 Revised timeline for the project
6
2. Survey and Background
2.1. Survey
Research suggests that there are many methodologies that can help in predicting heart
disease using machine learning. According to their research, Heart Disease Prediction System
Using Machine Learning, Ranjit Shrestha and Jyotir Moy Chatterjee mention that Artificial
Neural Network algorithm is one of the recommended techniques to utilize when predicting
heart disease mainly due to its ability to reduce the cost factor during the diagnosis in that a
different test could be performed “to make a decision for diagnosis of HD (heart disease).”19
Based on Hindawi’s article, A Method for Improving Prediction of Human Heart Disease Using
Machine Learning Algorithms, Abdul Saboor and his co-contributors stated that auscultation
method was the primary method that physicians used to differentiate between normal and
abnormal cardiac sounds. The physicians listened to these cardiac sounds using their
stethoscopes to identify several types of heart disease but, as it is said that, with every benefit,
there is a flaw as well. Despite providing the ability to the physicians to detect several types of
heart disease, the auscultation method did not always provide with the accurate classification and
clarification of the distinct sounds because that is related to the knowledge and practices of the
doctors that can only be gained after lengthy examinations. Apart from the manual procedure,
the authors also include the use of machine learning algorithms. Several machine learning
algorithms are mentioned, namely Multinomial Naïve Bayes Classifier, Support Vector Machine
Classifier, Extra Tree Classifier, and XGBoost Classifier, which can be used to predict heart
disease. Using a dataset of the size of 303 records and 76 attributes, the 10-fold cross-validation
7
method has been used to train and evaluate the model. The 10-fold cross-validation method has
been used primarily because of the lower size of the dataset, which does not have a higher
number of training data. Therefore, by data-splitting (i.e., train-test splitting), although we may
see the model’s underestimated predicting performance, the model will have the other 90% of
the data to learn from. Recall, accuracy, precision, and F-measure have been used to evaluate the
model’s performance. Data preprocessing eliminated any missing, or null, values from the
dataset and scaled and standardized the dataset. The results retrieved after conducting the model
experiment showed that Extra Tree Classifier and AdaBoost Classifier returned with the highest
accuracy of 90%. Random Forest Classifier and Latent Dirichlet Allocation Algorithm returned
with an accuracy of roughly 88%, XGBoost Classifier resulted with accuracy of 85%,
Classification and Regression Tree Algorithm resulted with accuracy of roughly 84%, Logistic
Regression returned with a 75% accuracy, Support Vector Machine Classifier returned with a
70% accuracy, and Multinomial Naïve Bayes Classifier resulted in an accuracy of 59%.20
In their article titled, "Effective Heart Disease Prediction Using Hybrid Machine Learning
Techniques", published by IEEE, S. Mohan, C. Thirumalai, and G. Srivastava discuss about the
challenges that are faced when tackling the life-threatening heart disease. They claim that there
are several factors that contribute to the advancement of this disease such as diabetes, high blood
pressure, high cholesterol, abnormal pulse rate, etc. To help identify this disease in its
preliminary stages, the authors suggest a few machine learning algorithms such as Naïve Bayes,
Generalized Linear Model, Logistic Regression, Deep Learning, Decision Tree, Random Forest,
and Support Vector Machine. They aim to implement their research by combining the
perceptions of medical science and data mining to find different metabolic syndromes,
investigate the data, and eventually perform heart disease prediction. The dataset contains a total
8
of 303 records and, after performing preprocessing on the dataset, six records with null values
have been removed. Along with the 303 records, the dataset also includes thirteen attributes, two
of which, age and gender, have been used to identify personal information of the patient, while
the remaining eleven have been used to hold vital clinical records of the patient. Furthermore,
this dataset, utilized for classification in their research, is a RBFN, or Radial Basis Function
Network, which is divided into 70% for the training dataset and 30% used for the testing dataset.
Upon applying every algorithm into the model, many standard performance metrics like
accuracy, precision, and error in classification have been considered to determine the
computation of the performance efficacy of this model. The results that returned after every
algorithm was applied into the model to evaluate the dataset, Deep Learning resulted in the
accuracy score of 87.4%, followed by Random Forest and Support Vector Classifiers, both tied
at 86.1%, Generalized Linear Model calculated its accuracy score at 85.1%, and closely behind it
was Decision Tree Classifier which returned with an accuracy of 85%, Logistic Regression had
an accuracy score of 82.9% and, lastly, Naïve Bayes recorded its accuracy score at 75.8% in
2.2. Background
As already mentioned previously in this paper, heart disease is one of the, if not the,
deadliest diseases that humans lose their life battle to every year. Although there have been
advancements in the medical field to detect various life-threatening diseases, there still remains a
greater risk that the disease may not be diagnosed in its early, development stage and, as a result,
could not be cured at the right time. According to their article, Call to Action: Urgent Challenges
9
Mark McClellan, and the co-authors state that cardiovascular disease, or heart disease, poses
tremendous threat to people of the United States and globally. 85.6 million Americans have been
affected by this disease in one form or another (i.e., have died or have lived with a disability),
and, financially, the disease costs 1 dollar for every 6 dollars spent on healthcare. Innovation
remains a big concern when it comes to finding effective medicines and treatments for
cardiovascular disease, and the overall cost to treat this disease continue to rise. The authors of
this article also share public review about the treatment they receive while undergoing their
treatment and it is known that, while many feel that the advancement in technology has certainly
improved their chances of successfully battling this disease, there are many who are deeply
concerned with the sky-rocketing cost that they have to pay for this treatment. Besides the
affordability, the patients also share another concern which is that the care and treatment does
not always meet with their priorities and goals, in that the patients feel that some of the care or
treatment that they sometimes receive is not always required or necessary. Instead, they continue
to struggle with some of their “bad habits” such as unhealthy diet, smoking, drinking, etc. that
play as a catalyst in increasing their risk of heart disease. Cardiovascular disease accounts for
over 17 million deaths per year globally and is expected to increase to more than 23.6 million
deaths by 2030. The mortality due to this disease can vary based on numerous factors such as
gender, race, and ethnicity, although, over the last few years, the factors affecting the mortality
has changed; it is determined based on the geographic location, wealth, and education. Besides
the concerning death rates, spending for heart disease has also increased. The American Heart
Association and American Stroke Association estimated that a total of $318 billion were spent
for all aspects of the cardiovascular disease. Aside from the direct expenses, there are also
indirect expenses that are related to this disease. For example, individuals in the workforce
10
needing to stay on medical leave or disability during their recovery period results in loss
productivity and, indirectly, affects the overall economy of the nation. Additional costs include
hiring someone who can provide immediate medical assistance or, simply, assistance with
everyday routine, individuals leaving the workforce, etc.22 Benefits due to advancement in
healthcare-related technology have significantly and, even negatively, affected the fight to defeat
diseases such as heart disease primarily due to increased costs that apply during the entire
treatment process. Many middle-class and lower-incomed individuals hesitate to make a move
and begin their treatment for heart disease since affordability becomes their biggest issue.
Affordability causes individuals within these groups to compromise with their treatment in that
they might not aim to go for all the steps that a full treatment may include to treat a heart disease.
With all the above mentioned challenges, and many more, that are involved with the heart
disease, this project aims to make little contribution towards the treatment aspect of the problem
by creating a model, along with a few factors related to heart disease, using few machine
learning algorithms that will assist medical professionals, who have understanding of this disease
and know how to make use of these factors, to actually be able to predict if an individual has a
A great amount of work has been done related to predicting heart diseases using machine
learning algorithms. One such research conducted by Madhumita Pal and her co-authors
labelled, Risk prediction of cardiovascular disease using machine learning classifiers, and also
published on the website of National Library of Medicine explains couple of machine learning
11
to predict the heart disease. They use a dataset of the size of 303 records of which 80%, or 243
records, have been used for the training dataset and the remaining 20%, or 60 records, have been
used for the testing dataset. The dataset also contains 13 important features such as age, gender,
chest pain, rest blood pressure, cholesterol, fasting blood sugar, resting electrocardiographic
result, maximum heart rate, exercise-induced angina, ST depression, slope of peak ST segment,
number of major vessels, thallium stress result, and a target variable (1 indicates the person
having cardiovascular disease and 0 indicates the person not having cardiovascular disease).
After the dataset is preprocessed using the correlation matrix, the two machine learning
algorithms are applied to the model. K-Nearest Neighbors Algorithm returned with an accuracy
score of 73.77% in predicting the heart disease while Multi-Layer Perceptron Algorithm resulted
12
3. Project Development and design
machine learning model that will predict whether a person has a heart disease or not. This model
will be implemented through Jupyter Notebook. I will import several libraries such as:
• Numpy – a numeric library in Python that can work with linear algebra as well as arrays.
• Pandas – this library in Python is used for data analysis and machine learning. It is used
to manipulate and analyze the csv files and data frame.
• Matplotlib – this library in Python is used for plotting and visualizing the data. It will
help in plotting graphs, line plots, histograms, etc. as well define parameters for the
• train_test_split – this scikit-learn library in Python will be used to split the dataset into
two parts – training dataset and test dataset.
Next, using the scikit-learn, or sklearn, library in Python, I import the four machine learning
algorithms that will be used to implement the model which will predict heart disease in a person.
These algorithms are K Neighbors Classifier, Decision Tree Classifier, Random Forest
13
3.2. Data Understanding and Preprocessing
The dataset will be imported from Kaggle as well as IEEE and consist of a group of
approximately 1500 random people each of which will contain the following attributes:
o 1 = atypical angina, which means that one feels chest pain which does not meet with the
o 4 = non-anginal pain
• Resting blood pressure, or trestbpsr – this is the blood pressure when the person is admitted
• Serum cholesterol, or chol – it represents the amount of total cholesterol in a person’s blood
• Fasting blood sugar, or fbs – this measures blood sugar after overnight fasting; fasting blood
sugar less than 99 mg/dl is considered normal, between 100 and 125 mg/dl is considered
prediabetic, and greater than 126 mg/dl is considered diabetic. In dataset, 1 = if fbs < 120
heart while you are at rest, provides information about your heart rate and rhythm, and can
also show if there is enlargement of the heart or evidence of a previous heart attack. 0 =
14
showing probable or definite left ventricular hypertrophy by Estes’ criteria, 1 = normal, and 2
• Exercise-induced angina, or exang – this is a type of chest pain that is caused, or induced, by
exercise, stress, etc. causing an overload on your heart. In dataset, 1 = pain, 0 = no pain.
• slope – represents the ST segment shift relative to the increase in heart rates, which is
disease.
Figure 2.1 shows the information of the dataset after using info() method.
15
In order to understand the data, I will create a correlation matrix that will help in showing the
positive and negative correlation of the above-mentioned features. Then, bar plots will be created
as well as two target classes, one for predicting whether there is a disease and the other for
predicting whether there is not a disease. The plots will help isolate categorical variables and
then some of these variables will be converted into temporary, dummy variables in order to scale
up all the values before I can apply any machine learning algorithms.
Now that the dataset has been scaled and standardized, I will split the dataset into two parts –
testing and training. Then, I will start applying each of the machine learning algorithms, as
Decision Tree Classifier will use a classification tree to produce a predictive model that is
used to formulate conclusions from a range of observations. The decisions, or tests, will be
performed based on the features provided in the dataset. As the name of this algorithm suggests,
it uses a tree-like structure to display the results (i.e., predictions) that are achieved based on a
number of feature-based splits. As shown in Figure 2.2 below, the process of Decision Tree
Classifier begins at the root node. Then, this root node is divided into various nodes, which are
decision nodes, and finally, based on the decision that is selected, we reach the conclusion, or
16
Figure 2.2 Generalized layout of Decision Tree (Source: Decision Tree)10
Few reasons for using this algorithm include: 1) Usefulness while solving decision-related
problems such as the problem of this thesis. 2) Helps in thinking about all practical solutions
because of its tree-like structure. 3) Data clean-up is not necessary as compared to other
algorithms.7
In my model, using the Decision Tree Classifier, the majority of coding hassle was reduced
once the Decision Tree was imported from scikit-learn. It provided access to the
DecisionTreeClassifier() class, and a few variables needed to calculate the accuracy. First, a
class value was assigned to every single data point. Then, since the range for the maximum
number of features to be selected for my model was variable, I selected my range of maximum
features to be between 1 and 23. Using the DecisionTreeClassifier() class, the parameters
max_features and random_state were added, where max_features selects maximum number of
17
features from the above range and random_state randomly selected the features at each decision
node. Then, using fit() method, training dataset is taken as an argument, and the results of the
testing dataset are appended to the list. Finally, a line graph was created to plot the accuracy
scores received for the range of maximum numbers I had selected (i.e., 1 to 23).
The K Neighbors Classifier is a type of non-parametric algorithm that will use the proximity
of the K nearest neighbors from a given data point to make classifications, or predictions, about
the clustering of the individual data points. This algorithm is also known as the lazy-learner
algorithm because “it does not learn from the training set immediately instead it stores the
dataset and at the time of classification, it performs an action on the dataset.”12 Dataset is only
stored when the algorithm is under training phase but, once more data is added to the original
dataset, this algorithm will classify the original dataset into various categories that are similar to
the newly received dataset. K Neighbors Classifier is used for classification problems, but it can
also be applied to problems based on regression. Given a point with an unknown class, the
purpose is to try and understand which other neighbors in that feature space are closest to this
point. These neighbors become the K nearest neighbors. In the case of a classification problem, a
data point is given a class label on the basis of the label that is given to the most frequently used
data point.11 For reference, an image of K Neighbors Classifier is provided below in Figure 2.3:
18
Figure 2.3 K-Neighbors Classifier Diagram (Source: Webtunix Solutions - Business Solution
Provider)12
In Figure 2.3, the point colored in green in the innermost circle is the data point and, based
on the vicinity of the other data points to the green data point, they each are assigned class
labels.12
In my model, using the K Neighbors Classifier, I looked for the K nearest neighbors around
my target variable’s data point and, based on the majority class, provided a class label to this
target variable data point. Then, I determined the total number of neighbors I wanted to include
around the target variable data point; I included 30 neighbors labelled 1 to 31. Then, using the
KNeighborsClassifier() class, available after importing K-Neighbors library from scikit learn, its
parameter, n_neighbors, determined the number of neighbors to be used for that iteration. After
this, using the fit() method, I fit the K nearest neighbors from the training dataset around my
target variable data point. Then, using score() function, I pass test input and the target data to
check the accuracy score, and using append() function, training data results are appended with
the testing data results. The results achieved were then plotted onto a line graph.
19
3.5. Random Forest Classifier Overview
Random Forest Classifier will create multiple decision trees and consider various features,
from the total features, to make better predictions for the overall dataset. Each individual tree in
this algorithm will perform a class prediction and the class that retains the most prediction
among all the other classes will become the model’s prediction. To provide a visual
Figure 2.4 Random Forest Classifier using Decision Trees (Source: Yiu, Tony)8
Fundamental concept behind the reason the Random Forest Classifier performs so well, as
In my model, I used the Random Forest Classifier that created multiple decision trees, by
performing random selection of features, from the total number of features. Then, to predict the
20
class, I used different sets of decision tree sizes such as 200, 400, 800, 1600, 3200. Then, using
RandomForestClassifier() class, available after importing Random Forest Classifier from scikit
learn, I passed two parameters: n_estimators and random_state. n_estimators determined the
total number of trees used for that iteration and random_state controlled the sampling of the
features to consider when looking for the best split at each decision node. After this, using the
fit() method, I built a forest of trees of a particular tree size, selected using n_estimators, from the
training dataset. Then, using score() function, I pass test input and the target data to check the
accuracy score, and using append() function, training data results are appended with the testing
data results. Finally, I plotted the accuracy scores for each of the tree sizes onto a bar graph to
plane, where N is the total number of features, that specifically classifies every data point in that
plane.
21
Figure 2.5 Optimal hyperplane selection graphs for Support Vector Classifier13
As shown in Figure 2.5 above, there are many different hyperplanes that we can select
from in order to separate the two classes of data points. The ultimate goal is to find a hyperplane
that has the maximum distance between the data points of both the classes. Hyperplanes are
extremely useful when trying to classify the classes for the data points as those data points that
are positioned on either side of the hyperplane will be grouped in different classes.13
In my model, using the Support Vector Classifier, I selected the 3 kernels below that I
a. Linear: this is the simplest of all the known kernels since the data is represented over
an x-y plane with a constant term, c, which is optional. Linear kernels are quite
simple as they are not represented over higher dimensions. They are also extremely
useful for problems where the dataset consists of a greater number of features.14 A
visual representation of the linear kernel is shown in Figure 2.6, which shows how a
22
Figure 2.6 Linear Kernal25
b. Polynomial, or poly: when the data cannot be linearly separated, the polynomial
kernel becomes useful since it maps the data over higher-dimensional planes and, in
doing so, sometimes will identify the separation of the two classes in the
which shows how a set of data points can be separated in a non-linear manner:
23
c. Radial Basis Function, or rbf: this kernel is slightly more advanced than the
polynomial kernel due to its ability to plot data over the plane of infinite number of
the total number of features, the Radial Basis Function kernel can make calculations
for any N > 1 such that a graphical representation of quadratic, cubic, or polynomial
representation of the polynomial kernel is shown in Figure 2.8, which shows how a
Continuing with the process in my model, any one of the above three kernels will be selected
when kernel parameter is passed for the SVC() class, available after the Support Vector Classifier
is imported using scikit learn. Then, using fit() method, I fit the SVM model according to the
training dataset. After this, using score() function, the accuracy score on the test data and target
values is returned, and using append() function, the results from the training data are appended
with the testing data results. Finally, the accuracy result for each of the three kernels is plotted
24
3.7. Risks involved
algorithms used to train it and, sometimes, they may not produce the best results.
• The implementation phase may take longer than expected due to personal and
employment commitments.
• Since the running time of a program depends partly on the CPU speed and processor,
should the program take long to run, then it may be possible that the dataset may need to
be reduced.
25
4. Results
➔ The below results are based on the following split of the dataset: 67% training and 33% test
In the line graph shown in Figure 2.9 above, the x-axis is assigned to the values for the
maximum number of features considered during each iteration of the decision tree process, and
y-axis provides with the accuracy scores. An ordered pair on this line graph explains the total
number of maximum features that are used when decision tree chooses a path from the root
node, which contains the total number of features (including dummy variables as well), and the
accuracy in predicting whether a person has a heart disease, which results from that path.
Looking at the graph, when the decision tree selects only 1 feature to predict heart disease, it
returns with an accuracy of approximately 81%. Similarly, when the Decision Tree Classifier
26
selects 2 maximum features, an accuracy of 83% was achieved. The lowest accuracy score of
approximately 79% was achieved when the algorithm included 11 total features and, on the
contrary, the highest accuracy score of 83% was achieved when the maximum number of
➔ The below results are based on the following split of the dataset: 67% training and 33% test
The line graph in Figure 2.10 above shows the comparison between different values of K
and the accuracy that each one achieves. The number of Neighbors, K, represents the x-value of
the graph and shows how many neighbors, or features, there are around the target variable data
point. If we use only 1 neighbor (i.e., 1 feature) to predict whether a person has a heart disease,
27
then we get the accuracy of 84%. Similarly, if there are 2 neighbors surrounding our target
variable data point, we have an accuracy of 79.5% to predict heart disease. The lowest accuracy
of 79% was achieved when there were 4, 13, and 29 neighbors selected for the target variable
data point. From the line graph above, the highest accuracy score of 84% was achieved when the
➔ The below results are based on the following split of the dataset: 67% training and 33% test
The bar graph, shown in Figure 2.11 above, includes five varied sizes of decision tree,
represented as Number of estimators, and the accuracy score that each of the tree size achieved.
28
Applying the concepts of the Decision Tree Classifier, the results of the Random Forest
Classifier combined multiple decision trees and, based on the varied sizes shown in the bar graph
above, an accuracy score for each decision tree size is predicted. For a forest of the size of 200
decision trees, the accuracy score of 90.7% was achieved. For a forest of the size of 400 decision
trees, the accuracy score of 90.5% was achieved. For a forest of the size of 800 decision trees,
the accuracy score of 90.5% was achieved. For a forest of the size of 1600 decision trees, the
accuracy score of 90.5% was achieved. For a forest of the size of 3200 decision trees, the
accuracy score of 90.9% was achieved. From the results, it can be concluded that a forest with
the size of 3200 decision trees had the highest accuracy score of 90.9%.
➔ The below results are based on the following split of the dataset: 67% training and 33% test
29
Figure 2.12 Support Vector Classifier Result Graph
The bar graph in Figure 2.12 above plots results of three types of kernels in the Support
Vector Classifier. When the model uses the Linear kernel to predict heart disease, an accuracy of
83% is achieved. Similarly, when Polynomial kernel is used to predict heart disease, an accuracy
score of 84.4%. Finally, when the Radial Basis Function kernel is used to predict heart disease,
an accuracy score of 85.6% is achieved. As expected, the Radial Basis Function resulted in the
best outcome out of all the three kernels because, as mentioned in Section 3.6.c, this kernel has
the ability to plot data over the plane of infinite number of dimensions, where dimensions are
30
4.5. Result Analysis
Collecting the maximum accuracy scores from each of the four classifiers, it can be seen that
Decision Tree Classifier returned the highest accuracy score of 83% when the maximum number
of features selected was 10. K-Neighbors Classifier returned with the highest accuracy score of
84%, which was achieved when the number of neighbors was selected to be 1. Random Forest
Classifier returned with the highest accuracy score of 90.9% when a forest with the size of 3200
decision trees was selected. Finally, Support Vector Classifier returned its best accuracy score of
85.6% when using the Radial Basis Function kernel. Therefore, it can be concluded that Random
Forest Classifier performed best amongst all four classifiers. Few discrepancies in the results are
as follows:
4.5.1. Lack of quality data – in this research, as previously mentioned, a total size of
1,493 for the dataset was used. The presence of noisy data could have impacted the
possible, data preprocessing methods such as correlation matrix and plotting bar
graphs were applied. Despite applying these two methods, it is possible that the data
may not have been fully cleaned up and, as a result, some classifiers may not have
been able to use the dataset optimally and predict the heart disease accurately
enough.
4.5.2. Implementation deficiency – inadequate data is one of the factors that can cause
the program to not execute effectively and, consequently, also not return the best
results.
4.5.3. Absence of enough knowledge about the programming language and the topic –
when doing this project, I initially had little knowledge about Python and, therefore,
31
throughout the year, it was a learning curve for me to understand this language so
Additionally, besides the extremely basic knowledge of the heart disease itself, I did
not know about many of the disease-related things such as the statistics about the
death rates, expenditures, some risk factors, etc. All of this information, some of
which has been described in the earlier portions of this paper, were learnt after
detailed research but, because the area of study for heart disease is so broad and
deep, that not all of the information could be stated in this paper.
32
5. Conclusion and Future Work
The goal of this research was to develop a model that could help doctors predict whether an
individual is suffering from a heart disease or not before the disease spreads in the body and
reaches the stage where it becomes incurable, and the individual loses their life. Luckily today,
with tremendous advancement in the field of science and technology, the possibilities of curing
various life-threatening diseases, including heart disease, has become possible but, had the same
life-threatening disease been detected in a person half a century ago, then the chances of a
medical expert detecting and curing that person’s disease would have been exceptionally low. A
common thing that ties the generation gap is the stress that humans adapt to during the span of
their lives. As time passed, the stress level in humans also began to increase. According to
American Psychological Association report titled “Stress in America 2020”, the average stress
level, on a scale of 1-10, of all adults across the United States in 2020 was 5.0, which increased
from 4.9 from the previous 2 years, while those among Gen Z adults, who are adults born
between mid- to late-90s and early 2010s, dealt with an average stress level of 5.6 in 2018, 5.8 in
2019, and 6.1 in 2020. As stress increases, humans adapt to unhealthy habits such as smoking,
drinking, unhealthy diet consumption, etc. which invites unwanted diseases into the body. When
the disease enters, an individual might not be aware of its presence and, sometimes deadly,
consequences that the disease brings with it. Elevated levels of stress, unhealthy diet
consumption, drinking, smoking, chest pain, etc. are also some of the many symptoms of a
person suffering from heart disease. This thesis was inspired on the basis of such heart-
wrenching facts about humans who suffer from heart disease and that is why the goal of this
thesis became to develop a model that can help medical experts detect this disease in the
33
Despite receiving decent accuracy scores for all the classifiers used in this project to predict
heart disease, there are still many areas that were left untouched or not explored in detail.
Therefore, the future scope of this project will be to include more sophisticated prediction
algorithms that can perform detailed analyses on the dataset, provide better results when
predicting an outcome, and, in the end, be used to compare for a detailed analysis of the results.
Speaking of the dataset, it will be vital to have larger datasets that can cover a larger portion of
the crowd because having a large dataset will evaluate the model and, after the algorithms have
been applied, it will return results (i.e., accuracy scores) that are more realistic. Another work
that can be done, particularly for the dataset, is to visit healthcare facilities such as hospitals,
labs, etc. and try to collect the actual patient data, combine that data, and, apply this dataset to
different machine learning techniques and have the results be compared with those obtained from
34
6. Bibliography
https://www.cancer.gov/publications/dictionaries/cancer-terms/def/coronary-heart-disease.
2. “About Heart Disease.” Centers for Disease Control and Prevention, Centers for Disease
3. Barla, Nilesh. “Data Science and Machine Learning in the Medical Industry.” Neptune.ai, 21
industry.
application-development-services/healthcare-software-development/python-healthcare.
https://www.ppschicago.com/pain-management/chest-pain/atypical-chest-
pain/#:~:text=What%20is%20Atypical%20Chest%20Pain,adequate%20supply%20of%20ox
ygenated%20blood.
6. AlBadri, Ahmed, et al. “Typical Angina Is Associated with Greater Coronary Endothelial
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680106/#:~:text=Typical%20angina%20(T
A)%20is%20defined,coronary%20artery%20disease%20(CAD).
https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm.
8. Yiu, Tony. “Understanding Random Forest.” Medium, Towards Data Science, 29 Sept. 2021,
https://towardsdatascience.com/understanding-random-forest-58381e0602d2.
35
9. Saini, Anshul. “Decision Tree Algorithm - A Complete Guide.” Analytics Vidhya, 5 Apr.
2023, https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/.
10. “Decision Tree.” Decision Tree - Learn Everything About Decision Trees,
https://www.smartdraw.com/decision-tree/.
13. Gandhi, Rohith. “Support Vector Machine - Introduction to Machine Learning Algorithms.”
machine-introduction-to-machine-learning-algorithms-934a444fca47.
14. Dye, Steven. “An Intro to Kernels.” Medium, Towards Data Science, 5 Mar. 2020,
https://towardsdatascience.com/an-intro-to-kernels-
9ff6c6a6a8dc#:~:text=The%20linear%20kernel%20is%20typically,this%20kind%20of%20d
ata%20set.
15. Sidharth. “SVM Kernels: Polynomial Kernel - from Scratch Using Python.” PyCodeMates,
polynomial-
kernel.html#:~:text=The%20polynomial%20kernel%20is%20often,hyperplane%20that%20s
eparates%20the%20classes.
36
17. “Stress in America™ 2020: A National Mental Health Crisis.” American Psychological
https://www.apa.org/news/press/releases/stress/2020/report-october#.
https://professional.heart.org/en/science-news/heart-disease-and-stroke-statistics-2023-
update.
19. Shrestha, Ranjith, and Jyotir Moy Chatterjee. "Heart Disease Prediction System Using
Machine Learning." LBEF Research Journal of Science, Technology and Management 1.2
(2019).
20. Saboor, Abdul, et al. “A Method for Improving Prediction of Human Heart Disease Using
https://www.hindawi.com/journals/misy/2022/1410169/.
21. S. Mohan, C. Thirumalai, and G. Srivastava, "Effective Heart Disease Prediction Using
Hybrid Machine Learning Techniques," in IEEE Access, vol. 7, pp. 81542-81554, 2019, doi:
10.1109/ACCESS.2019.2923707.
22. McClellan, Mark, et al. “Call to Action: Urgent Challenges in Cardiovascular Disease: A
Presidential Advisory From the American Heart Association.” AHA Journals, American
https://www.ahajournals.org/doi/full/10.1161/CIR.0000000000000652.
23. Pal M, Parija S, Panda G, Dhama K, Mohapatra RK. Risk prediction of cardiovascular
disease using machine learning classifiers. Open Med (Wars). 2022 Jun 17;17(1):1100-1113.
37
24. Finkelhor, Robert S, et al. “The St Segment/Heart Rate Slope as a Predictor of Coronary
https://www.sciencedirect.com/science/article/abs/pii/0002870386902656#:~:text=The%20S
T%20segment%20shift%20relative,coronary%20artery%20disease%20(CAD).
25. Fraj, Mohtadi Ben. “In Depth: Parameter Tuning for SVC.” Medium, All Things AI, 5 Jan.
2018, https://medium.com/all-things-ai/in-depth-parameter-tuning-for-svc-758215394769.
38