University of Sunderland Assignment Coversheet: Running Head: Heart Disease Prediction by Machine Learning Algorithm

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 24

Running Head: Heart disease prediction by Machine learning Algorithm

UNIVERSITY OF SUNDERLAND

ASSIGNMENT COVERSHEET
Student ID : 219311425 Student Name/ Names of all group members:
Anamol karki

Programme: BSc (Hons) Computer Systems Module Code and Name: CET 313
Engineering

Module Leader/ Module Tutor: Himalayan kakshepati Due Date: Apr 8 Hand in
Date:

Assessment Title :Heart disease prediction using machine learning

Learning Outcomes Assessed: ( number as appropriate)

Mark

Areas for Commendation

Areas for Improvement


General Comments

Assessor Signature : Anamol Overall mark ( subject to Moderator Signature


ratification by the
assessment board)

……………………………………………………………………
…………………………..
I confirm that in submitting this assignment that I have read, understood and adhered to the University’s Rules and procedures
CET-351 Research-Project Plan
governing infringements of Assessment Regulations.

Anamol Karki
PRINT Student Name: ____________________________________________ Faculty Stamp (date/time)

Student Signature : ______________________________________________

CET-
Module Code and Name: _________________________________________

Name of Module Tutor : __________________________________________


CET 313
Anamol karki
219311425
Bsc (hons) computer system Engineering
ISMT COLLEGE,TINKUNE GAIRIGAUN

Contents
Introduction......................................................................................................................................3

Aim:.............................................................................................................................................3

Objectives:...................................................................................................................................3

Section1: Prototype Identification and Planning.............................................................................4


Section 1.1 Literature Review on Prototype Identification.......................................................4

Section:1.2 Reflection of the prototype Identification................................................................6

Prototype Development...................................................................................................................7

Section 2: Development...............................................................................................................7

Used General Purpose Library.....................................................................................................7

DataSet Loading..........................................................................................................................8

Datatypes of attributes and missing value exploration................................................................9

Visual of Class Distribution:.....................................................................................................10

Model Evaluation.......................................................................................................................13

Building Predictive System.......................................................................................................14

Section 3....................................................................................................................................14

Data Analysis.............................................................................................................................16

Data pre-processing:..................................................................................................................17

Training......................................................................................................................................18

SVM...........................................................................................................................................18

Naïve Bayes...............................................................................................................................19

Logistic Regression...................................................................................................................19

Decision Tree.............................................................................................................................19

Random Forest...........................................................................................................................20

LightGBM..................................................................................................................................20

XGBoost....................................................................................................................................20

Conclusion.....................................................................................................................................21

References......................................................................................................................................22
Introduction
The goal of this project is to create a software-based prototype that can successfully identify
heart illness using diverse medical devices. This study's mission statement, goal, and objectives
are as follows:

Mission Statement: The prototype's mission statement is to create an automatic heart condition
recognition system when given medical qualities, which will aid medical practitioners in their
decision-making and diagnostic objectives.

Aim:

Development of automatic heart condition recognition prototype solution with machine learning
algorithms.

Objectives:

1. To gather a suitable dataset for training the prototype's machine learning models for heart
condition detection.
2. To select appropriate machine learning classifiers for the task so that enough model
performance can be attained for use in a real-world medical scenario.
3. To assess the performance of models in order to identify the optimal model for the job.
4. To suggest ways for using the best-fit model(s) for prediction, as well as its limitations.
5. Critically evaluate the artifact or product using cybersecurity approaches and acceptable
procedures, evaluating the work's limitations and strengths.

The main goal of this project is to see if a patient's medical characteristics, such as
gender, age, chest pain, fasting sugar level, and so on, indicate that they are likely to be
diagnosed with cardiovascular heart disease. A dataset with the patient's medical history
and attributes is chosen from the UCI repository. Using this dataset, we can forecast
whether or not the patient would develop heart disease. (Harshit Jindal1, 2072) One of
the leading causes of sickness and mortality among the world's population is heart
disease. One of the most important topics in the domain of clinical data analysis is
cardiovascular disease prediction. In the healthcare industry, there is a massive amount of
data. Data mining converts a significant amount of raw healthcare data into information
that may be used to make better decisions and forecasts. (Rawat, 2019) As a result, a
machine learning solution is an efficient and effective technique to better treat those
patients while also reducing the effort of medical personnel. In order to detect the likely
prenatal health state, machine learning algorithms just require some linked patient trait
The phrasing, on the other hand, complies with the HIPAA checklist for ensuring the
security of medical data. Diabetes, obesity, a poor diet, being overweight, excessive
alcohol consumption, and physical inactivity are all key factors of heart disease.

Section1: Prototype Identification and Planning


Section 1.1 Literature Review on Prototype Identification
In medical centers, a lot of studies have been conducted on disease prediction systems utilizing
various machine learning algorithms. The advent of AI and digital technologies, machine
learning, and data analysis for categorization of Heart Disease has resulted from global
technological improvement. In today's healthcare, artificial intelligence has aided providers in
both patient care and administrative tasks. The most prevalent sort of artificial intelligence being
utilized as a technique to enhance the key competences of healthcare technologies is machine
learning. Machine Learning is commonly used to gain a better knowledge of predictive treatment
methods based on a treatment and welfare framework. The prediction method was designed and
implemented using the learning vector quantization algorithm, which is an artificial neural
network learning technique. The data was obtained from the University of California at Irvine's
repository. It contains 303 incidences and 14 clinical characteristics. An algorithm was used to
train this dataset. The front end was created with three panels. The data input panel is the first,
followed by the ROC curve display portion, and finally the performance display section Also
calculated were sensitivity, accuracy, and specificity. The accuracy of this prediction technique
was close to 80%. Syed Umar Amin, Kavita Agarwal, and Rizwan Beg employed two algorithms
to predict cardiac disease: neural networks and genetic algorithms . Age, blood cholesterol,
fitness, blood pressure, stress, and other risk factors for heart disease are all involved. The data
was taken from a database that included these risk variables as an attribute. The dataset is
divided into two sections: training data and testing data . The model is trained in MATLAB GUI
using a neural network and genetic algorithm, and the resultant model has an accuracy of roughly
89 percent. Cascaded Neural Network, a deep learning method, was used to develop the Heart
Attack Prediction System. It's a self-contained dataset derived from the UCI machine learning
repository, which includes a patient's medical record. It includes 76 attributes for 270 patients,
however only 13 were picked using the feature selection technique. The noise and many
duplicate records were removed using filtering. 120 records were utilized to test the data and 150
records were used to train the data from a total of 270 records. Cascaded Neural Network is then
used to classify the properties. The algorithm divides the data into two categories at the
classification stage: whether the patient has a disease or not. As a result of this approach, an
accuracy of roughly 84 percent was achieved. (Yuvaraj, 2019) The main goal is to forecast that
cardiac disease would develop in a short period of time, allowing for an automated early
diagnosis. The suggested strategy is equally significant in the health system with personnel who
lack experience and skill. It analyzes a variety of medical factors, including as blood sugar and
heart rate, as well as age and sex, to determine whether you have heart disease. Data sets'
performance is measured using WEKA software.which was suggested by chala beyene (Akhand
Pratap Singh, 2020)M.Raihan has devised a simple method for predicting the likelihood of ISD
using a smartphone. The use of clinical data acquired from IHD patients has resulted in the
development of Android-based prototype applications. Clinical data from 787 patients was
analyzed and linked to risk variables like high blood pressure, diabetes, high cholesterol,
smoking, family history, obesity, depression, and present clinical symptoms that could indicate
undiagnosed IHDs. Data mining technologies was used to extract data and calculate a score. IHD
risks are classified as low, medium, or high. For patients whose data were collected to construct
the ratings, the authors discovered a significant relationship between low-, medium-, and high-
grade cardiac events; p=0.0001 and 0.0001. They are susceptible to cardiology in order to avoid
abrupt deaths and provide an easy technique of recognizing the threat of IHD. Currently, there
are some constraints on available resources, causing them to be underutilized by the population.
(Akhand Pratap Singh, 2020) Artificial Neural Networks utilizing back propagation techniques
can be used to train the model. The accuracy of this model increases proportionally when the
number of hidden layers is increased. With tensorflow, the model may be implemented in
Anaconda Navigator. Despite the fact that the MATLAB GUI is more interactive and easy to
create, packages must be manually installed, which is a time-consuming operation.Anaconda
Navigator comes with a built-in package that is entirely based on Python code.

Section:1.2 Reflection of the prototype Identification


Machine learning is used to predict heart disease by analyzing a dataset and developing several
models to arrive at the paper's conclusion. The study presents a thorough explanation of the
history of AI in healthcare and heart disease challenges for disease detection and diagnosis.
Machine learning is being utilized to build a better knowledge of prediction metrics based on
diverse frameworks for human health. The classification of data is one of the most well-known
challenges for machine learning algorithms. In this situation, machine learning is frequently used
to extract knowledge from business activity datasets and transfer it to larger databases. The bulk
of machine learning approaches rely on a large number of features to describe the algorithm's
behavior, which, indirectly or directly, increases the model's complexity . To combine the heart
disease diagnosis algorithms stated earlier, many algorithms such as hybrid approaches are
utilized in conjunction with logistic regression, naive Bayes, K-nearest neighbor, and neural
networks. In this scenario, the system was trained and implemented using the Unique Client
Identifier machine learning deported benchmark dataset on the Python platform. (victorchang,
2022) Cardiac diseases encompass coronary artery disease, arrhythmias, heart anomalies, and a
wide range of other conditions. This category includes disorders such as cardiomyopathy and
heart infections. Chest pain, a symptom of cardiovascular illness, is the most common indicator
of heart risk. Then it manifests as Nausea, Indigestion, Heartburn, or Stomach Pain. The paper
will demonstrate how a software may be written in Python to determine whether or not a person
has cardiovascular disease. The approach in this paper is based on a dataset of fourteen test
outcome characteristics collected from around 100 people. The patient with heart disease
symptoms, on the other hand, will be diagnosed using binary numbers, 1 and 0, with 1 indicating
the true value The patient has heart disease, in other words. and a value of 0 denotes a false value
that if, the patient does not have any kind of heart disease. (victorchang, 2022) The report will
include a comprehensive analysis of the numerous health challenges that patients with heart
disease and diabetes face. The use of AI and machine learning is necessary for determining the
extent of the suggested application prototype report.

Prototype Development
Section 2: Development
Machine learning solution for Heart disease condition based on medical parameters is the
prototype that has been identified and created. The complete approach of classification system
for fetal health prediction is now being built in Python using the Jupyter notebook platform or
collabs. For the same objective, several machine learning models are being constructed, and their
performance is evaluated using a small portion of the collected data from the Kaggle repository.
Normal, suspicious, and pathological are the three labels for the goal attribute for Heart disease
prediction, which are encoded by 1, 2 and 3 correspondingly. As a result, this is a classification
problem, and the models that need be built are classification models. For dataset loading, pre-
processing, visualization, and, most importantly, model fitting and evaluation, multiple libraries
are now employed in this project. Dataset loading, dataset exploration and pre-processing,
splitting feature and response variables into train-test sets, fitting the chosen machine learning
classifiers, and evaluating the classifiers on the test set to compare their performances are the
stages of the overall technique. These procedures, as implemented in a Jupyter notebook, are
now detailly documented in the following sections.

Used General Purpose Library

NumPy is used for pre-processing and manipulating arrays, as shown in the above screenshot,
while the Pandas library is mostly for data collected in csv format from the Kaggle source.as for
the split function we need to split our original data into training and test data then we are
importing our logistic regression from scale dot linear model and finally we are importing
security code so this accuracy test is used to evaluate to check how well our model is performing.

DataSet Loading
Here as you can head data will print the first five rows of the data frame where you can predict
the particular values either 0 or 1 this is how you print the first five rows of dataset. As
mentioned below in screenshot we can also print the last five rows of the dataset.so basically
head function will work for first five rows whereas tall function will work for last five rows and
shows its value.

As mentioned above in screenshot we can also check how many rows and columns are there in
the dataset.
Datatypes of attributes and missing value exploration

To see the total number of entries and column you can get the information above the screenshots.

We can also see the missing values if something went wrong while predicting from software
using above mentioned datasets. Now, all of the properties imported in the DataFrame are
numeric, including the class variable, which is shown as an integer type in the above outputs.
This is due to the fact that class labels are coded with unique integers, necessitsating the
appropriate numeric to nominal translation. There are no missing values in any of the attributes,
thus no attribute filtering or restoration is required.
As shown in the screenshots, this function will provide statistical measures for all columns. The
count represents the number of data points in each column, the mean represents the value of all
columns, the standard deviation represents the value of the minimum column, and we also have a
value for each column, as well as a percentile that is used to indicate value.

Visual of Class Distribution:

As you can see 165 people or data points in that value have deceased or defective heart whereas
138 people doesn’t have any disease in particular we need to have almost equal number of
distribution in two classes
1 represents the unhealthy patient and 0 represents the healthy patient to analyze the data we are
now going to split features and target. Target is the prediction of whether the person has heart
defects or not so its either zero or one so the particular column is known as target whereas
besides target other column is known as features cause we are going to use all this features to
predict particular target.

To predict features we removed target to analyze it as you can see above in screenshots
previously we have 14 column now only 13 left
Now we splitted target and features successfully so, we need to feed X and Y to machine
learning algorithm before that we need to split training data and testing data.

X train means features are separated as this contains all the training data, X tests contains
features of all the test data and Y train contains target of all those features present in the X strain
so we need to mention parameters of text size to know how many percentile of data you want
and when you mentioned stratify two classes zero or one will be distributed in an even manner
throughout your training data and test data as it will be present in original data as for the random
state it split the data in specific way so as you can see we have successfully splitted data now we
are going to train our machine learning model.
This pattern will find the relationship between features that are present in X train and the
corresponding target, it checks age ,sex and other parameters where there are particular values.

Model Evaluation

We are getting 85% accuracy on training data and 81% on test data as you can see both of the
data is almost similar because our data is very small which means our model is overfitted so to
avoid that we should use generalized learning approach as we are in the last stage of our
machine learning project we are going to build a predictive system which will predict the patient
have heart defective or not.

Building Predictive System

Before starting predicting system we need to do processing like reshaping which will tell our
machine learning model to predict given target for only one variable so when we give the value
to this particular input data column this model can predict whether the person have heart defect
or not.

Section 3

3.1 Report on the Evaluation

The evaluation results of the various models in the specified project are displayed and compared
in this part in order to determine the optimal model for heart disease prediction. The previously
mentioned metrics for evaluation are presented in a classification report. Anaconda navigation is
installed and utilized because it comes with built-in Python packages, is user-friendly, and is
simple to set up. It contains a Jupyter and spyder notebook. Spyder with TensorFlow is used
here, and the artificial neural network application is written in Python. Spyder is a spy gadget.
Making the debugging process go more smoothly. Both jupyter and collabs notebook were used
in the process of developing model. We have datasets which includes more than 200-300
individuals data but we used only fourteen columns datasets as mentioned below:

1. Age: which helps to identify the people’s age.


2. Sex: It includes gender which are determined by value as 1 is known as male and 0 is
known as female
3. Chest pain Type: It displays whether patient is suffering from pain or not in the chest.
4. The serum cholesterol level is displayed in milligrams per deciliter (mg/dl) (unit).
5. Blood pressure; it displays the values of pressure in mmhg.
6. Fasting blood sugar: This shows the value of sugar intake of individual if fasting blood
sugar>120mg/dl then:1(true) else:0 (false)
7. Resting ECG: shows the electrocardiographic data at rest 0 indicates that everything is
normal.1 indicates an ST-T wave irregularity.2 = hypertrophy of the left ventricle.
8. Maximum heart rate attained: shows the highest heart rate attained by an individual.
9. 1 = affirmative, 0 = no for exercise-induced angina.
10. Exercise-induced ST depression: displays the value, which can be an integer or a float.
11. Maximum exertion ST segment: 1 = slanting, 2 = level 3 = slanting downwards
12. Flourosopy colorizes the number of major vessels (0–3) and shows the value as an integer
or float.
13. Thal : displays the thalassemia :3 = average 6 indicates a defect that has been corrected
and 7 indicates a reversible flaw.
14. Diagnosis of heart disease: Indicates whether or not the individual has heart disease:0
indicates that something is missing, 1, 2, 3, or 4 indicates that something is present.

Although this database has 76 traits, all published studies focus on a selection of 14 of these. To
date, Machine Learning researchers have only used the specific database. One of the key
challenges on this dataset is to predict if a patient has heart disease or not based on the patient's
given qualities, and another is to diagnose and find various insights from this dataset that could
help in better understanding the situation.

Data Analysis

Let's take a look at the ages of those who are or are not affected by the disease.

Target = 1 indicates that the person has heart disease, while target = 0 indicates that the person
does not have heart disease.

We can see that the majority of those who are suffering are between the ages of 58 and 57.

The condition primarily affects adults in the age range of 50 and up.

Let's have a look at the age and gender breakdown for each target class.
Data pre-processing:

There are fourteen columns and more than 300 rows in the dataset. Let's have a look at the null
values.
There are only six cells with null values, four of which correspond to the attribute ca and two to
the attribute thal. Because the number of null values is so small, we can either ignore them or
impute them and also I splitted data in two sets train and test to know the accuracy.

Training

To obtain the findings, all of the models outlined above are used. The confusion matrix is used as
an assessment metric.

Pic: confusion matrix

This matrix shows the values that a classifier successfully predicted or mistakenly predicted
which will help us to know the data well. The number of successfully identified items by the
classifier is equal to the sum of TP and TN from the confusion matrix.

SVM
SVM training set accuracy = ((124+100)/(5+13+124+100))*100 = 92.51 percent SVM accuracy
for the test set was 80.32 percent. Let's take a look at all of the confusion matrices for each
classifier as well.

Naïve Bayes

Logistic Regression

Decision Tree
Random Forest

LightGBM

XGBoost
To summarize, here are all of the precision for all of the classifiers at once.

We can observe that Logistic Regression and SVM have the highest accuracy for the test set,
with an accuracy of 80.32 percent. Decision Tree achieves the greatest accuracy for the training
set of 100 percent. Only the default parameters are used to implement the algorithms.

Conclusion
As a result, the prototype for heart disease prediction using numeric characteristics collected
from machine learning classifier models can be described. The performance of those models is
also tested in a tiny subset of the acquired data called as the test set, and it is discovered that
random forest outperforms the others in all respects. After initializing the prototype models with
default settings, those parameters are changed numerous times by adjusting the model
parameters until the metrics scores do not improve much, and the prototype as a whole is
developed from its previous version. The first section describes using Python to forecast heart
disease based on the provided circumstances. Python is both an object-oriented and a high-level
programming language, with short development cycles and vibrant, energetic construction
choices. This language aids in the correct prediction of the heart disease pathway. This study
predicts people who will develop heart disease by extracting the patient medical history that
leads to a deadly heart illness from a dataset that contains patients' medical history such as chest
discomfort, sugar level, blood pressure, and other factors. This Heart Disease Detection System
aids a patient based on clinical data from a previous heart disease diagnosis. Logistic regression
and Random Forest Classifier are the algorithms used to create the given model. (Harshit Jindal1,
Heart disease prediction using machine learning, 2021) The models are chosen based on the
theory of machine learning models as utilized by earlier researchers for developing classification
models for various applications, and the models' outcomes are generally adequate. Adjustment
parameters are also chosen based on what is known to have a substantial influence on
performance. All of the generated models, including pre-processing and feature selection,
operate in a computationally viable time, allowing them to be used in real-world systems to
identify heart disease. Although the models' performances are impressive especially for random
forest, they may not be fully optimized because each of them has a wide parameter space and it
is impossible to tune all of them using brute force. As a result, a future focus of this research
might be to develop an intelligent algorithm for optimizing models with big parameters, but the
results are unlikely to improve considerably. Furthermore, because the dataset on which models
are built has a small sample size, sampling error might be severe when predicting for vast data
with uncertain labels. Manually calculating the chances of developing heart disease based on risk
factors is tough. Machine learning approaches, on the other hand, may be used to anticipate the
outcome from existing data.

References
Akhand Pratap Singh, D. B. (2020). A Review on Heart Disease Prediction using. Journal of
Xi'an University of Architecture & Technology, 4123-4136.

Harshit Jindal1, S. A. (2021). Heart disease prediction using machine learning. IOP Conference
Series: Materials Science and Engineering, 1022.

Harshit Jindal1, S. A. (2072). Heart disease prediction using machine learning algorithms . IOP
Conference Series: Materials Science and Engineering, 1-10.
Rawat, S. (2019, aug 19). towards data science . Retrieved from towards data science:
https://towardsdatascience.com/heart-disease-prediction-73468d630cfc

victorchang, v. h. (2022). An artificial intelligence model for heart disease detection using
machine learning algorithms. Elsevier Healthcare Analytics, 100016.

Yuvaraj, R. S. (2019). Artificial Intelligence Model for Earlier Prediction of. Journal of Physics:
Conference Series, 1-16.

You might also like