Professional Documents
Culture Documents
University of Sunderland Assignment Coversheet: Running Head: Heart Disease Prediction by Machine Learning Algorithm
University of Sunderland Assignment Coversheet: Running Head: Heart Disease Prediction by Machine Learning Algorithm
University of Sunderland Assignment Coversheet: Running Head: Heart Disease Prediction by Machine Learning Algorithm
UNIVERSITY OF SUNDERLAND
ASSIGNMENT COVERSHEET
Student ID : 219311425 Student Name/ Names of all group members:
Anamol karki
Programme: BSc (Hons) Computer Systems Module Code and Name: CET 313
Engineering
Module Leader/ Module Tutor: Himalayan kakshepati Due Date: Apr 8 Hand in
Date:
Mark
……………………………………………………………………
…………………………..
I confirm that in submitting this assignment that I have read, understood and adhered to the University’s Rules and procedures
CET-351 Research-Project Plan
governing infringements of Assessment Regulations.
Anamol Karki
PRINT Student Name: ____________________________________________ Faculty Stamp (date/time)
CET-
Module Code and Name: _________________________________________
Contents
Introduction......................................................................................................................................3
Aim:.............................................................................................................................................3
Objectives:...................................................................................................................................3
Prototype Development...................................................................................................................7
Section 2: Development...............................................................................................................7
DataSet Loading..........................................................................................................................8
Model Evaluation.......................................................................................................................13
Section 3....................................................................................................................................14
Data Analysis.............................................................................................................................16
Data pre-processing:..................................................................................................................17
Training......................................................................................................................................18
SVM...........................................................................................................................................18
Naïve Bayes...............................................................................................................................19
Logistic Regression...................................................................................................................19
Decision Tree.............................................................................................................................19
Random Forest...........................................................................................................................20
LightGBM..................................................................................................................................20
XGBoost....................................................................................................................................20
Conclusion.....................................................................................................................................21
References......................................................................................................................................22
Introduction
The goal of this project is to create a software-based prototype that can successfully identify
heart illness using diverse medical devices. This study's mission statement, goal, and objectives
are as follows:
Mission Statement: The prototype's mission statement is to create an automatic heart condition
recognition system when given medical qualities, which will aid medical practitioners in their
decision-making and diagnostic objectives.
Aim:
Development of automatic heart condition recognition prototype solution with machine learning
algorithms.
Objectives:
1. To gather a suitable dataset for training the prototype's machine learning models for heart
condition detection.
2. To select appropriate machine learning classifiers for the task so that enough model
performance can be attained for use in a real-world medical scenario.
3. To assess the performance of models in order to identify the optimal model for the job.
4. To suggest ways for using the best-fit model(s) for prediction, as well as its limitations.
5. Critically evaluate the artifact or product using cybersecurity approaches and acceptable
procedures, evaluating the work's limitations and strengths.
The main goal of this project is to see if a patient's medical characteristics, such as
gender, age, chest pain, fasting sugar level, and so on, indicate that they are likely to be
diagnosed with cardiovascular heart disease. A dataset with the patient's medical history
and attributes is chosen from the UCI repository. Using this dataset, we can forecast
whether or not the patient would develop heart disease. (Harshit Jindal1, 2072) One of
the leading causes of sickness and mortality among the world's population is heart
disease. One of the most important topics in the domain of clinical data analysis is
cardiovascular disease prediction. In the healthcare industry, there is a massive amount of
data. Data mining converts a significant amount of raw healthcare data into information
that may be used to make better decisions and forecasts. (Rawat, 2019) As a result, a
machine learning solution is an efficient and effective technique to better treat those
patients while also reducing the effort of medical personnel. In order to detect the likely
prenatal health state, machine learning algorithms just require some linked patient trait
The phrasing, on the other hand, complies with the HIPAA checklist for ensuring the
security of medical data. Diabetes, obesity, a poor diet, being overweight, excessive
alcohol consumption, and physical inactivity are all key factors of heart disease.
Prototype Development
Section 2: Development
Machine learning solution for Heart disease condition based on medical parameters is the
prototype that has been identified and created. The complete approach of classification system
for fetal health prediction is now being built in Python using the Jupyter notebook platform or
collabs. For the same objective, several machine learning models are being constructed, and their
performance is evaluated using a small portion of the collected data from the Kaggle repository.
Normal, suspicious, and pathological are the three labels for the goal attribute for Heart disease
prediction, which are encoded by 1, 2 and 3 correspondingly. As a result, this is a classification
problem, and the models that need be built are classification models. For dataset loading, pre-
processing, visualization, and, most importantly, model fitting and evaluation, multiple libraries
are now employed in this project. Dataset loading, dataset exploration and pre-processing,
splitting feature and response variables into train-test sets, fitting the chosen machine learning
classifiers, and evaluating the classifiers on the test set to compare their performances are the
stages of the overall technique. These procedures, as implemented in a Jupyter notebook, are
now detailly documented in the following sections.
NumPy is used for pre-processing and manipulating arrays, as shown in the above screenshot,
while the Pandas library is mostly for data collected in csv format from the Kaggle source.as for
the split function we need to split our original data into training and test data then we are
importing our logistic regression from scale dot linear model and finally we are importing
security code so this accuracy test is used to evaluate to check how well our model is performing.
DataSet Loading
Here as you can head data will print the first five rows of the data frame where you can predict
the particular values either 0 or 1 this is how you print the first five rows of dataset. As
mentioned below in screenshot we can also print the last five rows of the dataset.so basically
head function will work for first five rows whereas tall function will work for last five rows and
shows its value.
As mentioned above in screenshot we can also check how many rows and columns are there in
the dataset.
Datatypes of attributes and missing value exploration
To see the total number of entries and column you can get the information above the screenshots.
We can also see the missing values if something went wrong while predicting from software
using above mentioned datasets. Now, all of the properties imported in the DataFrame are
numeric, including the class variable, which is shown as an integer type in the above outputs.
This is due to the fact that class labels are coded with unique integers, necessitsating the
appropriate numeric to nominal translation. There are no missing values in any of the attributes,
thus no attribute filtering or restoration is required.
As shown in the screenshots, this function will provide statistical measures for all columns. The
count represents the number of data points in each column, the mean represents the value of all
columns, the standard deviation represents the value of the minimum column, and we also have a
value for each column, as well as a percentile that is used to indicate value.
As you can see 165 people or data points in that value have deceased or defective heart whereas
138 people doesn’t have any disease in particular we need to have almost equal number of
distribution in two classes
1 represents the unhealthy patient and 0 represents the healthy patient to analyze the data we are
now going to split features and target. Target is the prediction of whether the person has heart
defects or not so its either zero or one so the particular column is known as target whereas
besides target other column is known as features cause we are going to use all this features to
predict particular target.
To predict features we removed target to analyze it as you can see above in screenshots
previously we have 14 column now only 13 left
Now we splitted target and features successfully so, we need to feed X and Y to machine
learning algorithm before that we need to split training data and testing data.
X train means features are separated as this contains all the training data, X tests contains
features of all the test data and Y train contains target of all those features present in the X strain
so we need to mention parameters of text size to know how many percentile of data you want
and when you mentioned stratify two classes zero or one will be distributed in an even manner
throughout your training data and test data as it will be present in original data as for the random
state it split the data in specific way so as you can see we have successfully splitted data now we
are going to train our machine learning model.
This pattern will find the relationship between features that are present in X train and the
corresponding target, it checks age ,sex and other parameters where there are particular values.
Model Evaluation
We are getting 85% accuracy on training data and 81% on test data as you can see both of the
data is almost similar because our data is very small which means our model is overfitted so to
avoid that we should use generalized learning approach as we are in the last stage of our
machine learning project we are going to build a predictive system which will predict the patient
have heart defective or not.
Before starting predicting system we need to do processing like reshaping which will tell our
machine learning model to predict given target for only one variable so when we give the value
to this particular input data column this model can predict whether the person have heart defect
or not.
Section 3
The evaluation results of the various models in the specified project are displayed and compared
in this part in order to determine the optimal model for heart disease prediction. The previously
mentioned metrics for evaluation are presented in a classification report. Anaconda navigation is
installed and utilized because it comes with built-in Python packages, is user-friendly, and is
simple to set up. It contains a Jupyter and spyder notebook. Spyder with TensorFlow is used
here, and the artificial neural network application is written in Python. Spyder is a spy gadget.
Making the debugging process go more smoothly. Both jupyter and collabs notebook were used
in the process of developing model. We have datasets which includes more than 200-300
individuals data but we used only fourteen columns datasets as mentioned below:
Although this database has 76 traits, all published studies focus on a selection of 14 of these. To
date, Machine Learning researchers have only used the specific database. One of the key
challenges on this dataset is to predict if a patient has heart disease or not based on the patient's
given qualities, and another is to diagnose and find various insights from this dataset that could
help in better understanding the situation.
Data Analysis
Let's take a look at the ages of those who are or are not affected by the disease.
Target = 1 indicates that the person has heart disease, while target = 0 indicates that the person
does not have heart disease.
We can see that the majority of those who are suffering are between the ages of 58 and 57.
The condition primarily affects adults in the age range of 50 and up.
Let's have a look at the age and gender breakdown for each target class.
Data pre-processing:
There are fourteen columns and more than 300 rows in the dataset. Let's have a look at the null
values.
There are only six cells with null values, four of which correspond to the attribute ca and two to
the attribute thal. Because the number of null values is so small, we can either ignore them or
impute them and also I splitted data in two sets train and test to know the accuracy.
Training
To obtain the findings, all of the models outlined above are used. The confusion matrix is used as
an assessment metric.
This matrix shows the values that a classifier successfully predicted or mistakenly predicted
which will help us to know the data well. The number of successfully identified items by the
classifier is equal to the sum of TP and TN from the confusion matrix.
SVM
SVM training set accuracy = ((124+100)/(5+13+124+100))*100 = 92.51 percent SVM accuracy
for the test set was 80.32 percent. Let's take a look at all of the confusion matrices for each
classifier as well.
Naïve Bayes
Logistic Regression
Decision Tree
Random Forest
LightGBM
XGBoost
To summarize, here are all of the precision for all of the classifiers at once.
We can observe that Logistic Regression and SVM have the highest accuracy for the test set,
with an accuracy of 80.32 percent. Decision Tree achieves the greatest accuracy for the training
set of 100 percent. Only the default parameters are used to implement the algorithms.
Conclusion
As a result, the prototype for heart disease prediction using numeric characteristics collected
from machine learning classifier models can be described. The performance of those models is
also tested in a tiny subset of the acquired data called as the test set, and it is discovered that
random forest outperforms the others in all respects. After initializing the prototype models with
default settings, those parameters are changed numerous times by adjusting the model
parameters until the metrics scores do not improve much, and the prototype as a whole is
developed from its previous version. The first section describes using Python to forecast heart
disease based on the provided circumstances. Python is both an object-oriented and a high-level
programming language, with short development cycles and vibrant, energetic construction
choices. This language aids in the correct prediction of the heart disease pathway. This study
predicts people who will develop heart disease by extracting the patient medical history that
leads to a deadly heart illness from a dataset that contains patients' medical history such as chest
discomfort, sugar level, blood pressure, and other factors. This Heart Disease Detection System
aids a patient based on clinical data from a previous heart disease diagnosis. Logistic regression
and Random Forest Classifier are the algorithms used to create the given model. (Harshit Jindal1,
Heart disease prediction using machine learning, 2021) The models are chosen based on the
theory of machine learning models as utilized by earlier researchers for developing classification
models for various applications, and the models' outcomes are generally adequate. Adjustment
parameters are also chosen based on what is known to have a substantial influence on
performance. All of the generated models, including pre-processing and feature selection,
operate in a computationally viable time, allowing them to be used in real-world systems to
identify heart disease. Although the models' performances are impressive especially for random
forest, they may not be fully optimized because each of them has a wide parameter space and it
is impossible to tune all of them using brute force. As a result, a future focus of this research
might be to develop an intelligent algorithm for optimizing models with big parameters, but the
results are unlikely to improve considerably. Furthermore, because the dataset on which models
are built has a small sample size, sampling error might be severe when predicting for vast data
with uncertain labels. Manually calculating the chances of developing heart disease based on risk
factors is tough. Machine learning approaches, on the other hand, may be used to anticipate the
outcome from existing data.
References
Akhand Pratap Singh, D. B. (2020). A Review on Heart Disease Prediction using. Journal of
Xi'an University of Architecture & Technology, 4123-4136.
Harshit Jindal1, S. A. (2021). Heart disease prediction using machine learning. IOP Conference
Series: Materials Science and Engineering, 1022.
Harshit Jindal1, S. A. (2072). Heart disease prediction using machine learning algorithms . IOP
Conference Series: Materials Science and Engineering, 1-10.
Rawat, S. (2019, aug 19). towards data science . Retrieved from towards data science:
https://towardsdatascience.com/heart-disease-prediction-73468d630cfc
victorchang, v. h. (2022). An artificial intelligence model for heart disease detection using
machine learning algorithms. Elsevier Healthcare Analytics, 100016.
Yuvaraj, R. S. (2019). Artificial Intelligence Model for Earlier Prediction of. Journal of Physics:
Conference Series, 1-16.