A Prediction of Heart Disease Using Machine Learning Algorithms

A Prediction of Heart Disease Using Machine
Learning Algorithms
Mohd Faisal Ansari, Bhavya AlankarKaur, and Harleen Kaur(&)
Department of Computer Science and Engineering, School of Engineering

Sciences and Technology, Jamia Hamdard, New Delhi, India
faisalrehman.2903@gmail.com,
{balankar,harleen}@jamiahamdard.ac
Abstract. Now a day’s heart disease is emerging as one of the most death-
dealing diseases. As per a report published by the World Health Organization
[WHO], heart disease is one of the most hazardous diseases to human which
causes death all over the world from the last 20 years. Approx. 12 million people
are dying every year, which makes it the biggest challenge for medical profes-
sionals to develop an early diagnosis of heart disease with better accuracy. In this
paper, we have applied different machine learning algorithms and compared their
classification accuracies. We have proposed a modified algorithm using logistic
regression with principal component analysis for predicting heart disease with
more accuracy on various attributes such as age, blood pressure, chest pain,
serum cholesterol levels, heart rate, and other characteristic attributes, and
patients will be classified according to varying degrees of coronary artery disease.
Keywords: Machine learning Heart disease prediction Logistic regression

Support vector machine Principal component analysis Medical diagnosis
1 Introduction
In the world full of enlarged and enhanced computer technologies one of the major sub-
field of computer science that is Artificial Intelligence (Machine Learning and Deep
Learning) is used in the medical field to pull out the predictions whether heart disease
exists or not based on extracted medical records (image file or .csv file) of the patients
from the medical databases called Electronic Health Record with the use of various
algorithms [1, 17].
Nowadays heart disease is the most death-dealing disease. As per a report published
by the World Health Organization [WHO], heart disease is one of the most hazardous
diseases to human which causes death all over the world from the last 20 years. Millions
of human beings around the world are suffering from heart disease. Approx. 12 million
people dying every year which makes it the biggest challenge for medical professionals
how important the early diagnosis of heart disease with better accuracy [2].
There are many traditional methods for predicting such illness but they are not
looking sufficient, like data mining algorithms do not predict heart disease with so
much accuracy like the machine learning algorithms do (support vector machine,
logistic regression, naïve bayes, random forest, and decision tree). In terms of data
mining, when we work with these types of algorithms the problem arises from the very
© The Editor(s) (if applicable) and The Author(s), under exclusive license
to Springer Nature Switzerland AG 2021
J. I.-Z. Chen et al. (Eds.): ICIPCN 2020, AISC 1200, pp. 497–504, 2021.
https://doi.org/10.1007/978-3-030-51859-2_45
498 M. F. Ansari et al.
first step called data extraction like incomplete data, missing values, and inconsistency,
and predicted results are not so much accurate. Medical industries much needed that
type of diagnosis system which can predict heart disease at an early stage and offers
more and more accurate diagnosis than traditional methods [3, 18].
After the promising success of machine learning algorithms in various real-life field
industries, we have also observed that it can be a promising solution with the highest
accuracy for medical diagnosis and it can be seen as a key application in the healthcare
industry [4, 17].
In this paper, we are applying machine learning algorithms and comparing their
accuracy for classifying whether an algorithm has a more accurate percentage and on
this basis, we proposed a modified algorithm for predicting heart disease on various
attributes such as age, blood pressure, chest pain, serum cholesterol levels, heart rate,
and other characteristic attributes, and the patient will be classified according to varying
degrees of coronary artery disease. In this paper, we used the UCI machine learning
dataset of 304 patients which contain 304 rows and 14 columns.
2 Related Works
Many researchers are continuously working in the field of heart disease prediction to
find out better and better accuracy with the use of various algorithms [5]. From the
literature survey of different numbers of researchers, various techniques have been used
for heart disease prediction using large datasets to find out some trends, patterns, and
associations. A short literature review is presented here. Recently integrated clustering
of more than one machine learning techniques can improve model performance in the
heart disease diagnosis using various algorithms.
The study shows that several researchers are using various techniques like data
mining and machine learning etc., to identify the risk factors associated with heart
disease. Statistical scrutiny has identified the risk factors related to heart disease to be
age, blood pressure, cholesterol, smoking, high blood pressure with cholesterol levels,
family history, obesity, physical inactivity, high stress. Sufficient information about the
risk factors related to heart diseases helps health care professionals and doctors to
identify patients at high risk of having heart disease [6].
In many types of research, the researchers also focused on security issues, when
data are imported before mining. Specifically, they examine some scenarios in which
data mining algorithms like association rule mining and data clustering require privacy
safeguards [16]. Data mining is a hopeful approach to meet this challenging require-
ment [7].
Afterward, the new research area has emerged for heart disease diagnosis and
prediction called machine learning. Many authors have presented the concept namely
“Heart disease prediction using machine learning over data mining concepts” as a
means of extracting interesting patterns using data mining concepts with the use of
machine learning algorithms that focuses on expectation, based on training dataset with
known factors and the data mining willidnetify the unknown data properties [8]. The
machine learning concept is based on identifying unique patterns in data and extracting
feasible knowledge from them. The support vector machine and logistic technique have
been applied over the data for the prediction accuracy to achieve an expert system [9].
A Prediction of Heart Disease Using Machine Learning Algorithms 499
Afterward, a researcher proposed a hybrid machine learning model based on

Decision Tree, Support Vector Machine, and Naïve Bayes. In this, they show how the
use of three different classifiers to obtain the majority voting scheme [10]. This scheme
is divided into two parts. The part says that the classifiers produce their decision
separately. In the second part, these decisions are combined in a way that the new
model is based on the majority voting scheme. Many research thesis mainly focuses on
the task of data modeling and classification using machine learning techniques (su-
pervised and unsupervised) from pattern recognition and data mining [11].
3 Algorithms Used
This field is originally kindled by a cardiologist who sought to develop and test
computational analogy of heart muscles. In the heart disease prediction system, disease
risk attributes are taken as input variables. ‘Disease existing’ and ‘disease non-existing’
are the output variables.
3.1 Logistic Regression

In logistic regression, classification method will assign distinct set of classes and it use
the borrowing technique from the machine learning to obtain the field of statistics [13].
Logistic sigmoid function helps to chnage the output values in logistic regression and it
returns the probability value. The advantages includes Logistic Regression uses a more
intricate cost function, and that is denoted as “Sigmoid functiomn”. The proposed
algorithm don’t obtain the weighted sum of inputs directly, but it is passed through a
sigmoid function that can map any real value between 0 & 1.
The activation function is a sigmoid function 1= 1 eValues . The logit function
oobtain any real valued number from agives an ‘S’ shaped curve and plot it into a value
between 0 and 1. The coefficients of the logistic regression algorithm must be calcu-
lated from your training data in Fig. 1.
Fig. 1. Difference between graphs (Source: towardsdatascience.com)

Logistic Regression Involved to Solve the Problem

Logistic regression has been widely used in the medical research industry, for pre-
dicting disease whether it’s a heart disease or a tumor (for e.g. tumor Malignant or
Benign) or either cancer with the best accuracy results in the 0–1 format. It may be used
to predict the risk of developing a given disease based on observed characteristics of
the patients [14].
3.2 Support Vector Machine (SVM)

It is one of the machine learning technique and it obtains form the supervised category
based on both classification and regression problems. In other words, support vector
machines [SVM] are based on the concept of decision planes which define the
boundaries of decision as shown in Fig. 2. A decision plane is one that separates a
group of objects with different class memberships and can solve linear and nonlinear
issues and function well for many practical issues [15].
Fig. 2. Support vector machines (Source: mdpi.com)
4 Proposed Model
Steps Involved for Modelling of Dataset [as shown in Fig. 3]:

1) Data Collection: In this phase, we can fetch the data from the data source in the
form of a CSV file through the various line of codes and processed as an input to the
proposed model.
DATA
COLLECTION Visualization
Cleaning
DATA
PROCESSING
(EDA) Feature Extrac-
tion
Feature Selec-
Analysis
APPLIED tion
ALGORITHMS
Logistic Regres-
sion
PATTERN
EVALUATION Support Vector
Machines
PROPOSED
ALGORITHM
Logistic Regres-
sion with PCA
RESULTS
Fig. 3. Proposed Model
2) Data Pre-processing: This phase is also known as Exploratory Data Analysis. In

this phase, we can concentrate on taking care of missing values, noise reduction,
and afterward feature selection process is being carried out for training purposes.
3) Pattern Evaluation: In this phase, when you identify interesting patterns that
represent knowledge based on some measures. In other words, we can say that
performance analysis is performed to evaluate the results.
4) Output/Results: In this phase, we can pull out the results which we processed in
the analysis part.
4.1 Data Source

UCI machine learning repository dataset is a widely used dataset for heart disease
prediction system. It is generally used by machine learning researchers. The dataset
contains 304 rows and 14 columns and many medical indexes, the goal is to do
exploratory data analysis on the status of heart disease. The ‘Num’ field which has
varying values from 0–1.
4.2 Performance Evaluation Criteria

Four different attributes i.e. accuracy, precision, recall, and F1-measures are used to
assess the performance of the proposed models. Then, we can obtain four metrics:
accuracy, precision, recall, and F1- score as follows:
Accuracy ¼ TP þ TN TP þ FP þ TN þ FN
Precision ¼ TP TP þ FP Recall ¼ TP TP þ FN
F1 Measure ¼ 2 Percision Recall Precision þ Recall
5 Results and Discussion
Investigation of the data indicated that Oldpeak, Thalach, cp (Asymptomatic pain), CA

(>1), Thal (reversible defect) are convenient useful features for forecast the presence of
cardiac disease. And the age, exang, slope, trestbps, chol, gender, FBS, and restecg
were also found to have a potentially slight predictive power.
Strong Predictive Power Ascribe Concluded in eda Part: Oldpeak, Thalach, cp, CA
(>1), Thal (reversible defect)
i. (Patient having > 1) old peak value has more chances of having heart disease than
patients with having old peak value (<1).
ii. Those who have a lower heart rate (<140) having the cardiac disease as compared
to the patient not having the cardiac disease (>140).
iii. Patients who have to suffer from asymptomatic chest pain must have heart
disease.
Average Predictive Power Ascribe Concluded in eda Part: Age, Exang, Slope
i. Those who have aged (>35) having more chances of heart disease.
ii. Groups of people who shared the same characteristics more chances of having
chest pain after exercise.
iii. Cohort people have more chances of having a flat ST-wave slope than a non-
disease cohort.
Poor Predictive Power Ascribe Concluded in eda Part: Trestbps, Chol, Gender,
FBS, Restecg
Table 1. Experimental results

Models Accuracy Recall Specificity Precision F1-
score
Logistic(all attributes) 87% .68 .65 .75 .71
Logistic(most significant 82% .68 .69 .77 .72
attributes)
Logistic(removing least 83% .94 .60 .78 .85
significant attributes)
SVM 68% .79 .58 .65 .71
Logistic(with PCA) 86% .68 .69 .77 .72
These ascribes have not any forecast power or can’t differentiate between disease
and non-disease group of people.
6 Conclusion
In this paper, we proposed a new model after applying two models. This model is
evaluated on UCI machine learning repository datasets and the aim was to predict if a
person has heart disease or not on attributes blood pressure, heartbeat, exang, fbs, and
others with better accuracy than other models. Firstly we train logistic regression with
all attributes, then we train logistic regression with strong predictive power attributes
concluded in eda part, and then in last, we train logistic regression after removing the
least significant attributes. Secondly, we are applying support vector machines. And
then we proposed a model, logistic regression with principal component analysis. We
see a logistic regression model with all the variables and logistic regression model with
PCA performed best with an accuracy of 86%, recall 68%, specificity 69%, precision
77%, and f1score 72% [Table 1]. The consequences of the models are whether heart
disease is existing or not with different levels of presence.
References
1. W.H Organisation: “New initiative launched to tackle cardiovascular disease, the world
number one killer” Intra-Health International (2017)
2. Shen, Z., Clarke, M., Jones, R.: Detecting the risk factors of coronary heart disease by use of
neural networks. In: Engineering in Medicine and Biology Society (1993)
3. Subhash, S., Patil, S.: Disease prediction using machine learning over big data. Int. J. Innov.
Res. Sci. Eng. Technol. 7 (2018)
4. Ambekar, S., Phalnikar, R: Disease prediction by using machine learning. Int. J. Comput.
Eng. Appl. (2018). ISSN 2321-3469
5. Wilson, P.W.F., D’Agostino, R.B., Levy, D., Belanger, A.M.: Prediction of coronary heart
disease using risk factor categories. J. Am. Heart Assoc. 97, 1837–1847 (1998)
6. Amin, S.U., Agarwal, K.: Genetic neural network based data mining in prediction of heart
disease using risk factors. In: 2013 IEEE Conference on Information & Communication
Technologies (ICT) (2013)
7. Kumar, B.S.: Adaptive personalized clinical decision support system using effective data
mining algorithms. J. Netw. Commun. Emerg. Technol. (2018)
8. Stephen, J., Pejaver, V.: Big data in public health: terminology, mach. learning, and privacy.
Annu. Rev. Public Health 39, 95–112 (2018)
9. Raj, J.S., Ananthi, J.V.: Recurrent neural networks and nonlınear predıctıon in support
vector machines. J. Soft Comput. Paradigm (JSCP) 1(01), 33–40 (2019)
10. Simons, L.A., Simons, J., Friedlander, Y: Risk functions for prediction of cardiovascular
disease in elderly Australians: the Dubbo study. Med. J. Aust. (2003)
11. Bashar, A.: Survey on evolving deep learning neural network architectures. J. Artif. Intell. 1
(02), 73–82 (2019)
12. Kumar, B.S.: Data mining methods and techniques for clinical decision support systems.
J. Netw. Commun. Emerg. Technol. (JNCET) (2017)
13. Vapnik, V.N., Vapnik, V.: Statistical Learning Theory, vol. 2. Wiley, New York
14. Bharti, S.: Analytical study of heart disease prediction comparing with different algorithms.
In: International Conference on Computing, Communication and Automation (ICCA2015)
(2015)
15. Burges, J.C.: A tutorial on support vector machines for pattern recognition. Data Min.
Knowl. Discov. 2(2), 121–167 (1998)
16. Baboota, R., Kaur, H.: Predictive analysis and modelling football results using machine
learning approach for English premier league. Int. J. Forecast. 35(2), 745–755 (2019)
17. Kaur, H., Alam, M.A., Jameel, R., Mourya, A.K., Chang, V.: A proposed solution and future
direction for blockchain-based heterogeneous medicare data in cloud environment. J. Med.
Syst. 42(8), 1–11 (2018). https://doi.org/10.1007/s10916-018-1007-5
18. Kaur, H., Kumari, V.: Predictive modelling and analytics for diabetes using a machine
learning approach. Appl. Comput. Inf. (2018)

A Prediction of Heart Disease Using Machine Learning Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Prediction of Heart Disease Using Machine Learning Algorithms

Uploaded by

Copyright:

Available Formats

A Prediction of Heart Disease Using Machine

Mohd Faisal Ansari, Bhavya AlankarKaur, and Harleen Kaur(&)

Department of Computer Science and Engineering, School of Engineering

Keywords: Machine learning Heart disease prediction Logistic regression

Afterward, a researcher proposed a hybrid machine learning model based on

3.1 Logistic Regression

Fig. 1. Difference between graphs (Source: towardsdatascience.com)

Logistic Regression Involved to Solve the Problem

3.2 Support Vector Machine (SVM)

Fig. 2. Support vector machines (Source: mdpi.com)

Steps Involved for Modelling of Dataset [as shown in Fig. 3]:

Fig. 3. Proposed Model

2) Data Pre-processing: This phase is also known as Exploratory Data Analysis. In

4.1 Data Source

4.2 Performance Evaluation Criteria

5 Results and Discussion

Investigation of the data indicated that Oldpeak, Thalach, cp (Asymptomatic pain), CA

Table 1. Experimental results

You might also like