Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

HEART DISEASE PREDICTION

Ms.Shubhada Labde Mr.Chetan Patil Mr.Chinmay Tavarej Ms.Krushi Mehta


Dept.of computer Dept.of computer Dept.of computer Dept.of computer
Engineering Engineering Engineering Engineering
K.J.Somaiya Institute of K.J.Somaiya Institute of K.J.Somaiya Institute of K.J.Somaiya Institute of
Engineering and Engineering and Engineering and Information Engineering and
Information Technology Information Technology Technology Information Technology
shubhada.l@somaiya.edu patil.c@somaiya.edu chinmay.tavarej@somaiya.edu krushi.mehta@somaiya.edu

Abstract-- According to the Department Of


Health, cardiac diseases claim the lives of 17.7
million people each year, accounting for 31% of all Keyword– KNN, Naive Bayes, Decision Tree,
deaths worldwide. Heart disease has also become XGBOOST,Randomforest,Logistic_Regression,
the leading cause of death in India. According to LSVM,ANN
the 2016 Global Burden of Disease report, heart
disease claimed the lives of 1.7 million Indians in
2016. We have referred around 15-20 IEEE
papers and all the papers have the same 1.INTRODUCTION
attributes. So we propose the model having In most animals, the heart is a muscular organ that
consultation with doctor with new and easily pumps blood through the circulatory system's blood
manageable attributes that are Gender, arteries. The pumping blood transports oxygen and
Breathlessness during activity, breathlessness at nutrients to the body, as well as metabolic waste like
rest, awake in by breathlessness at night, Exercise carbon dioxide, to the lungs. If it fails to function
Induced Angina (chest pain after exercise),history properly, the brain and several other organs will stop
of cyanosis (bluish discoloration of fingers/around working, and the individual will die within
lips),diabetes, clubbing, Blood Pressure(if more minutes.Most Cardiovascular Diseases can be
than 140/90). Our project’s main goal is to build prevented by addressing observable risk factors, such as
intelligent system with machine learning, namely cigarette smoking, poor eating habits, obesity, physical
Naive Bayes, KNN, Random forest, Decision inactivity, and harmful liquor consumption in public
tree,Logistic RegressionXGBOOST, LSVM, ANN places. Individuals with Cardiovascular Disease or who
.Based on the obtained results the system can are at high cardiovascular risk (due to the presence of at
predict whether a person has chances of heart least one risk factor, such as hypertension, diabetes,
disease or not. It is implemented as a web hyperlipidemia, or a well-established illness) require an
application.We implement this system special for early introduction and direction using short
the age group under 50 for early prediction .With prescriptions, as advised. Cardiovascular disease is
early prediction and proper medical treatment one caused by the accumulation of fatty deposits inside the
can reduce the cost of treatment and further conduits and the formation of blood clusters. It's also
damage. In case if user is unknown about his/her linked to damage to the brain, heart, kidneys, and eyes,
diabetes and BP status he can answer no to the among other organs. Estimates made by Health
main question ,after that type 1 symptoms of this organization , India has lost up to 237 billion dollars
diseases can seen on the screen and based on due to cardiovascular disease in the last decade. As a
his/her response system predict the prediction result, timely and precise prognosis of heart-related
disease is essential.This system is basically a web based
application wherein the user answers a sequence of disease.In [7], a hybrid machine learning technology
questions. Data analytics is used to incorporate the was used to identify heart problems.. Using arbitrary
world for its valuable use to controlling, contravasting random forest classifiers and simple k-means
and managing large data sets.it can be applied with algorithms in machine learning, a hybrid technique for
much success to predict, prevent, manage a predicting cardiac disease is provided. Later results
cardiovascular disease. were obtained using a random forest classifier, and the
resulting confusion matrix proves the approach's
stability.Nabaouia Louridi [8]worked on the SVM
2.LITERATURE SURVEY model. SVM assists in data analysis for classification
This section highlights ongoing research towards using and regression analysis. Its goal is to find a hyperplane
machine learning classifiers to predict chronic and in N-dimensional space that clearly separates data.It
persistent diseases. Yuanyuan [1] To build the clusters compares machine learning algorithms to several
for discovering anomalies, the Silhouette approach performance indicators in order to increase their
determines an optimal value of K. They next use the efficiency.In[9], machine learning techniques were used
five most prominent machine learning classification to analyze raw data to provide a new perspective on
techniques to remove the discovered anomalies from cardiac disease.If disease is identified early and
the data. The work in [2] explored the LCSA stands for preventative actions are taken as soon as feasible, the
Levy-based crow search algorithm. ANFIS, or very death rate can be reduced dramatically.Decision Tree, in
non-linear, complex, and dynamic computational this the High entropy inputs are used to build trees.
processes, are used in this framework to make Random Forest, To get the best outcome, create
predictions on cardiac diseases. MSSO is used to numerous decision trees and combine them.The neuron
optimize the learning parameters of ANFIS, leading to components include inputs, hidden layers and
better results.The RFRF-ILM model given in [3]The output.Vijeta Sharma [10]examined the neural
Decision Tree(DT) feature variables and criteria are network.To predict the diagnosis of Cardiovascular
used to cluster datasets. The classifier is then applied to Heart Disease, they used neural networks as
each data set to estimate its performance. The good classifiers.An artificial neuron is comprised of an
performing models are selected based on the results activation function that determines its output.The
because they have a lower error rate. The output is pre-activation value derived via weighted sum of all its
further optimized by taking a decision tree cluster with inputs is the input to this activation function. It is
a high inaccuracy rate and eliminating its related decided whether a neuron should fire or not based on
class-type information.Chidambaram [4] Using the the output of an activation function. The neural
pandas tool, I first cleansed the dataset and processed it networks and many strategies used to increase neural
with preprocessing techniques such as Data Integration, network performance were found in this
Data Reduction ,Data Transformation and Data study.Meryem[11]In the preprocessing step, the mean
Cleaning.Patient records were visualized The cleansed value is used to replace missing data. The results show
data is split into 60 percent training and 40 percent test that using the mean to replace missing values works
using the split criterion, and the dataset is then tested to well. Using SVM with a linear kernel, a score accuracy
five machine learning classifiers. The EDCNN model is of 86.8% was attained..Pranav Motarwar[12]used
being worked on by Rony [5]. This model focuses on a techniques such as SVM, Gaussian NB, Hoeffding
more detailed architecture that encompasses the Tree , Random Forest, LMT to increase and boost the
multilayer perceptron model as well as regularization accuracy of his predictions. On the Cleveland dataset,
learning techniques. For the diagnosis, the UCI the performance of each method was evaluated, and the
repository dataset was used, as well as the CNN results were compared in terms of accuracy. Random
classifier and multi-layer perceptron (MLP) forest is also determined to be preferable because of its
module.Santhana Krishnan [6] uses the dataset ability to meet an individual's interests.Only 14
retrieved from an online website . Decision tree(DT) fundamental features are considered in the work in[13].
and Naive Bayes(NB) classifier algorithms are used. On They used K-nearest neighbor(KNN), Naive Bayes,
the dataset, two data mining algorithms were used to decision tree, and random forest as data mining
estimate the likelihood of a patient developing heart classification approaches. The information was
pre-processed before being included in the model. The
algorithms that exhibit the greatest results in this model
are K-nearest neighbor, Naive Bayes, and random 4.PROPOSED SYSTEM
forest, with K-nearest neighbors (k = 7) having the Flowchart:
maximum accuracy after applying four algorithms.
Archana[14]Calculate the accuracy of machine learning This is a flowchart of a heart disease prediction system.It
models for estimating heart disease using the dataset for gives a quick glance at the system.we have given two
training and testing. These algorithms include KNN, options for user
decision tree, linear regression, and support vector 1)Quick predict which doesn't require any registration
machine (SVM). [15]Machine learning-based models 2) sign up/ in which gives more features to users like
are analyzing multidimensional medical datasets and previous records and prevention.
producing more insightful results. A cardiovascular
dataset is categorized in this work utilizing a number of
state-of-the-art Supervised Machine Learning
algorithms that are specifically utilized for disease
prediction. The findings show that the Decision Tree
classification model outperformed SVM , KNN ,
Logistic Regression ,Naive Bayes, Random
Forest-based techniques in predicting cardiovascular
illnesses. With a 73 percent accuracy, the Decision Tree
provided the best outcome.

Fig 4.1 Flowchart of proposed system


3.EXISTING SYSTEM
Methodology:
In this section we take a glance on the work that
In this system, We are employing machine-learning
developers and researchers did in heart disease
techniques to develop an effective cardiac disease
prediction. Almost all of them have used The data set
prediction.We propose the model having consultation
includes 304 instances of 10 parameters such as sex,
with an expert Dr.Haresh Dhondi with new and
age, trestbps, cp, thalach, restecg,cho, fbs, ca, target
easily manageable parameters. All the parameters
was collected from the University of California
that we use are non-medical parameters so anyone
repository. The dataset is cleansed and processed at the
having basic knowledge of health can fill his/her
first level utilizing preprocessing techniques such as
personal data and based on the given input data the
Data Integration, Data Transformation, Data Reduction,
system will estimate whether or not a person has
and Data Cleaning with the Pandas tool. There were
heart problems.The system can process input in the
304 patient records in all that were viewed. Data
form of a CSV file. The algorithms are applied to the
visualization tools assist the data scientist in
input after it has been taken. The process is carried
comprehending the dataset's viability. The cleansed data
out after obtaining the data set, and an accurate heart
is divided into 60 percent training and 40 percent test
disease prediction is provided. In case if a user is
groups, and the dataset is then applied to five machine
unknown about his/her diabetes and BP status he can
learning classifiers: Logistic Regression , Support
answer no to the main question ,after that type 1
Vector Machine , Decision Tree (DT), Random Forest
symptoms of this disease can be seen on the screen
(RF), and K-Nearest Neighbors (KNN). The confusion
and based on his/her response system predict the
matrix was used to calculate the classifiers' accuracy.
prediction.
The best classifier could be determined as the one that
achieves the maximum accuracy.
Correlation matrix is used for attribute selection for
Architecture: this model.

The system's operation begins with the collecting of


data and the selection of relevant attributes. The
relevant data is then preprocessed and converted to
the required format. The information is then
separated into two categories: training and testing.
The algorithms are used, and the model is trained
with the data provided. The system's correctness is
determined by testing it using the testing data.

Fig 4.3. Correlation matrix

A correlation Matrix Plot is a covariance matrix with


a correlation meter that defines the strength of the
linear relationship. The Correlation matrix represents
the strength and direction of a linear relationship
between two variables, with values ranging from -1
to +1. The correlation matrix's feature shows the
correlation between the coefficients. Each of the
Fig 4.2 Architecture of system values of a random variable is said to be correlated
with each other. By showing the correlation matrix as
Collection of dataset: a heat map, this is an effective way to check for
We started by gathering data for our heart disease relationships between features.
prediction system. We divided the dataset into
training data and testing data after it was collected.
The training dataset is used to learn the prediction Data Pre-processing:
model, whereas the testing dataset is used to evaluate Preprocessing data is a crucial stage in the
the model. For this project, 70% of training data is development of a machine learning model. Data may
used and 30% of data is used for testing. The dataset not be clean or in the appropriate format for the
used for this project is given by Dr.Haresh Dhondi model at first, which can lead to inaccurate results.
(MD) We change data into our needed format during
pre-processing. It's used to deal with the dataset's
Selection of attributes: noises, duplication, and missing values. Importing
Attribute or Feature selection includes the selection datasets, Attribute scaling ,dividing datasets are all
of appropriate attributes for the prediction system. part of data pre-processing. Data preprocessing is
This is used to boost the system's efficiency. Various essential to improve the model's accuracy. Figure:
attributes of the patient like Gender,Breathlessness Data Pre-processing
during activity,breathlessness at rest,awake in by
breathlessness at night,Exercise Induced Angina Balancing of Data:
(chest pain after exercise),history of cyanosis (bluish Imbalanced datasets can be balanced in two ways.
discoloration of fingers/around They are UnderSampling and Oversampling
lips),diabetes,clubbing, Blood Pressure(if more than ● UnderSampling:The size of the abundant
140/90).etc are selected for the prediction. The class is reduced in Under Sampling to
achieve dataset balance. When the amount expert Dr.Harsh Dhondi with new and easily
of data is adequate, this technique is manageable attributes. All the parameters that we use
considered. are non-medical parameters so anyone having basic
● OverSampling:The dataset balance is knowledge of health can fill his/her personal data and
achieved in Over Sampling by increasing the based on the given input data the system will predict
size of the limited samples. When the the person having heart disease or not. Table.2.
amount of data available is insufficient, this shows the parameters used in the proposed model
method is considered.
Table.4.1 Feature information of dataset

Sr.No Attribute Name Range of


values

1 Age Int (years)

2 Gender Categorical
code

3 Breathlessness during Binary


activity

4 breathlessness at rest Binary

Fig 4.4 Data Balancing 5 Awake by Binary


breathlessness at night
Here,count 0 shows The percentage of people who do 6 Exercise induced Binary
not have heart disease. angina (chest pain
count 1 shows The number of people who have been after exercise)
diagnosed with heart disease.
7 History of cyanosis Binary
(bluish discoloration
Prediction of Disease:
of fingers/lips)
SVM, Naive Bayes, Decision Tree, Random Tree,
Logistic Regression, Artificial Neural Network, and 8 Diabetes Binary
Xg-boost are some of the machine learning methods If user selects NO
used for classification. For heart disease prediction, a 1.Excessive thirst
comparative analysis of algorithms is performed, and 2.Excessive urination
the algorithm with the highest accuracy is used. 3.weight loss
4.Tingling hands or
feet

Data Set Creation: 9 Clubbing Binary


Medical Expert :Dr. Haresh Dhondi ( MD)
10 Blood pressure(if Binary
more than 140/90)
If user selects NO
Parameters: 1.Severe headache
2.Fatigued
3.Unusal change in
We have referred different IEEE papers and we come behavior
to know that all the papers have the same attributes. 4.Nosebleed
So we propose the model having consultation with an
K separate training data subsets from the original
dataset using a bootstrap sampling approach. After
that, it trains these subsets to create K decision trees.
5.Algorithms Tested: Finally, those decision trees are used to build a
We apply eight prominent machine learning random forest. Based on the votes of these trees, all
classifiers to the heart disease dataset. of the decision trees predict the classification of each
sample of the testing dataset.We split our data 70%
1. K-nearest neighbor (KNN) train and 30% test.We got 97.18% accuracy. Random
K-nearest neighbor is a term used to describe a forest solves the overfitting problem in decision trees
person's closest. The KNN method is built on the by using the ensemble learning strategy to reduce
concept that comparable objects might be found close variance and hence enhance accuracy.
together. In numerous existing approaches, this [[742 2]
closeness is generally estimated using straight line [ 66 961]].
distance. The Euclidean distance method is the most Here,TP =742 ,FP =2, FN =66 ,TN =961
prominent of these methods. The distance between We got 97.18% accuracy
the current example of the data point and the query
example of the data point is first calculated by the 3. Support vector machine (SVM)
KNN algorithm. The index and each example's Machine to support vectors (SVM) Support vector
distance to a collection are then stored in sorted order. machine is a supervised classification approach that
Finally, all instances are classified as the mode of the may be applied to classification and regression
first K labels from the sorted collection. KNN is a problems. The SVM classifier starts by plotting each
simple and easy-to-understand non-parametric data point in n-dimensional space, where n is the
approach.. We split our data 70% train and 30% test. number of features in the dataset. Each feature's value
corresponds to the value of a specific coordinate.
[[661 83] Finally, it determines the hyper-plane that best
[158 869]] distinguishes the two classes and performs
Here,TP =661 ,FP =83, FN =158 ,TN =869 classification. SVM uses a good generalization
We got 86.67% accuracy technique to prevent overfitting and the Kernel
strategy to efficiently handle non-linear data. SVM,
on the other hand, takes longer to train for a larger
dataset, and selecting the right Kernel function is
tricky.

Linear SVM:
Linear SVM is being used for linearly separable data,
which means that if a dataset can indeed be classified
into two classes using only a single straight line, it is
linearly separable data, and the classifier is Linear
SVM..We split our data 70% train and 30% test.We
got 99.80% accuracy.
[[743 1]
[ 0 1027]].
Here,TP =743 ,FP =1, FN =0 ,TN =1027
We got 99.80% accuracy
Fig 5.1KNN

2. Random forest (RF)


A supervised classification technique based on the
decision tree paradigm is the random forest. It builds
Here,TP =742 ,FP =3, FN =203 ,TN =825
We got 88.60% accuracy

Fig.5.2 LSVM
4. Naive Bayes (NB)

The probability theorem of Bayes is the basis for the Fig.5.3.LR


Naive Bayes classification algorithm. It begins by
turning a dataset into a frequency table. It then 6.Decision Tree
generates a likelihood table by computing the
probability for each feature category. The posterior Decision Tree is a Supervised learning technique that
distribution for each class is then calculated using can be used for both classification and Regression
Bayes' formula. Finally, the prediction's outcome is problems, but mostly it is preferred for solving
the class with the highest posterior probability.We Classification problems.We split our data 70% train
split our data 70% train and 30% test.We got 87.24% and 30% test. Internal nodes represent dataset
accuracy. Because it can estimate test data from a attributes, branches represent decision rules, and each
little amount of training data, Naive Bayes requires node in the tree represents the conclusion in this
less training time. tree-structured classifier.We got 94.35% accuracy.
[[742 2] [[705 38]
[202 825]] [62 966]].
Here,TP =742 ,FP =2, FN =202 ,TN =825 Here,TP =705 ,FP =38, FN =62 ,TN =966
We got 87.24% accuracy We got 94.35% accuracy

5. Logistic regression (LR)

Logistic regression is a supervised classification


approach that uses a logistic function like the
Sigmoid function to represent the probability of a
class for each test instance. The sigmoid function,
which resembles an S-shaped curve, translates any
real-valued integer between 0 and 1 and then converts
those values either into 0 or 1. Whenever the dataset
is linearly separable, logistic regression performs
better and is less prone to overfitting.We split our
data 70% train and 30% test.We got 88.60%
accuracy. Fig.5.4.DT
[[742 3]
[203 825]].
7.XGBoost

Gradient Boosted decision trees are implemented in


XGBoost. Decision trees are created sequentially in
this approach. In XGBoost, weights are very
significant. All of the independent variables are given
weights, which are subsequently fed into the decision
tree, which predicts outcomes. The weight of factors
that the tree predicted incorrectly is increased, and
these variables are fed into the second decision
tree.We split our data 70% train and 30% test.We got
98.31% accuracy.

[[743 1]
Fig5.6. ANN
[ 22 1005]]
Here,TP =743 ,FP =1, FN =22 ,TN =1005
We got 98.31% accuracy Experimental Evaluation The performance
parameters used in this study to evaluate all
classification models in terms of heart disease
8.Neural Network: A neural network is a set of
prediction are defined in this section.
algorithms that attempts to recognise underlying
relationships in a set of data using a method that
resembles how the human brain works. Neural
networks, in this context, refer to systems of neurons
that can be organic or artificial in nature. Because
neural networks can adapt to changing input, they can
produce the best possible outcome without requiring
the output criteria to be redesigned. The artificial
intelligence-based notion of neural networks is
quickly gaining traction in the creation of trading
Fig5.7 Experimental Evaluation
systems.We split our data 70% train and 30% test.We
got 99.20% accuracy.
True positives (TP) refer to positive instances that the
classifier correctly labeled, false positives (FP) refer
[[730 13]
to positive examples that the classifier incorrectly
[ 0 1028]]
labeled, true negatives (TN) refer to negative
Here,TP =730 ,FP =13, FN =0 ,TN =1028
instances that the classifier correctly labeled, and
We got 99.20% accuracy
false negatives (FN) refer to negative instances that
the classifier incorrectly labeled.

6.RESULT

After testing all the above algorithms on our dataset


we obtained the following accuracy.we obtained the
highest accuracy with LSVM model which is 99.20%
Comparative Analysis: is from our dataset around 6.8% young age persons
and 39.9% middle age persons were detected with
heart disease.total 46.7% persons having heart
disease in young and middle age groups.This is an
6.1Table Obtained Accuracy with different alarming situation.With early detection and proper
Algorithms medical assistance this numbers can be reduce.

SR.NO ALGORITHM ACCURACY

1 NN 99.20%

2 DT 94.35%

3 RF 97.18%

4 XGBOOST 98.31%

5 KNN 86.67%

6 NB 87.24%

7 LR 88.60%

8 LSVM 99.80%

Fig 6.2 Heart Disease patients Age group wise

.In Fig 5.1Data distribution is given on X axis we


have age and on Y axis number of records

.Fig6.1 Obtained Accuracy Fig6.3 Age Wise Distribution

Findings from study :

Young age (6.8%) + Middle age(39.9%)=46.7%


people upto age 50 have heart related disease.One
shocking information that came out with this research
7. CONCLUSION

In this application we have found the best optimized


prediction models for heart disease with simple and
easily manageable parameters.So that the system can
identify heart disease at an early stage.we implement
this system special for age group under 50 for early
prediction .With early prediction and proper medical
treatment one can reduce the cost of treatment and
further damage.

8.FUTURE SCOPE
A mechanism has been developed to determine the
accuracy of heart disease prediction. The proposed
technology has a high level of accuracy when it
comes to diagnosing cardiac problems. We can
integrate this web application with an Android app in
the future so that it can be easily accessible to android
users.we also make a chain of different heart
specialist hospitals and provide them with this
system. So the patient can easily get an idea of the
available hospital for treatment.

You might also like