Format - For - Project - Synopsis Heart

You might also like

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 19

Project Synopsis

On
“Heart Disease Prediction Using Machine Learning”
(Times New Roman 20)

Under the Guidance of Submitted by

Prof P.M. Bihade Bharti meshram


Pankaj nimkar
Madhuri rahngdale
Pankaj Shahare
Pawan Dhekwar

DEPARTMENT OF COMPUTER ENGINEERING

DEPARTMENT OF COMPUTER ENGINEERING


SMT. RADHIKATAI PANDAV COLLEGE OF ENGINEERING, NAGPUR
Rashtrasant Tukdoji Maharaj Nagpur University,Nagpur
2020-2021
”Heart Disease Prediction Using Machine Learning”

1. Abstract…………………………………………………
2. Introduction …………………………………………...
3. Objective……………………………………………….
4. Problem Definition……………………………………
5. Related Work………………………………………….
6. Literature Survey…………………………………….
7. Proposed Work ………………………………………

8. Techniques/databases/algorithms …… ……………..
9. Modules………………………………………………..
10.Conclusion ……………………………………….......
References…………………………………………….
1.ABSTRACT
Machine Learning (ML), which is one of the most prominent applications of Artificial
Intelligence, is doing wonders in the research field of study. In this paper machine
learning is used in detecting if a person has a heart disease or not. A lot of people suffer
from cardiovascular diseases (CVDs), which even cost people their lives all around the
world. Machine learning can be used to detect whether a person is suffering from a
cardiovascular disease by considering certain attributes like chest pain, cholesterol level,
age of the person and some other attributes. Classification algorithms based on
supervised learning which is a type of machine learning can make diagnoses of
cardiovascular diseases easy. Algorithms like K-Nearest Neighbor (KNN), Random
Forest are used to classify people who have a heart disease from people who do not.
Two supervised machine learning algorithms are used in this paper which are, Neighbor
(K-NN) and Random Forest. The prediction accuracy obtained by K-Nearest Neighbor
(K-NN) is 86.885% and the prediction accuracy obtained by Random Forest algorithm
2.INTRODUCTION
According to the World Health Organization, every year 12 million deaths occur
worldwide due to Heart Disease. The load of cardiovascular disease is rapidly increasing
all over the world from the past few years. Many researches have been conducted in
attempt to pinpoint the most influential factors of heart disease as well as accurately
predict the overall risk. Heart Disease is even highlighted as a silent killer which leads to
the death of the person without obvious symptoms. The early diagnosis of heart disease
plays a vital role in making decisions on lifestyle changes in high-risk patients and in
turn reduce the complications. This project aims to predict future Heart Disease by
analyzing data of patients which classifies whether they have heart disease or not using
machine-learning algorithms. Data mining techniques can be useful in predicting heart
diseases. Predictive models can be made by finding previously unknown patterns and
trends in databases and using the obtained information . Data mining means to extract
knowledge from large amounts of data . Machine learning is a technology which can
help to achieve diagnosis of heart disease
3.Objectives
The main objective of developing this project are:

1. To develop machine learning model to predict future possibility of heart


disease by implementing Logistic Regression.
2. To determine significant risk factors based on medical dataset which may
lead to heart disease.
3. To analyse feature selection methods and understand their working
principle.

4. Heart disease prediction is a complex task, there is a need to automate the


prediction process to avoid risks associated with it and alert the patient well
in advance.

4.Problem Definition:-
The major challenge in heart disease is its detection. There are instruments
available which can predict heart disease but either they are expensive or are
not efficient to calculate chance of heart disease in human. Early detection of
cardiac diseases can decrease the mortality rate and overall complications.
However, it is not possible to monitor patients every day in all cases
accurately and consultation of a patient for 24 hours by a doctor is not
available since it requires more sapience, time and expertise. Since we have a
good amount of data in today’s world, we can use various machine learning
algorithms to analyse the data for hidden patterns. The hidden patterns can be
used for health diagnosis in medicinal data.
5.RELATED WORKS:-

With growing development in the field of medical science alongside machine


learning various experiments and researches has been carried out in these
recent years releasing the relevant significant papers. The paper

[1] propose heart disease prediction using KStar, J48, SMO, and Bayes Net
and Multilayer perceptron using WEKA software. Based on performance
from different factor SMO (89% of accuracy) and Bayes Net (87% of
accuracy) achieve optimum performance than KStar, Multilayer perceptron
and J48 techniques using k-fold cross validation. The accuracy performance
achieved by those algorithms are still not satisfactory. So that if the
performance of accuracy is improved more to give batter decision to
diagnosis disease.

[2]In a research conducted using Cleveland dataset for heart diseases which
contains 303 instances and used 10-fold Cross Validation, considering 13
attributes, implementing 4 different algorithms, they concluded Gaussian
Naïve Bayes and Random Forest gave the maximum accuracy of 91.2
percent.

[3]Using the similar dataset of Framingham, Massachusetts, the experiments


were carried out using 4 models and were trained and tested with maximum
accuracy K Neighbors Classifier: 87%, Support Vector Classifier: 83%,
Decision Tree Classifier: 79% and Random Forest Classifier:

[4] Nagaraj M Lutimath, et al., has performed the heart disease prediction
using Naive bayes classification and SVM (Support Vector Machine). The
performance measures used in analysis are Mean Absolute Error, Sum of
Squared Error and Root Mean Squared Error, it is established that SVM was
emerged as superior algorithm in terms of accuracy over Naive Bayes
6.Literature Survey:-

Research has been done in this field and people have produced methods to
predict cardiovascular disease using supervised machine learning
algorithms. Several research papers have been written on this topic. A
survey has been presented in the form of a paper which analyzes
performance of various models based on machine learning algorithms and
techniques [1]. In one of the papers, work has been done to create a
Graphical User Interface (GUI) to predict whether a person is suffering from
heart disease or not, using Weighted Association rule based Classifier [2]. In
another paper, a new approach has been presented which is based on
coactive neuro-fuzzy interference system (CANFIS) for the prediction of
heart disease [3]. A summary of commonly used techniques for heart
disease prediction and their complexities is given in one of the papers [4].
One of the papers presented a classifier approach for heart disease
detection and shows how Naive Bayes can be used for classification purpose
[5]. In one of the papers, a survey is done which includes different papers in
which one or more algorithms of data mining have been used for heart
disease prediction
7.proposed work:-

The proposed work predicts heart disease by exploring the above mentioned
four classification algorithms and does performance analysis. The objective
of this study is to effectively predict if the patient suffers from heart disease.
The health professional enters the input values from the patient's health
report. The data is fed into model which predicts the probability of having
heart disease.

In K-NN algorithm a data point is taken whose classification is not available, then the number of
neighbors, k is defined. After that k neighbors are selected according to the lowest Euclidian distance
between the selected data points and their neighbors. The selected data point is then classified into a
category, which is same as the category which has majority of neighbors among the K neighbors.

1.Random Forest Random Forest works by constructing multiple decision trees of the training data.
each of the trees predicts a class as an output and the class, which is the output of the greatest number
of decision trees is taken as the final result, in case of classification. In this algorithm we need to define
the number of trees we want to create. Random Forest is a bootstrap aggregating or bagging technique.
This technique is used to decrease the variance in the results.

2.Decision Tree
Decision Tree algorithm is in the form of a flowchart where the inner node represents the dataset
attributes and the outer branches are the outcome. Decision Tree is chosen because they are fast,
reliable, easy to interpret and very little data preparation is required. In Decision Tree, the prediction of
class label originates from root of the tree. The value of the root attribute is compared to record’s
attribute. On the result of comparison, the corresponding branch is followed to that value and jump is
made to the next node.
4.Support vector machines
(SVMs) were introduced first to be used in statistical learning theory. SVM is basically a
binary classifier that create a linear separating hyper plane to sort data position.SVMs
are basically used in classification, regression, and clustering.In case of global
optimization SVMs deal with more complex problems which appear in high dimensional
spaces which makes them attractive in various applications.Commonly used SVM
algorithms are the support vector regression, least squares support vector machine, and
successive projection algorithm-support vector machine [11].

5. Experimental Setup
The first step for the setup is to obtain the data set containing the features of a person
suffering from a heart disease and a person who is not along with the result, that whether
the person is suffering from the disease or not. The data set used in this experiment is
taken from a website called Kaggle (https://www.kaggle.com/ronitf/heart-disease-uci).
The programming language used to do the experiment is Python. Thirteen attributes are
used which are available in the data set. The information of the attributes is available on
Kaggle. The next step is to analyze the data. For this, the information of the data set is
required. To gather the concise summary of the DataFrame, the info() function is used
on the data set which is provided by the Pandas library
8.DATASETS:-

The dataset is publicly available on the Kaggle Website at [4] which


is from an ongoing cardiovascular study on residents of the town of
Framingham, Massachusetts. It provides patient information which
includes over 4000 records and 14 attributes. The attributes include:
age, sex, chest pain type, resting blood pressure, serum cholesterol,
fasting, sugar blood, resting electrocardiographic results, maximum
heart rate, exercise induced angina, ST depression induced by
exercise, slope of the peak exercise, number of major vessels, and
target ranging from 0 to 2, where 0 is absence of heart disease. The
data set is in csv (Comma Separated Value) format which is further
prepared to data frame as supported by pandas library in python.

Figure 1: Original Dataset Snapshot

The education data is irrelevant to the heart disease of an individual, so it is dropped.


Further with this dataset pre-processing and experiments are then carried out.
8.1.METHODS AND ALGORITHMS USED
The main purpose of designing this system is to predict the ten-year risk of future heart disease. We have
used Logistic regression as a machine-learning algorithm to train our system and various feature selection
algorithms like Backward elimination and Recursive feature elimination. These algorithms are discussed
below in detail.

1.Logistic Regression

Logistic Regression is a supervised classification algorithm. It is a predictive analysis algorithm based


on the concept of probability. It measures the relationship between the dependent variable
(TenyearCHD) and the one or more independent variables (risk factors) by estimating probabilities
using underlying logistic function (sigmoid function). Sigmoid function is used as a cost function to
limit the hypothesis of logistic regression between 0 and 1 (squashing) i.e. 0 ≤ hθ (x) ≤ 1.
In logistic regression cost function is defined as: 𝐶𝑜𝑠𝑡(hθ(x),y)={−log(ℎ𝜃(𝑥)) 𝑖𝑓
𝑦=1−log(1−ℎ𝜃(𝑥)) 𝑖𝑓 𝑦=0
Logistic Regression relies highly on the proper presentation of data. So, to make the model more
powerful, important features from the available data set are selected using Backward elimination and
recursive elimination techniques.

2.Backward Elimination Method:

While building a machine learning model only the features which have a significant influence on the
target variable should be selected. In the backward elimination method for feature selection, the first
step is selecting a significance level or P-value. For our model, we have chosen a 5% significance level
or P-value of 0.05. The feature with high P-value is identified, and if its P-value is greater than the
significance level it is removed from the dataset. The model is fit again with a new dataset, and the
process is repeated till all remaining features in dataset is less than the significance level. In this model,
factors male, age, cigsPerDay, prevalentStroke, diabetes, and sysBP were chosen as significant ones
after using the backward elimination algorithm.

3.Recursive Feature Elimination using Cross-Validation (RFECV)

RFECV is greedy optimization algorithm which aims to find the best performing feature subset.
Recursive Feature Elimination (RFE) fits a model repeatedly and removes the weakest feature until
specified number of features is reached. The optimal number of features is used with RFE to score
different feature subsets and select the best scoring collection of features which is RFECV. The main
issue of this algorithm is that it can be expensive to run. So, it is better to reduce the number of features
beforehand. Since correlated features provide the same information, such features can be eliminated
prior to RFECV. To address this, correlation matrix is plotted and the correlated features are removed.
The arguments for instance of RFECV are:
a a. estimator - model instance (RandomForestClassifier)
b b. step - number of features removed on each iteration (1)
c c. cv – Cross-Validation (StratifiedKFold)
d d. scoring – scoring metric (accuracy)

e Once RFECV is run and execution is finished, the features that are least important can be extracted
and dropped from the dataset. Top 10 features ranked by the RFECV technique in our model listed below
from least importance to highest importance.

1. prevalentStroke
2. Diabetes
3. BPMeds
4. CurrentSmoker
5. PrevalentHyp
6. Male
7. CigsPerDay
8. Heartrate
9. Glucose
10. DiaBP
8.2 ALOGORITHAMS:-

1.Random Forest Random Forest works by constructing multiple decision trees of the training
data. each of the trees predicts a class as an output and the class, which is the output of the greatest
number of decision trees is taken as the final result, in case of classification. In this algorithm we need
to define the number of trees we want to create. Random Forest is a bootstrap aggregating or bagging
technique. This technique is used to decrease the variance in the results.

Fig:of random forest

2.Decision Tree
Decision Tree algorithm is in the form of a flowchart where the inner node represents the dataset
attributes and the outer branches are the outcome. Decision Tree is chosen because they are fast,
reliable, easy to interpret and very little data preparation is required. In Decision Tree, the prediction of
class label originates from root of the tree. The value of the root attribute is compared to record’s
attribute. On the result of comparison, the corresponding branch is followed to that value and jump is
made to the next node.
3.Support vector machines
(SVMs) were introduced first to be used in statistical learning theory. SVM is basically a binary
classifier that create a linear separating hyper plane to sort data position.SVMs are basically used in
classification, regression, and clustering.In case of global optimization SVMs deal with more complex
problems which appear in high dimensional spaces which makes them attractive in various
applications.Commonly used SVM algorithms are the support vector regression, least squares support
vector machine, and successive projection algorithm-support vector machine [11].

Fig ; Support vector machines algoritham


9.Modules used: -

Modules used: Imported class from respective modules:

SimpleImputer
a a. Sklearn.impute

StandardScaler
a b. Sklearn.preprocessing

Pipeline
a c. Sklearn .pipeline

RFECV
a d. Sklearn. Feature selection

RandomForest Classifier
a e. Sklearn. ensemble

Train_test_split, StratifiedKFold
a f. Sklearn. model_selection

LogisticRegression,
a g. Sklearn. linear_model

Accuracy_score, confusion_matrix
a h. Sklearn. metrics
10.Conclusion:-

With the increasing number of deaths due to heart diseases, it has become
mandatory to develop a system to predict heart diseases effectively and
accurately. The motivation for the study was to find the most efficient ML
algorithm for detection of heart diseases. This study compares the accuracy
score of Decision Tree, Logistic Regression, Random Forest and Naive
Bayes algorithms for predicting heart disease using UCI machine learning
repository dataset. The result of this study indicates that the Random Forest
algorithm is the most efficient algorithm with accuracy score of 90.16% for
prediction of heart disease. In future the work can be enhanced by developing
a web application based on the Random Forest algorithm as well as using a
larger dataset as compared to the one used in this analysis which will help to
provide better results and help health professionals in predicting the heart
disease effectively and efficiently.
.REFERENCES:-

[1] Nagaraj M Lutimath,Chethan C,Basavaraj S Pol.,’Prediction Of Heart Disease using


Machine Learning’, International journal Of Recent Technology and Engineering,8,
(2S10), pp 474-477, 2019.
[2] UCI, ―Heart Disease Data Set.[Online]. Available (Accessed on May 1 2020):
https://www.kaggle.com/ronitf/heart-disease-uci.
[3] Sayali Ambekar, Rashmi Phalnikar,“Disease Risk Prediction by Using Convolutional
Neural Network”,2018 Fourth International Conference on Computing Communication
Control and Automation.
[4] C. B. Rjeily, G. Badr, E. Hassani, A. H., and E. Andres, ―Medical Data Mining for
Heart Diseases and the Future of Sequential Mining in Medical Field,‖ in Machine
Learning Paradigms, 2019, pp. 71–99.
[5] Jafar Alzubi, Anand Nayyar, Akshi Kumar. "Machine Learning from Theory to
Algorithms: An Overview", Journal of Physics: Conference Series, 2018
[6] Fajr Ibrahem Alarsan., and Mamoon Younes ‘Analysis and classification of heart
diseases using heartbeat features and machine learning algorithms’,Journal Of Big
Data,2019;6:81.
[7] Internet source [Online].Available (Accessed on May 1 2020): http://acadpubl.eu/ap

You might also like