Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

IV SEM U.G.

Project Report

2020 On

“Data Visualization and Prediction of


Heart Disease by Machine Learning
Algorithms”

Supervisor Submitted by
Dr. (Mr) M.S.Muthu Saurabh Kumar
Associate Professor in Pharmaceutics,
Department of Pharmaceutical Engg. & Technology,
Indian Institute of Technology (BHU),
(Assistant Professor)

DEPT. OF PHARMACEUTICAL ENGINEERING &


TECHNOLOGY INDIAN INSTITUTE OF
TECHNOLOGY
(BANARAS HINDU
UNIVERSITY) VARANASI-
221005, INDIA

1 May 2021 Roll. No.18165043


Department of Pharmaceutical
Engineering & Technology
Indian Institute of Technology, (BHU) Varanasi
VARANASI - 221005, INDIA

Assistant Professor
Dept. of Pharmaceutical Engineering and Technology
IIT(BHU) Varanasi

27 May 2020

This is to certify that the present work entitled “Data Visualization and Prediction of
Heart Disease by Machine Learning Algorithms” has been carried out by Mr.
Saurabh kumar under my direct supervision and guidance during his academic
semester IV. He has conducted his studies very sincerely, meticulously and
methodically and the results of the work are embodied in this report. I wish him
success in all his future endeavors.

Dr. (Mr). M.S Muthu sir


Supervisor
CONTENTS

1 Acknowledgement..............................................................................................................................2
2 Abstract..............................................................................................................................................3
3 Introduction.......................................................................................................................................4
4 Objective.............................................................................................................................................5
5 Types of Cardiovascular Diseases.....................................................................................................6
6 Prevalence of Cardiovascular Diseases............................................................................................7
7 Machine Learning Algorithms..........................................................................................................7
8 Material and Methodology................................................................................................................9
9 Work Plan........................................................................................................................................11
10 Data Visualization............................................................................................................................11
11 Result.................................................................................................................................................15
12 Conclusion........................................................................................................................................15

1
1 Acknowledgement

During the period of my project in this University, several respectful and affectionate
persons helped directly and indirectly to my project. Without their support it would be
impossible for me to accomplish my work, that’s why I wish to dedicate this section to
recognize their support.

First and foremost, I would like to thank my guide Prof. Shreyans Kumar Jain for guiding
me thoughtfully and efficiently through this project, giving me an opportunity to work at my
own pace along my own lines, while providing me with very useful directions whenever
necessary.

I offer my sincere thanks to all other persons who knowingly or unknowingly helped me
complete this project. I perceive as this project as a big milestone in my career development.
I will strive to use gained skills and knowledge in the best possible way, and I will continue
to work on their improvement, in order to attain desired career objectives. Hope to continue
cooperation with all in the future.

Saurabh Kumar
Dept. of Pharmaceutical Engineering & Technology
Indian Institute of Technology (BHU)
Varanasi-221005
2 Abstract

Heart related diseases or Cardiovascular Diseases are the main reason for a huge number of
deaths in the world over the last few decades and has emerged as the most life-threatening
disease, not only in India but in the whole world. So, there is a need of reliable, accurate and
feasible system to diagnose such diseases in time for proper treatment. Machine Learning
algorithms and techniques have been applied to various medical datasets to automate the
analysis of large and complex data. Many researchers, in recent times, have been using
several machine learning techniques to help the health care industry and the professionals in
the diagnosis of heart related diseases. This project presents various models based on such
algorithms and techniques and analyzes their performance.
3 Introduction

The heart is one of the main organs of the human body. It pumps blood trough the blood
vessels of the circulatory system. The circulatory system is extremely important because it
transports blood, oxygen and other materials to the different organs of the body. Heart plays
the most crucial role in circulatory system. If the heart does not function properly then it will
lead to serious health conditions including death. Change in lifestyle, work related stress and
bad food habits contribute to the increase in rate of several heart related diseases.

Medical organizations, all around the world, collect data on various health related issues.
These data can be exploited using various machine learning techniques to gain useful
insights. But the data collected is very massive and, many a times, this data can be very
noisy. These datasets, which are too overwhelming for human minds to comprehend, can be
easily explored using various machine learning techniques. Thus, these algorithms have
become very useful, in recent times, to predict the presence or absence of heart related
diseases accurately.
4 Objective

The aim of the project is to explore the problems in healthcare system and make a suitable
prototype to solve the problem.

The objective is to implement various machine learning algorithms such as logistic


regression, k-nearest neighbors, support vector machine, random forest and decision tree in
order to predict heart disease. Other features include performing data visualization of Heart
Disease UCI dataset.
5 Types of Cardiovascular Diseases

Heart diseases or cardiovascular diseases are a class of diseases that involve the heart and
blood vessels. Cardiovascular disease includes coronary artery diseases like angina and
myocardial infarction (commonly known as a heart attack).There is another heart disease,
called coronary heart disease, in which a waxy substance called plaque develops inside the
coronary arteries. These are the arteries which supply oxygen-rich blood to heart muscle.
When plaque begins to build up in these arteries, the condition is called atherosclerosis. The
development of plaque occurs over many years. With the passage of time, this plaque can
harden or rupture. Hardened plaque eventually narrows the coronary arteries which in turn
reduces the flow of oxygen-rich blood to the heart. If this plaque ruptures, a blood clot can
form on its surface. A large blood clot can most of the time completely block blood flow
through a coronary artery. Over time, the ruptured plaque also hardens and narrows the
coronary arteries. If the stopped blood flow isn’t restored quickly, the section of heart
muscle begins to die. Without quick treatment, a heart attack can lead to serious health
problems and even death. Heart attack is a common cause of death worldwide. Some of the
common symptoms of heart attack are as follows.

 Chest pain: It is the most common symptom of heart attack. If someone has a
blocked artery or is having a heart attack, he may feel pain, tightness or pressure in
the chest.

 Nausea, Indigestion, Heartburn and Stomach Pain: These are some of the often
overlooked symptoms of heart attack. Women tend to show these symptoms more
than men.

 Pain in the Arms: The pain often starts in the chest and then moves towards the
arms, especially in the left side.

 Feeling Dizzy and Light Headed: Things that lead to the loss of balance.

 Fatigue: Simple chores which begin to set a feeling of tiredness should not be
ignored.

 Sweating: Some other cardiovascular diseases which are quite common are stroke,
heart failure, hypertensive heart disease, rheumatic heart disease, cardiomyopathy,
cardiac arrhythmia, congenital heart disease, valvular heart disease, aortic
aneurysms, peripheral artery disease and venous thrombosis. Heart diseases develop
due to certain abnormalities in the functioning of the circulatory system or may be
aggravated by certain lifestyle choices like smoking, certain eating habits, sedentary
life and others. If the heart diseases are detected earlier then it can be treated properly
and kept under control. Here, early detection is the main key. Being well informed
about the whys and wherefores of heart disease will help in prevention summarily.

6 Prevalence of Cardiovascular Diseases

An estimated 17.5 million deaths occur due to cardiovascular diseases worldwide. More than
75% deaths due to cardiovascular diseases occur in the middle-income and low-income
countries. Also, 80% of the deaths that occur due to cardiovascular diseases are because of
stroke and heart attack. India too has a growing number of cardiovascular disease patients
added every year. Currently, the number of heart disease patients in India is more than 30
million. Over two lakh open heart surgeries are performed in India each year. A matter of
growing concern is that the number of patients requiring coronary interventions has been
rising at 20% to 30% for the past few years.

7 Machine Learning Algorithms

Research on machine learning has led to the formulation of several machine learning
algorithms. These algorithms can be directly used on a dataset for creating some models or
to draw vital conclusions and inferences from that dataset. Some popular machine learning
algorithms are Regression, Decision Tree, K Nearest Neighbor, Random Forest, Support
Vector Machine etc. They are discussed in the follows section.

 Regression: Regression is a statistical concept which is used to determine the weight


of relationship between one dependent variable (usually denoted by Y) and a series
of other changing variables (known as independent variables). Two basic types of
regression are linear regression and polynomial regression. Also, there are several
non-linear regression methods that are used for more complicated data analysis such
as logistic regression.
 Decision Tree: A Decision tree is a decision support tool that uses a tree-like graph
or model of decisions and their possible consequences including chance event
outcomes and utility. It is one of the ways to display an algorithm. Decision trees are
commonly used in operations research, specifically in decision analysis to help and
identify a strategy that will most likely reach the goal. It is also a popular tool in
machine learning. A Decision tree can easily be transformed to a set of rules by
mapping from the root node to the leaf nodes one by one. Finally by following these
rules, appropriate conclusions can be reached. It’s easy to use and implement. Not
needed too much data preparing. It looks like it’s similar to human thinking form.
Categorical data can be used.

 Support Vector Machine (SVM): It is a supervised learning method which


classifies data into two classes over a hyper plane. Support vector machine doesn’t
use Decision trees at all. Support vector machine attempts to maximize the margin
(distance between the hyper plane and the two closest data points from each
respective class) to decrease any chance of misclassification.

 Random Forest: Random Forest is an ensemble learning method (also thought of as


a form of nearest neighbor predictor) for classification and regression techniques. It
constructs a number of Decision trees at training time and outputs the class that is the
mode of the classes output by individual trees. It also tries to minimize the problems
of high variance and high bias by averaging to find a natural balance between the two
extremes. Both R and Python have robust packages to implement this algorithm.
Random Forest is a simple, flexible machine learning algorithm which is widely used
for both classification and regression. Basically it consists of many decision trees.
The final results depend upon the results of these decision trees.

 KNN (K Nearest Neighbors): KNN is a machine learning algorithm which is


widely used for classification. Main aims of this algorithm are finding k nearest data
to the point which is going to be classified. By looking its neighbors, algorithm
decides to put data into which class. KNN algorithms use a data and classify new
data points based on a similarity measures.
8 Material and Methodology

The following machine learning algorithms have been used:

 Logistic Regression
 K-nearest neighbors
 Support Vector Machine (SVM)
 Random Forest
 Decision Tree

For prediction of the heart disease, data is collected from UC Irvine Machine Learning
Repository

Data Set Information:


This database contains 76 attributes, but all published experiments refer to using a subset of
14 of them. In particular, the Cleveland database is the only one that has been used by ML
researchers to this date. The “target” field refers to the presence of heart disease in the
patient. It is integer valued from 0 (no presence) to 1 (presence). The names and social
security numbers of the patients were recently removed from the database, replaced with
dummy values.

Attribute’s Information:
 age: The person's age in years
 sex
 value ‘1’ = male
 value ‘0’ = female
 cp or chest pain type
 value ‘0’ = asymptomatic
 value ‘1’ = typical angina
 value ‘2’ = atypical angina
 value ‘3’ = non-anginal pain
 trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)
 chol: The person's cholesterol measurement in mg/dl
 fbs: The person’s fasting blood sugar > 120 mg/dl
 value ‘0’ = false
 value ‘1’ = true
 restecg: Resting electrocardiographic measurement
 value ‘0’ = normal
 value ‘1’ = having ST-T wave abnormality
 value ‘2’ = showing probable or definite left ventricular hypertrophy by
Estes' criteria
 thalach: The person's maximum heart rate achieved
 exang: Exercise induced angina
 value ‘0’ = no
 value ‘1’ = yes
 oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions
on the ECG plot)
 slope: The slope of the peak exercise ST segment Value
 value ‘0’ = down sloping
 value ‘1’ = up sloping
 value ‘1’ = flat
 ca: The number of major vessels (0-3)
 thal: A blood disorder called thalassemia
 value ‘1’ = normal
 value ‘2’ = fixed defect
 value ‘3’ = reversible defect
 target: Heart Disease
 value ‘0’ = no heart disease
 value ‘1’ = heart disease present
9 Work Plan

The heart disease data has been collected from UCI Machine Learning Repository. Then, the
data has been analyzed by performing visualization plots and charts. Then, the machine
learning algorithms like logistic regression, k-nearest neighbors, support vector machine,
random forest and decision tree have been implemented on the extracted data and prediction
will be done by each algorithms. Then score for each algorithm has been calculated and the
algorithms have been rated by their scores. On the basis of scores of each algorithm, models
have been compared that how efficient and precise is the algorithm in order to make
prediction of heart disease.

10 Data Visualization

Not much correlation in the data


.
The plot points out that most of the heart patients are in the age groups of 40's and 50's

The plot points out that most of the heart patients get atypical angina type chest pain
The plot points out that most of the heart patients get blood cholestrol level in range of (200-
250)mg/dl

The plot points out that when fasting blood sugar levels are above 120 mg/dl, it is less likely to
have a heart disease
11 Result

The following scores were resulted after implementing the following machine learning
algorithms:-

Algorithm % Score
Logistic Regression 86.88524590163934
Support Vector Machine 85.24590163934426
Decision Tree 80.32786885245902
K Nearest Neighbors 86.88524590163934
Random Forest 85.24590163934426

The complete project with Jupyter Notebook is uploaded on GitHub repository which link is
given below:
https://github.com/bhaveshpancholi/Heart-Disease- Prediction/blob/master/Heart%20Disease
%20Prediction.ipynb

12 Conclusion

After implementing several machine learning algorithms, the highest score is found by
Logistic Regression and K Nearest Neighbors with 0.868. The worst score is found by
Decision Tree with 0.803.

You might also like