Professional Documents
Culture Documents
HDP Report
HDP Report
A PROJECT REPORT
Submitted by
of
BACHELOR OF TECHNOLOGY
IN
JANUARY 2021
1
HEART DISEASE PREDICTION
USING MACHINE LEARNING
A Project Submitted
BACHELOR OF TECHNOLOGY
In
Computer Science & Engineering
By
TULA’S INSTITUTE
THE ENGINEERING AND MANAGEMENT COLLEGE
(DEHRADUN)
Jan, 2021
2
UTTARAKHAND TECHNICAL UNIVERSITY
CERTIFICATE
MR.LOKESH KUMAR
COMPUTER SCIENCE ENGINEERING
HEAD OF THE DEPARTMENT
TULA’S INTITUTE OF TECHNOLOGY
COMPUTER SCIENCE ENGINEERING
AND MANGEMENT, DHOOLKUT,
TULA’S INTITUTE OF TECHNOLOGY
DEHRADUN, 248011
AND MANGEMENT, DHOOLKUT,
DEHRADUN, 248011
______________
SIGNATURE
3
UTTARAKHAND TECHNICAL UNIVERSITY
CERTIFICATE DECLARATION
I declared that this written submission represents my work and ideas in my own words and where
other idea and work are also have been included, I have adequately citied and referenced the
source. I also declare that I have adhered to all principle of academic honesty and integrity and
have not misrepresented or fabricated or falsified any idea/data/facts/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the University
and can also evoke penal action from the sources which have thus not been properly cited or from
whom proper permission has not been taken when needed.
(Signature)
AFTAB ALAM KHAN (201704076)
ALKA JOSHI (201704079)
AMIT RAI (201704041)
GAURAV KUMAR (201704040)
4
ACKNOWLEDGMENT
I take this opportunity to remember the Almighty and my parents, who bestowed strength, courage
to perseverance to undertake the present course of study and complete it successfully.
First and Foremost, I would like to express a deep sense of gratitude and sincere regard toward
Director, for his sustained guidance, meticulous supervision, and constant encouragement during
my dissertation work.
My special thanks to Mr. Lokesh Kumar (HOD, CSE) , Mr. Sachin Kumar (Project
Coordinator, CSE) and Ms. Akansha Singh (Project Co-coordinator, CSE) who help me in
the execution of this work and for providing necessary facilities with full cooperation, bestowing
his excellent guidance, encouragement, inspiring motivation, valuable suggestion and painstaking
help throughout this project work.
I express my heartfelt and sincere thanks to my project guide Mr. Ram Narayan Pal (Tula’s
Institute of Engineering and Management, Dehradun) for his excellent guidance, caring,
patience, and providing me an excellent atmosphere for completion of this dissertation project.
I also thank my teaching and non-teaching staff members of the department for their kind
cooperation and others who helped me through the course of the minor project work.
Concerning this venture without their encouragement, patience, and moral support it would not
have been possible for me to complete my dissertation.
Date: _/_/2021
Submitted By –
AFTAB ALAM KHAN (201704076)
ALKA JOSHI (201704079)
AMIT RAI (201704041)
GAURAV KUMAR (201704040)
5
ABSTRACT
The heart and brain are the crucial organs having higher priority in the human body. Heart
disease is the most lethal problem of the world, during which the heart is unable to push the
required amount of blood to other parts of the body. The diagnosis of heart disease through the
traditional method has not been considered reliable in many aspects. Nowadays there are various
machine learning techniques and tools available to extract effective information for accurate
diagnosis and decision making.
In this paper, The Machine learning technique, such as Artificial Neural Network (ANN),
Decision Tree, K-Nearest Neighbour (KNN), Naive Bayes, and Support Vector Machine (SVM)
is used. The main objective of this paper is to analyse the prediction of the patient getting heart
diseases. After testing all these models, the model with good accuracy is taken for the prediction
of heart disease.
Today the healthcare industry is information-rich however still very poor in knowledge or mostly
the data are not publically available. This paper has used a Cleveland dataset containing 303
individuals and 14 attributes like age, sex, chest pain (cp), resting blood pressure (trestps),
cholesterol, resting electrocardiographic results (restcg), maximum heart rate, ST depression
induced, exercise induced angina, slope of the peak exercise, ST segment , number of vessels
coloured by fluoroscopy, thalassemia.
6
LIST OF FIGURES
Fig. No Description Page No.
2.1 Data achieved from Clevland Dataset. 14
2.2 Displays Number of Rows & Columns. As well as the Column names. 14
2.3 Returns the number of unique values for each variable. 15
2.4 Summarizes the count, mean, standard deviation, min, and max for numeric 15
variables.
2.5 Hungarian dataset before and after applying ssv to csv conversion. 17
2.6 6 Overall record processed 18
2.7 Record achieved after excluding missing values form joined dataset. 18
2.8 Display the Number of Missing Values for each column. 18
2.9 Data after handling missing values. 19
2.10 It appears we have a good balance between the two binary outputs. 19
3.1 Calculated Correlation matrix 20
3.2 SNS Heatmap of Correlation Matrix 20
3.3 Pairplot of positive or negative correlation 22
3.4 Heart Disease affected patient Sex ratio. 23
3.5 Variation in age for each target classes. 23
3.6 Outliers in OldPeak vs. Heart Disease 24
3.7 Maximum heart rate achieved (thalach) vs. Target 25
3.8 Data of positive heart disease patient 25
3.9 Data of negative heart disease patient. 26
3.10 Postive and Negative ST depression 26
3.11 Positive and Negative Max Heart rate 26
5.1 Logistic Regression Prediction [Accuracy 85%] 29
5.2 K-NN Prediction [Accuracy 88%] 30
5.3 SVM Prediction [Accuracy 85%] 31
5.4 Naïve Bayes Classifier Prediction [Accuracy 83%] 31
5.5 Decision Tree Prediction [Accuracy 77%] 32
5.6 Random Forest Prediction [Accuracy 88%] 33
7
5.7 Models Accuracy Graph 33
7.1 Confusion Matrix and Accuracy Score. 35
8.1 Feature Importance Score 37
8.2 Top 4 significant features concluded from the Feature Importance graph were chest 38
pain type (cp), maximum heart rate achieved (thalach), number of major vessels
(ca), and ST depression induced by exercise relative to rest (oldpeak) .
9.1 Providing input on designed Web Page for Prediction. 40
9.2 Output Positive 40
9.3 Random Forest Model predicted data 41
8
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO.
ABSTRACT vi
LIST OF FIGURES vii
1 INTRODUCTION 11
1.1 Scenario 12
1.2 Goals 12
1.3 Features and Predictor 13
2 DATA WRANGLING 14
2.1 Data conjunction 15
2.2 Handling missing values 18
5 MODELLING/ TRAINING 29
5.1 Model 1- Logistic Regression 29
5.2 Model 2- K-NN 30
5.3 Model 3- SVM(support vector machine) 30
5.4 Model 4- Naives Bayes classification 31
5.5 Model 5- Decision Tress 32
5.6 Model 6- Random Forest 32
9
6 PRECISION, RECALL,F1 SCORE AND SUPPORT 34
8 FEATURE IMPORTANCE 37
9 PREDICTIONS 39
9.1 Scenario 39
9.2 Predicting the test set results 40
10 Conclusions 42
11 List of References 43
10
Chapter 1
1. INTRODUCTION
The work done in our project is mainly focused on various data analysis process that
is used in heart disease prediction. Change in lifestyle, work related stress and bad
food habits contribute to the increase in rate of several heart related diseases. Heart
diseases have emerged as one of the most prominent cause of death all around the
world. According to World Health Organization, heart related diseases are
responsible for the taking 17.7 million lives every year, 31% of all global deaths. In
India too, heart related diseases have become the leading cause of mortality.
Estimates made by the World Health Organization (WHO), suggest that India have
lost up to $237 billion, from 2005-2015, due to heart related or cardiovascular
diseases.
The main challenge in today's healthcare is provision of best quality services and
effective accurate diagnosis. Even if heart diseases are found as the prime source of
death in the world in recent years, they are also the ones that can be controlled and
managed effectively. The whole accuracy in management of a disease lies on the
proper time of detection of that disease. The project work is dedicated to detect
these heart diseases at early stage to avoid disastrous consequences. If such a
prediction is accurate enough, we can not only avoid wrong diagnosis but also save
human resources. When a patient without a heart disease is diagnosed with heart
disease, he will fall into unnecessary panic and when a patient with heart disease is
not diagnosed with heart disease, he will miss the best chance to cure his disease.
Such wrong diagnosis is painful to both patients and hospitals. With accurate
predictions, we can solve the unnecessary trouble. Besides, if we can apply our
machine learning tool into medical prediction, we will save human resource because
we do not need complicated diagnosis process in hospitals. The input to our
algorithm is 13 features with number values. We use several algorithms such as
Logistic Regression, SVM, Naïve Bayes, Random Forest and Artificial Neural
11
Network to output a binary number 1 or 0. 1 indicates the patient has heart disease
and vice versa.
1.1 Scenario:
1.2 Goal:
12
1. age (#)
2. sex : 1= Male, 0= Female (Binary)
3. (cp) : chest pain type (4 values - Ordinal):Value 1: typical angina ,Value 2:
atypical angina, Value 3: non-anginal pain , Value 4: asymptomatic
4. (trestbps) : resting blood pressure (#)
5. (chol) : serum cholesterol in mg/dl (#)
6. (fbs) : fasting blood sugar > 120 mg/dl(Binary)(1 = true; 0 = false)
7. (restecg) : resting electrocardiography results(values 0,1,2)
8. (thalach) : maximum heart rate achieved (#)
9. (exang) : exercise induced angina (binary) (1 = yes; 0 = no)
10. (oldpeak) : ST depression induced by exercise relative to rest (#)
11. (slope) : of the peak exercise ST segment (Ordinal) (Value 1: up sloping ,
Value 2: flat , Value 3: down sloping )
12. (ca) : number of major vessels (0–3, Ordinal) colored by fluoroscopy
13. (thal) : maximum heart rate achieved — (Ordinal): 3 = normal; 6 = fixed
defect; 7 = reversible defect
13
Chapter 2
2. Data Wrangling
i. import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
Fig. 2.2 Displays Number of Rows & Columns as well as the Column names.
14
iv. df.nunique(axis=0) # returns the number of unique values for each variable.
Fig. 2.3 Returns the number of unique values for each variable.
v. df.describe() #summarizes the count, mean, standard deviation, min, and max for numeric variables
Fig. 2.4 Summarizes the count, mean, standard deviation, min, and max for numeric variables.
15
Class Distribution:
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Hungarian: 188 37 26 28 15 294
Switzerland: 8 48 32 30 5 123
Long Beach VA: 51 56 41 42 10 200
Total: 920
input_file.close()
output_file.close()
16
Fig. 2.5 Hungarian dataset before and after applying ssv to csv conversion.
Joined each file assuming that we will get high number of record that is more
than Cleveland alone but there is too many missing value and after tackling
those missing values we get only total of 299 records, so we decided to use
only Cleveland dataset.
17
output.write(corrected_line+"\n")
input_.close()
print("Toal record read:",total_input)
output.close()
print("record achieved after joining and excluding missing value in record:\
n",heartdf[0].count())
Fig.2.7 Record achieved after excluding missing values form joined dataset.
Fig .2.8 Display the Number of Missing Values for each column.
Finally, we converted the decimal values of target that is 1,2 and 3 to 1 and 0 to 0.
18
i. df['ca'].fillna(1.5,inplace=True)
df['thal'].fillna(3.0,inplace=True)
#replace 1,2,3->1 and ->0 in target for binary prediction
def rep_target(x):
if x==0:
return 0
else:
return 1
df['target'] = df['target'].apply(lambda x:rep_target(x))
ii. df['target'].value_counts()
Fig .2.10 It appears we have a good balance between the two binary outputs.
19
Chapter 3
3.1 Correlations
i. df.corr()
plt.figure(figsize=(13,7))
sns.heatmap(df.corr(),annot=True)
20
We can see there is a positive correlation between chest pain (cp) & target (our
predictor). This makes sense since, the greater amount of chest pain results in a
greater chance of having heart disease.
Cp (chest pain), is a ordinal feature with 4 values:
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic.
Pair plots are also a great way to immediately see the correlations between all
variables. But you will see me make it with only continuous columns from our
data, because with so many features, it can be difficult to see each one. So
instead I will make a pair plot with only our continuous features.
21
ii. subData = df[['age','trestbps','chol','thalach','oldpeak']]
sns.pairplot(subData)
Fig. 3.3 Choose to make a smaller pair plot with only the continuous variables, to dive deeper into the relationships.
Also a great way to see if there is a positive or negative correlation
22
3.2 Count Plot
iii.sns.countplot(x='sex',hue=’target’data=df)
iv. plt.figure(figsize=(20,8))
sns.countplot(x='age',data=df, hue='target')
We saw that most people who are suffering are of the age of 58,
followed by 57.
Majorly, people belonging to the age group 50+ are suffering from
the disease.
23
3.3 Violin & Box Plots
The advantages of showing the Box & Violin plots are that it shows
the basic statistics of the data, as well as its distribution. These plots are often used
to compare the distribution of a given variable across some categories.
It shows the median and IQR. (minimum , first quartile (Q1), median, third quartile
(Q3), and maximum).
i. plt.figure(figsize=(8,4))
sns.violinplot(x='target',y='oldpeak',inner='quartile',hue='sex',data=df)
plt.title("Oldpeak vs Heart_Disease(target)")
ii. plt.figure(figsize=(8,4))
sns.boxplot(x='target',y='thalach',hue='sex',data=df)
plt.title("Thalach vs Target")
24
Fig. 3.7 Maximum heart rate achieved (thalach) vs. Target
25
Fig. 3.9 Data of negative heart disease patient.
26
Chapter 4
Assign
Assign the 13 features to X, & the last column to our classification predictor, y
i. X=df.drop('target',axis=1)
y=df['target']
Split
To split the data set into the Training set and Test set.
Scaling
Standardize features by removing the mean and scaling to unit variance.
z = (x - u) / s
27
iii. from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
28
Chapter 5
5. Modeling /Training
Now we’ll Train various Classification Models on the Training set & see
which yields the highest accuracy. We will compare the accuracy of Logistic
Regression, K-NN (k-Nearest Neighbours), SVM (Support Vector Machine),
Naives Bayes Classifier, Decision Trees, Random Forest.
29
5.2 Model 2: K-NN (K-Nearest Neighbors)
30
Fig.5.3 SVM Prediction [Accuracy 85%]
31
5.5 Model 5: Decision Trees
32
Fig.5.6 Random Forest Prediction [Accuracy 88%]
33
Chapter 6
6.1 Precision: It’s “how many are correctly classified among that class”
6.2 Recall: It’s “how many of this class you find over the whole number of
element of this class”
34
Chapter 7
Note: A good rule of thumb is that any accuracy above 70% is considered good, but be careful
because if your accuracy is extremely high, it may be too good to be true (an example of Over
fitting). Thus, 90% is the ideal accuracy!
27 is the amount of True Positives in our data, while 25 is the amount of True
Negatives.
7 & 2 are the number of errors.
There are 7 Type 1 errors (False Positives)- You predicted positive and it’s
false.
There are 2 Type 2 errors (False Negatives)- You predicted negative and it’s
false.
Hence, if we calculate the accuracy its # Correct Predicted/ # Total.
In other words, where TP, FN, FP and TN represent the number of true positives,
false negatives, false positives and true negatives.
35
Accuracy = (TP + TN)/(TP + TN + FP + FN).
Accuracy =(27+25)/(27+25+7+2) = 0.85 = 85% accuracy
36
Chapter 8
8. Feature Importance
Feature Importance provides a score that indicates how helpful each feature
was in our model.
The higher the Feature Score, the more that feature is used to make key
decisions & thus the more important it is.
# get importance
i. importance = RFC_Model.feature_importances_
# summarize feature importance
ii. for i,feat in enumerate(importance,start=1):
print('Feature:%0d\tScore:%0.5f'%(i,feat))
37
iii. index= df.columns[:-1]
sns.barplot(x=importance,y=index,palette='Blues_r')
Fig.8.2 From the Feature Importance graph , top 4 significant features were ST depression induced by
exercise (oldpeak),chest pain (cp), maximum heart rate achieved(thalach), and Number of major vessels
colored by fluoroscopy (ca) concluded.
38
Chapter 9
9. Predictions
9.1 Scenario:
A patient develops cardiac symptoms & you input his vitals into the
Machine Learning Algorithm.
He is a 20 year old male, with a chest pain value of 2 (atypical angina),
with resting blood pressure of 110.
In addition he has a serum cholesterol of 230 mg/dl.
He is fasting blood sugar > 120 mg/dl.
He has a resting electrocardiographic result of 1.
The patients maximum heart rate achieved is 140.
Also, he was exercise induced angina.
His ST depression induced by exercise relative to rest value was 2.2.
The slope of the peak exercise ST segment is flat.
He has no major vessels colored by fluoroscopy, and in addition
his maximum heart rate achieved is a reversible defect.
39
Fig.9.1 Providing input on designed Web Page for Prediction.
@app.route('/predict',methods=['POST'])
def predict():
new_data = [int(i) for i in request.form.values() ]
new_data =np.array(new_data)
new_data = new_data.reshape(1,-1)
new_data_scaled = scaler.transform(new_data)
prediction = model.predict(new_data_scaled)
return render_template('result.html', prediction=word(prediction))
40
iv. y_pred = RFC_Model.predict(x_test)
y_pred = RFC_Model.predict(X_test).reshape(len(y_pred),1)
y_test = np.array(y_test).reshape(len(y_test),1)
concated = np.column_stack((y_test,y_pred))
dataset = pd.DataFrame(concated,columns=['actual','prediction'])
41
Chapter 10
10. Conclusions
Out of the 13 features we examined, the top 4 significant features that helped us
classify between a positive & negative Diagnosis were chest pain type (cp),
maximum heart rate achieved (thalach), number of major vessels colored by
fluroscopy (ca), and ST depression induced by exercise relative to rest (oldpeak).
Our machine learning algorithm can now classify patients with Heart Disease.
Now we can properly diagnose patients, & get them the help they need to recover.
By diagnosing detecting these features early, we may prevent worse symptoms
from arising later.
Our Random Forest algorithm yields the highest accuracy, 89%. Any accuracy
above 70% is considered good, but we have to be careful because if our accuracy
is extremely high, it may be too good to be true (an example of Over fitting).
Thus, 90% is the ideal accuracy!
42
List of References
References
I. Ramadoss and Shah B et al.“A. Responding to the threat of chronic diseases
in India”. Lancet. 2005; 366:1744–1749. DOI: 10.1016/S0140-
6736(05)67343-6.
III. Avinash Golande, Pavan Kumar T, ”Heart Disease Prediction Using Effective
Machine Learning Techniques”, International Journal of Recent Technology
and Engineering, Vol 8, pp.944-950,2019.
43