HDP Report

HEART DISEASE PREDICTION
USING MACHINE LEARNING
A PROJECT REPORT
Submitted by
AFTAB ALAM KHAN

ALKA JOSHI
AMIT RAI
GAURAV KUMAR
In partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE & ENGINEERING
TULA’S INTITUTE OF TECHNOLOGYAND MANAGEMENT
UTTARAKHAND TECHNICAL UNIVERSITY
JANUARY 2021
1
HEART DISEASE PREDICTION
USING MACHINE LEARNING
A Project Submitted
In Partial Fulfillment of the Requirements

for the Degree of
BACHELOR OF TECHNOLOGY
In
Computer Science & Engineering
By
AFTAB ALAM KHAN

ALKA JOSHI
AMIT RAI
GAURAV KUMAR
(Enrollment No. [170120101010]1,[170120101011]2,[170120101013]3,[170120101033]4)
Under the Supervision of

Mr. Ram Narayan Pal
FACULTY OF COMPUTER SCIENCE & ENGINEERING
TULA’S INSTITUTE
THE ENGINEERING AND MANAGEMENT COLLEGE
(DEHRADUN)
Jan, 2021
2
CERTIFICATE
Certified that this project report “HEART DISEASE PREDICTION USING

MACHINE LEARNING” is the bonafide work of “AFTAB ALAM KHAN ,
ALKA JOSHI , AMIT RAI , GAURAV KUMAR” who carried out the
project work under my supervision.
______________ MR. RAM NARAYAN PAL

SUPERVISOR
SIGNATURE
MR.LOKESH KUMAR
COMPUTER SCIENCE ENGINEERING
HEAD OF THE DEPARTMENT
TULA’S INTITUTE OF TECHNOLOGY
COMPUTER SCIENCE ENGINEERING
AND MANGEMENT, DHOOLKUT,
TULA’S INTITUTE OF TECHNOLOGY
DEHRADUN, 248011
AND MANGEMENT, DHOOLKUT,
DEHRADUN, 248011
______________
SIGNATURE
3
CERTIFICATE DECLARATION
I declared that this written submission represents my work and ideas in my own words and where
other idea and work are also have been included, I have adequately citied and referenced the
source. I also declare that I have adhered to all principle of academic honesty and integrity and
have not misrepresented or fabricated or falsified any idea/data/facts/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the University
and can also evoke penal action from the sources which have thus not been properly cited or from
whom proper permission has not been taken when needed.
(Signature)
AFTAB ALAM KHAN (201704076)
ALKA JOSHI (201704079)
AMIT RAI (201704041)
GAURAV KUMAR (201704040)
Department of Computer Science and Engineering

Date: _______
4
ACKNOWLEDGMENT
I take this opportunity to remember the Almighty and my parents, who bestowed strength, courage
to perseverance to undertake the present course of study and complete it successfully.
First and Foremost, I would like to express a deep sense of gratitude and sincere regard toward
Director, for his sustained guidance, meticulous supervision, and constant encouragement during
my dissertation work.
My special thanks to Mr. Lokesh Kumar (HOD, CSE) , Mr. Sachin Kumar (Project
Coordinator, CSE) and Ms. Akansha Singh (Project Co-coordinator, CSE) who help me in
the execution of this work and for providing necessary facilities with full cooperation, bestowing
his excellent guidance, encouragement, inspiring motivation, valuable suggestion and painstaking
help throughout this project work.
I express my heartfelt and sincere thanks to my project guide Mr. Ram Narayan Pal (Tula’s
Institute of Engineering and Management, Dehradun) for his excellent guidance, caring,
patience, and providing me an excellent atmosphere for completion of this dissertation project.
I also thank my teaching and non-teaching staff members of the department for their kind
cooperation and others who helped me through the course of the minor project work.
Concerning this venture without their encouragement, patience, and moral support it would not
have been possible for me to complete my dissertation.
Date: _/_/2021
Submitted By –
AFTAB ALAM KHAN (201704076)
ALKA JOSHI (201704079)
AMIT RAI (201704041)
GAURAV KUMAR (201704040)
5
ABSTRACT
The heart and brain are the crucial organs having higher priority in the human body. Heart
disease is the most lethal problem of the world, during which the heart is unable to push the
required amount of blood to other parts of the body. The diagnosis of heart disease through the
traditional method has not been considered reliable in many aspects. Nowadays there are various
machine learning techniques and tools available to extract effective information for accurate
diagnosis and decision making.
In this paper, The Machine learning technique, such as Artificial Neural Network (ANN),
Decision Tree, K-Nearest Neighbour (KNN), Naive Bayes, and Support Vector Machine (SVM)
is used. The main objective of this paper is to analyse the prediction of the patient getting heart
diseases. After testing all these models, the model with good accuracy is taken for the prediction
of heart disease.
Today the healthcare industry is information-rich however still very poor in knowledge or mostly
the data are not publically available. This paper has used a Cleveland dataset containing 303
individuals and 14 attributes like age, sex, chest pain (cp), resting blood pressure (trestps),
cholesterol, resting electrocardiographic results (restcg), maximum heart rate, ST depression
induced, exercise induced angina, slope of the peak exercise, ST segment , number of vessels
coloured by fluoroscopy, thalassemia.
6
LIST OF FIGURES
Fig. No Description Page No.
2.1 Data achieved from Clevland Dataset. 14
2.2 Displays Number of Rows & Columns. As well as the Column names. 14
2.3 Returns the number of unique values for each variable. 15
2.4 Summarizes the count, mean, standard deviation, min, and max for numeric 15
variables.
2.5 Hungarian dataset before and after applying ssv to csv conversion. 17
2.6 6 Overall record processed 18
2.7 Record achieved after excluding missing values form joined dataset. 18
2.8 Display the Number of Missing Values for each column. 18
2.9 Data after handling missing values. 19
2.10 It appears we have a good balance between the two binary outputs. 19
3.1 Calculated Correlation matrix 20
3.2 SNS Heatmap of Correlation Matrix 20
3.3 Pairplot of positive or negative correlation 22
3.4 Heart Disease affected patient Sex ratio. 23
3.5 Variation in age for each target classes. 23
3.6 Outliers in OldPeak vs. Heart Disease 24
3.7 Maximum heart rate achieved (thalach) vs. Target 25
3.8 Data of positive heart disease patient 25
3.9 Data of negative heart disease patient. 26
3.10 Postive and Negative ST depression 26
3.11 Positive and Negative Max Heart rate 26
5.1 Logistic Regression Prediction [Accuracy 85%] 29
5.2 K-NN Prediction [Accuracy 88%] 30
5.3 SVM Prediction [Accuracy 85%] 31
5.4 Naïve Bayes Classifier Prediction [Accuracy 83%] 31
5.5 Decision Tree Prediction [Accuracy 77%] 32
5.6 Random Forest Prediction [Accuracy 88%] 33
7
5.7 Models Accuracy Graph 33
7.1 Confusion Matrix and Accuracy Score. 35
8.1 Feature Importance Score 37
8.2 Top 4 significant features concluded from the Feature Importance graph were chest 38
pain type (cp), maximum heart rate achieved (thalach), number of major vessels
(ca), and ST depression induced by exercise relative to rest (oldpeak) .
9.1 Providing input on designed Web Page for Prediction. 40
9.2 Output Positive 40
9.3 Random Forest Model predicted data 41
8
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO.
ABSTRACT vi
LIST OF FIGURES vii
1 INTRODUCTION 11
1.1 Scenario 12
1.2 Goals 12
1.3 Features and Predictor 13
2 DATA WRANGLING 14
2.1 Data conjunction 15
2.2 Handling missing values 18
3 EXPLORATORY DATA ANALYSIS 20

3.1 Correlations 20
3.2 Count Plot 23
3.3 Violin and Box plots 24
3.4 Filtering data by positive and negative heart 25
disese patient
4 MACHINE LEARNING + PREDICTIVE 27

ANALYSIS
4.1 Prepare data for modelling 27
5 MODELLING/ TRAINING 29
5.1 Model 1- Logistic Regression 29
5.2 Model 2- K-NN 30
5.3 Model 3- SVM(support vector machine) 30
5.4 Model 4- Naives Bayes classification 31
5.5 Model 5- Decision Tress 32
5.6 Model 6- Random Forest 32
9
6 PRECISION, RECALL,F1 SCORE AND SUPPORT 34
7 MAKING THE CONFUSION MATRIX 35

7.1 How to interpret Confusion Matrix 35
8 FEATURE IMPORTANCE 37
9 PREDICTIONS 39
9.1 Scenario 39
9.2 Predicting the test set results 40
10 Conclusions 42
11 List of References 43
10
Chapter 1
1. INTRODUCTION
The work done in our project is mainly focused on various data analysis process that
is used in heart disease prediction. Change in lifestyle, work related stress and bad
food habits contribute to the increase in rate of several heart related diseases. Heart
diseases have emerged as one of the most prominent cause of death all around the
world. According to World Health Organization, heart related diseases are
responsible for the taking 17.7 million lives every year, 31% of all global deaths. In
India too, heart related diseases have become the leading cause of mortality.
Estimates made by the World Health Organization (WHO), suggest that India have
lost up to $237 billion, from 2005-2015, due to heart related or cardiovascular
diseases.
The main challenge in today's healthcare is provision of best quality services and
effective accurate diagnosis. Even if heart diseases are found as the prime source of
death in the world in recent years, they are also the ones that can be controlled and
managed effectively. The whole accuracy in management of a disease lies on the
proper time of detection of that disease. The project work is dedicated to detect
these heart diseases at early stage to avoid disastrous consequences. If such a
prediction is accurate enough, we can not only avoid wrong diagnosis but also save
human resources. When a patient without a heart disease is diagnosed with heart
disease, he will fall into unnecessary panic and when a patient with heart disease is
not diagnosed with heart disease, he will miss the best chance to cure his disease.
Such wrong diagnosis is painful to both patients and hospitals. With accurate
predictions, we can solve the unnecessary trouble. Besides, if we can apply our
machine learning tool into medical prediction, we will save human resource because
we do not need complicated diagnosis process in hospitals. The input to our
algorithm is 13 features with number values. We use several algorithms such as
Logistic Regression, SVM, Naïve Bayes, Random Forest and Artificial Neural
11
Network to output a binary number 1 or 0. 1 indicates the patient has heart disease
and vice versa.
1.1 Scenario:
You are working as a Data Scientist, reporting various cardiac symptoms. A

cardiologist measures vitals & hands you this data to perform Data
Analysis and predict whether certain patients have Heart Disease. We would like
to make a Machine Learning Algorithm where we can train our model to learn
& improve from experience. Thus, we would want to classify patients as either
positive or negative for Heart Disease.
1.2 Goal:
 Predict whether a patient should be diagnosed with Heart Disease. This is

a binary outcome.
Positive (+) = 1, patient diagnosed with Heart Disease
Negative (-) = 0, patient not diagnosed with Heart Disease
 Experiment with various Classification Models & see which yields

greatest accuracy.
 Examine trends & correlations within our data
 Determine which features are most important to Positive/Negative Heart

Disease diagnosis
1.3 Features & Predictor:
Our Predictor (Y, Positive or Negative diagnosis of Heart Disease) is

determined by 13 features (X):
12
1. age (#)
2. sex : 1= Male, 0= Female (Binary)
3. (cp) : chest pain type (4 values - Ordinal):Value 1: typical angina ,Value 2:
atypical angina, Value 3: non-anginal pain , Value 4: asymptomatic
4. (trestbps) : resting blood pressure (#)
5. (chol) : serum cholesterol in mg/dl (#)
6. (fbs) : fasting blood sugar > 120 mg/dl(Binary)(1 = true; 0 = false)
7. (restecg) : resting electrocardiography results(values 0,1,2)
8. (thalach) : maximum heart rate achieved (#)
9. (exang) : exercise induced angina (binary) (1 = yes; 0 = no)
10. (oldpeak) : ST depression induced by exercise relative to rest (#)
11. (slope) : of the peak exercise ST segment (Ordinal) (Value 1: up sloping ,
Value 2: flat , Value 3: down sloping )
12. (ca) : number of major vessels (0–3, Ordinal) colored by fluoroscopy
13. (thal) : maximum heart rate achieved — (Ordinal): 3 = normal; 6 = fixed
defect; 7 = reversible defect
Note: Our data has 3 types of data:
Continuous (#): Which is quantitative data , that can be measured.

Ordinal Data: Categorical data that has an order to it (0,1,2,3, etc.).
Binary Data: data whose unit can take on only two possible states (0 &1).
13
Chapter 2
2. Data Wrangling
i. import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt
ii. filePath = '/Users/aftab/Downloads/clevland-dataset.csv'

df = pd.read_csv(filePath)
#Attributes we taken into account
attr=['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','target']
df = pd.read_csv('cleveland.csv',sep=',',header=None)
file = pd.read_csv('cleveland.csv',header=None)
file = np.array(file)
df = pd.DataFrame(data=file,columns=attr)
df.head(5)
Fig.2.1 Data achieved from Clevland Dataset.
iii. print("(Rows, columns): " + str(df.shape))

df.columns
Fig. 2.2 Displays Number of Rows & Columns as well as the Column names.
14
iv. df.nunique(axis=0) # returns the number of unique values for each variable.
Fig. 2.3 Returns the number of unique values for each variable.
v. df.describe() #summarizes the count, mean, standard deviation, min, and max for numeric variables
Fig. 2.4 Summarizes the count, mean, standard deviation, min, and max for numeric variables.
2.1 Data Conjunction

There were 4 processed data present on UCI repository.
1. Cleveland Dataset
2. Hungarian Dataset
3. VA Dataset
4. Switzerland Dataset
15
Class Distribution:
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Hungarian: 188 37 26 28 15 294
Switzerland: 8 48 32 30 5 123
Long Beach VA: 51 56 41 42 10 200
Total: 920
 The Cleveland ,Switzerland, VA Dataset were pre obtained in CSV format, so

we didn’t have to work them for merging , but the Hungarian Dataset was in
SSV format so first we have to convert that dataset in CSV format.
 We are merging all dataset and Hungarian dataset is space separated

value .this program is to convert space separated value into comma separated
value.
vi. input_file = open(‘hungarian.data’,’r’)

output_file = open('hungarian.csv', 'w')
for line in input_file:
if line.strip():
line = line.replace(" ", ",")
output_file.write(line)
input_file.close()
output_file.close()
16
Fig. 2.5 Hungarian dataset before and after applying ssv to csv conversion.
 Joined each file assuming that we will get high number of record that is more
than Cleveland alone but there is too many missing value and after tackling
those missing values we get only total of 299 records, so we decided to use
only Cleveland dataset.
vii. input_files = ['cleveland.csv','hungarian.csv','va.csv','switzerland.csv']

output_file ="heart_disease.csv"
output = open(output_file,'w')
total_input = 0
total_output = 0
for input_file in input_files:
input_ = open(input_file,'r')
for line in input_:
total_input+=1
n_lines_total +=1
if ("?" not in line) and ("-9" not in line):
features_list = line.split(",")
features_list = [float(item) for item in features_list[0:14]]
corrected_line = ",".join(map(str, features_list))
17
output.write(corrected_line+"\n")
input_.close()
print("Toal record read:",total_input)
output.close()
print("record achieved after joining and excluding missing value in record:\
n",heartdf[0].count())
Fig.2.6 Overall record processed
Fig.2.7 Record achieved after excluding missing values form joined dataset.
viii.print(df.isna().sum()) # Display the Missing Values in Cleveland Dataset
Fig .2.8 Display the Number of Missing Values for each column.
2.2 Handling Missing Values

Using Cleveland dataset, we found total 6 missing values ,4 in ca and 2 in thal. To
handle these missing value we intentioned to replace them with their mean value.
Finally, we converted the decimal values of target that is 1,2 and 3 to 1 and 0 to 0.
18
i. df['ca'].fillna(1.5,inplace=True)
df['thal'].fillna(3.0,inplace=True)
#replace 1,2,3->1 and ->0 in target for binary prediction
def rep_target(x):
if x==0:
return 0
else:
return 1
df['target'] = df['target'].apply(lambda x:rep_target(x))
Fig .2.9 Data after handling missing values.
 Now to see good proportion between our positive & negative binary

predictor.
ii. df['target'].value_counts()
Fig .2.10 It appears we have a good balance between the two binary outputs.
19
Chapter 3
3. Exploratory Data Analysis
3.1 Correlations
 Correlation Matrix - It lets you see correlations between all variables.

With correlation matrix in seconds, you can see whether something is
positively or negatively correlated with our predictor (target).
i. df.corr()
plt.figure(figsize=(13,7))
sns.heatmap(df.corr(),annot=True)
Fig. 3.1 Calculated Correlation matrix
Fig. 3.2 SNS Heatmap of Correlation Matrix
20
We can see there is a positive correlation between chest pain (cp) & target (our
predictor). This makes sense since, the greater amount of chest pain results in a
greater chance of having heart disease.
Cp (chest pain), is a ordinal feature with 4 values:
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic.
In addition, we see a negative correlation between exercises induced angina (exang)

& our predictor. This makes sense because when you exercise, your heart requires
more blood, but narrowed arteries slow down blood flow.
 Pair plots are also a great way to immediately see the correlations between all
variables. But you will see me make it with only continuous columns from our
data, because with so many features, it can be difficult to see each one. So
instead I will make a pair plot with only our continuous features.
21
ii. subData = df[['age','trestbps','chol','thalach','oldpeak']]
sns.pairplot(subData)
Fig. 3.3 Choose to make a smaller pair plot with only the continuous variables, to dive deeper into the relationships.
Also a great way to see if there is a positive or negative correlation
22
3.2 Count Plot
iii.sns.countplot(x='sex',hue=’target’data=df)
Fig 3.4 :Heart Disease affected patient Sex ratio.
We saw that male = (1) are most affected by heart disease.
iv. plt.figure(figsize=(20,8))
sns.countplot(x='age',data=df, hue='target')
Fig. 3.5 Variation in age for each target classes.
 We saw that most people who are suffering are of the age of 58,
followed by 57.
 Majorly, people belonging to the age group 50+ are suffering from
the disease.
23
3.3 Violin & Box Plots
The advantages of showing the Box & Violin plots are that it shows
the basic statistics of the data, as well as its distribution. These plots are often used
to compare the distribution of a given variable across some categories.
It shows the median and IQR. (minimum , first quartile (Q1), median, third quartile
(Q3), and maximum).
In addition it can provide us with outliers in our data.
i. plt.figure(figsize=(8,4))
sns.violinplot(x='target',y='oldpeak',inner='quartile',hue='sex',data=df)
plt.title("Oldpeak vs Heart_Disease(target)")
Fig. 3.6 Outliers in OldPeak vs. Heart Disease
ii. plt.figure(figsize=(8,4))
sns.boxplot(x='target',y='thalach',hue='sex',data=df)
plt.title("Thalach vs Target")
24
Fig. 3.7 Maximum heart rate achieved (thalach) vs. Target
3.4 Filtering data by positive & negative Heart Disease patient

i. # Filtering data by POSITIVE Heart Disease patient
pos = df[df['target']==1]
pos.describe()
Fig.3.8 Data of positive heart disease patient
ii. # Filtering data by NEGATIVE Heart Disease patient

neg = df[df['target']==0]
neg.describe()
25
Fig. 3.9 Data of negative heart disease patient.
iii. print("Positive Patients ST Depression:",pos['oldpeak'].mean())

print("Negative Patients ST Depression:",pos['oldpeak'].mean())
Fig. 3.10 Postive and Negative ST depression
iv. print("Positive Patients Max Heart Rate:",pos['thalach'].mean())

print("Negative Patients Max Heart Rate:",pos['thalach'].mean())
Fig .3.11 Positive and Negative Max Heart rate
From comparing positive and negative patients we can see there are vast

differences in means for many of our 13 Features. From examining the details, we
can observe that positive patients experience heightened maximum heart rate
achieved (thalach) average. In addition, positive patients exhibit about 1/3rd the
amount of ST depression induced by exercise relative to rest (oldpeak).
26
Chapter 4
4. Machine Learning + Predictive Analytics
4.1 Prepare Data for Modeling

To prepare data for modelling, just remember ASN (Assign, Split and
Normalize).
 Assign
Assign the 13 features to X, & the last column to our classification predictor, y
i. X=df.drop('target',axis=1)
y=df['target']
 Split
To split the data set into the Training set and Test set.
ii. from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
 Scaling
Standardize features by removing the mean and scaling to unit variance.
The standard score of a sample x is calculated as:
z = (x - u) / s
Where, u is the mean of the training samples or zero if with_mean=False,

and ‘s’ is the standard deviation of the training samples or one if with_std=False.
Centering and Scaling happen independently on each feature by computing the

relevant statistics on the samples in the training set. Mean and standard deviation
are then stored to be used on later data using transform.
27
iii. from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
28
Chapter 5
5. Modeling /Training
Now we’ll Train various Classification Models on the Training set & see
which yields the highest accuracy. We will compare the accuracy of Logistic
Regression, K-NN (k-Nearest Neighbours), SVM (Support Vector Machine),
Naives Bayes Classifier, Decision Trees, Random Forest.
Note: These are all supervised learning models.
5.1 Model 1: Logistic Regression
i. from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix ,classification_report
log_model = LogisticRegression() #get instance of model
log_model.fit(X_train,y_train) #Fitting Model
y_pred1 = log_model.predict(X_test) #prediction
cm_logistic = confusion_matrix(y_test,y_pred1)
print("Confusion matrix for logistic Regression:\n",cm_logistic) #confusion matrix
print("\n\n\n\nClassification report for logistic Regression:\
n",classification_report(y_test,y_pred1)) #classification report
Fig.5.1 Logistic Regression Prediction [Accuracy 85%]
29
5.2 Model 2: K-NN (K-Nearest Neighbors)
ii. from sklearn.neighbors import KNeighborsClassifier

KNN_Model = KNeighborsClassifier() #get instance of model
KNN_Model.fit(X_train,y_train) #Model Training
y_pred2 = KNN_Model.predict(X_test) #Prediction
cm_KNN = confusion_matrix(y_test,y_pred2) #Confusion Matrix
print("Confusion matrix for K_Nearest Neighbour:\n",cm_KNN)
print("\n\n\n\nClassification report for K_Nearest Neighbour:\
n",classification_report(y_test,y_pred2)) #Classification Report
Fig.5.2 K-NN Prediction [Accuracy 88%]
5.3 Model 3: SVM (Support Vector Machine)
iii. from sklearn.svm import SVC

svc_Model =SVC(C=100, gamma=0.0001)
svc_Model.fit(X_train,y_train)
y_pred4 = svc_Model.predict(X_test)
cm_SVM = confusion_matrix(y_test,y_pred4)
print("Confusion matrix for Support Vector Machine:\n",cm_SVM)
print("\n\n\n\nClassification report for Support Vector Machine:\
n",classification_report(y_test,y_pred4))
30
Fig.5.3 SVM Prediction [Accuracy 85%]
5.4 Model 4: Naives Bayes Classifier

iv. from sklearn.naive_bayes import GaussianNB
NBayes = GaussianNB()
NBayes.fit(X_train,y_train)
y_pred3 = NBayes.predict(X_test)
cm_Naive = confusion_matrix(y_test,y_pred3)
print("Confusion matrix for Naive Bayes:\n",cm_Naive)
print("\n\n\n\nClassification report for Naive Bayes:\n",classification_report(y_test,y_pred3))
Fig.5.4 Naïve Bayes Classifier Prediction [Accuracy 83%]
31
5.5 Model 5: Decision Trees
v. from sklearn.tree import DecisionTreeClassifier

dec_Tree = DecisionTreeClassifier(random_state=1)
dec_Tree.fit(X_train,y_train)
y_pred5 = dec_Tree.predict(X_test)
cm_dec_Tree = confusion_matrix(y_test,y_pred5)
print("Confusion matrix for Decision Tree:\n",cm_dec_Tree )
print("\n\n\n\nClassification report for Decision Tree:\n",classification_report(y_test,y_pred5))
Fig.5.5 Decision Tree Prediction [Accuracy 77%]
5.6 Model 6: Random Forest
vi. from sklearn.ensemble import RandomForestClassifier

RFC_Model =RandomForestClassifier(n_estimators=50)
RFC_Model.fit(X_train,y_train)
y_pred6 = RFC_Model.predict(X_test)
cm_R_Forest = confusion_matrix(y_test,y_pred6)
print("Confusion matrix for Random Forest:\n",cm_R_Forest )
print("\n\n\n\nClassification report for Random Forest:\
n",classification_report(y_test,y_pred6))
32
Fig.5.6 Random Forest Prediction [Accuracy 88%]
From comparing the 6 models, we can conclude that Model 6: Random

Forest yields the highest accuracy. With an accuracy of 88%.
ig.5.7 Models Accuracy Graph
33
Chapter 6
6. Precision, Recall, F1-score and Support.
6.1 Precision: It’s “how many are correctly classified among that class”
6.2 Recall: It’s “how many of this class you find over the whole number of
element of this class”
6.3 F1-score: harmonic mean of precision and recall values.
F1 score reaches its best value at 1 and worst value at 0.

F1 Score = 2 x ((precision x recall) / (precision + recall))
6.4 Support: # of samples of the true response that lie in that class.
34
Chapter 7
7. Making the Confusion Matrix

i. from sklearn.metrics import confusion_matrix, accuracy_score
print("Confusion matrix :\n",confusion_matrix(y_test, y_pred))
ii. print("\nacuuracy:\n",accuracy_score(y_test, y_pred))
Fig. 7.1 Confusion Matrix and Accuracy Score.
Note: A good rule of thumb is that any accuracy above 70% is considered good, but be careful
because if your accuracy is extremely high, it may be too good to be true (an example of Over
fitting). Thus, 90% is the ideal accuracy!
7.1 How to Interpret Confusion Matrix:
 27 is the amount of True Positives in our data, while 25 is the amount of True
Negatives.
 7 & 2 are the number of errors.
 There are 7 Type 1 errors (False Positives)- You predicted positive and it’s
false.
 There are 2 Type 2 errors (False Negatives)- You predicted negative and it’s
false.
 Hence, if we calculate the accuracy its # Correct Predicted/ # Total.
In other words, where TP, FN, FP and TN represent the number of true positives,
false negatives, false positives and true negatives.
35
Accuracy = (TP + TN)/(TP + TN + FP + FN).
Accuracy =(27+25)/(27+25+7+2) = 0.85 = 85% accuracy
36
Chapter 8
8. Feature Importance
 Feature Importance provides a score that indicates how helpful each feature
was in our model.
 The higher the Feature Score, the more that feature is used to make key
decisions & thus the more important it is.
# get importance
i. importance = RFC_Model.feature_importances_
# summarize feature importance
ii. for i,feat in enumerate(importance,start=1):
print('Feature:%0d\tScore:%0.5f'%(i,feat))
Fig. 8.1 Feature Importance Score
37
iii. index= df.columns[:-1]
sns.barplot(x=importance,y=index,palette='Blues_r')
Fig.8.2 From the Feature Importance graph , top 4 significant features were ST depression induced by
exercise (oldpeak),chest pain (cp), maximum heart rate achieved(thalach), and Number of major vessels
colored by fluoroscopy (ca) concluded.
38
Chapter 9
9. Predictions
9.1 Scenario:
 A patient develops cardiac symptoms & you input his vitals into the
Machine Learning Algorithm.
 He is a 20 year old male, with a chest pain value of 2 (atypical angina),
with resting blood pressure of 110.
 In addition he has a serum cholesterol of 230 mg/dl.
 He is fasting blood sugar > 120 mg/dl.
 He has a resting electrocardiographic result of 1.
 The patients maximum heart rate achieved is 140.
 Also, he was exercise induced angina.
 His ST depression induced by exercise relative to rest value was 2.2.
 The slope of the peak exercise ST segment is flat.
 He has no major vessels colored by fluoroscopy, and in addition
his maximum heart rate achieved is a reversible defect.
Based on this information, we can classify this patient with Heart Disease?
39
Fig.9.1 Providing input on designed Web Page for Prediction.
@app.route('/predict',methods=['POST'])
def predict():
new_data = [int(i) for i in request.form.values() ]
new_data =np.array(new_data)
new_data = new_data.reshape(1,-1)
new_data_scaled = scaler.transform(new_data)
prediction = model.predict(new_data_scaled)
return render_template('result.html', prediction=word(prediction))
Fig.9.2 Output Positive
9.2 Predicting the Test set results:
 First value represents our predicted value; Second value represents

our actual value.
 If the values match, then we predicted correctly.
40
iv. y_pred = RFC_Model.predict(x_test)
y_pred = RFC_Model.predict(X_test).reshape(len(y_pred),1)
y_test = np.array(y_test).reshape(len(y_test),1)
concated = np.column_stack((y_test,y_pred))
dataset = pd.DataFrame(concated,columns=['actual','prediction'])
Fig.9.3 Random Forest Model predicted data.
 We get our model with accuracy (89%)!
41
Chapter 10
10. Conclusions
 Out of the 13 features we examined, the top 4 significant features that helped us
classify between a positive & negative Diagnosis were chest pain type (cp),
maximum heart rate achieved (thalach), number of major vessels colored by
fluroscopy (ca), and ST depression induced by exercise relative to rest (oldpeak).
 Our machine learning algorithm can now classify patients with Heart Disease.
Now we can properly diagnose patients, & get them the help they need to recover.
By diagnosing detecting these features early, we may prevent worse symptoms
from arising later.
 Our Random Forest algorithm yields the highest accuracy, 89%. Any accuracy
above 70% is considered good, but we have to be careful because if our accuracy
is extremely high, it may be too good to be true (an example of Over fitting).
Thus, 90% is the ideal accuracy!
42
List of References
References
I. Ramadoss and Shah B et al.“A. Responding to the threat of chronic diseases
in India”. Lancet. 2005; 366:1744–1749. DOI: 10.1016/S0140-
6736(05)67343-6.
II. Global Atlas on Cardiovascular Disease Prevention and Control. Geneva,

Switzerland: World Health Organization, 2011.
III. Avinash Golande, Pavan Kumar T, ”Heart Disease Prediction Using Effective
Machine Learning Techniques”, International Journal of Recent Technology
and Engineering, Vol 8, pp.944-950,2019.
IV. International Journal of Engineering Research & Technology (IJERT), Vol. 9

Issue 04, April-2020.
V. International Journal of Engineering and Technology(UAE),Vol 7 No 2.8,

DOI: 10.14419/ijet.v7i2.8.10557
VI. Jarar Zaidi publication, https://towardsdatascience.com/project-predicting-
heart-disease-with-classification-machine-learning-algorithms-fd69e6fdc9d6.
VII. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing
43

HDP Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HDP Report

Uploaded by

Copyright:

Available Formats

HEART DISEASE PREDICTION

USING MACHINE LEARNING

AFTAB ALAM KHAN

In partial fulfillment for the award of the degree

COMPUTER SCIENCE & ENGINEERING

TULA’S INTITUTE OF TECHNOLOGYAND MANAGEMENT

UTTARAKHAND TECHNICAL UNIVERSITY

In Partial Fulfillment of the Requirements

AFTAB ALAM KHAN

Under the Supervision of

FACULTY OF COMPUTER SCIENCE & ENGINEERING

Certified that this project report “HEART DISEASE PREDICTION USING

______________ MR. RAM NARAYAN PAL

Department of Computer Science and Engineering

3 EXPLORATORY DATA ANALYSIS 20

4 MACHINE LEARNING + PREDICTIVE 27

7 MAKING THE CONFUSION MATRIX 35

You are working as a Data Scientist, reporting various cardiac symptoms. A

 Predict whether a patient should be diagnosed with Heart Disease. This is

 Experiment with various Classification Models & see which yields

 Examine trends & correlations within our data

 Determine which features are most important to Positive/Negative Heart

1.3 Features & Predictor:

Our Predictor (Y, Positive or Negative diagnosis of Heart Disease) is

Note: Our data has 3 types of data:

Continuous (#): Which is quantitative data , that can be measured.

ii. filePath = '/Users/aftab/Downloads/clevland-dataset.csv'

Fig.2.1 Data achieved from Clevland Dataset.

iii. print("(Rows, columns): " + str(df.shape))

2.1 Data Conjunction

 The Cleveland ,Switzerland, VA Dataset were pre obtained in CSV format, so

 We are merging all dataset and Hungarian dataset is space separated

vi. input_file = open(‘hungarian.data’,’r’)

vii. input_files = ['cleveland.csv','hungarian.csv','va.csv','switzerland.csv']

Fig.2.6 Overall record processed

viii.print(df.isna().sum()) # Display the Missing Values in Cleveland Dataset

2.2 Handling Missing Values

Fig .2.9 Data after handling missing values.

 Now to see good proportion between our positive & negative binary

3. Exploratory Data Analysis

 Correlation Matrix - It lets you see correlations between all variables.

Fig. 3.1 Calculated Correlation matrix

Fig. 3.2 SNS Heatmap of Correlation Matrix

In addition, we see a negative correlation between exercises induced angina (exang)

Fig 3.4 :Heart Disease affected patient Sex ratio.

We saw that male = (1) are most affected by heart disease.

Fig. 3.5 Variation in age for each target classes.

In addition it can provide us with outliers in our data.

Fig. 3.6 Outliers in OldPeak vs. Heart Disease

3.4 Filtering data by positive & negative Heart Disease patient

Fig.3.8 Data of positive heart disease patient

ii. # Filtering data by NEGATIVE Heart Disease patient

iii. print("Positive Patients ST Depression:",pos['oldpeak'].mean())

Fig. 3.10 Postive and Negative ST depression

iv. print("Positive Patients Max Heart Rate:",pos['thalach'].mean())

Fig .3.11 Positive and Negative Max Heart rate

From comparing positive and negative patients we can see there are vast

4. Machine Learning + Predictive Analytics

4.1 Prepare Data for Modeling

ii. from sklearn.model_selection import train_test_split

The standard score of a sample x is calculated as:

Where, u is the mean of the training samples or zero if with_mean=False,

Centering and Scaling happen independently on each feature by computing the

Note: These are all supervised learning models.