Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 43

HEART DISEASE PREDICTION

USING MACHINE LEARNING

A PROJECT REPORT

Submitted by

AFTAB ALAM KHAN


ALKA JOSHI
AMIT RAI
GAURAV KUMAR

In partial fulfillment for the award of the degree

of

BACHELOR OF TECHNOLOGY
IN

COMPUTER SCIENCE & ENGINEERING

TULA’S INTITUTE OF TECHNOLOGYAND MANAGEMENT

UTTARAKHAND TECHNICAL UNIVERSITY

JANUARY 2021

1
HEART DISEASE PREDICTION
USING MACHINE LEARNING
A Project Submitted

In Partial Fulfillment of the Requirements


for the Degree of

BACHELOR OF TECHNOLOGY
In
Computer Science & Engineering
By

AFTAB ALAM KHAN


ALKA JOSHI
AMIT RAI
GAURAV KUMAR
(Enrollment No. [170120101010]1,[170120101011]2,[170120101013]3,[170120101033]4)

Under the Supervision of


Mr. Ram Narayan Pal

FACULTY OF COMPUTER SCIENCE & ENGINEERING

TULA’S INSTITUTE
THE ENGINEERING AND MANAGEMENT COLLEGE
(DEHRADUN)
Jan, 2021

2
UTTARAKHAND TECHNICAL UNIVERSITY

CERTIFICATE

Certified that this project report “HEART DISEASE PREDICTION USING


MACHINE LEARNING” is the bonafide work of “AFTAB ALAM KHAN ,
ALKA JOSHI , AMIT RAI , GAURAV KUMAR” who carried out the
project work under my supervision.

______________ MR. RAM NARAYAN PAL


SUPERVISOR
SIGNATURE

MR.LOKESH KUMAR
COMPUTER SCIENCE ENGINEERING
HEAD OF THE DEPARTMENT
TULA’S INTITUTE OF TECHNOLOGY
COMPUTER SCIENCE ENGINEERING
AND MANGEMENT, DHOOLKUT,
TULA’S INTITUTE OF TECHNOLOGY
DEHRADUN, 248011
AND MANGEMENT, DHOOLKUT,

DEHRADUN, 248011

______________

SIGNATURE

3
UTTARAKHAND TECHNICAL UNIVERSITY

CERTIFICATE DECLARATION

I declared that this written submission represents my work and ideas in my own words and where
other idea and work are also have been included, I have adequately citied and referenced the
source. I also declare that I have adhered to all principle of academic honesty and integrity and
have not misrepresented or fabricated or falsified any idea/data/facts/source in my submission. I
understand that any violation of the above will be cause for disciplinary action by the University
and can also evoke penal action from the sources which have thus not been properly cited or from
whom proper permission has not been taken when needed.

(Signature)
AFTAB ALAM KHAN (201704076)
ALKA JOSHI (201704079)
AMIT RAI (201704041)
GAURAV KUMAR (201704040)

Department of Computer Science and Engineering


Date: _______

4
ACKNOWLEDGMENT

I take this opportunity to remember the Almighty and my parents, who bestowed strength, courage
to perseverance to undertake the present course of study and complete it successfully.
First and Foremost, I would like to express a deep sense of gratitude and sincere regard toward
Director, for his sustained guidance, meticulous supervision, and constant encouragement during
my dissertation work.
My special thanks to Mr. Lokesh Kumar (HOD, CSE) , Mr. Sachin Kumar (Project
Coordinator, CSE) and Ms. Akansha Singh (Project Co-coordinator, CSE) who help me in
the execution of this work and for providing necessary facilities with full cooperation, bestowing
his excellent guidance, encouragement, inspiring motivation, valuable suggestion and painstaking
help throughout this project work.
I express my heartfelt and sincere thanks to my project guide Mr. Ram Narayan Pal (Tula’s
Institute of Engineering and Management, Dehradun) for his excellent guidance, caring,
patience, and providing me an excellent atmosphere for completion of this dissertation project.
I also thank my teaching and non-teaching staff members of the department for their kind
cooperation and others who helped me through the course of the minor project work.

Concerning this venture without their encouragement, patience, and moral support it would not
have been possible for me to complete my dissertation.

Date: _/_/2021
Submitted By –
AFTAB ALAM KHAN (201704076)
ALKA JOSHI (201704079)
AMIT RAI (201704041)
GAURAV KUMAR (201704040)

5
ABSTRACT

The heart and brain are the crucial organs having higher priority in the human body. Heart
disease is the most lethal problem of the world, during which the heart is unable to push the
required amount of blood to other parts of the body. The diagnosis of heart disease through the
traditional method has not been considered reliable in many aspects. Nowadays there are various
machine learning techniques and tools available to extract effective information for accurate
diagnosis and decision making. 

In this paper, The Machine learning technique, such as Artificial Neural Network (ANN),
Decision Tree, K-Nearest Neighbour (KNN), Naive Bayes, and Support Vector Machine (SVM)
is used. The main objective of this paper is to analyse the prediction of the patient getting heart
diseases. After testing all these models, the model with good accuracy is taken for the prediction
of heart disease. 

Today the healthcare industry is information-rich however still very poor in knowledge or mostly
the data are not publically available. This paper has used a Cleveland dataset containing 303
individuals and 14 attributes like age, sex, chest pain (cp), resting blood pressure (trestps),
cholesterol, resting electrocardiographic results (restcg), maximum heart rate, ST depression
induced, exercise induced angina, slope of the peak exercise, ST segment , number of vessels
coloured by fluoroscopy, thalassemia.

6
LIST OF FIGURES
Fig. No Description Page No.
2.1 Data achieved from Clevland Dataset. 14
2.2 Displays Number of Rows & Columns. As well as the Column names. 14
2.3 Returns the number of unique values for each variable. 15
2.4 Summarizes the count, mean, standard deviation, min, and max for numeric 15
variables.
2.5 Hungarian dataset before and after applying ssv to csv conversion. 17
2.6 6 Overall record processed 18
2.7 Record achieved after excluding missing values form joined dataset. 18
2.8 Display the Number of Missing Values for each column. 18
2.9 Data after handling missing values. 19
2.10 It appears we have a good balance between the two binary outputs. 19
3.1 Calculated Correlation matrix 20
3.2 SNS Heatmap of Correlation Matrix 20
3.3 Pairplot of positive or negative correlation 22
3.4 Heart Disease affected patient Sex ratio. 23
3.5 Variation in age for each target classes. 23
3.6 Outliers in OldPeak vs. Heart Disease 24
3.7 Maximum heart rate achieved (thalach) vs. Target 25
3.8 Data of positive heart disease patient 25
3.9 Data of negative heart disease patient. 26
3.10 Postive and Negative ST depression 26
3.11 Positive and Negative Max Heart rate 26
5.1 Logistic Regression Prediction [Accuracy 85%] 29
5.2 K-NN Prediction [Accuracy 88%] 30
5.3 SVM Prediction [Accuracy 85%] 31
5.4 Naïve Bayes Classifier Prediction [Accuracy 83%] 31
5.5 Decision Tree Prediction [Accuracy 77%] 32
5.6 Random Forest Prediction [Accuracy 88%] 33

7
5.7 Models Accuracy Graph 33
7.1 Confusion Matrix and Accuracy Score. 35
8.1 Feature Importance Score 37
8.2 Top 4 significant features concluded from the Feature Importance graph were chest 38
pain type (cp), maximum heart rate achieved (thalach), number of major vessels
(ca), and ST depression induced by exercise relative to rest (oldpeak) .
9.1 Providing input on designed Web Page for Prediction. 40
9.2 Output Positive 40
9.3 Random Forest Model predicted data 41

8
TABLE OF CONTENTS
CHAPTER NO TITLE PAGE NO.
ABSTRACT vi
LIST OF FIGURES vii

1 INTRODUCTION 11
1.1 Scenario 12
1.2 Goals 12
1.3 Features and Predictor 13

2 DATA WRANGLING 14
2.1 Data conjunction 15
2.2 Handling missing values 18

3 EXPLORATORY DATA ANALYSIS 20


3.1 Correlations 20
3.2 Count Plot 23
3.3 Violin and Box plots 24
3.4 Filtering data by positive and negative heart 25
disese patient

4 MACHINE LEARNING + PREDICTIVE 27


ANALYSIS
4.1 Prepare data for modelling 27

5 MODELLING/ TRAINING 29
5.1 Model 1- Logistic Regression 29
5.2 Model 2- K-NN 30
5.3 Model 3- SVM(support vector machine) 30
5.4 Model 4- Naives Bayes classification 31
5.5 Model 5- Decision Tress 32
5.6 Model 6- Random Forest 32

9
6 PRECISION, RECALL,F1 SCORE AND SUPPORT 34

7 MAKING THE CONFUSION MATRIX 35


7.1 How to interpret Confusion Matrix 35

8 FEATURE IMPORTANCE 37

9 PREDICTIONS 39
9.1 Scenario 39
9.2 Predicting the test set results 40

10 Conclusions 42

11 List of References 43

10
Chapter 1

1. INTRODUCTION

The work done in our project is mainly focused on various data analysis process that
is used in heart disease prediction. Change in lifestyle, work related stress and bad
food habits contribute to the increase in rate of several heart related diseases. Heart
diseases have emerged as one of the most prominent cause of death all around the
world. According to World Health Organization, heart related diseases are
responsible for the taking 17.7 million lives every year, 31% of all global deaths. In
India too, heart related diseases have become the leading cause of mortality.
Estimates made by the World Health Organization (WHO), suggest that India have
lost up to $237 billion, from 2005-2015, due to heart related or cardiovascular
diseases.

The main challenge in today's healthcare is provision of best quality services and
effective accurate diagnosis. Even if heart diseases are found as the prime source of
death in the world in recent years, they are also the ones that can be controlled and
managed effectively. The whole accuracy in management of a disease lies on the
proper time of detection of that disease. The project work is dedicated to detect
these heart diseases at early stage to avoid disastrous consequences. If such a
prediction is accurate enough, we can not only avoid wrong diagnosis but also save
human resources. When a patient without a heart disease is diagnosed with heart
disease, he will fall into unnecessary panic and when a patient with heart disease is
not diagnosed with heart disease, he will miss the best chance to cure his disease.
Such wrong diagnosis is painful to both patients and hospitals. With accurate
predictions, we can solve the unnecessary trouble. Besides, if we can apply our
machine learning tool into medical prediction, we will save human resource because
we do not need complicated diagnosis process in hospitals. The input to our
algorithm is 13 features with number values. We use several algorithms such as
Logistic Regression, SVM, Naïve Bayes, Random Forest and Artificial Neural

11
Network to output a binary number 1 or 0. 1 indicates the patient has heart disease
and vice versa.

1.1 Scenario:

You are working as a Data Scientist, reporting various cardiac symptoms. A


cardiologist measures vitals & hands you this data to perform Data
Analysis and predict whether certain patients have Heart Disease. We would like
to make a Machine Learning Algorithm where we can train our model to learn
& improve from experience. Thus, we would want to classify patients as either
positive or negative for Heart Disease.

1.2 Goal:

 Predict whether a patient should be diagnosed with Heart Disease. This is


a binary outcome.
Positive (+) = 1, patient diagnosed with Heart Disease
Negative (-) = 0, patient not diagnosed with Heart Disease

 Experiment with various Classification Models & see which yields


greatest accuracy.

 Examine trends & correlations within our data

 Determine which features are most important to Positive/Negative Heart


Disease diagnosis

1.3 Features & Predictor:

Our Predictor (Y, Positive or Negative diagnosis of Heart Disease) is


determined by 13 features (X):

12
1. age (#)
2. sex : 1= Male, 0= Female (Binary)
3. (cp) : chest pain type (4 values - Ordinal):Value 1: typical angina ,Value 2:
atypical angina, Value 3: non-anginal pain , Value 4: asymptomatic
4. (trestbps) : resting blood pressure (#)
5. (chol) : serum cholesterol in mg/dl (#)
6. (fbs) : fasting blood sugar > 120 mg/dl(Binary)(1 = true; 0 = false)
7. (restecg) : resting electrocardiography results(values 0,1,2)
8. (thalach) : maximum heart rate achieved (#)
9. (exang) : exercise induced angina (binary) (1 = yes; 0 = no)
10. (oldpeak) : ST depression induced by exercise relative to rest (#)
11. (slope) : of the peak exercise ST segment (Ordinal) (Value 1: up sloping ,
Value 2: flat , Value 3: down sloping )
12. (ca) : number of major vessels (0–3, Ordinal) colored by fluoroscopy
13. (thal) : maximum heart rate achieved — (Ordinal): 3 = normal; 6 = fixed
defect; 7 = reversible defect

Note: Our data has 3 types of data:

Continuous (#): Which is quantitative data , that can be measured.


Ordinal Data: Categorical data that has an order to it (0,1,2,3, etc.).
Binary Data: data whose unit can take on only two possible states (0 &1).

13
Chapter 2

2. Data Wrangling
i. import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
import matplotlib.pyplot as plt

ii. filePath = '/Users/aftab/Downloads/clevland-dataset.csv'


df = pd.read_csv(filePath)
#Attributes we taken into account
attr=['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','target']
df = pd.read_csv('cleveland.csv',sep=',',header=None)
file = pd.read_csv('cleveland.csv',header=None)
file = np.array(file)
df = pd.DataFrame(data=file,columns=attr)
df.head(5)

Fig.2.1 Data achieved from Clevland Dataset.

iii. print("(Rows, columns): " + str(df.shape))


df.columns

Fig. 2.2 Displays Number of Rows & Columns as well as the Column names.

14
iv. df.nunique(axis=0) # returns the number of unique values for each variable.

Fig. 2.3 Returns the number of unique values for each variable.

v. df.describe() #summarizes the count, mean, standard deviation, min, and max for numeric variables

Fig. 2.4 Summarizes the count, mean, standard deviation, min, and max for numeric variables.

2.1 Data Conjunction


There were 4 processed data present on UCI repository.
1. Cleveland Dataset
2. Hungarian Dataset
3. VA Dataset
4. Switzerland Dataset

15
Class Distribution:
Database: 0 1 2 3 4 Total
Cleveland: 164 55 36 35 13 303
Hungarian: 188 37 26 28 15 294
Switzerland: 8 48 32 30 5 123
Long Beach VA: 51 56 41 42 10 200
Total: 920

 The Cleveland ,Switzerland, VA Dataset were pre obtained in CSV format, so


we didn’t have to work them for merging , but the Hungarian Dataset was in
SSV format so first we have to convert that dataset in CSV format.

 We are merging all dataset and Hungarian dataset is space separated


value .this program is to convert space separated value into comma separated
value.

vi. input_file = open(‘hungarian.data’,’r’)


output_file = open('hungarian.csv', 'w')
for line in input_file:
    if line.strip():
        line = line.replace(" ", ",")
        output_file.write(line)

input_file.close()
output_file.close()

16
Fig. 2.5 Hungarian dataset before and after applying ssv to csv conversion.

 Joined each file assuming that we will get high number of record that is more
than Cleveland alone but there is too many missing value and after tackling
those missing values we get only total of 299 records, so we decided to use
only Cleveland dataset.

vii. input_files = ['cleveland.csv','hungarian.csv','va.csv','switzerland.csv']


output_file ="heart_disease.csv"
output = open(output_file,'w')
total_input = 0
total_output = 0
for input_file in input_files:
    input_ = open(input_file,'r')
    for line in input_:
        total_input+=1
        n_lines_total +=1
        if ("?" not in line) and ("-9" not in line):
            features_list = line.split(",")
            features_list = [float(item) for item in features_list[0:14]]
            corrected_line = ",".join(map(str, features_list))

17
            output.write(corrected_line+"\n")       
    input_.close()
print("Toal record read:",total_input)
output.close()
print("record achieved after joining and excluding missing value in record:\
n",heartdf[0].count())

Fig.2.6 Overall record processed

Fig.2.7 Record achieved after excluding missing values form joined dataset.

viii.print(df.isna().sum()) # Display the Missing Values in Cleveland Dataset

Fig .2.8 Display the Number of Missing Values for each column.

2.2 Handling Missing Values


Using Cleveland dataset, we found total 6 missing values ,4 in ca and 2 in thal. To
handle these missing value we intentioned to replace them with their mean value.

Finally, we converted the decimal values of target that is 1,2 and 3 to 1 and 0 to 0.

18
i. df['ca'].fillna(1.5,inplace=True)
df['thal'].fillna(3.0,inplace=True)
#replace 1,2,3->1 and ->0 in target for binary prediction
def rep_target(x):
    if x==0:
        return 0
    else:
        return 1
df['target'] = df['target'].apply(lambda x:rep_target(x))

Fig .2.9 Data after handling missing values.

 Now to see good proportion between our positive & negative binary


predictor.

ii. df['target'].value_counts()

Fig .2.10 It appears we have a good balance between the two binary outputs.

19
Chapter 3

3. Exploratory Data Analysis

3.1 Correlations

 Correlation Matrix - It lets you see correlations between all variables.


With correlation matrix in seconds, you can see whether something is
positively or negatively correlated with our predictor (target).

i. df.corr()
plt.figure(figsize=(13,7))
sns.heatmap(df.corr(),annot=True)

Fig. 3.1 Calculated Correlation matrix

Fig. 3.2 SNS Heatmap of Correlation Matrix

20
We can see there is a positive correlation between chest pain (cp) & target (our
predictor). This makes sense since, the greater amount of chest pain results in a
greater chance of having heart disease.
Cp (chest pain), is a ordinal feature with 4 values:
Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic.

In addition, we see a negative correlation between exercises induced angina (exang)


& our predictor. This makes sense because when you exercise, your heart requires
more blood, but narrowed arteries slow down blood flow.

 Pair plots are also a great way to immediately see the correlations between all
variables. But you will see me make it with only continuous columns from our
data, because with so many features, it can be difficult to see each one. So
instead I will make a pair plot with only our continuous features.

21
ii. subData = df[['age','trestbps','chol','thalach','oldpeak']]
sns.pairplot(subData)

Fig. 3.3 Choose to make a smaller pair plot with only the continuous variables, to dive deeper into the relationships.
Also a great way to see if there is a positive or negative correlation

22
3.2 Count Plot

iii.sns.countplot(x='sex',hue=’target’data=df)

Fig 3.4 :Heart Disease affected patient Sex ratio.

We saw that male = (1) are most affected by heart disease.

iv. plt.figure(figsize=(20,8))
sns.countplot(x='age',data=df, hue='target')

Fig. 3.5 Variation in age for each target classes.

 We saw that most people who are suffering are of the age of 58,
followed by 57.
 Majorly, people belonging to the age group 50+ are suffering from
the disease.

23
3.3 Violin & Box Plots

The advantages of showing the Box & Violin plots are that it shows
the basic statistics of the data, as well as its distribution. These plots are often used
to compare the distribution of a given variable across some categories.

It shows the median and IQR. (minimum , first quartile (Q1), median, third quartile
(Q3), and maximum).

In addition it can provide us with outliers in our data.

i. plt.figure(figsize=(8,4))
sns.violinplot(x='target',y='oldpeak',inner='quartile',hue='sex',data=df)
plt.title("Oldpeak vs Heart_Disease(target)")

Fig. 3.6 Outliers in OldPeak vs. Heart Disease

ii. plt.figure(figsize=(8,4))
sns.boxplot(x='target',y='thalach',hue='sex',data=df)
plt.title("Thalach vs Target")

24
Fig. 3.7 Maximum heart rate achieved (thalach) vs. Target

3.4 Filtering data by positive & negative Heart Disease patient


i. # Filtering data by POSITIVE Heart Disease patient
pos = df[df['target']==1]
pos.describe()

Fig.3.8 Data of positive heart disease patient

ii. # Filtering data by NEGATIVE Heart Disease patient


neg = df[df['target']==0]
neg.describe()

25
Fig. 3.9 Data of negative heart disease patient.

iii. print("Positive Patients ST Depression:",pos['oldpeak'].mean())


print("Negative Patients ST Depression:",pos['oldpeak'].mean())

Fig. 3.10 Postive and Negative ST depression

iv. print("Positive Patients Max Heart Rate:",pos['thalach'].mean())


print("Negative Patients Max Heart Rate:",pos['thalach'].mean())

Fig .3.11 Positive and Negative Max Heart rate

From comparing positive and negative patients we can see there are vast


differences in means for many of our 13 Features. From examining the details, we
can observe that positive patients experience heightened maximum heart rate
achieved (thalach) average. In addition, positive patients exhibit about 1/3rd the
amount of ST depression induced by exercise relative to rest (oldpeak).

26
Chapter 4

4. Machine Learning + Predictive Analytics

4.1 Prepare Data for Modeling


To prepare data for modelling, just remember ASN (Assign, Split and
Normalize).

 Assign

 Assign the 13 features to X, & the last column to our classification predictor, y

i. X=df.drop('target',axis=1)
y=df['target']

 Split

To split the data set into the Training set and Test set.

ii. from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 Scaling
Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

Where, u is the mean of the training samples or zero if with_mean=False,


and ‘s’ is the standard deviation of the training samples or one if with_std=False.

Centering and Scaling happen independently on each feature by computing the


relevant statistics on the samples in the training set. Mean and standard deviation
are then stored to be used on later data using transform.

27
iii. from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

28
Chapter 5

5. Modeling /Training
Now we’ll Train various Classification Models on the Training set & see
which yields the highest accuracy. We will compare the accuracy of Logistic
Regression, K-NN (k-Nearest Neighbours), SVM (Support Vector Machine),
Naives Bayes Classifier, Decision Trees, Random Forest.

Note: These are all supervised learning models.

5.1 Model 1: Logistic Regression

i. from sklearn.linear_model import LogisticRegression


from sklearn.metrics import confusion_matrix ,classification_report
log_model = LogisticRegression() #get instance of model
log_model.fit(X_train,y_train) #Fitting Model
y_pred1 = log_model.predict(X_test) #prediction
cm_logistic = confusion_matrix(y_test,y_pred1)
print("Confusion matrix for logistic Regression:\n",cm_logistic) #confusion matrix
print("\n\n\n\nClassification report for logistic Regression:\
n",classification_report(y_test,y_pred1)) #classification report

Fig.5.1 Logistic Regression Prediction [Accuracy 85%]

29
5.2 Model 2: K-NN (K-Nearest Neighbors)

ii. from sklearn.neighbors import KNeighborsClassifier


KNN_Model = KNeighborsClassifier() #get instance of model
KNN_Model.fit(X_train,y_train) #Model Training
y_pred2 = KNN_Model.predict(X_test) #Prediction
cm_KNN = confusion_matrix(y_test,y_pred2) #Confusion Matrix
print("Confusion matrix for K_Nearest Neighbour:\n",cm_KNN)
print("\n\n\n\nClassification report for K_Nearest Neighbour:\
n",classification_report(y_test,y_pred2)) #Classification Report

Fig.5.2 K-NN Prediction [Accuracy 88%]

5.3 Model 3: SVM (Support Vector Machine)

iii. from sklearn.svm import SVC


svc_Model  =SVC(C=100, gamma=0.0001)
svc_Model.fit(X_train,y_train)
y_pred4 = svc_Model.predict(X_test)
cm_SVM = confusion_matrix(y_test,y_pred4)
print("Confusion matrix for Support Vector Machine:\n",cm_SVM)
print("\n\n\n\nClassification report for Support Vector Machine:\
n",classification_report(y_test,y_pred4))

30
Fig.5.3 SVM Prediction [Accuracy 85%]

5.4 Model 4: Naives Bayes Classifier


iv. from sklearn.naive_bayes import GaussianNB
NBayes = GaussianNB()
NBayes.fit(X_train,y_train)
y_pred3 = NBayes.predict(X_test)
cm_Naive = confusion_matrix(y_test,y_pred3)
print("Confusion matrix for Naive Bayes:\n",cm_Naive)
print("\n\n\n\nClassification report for Naive Bayes:\n",classification_report(y_test,y_pred3))

Fig.5.4 Naïve Bayes Classifier Prediction [Accuracy 83%]

31
5.5 Model 5: Decision Trees

v. from sklearn.tree import DecisionTreeClassifier


dec_Tree  = DecisionTreeClassifier(random_state=1)
dec_Tree.fit(X_train,y_train)
y_pred5 = dec_Tree.predict(X_test)
cm_dec_Tree = confusion_matrix(y_test,y_pred5)
print("Confusion matrix for Decision Tree:\n",cm_dec_Tree )
print("\n\n\n\nClassification report for Decision Tree:\n",classification_report(y_test,y_pred5))

Fig.5.5 Decision Tree Prediction [Accuracy 77%]

5.6 Model 6: Random Forest 

vi. from sklearn.ensemble import RandomForestClassifier


RFC_Model  =RandomForestClassifier(n_estimators=50)
RFC_Model.fit(X_train,y_train)
y_pred6 = RFC_Model.predict(X_test)
cm_R_Forest = confusion_matrix(y_test,y_pred6)
print("Confusion matrix for Random Forest:\n",cm_R_Forest )
print("\n\n\n\nClassification report for  Random Forest:\
n",classification_report(y_test,y_pred6))

32
Fig.5.6 Random Forest Prediction [Accuracy 88%]

From comparing the 6 models, we can conclude that Model 6: Random


Forest yields the highest accuracy. With an accuracy of 88%.

ig.5.7 Models Accuracy Graph

33
Chapter 6

6. Precision, Recall, F1-score and Support.

6.1 Precision: It’s “how many are correctly classified among that class”

6.2 Recall: It’s “how many of this class you find over the whole number of
element of this class”

6.3 F1-score: harmonic mean of precision and recall values.

F1 score reaches its best value at 1 and worst value at 0.


F1 Score = 2 x ((precision x recall) / (precision + recall))

6.4 Support: # of samples of the true response that lie in that class.

34
Chapter 7

7. Making the Confusion Matrix


i. from sklearn.metrics import confusion_matrix, accuracy_score
print("Confusion matrix :\n",confusion_matrix(y_test, y_pred))

ii. print("\nacuuracy:\n",accuracy_score(y_test, y_pred))

Fig. 7.1 Confusion Matrix and Accuracy Score.

Note: A good rule of thumb is that any accuracy above 70% is considered good, but be careful
because if your accuracy is extremely high, it may be too good to be true (an example of Over
fitting). Thus, 90% is the ideal accuracy!

7.1 How to Interpret Confusion Matrix:

 27 is the amount of True Positives in our data, while 25 is the amount of True
Negatives.
 7 & 2 are the number of errors.
 There are 7 Type 1 errors (False Positives)- You predicted positive and it’s
false.
 There are 2 Type 2 errors (False Negatives)- You predicted negative and it’s
false.
 Hence, if we calculate the accuracy its # Correct Predicted/ # Total.
In other words, where TP, FN, FP and TN represent the number of true positives,
false negatives, false positives and true negatives.

35
Accuracy = (TP + TN)/(TP + TN + FP + FN).
Accuracy =(27+25)/(27+25+7+2) = 0.85 = 85% accuracy

36
Chapter 8

8. Feature Importance
 Feature Importance provides a score that indicates how helpful each feature
was in our model.
 The higher the Feature Score, the more that feature is used to make key
decisions & thus the more important it is.

# get importance
i. importance = RFC_Model.feature_importances_
# summarize feature importance
ii. for i,feat in enumerate(importance,start=1):
print('Feature:%0d\tScore:%0.5f'%(i,feat))

Fig. 8.1 Feature Importance Score

37
iii. index= df.columns[:-1]
sns.barplot(x=importance,y=index,palette='Blues_r')

Fig.8.2 From the Feature Importance graph , top 4 significant features were ST depression induced by
exercise (oldpeak),chest pain (cp), maximum heart rate achieved(thalach), and Number of major vessels
colored by fluoroscopy (ca) concluded.

38
Chapter 9

9. Predictions

9.1 Scenario:
 A patient develops cardiac symptoms & you input his vitals into the
Machine Learning Algorithm.
 He is a 20 year old male, with a chest pain value of 2 (atypical angina),
with resting blood pressure of 110.
 In addition he has a serum cholesterol of 230 mg/dl.
 He is fasting blood sugar > 120 mg/dl.
 He has a resting electrocardiographic result of 1.
 The patients maximum heart rate achieved is 140.
 Also, he was exercise induced angina.
 His ST depression induced by exercise relative to rest value was 2.2.
 The slope of the peak exercise ST segment is flat.
 He has no major vessels colored by fluoroscopy, and in addition
his maximum heart rate achieved is a reversible defect.

Based on this information, we can classify this patient with Heart Disease?

39
Fig.9.1 Providing input on designed Web Page for Prediction.

@app.route('/predict',methods=['POST'])
def predict():
new_data = [int(i) for i in request.form.values() ]
new_data =np.array(new_data)
new_data = new_data.reshape(1,-1)
new_data_scaled = scaler.transform(new_data)
prediction = model.predict(new_data_scaled)
return render_template('result.html', prediction=word(prediction))

Fig.9.2 Output Positive

9.2 Predicting the Test set results:

 First value represents our predicted value; Second value represents


our actual value.
 If the values match, then we predicted correctly.

40
iv. y_pred = RFC_Model.predict(x_test)
y_pred = RFC_Model.predict(X_test).reshape(len(y_pred),1)
y_test = np.array(y_test).reshape(len(y_test),1)
concated = np.column_stack((y_test,y_pred))
dataset = pd.DataFrame(concated,columns=['actual','prediction'])

Fig.9.3 Random Forest Model predicted data.

 We get our model with accuracy (89%)!

41
Chapter 10

10. Conclusions
 Out of the 13 features we examined, the top 4 significant features that helped us
classify between a positive & negative Diagnosis were chest pain type (cp),
maximum heart rate achieved (thalach), number of major vessels colored by
fluroscopy (ca), and ST depression induced by exercise relative to rest (oldpeak).
 Our machine learning algorithm can now classify patients with Heart Disease.
Now we can properly diagnose patients, & get them the help they need to recover.
By diagnosing detecting these features early, we may prevent worse symptoms
from arising later.
 Our Random Forest algorithm yields the highest accuracy, 89%. Any accuracy
above 70% is considered good, but we have to be careful because if our accuracy
is extremely high, it may be too good to be true (an example of Over fitting).
Thus, 90% is the ideal accuracy!

42
List of References

References
I. Ramadoss and Shah B et al.“A. Responding to the threat of chronic diseases
in India”. Lancet. 2005; 366:1744–1749. DOI: 10.1016/S0140-
6736(05)67343-6.

II. Global Atlas on Cardiovascular Disease Prevention and Control. Geneva,


Switzerland: World Health Organization, 2011.

III. Avinash Golande, Pavan Kumar T, ”Heart Disease Prediction Using Effective
Machine Learning Techniques”, International Journal of Recent Technology
and Engineering, Vol 8, pp.944-950,2019.

IV. International Journal of Engineering Research & Technology (IJERT), Vol. 9


Issue 04, April-2020.

V. International Journal of Engineering and Technology(UAE),Vol 7 No 2.8,


DOI: 10.14419/ijet.v7i2.8.10557
VI. Jarar Zaidi publication, https://towardsdatascience.com/project-predicting-
heart-disease-with-classification-machine-learning-algorithms-fd69e6fdc9d6.
VII. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing

43

You might also like