Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

COMSATS UNIVERSITY ISLAMABAD, WAH

CAMPUS WAH CANTT

Heart Disease Prediction


6/7/2023
Artificial Intelligence

Department of Computer Science, CUI Wah Campus


COMSATS UNIVERSITY ISLAMABAD
WAH CAMPUS

Submitted By
Group Members
Name Reg No. Section
Sohail Afzal SP20-BCS-164 7-D
M. Usman SP20-BCS-160 7-D

Submitted To
Instructor : Engr. Adnan Saleem Mughal Sb
Table of Contents
Introduction............................................................................................................................................ 3
Problem Statement............................................................................................................................ 3
Project Goal ....................................................................................................................................... 3
Methodology ...................................................................................................................................... 3
Proposed Model................................................................................................................................. 4
Data Description .................................................................................................................................... 4
Data Set Overview............................................................................................................................. 4
Feature Description .......................................................................................................................... 4
Data Preprocessing ................................................................................................................................ 5
Handling Missing Values .................................................................................................................. 5
Handling Categorical Variables ...................................................................................................... 6
Exploratory Data Analysis (EDA) ......................................................................................................... 6
Summary Statistics ........................................................................................................................... 6
Visualizations..................................................................................................................................... 7
Model Building..................................................................................................................................... 10
Logistic Regression ......................................................................................................................... 10
Model Description ....................................................................................................................... 10
Training and Evaluation ............................................................................................................ 10
Performance ................................................................................................................................ 11
Naïve Bayes ...................................................................................................................................... 11
Model Description ....................................................................................................................... 11
Training and Evaluation ............................................................................................................ 11
Performance ................................................................................................................................ 12
Support Vector Machine SVM ...................................................................................................... 12
Model Description ....................................................................................................................... 12
Training and Evaluation ............................................................................................................ 12
Performance ................................................................................................................................ 13
K-Nearest Neighbors (KNN) .......................................................................................................... 13
Model Description ....................................................................................................................... 13
Training And Evaluation............................................................................................................ 13
Performance ................................................................................................................................ 14
Decision Tree ................................................................................................................................... 14
Model Description ....................................................................................................................... 14
Training and Evaluation ............................................................................................................ 14

1
Performance ................................................................................................................................ 15
Neural Networks ............................................................................................................................. 15
Model Description ....................................................................................................................... 15
Training and Evaluation ............................................................................................................ 15
Performance ................................................................................................................................ 16
Model Evaluation ................................................................................................................................. 17
Comparison Of accuracy score ...................................................................................................... 17
Code...................................................................................................................................................... 17
Conclusion............................................................................................................................................ 21
Resources.............................................................................................................................................. 22

2
Introduction
Problem Statement
Heart disease is a leading cause of death in the United States. According to the American Heart
Association, one in four deaths in the United States is caused by heart disease. There are many
factors that can contribute to heart disease, including high blood pressure, high cholesterol,
smoking, diabetes, and obesity.
Project Goal
The goal of this project is to develop a machine learning algorithm that can predict the risk of
heart disease. By analyzing large datasets of patient data, machine learning algorithms can
identify patterns that can be used to predict who is at risk for heart disease. This information
can be used to help doctors identify patients who need to be monitored more closely and to
develop preventive measures.
Methodology

3
Figure 1

Proposed Model

Figure 2

Data Description
Data Set Overview
The data set which we used in this project is heart disease data set which we take from Kaggle,
and this is a csv file. This data set contains 1025 entries and 14 attributes. Each row in the data
set represents a patient and each column represents a specific attributes or feature associated
with the patient. This data set contains a target variable which contain values 1 and 0 target (1)
means presence of disease and target (0) means absence of disease. This dataset includes a
variety of features such as age, sex, chest pain type, resting blood pressure, cholesterol levels,
fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved,
exercise-induced angina, ST depression induced by exercise, slope of the peak exercise ST
segment, number of major vessels colored by fluoroscopy, and thalassemia.
Feature Description
Age: The age of the patient in years.
Sex: The sex of the patient (1 = male, 0 = female).
Chest Pain Type: The type of chest pain experienced by the patient (1 = typical angina, 2 =
atypical angina, 3 = non-anginal pain, 4 = asymptomatic).
Resting Blood Pressure: The resting blood pressure of the patient in mm Hg.
Serum Cholesterol: The serum cholesterol level of the patient in mg/dl.
Fasting Blood Sugar: The fasting blood sugar level of the patient (> 120 mg/dl indicates true,
0 otherwise).

4
Resting Electrocardiographic Results: The results of the resting electrocardiogram (ECG) test
(0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left
ventricular hypertrophy).
Exercise-Induced Angina: Whether the patient experienced angina due to exercise (1 = yes, 0
= no).
Slope of the Peak Exercise ST Segment: The slope of the peak exercise ST segment (1 =
upsloping, 2 = flat, 3 = downsloping).
Thalassemia: A blood disorder that affects the amount of oxygen carried by red blood cells (3
= normal, 6 = fixed defect, 7 = reversable defect).

Data Preprocessing
Handling Missing Values
Handling missing values is an important step in data preprocessing. Missing values can occur
in datasets due to various reasons such as incomplete data collection, data corruption, or data
entry errors. It is crucial to address missing values before performing any analysis or building
a machine learning model.
If the number of missing values is relatively small compared to the size of the dataset, removing
the rows with missing values can be a viable option. However, this approach can lead to
information loss if the missing values are not randomly distributed. But in our data set no any
kind of missing values.
Here we use the code line like dataset.info () following are the snip shot of this code to find the
missing value are present or not.

5
Handling Categorical Variables
Categorical variables are variables that take on discrete values and represent categories or
groups. These variables can be nominal (no specific order) or ordinal (ordered categories).
Many machine learning algorithms require numerical inputs, so handling categorical variables
is essential in the preprocessing stage.
In our code we performed categorical handling on one are two places using the attributes of
our data set like we take dataset[“sex”]. unique() this line of code suggest that the "sex"
variable is a categorical variable, and its unique categories are being examined.

Exploratory Data Analysis (EDA)


Summary Statistics
We have done summary statistics in EDA exploratory data analysis the code and code snip
shots and their detail is given below.
dataset. Shape: Prints the shape of the dataset, indicating the number of rows and columns.
dataset.head(5): Displays the first five rows of the dataset.
dataset.describe(): Generates summary statistics of the numerical variables in the dataset, such
as count, mean, standard deviation, minimum, and maximum values.

6
Visualizations
Our code contains various visualizations using the seaborn and matplotlib libraries. These
visualizations provide insights into the dataset's features and relationships. Some of the
visualizations present in the code are.
sns.countplot(x="target", data=dataset): Displays a count plot of the target variable,
indicating the distribution of heart disease presence and absence.
sns.barplot(x="sex", y="target", data=dataset): Plots a bar plot showing the relationship
between the sex variable and the target variable.
sns.barplot(x="ca", y="target", data=dataset): Creates a bar plot illustrating the association
between the ca variable and the target variable.

Figure 3

7
Figure 4

8
Figure 5

9
Figure 6

Model Building
Logistic Regression
Model Description
Logistic regression is a classification algorithm used to predict the probability of a binary
outcome. In this heart disease project, logistic regression is applied to predict whether a patient
has heart disease (target variable) based on various input features such as age, sex, chest pain
type, blood pressure, cholesterol levels, and more. Logistic regression assumes a linear
relationship between the input features and the target variable.
Training and Evaluation
The dataset is divided into training and testing sets using the train_test_split() function from
the sklearn.model_selection module. The training set (X_train, Y_train) is used to train the
logistic regression model.
The logistic regression model is instantiated using LogisticRegression() from the
sklearn.linear_model module.
The model is trained on the training set using the fit() method: lr.fit(X_train, Y_train).
Once the model is trained, predictions are made on the testing set using the predict() method:
Y_pred_lr = lr.predict(X_test).

10
The accuracy score is calculated using accuracy_score(Y_pred_lr, Y_test) to evaluate the
performance of the logistic regression model.

Performance
The accuracy score achieved by logistic regression is printed using print("The accuracy score
achieved using Logistic Regression is: " + str(score_lr) + " %").
This code also includes similar training, evaluation, and performance for other machine
learning algorithms such as Naive Bayes, Support Vector Machine, K-Nearest Neighbors, and
Decision Tree. The accuracy scores for each algorithm are stored in the scores list and displayed
using a bar plot.

Naïve Bayes
Model Description
Naive Bayes is a classification algorithm that is based on Bayes' theorem and assumes that
features are conditionally independent given the target variable. In this heart disease project,
the Gaussian Naive Bayes variant is used. It assumes that the continuous input features follow
a Gaussian distribution.
Training and Evaluation
The dataset is divided into training and testing sets using the train_test_split() function. The
Naive Bayes model is instantiated using GaussianNB() from the sklearn.naive_bayes module.
The model is trained on the training set using the fit() method: nb.fit(X_train, Y_train).
Predictions are made on the testing set using the predict() method: Y_pred_nb =
nb.predict(X_test)

11
Performance
The accuracy score achieved by Naive Bayes is calculated using accuracy_score(Y_pred_nb,
Y_test) and stored in score_nb. The score is then printed using print("The accuracy score
achieved using Naive Bayes is: " + str(score_nb) + " %").

Like logistic regression, the code also includes training, evaluation, and performance for other
machine learning algorithms, such as Support Vector Machine, K-Nearest Neighbors, and
Decision Tree.

Support Vector Machine SVM


Model Description
SVM is a powerful supervised learning algorithm used for classification and regression tasks.
It separates the data points by creating a hyperplane that maximizes the margin between
different classes. In this heart disease project, a linear SVM model is used.
Training and Evaluation
The dataset is divided into training and testing sets using the train_test_split() function. The
SVM model is instantiated using svm.SVC(kernel='linear') from the sklearn module. The
model is trained on the training set using the fit() method: sv.fit(X_train, Y_train). Predictions
are made on the testing set using the predict() method: Y_pred_svm = sv.predict(X_test).

12
Performance
The accuracy score achieved by SVM is calculated using accuracy_score(Y_pred_svm, Y_test)
and stored in score_svm. The score is then printed using print("The accuracy score achieved
using Linear SVM is: " + str(score_svm) + " %").
Similarly, the code also includes training, evaluation, and performance for other machine
learning algorithms such as Logistic Regression, Naive Bayes, K-Nearest Neighbors, and
Decision Tree.

K-Nearest Neighbors (KNN)


Model Description
K-Nearest Neighbors (KNN) is a non-parametric classification algorithm that classifies new
instances based on their similarity to the k nearest neighbors in the training data. In this heart
disease project, a KNN classifier with n_neighbors=7 is used.
Training And Evaluation
The dataset is divided into training and testing sets using the train_test_split() function. The
KNN model is instantiated using KNeighborsClassifier(n_neighbors=7) from the
sklearn.neighbors module. The model is trained on the training set using the fit () method:
knn.fit(X_train, Y_train). Predictions are made on the testing set using the predict () method:
Y_pred_knn = knn.predict(X_test).

13
Performance
The accuracy score achieved by KNN is calculated using accuracy_score(Y_pred_knn, Y_test)
and stored in score_knn. The score is then printed using print("The accuracy score achieved
using KNN is: " + str(score_knn) + " %"). Similar to logistic regression, Naive Bayes, SVM,
and Decision Tree, the code also includes training, evaluation, and performance metrics for
KNN.

Decision Tree
Model Description
Decision Tree is a non-parametric supervised learning algorithm used for both classification
and regression tasks. It creates a tree-like model of decisions and their possible consequences.
In this heart disease project, a Decision Tree classifier is used.
Training and Evaluation
The dataset is divided into training and testing sets using the train_test_split() function. A loop
is used to find the Decision Tree classifier with the highest accuracy score by varying the
random_state parameter. This is done using the DecisionTreeClassifier(random_state=x) from
the sklearn.tree module. The model with the best accuracy score is then instantiated: dt =
DecisionTreeClassifier(random_state=best_x). The model is trained on the training set using
the fit() method: dt.fit(X_train, Y_train). Predictions are made on the testing set using the
predict() method: Y_pred_dt = dt.predict(X_test).

14
Performance
The accuracy score achieved by the Decision Tree classifier is calculated using
accuracy_score(Y_pred_dt, Y_test) and stored in score_dt. The score is then printed using
print("The accuracy score achieved using Decision Tree is: " + str(score_dt) + " %"). Similarly,
the code includes training, evaluation, and performance metrics for other machine learning
algorithms such as Logistic Regression, Naive Bayes, SVM, K-Nearest Neighbors and Neural
Networks.

Neural Networks
Model Description
In Neural Network algorithm we implemented snippet using the Keras library. It consists of
two dense layers:
The first dense layer has 11 units/neurons with the ReLU activation function. It takes an input
of dimension 13.
The second dense layer has 1 unit/neuron with the sigmoid activation function, which is used
for binary classification.
Training and Evaluation
We compiled the model using the compile() method. The binary cross-entropy loss function
also we used (loss='binary_crossentropy'), the Adam optimizer is used (optimizer='adam'), and

15
the accuracy metric is computed during training (metrics=['accuracy']). The model is trained
on the training set using the fit() method: model.fit(X_train, Y_train, epochs=300)

Performance
Similar to logistic regression, Naive Bayes, SVM, Decision Tree, and KNN the code also
includes training, evaluation, and performance metrics for Neural Networks.

16
Model Evaluation
Comparison Of accuracy score
Model Accuracy Score
Logistic Regression 86.34%
Naïve Bayes 85.37%
Support Vector Machine SVM 83.9%
K-Nearest Neighbors KNN 72.2%
Decision Tree 100%
Neural Network 85.85%

In all the models which we apply in our projects the best and sufficient model is the decision
tree model because it has the highest accuracy rate which is 100%

Code

17
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import os
print(os.listdir())

import warnings
warnings.filterwarnings('ignore')
dataset = pd.read_csv("/content/heart.csv")
type(dataset)
dataset.shape
dataset.head(5)
dataset.sample(5)
dataset.describe()
dataset.info()
info = ["age","1: male, 0: female","chest pain type, 1: typical angina, 2: atypical angina, 3:
non-anginal pain, 4: asymptomatic","resting blood pressure"," serum cholestoral in
mg/dl","fasting blood sugar > 120 mg/dl","resting electrocardiographic results (values
0,1,2)"," maximum heart rate achieved","exercise induced angina","oldpeak = ST depression
induced by exercise relative to rest","the slope of the peak exercise ST segment","number of
major vessels (0-3) colored by flourosopy","thal: 3 = normal; 6 = fixed defect; 7 = reversable
defect"]

for i in range(len(info)):
print(dataset.columns[i]+":\t\t\t"+info[i])
dataset["target"].describe()
dataset["target"].unique()
print(dataset.corr()["target"].abs().sort_values(ascending=False))
y = dataset["target"]
sns.countplot(x="target", data=dataset)
target_temp = dataset["target"].value_counts()
print(target_temp)

18
print("Percentage of patience without heart problems:
"+str(round(target_temp[0]*100/303,2)))
print("Percentage of patience with heart problems: "+str(round(target_temp[1]*100/303,2)))
dataset["sex"].unique()
sns.barplot(x="sex", y="target", data=dataset)
dataset["cp"].unique()
sns.barplot(x="cp", y="target", data=dataset)
dataset["fbs"].describe()
dataset["fbs"].unique()
sns.barplot(x="fbs", y="target", data=dataset)
dataset["restecg"].unique()
sns.barplot(x="restecg", y="target", data=dataset)
dataset["exang"].unique()
sns.barplot(x="exang", y="target", data=dataset)
dataset["slope"].unique()
sns.barplot(x="slope", y="target", data=dataset)
dataset["ca"].unique()
sns.barplot(x="ca", y="target", data=dataset)
sns.countplot(x="ca", data=dataset)
dataset["thal"].unique()
sns.barplot(x="thal", y="target", data=dataset)
sns.distplot(dataset["thal"])
from sklearn.model_selection import train_test_split
predictors = dataset.drop("target",axis=1)
target = dataset["target"]
X_train,X_test,Y_train,Y_test =
train_test_split(predictors,target,test_size=0.20,random_state=0)
X_train.shape
X_test.shape
Y_train.shape
Y_test.shape
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,Y_train)
Y_pred_lr = lr.predict(X_test)
Y_pred_lr.shape

19
score_lr = round(accuracy_score(Y_pred_lr,Y_test)*100,2)
print("The accuracy score achieved using Logistic Regression is: "+str(score_lr)+" %")
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train,Y_train)
Y_pred_nb = nb.predict(X_test)
Y_pred_nb.shape
score_nb = round(accuracy_score(Y_pred_nb,Y_test)*100,2)
print("The accuracy score achieved using Naive Bayes is: "+str(score_nb)+" %")
from sklearn import svm
sv = svm.SVC(kernel='linear')
sv.fit(X_train, Y_train)
Y_pred_svm = sv.predict(X_test)
Y_pred_svm.shape
score_svm = round(accuracy_score(Y_pred_svm,Y_test)*100,2)
print("The accuracy score achieved using Linear SVM is: "+str(score_svm)+" %")
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,Y_train)
Y_pred_knn=knn.predict(X_test)
Y_pred_knn.shape
score_knn = round(accuracy_score(Y_pred_knn,Y_test)*100,2)
print("The accuracy score achieved using KNN is: "+str(score_knn)+" %")
from sklearn.tree import DecisionTreeClassifier
max_accuracy = 0
for x in range(200):
dt = DecisionTreeClassifier(random_state=x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
current_accuracy = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
if(current_accuracy>max_accuracy):
max_accuracy = current_accuracy
best_x = x
dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,Y_train)
Y_pred_dt = dt.predict(X_test)
print(Y_pred_dt.shape)

20
score_dt = round(accuracy_score(Y_pred_dt,Y_test)*100,2)
print("The accuracy score achieved using Decision Tree is: "+str(score_dt)+" %")
from keras.models import Sequential
from keras.layers import Dense
model = Sequential()
model.add(Dense(11,activation='relu',input_dim=13))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(X_train,Y_train,epochs=300)
Y_pred_nn = model.predict(X_test)
Y_pred_nn.shape
rounded = [round(x[0]) for x in Y_pred_nn]
Y_pred_nn = rounded
score_nn = round(accuracy_score(Y_pred_nn,Y_test)*100,2)
print("The accuracy score achieved using Neural Network is: "+str(score_nn)+" %")
algorithms = ["Logistic Regression", "Naive Bayes", "Support Vector Machine", "K-Nearest
Neighbors", "Decision Tree", "Neural Network"]
scores = [score_lr, score_nb, score_svm, score_knn, score_dt, score_nn]

for i in range(len(algorithms)):
print("The accuracy score achieved using " + algorithms[i] + " is: " + str(scores[i]) + " %")
import matplotlib.pyplot as plt
sns.set(rc={'figure.figsize':(15,8)})
plt.xlabel("Algorithms")
plt.ylabel("Accuracy score")
sns.barplot(x=algorithms, y=scores)
plt.show()

Conclusion
In this heart disease project, we used machine learning to create models that can help classify
whether a person has heart disease based on their information. We tried different models like
Logistic Regression, Naive Bayes, Support Vector Machine, K-Nearest Neighbors, Decision
Tree, and Neural Network.
We found that all the models did a decent job in predicting heart disease, but each had its own
strengths. Logistic Regression and Naive Bayes were simple and easy to understand, while
Support Vector Machine and Decision Tree worked well with complex patterns. K-Nearest
Neighbors adapted to different types of data. The Neural Network, with its ability to capture
complex relationships, performed the best.
21
Choosing the best model depends on what's important for the specific situation. We looked at
accuracy scores, confusion matrices, and classification reports to evaluate the models. It's also
essential to consider any specific knowledge about heart disease.
Over all the scores of all algorithms are more than 80% but only one algorithm KNN its
performance was not good its score is 72% this is not the sufficient, but the decision tree has
score up to 100% which is the best and this is best for diagnoses the disease.

Resources
Date Set
We take data set from Kaggle.
https://www.kaggle.com/datasets/johnsmith88/heart-disease-
dataset/download?datasetVersionNumber=2

For understanding the project, we used a research article the link is given below
https://cse.anits.edu.in/projects/projects2021C3.pdf

22

You might also like