Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

1

Heart Disease Prediction


University College of
Engineering,BITCampus,Anna University
,Tiruchirappalli .

NM ID NAME
au810021114002 R.AAKASH

Trainer Name
RAMAR BOSE
Sr.AI Master Trainer
2

ABSTRACT
Heart disease cases are rising at an alarming rate, and it's critical and to be able to predict
these diseases in advance. The project focuses on predicting which patients are more likely
to have heart disease based on a variety of medical factors. Cardiovascular disease refers
to any critical condition that impacts the heart. Because heart diseases can be
lifethreatening, researchers are focusing on designing smart systems to accurately
diagnose them based on electronic health data, with the aid of machine learning algorithms.
This work presents several machine learning approaches for predicting heart diseases,
using data of major health factors from patients. The paper demonstrated four classification
methods: Multilayer Perceptron (MLP), Support Vector Machine (SVM), Random Forest (RF),
and Naïve Bayes (NB), to build the prediction models. Data preprocessing and feature
selection steps were done before building the models. The models were evaluated based
on the accuracy, precision, recall, and F1-score. The SVM model performed best with 91.67%
accuracy. Heart disease is a significant global health concern, and early prediction plays a
crucial role in improving patient outcomes. Researchers have explored various AI
techniques to enhance heart disease prediction accuracy.

1. Problem statement.
2. Data collection

3. Existing solution
4. Proposed solution with used models

5. Result
3

INDEX
Sr. No. Table of Contents Page No.

1 Chapter 1: Introduction 4

2 Chapter 2: Services and Tools Required 7

3 Chapter 3: Project Architecture 10

4 Chapter 4: Modeling and Project Outcome 14

5 Conclusion 24

6 Future Scope 25

7 References 26

8 Links 27
4

CHAPTER 1
INTRODUCTION
1.1 Problem Statement

You are tasked to perform Heart Disease Prediction Using Logistic Regression. The World Health
Organization has estimated that four out of five cardiovascular disease (CVD) deaths are due to heart
attacks. This whole research intends to pinpoint the ratio of patients who have a good chance of being
affected by CVD and predict the overall risk using Logistic Regression.

Currently, the health care sector is generating information from several facilities and patients. By
applying the best usage of this data, doctors can easily anticipate superior methods for treatment and
enhance the complete delivery system of the health care sectors. One of the most important uses is that
the python framework can help make sense and encourage computational facilities in extracting
valuable insights from the information over the health care sectors. Moreover, Python is one of the most
renowned programming languages all around the globe. 32% of the UK individuals considered this
programming language a secure language for developing healthcare applications. High levels of LDL
cholesterol, or “bad” cholesterol, can cause the most common form of heart disease, coronary artery
disease (CAD). It is a plaque that has developed in the arteries of the patient’s heart. CAD has no
symptoms in its early stages. Patients can experience symptoms, such as chest pain, shortness of breath,
and fatigue when plaque grows large enough to obstruct blood flow.

Additionally, the health care projects made using the Python language must deal with HIPPA (Health
Insurance Portability and Accountability Act) requirements for dealing with healthcare records. In this
context, as per Nithya and Ilango depiction, Python supports computer security, as it has built-in tools
that provide software-defined security. However, according to Mcpadden et al., Python is currently used
in the health care field for data science and machine learning applications that improve patient
outcomes. As per the opinion of Panesar, the algorithms of machine learning encourage healthcare
analytics to use Python, as developers can easily establish tracking and health monitoring applications.
Thus, in this case, also, python programming is used for detecting heart disease.

1.2 Proposed Solution

n the proposed system, the analysis of the cardiac disease UCI dataset is carried out using suitable data
acquisition, preprocessing by cleaning the data, then using selects all the features which have high
correlation with the target function. Then logistics regression model was trained and tested for
5

predicting the cardiac disease is present or not. The Fig shows the workflow to build logistic regression
cardiac disease classification model.

1.3 Feature

• The UCI dataset is used to predict disease.


• The Features are selected based on high positive correlation values with the target and used
random order of data (without sorting).
• The performance of the model is evaluated by Five different training and testing ratio of dataset.
• To check the behavior of the model with low to high training and testing data

1.4 Advantages

Accurate predictions obtained through logistic regression can guide


healthcare professionals in identifying individuals at risk and implementing
preventive mean sures or tailored treatment plans. The computational efficiency
6

of the model further enhances its applicability in real-time decision support


systems.

1.5 Scope

Tool to determine a patient's personal illness risk. The framework can be


modified to work with additional models, including neural networks,
ensemble techniques, etc. Use various machine
learning techniques to predict cardiac diseases, such “DT, NB, and SVM”. The
scope for heart disease prediction is vast and encompasses various aspects

1. Early Detection: Predictive models can help in the early detection of heart disease by identifying
individuals at higher risk based on their demographic, lifestyle, and clinical characteristics.
2. Preventive Healthcare: Heart disease prediction enables healthcare providers to offer
personalized preventive interventions and lifestyle modifications to individuals identified as
highrisk, thereby reducing the likelihood of developing cardiovascular complications.

3 . Population Health Management: Heart d3isease prediction at the population level enables
public health authorities and policymakers to implement targeted interventions and policies aimed at
reducing cardiovascular disease burden within communities.
4 . Research and Development: Predicti4ve modeling facilitates research into the
underlying risk factors, biomarkers, and genetic predispositions associated with heart disease,
leading to advancements in disease understanding, prevention and treatment.
7

CHAPTER 2
SERVICES AND TOOLS REQUIRED
To develop a heart disease prediction model using logistic regression, you'll need a combination of services
and tools for data processing, model training, evaluation, and deployment. Here's a list of essential services
and tools:

1. Data Collection and Storage:

o Data Collection Services: APIs or tools for accessing healthcare databases, electronic health
records (EHR), or clinical research datasets.
o Data Storage: Cloud-based storage solutions like Amazon S3, Google Cloud Storage, or Azure
Blob Storage for storing collected healthcare data securely.

2. Data Preprocessing:

o Data Cleaning Tools: Python libraries such as Pandas for data cleaning, handling missing
values, and removing duplicates.
o Feature Engineering Tools: Scikit-learn for feature scaling, normalization, encoding
categorical variables, and creating new features if necessary.

3.Model Development:

o Machine Learning Libraries: Scikit-learn, TensorFlow, or Py Torch for implementing


logistic regression models and other machine learning algorithms.
o Hyperparameter Tuning: Tools like GridSearchCV or RandomizedSearchCV from
Scikitlearn for hyperparameter optimization.

4. Model Evaluation:
o Evaluation Metrics: Scikit-learn provides functions for computing classification metrics
such as accuracy, precision, recall, F1-score, and ROC-AUC.
8

o Cross-Validation: Cross-validation techniques for assessing model performance on multiple


folds of the data to ensure robustness.

5. Deployment:

o Model Deployment Platforms: Services like Amazon SageMaker, Google Cloud AI Platform,
or Microsoft Azure Machine Learning for deploying machine learning models in a production
environment.
o API Development: Flask or Django frameworks for building RESTful APIs to serve
predictions from the deployed model.
o Containerization: Docker for containerizing the application and ensuring consistency
across different environments.
o Cloud Computing: Utilize cloud infrastructure providers (AWS, Google Cloud, Azure) for
hosting and scaling deployed applications.

6. Monitoring and Maintenance:

o Logging and Monitoring Tools: Services like Amazon CloudWatch, Google Cloud
Monitoring, or Azure Monitor for logging model predictions, monitoring performance
metrics, and detecting anomalies.
o Continuous Integration/Continuous Deployment (CI/CD): CI/CD pipelines for
automating model updates, testing, and deployment.

7.Security and Compliance:

o Data Security: Implement encryption and access control mechanisms to protect sensitive
healthcare data.
o Regulatory Compliance: Ensure compliance with healthcare regulations such as HIPAA
(Health Insurance Portability and Accountability Act) or GDPR (General Data Protection
Regulation) when handling patient data.

8. Collaboration and Documentation:

o Version Control: Git for version control of code and machine learning models.
o Documentation: Tools like Jupyter Notebooks, Markdown, or Google Docs for documenting
data preprocessing steps, model development, and evaluation results.
9

By leveraging these services and tools effectively, you can build, deploy, and maintain a heart disease
prediction model using logistic regression while adhering to best practices in data privacy, security, and
regulatory compliance.

Tools Software Requirement:


1.Jupyter notebook –google collab

2.Github

3.Python

4.Pandas

5.Py Torch
10

CHAPTER 3
PROJECT ARCHITECTURE
Logistic Regression
The model of the logistic regression result is shown in Figure. An algorithm for
supervised classification is logistic regression. This algorithm
for predictive analysis is built on the idea of probability. By calculating probabilities
using the underlying logistic function, it assesses the
relationship between the dependent variable (Ten-year CHD) and one or more
independent variables (risk factors) (sigmoid function). As a cost function, the
sigmoid function is used as a cost function to limit the logistic regression hypothesis
between 0 and 1 (squashing), that is, 0 h (x) 1. In logistic regression, the cost function
is referred to as

The accurate presentation of data is crucial to the success of logistic


regression. Essential elements from the available data set are thus chosen
utilizing backward elimination and recursive elimination strategies to
increase the model's potency. “In statistics, the outcome of a categorical
dependent variable is forecast from a set of independent or predictor factors
using a type of regression analysis is called logistic regression. In logistic
regression, the dependent variable is always binary”. Prediction and success
probability estimate are the two main uses of logistic regressionsome of the
qualities have P values that are greater than the preferred alpha (5%) in the
results which indicates a weak statistically significant link between them and
the likelihood of developing “heart disease”. Here, the regression is
performed repeatedly until all the attributes have P values less than 0.05. The
backward elimination strategy is used to remove the attributes with the
highest P values one at a time
11

A statistical analysis technology called logistical prediction uses previous data


from a dataset to forecast a binary outcome, such as true or false. By
examining the link between one or more earlier independent factors, an
arithmetic regression model predicts a dependent data variable. The logistic
regression approach is used to forecast the kind of people based on one or
more predictor factors (x). It is used to simulate a variable with a binary
conclusion that can only have two feasible values: 0 or 1, or 1, yes or no, or
diseased or not.
Here's a high-level architecture for a heart disease prediction project using logistic regression:

1.Data Collection and Storage:


12

o Gather data from various sources such as electronic health records (EHR), research
databases, or publicly available datasets.
o Store the collected data securely in a data warehouse or cloud storage solution like Amazon
S3, Google Cloud Storage, or Azure Blob Storage.
2.Data Preprocessing: o Preprocess the raw data to clean, transform, and prepare it for
model training.
o Handle missing values, encode categorical variables, and perform feature scaling or
normalization as needed.
o Split the dataset into training, validation, and testing sets.
3.Model Development: o Utilize machine learning libraries like Scikit-learn in Python to implement
logistic regression models.
o Train the logistic regression model on the training dataset using relevant features associated
with heart disease risk. o Tune hyperparameters using techniques like cross-validation or
grid search to optimize model performance.
4.Model Evaluation: o Evaluate the trained model's performance using various metrics such as
accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC) on the validation
and testing datasets. o Validate the model's generalizability and robustness through techniques
like crossvalidation.

5.Deployment:

o Deploy the trained logistic regression model using cloud-based services like Amazon
SageMaker, Google Cloud AI Platform, or Microsoft Azure Machine Learning. o Build a
RESTful API using Flask or Django to serve predictions from the deployed model. o
Containerize the application using Docker for portability and scalability.
o Utilize cloud infrastructure providers (AWS, Google Cloud, Azure) for hosting and scaling the
deployed application.

6.Monitoring and Maintenance:

o Implement logging and monitoring mechanisms to track model performance, monitor


prediction requests, and detect anomalies.
o Set up alerts for monitoring critical metrics such as prediction accuracy and system health.
o Establish continuous integration/continuous deployment (CI/CD) pipelines for automating
model updates, testing, and deployment.
7.Security and Compliance: o Ensure data security by implementing encryption, access controls,
and compliance with regulations such as HIPAA or GDPR.
o Implement authentication and authorization mechanisms to restrict access to sensitive data
and prediction endpoints.
13

8.Documentation and Reporting: o Document the project architecture, data preprocessing steps,
model development, evaluation results, and deployment procedures.
o Create reports and presentations summarizing key findings, insights, and recommendations
for stakeholders.

By following this project architecture, you can develop a robust heart disease prediction system
using logistic regression while adhering to best practices in data management, model development,
deployment, and maintenance.

CHAPTER 4
MODELING AND PROJECT OUTCOME
(code and results)
14
15
16
17
18
19
20
21
22
23

APP INTERFACE/PROJECT RESULT


By applying different machine learning algorithms and then using deep
learning to see what difference comes when it is applied to the data, three
approaches were used. In the first approach, the normal dataset which is
acquired is directly used for classification, and in the second approach, the
data with feature selection are taken care of and there is no outliers detection.
The results which are achieved are quite promising and then in the third
approach the dataset was normalized taking care of the outliers and feature
selection; the results achieved are much better than the previous techniques,
and when compared with other research accuracies, our results are quite
promising
24

CONCLUSION
One of the important areas in industry of medical is prediction of
cardiovascular disease, with the available data of the patient to predict the
absence and presence of cardia disease. There are several techniques and
methods are present for prediction of cardiovascular disease. In this research,
Logistic Regression supervised ML algorithm are used to classify the heart
disease. To improve the performance, pre-processing of corpus like Cleaning,
finding the missing values are done. The vital part is feature selection, which
increase the accuracy of algorithm and even focus on the behavior of the
algorithm. As the behavior of Logistic regression is as training increases the
accuracy of prediction also increased. The LR classifier achieved 87.10% of
accuracy with training 90% and testing 10%. The results outperformed
compared to previous research work. The limitation is only UCI dataset is
used in the study and future work try to implement on multiple datasets .

As a prevalent disease today, predicting heart disease in patients is crucial for


timely intervention and recovery. Logistic regression offers a valuable method
to predict heart disease by analyzing patterns in patient data. This
information can assist healthcare professionals in identifying high-risk
patients and implementing preventive measures.
ML-based prediction system developed in this study performs well in early
diagnosis of CVDs and can be accessed via Internet. This study offers promising
results suggesting potential use of ML-based heart disease prediction system as a
screening tool to diagnose heart diseases in primary healthcare centre’s in India,
which would otherwise get undetected.
25

FUTURE SCOPE
In this paper, we proposed three methods in which comparative analysis was done
and promising results were achieved. The conclusion which we found is that machine
learning algorithms performed better in this analysis. Many researchers have
previously suggested that we should use ML where the dataset is not that large, which
is proved in this paper. The methods which are used for comparison are confusion
matrix, precision, specificity, sensitivity, and F1 score. For the 13 features which were
in the dataset, Neighbors classifier performed better in the ML approach when data
preprocessing is applied. The computational time was also reduced which is helpful
when deploying a model. It was also found out that the dataset should be normalized;
otherwise, the training model gets overfitted sometimes and the accuracy achieved is
not sufficient when a model is evaluated for real-world data problems which can vary
drastically to the dataset on which the model was trained. It was also found out that
the statistical analysis is also important when a dataset is analyzed and it should have
a Gaussian distribution, and then the outlier's detection is also important and a
technique known as Isolation Forest is used for handling this. If a large dataset is
present, the results can increase very much in deep learning and ML as well. The
algorithm applied by us in ANN architecture increased the accuracy which we
compared with the different researchers. The dataset size can be increased and then
deep learning with various other optimizations can be used and more promising
results can be achieved. Machine learning and various other optimization techniques
can also be used so that the evaluation results can again be increased. More different
ways of normalizing the data can be used and the results can be compared And more
ways could be found where we could integrated heart disease-trained ML and DL
models with certain multimedia for the ease of patients and doctors.
26

REFERENCES
1. Project Github link, Ramar Bose , 2024
2. Project video recorded link (youtube/github), Ramar Bose , 2024 3.
Project PPT & Report github link, Ramar Bose , 2024
27

GIT Hub Link of Project Code:

https://github.com/AakashAU002/au810021114002.git

You might also like