Professional Documents
Culture Documents
Heart Disease Prediction Synopsis
Heart Disease Prediction Synopsis
Submitted to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA
BHOPAL (M.P)
BHOPAL
CERTIFICATE
This is to certify that the work embodied in this Major Project Synopsis
entitled “Prediction of Heart Diseases using Machine Learning” has
been satisfactorily completed by the Harshit More [0103CS193D05] and
Nikhil Kute[0103CS193D11]. It is a bonafide piece of work, carried out
under the guidance from Department of Computer Science &
Engineering, Lakshmi Narain College of Technology, Bhopal for the
partial fulfillment of the Bachelor of Technology during the academic
year 2021-22.
Approved By
BHOPAL
ACKNOWLEDGEMENT
A special thank goes to Dr. Sadhna K. Mishra (Prof. & HOD) who
helped us by providing timely suggestions in completing this project
work. She exchanged her interesting ideas & thoughts which made this
project work successful.
We would also thank our institution and all the faculty members
without whom this project work would have been a distant reality.
Heart related diseases or Cardiovascular Diseases (CVDs) are the main reason for a
huge number of death in the world over the last few
decades and has emerged as the most life-threatening disease, not only in India but in
the whole world. So, there is a need of reliable,
accurate and feasible system to diagnose such diseases in time for proper treatment.
Machine Learning algorithms and techniques have
been applied to various medical datasets to automate the analysis of large and complex
data. Many researchers, in recent times, have
been using several machine learning techniques to help the health care industry and
the professionals in the diagnosis of heart related
diseases. This paper presents a survey of various models based on such algorithms and
techniques andanalyze their performance. Models
based on supervised learning algorithms such as Support Vector Machines (SVM), K-
Nearest Neighbour (KNN), NaïveBayes, Decision
Trees (DT), Random Forest (RF) and ensemble models are found very popular among
the researchers.
CHAPTER 1
INTRODUCTION
1.1 Introduction
We are living in very fast and hectic scheduled world. Nobody is aware about their health.
Everybody is concentrating over carrier. Due to this negligence of health very fast-growing region
is heart diseases patient. Suppose you just diagnose with heart diseases type 1 or type 2. This is
very important to start understanding about this illness. You can stop this by growing day by day by
starting to live your life healthier.
There are many causes of heart failure, but the condition is generally broken down into two types:
The lower left chamber of the heart (left ventricle) gets bigger (enlarges) and cannot squeeze (con-
tract) hard enough to pump the right amount of oxygen-rich blood to the rest of the body.
The heart contracts and pumps normally, but the bottom chambers of the heart (ventricles) are
thicker and stiffer than normal. Because of this, the ventricles can't relax properly and fill up all the
way. Because there's less blood in the ventricles, less blood is pumped out to the rest of the body
when the heart contracts.
1.2 Machine Learning
This industry is not new when we go through multiple papers and publication, we found that this
mechanism evolves in year 1960. Within From last some years m/c learning involve in many
domains due to improving volume of data in daily life. Much online web application generated
huge amount of log or data that can be processed by any algorithm.
Predictive analysis is an analytics process which is analyses historical data cum new data to
forecast behavior and trends activity for future prediction goal, it is closely related to
advanced analytics.
Machine learning techniques
Nature-inspired algorithm
Figure 1.2: Types of Prediction
AIC: The A1C test measures your average blood sugar for the past two to three months. The
advantages of being diagnosed this way are that you don't have to fast or drink anything.
Fasting Plasma Glucose (FPG): This test checks your fasting blood sugar levels. Fasting means
after not having anything to eat or drink (except water) for at least 8 hours before the test. This test
is usually done first thing in the morning, before breakfast.
Oral Glucose Tolerance Test (OGTT): The OGTT is a two-hour test that checks your blood sugar
levels before and two hours after you drink a special sweet drink. It tells the doctor how your body
processes sugar.
Type 1 research:
This Diseases is introduced by an autoimmune attack on beta-cells, eliminating the ability of the
body to produce insulin. In 1921, research led to the discovery of insulin, changing type 1 diseases
from a life-threatening condition to a manageable one.
Type 2 research:
This Diseases is developed by both genetic and environmental factors and effect the body’s
capability to make or use insulin. Research is developing for 3 decades to improve the glucose
controlling of patient.
Type 1 and type 2
Type 1 and Type 2 diseases have different underlying causes, but both result in high blood glucose
and lead to similar complications.
Obesity
Obesity significantly increases risk for type 2 Diseases and complicates management of type 1
Diseases.
1.6 Motivation
The Motivation of our Dissertation is that In Today’s hectic life nobody has time to detect the
behavior of their life style. Once it gets infected, we all come under threaten. So we come to
conclusion if we detected our working behavior through any mechanism that will give us prior
information about Diseases.
The origin of data mining technology meets people’s necessities. DM sometimes also called as
Knowledge Discovery from the Database (KDD). A terrific amount of data and information is being
collected with the help of computing devices and latest technologies. Now data is everywhere: from
business transactions, government, healthcare, websites and scientific data etc. Just retrieval is not
enough for decision-making, so the DM come into picture for summarization of data for valuable
information i.e. Knowledge discovery and the discovery of patterns in raw data [9].
In the beginning, we started storing all data. Unfortunately, these gigantic collections of data
accumulated on dissimilar data structures very rapidly became devastating. DM can extract implicit
but potentially useful information and knowledge, which people do not know in advance, from a lot
of noisy, incomplete, random and fuzzy data in practical application. The DM is happening field
and powerful means to extract useful knowledge from massive amounts of data to bridge the gap
between knowledge and data.
Another definition of DM is the investigation and analysis of huge quantities of data in order to
discover legitimate, narrative, potentially useful, and eventually understandable patterns in data.
Process of analysing through intelligent algorithms from large databases to find patterns that are:
DM and KDD is a new interdisciplinary field, merging ideas from statistics, machine learning
databases and parallel computing.
Researchers have defined the term ‘data mining’ in many ways.
Few definitions of DM or KDD, which are available in literature, are given below.
KDD process is a type of data mining methodology which used to extract hidden knowledge from
a large database, by implementing pre-processing step and data transformation step.
1. Developing an understanding of
2. Creating a target data set or selecting a data set, on which detection is to be accomplish.
3. Data cleaning and pre-processing.
4. Data reduction
➢ Finding useful features to represent the data depending on the aim of the task.
➢ Use of dimensionality reduction methods to reduce the decrease number of variables
for the representations for the data.
➢ Choose the Aim of the KDD process is classification, regression, clustering or any
other.
7. Data mining.
Data mining is the process of extracting hidden, previously unknown patterns from huge database
or data warehouse. Data mining is also known as knowledge discovery from data (KDD). Data
mining play important role in the various area like banking, education, health care, medical etc.
Many organizations use data mining technique to analyses large dataset, to support decision making
process and to get better result for their long-term need.
Data Data Data Data
Data
Health organization use data mining technique in order to identify hidden patterns from disease,
drugs dataset and used for prediction and detection of different disease and also it supports decision
making process in clinical diagnosis. Different data mining technique is used prediction and
detection of different disease, some of the technique is listed below.[24]
Classification is the process of finding a model which describes and distinguishes data classes or
concepts based on a class label. There are different classification algorithms some of this are
Artificial Neural Network (ANN), Decision tree, Bayesian network, naïve bays etc.
Clustering is the process of analysing data objects without consulting a class label. It is process of
grouping new class based on maximizing the intra class similarity and minimizing the interclass
similarity. There are different clustering algorithms some of this are K nearest neighbour and k
mean clustering.
Association rule learning is machine learning method which used for finding frequent patterns.
Some of the association algorithm is Apriori algorithm, Eclat algorithm and FP growth algorithm.
2.3.2 Applications of Data mining
A Traffic Prediction
P Videos Surveillance
P
Search Engine Result Refining
L
Online Fraud Detection
I
C Product Recommendations
A
Future Healthcare
T
I Manufacturing Engineering
Traffic Predictions: NGoogle uses the DM algorithm n the traffic prediction we all used the GPS
navigation system because of this navigation system the data is saved is a central database and up-
date the location of aNvehicle. The underlying problem is that there are a minimum number of cars
that are equipped with GPS. Machine learning in such scenarios helps to estimate the regions where
congestion can be found on the basis of daily experiences. [7]
O
Videos Surveillance: Imagine a single person monitoring multiple video cameras, a difficult job to
do and boring as well.NThis is why the idea of training computers to do this job makes sense.
The video surveillance device nowadays is powered by way of AI that makes it viable to hit upon
crime earlier than they happen. They song uncommon behavior of people like status immobile for a
long term, stumbling or snoozing on benches.
Search Engine Result Refining: Google and other search engines use DM to improve the search
results for you. Every time you execute a search, the algorithms at the backend keep a watch at how
you respond to the results. If you open the top results and stay on the web page for long, the search
engine assumes that the results it displayed were in accordance to the query. Similarly, if you reach
the second or third page of the search results but do not open any of the results, the search engine
estimates that the results served did not match requirement. This way, the algorithms working at the
backend improve the search results.[7]
Online Fraud Detection: DM is proving its potential to make cyberspace a secure place and
tracking monetary frauds online is one of its examples. For example: PayPal is using ML for
protection against money laundering.
Product Recommendations: DM algorithm is used in product recommendations User got the same
product on his social media account that he saw on a e-commerce website.
Future Healthcare: Data mining improve health systems. It uses data and analytics to verify best
practices that improve supervision and reduce costs. Researchers use data mining algorithms like
multi-dimensiona
l databases, machine learning, soft computing, data visualization and statistics. Mining can be
useful to predict the volume of patients in every class. Methods are developed that make sure that
the patients gets appropriate supervision at the right place and at right time.
Market Basket Analysis: Market basket analysis is a modelling algorithm based on theory that if
you buy a certain group of items you are more likely to buy another group of items. This method
may allow the shopkeeper to know the purchase behaviour of a purchaser. This information can
help the shopkeeper to understand the purchaser’s requirements and change the shop’s layout
accordingly.
Education: There is new emerging field, known as Educational Data Mining, concerns with devel-
oping techniques that discover knowledge from data obtained from the educational Environments.
The objectives of EDM are identified as predicting the students’ future studying behaviour, under-
standing the effects of educational help, and improving scientific knowledge about learning. Data
mining can be used by an institution to take correct decisions and also for predicting the Progress
Report of the student. With the results the institution can focus on how to teach and what to
teach.[7]
CRM: Customer Relationship Management, it is about acquiring and retaining customers, also ad-
vancing customers’ loyalty and developing customer focused strategies. To maintain a proper rela-
tionship with the customer.
Product Recommendations DM algorithm are used in product recommendations User got the
same product on his social media to account that he saw on an e-commerce website.
Machine learning works on a very simple concept understanding with experiences. Machine
learning is the process that comes from humans and animals teaches computer that learning from
the experience. Machine learning contains algorithms that learn from past data and predicts the
future data. In machine learning we train computer by algorithm on some data and predicted the
future results. The algorithms adaptively improve their performance as the number of samples
available for learning increases.
2.4.1 Types of Techniques of Machine Learning
Supervised ML
Unsupervised ML
Semi supervised ML
Reinforcement ML
Ensemble Learning
Neural Network
Supervised Learning: In supervised learning mechanism we have to educate the model with some
prior knowledge so that they can behave like intelligent program. Here we have to give training as
well as we can use this program for further use.
Reinforcement Learning: In this learning all programs learn their steps on the basis of their
experiences. This comes in between supervised & unsupervised. Here a terms agent comes in
picture which has very important work. Here agent will take action or learn decisions on the basis
of prior working.
Multitasking Learning: Multitask Learning (MTL) is an initial changing tools whose main motto
to enhance generalization conduct. MTL improves the above mechanism by averaging the domain
related advice containing in the training indicator of related works.
Decision Tree Model: A decision tree model is one of the most common data mining models. It is
popular because the resulting model is easy to understand. The algorithms use a recursive
partitioning approach. Decision tree is a type of supervised learning algorithm that is mostly used
in classification problems.
Types of decision tree is based on the type of target variable; it can be of two types:
Categorical Decision
Decision Tree
Continuous Decision
Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it
called as categorical variable decision tree.
Example: In above scenario of student problem, where the target variable was “It will rain today”
YES or NO.
Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called
as Continuous Variable Decision Tree. Example: - Salary of a person.
Support Vector Machine Model: A Support Vector Machine (SVM) searches for so called support
vectors which are data points that are found to lie at the edge of an area in space which is a
boundary from one class of points to another. In the terminology of SVM we talk about the space
between regions containing data points in different classes as being the margin between those
classes. The support vectors are used to identify a hyperlane (when we are talking about many
dimensions in the data, or a line if we were talking about only two-dimensional data) that separates
the classes.[6]
OUTPUT LAYER
HIDDEN LAYER
Y-Axis LAYER
INPUT
X-Axis
Artificial neural network is one of prediction algorithm which use learning rate and momentum to
classify data accurately. ANN predict the output by adjusting weight. It consists of three layers
Back propagation algorithm is type of Artificial neural network algorithm by which each neuron is
learned by adjusting the weighted associated with it in order to correct or reduce the error. It is
supervised learning algorithm which used gradient descent optimization algorithm in order to adjust
the weight on the neurons by computing the gradient of loss function. [6]
Clustering: Clustering is the process of grouping the physical and abstract objects into classes of
the similar objects. Clustering is a process of partitioning a set of data (or objects) into a set of
meaningful sub-classes, called clusters. It is an unsupervised learning method there are no
predefined classes. Clustering technique will generate high quality clusters that intra-class
similarity is high and inter-class similarity is low. The characteristic of a clustering result also
relies upon both the similarity measure used by the technique and its implementation. The aspect of
a clustering technique is measured by its performance to find some or all of the unseen patterns.
Boosting: Boosting is very important classification method in the recent development. It works by
applying a classification algorithm sequentially to reweighted version of training dataset, then
choosing the weighted majority vote of sequence of classifiers produced this simple algorithm
results in dramatic improvement in performance for many classification algorithms. This seems that
phenomenon can be understood in terms of statistical principles, namely additive modelling on
logistic scale which uses Bernoulli criterion as much as it can.
Association Rule Mining: Association rules analysis is a technique to uncover how items are
associated to each other. Association rule mining „ Finding frequent patterns, associations,
correlations, or causal structures among sets of items in transaction databases. What customer
buying in his basket by finding associations and correlations between the different items that
customers place in their baskets. „
2) Cross-marketing.
3) Catalog design.
4) Loss-leader analysis.
2.5 Importance of Boosting Method
Boosting is Machine learning Meta algorithm for reducing bias and variance in supervised learning and
machine learning which converts weak learner to strong learner. A question is posed by Kearn and Valiant
“Can a group of weak learners make a strong learner? “Here a weak learner is defined as classifier i.e.
slightly correlated with the right classification (it can provide example which are better than random
guessing) on contrary. a strong learner is a classifier which is arbitrarily well correlated with the right
classification.
Naïve Bayes
Logistic Regression
Decision Tree
Random Forest
Classification Algorithms
K-Mean
Neural Network
Fuzzy k-NN
Genetic Algorithm
3.1 Introduction
We all know that health is very important key features nowadays to all of us. We know that many
countries like India, Bangladesh and Pakistan is really struggling with Disease’s patients. In
America people also struggling with this stage. So many Researchers start contributing their efforts
in this field. In below section we studied number of research papers and tried to build some
summary for our research work.
According to the paper, Data mining is sub branch of computer science. It the the way through
which we find some info from a given huge data. Here every day new technology comes into
existence like manufactured intelligences, DBMS, ML, DL. work of data mining is find structural
data that will provide some reasonable information from a give huge data. Here Authors proposed
that algorithm like Bayesian and KNN to apply of patient data and try to find prediction of heart
diseases based upon given features. [10].
Finally, Authors conclude that Authors used a large dataset to ensure better prediction result.
Here Authors give some recommendation to the patient that how to control Diseases in the case of
young age patient. Authors build a system which will anticipate heart diseases patient. Here
knowledge base assistance plays vital role in prediction system. Authors taken a dataset which has
2000 in counts which will give nearness levels of heart disease’s patient. Here prediction is taken
place with the help of Naive bayes and k-nearest Neighbors and also, they compare on the basis of
some performance parameters. This developed system may be very useful for HealthCare Industries
for finding pre Diseasess patients [11].
Here Authors Explained that we have several Machine learning techniques which are used for better
prediction over a big data set. We all know that due to complexity prediction in health sector is
challenging job for all data scientist but it is very important for HealthCare sector. This paper
discussed about different six machine learning algorithms are utilized for our prediction system.
Performance and accuracy of applied on a dataset. Here Authors applied different comparison
parameters. Here Authors tried to prove which one gives better result in terms of Accuracy. Aim of
this research to help out doctors and PR actioners for finding early prediction of Diseases with ML
techniques.
Authors concluded that predictive analysis in any HealthCare system may be change the
mindset of doctors by finding Insight Information from a given data by using Machine Learning.
Here Authors used different Algorithms like SVM, KNN, RF, NB, DT & LR. Authors used Pima
Indian dataset for their analysis. Authors claim that SVM and KNN give higher accuracy both these
algorithms give 77% accuracy. For finding better accuracy they need huge real data for creating
model [ 12].
In this paper, Here Authors explained about Disease’s mellitus is very common disease in many
people due to disorder of metabolic functionality. Due to this many organs gets infected. if we talk
about blood veins and nerves. If we predict early prediction then may be possible that we can stop
in any human body from very dangerous stage. Machine Learning techniques provides efficient
result to extract knowledge by creating any predicting model for diagnostic medical datasets
collected from different real heart diseases patients. From this dataset we can extract many
insightful information by using machine learning mechanism. In this work Authors applied very
popular ML Models like SVM, NB, K-Nearest Neighbors & c4.5 Decision Tree. In this case DT
gives better result in terms of Accuracy or other performance parameters [13].
Here Authors concludes that analysis of early prediction can reduce the risk factor by using
machine learning techniques. Here Authors extract the Insightful information from give dataset.
They applied multiple ML Model out of them c4.5 decision Tree gives better results in terms of
Accuracy.
In this work Authors focus over use in 21st century Major cause of death is diseases/syndrome. If
the trends go similar then in 2030 millions of people can die due to this disease. Health Sector is
collecting a real data from different hospital or Test center for doing research. Here Machine
learning gives very good support in terms of finding
CHAPTER 4
PROBLEM DEFINITION & PROPOSED METHOD
Data Preprocessing
Apply ML Model
Performance Result
Data Processing
As a researcher we all know that we have two types of data Numerical Data and Nominal Data.
Both data has specific work in their field sometimes we have to convert one form to another form.
Here we are converting from Numerical Data to Nominal Data.
The patient’s age is classified into three categories
Table 4.2 Data Conversion
S. No Classification Numerical Value
1 Young 10-25 years
2 Adult 26- 50 years
3 Old (Above 50 years)
𝑇𝑃
Precision: Precision = 𝑇𝑃+ 𝐹𝑃
𝑇𝑃
Recall: Recall = 𝑇𝑃+ 𝐹𝑁
2∗𝑟𝑒𝑐𝑎𝑙𝑙∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
F-measure F − measures =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
𝑇𝑃+𝑇𝑁
Accuracy: Accuracy = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
4.2 Algorithms
a) Data Conversion
Input Dataset
Desired Data
Repeat
No
Repeat
for next
Data
Yes
Performance
Parameter
Accuracy /
Precision
Evaluate Final
Figure 4.2: Flow of Operation
CHAPTER 5
EXPERIMENT SETUP
Anaconda is a totally free Environment their source is really open to all for doing much.
Python and its libraries are using in data science and data analysis very efficiently. They are also
largely used for creating expandable machine learning algorithms. Python can apply various ma-
chines learning techniques such65 as Classification, Regression, Recommendation, and Clustering.
Python offers to researcher ready-to-Implement Environment for doing or performing data mining
tasks on huge volumes and variety of data effectively in lesser time.
Pandas
SciKit-Learn
Python Utility
SciPy
Matplotlib
5.3 Implementation
The model employs filters for faster evaluation and lesser overall time. The pre-processing methods
and application of filters affect a lot in final evaluation results of classifiers (ML based models).
The feature extraction methods, conversion of nominal to binary and cleaning are few of those
filters
Explanation: In the figure 5.3 we called Libraries which will help you to call all functionality
which required.
Explanation: In the igure 5.4 we try to show number of major columns available in our DataSet.we
have 9 columns in our data set.
Figure 5.5: Major Columns
Explanation: In the igure 5.5 we try to show all attributes available in our DataSet.we have 9
columns in our data set.
Explanation: In the igure 5.6 we try to show age variation in terms of Diseasess variation.
Figure 5.7: All Parameter Dependencies
Explanation: In the Figure 5.7 we try to show all parameter impact in terms of terms of Diseasess
variation.
CHAPTER - 6
CONCLUSION AND FUTURE WORK
6.1 Conclusion
We conclude that when we Implemented Number of Machine Learning Algorithms for finding best
results in terms of performance. We proposed Tunning of given model for improving the
performance and after applying GradientBoostingClassifier & LGBMClassifier with changes like
random state and some other parameter which improve the performance values. The values we
The future works focus on applying some other techniques to improving the performances of these
methods for up to maximum extent. Another concept that can be implemented Deep learning in
place of machine learning technology. The reason behind this is best and efficient techniques using
nowadays. Deep learning is also introduced nowadays which is becoming more popular for
classification purpose. So, we can also implement deep learning in future work also.