Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Prediction of Heart Diseases using Machine Learning

A Major Project Synopsis Report


Submitted in Partial fulfillment for the award of
Bachelor of Technology in Computer Science & Engineering

Submitted to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDYALAYA
BHOPAL (M.P)

MAJOR PROJECT SYNOPSIS REPORT


Submitted by
Harshit More [0103CS193D05] Nikhil Kute [0103CS193D11

Under the supervision of

Prof Deepak Rathore

Department of Computer Science & Engineering


Lakshmi Narain College of Technology, Bhopal (M.P.)
Session 2021-22
LAKSHMI NARAIN COLLEGE OF TECHNOLOGY,

BHOPAL

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

CERTIFICATE

This is to certify that the work embodied in this Major Project Synopsis
entitled “Prediction of Heart Diseases using Machine Learning” has
been satisfactorily completed by the Harshit More [0103CS193D05] and
Nikhil Kute[0103CS193D11]. It is a bonafide piece of work, carried out
under the guidance from Department of Computer Science &
Engineering, Lakshmi Narain College of Technology, Bhopal for the
partial fulfillment of the Bachelor of Technology during the academic
year 2021-22.

Prof Deepak Rathore


(GUIDE NAME)

Approved By

Dr. Sadhna K. Mishra


Prof. & Head
Department of Computer Science & Engineering

LAKSHMI NARAIN COLLEGE OF TECHNOLOGY,

BHOPAL

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

ACKNOWLEDGEMENT

We express our deep sense of gratitude to Prof. Deepak Rathore


(guide name) department of Computer Science & Engineering L.N.C.T.,
Bhopal, whose valuable guidance and timely help encouraged me to
complete this project.

A special thank goes to Dr. Sadhna K. Mishra (Prof. & HOD) who
helped us by providing timely suggestions in completing this project
work. She exchanged her interesting ideas & thoughts which made this
project work successful.

We would also thank our institution and all the faculty members
without whom this project work would have been a distant reality.

Harshit More [0103CS193D05]

Nikhil Kute [0103CS193D11]


CONTENTS

SN Title Pages Remark

1. Abstract Page No. 1 1 Pages

2. Introduction Page No. 6- 6-10 Pages


10
3. Data Mining and Its Page No. 10-22 pages
Applications 10-22
4. Literature Review Page No. 23-24 pages
23-24
5. Problem Definition & Proposed Page No. 25-29 pages
Method 25-29
6. Experiment Setup Page No. 30-34 pages
30-34
7. Conclusion & Future Work Page No. 35 35 pages
Abstract

Heart related diseases or Cardiovascular Diseases (CVDs) are the main reason for a
huge number of death in the world over the last few
decades and has emerged as the most life-threatening disease, not only in India but in
the whole world. So, there is a need of reliable,
accurate and feasible system to diagnose such diseases in time for proper treatment.
Machine Learning algorithms and techniques have
been applied to various medical datasets to automate the analysis of large and complex
data. Many researchers, in recent times, have
been using several machine learning techniques to help the health care industry and
the professionals in the diagnosis of heart related
diseases. This paper presents a survey of various models based on such algorithms and
techniques andanalyze their performance. Models
based on supervised learning algorithms such as Support Vector Machines (SVM), K-
Nearest Neighbour (KNN), NaïveBayes, Decision
Trees (DT), Random Forest (RF) and ensemble models are found very popular among
the researchers.
CHAPTER 1
INTRODUCTION

1.1 Introduction
We are living in very fast and hectic scheduled world. Nobody is aware about their health.
Everybody is concentrating over carrier. Due to this negligence of health very fast-growing region
is heart diseases patient. Suppose you just diagnose with heart diseases type 1 or type 2. This is
very important to start understanding about this illness. You can stop this by growing day by day by
starting to live your life healthier.

There are many causes of heart failure, but the condition is generally broken down into two types:

Heart failure with reduced left ventricular function (HF-rEF)

The lower left chamber of the heart (left ventricle) gets bigger (enlarges) and cannot squeeze (con-
tract) hard enough to pump the right amount of oxygen-rich blood to the rest of the body.

Heart failure with preserved left ventricular function (HF-pEF)

The heart contracts and pumps normally, but the bottom chambers of the heart (ventricles) are
thicker and stiffer than normal. Because of this, the ventricles can't relax properly and fill up all the
way. Because there's less blood in the ventricles, less blood is pumped out to the rest of the body
when the heart contracts.
1.2 Machine Learning
This industry is not new when we go through multiple papers and publication, we found that this
mechanism evolves in year 1960. Within From last some years m/c learning involve in many
domains due to improving volume of data in daily life. Much online web application generated
huge amount of log or data that can be processed by any algorithm.

Predictive analysis is an analytics process which is analyses historical data cum new data to
forecast behavior and trends activity for future prediction goal, it is closely related to

advanced analytics.
Machine learning techniques

Types of Prediction Data mining techniques

Nature-inspired algorithm
Figure 1.2: Types of Prediction

1.3 Introduction of Diseases


Diseases is arising as a serious chronic disease and has become an epidemic now in higher
percentage of urban areas. In India, the diseases occurrence has significantly raised from 12 to 19
percent approximately in urban scenario and 6.5 percent in rural scenario. This is important to note
that this growth rate is 49 to 79 percent higher than that of China.

Figure 1.3: Diseases % Infected


In the figure 1.3 depicts the overall estimated statistics of preHeart diseases, heart diseases and total
cases in India which clearly indicates the seriousness of the issue concerned.
1.4 Diagnosis

Figure 1.4: Types of Diagnosis

AIC: The A1C test measures your average blood sugar for the past two to three months. The
advantages of being diagnosed this way are that you don't have to fast or drink anything.

Fasting Plasma Glucose (FPG): This test checks your fasting blood sugar levels. Fasting means
after not having anything to eat or drink (except water) for at least 8 hours before the test. This test
is usually done first thing in the morning, before breakfast.

Oral Glucose Tolerance Test (OGTT): The OGTT is a two-hour test that checks your blood sugar
levels before and two hours after you drink a special sweet drink. It tells the doctor how your body
processes sugar.

1.5 Research Area in Heart diseases


The Research is developed by American society of Heart diseases in field of Heart disease’s
research for this grant from different source and Industries is providing funds. Many innovations
and is promising in field of diseases. Every scientist group is working for different and specific
projects. Here everyone is contributing to improve the life of people throughout the society.
Distribution of fund in different area is given below:
Figure 1.5: Research area according to America Heart diseases Society [8]

Type 1 research:
This Diseases is introduced by an autoimmune attack on beta-cells, eliminating the ability of the
body to produce insulin. In 1921, research led to the discovery of insulin, changing type 1 diseases
from a life-threatening condition to a manageable one.
Type 2 research:
This Diseases is developed by both genetic and environmental factors and effect the body’s
capability to make or use insulin. Research is developing for 3 decades to improve the glucose
controlling of patient.
Type 1 and type 2
Type 1 and Type 2 diseases have different underlying causes, but both result in high blood glucose
and lead to similar complications.
Obesity
Obesity significantly increases risk for type 2 Diseases and complicates management of type 1
Diseases.
1.6 Motivation
The Motivation of our Dissertation is that In Today’s hectic life nobody has time to detect the
behavior of their life style. Once it gets infected, we all come under threaten. So we come to
conclusion if we detected our working behavior through any mechanism that will give us prior
information about Diseases.

1.7 What does the science say about diseases


"What can I eat?" is one of the top questions asked by people with diseases when they are
diagnosed. Everybody has to follow the Diet plan when they diagnosed by illness. Some food
planning is given below:

Figure 1.6: Food Plan [8]

1.8 predeceases Factors


Predeceases condition can be recognized by following factors which we need to care. By caring we
can reduce the heart diseases risks:

Are 45 or older patient having High Blood Pressure

Family Member Diabetics Have low HDL cholesterol

Are you fall in over weight? Had diabetes during pregnancy

Figure 1.7 Preheart disease’s Factors


Are you physically inactive? polycystic ovary syndrome
CHAPTER 2
DATA MINING AND ITS APPLICATION

2.1 Data Mining (DM)


“Knowledge shows the way to Power and Success”

The origin of data mining technology meets people’s necessities. DM sometimes also called as
Knowledge Discovery from the Database (KDD). A terrific amount of data and information is being
collected with the help of computing devices and latest technologies. Now data is everywhere: from
business transactions, government, healthcare, websites and scientific data etc. Just retrieval is not
enough for decision-making, so the DM come into picture for summarization of data for valuable
information i.e. Knowledge discovery and the discovery of patterns in raw data [9].
In the beginning, we started storing all data. Unfortunately, these gigantic collections of data
accumulated on dissimilar data structures very rapidly became devastating. DM can extract implicit
but potentially useful information and knowledge, which people do not know in advance, from a lot
of noisy, incomplete, random and fuzzy data in practical application. The DM is happening field
and powerful means to extract useful knowledge from massive amounts of data to bridge the gap
between knowledge and data.

Another definition of DM is the investigation and analysis of huge quantities of data in order to
discover legitimate, narrative, potentially useful, and eventually understandable patterns in data.
Process of analysing through intelligent algorithms from large databases to find patterns that are:

✓ Valid: The true patterns that holds in common.


✓ Novel: the pattern we do not know beforehand.
✓ Valuable: From the patterns we can invent actions.
✓ Understandable: We can deduce and figure out the patterns.

DM and KDD is a new interdisciplinary field, merging ideas from statistics, machine learning
databases and parallel computing.
Researchers have defined the term ‘data mining’ in many ways.
Few definitions of DM or KDD, which are available in literature, are given below.

2.2 KDD (Knowledge Data Discovery)

KDD process is a type of data mining methodology which used to extract hidden knowledge from
a large database, by implementing pre-processing step and data transformation step.

1. Identification of Goal Definition of Problem Application Goal Known


Prior

2. Target of Data Set Data Set Selection Data set Creation

3. Data Pre-Processing Removing Noisy Data Handling Missing Data

4. Data Transformation Find Useful Feature Find Weighted Value

Figure 2.1: KDD Process


5. Data Mining Choosing DM Fun. Search for Presentation
This research will predict Diseases by using the Knowledge Discovery in Database (KDD)
methodology. KDD is the process of extracting knowledge from large database and emphasize
“high-level" application of particularVisualization
6. Presentation data mining methods. KDD process
Replace consistsPattern
Redundant of nine step, the
steps are iterative and interactive in nature9. Note that the process is iterative at each step, meaning
that one might have to move back to previous step. The process starts with determining the KDD
goals, and ends with the implementation of the discover knowledge.
KDD Steps:

1. Developing an understanding of

➢ The appropriate prior knowledge


➢ The Aim of the end-user

2. Creating a target data set or selecting a data set, on which detection is to be accomplish.
3. Data cleaning and pre-processing.

➢ Removal of noise in dataset.


➢ Plan of action for handling missing data.

4. Data reduction

➢ Finding useful features to represent the data depending on the aim of the task.
➢ Use of dimensionality reduction methods to reduce the decrease number of variables
for the representations for the data.

5. Choosing the data mining task.

➢ Choose the Aim of the KDD process is classification, regression, clustering or any
other.

6. Choosing the data algorithms.

➢ Selecting methods to be used for searching for patterns in the data.


➢ Deciding which models and parameters may be appropriate.

7. Data mining.

➢ A set of such representations as classification rules or trees, regression, clustering.

8. Define mined patterns.


9. Combine founded knowledge.

2.3 Data mining process

Data mining is the process of extracting hidden, previously unknown patterns from huge database
or data warehouse. Data mining is also known as knowledge discovery from data (KDD). Data
mining play important role in the various area like banking, education, health care, medical etc.
Many organizations use data mining technique to analyses large dataset, to support decision making
process and to get better result for their long-term need.
Data Data Data Data
Data

Processing Trans- Mining Evaluation


Selection
formation

Figure 2.1: Data Mining Process Steps

Health organization use data mining technique in order to identify hidden patterns from disease,
drugs dataset and used for prediction and detection of different disease and also it supports decision
making process in clinical diagnosis. Different data mining technique is used prediction and
detection of different disease, some of the technique is listed below.[24]

2.3.1 Data Mining Techniques

Classification is the process of finding a model which describes and distinguishes data classes or
concepts based on a class label. There are different classification algorithms some of this are
Artificial Neural Network (ANN), Decision tree, Bayesian network, naïve bays etc.

Clustering is the process of analysing data objects without consulting a class label. It is process of
grouping new class based on maximizing the intra class similarity and minimizing the interclass
similarity. There are different clustering algorithms some of this are K nearest neighbour and k
mean clustering.

Association rule learning is machine learning method which used for finding frequent patterns.
Some of the association algorithm is Apriori algorithm, Eclat algorithm and FP growth algorithm.
2.3.2 Applications of Data mining

A Traffic Prediction

P Videos Surveillance

P
Search Engine Result Refining
L
Online Fraud Detection
I

C Product Recommendations

A
Future Healthcare
T

I Manufacturing Engineering

O Figure 2.2: Area where DM Used

Traffic Predictions: NGoogle uses the DM algorithm n the traffic prediction we all used the GPS
navigation system because of this navigation system the data is saved is a central database and up-
date the location of aNvehicle. The underlying problem is that there are a minimum number of cars
that are equipped with GPS. Machine learning in such scenarios helps to estimate the regions where
congestion can be found on the basis of daily experiences. [7]
O
Videos Surveillance: Imagine a single person monitoring multiple video cameras, a difficult job to
do and boring as well.NThis is why the idea of training computers to do this job makes sense.

The video surveillance device nowadays is powered by way of AI that makes it viable to hit upon
crime earlier than they happen. They song uncommon behavior of people like status immobile for a
long term, stumbling or snoozing on benches.

Search Engine Result Refining: Google and other search engines use DM to improve the search
results for you. Every time you execute a search, the algorithms at the backend keep a watch at how
you respond to the results. If you open the top results and stay on the web page for long, the search
engine assumes that the results it displayed were in accordance to the query. Similarly, if you reach
the second or third page of the search results but do not open any of the results, the search engine
estimates that the results served did not match requirement. This way, the algorithms working at the
backend improve the search results.[7]

Online Fraud Detection: DM is proving its potential to make cyberspace a secure place and
tracking monetary frauds online is one of its examples. For example: PayPal is using ML for
protection against money laundering.

Product Recommendations: DM algorithm is used in product recommendations User got the same
product on his social media account that he saw on a e-commerce website.

Future Healthcare: Data mining improve health systems. It uses data and analytics to verify best
practices that improve supervision and reduce costs. Researchers use data mining algorithms like
multi-dimensiona

l databases, machine learning, soft computing, data visualization and statistics. Mining can be
useful to predict the volume of patients in every class. Methods are developed that make sure that
the patients gets appropriate supervision at the right place and at right time.

Market Basket Analysis: Market basket analysis is a modelling algorithm based on theory that if
you buy a certain group of items you are more likely to buy another group of items. This method
may allow the shopkeeper to know the purchase behaviour of a purchaser. This information can
help the shopkeeper to understand the purchaser’s requirements and change the shop’s layout
accordingly.

Education: There is new emerging field, known as Educational Data Mining, concerns with devel-
oping techniques that discover knowledge from data obtained from the educational Environments.
The objectives of EDM are identified as predicting the students’ future studying behaviour, under-
standing the effects of educational help, and improving scientific knowledge about learning. Data
mining can be used by an institution to take correct decisions and also for predicting the Progress
Report of the student. With the results the institution can focus on how to teach and what to
teach.[7]
CRM: Customer Relationship Management, it is about acquiring and retaining customers, also ad-
vancing customers’ loyalty and developing customer focused strategies. To maintain a proper rela-
tionship with the customer.

Product Recommendations DM algorithm are used in product recommendations User got the
same product on his social media to account that he saw on an e-commerce website.

2.3.3 Data Mining Challenges:

• Developing a Unifying Theory of Data Mining.

• Scaling Up for High Dimensional Data/High Speed Streams.

• Mining Sequence Data and Time Series Data.

2.4 Introduction to Machine Learning

Machine learning works on a very simple concept understanding with experiences. Machine
learning is the process that comes from humans and animals teaches computer that learning from
the experience. Machine learning contains algorithms that learn from past data and predicts the
future data. In machine learning we train computer by algorithm on some data and predicted the
future results. The algorithms adaptively improve their performance as the number of samples
available for learning increases.
2.4.1 Types of Techniques of Machine Learning

Supervised ML

Unsupervised ML

Semi supervised ML

Reinforcement ML

Machine Learning Multitasking Learning

Ensemble Learning

Neural Network

Instance Based Learning

Figure 2.3: Types of Machine Learning

Supervised Learning: In supervised learning mechanism we have to educate the model with some
prior knowledge so that they can behave like intelligent program. Here we have to give training as
well as we can use this program for further use.

Unsupervised Learning: In unsupervised learning mechanism we have to educate the model


without any prior knowledge means this is typical to make a program behaves intelligently.

Reinforcement Learning: In this learning all programs learn their steps on the basis of their
experiences. This comes in between supervised & unsupervised. Here a terms agent comes in
picture which has very important work. Here agent will take action or learn decisions on the basis
of prior working.

Multitasking Learning: Multitask Learning (MTL) is an initial changing tools whose main motto
to enhance generalization conduct. MTL improves the above mechanism by averaging the domain
related advice containing in the training indicator of related works.

Decision Tree Model: A decision tree model is one of the most common data mining models. It is
popular because the resulting model is easy to understand. The algorithms use a recursive
partitioning approach. Decision tree is a type of supervised learning algorithm that is mostly used
in classification problems.

Types of decision tree is based on the type of target variable; it can be of two types:

Categorical Decision

Decision Tree
Continuous Decision

Figure 2.4: Types of Decision Tree

Categorical Variable Decision Tree: Decision Tree which has categorical target variable then it
called as categorical variable decision tree.

Example: In above scenario of student problem, where the target variable was “It will rain today”
YES or NO.

Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is called
as Continuous Variable Decision Tree. Example: - Salary of a person.

Support Vector Machine Model: A Support Vector Machine (SVM) searches for so called support
vectors which are data points that are found to lie at the edge of an area in space which is a
boundary from one class of points to another. In the terminology of SVM we talk about the space
between regions containing data points in different classes as being the margin between those
classes. The support vectors are used to identify a hyperlane (when we are talking about many
dimensions in the data, or a line if we were talking about only two-dimensional data) that separates
the classes.[6]

OUTPUT LAYER
HIDDEN LAYER
Y-Axis LAYER
INPUT

X-Axis

Figure 2.5: Model of Support Vector Machine


Artificial neural network

Artificial neural network is one of prediction algorithm which use learning rate and momentum to
classify data accurately. ANN predict the output by adjusting weight. It consists of three layers

Figure 2.6: Layers of Artificial Neural Network

Back propagation algorithm is type of Artificial neural network algorithm by which each neuron is
learned by adjusting the weighted associated with it in order to correct or reduce the error. It is
supervised learning algorithm which used gradient descent optimization algorithm in order to adjust
the weight on the neurons by computing the gradient of loss function. [6]

Advantage of Artificial neural network


This study chooses ANN algorithm because of the following advantages some of them are:

1) Ability to classify nonlinear data and Complex relationship.

2) It has high ability tolerance to Noisy data and missing value.

3) It has ability to classify untrained data.

Clustering: Clustering is the process of grouping the physical and abstract objects into classes of
the similar objects. Clustering is a process of partitioning a set of data (or objects) into a set of
meaningful sub-classes, called clusters. It is an unsupervised learning method there are no
predefined classes. Clustering technique will generate high quality clusters that intra-class
similarity is high and inter-class similarity is low. The characteristic of a clustering result also
relies upon both the similarity measure used by the technique and its implementation. The aspect of
a clustering technique is measured by its performance to find some or all of the unseen patterns.

Boosting: Boosting is very important classification method in the recent development. It works by
applying a classification algorithm sequentially to reweighted version of training dataset, then
choosing the weighted majority vote of sequence of classifiers produced this simple algorithm
results in dramatic improvement in performance for many classification algorithms. This seems that
phenomenon can be understood in terms of statistical principles, namely additive modelling on
logistic scale which uses Bernoulli criterion as much as it can.

Association Rule Mining: Association rules analysis is a technique to uncover how items are
associated to each other. Association rule mining „ Finding frequent patterns, associations,
correlations, or causal structures among sets of items in transaction databases. What customer
buying in his basket by finding associations and correlations between the different items that
customers place in their baskets. „

Applications of association rule mining

1) Basket data analysis.

2) Cross-marketing.

3) Catalog design.

4) Loss-leader analysis.
2.5 Importance of Boosting Method
Boosting is Machine learning Meta algorithm for reducing bias and variance in supervised learning and
machine learning which converts weak learner to strong learner. A question is posed by Kearn and Valiant
“Can a group of weak learners make a strong learner? “Here a weak learner is defined as classifier i.e.
slightly correlated with the right classification (it can provide example which are better than random
guessing) on contrary. a strong learner is a classifier which is arbitrarily well correlated with the right
classification.

2.6 Types of Classification Algorithms

Naïve Bayes

Support Vector Machine

Logistic Regression

Decision Tree

Random Forest
Classification Algorithms
K-Mean

Neural Network

Fuzzy k-NN

Genetic Algorithm

Figure 2.7: Types of Classification Algorithms


CHAPTER 3
LITERATURE REVIEW

3.1 Introduction

We all know that health is very important key features nowadays to all of us. We know that many
countries like India, Bangladesh and Pakistan is really struggling with Disease’s patients. In
America people also struggling with this stage. So many Researchers start contributing their efforts
in this field. In below section we studied number of research papers and tried to build some
summary for our research work.

According to the paper, Data mining is sub branch of computer science. It the the way through
which we find some info from a given huge data. Here every day new technology comes into
existence like manufactured intelligences, DBMS, ML, DL. work of data mining is find structural
data that will provide some reasonable information from a give huge data. Here Authors proposed
that algorithm like Bayesian and KNN to apply of patient data and try to find prediction of heart
diseases based upon given features. [10].
Finally, Authors conclude that Authors used a large dataset to ensure better prediction result.
Here Authors give some recommendation to the patient that how to control Diseases in the case of
young age patient. Authors build a system which will anticipate heart diseases patient. Here
knowledge base assistance plays vital role in prediction system. Authors taken a dataset which has
2000 in counts which will give nearness levels of heart disease’s patient. Here prediction is taken
place with the help of Naive bayes and k-nearest Neighbors and also, they compare on the basis of
some performance parameters. This developed system may be very useful for HealthCare Industries
for finding pre Diseasess patients [11].
Here Authors Explained that we have several Machine learning techniques which are used for better
prediction over a big data set. We all know that due to complexity prediction in health sector is
challenging job for all data scientist but it is very important for HealthCare sector. This paper
discussed about different six machine learning algorithms are utilized for our prediction system.
Performance and accuracy of applied on a dataset. Here Authors applied different comparison
parameters. Here Authors tried to prove which one gives better result in terms of Accuracy. Aim of
this research to help out doctors and PR actioners for finding early prediction of Diseases with ML
techniques.
Authors concluded that predictive analysis in any HealthCare system may be change the
mindset of doctors by finding Insight Information from a given data by using Machine Learning.
Here Authors used different Algorithms like SVM, KNN, RF, NB, DT & LR. Authors used Pima
Indian dataset for their analysis. Authors claim that SVM and KNN give higher accuracy both these
algorithms give 77% accuracy. For finding better accuracy they need huge real data for creating
model [ 12].

In this paper, Here Authors explained about Disease’s mellitus is very common disease in many
people due to disorder of metabolic functionality. Due to this many organs gets infected. if we talk
about blood veins and nerves. If we predict early prediction then may be possible that we can stop
in any human body from very dangerous stage. Machine Learning techniques provides efficient
result to extract knowledge by creating any predicting model for diagnostic medical datasets
collected from different real heart diseases patients. From this dataset we can extract many
insightful information by using machine learning mechanism. In this work Authors applied very
popular ML Models like SVM, NB, K-Nearest Neighbors & c4.5 Decision Tree. In this case DT
gives better result in terms of Accuracy or other performance parameters [13].

Here Authors concludes that analysis of early prediction can reduce the risk factor by using
machine learning techniques. Here Authors extract the Insightful information from give dataset.
They applied multiple ML Model out of them c4.5 decision Tree gives better results in terms of
Accuracy.

In this work Authors focus over use in 21st century Major cause of death is diseases/syndrome. If
the trends go similar then in 2030 millions of people can die due to this disease. Health Sector is
collecting a real data from different hospital or Test center for doing research. Here Machine
learning gives very good support in terms of finding
CHAPTER 4
PROBLEM DEFINITION & PROPOSED METHOD

4.1 Diseases Prediction Methods


In order to find our goal, our methodology contains a number of stages which we are explaining
below:
A. Datasets & Properties
B. Data Preprocessing
C. Apply Different Machine Learning Techniques
D. Finding Performance Measures
For better understanding we are explaining it in form of process flow diagram which is given
below:

Real Time Problem

Relevant Data Collection Health Data


Storage

Data Preprocessing

Training Dataset Testing Dataset

Apply ML Model

Performance Result

Figure4.1: Proposed Process Flow


Dataset & Properties
Table 4.1 Properties Description
S.N0: Properties Remark
1 Pregnancies Number of times pregnant
2 glucose plasma glucose concentration 2 hours in an oral
glucose tolerance test

3 Blood Pressure Diastolic blood pressure (mm Hg)


4 Skin Thickness Triceps skin fold thickness(mm)
5 Insulin 2-Hour serum insulin (mu U/ml)
6 BMI Body mass index (weight in kg/(height in m)2)
7 Diseases Pedigree Diseases pedigree function
Function
8 Age Age (Years)
9 Outcome class variables (0 or 1) 268 of 768 are 1 other are 0

Data Processing
As a researcher we all know that we have two types of data Numerical Data and Nominal Data.
Both data has specific work in their field sometimes we have to convert one form to another form.
Here we are converting from Numerical Data to Nominal Data.
The patient’s age is classified into three categories
Table 4.2 Data Conversion
S. No Classification Numerical Value
1 Young 10-25 years
2 Adult 26- 50 years
3 Old (Above 50 years)

Apply Machine Learning


When our Data is ready for using by any ML Techniques to create Model. Here we are Applying
Number of Machine Learning Algorithms for finding better Results.
Apply Performance Measure
By using following equations, we can find many Evaluation Parameters some of them is given
below:

𝑇𝑃
Precision: Precision = 𝑇𝑃+ 𝐹𝑃
𝑇𝑃
Recall: Recall = 𝑇𝑃+ 𝐹𝑁
2∗𝑟𝑒𝑐𝑎𝑙𝑙∗𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
F-measure F − measures =
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

𝑇𝑃+𝑇𝑁
Accuracy: Accuracy = 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
4.2 Algorithms

Step 01: Store Data from Kaggle Repository

Step 02: Import Prior Libraries:

Step03: Now Import our Required Dataset

Step04: Apply Feature Extraction

a) Data Conversion

b) Apply Encoding Techniques

Step 05: Visualize Data for better understanding

Step06: Applying Machine Learning Algorithms

Step07: Apply Different Model

Step08: Repeat Step07 for many times with different Algorithms

Step09: Finally Compare Results with performance parameters like Accuracy


4.3 Flow Diagram of proposed methodology

Input Dataset

Preprocessing Numerical to nominal

Desired Data

Training Data Applied

Repeat

No
Repeat
for next
Data
Yes

App. Different Algo

Performance
Parameter

Accuracy /
Precision

Evaluate Final
Figure 4.2: Flow of Operation
CHAPTER 5
EXPERIMENT SETUP

5.1 Experimental Framework


Python is a prominent environment using by researcher to development or deployment of generat-
ed systems. It has vast set of libraries with number of modules, packages that supports program-
mer to attain in many ways to complete their work efficiently.

Figure 5.1: GUI Anaconda

Anaconda is a totally free Environment their source is really open to all for doing much.
Python and its libraries are using in data science and data analysis very efficiently. They are also
largely used for creating expandable machine learning algorithms. Python can apply various ma-
chines learning techniques such65 as Classification, Regression, Recommendation, and Clustering.

Python offers to researcher ready-to-Implement Environment for doing or performing data mining
tasks on huge volumes and variety of data effectively in lesser time.
Pandas

SciKit-Learn
Python Utility
SciPy

Matplotlib

Figure 5.2: Libraries of Python

5.2 Dataset & Features


Machine learning data is usually described in a matrix called dataset. This matrix is structured in a
way that corresponds to each row an observation (example) data set and each column represents a
feature (also variable or attribute) that describes the data. Data values can take many
representations. Data can be numerical (integer or real numbers) or nominal data, where values are
differentiated by name. Nominal data is type of Categorical data type of that, as its name indicates,
the data only can have a fixed set of nominal values (or categories).

5.3 Implementation
The model employs filters for faster evaluation and lesser overall time. The pre-processing methods
and application of filters affect a lot in final evaluation results of classifiers (ML based models).
The feature extraction methods, conversion of nominal to binary and cleaning are few of those
filters

5.4 Different Process Stage


Figure 5.3: Calling Libraries

Explanation: In the figure 5.3 we called Libraries which will help you to call all functionality
which required.

Figure 5.4: Major Columns

Explanation: In the igure 5.4 we try to show number of major columns available in our DataSet.we
have 9 columns in our data set.
Figure 5.5: Major Columns

Explanation: In the igure 5.5 we try to show all attributes available in our DataSet.we have 9
columns in our data set.

Figure 5.6: Histogram of Age in terms of Diseases

Explanation: In the igure 5.6 we try to show age variation in terms of Diseasess variation.
Figure 5.7: All Parameter Dependencies

Explanation: In the Figure 5.7 we try to show all parameter impact in terms of terms of Diseasess
variation.
CHAPTER - 6
CONCLUSION AND FUTURE WORK

6.1 Conclusion
We conclude that when we Implemented Number of Machine Learning Algorithms for finding best

results in terms of performance. We proposed Tunning of given model for improving the

performance and after applying GradientBoostingClassifier & LGBMClassifier with changes like

random state and some other parameter which improve the performance values. The values we

getting is RF: 0.897368,

XGB: 0.901316, LightGBM: 0.896053.

6.2 Future Work

The future works focus on applying some other techniques to improving the performances of these

methods for up to maximum extent. Another concept that can be implemented Deep learning in

place of machine learning technology. The reason behind this is best and efficient techniques using

nowadays. Deep learning is also introduced nowadays which is becoming more popular for

classification purpose. So, we can also implement deep learning in future work also.

You might also like