 Data mining for healthcare is an interdisciplinary field of study that originated in

database statistics and is useful in examining the effectiveness of medical therapies.
Machine learning and data visualization Diabetesrelated heart disease is a kind of
heart disease that affects diabetics. Diabetes is a chronic condition that occurs when
the pancreas fails to produce enough insulin or when the body fails to properly use
the insulin that is produced. Heart disease, often known as cardiovascular disease,
refers to a set of conditions that affect the heart or blood vessels. Despite the fact that
various data mining classification algorithms exist for predicting heart disease, there
is inadequate data for predicting heart disease in a diabetic individual. Because the
decision tree model consistently beat the naive Bayes and support vector machine
models, we fine-tuned it for best performance in forecasting the likelihood of heart
disease in diabetes individuals.
 The model uses a dataset with the count of 132 symptoms from which the usercan
select their symptoms. The user does not need to have a medical report touse this
system as the prediction is based on the symptoms which will save themoney. The
system also has a very easy to use user interface so all the users canuse it to predict
the generic diseases.

Machine learning is computer programming to optimize performance using sample data or

past data. Machine learning is the study of computer systems that learn from data and
experience. The machine-learning algorithm has two parts: training, testing. Predict disease
using symptoms and patient history Machine learning technology has been striving for
decades. Machine learning technology provides an immeasurable platform in the medical
field for health issues to be effectively resolved. We apply machine learning to keep
complete hospital data. leading to the reference in the current text must match the list of
references at the end of the document.
In terms of data collecting and processing, healthcare is one of the most worrisome
industries. With the advent of the digital era and technological advancements, a vast
quantity of multidimensional data on patients is created, including clinical factors, hospital
resources, illness diagnostic information, patients' records, and medical equipment. The
enormous, dense, and complex data must be processed and evaluated in order to extract
knowledge for effective decision making. Medical data mining offers a lot of potential for
uncovering hidden patterns in medical data sets.
By identifying significant patterns and detecting correlations and relationships among many
variables in huge databases, the use of various data mining tools and machine learning
approaches has changed healthcare organizations . It serves as an important instrument in
the medical sector, providing and comparing existing data for the future course of action.
This technology combines multiple analytic methodologies with modern and complex
algorithms, allowing for the exploration of massive amounts of data . It is used in healthcare
to gather, organize, and analyze patient data in a systematic manner. It may be used to
identify inherent inefficiencies and best practices for providing better services, which may
lead to improved diagnosis, better medicine, and more successful treatment, as well as a
platform for a deeper knowledge of the mechanisms in practically all elements of the
medical domain. Overall, it assists in the early detection and prevention of disease
epidemics by searching medical databases for pertinent information. The process of
determining a condition based on a person’s symptoms and indicators is known as medical
diagnosis. In the diagnostic process, one or more diagnostic procedures, such as diagnostic
tests, are performed. Diagnosis of chronic illnesses is a vital issue in the medical industry
since it is based on many symptoms. It is a complex procedure that frequently leads to
incorrect assumptions. When diagnosing illnesses, the clinical judgment is based mostly on
the patient’s symptoms as well as the physicians' knowledge and experience

 The model proposed by is used for Disease Prediction and uses different ML algo-
rithms like Iforest for correcting the dataset problems and SMOTET for balancing
thedataset and then it uses the Ensemble learning technique. The Input the the ML
model is taken only by the electronic reports which are produced by the blood
examination of thepatient or the user. Some of the input taken in this model are
glucose level, cholesterol,lipoprotein, blood pressure and other inputs which are only
be possible by the physicalexamination the user or the patient.

 The model proposed by uses big data analytics and the deep learning models for
theprediction of Disease The dataset is big so it uses the Big Data analytics like Map
reduceis used in this model and on that the deep learning models are used for the
prediction ofthe Disease which makes it a very big process and it becomes very time
consuming. Thismodel needs the full medical examination of the user or the patient
foe the predictionof the disease. Full medical history of the patient or the user is taken
as an input tothis model which is stored with the help of the big data tools and then
used by the deeplearning models to predict the disease. this model also needs all the
medical record ofthe patient like all the medications which the patient or the user was
taking and the list ofdoctors which he or she has visited which help in proper analysis
of the patient’s problem.

 In the model proposed by uses different ML algorithms like Random Forest,

LogisticRegression, Decision tree and others for the sake of prediction of the Disease
and is usedfor the prediction of the Heart Disease, Breast Cancer and Diabetes. All
the algorithmsused in the system have their own way of predicting the Disease and
are used accordingly.Different dataset are used in this model for the different disease
like the heart diseasehas a different dataset and the Breast Disease has a different
dataset. For the differentdataset the algorithms have their different accuracy %
accordingly and are used as perthe accuracy.

 In the model proposed by uses different data mining and the classification
algorithmsfor the prediction of disease. This model is mainly used for the prediction
of the HeartDisease and the algorithms which are used in this model are Decision
Tree and the NaiveBayes algorithm which are used for the prediction of the Disease
and various data miningtechniques are also used in this model for correcting and
balancing the dataset so that thesystem can work correctly and can predict the correct

 Many of the existing machine learning models for health care analysis are
concentrating on one disease per analysis. For example first is for liver analysis, one
for cancer analysis, one for lung diseases like that. If a user wants to predict more
than one disease, he/she has to go through different sites. There is no common system
where one analysis can perform more than one disease prediction. Some of the
models have lower accuracy which can seriously affect patients’ health. When an
organization wants to analyse their patient’s health reports, they have to deploy many
models which in turn increases the cost as well as time Some of the existing systems
consider very few parameters which can yield false results.
 The model which is more time consuming as it involves both the structured and the
unstructured data so the time taken to process the data is more as compared to the
dataset which contains only the structured data as in the proposed project which
contains only the structured data and the classification algorithms used in the
proposed project are decision tree, Naive Bayes and Random forest. The accuracy of
the model given by is above 90% which is not good for a ML model as it is said to be
in an over fitting situation whereas the proposed model has accuracy of about 86%
which is good enough for a model of disease prediction.
 The model given by has a very limited scope as it is only meant for the prediction of
the diabetes and hypertension whereas the proposed model is used for the prediction
of the basic general disease. The model given by [2] needs the blood report of the
patient or the user for the prediction of the diabetes or the hypertension and the
algorithms used in this model are ensemble learning techniques whereas the predicted
model does not need any blood report or physical presence of the user or the patient.
The system contains a list of symptoms from which the user can select the symptoms
which the user is facing and can predict the disease very easily and the algorithms
used are different from the given model. The input required in the given model are
based on the medical report of the user like cholesterol, blood glucose etc whereas the
proposed system does not requireany type of blood report for the prediction of the
 The model given by uses a very big data set and to manage that dataset the big data
analytics are used which makes this system slow as needs a lot of system
requirements to run this project and the deep learning algorithms are used in this
project are FISM,NAIS, Deep ICF which is different from the proposed model which
uses the classification algorithms which are light

In multiple disease prediction, it is possible to predict more than one disease at a time. So
the user doesn’t need to traverse different sites in order to predict the diseases. We are
taking three diseases that are Liver, Diabetes, and Heart. As all the three diseases are
correlated to each other. To implement multiple disease analyses we are going to use
machine learning algorithms and Django. When the user is accessing this API, the user has
to send the parameters of the disease along with the disease name. Django will invoke the
corresponding model and returns the status of the patient.
The system defines that liver diseases is causing high number of deaths in India and is also
considered as a life threating disease in the world. As it is difficult to detect the liver disease
at early stage. So using automated program using machine learning algorithms we can detect
the liver disease accurately .They used and compared SVM ,Decision Tree and Random
forest algorithm and measures precision, accuracy and recall metrics for quantitative
measurement. The accuracy are 95%,87%,92% respectively.

4.1 Random Forest Algorithm

Step-1: Firstly it will select random K data points from the training set.

Step-2: After selecting k data points then building the decision trees associated with the
selected data points (Subsets).

Step-3: Then choosing the number N for decision trees that you want to build.

Step-4:Repeating step 1 and 2.

Step-5: Finding the predictions of each decision tree, and assigning the new data points to
the category that wins the majority votes.

4.2 Decision Tree Algorithm

Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

Step-3: Divide the S into subsets that contains possible values for the best attributes.

Step-4: Generate the decision tree node, which contains the best attribute.

Step-5: Recursively make new decision trees using the subsets of the dataset created in step

4.3 Logistic Regression Algorithm

Step 1: Import Libraries

Step 2: Load and Prepare the Dataset

Step 3: Feature Scaling (Optional but recommended)

Step 4: Train the Logistic Regression Model

Step 5: Make Predictions

Step 6: Evaluate the Model


5.1 Application type

The variable to be predicted is continuous (Insufficient Weight, Normal Weight,

Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity
Type III). Therefore, this is an approximation project.Here, the basic goal is to model the
obesity levels as a function of the input variables and advise the patient on how to
improve the obesity level.

5.2 Data set

The data set contains three concepts:

 Data source.
 Variables.
 Instances.

The ObesityDataSet.csv file contains the data for this application. The number of instances
(rows) in the data set is 2111, and the number of variables (columns) is 17.

The number of input variables, or attributes for each sample, is 14. Height and weight are
unused variables related to the target variable. The input variables are numeric-valued,
binary, and categorical. The number of target variables is 1 and represents the estimation of
obesity levels in individuals. The following list summarizes the variables information:

• gender: (1=Female or 0=Male).

• age: (Numeric).
• height: (Numeric).
• weight: (Numeric).
• calories: (0=Yes/1=No).Calories consumption monitoring.
• activity: (0, 1, 2 or 3). Physical activity frequency.
• transportation: (Automobile, motorbike, bike, public transportation or walking).
Transportation used.
• obesity_level: (1=Insufficient_Weight, 2=Normal_Weight, 3=Overweight_Level_I,
4=Overweight_Level_II, 5=Obesity_Type_I, 6=Obesity_Type_II,

Finally, the use of all instances is set. Note that each instance contains the input and
target variables of a different patient. The data set is divided into training, validation,
and testing subsets. 60% of the instances will be assigned for training, 20% for
generalization, and 20% for testing. More specifically, 1267 samples are used here for
training, 422 for selection, and 422 for testing samples.
Once the data set has been set, we are ready to perform a few related analytics. We
check the provided information and make sure that the data has good quality.

5.3 Architecture Design

Block Diagram

In the figure. We have experimented on three diseases that is heart,17jango17s and liver as
these are correlated to each other. The first step is to the dataset for heart disease,
17jango17s disease and liver disease we have imported the UCI dataset, PIMA dataset and
Indian liver dataset respectively. Once we have imported the dataset then visualization of
each inputed data takes place. After visualization pre-processing of data takes place wher we
check for outliers, missing values and also scale the dataset then on the updated dataset we
split the data into training and testing .Next is on the training dataset we had applied knn,
xgboost and random forest algorithm and applied knowledge on the classified algorithm
using testing dataset. After applying knowledge we will choose the algorithm with the best
accuracy for each of the disease .Then we build a pickle file for all the disease and then
integrated the pickle file with the 18jango framework for the output of the model on the
6. System Requirements
6.1 Hardware Requirements

• Processor : Core i3/i5/i7

• RAM : 2-4GB
• HDD : 500 GB

6.2 Software Requirements

• Platform : Windows Xp/7/8/10/11
• Coding Language : Python
7. Diagram

7.1 Data Flow Diagram

Data flow Diagram

7.2 ER-Diagram

8. Neural network

The third step is to set the model parameters. For approximation, project type, is
composed of:
• Scaling layer.
• Perceptron layers.
• Unscaling layer.

The mean and standard deviation is set as the scaling method, while the minimum and
maximum is set as the unscaling method. The activation function chosen for this model is
the hyperbolic tangent activation function and the linear activation function for the hidden
layer and the output layer, respectively

It contains a scaling layer, two perceptron layers, and an unscaling layer. The number of
inputs is 18, and the number of outputs is 1. The complexity, represented by the number of
neurons in the hidden layer, is 3.

9. Supervised Learning

Supervised learning is a type of machine learning where the algorithm is trained on a

labeled dataset, which means that each input data point is associated with a corresponding
output label. The goal of supervised learning is to learn a mapping from inputs to outputs, so
that the algorithm can make predictions or decisions on new, unseen data.

In a supervised learning scenario, the training data consists of input-output pairs, and the
algorithm learns to map the input data to the correct output by adjusting its internal
parameters during the training process. The training process involves presenting the
algorithm with a set of labeled examples, allowing it to make predictions, and then adjusting
its parameters based on the error between the predicted outputs and the actual labels.

There are two main types of supervised learning:

1. Classification: In classification tasks, the goal is to predict a discrete label or

category for each input. For example, classifying emails as spam or not spam, or
identifying whether an image contains a cat or a dog.

2. Regression: In regression tasks, the goal is to predict a continuous numeric value.

Examples include predicting the price of a house based on its features, or predicting
a person's age based on certain attributes.
Supervised learning is widely used in various applications, including image and speech
recognition, natural language processing, recommendation systems, and many others. The
key idea is to leverage the labeled training data to enable the algorithm to generalize its
learning to new, unseen data and make accurate predictions or classifications.

• Today’s, world most of the data is computerized, the data is distributed, and it is not
utilizingproperly. With the help of the already present data and analysing it, we can
also use for un-known patterns. The primary motive of this project is the prediction
of diseases with high rateof accuracy. For predicting the disease, we can use logistic
regression algorithm, naive Bayes,sklearn in machine learning. The future scope of
the paper is the prediction of diseases by usingadvanced techniques and algorithms
in less time complexity.A technology called CAD is more beneficial as sometimes
systems are better diagnosticsthan Doctors. Machine Learning and its different
branches are used in Cancer detection as well.It helps or can say assist in making
decisions on critical cases or on therapies. Artificial intel-ligence plays an important
role in development of many health related procedure or methods. Artificial
intelligence is very common now a days in surgeries, like Robotics surgery. Since
we are in the circumstances of growing population, we must need technology which
can help us to meet the expectations of the patients, their flawless cure, their better
health and their smoot hand easy approachable access to health care industries to
heal and get well soon.
• Data mining for healthcare is an interdisciplinary topic of research that evolved from
database statistics and is valuable in assessing the efficacy of medical interventions.
Data visualization with machine learning Diabetes-related heart disease is a kind of
heart disease that occurs in diabetics. Diabetes is a chronic disease that arises when
the pancreas fails to create enough insulin or when the body fails to utilize the
insulin that is generated appropriately.
• In the future we can add more diseases in the existing API.
• We can try to improve the accuracy of prediction in order to decrease the mortality


