Professional Documents
Culture Documents
MDPS
MDPS
A Project Report submitted in partial fulfillment of the requirements for the award of the
degree of
BACHELOR OF TECHNOLOGY
in
by
Souvik Nandi
Sangram Khan
Mirsad Mondal
Md Arsad Ali
Abhijit Rout
Place: NITMAS
Date: 29/11/2023
CERTIFICATE OF RECOMMENDATION
This is to certify that the project entitled “Multiple Disease Predication System“is the
original work done by Souvik Nandi (Roll No.: 14400120003), Sangram Khan (Roll No.:
14400120011), Mirsad Mondal (Roll No.: 14400120016), Md Arsad Ali (Roll No.:
14400120022) and Abhijit Rout (Roll No.: 14400120012) under my supervision in partial
fulfillment of the degree of Bachelor of Technology in Computer Science & Engineering
at Neotia Institute of Technology, Management and Science (affiliated to MAKAUT and
approved by AICTE) during the academic year 2023-2024.
------------------------
(Prof.) Suman Haldar
Project Supervisor
Dept. of Computer Science & Engineering
Neotia Institute of Technology, Management and Science
Place: NITMAS
Date: 29/11/2023
CERTIFICATE OF APPROVAL
The project report entitled “Multiple Disease Predication System” submitted by <Souvik
Nandi (Roll No: 14400120003, Registration No: 201440100110032 (2020-21), Sangram
Khan (Roll No: 14400120011, Registration No: 201440100110024 (2020-21), Mirsad
Mondal (Roll No: 14400120016, Registration No: 201440100110019 (2020-21), Md Arsad
Ali (Roll No.: 14400120022, Registration No.: 201440100110024 (2020-21), Abhijit Rout
(Roll No.: 14400120016, Registration No.: 01440100110019 (2020-21) of Neotia Institute
of Technology, Management and Science (NITMAS) is approved for the degree of
Bachelor of Technology in Computer Science & Engineering (CSE) under MAKAUT,
West Bengal, India.
Principal,
Neotia Institute of Technology, Management and Science
ACKNOWLEDGEMENT
We would like to take this opportunity to thank everyone whose cooperation and
encouragement throughout the ongoing course of this project remain invaluable to us.
We are sincerely grateful to our guide Prof. Suman Haldar of the Department of computer
Science, NITMAS, for his wisdom, guidance, and inspiration that helped us to go through
with this project and take it to where it stands now.
Last but not the least, we would like to extend our warm regards to our families and peers
who have kept supporting us and always had faith in our work.
Souvik Nandi
Sangram Khan
Mirsad Mondal
Md Arsad Ali
Abhijit Rout
Place: NITMAS
Date: 29/11/2023
PUBLICATIONS
LIST OF FIGURES
1. Sign Up/Log
2. In Registration
3. Diabetes
4. Heart
5. Parkinson's
CONTENTS
Abstract
1. Introduction
1.1 Literature Survey
1.2 Problem Statements
2. Related work
2.1. Data Entry
2.1.1.
2.2. Modeling(SOMTET)
3. Proposed Methodology
3.1
4. Algorithm
4.1 Random Forest,
4.2 LogisticRegression
4.3 Decision tree
The model proposed by is used for Disease Prediction and uses different ML algo-
rithms like Iforest for correcting the dataset problems and SMOTET for balancing
thedataset and then it uses the Ensemble learning technique. The Input the the ML
model is taken only by the electronic reports which are produced by the blood
examination of thepatient or the user. Some of the input taken in this model are
glucose level, cholesterol,lipoprotein, blood pressure and other inputs which are only
be possible by the physicalexamination the user or the patient.
The model proposed by uses big data analytics and the deep learning models for
theprediction of Disease The dataset is big so it uses the Big Data analytics like Map
reduceis used in this model and on that the deep learning models are used for the
prediction ofthe Disease which makes it a very big process and it becomes very time
consuming. Thismodel needs the full medical examination of the user or the patient
foe the predictionof the disease. Full medical history of the patient or the user is taken
as an input tothis model which is stored with the help of the big data tools and then
used by the deeplearning models to predict the disease. this model also needs all the
medical record ofthe patient like all the medications which the patient or the user was
taking and the list ofdoctors which he or she has visited which help in proper analysis
of the patient’s problem.
In the model proposed by uses different data mining and the classification
algorithmsfor the prediction of disease. This model is mainly used for the prediction
of the HeartDisease and the algorithms which are used in this model are Decision
Tree and the NaiveBayes algorithm which are used for the prediction of the Disease
and various data miningtechniques are also used in this model for correcting and
balancing the dataset so that thesystem can work correctly and can predict the correct
Disease.
1.2 PROBLEM STATEMENTS
Many of the existing machine learning models for health care analysis are
concentrating on one disease per analysis. For example first is for liver analysis, one
for cancer analysis, one for lung diseases like that. If a user wants to predict more
than one disease, he/she has to go through different sites. There is no common system
where one analysis can perform more than one disease prediction. Some of the
models have lower accuracy which can seriously affect patients’ health. When an
organization wants to analyse their patient’s health reports, they have to deploy many
models which in turn increases the cost as well as time Some of the existing systems
consider very few parameters which can yield false results.
The model which is more time consuming as it involves both the structured and the
unstructured data so the time taken to process the data is more as compared to the
dataset which contains only the structured data as in the proposed project which
contains only the structured data and the classification algorithms used in the
proposed project are decision tree, Naive Bayes and Random forest. The accuracy of
the model given by is above 90% which is not good for a ML model as it is said to be
in an over fitting situation whereas the proposed model has accuracy of about 86%
which is good enough for a model of disease prediction.
The model given by has a very limited scope as it is only meant for the prediction of
the diabetes and hypertension whereas the proposed model is used for the prediction
of the basic general disease. The model given by [2] needs the blood report of the
patient or the user for the prediction of the diabetes or the hypertension and the
algorithms used in this model are ensemble learning techniques whereas the predicted
model does not need any blood report or physical presence of the user or the patient.
The system contains a list of symptoms from which the user can select the symptoms
which the user is facing and can predict the disease very easily and the algorithms
used are different from the given model. The input required in the given model are
based on the medical report of the user like cholesterol, blood glucose etc whereas the
proposed system does not requireany type of blood report for the prediction of the
disease.
The model given by uses a very big data set and to manage that dataset the big data
analytics are used which makes this system slow as needs a lot of system
requirements to run this project and the deep learning algorithms are used in this
project are FISM,NAIS, Deep ICF which is different from the proposed model which
uses the classification algorithms which are light
PROPOSED SOLUTION
In multiple disease prediction, it is possible to predict more than one disease at a time. So
the user doesn’t need to traverse different sites in order to predict the diseases. We are
taking three diseases that are Liver, Diabetes, and Heart. As all the three diseases are
correlated to each other. To implement multiple disease analyses we are going to use
machine learning algorithms and Django. When the user is accessing this API, the user has
to send the parameters of the disease along with the disease name. Django will invoke the
corresponding model and returns the status of the patient.
The system defines that liver diseases is causing high number of deaths in India and is also
considered as a life threating disease in the world. As it is difficult to detect the liver disease
at early stage. So using automated program using machine learning algorithms we can detect
the liver disease accurately .They used and compared SVM ,Decision Tree and Random
forest algorithm and measures precision, accuracy and recall metrics for quantitative
measurement. The accuracy are 95%,87%,92% respectively.
4. ALGORITHM
Step-2: After selecting k data points then building the decision trees associated with the
selected data points (Subsets).
Step-3: Then choosing the number N for decision trees that you want to build.
Step-5: Finding the predictions of each decision tree, and assigning the new data points to
the category that wins the majority votes.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step
Data source.
Variables.
Instances.
The ObesityDataSet.csv file contains the data for this application. The number of instances
(rows) in the data set is 2111, and the number of variables (columns) is 17.
The number of input variables, or attributes for each sample, is 14. Height and weight are
unused variables related to the target variable. The input variables are numeric-valued,
binary, and categorical. The number of target variables is 1 and represents the estimation of
obesity levels in individuals. The following list summarizes the variables information:
Finally, the use of all instances is set. Note that each instance contains the input and
target variables of a different patient. The data set is divided into training, validation,
and testing subsets. 60% of the instances will be assigned for training, 20% for
generalization, and 20% for testing. More specifically, 1267 samples are used here for
training, 422 for selection, and 422 for testing samples.
Once the data set has been set, we are ready to perform a few related analytics. We
check the provided information and make sure that the data has good quality.
Block Diagram
In the figure. We have experimented on three diseases that is heart,17jango17s and liver as
these are correlated to each other. The first step is to the dataset for heart disease,
17jango17s disease and liver disease we have imported the UCI dataset, PIMA dataset and
Indian liver dataset respectively. Once we have imported the dataset then visualization of
each inputed data takes place. After visualization pre-processing of data takes place wher we
check for outliers, missing values and also scale the dataset then on the updated dataset we
split the data into training and testing .Next is on the training dataset we had applied knn,
xgboost and random forest algorithm and applied knowledge on the classified algorithm
using testing dataset. After applying knowledge we will choose the algorithm with the best
accuracy for each of the disease .Then we build a pickle file for all the disease and then
integrated the pickle file with the 18jango framework for the output of the model on the
webpage.
6. System Requirements
6.1 Hardware Requirements
ER-Diagram
8. Neural network
The third step is to set the model parameters. For approximation, project type, is
composed of:
• Scaling layer.
• Perceptron layers.
• Unscaling layer.
The mean and standard deviation is set as the scaling method, while the minimum and
maximum is set as the unscaling method. The activation function chosen for this model is
the hyperbolic tangent activation function and the linear activation function for the hidden
layer and the output layer, respectively
It contains a scaling layer, two perceptron layers, and an unscaling layer. The number of
inputs is 18, and the number of outputs is 1. The complexity, represented by the number of
neurons in the hidden layer, is 3.
9. Supervised Learning
In a supervised learning scenario, the training data consists of input-output pairs, and the
algorithm learns to map the input data to the correct output by adjusting its internal
parameters during the training process. The training process involves presenting the
algorithm with a set of labeled examples, allowing it to make predictions, and then adjusting
its parameters based on the error between the predicted outputs and the actual labels.
• Today’s, world most of the data is computerized, the data is distributed, and it is not
utilizingproperly. With the help of the already present data and analysing it, we can
also use for un-known patterns. The primary motive of this project is the prediction
of diseases with high rateof accuracy. For predicting the disease, we can use logistic
regression algorithm, naive Bayes,sklearn in machine learning. The future scope of
the paper is the prediction of diseases by usingadvanced techniques and algorithms
in less time complexity.A technology called CAD is more beneficial as sometimes
systems are better diagnosticsthan Doctors. Machine Learning and its different
branches are used in Cancer detection as well.It helps or can say assist in making
decisions on critical cases or on therapies. Artificial intel-ligence plays an important
role in development of many health related procedure or methods. Artificial
intelligence is very common now a days in surgeries, like Robotics surgery. Since
we are in the circumstances of growing population, we must need technology which
can help us to meet the expectations of the patients, their flawless cure, their better
health and their smoot hand easy approachable access to health care industries to
heal and get well soon.
• Data mining for healthcare is an interdisciplinary topic of research that evolved from
database statistics and is valuable in assessing the efficacy of medical interventions.
Data visualization with machine learning Diabetes-related heart disease is a kind of
heart disease that occurs in diabetics. Diabetes is a chronic disease that arises when
the pancreas fails to create enough insulin or when the body fails to utilize the
insulin that is generated appropriately.
• In the future we can add more diseases in the existing API.
• We can try to improve the accuracy of prediction in order to decrease the mortality
rate.
BIBLOGRAPHY
⇨ www.google.com
⇨ www.wikipedia.com
⇨ www.youtube.com