Project Documents Group12

1
Diabetes Prediction using Machine Learning
Bachelor of Technology in Information Technology
By
Group No: - 12
Abhishek Sinha -14800218056

Arka Dutta -14800218049
Ritayan Midya -14800218026
Pritam Pal -14800218030
Ehsan Hassan -14800218043
Under the Guidance of

Prof. Mousumi Biswas
DEPARTMENT OF INFORMATION TECHNOLOGY

FUTURE INSTITUTE OF ENGINEERING AND MANAGEMENT
(Affiliated to West Bengal University of Technology)
KOLKATA 700 150
2021
1
2
Content of the Project Document
CERTIFICATE 3
ACKNOWLEDGEMENTS 4
INTRODUCTION 5
MOTIVATION OF THE PROJECT 6
HARDWARE AND SOFTWARE TOOLS TO BE USED 7
FLOW-CHART OF THE PROJECT 8
ABOUT DATASET 9
PREPROSSEING DATASET 10
ABOUT CLASSIFICATION SUPERVISED MODEL 11
CONFUSION MATRIX 13
ROC CURVE 15
OUTPUT COMPARISON 17
FUTURE SCOPE 17
CONCLUSION 18
REFRENCES 18
2
3
Department of Information Technology

FUTURE INSTITUTE OF ENGINEERING AND MANAGEMENT
Sonarpur Station Road, Kolkata – 700150
Tel: 033-2434 5640 (Extn. – 238) URL: www.futureengineering.in
CERTIFICATE
We do hereby declaring that the work which is being presented in the Project Report entitled Diabetes
Prediction using Machine Learning, in partial fulfilment of the requirements for the award of the Bachelor of
Technology in Information Technology and submitted to the Department of Information Technology of Future
Institute of Engineering and Management, Kolkata, is an authentic record of our own work carried out during the
period from September 2021 to June 2022, under the supervision of Prof. Mousumi Biswas.
The matter presented in this thesis has not been submitted by us for the award of any other degree elsewhere.
Full Signature of the Students(s)

a) Abhishek Sinha
b) Arka Dutta
c) Ritayan Midya
d) Pritam Pal
e) Ehsan Hassan
This is to certify that the above statement made by the students, is correct to the best of my knowledge.
Date: 11.01.2022
Signature of the Supervisor
Prof. Mousumi Biswas

Assistant Professor
Head Signature of the External Examiner/

Department of Information Technology Panel Members
Future Institute of Engineering and Management
Kolkata, WB
3
4
ACKNOWLEDGEMENT
We have taken efforts in this project. However, it would not have been possible
without the kind support and help of many individuals. We would like to extend
our sincere thanks to all of them.
We are highly indebted to our guide Prof. Mousumi Biswas for his guidance
and constant supervision as well as for providing necessary information
regarding the project and also for his support in completing the project.
Also special thanks to Prof. Debjyoti Basu and Prof. Subhasis Mitra for
helping us in this project.
We express our thanks to our Principal Dr. Aloke Ghosh and our Head of the
Department Prof. Prasenjit Basu for extending their support. We would also
thank our Institution and the faculty members without whom this project would
have been a distant reality.
Our thanks and appreciations also go to all people who have willingly helped us
out with their abilities.
Abhishek Sinha
Arka Dutta
Ritayan Midya
Pritam Pal
Ehsan Hassan
4
5
INTRODUCTION
Diabetes is noxious diseases in the world. Diabetes caused because of obesity or

high blood glucose level, and so forth. It affects the hormone insulin,resulting in
abnormal metabolism of crabs and improves level of sugar in the blood.
Diabetes occurs when body does not make enough insulin. According to (WHO)
World Health Organization about 422 million people suffering from diabetes
particularly from low or idle income countries. And this could be increased to
490 billion up to the year of 2030. However prevalence of diabetes is found
among various Countries like Canada, China, and India etc. Population of India
is now more than 100 million so the actual number of diabetics in India is 72.9
million. Diabetes is major cause of death in the world. Early prediction of
disease like diabetes can be controlled and save the human life. To accomplish
this, this work explores prediction of diabetes by taking various attributes
related to diabetes disease.
5
6
MOTIVATION OF THE PROJECT
In recent times, most peoples are suffering in Diabetes. There are estimated
72.96 million cases of diabetes in adult population of India. The prevalence in
urban areas ranges between 10.9% and 14.2% and prevalence in rural India was
3.0-7.8% among population aged 20 years and above with a much higher
prevalence among individuals aged over 50 years. For this purpose we use the
Pima Indian Diabetes Dataset, we apply various Machine Learning classification
to predict diabetes. Machine Learning Is a method that is used to train computers
or machines explicitly. Various Machine Learning Techniques provide efficient
result to collect Knowledge by building various classification and ensemble
models from collected dataset. Such collected data can be useful to predict
diabetes. Various techniques of Machine Learning can capable to do prediction,
however it’s tough to choose best technique. Thus for this purpose we apply
popular classification method K-NN & Logistic Regression on dataset for
prediction. And main objective of this project comparison between this two
method & choose the best prediction method.
6
7
HARDWARE & SOFTWARE TOOLS TO BE USED
HARDWARE:
 Any Kind of Laptop or Desktop (Windows 10) with internet
connectivity.
 GPU
SOFTWARE:
 Google Colab
 MS Excel
 Python
 Sklearn
7
8
FLOW-CHART OF THE PROJECT
This is most important phase which includes model building for prediction of
diabetes. In this we have implemented various machine learning algorithms
which are discussed above for diabetes prediction.
Procedure of Proposed Methodology-
Step1: Import required libraries, Import diabetes dataset.

Step2: Pre-process data to remove missing data.
Step3: Perform percentage split of 80% to divide dataset as Training set
and 20% to Test set.
Step4: Select the machine learning algorithm i.e. K- Nearest Neighbour,
Logistic regression.
Step5: Build the classifier model for the mentioned machine learning
algorithm based on training set.
Step6: Test the Classifier model for the mentioned machine learning
algorithm based on test set.
Step7: Perform Comparison Evaluation of the experimental performance
results obtained for each classifier.
Step8: After analysing based on various measures conclude the best
performing algorithm.
PIMA DIABETES TRAIN FITTING CLASSIFICATION

DATASET DATASET SUPERVISED MODEL
(80%) (K-NN, LOGISTIC REGRESSION)
SPLIT
DATASET
DATA PROCESSING CLASSIFIER
TEST
DATASET
(20%)
GRAPH
VISUALIZATION CONFUSION PREDICTING TEST
AND ANALYSING MATRIX RESULT
BEST MODEL
8
9
ABOUT DATASET
This dataset is originally from the National Institute of Diabetes and
Digestive and Kidney Diseases. The objective is to predict based on
diagnostic measurements whether a patient has diabetes or not.
 This dataset has 768 samples of diabetic and healthy individuals.
 In particular, all patients here are females of at least 21 years of age.
 The diabetes dataset is credited to UCI machine learning database
repository.
 The dataset has total 9 attributes out of which 8 are independent
variables and one is the dependent variable i.e. target variable which
determines whether patient is having diabetes or not.
Attribute Details:
 Pregnancies (Number of time pregnant)
 Glucose level
 Blood Pressure
 Skin Thickness
 Insulin
 BMI(Body Mass Index)
 Diabetes Pedigree Function (It provides information about
diabetes history in relatives and genetic relationship of those
relatives with patients.)
 Age
 Outcome (0 means Non-diabetic and 1 means Diabetic)
9
10
PREPROSSESING DATASET
1. Replace 0 value with Median of each attributes:

We can see that columns - Pregnancies, Glucose, Blood Pressure,
Skin Thickness, Insulin and BMI have minimum values of 0. It makes
sense to have 0 pregnancies, but the it does not make sense for other
mentioned variables to have a minimum value of 0. So we can conclude
that Glucose, Blood Pressure, Skin Thickness, Insulin and BMI have
missing data. The 0's in these columns should be replaced with the
median, since the median is least affected by outliers.
2. Scaling the Data:

We can imagine how the feature with greater range with
overshadow or diminish the smaller feature completely and this will
impact on the prediction. So,we need to scaling the data for good
result.Data is rescaled such that μ(mean) = 0 and 𝛔(SD) = 1, and is done
through this formula.
3. Split the dataset:

After processing the data we have to split the dataset into two part-
Train Dataset(80%) & Test dataset(20%).
10
11
ABOUT CLASSIFICATION SUPERVISED MODEL

Supervised learning is the types of machine learning in which machines
are trained using well "labelled" training data, and on basis of that data,
machines predict the output. The labelled data means some input data is
already tagged with the correct output.
Classification algorithms are used when the output variable is categorical,

which means there are two classes such as Yes-No, True-false, etc.
 K-NN Algorithms
 Logistic Regression etc.
K-Nearest Neighbour – KNN is also a supervised machine learning

algorithm. KNN helps to solve both the classification and regression
problems. KNN is lazy prediction technique.KNN assumes that similar
things are near to each other. Many times data points which are similar
are very near to each other.KNN helps to group new work based on
similarity measure.KNN algorithm record all the records and classify
them according to their similarity measure. For finding the distance
between the points uses tree like structure. To make a prediction for a new
data point, the algorithm finds the closest data points in the training data
set its nearest neighbours. Here K= Number of nearby neighbours, it’s
always a positive integer. Neighbours value is chosen from set of class.
Closeness is mainly defined in terms of Euclidean distance. The
Euclidean distance between two points P and Q i.e. P (p1, p2,.., pn) and Q
(q1, q2,..qn) is defined by the following equation:-
Algorithm-
 Take a sample dataset of columns and rows named as

Pima Indian Diabetes data set.
 Take a test dataset of attributes and rows.
 Find the Euclidean distance by the help of formula

11
12
 Then, Decide a random value of K. is the no. of nearest

neighbours
 Then with the help of these minimum distance and

Euclidean distance find out the nth column of each.
 Find out the same output values.
If the values are same, then the patient is diabetic, other-

wise not.
In this dataset, for value k=19, the prediction score is high.
Logistic Regression- Logistic regression is also a supervised learning

classification algorithm. It is used to estimate the probability of a binary
response based on one or more predictors. They can be continuous or
discrete. Logistic regression used when we want to classify or distinguish
some data items into categories.
It classify the data in binary form means only in 0 and 1 which refer case
to classify patient that is positive or negative for diabetes.
Main aim of logistic regression is to best fit which is responsible for

describing the relationship between target and predictor variable. Logistic
regression is a based on Linear regression model. Logistic regression model
uses sigmoid function to predict probability of positive and negative class.
Sigmoid function P = 1/1+e – (a+bx) Here P = probability, a and b = parameter

of Model.
12
13
CONFUSION MATRIX
The confusion matrix is a technique used for summarizing the
performance of a classification algorithm i.e. it has binary outputs. For
this Diabetes Prediction-
 Cases in which the doctor predicted they don’t have the disease, and
they don’t have the disease will be termed as TRUE POSITIVES
(TP). The doctor has correctly predicted that the patient hasn’t the
disease.
 Cases in which the doctor predicted they have the disease, and they
have the disease will be termed as TRUE NEGATIVES (TN). The
doctor has correctly predicted that the patient has the disease.
 Cases in which the doctor predicted they don’t have the disease, but
they have the disease will be termed as FALSE POSITIVES (FP).
Also known as “Type I error”.
 Cases in which the doctor predicted they have the disease, but they
don’t have the disease will be termed as FALSE NEGATIVES
(FN). Also known as “Type II error”.
1. Confusion Matrix for K-NN Algo:

13
14
2. Confusion Matrix for LogisticRegression:
14
15
ROC CURVE
A Receiver Operating Characteristic Curve (ROC curve) is a graphical
plot that illustrates the diagnostic ability of a binary classifier system as
its discrimination threshold is varied. The ROC curve is created by
plotting the true positive rate against the false positive rate at various
threshold settings.
TPR=TP/ (TP+FN)
SPECIFICITY= TN/ (TN+FP)
FPR =1-SPECIFICITY
1. ROC Curve for K-NN
Area Under ROC: 0.8232323232323232
15
16
2. ROC Curve for LogisticRegression
Area Under ROC: 0.8229568411386594
16
17
OUTPUT COMPARISON
Method Name Accuracy Rate(%) Miscalculation
Rate(%)
77.27272727272727 22.727272727272734
K-NN
75.32467532467533 24.675324675324674
Logistic Regression
FUTURE SCOPE
 Implementing SVM,RandomForest Classification. Basically try to
improving for more AccuracyRate.
 Implement GUI as Front End.
17
18
CONCLUSION
The main aim of this project was to design and implement Diabetes
Prediction Using Machine Learning Methods and Performance Analysis of
that methods and it has been achieved successfully. The proposed approach
uses various classification in which KNN, Logistic Regression are used.
The Experimental results can be assist health care to take early prediction
and make early decision to cure diabetes and save humans life.
REFERENCES
 https://www.javatpoint.com/supervised-machine-learning
 www.youtube.com
 www.kaggle.com
 www.ijert.org
18

Project Documents Group12

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Documents Group12

Uploaded by

Copyright:

Available Formats

1

Diabetes Prediction using Machine Learning

Bachelor of Technology in Information Technology

Abhishek Sinha -14800218056

Under the Guidance of

DEPARTMENT OF INFORMATION TECHNOLOGY

Content of the Project Document

Department of Information Technology

Full Signature of the Students(s)

Prof. Mousumi Biswas

Head Signature of the External Examiner/

Diabetes is noxious diseases in the world. Diabetes caused because of obesity or

MOTIVATION OF THE PROJECT

HARDWARE & SOFTWARE TOOLS TO BE USED

FLOW-CHART OF THE PROJECT

Procedure of Proposed Methodology-

Step1: Import required libraries, Import diabetes dataset.

PIMA DIABETES TRAIN FITTING CLASSIFICATION

1. Replace 0 value with Median of each attributes:

2. Scaling the Data:

3. Split the dataset:

ABOUT CLASSIFICATION SUPERVISED MODEL

Classification algorithms are used when the output variable is categorical,

K-Nearest Neighbour – KNN is also a supervised ma- chine learning

 Take a sample dataset of columns and rows named as

 Take a test dataset of attributes and rows.

 Find the Euclidean distance by the help of formula

 Then, Decide a random value of K. is the no. of nearest

 Then with the help of these minimum distance and

 Find out the same output values.

If the values are same, then the patient is diabetic, other-

In this dataset, for value k=19, the prediction score is high.

Logistic Regression- Logistic regression is also a supervised learning

Main aim of logistic regression is to best fit which is responsible for

Sigmoid function P = 1/1+e – (a+bx) Here P = probability, a and b = parameter

1. Confusion Matrix for K-NN Algo:

2. Confusion Matrix for LogisticRegression:

1. ROC Curve for K-NN

Area Under ROC: 0.8232323232323232

2. ROC Curve for LogisticRegression

Area Under ROC: 0.8229568411386594

You might also like