Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Abstract

Diabetes is a fatal disease and is one of the 5 major reasons of death. It is notable that millions
of people across the globe are affected by it. During the study, involving various latest
published papers on diabetes prediction, it was observed that most of models used in them
have been built on a small dataset containing ~1000 records and that too for the same region /
country. There has been mention of using large dataset and ensemble approach to do the
diabetes prediction as a future scope. The dataset used for this study has more than ~117000
records with different ethnicity people. Also, different ensemble models like PCA-Logistic
Regression, PCA-Naive Bayes, PCA-Decision Tree, CNN-LSTM, PCA-RF etc. have been
implemented to predict diabetes. Several metrics such as accuracy have been used to evaluate
and compare the capability to predict and perform of these models.
Table of Contents

Abstract 1
1. Background 3
2. Related Work 4
4. Aim and Objectives 8
5. Significance of the Study 9
6. Scope of the Study 9
7. Research Methodology 9
8. Requirements Resources 11
9. Research Plan 12
References 13

2
1. Background

International Diabetes Federation states that there are more than 500 million adults across the
world who are living with this chronic illness and this figure is expected to grow beyond 700
million in another 20 years(Menaka et al., 2022). The body of diabetic affected individuals will
not be able to process and use glucose from the consumed food in an adequate manner. While
certain types of diabetes may be prevented by adopting a healthy lifestyle, the others would
have to be controlled by treatment including medications and/or insulins, insulin being a
hormone made by pancreas (an organ located behind stomach), which releases insulin into the
bloodstream.

There are two broad categories of diabetes namely Type I diabetes and Type II diabetes, the
former type is dependent on the insulin secretion and the later type is due to improper utilization
of insulin. Type I diabetes is more dangerous than Type II diabetes as it may cause severe
damage to the vital organs. To be specific, hearts, kidneys blood vessels, eyes are affected by
Type I diabetes. The common symptoms for Type I diabetes are pain in the abdominal area,
vomiting sensation, lack of energy in the body, loss of body weight, etc. The later type has the
symptoms like blurred vision, frequent urination, thirsty throat, etc. Most of diabetic patients
are diagnosed with Type II diabetes as it is the common type.(Menaka et al., 2022).

There are multiple factors which contribute to chances of developing diabetes, viz; Family
history, injury to the pancreas, presence of autoantibodies, physical stress, exposure to illnesses
caused by viruses, obesity, high blood pressure, physically inactivity, age, smoking and such
like.

3
2. Related Work

Many studies have been done in the past to build a model to predict diabetes. Below are some
of the studies done in the year 2022:

“A Comparative analysis of various supervised machine learning algorithms for the early
prediction of type-II diabetes mellitus was done on 400 instances of data from across a wide
geographical region based on the suggestion and expert advice of concern medical domain. The
data for the experiment has been also collected from various pathological labs. Various non-
pathological parameters have been decided with the help of the specialists of diabetes. As part
of pre-processing, outliers were removed and corrupted and missing values were handled. The
algorithms used is LR, NB, SVM, DT, RF and ANN was verified by using a 10-fold cross-
validation technique. The RF model also produced good results for other statistical parameters
like precision, recall, specificity, f1-score, false-positive rate, false-negative rate and negative
predicted value.
For future work, in order to establish the efficiency, reliability and validity of the current
research hybridisation or ensemble approaches may be used for better results and detection of
diabetes in early stages to save human lives.”(Mohammad Ganie and Bashir Malik, 2022)

“A Data Analytics Suite for Exploratory Predictive, and Visual Analysis of Type 2 Diabetes
was done on patients data from the Croydon/ Prowellness database. As part of data pre-
processing, missing values were imputed, standardisation was performed by conversion of
metric units(e.g.,mg/dL to mmol/L) and merging of similar data columns. The survival time of
the patients is computed for the patients from the time of diabetes diagnosis to the occurrence
of the complication. The algorithms used in this model was an analytics suite that performs
exploratory, predictive, and visual analysis of Type2 Diabetes data. The prediction model was
implemented on R programming environment. The SVM model was validated via a 5-fold
Cross-Validation study. An average prediction accuracy of 65.05% was obtained in the 5-fold
Cross-Validation, and the best prediction accuracy obtained was 73.3%.
Possibilities for future work include building and training the model on larger databases to
increase the prediction accuracy and develop more robust prediction models by adopting
artificial intelligence methods, and clinical validation of the data.”(IEEE Xplore Full-Text
PDF:, 2023a)

“A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning


Techniques. Pima Indian Diabetes Database is a familiar and commonly used data set for the
prediction of diabetes. This data set consists of 768 rows and 9 columns. As part of data pre-
processing, removal of outliers and standardizing the data. The algorithms used in this model
were four classifier models such as LR, RF, SVM, and KNN. Before training the data, the set
outlier was eradicated. ML algorithm comparison is indicated in the bar chart. RF and SVM
have a high accuracy of 83%.
For future work developing an Android application is proposed for the suggested hypothetical
diabetes monitoring system, including the proposed categorization and prediction algorithms,

4
and deploying it. Genetic algorithms, in conjunction with the suggested prediction mechanism,
may be investigated for improved monitoring.”(Krishnamoorthi et al., 2022)

“An ensemble Machine Learning approach for predicting Type-II diabetes mellitus based on
lifestyle indicators. The dataset used for the study has been collected from different
geographical regions based on the expert advice of specialists like diabetologists,
endocrinologists, etc. The dataset comprised of 1939 records and 11 biological/lifestyle
parameters, where first 10 parameters are predicate/independent variables and last 1(Outcome)
is target/dependent variable. The collected dataset was pre-processed using techniques such as
resampling and discretization. Data transformation has been performed to improve the
efficiency of data before building the machine learning models. SMOTE for data balancing.
Different ensemble learning techniques like Bagging, Boosting, and Voting are employed.
Among all the classification techniques, the bagged decision tree achieved the highest accuracy
rate (99.41%), precision (99.13%), recall (95.83%), specificity (99.11%), F1-score (99.15%),
misclassification rate (MCR) (0.86%), and receiver operating characteristic (ROC) curve
(99.07%), respectively.
In the future, this proposed framework can be used to identify the probability of disease in
patients and probably patients at earlier stages. Based on the inclination of the disease different
categories of patients can be advised to make appropriate changes to their lifestyle (Diet Plans
and Physical Exercise Charts). Additionally, mobile applications can be developed to help
healthcare providers to detect and predict TII2M disease. It will be also useful for end users in
reducing the complications of diabetes at earlier stages and avoiding hospital readmissions,
unnecessary medical check-ups, and regularity of visiting clinical labs, etc.”(Ganie and Malik,
2022)

“Accurate Prediction of Type 1 and Type 2 Diabetes. In this study, the PIMA dataset that is
hosted in https://www.kaggle.com/johndasilva/diabetesis used for predicting diabetes. This
dataset has the details of 2000 people out of which more than 900 are having diabetes. The data
set is a balanced dataset and hence there is no gap between the count of diabetic and non-
diabetic people. The pre-processing stage focuses on removing outliers, removing missing
values, standardizing the data and splitting the dataset into training and testing sets. Five types
of classification algorithms were used in the study. The experimental results reveal that
Decision Tree with Glucose and BMI as the important features achieves the highest accuracy
of 98% for the training set and 99% for the testing set.
For future, this work can be extended and improved by using ensemble strategies for the
automation of diabetes analysis.”(Menaka et al., 2022)

“Prediction of Diabetes Empowered with Fused Machine Learning. The dataset used in this
research is taken from the UCI Machine Learning Repository. Data is cleaned, normalized, and
divided in to training and test dataset during the pre-processing stage. In this proposed model,
only two widely used ML algorithms (SVMs and ANNs) are used. The proposed fuzzy decision
system has achieved the accuracy of 94.87, which is higher than the other existing
systems.”(Ahmed et al., 2022)

“Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning
perspective. PIMA Indian dataset and the laboratory of the Medical City Hospital (LMCH)
diabetes dataset were used for this study. As part of data pre-processing, the framework includes
the adoption of Spearman correlation and polynomial regression for feature selection and
missing value imputation, respectively, from a perspective that strengthens their performances.
The methodology used in this study are different supervised machine learning models, random

5
forest (RF) model, support vector machine (SVM) model, and a designed twice-growth deep
neural network (2GDNN) model are proposed for classification. Through experiments on the
PIMA Indian and LMCH diabetes datasets, precision, sensitivity, F1-score, train-accuracy, and
test-accuracy scores of 97.34%, 97.24%, 97.26%, 99.01%, 97.25 and 97.28%, 97.33%, 97.27%,
99.57%, 97.33, are achieved with the proposed 2GDNN model, respectively.”(Olisah et al.,
2022)

“Data-Driven Machine-Learning Methods for Diabetes Risk Prediction. The Kaggle data set is
used for this studyhttps://www.kaggle.com/datasets/andrewmvd/early-diabetes-classification.
No specific processing was performed on this dataset as there were no missing and extreme
values. The number of participants is 520 and all the attributes (16 as input to machine-learning
models and 1 for the target class) are analyzed. SMOTE method, based on a 5-NN classifier,
was used to create synthetic data based on 60% of the minority class, i.e., Non-Diabetes, such
that the instances in the two classes are equally distributed (i.e., 50%–50%). In this research
work, various ML models, such as BayesNet, NB, SVM, LR, ANN, KNN, J48, LMT, RF, RT,
RepTree, RotF, AdaBoostM1 and SGD and Ensemble method (Stacking), are evaluated in
terms of the accuracy, precision, recall, F-measure and AUC. After applying SMOTE with 10-
fold cross-validation, the Random Forest and KNN outperformed the other models with an
accuracy of 98.59%. Similarly, applying SMOTE with a percentage split (80:20), the Random
Forest and KNN outperformed the other models with an accuracy of 99.22%.
In future work, aim is to extend the machine-learning framework through the use of deep-
learning methods by applying a Long-Short-Term-Memory (LSTM) algorithm and
Convolutional Neural Networks (CNN) in the same dataset and comparing the results in terms
of accuracy with relevant published works.”(Barmparis et al., 2022)

Likewise following studies were done in the year 2021:

“Comparative performance analysis of quantum machine learning with deep learning for
diabetes prediction. PIDD (developed by the National Institute of Diabetes and Digestive and
Kidney Diseases) has been employed for this study. After performing EDA, it has been
observed that the PIDD contains many missing values, outliers, and the values of attributes
are also not normalized. Therefore, the present investigation utilizes outlier rejection (OR),
flling missing values (MV), and normalization (N) in pre-processing. Based on the features
present in the dataset, two prediction models have been proposed by employing deep
learning (DL) and quantum machine learning (QML) techniques. The accuracy has been used
to evaluate the prediction capability of these developed models. The performance measures
such as precision, accuracy, recall, F1 score, specifcity, balanced accuracy, false detection
rate, missed detection rate, and diagnostic odds ratio have been achieved as 0.90, 0.95, 0.95,
0.93, 0.95, 0.95, 0.03, 0.02, and 399.00 for DL model respectively, However for QML, these
measures have been computed as 0.74, 0.86, 0.85, 0.79, 0.86, 0.86, 0.11, 0.05, and 35.89
respectively.
In the future, the developed DL model will be examined on other diabetes datasets to examine
the robustness of the model and a user-friendly web application will be developed. Moreover,
the proposed QML model needs to integrate with the deep learning framework which

6
may boost the performance against the developed models and state-of-the-art
techniques”.(Gupta et al., 2022)

“An Improved Artificial Neural Network Model for Effective Diabetes Prediction. The dataset
used in the study is obtained from the National Institute of Diabetes and Digestive and Kidney
Diseases [17]. The purpose behind this is to predict the disease considering some selected
diagnostic key attributes included in the dataset whether a person is a diabetes patient or not.
The dataset contains the data for female patients only with minimum age 21 years for the
resident of Arizona USA. Noisy data have been eliminated before receiving the dataset from
the concerned authority. The algorithms used are artificial backpropagation neural network
(ABPNN), ABP-SCGNN framework, ABP-SCGNN. The proposed ABP-SCGNN framework
is effective and efficient, with a 93% success ratio when simulated with a test PIDD
dataset”.(Bukhari et al., 2021)

This research will use the dimensionality reduction techniques like PCA on the features along
with the classification algorithms like Random Forests, Naive Bayes, Decision Tree etc. in order
to create ensemble models to predict diabetes.

7
4. Aim and Objectives

The main aim of this research is to propose a model to predict diabetes based on the different
risk factors. While going through the different latest published papers on diabetes prediction, it
has been observed that most of the models used in them have been built on a small dataset
containing ~1000 records and that too for the same region / country. There has been mention
of using large dataset and ensemble approach as a future scope. The generated model will
predict diabetes on large dataset using ensemble approach.

The research objectives are formulated based on the aim of this study which are as follows:

• To analyze the pattern and relationship between the risk factors of diabetes using
dimensionality reduction technique on a large dataset.
• To remove outliers, to handle corrupted and missing values, standardizing the data and
splitting the dataset into training and testing sets.
• To compare between the predictive ensemble models in order to identify most accurate
model to predict diabetes.
• To evaluate the performance of the classifiers.

8
5. Significance of the Study
This study will help in building a model with higher accuracy to recognize diabetic patients. It
will be more robust as the same has been built with different ethnicity people data rather than
people from specific country / region.

6. Scope of the Study


The scope of the study is to predict diabetes based on the different risk factors on a large dataset
using ensemble models.

7. Research Methodology

While going through the different latest published papers on diabetes prediction, it was
observed that most of models used in them have been built on a small dataset containing ~1000
records and that too for the same region / country. There has been mention of using large
dataset and ensemble approach to do the diabetes prediction as a future scope. The dataset used
for my study has more than ~117000 records with different ethnicity people.
Methodology deployed involves key processes such as the selection of target data, pre-
processing the chosen data, implementing dimensionality reduction technique like PCA,
implementing supervised learning techniques and evaluating the machine learning performance
using evaluation measures.

7.1 Dataset Description


The training dataset used for the study is from Kaggle (Diabetes2 | Kaggle, 2023a) and
the testing dataset used is (Diabetes2 | Kaggle, 2023b). The training dataset has ~117000
records and 180 columns including the target column. The testing dataset has ~13000
records and 179 columns excluding the target column.

7.2 Data Preprocessing


The quality of dataset was improved by using pre-processing techniques like removing
outliers, to handle corrupted and missing values, standardizing the data.

7.3 Modelling Techniques


We plan to implement five different ensemble models like PCA-Logistic Regression,
PCA-Naive Bayes, PCA-Decision Tree, CNN-LSTM, PCA-RF etc. to predict diabetes.

9
7.4 Evaluation Metrics
Accuracy and other metrics like AUC, receiver operating curve (ROC), Confusion
matrix, Sensitivity, Specificity etc. will be used to evaluate and compare the prediction
capability and performance of these developed models.

Accuracy=Correctly Predicted Labels / Total Number of Labels

Sensitivity=Number of actual Yeses correctly predicted /Total number of actual Yeses

Specificity=Number of actual Nos correctly predicted / Total number of actual Nos

ROC Curve - It show the tradeoff between the True Positive Rate (TPR) and the False
Positive Rate (FPR).
True Positive Rate (TPR)=True Positives / (True Positives + False Negatives)
False Positive Rate (FPR)=False Positives / (True Negatives + False Positives)

AUC - Area under the curve (AUC) of a ROC curve is used to determine how good the
model is. If the ROC curve is more towards the upper-left corner of the graph, it means
that the model is very good and if it is more towards the 45-degree diagonal, it means
that the model is almost completely random. So, the larger the AUC, the better will be
the model.

10
Figure 1 Proposed data flow diagram for diabetes prediction

8. Requirements Resources

Software Required: Anaconda 3.0, Jupyter Notebook, Matplotlib, Scikit learn, seaborn, numpy, Google
Collab

Hardware Required: System with GPUs

11
9. Research Plan

Project Plan
Select a period to highlight at right. A legend describing the charting follows. Period Highlight: 1 Plan Duration Actual Start % Complete Actual (beyond plan) % Complete (beyond plan)

PLAN ACTUAL ACTUAL PERCENT


ACTIVITY PLAN START PERIODS
DURATION START DURATION COMPLETE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

100%
Research Overview & Research Interest Form submission 1 4 1 4

Literature search 4 6 4 6
100%

100%
Literature review 6 8 6 8

Project Topic Approval 8 11 8 11


100%

100%
Research Proposal Writing and Submission 11 16 11 17

Interim Report Template and Effective Thesis Writing 16 17 17 0


85%

50%
Writing Literarture 17 19 0 0

Interim report submission 19 22 0 0


60%

Design Model 22 24 0 0
75%

Develop & train model 24 26 0 0


100%

Test Model 26 27 0 0
60%

Review statistical tests 27 28 0 0


0%

Effective Final Thesis Writing & Video Presentation 28 29 0 0


50%

Video Presentation Deadline+Final Deadline+ Dissertation End 29 30 0 0


0%

12
References

Ahmed, U., Issa, G.F., Khan, M.A., Aftab, S., Khan, M.F., Said, R.A.T., Ghazal, T.M. and
Ahmad, M., (2022) Prediction of Diabetes Empowered With Fused Machine Learning. IEEE
Access, 10, pp.8529–8538.
Anon (2023a) Diabetes2 | Kaggle. [online] Available at:
https://www.kaggle.com/datasets/leonardocadasrocha/diabetes2?select=Dados_Treino.csv
[Accessed 19 Feb. 2023].
Anon (2023b) Diabetes2 | Kaggle. [online] Available at:
https://www.kaggle.com/datasets/leonardocadasrocha/diabetes2?select=Dados_Teste.csv
[Accessed 19 Feb. 2023].
Anon (2023a) IEEE Xplore Full-Text PDF: [online] Available at:
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9694592 [Accessed 18 Feb. 2023].
Anon (2023b) IEEE Xplore Full-Text PDF: [online] Available at:
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9694592 [Accessed 12 Feb. 2023].
Barmparis, G.D., Marketou, M.E., Tsironis, G.P., Dritsas, E. and Trigka, M., (2022) Data-Driven
Machine-Learning Methods for Diabetes Risk Prediction. Sensors 2022, Vol. 22, Page 5304,
[online] 2214, p.5304. Available at: https://www.mdpi.com/1424-8220/22/14/5304/htm
[Accessed 13 Feb. 2023].
Bukhari, M.M., Alkhamees, B.F., Hussain, S., Gumaei, A., Assiri, A. and Ullah, S.S., (2021) An
Improved Artificial Neural Network Model for Effective Diabetes Prediction. Complexity, 2021.
Ganie, S.M. and Malik, M.B., (2022) An ensemble Machine Learning approach for predicting
Type-II diabetes mellitus based on lifestyle indicators. Healthcare Analytics, 2, p.100092.
Gupta, H., Varshney, H., Kumar Sharma, T., Pachauri, N. and Prakash Verma, O., (2022)
Comparative performance analysis of quantum machine learning with deep learning for
diabetes prediction Abbreviations ANN Artificial neural network AB AdaBoost BMI Body mass
index BP Blood pressure DL Deep learning DM Diabetes mellitus DOR Diagnostic odds.
Complex & Intelligent Systems, [online] 8, pp.3073–3087. Available at:
https://doi.org/10.1007/s40747-021-00398-7 [Accessed 12 Feb. 2023].
Krishnamoorthi, R., Joshi, S., Almarzouki, H.Z., Shukla, P.K., Rizwan, A., Kalpana, C. and
Tiwari, B., (2022) A Novel Diabetes Healthcare Disease Prediction Framework Using Machine
Learning Techniques. Journal of Healthcare Engineering, 2022.
Menaka, V., Likitha, V., Shravya, M. and Pari, R., (2022) Accurate Prediction of Type 1 and
Type 2 Diabetes. Proceedings of the 2022 3rd International Conference on Intelligent
Computing, Instrumentation and Control Technologies: Computational Intelligence for Smart
Systems, ICICICT 2022, pp.1117–1121.
Mohammad Ganie, S. and Bashir Malik, M., (2022) Comparative analysis of various supervised
machine learning algorithms for the early prediction of type-II diabetes mellitus Privacy
Preserving in Data Analytics View project Big Data Analytics In Smart Healthcare Systems View
project Comparative analysis. Int. J. Medical Engineering and Informatics, [online] 146, pp.473–
483. Available at: https://www.researchgate.net/publication/349858232 [Accessed 12 Feb.
2023].
Olisah, C.C., Smith, L. and Smith, M., (2022) Diabetes mellitus prediction and diagnosis from a
data preprocessing and machine learning perspective. Computer Methods and Programs in
Biomedicine, 220.

13

You might also like