Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Healthcare Analytics 4 (2023) 100273

Contents lists available at ScienceDirect

Healthcare Analytics
journal homepage: www.elsevier.com/locate/health

A risk assessment and prediction framework for diabetes mellitus using


machine learning algorithms
Salliah Shafi Bhat a, *, Madhina Banu a, Gufran Ahmad Ansari b, Venkatesan Selvam a
a
B. S.AbdurRahman Crescent Institute of Science and Technology, Chennai, 48, India
b
Department of Computer Science, Dr.Vishwanth Karand MIT Peace University, Pune, 411 038, India

A R T I C L E I N F O A B S T R A C T

Handling Editor: Madijd Tavana Diabetes disease seriously threatens people’s health and is becoming more common nowadays. Diabetes Mellitus
(DM) is a condition caused by high blood sugar levels, inactivity, unhealthy eating, being overweight, and other
Keywords: factors. This research article analyzed and examined various risk prediction models and algorithms for diabetes,
Machine learning including Type 1, Type 2, and Gestational Diabetes. This study develops several Machine Learning (ML) models
Diabetes mellitus
for predicting diabetes using various datasets. The process involves producing highly informative features called
Logistic regression
Feature Engineering (FE). We used the Pima Indian Diabetes Dataset (PIDD) to experiment with and examine the
Gradient boost
Decision tree effectiveness of ML models’ ability to predict diabetes. Using Python programming, we used three classification
Prediction algorithms, Logistic Regression, Gradient Boost, and Decision Tree, and combined feature selection techniques
among the classification techniques, Decision Tree has the highest accuracy rate (91 %), precision (96 %), recall
(92 %), and Fi score (94 %).

1. Introduction eyes, heart, kidneys, stroke, and lower limb loss ultimately leading to
death [5]. It can be beneficial to implement early preventative measures
The incidence of diabetes mellitus an autoimmune illness that can be and treatments to lower the death ratio if this dangerous disease is
brought by a variety of inherited, environmental, dietary, and other identified early. Machine learning algorithms have been used to forecast
variables is rapidly increasing in comparison to other diseases in the diabetes mellitus disease which contrasts with the conventional diag­
world today. A chronic condition is a sickness or illness that persists over nosis process. Using information from people every day physical ex­
time or has long-lasting symptoms [1]. The lifestyle was impacted by aminations can help to develop a preliminary assessment and act as a
such disorders which was a serious negative effect. One of the most resource for doctors [6]. Owing to the rising occurrence of diabetes
severe diseases diabetes is widespread. Chronic illness is an important mellitus a number of machine learning algorithms including Random
component in adult fatalities all around the world [2]. In disease-related Forests (RF), Regression Models, and various ensembles [7] have been
mortality, it is now the seventh most fatal disease. Diabetes can cause a given for the early prediction of diabetes mellitus. In addition to model
variety of issues including an elevated risk of coronary artery disease architecture and optimization, feature engineering is a crucial compo­
and stroke [3]. Unfortunately, there is no known solution for these nent to improve efficiency and boost classification accuracy. For this
diseases the only option is to control blood glucose levels. In the world, goal, many feature selection strategies have been applied in the past.
8.8 % of adults have diabetes in 2023 and by 2045 that number is ex­ Extraction of features, Support Vector Machine (SVM), Decision Tree
pected to increase to 9.9 % [4]. There are currently 422 million diabetics (DT), principal component analysis (PCA), and linear discriminant
worldwide and patients in low- and middle-income countries are analysis (LDA) were all used as instances [8]. The three main kinds of
becoming more commonplace faster than those in high-income areas. diabetes are Type 1 diabetes, Type 2 diabetes, and gestational diabetes.
Blood glucose levels rise in diabetes mellitus patients which can be Type 1 diabetes is the most common form and is known as an aetio­
attributed to either decreased insulin physiologic effects or faulty insulin logical type. Its cases are attributed to an auto-immune process or are
secretion. If left untreated it can harm the nervous system, blood vessels, ketoacidosis-prone and have no established pathogenesis or

* Corresponding author.
E-mail addresses: Salliahshafi678@gmail.com (S.S. Bhat), madhina@crescent.education (M. Banu), gufran.ansari@mitwpu.edu.in (G.A. Ansari),
selvamvenkatesan@gmail.com (V. Selvam).

https://doi.org/10.1016/j.health.2023.100273
Received 25 March 2023; Received in revised form 5 September 2023; Accepted 11 October 2023
Available online 23 October 2023
2772-4425/© 2023 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

pathophysiology [9]. Based on lifestyle factors Gender, Family history regular, trustworthy, and accurate. Many research investigations have
Type 2 diabetes mellitus can be predicted, prevented, and reported. It taken place utilizing machine learning algorithms to predict diseases
typically develops later in life [10]. During pregnancy, women who have [16,17]. To perform future research in the diagnosis of diabetes mellitus
gestational diabetes are more likely to experience difficulties for both condition scientists have examined various datasets algorithms and
the mother and the unborn child [11]. Diabetes needs proper fast mo­ techniques. Below are some key literature forms that are discussed
lecular diagnoses that can analyse disease risk, anticipate the disease, follow. In the paper, SVM with Radial Basis Function (RBF) kernel was
and quickly identify individuals and family members who are at higher used, and entries with mean were used in place of zero values to achieve
risk. Specific diabetes is defined as diabetes in gene mutations or asso­ an accuracy of 74.80 %. Three features out of eight were recovered by
ciated genetic problems. Predictive molecular/genetic testing and pre­ the authors in Ref. [18]. The author then used Linear Discriminant
ventive care can be critical in such circumstances. Amputation, Charcot Analysis (LDA) to classify the data and had a 74.40 % accuracy rate. A
joints, autonomous dysfunction, including infertility and nephropathy different strategy [19] was to remove the entries with zero values,
linked to renal failure are among the progressive symptoms of diabetes bringing the number of entries left to 460 from 768. These 460 entries
mellitus. Other implications include the advanced progression of reti­ yielded an accuracy of 75.5 % after 200 were used for training and the
nopathy problems that might cause probable blindness [12]. The risk of remaining for testing. Authors have used SVM in a Feed Forward Neural
cardiovascular, peripheral vascular, and brain vascular illness is rising in Network to classify the diabetic dataset, obtaining an accuracy of 75.65
people with diabetes so this research examines the many risk prediction %, and Linear Discriminant Analysis (LDA) to select two features out of
models currently in use and suggests an appropriate model for the pre­ eight [20]. ANN approach to reach the accuracy of 76 % when classi­
diction of diabetes mellitus. Age, changes in lifestyle, and eating habits, fying the diabetic dataset using Automatic Identification System tech­
and the rapidly expanding socioeconomic such as admission to medical niques. The authors used the naïve Bayes and decision tree method to
facilities all affect the risk and progression of diabetes-related disorders. categorise diabetes after using the correlation-based feature selection
In view of this other risk factors for diabetes complications include high technique to choose two of the eight available features. They also used
blood pressure, higher glucose levels, elevated blood lipids, obesity, and the average to fill in the missing numbers, reporting a 74.79 % accuracy
being overweight [13]. It is necessary to investigate machine learning [21]. Multi-Layer Perceptron (MLP) and Bayes Net classifiers were used
algorithms for better diabetes illness prediction so that they can be and the results showed an output accuracy of 81.19 %. J48 classification
avoided, and preventative measures can be performed in advance. algorithm was used and 76.58 % accuracy was attained. Employed the
Moreover, machine learning algorithms can be thoroughly investigated J48, NB, LR, and Random Forest classification algorithms yielding an
to support healthcare governance and resources for improved patient accuracy of 80.43 % [22]. When the classification algorithms Gaussian
health services. This will immediately benefit patient groups, practi­ Process Classifier (GPC), and Naive Bayes were performed to classify
tioners, telehealth systems, healthcare providers, and hospital man­ diabetes patients, it was shown that GPC provided the highest accuracy,
agement. With the help of a machine learning algorithm, we want to around 82 % utilizing the radial basis kernel [23]. Introduced in a Hi­
improve the model for predicting diabetes mellitus disease in this erarchical Multi-Level Classifier with Multi-Objective Voting Technique
research. In addition, considering the significance of the application, the (HM-Bag) for Classification and compared their method with the Naïve
author wants to increase the model’s accuracy and other performance Bayes, support vector machine, logistic regression, Quadratic Discrimi­
metrics for predicting the progression of diabetes. In recent research nate Analysis, k nearest neighbor, Artificial Neural Network classifiers.
machine learning has been successful in predicting many chronic ill­ They discovered that HM-Bag provided the highest level of accuracy
nesses such as diabetes and has shown good results across a number of 77.21 %. SVM, Naive Bayes, Random Forest, AdaBoost, etc. Are a few
statistical metrics. Early detection of diabetes mellitus is important, and examples. Case Based on reasoning notable methodologies that are
using methods of machine learning to increase accuracy is essential [14, improving the accuracy of diabetes prediction. Almost all of the solu­
15]. tions including Complicated Valued Neural Networks, Spiral Premise
The major contribution of this paper includes: The data set used for Work, Genuine Valued Neural Networks, and Decision Tree are also
this study has been collected from The UCI data set. The performance, mentioned along with their drawbacks [24]. By utilizing machine
accuracy, as well as Precision, Recall, fi score, and AUC-ROC of several learning techniques and algorithms including NB, LR, and GB proposed a
supervised machine learning algorithms, are analyzed. The flow dia­ model for future diabetes and its risk level using the PIMA diabetes
gram for risk assessment of diabetes mellitus prediction using machine dataset [25]. The Boruta technique has been used to determine how
learning algorithms was created and different statistical metrics have closely the qualities are related. Among all classifiers GB achieved the
been used to assess the performance of these classifications. Pre- greatest accuracy of 86 %. A Probabilistic Neural Network (PNN) to
processing data is to enhance dataset feature evaluation. Comparison predict diabetic illnesses was employed. The “PIMA Indian dataset” was
of results before and after using pre-processing methods. Moreover, we treated to the algorithm. Pre-processing was not used by the author.
contrast our suggested approach with the most recent techniques that “The dataset is separated into 10 % for the standard setting and 90 % for
used the same experimental setup, datasets, and performance the training set”, the statement continues. For testing and taxing data,
evaluations. the proposed technique achieved an accuracy of 81.49 % and 89.56 %,
The rest of the paper is structured as follows: Section 2 discusses respectively” [26] introduced a neural network in their research that
related work. The details of the Research methodology and dataset are predicts outcomes based on blood glucose levels, streamlining the esti­
presented in Section 3. The experiment, Result, and discussion are dis­ mation of ambiguity during the prediction process. The Type 1 diabetes
cussed in Section 4. Finally, in Section 5, we describe the conclusion and dataset which takes the sugar level into account is the dataset in use
Future work. [27]. The methodology is used to assess the root-mean-square error
(RMSE) metric and the surveillance-error-grid (SEG) technique. Table 1
2. Related work shows the related work as depicted below using eleven MLT to construct
a model for predicting the development of diabetic illness. The support
This section demonstrates the previous research work for Risk vector machine produced the best results of all the classifiers taken into
Assessment and Prediction Framework for Diabetes Mellitus Using ma­ consideration with an accuracy rate of 83.49 % using the UCI dataset.
chine learning algorithms. Many ML techniques are used in the medical For other metrics, the method improved accuracy and decreased the
sector to detect and forecast disorders. One such illness is diabetes which prediction error rate. To validate the suggested framework, various
uses ML techniques to find the most effective treatments. Machine metrics are examined. The authors proposed using a hybrid technique to
learning techniques are applied in nearly every field of life to address expand on the previous research for more accurate prediction [28] and
practical issues because of their promise to produce outcomes that are suggested a system for diabetic illness prediction utilizing techniques

2
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Table 1 of the previous studies did not fully make use of data pre-processing
A Review of existing research on diabetes detection. before creating the machine learning models. Inadequate outputs were
Ref Algorithms Data Set Reported Accuracy the result. To increase the data quality needed for the prediction model
we therefore felt the need to use exploratory data analysis. Furthermore,
[33] DT, LR, SVM PIMA data set 79.86%in SVM
[34] DT, Gradient Boost Clinical dataset 82.6 % in Gradient Boost despite the reality that these techniques are essential for achieving
[35] DT, LR, SVM PIDD set 89.7 % in DT improved prediction performance, data normalization and stand­
[36] KNN, SV, DT, NB PIMA data set 81.7 % in NB ardisation were missing from the majority of the research that has
[37] RF, DT, NB PIMA data set 83.5 % in RF already been published.
[38] Gradient Boost, NB, SVM PIMA data set 84.9 % in SVM
[39] LR, SVM, KNN, DT Clinical dataset 89.9 % in LR
3. Methodology

based on machine learning. For other metrics, the method improved 3.1. Proposed methodology for risk assessment of diabetes mellitus
accuracy and decreased the prediction error rate. To validate the sug­ prediction using machine learning algorithms
gested framework, various metrics are examined. The authors proposed
using a hybrid technique to expand on the previous research for more The technique used for this experimental study is shown in Fig. 1. It
accurate prediction [29]. Suggested a system for diabetic illness pre­ outlines the actions that must be taken in order to use machine learning
diction utilizing techniques based on machine learning. The author algorithms for early diabetes mellitus risk assessment Prediction. For the
constructed a model for predicting diabetes disease by contrasting experiment design a publicly available Pima Indian diabetes dataset has
various machine learning algorithms. Comparative and performance been loaded into the web Jupyter Notebook (an open-source platform).
assessments were conducted on seven machine learning classifiers. The Using the Python programming language, the necessary library pack­
hybrid random forest with linear model generated the best results out of ages are installed from Sklearn. The classifiers are initially used to
all classifiers with an accuracy rate of 88.4 %. The output of the sug­ predict the disease without any data preparation. During exploratory
gested model was improved without the usage of any data preparation data analysis, we observed that preparing data can be essential for
methods. The authors recommended that massive datasets and a variety achieving better outcomes. Data imputation is used in the pre-processing
of machine learning algorithms could be used in this research for future phase to identify missing values and restore duplicates. To find and
directions. The overall majority of the previous research did not fully replace outliers in the dataset, the interquartile range approach is uti­
make use of data pre-processing before developing machine learning lized. To check for information loss in the dataset, if any, additional
algorithms. Inadequate outputs were the result [30]. Investigated the programs that are required are also executed. The dataset is divided into
use of a variety of various machine learning models for diabetes diag­ two halves, with 30 % used to test the models and 70 % used to train
nosis prediction, including conventional models like support vector them. After data pre-processing and in order to achieve the required
machine, random forest, and decision tree, as well as ensemble models, outcomes the Decision tree algorithms are finally applied to get the best
centered on using SVM rule extraction. The precision scores were shown outcomes.
to be higher when this method was combined with a random forest
classifier compared to the base random forest and support vector models 3.1.1. K- fold cross validation
(89.6 % compared. 81.2 % and 88.4 %, respectively). On the other hand, Fig. 2 illustrates the visual representation of data splitting (10-fold
they discovered the recall scores of the ensemble classifier (44.3 %) were cross-validation) used in the present research. The k-fold cross-
lower than the base random forest model (49.0 %), but still higher than validation procedure has been applied with a k value of 10. Ten parti­
the base support vector models model (40.0 %). This was due to the tions of equal size were created at random from the full dataset. One
ensemble’s use of support vector models and random forest rules, which partition out of the ten was kept for use as the model validation (test set),
were less complex than those in the original random forest model. They while the remaining ten partitions minus one serve as train data. Each of
ultimately came to the conclusion that support vector + random forest the 10 partitions was used as the validation data precisely once during
ensemble was still preferable over both base models [31]. Using ma­ the entire process of ten iterations. The averaging function was used to
chine learning approaches a framework for disease prediction in combine the results of all repetitions. To match the performance of both
healthcare has been developed. The Pima Indian Diabetes Dataset was training and testing datasets, the problem of over fitting and under
subjected to a machine learning approach in the study. According to the fitting has been reduced in the dataset. The advantage of this strategy
outcomes logistic regression outperformed the other machine learning was that it reduced data bias enabling the development of ML models to
algorithms. It has been made clear that glucose and BMI are related in produce accurate outcomes. Each bin of testing data has been used
terms of diabetes condition. The study’s usage of a structured dataset exactly once to validate results and all data samples have been used for
rather than an unstructured dataset for testing has some drawbacks. It is both training and testing.
proposed that techniques utilized in the paper can be simply transferred
to other medical facilities in order to forecast different diseases. Based 3.2. Diabetes dataset
on the research already in existence it is simple to conclude that human
behavior and medical information can be investigated for early risk The popular experiment made use of a diabetic dataset from the
assessment for diabetes mellitus prediction. Because the prediction can machine learning repository at the University of California Irvine. The
be made sooner to prevent the onset of disease, the approach can also data collection contains 728 records, 528 of which are diabetic and 200
benefit current and future patients. According to research, there are 232 of which are not. There are 9 attributes in a data set depicted in Table 2.
million [32] people worldwide who even do not know if they have The table includes information on attributes that were taken into
diabetes as a result of ignorance and an underfunded healthcare system. consideration a description of each attribute its measurements and its
Such technical assistance would be very helpful if made accessible to the range value [40].
wider society. For comparison and performance experiments, seven
machine learning classifiers were taken into consideration. With an 3.2.1. Data set description
accuracy rating of 88.4 %, the hybrid random forest with a linear model After data pre-processing on data set Pima Indian Data Set of 768
outperformed all other classifiers. To enhance the output of the sug­ patients from the University of California [41] Descriptive statistics are
gested model, no data preparation method was applied. The authors essential for identifying data properties. It organizes data to make
indicated that the next step in this research might be to use massive analysis easier. Table 3 describes the Pima Indian Diabetes Dataset at­
datasets and a variety of machine learning techniques. The vast majority tributes, and statistical data, including the count of records, Minimum

3
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Fig. 1. Proposed methodology for risk assessment of diabetes mellitus prediction.

Fig. 2. K-Fold Cross Validation.

(min) and maximum (max) values, as well as mean and standard devi­ 3.3. Data pre-processing
ation (Std). For Example, the Pregnancies have 4.034 as a mean value,
and 3.449 as a standard deviation, the min and max pregnancies The performance of a prediction model is determined by how well it
numbers are 0 and 4 respectively. Such statistical evaluations are predicts the future, which is considered to be an outcome of a number of
calculated for the remaining attributes as well [42]. variables. In other words, data isn’t always accurate. In a dataset, there
could be duplicate data features, inconsistent data features, noise, and/

4
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Table 2 classes where class 1 (528 instances) is diabetic, and class 0 is a


Attribute Information of a dataset. nondiabetic disease (200 instances) as depicted in Fig. 3. Oversampling
#N Attributes Description Measurement Value Technique (SMOTE) which is effective and popular is used to balance
range classes of data with significant levels of imbalance in order to address a
1 Pregnancies How many times a Years 0to 3 variety of practical difficulties. Working with unbalanced datasets
person became makes it difficult to create models because they perform low on
pregnant numerous statistical tests. To increase the framework prediction power
2 Glucose Plasma glucose level Mg/dl 0to 199 in the present research class balancing using the SMOTE method was
after 2 h
3 Blood Pressure An individual’s BP (MM/Hg) 0to 122
done before the Machine learning models were created. This method
4 Skin Thickness The triceps fold MM 0 to99 increases the minority class in the dataset by oversampling. It takes
thickness minority class instances at random, locates their k nearest minority class
5 Insulin patient blood insulin μ U/ml 0 to 846 neighbors, and then picks one of the neighbors to construct a line
levels
segment in the feature space as produced by a convex blend of two
6 BMI Body mass index kg 0 to67
7 Diabetes It shows the capability 1 = diabetic 0 to2.45 contrived examples, let’s say A and B as shown in Fig. 4.
Pedigree that analyses diabetes disease
Function risk 0 = non diabetic 3.3.2. Histogram of data set
disease A histogram is used to show and evaluate data sample distribution.
8 Age Age of an individual years 1 to 78
9 Outcome Class attributes 1 diabetic and 0 or 1
Histograms can be shown as uniform, normal, left-skewed, or right-
0 non diabetic skewed. The normally distributed histograms in Fig. 5 group all the at­
tributes into the value range. The x-axis reflects the attribute’s type
while the y-axis represents the value of the attribute.
Table 3
Data set description. 3.3.3. Box Plot of the data set
The boxplots for each attribute in the dataset are shown in Fig. 6. To
S. No. Attributes Count Mean Std Dev Min Max
manage the outliers in the dataset the confidence interval range
1 Pregnancies 728 4.034 3.449 0 4 approach with the probability density function was used to generate
2 Glucose 728 0.554 0.497 0 1
boxplots for characteristics.
3 Blood Pressure 728 0.529 0.513 0 1
4 Skin Thickness 728 0.46 0.4999 0 1 Fig. 7 depicts the CCA of all attributes used to predict disease with a
5 Insulin 728 0.623 0.483 0 1 range of connections ranging from +1 to 1 on the X- and Y-axes. The
6 BMI 728 0.496 0.513 0 1 column value represents the degree of the link between the intersecting
7 Diabetes Pedigree 728 0.242 0.428 0 1 properties. For example, the connection value between Pregnancy and
Function
8 Age 728 48.228 12.299 16 90
Glucose is 0.39.
9 Outcome 728 0.520 0.4800 0 1

3.4. Applying machine learning algorithm


or missing data. The data must first be processed by reducing its
redundancy, consistency issues, noise, and missing value issues before Almost every sector is exploring the use of MLA to tackle issues in the
applying machine learning techniques. As a result, we can come to the real world. These models have significantly improved the ability to
conclusion that data pre-processing is a crucial step in creating a suc­ forecast, identify, diagnose, and prognostic many diseases. In this
cessful classification model. Duplicate records and incorrect data are research for risk prediction diabetes mellitus, we consider the following
eliminated during data pre-processing if they occur. Because of the high
proportion of missing data, classification accuracy is reduced. A statis­
tical cleaner filter is used to evaluate features and discover missing
values in the data without utilizing any classification techniques. This
filter removes numerical data that is either too large or too small by
converting it to a current value [43]. Furthermore, data cleaning is a
critical step in the data pre-processing pipeline in machine learning that
ensures that the dataset used to train or evaluate the model is correct,
consistent, and relevant. In order to produce more accurate and
dependable predictions, data cleaning helps to eliminate bias and
unpredictability in the model [44]. The dataset was pre-processed with
techniques such as resampling and discretization utilizing several sta­
tistics libraries using the Spyder Integrated Development Environment
and Python (3.9.1) as the programming language [45]. Missing values
have been filled by averaging specific attribute values such as Preg­
nancy, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes
Pedigree Function, Age, and Outcome. Boxplot was used to detect out­
liers In order to replace the outlier with acceptable sample values, the
interquartile range technique was applied. Before creating machine
learning data transformation was conducted to improve data efficiency.

3.3.1. Class balance


If the dataset used for the problem statement is not balanced machine
learning algorithms produce poor results. If the target class is not evenly
distributed some sampling approaches can be employed to create a
balanced dataset. The experiment dataset contains a wide variety of Fig. 3. Instances of outcome variables.

5
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Fig. 4. Class balancing of data set before and after SMOTE.

Fig. 5. Histogram of attributes.

for MLA as Logistic Regression, Gradient Boost, and Decision Tree. model. It uses the dependent variable’s maximum likelihood prediction.
The main advantages of LR are its consistency and ability to handle
3.4.1. Logistic regression (LR) nonlinear data [46]. Consider the fact that there are n attributes such as
It is a classification-based rather than a regression-based linear A1, A2 … where Q is the likelihood that the event will occur, and (1 - Q)

6
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Fig. 6. Box plot of attributes.

Fig. 7. Correlation between features.

is the likelihood that it won’t. The model is then provided by 3.4.2. Gradient Boost (GB)
( ) By adjusting the weights, all estimators are eventually incorporated
q
Log log(q) = β0 + β1 A1 + …….Bn An (1) once the weakest learners have been taught sequentially. The GB
1-q
method seeks to reduce the gap between the anticipated and actual
values by concentrating on estimating the residual errors of earlier
In which β1 is the regression Coefficient.
estimation techniques [47].

7
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

3.4.3. Decision tree (DT) Table 4


It is the most basic MLA that creates association rates to identify and Classification Performance of other matrices.
forecast the target variables. To construct the tree, DT chooses the root #N Algorithms Accuracy Precision Recall Fi score
node, then descends to the leaf node to forecast the label [48]. Gini index
1 Logistic Regression 81 % 86 82 84
and Information Gain (IG) are the two primary methods used to locate 2 Decision Tree 91 % 96 92 94
the root node in DT [49]. In DT, the top node is chosen by default using 3 Gradient Boost 89 % 94 90 92
IG.

The results were computed for both groups (0: no Diabetes Mellitus
4. Experiment, results, and discussion
disease, 1: Diabetes Mellitus disease) in percentage terms as Shown in
Fig. 9 and Fig. 10 Decision Tree performed best for all performance
In this section, the authors discuss the experimental details and
evaluation measurements before pre-processing, whereas LR performed
outcomes obtained when using machine learning algorithms for the
the worst. While Decision Tree performed best for all performance
prediction of diabetes mellitus illness. After the implementation of the
evaluation measurements without pre-processing, LR performed the
suggested methodology, all outcomes are shown and systematically
worst.
evaluated. The evaluation is fully explored using performance evalua­
tion criteria including accuracy, precision, recall, fi-score, and receiver
operation curve and confusion matrix for machine learning algorithms. 4.3. Feature importance
The testing accuracy machine learning algorithms before and after
processing is depicted in Fig. 8 shows the effectiveness of diabetes A method for determining the score of input features (independent
mellitus risk assessment Prediction in terms of all classifier’s accuracy predicate variables) based on their contribution to the prediction of the
before and after pre-processing. The algorithms before pre-processing data analysis technique (dependent/target variable) is called feature
with accuracy are LR (79), Decision Tree (89), and GB (87) respec­ importance. It makes a significant contribution to the creation of ma­
tively. After applying pre-processing techniques. Decision Tree has the chine learning algorithms that enhance the accuracy rate. The number of
highest accuracy rate of 91 % as compared to LR and GB. times an attribute is used for splitting in the training is indicated by the
feature importance score (F-score) in this research method [50]. If a
characteristic (like Blood Pressure) has a higher F-score, it is considered
4.1. Equipment requirements and processing time to be a more significant trait. Fig. 11 displays, in decreasing order based
on their F-score, the contribution of each characteristic to prediction.
This research was carried out on an HP Z60 workstation. The sys­ For instance, Blood Pressure has the greatest predictive importance,
tem’s hardware requirements are as follows: Windows 10 pro-64-bit, an whereas BMI has the lowest.
Intel XEON 2.4 GHz CPU with 12 cores, 4 GB of Memory, and a 1 TB hard
drive. On this machine, the execution times for the algorithms LR, DT,
4.4. ROC curve
and GB were 4.23, 3.57, and 4.51 s, respectively. Software utilized for
implementations includes Python as a programming language, the web-
To demonstrate the predictive power of the making algorithms under
based computing platform Jupyter Notebook, and the graphical user
consideration at various thresholds, The ROC curve (receiver operating
interface-based Anaconda Navigator.
characteristic) was employed. It plots the false-positive rate vs. the true-
positive rate on the X- and Y-axes, respectively. We evaluated the ability
4.2. Performance evaluations of our models to differentiate between classes using the ROC curve (0-no
diabetes disease and 1-diabetes disease). A larger ROC curve indicates
For data preparation Accuracy, Precision, Recall, and Fi-score of the that the model is capable of accurately predicting outcomes between
three classifiers under consideration were evaluated as shown in Table 4 0 and 1. The best measure of separability is when the model’s AUC is

Fig. 8. Classification of accuracy.

8
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Fig. 9. Performance evaluation before pre-processing.

Fig. 10. Performance evaluation after pre-processing.

close to 1; the worst measure of disassociation is when the AUC is close 4.6. Comparative analysis
to 0.9. The model is ineffective at effectively separating the classes when
AUC is equal to 0.8. The ROC Curves for DT, LR, and GB are shown in In terms of several assessment metrics for predicting diabetic melli­
Fig. 12-14respectively. The research leads us to the conclusion that DT tus illness, the suggested method produced favourable outcomes. In
performs better than GB and LR. terms of the methodologies used in the dataset and accuracy, Table 5
shows the performance of our proposed framework with a number of
pertinent studies. In terms of several evaluation metrics, our suggested
4.5. Confusion matrix framework performed well, especially when it came to accuracy in
forecasting diabetes mellitus illness. Better results than earlier relevant
The proposed confusion matrix is used to assess the performance of research have been obtained employing methods like using the Boxplot
MLA in identifying improperly identified patients in DM disease pre­ technique, data imputation may be used to manage missing values as
diction. It compares projected values to actual values using key di­ well as identify and replace outliers.
mensions: False Positive (FP), False Negative (TN), and True Positive
(TP) (FN). Figs. 15–17 show the confusion matrices (Training and 5. Conclusion and future work
Testing) of classifier classifiers to evaluate performance assessment to­
wards Diabetes Mellitus illness prediction. The confusion matrices were This research efficiently predicted diabetic illness using MLA. To
used to assess the different MLAs using statistics and machine learning enhance the dataset’s prediction outcomes and quality evaluation,
measurements.

9
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Fig. 11. Feature importance prediction.

Fig. 13. ROC curve for DT

Fig. 12. Roc curve for LR Authors’ contributions

several pre-processing techniques including imputation, missing value, SS, MB, GA and VS conceived the idea of this study as part of PhD
and cleaning procedures have been used. Three different MLA algo­ work. SS analyzed the data. SS wrote the first draft of the manuscript.
rithms were also used in this research namely LR, DT, and GB. Several SS, MB, VS AND GA designed and reviewed the manuscript. SS
statistics and MLA were used to evaluate the experimental results. Ac­ approved the final version of this research article. SS, MB, GA and VS
cording to the experimental findings, DT has the greatest accuracy rate were responsible for the overall supervision of this work.
at 91 %. Moreover, the DT produced higher outcomes for metrics fi-
score, Precision, Accuracy Recall, and confusion matrix all included. Funding
To determine the contribution of independent features, the feature
importance procedure was used for contributing to the outcome. To Not Applicable.
improve its efficacy, additional MLA may be utilized such as LR and GB
for this work. Other healthcare datasets with the same similarities as Ethical approval
those in this suggested technique may be utilized for improving factors
to extend the reach of this research work. Deep learning methods may The Study Design and constants are approved by the B. S Abdur
also be studied to identify and more accurately predict diabetes mellitus Rahman Science &Technology Chennai.
illnesses.
Consent to publication

Not Applicable.

10
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

Fig. 14. Roc curve for GBC

Fig. 15. Confusion matrix for LR


Fig. 17. Confusion matrix for GBC

Table 5
Comparison with the existing system.
Research Work Techniques Adopted Data set Highest Accuracy

[51] XGB, ADB, GBM, LGBM PIDD Set 87.56 % in GBM


[52] XGB, KNN, GB, DT Clinical Dataset 78.9 %in DT
[53] KNN, SVM, DT PIMA Data set 77.3 % in KNN
[54] RF, DT, KNN, SVM PIDD Set 81 % in RF
[55] DT, KNN, SVM, XGB PIMA Data set 82 % in SVM
[56] KNN, RF, DT PIDD Set 75.24 % in RF
Our Method LR, DT, GB PIMA Data set 91 % in DT

interests or personal relationships that could have appeared to influence


the work reported in this paper.

Fig. 16. Confusion matrix for DT Data availability

Financial disclosure The data that has been used is confidential.

Not Applicable. References

[1] A. Alazwari, A. Johnstone, L. Tafakori, M. Abdollahian, A.M. AlEidan, K. Alfuhigi,


Declaration of competing interest M.A. Alshamrani, Predicting the development of T1D and identifying its Key
Performance Indicators in children; a case-control study in Saudi Arabia, PLoS One
18 (3) (2023), e0282426.
The authors declare that they have no known competing financial

11
S.S. Bhat et al. Healthcare Analytics 4 (2023) 100273

[2] S. Musa, I. Dergaa, V. Bachiller, H.B. Saad, Global implications of COVID-19 [29] R. Kamalraj, S. Neelakandan, M.R. Kumar, V.C.S. Rao, R. Anand, H. Singh,
pandemic on adults’ lifestyle behavior: the invisible pandemic of Interpretable filter based convolutional neural network (IF-CNN) for glucose
noncommunicable disease, Int. J. Prev. Med. 14 (1) (2023) 15. prediction and classification using PD-SS algorithm, Measurement 183 (2021),
[3] S.S. Bhat, V. Selvam, G.A. Ansari, M.D. Ansari, M.H. Rahman, Prevalence and early 109804.
prediction of diabetes using machine learning in North Kashmir: a case study of [30] T.O. Omotehinwa, D.O. Oyewola, E.G. Dada, A light gradient-boosting machine
district Bandipora, Comput. Intell. Neurosci. 2022 (2022) 12. algorithm with tree-structured parzen estimator for breast cancer diagnosis,
[4] S.N. Hong, I.L. Mak, W.Y. Chin, E.Y.T. Yu, E.T.Y. Tse, J.Y. Chen, E.Y.F. Wan, Age- Healthcare Anal. (2023), 100218.
specific associations between the number of co-morbidities, all-cause mortality and [31] V. Chang, M.A. Ganatra, K. Hall, L. Golightly, Q.A. Xu, An assessment of machine
public direct medical costs in patients with Type 2 diabetes: a retrospective cohort learning models and algorithms for early prediction and diagnosis of diabetes using
study, Diabetes Obes. Metabol. 25 (2) (2023) 454–467. health indicators, Healthcare Anal. 2 (2022), 100118.
[5] Bhat, S. S., & Ansari, G. A. Prediction of diabetes mellitus using machine learning. [32] S. Wankhade, S. Vigneshwari, A novel hybrid deep learning method for early
In Machine Learning and Deep Learning in Efficacy Improvement of Healthcare detection of lung cancer using neural networks, Healthcare Anal. 3 (2023),
Systems (pp. 93-108). CRC Press.. 100195.
[6] E.S. Almutairi, M.F. Abbod, Machine learning methods for diabetes prevalence [33] J. Gao, Y. Yang, P. Lin, D.S. Park, Computer vision in healthcare applications,
classification in Saudi arabia, Modelling 4 (1) (2023) 37–55. J. Healthc. Eng. 2018 (2018) 4.
[7] J. Rashid, S. Batool, J. Kim, M. Wasif Nisar, A. Hussain, S. Juneja, R. Kushwaha, An [34] R. Krishnamoorthi, S. Joshi, H.Z. Almarzouki, P.K. Shukla, A. Rizwan, C. Kalpana,
augmented artificial intelligence approach for chronic diseases prediction, Front. B. Tiwari, A novel diabetes healthcare disease prediction framework using machine
Public Health 10 (2022) 559. learning techniques, J. Healthc. Eng. 23 (2022) 1.
[8] J.W. Mao, Y. He, Z.T. Liu, Speech emotion recognition based on linear discriminant [35] M.A. Widyananda, I. Palupi, Implementation of the spiral optimization algorithm
analysis and support vector machine decision tree, in: 2018 37th Chinese Control in the support vector machine (SVM) classification method (case study: diabetes
Conference (CCC), IEEE, 2018, pp. 5529–5533. prediction), in: 2021 International Conference Advancement in Data Science, E-
[9] C. Nederstigt, B.S. Uitbeijerse, L.G.M. Janssen, E.P.M. Corssmit, E.J.P. de Koning, Learning and Information Systems (ICADEIS), IEEE, 2021, October, pp. 1–6.
O.M. Dekkers, Associated auto-immune disease in Type 1 diabetes patients: a [36] A. Cahn, A. Shoshan, T. Sagiv, R. Yesharim, R. Goshen, V. Shalev, I. Raz, Prediction
systematic review and meta-analysis, Eur. J. Endocrinol. 180 (2) (2019) 135–144. of progression from pre-diabetes to diabetes: development and validation of a
[10] S. Adam, H.D. McIntyre, K.Y. Tsoi, A. Kapur, R.C. Ma, S. Dias, FIGO Committee on machine learning model, Diabetes/metabolism Res. Rev. 36 (2) (2020), e3252.
the Impact of Pregnancy on Long-term Health and the FIGO Division of Maternal [38] A. Gupta, I.S. Rajput, V. Jain, S. Chaurasia, NSGA-II-XGB: meta-heuristic feature
and Newborn Health, Pregnancy as an opportunity to prevent type 2 diabetes selection with XGBoost framework for diabetes prediction, Concurrency Comput.
mellitus: FIGO best practice advice, Int. J. Gynecol. Obstet. 160 (2023) 56–67. Pract. Ex. 34 (21) (2022), e7123.
[11] S.S. Bhat, V. Selvam, G.A. Ansari, M.D. Ansari, Analysis of diabetes mellitus using [39] K. Kangra, J. Singh, Comparative analysis of predictive machine learning
machine learning techniques, in: 2022 5th International Conference on algorithms for diabetes mellitus, Bull. Elec. Eng. Inf. 12 (3) (2023) 1728–1737.
Multimedia, Signal Processing and Communication Technologies (IMPACT), IEEE, [40] H.F. Ahmad, H. Mukhtar, H. Alaqail, M. Seliaman, A. Alhumam, Investigating
2022, November, pp. 1–5. health-related features and their impact on the prediction of diabetes using
[12] D. Sugumar, E. Rymbai, S. Vasu, D. Selvaraj, Nuclear receptors and other molecular machine learning, Appl. Sci. 11 (3) (2021) 1173.
targets in Type 2 diabetes mellitus, J. Appl. Pharmaceut. Sci. 13 (7) (2023) 85–101. [41] https://www.kaggle.com/datasets/mathchi/diabetes-data-set.
[13] G. Hu, J. Ding, D.H. Ryan, Trends in Obesity Prevalence and Cardiometabolic Risk [42] R. El-Yafouri, L. Klieb, V. Sabatier, The impact of office-related metrics on meeting
Factor Control in US Adults with Diabetes, Obesity, 2023, pp. 1999–2020. physician expectations from Electronic Medical Record systems, Healthcare Anal.
[14] J. Bradley, S. Rajendran, Developing predictive models for early detection of (2023), 100208.
intervertebral disc degeneration risk, Healthcare Anal. 2 (2022), 100054. [43] E.S. Mohamed, T.A. Naqishbandi, S.A.C. Bukhari, I. Rauf, V. Sawrikar, A. Hussain,
[15] M.M. Hassan, S. Mollick, F. Yasmin, An unsupervised cluster-based feature A hybrid mental health prediction model using Support Vector Machine, Multilayer
grouping model for early diabetes detection, Healthcare Anal. 2 (2022), 100112. Perceptron, and Random Forest algorithms, Healthcare Anal. 3 (2023), 100185.
[16] A.I. Maghsoudi, A.E. Torkayesh, L.C. Wood, E. Herrera-Viedma, K. Govindan, [44] G.R. Ashisha, S.T. George, X.A. Mary, K.M. Sagayam, S. Pramanik, Analysis of
A machine learning driven multiple criteria decision analysis using LS-SVM feature Diabetes Disease Using Machine Learning Techniques: A Review, 2022.
elimination: sustainability performance assessment with incomplete data, Eng. [45] J. Chaki, S.T. Ganesh, S.K. Cidham, S.A. Theertan, Machine learning and artificial
Appl. Artif. Intell. 119 (2023), 105785. intelligence based Diabetes Mellitus detection and self-management: a systematic
[17] R. Ahuja, S.C. Sharma, M. Ali, A diabetic disease prediction model based on review, J. King Saud Univ.-Comput. Inf. Sci. 34 (6) (2022) 3204–3225.
classification algorithms, Print ISSN, Ann. Emerg. Technol. Comput. (AETiC) 3 [46] A. Fahmi, F.A. Muqtadiroh, D. Purwitasari, S. Sumpeno, M.H. Purnomo, A multi-
(2019) 44–52, 2516-028. class classification of dengue infection cases with feature selection in imbalanced
[18] J. Chen, J. Xia, P. Du, J. Chanussot, Z. Xue, X. Xie, Kernel supervised ensemble clinical diagnosis data, Int. J. Intelli. Eng. Syst 15 (3) (2022) 2022.
classifier for the classification of hyperspectral data using few labeled samples, [47] Y. Pan, S. Chen, F. Qiao, S.V. Ukkusuri, K. Tang, Estimation of real-driving
Rem. Sens. 8 (7) (2016) 601. emissions for buses fueled with liquefied natural gas based on gradient boosted
[19] M.Y. Kabir, Social Media Analytics with Applications in Disaster Management and regression trees, Sci. Total Environ. 660 (2019) 741–750.
COVID-19 Events, Missouri University of Science and Technology, 2022. [48] S. Nawar, A.M. Mouazen, Comparison between random forests, artificial neural
[20] A. Javeed, S.S. Rizvi, S. Zhou, R. Riaz, S.U. Khan, S.J. Kwon, Heart risk failure networks and gradient boosted machines methods of on-line Vis-NIR spectroscopy
prediction using a novel feature selection method for feature refinement and neural measurements of soil total nitrogen and total carbon, Sensors 17 (10) (2017) 2428.
network for classification, Mobile Inf. Syst. 2020 (2020) 1–11. [49] G.N. Ahmad, H. Fatima, M. Abbas, O. Rahman, M.S. Alqahtani, Mixed machine
[21] D. Malik, G. Munjal, Reviewing classification methods on health care, in: learning approach for efficient prediction of human heart disease by identifying the
Intelligent Healthcare: Applications of AI in eHealth, Springer International numerical and categorical features, Appl. Sci. 12 (15) (2022) 7449.
Publishing, Cham, 2021, pp. 127–142. [50] B.E.R. Singh, E. Sivasankar, Risk analysis in electronic payments and settlement
[22] R. Ahuja, S.C. Sharma, M. Ali, A diabetic disease prediction model based on system using dimensionality reduction techniques, in: 2018 8th International
classification algorithms Print ISSN, Ann. Emerg. Technol. Comput. (AETiC) 3 Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE,
(2019) 44–52, 2516-0281. 2018, January, pp. 14–19.
[23] A. Mahabub, A robust voting approach for diabetes prediction using traditional [51] T.A. Shaikh, R. Ali, Applying machine learning algorithms for early diagnosis and
machine learning techniques, SN Appl. Sci. 1 (12) (2019) 1667. prediction of breast cancer risk, in: Proceedings of 2nd International Conference on
[24] S. Shafi, G.A. Ansari, Early prediction of diabetes disease & classification of Communication, Computing and Networking: ICCCN 2018, NITTTR Chandigarh,
algorithms using machine learning approach, in: Proceedings of the International Springer Singapore, India, 2019, pp. 589–598.
Conference on Smart Data Intelligence (ICSMDI 2021, 2021, May. [52] M.R. Nalluri, D.S. Roy, Hybrid disease diagnosis using multiobjective optimization
[25] N. Ahmed, R. Ahammed, M.M. Islam, M.A. Uddin, A. Akhter, M.A.A. Talukder, B. with evolutionary parameter optimization, J. Healthc. Eng. 2017 (2017) 27.
K. Paul, Machine learning based diabetes prediction and development of smart web [53] S. Martínez-Agüero, I. Mora-Jiménez, J. Lérida-García, J. Álvarez-Rodríguez,
application, Int. J. Cognitive Comput. Eng. 2 (2021) 229–241. C. Soguero-Ruiz, Machine learning techniques to identify antimicrobial resistance
[26] J. Martinsson, A. Schliep, B. Eliasson, O. Mogren, Blood glucose prediction with in the intensive care unit, Entropy 21 (6) (2019) 603.
variance estimation using recurrent neural networks, J. Healthcare Inf. Res. 4 [54] S.S. Bhat, V. Selvam, G.A. Ansari, M.D. Ansari, Analysis of diabetes mellitus using
(2020) 1–18. machine learning techniques, in: 2022 5th International Conference on
[27] S.S. Bhat, V. Selvam, G.A. Ansari, M.D. Ansari, Hybrid prediction model for type-2 Multimedia, Signal Processing and Communication Technologies (IMPACT), IEEE,
diabetes mellitus using machine learning approach, in: 2022 Seventh International 2022, November, pp. 1–5.
Conference on Parallel, Distributed and Grid Computing (PDGC), IEEE, 2022, [55] K. Kangra, J. Singh, Comparative analysis of predictive machine learning
November, pp. 150–155. algorithms for diabetes mellitus, Bull. Elec. Eng. Inf. 12 (3) (2023) 1728–1737.
[28] E.P. Hong, S.G. Heo, J.W. Park, The liability threshold model for predicting the risk [56] N. Nipa, M.M.H. Riyad, M.S. Satu, M. Walliullah, K.C. Howlader, M.A. Moni,
of cardiovascular disease in patients with type 2 diabetes: a multi-cohort study of Clinically adaptable machine learning model to identify early appreciable features
Korean adults, Metabolites 11 (1) (2020) 6. of diabetes in Bangladesh, Intelli. Med. 14 (2023) 96.

12

You might also like