Predictive Modelingand Analyticsfor Diabetesusing

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Available online at www.sciencedirect.

com
ScienceDirect
ScienceDirect
Procedia Computer Science 00 (2022) 000–000
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 218 (2023) 1257–1269

International Conference on Machine Learning and Data Engineering


International Conference on Machine Learning and Data Engineering
Predictive Modeling and Analytics for Diabetes using
Predictive Modeling
Hyperparameter tunedand Analytics
Machine for Diabetes
Learning using
Techniques
Hyperparameter tuned Machine Learning Techniques
Subhash Chandra Guptaa*, Noopur Goelb
a,b Subhash Chandra Guptaa*, Noopur Goelb
Department of Computer Applications, V.B.S. Purvanchal University, Jaunpur, India
a,b
Department of Computer Applications, V.B.S. Purvanchal University, Jaunpur, India

Abstract
Abstract
Accuracy of a classifier is important for the success of any prediction model. The more accuracy a classifier possesses, the more
robust the system
Accuracy is made on
of a classifier it. In this paper,
is important for the asuccess
diseaseofprediction modelmodel.
any prediction is developed in accuracy
The more Python for the classification
a classifier possesses,of the
diabetes
more
in patients. In the research paper, study is performed to make a comparative analysis of the performance of
robust the system is made on it. In this paper, a disease prediction model is developed in Python for the classification of diabetes machine learning
classification
in patients. Inalgorithms.
the research The classifier's
paper, performances
study is performed to aremake
enhanced by of tuning
a comparative the hyperparameters
analysis of classifiers
of the performance of machineandlearning
applied
different dataset
classification preprocessing
algorithms. methods. In
The classifier's this experimental
performances analysis,by
are enhanced four
of models have
tuning the been created, and
hyperparameters of each modeland
classifiers is based on
applied
adifferent
dataset,dataset
obtained by different preprocessing methods of PIMA dataset. For each model, K-Nearest Neighbors,
preprocessing methods. In this experimental analysis, four models have been created, and each model is based on Decision Tree,
aRandom
dataset,Forest,
obtainedandbySupport
different vector machines methods
preprocessing classification algorithms,
of PIMA dataset. have beenmodel,
For each appliedK-Nearest
and classifier's hyperparameters
Neighbors, are
Decision Tree,
tuned to get better results from these models.
Random Forest, and Support vector machines classification algorithms, have been applied and classifier's hyperparameters are
tunedA detail
to get analysis has also
better results fromperformed to get the best prediction model, the best classifier and effective preprocessing methods
these models.
for A
it. The
detail analysis has also performed tothe
prediction model use F1score as getmain metric.
the best The highest
prediction F1score
model, and classifier
the best accuracy are
and75.68 % and
effective 88.61% respectively,
preprocessing methods
which is achieved by Random Forest classifier for dataset model D3 obtained by removing the samples
for it. The prediction model use F1score as the main metric. The highest F1score and accuracy are 75.68 % and 88.61% having missingrespectively,
or unknown
values
which is from PIMAby
achieved dataset.
Random Forest classifier for dataset model D3 obtained by removing the samples having missing or unknown
values from PIMA dataset.
© 2023 The Authors. Published by ELSEVIER B.V.
© 2023
This Theopen
is an Authors. Published
access article by Elsevier
under the CCB.V.BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
© 2023 The Authors. Published by ELSEVIER B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review
This is an under
open responsibility
access article of
under the scientific
the CC committee
BY-NC-ND of the
license International Conference on Machine Learning and Data
(https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data Engineering
Engineering
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data
Keywords:Hyperparameter, K-Nearest Classifier, Support Vector Machine, F1 score, diabetes mellitus, classification;
Engineering
Keywords:Hyperparameter, K-Nearest Classifier, Support Vector Machine, F1 score, diabetes mellitus, classification;

1. Introduction
1. Introduction

* Corresponding author. Tel.: +91-9454734289;


E-mail address:csubhashgupta@gmail.com
* Corresponding author. Tel.: +91-9454734289;
E-mail address:csubhashgupta@gmail.com
1877-0509© 2023 The Authors. Published by ELSEVIER B.V.
This1877-0509©
is an open access 2023 article under
The the CC BY-NC-ND
Authors. license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Published by ELSEVIER B.V.
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data Engineering
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and Data Engineering

1877-0509 © 2023 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
Peer-review under responsibility of the scientific committee of the International Conference on Machine Learning and
Data Engineering
10.1016/j.procs.2023.01.104
1258 Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269
2 Author name / Procedia Computer Science 00 (2019) 000–000

Glucose is the main source of energy and in human blood it is present in a form of sugar. Due to a disorder, the
extraction of glucose from blood is badly affected and the situation of high sugar level in body is arises, which is
known as diabetes. A hormone called ‘Insulin’, which regulates blood sugar level in the body, is responsible for it
[21]. In a diabetic patient, either the pancreas does not produce enough insulin or produced insulin is not absorbed by
the body. Diabetes causes major health implications like kidney failure, heart attack and blindness [2]. Although it is
incurable, but it’s threats can be avoided by regular exercise, healthy eating and maintaining body weight exercise [6].
Type-1diabetes, type-2 diabetes and gestational diabetes are the three types of diabetes found in human. In Type-
1 diabetes, pancreas becomes unable to produce enough insulin due to the destroyed beta cells of pancreas and glucose
level is increased in the body. Type-1 diabetes is also known as Insulin-Dependent Diabetes Mellitus (IDDM) [2],[12,
13, 25]. Type-2 diabetes, also known as Non-Insulin-Dependent Diabetes Mellitus (NIDDM) or Adult-Onset Diabetes,
is the result of insulin resistance ability developed in body cells. The produced insulin is not consumed by the body,
and this situation stimulates the pancreas to produce more insulin. Due to over functioning, pancreas stop the
production of insulin, which causes increased glucose level in body. Gestational diabetes is seen only in pregnant
women when high glucose level are identified in body during their pregnancy and after the delivery of their baby it
disappears[4],[19],[27].
Today diabetes is the major threats for health of world population. In Every seven seconds, the reason behind the
death of a person less than 60 years old, is diabetes, and it is 50% of total death occurs in the world. According to
WHO report, about 108 million population were under the threat of diabetes in 1980, and in 2014 it has increased to
422 million [4, 25],[27]. About 77% of world diabetes population are belongs from low and middle income countries.
Globally, diabetes particularly attacks middle–aged people between the age of 40 and 59, which have serious social
and economic implications. The overall prevalence of diabetes in 35 to 39 years age group, 45 to 49 years age group,
55 to 59 years age group and 65 to 69 years age group is approximately 5%, 10%, 15% and 20% respectively [13].
The number of diabetes population is risen in India too. In 2015, 69.2 million people were affected from diabetes and
had increased to 72.9 million up to 2017 [21].
Machine learning, a branch of artificial intelligence and connected with statistics, is related with the development
of such type of algorithms and techniques which permitted to computers to learn and develop intelligence ability based
on the past data analysis. The system develops its intelligence ability in the following steps: data gathering from
different resources, prepare dataset by preprocessing methods, building a model using classifiers on training set data
and analyze the model's performance on test data[16]. Machine learning techniques are useful for classification,
prediction and pattern recognition like problems. It can be applied in different areas such as email filtering, web page
ranking, search engine, face tagging and recognizing, robotics, traffic management and disease prediction/
classification[23].
A disease prediction model is used to correctly classify a given sample in positive or negative results. The correct
diagnosis of a patient’s disease increases the chances of recovery of him. Due to a wrong classification of disease, a
patient may have to pay a heavy cost such as lifelong bad health or sometimes his life also. The model, having better
performance, has less chance of error in the results of its classification. In the past, a number of researches have made
in the classification of disease, but scope of improvement in it is still existed.
The objective of the study is to enhance the performance of classifiers by improving the functioning of working of
classification model by selecting appropriate preprocessing method and tuning the hyperparameters of classifiers. In
our experimental work, the model is developed using machine learning classifiers in PYTHON for PIMA diabetes
dataset[28]. PIMA dataset has been prepared for model building using different preprocessing techniques and
converted into four dataset versions. On each version of dataset, model is built using the same hyperparameters tuned
classifiers such as KNN, decision tree, random forest and support vector machine. Obtained results have been
evaluated on different performance metrics like F1score, accuracy, precision and recall and finally analyze the results.
The research paper has seven sections. Introduction and global prevalence about the diabetes disease and objective
of the paper are discussed in first section. A review of past literature is made in Section 2 while section 3 and 4 are
about working methodology and result discussion. The best classifier, dataset model and the effect of preprocessing
and the conclusion are discussed in section 5, 6 and 7 respectively.

2. Literature Review
Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269 1259
Author name / Procedia Computer Science 00 (2019) 000–000 3

Sneha et al. [21] have used feature selection method to select optimal attributes of PIMA diabetes dataset for their
prediction model. The machine learning algorithms used in their model are Support vector machine, k- Nearest
Neighbors, Naive Bayes, Decision Tree and Random Forest and have achieved 82.2% the best accuracy level by Naive
Bayes technique. R J. Steffi et al. [22] has worked on PIMA Diabetes dataset and made a comparative analysis
amongst the performance of ANN, Logistic regression, Naive Bayes, and SVM and C5.0 machine learning algorithms.
They have got their best accuracy of 74.67 for prediction model by Logistic regression. They have also made a
comparison between the time taken by these algorithms to make prediction, and found C5.0 has taken minimum time
for it.
Amina Azrar et al. [5] has worked on PIMA diabetes dataset for their prediction model and converted numerical
data into categorical data during preprocessing. They have applied K-Nearest Neighbors, naïve bayes and decision
tree to predict diabetes and non- diabetes patient in dataset and cross validated the obtained results. The best result
obtained for their research work is 79.56 % by decision tree classifier. Model is implemented in WEKA. Aiswarya
Iyer et al. [14] has implemented a prediction model in WEKA using PIMA dataset for their work. The input dataset
has been normalized and apply feature selection method to get better performance. They got the best accuracy 79.56
% form Naive Bayes algorithm.
Neha Tigga et al. [23] has made a study to assess the risk of diabetes considering the lifestyle and family
background of people. The experiment is done on data which was collected online and offline mode from about 952
people. They applied major machine learning algorithms such as Naive Bayes, logistic regression, KNN, SVM,
decision Tree and random Forest for their prediction model. The same model has been also applied on PIMA diabetes
dataset and made a comparative analysis between the results obtained from both of them. They got that Random Forest
has shown the best results for model on both dataset. Mani Abedini et al. [1] has worked on an ensemble hierarchical
model that has two levels. The model has been trained independently by Logistic Regression and a Decision Tree
(ID3) in first level. And in second level they have combined the outputs of previous level by using an ANN. The
prediction model is applied on PIMA diabetes dataset and acquires 83.08 % accuracy level by their proposed ensemble
hierarchical model (Artificial Neural Network + Logistic Regression + Decision Tree).
Using python, Gupta S.C. et al. [10] has worked on PIMA diabetes dataset to enhance the performance of K-nearest
Neighbors machine learning algorithm by data normalization and feature selection method. They have got 85.06%
accuracy level and 78.18 % F1score from KNN when number of neighbors is 19. Elliot B. Sloane et al. [20] have
proposed a cloud based mobile application model to make a diabetes monitoring system that will integrate its three
components-the diabetic patient, physician and diabetes coaches. It would intervene, if it found, the patient is in critical
situation after monitoring the life-style. Choubey et al. [8] has made a review of researches done during the period
2003 to 2014 and tabulated them by the used classification techniques and their results, tools and applicability. S.
Traymbak et al. [24], has developed a comparative model using R-programming tool and used LDA, KNN, SVM,
Random Forest and adaptive boosting (AdaBoost) machine learning classifier.
Huma Naz et al. [17] have worked on PIMA diabetes dataset by building a prediction model using deep learning
approach and apply Artificial Neural Network (ANN), Naive Bayes (NB), Decision Tree (DT) and Deep Learning
(DL) classifiers for the prediction of diabetic and non-diabetic patients. Among these classifiers, Decision Tree (DT)
and Deep Learning (DL) are the best performing classifiers, although Deep learning perform a little better score in
terms of accuracy. An electronic diagnostic system is proposed by Roy, K. et al.in [18] with three machine learning
classifiers, Naive Bayes, random forest, and J48 decision tree classifier, taking PIMA diabetes as a base dataset. In
their experimental analysis, they have scored 75.65%, 73.91% and 79.13 % accuracy level by decision tree, random
forest and naive bayes classifiers respectively. Using PIMA dataset, Z. Zaman et al. [26] have implemented a
classification model using Naive Bayes, SVM and Decision tree classifiers and scored 81%, 79% and 70 % accuracy
level. In [7] a decision support system is developed for the diagnosis of type 2 diabetes in patients. In the first phase
of their study, they preprocessed the noisy diabetes dataset using imputation methods and in second phase they applied
various classification algorithms such as linear, tree-based, and ensemble algorithms and achieve the best accuracy
level by Artificial Neural Network. Since the input dataset was imbalanced binary dataset, so they used SMOTE
techniques too to make the equal samples of both classes before applying classification algorithms. The objective of
study to compare the performances of these classifiers and also classify diseases in mild, moderate, and severe
category considering patients' various factors. The following Table 1 shows some recent researches using PIMA as a
dataset.
1260 Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269
4 Author name / Procedia Computer Science 00 (2019) 000–000

Table 1. Recent literatures related to diabetes classification


Reference Authors &Year Classification Algorithm (Accuracy in %)

Mani Abedini, A. Bijari and T. Banirostam Naive Bayes , Decision Tree (J48),PLS –LDA , SVM , BLR ,KNN
[1]
(2020)
[5] A Azrar, Y. Ali, M. Awais and K. Zaheer KNN (65.19) , Naïve Bayes (71.74) and Decision Tree (75.65)
V. Chang, J. Bailey, Q. A. Xu, and Z. Sun Naive Bayes (79.13%), random forest (73.91 %), and J48 decision tree (75.65
[7]
(2022) %)
[9] Henock M. Deberneh and Intaek Kim (2021) LR (71%), RF (73%), SVM (73%), and XGBoost (73%)
[10] S.C. Gupta and Noopur Goel (2020) K-Nearest Neighbors classifier (87 %)
[17] Huma Naz and Sachin Ahuja (2020) ANN, Naive Bayes (NB), Decision Tree (DT) and DL
KNN (63.04) , SVM(77.73) ,NB (73.48) , Decision Tree (73.18) and Random
[21] N. Sneha and Tarun Gangil (2019)
Forest (75.39)
J. Steffi , R. Balasubramanian and K. Arvind Naïve Bayes (73.57), Logistic Regression (74.67) ,Decision Tree (C4.5) (74.63)
[22]
Kumar (2018) , SVM (72.17) and ANN (72.29)
[23] N. P. Tigga and S. Garg (2020) Decision tree (J48) and Naïve Bayes
[24] S. Traymbak and N. Issar (2021) LDA, k-Nearest Neighbour (KNN), SVM, RF, and AdaBoost
Z. Zaman, M. A. A. A. Shohas, M. H. Bijoy,
[26] SVM (79), Naive Bayes (81%) and Decision tree (70%)
M. Hossain, and S. Al Sakib (2022)

3. Methodology of the Proposed model

To fulfil the problem statement given in introduction section, the diabetes prediction model is implemented in
python language and worked on PIMA diabetes dataset. By preprocessing of actual PIMA dataset, four versions of
dataset are created and built a prediction model on each version of dataset. The machine learning classifiers, used for
each model, are KNN, SVM, decision tree and random forest. The objective of model is the classification of samples
in different class labels (diabetic and non-diabetic category) and analyses the effect of preprocessing methods on the
performance of classifiers.

3.1 Loading of PIMA Dataset

The PIMA diabetes dataset is the collection of medical test data related to women of PIMA community of Phoenix
town of the United States of America. Due to their highest prevalence of type 2 diabetes, they have been a subject of
research studies. The PIMA dataset is freely available for research purpose, and it has downloaded from UCI Machine
Learning Repository [28]. The dataset is an imbalanced dataset and has medical test data of 768 persons, in which 500
samples are of non-diabetic persons and 268 samples are of diabetic persons, in other word ‘majority class is non-
diabetic (Negative) while minority class is diabetic (Positive)’. PIMA dataset is a binary class label dataset, which has
a "1" value for positive outcome and a "0" value for negative.

3.2 Dataset Preprocessing

The dimension of PIMA dataset is 768 X 9. PIMA dataset contains medical data of 768 patients on 8 different test
attributes. The 9th attribute is the class attribute, which shows the diagnosis result of the patient. Six attributes, which
have zero values for some samples in dataset, are pregnancy, glucose, BloodPressure, skinthickness, insulin and BMI,
and the occurrences of zero value in these attributes are 111, 5, 35, 227, 374 and 11 respectively. Most of the attributes
having a "zero” value, which may be the result of either wrong data collection process or typographic errors or
unavailability of data during data collection process. These zero values may adversely affect the classifier's
performances. These missing values must be filled with appropriate values for building a robust prediction model.
Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269 1261
Author name / Procedia Computer Science 00 (2019) 000–000 5

Data preprocessing is the collection of process to tackle these type of shortcomings present in dataset. Here, dataset
is preprocessed by four different methods to handle missing values and outliers. The details of four datasets versions
obtains from preprocessing are given in Table 2.

Table 2. Shape of different version of PIMA dataset preprocessed by different methods.

Dataset Shape of No. of (+ve Preprocessing Activities


Datasets /-ve) Samples
D1- Actual Pima Diabetes Dataset 768 X 9 268 / 500 Preprocessing is not performed
D2- Dataset filling missing values with 768 X 9 268 / 500 Fill all unknown values by corresponding column’s
column’s mean value mean value
D3- Dataset removing missing values rows 393 X 9 130 / 263 Remove all rows which have unknown value
D4- Dataset removing outliers and rows 337 X 9 98 / 239 Remove rows having missing values and outliers
having missing values

Fig.1. Workflow of proposed mode

3.3 Hyperparameter Tuning

The model has been trained with 80% samples and tested on remaining 20% samples of dataset of each version.
On the model, KNN, SVM, decision tree and random forest classifiers have been applied. The performance of a
classifier is depend upon a number of parameters which define the architecture of the model. These parameters are
called hyperparameters, and to get better results from a classifier on a dataset, it’s hyperparameters are tried on
1262 Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269
6 Author name / Procedia Computer Science 00 (2019) 000–000

different values and selected the values on which it produces best result. This process is called hyperparameter tuning.
Finding the optimum results from classifiers through parameter tuning is a very difficult task.

The experimental work has been implemented in PYTHON language. Hyperparameter's values are tested on
different range of values through nested loops during building a classifier model in python experimental program. In
the following Table 3, the tuned hyperparameters and their tested range / values are given.

Table 3. Classifiers and their hyperparameters of prediction model

S. No. Classifiers Hyperparameters of Classifiers Tested value of Hyperparameters


algorithm [‘auto’, ‘kd_tree’, ‘ball_tree’, ‘brute’]
metrics [‘euclid’ , ‘minkowski’ ]
1 KNN
n_neighbors Range ( 1, 100)
leaf_size 30
criterion [ ‘entropy’ , ‘gini’ ]
maximum_depth Range 1 to 30
2 Decision tree
random_state Range 1 to 66
Minimum_sample_size [ 5, 10 , 15, 20, 25, 30, 35,40, 45, 50]
n_estimators Range 1 to 50
criterion [‘entropy’ , ‘gini’]
3 Random forest random_state Range 1 to 50
max_features ‘auto'
min_samples_leaf Range 1 to 50
kernel [‘rbf’, ‘poly’, ‘linear’]
Support Vector C Range (1, 100)
4
machine gamma scale'
degree [1, 2, 3] applicable for ‘poly’ kernel

3.4 Make comparative Analysis of Results

In the above prediction model, there are four dataset models which are based on different versions of diabetes
dataset. Each dataset model applies the same classifiers and produces its results. Classifier's performances are
measured on accuracy, precision, sensitivity, specificity and F1score. To evaluate the performance of model, F1score
is taken as the main metrics. A detail analysis is performed on the results of models to identify the best performing
dataset model and its best classifiers. The study also gives focus on the best methods to fill missing values in dataset.

4. Result and Discussion

For a disease prediction model, the performance of a classifier plays an important role. A patient has to pay a heavy
cost whenever a classifier makes a wrong prediction. In a data analysis system, the performances of a classifier is
measured on accuracy, F1score, precision, sensitivity and specificity. Confusion matrix provide the necessary data to
calculate these values. [15].

(𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇)
𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴 = (1)
(𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹 + 𝐹𝐹𝐹𝐹)

𝑇𝑇𝑇𝑇
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = (2)
(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)

𝑇𝑇𝑇𝑇
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = (3)
(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)
Subhash Chandra
Author name Gupta et al.Computer
/ Procedia / Procedia Computer
Science Science
00 (2019) 218 (2023) 1257–1269
000–000 12637

𝑇𝑇𝑇𝑇
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = (4)
(𝑇𝑇𝑇𝑇 + 𝐹𝐹𝐹𝐹)

( 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 ∗ 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟)
𝐹𝐹1 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 = 2 ∗ (5)
( 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 + 𝑟𝑟𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒)
Only accuracy cannot be a good metrics for a prediction model. The high accuracy of a classifier may be the
reflection of correct diagnosis of most of true-negative cases as while true-positive predictions of model are less. The
cost of wrong predictions, such as false-positive and false-negative results, may not be the same in any system. For
one system, these two costs may vary. For example, in a diabetes prediction system, the cost of a false-negative
diagnosis is more from the cost of false-positive predictions. A diabetic patient may ignore his treatment when he has
been predicted as a non- diabetic patient (false-negative result) due to a wrong diagnosis. So we need a model where
number of wrong predictions (false-positives and false-negatives) may be less. In other words, the model may have
high precision and high recall value. The high precision of a model ensures that it makes less number of false-positive
predictions, while high recall shows lower number of false-negative predictions. In case of high false-positives cost,
precision will be good measure and recall is for high false-negative cost. Accuracy may be used a performance metrics
when cost of false-negative and false-positive are same. But when it is different, either precision or recall or both are
used. F1score may be a good metrics since it is the weighted average of sensitivity and precision. The higher f1score
of the model ensures that the model is performing better on both false-positive and false-negative cases. In research
paper, F1score is considered as main metrics for the evaluation of classifiers used in model.

4.1 Application of Classifiers on Models based on different Versions of PIMA dataset

In this paper K-Nearest Neighbors, Support Vector Machine, Decision tree and Random Forest classifiers have
been applied on all datasets and their performances are analyzed. For the division of dataset in training and test dataset,
80:20 rule has been followed. Training and test dataset is used to build the model and check its performance
respectively.

4.1.1 Precision Analysis

A precision 1.00 indicates that classifier’s all positive predictions are actually correct. In figure-2, classifier's results
are shown. For dataset model D3, random forest shows maximum 100 % precision level, which indicates all positive
predictions made by it are actually correct. The minimum 64% precision is obtained by Decision Tree classifier for
dataset D4. The analysis of results of these observations indicate that unknown or missing values negatively affect the
performance of classifiers. In comparison to other dataset models, the dataset model D1, which is based on actual
dataset and contains missing or unknown value, has less precision.

4.1.2 Sensitivity Analysis (True Positive Rate)

In this experimental examination, high recall value is achieved by random forest for dataset D1 and D2, by Decision
tree for dataset D3 and D4. The highest recall amongst the recalls of all classifiers for all datasets is 73.91%, and it
has been obtained for D3 dataset. KNN is the worst performer in classifiers for datasets D3, SVM for dataset D1 and
D2 and random forest for dataset D4 since they have the lowest recall value. The sensitivity of classifier of all dataset
models are shown in figure Fig. 3.

4.1.3 Analysis of Accuracy

Accuracy will be a better metric when the size of training and test samples are equal [1], [24]. The figure Figure-4
shows that the maximum accuracy of dataset D1, D2 and D3 is achieved by random forest, and it is 81.17%, 81.17%
and 88.61% respectively while SVM returns best output 82.35% for D4. The model based on dataset D3 scores the
highest accuracy level, which shows preprocessing methods affects the performance of a model. SVM is the worst
1264 Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269
8 Author name / Procedia Computer Science 00 (2019) 000–000

performing classifier for models based on datasets D1, D2 and D3. While for dataset D4, decision tree is the worst
performing classifier. Although missing values are filled with the means of corresponding columns in dataset D2, but
it does not make any improvements in comparison to dataset D1, both show approximate same level of accuracy.

Fig. 2. Precision of all dataset models Fig. 3. Sensitivity of all dataset models

4.1.4 Analysis of F1 Score

It is the harmonic mean of precision and recall and lies between 0 and 1. F1score is selected as the main metrics for
selection of the best dataset model in experimental examination, since a high F1score shows the better performance of
model [25]. F1 score of all dataset models are given in figure Fig. 5.

Fig. 4. Accuracy of all datasets models Fig. 5. F1 Score of for all dataset models

Among these datasets, dataset D3 shows the maximum F1 score of 75.68% followed by dataset D2 (73.39%), and
dataset D1 (72.72%). Random forest is the main performer behind it. The worst performance is for dataset D4 where
all classifiers except SVM score the minimum. F1 scores of Random Forest classifier for dataset D1 and D2 are 72.73
% and 73.39 % respectively. All classifiers show a better performance for dataset D2 model in comparison to D1
dataset model. The obtained results may conclude that a model improved its performance when unknown or missing
values are replaced with the corresponding mean values or removed from (in case of D3 dataset model) in the dataset.
Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269 1265
Author name / Procedia Computer Science 00 (2019) 000–000 9

4.1.5 Specificity Analysis (True negative Rate)

SVM scores the best result in specificity for all dataset models, which are 100%, 95.83%, 90.74% and 90.74% for
dataset model D4, D3, D1 and D2 respectively. It means they classified all negative samples correctly.

4.2 The detail analysis of the best selected dataset model

In figure 5 and Figure 6, accuracy and F1score of all four dataset models are shown. Among these datasets, the
dataset D3 model shows the best performance. Dataset D3 shows the best performance among these four datasets for
all classifiers. It shows its best accuracy of 88.61% when Random forest classifier is used while Decision tree, SVM,
and KNN classifiers show 86.1, 84.8% and 81% accuracy respectively. Dataset D3 also shows highest F1score for
Random Forest (75.68 %), decision tree (75.56 %) and SVM (75 %). The score of recall, precision and specificity for
dataset D3 model is also better among other dataset models. Considering these performances on different metrics,
dataset D3 model is selected as the best prediction model, and it is followed by models based on dataset D2, D1 and
D4 respectively. The observations obtained from experimental analysis shows that the best prediction is made by the
model based on dataset from which missing or unknown values samples are deleted. Since the model, based on dataset
D3, shows the best performance for all metrics, so it is selected for detail analysis.

4.3 Detailed Analysis of Performance of Classifiers for Dataset Model D3

Actual PIMA diabetes dataset is preprocessed by different preprocessing methods and created four datasets from
it. Dataset version D3 is one of them, and it is obtained by removing those rows which have missing values for any
columns and due to it the size of obtained dataset is reduced to 393 rows. The model developed on it produce better
result in comparison to other dataset models. In this section, a detail analysis of the performance of each classifier of
model is made.

4.3.1 K- Nearest Neighbors

Performance of the model varies due to the different values of K used for KNN classifier. For a dataset, it is very
difficult to find the optimum K on which it produces its best result [5, 8, 10, 11, 16, 23, 23, 24]. So, the value of
neighbors (K) for K- Nearest Neighbors classifier has been taken in range of 1 to 100. To get better performance, its
hyperparameters are tuned. The observation obtained from the KNN classifier's performance shows that the
performance of KNN is improved by increasing the number of neighbors, until it achieve its peak value, after it, its
performance goes down. KNN classifier shows the best accuracy of 81.1% when number of neighbors (K) is either
20, 24 or 30. On these values of K, the score of f1score, precision, recall and specificity are 61.54%, 75%, 52.17% and
92.85% respectively.

4.3.2 Decision Tree

PIMA diabetes dataset is a binary class labelled dataset, so standard decision tree classifier[1, 17, 18, 21] is suitable
for it. The decision tree classifier contains a number of hyperparameters, and tuning of these produce a better result
for any model. The tuned hyperparameters are "information gain", "maximum depth", "minimum sample size" and
"random state". The value of maximum depth is 5 and random state is 66. When parameter “minimum_sample_size”
of leaf is 32, it gets the best accuracy level of 86.1%. On the same parameters f1score, precision, recall and specificity
of random forest are 75.6%, 77.27%, 73.91% and 91.07% respectively.

4.3.3 Random Forest

Random forest [21, 23, 24]is an ensemble method and produce many decision trees for each class labels. For dataset
D3, it is the best scorer, and its accuracy, f1score, precision, recall and specificity are 88.61%, 75.68%, 100%, 60.87%,
and 100% respectively. The performance of random forest depends upon its hyperparameters such as "minimum
1266 Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269
10 Author name / Procedia Computer Science 00 (2019) 000–000

sample leaf size", "N-estimators", "min_sample_leaf" etc. "n_estimators" parameters of random forest classifier shows
the number of built decision trees for it. Its value may vary from one dataset to other dataset to get best result. In this
experimental analysis, 88.61 % accuracy level is achieved when ‘n_estimators’ value is 5 while parameter's range is
tried between 1 and 50.The same experiment is done with “min_sample_leaf” value.

4.3.4 Support Vector Machine

SVM classifier has four types of kernel, linear kernel, RBF kernel (Radial Basis Function Kernel), Polynomial
kernel and sigmoid kernel, to process data. Kernel, used for SVM classifier, also affects its performance important
role[3, 8, 16]. The effect of these kernels are checked along with different values of regularized parameter C. SVM
achieves 84.81% accuracy level when regularized parameter's (“C” value) is 4 and kernel is either "linear" or
"polynomial". Other SVM kernels (“sigmoid” and "RBF") get lower accuracy in comparison to "linear" or
"polynomial" kernel for given dataset. In Table 4, the performance of different SVM's kernels are given in detail.

Table 4. Performance of hypertuned SVM classifier

SVM Kernel Radial Basis Function Kernel (RBF) Linear Kernel Polynomial Kernel

Regularized Parameter C 33 4 1

Accuracy 83.54 84.81 84.81

Precision 80.95 79.17 81.89

Sensitivity 65.38 73.1 69.23


Specificity 92.45 90.6 92.45
F1 Score 72.34 76.00 75.00

5. Effect of Preprocessing Methods on the Performance of Dataset Models

Dataset D1 and dataset D2 are equal in size, having 768 samples. The models based on these two dataset shows the
same pattern of performance of classifiers on different metrics. Random forest is the best classifier amongst all
considering the score of accuracy, F1score and sensitivity. For dataset D1 model, it shows 72.73 % f1score, 80.52 %
accuracy and 72.72 % sensitivity, while for dataset D2 model it is 73.39%, 81.17 % and 72.73 % respectively. SVM
and decision tree show the best performance amongst other classifiers in precision and specificity for both models.

The D2 dataset model shows a little bit better performance in comparison to D1 dataset model. From it, it is
concluded that missing or unknown values in dataset may down the performance of model, and when it replaced with
their mean values, performance is improved. Dataset D4 is the smallest dataset, having only 337 samples, created from
removing samples which are either outliers or having missing or unknown values. SVM classifier scores best results
for this dataset in terms of accuracy, precision, specificity and F1score.

For model based on dataset D3, Random forest scores the best score for all metrics accuracy, precision, sensitivity,
specificity and F1score. Its F1score is 75.68 % and precision is 100%. The 100 % level of precision shows that all
predictions, which it declares as a positive result, are actually positive. KNN classifier is the worst performing
classifier for this dataset model.

The performance of random forest classifier is better in comparison to other classifiers for most of the dataset
models. It contains a number of decision tree and select the outcome by the majority votes from these decision trees.
Random forest and decision tree beats KNN due to the automatic feature interaction and less susceptibility to
overfitting. For KNN classifier, computation happens in real time and take more time. This is the reason why KNN is
Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269 1267
Author name / Procedia Computer Science 00 (2019) 000–000 11

called a lazy learning model. The following table, Table 5, shows the score of each classifier and ranked in the highest
to the lowest rank.

Table 5. Rank of classifiers for dataset D3 model

Max. Classifiers Arrange in Non-decreasing Order of Their Score


Metrics Min. (%)
(%) Rank 1 Rank 2 Rank 3 Rank 4

F1 Score 75.68 RF DT SVM KNN 61.53


Accuracy 88.61 RF DT SVM KNN 81.01
Precision 100.00 RF SVM DT KNN 75.00
Recall 73.91 DT SVM RF KNN 52.17
Specificity 100.00 RF KNN SVM DT 91.07

6. Findings of Analysis

The experimental analysis of models based on these four versions of dataset shows that the performance of
classifiers are affected by the preprocessing methods used for dataset preparation. There are four models which are
based on four versions of actual PIMA datasets named D1, D2, D3 and D4 created from different preprocessing
methods. The model based on D3 dataset is the most robust model and achieved the best score by random forest
classifier for different performance metrics.

Table 6. Tabular views of performances of all versions of PIMA diabetes datasets showing their best performing classifier
Dataset D2 : Dataset D3: Dataset D4 : Removing
Performanc Dataset D1 : Actual PIMA
Missing value are filled Removing Rows Rows having Missing Values
e Metrics Diabetes Dataset
by Mean having Missing Values & Outliers
Max. Max. Max.
*BPC BPC BPC Max. BPC
Score Score Score
F1 Score 72.73 RF 73.39 RF 75.68 RF 62.5 SVM
Accuracy 80.52 RF 81.17 RF 88.61 RF 82.35 SVM
100.0
Precision 74.47 DT 76.09 DT RF 83.33 SVM
0
Recall 72.73 RF 72.72 RF 73.91 DT 59.26 DT
100.0
Specificity 90.74 SVM 90.74 SVM RF 95.83 SVM
0

*BPC: Best performing classifier

From the experimental analysis of these four dataset, it is also found that the Random Forest is the best classifier. It
shows the maximum F1score and accuracy for dataset D1, D2 and D3. Table 4 shows a detailed view of the best scorer
classifier for each dataset, considering different performance metrics.

7. Conclusion and Future Scope

The correct diagnosis of a disease is necessary for any classification system. The early a diagnosis of disease will
be made, the sooner his remedy is started. Since the cost of incorrect diagnosis (prediction of a diabetic person as non-
diabetic) is very high, so the developed model should be robust and reduce the chance of these errors. In this
experimental analysis, four models are developed from four datasets (D1, D2, D3 and D4) created by different
preprocessing methods on PIMA diabetes dataset. Dataset model D1 uses actual PIMA dataset for classification.
Different classification techniques are applied on these datasets and their performances have been evaluated.
Considering F1score as the best metrics, the model based on dataset D3, is found on the top of table. F1score of dataset
1268 Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269
12 Author name / Procedia Computer Science 00 (2019) 000–000

D3 is the maximum and it is 76%. Dataset D3 is created after the removal of rows having missing values from the
original dataset. F1score of other dataset D1, D2 and D4 are 73%, 73% and 63% respectively. Further, a comparative
analysis is performed between the classifiers used for analysis. Random forest shows the best performance from all
other classifiers. It’s F1 Score, accuracy, precision, recall and specificity is 76%, 89%, 100%, 74% and 100%
respectively.

Limitation: It is found from the analysis of results obtained from different models that classifier's performances are
negatively impacted by missing or unknown values in datasets. So to get better results, those samples must be removed
from the dataset, but it causes some other problems. After the removal of these samples, the size of dataset may reduce
up to a critical level, and it might be difficult to make right prediction by model.

Future scope: The analysis of obtained observations is only for PIMA diabetes dataset. In the future, large size diabetes
dataset can be used to validate the obtained observations. In this experimental model, only four classifiers have been
applied on imbalance dataset, but in future, advance techniques such as deep learning algorithms along with feature
selection method can be applied on balanced class dataset using up-sampling methods.

References

1. Abedini, M. et al.: Classification of Pima Indian Diabetes Dataset using Ensemble of Decision Tree, Logistic Regression and Neural
Network. Ijarcce. 9, 7, 1–4 (2020). https://doi.org/10.17148/ijarcce.2020.9701.
2. Alehegn, M. et al.: Analysis and Prediction of Diabetes Mellitus using Machine Learning Algorithm. Int. J. Pure Appl. Math. 118, 9,
871–877 (2018).
3. Alehegn, M., Joshi, R.R.: Type II diabetes prediction using combo of SVM. Int. J. Eng. Adv. Technol. 8, 6, 712–715 (2019).
https://doi.org/10.35940/ijeat.F7974.088619.
4. American Diabetes Association: Classification and diagnosis of diabetes: Standards of medical care in Diabetesd - 2018. American
Diabetes Association Inc. (2018). https://doi.org/10.2337/dc18-S002.
5. Azrar, A. et al.: Data mining models comparison for diabetes prediction. Int. J. Adv. Comput. Sci. Appl. 9, 8, 320–323 (2018).
https://doi.org/10.14569/ijacsa.2018.090841.
6. Care, D., Suppl, S.S.: 2 . Classi fi cation and Diagnosis of Diabetes : Standards of Medical Care in Diabetes d 2018. 41, January, 13–27
(2018). https://doi.org/10.2337/dc18-S002.
7. Chang, V. et al.: Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput. Appl.
0123456789, (2022). https://doi.org/10.1007/s00521-022-07049-z.
8. Choubey, D.K., Paul, S.: Classification techniques for diagnosis of diabetes: A review. Int. J. Biomed. Eng. Technol. 21, 1, 15–39 (2016).
https://doi.org/10.1504/IJBET.2016.076730.
9. Deberneh, H.M., Kim, I.: Prediction of type 2 diabetes based on machine learning algorithm. Int. J. Environ. Res. Public Health. 18, 6,
9–11 (2021). https://doi.org/10.3390/ijerph18063317.
10. Gupta, S.C., Goel, N.: Performance enhancement of diabetes prediction by finding optimum K for KNN classifier with feature selection
method. In: Proceedings of the 3rd International Conference on Smart Systems and Inventive Technology, ICSSIT 2020. pp. 980–986
Institute of Electrical and Electronics Engineers Inc. (2020). https://doi.org/10.1109/ICSSIT48917.2020.9214129.
11. Gupta, S.C., Goel, N.: Selection of Best K of K-Nearest Neighbors Classifier for Enhancement of Performance for the Prediction of
Diabetes. Adv. Intell. Syst. Comput. 1299 AISC, 135–142 (2021). https://doi.org/10.1007/978-981-33-4299-6_11.
12. International Diabetes Federation: IDF Diabetes Atlas Eight Edition 2017. International Diabetes Federation (2017).
13. International Diabetes Federation: IDF Diabetes Atlas Ninth edition 2019. (2019).
14. Iyer, A. et al.: Diagnosis of Diabetes Using Classification Mining Techniques. Int. J. Data Min. Knowl. Manag. Process. 5, 1, 01–14
(2015). https://doi.org/10.5121/ijdkp.2015.5101.
15. Jakka, A., J, V.R.: Performance Evaluation of Machine Learning Models for Diabetes Prediction. Int. Journalof Innov. Technol. Exloring
Eng. 8, 11, 1976–1980 (2019). https://doi.org/10.35940/ijitee.K2155.0981119.
16. Kaur, H., Kumari, V.: Predictive modelling and analytics for diabetes using a machine learning approach. Appl. Comput. Informatics.
(2019). https://doi.org/10.1016/j.aci.2018.12.004.
17. Naz, H., Ahuja, S.: Deep learning approach for diabetes prediction using PIMA Indian dataset. J. Diabetes Metab. Disord. 19, 1, 391–
403 (2020). https://doi.org/10.1007/s40200-020-00520-5.
Subhash Chandra Gupta et al. / Procedia Computer Science 218 (2023) 1257–1269 1269
Author name / Procedia Computer Science 00 (2019) 000–000 13

18. Roy, K. et al.: An Enhanced Machine Learning Framework for Type 2 Diabetes Classification Using Imbalanced Data with Missing
Values. Complexity. 2021, (2021). https://doi.org/10.1155/2021/9953314.
19. Sengamuthu, R. et al.: Various Data Mining Techniques Analysis To Predict. Int. Res. J. Eng. Technol. 5, 5, 676–679 (2018).
20. Sloane, E.B. et al.: Cloud-based diabetes coaching platform for diabetes management. 3rd IEEE EMBS Int. Conf. Biomed. Heal.
Informatics, BHI 2016. 3, December 2012, 610–611 (2016). https://doi.org/10.1109/BHI.2016.7455972.
21. Sneha, N., Gangil, T.: Analysis of diabetes mellitus for early prediction using optimal features selection. J. Big Data. 6, 1, (2019).
https://doi.org/10.1186/s40537-019-0175-6.
22. Steffi, J. et al.: Predicting Diabetes Mellitus using Data Mining Techniques. Int. J. Eng. Dev. Res. 6, 2, 460–467 (2018).
23. Tigga, N.P., Garg, S.: Prediction of Type 2 Diabetes using Machine Learning Classification Methods. Procedia Comput. Sci. 167, 2019,
706–716 (2020). https://doi.org/10.1016/j.procs.2020.03.336.
24. Traymbak, S., Issar, N.: Data Mining Algorithms in Knowledge Management for Predicting Diabetes After Pregnancy by Using R. Indian
J. Comput. Sci. Eng. 12, 6, 1542–1558 (2021). https://doi.org/10.21817/indjcse/2021/v12i6/211206006.
25. World Health Organization: Global Report on Diabetes. WHO Library, Geneva (2016).
26. Zaman, Z. et al.: Assessing Machine Learning Methods for Predicting Diabetes among Pregnant Women. Int. J. Adv. Life Sci. Res. 05,
01, 29–34 (2022). https://doi.org/10.31632/ijalsr.2022.v05i01.005.
27. Diagnostic Criteria and Classification of Hyperglycaemia First Detected in Pregnancy. 1–63.
28. Pima Indians Diabetes Database | Kaggle, https://www.kaggle.com/uciml/pima-indians-diabetes-database, last accessed 2021/09/27.

You might also like