Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

International Journal of Multidisciplinary Research and Explorer (IJMRE)

Comparative Analysis of Machine Learning


Algorithms to Predict Type II Diabetes
Puneet Misra Arun Singh Yadav
Department of Computer Science Department of Computer Science
University of Lucknow University of Lucknow
Lucknow, India Lucknow, India
puneetmisra@gmail.com arun.ai.lkouniv@gmail.com

Abstract— Machine Learning (ML) models are becoming makes the human body insulin resistance that causes other
robust and more accurate nowadays as the rapid increase in serious complications. Thus, the early and timely diagnosis
the amount and quality of training data. Researchers are of diabetes may prevent serious complications. The various
proposing complex models for real-life problems to achieve machine learning-based system has been developed in recent
higher accuracy, which requires high computing and other years to predict diabetes[5][6] still scientists and medical
resources. In the context of the healthcare disease diagnosis, experts evolving new and intelligent algorithms and proved
detection and prediction is still a challenge. that machine learning algorithms[7][8] performed better in
Early diagnosis of a disease or ailment helps in timely recovery. disease diagnosing. The capability to work on extensive,
Moreover, health been core to every individual, a lot of work is heterogeneous data taken from different sources and keep
being done in this field to improve upon by using all available improving the model performance by adding the background
information. details to make it a more powerful tool[9][10], [11]. The only
Current paper experiments on Pima Indian Diabetes Dataset objective of these developed systems is to improve the
(PIDDS) in two stages A and B. The main objective of this accuracy that leads to the correct prediction of the disease.
study is to review the accuracy of the applied machine learning
algorithms and analyze their efficiency in predictions. Another
The main objective of this study is to do a comparative
essential objective is to show the efficacy of simpler models. study of different supervised ML algorithms. We will
Fields like computer vision and NLP have given rise to deep investigate their logic, assumptions, feature selection,
learning with complex and high computational models setting preprocessing impact, etc. and will show that the less
the trend to apply them in almost all the fields While they help complex algorithms can do better on less complex problems.
where we have an abundance of data and complex The rest of the paper is organized as follows the Section
relationships, simpler models still can do wonders and on their II includes the related work and limitations of the previous
day can challenge these behemoths. We have also applied system. The article organized as following sections, section I
preprocessing methods (imputation, feature selection, scaling provides the brief introduction about the work, section II
and discretization) to improve the classification accuracy. The contains the related work, sections III focuses on the adopted
algorithms selected for this problem are Logistic regression methodology Type II diabetes prediction and section IV
(LR), Artificial Neural Networks (ANN), Support Vector included the comparison and discussion on the produced
Machine(SVM), Naïve Bayes (NB), and Decision Tree(DT). LR results by the various classifier.
provided the best accuracy, and the rest of the models are very
close to each other. II. REVIEW OF LITERATURE
Keywords—Machine learning, Disease prediction, Machine learning is the problem of induction, where general
classification, Preprocessing, Logistic Regression, ANN, Naïve rules are learned from a domain-specific observed data. It is
Bayes, SVM, Decision Tree. not feasible to know what representation or what algorithm
I. INTRODUCTION is best on the given problem beforehand. Without knowing
the problem so well, you probably don’t need machine
Intelligent learning for prediction and forecasting is the learning, to begin with. The machine learning model allows
topic that is under consideration in today's promising a section of preprocessing, which removes irrelevant
research related to Artificial Intelligence(AI). Learning is the information from the data sources. The removal of
critical requirement for any intelligent behavior. Researchers
unwanted data must be done very carefully by
have agreed that without learning, there is no intelligence.
understanding the nature of data and the correlation of
Therefore, machine learning has become a rapidly
developing subfield of AI research. These intelligent different features.
algorithms were from the very beginning designed and used The logistic regression method compares the relationship
to analyze medical, clinical information[1]. Machine learning among a dependent and independent features of the dataset.
algorithms analyze the historical data and extract the useful These variables are usually continuous. The LR predicts the
and not so apparent patterns from the dataset for prediction value of a dependent features using prior probability[12].
and diagnosis[2][3]. Challenges with medical data is that it is The outliers undoubtedly impact prediction accuracy. In
non-linear, heterogeneous, and noisy[4]. So that information paper [10], the author used distance-based outlier detection
needs to be preprocessed to get the better result. Diabetes is a as a preprocessing method and proposed a modified
severe health problem in which the amount of sugar content prediction model for diabetes type II prediction. The model
cannot be regulated. Type I diabetes is caused when the achieved 79% of accuracy by using the sigmoid function,
human body refused to produce insulin. Type II diabetes but after applying the Neuro based weight activation

http://doi-ds.org/doilink/01.2021-94842525/ Website: www.ijmre.com Volume No.1 Issue-1 16


International Journal of Multidisciplinary Research and Explorer (IJMRE)

function to calculate bipolar sigmoid, the accuracy reached III. EXPERIMENTAL SETUP
90.4%[13]. The impact of preprocessing techniques like In this study, we have used the famous Pima Indian
feature selection, missing values imputation and reducing Diabetes Dataset(PIDD)[23]. Pima Indian is a group of
class imbalance improves the classifier prediction of risk of Native Americans living in Southern Arizona. Due to some
30-days hospital readmission for diabetes patients [14]. A genetic issues, they take a poor diet of carbohydrates.
very slight improvement can be seen in the Naïve Bayes However, in recent years they moved towards processed
model after applying preprocessing technique as compared food rather than traditional agriculture food with minimum
to logistic regression and decision tree[14]. But this study physical activity. This sudden change in habit and food
also shows that the impact of these schemes varies by makes them the highest prevalence of type-II diabetes which
techniques and problem formulation. The problem of the makes them a reason for the research. This database was
highly skewed dataset can be overcome using subsampling, taken from the UCI machine learning library[24]. PIDD is a
but the class imbalance problem cannot produce a good benchmark for comparing methods and widely adopted free
prediction model. N. Barakat [15]proposed a hybrid diabetes datasets for research purposes [25] in the machine learning
prediction model using the SVM classifier. The author has community.
used K-means clustering algorithms for the preprocessing The experimental setup for this study is divided into two
scheme to handle the class imbalance problem. A total of stages. The first stage deals with data-preprocessing(A)
five clusters are derived from the dataset and every cluster methods as we have seen in the literature review and
positive samples are taken based on Euclidian distance. The previous study[26] that the data preprocessing has drastically
final dataset is divided into training and test dataset. Here improved the results. The preprocessed dataset of the first
the SVM provides promising results for diabetes prediction stage forms the input for the second stage classification(B)
with 94% accuracy. A. A. Al Jarullah[16] has used the J48 where the 5 ML methods make the predictions for diabetes.
decision tree classifier on the modified dataset (pre- All the experiments have done in this study on the jupyter
processed data). After applying the unsupervised k-means notebook[27] using python programming language. Here
clustering for class imbalance problem and numerical Ipython compiler is used to run python programs
discretization to make small groups of each attribute, the The methodology includes the data collection and
author achieved 78.17 % of accuracy. But the decision tree analyzing the nature, the preprocessing methods, and the
can do better and W. Chen et al.[17] has used k-means and predictions. The model proposed for this study are as
10-fold cross-validation technique for data pre-processing. follows:
The author significantly improved the performance of the
A. Data Preprocessing
decision tree model by maintaining the security and privacy
issues[18]. With this dataset, the author achieved 90.04% The PIDD contains 768 records of pregnant females with
accuracy on the PIDD dataset. The outlier problem may eight characteristics and one more column for the outcome.
produce the wrong result. R. Ramezaniet al.[19] used Each attribute is assigned the numeric value. In the dataset
multiple imputation methods for missing value treatment 65.10% (500 females) are non-diabetic (represented with
and OT for dimensionality reduction. This modified dataset value 0) and 34.90% (268 females) have diabetes
(represented with value 1). We have used the following
applied to the hybrid model LANFIS(Logistic Adaptive
attributes of PIDD dataset:
Network-based Fuzzy Inference System). This model has
achieved 88.05% accuracy. Sometimes the uncorrelated TABLE I. THE PIMA INDIAN DIABETES DATASET WITH A
variables reduce the performance of any learning model, so DESCRIPTION
finding uncorrelated attributes means the principal S.No. Parameter Description Data Type
components. M. K. M. Dhomse Kanchan B.[7] used the 1. PREGNANT Number of time women get Numeric
PCA as a pre-processing scheme. The modified dataset pregnant
applied to classifiers where SVM outperform after applying 2. PGLUCOSE Plasma glucose concentration Numeric
PCA. One of the closest work can be seen where the author measured using a 2-hour oral
glucose tolerance test in mm Hg
has used the PCA and some other unsupervised ml methods 3. DBP Diastolic blood pressure Numeric
for pre-processing. This pre-processed dataset then applied 4. INSULIN Two-hour serum insulin in Numeric
on ANN classifier which predicts diabetes with 92.28% muU/ml
accuracy[20]. Model selection for the problem is the biggest 5. TSFT Triceps skin fold thickness in mm Numeric
6. BMI Body mass index in mm2 Numeric
challenge where even the less complicated models can make
7. DPF Diabetes pedigree function Numeric
a better prediction but here, the quality of data plays a 8. AGE Age of the patient Numeric
significant role. H. Wu et al.[21] has done excellent work 9. OUTCOME Patient with diabetes onset within Nominal
on data using a feature selection approach with correlation five years(0 or 1)
check and k-means clustering. They prepared the data so
well that even the less complicated models like logistic The initial investigation of the dataset suggests that it is a
regression classified the diabetic positive and negative supervised classification problem. The PIDD contains
patient with 95.42 % accuracy. Naïve Bayes always worked several inconsistencies in it as the metadata shows no
better for imbalanced and missing data[22]. It fairly missing values but table 2 exhibits biologically implausible
achieved the 76.3% accuracy after applying k-means and zero values. This situation suggests that metadata is incorrect
weka tool filtering approach. and must be treated as missing values. Some of the
previously published studies have overlooked this and

http://doi-ds.org/doilink/01.2021-94842525/ Website: www.ijmre.com Volume No.1 Issue-1 17


International Journal of Multidisciplinary Research and Explorer (IJMRE)

directly used them as recorded. However, this was a serious B. Classification


concern because INSULIN variable has more than 40% Classification is the task of assigning the new observation
values are zero. After that, researchers start treating them as to the class to which they most likely belong, i.e. close to
missing data and have published several studies. The
the accuracy, based on the classification model built from
occurrences of zero value in different variables are as
the labeled training data. E.g., A good classifier can predict
follows:
the condition of the patient in the future based on various
TABLE II. OCCURRENCES OF ZERO IN DIFFERENT VARIABLES symptoms and other parameters.
The classification can be binary and multilevel. When only
S.No. Variable No. of Zero
two target classes are there in the problem, it is known as
1. PREGNANT 111
2. PGLUCOSE 5 binary classification. For example, whether the patient has
3. DBP 35 type-2 diabetes or not? Nevertheless, in multilevel
4. TSFT 227 classification, there must be more than one target class
5. INSULIN 374 present in the problem statement. For example, a patient
6. BMI 11 admitted in the ICU has a low, medium and high risk of
7. DPF 0
8. AGE 0
mortality. The dataset taken for this study is a binary
classification problem.
However, we cannot be sure in some cases that the In the machine learning approach, the actual dataset is
presence of zero should be treated as missing or not. In case divided into two parts. The first part of the data(training
of variable PREGNANT (number of times a woman gets data) is used to build the classification model by training it
pregnant), it can be zero times or more than one both cases and the second part(test data)validates the model accuracy.
can be considered but treat it as non-missing is more relevant Splitting of data must be done carefully else the information
than missing instance. Missing data can severely distort the leakage can happen from test data. In this study, we have
correlation between the variables. In the case of BMI and used train_test_split() method of Scikit-Learn library of
TFST both variables used to measure obesity and must be python. Through this function, we divide the dataset into a
highly correlated, but the computed correlation coefficient different ratio. However, 80/20 (train/test) rule is mostly
recorded 0.393, which is a weak positive correlation. After used in the studies. We have used the following
removing the record of zero instances of TFST yields classification algorithms:
correlation coefficient 0.632(highly positive). Instead of
removing the missing instances, we have calculated the 1) Logistic Regression(LR): It is a supervised machine
feature importance on the sample with no missing values learning algorithm borrowed from the traditional statistics
using Recursive Feature Elimination (RFE). To get more
which uses a Logistic function called sigmoid function g(z)
confidence in feature selection k-fold cross-validation with
Stratified k-fold is used. that takes any value (independent variables) and predicts the
discrete categories (dependent variables) between 0 and 1.
But using OvR technique, this model extended to multiclass
classification. As it is borrowed from the linear regression, so
the z value is similar to linear regression:

The means to p(y=1|x), i.e. the probability of


predicted positive events. For example, the probability that
the patient has type II diabetes, given features x. So, the
inverse probability, not having disease p(y=0│x)=1-h(θ).
Logistic regression uses cross-entropy as a loss function due
to non-linear sigmoid function at the end. The cost function
will use two equations as given below:
Fig. 1. The most suitable & number of
features for predictions

The top-performing features are based on Recursive In this experiment, we have used GridSearch with k-fold
Feature Elimination with Cross-Validation (RFECV) results cross-validation to find the best parameters for the LR. The
are PREGNANT, PGLUCOSE, BMI, and DPF. The selected parameters used with LR are given as follows:
features have few missing values which have replaced by the
mean. After selecting the above features, we have used the TABLE III. PARAMETERS USED IN LR AS RETURNED BY GRID SEARCH
complete dataset for the experiment. Scaling is performed on Parameters Values
features at unit variance for efficient learning. The C 1
preprocessed dataset is then used in Stage B for the Penalty L1
predictions. Solver newton-cg

http://doi-ds.org/doilink/01.2021-94842525/ Website: www.ijmre.com Volume No.1 Issue-1 18


International Journal of Multidisciplinary Research and Explorer (IJMRE)

window functions or kernels are responsible for converting


the inputs into the required format. SVM have different types
2) Artificial Neural Network(ANN): In this study, we of kernels according to the problem like linear, non-linear,
have used a Multilayer Perceptron (MLP) model of Artificial polynomial, radial basis function (RBF) and sigmoid. It
Neural Network. It is a supervised algorithm that learns from returns the inner product of two points in a suitable feature
the labeled training set of the given data for the function space and thus can work well with a high dimensional
f(.):Rm==>Ro. Here m represents the number of features dataset. In this experiment RBF, the most popular kernel is
given as an input vector whereas o denotes the number of used. Gamma and C parameters are tuned to get the optimal
features for the output vector. It learns the non-linear values to achieve higher accuracy.
function approximation for regression or classification
problem from the given independent variable X=x1,x2,x3….. TABLE V. OPTIMAL PARAMETER COMBINATION USED IN SVM USING
GRID SEARCH
and dependent variable Y. We have used MLP classifier for
our problem. The gradient descent approach used in ANN Parameters Levels
training. These gradients are calculated using Kernal Rbf
Gamma 0.05
backpropagation which reduces cross-entropy loss function Regularization (C) 12
in classification. Two-layer feed-forward backpropagation
neural network is employed for the experiment in this paper. 5) Decision Tree(DT): It constructs a hierarchical tree-
Grid search was utilized for the optimal parameter setting of like structure of the given training data. It divides the
ANN. Parameters used in this experiment are given below in training data on the value of a feature. This model learns
the table : decision rules inferred from the features and predict the
target class. In this experiment, we have used the CART
(classification and regression tree) algorithm of decision tree
TABLE IV. PARAMETERS USED FOR ANN SUGGESTED BY
because the training space has only numerical
GRIDSEARCH values. CART creates the binary tree using the features and
Parameters Levels
threshold that yields the maximum information gain using
Learning rate Constant Gini index at each node. We have used the Decision Tree
Hidden layers 2 classifier from sklearn library that contains fourteen
Activation Relu different parameters, but we tune only two parameters that
Maximum Iterations 500
are max_depth and min_samples_split to control the size
As we have used four best features to train the model so and complexity of the tree. The optimal parameters used in
that the input layer comprises four neurons. Each neuron the model are given in the table below:
represents a unique feature. One hidden layer was used with
five hundred neurons as set in the maximum iterations. TABLE VI. BEST PARAMETERS GIVEN BY GRID SEARCH FOR DT
Similarly, the result was obtained from another hidden layer Parameters Values
with the constant initial learning rate 0.001 and activation Maximum depth of the model 3
function relu. Minimum samples to split 2

3) Naïve Bayes(NB): It is a probabilistic method that IV. EVALUATION MEASURES AND RESULT
applies Bayes theorem. It calculates the probability of a Accuracy, sensitivity and specificity matrices are used in
given record belonging to a specific class. It assumes that this experiment to evaluate the performance of predictions of
given the class, features are statistically independent of each the model. If the training space is balanced correctly, then
other. This assumption is called class conditional the accuracy measure is enough to evaluate the model
independence, which significantly simplifies the learning performance. However, in this experiment, the target
process. It is a generative method that generates the data variable is imbalanced, i.e. 34.9% are diabetic and 65.1% are
from the assumptions and distributions and then uses this non-diabetic patients that’s why precision, recall, and F-score
prior knowledge to predict the unseen data. It performs measures have used. To calculate all these measures
better on less training data despite naive assumptions. NB is confusion matrix is needed that are True Positive, False
Positive, True Negative, False Negative. The formulation of
always the best choice for quick and dirty implementation
the measures is given below in the table:
and considered to be the benchmark. In this experiment, we
have used Gaussian Naive Bayes to predict likelihood. We TABLE VII. MEASURES USED FOR MODEL EVALUATION
have not used a grid search for NB because it has nothing to
Matric Formula
tune. Precision(P) TP/(TP+FP) & TN/(TN+FN)
Recall(R) TP/(TP+FN) & TN/(TN+FP)
4) Support Vector Machine (SVM): Support vector F1-Score 2*P*R/(P+R)
classifier is also called the maximum margin classifier Accuracy (TP+TN)/(TP+TN+FP+FN)
because it creates the maximum margin hyperplane. to
achieve this the decision boundary defined to maximize the All the tests were conducted on the discussed
margin between the positive and negative classes. The experimental setup and only the best results are taken of the

http://doi-ds.org/doilink/01.2021-94842525/ Website: www.ijmre.com Volume No.1 Issue-1 19


International Journal of Multidisciplinary Research and Explorer (IJMRE)

discussed model evaluation matrices. The results of each true in all respect and depends on the nature of data, its
prediction model are reported in table 7 and the comparative quality, volume, etc. It is also possible that complex models
chart of the model performance is given in figure 3. The can give better results by going the deep dive in the problem
results produced by each model are satisfactory with the set and its inconsistencies.
average accuracy of 80%, while the best is achieved around
84% by the LR. From the recorded values in table 7, LR has REFERENCES
identified diabetic patients with a high recall of 84% but at
the same time, model has classified the non-diabetic patients [1] G. D. Magoulas and A. Prentza, “Machine Learning in Medical
with 84% precision. The computed harmonic mean(F1- Applications,” Mach. Learn. Its Appl., vol. 2049, pp. 300–307,
Score) of the precision and recall for LR is also 84%. The 2001, doi: 10.1007/3-540-44673-7_19.
performance of ANN is very close to the best performing [2] A. J. Frandsen, “Machine Learning for Disease Prediction,”
thesis, p. Paper 5975, 2016.
model and has achieved 81% of accuracy. ANN is more [3] I. Kononenko, “Machine learning for medical diagnosis: history,
complicated than LR, challenging to train, overfitting issues, state of the art and perspective.,” Artif. Intell. Med., vol. 23, no. 1,
difficult to optimize, and need large number of training pp. 89–109, 2001, doi: 10.1016/S0933-3657(01)00077-X.
examples for generalization. [4] E. Menasalvas and C. Gonzalo-Martin, “Challenges of Medical
Text and Image Processing: Machine Learning Approaches,”
TABLE VIII. BEST RESULT OBTAINED FROM EACH MODEL ON USED Springer, Cham, 2016, pp. 221–242.
EVALUATION MEASURES [5] S. Habibi, M. Ahmadi, and S. Alizadeh, “Type 2 Diabetes
Mellitus Screening and Risk Factors Using Decision Tree:
Model Precision Recall F1-Score Accuracy Results of Data Mining,” Glob. J. Health Sci., vol. 7, no. 5, pp.
LR 0.84 0.84 0.84 0.837 304–310, Sep. 2015, doi: 10.5539/gjhs.v7n5p304.
ANN 0.81 0.81 0.81 0.817 [6] B. Farran, A. M. Channanath, K. Behbehani, and T. A. Thanaraj,
NB 0.80 0.80 0.81 0.805 “Predictive models to assess risk of type 2 diabetes, hypertension
SVM 0.80 0.80 0.80 0.798 and comorbidity: machine-learning algorithms and validation
DT 0.80 0.81 0.80 0.805 using national health data from Kuwait—a cohort study,” BMJ
In our case, ANN could perform better if we have more Open, vol. 3, no. 5, p. e002457, May 2013, doi:
10.1136/bmjopen-2012-002457.
training examples and a balanced dataset. The Gaussian [7] M. K. M. Dhomse Kanchan B., “Study of Machine Learning
Naïve Bayes(NB) and the Decision Tree(DT) predicted with Algorithms for Special Disease Prediction using Principal of
80% accuracy, but DT predicted diabetic patients better than Component Analysis,” 2016 Int. Conf. Glob. Trends Signal
NB. The worst performance is given by SVM with 79% Process. Inf. Comput. Commun., pp. 5–10, 2016.
[8] I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I.
accuracy. The SVM performs worse with a small dataset this Vlahavas, and I. Chouvarda, “Machine Learning and Data
is because the data points near the support vectors (decision Mining Methods in Diabetes Research,” Comput. Struct.
boundary) may not be a true representation of classification Biotechnol. J., vol. 15, pp. 104–116, 2017, doi:
decision boundary and thus creates the false maximum 10.1016/j.csbj.2016.12.005.
[9] S. Gambhir, S. K. Malik, and Y. Kumar, “Role of Soft
margin hyperplane. Computing Approaches in HealthCare Domain : A Mini
Review,” J. Med. Syst., 2016, doi: 10.1007/s10916-016-0651-x.
[10] N. Kumar and S. Kumar, “A novel scale approach for load
V. CONCLUSION balancing with hardware and software in cloud and iot platform,”
J. Theor. Appl. Inf. Technol., vol. 97, 2019.
This experimental study aims to do a comparative [11] N. Kumar and S. Kumar, “Virtual Machine Placement Using
analysis of different ML models for predicting Type II Statistical Mechanism in Cloud Computing Environment,” Int. J.
diabetic patients. As we have shown in the literature review Appl. Evol. Comput., vol. 9, no. 3, pp. 23–31, 2018, doi:
that many complex ML models have accurately predicted 10.4018/ijaec.2018070103.
Type II diabetic and non-diabetic patients with greater [12] L. J. Davis and K. P. Offord, “Logistic regression:Modeling
Conditional Probabilities,” Emerg. Issues Methods Personal.
accuracy. But we hypothesize that even the simplest ML Assess., pp. 273–283, 2013, doi: 10.4324/9780203774618-23.
model can do better than complex models if we properly [13] M. Nirmala Devi, A. A. Balamurugan, and M. Reshma Kris,
examine the problem type and apply the suitable “Developing a modified logistic regression model for diabetes
preprocessing techniques. In previous studies, authors have mellitus and identifying the0 important factors of type II DM,”
trimmed the dataset to treat the inconsistencies but, in this Indian J. Sci. Technol., vol. 9, no. 4, pp. 1–8, 2016, doi:
10.17485/ijst/2016/v9i4/87028.
experiment, we have taken a complete dataset. We have [14] R. Duggal, S. Shukla, S. Chandra, B. Shukla, and S. K. Khatri,
applied the RFECV method on the complete dataset and “Impact of selected pre-processing techniques on prediction of
found that the features that contain the many missing values risk of early readmission for diabetic patients in India,” Int. J.
have minimal impact on the prediction accuracy and the top Diabetes Dev. Ctries., vol. 36, no. 4, pp. 469–476, 2016, doi:
four features are giving the best result. After excluding these 10.1007/s13410-016-0495-4.
[15] N. H. Barakat, A. P. Bradley, and M. N. H. Barakat, “Intelligible
features preprocessed dataset applied on different ML support vector machines for diagnosis of diabetes mellitus.,”
models, LR outperforms on all used model evaluation IEEE Trans. Inf. Technol. Biomed., vol. 14, no. 4, pp. 1114–
measures. While another complex model performs 1120, 2010, doi: 10.1109/TITB.2009.2039485.
approximately the same on all measures. We have also [16] A. A. Al Jarullah, “Decision tree discovery for the diagnosis of
observed that the feature selection method on the few type II diabetes,” 2011 Int. Conf. Innov. Inf. Technol., pp. 303–
dimensions (in our case 8 independent features) has 307, 2011, doi: 10.1109/INNOVATIONS.2011.5893838.
[17] W. Chen, S. Chen, H. Zhang, and T. Wu, “A hybrid prediction
contributed to improving the model accuracy and has helped model for type 2 diabetes using K-means and decision tree,”
to avoid the severe concerns like multicollinearity. Proc. IEEE Int. Conf. Softw. Eng. Serv. Sci. ICSESS, vol. 2017-
Novem, no. 61272399, 2018, doi:
This study tries to establish the fact that not every time 10.1109/ICSESS.2017.8342938.
we need to with highly complex models and even the less [18] J. Kumar, “Cloud computing security issues and its challenges: A
sophisticated models can give better accuracy. But this is not comprehensive research,” Int. J. Recent Technol. Eng., vol. 8, no.

http://doi-ds.org/doilink/01.2021-94842525/ Website: www.ijmre.com Volume No.1 Issue-1 20


International Journal of Multidisciplinary Research and Explorer (IJMRE)

1 Special Issue 4, pp. 10–14, 2019. [24] “PIMA INDIAN DIABETES DATASET,” UCI Machine
[19] R. Ramezani, M. Maadi, and S. M. Khatami, “A novel hybrid Learning Repository, 1988.
intelligent system with missing value imputation for diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
diagnosis,” Alexandria Eng. J., 2016, doi: (accessed Apr. 15, 2018).
10.1016/j.aej.2017.03.043. [25] A. Idri, H. Benhar, J. L. Fernández-Alemán, and I. Kadi, “A
[20] M. Nilashi, O. Ibrahim, M. Dalvi, H. Ahmadi, and L. systematic map of medical data preprocessing in knowledge
Shahmoradi, “Accuracy Improvement for Diabetes Disease discovery,” Comput. Methods Programs Biomed., vol. 162, pp.
Classification: A Case on a Public Medical Dataset,” Fuzzy Inf. 69–85, 2018, doi: 10.1016/j.cmpb.2018.05.007.
Eng., vol. 9, no. 3, pp. 345–357, 2017, doi: [26] P. Misra and A. Yadav, “Impact of Preprocessing Methods on
10.1016/j.fiae.2017.09.006. Healthcare Predictions,” SSRN Electron. J., Jan. 2019, doi:
[21] H. Wu, S. Yang, Z. Huang, J. He, and X. Wang, “Type 2 diabetes 10.2139/ssrn.3349586.
mellitus prediction model based on data mining,” Informatics [27] D. Avila, M. Bussonnier, S. Corlay, Brian Granger, and J. Grout,
Med. Unlocked, vol. 10, pp. 100–107, 2018, doi: “Juypter Notebook with Ipython,” 2014. http://jupyter.org/install
10.1016/j.imu.2017.12.006. (accessed May 18, 2018).
[22] D. Sisodia and D. S. Sisodia, “Prediction of Diabetes using
Classification Algorithms,” Procedia Comput. Sci., vol. 132, no.
Iccids, pp. 1578–1585, 2018, doi: 10.1016/j.procs.2018.05.122.
[23] C. O. Rep, “HHS Public Access,” vol. 4, no. 1, pp. 92–98, 2016,
doi: 10.1007/s13679-014-0132-9.High-Risk.

http://doi-ds.org/doilink/01.2021-94842525/ Website: www.ijmre.com Volume No.1 Issue-1 21

You might also like