Professional Documents
Culture Documents
Comparative Analysis of Machine Learning Algorithms To Predict Type II Diabetes
Comparative Analysis of Machine Learning Algorithms To Predict Type II Diabetes
Abstract— Machine Learning (ML) models are becoming makes the human body insulin resistance that causes other
robust and more accurate nowadays as the rapid increase in serious complications. Thus, the early and timely diagnosis
the amount and quality of training data. Researchers are of diabetes may prevent serious complications. The various
proposing complex models for real-life problems to achieve machine learning-based system has been developed in recent
higher accuracy, which requires high computing and other years to predict diabetes[5][6] still scientists and medical
resources. In the context of the healthcare disease diagnosis, experts evolving new and intelligent algorithms and proved
detection and prediction is still a challenge. that machine learning algorithms[7][8] performed better in
Early diagnosis of a disease or ailment helps in timely recovery. disease diagnosing. The capability to work on extensive,
Moreover, health been core to every individual, a lot of work is heterogeneous data taken from different sources and keep
being done in this field to improve upon by using all available improving the model performance by adding the background
information. details to make it a more powerful tool[9][10], [11]. The only
Current paper experiments on Pima Indian Diabetes Dataset objective of these developed systems is to improve the
(PIDDS) in two stages A and B. The main objective of this accuracy that leads to the correct prediction of the disease.
study is to review the accuracy of the applied machine learning
algorithms and analyze their efficiency in predictions. Another
The main objective of this study is to do a comparative
essential objective is to show the efficacy of simpler models. study of different supervised ML algorithms. We will
Fields like computer vision and NLP have given rise to deep investigate their logic, assumptions, feature selection,
learning with complex and high computational models setting preprocessing impact, etc. and will show that the less
the trend to apply them in almost all the fields While they help complex algorithms can do better on less complex problems.
where we have an abundance of data and complex The rest of the paper is organized as follows the Section
relationships, simpler models still can do wonders and on their II includes the related work and limitations of the previous
day can challenge these behemoths. We have also applied system. The article organized as following sections, section I
preprocessing methods (imputation, feature selection, scaling provides the brief introduction about the work, section II
and discretization) to improve the classification accuracy. The contains the related work, sections III focuses on the adopted
algorithms selected for this problem are Logistic regression methodology Type II diabetes prediction and section IV
(LR), Artificial Neural Networks (ANN), Support Vector included the comparison and discussion on the produced
Machine(SVM), Naïve Bayes (NB), and Decision Tree(DT). LR results by the various classifier.
provided the best accuracy, and the rest of the models are very
close to each other. II. REVIEW OF LITERATURE
Keywords—Machine learning, Disease prediction, Machine learning is the problem of induction, where general
classification, Preprocessing, Logistic Regression, ANN, Naïve rules are learned from a domain-specific observed data. It is
Bayes, SVM, Decision Tree. not feasible to know what representation or what algorithm
I. INTRODUCTION is best on the given problem beforehand. Without knowing
the problem so well, you probably don’t need machine
Intelligent learning for prediction and forecasting is the learning, to begin with. The machine learning model allows
topic that is under consideration in today's promising a section of preprocessing, which removes irrelevant
research related to Artificial Intelligence(AI). Learning is the information from the data sources. The removal of
critical requirement for any intelligent behavior. Researchers
unwanted data must be done very carefully by
have agreed that without learning, there is no intelligence.
understanding the nature of data and the correlation of
Therefore, machine learning has become a rapidly
developing subfield of AI research. These intelligent different features.
algorithms were from the very beginning designed and used The logistic regression method compares the relationship
to analyze medical, clinical information[1]. Machine learning among a dependent and independent features of the dataset.
algorithms analyze the historical data and extract the useful These variables are usually continuous. The LR predicts the
and not so apparent patterns from the dataset for prediction value of a dependent features using prior probability[12].
and diagnosis[2][3]. Challenges with medical data is that it is The outliers undoubtedly impact prediction accuracy. In
non-linear, heterogeneous, and noisy[4]. So that information paper [10], the author used distance-based outlier detection
needs to be preprocessed to get the better result. Diabetes is a as a preprocessing method and proposed a modified
severe health problem in which the amount of sugar content prediction model for diabetes type II prediction. The model
cannot be regulated. Type I diabetes is caused when the achieved 79% of accuracy by using the sigmoid function,
human body refused to produce insulin. Type II diabetes but after applying the Neuro based weight activation
function to calculate bipolar sigmoid, the accuracy reached III. EXPERIMENTAL SETUP
90.4%[13]. The impact of preprocessing techniques like In this study, we have used the famous Pima Indian
feature selection, missing values imputation and reducing Diabetes Dataset(PIDD)[23]. Pima Indian is a group of
class imbalance improves the classifier prediction of risk of Native Americans living in Southern Arizona. Due to some
30-days hospital readmission for diabetes patients [14]. A genetic issues, they take a poor diet of carbohydrates.
very slight improvement can be seen in the Naïve Bayes However, in recent years they moved towards processed
model after applying preprocessing technique as compared food rather than traditional agriculture food with minimum
to logistic regression and decision tree[14]. But this study physical activity. This sudden change in habit and food
also shows that the impact of these schemes varies by makes them the highest prevalence of type-II diabetes which
techniques and problem formulation. The problem of the makes them a reason for the research. This database was
highly skewed dataset can be overcome using subsampling, taken from the UCI machine learning library[24]. PIDD is a
but the class imbalance problem cannot produce a good benchmark for comparing methods and widely adopted free
prediction model. N. Barakat [15]proposed a hybrid diabetes datasets for research purposes [25] in the machine learning
prediction model using the SVM classifier. The author has community.
used K-means clustering algorithms for the preprocessing The experimental setup for this study is divided into two
scheme to handle the class imbalance problem. A total of stages. The first stage deals with data-preprocessing(A)
five clusters are derived from the dataset and every cluster methods as we have seen in the literature review and
positive samples are taken based on Euclidian distance. The previous study[26] that the data preprocessing has drastically
final dataset is divided into training and test dataset. Here improved the results. The preprocessed dataset of the first
the SVM provides promising results for diabetes prediction stage forms the input for the second stage classification(B)
with 94% accuracy. A. A. Al Jarullah[16] has used the J48 where the 5 ML methods make the predictions for diabetes.
decision tree classifier on the modified dataset (pre- All the experiments have done in this study on the jupyter
processed data). After applying the unsupervised k-means notebook[27] using python programming language. Here
clustering for class imbalance problem and numerical Ipython compiler is used to run python programs
discretization to make small groups of each attribute, the The methodology includes the data collection and
author achieved 78.17 % of accuracy. But the decision tree analyzing the nature, the preprocessing methods, and the
can do better and W. Chen et al.[17] has used k-means and predictions. The model proposed for this study are as
10-fold cross-validation technique for data pre-processing. follows:
The author significantly improved the performance of the
A. Data Preprocessing
decision tree model by maintaining the security and privacy
issues[18]. With this dataset, the author achieved 90.04% The PIDD contains 768 records of pregnant females with
accuracy on the PIDD dataset. The outlier problem may eight characteristics and one more column for the outcome.
produce the wrong result. R. Ramezaniet al.[19] used Each attribute is assigned the numeric value. In the dataset
multiple imputation methods for missing value treatment 65.10% (500 females) are non-diabetic (represented with
and OT for dimensionality reduction. This modified dataset value 0) and 34.90% (268 females) have diabetes
(represented with value 1). We have used the following
applied to the hybrid model LANFIS(Logistic Adaptive
attributes of PIDD dataset:
Network-based Fuzzy Inference System). This model has
achieved 88.05% accuracy. Sometimes the uncorrelated TABLE I. THE PIMA INDIAN DIABETES DATASET WITH A
variables reduce the performance of any learning model, so DESCRIPTION
finding uncorrelated attributes means the principal S.No. Parameter Description Data Type
components. M. K. M. Dhomse Kanchan B.[7] used the 1. PREGNANT Number of time women get Numeric
PCA as a pre-processing scheme. The modified dataset pregnant
applied to classifiers where SVM outperform after applying 2. PGLUCOSE Plasma glucose concentration Numeric
PCA. One of the closest work can be seen where the author measured using a 2-hour oral
glucose tolerance test in mm Hg
has used the PCA and some other unsupervised ml methods 3. DBP Diastolic blood pressure Numeric
for pre-processing. This pre-processed dataset then applied 4. INSULIN Two-hour serum insulin in Numeric
on ANN classifier which predicts diabetes with 92.28% muU/ml
accuracy[20]. Model selection for the problem is the biggest 5. TSFT Triceps skin fold thickness in mm Numeric
6. BMI Body mass index in mm2 Numeric
challenge where even the less complicated models can make
7. DPF Diabetes pedigree function Numeric
a better prediction but here, the quality of data plays a 8. AGE Age of the patient Numeric
significant role. H. Wu et al.[21] has done excellent work 9. OUTCOME Patient with diabetes onset within Nominal
on data using a feature selection approach with correlation five years(0 or 1)
check and k-means clustering. They prepared the data so
well that even the less complicated models like logistic The initial investigation of the dataset suggests that it is a
regression classified the diabetic positive and negative supervised classification problem. The PIDD contains
patient with 95.42 % accuracy. Naïve Bayes always worked several inconsistencies in it as the metadata shows no
better for imbalanced and missing data[22]. It fairly missing values but table 2 exhibits biologically implausible
achieved the 76.3% accuracy after applying k-means and zero values. This situation suggests that metadata is incorrect
weka tool filtering approach. and must be treated as missing values. Some of the
previously published studies have overlooked this and
The top-performing features are based on Recursive In this experiment, we have used GridSearch with k-fold
Feature Elimination with Cross-Validation (RFECV) results cross-validation to find the best parameters for the LR. The
are PREGNANT, PGLUCOSE, BMI, and DPF. The selected parameters used with LR are given as follows:
features have few missing values which have replaced by the
mean. After selecting the above features, we have used the TABLE III. PARAMETERS USED IN LR AS RETURNED BY GRID SEARCH
complete dataset for the experiment. Scaling is performed on Parameters Values
features at unit variance for efficient learning. The C 1
preprocessed dataset is then used in Stage B for the Penalty L1
predictions. Solver newton-cg
3) Naïve Bayes(NB): It is a probabilistic method that IV. EVALUATION MEASURES AND RESULT
applies Bayes theorem. It calculates the probability of a Accuracy, sensitivity and specificity matrices are used in
given record belonging to a specific class. It assumes that this experiment to evaluate the performance of predictions of
given the class, features are statistically independent of each the model. If the training space is balanced correctly, then
other. This assumption is called class conditional the accuracy measure is enough to evaluate the model
independence, which significantly simplifies the learning performance. However, in this experiment, the target
process. It is a generative method that generates the data variable is imbalanced, i.e. 34.9% are diabetic and 65.1% are
from the assumptions and distributions and then uses this non-diabetic patients that’s why precision, recall, and F-score
prior knowledge to predict the unseen data. It performs measures have used. To calculate all these measures
better on less training data despite naive assumptions. NB is confusion matrix is needed that are True Positive, False
Positive, True Negative, False Negative. The formulation of
always the best choice for quick and dirty implementation
the measures is given below in the table:
and considered to be the benchmark. In this experiment, we
have used Gaussian Naive Bayes to predict likelihood. We TABLE VII. MEASURES USED FOR MODEL EVALUATION
have not used a grid search for NB because it has nothing to
Matric Formula
tune. Precision(P) TP/(TP+FP) & TN/(TN+FN)
Recall(R) TP/(TP+FN) & TN/(TN+FP)
4) Support Vector Machine (SVM): Support vector F1-Score 2*P*R/(P+R)
classifier is also called the maximum margin classifier Accuracy (TP+TN)/(TP+TN+FP+FN)
because it creates the maximum margin hyperplane. to
achieve this the decision boundary defined to maximize the All the tests were conducted on the discussed
margin between the positive and negative classes. The experimental setup and only the best results are taken of the
discussed model evaluation matrices. The results of each true in all respect and depends on the nature of data, its
prediction model are reported in table 7 and the comparative quality, volume, etc. It is also possible that complex models
chart of the model performance is given in figure 3. The can give better results by going the deep dive in the problem
results produced by each model are satisfactory with the set and its inconsistencies.
average accuracy of 80%, while the best is achieved around
84% by the LR. From the recorded values in table 7, LR has REFERENCES
identified diabetic patients with a high recall of 84% but at
the same time, model has classified the non-diabetic patients [1] G. D. Magoulas and A. Prentza, “Machine Learning in Medical
with 84% precision. The computed harmonic mean(F1- Applications,” Mach. Learn. Its Appl., vol. 2049, pp. 300–307,
Score) of the precision and recall for LR is also 84%. The 2001, doi: 10.1007/3-540-44673-7_19.
performance of ANN is very close to the best performing [2] A. J. Frandsen, “Machine Learning for Disease Prediction,”
thesis, p. Paper 5975, 2016.
model and has achieved 81% of accuracy. ANN is more [3] I. Kononenko, “Machine learning for medical diagnosis: history,
complicated than LR, challenging to train, overfitting issues, state of the art and perspective.,” Artif. Intell. Med., vol. 23, no. 1,
difficult to optimize, and need large number of training pp. 89–109, 2001, doi: 10.1016/S0933-3657(01)00077-X.
examples for generalization. [4] E. Menasalvas and C. Gonzalo-Martin, “Challenges of Medical
Text and Image Processing: Machine Learning Approaches,”
TABLE VIII. BEST RESULT OBTAINED FROM EACH MODEL ON USED Springer, Cham, 2016, pp. 221–242.
EVALUATION MEASURES [5] S. Habibi, M. Ahmadi, and S. Alizadeh, “Type 2 Diabetes
Mellitus Screening and Risk Factors Using Decision Tree:
Model Precision Recall F1-Score Accuracy Results of Data Mining,” Glob. J. Health Sci., vol. 7, no. 5, pp.
LR 0.84 0.84 0.84 0.837 304–310, Sep. 2015, doi: 10.5539/gjhs.v7n5p304.
ANN 0.81 0.81 0.81 0.817 [6] B. Farran, A. M. Channanath, K. Behbehani, and T. A. Thanaraj,
NB 0.80 0.80 0.81 0.805 “Predictive models to assess risk of type 2 diabetes, hypertension
SVM 0.80 0.80 0.80 0.798 and comorbidity: machine-learning algorithms and validation
DT 0.80 0.81 0.80 0.805 using national health data from Kuwait—a cohort study,” BMJ
In our case, ANN could perform better if we have more Open, vol. 3, no. 5, p. e002457, May 2013, doi:
10.1136/bmjopen-2012-002457.
training examples and a balanced dataset. The Gaussian [7] M. K. M. Dhomse Kanchan B., “Study of Machine Learning
Naïve Bayes(NB) and the Decision Tree(DT) predicted with Algorithms for Special Disease Prediction using Principal of
80% accuracy, but DT predicted diabetic patients better than Component Analysis,” 2016 Int. Conf. Glob. Trends Signal
NB. The worst performance is given by SVM with 79% Process. Inf. Comput. Commun., pp. 5–10, 2016.
[8] I. Kavakiotis, O. Tsave, A. Salifoglou, N. Maglaveras, I.
accuracy. The SVM performs worse with a small dataset this Vlahavas, and I. Chouvarda, “Machine Learning and Data
is because the data points near the support vectors (decision Mining Methods in Diabetes Research,” Comput. Struct.
boundary) may not be a true representation of classification Biotechnol. J., vol. 15, pp. 104–116, 2017, doi:
decision boundary and thus creates the false maximum 10.1016/j.csbj.2016.12.005.
[9] S. Gambhir, S. K. Malik, and Y. Kumar, “Role of Soft
margin hyperplane. Computing Approaches in HealthCare Domain : A Mini
Review,” J. Med. Syst., 2016, doi: 10.1007/s10916-016-0651-x.
[10] N. Kumar and S. Kumar, “A novel scale approach for load
V. CONCLUSION balancing with hardware and software in cloud and iot platform,”
J. Theor. Appl. Inf. Technol., vol. 97, 2019.
This experimental study aims to do a comparative [11] N. Kumar and S. Kumar, “Virtual Machine Placement Using
analysis of different ML models for predicting Type II Statistical Mechanism in Cloud Computing Environment,” Int. J.
diabetic patients. As we have shown in the literature review Appl. Evol. Comput., vol. 9, no. 3, pp. 23–31, 2018, doi:
that many complex ML models have accurately predicted 10.4018/ijaec.2018070103.
Type II diabetic and non-diabetic patients with greater [12] L. J. Davis and K. P. Offord, “Logistic regression:Modeling
Conditional Probabilities,” Emerg. Issues Methods Personal.
accuracy. But we hypothesize that even the simplest ML Assess., pp. 273–283, 2013, doi: 10.4324/9780203774618-23.
model can do better than complex models if we properly [13] M. Nirmala Devi, A. A. Balamurugan, and M. Reshma Kris,
examine the problem type and apply the suitable “Developing a modified logistic regression model for diabetes
preprocessing techniques. In previous studies, authors have mellitus and identifying the0 important factors of type II DM,”
trimmed the dataset to treat the inconsistencies but, in this Indian J. Sci. Technol., vol. 9, no. 4, pp. 1–8, 2016, doi:
10.17485/ijst/2016/v9i4/87028.
experiment, we have taken a complete dataset. We have [14] R. Duggal, S. Shukla, S. Chandra, B. Shukla, and S. K. Khatri,
applied the RFECV method on the complete dataset and “Impact of selected pre-processing techniques on prediction of
found that the features that contain the many missing values risk of early readmission for diabetic patients in India,” Int. J.
have minimal impact on the prediction accuracy and the top Diabetes Dev. Ctries., vol. 36, no. 4, pp. 469–476, 2016, doi:
four features are giving the best result. After excluding these 10.1007/s13410-016-0495-4.
[15] N. H. Barakat, A. P. Bradley, and M. N. H. Barakat, “Intelligible
features preprocessed dataset applied on different ML support vector machines for diagnosis of diabetes mellitus.,”
models, LR outperforms on all used model evaluation IEEE Trans. Inf. Technol. Biomed., vol. 14, no. 4, pp. 1114–
measures. While another complex model performs 1120, 2010, doi: 10.1109/TITB.2009.2039485.
approximately the same on all measures. We have also [16] A. A. Al Jarullah, “Decision tree discovery for the diagnosis of
observed that the feature selection method on the few type II diabetes,” 2011 Int. Conf. Innov. Inf. Technol., pp. 303–
dimensions (in our case 8 independent features) has 307, 2011, doi: 10.1109/INNOVATIONS.2011.5893838.
[17] W. Chen, S. Chen, H. Zhang, and T. Wu, “A hybrid prediction
contributed to improving the model accuracy and has helped model for type 2 diabetes using K-means and decision tree,”
to avoid the severe concerns like multicollinearity. Proc. IEEE Int. Conf. Softw. Eng. Serv. Sci. ICSESS, vol. 2017-
Novem, no. 61272399, 2018, doi:
This study tries to establish the fact that not every time 10.1109/ICSESS.2017.8342938.
we need to with highly complex models and even the less [18] J. Kumar, “Cloud computing security issues and its challenges: A
sophisticated models can give better accuracy. But this is not comprehensive research,” Int. J. Recent Technol. Eng., vol. 8, no.
1 Special Issue 4, pp. 10–14, 2019. [24] “PIMA INDIAN DIABETES DATASET,” UCI Machine
[19] R. Ramezani, M. Maadi, and S. M. Khatami, “A novel hybrid Learning Repository, 1988.
intelligent system with missing value imputation for diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
diagnosis,” Alexandria Eng. J., 2016, doi: (accessed Apr. 15, 2018).
10.1016/j.aej.2017.03.043. [25] A. Idri, H. Benhar, J. L. Fernández-Alemán, and I. Kadi, “A
[20] M. Nilashi, O. Ibrahim, M. Dalvi, H. Ahmadi, and L. systematic map of medical data preprocessing in knowledge
Shahmoradi, “Accuracy Improvement for Diabetes Disease discovery,” Comput. Methods Programs Biomed., vol. 162, pp.
Classification: A Case on a Public Medical Dataset,” Fuzzy Inf. 69–85, 2018, doi: 10.1016/j.cmpb.2018.05.007.
Eng., vol. 9, no. 3, pp. 345–357, 2017, doi: [26] P. Misra and A. Yadav, “Impact of Preprocessing Methods on
10.1016/j.fiae.2017.09.006. Healthcare Predictions,” SSRN Electron. J., Jan. 2019, doi:
[21] H. Wu, S. Yang, Z. Huang, J. He, and X. Wang, “Type 2 diabetes 10.2139/ssrn.3349586.
mellitus prediction model based on data mining,” Informatics [27] D. Avila, M. Bussonnier, S. Corlay, Brian Granger, and J. Grout,
Med. Unlocked, vol. 10, pp. 100–107, 2018, doi: “Juypter Notebook with Ipython,” 2014. http://jupyter.org/install
10.1016/j.imu.2017.12.006. (accessed May 18, 2018).
[22] D. Sisodia and D. S. Sisodia, “Prediction of Diabetes using
Classification Algorithms,” Procedia Comput. Sci., vol. 132, no.
Iccids, pp. 1578–1585, 2018, doi: 10.1016/j.procs.2018.05.122.
[23] C. O. Rep, “HHS Public Access,” vol. 4, no. 1, pp. 92–98, 2016,
doi: 10.1007/s13679-014-0132-9.High-Risk.