Professional Documents
Culture Documents
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
Prognostic Biomarkers Identification For Diabetes Prediction by Utilizing Machine Learning Classifiers
Abstract—Diabetes caused 4.2 million deaths in 2019 alone Diabetes mellitus, popularly acknowledged as diabetes, is
which makes it the seventh leading cause of death worldwide. a combination of metabolic disturbances distinguished by a
Although diabetes can be treated, late treatment can be fatal raised blood sugar volume over a lengthened period [8].
and may result in early death. Moreover, diabetes is a costly
disease to maintain, hence, early detection of diabetes can Signs usually involve repeated urination, amplified thirst,
facilitate the patients by indicating the time to seek treatment and enhanced hunger. If neglected, diabetes may produce
and to get prepared mentally and financially. Previously, numerous complexities [3]. Severe difficulties can involve
various studies suggested and proposed different approaches diabetic ketoacidosis, hyperosmolar hyperglycemic situation,
for achieving near-perfect accuracy but not many works or even loss of life [9]. Severe long-term complexities involve
focused on finding the appropriate attributes which can predict
the disease at the early stage. In this study, we focused on cardiovascular illness, stroke, chronic kidney disorder, foot
finding those significant features and our experimental analysis ulcers, injury to the nervures, loss of eye-sights, and cognitive
showed the findings of 10 significant features that can achieve impairment [3], [10].
a near-perfect recognition of 98.08%. The feature selection
approaches used in this research are the Chi-Square test, the Diabetes is expected to either the pancreas not creating
Minimum Redundancy Maximum Relevance (mRMR) test,
and the Recursive Feature Elimination test based on Random sufficient insulin, or the cells of the body not reacting correctly
Forest (RFE-RF). Also, the seven classifiers utilized in this to the insulin created [11]. There are three foremost kinds of
research are Decision Tree (DT), K-Nearest Neighbors (KNN), diabetes. Type 1 diabetes is caused by the pancrea’s collapse to
Logistic Regression (LR), Naı̈ve Bayes (NB), Random Forest create sufficient insulin because of the lack of beta cells. Type
(RF), Neural Network (NN), and Support Vector Machine (SVM). 2 diabetes starts with insulin opposition, a situation where cells
Index Terms—Diabetes Classification, Chi-Square, mRMR,
lose to reply to insulin correctly. Gestational diabetes is the
Recursive Feature Elimination, Decision Tree, K-Nearest Neigh- last main type and happens when pregnant women having no
bors, Naı̈ve Bayes, Logistic Regression, Random Forest, Neural past occurrences of diabetes exhibit high blood sugar volumes
Network, Support Vector Machine [3].
978-0-7381-4246-3/20/$31.00
2020
c IEEE
II. L ITERATURE R EVIEW
Previously, investigations have been conducted for the clas-
sification of diabetes. One of the best works was conducted
on a collected data of 865 patients with 9 attributes where
the considered variables were diabetes probability, 2h serum
insulin, number of times pregnant, diabetes pedigree type,
skin fold thick, plasma glucose, diastolic B.P, and sex [11].
Another study suggested an overall accuracy of 88.10% by
utilizing the Support Vector Machine and Linear Deterministic
Analysis algorithm together on a dataset of 738 patients
[12]. One research suggested Support Vector Machine as the
best classifier while applied on their dataset which consisted
of seven attributes: glucose, skin, blood pressure, insulin,
thickness, BMI, age, and diabetes pedigree function [13]. The
Bayesian Regulation algorithm was also utilized to achieve
88.8% accuracy on a dataset of 250 diabetes patients [14].
Promising research suggested Random Forest as the best
classifier while considering the Pima Indian Diabetes dataset
[15]. 70.8% accuracy was measured on a dataset of 318
records [16]. Recent research suggested an overall accuracy of
97.4% while utilizing 16 attributes on a dataset of 520 persons
applying Random Forest with ten fold cross validation (CV).
They also claimed an accuracy of 99% while keeping 100 Fig. 1: Proposed Workflow Diagram of this Research
samples in test data [17]. We worked with the same dataset
and achieved an overall accuracy of 98.08% with 10 attributes
in consideration which indicates that the accuracy achieved is
nearly same but we need only 10 attributes instead of all 16 one or more segments of a probability table. The measures
to classify the patients. are arranged into regularly exclusive collections. The null
hypothesis is true if there are no differences among the
III. M ATERIALS AND M ETHODS associations in the population, the test statistic discovered from
In this section, firstly, dataset description has been dis- the measures supports chi-square frequency correlations. The
cussed. After that the feature selection approaches and classi- goal of the test is to evaluate how properly the observed rates
fication methods have been presented. would be admitting the null hypothesis is true.
TABLE V: Performance of classifiers for each of the nine scenarios in terms of accuracy, sensitivity, specificity, Matthews
correlation coefficient (MCC), F1-score, and area under curve (AUC) value
Classifiers Name Overall Accuracy Sensitivity Specificity MCC Value F1-score AUC Value
Chi-square Test
DT 0.90385 0.87500 0.95000 0.80813 0.91803 0.91250
KNN (k=5) 0.91346 0.90625 0.92500 0.82121 0.92800 0.91560
LR 0.85577 0.89063 0.80000 0.69402 0.88372 0.84530
NB 0.83654 0.84375 0.82500 0.66067 0.86400 0.83440
NN 0.92308 0.90625 0.95000 0.84318 0.93548 0.92810
RF 0.93269 0.92188 0.95000 0.86134 0.94400 0.93590
SVM(RBF) 0.95192 0.95313 0.95000 0.89910 0.96063 0.95156
mRMR Feature Selection
DT 0.93269 0.92188 0.95000 0.86134 0.94400 0.93594
KNN (k=9) 0.86538 0.85938 0.87500 0.72316 0.88710 0.86720
LR 0.86538 0.90625 0.80000 0.71353 0.89231 0.85310
NB 0.82692 0.82813 0.82500 0.64315 0.85484 0.82660
NN 0.94231 0.93750 0.95000 0.87997 0.95238 0.94380
RF 0.92308 0.89063 0.97500 0.84792 0.93443 0.93280
SVM(RBF) 0.94238 0.92188 0.97500 0.88318 0.95161 0.94844
RFE-RF Feature Selection
DT 0.91346 0.85938 1.00000 0.83757 0.92437 0.92969
KNN (k=5) 0.90385 0.92188 0.87500 0.79688 0.92188 0.89840
LR 0.86538 0.85938 0.87500 0.72316 0.88710 0.86720
NB 0.83654 0.84375 0.82500 0.66067 0.86400 0.83440
NN 0.89423 0.82813 1.00000 0.80592 0.90598 0.91410
RF 0.94231 0.92188 0.97500 0.88318 0.95161 0.94840
SVM(RBF) 0.98077 0.98438 0.97500 0.95938 0.98438 0.97969
Common in Chi-square, mRMR and RFE-RF
DT 0.88462 0.85938 0.92500 0.76834 0.90164 0.89219
KNN (k=5) 0.87500 0.84375 0.92500 0.75148 0.89256 0.88440
LR 0.85577 0.87500 0.82500 0.69688 0.88189 0.85000
NB 0.83654 0.84375 0.82500 0.66067 0.86400 0.83440
NN 0.88462 0.85938 0.92500 0.76834 0.90164 0.89220
RF 0.86535 0.89063 0.82500 0.71563 0.89063 0.85750
SVM(RBF) 0.89423 0.87500 0.92500 0.78556 0.91057 0.90000
techniques used in this research are the Chi-square test, the the top 10 features were taken under consideration. Table-
minimum Redundancy Maximum Relevance (mRMR) test, IV illustrates the features selected by the Chi-square test,
and Recursive Feature Elimination based on Random Forest mRMR test, and RFE-RF test. Moreover, Table-IV also shows
(RFE-RF) test. The features were selected based on the rank the six common features selected by all three feature selec-
of the features. Different feature selection classifiers ranked tion approaches as well. After that, seven machine learning
the features in different serials. Table-II denotes the ranks of classifiers were utilized for features selected by each of the
the variables calculated by the Chi-square test. feature selection approaches and for the six common features.
The seven classifiers considered in this research are Deci-
Table-III illustrates the rank of the top 10 features calculated sion Tree, K-Nearest Neighbors, Logistic Regression, Naı̈ve
by the mRMR test. For all the feature selection approaches,
Bayes, Neural Network (Back Propagation), Random Forest, the development of a diabetes recognition system and will
and Support Vector Machine with Radial Basis Kernel(RBF). decrease the mortality rate.
The performance of the classifiers was measured with the
assistance of some performance measurement techniques e.g. VI. ACKNOWLEDGEMENT
sensitivity, specificity, MCC value, F1-score, AUC value, and We are grateful to the Green University of Bangladesh for
overall accuracy. financing this research.
TABLE VI: Comparison in terms of accuracy and selected R EFERENCES
features between proposed approach and previous approaches
[1] R. Thomas, S. Halim, S. Gurudas, S. Sivaprasad, and D. Owens, “Idf
Study Number of Features Accuracy diabetes atlas: A review of studies utilising retinal photography on the
global prevalence of diabetes related retinopathy between 2015 and
RF (10 Fold CV) [17] 16 97.40% 2018,” Diabetes research and clinical practice, vol. 157, p. 107840,
RF (percentage split) [17] 16 99.00% 2019.
Proposed Approach 10 98.08% [2] T. Vos, A. D. Flaxman, M. Naghavi, R. Lozano, C. Michaud, M. Ezzati,
K. Shibuya, J. A. Salomon, S. Abdalla, V. Aboyans et al., “Years lived
with disability (ylds) for 1160 sequelae of 289 diseases and injuries
Table-V illustrates the performances of the classifiers for 1990–2010: a systematic analysis for the global burden of disease study
each of the scenarios. It can be observed that the highest 2010,” The lancet, vol. 380, no. 9859, pp. 2163–2196, 2012.
[3] W. H. Organization et al., “Diabetes fact sheet n 312. october 2013,”
accuracy achieved by the selected features of the Chi-square Archived from the original on, vol. 26, 2013.
test is 95.19% for the Support Vector Machine classifier. The [4] U. Diabetes and H. Lobby, “What is diabetes,” Diabetes UK, 2014.
overall accuracy of 94.23% was achieved by utilizing the [5] W. H. Organization et al., “The top 10 causes of death fact sheet n o
310,” Geneva, Switzerland: World Health Organization, 2013.
features selected by the mRMR test for the Support Vector [6] A. D. Association et al., “Economic costs of diabetes in the us in 2017,”
Machine classifier. However, an overall accuracy of 98.08% Diabetes care, vol. 41, no. 5, pp. 917–928, 2018.
was achieved by utilizing the features selected by the RFE-RF [7] C. for Disease Control, Prevention et al., “National diabetes statistics
report, 2020,” Atlanta, GA: Centers for Disease Control and Prevention,
test for the Support Vector Machine classifier. The common US Department of Health and Human Services, pp. 12–15, 2020.
features selected by all three classifiers didn’t produce a good [8] W. H. Organization et al., Guidelines for the prevention, management
result compared to the other feature selection approaches. The and care of diabetes mellitus, 2006.
[9] A. E. Kitabchi, G. E. Umpierrez, J. M. Miles, and J. N. Fisher,
common features produced the highest accuracy of 89.42% for “Hyperglycemic crises in adult patients with diabetes,” Diabetes care,
the Support Vector Machine Classifier as well. vol. 32, no. 7, pp. 1335–1343, 2009.
Recent work on the same dataset provided an overall [10] E. Saedi, M. R. Gheini, F. Faiz, and M. A. Arami, “Diabetes mellitus
and cognitive impairments,” World journal of diabetes, vol. 7, no. 17,
accuracy of 97.4% for the Random Forest classifier with p. 412, 2016.
cross validation where all of the 16 features form the dataset [11] V. Kumar and L. Velide, “A data mining approach for prediction and
was utilized for the classification [17]. In our research, we treatment ofdiabetes disease,” Int J Sci Invent Today, vol. 3, pp. 73–9,
utilized only 10 features selected by the RFE-RF test and 2014.
[12] P. Agrawal and A. Dewangan, “A brief survey on the techniques used
achieved an accuracy of 98.08% which outperformed the for the diagnosis of diabetes-mellitus,” Int. Res. J. of Eng. and Tech.
previous work. The same previous investigation suggested an IRJET, vol. 2, pp. 1039–1043, 2015.
accuracy of 99% while taking 100 samples in the test set [17]. [13] T. N. Joshi and P. Chawan, “Diabetes prediction using machine learning
techniques,” Ijera, vol. 8, no. 1, pp. 9–13, 2018.
However, our investigation minimized the information needed [14] M. A. Sapon, K. Ismail, and S. Zainudin, “Prediction of diabetes by
for early classification of diabetes among patients which will using artificial neural network,” in Proceedings of the 2011 International
eventually result in a more cost-effective and nearly accurate Conference on Circuits, System and Simulation, Singapore, vol. 2829,
2011, p. 299303.
classification process. Table-VI illustrates the comparison with [15] D. Singh, E. J. Leavline, and B. S. Baig, “Diabetes prediction using
the previous study. medical data,” Journal of Computational Intelligence in Bioinformatics,
vol. 10, no. 1, pp. 1–8, 2017.
V. C ONCLUSION [16] T. M. Ahmed, “Developing a predicted model for diabetes type 2
treatment plans by using data mining,” Journal of Theoretical and
Being the seventh largest cause of death, diabetes detection Applied Information Technology, vol. 90, no. 2, p. 181, 2016.
in the early stage has been an area of research for a decade [17] M. F. Islam, R. Ferdousi, S. Rahman, and H. Y. Bushra, “Likelihood
now. Previously, many works suggested different methods on prediction of diabetes at early stage using data mining techniques,” in
Computer Vision and Machine Intelligence in Medical Image Analysis.
different datasets for the accurate recognition of the disease Springer, 2020, pp. 113–125.
at an early stage. While most of the studies achieved a near [18] M. L. McHugh, “The chi-square test of independence,” Biochemia
accurate accuracy, the consideration of significant attributes medica: Biochemia medica, vol. 23, no. 2, pp. 143–149, 2013.
[19] M. Radovic, M. Ghalwash, N. Filipovic, and Z. Obradovic, “Minimum
still lacks proper investigation. In this study, we considered a redundancy maximum relevance feature selection approach for temporal
publicly available dataset and applied three feature selection gene expression data,” BMC bioinformatics, vol. 18, no. 1, pp. 1–14,
approaches. We also considered all the six common features 2017.
[20] T. M. Phuong, Z. Lin, and R. B. Altman, “Choosing snps using
selected by all three feature selection techniques. After that, we feature selection,” in 2005 IEEE Computational Systems Bioinformatics
applied seven machine learning classifiers and concluded that Conference (CSB’05). IEEE, 2005, pp. 301–309.
Support Vector Machine can achieve a near-perfect accuracy [21] N. Bhargava, G. Sharma, R. Bhargava, and M. Mathuria, “Decision tree
analysis on j48 algorithm for data mining,” Proceedings of International
of 98.08% with only 10 features selected by the RFE-RF Journal of Advanced Research in Computer Science and Software
approach. We hope that our research will be beneficial for Engineering, vol. 3, no. 6, 2013.
[22] S. Kaghyan and H. Sarukhanyan, “Activity recognition using k-nearest
neighbor algorithm on smartphone with tri-axial accelerometer,” Inter-
national Journal of Informatics Models and Analysis (IJIMA), ITHEA
International Scientific Society, Bulgaria, vol. 1, pp. 146–156, 2012.
[23] D. G. Kleinbaum, K. Dietz, M. Gail, M. Klein, and M. Klein, Logistic
regression. Springer, 2002.
[24] H. Zhang, “The optimality of naive bayes,” AA, vol. 1, no. 2, p. 3, 2004.
[25] X. Yang, G. Zhang, J. Lu, and J. Ma, “A kernel fuzzy c-means clustering-
based fuzzy support vector machine algorithm for classification problems
with outliers or noises,” IEEE Transactions on Fuzzy Systems, vol. 19,
no. 1, pp. 105–115, 2010.
[26] J. Ali, R. Khan, N. Ahmad, and I. Maqsood, “Random forests and de-
cision trees,” International Journal of Computer Science Issues (IJCSI),
vol. 9, no. 5, p. 272, 2012.
[27] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.
MIT press Cambridge, 2016, vol. 1.