Nayana Paper GUCON

An Optimal Decision Tree Model for Diabetes
Prediction and Diagnosis

Nayana S , Lakshmi Shrinivasan

fields. [2]. An expert system is a computer program that
Abstract—Diabetes is sometimes referred to as one of the most integrates knowledge to solve complex problems, and can
deadly and persistent illnesses that can cause blood sugar levels. replace or support human experts. [3][4]. There is a link
Designing an efficient medical decision support system plays a between the interpretation of data for the diagnosed illness
crucial and important role in predicting illness early and and the diagnostic knowledge and information in the medical
helping specialists with the right medications. The reason for field where appropriate treatment continues. Artificial
this is the impairment of the insulin in the body, resulting in
intelligence (AI) has received considerable interest for
abnormal carbohydrate metabolism and increased urine and
blood party numbers. This can lead to long-term damage and
diagnosis, especially from researchers in various disciplines
dysfunction of various organs, especially the eyes, kidneys, of medicine, because artificial intelligence-based DSS is
nerves and heart. Therefore, early detection and appropriate more efficient than other support systems [2].
medication can reduce the risk of these difficulties. The chronic high blood sugar in0diabetes is0associated
Classification techniques have continued to be used in medicine with organ0damage, disability, and0dysfunction of several
for discovering patient data and achieving predictive models or organs, particularly the0eyes, kidneys, nerves, heart, and
sets of rules. This study helped in the diagnosis of diabetes 0by blood vessels. The majority of diabetes can be0divided into
selecting0the optimal decision tree algorithm model. To control two categories. type 1 and type 2. The cause0of type 1
the fitting of the decision0tree model, we use the Expectation- diabetes is an absolute0lack of insulin secretion. On
maximization (EM) clustering algorithm to reduce the data and the0other hand, type 2 diabetes is much0more prevalent,
split the data0into0three datasets. The decision tree0model is
and0the cause0is a combination of0resistance to insulin
built by selecting various hyperparameters so0that the most
accurate model0is chosen as the optimal model. To obtain the
action and a poorly compensated insulin-secreting response
prediction results, we used the Pima Indians Diabetes Dataset [5].
(PIDD) from UCI Machine Learning Repository. These results
show that the proposed optimal decision tree model reached II.BACKGROUND
higher precision compared with other previous studies Diabetes mellitus is a chronic disease caused by a
published in the literature. It can be proved that the model decrease in insulin production by the body's pancreas and is
proposed based on the results is useful for predicting and
called diabetes type I. Or, diabetes type II becomes
diagnosing type 2 diabetes.
unresponsive to the insulin produced, causing blood sugar
.
Keywords-diabetes diagnosis; data0mining; hyperparameters; levels to rise above normal [6].
optimal0model; decision tree In the0long run, both of these0conditions cause heart
failure, which causes serious damage to various organs0in the
body that are not controlled by the appropriate medications.
I. INTRODUCTION According to a survey conducted by the World Health
Organization, the number of people living with this chronic
In recent0years, people's0irregular eating0habits have0led
disease currently costs around 3.7 billion rupees worldwide,
to frequent0occurrence of0diseases. Diabetes, as0a0common
and is expected to double that number0by 2030.
disease, is0one of0the fastest onset0diseases. This is because
Therefore, there0is0a need0for an0expert system
the body's ability to produce or respond to insulin is
for0early diagnosis0and prediction of0diabetes that can
impaired, resulting in abnormal carbohydrate metabolism
handle the ambiguous and uncertain data commonly
and increased blood sugar and urine sugar levels [1].
occurring in the medical field. The fuzzy-based model is
one0of the cost-effective systems that is0beneficial to these
Decision Support System (DSS) is an expert system used
medical systems. Powerful reasoning capabilities,
in the medical0field to help doctors and0health professionals
eliminating these uncertainties and providing an accurate
make appropriate0knowledge-based decisions0in0different
solution [7]. Adaptive Neuro-fuzzy Inference-
System0(ANFIS) integrates the principles of fuzzy logic and

neural networks into a single frame, with the functions of
both0interpolation and-learning. As a result, nonlinear
Nayana S is with Department0of Electronics and0Communication, Ramaiah functions are efficiently estimated using these0combined
Institute of0Technology, Bangalore, India. (email: functions. The neural0network of ANFIS is integrated with
nayanashannika134@gmail.com) the Takagi-Sugeno0fuzzy reasoning0system according to
Lakshmi0Shrinivasan is0with Department0of Electronics and
Communication, Ramaiah0Institute of Technology, Bangalore, India.
(email: lakshmi.s@msrit.edu
mathematical calculations that can solve complex problems A well-designed clinical decision support system (CDSS)
[8]. enhances medical decisions using clinical knowledge and
Type 2 diabetes has a very high-incidence worldwide. patient health information. The general block diagram of
Prevention and-treatment of type 2 diabetes requires early Artificial intelligence based expert system for diagnosis in
detection. Today, data mining technology has become medical field is shown in Figure 1. It is a tool that provides
increasingly important in the field of medical diagnostics for medical professionals, patients and caretakers all the
its classification and predictive capabilities. In0this paper, we necessary information belongs to a person. This improvises
provide a0hybrid predictive model that is useful in attention and medication quality avoiding errors or
diagnosing type 2 diabetes. In the0proposed prediction uncertainable events thus making the entire process more
model, the K-means algorithm is used for0data reduction, efficient and accurate. CDSS tools are increasingly adapting
using the J48 decision0tree as a classifier-for classifying the artificial intelligence (AI) expert systems to manage complex
data. [9]. and big data analytics. AI algorithms like Optimal decision
The key purpose of the paper is to explore and provide a tree, fuzzy logic, neural network, machine learning etc., have
better diagnosis and prediction of diabetes by predicting the the ability to consume large amount of data, process them by
blood glucose level in an advance that is before 2 hours. identifying patterns and provide users with desired results
Presently there are a quite a lot of other methodologies do Observations in PIDD show that "0" indicates a negative
employ on classification for the diabetes disease (DD). The diagnosis and "1" indicates a positive0diagnosis. This
planned methodology that has been adopted for the decision tree algorithm was used0to
classification and prediction, on the selected feature is J48 determine0the0probability of developing diabetes in a-
Decision Tree algorithm [10]. patient0based on clinical examination and observed
Data mining is a common method for investigating symptoms. Diabetes0can affectnboth women and men of
unknown patterns and rules of prediction and classification. different ages. Parameters such as insulin level, blood
One of the data mining methods is the decision0tree glucose, obesity index (BMI), diabetic lineage function
algorithm. It is0a classification0method that0uses a0decision (DPF), and age0were given as0inputs in each0patient and the
tree0as its representation, and0includes three0kinds probability of0occurrence of0diabetes was calculated.
of0nodes: root node, internal0node, target0node, or end node
A. Decision tree
[11].
Advances in science0and technology can help It is expected that Data Mining is one0of the most
professionals diagnose0diabetes by analyzing0diabetes data revolutionary0developments in the coming0decades. Data
and diagnosing diabetes0in an0appropriate way. Many mining is a common medium to explore unknown models
methodologies are applied in the0field of0diabetes, or foreseeable rules. One of the methods of data0mining is
among0which data0mining is one0of the a decision tree. Tree decisions are a classification method
most0important0methods. This is0the fastest method used in that allows installations to run with three buttons: original
decision tree data mining. This0paper proposes0a keys, internal buttons, ends or destination nodes. Each key
decision0tree-based algorithm for0diabetes prediction and represents0an0attribute. Its0branch is the value0of
diagnosis. attributes and leaves representing the class. Of course, the
family0of crucial tree0algorithms can process
III. PROPOSED METHODOLOGY classification and number0features. The first0step in the
proposed0system is to identify input and output0attributes.
The0output is set between 0 to the unity. The likelihood of
The proposed task presents a simple but effective optimal diabetes is based on the given input parameters. MATLAB
decision tree-based classification and prediction of0type 2 Toolbox was used to design decision trees.
diabetes. Decision0Tree PIMA Developed using India's
Diabetes Database (PIDD). B. EM Clustering
Expectation-maximizing clustering uses a specific
multivariate0Gaussian probability0distribution function
model to estimate0the probability that0a particular
data0point belongs0to a0cluster. That0is, each0cluster
is0considered a Gaussian0function0model. This is0mainly
done by0alternating the two0steps.
a) Expected Value (E-Step): For every specified data-point,
the0probability of belonging to each0cluster is0calculated as
the0weight of the data point. If a0point belongs to a cluster,
its probability0is more likely to be assigned a value close to
1. If a point is likely to belong to more than one cluster, then
the following probability distribution should be established.
Figure. 1. General block diagram of Artificial intelligence as an expert
Clustering of data points.
system in medical diagnosis b) Maximization M-Step: This step mainly uses the weights
of the data points computed in0the previous0step to0estimate
the relevant parameters0of each0cluster. Each data0point
computed the variance and mean of each cluster weighted
with probability0in E-Step. The0overall0probability
or0maximum probability of the next clustering0is retrieved.
These two stages are continuously rotated, increasing
the0total possibility of the number of units until convergence
occurs. Several iterative steps are0required to0prevent
Figure. 3 Partitioned datasets
local0optimization. Classification machine the0maximum
number0of times0a series0of decisions0can be made to
classify a given sample. Decreasing the0levels in0the 2. Training Model
decision0tree helps avoid overfitting0the training data to Finding the right hyperparameter type for a decision tree
predictions. algorithm requires building a model that tests combinations
of
C. Dataset
hyperparameters. 70% of0the data0set is utilized for training
I used0the PIMA0Dataset, which0is universally the decision-tree model.
obtained from0the UCI0repository, to carry out my 3. Prediction
research. This dataset consists of multiple independent After training the model with the dataset we0will
medical prediction variables and the dependent variable choose0the optimal0decision-tree0model. Depend on the
(result) of one target variable. Independent variables probability of occurrence values the prediction of diabetes is
include attributes such as pregnancy, body mass index, Calculated.
blood sugar, blood pressure, insulin, skin thickness, and 4. Model-Evaluation and Performance-Evaluation
age. And the results explain if the patient has diabetes In this segment, several measures are computed to evaluate
(result 0 is "no diabetes", 1 is "yes"). how0good or0accurate the classifier is to predict the class
labels (dataset) of the tuples. Precision, Sensitivity, and
IV. CLASSIFICATION PROCESS. Specificity: First, there are four additional terms you should
The steps for the classification and prediction of diabetes be aware of that are used in the calculations of many rating
using decision tree are showed in Figure 2: scales.
a) Real mood or True positive (TP): Positive record or
record are properly marked by the classification.
b) Real negative (TN): The negative0records are correctly
marked by the classification.
C) False Support (FP): The negative0records are0incorrectly
marked.
d) False negative (FN): Positive0records are0misleading, is
negative. This study uses the following equations for
measuring accuracy, sensitivity0and specificity.
Figure. 2 Proposed diabetes Classification model.
1. Data Preprocessing
The misclassified data were partitioned into different
datasets for evaluation and training of the deleted model
using the EM clustering algorithm. We construct training set V. Confusion matrix
1 by oddly choosing 70% among the entire data-set. The
remaining 30% of the dataset is applied as0a0test dataset The confusion0matrix is a0useful tool for0analyzing how
to0estimate0the performance0of the0decision-tree algorithm. well the categorizer predictive self recognizes different sets
Since0the0model has0been tested multiple times, it is0easy of classes. TP0and TN will notify you when0the classifier
to0overfit the0model by0evaluating it on the same dataset. problem is being resolved FP and FN will be notified when
To0evaluate the0model more efficiently, training0set 1 is there is a categorizer. For a precision sorter, ideally most
split into0two subsets0of 90 sets 100sets. The090 sets pairs are represented0along the0diagonal of0the
above0are used0to train each0model called0training set 2. confusion0matrix, with the rest of the items being 0 or close
The0remaining 10% of0the data is0used as a cross-test set to 0. The confusion matrix represented using the real and
(CV0set) to evaluate0the model. As0shown in0Figure 3, the predictive classes is shown in Figure 4.
entire data0set is0split0into three0subsets: training0set 2,
CV0set, and0test0set.
Figure 4. Confusion matrix
VI. RESULTS AND DISCUSSION
In this study, we first used EM clustering to remove

inaccurately classified samples. Then divide the entire
sample into a training-set0and a0test-set. The training set is
utilized to train0the decision-tree0model. From0the Figure 6. The Decision Tree Classification Visualization.
confusion matrix we obtained that 83 samples were True
positive, 59 were TN, 6 were FN and 5 were FP. A graph of
glucose levels for the probability of developing diabetes is
shown in Figure 5.
Figure 7. Regression Tree
The decisin Tree Classifier Visualization is shown in

Figure 6 andRegression Tree is shown in Figure 7.
Figure 5. The probability of occurrence of Diabetes VS Glucose Values.
.
TABLE 1
Performance0validation of0different classification0models
Classification0Model Accuracy Percentage

J48 Algorithm 73.82%
C4.5 Algorithm 74.5%
K-means and Decision Tree 78.16%
ANFIS 86.46%
This Study (Optimal decision 92.6%
tree)
VII. CONCLUSION
The results obtained as shown in Table 1 show that the
Optimal Decision tree classification model is more accurate
than other classification models. Comparing with the above
results, it can be seen that the proposed model has been able
to obtain very accurate results in classifying type 2 diabetic
patients. The proposed model applies only to numerical data
and can be improved so that you can see the behavior of
different0types of0medical data, such0as images0and signals.
Further research is needed to0evaluate the0effectiveness
of0the proposed method with a larger amount of0data for
practical implementation.
REFERENCES
[1] Kerner W, Brückel J. Definition, ”Classification and diagnosis of

diabetes mellitus”, Experimental and clinical endocrinology & diabetes :
official journal, vol. 122, no. 7, 2014, pp. 384.
[2] Ohlsson M. (2004). WeAidU-a decision support system for myocardial
perfusion images using artificial neural networks. Artif. Intell. Med. 30,
49–60.
[3] P. Patra, D. Sahu, and I. Mandal, “An Expert System for Diagnosis of
Human Diseases,” International Journal for Computer Applications
(IJCA), vol. 1, no. 13, 2010.
[4] C. Yau and A. Sattar, “Developing Expert System with Soft Systems
Concept,” Proceedings of International Conference on Expert Systems for
Development,” pp. 79-84, 1994.
[5] Malchoff C D. Diagnosis and classification of diabetes mellitus.[J].
Diabetes Care, 2011, 34(Suppl 1):S62-S69.
[6] Geman, O., Chiuchisan, I. and Toderean, R., Application of Adaptive

Neuro-Fuzzy Inference System for diabetes classification and prediction.
In 2017 E-Health and Bioengineering Conference (EHB) (pp. 639-642).
IEEE, 2017
[7] Pandey B., Mishra R. B. (2009). Knowledge and intelligent computing
system0in0medicine. Comput.Biol.Med. 39,21510.1016/j.compbiomed.2
008.12.008
[8] R. Seising, “From vagueness in medical thought to the foundations of
fuzzy reasoning in medical diagnosis,” Artificial Intelligence in
Medicine, vol. 38, no.3, pp. 237-256, 2006.
[9] Wenqian Chen, Shuyu Chen, Zhang, Tianshu Wu ,A Hybrid Prediction
Model for Type 2 Diabetes Using K-means and Decision Tree.IEEE,2017
[10] Pradeep K R,Dr. Naveen N C,”Prdictive Analysis of Diabetes using J48
Algorithm of classification Techinques”.IEEE 2016.
[11] G. Shi, Decision trees, Chapter 5, in: G. Shi (Ed.), Data Mining and
Knowledge Discovery for Geoscientists, Elsevier, Oxford, 2014, pp. 111–
138

Nayana Paper GUCON

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nayana Paper GUCON

Uploaded by

Copyright:

Available Formats

An Optimal Decision Tree Model for Diabetes

Prediction and Diagnosis

Figure. 2 Proposed diabetes Classification model.

VI. RESULTS AND DISCUSSION

In this study, we first used EM clustering to remove

Figure 7. Regression Tree

The decisin Tree Classifier Visualization is shown in

Classification0Model Accuracy Percentage

[1] Kerner W, Brückel J. Definition, ”Classification and diagnosis of

[6] Geman, O., Chiuchisan, I. and Toderean, R., Application of Adaptive

You might also like