Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Inteligencia Artificial 20(59) (2017), 123-127

doi: 10.4114/intartf.vol20iss59pp123-127

. INTELIGENCIA ARTIFICIAL
http://journal.iberamia.org/

PERFORMANCE ANALYSIS OF (SVC) SVM BASED


CLASSIFICATION APPROACH FOR THE PREDICTION OF
DIABETES
UmaMaheswari Gurusamy (1), M.Revathi, R.Kaviyapriya
Department of Computer Science, Kamaraj College of Engineering and Technology, Tamil Nadu. India
(1) uma.optimist@gmail.com

Abstract A major non-communicable disease known as diabetes mellitus is becoming an epidemic and a global
public health crisis. Hypertension, often known as high blood pressure, is one of the many potentially fatal
consequences of diabetes that is frequently misdiagnosed and mistreated until symptoms are severe. Early
identification of those who are at risk can significantly minimise or perhaps eliminate diabetic complications. In
Recent years, there is a widespread application of numerous machine learning classification algorithms for the
diagnosis of diabetes, but very few studies have been done for the detection of hypertension in diabetic people.
The existing rule-based models are unable to provide understandable rule sets in predicting the diabetes. To
resolve this limitation, this paper aims to develop a Support Vector Machine (SVM) based approach for the
diagnosis of diabetes by extracting rules. A feature selection process is presented for choosing characteristics from
the dataset that are significantly connected with each other. The dataset used in this research has been obtained
from Pima Indians diabetes, India comprising of 300 diabetic subjects with 108 hypertensives and 192
normotensives. The findings from the datasets are generalized using five publicly available diabetes-related
datasets. The proposed approach is proved to be better than the existing machine learning classifiers with
respective to the accuracy of prediction. The training and testing portions of the dataset employed in this study are
split 70:30, respectively. The proposed approach outperforms the existing approaches with respect to the accuracy
of prediction and hence the risk associated with patients can be successfully communicated to healthcare
professionals to help them make decisions about prevention and intervention.

Keywords: Diabetes, SVM, diagnosis, rule based methods, feature selection

1 Introduction
Computer technology has been significantly advanced in the recent years, resulting in the generation of
enormous amounts of data. Healthcare industry contains very large and sensitive data. This data needs to be
treated very carefully to get benefitted from it. Diabetes Mellitus is one of the diseases which becomes a global
hazard. There is a need to develop accurate and efficient predictive models that helps in diagnosing diabetes. The
difficulty faced is that it is not easy to do early prediction of diabetes because different people may have different
symptoms. The symptoms of diabetes are thirsty, hunger, loss of weight, tiredness, problem in eyesight, headache,
frequent sense of urination and so on. A better diagnosis is still required to accurately predict the diabetes. In order
to create a prediction model, it is therefore necessary to analyse the diabetic data sets that are already available.

Diabetes mellitus has emerged as one of the largest global health problems of the 21st century. According
to the International Diabetes Federation (IDF), 415 million adults are currently affected from diabetes which is
expected to rise to 642 million by 2040, posting a huge hike of 10.4% during next two decades. Additionally, the
prevalence of impaired glucose tolerance or pre-diabetes is also increasing at a fast pace. The estimates reveal
that over 481 million persons will suffer from pre-diabetes by 2040. Globally, over a million children and
adolescents are suffering from T1DM, and one in six live births is being affected by gestational diabetes. It is
estimated that almost 49.7% of people having diabetes are undiagnosed. Diabetes mellitus is a group of metabolic
disorders characterized by hyperglycaemia resulting in elevated blood sugar. It is caused due to impaired insulin
secretion, defective insulin action, or both. The majority of cases of diabetes mellitus are extensively classified into
two groups: Type I (T1DM) and Type II (T2DM). T1DM occurs when the pancreas are unable to produce insulin
due to autoimmune destruction of pancreatic beta cells resulting in insulin deficiency. Another most common form

ISSN: 1137-3601 (print), 1988-3064 (on-line)


©IBERAMIA and the authors
2 Inteligencia Artificial 59 (2017)

of diabetes is T2DM where the body does not effectively use the insulin produced. One more category of diabetes
is gestational diabetes that develops during pregnancy. The complications related to diabetes are liable for the
considerable morbidity and mortality. Chronic hyperglycaemia of diabetes is correlated with microvascular
complications such as retinopathy, nephropathy, and neuropathy.

Therefore, diabetes is the leading cause of blindness, renal failure, impotence and diabetic foot disorders
whose severity can cause lower-limb amputations. The complications of diabetes include cardio vascular diseases
such as heart attacks, strokes and cerebrovascular diseases. Hence, there is a need of application of specific
machine learning models for predicting the disease in its early stage. Machine learning is a sort of artificial
intelligence that enables the computers to learn without being explicitly programmed and can teach themselves to
change and grow when disclosed to new or unseen data.

2 LITERATURE SURVEY
U. Ahmed et al [1] proposed a fused machine learning technique for predicting diabetes using Support
Vector Machine (SVM) and Artificial Neural Network (ANN) model. The dataset is examined to assess if a diabetes
diagnosis is accurate or inaccurate. The training and testing portions of the dataset employed in this study are split
70:30, respectively. The output of the model becomes the input membership function for the fuzzy model, whereas
the fuzzy logic finally determines whether a diabetes diagnosis is positive or negative.

A novel method based on Local Median-based Gaussian Naive Bayes (LMeGNB) is proposed to
compensate for the missing values, combined with the K-means SMOTE [2] method to adjust the positive and
negative samples of diabetes to obtain the normalized balanced data. After that, an ensemble model based on
many stages of probability is constructed using various machine learning methods. When extreme gradient
boosting, random forests, and weighted k nearest neighbours are integrated, the highest classification accuracy of
94.53% is obtained on Pima Indian diabetes dataset. In order to show the applicability of the strategy in predicting
diabetes, the experiment also evaluated the PE DIM model using two diabetes datasets, RSMH and Tabriz.

This study aims to improve the sensitivity and selectivity of glucose detection in an aqueous solution by
using light sources of various wavelengths. The multiple wavelength [3] measurements have the potential to
compensate for errors associated with inter and intra individual differences in blood and tissue components. In this
work, 18 distinct wavelengths between 410 and 940 nm are used to analyse the transmission data of a specially
constructed optical sensor. The result shows a high correlation value (0.98) between glucose concentration and
transmission intensity for four wavelengths (485, 645, 860 and 940 nm). The glucose prediction was by employing
five machine learning techniques. When regression methods are used, 9% of glucose predictions fall outside the
correct range (normal, hypoglycaemic or hyperglycaemic). The prediction accuracy is improved by applying
classification methods on sets of data arranged into 21 classes.

SQRex-SVM [4] and the eclectic methods are employed to transform the SVM black box into a more
understandable model in this study. For the purpose of diagnosing illnesses, we have created a hybrid system. In
particular, we have employed SVMs for the diagnosis and prediction of diabetes, where an additional rule-based
explanation component is utilized to provide comprehensibility. The SVM and the rules deduced from it are
designed to serve as a second opinion for diabetes diagnosis and as a tool to forecast diabetes by identifying those
at high risk.

The data was collected from 451,425 people by the healthcare providers, the model [5] was created and
verified. The findings demonstrate that all suggested models performed satisfactorily, with AUC values ranging
from 0.826 to 0.850. Among the seven predictive models, the gradient boosting tree model outperformed other
models, achieving an AUC of 0.850. The risk prediction model has great potential for automating real-time
diagnosis, assisting healthcare professionals in effectively focusing on high-risk individuals, and assisting
healthcare professionals in developing proactive strategies to delay the onset or stop the progression of chronic
diseases.

Diabetes affects many people all over the world. Its annual incidence rates are rising dramatically.
Diabetes-related problems in several of the body's vital organs can be lethal if left untreated. Early diabetes
diagnosis is crucial for prompt treatment that can prevent the development of such problems. The RR-interval
signals known as heart rate variability (HRV) signals (derived from electrocardiogram (ECG) signals) can be used
successfully for the non-invasive diagnosis of diabetes. This research paper based on long short-term memory [6]
(LSTM), convolutional neural network (CNN) presented a classification of diabetic and normal HRV signals using
deep learning mechanism. The combinations for extracting complex temporal dynamic features of the input HRV
data are used for diagnosis. The Support vector machine (SVM) is used to classify using these features. In
comparison to the earlier approaches without employing SVM, the performance is enhanced by 0.03% and 0.06%
in the CNN and CNN-LSTM architectures, respectively.
Inteligencia Artificial 59 (2017) 3

The predictive analysis proposed in [7] employs a variety of data mining techniques, machine learning
algorithms, and statistics. The Hadoop / MapReduce environment's predictive analysis technique is used to
forecast the diabetes types that are common, its problems, and the sorts of care that should be given. According to
the analysis, this approach offers a productive means of treating and caring for patients with superior outcomes,
such as accessibility and affordability.

There is a lack in the accuracy of prediction when there is a lack in the quality of the medical data. Also,
distinct regional diseases in different places have their own features, which could make it harder to forecast when a
disease would spread. The machine learning technique proposed by chen et al [8] for efficient chronic illness
outbreak prediction in populations with high disease incidence. The prediction accuracy of this approach achieves
94.8% with a convergence speed that is faster than that of the CNN based unimodal illness risk prediction
algorithm, when compared to many conventional prediction algorithms.

The SVM [9] model proposed by ranganarayanan et al. identifies seven novel putative glucose binding
sites in HSA, of which two are exposed only during HSA dynamics, with a 10-fold cross validation accuracy of 84%.
These results can supports the development of HSA as a different biomarker for glycaemic control.

Decision tree and Naive Bayes algorithms are proposed [10] for the purpose of predicting the onset of
diabetes through logistic regression analysis, over balanced and unbalanced datasets. In comparison to random
under-sampling, over-sampling, and no sampling, the results showed that Naive Bayes with K-medoids under
sampling technique was superior. It is possible to obtain average receiver operating characteristic performance of
79%. The findings of this study recommend additional investigation to elucidate the pathophysiological importance
of HDL and the routes in the onset of diabetes Bayes with K-medoids under sampling technique was superior. With
the higher true positive rate, it is possible to obtain average receiver operating characteristic performance of 79%.
The accuracy of prediction in the existing methods is still has to be improved to predict the disease in its early
stage itself.

3 SVM BASED CLASSIFICATION (SVC) APPROACH

A learning approach for the generation of classification rules to diagnose hypertension among diabetic
individuals is introduced in the proposed work. The methodology utilizes an ensemble technique for extracting rules
from SVMs. At the initial stage, support vectors (SVs) are extracted from the SVM model with reasonable accuracy.
Then, the trained SVM model is used for predicting the actual class labels of SVs, where a modified artificial
dataset is formed by the SVs and their predicted labels. Lastly, the modified training dataset is given for generating
rules and determining the performance of the proposed classification model. The rule induction method together
with SVs develops economical and beneficial evaluative rules for detecting hypertension. The proposed method
aims to
• Introduce a feature selection as a pre-processing step to choose significant biomarkers and risk-factors
from the six diabetes-related datasets.
• Develop a model based on rule extraction approach from SVMs using an implementation of gradient
boosted decision trees.
• Compare the proposed SVM based classification approach with the existing approaches.
The proposed rule-extraction technique is divided into two major steps. In the first step, the SVM model is
constructed utilizing the training data by tuning the hyper parameters within the search space to obtain an
acceptable accuracy. The instances which lie close to the decision boundary known as support vectors (SVs) are
extracted from the SVM model. Henceforth, this model is used for predicting class labels of the SVs. An artificial
dataset is generated by substituting actual labels with the predicted labels of SVs. Since, SVs contain noisy class
labels, therefore replacing the original labels with those predicted by the SVM eliminates the label noise from the
artificial dataset. In addition, the generated rules not only imitate the predictions of SVMs but also provide a better
understanding of their internal workings. In the second step, the artificial dataset known as modified training
dataset is given to algorithm with tuned hyper parameters for generation of best rule sets. Finally, the evaluation of
the rules is performed on testing data.

The pima Indians diabetic (PIM) dataset is used for predicting the presence of positive versus negative
class. The entire dataset is initially divided into two parts, 90% of the training data is utilized for generation of rules
from SVMs and remaining 10% is used as testing data. The ideal tuning hyper parameters for the models are
discovered throughout the rule-extraction process using 10-fold cross validation. Then these hyper-parameters are
utilized for obtaining the trained model. Further, this model is used for generation of rules and evaluation by the
testing data. The average results of performance evaluation measures such as accuracy, precision, re-call, F-
4 Inteligencia Artificial 59 (2017)

measure and Area Under Curve (AUC) for the model are obtained by cross validation. Accordingly, the rules
generated are assessed by the rule set size and mean rule length.

Figure 1. System design of the proposed method

Decision tree is a fundamental classification and regression technique. The classification of instances
based on features can be described by a decision tree model, which has a tree structure. It can be viewed as a
collection of if-then rules, or as conditional probability distributions that are specified in feature space and class
space. The Random Forest categorization employs several decision trees and is capable of carrying out regression
and prediction tasks. Each tree in RF will provide its own classification result and "vote" when the RF is predicting a
new object based on some features, and the total output will be the greatest number of taxonomies. The RF output
in the regression problem is the average value of all the decision trees' output.

A supervised machine learning approach called "Support Vector Machine" (SVM) is used for the
classification. Then, the classification is done by identifying the hyper plane that effectively distinguishes the two
classes such as “normal” and “T2Diabetes” in the proposed method. This algorithm plots every data point in n-
dimensional space, where n is the number of features and each feature's value is a specific coordinate value. The
ability to recast the linear SVM using the inner product of any two provided data rather than the observations
themselves is a significant insight. The inner product between two vectors is the sum of the multiplication of each
pair of input values. For example, the inner product of the vectors [2, 3] and [5, 6] is 2*5 + 3*6 or 28. The following
equation can be used to forecast a new input using the dot product between the input (x) and each support vector
(xi):

f(x) = B0 + sum(ai*(x,xi)) (1)

The inner products of a new input vector (x) with each support vector in the training set are computed by using
equation. The coefficients B0 and ai (for each input) must be estimated from the training data by the learning
algorithm.

4 RESULT ANALYSIS
The performance of the proposed approach is evaluated using Py Charm IDE. The rule generating process
progresses in two steps. The SVM model is created during the first stage utilising training data from each fold, and
it is then used to forecast the class labels. The rules are evaluated on the remaining 10% of test data for
determining the accuracy. In addition, ruleset size and mean rule length are also calculated.

The training set is different from test set. In k-fold cross validation method, the whole dataset is used to
train and test the classifier and classify the types as “Prediabetes” and “T2Diabetes”. The results of the study
demonstrated that blood pressure and cholesterol did not predict diabetes risk after adjustment for BMI and
glucose tolerance. The Naive Bayes, support vector machine and decision tree are widely used machine learning
algorithms to solve association, prediction and classification problems in epidemiological and medical studies
because of their unique characteristics. There are several methods for feature selection that can help to eliminate
Inteligencia Artificial 59 (2017) 5

unnecessary features by reducing the number of attributes. The methods such as Principle Component Analysis
(PCA) and minimum Redundancy Maximum Relevance (mRMR).

4.1 Analysis of the proposed SVC approach

Histograms are a useful visualization tool in machine learning as they allow us to easily see the
distribution of values for a particular attribute. This information can be used to understand the shape and
characteristics of the data, identify potential outliers, and choose appropriate machine learning algorithms or pre-
processing techniques. The diabetes dataset has multiple attributes such as age, BMI, blood pressure, etc. The
distribution of each attribute for the proposed approach is visualized using a histogram which is shown in figure 3.
A histogram is a graphical representation of the distribution of a dataset. It divides the range of values of the
attribute into a series of intervals (also called bins), and then counts the number of values that fall into each
interval. The x-axis represents the range of values of a specific attribute, while the y-axis represents the frequency
or count of how many times each attribute occurs in the range which is shown in figure 2.

The first attribute represents the range of values of age, such as 0-20, 21-30, 31-40, and so on. The y-axis
represents the number of occurrences or frequency of each value or range of values of the attribute being
analysed. In the case of the age of patients, the y-axis would represent the number of patients that fall into each
age range or bin. The attribute BMI is represented with the range of values, such as 0-20, 21-40, 41-60, and so on.
The y-axis represents the number of occurrences or frequency is high in the range of 21-40 of the BMI attribute.
The attribute BloodP is represented with the range of values, such as 0-20, 21-40, 41-60, 61-80, 81-100 and so on.
The y-axis represents the number of occurrences or frequency is high in the range of 61-80 of the BloodP attribute.
The attribute NumTimesPreg is represented with the range of values, such as 0-5, 5-10, 10-15, and so on. The y-
axis represents the number of occurrences or frequency is high in the range of 0-5 of the NumTimesPreg attribute.

Figure 2. Analysis of the proposed approach

4.2 Comparison of the proposed and the existing approaches

Figure 3 shows the comparison of accuracy of prediction between the proposed and the existing approaches.
The box plot shown in figure 3 has the performance metric accuracy in y-axis and x-axis represents the algorithm.
Each boxplot will have a rectangular box with whiskers extending from the box. The box represents the interquartile
range (IQR) of the performance metric of the approaches in the x-axis, while the whiskers represent the range of
values that lie within 1.5 times the IQR. Any data points that fall outside of the whiskers are considered outliers and
are represented as individual points. The existing approaches considered for comparison are Logistic Regression
(LR), K nearest neighbour (KNN), Naïve Bayes (NB), Linear support vector classifier (LSVC), Random forest
6 Inteligencia Artificial 59 (2017)

classifier (RFC), and Random forest classifier (RFC). The proposed approach is represented as support vector
classifier (SVC). The median is shown by the dark yellow line. The existing approaches such as Logistic
Regression (LR) has an accuracy of 0.74909, K Nearest Neighbour (KNN) has 0.7375, Naïve Bayes (NB) has
0.715995, Linear support vector classifier (LSVC) has 0.750269, Random forest classifier (RFC) has 0.752957 and
Decision tree (DT) has 0.731452. The proposed approach Support vector classifier (SVC) has an accuracy of
0.755645. Hence the accuracy is improved to 0.3 % when compared to the existing approaches.

Figure 3. Comparison of the proposed and the existing approaches

5 CONCLUSION
`Thus Diabetes is predicted using support vector based classification (SVC) approach. Specifically, SVMs
have been used for identifying hypertension in patients with diabetes in which a rule-based explanation module
was added to produce intelligible and understandable rules. Such SVM extracted rules can be considered as a
second opinion for diagnosing hypertension and other complications related to diabetes. The medical experts
recognize the rules generated by the system to be very beneficial for outpatient screening where basic
measurements can help to identify the severity of diabetes and its complications. The SVM model is constructed
utilizing the training data by tuning the hyper parameters within the search space to obtain an acceptable accuracy.
It helps the department of medical, government safety and welfare system with an intention to ease the people by
keeping them aware with the diabetic disease. The experimental results also demonstrate that the proposed
approach is suitable with optimal discrimination for the prediction of future onset of diabetes, understanding the
contributing factors, role of data sampling techniques to generate balanced training dataset and building efficient
prediction models. The proposed approsch can also be extended to predict other type of ailments which arise from
metabolic syndrome.

REFERENCES
[1] U. Ahmed et al., "Prediction of Diabetes Empowered With Fused Machine Learning," in IEEE Access, vol. 10,
pp. 8529-8538, 2022, doi: 10.1109/ACCESS.2022.3142097.
[2] L. Jia, Z. Wang, S. Lv and Z. Xu, "PE_DIM: An Efficient Probabilistic Ensemble Classification Algorithm for
Diabetes Handling Class Imbalance Missing Values," in IEEE Access, vol. 10, pp. 107459-107476, 2022, doi:
10.1109/ACCESS.2022.3212067.
[3] M. Shokrekhodaei, D. P. Cistola, R. C. Roberts and S. Quinones, "Non-Invasive Glucose Monitoring Using
Optical Sensor and Machine Learning Techniques for Diabetes Applications," in IEEE Access, vol. 9, pp.
73029-73045, 2021, doi: 10.1109/ACCESS.2021.3079182.
Inteligencia Artificial 59 (2017) 7

[4] N. Barakat, A. P. Bradley and M. N. H. Barakat, "Intelligible Support Vector Machines for Diagnosis of Diabetes
Mellitus," in IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 4, pp. 1114-1120, July
2010, doi: 10.1109/TITB.2009.2039485.
[5] J. Yang, X. Ju, F. Liu, O. Asan, T. S. Church and J. O. Smith, "Prediction for the Risk of Multiple Chronic
Conditions Among Working Population in the United States With Machine Learning Models," in IEEE Open
Journal of Engineering in Medicine and Biology, vol. 2, pp. 291-298, 2021, doi: 10.1109/OJEMB.2021.3117872.
[6] Anita Sachin Mahajan, "Medical Diagnosis of Diabetes Using Deep Learning Techniques and Big data
Analytics", International Journal of Emerging Technologies and Innovative Research, vol.7, no. 4, pp. 1490-
1497, April-2020.
[7] Ashwini Abhale, Shruti Gulhane, Sandhya Budhewar, Swanali Jathar, Harshada Sonwane, "Predictive analysis
of Diabetic Patient Data Using Machine Learning and Big Data," in International Journal of Research in Advent
Technology, pp. 25-28, Feb. 2019.
[8] M. Chen, Y. Hao, K. Hwang, L. Wang and L. Wang, "Disease Prediction by Machine Learning Over Big Data
From Healthcare Communities," in IEEE Access, vol. 5, pp. 8869-8879, 2017, doi:
10.1109/ACCESS.2017.2694446.
[9] P. Ranganarayanan, N. Thanigesan, V. Ananth, V. K. Jayaraman and V. Ramakrishnan, "Identification of
Glucose-Binding Pockets in Human Serum Albumin Using Support Vector Machine and Molecular Dynamics
Simulations," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 13, no. 1, pp. 148-
157, 1 Jan.-Feb. 2016, doi: 10.1109/TCBB.2015.2415806.
[10] S. Perveen, M. Shahbaz, K. Keshavjee and A. Guergachi, "Metabolic Syndrome and Development of Diabetes
Mellitus: Predictive Modeling Based on Machine Learning Techniques," in IEEE Access, vol. 7, pp. 1365-1375,
2019, doi: 10.1109/ACCESS.2018.2884249.
[11] J. Yang et al., "Blood Pressure States Transition Inference Based on Multi-State Markov Model," in IEEE
Journal of Biomedical and Health Informatics, vol. 25, no. 1, pp. 237-246, Jan. 2021, doi:
10.1109/JBHI.2020.3006217.
[12] Fernando López-Martínez, Aron Schwarcz.MD, Edward Rolando Núñez-Valdez, Vicente García-Díaz,
"Machine learning classification analysis for a hypertensive population as a function of several risk factors,"
Expert Systems with Applications, vol. 110, pp. 206-215, 2018, https://doi.org/10.1016/j.eswa.2018.06.006.

You might also like