Professional Documents
Culture Documents
Know-Stress: Predictive Modeling of Stress Among Diabetes Patients Under Varying Conditions
Know-Stress: Predictive Modeling of Stress Among Diabetes Patients Under Varying Conditions
Know-Stress: Predictive Modeling of Stress Among Diabetes Patients Under Varying Conditions
Know-Stress
Predictive Modeling of Stress among Diabetes Patients
under Varying Conditions
Introduction
Around 422 million people worldwide had diabetes mellitus in 2014 (World
Health Organization, 2016). The global diabetes prevalence rate has increased
from 4.7% in 1980 to 8.5% now (World Health Organization, 2016). Type
2 diabetes (T2D) diminishes patients’ quality of life. Diabetes patients are
more likely to suffer from visual and hearing impairments, overweight issues,
osteoarthritis, stroke, and possess two- to threefold increased risk of physical
disability than people not suffering from diabetes (Gregg et al., 2000; Kirkman
et al., 2012; Schwartz et al., 2002; Volpato et al., 2002; Volpato, Leveille,
Blaum, Fried, & Guralnik, 2005).
T2D has been widely recognized as a chronic disease caused by unhealthy
lifestyle. However, in recent years, psychological factors were brought to
researchers’ attention because physical factors could not fully explain the
development and progression of T2D. Compared to nondiabetic individuals,
diabetes patients experience chronic allostatic load, which manifests as
impaired post-stress recovery and blunted stress reactivity (Carvalho
et al., 2015; Steptoe et al., 2014).
The prevalence of diabetes is higher among low-income groups and ethnic
minorities (Lanting, Joung, Mackenbach, Lamberts, & Bootsma, 2005). For-
mer research presents the trend that stress level increases as the socioeconomic
status decreases (Orpana, Lemyre, & Kelly, 2007). Occupational and financial
stressors, frequently experienced by low-income groups, have been shown
to be significantly associated with the development and progression of T2D
(Morikawa et al., 2005). Social stressors – for example, social rejection – have
also been indicated to be correlated with physiologic stress response (Slavich,
O’Donovan, Epel, & Kemeny, 2010). Considerable amount of empirical
153
154 Know-Stress
evidence implies the importance of delivering stress relief intervention for low-
income T2D patients.
In existing studies, social scientists employ statistical analyses to identify
the association between stress and the development of T2D on a group level
(Kelly & Ismail, 2015). Machine learning algorithms to make prediction of
stress levels in individual cases becomes possible.
The purpose of this chapter is to predict stress levels of low-income T2D
patients using machine learning–based classification techniques, aspiring to
build models that can be utilized to help identify vulnerable individuals and
thus provide timely interventions.
Data Description
A Brief Background of the Data Set Used
This project employs secondary data from the Diabetes–Depression Care-
management Adoption Trial (DCAT). The purpose of the DCAT program was
to reduce health disparities of low-income population by optimizing depression
screening, treatment, follow-up, outcomes, and cost saving using technology
(Wu et al., 2014). In collaboration with Los Angeles County Department of
Health Services (DHS) Ambulatory Care Network (ACN), DCAT intervention
was implemented from 2010 to 2013 among 1,400 low-income diabetes
patients in Los Angeles County (Wu et al., 2014). The participants were pre-
dominantly Hispanic/Latino Immigrants (89%). The DCAT program collected
baseline and 6-, 12-, and 18-month follow-up data (Wu et al., 2014). This
current study analyzes only the baseline data.
1. Cleansed data set: One major drawback of our data set is that there are
several missing values and so the first data set we made was by removing
all these missing values. In total we had 130 features and 1,340 patient
records in this cleansed data set. We removed about 35% of the data. This
data set acts as the baseline against which we compare the performances of
the data sets described below.
2. Data sets based on categories of predictor variables: There can be several
ways in which the features or predictor variables can be divided into
categories. However, we divided the data set into four major categories,
namely demographic, biological (health-related variables like weight,
other ailments, etc.), psychological, and social variables. The intuition
behind this classification of predictor variables is the fact that sometimes,
due to financial or time constraints, it might not be feasible to run the
entire study to collect all the information we had in our data set. Hence, if
we can have access to only some categories of information, can we still get
a model with comparable performance to model using the data set in (1)?
3. We used a machine learning model called extremely randomized trees,
which is a tree-based ensemble method for learning the Top 25 features.
We chose 25 as per consultation with domain experts. The goal was to
compare the predictor variables selected here using this method with those
in the data sets in (2) and to get an intuitive reason behind the features
selected by the machine learning model.
Experimental Setup
For each of the data sets mentioned on the preceding list, we randomly divide
the data into 70% for training and 30% for testing. Training implies the we
use 70% of the data in each data set to learn the machine learning models and
then evaluate the learned models using the test data sets. The metrics used for
evaluation are as described in the metrics section that follows.
Models of Classification
Our problem is a classification task (Caruana & Niculescu-Mizil, 2006): i.e.,
using the machine learning algorithms described in this section, we want to
classify each patient in the data sets into either the high-stress or the low-stress
category.
An important term used in the context of classification is decision bound-
ary – i.e., a separation point. The idea of the decision boundary is basically that
Subhasree Sengupta, Kexin Yu, and Behnam Zahiri 159
it serves as a threshold such that any value above the threshold is classified as
“True” (or high stress in our application) and anything below is classified as
“False” (or low stress in our application). This separation can be expressed as a
line or hyperplane in case of linear decision boundaries, but often a line cannot
serve as an efficient separator. So, depending on the distribution of the data,
we might need to revert to nonlinear separators.
Given this, we use three widely used machine learning models of classifi-
cation. Their popularity stems from several reasons, such as efficient software
that enables the model to scale very well with the increasing size of the data set;
interpretability – ease of understanding what the model is accomplishing; etc.
Very briefly, the mathematical intuition behind classification is expressed as
follows:
1. Each layer consists of a set of neurons that serve the input for that layer.
2. We then take a weighted combination of the inputs coming through these
neurons.
3. Some form of activation function is applied.
4. Now we take the output of (3) to become inputs of the next layer, and this
follows until the output layer.
160 Know-Stress
5. Finally, the output obtained from the final layer is compared against the
actual output; the error is computed and the weights all the way back to the
input layer are adjusted. This process is also called back propagation.
6. This entire process can be repeated many times to better train the model –
each full cycle (steps 1–5 are called an epoch.)
We use the keras wrapper for implementing the architecture of the neural
network with Tensor flow as the backend for training the weights.
Decision Tree
Using a decision tree–based aid has gained a lot of importance in mining
of health care data (Koh & Tan, 2011). The ease of interpretability of this
method makes it a popular choice among machine learning methods. For a
concrete understanding of how a decision tree is constructed, please refer to
Quinlan (1987).
Aside from adding validity to the results obtained by the neural network,
another purpose of the decision tree method is to get an idea about which
predictor variables significantly affect the predictive capability of the learned
model, thereby giving us an intuition into which features we necessarily need
to keep when selecting the top 25 features from the dataset.
Logistic Regression
Logistic regression is the common technique used in a multi-class classifica-
tion used when the outcome variable is categorical (binary in this application).
Thus, this technique will serve more as a baseline or benchmark for us to
compare with the machine learning classifiers as described earlier.
Subhasree Sengupta, Kexin Yu, and Behnam Zahiri 161
Feature Selection
The top 25 features were selected by a machine learning model called
extremely randomized trees (Geurts, Ernst, & Wehenkel, 2006). It is a tree-
based ensemble method. An ensemble is a collection of many decision trees.
Besides accuracy, the main strength of the resulting algorithm is computational
efficiency. This technique gives us the importance value for each feature
(predictor) in obtaining the output. We then sort the features by this value in
descending order and select the top 25 features.
Metrics of Evaluation
Some important terminology:
1. True positives (TP): Those patients in the test set who had “high” stress
and were predicted to have high stress.
2. False positives (FP): Those patients in the test set who had “low” stress
and were predicted to have high stress.
3. True negatives (TN): Those patients in the test set who had “low” stress
and were predicted to have low stress.
4. False negatives (FN): Those patients in the test set who had “high” stress
and were predicted to have low stress.
positives and true negatives is high, which means we have learned a very
good model.
5. AUC: Area under the ROC (Receiver Operating Characteristic).
AUC value gives the area under the curve (where the x-axis is the false
positive rate and the y-axis is the true positive rate), and since we want to
increase the true positive rate, we want a higher AUC value. That indicates
that the larger the area under the plotted curve, the better our model’s
performance will be.
Table 9.4. Results for data set with only demographic info
network experiments, which means they reflect the same trends and hence have
not been explicitly presented here.
An important observation here is that we find with increasing the cutoff
value there is a significant drop in the recall score obtained. This can further
Table 9.5. Results for data set with only biological info
Table 9.6. Results for data set with demographic and biological info
Table 9.7. Results for data set with demographic, biological and
psychological info
Table 9.8. Results for data set with demographic, biological, and
psychological and social info
model and the true positive rate drastically falls, which is not a desirable
outcome. Overall, taking all the metrics into consideration, based on the
results, a cutoff score of 15 seems to be perform best. In terms of a range
for the cutoff value, the range of 15–19 performs best.
Model Stability
One significant observation about the experiments described earlier in the
chapter is that when the data set is manually partitioned into the demographic,
biological, psychological, and social data sets, experimenting on these or
a combination of these data sets severely affects model performance. For
example, in the combined demographic, biological, psychological, and social
data set we find that at the cutoff of 39 the recall score drops to 0.04 and the
ratio of FP:TP is 22:1, which is not desired at all. This is a bit counterintuitive,
as based on our feature selection results where most of the selected features
are psychological, we would presume that the model would perform better.
A probable reason for this can be that some form of over-fitting is taking
place. We also find that, rather than experimenting with only demographic or
biological data, using a combination of both reflects improvements in model
performance, but adding psychological and social variables does not boost the
performance – in fact, quite the opposite.
Feature Selection
We perform feature selection using a machine learning model called extremely
randomized trees. Using this we can identify the top 25 features in the data
set. The selected features are a mixture of biological, psychological, and some
demographic features. Interestingly, we find that the features located at the top
two levels of the decision tree as applied to the demographic and biological
data sets also appear in the features selected by the machine learning algorithm
for feature selection. This helps interpret the machine learning–based feature
selection method.
We find that there is not as much fluctuation in model performance as we
have seen when trying the data sets with distinct categories created manually
(demographic, biological, psychological, and social). Thus, using this machine
learning approach would be better/more stable than the manual category–based
data-partitioning method we have presented.
166 Know-Stress
Conclusion
We present a rigorous machine learning-based analysis with varying data
sets. We present results based on three popular classification models (neural
networks, decision trees, logistic regression). We address the challenge of
transforming our continuous predicted outcome variable in a categorical
variable with two categories (high versus low). To achieve this, we experiment
with several “cutoff scores” to divide the values into the two categories. We
present a comparison of various cutoffs used on these different data sets and
find an optimal cutoff range to be used. We reason that the optimal cutoff
should be in the range of 15–19 based on the experiments and across all data
sets. We also conduct feature selection to select top 25 features in the data set
in view of situations where gathering all the extended list of variables in the
given data set might be infeasible and present the results. The selected features
are predominantly psychological variables, some biological variables, and a
few demographic variables. We further validate this feature selection process
by comparing it with results obtained from manual splits of the data based on
demographics, biological, social, and psychological categories. In conclusion,
our extensive analysis provides important insights and a foundation of using
computational tools from artificial intelligence to answer challenging social
work questions.
Future Work
There are several ways to extend upon the analysis techniques presented in
this chapter. An interesting direction of further exploration will be to add
Subhasree Sengupta, Kexin Yu, and Behnam Zahiri 167
in the layer of time series analysis to the current investigation. This can
be achieved by gathering data at different time periods and then predicting
stress using that information along with historical observations. Further, the
process of feature selection can be used for each data set representing the
four categories as previously mentioned. In that regard, more categories of
the predictor variables can be explored than those we have used. Also, more
sophisticated machine learning techniques can be used to answer the question
about tackling missing data.
References
Caruana, R., & Niculescu-Mizil, A. (2006, June). An empirical comparison of super-
vised learning algorithms. In Proceedings of the 23rd International Conference on
Machine Learning (pp. 161–168). Association for Computing Machinery.
Carvalho, L. A., Urbanova, L., Hamer, M., Hackett, R. A., Lazzarino, A. I., & Steptoe,
A. (2015). Blunted glucocorticoid and mineralocorticoid sensitivity to stress in
people with diabetes. Psychoneuroendocrinology, 51, 209–218. https://doi.org/
10.1016/j.psyneuen.2014.09.023
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine
Learning, 63(1), 3–42.
Gregg, E. W., Beckles, G. L., Williamson, D. F., Leveille, S. G., Langlois, J. A.,
Engelgau, M. M., & Narayan, K. M. (2000). Diabetes and physical disability
among older US adults. Diabetes Care, 23(9), 1272–1277.
Kelly, S. J., & Ismail, M. (2015). Stress and type 2 diabetes: A review of how stress
contributes to the development of type 2 diabetes. Annual Review of Public Health,
36, 441–462. https://doi.org/10.1146/annurev-publhealth-031914-122921.
Kirkman, M. S., Briscoe, V. J., Clark, N., Florez, H., Haas, L. B., Halter, J. B., . . . Swift,
C. S. (2012). Diabetes in older adults. Diabetes Care, 35(12), 2650–2664. https://
doi.org/10.2337/dc12-1801.
Koh, H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of
Healthcare Information Management, 19(2), 65.
Lanting, L. C., Joung, I. M. A., Mackenbach, J. P., Lamberts, S. W. J., & Bootsma, A.
H. (2005). Ethnic differences in mortality, end-stage complications, and quality of
care among diabetic patients. Diabetes Care, 28(9), 2280–2288. https://doi.org/10
.2337/diacare.28.9.2280.
Lisboa, P. J., & Taktak, A. F. (2006). The use of artificial neural networks in decision
support in cancer: A systematic review. Neural Networks, 19(4), 408–415.
Morikawa, Y., Nakagawa, H., Miura, K., Soyama, Y., Ishizaki, M.,
Kido, T., . . . Nogawa, K. (2005). Shift work and the risk of diabetes mellitus among
Japanese male factory workers. Scandinavian Journal of Work, Environment &
Health, 31(3), 179–183.
Orpana, H. M., Lemyre, L., & Kelly, S. (2007). Do stressors explain the association
between income and declines in self-rated health? A longitudinal analysis of the
National Population Health Survey. International Journal of Behavioral Medicine,
14(1), 40–47.
Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-
Machine Studies, 27(3): 221. https://doi.org/10.1016/S0020-7373(87)80053-6.
168 Know-Stress
Schwartz, A. V., Hillier, T. A., Sellmeyer, D. E., Resnick, H. E., Gregg, E., Ensrud, K.
E., . . . Cummings, S. R. (2002). Older women with diabetes have a higher risk of
falls. Diabetes Care, 25(10), 1749–1754.
Slavich, G. M., O’Donovan, A., Epel, E. S., & Kemeny, M. E. (2010). Black sheep
get the blues: A psychobiological model of social rejection and depression.
Neuroscience & Biobehavioral Reviews, 35(1), 39–45.
Steptoe, A., Hackett, R. A., Lazzarino, A. I., Bostock, S., La Marca, R., Carvalho,
L. A., & Hamer, M. (2014). Disruption of multisystem responses to stress
in type 2 diabetes: Investigating the dynamics of allostatic load. PNAS Pro-
ceedings of the National Academy of Sciences of the United States of Amer-
ica, 111(44), 15693–15698. https://doi.org/http://dx.doi.org.libproxy1.usc.edu/10
.1073/pnas.1410401111
Thompson, B. (2004). Exploratory and confirmatory factor analysis: Understand-
ing concepts and applications. Washington, DC: American Psychological
Association.
Volpato, S., Blaum, C., Resnick, H., Ferrucci, L., Fried, L. P., & Guralnik, J. M. (2002).
Comorbidities and impairments explaining the association between diabetes and
lower extremity disability. Diabetes Care, 25(4), 678–683. https://doi.org/10
.2337/diacare.25.4.678
Volpato, S., Leveille, S. G., Blaum, C., Fried, L. P., & Guralnik, J. M. (2005). Risk
factors for falls in older disabled women with diabetes: The women’s health
and aging study. The Journals of Gerontology Series A: Biological Sciences and
Medical Sciences, 60(12), 1539–1545.
World Health Organization Global report on diabetes. (2016). Retrieved January 21,
2017 from http://www.who.int/diabetes/global-report/en/.
Wu, S., Ell, K., Gross-Schulman, S. G., Sklaroff, L. M., Katon, W. J.,
Nezu, A. M., . . . Guterman, J. J. (2014). Technology-facilitated depression care
management among predominantly Latino diabetes patients within a public safety
net care system: Comparative effectiveness trial design. Contemporary Clinical
Trials, 37(2), 342–354.