Considering Interpersonal Differences in Validating Wearable Sleep-Tracking Technologies

2017 Tenth International Conference on Mobile Computing and Ubiquitous Network (ICMU)
Considering Interpersonal Differences in Validating

Wearable Sleep-Tracking Technologies
Zilu Liang Mario Alberto Chapa Martell
Dept. of Engineering, The University of Tokyo CAC Corporation
Hongo 7-3-1, Tokyo, Japan 113-0022 Hakozaki-cho 24-1 , Tokyo, Japan
z-liang@t-adm.t.u-tokyo.ac.jp Mchapam0300@gmail.com
Abstract—This study investigated an important yet usually Well Co., Osaka, Japan) for approximately 40 nights. The
neglected question in validating consumer sleep tracking devices: measurements of consumer devices were validated against
does interpersonal difference play a role in the validity of those by SLEEP SCOPE and such analysis were done for each
consumer sleep trackers? Most existing validation studies assume participant separately. We then compared the results obtained
that the accuracy of these devices is consistent cross individuals, from each of the participants. The measurements by SLEEP
ignoring the fact that human sleep demonstrates significant SCOPE were used as the ground truth to compare with.
interpersonal differences. This study aimed to test this
assumption through validating two newest gadgets of home sleep We used several statistical techniques to analyze the
tracking technologies, i.e. Fitbit Charge 2 (wearable wristband) validity of consumer sleep trackers. The results revealed
and Neuroon (wearable EEG eye mask), in comparison to a significant inter-personal differences. First, the Fitbit
clinical device. Two participants were recruited to track their measurements completely deviated from the clinical
sleep using a Fitbit, a Neuroon, and a clinical device SLEEP measurements for one participant, but agreed very well for
SCOPE simultaneously for approximately 40 nights. Data another participant especially in the dimension of total sleep
analysis was conducted for each participant separately. The time (TST). Second, the Neuroon measurements agreed well
results showed that the Fitbit measurements completely deviated but were uncorrelated to the clinical measurements for one
from the clinical measurements for one participant, but agreed participant (on SOL), and were correlated significantly but
very well for another participant, especially in the dimension of
agreed poorly for another participant (on WASO and SE). In
total sleep time. Neuroon measurements agreed well but were
addition, we found that good agreement between Neuroon and
uncorrelated to clinical measurements for one participant, and
were correlated significantly but agreed poorly for another
SLEEP SCOPE generally occurred when the user had a good
participant. This study suggested that interpersonal differences night of sleep. The main contributions of this study are two-
should be considered in future validation studies and in designing fold: (1) the results revealed that the validity of consumer
better consumer sleep-tracking technologies. wearable devices varies from user to user; (2) we identified
conditions for good validity of consumer wearable devices to
Keywords—Sleep; wearables; EEG; Fitbit; validation; n-of-1; inform the future design of accurate consumer sleep trackers.
personal informatics. The outcome of this study brought forward an important
question that we need to consider in future validation studies
I. INTRODUCTION and in designing better consumer sleep tracking technologies.
As sensing and wearable technologies advance, many home
sleep tracking technologies emerge in consumer market. These II. RELATED WORK
technologies range from free mobile applications (e.g. Sleep As In clinical studies, polysomnography (PSG) and actigraphy
Android, Sleep Bot) to affordable wearable devices (e.g. Fitbit, are the most widely used devices for objectively measuring
Jawbone) to portable EEG (e.g. Neuroon and Sleep Shepherd). sleep. It has been most useful for the diagnosis and treatment of
Since human sleep demonstrates significant interpersonal sleep diseases such as obstructive sleep apnea (OSA),
differences [1-2], it is an interesting question as to whether narcolepsy, rapid-eye-movement sleep behavior disorder [6].
consumer sleep trackers are equivalently accurate for all Actigraphy, on the other hand, is mostly used for the diagnosis
individuals. Most of the work in literature assumes that the of circadian rhythm disorders [7].
accuracy of these devices is consistent across individuals [3].
In addition to the clinical sleep monitoring devices, many
The aim of this study is to test this assumption through
consumer sleep tracking devices has emerged over recent years.
validating two newest gadgets of home sleep tracking
These technologies can be generally divided into two
technologies in comparison to a clinical device. To investigate
categories, i.e. sleep tracking devices based on movement (e.g.
interpersonal differences, we adopted the n-of-1 approach (i.e.
Fitbit, Jawbone, mobile applications) and sleep tracking
personalized approach) [4] in place of the 1-of-n approach (i.e.
devices based on brain activity signals (e.g. Zeo headband,
large-sample approach) [5] that has been used in most of the
Neuroon eye mask, Sleep Shepherd headband) [14-15]. This
existing studies. We recruited two participants who wore a
field is expanding rapidly and new sleep tracking devices are
Fitbit Charge 2 wearable wristband, a Neuroon wearable EEG
introduced to the consumer market every year. Validation
eye mask, and a clinical device named SLEEP SCOPE (Sleep
studies on previous versions of Fitbit and Jawbone found that
This study was sponsored by JSPS KAKENHI Grant-in-Aid for Research
Activity Start-up (Grant Number 16H07469). The opinions expressed in
this paper do not represent the views of the second author’s company.
978-4-907626-31-0/17/$31.00 ©2017
Authorized licensed use limited to: MapuaIEEE
University. Downloaded on May 01,2021 at 12:32:40 UTC from IEEE Xplore. Restrictions apply.
Fitbit devices overestimated TST and SE and underestimated participants are shown in Table II (according to the
WASO compared to PSG and clinical actigraphy in “normal measurements of SLEEP SCOPE).
mode”. While in “sensitive mode”, Fitbit devices substantially TABLE II. BASIC STATISTICS OF SLEEP QUALITY
underestimated TST and SE and overestimated WASO [8-9].
In this study, we aim to investigate the impact of interpersonal TST WASO SOL
NAWK SE (%)
(min) (min) (min)
differences on the accuracy of recently-released sleep trackers Participant
in the consumer market, i.e. Fitbit Charge 2 wristband and 1
408േ30 34േ11 21േ14 34േ8 88േ4
Neuroon wearable EEG eye mask, with the purpose of Participant
412േ40 34േ33 8േ7 27േ8 90േ7
establishing a baseline for the state-of-the-art in this field. 2
TABLE I. DEFINITION OF MEASURED SLEEP METRICS TABLE III. Z-VALUES OF WSR TEST ON SLEEP METRICS
Metrics Definition TST WASO SOL NAWK SE

Time in minutes from sleep onset to sleep Fitbit vs SLEEP a
Total Sleep Time (TST) -3.29 2.73 3.31 3.29 -3.29
offset. SCOPE (Participant 1)
Wake after Sleep Onset Periods of wakefulness occurring after Neuroon vs SLEEP
(WASO) defined sleep onset. 2.75 -2.93 -1.33 2.93 2.67
SCOPE (Participant 1)
Sleep Onset Latency Time in minutes from “light out” to the Fitbit vs SLEEP
(SOL) first epoch scored as sleep. -1.49 -0.35 2.80 3.98 -2.05
The number of awakenings occurring after Neuroon vs SLEEP
Awake Count 3.81 -4.04 -3.78 4.11 3.94
defined sleep onset. SCOPE (Participant 2)
Percentage of total time in bed actually a.
Sleep Efficiency (SE) Bold indicates a significant difference between devices (p < 0.05).
spent in sleep. TST/(TST+WASO)
B. Agreement between Consumer and Clinical Devices
III. METHOD The Bland-Altman plots on aggregated sleep metrics are
shown in Figure 1 (Neuroon vs SLEEP SCOPE) and Figure 2
Two participants without sleep problems were recruited for (Fitbit vs SLEEP SCOPE). Generally, Fitbit achieved better
collecting sleep data (Participant 1: sex=F, age=28; participant agreement to SLEEP SCOPE in comparison to Neuroon. For
2: sex=M, age=32). Each participant wore a Fitbit Charge 2 on participant 1, however, Neuroon yielded better agreement to
the non-dominant wrist and a Neuroon wearable EEG eye SLEEP SCOPE in terms of SOL. For both participants, good
mask on the lower forehead. The two electrodes of the clinical
agreement between Neuroon and SLEEP SCOPE occurred
device SLEEP SCOPE were attached to the upper forehead and
behind ear respectively. These devices were used when the sleep was good, i.e., TST > 400 min, WASO < 18
simultaneously to measure the sleep metrics summarized in min, SOL < 30 min and SE > 80% [13].
Table I in which the definition of the metrics are also presented We delved deeper into the trend of device difference as a
[10]. We used Bland-Altman plot [11], which compares the function of device mean and found that inter-personal
measurements’ differences to the measurements’ mean, to differences mainly manifested in the dimension of SOL. In
evaluate the agreement between the consumer sleep trackers what follows, the term “device difference” refers to the
(i.e., Fitbit and Neuroon) and the clinical sleep monitor difference between the measurement of a consumer device
(SLEEP SCOPE). Wilcoxon Signed Rank test [12] was used to (either Neuroon or Fitbit) and the measurement of the clinical
compare the average of the sleep measurements by the device SLEEP SCOPE, while the term “device mean” refers to
consumer devices and that by the clinical device. Pearson the mean of the measurements by the two devices. According
correlation coefficients were used to evaluate the linear to the Bland-Altman plots in Figure 2, device difference
relationship between the consumer devices and the clinical increases as device mean decreases for participant 2, but no
device. trend was observed for participant 1. As in the case of Fitbit
shown in Figure 3, device difference increases as device mean
IV. RESULTS
increases for participant 1, but no trend was observed for
A. Data Preprocessing participant 2.
In total 43 nights of sleep were collected from participant 1 C. Differences Between Consumer and Clinical Devices
and 49 nights of sleep were collected from participant 2. Many
data were missing due to device failures, especially for The results of Wilcoxon Signed Rank (WSR) test is shown
Neuroon and SLEEP SCOPE. Some of the reasons for device in Table III (for TST, WASO, SOL, NAWK, SE). Bold
failures include: sensor misplacement (e.g. device sliding off numbers indicate significant differences between devices (p <
the face, poor electrode contact with the skin), battery issues 0.05). There are statistically significant differences between
(e.g. device running out of battery during the measurement), consumer and clinical devices in many sleep dimensions, but
and user’s incompliance (e.g. terminating the measurement the results are not equivalent for both participants. Neuroon
midnight). After removing missing data pair wisely, we worked better on participant 1, and no statistically significant
obtained two datasets of 11 (Neuroon vs SLEEP SCOPE) and difference between Neuroon and SLEEP SCOPE was found in
14 (Fitbit vs SLEEP SCOPE) entries respectively for the dimension of SOL. In comparison, Fitbit worked better on
participant 1, and two datasets of 22 (Neuroon vs SLEEP participant 2, because no statistically significant difference
SCOPE) and 21 (Fitbit vs SLEEP SCOPE) entries respectively was found in the dimension of TST and WASO.
for participant 2. The basic statistics of sleep quality for both
Authorized licensed use limited to: Mapua University. Downloaded on May 01,2021 at 12:32:40 UTC from IEEE Xplore. Restrictions apply.
D. Relationship between Consumer and Clinical Devices 0.43, p = 0.049). Moderate correlation was also found between
The Pearson correlation coefficients were computed to Neuroon and SLEEP SCOPE in multiple sleep dimensions for
assess the linear relationship between the measurements by participant 2, including WASO (r = 0.46, p = 0.029), SE (r =
consumer devices and those by the clinical device. The results 0.42, p = 0.054), and Wake Ratio (r = 0.45, p = 0.036).
on TST, WASO, SOL, NAWK and SE are summarized in E. Discussions
Table IV. For participant 1, no significant correlation was
found between Neuroon/Fitbit and SLEEP SCOPE. For The results presented above demonstrated that the validity
participant 2, strong correlation was found between Fitbit and of Fitbit and Neuroon was not consistent between the two
SLEEP SCOPE in the dimension of TST (r = 0.72, p < 0.000), participants.
and moderate correlation was found in terms of WASO (r =
(a) Bland-Altman plots between Neuroon and SLEEP SCOPE on aggregated sleep metrics for participant 1.
(b) Bland-Altman plots between Neuroon and SLEEP SCOPE on aggregated sleep metrics for participant 2.
Fig. 1. Bland-Altman plots of aggregated sleep metrics (TST, WASO, NAWK, SOL, SE) measured by Neuroon and SLEEP SCOPE for each participant. (The x-
axis is the mean of Neuroon and SLEEP SCOPE, and the y-axis is the difference between the two devices. Dotted line represents two standard deviations from the
mean.)
(a) Bland-Altman plots between Fitbit and SLEEP SCOPE on aggregated sleep metrics for participant 1.
(b) Bland-Altman plots between Fitbit and SLEEP SCOPE on aggregated sleep metrics for participant 2.
Fig. 2. Bland-Altman plots of aggregated sleep metrics (TST, WASO, NAWK, SOL, SE) measured by Fitbit and SLEEP SCOPE for each participant. (The x-
axis is the mean of Fitbit and SLEEP SCOPE, and the y-axis is the difference between the two devices. Dotted line represents two standard deviations from the
mean.)
TABLE IV. CORRLEATION ANALYSIS ON AGGREGATED SLEEP METRICS mask, in comparison to a clinical device SLEEP SCOPE on
TST WASO SOL NAWK SE
two participants separately. The results demonstrated
Fitbit vs SLEEP
significant interpersonal differences. On the one hand, the
0.25 0.25 0.04 0.08 0.23 Fitbit measurements completely deviated from the clinical
Neuroon vs SLEEP measurements for one participant, but agreed very well for
0.42 0.09 -0.53 0.18 -0.16
SCOPE (Participant 1) another participant, especially in the dimension of TST. On the
a**
Fitbit vs SLEEP 0.72 other hand, the Neuroon measurements agreed well but were
0.43* 0.22 0.02 0.23
SCOPE (Participant 2) ** uncorrelated to the clinical measurements for one participant
Neuroon vs SLEEP
0.36 0.46* -0.28 0.27 0.42*
(on SOL), and correlated significantly but agreed poorly for
SCOPE (Participant 2) another participant (on WASO, SE, Wake Ratio). Our study
a.
Bold indicates a significant correlation (*p 0.05, **p 0.01, ***p 0.001, ****p<0.000). revealed that the validity of consumer sleep tracking devices
varies from user to user. Such interpersonal differences should
Fitbit measurements completely deviated from those by be considered in future validation studies and in designing
SLEEP SCOPE for participant 1, whereas for participant 2 better consumer sleeps tracking technologies.
Fitbit achieved good agreement and significant correlation to
SLEEP SCOPE in the dimensions of TST and WASO. An REFERENCES
investigation into the scatterplot between Fitbit and SLEEP [1] Z. Liang, M. A. Chapa-Martell, B. Ploderer, “Inter-individual
SCOPE on SOL showed that Fitbit underestimated SOL differences in sleep quality: insights from mining wearable sleep-
tracking data,” IPSJ Technical Report, 2017-MBL-82(54), pp.1-6, 2017.
compared to SLEEP SCOPE. On the other hand, Neuroon
[2] H. P. A. van Dongen, K. M. Vitellaro, D. F. Dinges, “Individual
showed good agreement to SLEEP SCOPE in the dimension differences in adult human sleep and wakefulness: Leitmotif for a
of SOL for participant 1, but agreed poorly for participant 2. research agenda,” Sleep, vol.28, no.4, p.479-496, 2005.
Noting that participant 1 had longer SOL than participant 2, it [3] J. Mantua, N. Gravel, and R. Spencer, “Reliability of sleep measures
is likely that participant 1 were simply lying still during the from four personal health monitoring device compared to research-based
actigraphy and polysomnography,” Sensors, vol.16, 646, 2016.
process of falling asleep. Since Fitbit measures sleep based on
[4] E. O. Lillie, B.Patay, J. Diamant, et al, “The n-of-1 clinical trial: the
movement, it may estimate this segment as the time asleep ultimate strategy for individualizing medicine?” Personalized Medicine,
rather than the time to fall asleep. As contrast, Neuroon vol.8, no.2, p.161-173, 2011.
measures sleep based on brainwave, it was able to accurate [5] J. C. Cappelleri, N. Ting, “A modified large-sample approach to
classify this segment as the time to fall asleep even if the approximate interval estimation for a particular intraclass correlation
participant was not moving. As for participant 2, the disparity coefficient,” Statics in Medicine, vol.22, no.11, p.1861-1877, 2003.
between Neuroon and SLEEP SCOPE in the dimension of [6] C. A. Kushida,M. R. Littner, T. M. Morgenthaler, et al. “Practice
parameters or the indications for polysomnography and related
SOL may due to the fact that he often fell asleep while reading procedures: an updated for 2005,” Sleep, vol.18, no.4, p.499-519, 2005.
in bed without putting on the Neuroon eye mask properly. [7] M. Littner, C. A. Kushida, W. M. Anderson, et al., “Practice parameters
After a little while, he would wake up to put on the eye mask. for the role of actigraphy in the study of sleep and circadian rhythms: an
This may explain the overestimation of SOL for him. update for 2002,” Sleep, vol.26, no.3, pp.337-341, 2003.
In addition to the interpersonal differences, we also found [8] B. P. Kolla, S. Mansukhani, and M. P. Mansukhani, “Consumer sleep
tracking devices: a review of mechanisms, validity and utility,” Expert
a common trend in the validity of Neuroon for both Review of Medical Devices, vol.13, no.5, p.497-506, 2016.
participants. Good agreement between Neuroon and SLEEP [9] M. de Zambotti, S. Claudatos, S. Inkelis, et al, “Evaluation of a
SCOPE occurred when the sleep quality of the user matched consumer fitness-tracking device to assess sleep in adults,” Chronobiol
the clinical definition of good sleep, which suggests that Int. vol.32, no.7, pp.1024-1028, 2015.
wearable EEG sleep trackers may be more accurate when the [10] D. Shrivastava, S. Jung, M. Saadat, et al. “How to interpret the results of
user has a good night of sleep. a sleep study,” Journal of Community Hospital Internal Medicine
Perspectives, vol.4, 24983, 2014.
Based on this analysis, we may conclude that the validity
[11] J. M. Bland, D. G. Altman, “Statistical methods for assessing agreement
of consumer sleep tracking devices is affected by users’ sleep between two methods of clinical measurement,” Lancet, vol.1, p.307-
quality per se as well as users’ sleep habit. Fitbit may produce 310, 1986.
less accurate sleep measurements for people who take long [12] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics
time to fall asleep, while Neuroon may not work well for Bulletin, vol.1, no.6, 80-83, 1945.
people who read in bed. In the next step, we will recruit a [13] M. M. Ohayon, M. A. Carskadon, C. Guilleminault, M. V. Vitiello,
larger cohort to investigate the characteristics of people who “Meta-analysis of quantitative sleep parameters from childhood to old
age in healthy individuals: developing normative sleep values across the
tend to generate inaccurate Fitbit and Neuroon measurements human lifespan,” SLEEP, vol.27, no.7, 2004.
as well as the role of geographical variables such as gender, [14] Z. Liang, B. Ploderer, “Sleep tracking in the real world: a qualitative
BMI and age. Based on such information, we may tune the study into barriers for improving sleep,” in Proceedings of the 28th
sleep analysis model for each type of user. Australian Conference on Computer-Human Interaction, pp.537-541,
2016.
V. CONCLUSION [15] W. Liu, B. Ploderer, and T. Hoang, “In bed with technology: challenges
and opportunities for sleep tracking,” in Proceedings of the Australian
In this study we validated two newest consumer sleep tracking Computer-Human Interaction Conference, pp. 142-151, 2015.
devices, i.e., Fitbit Charge 2 and Neuroon wearable EEG eye

Considering Interpersonal Differences in Validating Wearable Sleep-Tracking Technologies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Considering Interpersonal Differences in Validating Wearable Sleep-Tracking Technologies

Uploaded by

Copyright:

Available Formats

2017 Tenth International Conference on Mobile Computing and Ubiquitous Network (ICMU)

Considering Interpersonal Differences in Validating

Metrics Definition TST WASO SOL NAWK SE

You might also like