Professional Documents
Culture Documents
From Pulses To Sleep Stages: Towards Optimized Sleep Classification Using Heart-Rate Variability
From Pulses To Sleep Stages: Towards Optimized Sleep Classification Using Heart-Rate Variability
Article
From Pulses to Sleep Stages: Towards Optimized Sleep
Classification Using Heart-Rate Variability
Pavlos I. Topalidis 1 , Sebastian Baron 2,3 , Dominik P. J. Heib 1,4 , Esther-Sevil Eigl 1 ,
Alexandra Hinterberger 1 and Manuel Schabus 1, *
1 Laboratory for Sleep, Cognition and Consciousness Research, Department of Psychology, Centre for
Cognitive Neuroscience Salzburg (CCNS), Paris-Lodron University of Salzburg, 5020 Salzburg, Austria;
pavlos.topalidis@plus.ac.at (P.I.T.); dominik.heib@proschlaf.eu (D.P.J.H.); esther-sevil.eigl@plus.ac.at (E.-S.E.)
2 Department of Mathematics, Paris-Lodron University of Salzburg, 5020 Salzburg, Austria
3 Department of Artificial Intelligence and Human Interfaces (AIHI), Paris-Lodron University of Salzburg,
5020 Salzburg, Austria
4 Institut Proschlaf, 5020 Salzburg, Austria
* Correspondence: manuel.schabus@sbg.ac.at; Tel.: +43-662-8044-5113
Abstract: More and more people quantify their sleep using wearables and are becoming obsessed in
their pursuit of optimal sleep (“orthosomnia”). However, it is criticized that many of these wearables
are giving inaccurate feedback and can even lead to negative daytime consequences. Acknowledging
these facts, we here optimize our previously suggested sleep classification procedure in a new sample
of 136 self-reported poor sleepers to minimize erroneous classification during ambulatory sleep
sensing. Firstly, we introduce an advanced interbeat-interval (IBI) quality control using a random
forest method to account for wearable recordings in naturalistic and more noisy settings. We further
aim to improve sleep classification by opting for a loss function model instead of the overall epoch-
by-epoch accuracy to avoid model biases towards the majority class (i.e., “light sleep”). Using these
implementations, we compare the classification performance between the optimized (loss function
model) and the accuracy model. We use signals derived from PSG, one-channel ECG, and two
consumer wearables: the ECG breast belt Polar® H10 (H10) and the Polar® Verity Sense (VS), an
optical Photoplethysmography (PPG) heart-rate sensor. The results reveal a high overall accuracy
for the loss function in ECG (86.3 %, κ = 0.79), as well as the H10 (84.4%, κ = 0.76), and VS (84.2%,
Citation: Topalidis, P.I.; Baron, S.;
κ = 0.75) sensors, with improvements in deep sleep and wake. In addition, the new optimized model
Heib, D.P.J.; Eigl, E.-S.; Hinterberger,
displays moderate to high correlations and agreement with PSG on primary sleep parameters, while
A.; Schabus, M. From Pulses to Sleep
Stages: Towards Optimized Sleep
measures of reliability, expressed in intra-class correlations, suggest excellent reliability for most
Classification Using Heart-Rate sleep parameters. Finally, it is demonstrated that the new model is still classifying sleep accurately in
Variability. Sensors 2023, 23, 9077. 4-classes in users taking heart-affecting and/or psychoactive medication, which can be considered
https://doi.org/10.3390/s23229077 a prerequisite in older individuals with or without common disorders. Further improving and
validating automatic sleep stage classification algorithms based on signals from affordable wearables
Academic Editor: Yvonne Tran
may resolve existing scepticism and open the door for such approaches in clinical practice.
Received: 7 October 2023
Revised: 31 October 2023 Keywords: automatic sleep-staging; machine learning; CNN; wearables; hear-rate variability; inter-
Accepted: 3 November 2023 beat intervals
Published: 9 November 2023
1. Introduction
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland. Because sleep is an important factor related to health, disease, overall quality of life
This article is an open access article and peak performance [1], more and more people are monitoring it using wearable devices.
distributed under the terms and The pursuit of good sleep can sometimes go too far, creating an obsession with monitoring
conditions of the Creative Commons and quantifying sleep that can result in increased stress and discomfort [2], a condition
Attribution (CC BY) license (https:// termed orthosomnia (i.e., from the Greek “ortho” meaning straight or upright and the
creativecommons.org/licenses/by/ Latin “somnia”). Many users of wearables have been erroneously led to believe, however,
4.0/). that commercial wearables can accurately and reliably measure sleep, even though the
majority lack scientific soundness and/or independent validation studies [3] against the
gold standard, Polysomnography (PSG), which is a combination of electroencephalography
(EEG), electrooculography, (ECG) and electromyography (EMG).
Erroneous feedback about one’s sleep (e.g., substantial underestimation of deep sleep,
or wrong wake classification at sleep onset or during the night) can have serious adverse
effects, exacerbating people’s misperception and even leading to negative daytime con-
sequences [4]. Such erroneous feedback on wearables may also lead to inappropriate
suggestions for adjusting sleep habits and work against the aim of promoting better sleep
health [5]. This is particularly worrisome for people with sleep problems and preoccupa-
tions who are especially sensitive to feedback on their sleep [6–8]. The potential adverse
effects of inaccurate feedback in combination with limited and pour validation studies
against PSG certainly justify the skepticism around the broad use of wearables in the clinical
field [9]. At the same time, there are undeniably benefits of using accurate wearable devices
that affordably capture daily sleep changes in natural environments, outside of the labo-
ratory. Especially in light of recent studies showing that implementing such technologies
together with sleep intervention protocols can have positive therapy outcomes [10–13]. It
is our opinion that such technologies, if optimized and carefully validated, will soon play a
central role in research and clinical practice as they allow continuous sleep measurements
(and feedback) in ecologically valid home environments and at an affordable cost.
However, only a few of the wearable technologies that claim multiclass epoch-by-
epoch sleep classification have been transparent considering their sensing technology and
models [14]. There are even fewer studies validated against PSG in suitable large (and
heterogeneous) samples, using appropriate performance metrics (e.g., Cohes’s κ, sensitivity
and specificity, F1) rather than solely relying on mean accuracy values per night or even
overall epoch-by-epoch accuracy percentages across sleep stages (see Topalidis et al. [15]).
Among the few, Chinoy et al. [16] recently compared the performance of seven consumer
sleep-tracking devices to PSG and reported that the reviewed devices displayed poor
detection of sleep stages compared to PSG, with consistent under- or overestimation of the
amount of REM or deep sleep. Chee et al. [17] validated two widely used sensors, the Oura
ring (v. 1.36.3) and Actiwatch (AW2 v. 6.0.9), against PSG in a sample of 53 participants
with multiple recordings each, varying in length (i.e., 6.5 8 or 9 h). Compared to PSG,
the Oura ring displayed an average underestimation of about 40 min for TST (Total Sleep
Time), 16 min for REM (Rapid Eye Movement) sleep, and 66 min for light sleep, while it
overestimated Wake After Sleep Onset (WASO) by about 38 min and deep (N3) sleep by
39 min. Altini and Kinnunen [18] examined 440 nights from 106 individuals wearing the
Oura ring (v. Gen2M) and found a 4-class overall accuracy of 79% when including various
autonomic nervous system signals, such as heart-beat variability, temperature, acceleration
and circadian features. Crucially, the results of Altini and Kinnunen [18] suggest that it is
the autonomic nervous system signals, such as Heart-Rate Variability (HRV), that drive
higher accuracy of 4-class classification performance, as it is drastically increased when
adding these signals into the model trained only on movement and temperature signals.
We have recently developed a 4-class sleep stage classifications model (wake, light,
deep, REM) that reaches up to 81% accuracy and a κ of 0.69 [15], when using only data from
low-cost HRV sensors. Although such classification accuracies are approaching expert inter-
rater reliability [15], there are edge cases where classification is particularly difficult (e.g.,
longer periods of rest and absent movement while reading as compared to falling asleep) and
prone to errors, and may result in erroneous feedback to the users. To deal with these cases,
we suggest first to implement advanced signal quality controls, that detect artefacts related
to sensor detachment, or excessive movements during sleep that leads to insufficient signal
quality (i.e., inter-beat interval estimations) for meaningful classification. In fact, some of the
previously suggested models developed for one-channel ECG recordings (e.g., Mathunjwa
et al. [19], Habib et al. [20], Sridhar et al. [21]) address such issues by implementing
individual normalization and simple outlier corrections (e.g., linearly interpolating too
short/long inter-beat intervals). However, we illustrate in Figure 1 an example from our
Sensors 2023, 23, 9077 3 of 17
data, that such a simple approach is sometimes insufficient, as bad IBI signal quality can
lead to erroneous sleep stage classification, and we thus suggest incorporating advanced
machine learning approaches for bad signal detection and more reliable classification.
20
15 raw IBI
10
5
0
(b)
1
IBI score
0.5
IBI Score cutoff
0
(c)
Predicted
REM
Deep
Light
Wake
(d) Actual
REM
Deep
Light
Wake
Figure 1. Bad IBI signal, if not detected, can lead to erroneous sleep stage classification. There are
bad epochs in (a) the raw IBI signal that can be identified using (b) an advanced IBI signal quality
control procedure based on a trained random forest model. Due to bad signal quality, sleep staging
these epochs are erroneously classified (c), misrepresenting thus the ground truth (d).
bias of the accuracy model towards the majority class and how opting for the loss function
model resolves this issue.
(a)
gold-standard
Wake
Rem
Light
Deep
(b)
accuracy model
Wake
Rem
Light
Deep
(c)
loss function model
Wake
Rem
Light
Deep
Figure 2. Example of a night where the accuracy model overestimates the majority class (light sleep).
Panel (a) displays the actual PSG-based hypnogram as sleep staged automatically using the G3
sleepware gold-standard. Panel (b) displays sleep staging using the “accuracy model”, while panel
(c) displays the “loss function model”. Note that the accuracy model displays a bias towards the
majority class (e.g., epochs marked in red shading), as it strives to maximize overall classification
accuracy, especially in cases where the model is unsure. In contrast, the loss function incorporates the
skewed class distribution using categorical cross entropy weighted by class, thus correcting for a bias
towards the majority class. In this example, both predicted hypnograms (a,b) use signals derived
from the H10 sensor.
(psychoactive substances) and/or medication with expected effects on the heart (beta
blockers, etc.).
2. Methods
2.1. Participant
We recorded ambulatory PSG from 136 participants (female = 40; Mean Age = 45.29,
SD = 16.23, range = 20–76) who slept in their homes, for one or more nights. In total
265 nights of recordings including the gold standard PSG and ECG. From these participants,
112 wore a Polar® H10 chest sensor (see the Section 2.2) and 99 participants wore the Polar®
Verity Sense, for 178 and 135 nights, respectively. This sample was part of an ongoing
study, investigating the effects of online sleep training programs on participants with
self-reported sleep complaints, which included participants with no self-reported acute
mental or neurological disorders and capable of using smartphones. While we did not set
strict exclusion criteria our sample naturally comes with people having sleep difficulties.
Specifically, 84.8% of our sample had a Pittsburgh Sleep Quality Index (PSQI) score above
5, with an average score of 9.36 (SD = 3.21). For a subset of participants, we took a medical
history and grouped them in those who were on psychoactive and/or heart medication
(with medication, N = 17, Nnights = 40), and those who reported taking no medication
(without medication, N = 39, Nnights = 87). The study was conducted according to the
ethical guidelines of the Declaration of Helsinki.
2.2. Materials
We recorded PSG using the ambulatory varioport 16-channel EEG system (Becker
Meditec®, Karlsruhe, Germany) with gold cup electrodes (Grass Technologies, Astro—Med
GmbH®, Rodgau, Germany), according to the international 10–20 system, at frontal (F3,
Fz, F4), central (C3, Cz, C4), parietal ( P3, P4) and occipital (O1, O2) derivations. We
recorded EOG using two electrodes (placed 1 cm below the left outer canthus and 1 cm
above the right outer canthus, respectively), and the chin muscle activity from two EMG
electrodes. Importantly, we recorded the ECG signal with two electrodes that we placed
below the right clavicle and on the left side below the pectoral muscle at the lower edge of
the left rib cage. Before actual sleep staging, we used the BrainVision Analyzer 2.2 (Brain
Products GmbH, Gilching, Germany) to re-reference the signal to the average mastoid
signal, filter according to the American Academy of Sleep Medicine (AASM) for routine
PSG recordings (AASM Standards and Guidelines Committee [22]) and downsampled to
128 Hz (the original sampling rate was at 512 Hz). Sleep was then automatically scored in
30-s epochs using standard AASM scoring criteria, as implemented in the Sleepware G3
software (Sleepware G3, Koniklijke Philips N.V., Eindhoven, The Netherlands). The G3
software is considered to be non-inferior to manual human staging and can be readily used
without the need for manual adjustment [23]. All PSG recordings in the current analysis
have been carefully manually inspected for severe non-physiological artifacts in the EEG,
EMG as well as EOG, as such effects would render our PSG-based classification (serving as
the gold standard) less reliable.
Having conducted extensive research on multiple sensors, we decided on two Polar®
(Polar® Electro Oy, Kempele, Finland) sensors as they came with the most accurate signal
as compared to gold-standard ECG [24–26]: the H10 chest strap (Model: 1W) and the Verity
Sense (VS, Model: 4J). Apart from good signal quality, both sensors have good battery life
(about 400 h for H10 and 30 h for VS Sense), and low overall weight and volume, which
makes them comfortable to sleep with.
in time to the inter-beat intervals from the ECG channels of the PSG recording (using a
windowed cross-correlation approach before sleep classification). As recordings involved
ambulatory PSG and sensor recordings in home settings, we excluded recordings in cases
where (i) signal quality was poor (>25% bad IBI data), (ii) sensor data could not be read
out, (iii) recordings were too short due to battery failure, and (iv) where synchronization
between sensor and PSG was not possible (due to one of the two time series being too
noisy). For eligible recordings with lower than 25% bad signal quality we “excluded” (and
filled with padding) a total of 806 epochs, lasting 30 s, for ECG (which is 0.32% of all
epochs), 3506 epochs for H10 (=2.1%), and 818 epochs for VS (=0.66%). In total, this resulted
in 37 H10 and 54 VS recordings that were lost due to the aforementioned problems.
the linear trends using scatterplots. We determined the agreement between the PSG and
wearables sleep parameters using Bland–Altman plots, reporting the biases, Standard
Measurement Error (SME),√ Limits of Agreement (LOA), Minimal Detectable Change [30]
(MDC = SME · 1.96 · 2), as well as absolute interclass correlation (ICC, a two-way model
with agreement type [31]). The MDC, or else smallest detectable change, refers to the
smallest change detected by a method that exceeds measurement error [30]. The intraclass
correlation of reliability encapsulates both the degree of correlation and agreement between
measurements [31]. According to Koo and Li [31], values less than 0.5, between 0.5 and
0.75, between 0.75 and 0.9, and greater than 0.90 are indicative of poor, moderate, good,
and excellent reliability, respectively. All data and statistical analysis was performed in R
(R Core Team [32]; version 4).
3. Results
3.1. Comparison of the Accuracy and Loss Function Models and Performance after Optimization
Figure 3 illustrates the confusion matrices for each model and sensor. When looking
at the overall performance, we did observe a small (in the magnitude of 1 to 2%) increase
in the overall classification accuracy for the loss function model. Statistical within-subject
comparison of the two models showed a benefit only for the ECG (p < 0.001) recordings,
but not for the H10 (p = 0.094) and VS (p = 0.13). Comparison of κ values yield a significant
increase for the ECG (p < 0.001) and H10 (p = 0.036), but not VS (p = 0.14). Figure 4a
displays the average model κ values for each sensor separately. In addition, when using
the loss function model, κ values for ECG (M = 0.798, SD = 0.08) were significantly higher
(p < 0.001) than the H10 (M = 0.772, SD = 0.09) and VS (M = 0.772, SD = 0.1), while the
two sensors did not significantly differ. Compared to classifying only light sleep for the
whole recording, considered here as the chance level, the loss function model achieved
significantly higher classifications accuracies (p < 0.001) for all sensors. Figure 4b illustrates
the classification accuracies achieved for every recording, using the loss function model,
and the corresponding accuracies when staging only the majority class. Finally, we tested
whether F1 scores, incorporating both the precision and recall of the model, improve
significantly using the loss function model when considering each class separately. When
looking at the ECG sensor we observed a small but significant increase in the classification
accuracies in all sleep stages (wake: p = 0.001; REM: p < 0.001; light: p = 0.004; deep:
p = 0.003). F1 scores in H10 were also higher for the loss function in all sleep stages,
except for light sleep (p = 0.56). Only wake sleep showed a benefit from the loss function
model in VS (p = 0.015). The median F1 score for each sensor, model, and sleep stage are
illustrated in Figure 4c.
Sensors 2023, 23, 9077 8 of 17
Wake
Light
Output Class
Deep
Rem
Overall
Wake Light Deep Rem Overall Wake Light Deep Rem Overall Wake Light Deep Rem Overall
Wake
Light
Output Class
Deep
Rem
Overall
Wake Light Deep Rem Overall Wake Light Deep Rem Overall Wake Light Deep Rem Overall
Target Class Target Class Target Class
Figure 3. Confusion matrices of accuracy (blue) and loss function models (red). The IBIs were
extracted from gold-standard ECG (left), chest-belt H10 (middle), and the PPG VS (right), and were
classified using the two models. In each confusion matrix, rows represent predicted classes (Output
Class) and columns represent true classes (Target Class). Cells on the diagonal indicate correct classi-
fications, while off-diagonal cells represent incorrect classifications. Each cell displays the count and
percentage of classifications. Precision (truePositives/(truePositives + f alsePositives)) is displayed
on the gray squares on the right, while recall (truePositives/(truePositives + f alseNegatives)) is dis-
played at the bottom. The number of epochs has been equalized between the two models for a more
fair comparison. Note that next to the small improvement in the overall accuracy compared to the
accuracy model, the loss function model displays an increase in the recall of wake and deep sleep
stages. This is arguably enough to address some of the nights that are difficult to classify.
Sensors 2023, 23, 9077 9 of 17
(a) (c)
ECG N = (265) H10 N = (178) VS N = (135)
1.0 *** * ns
wake REM light deep
*** *** ** **
1.0
F1 score ECG
0.8 0.897
0.798 0.888 0.88 0.882
0.792 0.855
0.754 0.772 0.762 0.772 0.852
0.8 0.798 0.811
κ
0.6
0.6
0.4
* ** ns *
1.0
accuracy model loss funtion model
F1 score H10
0.879 0.887
0.866 0.868
0.8 0.783 0.795 0.789
(b) 0.785
1.0 **** **** ****
0.6
0.8
accuracy [%]
0.6 ** ns ns ns
1.0
F1 score VS
0.4
0.8 0.772 0.8 0.783 0.8
Figure 4. Performance metrics of the accuracy and loss function models as computed for ECG (left),
H10 (middle), and VS (right). (a) The loss function model yielded a small but significant increase in
the κ values for all sensors. (b) The loss function model displayed significant higher classification
accuracies (b) compared to staging the majority class for the whole recording. (c) When considering
the performance of the two models separately in each class, as reflected in the class F1 scores, we
observed a small but significant increase in the performance of the loss function model. ns—not
significant, <0.05 *, <0.01 **, <0.001 ***, <0.0001 ****.
3.2. Correlation and Agreement between the Gold-Standard PSG and the Two Wearable Devices on
Primary Sleep Parameters
We observed significant (p < 0.001) high Spearman correlations between all PSG and
VS derived sleep parameters of interest. More specifically, between H10 and PSG we found
a high correlation in Sleep Onset Latency (ρ = 0.82, p < 0.001), Sleep efficiency (ρ = 0.88,
p < 0.001), and Total Sleep Time (ρ = 0.96, p < 0.001), as well as, Wake After Sleep Onset
(ρ = 0.89, p < 0.001), Deep (ρ = 0.6, p < 0.001), and REM sleep (ρ = 0.76, p < 0.001).
Similar correlations were observed between VS and PSG: Sleep Onset Latency (ρ = 0.82,
p < 0.001); Sleep efficiency (ρ = 0.84, p < 0.001), and Total Sleep Time (ρ = 0.94, p < 0.001);
Wake After Sleep Onset (ρ = 0.8, p < 0.001); Deep (ρ = 0.6, p < 0.001); REM sleep (ρ = 0.78,
p < 0.001). Figure 5 displays the correlation between VC and PSG.
Sensors 2023, 23, 9077 10 of 17
500
200 90
VS sleep onset latency [min.]
450
70 350
100
300
60
50 250
50
200
0
0 50 100 150 200 50 60 70 80 90 200 250 300 350 400 450 500
PSG sleep onset latency [min.] PSG sleep efficiency [%] PSG total sleep time [min.]
150
ρ = 0.8, p < 0.001 ρ = 0.78, p < 0.001 ρ = 0.54, p < 0.001
175
125
150 125
VS wake after sleep onset [min.]
125
100
100
100
75
75
75
50 50
25 50
25
0
Figure 5. Correlations between VS and PSG sleep parameters as computed with Spearman’s rank
correlations (ρ). PSG sleep parameters are plotted on the x’ axis and VS metrics are plotted on the y’
axis. The individual points reflect each recording. The solid line reflects the corresponding linear
model and the shaded areas reflect the 95% confidence intervals. Note that all sleep metrics correlate
highly with PSG, with deep sleep showing the weakest positive correlation.
We explored the extent of the agreement between the gold standard and both sensors
using Bland–Altman plots on the sleep parameters of interest. Tables 1 and 2 summarize
the extent of agreement between the gold standard, PSG, and the two wearable devices,
H10 and VS on the sleep parameters of interest. In Figure 6 we visualize the agreement
between PSG and VS using Bland–Altman plots.
Sensors 2023, 23, 9077 11 of 17
Table 1. Table of Agreement between PSG and H10 sensor. Agreement between the gold standard
PSG and H10 wearable in measuring the sleep parameters of interest. Note that there is good to
excellent reliability on all sleep parameters, but deep sleep shows a moderate reliability. SOL = Sleep
Onset Latency, SE = Sleep Efficiency, TST = Total Sleep Time, WASO = Wake After Sleep Onset,
REM = Rapid Eye Movement Sleep, LOA = Limits of Agreement, upper–lower; SME = Standard
Measurement Error; MDC = Minimal Detectable Change; ICC = Intra-Class Correlation.
Parameter Mean PSG (SD) Mean H10 (SD) LOA Bias SME MDC ICC
SO [min.] 18.4 (18) 18.3 (21.6) 31.8 − 31.9 −0.1 1.2 3.4 0.7
SE [%] 85.5 (8.8) 86.3 (8.8) 10 −8.5 0.8 0.4 1.0 0.9
TST [min.] 400.4 (59) 402.5 (60.6) 49.2 −45 2.1 1.8 5.0 0.9
WASO [min.] 51.3 (42.3) 47.4 (40.9) 25.1 −32.9 −3.9 1.1 3.1 0.9
REM [min.] 92.2 (27.9) 92.9 (26.7) 34.9 −33.5 0.7 1.3 3.6 0.8
Deep [min.] 66.6 (28.3) 71.4 (23.4) 48.3 −38.8 4.8 1.7 4.6 0.6
Table 2. Table of Agreement between PSG and VS sensor. Agreement between the gold standard PSG
and VS wearable in measuring the sleep parameters of interest. Similarly to H10, there is good to
excellent reliability on all sleep parameters, but deep sleep shows moderate reliability. SOL = Sleep
Onset Latency, SE = Sleep Efficiency, TST = Total Sleep Time, WASO = Wake After Sleep Onset,
REM = Rapid Eye Movement Sleep, LOA = Limits of Agreement, upper–lower; SME = Standard
Measurement Error; MDC = Minimal Detectable Change; ICC = Intra-Class Correlation.
Parameter Mean PSG (SD) Mean VS (SD) LOA Bias SME MDC ICC
SO [min.] 22 (27.8) 21.8 (27.3) 28.6 −29.1 −0.3 1.3 3.5 0.9
SE [%] 86 (7.8) 87.3 (8.3) 8.1 −5.5 1.3 0.3 0.8 0.9
TST [min.] 397.6 (56) 398.3 (62) 50.6 −49.3 0.7 2.2 6.1 0.9
WASO [min.] 43.9 (30.5) 37.6 (32) 22.2 −34.8 −6.3 1.3 3.5 0.9
REM [min.] 90.7 (26.6) 91.4 (24.5) 33 −31.7 0.7 1.4 4.0 0.8
Deep [min.] 67.4 (27.3) 71.1 (21.6) 50.2 −42.9 3.6 2.1 5.7 0.5
100
20
Difference in Sleep Onset Latency [min.]
(VS−PSG)
(VS−PSG)
0 0 0
−10
−50
−50
−20
−100
100 60 100
50 30 50
Difference in WASO [min.]
(VS−PSG)
(VS−PSG)
0 0 0
Figure 6. Agreement between the PSG-based sleep metrics and VS sensor, as visualized using
Bland–Altman plots. The dashed red line represents the mean difference (i.e., bias) between the two
measurements. The black solid line represents the point of equality (where the difference between the
two devices is equal to 0), while the dotted lines represent the upper and lower limits of agreement
(LOA). The shaded areas indicate the 95% confidence interval of bias and lower and upper agreement
limits. A positive bias value indicates a VS overestimation, while a negative bias value reflects a
VS underestimation, using the gold standard, PSG, as point of reference (VS-PSG). Note that VS
underestimates SOL and WASO, while it overestimates the rest of the sleep parameters. However,
the degree of bias here is minimal.
Sensors 2023, 23, 9077 13 of 17
ECG H10 VS
* ns ns
0.9
cohens' kappa
0.7
0.5
0.3
Sensor Medication N (% Females) Mean Age (SD) Mean PSQI (SD) Age Effect PSQI Effect
1 ECG without meds 87 (66.7) 41.7 (16) 8.8 (3.3)
2 ECG with meds 40 (80) 47 (15.6) 10.4 (3.4) + *
3 H10 without meds 60 (70) 42.5 (16.6) 8.6 (3.2)
4 H10 with meds 30 (80) 48.1 (15.4) 10.5 (3.7) ns *
5 VS without meds 35 (65.7) 38.8 (13.2) 8.5 (3)
6 VS with meds 14 (71.4) 47.6 (15.4) 11.1 (4.4) + ns
Figure 7. The effects of heart or psychoactive medication on sleep stage classification using the
optimized model. The boxplot illustrates the median κ values between recordings obtained from
people with and without medication, separately for each sensor. The table contains the group
descriptives including the number of subjects, age and PSQI group averages, as well as the statistical
effects of Age and PSQI. Note that there are significant differences between the two groups in the
ECG recordings, but at the same time a trend for age and a significant difference in the PSQI scores.
ns—not significant, <0.05 *.
4. Discussion
In the current study, we validated our previous findings in a new sample and showed
that the model suggested in the study of Topalidis et al. [15] can be optimized further by
including advanced IBI signal control and opting for the loss function for choosing the
final classification model. Results reveal statistically higher classification performance
for the loss function model, compared to choosing the model with the highest overall
accuracy during validation (accuracy model), although numerically the benefit is small.
We discuss why even a small but reliable increase is relevant here while highlighting that
we optimized a model that already displays very high classification performance. We
further show that psychoactive and/or heart-affecting medication does not have a strong
impact on sleep stage classification accuracy. Lastly, we evaluated our new optimized
model for measuring sleep-related parameters and found that our model shows substantial
agreement and reliability with PSG-extracted sleep parameters.
In the following section, we discuss the results, acknowledge the limitations of the
current study, and provide an outlook on sleep classification using wearables in sleep
research and clinical practice for the near future.
We have reasoned that opting for the loss function model will have a beneficial effect
on classification performance, minimizing biases towards the majority class. When looking
Sensors 2023, 23, 9077 14 of 17
the classification accuracy for the “deep sleep” class, while implementing standardized
frameworks for evaluating the performance of sleep-tracking technology [34].
The classifications reported here, are to our knowledge among the highest classifica-
tion performances using wearable devices, while capitalizing only on a single source of
signal, namely the IBI signal. The classification approach here takes into consideration
recommendations for optimal model training such as incorporating the temporal dynamics
of sleep (information about epoch N is contained in epochs N − 1 and N + 1). Ebrahimi
and Alizadeh [35], for example, systematically reviewed automatic sleep staging using
cardiorespiratory signals, and suggested that by taking into account signal delays and im-
plementing sequence learning models that are capable of incorporating temporal relations
results in better classification performance.
We also explored the effects of psychoactive and heart medication on sleep stage
classification performance. We reasoned that such medication could have a direct effect
on PSG and ECG [36,37] and thereby affect relevant features of the signal, resulting in
decreased classification performance. We observed a small but statistically significant
decrease in κ values in people with and without medications only for the ECG data, but not
the wearables. These results could be explained by the large difference in available data
between the devices in combination with small effect sizes. It is important to note, however,
that for the ECG data age and PSQI differences were observed between the two groups
(the group with medication being older and having worse sleep quality) and thus the
drop in classification performance could be driven by these differences. However, the
median κ for all classified recordings from participants on medication groups was above
0.07 for all devices, suggesting a substantial agreement in people taking psychoactive
and/or heart-affecting medication. Future studies should look at the effects of psychoactive
and/or heart-affecting medication, or even stimulants (e.g., caffeine), on sleep staging more
closely, including more controlled or even within-subjects designs, to explore how even
subtle changes in the cardiac signal can impact sleep stage classification.
As Western health systems increasingly adopt digital health interventions including
sleep (see Arroyo and Zawadzki [38] for a systematic review on mHealth interventions in
sleep), objective and reliable tracking of the effects of such interventions becomes more
and more relevant and allows for ecologically valid and continuous measurements [9] in
natural home settings. Recently, for example, Spina et al. [11] used sensor-based sleep
feedback in a sample of 103 participants suffering from sleep disturbances and found that
such sensor-based sleep feedback can already reduce some of the insomnia symptoms.
Interestingly, such feedback alone was however not enough to induce changes in the sleep–
wake misperception, which may need additional interventions (see Hinterberger et al. [39]).
Given that people with sleep problems and preoccupations about their sleep are especially
sensitive to such feedback, there is a high ethical necessity to only provide certain and
accurate feedback to patients to prevent negative side effects [4].
Conflicts of Interest: M.S. is the co-founder of and CSO at the NUKKUAA GmbH. D.P.J.H. is also an
employee of the Institut Proschlaf Austria (IPA).
Abbreviations
The following abbreviations are used in this manuscript:
IBI Inter-Beat-Interval
HRV Heart Rate Variability
PSG Polysomnography
ECG Electrocardiography
References
1. Ramar, K.; Malhotra, R.K.; Carden, K.A.; Martin, J.L.; Abbasi-Feinberg, F.; Aurora, R.N.; Kapur, V.K.; Olson, E.J.; Rosen, C.L.;
Rowley, J.A.; et al. Sleep is essential to health: An American Academy of Sleep Medicine position statement. J. Clin. Sleep Med.
2021, 17, 2115–2119. [CrossRef] [PubMed]
2. Baron, K.G.; Abbott, S.; Jao, N.; Manalo, N.; Mullen, R. Orthosomnia: Are some patients taking the quantified self too far? J. Clin.
Sleep Med. 2017, 13, 351–354. [CrossRef] [PubMed]
3. Rentz, L.E.; Ulman, H.K.; Galster, S.M. Deconstructing commercial wearable technology: Contributions toward accurate and
free-living monitoring of sleep. Sensors 2021, 21, 5071. [CrossRef]
4. Gavriloff, D.; Sheaves, B.; Juss, A.; Espie, C.A.; Miller, C.B.; Kyle, S.D. Sham sleep feedback delivered via actigraphy biases
daytime symptom reports in people with insomnia: Implications for insomnia disorder and wearable devices. J. Sleep Res. 2018,
27, e12726. [CrossRef] [PubMed]
5. Ravichandran, R.; Sien, S.W.; Patel, S.N.; Kientz, J.A.; Pina, L.R. Making sense of sleep sensors: How sleep sensing technologies
support and undermine sleep health. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems,
Denver, CO, USA, 6–11 May 2017; pp. 6864–6875.
6. Downey, R.; Bonnet, M.H. Training subjective insomniacs to accurately perceive sleep onset. Sleep 1992, 15, 58–63. [CrossRef]
7. Tang, N.K.; Harvey, A.G. Correcting distorted perception of sleep in insomnia: A novel behavioural experiment? Behav. Res. Ther.
2004, 42, 27–39. [CrossRef]
8. Tang, N.K.; Harvey, A.G. Altering misperception of sleep in insomnia: Behavioral experiment versus verbal feedback. J. Consult.
Clin. Psychol. 2006, 74, 767. [CrossRef]
9. Roomkham, S.; Lovell, D.; Cheung, J.; Perrin, D. Promises and challenges in the use of consumer-grade devices for sleep
monitoring. IEEE Rev. Biomed. Eng. 2018, 11, 53–67. [CrossRef]
10. Aji, M.; Glozier, N.; Bartlett, D.J.; Grunstein, R.R.; Calvo, R.A.; Marshall, N.S.; White, D.P.; Gordon, C. The effectiveness of digital
insomnia treatment with adjunctive wearable technology: A pilot randomized controlled trial. Behav. Sleep Med. 2022, 20, 570–583.
[CrossRef]
11. Spina, M.A.; Andrillon, T.; Quin, N.; Wiley, J.F.; Rajaratnam, S.M.; Bei, B. Does providing feedback and guidance on sleep
perceptions using sleep wearables improves insomnia? Findings from Novel Insomnia Treatment Experiment (“NITE”), a
randomised controlled trial. Sleep 2023, 8, zsad167. [CrossRef]
12. Song, Y.M.; Choi, S.J.; Park, S.H.; Lee, S.J.; Joo, E.Y.; Kim, J.K. A real-time, personalized sleep intervention using mathematical
modeling and wearable devices. Sleep 2023, 46, zsad179. [CrossRef] [PubMed]
13. Murray, J.M.; Magee, M.; Giliberto, E.S.; Booker, L.A.; Tucker, A.J.; Galaska, B.; Sibenaller, S.M.; Baer, S.A.; Postnova, S.; Sondag,
T.A.; et al. Mobile app for personalized sleep–wake management for shift workers: A user testing trial. Digit. Health 2023,
9, 20552076231165972. [CrossRef] [PubMed]
14. Imtiaz, S.A. A systematic review of sensing technologies for wearable sleep staging. Sensors 2021, 21, 1562. [CrossRef] [PubMed]
15. Topalidis, P.; Heib, D.P.; Baron, S.; Eigl, E.S.; Hinterberger, A.; Schabus, M. The Virtual Sleep Lab—A Novel Method for Accurate
Four-Class Sleep Staging Using Heart-Rate Variability from Low-Cost Wearables. Sensors 2023, 23, 2390. [CrossRef]
16. Chinoy, E.D.; Cuellar, J.A.; Huwa, K.E.; Jameson, J.T.; Watson, C.H.; Bessman, S.C.; Hirsch, D.A.; Cooper, A.D.; Drummond, S.P.;
Markwald, R.R. Performance of seven consumer sleep-tracking devices compared with polysomnography. Sleep 2021, 44, zsaa291.
[CrossRef]
17. Chee, N.I.; Ghorbani, S.; Golkashani, H.A.; Leong, R.L.; Ong, J.L.; Chee, M.W. Multi-night validation of a sleep tracking ring in
adolescents compared with a research actigraph and polysomnography. Nat. Sci. Sleep 2021, 13, 177–190. [CrossRef] [PubMed]
18. Altini, M.; Kinnunen, H. The promise of sleep: A multi-sensor approach for accurate sleep stage detection using the oura ring.
Sensors 2021, 21, 4302. [CrossRef]
19. Mathunjwa, B.M.; Lin, Y.T.; Lin, C.H.; Abbod, M.F.; Sadrawi, M.; Shieh, J.S. Automatic IHR-based sleep stage detection using
features of residual neural network. Biomed. Signal Process. Control 2023, 85, 105070. [CrossRef]
20. Habib, A.; Motin, M.A.; Penzel, T.; Palaniswami, M.; Yearwood, J.; Karmakar, C. Performance of a Convolutional Neural Network
Derived from PPG Signal in Classifying Sleep Stages. IEEE Trans. Biomed. Eng. 2022, 70, 1717–1728. [CrossRef]
21. Sridhar, N.; Shoeb, A.; Stephens, P.; Kharbouch, A.; Shimol, D.B.; Burkart, J.; Ghoreyshi, A.; Myers, L. Deep learning for automated
sleep staging using instantaneous heart rate. NPJ Digit. Med. 2020, 3, 106. [CrossRef] [PubMed]
Sensors 2023, 23, 9077 17 of 17
22. Berry, R.B.; Brooks, R.; Gamaldo, C.E.; Harding, S.M.; Marcus, C.; Vaughn, B.V. The AASM manual for the scoring of sleep and
associated events. In Rules, Terminology and Technical Specifications; American Academy of Sleep Medicine: Darien, IL, USA, 2012;
Volume 176, p. 2012.
23. Bakker, J.P.; Ross, M.; Cerny, A.; Vasko, R.; Shaw, E.; Kuna, S.; Magalang, U.J.; Punjabi, N.M.; Anderer, P. Scoring sleep with
artificial intelligence enables quantification of sleep stage ambiguity: Hypnodensity based on multiple expert scorers and
auto-scoring. Sleep 2023, 46, zsac154. [CrossRef]
24. Schaffarczyk, M.; Rogers, B.; Reer, R.; Gronwald, T. Validity of the polar H10 sensor for heart rate variability analysis during
resting state and incremental exercise in recreational men and women. Sensors 2022, 22, 6536. [CrossRef]
25. Gilgen-Ammann, R.; Schweizer, T.; Wyss, T. RR interval signal quality of a heart rate monitor and an ECG Holter at rest and
during exercise. Eur. J. Appl. Physiol. 2019, 119, 1525–1532. [CrossRef]
26. Hettiarachchi, I.T.; Hanoun, S.; Nahavandi, D.; Nahavandi, S. Validation of Polar OH1 optical heart rate sensor for moderate and
high intensity physical activities. PLoS ONE 2019, 14, e0217288. [CrossRef]
27. Dorffner, G.; Vitr, M.; Anderer, P.; The effects of aging on sleep architecture in healthy subjects. In GeNeDis 2014: Geriatrics;
Springer International Publishing: Cham, Switzerland, 2014; pp. 93–100.
28. Radha, M.; Fonseca, P.; Moreau, A.; Ross, M.; Cerny, A.; Anderer, P.; Long, X.; Aarts, R.M. Sleep stage classification from heart-rate
variability using long short-term memory neural networks. Sci. Rep. 2019, 9, 14149. [CrossRef] [PubMed]
29. Viera, A.J.; Garrett, J.M. Understanding interobserver agreement: The kappa statistic. Fam. Med. 2005, 37, 360–363. [PubMed]
30. de Vet, H.C.; Terwee, C.B.; Ostelo, R.W.; Beckerman, H.; Knol, D.L.; Bouter, L.M. Minimal changes in health status questionnaires:
Distinction between minimally detectable change and minimally important change. Health Qual. Life Outcomes 2006, 4, 54.
[CrossRef] [PubMed]
31. Koo, T.; Li, M. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med.
2016, 15, 155–163. [CrossRef]
32. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria,
2017.
33. Reddy, O.C.; van der Werf, Y.D. The sleeping brain: Harnessing the power of the glymphatic system through lifestyle choices.
Brain Sci. 2020, 10, 868. [CrossRef]
34. Menghini, L.; Cellini, N.; Goldstone, A.; Baker, F.C.; de Zambotti, M. A standardized framework for testing the performance of
sleep-tracking technology: Step-by-step guidelines and open-source code. Sleep 2021, 44, zsaa170. [CrossRef]
35. Ebrahimi, F.; Alizadeh, I. Automatic sleep staging by cardiorespiratory signals: A systematic review. Sleep Breath. 2021, 26,
965–981. [CrossRef] [PubMed]
36. Doghramji, K.; Jangro, W.C. Adverse effects of psychotropic medications on sleep. Psychiatr. Clin. 2016, 39, 487–502.
37. Symanski, J.D.; Gettes, L.S. Drug effects on the electrocardiogram: A review of their clinical importance. Drugs 1993, 46, 219–248.
[CrossRef] [PubMed]
38. Arroyo, A.C.; Zawadzki, M.J. The implementation of behavior change techniques in mHealth apps for sleep: Systematic review.
JMIR MHealth UHealth 2022, 10, e33527. [CrossRef] [PubMed]
39. Hinterberger, A.; Eigl, E.S.; Schwemlein, R.N.; Topalidis, P.; Schabus, M. Investigating the Subjective and Objective Efficacy of a
Cognitive Behavioral Therapy for Insomnia (CBT-I)-Based Smartphone App on Sleep: A Randomized Controlled Trial. 2023.
Available online: https://osf.io/rp2qf/ (accessed on 30 October 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.