Professional Documents
Culture Documents
An Attention-Based Deep Learning Approach For Sleep Stage Classification With Single-Channel EEG
An Attention-Based Deep Learning Approach For Sleep Stage Classification With Single-Channel EEG
An Attention-Based Deep Learning Approach For Sleep Stage Classification With Single-Channel EEG
Abstract — Automatic sleep stage classification is of apnea [2]. In particular, sleep stages (e.g., light sleep and deep
great importance to measure sleep quality. In this paper, sleep) are important for immune system, memory, metabolism,
we propose a novel attention-based deep learning archi- etc. [3]–[5]. Therefore, it is highly desired to measure sleep
tecture called AttnSleep to classify sleep stages using
single channel EEG signals. This architecture starts with quality through sleep monitoring and sleep stage classification.
the feature extraction module based on multi-resolution Sleep specialists usually determine the sleep stages based
convolutional neural network (MRCNN) and adaptive fea- on the polysomnography (PSG), which consists of electroen-
ture recalibration (AFR). The MRCNN can extract low and cephalogram (EEG), electrooculogram (EOG), anelectromyo-
high frequency features and the AFR is able to improve gram (EMG) and electrocardiogram (ECG) [6]. Single-channel
the quality of the extracted features by modeling the
inter-dependencies between the features. The second mod- EEG has recently become attractive for sleep monitoring due
ule is the temporal context encoder (TCE) that leverages to its ease-of-use. In particular, PSG or single-channel EEG
a multi-head attention mechanism to capture the temporal recordings are usually divided into 30-second segments and
dependencies among the extracted features. Particularly, each segment is manually checked by sleep specialists and
the multi-head attention deploys causal convolutions to then classified into one of the six stages, i.e., wake (W),
model the temporal relations in the input features. We eval-
uate the performance of our proposed AttnSleep model rapid eye movement (REM) and four non-REM stages (N1,
using three public datasets. The results show that our N2, N3 and N4) [7]. This manual process is very exhaustive,
AttnSleep outperforms state-of-the-art techniques in terms tedious, and time-consuming. As such, automatic sleep stage
of different evaluation metrics. Our source codes, experi- classification systems are required to assist sleep specialists.
mental data, and supplementary materials are available at Many studies have adopted conventional machine learning
https://github.com/emadeldeen24/AttnSleep.
methods to classify EEG signals into corresponding sleep
Index Terms — Sleep stage classification, multi-resolution stages. These methods usually consist of two steps, namely,
convolutional neural network, adaptive feature recalibra- manual feature extraction and sleep stage classification. First,
tion, temporal context encoder, multi-head attention.
they design and extract various features from time and fre-
quency domains. Feature selection algorithms are often applied
I. I NTRODUCTION to further select the most discriminative features. Second,
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
810 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 29, 2021
Fig. 1. Overall framework of the proposed AttnSleep model for automatic sleep stage classification.
to classify the raw data from three channels (i.e., EEG, EOG, recalibration (AFR). The MRCNN extracts features corre-
and EMG). In [15], the authors designed a relatively deep sponding to low and high frequencies from different frequency
CNN architecture to show that a better performance can bands, and the AFR models the features inter-dependencies
be achieved by relying on network depth. In [17], authors to enhance the feature learning. Second, we propose a novel
converted each raw signal into a log-power spectra and used a temporal context encoder (TCE) that deploys a multi-head
CNN to perform a joint classification and prediction task for attention with causal convolutions to efficiently capture the
identifying sleep stages. Generally, the CNN models above temporal dependencies in the extracted features. We also
have achieved good performance for sleep stage classification. design a class-aware loss function to effectively address
However, most of them are not able to effectively model the the data imbalance issue without additional computations.
temporal dependencies among the EEG data. We perform extensive experiments on three public datasets
Recurrent Neural Networks (RNNs) were proposed to cap- and experimental results demonstrate that our AttnSleep model
ture temporal dependencies in time-series EEG data. For outperform the state-of-the-arts for sleep stage classification.
example, [19] used a cascaded RNN architecture to classify Overall, the main contributions of our proposed model can
sleep stages. Some researchers combined CNN with RNN by be summarized as follows.
using CNN for features extraction and RNN for modelling 1) We propose a novel feature extraction technique, i.e.,
the time dependencies [20]–[23]. For example, [20] used a multi-resolution CNN module, to extract features cor-
a CNN architecture to extract features from the raw data, responding to low and high frequencies from different
and then exploited the long short-term memory (LSTM) to frequency bands, and an adaptive feature recalibration
learn transition rules through sleep stages. Similarly, in [23], to learn the features interdependencies and enhance the
the authors also used a successive blocks of CNN followed by representation capability of the extracted features.
LSTM to classify raw EEG signals. In addition, [24] used an 2) We propose a novel temporal context encoder that
LSTM based encoder-decoder with an attention mechanism deploys a multi-head self-attention with causal convo-
after the encoder to find out the most relevant parts in lutions to efficiently capture the temporal dependencies
input sequences. However, RNNs also have limitations due in the extracted features.
to their recurrent nature, i.e., they usually have high model 3) We design a class-aware loss function to efficiently han-
complexity and it is thus difficult to train them in parallel. dle the class imbalance without introducing additional
Some works used attention mechanism completely instead of computations.
using RNN. For example, [25] segmented the EEG epochs 4) These novel components are supported by extensive
and used self-attention to learn both intra-epoch features, and experiments over three public datasets. The results
inter-epoch temporal features. demonstrate that our proposed model outperforms state-
Aside from selecting classification models, sleep staging of-the-arts in sleep stage classification.
also needs to address the data imbalance problem, since The rest of the paper is organized as follows: Section II
humans spend different time periods in each stage. Oversam- illustrates the details of the proposed model. In Section III,
pling is a common strategy to address this issue. For example, we introduce the datasets, evaluation metrics, experimental
the studies in [13], [17], [20] replicated the minority classes for setup and the baseline methods. We then present the com-
model training. The authors in [24] applied Synthetic Minority parison results against the baseline methods and the ablation
Over-sampling TEchnique (SMOTE) [26] for oversampling to study of our AttnSleep model, as well as a sensitivity analysis
balance the data. However, oversampling techniques expand of the choice of the number of heads in the MHA. Finally,
the training data and thus increase the training time. the conclusion of the paper is presented in Section IV.
To address the above issues, we propose a novel architecture
called AttnSleep for automatic sleep stages classification. II. P ROPOSED M ETHOD
First, we propose a novel feature extraction module based In this section, we introduce our proposed AttnSleep model
on multi-resolution CNN (MRCNN) and adaptive feature for sleep stage classification from single-channel EEG data.
ELDELE et al.: ATTENTION-BASED DEEP LEARNING APPROACH FOR SLEEP STAGE CLASSIFICATION 811
1 k
K M where a, b, c are hyperparameters that change for each dataset.
L=− yi k ),
yi log( (10) To fulfill the above recommendation, we chose a < b < c.
M
k=1 i=1 Note that we use μk to scale down the M/Mk value so we
where yik is the actual label for i -th sample and yi k is the keep the values of μk less than 1 by dividing them by K . As
predicted probability of i -th sample for the class k, M is the such, the values of a, b, c are better to be kept less than K to
total number of samples and K is the number of classes. Note scale down the weights. Additional experiments on the effect
that various sleep datasets are imbalanced, i.e., the amount of the different variants of a, b, and c values are provided in
of data for each class varies a lot. The loss function in Section S.II the supplementary materials.
Equation 10 equally penalizes the miss-classification of all the Finally, we use Adam [35] as the optimizer to minimize our
classes, and thus the trained model may be biased towards the class-aware loss in Equation 11 and learn model parameters.
majority classes.
We propose a class-aware loss function to address the above III. E XPERIMENTAL R ESULTS
issue, which uses a weighted cross-entropy loss as follows: In this section, we first introduce the experimental setup.
Then, we demonstrate the evaluation results of our proposed
1
K M
L=− wk yik log(
yi k ), (11) AttnSleep.
M
k=1 i=1
wk = μk · max(1, log(μk M/Mk )), (12)
A. Datasets and Evaluation Metrics
where wk represents the weight assigned to the class k, μk In our experiments, we used three public datasets, namely,
is a tunable parameter, and Mk is the number of samples in Sleep-EDF-20, Sleep-EDF-78 and Sleep Heart Health
class k. Study (SHHS) as shown in Table I. For each dataset, we used
The choice of the class weight wk relies on two factors i.e. a single EEG channel for various models in our experiments.
the number of samples of this class (controlled by M/Mk ), and Sleep-EDF-20 and Sleep-EDF-78 were obtained from
the distinctness of this class (controlled by μk ). With analyzing the PhysioBank [36]. Sleep-EDF-20 contains data files for
814 IEEE TRANSACTIONS ON NEURAL SYSTEMS AND REHABILITATION ENGINEERING, VOL. 29, 2021
TABLE I
D ETAILS OF T HREE D ATASETS U SED IN O UR E XPERIMENTS (E ACH S AMPLE I S A 30-S ECOND E POCH )
TABLE V
C OMPARISON A MONG ATTN S LEEP AND S TATE - OF - THE -A RT M ODELS . T HE B EST VALUES ON E ACH D ATASET A RE H IGHLIGHTED IN B OLD
D. Comparison With State-of-the-Art Approaches Fig. 6. Ablation study conducted on Sleep-EDF-20 dataset.
We evaluated the performance of our AttnSleep model
against various state-of-the-art approaches. We compared their
performance in terms of overall accuracy, macro F1-score, E. Ablation Study
cohen kappa, macro G-mean and the average training time Note that our AttnSleep consists of MRCNN, AFR and
on three datasets. TCE modules together with the class-aware loss function.
Table V shows the comparison among DeepSleepNet To analyze the effectiveness of each module in our AttnSleep,
[20], SleepEEGNet [24], ResnetLSTM [43], MultitaskCNN we present an ablation study conducted on Sleep-EDF-
[17] and our AttnSleep. We observe that our AttnSleep 20 dataset as shown in Fig. 6. Specifically, we derive five
achieves better classification performance than the other four model variants as follows and the first four variants do not
approaches, due to its powerful feature extraction module use the class-aware loss function.
as well as the TCE with attention mechanism. In particular, 1) MRCNN: MRCNN module only.
our AttnSleep achieves better MF1 and MGm on Sleep-EDF- 2) MRCNN+AFR: MRCNN and AFR without TCE.
78 and SHHS, indicating that the designed cost-sensitive loss 3) MRCNN+TCE: MRCNN and TCE without AFR.
function is helpful to handle imbalanced data. In addition, 4) MRCNN+AFR+TCE: training MRCNN, AFR and
we can observe that our AttnSleep achieves lower performance TCE together, without the class-aware loss function.
for class N1 than [20], [24]. As shown in Fig. 4, W, REM and 5) AttnSleep: training MRCNN, AFR and TCE together,
N1 have similar features in our framework. Therefore, our with the class-aware loss function.
AttnSleep tends to misclassify N1 as other classes including We can draw the following three conclusions based on
W and REM, which is also demonstrated in the confusion the ablation study as shown in Fig. 6. First, AFR can
matrices in Tables II, III and IV. enhance the classification performance, which demonstrates
Note that all the five methods in Table V use a single epoch the necessity of modeling the feature inter-dependencies. This
(i.e., 30-second EEG signal) as the model input. Differently, is further demonstrated by comparing the third and fourth
SeqSleepNet [37] takes 3 epochs as input and then predicts the variants (i.e., MRCNN+TCE vs. MRCNN+AFR+TCE). Sec-
label for the middle epoch. For fair comparison, we compare ond, by comparing MRCNN and MRCNN+TCE (similarly
our AttnSleep with SeqSleepNet in Table VI by using 3 epochs MRCNN+AFR vs. MRCNN+AFR+TCE), we conclude that
as input. As shown in Table VI, our AttnSleep outperforms capturing the temporal dependencies with TCE is impor-
SeqSleepNet in terms of all the four metrics (ACC, MF1, κ and tant for sleep stage classification. Moreover, TCE is even
MGm). By comparing Tables V and VI, we also observe that more important than AFR as MRCNN+TCE outperforms
using more epochs as input includes more temporal relations MRCNN+AFR. Third, AttnSleep achieves significantly better
and helps our AttnSleep model to achieve better performance. MF1 and MGm than other four variants, indicating that the
In addition, the training time of our method is much less proposed class-aware cost-sensitive loss function can effec-
than other methods as shown in Tables V and VI. First, tively address the data imbalance issue without any added
DeepSleepNet [20], SleepEEGNet [24] and SeqSleepNet [37] computation overhead. We also conduct the ablation study
all exploit LSTMs which slow down the training due to for Sleep-EDF-78 and SHHS dataset, which can be found
the recurrent processing in LSTM. Second, MultitaskCNN in Fig. S.3 and S.4 in our supplementary materials.
[17] and SeqSleepNet [37] require additional computation to
pre-train a DNN-based filter bank before training the main
model. Differently, our AttnSleep model captures the temporal F. Sensitivity Analysis for the Number of Heads in MHA
dependency among EEG data using TCE instead of LSTM, As MHA is one key component of our model, it is impor-
and can thus benefit from parallel computation to achieve the tant to study how the number of heads affects the model
reduced training complexity. performance. In particular, we fix the other parameters and
ELDELE et al.: ATTENTION-BASED DEEP LEARNING APPROACH FOR SLEEP STAGE CLASSIFICATION 817
TABLE VI
C OMPARISON OF THE P ERFORMANCE OF ATTN S LEEP AGAINST S EQ S LEEP N ET W ITH 3 E POCHS AS I NPUT
[14] A. Sors, S. Bonnet, S. Mirek, L. Vercueil, and J.-F. Payen, “A con- [27] W. Huang, J. Cheng, Y. Yang, and G. Guo, “An improved deep
volutional neural network for sleep stage scoring from raw single- convolutional neural network with multi-scale information for bearing
channel EEG,” Biomed. Signal Process. Control, vol. 42, pp. 107–114, fault diagnosis,” Neurocomputing, vol. 359, pp. 77–92, Sep. 2019.
Apr. 2018. [28] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolutional
[15] M. Sokolovsky, F. Guerrero, S. Paisarnsrisomsuk, C. Ruiz, and neural networks for raw waveforms,” in Proc. IEEE Int. Conf. Acoust.,
S. A. Alvarez, “Deep learning for automated feature discovery and Speech Signal Process. (ICASSP), Mar. 2017, pp. 421–425.
classification of sleep stages,” IEEE/ACM Trans. Comput. Biol. Bioinf., [29] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
vol. 17, no. 6, pp. 1835–1845, Nov. 2020. network training by reducing internal covariate shift,” in Proc. 32nd
[16] S. Chambon, M. N. Galtier, P. J. Arnal, G. Wainrib, and A. Int. Conf. Mach. Learn., 2015, pp. 448–456.
Gramfort, “A deep learning architecture for temporal sleep stage [30] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc.
classification using multivariate and multimodal time series,” IEEE IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,
Trans. Neural Syst. Rehabil. Eng., vol. 26, no. 4, pp. 758–769, pp. 7132–7141.
Apr. 2018. [31] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rec-
[17] H. Phan, F. Andreotti, N. Cooray, O. Y. Chén, and M. De Vos, “Joint tified activations in convolutional network,” 2015, arXiv:1505.00853.
classification and prediction CNN framework for automatic sleep stage [Online]. Available: http://arxiv.org/abs/1505.00853
classification,” IEEE Trans. Biomed. Eng., vol. 66, no. 5, pp. 1285–1296, [32] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
May 2019. Process. Syst., 2017, pp. 5998–6008.
[18] F. Li et al., “End-to-end sleep staging using convolutional neural network [33] J. Devlin, M.-W. Chang, K. Lee, and N. K. Toutanova, “Bert: Pre-
in raw single-channel EEG,” Biomed. Signal Process. Control, vol. 63, training of deep bidirectional transformers for language understanding,”
Jan. 2021, Art. no. 102203. in Proc. Assoc. Comput. Linguistics, 2019, pp. 4171–4186.
[19] N. Michielli, U. R. Acharya, and F. Molinari, “Cascaded LSTM [34] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016,
recurrent neural network for automated sleep stage classification using arXiv:1607.06450. [Online]. Available: http://arxiv.org/abs/1607.06450
single-channel EEG signals,” Comput. Biol. Med., vol. 106, pp. 71–81, [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
Mar. 2019. in Proc. Int. Conf. Learn. Represent., 2015, pp. 1–15.
[20] A. Supratak, H. Dong, C. Wu, and Y. Guo, “DeepSleepNet: A model [36] A. L. Goldberger et al., “PhysioBank, PhysioToolkit, and PhysioNet:
for automatic sleep stage scoring based on raw single-channel EEG,” Components of a new research resource for complex physiologic sig-
IEEE Trans. Neural Syst. Rehabil. Eng., vol. 25, no. 11, pp. 1998–2008, nals,” Circulation, vol. 101, no. 23, pp. e215–e220, Jun. 2000.
Nov. 2017. [37] H. Phan, F. Andreotti, N. Cooray, O. Y. Chen, and M. De Vos,
[21] Z. Chen, M. Wu, W. Cui, C. Liu, and X. Li, “An attention based CNN- “SeqSleepNet: End-to-end hierarchical recurrent neural network for
LSTM approach for sleep-wake detection with heterogeneous sensors,” sequence-to-sequence automatic sleep staging,” IEEE Trans. Neural Syst.
IEEE J. Biomed. Health Informat., early access, Jun. 30. 2020, doi: Rehabil. Eng., vol. 27, no. 3, pp. 400–410, Mar. 2019.
10.1109/JBHI.2020.3006145. [38] G.-Q. Zhang et al., “The national sleep research resource: Towards a
[22] Z. Chen et al., “A novel ensemble deep learning approach for sleep- sleep data commons,” J. Amer. Med. Inform. Assoc., vol. 25, no. 10,
wake detection using heart rate variability and acceleration,” IEEE pp. 1351–1358, Oct. 2018.
Trans. Emerg. Topics Comput. Intell., early access, Jun. 8, 2020, doi: [39] S. F. Quan et al., “The sleep heart health study: Design, rationale, and
10.1109/TETCI.2020.2996943. methods,” Sleep, vol. 20, no. 12, pp. 1077–1085, 1997.
[23] A. Malafeev et al., “Automatic human sleep stage scoring using [40] P. Fonseca, N. den Teuling, X. Long, and R. M. Aarts, “Cardiorespiratory
deep neural networks,” Frontiers Neurosci., vol. 12, p. 781, sleep stage detection using conditional random fields,” IEEE J. Biomed.
Nov. 2018. Health Informat., vol. 21, no. 4, pp. 956–966, Jul. 2017.
[24] S. Mousavi, F. Afghah, and U. R. Acharya, “SleepEEGNet: Automated [41] J. Cohen, “A coefficient of agreement for nominal scales,” Educ.
sleep stage scoring with sequence to sequence deep learning approach,” Psychol. Meas., vol. 20, no. 1, pp. 37–46, Apr. 1960.
PLoS ONE, vol. 14, no. 5, May 2019, Art. no. e0216456. [42] Y. Tang, Y.-Q. Zhang, N. V. Chawla, and S. Krasser, “SVMs modeling
[25] T. Zhu, W. Luo, and F. Yu, “Convolution- and attention-based neural for highly imbalanced classification,” IEEE Trans. Syst. Man, Cybern.
network for automated sleep stage classification,” Int. J. Environ. Res. B, Cybern., vol. 39, no. 1, pp. 281–288, Feb. 2009.
Public Health, vol. 17, no. 11, p. 4152, Jun. 2020. [43] Y. Sun, B. Wang, J. Jin, and X. Wang, “Deep convolutional network
[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, method for automatic sleep stage classification based on neurophysio-
“SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. logical signals,” in Proc. 11th Int. Congr. Image Signal Process., Biomed.
Res., vol. 16, pp. 321–357, Jun. 2002. Eng. Informat. (CISP-BMEI), Oct. 2018, pp. 1–5.