A Multi-Center Clinical Trial For Camera-Based Infant Sleep and Awake Detection in Neonatal Intensive

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

2023 IEEE International Conference on E-health Networking, Application & Services (Healthcom)

A Multi-center Clinical Trial for Camera-based


Infant Sleep and Awake Detection in Neonatal
2023 IEEE International Conference on E-health Networking, Application & Services (Healthcom) | 979-8-3503-0230-1/23/$31.00 ©2023 IEEE | DOI: 10.1109/HEALTHCOM56612.2023.10472347

Intensive Care Unit


Yuya Yuan1† , Dongmin Huang1† , Lirong Ren2 , Xiaoyan Song3 , Liping Pan4 , Hongzhou Lu4,∗ , Wenjin Wang1,∗

Abstract—Infants need adequate sleep to develop their brain the problem of irregular breathing patterns or apnea [2]. There
and cardiovascular systems, especially for preterm infants in the is an urgent need to monitor infant sleep in NICU. As the
Neonatal Intensive Care Unit. Camera-based infant monitoring first step, sleep-awake detection can provide clinicians the
is an emerging direction of research in video health monitoring.
Intuitively, camera can easily identify the sleep-awake stage of basic information on the total amount of sleeping time of an
infants by detecting the state of eyes, e.g. closed or opening. infant. This may guide caregivers to optimize the workflow
Thus in this paper, we propose to explore the unique advantage for neonatal care and take appropriate medical interventions
of camera-based facial analysis for sleep-awake detection, as to improve the prognosis of infants in NICU.
a fundamental step toward infant sleep monitoring. A multi- Clinically, polysomnography assembling both the cerebral
center clinical trial was conducted to collect infant videos for
investigating the feasibility of our proposal. A benchmark in- and physiological measurements is considered to be the gold
cluding four machine learning methods of SVM, KNN, MLP, and standard for infant sleep monitoring [3]. However, it requires
CNN (ResNet18) was set up to classify the sleep/awake stage of multiple electrodes attached to the infant fragile skin to obtain
infants. To alleviate the overfitting issue caused by over-sampling physiological signals, which increases the risk of skin damage
of a sleeping infant, we propose to integrate ResNet18 with and infections. This is not preferred for preterm infants,
the contrastive learning strategy to strengthen the consistency
of facial features learned from different infants. The clinical especially critically ill infants. Recently, contact-free infant
evaluation shows that all benchmarked methods obtained an monitoring has been achieved by video cameras [4], which
accuracy above 75% while the proposed method achieved the can solve above issues related to contact-based monitoring.
best accuracy of 86%. This invokes further explorations of using However, almost all related work in this area focused on vital
facial/eye features of infants for sleep-awake staging, towards signs monitoring, and less on behavioral monitoring such as
intelligent contactless sleep analysis of infants in combination
with camera-based vital signs monitoring. context understanding in NICU, though there is a clear need in
Index Terms—Infant sleep, sleep and awake classification, tracking the cognitive-related neuron-development of infants.
clinical trial, contrastive learning Long et al. [5] extracted whole-body motion from videos to
detect the sleep-awake state of infants.
I. INTRODUCTION Although there are prospective experiments, camera-based
Infants usually need to sleep 14-17 hours per day to secrete infant sleep-awake detection was not fully explored and it
enough growth hormone for developing various body tissues still faces many challenges. First, most methods are trained
and organs, especially the nervous system [1]. However, due and evaluated on the dataset with less than 20 infants [6]–[8].
to disruptions caused by certain diseases (sudden infant death Since sleeping infants have limited variations in the sampled
syndrome, sleep apnea, etc.) and frequent medical interven- images from a video, the network trained on such data may
tions, infants in the Neonatal Intensive Care Unit (NICU) are be overfitted to a few infants in specific conditions. Second,
particularly susceptible to sleep disorders. This may lead to most methods exploit motion cues from an infrared camera for
malnutrition and weakened immunity of infants, aggravating sleep-awake detection rather than facial features [5], [8]–[10].
Since sleep and awake states can be derived from the eyes (i.e.
This work is supported by the National Key R&D Program of China closing or opening) directly, we consider that exploiting facial
(2022YFC2407800), General Program of National Natural Science Founda-
tion of China (62271241), Guangdong Basic and Applied Basic Research features for sleep-awake detection is a more straightforward
Foundation (2023A1515012983), Shenzhen Science and Technology Program option, especially for awake but quiet infants without much
(JSGGKQTD20221103174704003), and Shenzhen Fundamental Research body motion.
Program (JCYJ20220530112601003).
1 Department of Biomedical Engineering, Southern University of Science To explore the feasibility of using facial information for
and Technology, China. infant sleep-awake classification, in this paper, we conducted
2 Department of Obstetrics, Baoan Hospital of Traditional Chinese Medicine
a multi-center clinical trial at three Chinese hospitals to record
in Shenzhen, China.
3 Neonatal Intensive Care Unit, Nanfang Hospital of Southern Medical the videos of infants from NICU. This study was approved
University, China. by the hospitals’ Institutional Review Boards and informed
4 Neonatal Intensive Care Unit, The Third People’s Hospital of Shenzhen,
consents were obtained from the legal guardian of infants. To
China. detect the sleep-awake state of an infant, we benchmarked four
† These authors contributed equally to this work.
∗ Corresponding author: Hongzhou Lu (luhongzhou@fudan.edu.cn), Wenjin machine-learning methods including support vector machine
Wang (wangwj3@sustech.edu.cn) (SVM), k-nearest neighbors (KNN), multilayer perceptron

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on April 11,2024 at 15:03:27 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-0230-1/23/$31.00 ©2023 IEEE 319
2023 IEEE International Conference on E-health Networking, Application & Services (Healthcom)

Multi-center Clinical Trial Preprocessing Benchmarked Methods


SVM
HOG
KNN
Feature
Grey-scale FNN
Left eye Gradient
Eye image Cell Histogram
(24 x 24 x 3) Vector

JX6420 IDS ResNet18

Dense
MediaPipe
(CNN-based)
Sensors Video stream Right eye
(24 x 24 x 3) Face Image

Annotate
ResNet18-CL (Improved Model)
Recorder (awake/sleep)
Mapper

(CNN-based)
Dense Dense Lmsup

ResNet18
Infant
Classifier
Face Image Face Image
(224 x 224 x 3) Dense Dense Lc
Clinical scenario N Medical Staff

Fig. 1. Clinical setup and pipeline for infant sleep-awake detection. The multi-center clinical trial was performed at three hospitals to collect the infant videos
including the state of awake and sleep. The preprocessing unifies the data format. The benchmarked methods evaluate the feasibility of using infant facial
features for sleep and awake classification, and ResNet18-CL is proposed to improve classification performance.

(MLP), and convolutional neural network (CNN). In particular, the recording length from 2 to 10 minutes. Their gestational
to address the scarcity of data diversity caused by over- ages were between 32 and 37 weeks.
sampling the data from a sleeping infant, we proposed a In the Department of Obstetrics of the Shenzhen Baoan
novel method based on the combination of ResNet18 and Hospital of Traditional Chinese Medicine, the infant videos
contrastive learning to improve the consistency of the learned were recorded by nurses in an unconstrained environment
features from different infants but with the same state (or using the 48MP camera of a smartphone. Each video was
label) by pairwise comparison between the data from different recorded between 20 and 35 seconds, with a resolution of
infants. Extensive experiments on clinical data demonstrate 720×1280 pixels. The goal of this setup is to evaluate the
the feasibility of camera-based sleep and awake detection on performance of trained models on a new setting with limited
unseen infants. training data. Here we collected the videos of 26 full-term
infants within one hour of birth.
II. C LINICAL INFANT SLEEP DATASET A clinical dataset including a total of 55 infants (full-
term, preterm, and critically ill) was created in our multi-
Previous studies [6]–[10] collected video data from a small center clinical trial. To annotate the dataset, the medical
number of infants (less than 20) to train and evaluate their physicians labeled two states (awake and sleep) based on the
proposed methods. Due to the lack of the diversity of training guidelines [11]. After that, the machine learning methods were
data, it may create bias when evaluating the devices or meth- trained and evaluated on 5754 images (2831 sleep images and
ods. To address this, a multi-center clinical trial was conducted 2923 awake images) from 55 infants in this paper.
to collect the videos of infants with different gestational ages
and physical conditions (see Fig. 1). III. M ETHOD
In the NICU of the Third People’s Hospital of Shenzhen The pipelines of the benchmarked methods and the proposed
and the Nanfang Hospital of Southern Medical University, the method are shown in Fig. 1. The benchmarked methods are
RGB camera (JX6420, JieXiang Optoelectronics, china) was in two categories: (i) handcraft feature-based methods. Mon-
utilized to record the infant videos with different resolutions of itoring of eye movements can identify the infant sleep/awake
160×90 pixels and 320×180 pixels, sampled at 20 frames per state, and rapid eye movement is associated with different
second (FPS). Each video, ranging from 10 minutes to 2 hours, sleep stages [12]. For facial analysis, it has been reported that
recorded a variety of infant activities without any constraints, the histograms of oriented gradients (HOG) is an effective
including crying, body movement, wakefulness, sleep, etc. 26 feature for characterizing the texture patterns of the eye [13].
preterm or critically ill infants, with the gestational age from Thus, we extract HOG features from the eye areas and use
27 to 39 weeks were recorded. SVM, KNN, and MLP to perform classifications between
In the NICU of the Shenzhen Baoan Hospital of Traditional the sleep and awake state; (ii) CNN-based methods. CNN
Chinese Medicine, we used a different RGB camera (IDS- can automatically learn task-relevant features from the labeled
Ul3860C, Germany) to capture the infant videos with a reso- data in an end-to-end fashion. It can also leverage existing
lution of 968×608 pixels, sampled at 20 FPS. Three videos of data from neighboring fields (e.g. ImageNet) to improve
2 preterm infants and 1 full-term infant were recorded, with domain-specific performance by fine-tuning. ResNet18 that

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on April 11,2024 at 15:03:27 UTC from IEEE Xplore. Restrictions apply.
320
2023 IEEE International Conference on E-health Networking, Application & Services (Healthcom)

demonstrates success in various vision tasks was used in our TABLE I


benchmark. B ECNHMARKING RESULTS OF INFANT SLEEP - AWAKE DETECTION (%)
Since a sleeping infant was sampled multiple times in a Input Method ACC PRE SEN SPE F1
video though its facial images may not have much varia- SVM 75.16 76.13 72.41 74.82 76.00
tions during sleep, the scarcity of data diversity may limit
HOG KNN 79.61 81.51 77.49 79.19 79.98
the performance of the CNN model to trained infants. To
solve this issue, we proposed a method, called ResNet18- MLP 73.59 76.42 71.81 74.48 75.66
CL, that integrates ResNet18 and contrastive learning [14]. Its
ResNet18 82.08 82.12 81.89 82.09 81.97
main idea is to reduce the similarity between samples from Face
different categories and increase that from the same category ResNet18-CL 86.56 86.65 87.00 86.54 86.78
by manually comparing each training sample from different
infants. The pairwise comparison forces the model to learn of feature representations related to sleep and awake states
features emphasising intra-class compactness and inter-class among different infants. Specifically, the output of last layer
separateness, which improves the consistency of deep features of ResNet is defined as the embedding h. Then a mapper
learned from different infants but with the same state. M and a classifier C are added to project h into different
feature spaces, respectively. The mapper M consists of two
A. Preprocessing fully-connected layers with the same neural units of 32, and
Since the model requires a uniform data format as input, its normalized output is used to calculate the constraint term,
preprocessing was performed to unify the input, including face as shown below.
detection, resizing, and eye area extraction. First, according to X −1 X exp(zi · zp /τ )
the annotation, the image was sampled by saving a frame at Lsup
m = log P , (1)
|p(i)| exp(zi · za /τ )
an interval of 10 seconds when the infant was sleeping, and i∈I p∈p(i)
a∈A(i)
at an interval of 2 seconds when awake. Second, Mediapipe
(proposed by Google) [15] was applied to detect the facial where zi , zp and zn denote the outputs of the i-th input sample
landmarks of the infant and then crop this region. Thirdly, xi , its positive sample (with the same label xi ), and its negative
this region was resized to 224×224 pixels. In the end, the eye sample (with the different label xi ) using M , respectively. And
area was extracted using the facial landmark and then resized p(i) and |p(i)| denote the set of all positives in one training
to 24×24 pixels. batch and its cardinality, respectively.
Based on the cross-entropy, the classifier C performs clas-
B. Benchmarked methods sification between sleep and awake using two fully-connected
1) SVM, KNN, MLP with the input of handcraft features: layers with 32 and 2 neural units as:
HOG is used as input to train and evaluate the methods. To 1 XX
M
explore the effect of different parameter settings, we tested pa- Lc = − yi log(ybi ), (2)
N i c=1
rameters including cell size [4×4, 6×6, 8×8] pixels and block
size [1×1, 2×2, 3×3] cells. For HOG classification, SVM where yi and yˆi denote the label and the probability of each
adopts the Gaussian kernel and sets the penalty parameter C predicted label. Therefore, the final loss function is the sum
to one. The adjacent point of KNN sets from 10 to 20. MLP of Lc and Lsup m . During the process of optimization, the
consists of three fully-connected layers with different neural embedding h is constrained by the mapper M to gradually
units of 64, 32, and 2, respectively. The best performance of separate different kinds of data and aggregate the same kinds
the parameter combination was reported as the final result. of data, making h more discriminative. Meanwhile, the clas-
2) ResNet18 with the input of face images: The highlight sifier improves the decision boundary based on h.
of ResNet18 is adding the residual module that uses skip
connections to retain shallow features and allow the interaction IV. E XPERIMENTS AND RESULTS
between shallow and deep features. It effectively alleviates the All deep methods were trained using the Adam optimizer
problem of gradient disappearance and improves the general- with a learning rate of 10e-4 and an epoch of 100. In the
ization ability of the model. In our work, the output layer experiment, the clinical infant sleep dataset was randomly
of ResNet18 was changed to a fully-connected layer with 2 split into 80% for training and 20% for testing in a subject-
neural units to predict the binary class of sleep and awake. The wise way1 , where 20% training set is used for validation.
model was pre-trained on ImageNet, then fine-tuned using our The training set contains 4928 images (2421 sleep and 2507
clinical dataset. awake) from 44 infants, while the test set has 826 images
(410 sleep and 416 awake) from 11 infants. To evaluate the
C. Improved model with contrastive learning
performance, the quality metrics including accuracy (ACC),
As shown in Fig. 1, ResNet18-CL constrains the optimiza- precision (PRE), sensitivity (SEN), specificity (SPE), and F1-
tion process to maximize the cosine similarity of data from score, are used according to [8].
different infants but with the same label and minimize that
of data with different labels. This improves the consistency 1 An infant’s data only appears in either the training set or test set.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on April 11,2024 at 15:03:27 UTC from IEEE Xplore. Restrictions apply.
321
2023 IEEE International Conference on E-health Networking, Application & Services (Healthcom)

Sleep

Sleep

Sleep

Sleep

Sleep
0.64 0.36 0.71 0.29 0.62 0.38 0.84 0.16 0.84 0.16
Awake

Awake

Awake

Awake

Awake
0.15 0.85 0.073 0.93 0.13 0.87 0.2 0.8 0.11 0.89
Sleep Awake Sleep Awake Sleep Awake Sleep Awake Sleep Awake
(a) SVM (b) KNN (c) MLP (d) ResNet18 (e) ResNet18-CL
Fig. 2. The confusion matrices obtained by benchmarked methods on the test set of infant sleep dataset.

for infant monitoring.


V. CONCLUSIONS
In this paper, a multi-center clinical trial was conducted to
collect videos from 55 infants (full-term, preterm, and criti-
cally ill) to investigate camera-based infant sleep and awake
detection. The clinical validation shows that all benchmarked
methods achieved over 75% in metrics of accuracy, sensitivity,
and specificity. Among them, the proposed ResNet18-CL
(a) ResNet18 (b) ResNet18-CL obtained the best performance with 86.56% accuracy, 87.00%
Fig. 3. The t-SNE visualization of the embedding. sensitivity, and 86.54% specificity by using the contrastive
Table I reported the clinical evaluation of all methods. For
learning strategy.
benchmarked methods, it shows that SVM and MLP achieve
similar performance with 75.16% and 73.59% ACC. KNN R EFERENCES
outperformed SVM and MLP, improving the ACC by 4% [1] F. Jiang, “Sleep and early brain development,” Annals of Nutrition and
for awake/sleep classification. In the comparison, ResNet18 Metabolism, vol. 75, no. 1, pp. 44–54, 2019.
achieves the best performance in all metrics clearly. To explore [2] O. Verschuren et al., “Sleep: an underemphasized aspect of health and
development in neurorehabilitation,” Early human development, vol.
the reasons for such improvement, the confusion matrices 113, pp. 120–128, 2017.
of benchmarked methods are shown in Fig. 2. It shows [3] V. Bertelle et al., “Sleep in the neonatal intensive care unit,” The Journal
that SVM, KNN, and MLP are not sensitive enough to the of perinatal & neonatal nursing, vol. 21, no. 2, pp. 140–148, 2007.
[4] Y. Zeng et al., “A multi-modal clinical dataset for critically-ill and pre-
detection of infant sleep state, whereas ResNet18 can maintain mature infant monitoring: Eeg and videos,” in IEEE-EMBS International
good performance on awake recognition while improving the Conference on Biomedical and Health Informatics (BHI). IEEE, 2022,
sensitivity of sleep recognition. It also shows that ResNet18 pp. 1–5.
[5] X. Long et al., “Video-based actigraphy for monitoring wake and sleep
has a more balanced performance in the classification between in healthy infants: A laboratory study,” sensors, vol. 19, no. 5, p. 1075,
sleep and awake. 2019.
The performance of our proposed method is reported in [6] S. Cabon et al., “Audio-and video-based estimation of the sleep stages of
newborns in neonatal intensive care unit,” Biomedical Signal Processing
Table I. Compared to ResNet18, it obtained better performance and Control, vol. 52, pp. 362–370, 2019.
with an improvement of almost 5% in all metrics. The im- [7] M. Awais et al., “Can pre-trained convolutional neural networks be
proved performance may stem from the pairwise comparison directly used as a feature extractor for video-based neonatal sleep and
wake classification?” BMC Research Notes, vol. 13, no. 1, pp. 1–6, 2020.
of the data from different infants, which forms a variety of [8] M. Awais et al., “A hybrid dcnn-svm model for classifying neonatal sleep
input combinations to train the model, whereas other methods and wake states based on facial expressions in video,” IEEE Journal of
only learn from one sample at a time, so as to enrich the Biomedical and Health Informatics, vol. 25, no. 5, pp. 1441–1449, 2021.
[9] X. Long et al., “Video-based actigraphy is an effective contact-free
diversity of samples. The t-SNE visualization was used to method of assessing sleep in preterm infants,” Acta Paediatrica (Oslo,
visualize the difference of the embedding between ResNet18 Norway: 1992), vol. 110, no. 6, p. 1815, 2021.
and ResNet18-CL, as shown in Fig. 3. It shows that in the [10] A. Heinrich et al., “Body movement analysis during sleep based on
video motion estimation,” in IEEE 15th International Conference on
embedding, the separability between different classes is clearly e-Health Networking, Applications and Services, 2013, pp. 539–543.
increased, while the compactness of the same class is im- [11] J. Robinson et al., “Eyelid opening in preterm neonates.” Archives of
proved. This demonstrates the effectiveness of ResNet18-CL Disease in Childhood, vol. 64, no. 7, pp. 943–948, 1989.
[12] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
that applies contrastive learning to enhance the classification detection,” in IEEE computer society conference on computer vision and
between sleep and awake of infants. pattern recognition, vol. 1, 2005, pp. 886–893.
[13] Y. Dong et al., “Comparison of random forest, random ferns and
In summary, the clinical evaluation demonstrates the feasi- support vector machine for eye state classification,” Multimedia Tools
bility of using facial attributes (e.g. eyes) for sleep and awake and Applications, vol. 75, pp. 11 763–11 783, 2016.
detection of infants. Although their performance has space of [14] P. Khosla et al., “Supervised contrastive learning,” Advances in neural
information processing systems, vol. 33, pp. 18 661–18 673, 2020.
improvement, it shows to be a promising direction for infant [15] C. Lugaresi et al., “Mediapipe: A framework for building perception
sleep analysis with the future combination of camera vital pipelines,” arXiv preprint arXiv:1906.08172, 2019.
signs monitoring, towards the camera-based polysomnography

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on April 11,2024 at 15:03:27 UTC from IEEE Xplore. Restrictions apply.
322

You might also like