Future_Frame_Prediction_Network_for_Human_Fall_Detection_in_Surveillance_Videos (2)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

14460 IEEE SENSORS JOURNAL, VOL. 23, NO.

13, 1 JULY 2023

Future Frame Prediction Network for Human Fall


Detection in Surveillance Videos
Suyuan Li and Xin Song

Abstract—Video fall detection is one of the most signifi-


cant challenges in computer vision domain, and it usually
involves the recognition of events that do not conform to
expected falls. Recently, a majority of unsupervised models
are popular to address the issues that call for substantial
manual labeled training data in supervised learning. How-
ever, almost all existing unsupervised methods usually min-
imize reconstruction errors, which may lead to insufficient
reconstruction errors between fall and nonfall video frames
because of the powerful representation ability of the neural
network. In this article, we propose a novel efficient fall detec-
tion method based on future frame prediction. Specifically,
attention U-Net with flexible global aggregation blocks that
can achieve better performance is regarded as a frame prediction network, achieving that several video frames predict
the next future frame. In the training phase, commonly used appearance constraints on intensity and gradient and motion
constraint are combined to further generate higher quality frames. Such constraints promote the performance of the
prediction network, which can enlarge the difference between the predicted fall frame and the real fall frame. In the
testing phase, the fall score based on the error between the predicted frame and the real frame can be computed to
distinguish the fall event. Exhaustive experiments have been conducted on the UR fall dataset, multiple cameras fall
dataset (MCFD), and high-quality fall simulation dataset, and the results verify the effectiveness of the proposed method
and outperform other existing state-of-the-art methods.
Index Terms— Attention gate, fall detection, prediction network, U-Net.

I. I NTRODUCTION
ALL detection aims to identify the events that do not
F conform to expected falls in a video sequence, which
is an essential task in video surveillance. With the rapid
growth of the aging population, it is more imperative than
ever to address issues related to the health of the elderly, with
falls among the senior population standing out in particular.
The World Health Organization estimates that 646 000 falls
result in fatalities each year [1]. In such cases, visual and
muscle impairment, frequent loss of balance, lightheadedness,
Fig. 1. Falling complications.
pharmacological side effects, unconsciousness, and slippage
are the most common causes of falls [2]. As shown in Fig. 1,
failing to recognize falls in the elderly can lead to a variety of
Manuscript received 27 February 2023; revised 11 May 2023; familial, medical, and psychological issues. Therefore, in order
accepted 12 May 2023. Date of publication 19 May 2023; date of to rigorously monitor and guarantee the health of the elderly,
current version 29 June 2023. This work was supported in part it is essential to develop an automatic and reliable fall detection
by the National Natural Science Foundation of China under Grant
61473066 and Grant 61601109, in part by the 2023 Hebei Provincial system.
Doctoral Candidate Innovation Ability Training Funding Project under Recently, a variety of researchers have focused their efforts
Grant CXZZBS2023170, and in part by the Natural Science Foundation on developing an intelligent monitor system that can reliably
of Hebei under Grant F2021501020. The associate editor coordinating
the review of this article and approving it for publication was Prof. Yu- detect falls. In general, in light of the involved sensors,
Dong Zhang. (Corresponding author: Xin Song.) cameras can provide comprehensive information about pos-
The authors are with the School of Computer Science and Engi- ture and position, as well as other advantages such as low
neering, Northeastern University, Shenyang 110819, China (e-mail:
2010649@stu.neu.edu.cn; sxin78916@neuq.edu.cn). cost and noninvasiveness, compared to wearable sensors and
Digital Object Identifier 10.1109/JSEN.2023.3276891 ambient sensors. In camera device-based systems, traditional

1558-1748 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
LI AND SONG: FUTURE FRAME PREDICTION NETWORK FOR HUMAN FALL DETECTION 14461

methods based on handcrafted features mostly rely on extract-


ing personal shape and geometry to distinguish their unusual
motions. The performance of these methods, however, may
be impacted by shadow, occlusion, variation in illumination,
and viewpoint variations. Recently, deep learning, particularly
convolutional neural networks (CNNs) [4], [5], now has a
great deal of success in computer vision, such as action
recognition and object classification. These deep-learning-
based methods extract automatically features and possess a
greater discriminating ability for image representation, which
can effectively carry out the above illuminations. As a con-
sequence, fall detection algorithms based on CNNs, such as
VGG16 [6], ResNet [7], and GoogleNet [8], have become Fig. 2. Classification of fall detection methods.
increasingly popular. These previous algorithms can generally
achieve superior performance, whereas they are trained in a
supervised manner, which are time-consuming to acquire and performance of the proposed method can be verified through
manually annotate. Therefore, an autoencoder-based solution experiments on three public datasets.
is leveraged to perform video fall detection, which can train The remainder of this article begins with a review of relevant
on unlabeled data in an unsupervised manner. Specifically, video fall detection works in Section II. The proposed future
these methods based on convolutional auto-encoder (CAE) frame prediction framework is then introduced in Section III,
networks [9], [10], [11] mainly utilize the reconstruction which includes the appearance constraints, motion constraint,
error between the input frame and the reconstructed frame to and adversarial constraint. In Section IV, the proposed method
distinguish falls, assuming that the reconstruction error of the is thoroughly evaluated, and the experimental results are
fall frame can obviously be distinct from that of the nonfall reported on three public datasets. Finally, in Section V, the
frame. However, an identity mapping between the input and conclusion is drawn.
the output is the core of the autoencoder. In light of the II. R ELATED W ORK
above, it is not sufficient to assume that the fall event has
Fall detection is becoming a more and more popular
a large reconstruction error due to the powerful representation
research topic in the field of public healthcare and medical
ability of the neural network, leading that the performance
services. As shown in Fig. 2, the devices used in fall detection
based on reconstruction error methods may be not optimal.
systems can be divided into three categories: wearable sensors,
Nowadays, with the rise of generative adversarial networks
ambient sensors, and cameras [13]. The typical methods based
(GAN) [12], video prediction performance has significantly
on wearable sensors mainly depend on the body with embed-
increased. As a consequence, in this article, to solve the above
ded sensors, such as accelerometer, to effectively recognize
problem about self-reconstruction representation, we propose
the movement and position of object [14], [15], [52], [53],
a unique method based on predicted future frames for recog-
[54]. Furthermore, these wearable sensors have to be placed
nizing fall events in video surveillance, identifying fall events
in a particular location for a long amount of time, which will
by comparing real frames with predicted frames. The core
be uncomfortable for the elderly. Another common method
issue of the proposed method is that the prediction network
mainly adopts ambient sensors [55], such as pressure sensors,
can predict the future nonfall frame better than the future fall
to collect audio and visual information for detecting fall
frame. Specifically, we utilize attention U-Net as a predictor
instances. Consequently, it is laborious to install these sensors
to generate the future frame, which is only trained on nonfall
throughout the whole surface, and they are easily exposed to
data. In the training phase, we first adopt the intensity loss and
noise. The methods based on camera sensors have recently
gradient loss to constrain the appearance, which can make it
gotten a lot of attention because of their low cost and better
close to the real frame. Then, we adequately consider motion
user-friendliness. RGB cameras, Kinect cameras, thermal sen-
as another important feather, which is constrained by optical
sors, or even several cameras are often used to collect visual
flow loss. In addition, the GAN module is applied to generate
data. This section delves into the research on vision-based
high-quality frames. Finally, in the testing phase, the trained
fall detection systems [56], which can be separated into two
attention U-Net is used to predict the future frame, and the
categories based on different features: handcrafted features and
error between the predicted future frame and the real frame
deep features.
can be applied to compute a fall score, further detecting falls.
The following is a list of our contributions: 1) we propose A. Handcrafted Features Based on Fall Detection
a frame prediction framework for fall detection that leverages Traditionally, fall detection methods based on handcrafted
attention U-Net to predict the next future frame; 2) we leverage feathers mainly rely on body shape, human postures, and
optical flow constraint to force the optical flow of the predicted tracking head motions. Human silhouette information from
frame to be close to that of the real frame, maintaining motion provided sequences for describing falls is considered as the
consistency; and 3) we further develop a novel fall score based basic traditional method to extract body shape features. In [16],
on the error between the predicted future frame and the real frame differencing is adopted to obtain the human silhou-
frame to improve the overall detection performance. Also, the ette. The contour of the human body is used to increase

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
14462 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

privacy protection in human body recognition, and the vertical effective features are fed into an attention-guided LSTM model
projection histogram and statistical scheme of the contour for final detection. With the development of 3-D convolutional
image are used to lower the impact of the upper limbs of neural network, (3-D CNN) is employed in combination with
the human body. According to the ratio and difference of LSTM-based attention mechanism to locate the activity and
the height and width of the body contour bounding box, the obtain discriminant features [24]. Xiong et al. [57] proposed a
k-nearest neighbor (KNN) classification method is used to skeleton-based 3-D consecutive-low-pooling neural network to
categorize human falls. Unlike the frame difference method, extract the skeleton representation for fall recognition. Differ-
the background subtraction method eliminates the backdrop ent from the above methods based on one input, multi-inputs,
from the video while retaining the dynamic foreground target. such as optical flows and RGB video frames, are combined
In [17], a foreground human body is extracted using a back- to feed into the CNN and further improve the detection
ground subtraction method. After morphological operations, performance [25]. However, the above outlined approaches
the human silhouette is altered and covered with a fitted based on deep learning mostly use supervised classification,
ellipse. Second, shape elements from the covered silhouette which necessitates a large amount of manually annotated data
are quantified to illustrate numerous human positions. Finally, and consumes a huge amount of resources. Considering the
the trained directed acyclic graph support vector machine dis- rarity of the falls as well as simplifying the data collection
tinguishes falls from other daily activities. In contrast to body procedure, autoencoders [26], [27], [28], [29], [30], [31] are
shape features, the use of posture information concentrates adopted for video fall detection. Generally, these existing
on the process of movement change. Cucchiara et al. [18] methods mainly rely on CAE to compute the reconstruction
calculated the projection histogram of each individual in each errors between the inputs and outputs, potentially learning
frame and compare it with the probability projection map the discrimination of reconstructions between fall and nonfall
stored for each pose in the training phase. After that, they frames. In [32], a novel dilated convolutional autoencoder with
use the tracking module’s information to further verify the LSTM for aggregating high-level spatial and temporal features
obtained pose for demonstrating the first stage classification’s is proposed.
reliability. In addition, optical flow can also be adopted to
estimate human posture. Iazzi et al. [19] proposed to utilize
C. Video Frame Prediction
optical flow to extract numerous characteristics from the three
blocks corresponding to the head, the center of the body, It has received a great deal of attention recently since it
and the feet for motion changes, as well as to compute the offers so much potential for unsupervised video represen-
speed and direction of movement for each object, categoriz- tation learning. Due to the further development of GAN,
ing falls by a support vector machine. To effectively model video frame prediction has been used in more and more
spatial–temporal characteristics, a Gaussian mixture model potential applications, such as video anomaly detection.
(GMM) employing histogram of optical flow and motion Luo et al. [33] explored a video anomaly detection prediction
boundary histogram features is proposed [20]. Different from network based on a graph convolutional network (GCN),
body shape features and human posture, head analysis is which can effectively form spatial and temporal connections
mainly based on head tracking when big movements occur in of joints. To improve the performance of anomaly detection in
a video stream. Because the amplitude of head motion during surveillance videos, Chen et al. [34] designed a bidirectional
falls is high, the strategy of segmenting the head and torso prediction-based architecture that uses forward and backward
is recommended in [21]. Nevertheless, all known video-based prediction subnetworks to generate the same target frame.
approaches involve extracting the object, which is prone to Then, the true target frame and its bidirectionally generated
image noise, brightness change, and occlusion. frame are then used to construct a loss function for recognizing
unusual instants. To describe the intrinsic spatial–temporal
B. Deep Learning Features Based on Fall Detection relationship between frames better, Wang et al. [35] aimed
to make full use of RGB pixel synthesis and optical flow
In this decade, deep neural networks have already generated
warping methods for establishing a multibranch mask network,
rapid, revolutionary development. Recently, deep learning has
where a mask layer can be added in each branch adaptively
been applied to fall detection systems as a result of its
to control the magnitude range of the estimated optical flow
effectiveness in image classification, object identification, and
recognition tasks. Unlike the methods based on handcrafted and the weight of the predicted frames. Li et al. [36] designed
features, deep learning, such as CNNs, can automatically a two-step spatial–temporal cascade autoencoder model, the
spatial–temporal adversarial autoencoder preliminarily identi-
learn efficient deep features of objects. In [22], CNNs are
fies anomalous video cuboids and excludes normal cuboids
used as automatic extractors to obtain deep features from
in the first phase, and the spatial–temporal convolutional
optical flow images for directly and effectively identifying
autoencoder utilizes a reconstruction error-based strategy that
fall events in sequential data. In addition, due to the great
makes use of the CAE and skip connection to categorize the
success of recurrent neural networks in sequential data, long
anomalous cuboid again in the second phase.
short term memory (LSTM) is gradually used to overcome
the problem of gradient disappearance and gradient explosion
during extended series training of falls. In [23], to detect III. P ROPOSED M ETHOD
pedestrians in the frames and complete the track assignment, In this article, we propose a visual fall detection method
the YOLO v3 and Deep sort methods are used, and then, based on attention U-Net. Specifically, the overall framework

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
LI AND SONG: FUTURE FRAME PREDICTION NETWORK FOR HUMAN FALL DETECTION 14463

Fig. 3. Overall framework of our fall recognition based on prediction network.

Fig. 4. Schematic of attention U-Net.

is shown in Fig. 3. The goal of the proposed method is to train Consequently, low- and high-level features are combined to
attention U-Net for predicting future frame based on previous reinforce detailed decoders. Nevertheless, because of the com-
frames, and the error between the predicted frame and real plexity and variety of fall behavior, the strategy of transferring
frame can be leveraged to distinguish the presence or absence features by copying and sharing between the same scales is
of fall behaviors. To obtain high-quality predicted nonfall too simplistic to further adequately predict the complex shape
frames, intensity loss, gradient loss, and adversarial loss are and motion of human falls. Furthermore, due to the reason
considered as appearance constraints in the training phase. that similar low-level features are repeatedly extracted in the
In addition, to focus on the temporal information, the optical U-Net, the model can result in excessive and redundant use of
flow loss is regarded as motion constraint, where the optical computational resources and model parameters. Accordingly,
flow can be directly calculated by the trained Flownet [37]. the attention gate is integrated into the U-Net for the pre-
In this case, these fall frames can be easily recognized by the diction via skip connection. The attention gate can gradually
bigger errors between predicted frames and real frames in the suppress feature responses in the irrelevant background areas
test phase. By definition, It−T , . . . , It−1 denotes consecutive and enhance useful salient features about the person, aiming to
T input frames, Iˆt denotes the future frame predicted by the focus on the person of varying shapes and sizes. Furthermore,
model, and It denotes the corresponding real frame. Next, all the region where the human body occurs can be predicted
the components of our framework can be introduced in detail. better, which will result in more accurate and robust fall
recognition performance.
A. Attention U-Net Fig. 4 shows the construction of the attention U-Net net-
U-Net is a popular frame generation or image gen- work, which primarily consists of skip connection layers with
erating network among existing methods [48]. Generally, attention mechanisms, an encoder, and a decoder. Specifi-
it primarily consists of an encoder and a decoder that cally, the size of the input is 256 × 256 × T × 3, where
extract features with spatial resolutions that progressively 256 denotes the length and width of input frames, T denotes
decrease and gradually increase, and restore video frames the quantity of input video frames, and 3 denotes channels.
with those resolutions. In addition, the encoder features can The encoder mainly includes three blocks for downsampling,
be then transferred to the decoder via skip connections. which is realized by two 3 × 3 convolution operations and

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
14464 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

Fig. 5. Schematic of the attention module.

a 2 × 2 max-pooling operation. Correspondingly, the decoder background-related weights can be reduced, which improves
includes three blocks for upsampling, and the output size is to predict the region in which the human body may act. The
256 × 256 × 3. performance of the network model to predict human behavior
In addition, it is worth noting that the attention module can thus be improved by incorporating the attention module
is implemented before each block of upsampling, which can in the skip connections of the encoder and decoder.
suppress feature responses in irrelevant background regions
and boost performance. In the attention module, the feature B. Loss Function
maps produced by upsampling blocks and downsampling
Generally, it is vital to design loss functions to achieve con-
blocks at the corresponding position are subjected to various
vergence and better performance. In our scheme, to minimize
full convolution operations, and these features can be finally
the gap between the predicted frame and the real frame, the
combined to obtain the attention coefficients. Fig. 5 shows the
similarity in appearance and motion between the predicted
schematic of the attention module mechanism, which gradually
frame and the real frame is adequately considered. Eventu-
connects to the decoding network. In more detail, the decoder
ally, appearance constraints, motion constraint, and adversarial
feature ϕ l and the encoder feature ψ l can be, respectively,
constraint can be jointly optimized in the final loss function.
represented as follows:
  1) Appearance Constraints: Inspired by the previous
XX H XW method in [38], intensity and gradient can usually be adopted.
ϕl = δ  wlg gl + blg  (1) Therefore, intensity loss first utilized to ensure the appearance
m i=0 j=0 similarity of the predicted frames and real frames pixel by
 
H X
XX W pixel. Specifically, the intensity loss based on l2 distance can
ψl = δ  wlx x l + blx  (2) be formally represented as follows:
m i=0 j=0   X
L int I t , Î t = ∥I t − Î t ∥2 (5)
where gl and x l are, respectively, the feature maps of upsam- i, j
pling block and downsampling block, wlg and wlx are, respec-
tively, the learned weights to extract feature map gl and x l , in which Î t and I t are, respectively, the tth predicted and
blg and blx are correspondingly learned bias, H and W are real frame and (i, j) denotes the index. In addition, the image
the height and width of features maps, respectively, and δ gradient expresses the trend of image intensity variation, which
denotes ReLU. After the features of the encoder and decoder can sharpen the image and further guarantee that the appear-
are extracted, the attention coefficient can be calculated as ance of the predicted frame and the real frame is consistent.
follows: Thus, the gradient difference can be calculated on horizontal
  and vertical orientation as follows:
XX H X W   
Ml = σ  w l ϕ l + ψ l + bl  (3) 1[i]
grad (I, i, j) = |I (i, j) − I (i − 1, j)| (6)
m i=0 j=0
[ j]
1grad (I, i, j) = |I (i, j) − I (i, j − 1)|. (7)
where wl and bl are, respectively, learned weights and bias to
extract feature map ϕ l and ψ l in the l layer and σ denotes Based on (6) and (7), the gradient loss is computed as
the sigmoid activated function, which can make each output follows:
value range from 0 to 1, i.e., M l ∈ [0, 1]. Furthermore, all
 
L grad I t , Î t
the above learned weights adopt 1 × 1 × 1 convolution. The  
relevance of each element in gl and x l can be determined by 1[i] [i]
X
= grad (I t , i, j) − 1grad Î t , i, j
updating the feature weights of the gl and x l feature maps i, j
1
using backpropagation learning. Eventually, the feature of the X [ j] [ j]
 
attention module can be obtained as follows: + 1grad (I t , i, j) − 1grad Î t , i, j . (8)
1
i, j
x̂ l = M l · x l . (4)
Thus, the final appearance loss can be obtained as follows:
In the attention module, the human action pixel chan-      
nel can be weighted by the attention coefficient, and other L appearance I t , Î t = L int I t , Î t + L grad I t , Î t . (9)

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
LI AND SONG: FUTURE FRAME PREDICTION NETWORK FOR HUMAN FALL DETECTION 14465

2) Motion Constraint: Generally, in the case of optimizing formally expressed as follows:


only the gradient loss and the intensity loss, the optimal  
prediction of the future frame cannot be obtained because of L = λappearance L appearance I t , Î t
the cause that the motion of the human body may result in    
large changes with small distortions in intensity at a few pixels. + λoptical L optical I t , Î t + λadversarial L adversarial I t , Î t
In addition, for recognizing fall behaviors, temporal correlation (13)
is usually crucial for predicting future frame, so optical flow
loss is added to guarantee the correctness of motion prediction. where λappearance , λoptical , and λadversarial are, respectively, the
As a consequence, we apply the advanced method based on weights of appearance loss, motion loss, and adversarial loss,
CNN [37] to estimate the optical flow. In this case, pretrained limiting the impact of different loss functions.
Flownet can decrease the complexity of our network, hasten
the network training process, and guarantee the correctness of
the optical flow estimation. Specifically, the optical flow in C. Fall Score
each pixel value can be determined for the predicted frame In order to recognize fall behaviors well, we anticipate that
and the real frame, respectively, in the horizontal and vertical the attention U-Net will predict future nonfall video frames
directions, and the optical flow loss can be expressed as better than fall video frames. Thus, only normal data are
follows: used for training. In this case, the historical information in
the model can be beneficial to nonfall frames, which will
    1
L optical I t , Î t = f Î t+1 , I t − f (I t+1 , I t ) (10) result in smaller errors between predicted frames and real
1 frames. In contrast, fall frames cannot be accurately predicted,
where f (·) denotes the pretrained Flownet optical flow and the errors between the predicted and real frames will be
estimation. bigger. Then, the error can be utilized to distinguish falls and
3) Adversarial Constraint: Motivated by the success of nonfalls. In the context of reconstruction-based fall detection,
GANs in image generation and video generation, the MSE and peak signal-to-noise ratio (PSNR) are two widely
adversarial constraint is added to further make the predicted used measures for evaluating image quality. Following the
frame closer to the real frame in our proposed method. Com- work [38], MSE is not suitable for evaluating prediction tasks.
monly, GAN is mainly comprised of the generative model G In this article, PSNR is utilized to evaluate the quality of the
and the discriminative model D, which are utilized to generate predicted video frame, which can be expressed formally as
data and regarded as a binary classifier that distinguishes follows:
between generated data and ground data. In fact, the attention h i2
U-Net is regarded as a generator G to generate future predicted   max Î t
PSNR Î t , I t = 10 log10 2 . (14)
frame. Furthermore, three fully connected layers and a sigmoid 1 PN

activated layer constitute the discriminative model D, which is N i=0 Î i − I i
devoted to distinguishing generated frame obtained by G and
Usually, a high PSNR of the video frame indicates a nonfall
the real frame. As for D, class 1 represents the real frame and
event, whereas a small PSNR indicates a more likely fall
class 0 represents the generative frame. Specifically, G and D
event. As a consequence, the PSNR of all video frames can
are optimized alternately, and the discriminative loss that G is
be normalized and computed as follows:
fixed can be expressed formally as follows:
   
  X1 PSNR Î t , I t − mini PSNR Î i , I i
D
Î t , I t = L mse D (I t )i, j , 1

L adversarial S (t) =  (15)
2
  
i, j maxi PSNR Î i , I i − mini PSNR Î i , I i
X1    
+ L mse D Î t ,0 where falls and nonfalls can be recognized according to a
2 i, j
i, j threshold based on S(t). However, it may not be the most
(11) realistic to merely consider the fall behavior at the frame level
since the fall behavior is also a temporally continuous cohesive
where L mse is the mean square error (MSE) loss function, activity. Generally, the PSNR curve of the video frame is often
D(I t )i, j takes the values in {0,1}, and D( Î t )i, j ∈ [0, 1]. smooth in nonfall videos, but in fall videos, the PSNR curve
As for G, the goal is to attempt to generate frame that can be may suddenly drop and sharp. As a result, in order to further
judged as the real frame. Therefore, the generative loss that identify falls well, the variance of PSNR of all frames in a
D is fixed can be expressed formally as follows: video can be calculated as follows:
N
  X1    
L adversarial Î t , I t =
G
,1 . 1 X 2
2
L mse D Î t
i, j
(12) σ2 = PSNRi − PSNRµ (16)
i, j N
i=0

To sum up, appearance loss, motion loss, and adversarial where N is the total number of frames in a video, and PSNRµ
loss are combined to optimize the attention U-Net to predict is the mean of PSNR of the whole video. Eventually, we can
higher quality future frame, and the final training loss can be set a threshold based on σ 2 to distinguish falls and nonfalls.

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
14466 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

TABLE I such as walking, bending over, and sitting, which can enrich
T HREE B ENCHMARK FALL DATASETS the category of falls and be more representative of actual life
scenarios. Also, some keyframes are shown in Fig. 6(b).
3) High-Quality Fall Simulation Dataset: Baldewijns et al.
[41] created a new fall dataset that is closer to real life. The
dataset totally includes 55 fall videos and 15 nonfall videos
with a resolution of 320 × 240. This dataset is more complex
than the above two datasets and perfectly simulates fall events
in actual circumstances, which includes some realistic fall
scenarios such as the use of a walker, occlusion, and the
presence of multiple people. Also, some keyframes are shown
in Fig. 6(c).

B. Implementation Details
All experiments are implemented using Python based
on PyTorch framework on Intel1 Core2 i9-12900KF CPU
@3.20 GHz with GeForce 2080 Ti. For parameter optimiza-
tion, the learning rates for the generator and discriminator are
set at 0.0002 and 0.00002, respectively, when the attention
Fig. 6. Some keyframes from three public fall datasets. (a)–(c) UR U-Net is trained using the Adam method. The size of each
Fall Dataset, Multiple Cameras Fall Dataset, and High-Quality Fall frame in three public datasets is resized to 256 × 256, and
Simulation Dataset, respectively.
the intensity of pixels is normalized to (–1, 1). Furthermore,
the number of input frames is set to 4, and the batch size is set
IV. E XPERIMENTS to 16. The hyperparameters λappearance , λoptical , and λadversarial
In this section, the description of three benchmark datasets in (13) are, respectively, set to 1.0, 1.0, and 0.2 through
and details of implementation can first be provided; then, the experiments.
experimental results are given and the impact of different com-
ponents can be analyzed and verified. Finally, the performance C. Experimental Results
of the proposed method can be assessed and compared with In order to further verify the effectiveness of the attention
other several state-of-the-art methods. U-Net, we select several video samples from the three public
datasets for testing and compute the corresponding PSNR
A. Datasets values between predicted future fall frames and nonfall frames.
Fig. 7(a) shows the PSNR of video samples in the UR fall
We perform our experiments on three publicly available
dataset. From Fig. 7(a), the PSNR curve moves gently in a
datasets of fall detection, including UR fall dataset, multiple
straight direction in the first 80 frames and the last 25 frames
cameras fall dataset (MCFD), and high-quality fall simulation
of the UR video samples, which correspond to nonfall frames,
dataset. Also, Table I shows the detailed information of these
such as walking and lying after falling. It is worth noting
datasets.
that the PSNR curve suddenly decreases and increases from
1) UR Fall Dataset: The University of Rzeszow’s Com-
the 80th frame to 135th frame, which is highlighted with the
putational Modeling Discipline Centre created the UR fall
red box, representing the process of falling. Fig. 7(b) shows
detection dataset (URFD) [39]. The dataset includes 70 videos
the PSNR of video samples in the MCFD. From Fig. 7(b),
in total, of which 30 are fall videos and 40 are nonfall
the PSNR curve moves gently in a straight direction in the
videos, such as walking, sitting, squatting, and leaning. The
first 23 frames and the last 30 frames of the MCFD video
performers in the video also displayed several fall behaviors at
samples, which correspond to nonfall frames, such as sitting,
the same moment, including leaning backward, slanting, and
walking, standing, and squatting. It is worth noting that the
suddenly falling to the ground. In the dataset, all fall-related
PSNR curve suddenly decreases and increases from the 23th
behaviors and daily activities are recorded as RGB images
frame to 32th frame, which is highlighted with the red box,
with a resolution of 640 × 480. Also, some keyframes are
representing the process of falling. Fig. 7(c) shows the PSNR
shown in Fig. 6(a).
of video samples in the high-quality fall simulation dataset.
2) Multiple Cameras Fall Dataset: Auvinet et al. [40] cre-
From Fig. 7(c), the PSNR curve moves gently in a straight
ated the MCFD. The dataset totally contains 192 videos in
direction in the first 400 frames and the last 600 frames of the
24 scenes, including 22 fall scenes and two daily activities
HQFSD video samples, which correspond to nonfall frames,
scenes, which are recorded by eight calibrated cameras with
such as walking and lying after falling. It is worth noting that
a resolution of 720 × 480. The dataset differs from other
the PSNR curve suddenly decreases and increases from the
fall datasets in that it can capture behavior and motion data
400th frame to 600th frame, which is highlighted with the red
from a variety of perspectives. In terms of daily behavior,
activities, including moving boxes, dressing, and cleaning 1 Registered trademark.
rooms, are incorporated in addition to the typical behaviors 2 Trademarked.

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
LI AND SONG: FUTURE FRAME PREDICTION NETWORK FOR HUMAN FALL DETECTION 14467

TABLE II
C OMPARISON OF U-N ET AND ATTENTION U-N ET

Fig. 8. PSNR variance of different constraints on the UR fall dataset.

and attention U-Net on the UR fall dataset, with respect to


accuracy, sensitivity, and specificity. These metrics are popular
and common in fall detection, where higher values usually
represent better fall recognition performance. Table II shows
the comparison of U-Net and attention U-Net. It can be seen
that the accuracy, sensitivity, and specificity of attention U-Net
are all 100%, which are obviously better than those of U-Net.
In terms of accuracy, the performance can be improved by
2.9% in attention U-Net. Here, the reason is that the attention
module in U-Net can suppress feature responses in irrelevant
background regions and preserve relevant activations to gener-
ate a predicted future frame. A higher attention coefficient M l
indicates more relevant activations, and conversely, a lower
attention coefficient M l indicates more feature responses in
Fig. 7. PSNR curves of video samples from three public datasets. (a)–
(c) UR Fall Dataset, Multiple Cameras Fall Dataset, and High-Quality irrelevant background regions. Therefore, attention U-Net can
Fall Simulation Dataset, respectively. generate high-quality future frame due to the attention module,
which is beneficial for distinguishing fall events and nonfall
box, representing the process of falling. As we can see from events. Furthermore, Table II shows the size of two models,
Fig. 7, the nonfall frames usually correspond to higher PSNR and model sizes of U-Net and attention U-Net are, respectively,
values, and PSNR values of fall frames suddenly decrease 167 and 166 MB. Despite the fact the improvement in model
and increase when fall events occur, which demonstrates that size is not obvious, it still optimizes the process of prediction
the attention U-Net can effectively recognize fall behaviors and reduces resource consumption.
in videos. Furthermore, despite the fact that the lengths of 2) Impact of Different Constraints: To verify and analyze the
the video samples from three different public datasets vary impact of different constraints for fall detection, we gradually
greatly, the model can achieve good detection performance, conclude ablation experiments by combining different loss
which further illustrates the robustness of the proposed method functions on the UR fall dataset with respect to the area
on different kinds of scenes. under curve (AUC) value and PSNR variance. Fig. 8 shows the
D. Ablation Studies PSNR variance of different constraints on the UR fall dataset.
In this section, an ablation study on public datasets can be As shown in Fig. 8, the blue column represents the average
conducted. The effectiveness can be validated in detail, with PSNR variance of predicted fall frames, the red column repre-
aspects to components in our framework, such as attention sents the average PSNR variance of predicted nonfall frames,
module, various constraints, and evaluation methods. After and the green column represents the gap between the PSNR
that, the optimal attention U-Net can be found. variance of predicted nonfall frames and fall frames. It is
1) Impact of Attention Module: To demonstrate the appli- obvious that the PSNR variance of fall frames is bigger than
cability of the attention module in fall detection based on that of nonfall frames, which can be utilized to discriminate
predicted future frame, we evaluate the metrics of U-Net between fall events and nonfall events. It is also worth noting

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
14468 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

TABLE III TABLE V


G AP OF PSNR VARIANCES AND THE AUC VALUE C OMPARISON OF THE P ROPOSED M ETHOD W ITH E XISTING M ETHODS
W ITH D IFFERENT C ONSTRAINTS ON T HREE P UBLIC FALL DATASETS

TABLE IV
C OMPARISON OF D ETECTION P ERFORMANCE
W ITH D IFFERENT M EASURES

that the gap of PSNR variances varies with the number of


loss constraints, the more constraints will generate the bigger
gap of PSNR variances. Furthermore, Table III shows the gap
of PSNR variances and AUC values with different constraints.
From Table III, the AUC value and the gap of PSNR variances
vary with the number of loss constraints, and the bigger gap
of PSNR variances corresponds to the bigger AUC value. Fur-
thermore, it is worth noting that the AUC values with/without
motion consistency are, respectively, 99.2% and 98.1%. Here,
the reason is that the motion constraint can enforce the motion
consistency between the predicted frame and the real frame in
the training phase, and the difference between the predicted
frame and the real frame can be enlarged. Synthetically, the
mentioning that our proposed method can achieve a sensitivity
combination of appearance constraints, motion constraint, and
of 100% and a specificity of 100% on the MCFD, which
adversarial constraint is significant to enhance the quality of
verifies that our model has the best performance compared
the predicted future frame, which can further improve fall
with the reported methods, such as human shape deforma-
detection performance.
tion [42], Grassmann manifold [20], and CNN [23], [46],
3) Impact of PSNR and MSE: In order to explore whether
[49]. Furthermore, our proposed method can also achieve
the fall score based on PSNR has a better performance,
a sensitivity of 100% and a specificity of 100% on the
we have verified and analyzed the common metrics, with
UR fall dataset. It is obvious that the performance of our
respect to the accuracy, on the three public datasets. Table IV
model is better than that of the reported methods, such as
shows the comparison of detection performance with fall
unsupervised learning methods based on reconstruction [32],
scores based on PSNR and MSE. It is obvious that all
CNN [50], and weakly supervised learning-based dual-modal
the metrics based on PSNR are better than those based on
network [51]. Again, we notice that our proposed method can
MSE. In particular, compared to the fall score with MSE,
achieve a sensitivity of 68.4% and a specificity of 78.7% on the
the accuracy of the fall score with PSNR can achieve 100%
high-quality fall simulation dataset. Even though our proposed
on the UR fall dataset and MCFD. It is mainly because the
method cannot achieve high performance on more complex
degree of visual distortion of the image may be large when
datasets, our proposed method obtains competitive results
the MSE values are equal, which will influence the detection
compared with the reported methods, including [44] and [46].
performance. Accordingly, PSNR is utilized to compute fall
These results on three public datasets can demonstrate that our
scores to recognize fall behaviors in our method.
proposed method is able to recognize fall events in various
E. Comparison With Other State-of-the-Art Methods scenes.
Finally, we conclude the experiments on these three public
datasets to demonstrate the fall recognition performance of our
proposed method. As shown in Table V, the sensitivity and F. Processing Time
specificity can be compared with these existing state-of-the- Our framework is implemented with NVIDIA
art methods on different datasets. From Table V, it is worth GeForce 2080 Ti and PyTorch. We analyze the training

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
LI AND SONG: FUTURE FRAME PREDICTION NETWORK FOR HUMAN FALL DETECTION 14469

TABLE VI based on deep features. In the future, we will substantially


C OMPARISON OF E XECUTION T IME ON THE M ULTIPLE concentrate on occluded scenarios and multiperson scenarios.
C AMERAS FALL DATASET

R EFERENCES
[1] L. Yang, Y. Ren, H. Hu, and B. Tian, “New fast fall detection method
based on spatio-temporal context tracking of head by using depth
images,” Sensors, vol. 15, no. 9, pp. 23004–23019, Sep. 2015.
[2] X. Gao, Z. Chen, S. Tang, Y. Zhang, and J. Li, “Adaptive weighted
imbalance learning with application to abnormal activity recognition,”
time and testing time of our proposed method on the multiple Neurocomputing, vol. 173, pp. 1927–1935, Jan. 2016.
cameras fall detection. In the training phase, training an [3] A. Abobakr, M. Hossny, and S. Nahavandi, “A skeleton-free fall detec-
attention U-Net detection model needs about 3.5 h. In the tion system from depth images using random decision forest,” IEEE
Syst. J., vol. 12, no. 3, pp. 2994–3005, Sep. 2018.
testing phase, our proposed method can reach 42 frames per [4] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
second (FPS). Table VI shows the comparison of execution data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
time on the MCFD. Compared with other methods, it is Jul. 2006.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification
obvious that our method can perform better. The experimental with deep convolutional neural networks,” Commun. ACM, vol. 60, no. 6,
results mean that our proposed method can be applied for pp. 1097–1105, Jun. 2017.
real-time detection. [6] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” Comput. Sci., vol. 1, no. 1,
pp. 1409–1556, Sep. 2014.
G. Discussion on the Limitation [7] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
Despite the fact that our method can achieve excellent image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778.
results on datasets such as the UR fall dataset and the MCFD, [8] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
it has some limitations in relatively more complex datasets. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1–9.
In complex scenarios, if there are several people or people [9] M. Ribeiro, A. E. Lazzaretti, and H. S. Lopes, “A study of deep
using a walking aid in the video, these moving individuals convolutional auto-encoders for anomaly detection in videos,” Pattern
Recognit. Lett., vol. 105, pp. 13–22, Apr. 2018.
can generate additional optical flows, which will influence the [10] T. Li, X. Chen, F. Zhu, Z. Zhang, and H. Yan, “Two-stream deep spatial–
prediction of a person who is really falling, further leading temporal auto-encoder for surveillance video abnormal event detection,”
to erroneous identification. In addition, our proposed method Neurocomputing, vol. 439, pp. 256–270, Jun. 2021.
may be ineffective in scenarios with extreme occlusions, such [11] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
deep convolutional networks for visual recognition,” IEEE Trans. Pattern
as a falling person being occluded by some furniture. In this Anal. Mach. Intell., vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
instance, it is difficult to evaluate the optical flows caused by [12] C. Ledig et al., “Photo-realistic single image super-resolution using
the falling person since they are occluded. As a result, a minor a generative adversarial network,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 105–114.
prediction error will be yielded, and then, a fall event may be [13] M. Mubashir, L. Shao, and L. Seed, “A survey on fall detection:
recognized as a nonfall event. Principles and approaches,” Neurocomputing, vol. 100, pp. 144–152,
Jan. 2013.
[14] M. Fáñez, J. R. Villar, E. de la Cal, V. M. González, J. Sedano, and
V. C ONCLUSION S. B. Khojasteh, “Mixing user-centered and generalized models for fall
In this article, we propose a novel efficient fall detection detection,” Neurocomputing, vol. 452, pp. 473–486, Sep. 2021.
method based on future frame prediction. Specifically, the [15] J. R. Villar, C. Chira, E. de la Cal, V. M. González, J. Sedano, and
S. B. Khojasteh, “Autonomous on-wrist acceleration-based fall detection
attention U-Net is adopted to be a basic prediction network systems: Unsolved challenges,” Neurocomputing, vol. 452, pp. 404–413,
for predicting the next future frame, and the attention gate can Sep. 2021.
solve the problem that similar low-level features are repeatedly [16] C.-L. Liu, C.-H. Lee, and P.-M. Lin, “A fall detection system using
extracted in the U-Net. To generate a more realistic future k-nearest neighbor classifier,” Expert Syst. Appl., vol. 37, no. 10,
pp. 7174–7181, Oct. 2010.
frame, appearance constraints on intensity and gradient and [17] B. Mirmahboub, S. Samavi, N. Karimi, and S. Shirani, “Automatic
motion constraint are combined to train attention U-Net. In this monocular system for human fall detection based on variations in
way, the nonfall events in terms of motion and appearance can silhouette area,” IEEE Trans. Biomed. Eng., vol. 60, no. 2, pp. 427–436,
Feb. 2013.
be guaranteed to generate, and the difference between the pre- [18] R. Cucchiara, C. Grana, A. Prati, and R. Vezzani, “Probabilistic posture
dicted frame and the real frame is larger for fall events, which classification for human-behavior analysis,” IEEE Trans. Syst., Man,
will be beneficial for recognizing anomalies. On a number of Cybern. A, Syst. Humans, vol. 35, no. 1, pp. 42–54, Jan. 2005.
relatively simple datasets, our proposed method can achieve [19] A. Iazzi, M. Rziza, and R. O. H. Thami, “Efficient fall activity recog-
nition by combining shape and motion features,” Comput. Vis. Media,
high detection performance, such as a sensitivity of 100% and vol. 6, no. 3, pp. 247–263, Sep. 2020.
a specificity of 100% on the UR fall dataset and the multi- [20] P. Soni and A. Choudhary, “Grassmann manifold based framework for
ple cameras fall dataset. Furthermore, our proposed method automated fall detection from a camera,” Image Vis. Comput., vol. 122,
no. 1, Jun. 2022, Art. no. 104431.
can also achieve competitive results on the high-quality fall [21] C. Yao, J. Hu, W. Min, Z. Deng, S. Zou, and W. Min, “A novel real-
simulation dataset, which is more closely related to real- time fall detection method based on head segmentation and convolu-
world scenarios. The experimental results demonstrate that the tional neural network,” J. Real-Time Image Process., vol. 17, no. 6,
pp. 1939–1949, Jun. 2020.
attention U-Net model can actually accomplish accurate recog-
[22] A. Núñez-Marcos, G. Azkune, and I. Arganda-Carreras, “Vision-based
nition, outperforming other existing state-of-the-art methods, fall detection with convolutional neural networks,” Wireless Commun.
including methods based on traditional features and methods Mobile Comput., vol. 2017, pp. 1–16, Jan. 2017.

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.
14470 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

[23] Q. Feng, C. Gao, L. Wang, Y. Zhao, T. Song, and Q. Li, [45] X. Ma, H. Wang, B. Xue, M. Zhou, B. Ji, and Y. Li, “Depth-
“Spatio-temporal fall event detection in complex scenes using atten- based human fall detection via shape features and improved extreme
tion guided LSTM,” Pattern Recognit. Lett., vol. 130, pp. 242–249, learning machine,” IEEE J. Biomed. Health Informat., vol. 18, no. 6,
Feb. 2020. pp. 1915–1922, Nov. 2014.
[24] N. Lu, Y. Wu, L. Feng, and J. Song, “Deep learning for fall detection: [46] Y. Fan, M. D. Levine, G. Wen, and S. Qiu, “A deep neural network
Three-dimensional CNN combined with LSTM on video kinematic for real-time detection of falling humans in naturally occurring scenes,”
data,” IEEE J. Biomed. Health Informat., vol. 23, no. 1, pp. 314–323, Neurocomputing, vol. 260, pp. 43–58, Oct. 2017.
Jan. 2019. [47] Y. Yun and I. Y.-H. Gu, “Human fall detection in videos by fusing statis-
[25] C. Khraief, F. Benzarti, and H. Amiri, “Elderly fall detection based tical features of shape and motion dynamics on Riemannian manifolds,”
on multi-stream deep convolutional networks,” Multimedia Tools Appl., Neurocomputing, vol. 207, pp. 726–734, Sep. 2016.
vol. 79, nos. 27–28, pp. 19537–19560, Mar. 2020. [48] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
[26] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, works for biomedical image segmentation,” in Proc. Int. Conf. Med.
“Learning temporal regularity in video sequences,” in Proc. IEEE Conf. Image Comput. Comput.-Assist. Intervent., Munich, Germany, 2015,
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 770–778. pp. 234–241.
[27] Cai, Xi, S. Li, X. Liu, and G. Han, “Vision-based fall detection with [49] C. Vishnu, R. Datla, D. Roy, S. Babu, and C. K. Mohan, “Human
multi-task hourglass convolutional auto-encoder,” IEEE Access, vol. 8, fall detection in surveillance videos using fall motion vector modeling,”
pp. 44493–44502, 2020. IEEE Sensors J., vol. 21, no. 15, pp. 17162–17170, Aug. 2021.
[28] J. Zhou and T. Komuro, “Recognizing fall actions from videos using [50] Y. Chen, W. Li, L. Wang, J. Hu, and M. Ye, “Vision-based fall event
reconstruction error of variational autoencoder,” in Proc. IEEE Int. Conf. detection in complex background using attention guided bi-directional
Image Process. (ICIP), Sep. 2019, pp. 3372–3376. LSTM,” IEEE Access, vol. 8, pp. 161337–161348, 2020.
[29] S. Khan, J. Nogas, and A. Mihailidis, “Spatio-temporal adversarial [51] L. Wu et al., “Robust fall detection in video surveillance based on weakly
learning for detecting unseen falls,” Pattern Anal. Appl., vol. 24, no. 1, supervised learning,” Neural Netw., vol. 163, pp. 286–297, Jun. 2023.
pp. 191–381, Mar. 2021. [52] J. Clemente, F. Li, M. Valero, and W. Song, “Smart seismic sensing for
indoor fall detection, location, and notification,” IEEE J. Biomed. Health
[30] C. Fan and F. Gao, “A new approach for smoking event detection using a
Informat., vol. 24, no. 2, pp. 524–532, Feb. 2020.
variational autoencoder and neural decision forest,” IEEE Access, vol. 8,
pp. 120835–120849, 2020. [53] R. Jain and V. B. Semwal, “A novel feature extraction method for
preimpact fall detection system using deep learning and wearable
[31] C. Sun, Y. Jia, H. Song, and Y. Wu, “Adversarial 3D convolutional
sensors,” IEEE Sensors J., vol. 22, no. 23, pp. 22943–22951, Dec. 2022.
auto-encoder for abnormal event detection in videos,” IEEE Trans.
[54] F. A. S. F. de Sousa, C. Escriba, E. G. A. Bravo, V. Brossa, J. Fourniols,
Multimedia, vol. 23, pp. 3292–3305, 2021.
and C. Rossi, “Wearable pre-impact fall detection system based on 3D
[32] S. Li, X. Song, S. Xu, H. Qi, and Y. Xue, “Dilated spatial–temporal accelerometer and subject’s height,” IEEE Sensors J., vol. 22, no. 2,
convolutional auto-encoders for human fall detection in surveillance pp. 1738–1745, Jan. 2022.
videos,” ICT Express, Jul. 2022, doi: 10.1016/j.icte.2022.07.003.
[55] Z. Liu, M. Yang, Y. Yuan, and K. Y. Chan, “Fall detection and personnel
[33] W. Luo, W. Liu, and S. Gao, “Normal graph: Spatial temporal graph con- tracking system using infrared array sensors,” IEEE Sensors J., vol. 20,
volutional networks based prediction network for skeleton based video no. 16, pp. 9558–9566, Aug. 2020.
anomaly detection,” Neurocomputing, vol. 444, no. 1, pp. 322–337, [56] E. Alam, A. Sufian, P. Dutta, and M. Leo, “Vision-based human fall
Nov. 2022. detection systems using deep learning: A review,” Comput. Biol. Med.,
[34] D. Chen, P. Wang, L. Yue, Y. Zhang, and T. Jia, “Anomaly detection vol. 146, no. 1, pp. 1–22, Jul. 2022.
in surveillance video based on bidirectional prediction,” Image Vis. [57] X. Xiong, W. Min, W.-S. Zheng, P. Liao, H. Yang, and S. Wang, “S3D-
Comput., vol. 98, no. 1, pp. 1–8, Jun. 2020. CNN: Skeleton-based 3D consecutive-low-pooling neural network for
[35] X. Wang et al., “Robust unsupervised video anomaly detection by fall detection,” Int. J. Speech Technol., vol. 50, no. 10, pp. 3521–3534,
multipath frame prediction,” IEEE Trans. Neural Netw. Learn. Syst., Jun. 2020.
vol. 33, no. 6, pp. 2301–2312, Jun. 2022.
[36] N. Li, F. Chang, and C. Liu, “Spatial–temporal cascade autoencoder for
video anomaly detection in crowded scenes,” IEEE Trans. Multimedia,
vol. 23, pp. 203–215, 2021.
[37] A. Dosovitskiy et al., “FlowNet: Learning optical flow with convo-
lutional networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Dec. 2015, pp. 2758–2766.
Suyuan Li received the B.S. degree from
[38] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video pre-
Liaoning Normal University, Dalian, China,
diction beyond mean square error,” in Proc. Int. Conf. Learn. Represent.
in 2017, and the M.E. degree in electronics and
(ICLR), 2016, pp. 1–14.
communication engineering from Northeastern
[39] B. Kwolek and M. Kepski, “Human fall detection on embed- University, Shenyang, China, in 2020, where he
ded platform using depth maps and wireless accelerometer,” Com- is currently pursuing the Ph.D. degree in infor-
put. Methods Programs Biomed., vol. 117, no. 3, pp. 489–501, mation and communication engineering.
Dec. 2014. His research interests include fall detection
[40] E. Auvinet, F. Multon, A. Saint-Arnaud, J. Rousseau, and J. Meunier, based on deep learning and image processing.
“Fall detection with multiple cameras: An occlusion-resistant method
based on 3-D silhouette vertical distribution,” IEEE Trans. Inf. Technol.
Biomed., vol. 15, no. 2, pp. 290–300, Mar. 2011.
[41] G. Baldewijns, G. Debard, G. Mertes, B. Vanrumste, and
T. Croonenborghs, “Bridging the gap between real-life data and
simulated data by providing a highly realistic fall dataset for evaluating
camera-+based fall detection algorithms,” Healthcare Technol. Lett.,
vol. 3, no. 1, pp. 6–11, Mar. 2016.
[42] C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau, “Robust video Xin Song was born in Jilin, China, in 1978. She
surveillance for fall detection based on human shape deformation,” received the Ph.D. degree in communication and
IEEE Trans. Circuits Syst. Video Technol., vol. 21, no. 5, pp. 611–622, information system from Northeastern Univer-
May 2011. sity, Shenyang, China, in 2008.
She is now working as a Teacher with North-
[43] W. Feng, R. Liu, and M. Zhu, “Fall detection for elderly person care in a
eastern University. Her research interests are in
vision-based home surveillance environment using a monocular camera,”
the area of robust adaptive beamforming and
Signal, Image Video Process., vol. 8, no. 6, pp. 1129–1138, May 2014.
wireless communication.
[44] G. Debard et al., “Camera-based fall detection using real-world versus
simulated data: How far are we from the solution?” J. Ambient Intell.
Smart Environ., vol. 8, no. 2, pp. 149–168, Mar. 2016.

Authorized licensed use limited to: JNT University Kakinada. Downloaded on February 14,2024 at 05:08:05 UTC from IEEE Xplore. Restrictions apply.

You might also like