Professional Documents
Culture Documents
1 s2.0 S0925231223006847 Main
1 s2.0 S0925231223006847 Main
1 s2.0 S0925231223006847 Main
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
A R T I C L E I N F O A B S T R A C T
Communicated by Zidong Wang Abnormal event detection is a critical component of intelligent surveillance systems, focusing on identifying
abnormal objects or unusual human behaviours in video sequences. However, conventional methods struggle due
Keyword: to the scarcity of labelled data. Existing solutions typically train on normal data, establish boundaries for regular
Abnormal event detection events, and identify outliers during testing. These approaches are often inadequate as they do not efficiently
Pose estimation
leverage the geometry and image texture information, and they lack a specific focus on different types of
Optical flow
abnormal events. This paper introduces a novel two-stream fusion algorithm for abnormal event detection to
Object detection
Graph convolutional neural network address these diverse abnormal events better. We first extract the object, pose, and optical flow features. Then,
Adversarial learning the object and pose information is combined early on to eliminate occluded pose graphs. The trusted pose graphs
Data fusion are fed into a Spatio-Temporal Graph Convolutional Network (ST-GCN) to detect abnormal behaviours. Simul
taneously, we propose a video prediction framework that identifies abnormal frames by measuring the difference
between predicted and ground truth frames. Lastly, we execute a decision-level fusion between the classification
and prediction streams to achieve the final results. Our results on the UCSD PED1 dataset indicate the enhanced
performance of the fusion model for various abnormal events. Furthermore, experimental results on the UCSD
PED2 dataset and the ShanghaiTech campus dataset underscore our approach’s effectiveness compared to other
related works.
* Corresponding author.
https://doi.org/10.1016/j.neucom.2023.126561
Received 20 March 2023; Received in revised form 11 June 2023; Accepted 8 July 2023
Available online 18 July 2023
0925-2312/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
Y. Yang et al. Neurocomputing 553 (2023) 126561
Fig. 1. (a): The overall architecture of the proposed two-stream fusion framework. The raw images are processed through three models: object detection, pose
estimation, and optical flow calculation. These processes extract object class information, human-related pose, and optical flow results, respectively. There are two
concurrent streams: (b) action-based classification and (c) motion-based prediction. In the action-based classification stream (b), the object detector, the YOLOv3
model, primarily classifies the detected objects into four categories: pedestrian, bicycle, skateboard, and car [9]. Frames containing detected pedestrians and pose
information is then fed into the Spatio-Temporal Graph Convolutional Network (ST-GCN) [14] to capture the spatial and temporal features of regular body joints. The
extracted features are combined into a latent vector and further processed via a feature clustering step to output the normality scores. Concurrently, in the motion-
based prediction stream (c), raw images undergo adversarial training to predict the subsequent frame based on historical trajectories. The predicted frames are then
calculated for their optical flow results and compared with the optical flows of the preceding frame in the ground truth. By combining the motion and appearance
loss, we calculate the normality scores of the frames in the test clip. Finally, the results from all features are fused at the decision level to produce the final normality
score. Best viewed in colour.
driver is occluded by a car or truck, and the target information cannot be abnormal events. The second stream extracts optical flow results by
extracted by feature extraction models. flownet [26]. Then, raw RGB data and optical flow features are pro
Another increasing trend in the AED framework is based on recon cessed by U-net as a generator. Moreover, a discriminator is used to
struction or future prediction models [16–21]. Reconstruction-based predict whether the next frame is real or fake, which is the theory of
algorithms generally train autoencoders using normal data, with adversarial learning. The second stream can compensate for the first
abnormal frames distinguished in video sequences due to their higher stream’s limitations and enrich appearance and motion description,
reconstruction errors compared to normal frames. During inference, which has a positive impact on abnormal event detection with the
abnormal frames are identified based on large prediction errors, as these crowded environment and small objects. The normality scores of tested
events rarely occur during training and thus have different probability data from the two streams would do a final fusion to decide whether
distributions compared to the training data. Other prediction-based AED they are normal or abnormal. Extensive experiments are provided on the
algorithms [22–24] use raw RGB data or optical flow features to train a UCSD PED1 & PED2 [8], ShanghaiTech Campus [22] and CUHK Avenue
network and predict the next frame based on previous frames. The [27] datasets to highlight the efficacy and robustness of the proposed
strength of these algorithms lies in their ability to leverage the spatial system.
and temporal appearance and motion information, enabling them to We demonstrate that multiple features can efficiently boost RGB-
detect a variety of abnormal events such as unusual behaviours like based networks for classification and prediction. On the other hand,
chasing, jumping, and fighting and abnormal object classifications like raw RGB data and calculated optical flow results can compensate for
skateboard, bicycle, and car. Furthermore, since the training data is pose data in motion and appearance features. The contributions of this
normal, there is no need for labelling, which significantly reduces labour work are mainly:
costs. However, these algorithms are challenging to train well and are
not sensitive to small abnormal classes. 1. We propose a unified classification and prediction fusion framework
This paper proposes a unified AED framework by fusing multiple designed to detect various types of abnormal events in surveillance
features to detect different types of video-based abnormal events, as videos.
shown in Fig. 1. Our framework consists of two parallel processing 2. By incorporating motion features, our AED algorithm shows
branches. The first stream extracts class information to make pre increased sensitivity towards abnormal events, even in crowded
liminary classifications for pedestrians and other objects by the object environments.
detector YOLOv3 [25]. Then a pose estimation algorithm is used to 3. Our extensive experiments demonstrate the effectiveness and
extract body pose information. After initial fusion with class informa robustness of the proposed framework in AED.
tion, the reduced pose information is converted into high-level features, 4. We analyze the types of abnormal events in several public AED
further processed by a graph convolutional neural network. The first datasets and evaluate the accuracy of our method in identifying these
integrated pose and class based stream can improve the accuracy of abnormal events.
pose-level detection results and efficiently detect non-human-related
2
Y. Yang et al. Neurocomputing 553 (2023) 126561
Fig. 2. Some typical samples after object detection and pose estimation. Different abnormal objects: bicycles, trucks and skateboards are shown. Best viewed
in colour.
This paper introduces a joint fusion approach for AED utilizing MPED-RNN (single encoder dual decoder) architecture. Recently, Zeng
comprehensive pose, class, and motion information. Preliminary aspects et al. [24] proposed a joint pose-based classification and prediction ar
of this work, primarily addressing non-human-related AED through the chitecture for anomaly detection. High-level graph features encode the
fusion of pose and class information, have previously been presented in trajectories of people and the interactions between multiple persons,
[9]. We further expand upon this initial research by conducting exten while low-level graph features encode the body postures and the rep
sive experiments on several public datasets to compare our proposed resented actions. Pose-based AED methods can effectively reduce the
detection method with recent state-of-the-art works. This work also in computational cost of the training since the input is the pose graphs and
cludes an ablation study on different contribution components and not the raw images. However, the algorithms in [31,24] do not consider
abnormal event types, providing a more holistic and in-depth analysis of non-human-related abnormal events. Since all pose-based methods need
our approach. to extract pose features for action recognition or prediction, the pose
estimation model plays a crucial role in AED work. It is significant to not
2. Related Work only extract all keypoints of each person precisely but also need to
identify multiple targets in video sequences. That is why this step is
2.1. Image-based and Pose-based AED called pose estimation rather than pose extraction. Thus, low-resolution
datasets are hard to estimate the pose information and are unfriendly to
Most State-of-the-art AED methods rely on RGB data to detect pose-based AED algorithms. Recently, it is attention-grabbing to utilize
strange or rare events in surveillance video sequences. Sultani et al. [17] object detectors to acquire prior knowledge from different types of im
proposed a frame-level anomaly detection in a supervised manner. ages, such as RGB, infrared, depth, and remote sensing images [32,33].
Video sequences are divided into a fixed number of segments. Multiple Our proposed algorithm also adds object detection to talk about non-
instances learning (MIL) algorithm defines abnormal video sequences, human-related abnormal events like driving and makes an early fusion
and a deep MIL ranking model is trained to detect anomalies. However, between the pose information and the results of object detection to
this method is inefficient since labelling video clips is time-consuming. enhance the localization and identifications of detected persons.
Reconstruction methods are more prevalent in the recent AED work
since these autoencoders (AE) -based models are trained in semi- 2.2. The Data Fusion Mechanism
supervised learning [19,28]. Training data are normal, and testing
samples simultaneously contain normal and abnormal events. Raw RGB Deep learning in a single-domain dataset has been successful. Data
data are trained to generate the latent vectors of normal samples. In fusion is used to extract and mix information from multiple domains or
testing, video sequences similar to the training data will be recon different sensors [34–37]. Compared with individual modalities, the
structed with low error. In contrast, strange actions or rare events which fused data would have a richer representation and performance and are
deviate from normal data are expected to be reconstructed with higher robust to the given tasks. Thus, data fusion is a practical solution to
error. Hasan et al. [19] proposed a combination of CNN-based autoen several fields like medicine, business and driverless technology [38].
coders to model the temporal evolution of HOG-HOF features. Sabokrou Traditional data fusion methods have three techniques: early, interme
et al. [28] proposed an adversarially learned one-class classifier for diate, and late fusion. Early fusion is also called data-level fusion, which
novelty detection. An autoencoder model is trained as a generator. At means multiple data are fused before conducting the data analysis
the same time, a sequence of convolution layers is trained to distinguish model. Intermediate fusion is the most flexible method in which mul
the novel or outlier sample as a discriminator. After adversarial learning, tiple data can be fused at different stages of model training. Late fusion
the testing samples with normal events are enhanced, and those with methods allow data to be fused after multiple model decisions. One of
outliers are distorted as expected. These methods distinguish abnormal AED’s challenges is that the types of abnormal events in video surveil
frames due to reconstruction errors and can not provide insight into the lance are various: Abnormal objects, such as cars, bicycles and skate
types of abnormal events. In this paper, our fusion model will focus on boards; Abnormal positions, such as walking on the grass and driving on
the binary classification of normal and abnormal events and analyze the the sidewalk; Abnormal behaviours, such as shooting, fighting and
different types of abnormal events. [23,29,30] provide other recent chasing. Data fusion methods for AED can robustly mix multiple features
approaches where attention-based methods are used for the original such as pose and optical flow to detect different types of abnormal
data. These methods split the input images into the foreground and events.
background images and only focus on the motion area. However, these
methods are not sensitive to cluttered or crowded background scenes. 3. Proposed Framework
Pose-based AED methods are proposed with the development of pose
estimation and graph neural networks. In [13], human poses are rep 3.1. Pose-driven Action Classification Model
resented as pose graphs. The graph information is encoded in a spatial
and temporal graph convolution neural network and clustered into a 3.1.1. Object Detection and Pose Estimation
latent vector. The Dirichlet process mixture model (DPMM) is used to a pre-trained object detector YOLOv3 model is conducted to capture
analyze the distribution of these vectors and distinguish the actions the targets’ classes, localization information and corresponding confi
represented by these vectors. Similarly, in [31], a pose-based human dence scores, shown in Fig. 2. We set the video sequences as V. For each
activity anomaly detection algorithm is presented, leveraging the detected objects, classes are represented as C, the localization
3
Y. Yang et al. Neurocomputing 553 (2023) 126561
objects in the traffic surveillance system. In the m-th frame, if there are K
objects detected by the object detector and J human poses extracted by
pose estimation:
( )
GmT = GmJ ∩ VmK (3)
There are two parts in the early data-level fusion step: One is to
strengthen the confidence of detected persons, and the other is to
remove some occluded pose graphs during pose estimation. We set B ∈
[bicycle, skateboard, car, truck, person] and Fm
k is the confidence of the
k − th objects in the m − th frame.
{
V mk if Cmk ∈ B & Fmk > ε
m
Sk = (4)
0 otherwise
Fig. 3. The body joints information for human behaviours: the red box repre
sents cycling and the black box represents walking. Best viewed in colour.
pose results are crucial when classifying human activities. If the human’s
body is partly occluded by other objects or blurred due to fast-moving, where σ is an activation function. Wm m
0 , W 1 are two trainable weights,
the extracted pose information will result in a lousy performance in and γj,k is a normalization variable. The adjacent matrix Ap,q means the
AED work. Thus, we introduce class information to strengthen the weights of the connection between μm m
p and μq . A fixed matrix A is set and
human pose information and distort the influence of other familiar
4
Y. Yang et al. Neurocomputing 553 (2023) 126561
used for all frames, and the model is implemented with the following 3.2.1. Loss Functions
formula [41]: To ensure the predicted frame is close to its ground truth. We need to
( ) ( ) 1 reduce the distance in intensity and gradient. The intensity loss makes
(6)
1
GCN N m = Λ− 2 A + I Λ− 2 N m W m the pixels of the generated frame close to the ground truth at the RGB
level. The formula for intensity loss is as follows:
where Λ is the degree matrix, I is the identity matrix representing self-
∑
m
connections. After the GCN implementation between the neighbouring Li =
m
‖V − V m ‖22 (9)
keypoints in the spatial domain, the pose graph relationships between m=1
time t and t′ with the same identification should be considered. The next
step is analysing and clustering the spatio-temporal features from the Although the intensity loss can capture the major pixel changes in im
neural network. ages, the generated prediction frames would be blurred if only intensity
loss were applied. Thus, the gradient loss is added to the generator
3.1.4. Deep Embedded Clustering training loss function to sharpen the predicted results. The formula is
A hand-craft action cluster is set, which contains K actions, such as shown below:
hand waving, jumping, running, chasing, and falling. Supposing the m ∑⃦⃒
∑ ⃦⃒ m m ⃒ ⃒
⃒ ⃒ ⃒ ⃦ ⃦⃒
⃒ ⃦ ⃦⃒ m m ⃒ ⃒
⃒ ⃒ ⃒⃦
⃒⃦
m Lg = ⃦⃒V i,j − V i− 1,j ⃒− ⃒Vi,jm − V mi− 1,j ⃒ ⃦ + ⃦⃒V i,j − V i,j− 1 ⃒− ⃒V mi,j − V mi,j− 1 ⃒ ⃦
features extracted from ST-GCN [14] is G′ , the probability of pose 1 1
m=1 i,j
graph y belonging to which cluster is represented as:
(10)
( ) m
exp(Θl G′ ) where i and j are the pixels of the frame.
pmk = P y = k = ( ) (7)
∑
K
m A CNN-based optical flow estimation algorithm: Flownet [26] is used
exp Θk′ G′ to track the multiple targets moving. In the prediction stream, The op
k=1
tical flow results can strengthen the relationship of moving objects in the
Where Θ are the parameters of the clustering models, the main task of temporal domain and are also sensitive to the movement of the small
the clustering model is to construct the probability distribution of object. The f means the function of Flownet, which is pre-trained well
normal samples. As mentioned above, the ST-GCN [14] is trained semi- with fixed weights and parameters. The optical flow loss is shown below:
supervised, which means the training data are all normal events. ⃦ ( m+1 ) ( )⃦
Furthermore, optimising the training minimises the Kullback–Leibler Lf = ⃦f V , V m − f V m+1 , V m ⃦1 (11)
(KL) divergence between the clustering probability distribution P and
the distribution of the detected objects Q. The loss function of the 3.2.2. Adversarial training
clustering model is: In the prediction stream, we used a U-net based autoencoder as a
( ) generator and a patch-level discriminator. U-net [43] can reduce the
( ) ∑∑
Lc = KL Q P =
⃦
⃦
⃦
qab log
qab
(8) training samples and improve the reconstruction performance. And the
generative adversarial network is also efficient for image reconstruction.
⃦
⃦
a b
pab
The optimization of the generator is to reduce the intensity, gradient,
where a, b mean the a-th pose graph sample assigned to b-th cluster. pab optical flow and adversarial learning loss. The task of the generator is to
is the detected classification probability, and qab is the true target generate a fake sample corresponding to the probability distribution of
probability. real data. Since the neural network is training semi-supervised, which
means there are only normal samples in the training data, the loss of the
3.2. Motion-based Prediction Model adversarial learning reduces the distance between the generated sam
ples and class 1. The Eq. 12 is shown:
The enhanced action-based classification stream is proposed in both ∑∑( )2
spatial and time domains. The latent vector is represented with a fully La = D(V)i,j − 1 (12)
connected layer. A deep clustering model is followed to build a dictio m i,j
nary for human behaviours [42]. However, there are still existing some
Finally, the overall generator loss is obtained after assigning the same set
problems with the classification stream. The pose-level algorithm is
of weights to each set of losses. Algorithm 2 specifically describes the
over-reliance on pose estimation results and can not capture the
generator loss calculation process. And the task of the discriminator is to
abnormal behaviours of tiny movement changes, such as skiing. Inspired
distinguish real data and fake data. The loss function of the discrimi
by [22], we added a future frame prediction model as the motion stream
nator is illustrated in Eq. 13:
into our framework. The prediction stream leverages adversarial
( )
learning with the U-net network as the generator to discriminate the ∑∑ ( )2 ( )2
normality of videos at the frame level. Optical flow constraints of each Ld = D(V)i,j − 0 + D(V)i,j − 1 (13)
frame are also calculated to enhance the motion information in the m i,j
temporal domain. The theory of the next frame prediction algorithm for
In traditional semi-supervised learning, the model is trained with a large
AED is to generate the next frame according to historical trajectory and
amount of unlabelled data and a small amount of labelled data. How
compare it with the ground truth. Since the training data are all normal,
ever, in the abnormal event detection scenario, another form of semi-
the frames containing abnormal events would have larger differences
supervised learning, often called one-class classification, is used. The
than normal frames in the testing step. The optimisation of training is to
model is trained exclusively with normal data (labelled), and it is then
decrease intensity loss, gradient loss, adversarial training loss and op
expected to detect instances in the test data that deviate significantly
tical flow loss. Supposing the m-th frame in the video is V m , and the
m from this normal behaviour, which contains both normal and abnormal
predicted m-th frame is V . instances. In the training settings, the parameters λ1 , λ2 , λ3 , and λ4 in
5
Y. Yang et al. Neurocomputing 553 (2023) 126561
( ) (V max )2
PSNR V, V = 10log10 (16)
∑
T
1
T
(V − V)2
t=1
Higher PSNR indicates a less likely abnormal event in the testing frames.
Then, the normality scores from the action classification stream N A and
the motion prediction stream N M should do a normalization, respec
tively, according to:
̃ = N − Nmin
N (17)
Nmax − Nmin
According to the results of object detection, the equation for the The final classification is as follows:
normality scores in frame m: ( ) {
Normal if N Fm > ζ
Frame m = (19)
∑
K Abnormal otherwise
Rmk F mk Wi
N Cm = k=1 (14) where ζ is a threshold to determine the abnormal events from video
U
sequences.
where U represents the area of the frames and Wi are the weights of
different types of abnormal objects. The first abnormal class is bicycle 4. Experiments
which can be easily detected in high-resolution datasets or sparse en
vironments. The second abnormal class is skateboard which is hard to be In this section, we introduce the AED datasets tested in our proposed
recognized due to its small size and low confidence. The third abnormal framework and then elaborate on the evaluation metrics and the
classes are cars and trucks, easily detected in all test clips. Since these detailed setting of our experiment implementation. Next, we present
vehicles occlude the drivers, the pose estimator can not detect this experimental results using validation sequences from the datasets: UCSD
abnormal driving event. Meanwhile, the normality scores of ST-GCN and PED1 & PED2 [8], ShanghaiTech Campus (SHTC) [22]. CUHK Avenue
deep clustering are represented as N P. And the normality scores of the (Avenue) [27] to investigate the effectiveness of different components in
action classification stream are represented as: the proposed framework and the influence of different abnormal events
6
Y. Yang et al. Neurocomputing 553 (2023) 126561
Table 1
Comparison in detection efficiency on SHTC dataset.
Model Parameter(↓) execution FPS(↑)
(millions) time(↓)(s)
4.1. Datasets
Fig. 4. Illustration of the abnormal events distribution in the UCSD PED1, • The UCSD datasets have two parts: PED1 and PED2, all recorded in a
UCSD PED2, SHTC and Avenue datasets. The UCSD PED1 & PED2 datasets fixed viewpoint in low resolution. PED1 has 34 training video se
mainly include abnormal objects. The Avenue dataset mostly includes quences and 36 testing video sequences. PED2 has 16 training video
abnormal actions. sequences and 12 testing video sequences. The main abnormal events
in PED2 are the same as those in PED1, except for walking on the
grass. The main abnormal events include bicycle, car, skateboarding
and walking on the grass. There are only strange objects in PED2.
• The ShanghaiTech campus dataset contains 330 training video clips
and 109 testing video clips with 13 different scenes. There are
abnormal behaviours such as chasing, jumping and fighting and
strange objects such as trucks, bicycles and skateboards.
• CUHK Avenue dataset contains 16 training video clips and 21 testing
video clips with the same background. However, most abnormal
events are abnormal behaviours such as throwing, loitering and fast-
running.
Fig. 6. Illustration of some abnormal frames in the UCSD PED1, UCSD PED2, SHTC and Avenue datasets. The red bounding box donates different abnormal objects:
bicycles, trucks, skateboards, and strange actions. The colour version is better for understanding the figure.
7
Y. Yang et al. Neurocomputing 553 (2023) 126561
Fig. 7. The AUC performance of the baseline [9] published in the conference
ICASSP 2022 and the proposed two-stream fusion method. Fig. 9. ROC curves on UCSD PED1 dataset using classification, prediction, and
fusion models. Best viewed in colour.
Fig. 8. ROC curves on the USCD PED2 dataset using the classification, pre Fig. 10. ROC performance under classification, prediction, and fusion models
diction, and fusion models. Best viewed in colour. on SHTC datasets. Best viewed in colour.
8
Y. Yang et al. Neurocomputing 553 (2023) 126561
Table 2 negative; FP and N represent the false positive and real negative
Comparison in AED with different object detectors. (normal) samples, respectively, and TN represents the true negative in
Detectors AUC—X AUC—XY AUC—XYZ video sequences. A higher AUC and F1 scores indicate better anomaly
detection performance.
DAMO-YOLO [51] 0.698 0.777 0.786
RTMDet [52] 0.673 0.772 0.780
YOLOv3* 0.708 0.819 0.838
4.3. Performance Analysis
X: The classification stream only with object information.
XY: The enhanced classification stream with pose and object information. In this section, we present the detection efficiency and accuracy to
XYZ: The enhanced fusion framework with classification and prediction streams.
evaluate the effectiveness of our proposed method, including the
detection efficiency, the ablation study of different proposed compo
y-axis. Meanwhile, F1 scores are calculated when counting the number nents and the effects of different abnormal events. Due to different types
of abnormal events in different datasets. of abnormal events, the specific normality scores of each frame in video
TP TP sequences are also analysed. Furthermore, we count the number of
TPR = = (20) abnormal events without including events starting and ending.
P TP + FN
Fig. 11. Normality scores for different abnormal events in video sequences: Bicycle, Skateboard, Car and Throwing. The first stream is action-based classification,
and the second is motion-based prediction. Best viewed in colour.
9
Y. Yang et al. Neurocomputing 553 (2023) 126561
Table 3 Table 5
AUC performance on different anomalies on UCSD PED2/SHTC datasets. AUC Performance for abnormal event detection compared to state-of-the-art
Classification stream Prediction stream Fusion model
methods.
SHTC PED2 PED1 Avenue
Bicycles 0.930/0.886 0.964/0.722 0.977/0.872
Cars 0.999/0.889 1/0.684 1/0.833 Luo et al. [44] 0.680 0.922 - 0.817
Skateboards 0.930/0.758 0.935/0.802 0.950/0.805 Conv-AE [19] 0.609 0.811 0.750 0.800
Strange actions -/0.718 -/0.762 -/0.772 MDT [45] - 0.829 0.818 -
Total 0.927/0.819 0.953/0.730 0.979/0.838 ConvLSTM-AE [46] - 0.881 0.755 0.770
Abati et al. [47] 0.725 0.954 - -
Markovitz1 et al. [15] 0.761 - - -
proposed classification stream, which is driven by pose information, Yang et al. [29] - 0.940 - -
exhibits a notable advantage in terms of model complexity and resource Zhou et al. [23] - 0.960 0.839 0.860
Zhang et al. [30] 0.803 0.929 0.942 0.805
utilization. With a significantly lower parameter count compared to the
Yan et al. [7] - - 0.677 0.796
baseline models, this stream offers potential benefits in terms of effi Li et al. [4] 0.717 - - 0.820
ciency and resource allocation. Considering execution time, the com Hyun et al. [2] 0.740 0.972 - 0.868
parison reveals that the rGAN [49] model requires a relatively longer Proposed 0.838 0.979 0.855 0.842
execution time, while the MPN [50] model demonstrates significantly
faster performance. In this context, the proposed classification stream
walking Figs. 8–10 detail the AUC performance of our component
falls in between, with an execution time of 164 s. This suggests that the
models on different individual abnormal events of UCSD PED1 & PED2,
proposed model strikes a balance between computational efficiency and
SHTC datasets. First of all, compared with the individual models, the
detection accuracy, offering improved performance compared to rGAN
fusion model performs the best on the tested datasets. The prediction
[49] while being outperformed by MPN [50] in terms of execution time.
stream is sensitive to object movements and robust to abnormal events.
Furthermore, the evaluation of frames per second (FPS) provides in
However, if the background environment is filled with crowded pedes
sights into the capabilities of the models. The proposed classification
trians, the prediction stream would not be effective for abnormal events.
stream achieves an FPS of 12.3, indicating its ability to process video
The improved classification stream can detect abnormal objects such as
frames at a relatively faster rate compared to rGAN [49], although
bicycles and cars and abnormal actions such as jumping and chasing.
slightly lower than MPN [50]. It demonstrates promising detection
Nevertheless, the action classification stream does not capture tiny ob
performance while maintaining reasonable computational efficiency.
jects like skateboards well.
While the proposed fusion model incurs a higher computational cost
From Fig. 8, the AUC of the action-based classification stream is
compared to a single classification stream, its utilization of motion and
92.7%, which is lower than 96.9% of the motion-based prediction
appearance features contributes to improved detection accuracy. By
stream. This is because the prediction stream is more sensitive to the
combining multiple streams of information, the fusion model achieves
partly moving targets, which are removed by pose estimation and early
enhanced performance and robustness in detecting abnormal events.
fusion steps at the beginning. Thus, the prediction stream can detect the
These findings highlight the trade-offs between model complexity,
start and end of abnormal events more accurately.
execution time, and processing performance in abnormal event detec
In addition, the fusion model can accurately capture the occluded
tion. The proposed classification stream shows promising results in
abnormal vehicles such as trucks and cars. For the UCSD PED1 dataset,
terms of parameter efficiency and processing capabilities, primarily
the AUC performance of the prediction stream is better than the classi
because it’s a pose-centric model. Although the fusion model is more
fication stream, contrary to the ShanghaiTech Campus dataset. This is
complex in terms of model parameters and execution time, it boasts
because the prediction model is more accurate in capturing partly
superior detection accuracy due to the additional motion and appear
moving targets, and the classification stream is more sensitive to high-
ance information. This means we can flexibly choose different models to
resolution and densely populated datasets.
handle different tasks, optimizing for either efficiency or accuracy
In testing on Avenue datasets, The fusion model has demonstrated on
depending on the application. Further investigations and optimizations
the Avenue dataset that a good AUC performance is not achieved during
can be pursued as future work to enhance the overall detection effi
the action-based classification stream. After analyzing the Avenue
ciency of the proposed algorithms.
dataset, the ground truth labels are pixel-level. And the main types of
abnormal events: are loitering and walking in the wrong direction,
4.3.2. Ablation Study
which should not be considered anomalies.
Fig. 7 compares the AUC performance of our work presented at
ICASSP 2022 and the proposed method. The enhanced fusion framework
4.3.3. Analysis of Different Object Detectors
outperforms the previous work across all tested datasets due to its ability
Table 2 demonstrates the detection performance of YOLOv3 [25]
to efficiently recognize and classify normal and abnormal movements
compared to DAMO-YOLO [51] and RTMDet [52] on SHTC [22] dataset.
using both motion and appearance features. The improvement observed
The evaluation includes the classification stream utilizing only object
on the PED1 and PED2 datasets is higher than that on the SHTC dataset.
information, the enhanced classification stream incorporating pose and
This is not only because SHTC is larger in scale than PED1 and PED2 but
object information, and the fusion model combining the classification
also because the primary abnormal events in the PED1 and PED2 data
and prediction streams. The results indicate that YOLOv3 outperforms
sets, such as driving and cycling, exhibit different motion features than
DAMO-YOLO [51] and RTMDet [52] in all three scenarios. This suggests
that YOLOv3 provides higher abnormal event detection performance
Table 4 when considering object information alone, as well as when incorpo
F1 scores for different α parameters on UCSD PED1 dataset, which contains 55 rating additional pose information and in the fusion model.
abnormal events. It is worth noting that DAMO-YOLO [51] and RTMDet [52] are
Truly Miss False Accuracy F1 considered efficient and accurate object detectors for general object
α detected detection alarm scores detection tasks. However, their strong object detection capabilities can
0.2 51 4 29 0.927 0.773
sometimes interfere with abnormal event detection, particularly due to
0.4 49 6 9 0.891 0.867 their robust ability to detect objects of various shapes and sizes. This
0.6 49 6 1 0.891 0.933 interference can result in false alarms being detected as anomalies,
0.8 46 9 0 0.836 0.911 thereby reducing the specificity and accuracy of the detection system. As
10
Y. Yang et al. Neurocomputing 553 (2023) 126561
11
Y. Yang et al. Neurocomputing 553 (2023) 126561
[12] F. Angelini, Y. Jiawei, S.M. Naqvi, Privacy-Preserving Online Human Behaviour [44] W. Luo, W. Liu, S. Gao, A revisit of sparse coding based anomaly detection in
Anomaly Detection Based On Body Movements and Objects Positions (2019). stacked rnn framework, in: IEEE International Conference on Computer Vision
[13] N. Li, F. Chang, C. Liu, Human-related anomalous event detection via spatial- (ICCV), 2017.
temporal graph convolutional autoencoder with embedded long short-term [45] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded
memory network, Neurocomputing 490 (2022) 482–494. scenes, IEEE International Conference on Computer Vision and Pattern Recognition
[14] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for (2010).
skeleton-based action recognition., AAAI Conference on Artificial Intelligence [46] W. Luo, W. Liu, S. Gao, Remembering history with convolutional lstm for anomaly
(2018). detection, IEEE International Conference on Multimedia and Expo (2017).
[15] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, S. Avidan, Graph Embedded [47] D. Abati, A. Porrello, S. Calderara, R. Cucchiara, Latent space autoregression for
Pose Clustering for Anomaly Detection (2020). novelty detection, IEEE/CVF Conference on Computer Vision and Pattern
[16] Y. Yang, Z. Fu, S.M. Naqvi, Enhanced adversarial learning based video anomaly Recognition (CVPR) (2019).
detection with object confidence and position, 13th International Conference on [48] Z. Wang, Y. Zou, Z. Zhang, Cluster attention contrast for video anomaly detection,
Signal Processing and Communication Systems (ICSPCS) (2019). ACM International Conference on Multimedia (2020) 2463–2471.
[17] W. Sultani, C. Chen, M. Shah, Real-World Anomaly Detection in Surveillance [49] Y. Lu, F. Yu, M.K.K. Reddy, Y. Wang, Few-shot scene-adaptive anomaly detection,
Videos (2018). ECCV (2020).
[18] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, J. Gall, A survey on human motion [50] H. Lv, C. Chen, Z. Cui, C. Xu, Y. Li, J. Yang, Learning normal dynamics in videos
analysis from depth data (2013) 149–187. with meta prototype network, in: the IEEE/CVF Conference on Computer Vision
[19] M. Hasan, J. Choi, J. Neumann, A.K. Roy-Chowdhury, L.S. Davis, Learning and Pattern Recognition (CVPR), 2021, pp. 15425–15434.
temporal regularity in video sequences, IEEE International Conference on [51] X. Xu, Y. Jiang, W. Chen, Y. Huang, Y. Zhang, X. Sun, Damo-yolo: A report on real-
Computer Vision and Pattern Recognition (2016). time object detection design, arXiv preprint arXiv:2211.15444v2 (2022).
[20] Y.S. Chong, Y.H. Tay, Abnormal event detection in videos using spatiotemporal [52] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, K. Chen, Rtmdet:
autoencoder, CoRR (2017). An empirical study of designing real-time object detectors (2022). arXiv:
[21] M. Astrid, M.Z. Zaheer, S.-I. Lee, Pseudobound: Limiting the anomaly 2212.07784.
reconstruction capability of one-class classifiers using pseudo anomalies,
Neurocomputing 534 (2023) 147–160.
[22] W. Liu, W. Luo, D. Lian, S. Gao, Future frame prediction for anomaly detection – a
new baseline, IEEE Conference on Computer Vision and Pattern Recognition Yuxing Yang earned his Master’s degree from the School of
(CVPR) (2018). Engineering, Newcastle University, U.K., in 2018. Recently, he
[23] J.T. Zhou, L. Zhang, Z. Fang, J. Du, X. Peng, Y. Xiao, Attention-driven loss for defended his Ph.D. thesis with the Intelligent Sensing and
anomaly detection in video surveillance, IEEE Transactions on Circuits and Communications Research Group at the same institution. In
2021, he worked as a research assistant on an EPSRC IAA
Systems for Video Technology 30 (12) (2020) 4639–4647.
[24] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, Z. Qiu, A hierarchical spatio-temporal project, which applied deep learning for multimodal human
graph convolutional neural network for anomaly detection in videos, CoRR security surveillance. His research interests include video
2112.04294 (2021). anomaly detection, human behavior analysis, and intelligent
[25] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv (2018). surveillance systems.
[26] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der
Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow with convolutional
networks, CoRR 1504.06852 (2015).
[27] J.S.C. Lu, J. Jia, Abnormal event detection at 150 fps in matlab, in: IEEE
International Conference on Computer Vision (ICCV), 2013.
[28] M. Sabokrou, M. Khalooei, M. Fathy, E. Adeli, Adversarially learned one-class
classifier for novelty detection, IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2018). Zeyu Fu received the B.Eng. (with First-Class Hons.) and Ph.D.
[29] Y. Yang, Y. Xian, Z. Fu, S.M. Naqvi, Video anomaly detection for surveillance based degrees from the School of Engineering, Newcastle University,
on effective frame area, in: IEEE 24th International Conference on Information U.K., in 2015 and 2019, respectively. During his PhD study, he
Fusion (FUSION), 2021. worked on a DSTL & EPSRC-funded project which develops
[30] S. Zhang, M. Gong, Y. Xie, A.K. Qin, H. Li, Y. Gao, Y.-S. Ong, Influence-aware algorithms for video-based multiple human tracking. At the
attention networks for anomaly detection in surveillance videos, IEEE Transactions same institution, he was a research assistant working on an
on Circuits and Systems for Video Technology (2022). MRC-CiC funded project which applies machine learning for
[31] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, S. Venkatesh, Learning Regularity ocular imaging. From 2020 to 2022, he was a postdoctoral
in Skeleton Trajectories for Anomaly Detection in Videos, in: Computer Vision and researcher at the Institute of Biomedical Engineering, Univer
Pattern Recognition (CVPR), 2019. sity of Oxford, U.K., working on two projects about AI for
[32] Q. Wang, Y. Liu, Z. Xiong, Y. Yuan, Hybrid feature aligned network for salient healthcare (NIH-funded CIFASD and ERC-adv funded PULSE).
object detection in optical remote sensing imagery, IEEE Transactions on He is currently a lecturer (assistant professor) in computer
Geoscience and Remote Sensing 60 (2022) 1–15. vision at the Department of Computer Science, University of
[33] Y. Liu, Q. Li, Y. Yuan, Q. Du, Q. Wang, Abnet: Adaptive balanced network for Exeter, U.K. His research interests include visual surveillance, machine learning, and
multiscale object detection in remote sensing imagery, IEEE Transactions on medical image analysis.
Geoscience and Remote Sensing 60 (2022) 1–14.
[34] A. Ali, P. Angelov, Anomalous behaviour detection based on heterogeneous data
and data fusion, Soft Computing 22 (2018).
Syed Mohsen Naqvi received the Ph.D. degree from Lough
[35] V. Chatzigiannakis, G. Androulidakis, K. Pelechrinis, S. Papavassiliou, V. Maglaris,
borough University, Loughborough, U.K., in 2010. He is
Data fusion algorithms for network anomaly detection: classification and
currently a Reader in Signal and Information Processing, the
evaluation, in: International Conference on Networking and Services (ICNS), 2007.
Director of the Intelligent Sensing Laboratory, and the Deputy
[36] Z. Fu, S.M. Naqvi, J.A. Chambers, Collaborative detector fusion of data-driven phd
Head of the Intelligent Sensing and Communications Research
filter for online multiple human tracking, in: International Conference on
Group, Newcastle University, Newcastle, U.K. His research
Information Fusion (FUSION), 2018.
contributions have been in human action, activity, behavior
[37] K.B.-Y. Wong, T. Zhang, H. Aghajan, Data fusion with a dense sensor network for
analyses, multiple human target detection, localization, and
anomaly detection in smart homes, Human Behavior Understanding in Networked
tracking, human speech enhancement and separation, and
Sensing, Theory and Applications of Networks of Sensors (2014) 211–237.
explainable AI, all for defence and healthcare applications.
[38] D. Lahat, T. Adali, C. Jutten, Multimodal data fusion: An overview of methods,
Dr Naqvi has above 130 publications in peer-reviewed ar
challenges, and prospects, Proceedings of the IEEE 103 (9) (2015) 1449–1477.
ticles in high impact journals and proceedings of leading in
[39] F. Angelini, Z. Fu, Y. Long, L. Shao, S.M. Naqvi, 2D Pose-based Real-time Human
ternational conferences. He organized special sessions in
Action Recognition with Occlusion-handling, IEEE Transactions on Multimedia
FUSION (2013-2022), delivered Seminars, and was a Speaker with University Defence
(2019).
Research Collaboration (UDRC) Summer Schools (2015-2017). He was involved in more
[40] Y. Xiu, J. Li, H. Wang, Y. Fang, C. Lu, Pose Flow: Efficient online pose tracking,
than 15 research projects funded by UKRI and Industry (e.g., EPSRC, BBSRC, MoD, UDRC,
British Machine Vision Conference (BMVC) (2018).
Thales, Innovate U.K., NHS). He also successfully supervised and graduated above 20 Ph.
[41] W. Luo, W. Liu, S. Gao, Graph convolutional neural network for skeleton-based
D., including the authors of this paper. He is a Senior Member of IEEE and a Fellow of the
video abnormal behavior detection, Generalization with Deep Learning (2021)
Higher Education Academy. He was an Associate Editor of Elsevier Journal on Signal
139–155.
Processing (2018-2022). He served two terms of Associate Editor of IEEE Transactions on
[42] Z. Chen, Y. Tian, W. Zeng, T. Huang, Detecting abnormal behaviors in surveillance
Signal Processing (2019-2023). He is an Associate Editor of IEEE/ACM Transactions on
videos based on fuzzy clustering and multiple auto-encoders, in: IEEE International
Audio Speech and Language Processing (2019-to date).
Conference on Multimedia and Expo (ICME), 2015.
[43] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical
image segmentation, in: Medical Image Computing and Computer-Assisted
Intervention (MICCAI), 2015, pp. 234–241.
12