1 s2.0 S0925231223006847 Main

Neurocomputing 553 (2023) 126561
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Abnormal event detection for video surveillance using an enhanced

two-stream fusion method
Yuxing Yang *, a, Zeyu Fu b, Syed Mohsen Naqvi a
a
Intelligent Sensing and Communications Research Group, Newcastle University, Newcastle upon Tyne, UK
b
Department of Computer Science, College of Engineering, Mathematics, and Physical Sciences, University of Exeter, Exeter, UK
A R T I C L E I N F O A B S T R A C T
Communicated by Zidong Wang Abnormal event detection is a critical component of intelligent surveillance systems, focusing on identifying
abnormal objects or unusual human behaviours in video sequences. However, conventional methods struggle due
Keyword: to the scarcity of labelled data. Existing solutions typically train on normal data, establish boundaries for regular
Abnormal event detection events, and identify outliers during testing. These approaches are often inadequate as they do not efficiently
Pose estimation
leverage the geometry and image texture information, and they lack a specific focus on different types of
Optical flow
abnormal events. This paper introduces a novel two-stream fusion algorithm for abnormal event detection to
Object detection
Graph convolutional neural network address these diverse abnormal events better. We first extract the object, pose, and optical flow features. Then,
Adversarial learning the object and pose information is combined early on to eliminate occluded pose graphs. The trusted pose graphs
Data fusion are fed into a Spatio-Temporal Graph Convolutional Network (ST-GCN) to detect abnormal behaviours. Simul
taneously, we propose a video prediction framework that identifies abnormal frames by measuring the difference
between predicted and ground truth frames. Lastly, we execute a decision-level fusion between the classification
and prediction streams to achieve the final results. Our results on the UCSD PED1 dataset indicate the enhanced
performance of the fusion model for various abnormal events. Furthermore, experimental results on the UCSD
PED2 dataset and the ShanghaiTech campus dataset underscore our approach’s effectiveness compared to other
related works.
1. Introduction mentioned above. Human action recognition algorithms can detect

human-related actions such as running, bending, jumping, and hands-
Abnormal event detection (AED) is a critical yet challenging area of waving in a supervised or semi-supervised manner. Similarly, multiple
research in computer vision. The primary tasks of AED are to identify types of actions can be considered binary normal or abnormal move
human activities in video sequences and distinguish abnormalities in ments. With the rapid development of video-based or pose-based human
each frame. AED holds significant potential for a variety of applications action recognition, these methods [11–15] are also transferred to
in modern society, such as surveillance systems, assisted living, overcome the task of AED. In [11], an end-to-end hierarchical RNN for
healthcare, robotics, and sports analysis [1–7]. However, some chal skeleton-based action recognition is demonstrated for video anomaly
lenging issues still hinder the development of AED technologies. Human detection. Moreover, in [12], a joint RGB-Pose MLSTM model is pro
events constitute complex temporal sequences of mixed gestures under posed for human activity anomaly detection.
diverse styles, attitudes, viewpoints, and lighting conditions. The precise While these algorithms are effective at interpreting human-related
definition of abnormal events is difficult due to the different types of actions, they do not take scene context information into account. The
anomalies present in different datasets. Furthermore, the context of the same actions in different scenes may have different semantic labels.
scene plays a crucial role in developing AED systems. For instance, in the Further challenges arise due to blurred target appearances, which can be
UCSD PED1 datasets [8], walking-on-the-sidewalk is defined as a normal caused by high-speed movement or poor dataset resolution, making it
event, while walking-on-grass is abnormal. Moreover, detecting the difficult for action recognition algorithms to estimate precise informa
beginning and end of the activities is not trivial as the appearances of tion for the neural network. Additionally, these action recognition al
detected objects are not integral [9,10]. gorithms are limited to classifying human-related behaviours, and
Various approaches are proposed to deal with the challenges cannot detect non-human-related abnormal events, driving, where the
* Corresponding author.
https://doi.org/10.1016/j.neucom.2023.126561
Received 20 March 2023; Received in revised form 11 June 2023; Accepted 8 July 2023
Available online 18 July 2023
0925-2312/© 2023 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
Y. Yang et al. Neurocomputing 553 (2023) 126561
Fig. 1. (a): The overall architecture of the proposed two-stream fusion framework. The raw images are processed through three models: object detection, pose
estimation, and optical flow calculation. These processes extract object class information, human-related pose, and optical flow results, respectively. There are two
concurrent streams: (b) action-based classification and (c) motion-based prediction. In the action-based classification stream (b), the object detector, the YOLOv3
model, primarily classifies the detected objects into four categories: pedestrian, bicycle, skateboard, and car [9]. Frames containing detected pedestrians and pose
information is then fed into the Spatio-Temporal Graph Convolutional Network (ST-GCN) [14] to capture the spatial and temporal features of regular body joints. The
extracted features are combined into a latent vector and further processed via a feature clustering step to output the normality scores. Concurrently, in the motion-
based prediction stream (c), raw images undergo adversarial training to predict the subsequent frame based on historical trajectories. The predicted frames are then
calculated for their optical flow results and compared with the optical flows of the preceding frame in the ground truth. By combining the motion and appearance
loss, we calculate the normality scores of the frames in the test clip. Finally, the results from all features are fused at the decision level to produce the final normality
score. Best viewed in colour.
driver is occluded by a car or truck, and the target information cannot be abnormal events. The second stream extracts optical flow results by
extracted by feature extraction models. flownet [26]. Then, raw RGB data and optical flow features are pro
Another increasing trend in the AED framework is based on recon cessed by U-net as a generator. Moreover, a discriminator is used to
struction or future prediction models [16–21]. Reconstruction-based predict whether the next frame is real or fake, which is the theory of
algorithms generally train autoencoders using normal data, with adversarial learning. The second stream can compensate for the first
abnormal frames distinguished in video sequences due to their higher stream’s limitations and enrich appearance and motion description,
reconstruction errors compared to normal frames. During inference, which has a positive impact on abnormal event detection with the
abnormal frames are identified based on large prediction errors, as these crowded environment and small objects. The normality scores of tested
events rarely occur during training and thus have different probability data from the two streams would do a final fusion to decide whether
distributions compared to the training data. Other prediction-based AED they are normal or abnormal. Extensive experiments are provided on the
algorithms [22–24] use raw RGB data or optical flow features to train a UCSD PED1 & PED2 [8], ShanghaiTech Campus [22] and CUHK Avenue
network and predict the next frame based on previous frames. The [27] datasets to highlight the efficacy and robustness of the proposed
strength of these algorithms lies in their ability to leverage the spatial system.
and temporal appearance and motion information, enabling them to We demonstrate that multiple features can efficiently boost RGB-
detect a variety of abnormal events such as unusual behaviours like based networks for classification and prediction. On the other hand,
chasing, jumping, and fighting and abnormal object classifications like raw RGB data and calculated optical flow results can compensate for
skateboard, bicycle, and car. Furthermore, since the training data is pose data in motion and appearance features. The contributions of this
normal, there is no need for labelling, which significantly reduces labour work are mainly:
costs. However, these algorithms are challenging to train well and are
not sensitive to small abnormal classes. 1. We propose a unified classification and prediction fusion framework
This paper proposes a unified AED framework by fusing multiple designed to detect various types of abnormal events in surveillance
features to detect different types of video-based abnormal events, as videos.
shown in Fig. 1. Our framework consists of two parallel processing 2. By incorporating motion features, our AED algorithm shows
branches. The first stream extracts class information to make pre increased sensitivity towards abnormal events, even in crowded
liminary classifications for pedestrians and other objects by the object environments.
detector YOLOv3 [25]. Then a pose estimation algorithm is used to 3. Our extensive experiments demonstrate the effectiveness and
extract body pose information. After initial fusion with class informa robustness of the proposed framework in AED.
tion, the reduced pose information is converted into high-level features, 4. We analyze the types of abnormal events in several public AED
further processed by a graph convolutional neural network. The first datasets and evaluate the accuracy of our method in identifying these
integrated pose and class based stream can improve the accuracy of abnormal events.
pose-level detection results and efficiently detect non-human-related
2
Fig. 2. Some typical samples after object detection and pose estimation. Different abnormal objects: bicycles, trucks and skateboards are shown. Best viewed
in colour.
This paper introduces a joint fusion approach for AED utilizing MPED-RNN (single encoder dual decoder) architecture. Recently, Zeng
comprehensive pose, class, and motion information. Preliminary aspects et al. [24] proposed a joint pose-based classification and prediction ar
of this work, primarily addressing non-human-related AED through the chitecture for anomaly detection. High-level graph features encode the
fusion of pose and class information, have previously been presented in trajectories of people and the interactions between multiple persons,
[9]. We further expand upon this initial research by conducting exten while low-level graph features encode the body postures and the rep
sive experiments on several public datasets to compare our proposed resented actions. Pose-based AED methods can effectively reduce the
detection method with recent state-of-the-art works. This work also in computational cost of the training since the input is the pose graphs and
cludes an ablation study on different contribution components and not the raw images. However, the algorithms in [31,24] do not consider
abnormal event types, providing a more holistic and in-depth analysis of non-human-related abnormal events. Since all pose-based methods need
our approach. to extract pose features for action recognition or prediction, the pose
estimation model plays a crucial role in AED work. It is significant to not
2. Related Work only extract all keypoints of each person precisely but also need to
identify multiple targets in video sequences. That is why this step is
2.1. Image-based and Pose-based AED called pose estimation rather than pose extraction. Thus, low-resolution
datasets are hard to estimate the pose information and are unfriendly to
Most State-of-the-art AED methods rely on RGB data to detect pose-based AED algorithms. Recently, it is attention-grabbing to utilize
strange or rare events in surveillance video sequences. Sultani et al. [17] object detectors to acquire prior knowledge from different types of im
proposed a frame-level anomaly detection in a supervised manner. ages, such as RGB, infrared, depth, and remote sensing images [32,33].
Video sequences are divided into a fixed number of segments. Multiple Our proposed algorithm also adds object detection to talk about non-
instances learning (MIL) algorithm defines abnormal video sequences, human-related abnormal events like driving and makes an early fusion
and a deep MIL ranking model is trained to detect anomalies. However, between the pose information and the results of object detection to
this method is inefficient since labelling video clips is time-consuming. enhance the localization and identifications of detected persons.
Reconstruction methods are more prevalent in the recent AED work
since these autoencoders (AE) -based models are trained in semi- 2.2. The Data Fusion Mechanism
supervised learning [19,28]. Training data are normal, and testing
samples simultaneously contain normal and abnormal events. Raw RGB Deep learning in a single-domain dataset has been successful. Data
data are trained to generate the latent vectors of normal samples. In fusion is used to extract and mix information from multiple domains or
testing, video sequences similar to the training data will be recon different sensors [34–37]. Compared with individual modalities, the
structed with low error. In contrast, strange actions or rare events which fused data would have a richer representation and performance and are
deviate from normal data are expected to be reconstructed with higher robust to the given tasks. Thus, data fusion is a practical solution to
error. Hasan et al. [19] proposed a combination of CNN-based autoen several fields like medicine, business and driverless technology [38].
coders to model the temporal evolution of HOG-HOF features. Sabokrou Traditional data fusion methods have three techniques: early, interme
et al. [28] proposed an adversarially learned one-class classifier for diate, and late fusion. Early fusion is also called data-level fusion, which
novelty detection. An autoencoder model is trained as a generator. At means multiple data are fused before conducting the data analysis
the same time, a sequence of convolution layers is trained to distinguish model. Intermediate fusion is the most flexible method in which mul
the novel or outlier sample as a discriminator. After adversarial learning, tiple data can be fused at different stages of model training. Late fusion
the testing samples with normal events are enhanced, and those with methods allow data to be fused after multiple model decisions. One of
outliers are distorted as expected. These methods distinguish abnormal AED’s challenges is that the types of abnormal events in video surveil
frames due to reconstruction errors and can not provide insight into the lance are various: Abnormal objects, such as cars, bicycles and skate
types of abnormal events. In this paper, our fusion model will focus on boards; Abnormal positions, such as walking on the grass and driving on
the binary classification of normal and abnormal events and analyze the the sidewalk; Abnormal behaviours, such as shooting, fighting and
different types of abnormal events. [23,29,30] provide other recent chasing. Data fusion methods for AED can robustly mix multiple features
approaches where attention-based methods are used for the original such as pose and optical flow to detect different types of abnormal
data. These methods split the input images into the foreground and events.
background images and only focus on the motion area. However, these
methods are not sensitive to cluttered or crowded background scenes. 3. Proposed Framework
Pose-based AED methods are proposed with the development of pose
estimation and graph neural networks. In [13], human poses are rep 3.1. Pose-driven Action Classification Model
resented as pose graphs. The graph information is encoded in a spatial
and temporal graph convolution neural network and clustered into a 3.1.1. Object Detection and Pose Estimation
latent vector. The Dirichlet process mixture model (DPMM) is used to a pre-trained object detector YOLOv3 model is conducted to capture
analyze the distribution of these vectors and distinguish the actions the targets’ classes, localization information and corresponding confi
represented by these vectors. Similarly, in [31], a pose-based human dence scores, shown in Fig. 2. We set the video sequences as V. For each
activity anomaly detection algorithm is presented, leveraging the detected objects, classes are represented as C, the localization
3
objects in the traffic surveillance system. In the m-th frame, if there are K
objects detected by the object detector and J human poses extracted by
pose estimation:
( )
GmT = GmJ ∩ VmK (3)
There are two parts in the early data-level fusion step: One is to
strengthen the confidence of detected persons, and the other is to
remove some occluded pose graphs during pose estimation. We set B ∈
[bicycle, skateboard, car, truck, person] and Fm
k is the confidence of the
k − th objects in the m − th frame.
{
V mk if Cmk ∈ B & Fmk > ε
m
Sk = (4)
0 otherwise
Algorithm 1: Initial fusion for pose graph
Fig. 3. The body joints information for human behaviours: the red box repre
sents cycling and the black box represents walking. Best viewed in colour.
information are represented as x1 , x2 , y1 , y2 and F is confidence scores,

respectively. Thus, if there are K objects in the m-th frame, the formula
for the frame is:
{[ ⃒ }
⃒
VmK = (C, x1 , x2 , y1 , y2 , F)k,m ⃒k = 1, …, K (1)
We exploit pose-based action classification architecture for AED,

inspired by skeleton-based human action recognition [39] and multiple
human tracking [10]. Pose information is estimated and fed into an ST-
GCN [14] model, which is advantageous for analyzing the no-grid data
to cluster abnormal human actions. The pre-trained pose estimator is
based on AlphaPose network [40] which can capture the 17 landmarks
of the detected humans. To decrease the influence of external factors on
testing video sequences, such as background illumination, objects’
contours and different viewpoints, we capture the latent features from 3.1.3. ST-GCN and Implementation
the estimated body graphs and recognize the different human behav With the pose information of each detected person from the pre-
iours. Each person is represented by a pose graph containing two trained pose estimation on the testing dataset, a spatio-temporal graph
essential factors: nodes and edges. Nodes are the physical location under convolutional network (ST-GCN) is constructed to extract the latent
the coordinate system. Moreover, edges are the vectors between two features of normal samples in training and distinguish abnormal events
nodes. The m-th frame’s output after pose estimation is: in the test step. The physical connection in the localisation of body pose
{[ ]⃒ } graphs is analysed in the spatial domain, While the movement of key
GmJ = m, J, xl , yl , cl ⃒j = 1, …, J; l = 1, …, 17 (2)
points is analysed during consecutive frames in the temporal domain. In
the m-th frame on a specific video sequence, If there are K detected
There are J persons detected. The outputs of the detected video se
m m m
quences are in 3-D lists, which contain the frame numbers, the detected poses, each pose graph is represented as Γm k = {Nk , Ek }, where Nk =
⃒
person’s identification and physical localization information. D⃒
{μm
j,k ∈ S ⃒j = 1, …, 17} is the nodes of the kt h pose in the m − th frame. S
Fig. 3 visualizes the pose graphs of two detected persons with
different events: walking and cycling. The movement trajectory of each are the set of all pose keypoints, while D is the dimension of the joints. It
detected person can be described as temporal information according to equals 2 or 3 when the poses are in two or three dimensions. Em m
k = {∊j,k }
the pose estimation of video sequences. Meanwhile, the localization of is represented by the edges, which describe the connection of neighbour
detected persons in each frame is the spatial information for learning. keypoints. When considering the time domain, the nodes from the
neighbouring frames are represented as C(μm j,k ). After settings, the for
3.1.2. Initial Fusion mula of graph convolutional network is [41]:
Graph convolutional neural network (GCN) benefits the no-grid data ⎛ ⎞ ⎛ ⎞
such as the body joints. Thus, there are plenty of related works for ∑ 1
human-related action recognition. However, for AED, the non-human- p⎝μj,k ⎠ = σ ⎝μj,k W 0 +
m m m
μ W ⎠
m m
(5)
( ) γ j,k k 1
related abnormal event can not be detected. Meanwhile, the extracted μm
k
∈C μm
j,k
pose results are crucial when classifying human activities. If the human’s
body is partly occluded by other objects or blurred due to fast-moving, where σ is an activation function. Wm m
0 , W 1 are two trainable weights,
the extracted pose information will result in a lousy performance in and γj,k is a normalization variable. The adjacent matrix Ap,q means the
AED work. Thus, we introduce class information to strengthen the weights of the connection between μm m
p and μq . A fixed matrix A is set and
human pose information and distort the influence of other familiar
4
used for all frames, and the model is implemented with the following 3.2.1. Loss Functions
formula [41]: To ensure the predicted frame is close to its ground truth. We need to
( ) ( ) 1 reduce the distance in intensity and gradient. The intensity loss makes
(6)
1
GCN N m = Λ− 2 A + I Λ− 2 N m W m the pixels of the generated frame close to the ground truth at the RGB
level. The formula for intensity loss is as follows:
where Λ is the degree matrix, I is the identity matrix representing self-
∑
m
connections. After the GCN implementation between the neighbouring Li =
m
‖V − V m ‖22 (9)
keypoints in the spatial domain, the pose graph relationships between m=1
time t and t′ with the same identification should be considered. The next
step is analysing and clustering the spatio-temporal features from the Although the intensity loss can capture the major pixel changes in im
neural network. ages, the generated prediction frames would be blurred if only intensity
loss were applied. Thus, the gradient loss is added to the generator
3.1.4. Deep Embedded Clustering training loss function to sharpen the predicted results. The formula is
A hand-craft action cluster is set, which contains K actions, such as shown below:
hand waving, jumping, running, chasing, and falling. Supposing the m ∑⃦⃒
∑ ⃦⃒ m m ⃒ ⃒
⃒ ⃒ ⃒ ⃦ ⃦⃒
⃒ ⃦ ⃦⃒ m m ⃒ ⃒
⃒ ⃒ ⃒⃦
⃒⃦
m Lg = ⃦⃒V i,j − V i− 1,j ⃒− ⃒Vi,jm − V mi− 1,j ⃒ ⃦ + ⃦⃒V i,j − V i,j− 1 ⃒− ⃒V mi,j − V mi,j− 1 ⃒ ⃦
features extracted from ST-GCN [14] is G′ , the probability of pose 1 1
m=1 i,j
graph y belonging to which cluster is represented as:
(10)
( ) m
exp(Θl G′ ) where i and j are the pixels of the frame.
pmk = P y = k = ( ) (7)
∑
K
m A CNN-based optical flow estimation algorithm: Flownet [26] is used
exp Θk′ G′ to track the multiple targets moving. In the prediction stream, The op
k=1
tical flow results can strengthen the relationship of moving objects in the
Where Θ are the parameters of the clustering models, the main task of temporal domain and are also sensitive to the movement of the small
the clustering model is to construct the probability distribution of object. The f means the function of Flownet, which is pre-trained well
normal samples. As mentioned above, the ST-GCN [14] is trained semi- with fixed weights and parameters. The optical flow loss is shown below:
supervised, which means the training data are all normal events. ⃦ ( m+1 ) ( )⃦
Furthermore, optimising the training minimises the Kullback–Leibler Lf = ⃦f V , V m − f V m+1 , V m ⃦1 (11)
(KL) divergence between the clustering probability distribution P and
the distribution of the detected objects Q. The loss function of the 3.2.2. Adversarial training
clustering model is: In the prediction stream, we used a U-net based autoencoder as a
( ) generator and a patch-level discriminator. U-net [43] can reduce the
( ) ∑∑
Lc = KL Q P =
⃦
⃦
⃦
qab log
qab
(8) training samples and improve the reconstruction performance. And the
generative adversarial network is also efficient for image reconstruction.
⃦
⃦
a b
pab
The optimization of the generator is to reduce the intensity, gradient,
where a, b mean the a-th pose graph sample assigned to b-th cluster. pab optical flow and adversarial learning loss. The task of the generator is to
is the detected classification probability, and qab is the true target generate a fake sample corresponding to the probability distribution of
probability. real data. Since the neural network is training semi-supervised, which
means there are only normal samples in the training data, the loss of the
3.2. Motion-based Prediction Model adversarial learning reduces the distance between the generated sam
ples and class 1. The Eq. 12 is shown:
The enhanced action-based classification stream is proposed in both ∑∑( )2
spatial and time domains. The latent vector is represented with a fully La = D(V)i,j − 1 (12)
connected layer. A deep clustering model is followed to build a dictio m i,j
nary for human behaviours [42]. However, there are still existing some
Finally, the overall generator loss is obtained after assigning the same set
problems with the classification stream. The pose-level algorithm is
of weights to each set of losses. Algorithm 2 specifically describes the
over-reliance on pose estimation results and can not capture the
generator loss calculation process. And the task of the discriminator is to
abnormal behaviours of tiny movement changes, such as skiing. Inspired
distinguish real data and fake data. The loss function of the discrimi
by [22], we added a future frame prediction model as the motion stream
nator is illustrated in Eq. 13:
into our framework. The prediction stream leverages adversarial
( )
learning with the U-net network as the generator to discriminate the ∑∑ ( )2 ( )2
normality of videos at the frame level. Optical flow constraints of each Ld = D(V)i,j − 0 + D(V)i,j − 1 (13)
frame are also calculated to enhance the motion information in the m i,j
temporal domain. The theory of the next frame prediction algorithm for
In traditional semi-supervised learning, the model is trained with a large
AED is to generate the next frame according to historical trajectory and
amount of unlabelled data and a small amount of labelled data. How
compare it with the ground truth. Since the training data are all normal,
ever, in the abnormal event detection scenario, another form of semi-
the frames containing abnormal events would have larger differences
supervised learning, often called one-class classification, is used. The
than normal frames in the testing step. The optimisation of training is to
model is trained exclusively with normal data (labelled), and it is then
decrease intensity loss, gradient loss, adversarial training loss and op
expected to detect instances in the test data that deviate significantly
tical flow loss. Supposing the m-th frame in the video is V m , and the
m from this normal behaviour, which contains both normal and abnormal
predicted m-th frame is V . instances. In the training settings, the parameters λ1 , λ2 , λ3 , and λ4 in
5
Algorithm 2, which act as the weights of intensity loss, gradient loss,

N Am = N Pm + N Cm (15)
optical flow loss, and adversarial training, are empirically selected as 1,
1, 2 and 0.05, respectively.
The AED compares the difference between the predicted frames and the
Algorithm 2: Generator loss of prediction steam
ground truth for the motion-based prediction stream. Peak Signal Noise
Ratio (PSNR) is a common evaluation matrix to judge the normality of
the images. The PSNR values are calculated at the frame level,
( ) (V max )2
PSNR V, V = 10log10 (16)
∑
T
1
T
(V − V)2
t=1
Higher PSNR indicates a less likely abnormal event in the testing frames.
Then, the normality scores from the action classification stream N A and
the motion prediction stream N M should do a normalization, respec
tively, according to:
̃ = N − Nmin
N (17)
Nmax − Nmin
After normalization, the final normality score N F is represented as:

( )
N F = δÑA + 1 − δ Ñ M (18)
where δ is the stream weights for different datasets.
Algorithm 3: Final fusion for classification and prediction stream
3.3. Final Fusion Steps
According to the results of object detection, the equation for the The final classification is as follows:
normality scores in frame m: ( ) {
Normal if N Fm > ζ
Frame m = (19)
∑
K Abnormal otherwise
Rmk F mk Wi
N Cm = k=1 (14) where ζ is a threshold to determine the abnormal events from video
U
sequences.
where U represents the area of the frames and Wi are the weights of
different types of abnormal objects. The first abnormal class is bicycle 4. Experiments
which can be easily detected in high-resolution datasets or sparse en
vironments. The second abnormal class is skateboard which is hard to be In this section, we introduce the AED datasets tested in our proposed
recognized due to its small size and low confidence. The third abnormal framework and then elaborate on the evaluation metrics and the
classes are cars and trucks, easily detected in all test clips. Since these detailed setting of our experiment implementation. Next, we present
vehicles occlude the drivers, the pose estimator can not detect this experimental results using validation sequences from the datasets: UCSD
abnormal driving event. Meanwhile, the normality scores of ST-GCN and PED1 & PED2 [8], ShanghaiTech Campus (SHTC) [22]. CUHK Avenue
deep clustering are represented as N P. And the normality scores of the (Avenue) [27] to investigate the effectiveness of different components in
action classification stream are represented as: the proposed framework and the influence of different abnormal events
6
Table 1
Comparison in detection efficiency on SHTC dataset.
Model Parameter(↓) execution FPS(↑)
(millions) time(↓)(s)
rGAN [49] 19.0 957 2.1

MPN [50] 12.7 12 166.8
Classification stream* 0.79 164 12.3
Fusion model* 89 1732 2.7
4.1. Datasets
Here we briefly introduce the datasets used in our experiments. We

show the number of abnormal events in different datasets statistically in
Fig. 4, which can help analyze the validity of different models when
detecting different abnormal events. Fig. 5 quantitatively illustrates the
proportion of abnormal and normal events in each dataset. Some image
samples are shown in Fig. 6.
Fig. 4. Illustration of the abnormal events distribution in the UCSD PED1, • The UCSD datasets have two parts: PED1 and PED2, all recorded in a
UCSD PED2, SHTC and Avenue datasets. The UCSD PED1 & PED2 datasets fixed viewpoint in low resolution. PED1 has 34 training video se
mainly include abnormal objects. The Avenue dataset mostly includes quences and 36 testing video sequences. PED2 has 16 training video
abnormal actions. sequences and 12 testing video sequences. The main abnormal events
in PED2 are the same as those in PED1, except for walking on the
grass. The main abnormal events include bicycle, car, skateboarding
and walking on the grass. There are only strange objects in PED2.
• The ShanghaiTech campus dataset contains 330 training video clips
and 109 testing video clips with 13 different scenes. There are
abnormal behaviours such as chasing, jumping and fighting and
strange objects such as trucks, bicycles and skateboards.
• CUHK Avenue dataset contains 16 training video clips and 21 testing
video clips with the same background. However, most abnormal
events are abnormal behaviours such as throwing, loitering and fast-
running.
In Fig. 5, there is a slight imbalance towards abnormal events in the

PED1 and SHTC testing datasets. On the other hand, there is a significant
disparity in the number of abnormal and normal events in the PED2 and
Avenue testing datasets. PED2 has a larger proportion of abnormal
events, while Avenue has a larger proportion of normal events. The
distribution of abnormal events in the Avenue dataset more closely re
sembles real-life scenarios, where abnormal events are relatively rare or
short in duration. This data imbalance presents a challenge for abnormal
event detection algorithms. When training datasets have too few
Fig. 5. Illustration of the proportion of normal and abnormal events distribu abnormal events, it can impact the ability of the algorithms to detect and
tion in the UCSD PED1, UCSD PED2, SHTC and Avenue datasets. classify such events accurately. Similarly, when testing datasets have an
insufficient number of abnormal events, it can affect the reliability and
on the overall detection performance. Then, we explore the accuracy of accuracy of the detection results. Addressing data imbalance is of utmost
finding abnormal events in specific video sequences. Lastly, we compare importance to ensure the effectiveness and generalizability of abnormal
the proposed method with other state-of-the-art works event detection models. Techniques such as data augmentation, over
[44,19,45–47,15,29,23,24,30] and discuss some failure cases for AED. sampling, and class weighting [7,16,48] can help to mitigate the impact
of data imbalance and improve the performance of abnormal event
detection models.
Fig. 6. Illustration of some abnormal frames in the UCSD PED1, UCSD PED2, SHTC and Avenue datasets. The red bounding box donates different abnormal objects:
bicycles, trucks, skateboards, and strange actions. The colour version is better for understanding the figure.
7
Fig. 7. The AUC performance of the baseline [9] published in the conference
ICASSP 2022 and the proposed two-stream fusion method. Fig. 9. ROC curves on UCSD PED1 dataset using classification, prediction, and
fusion models. Best viewed in colour.
In real-life applications, all these datasets can be instrumental for

training and evaluating abnormal event detection models for surveil 4.2. Experimental Settings, Parameters and Evaluation Metrics
lance applications. The UCSD datasets [8] are particularly valuable for
scenarios where vehicles are not expected, such as pedestrian zones or We split the data into training which only contains normal samples,
specific parks. However, the UCSD datasets [8] only contain abnormal and testing, which contains both normal and abnormal samples. A 2D
objects within a fixed scene and are of low resolution. The diverse nature pose estimation model extracts human pose information from every clip.
of abnormal events in the SHTC dataset [22] makes it suitable for Moreover, the object detector makes a direct detection. The threshold of
general-purpose surveillance applications, such as monitoring public the object detection model ε in Eq. 4 is 0.2. In the early fusion, only the
spaces in an outdoor environment. Although the SHTC dataset [22] threshold of IoU ∊ in Algorithm 1 is 0.9 to ensure the accuracy of
provides several different scenes, these are still specific to a campus detected pose graphs. In the ST-GCN network, the training patch size is
environment. Consequently, models trained on this dataset might not 16, and the autoencoder’s learning rate and batch size are 0.0001 and
generalize well to other environments. Furthermore, the dataset relies 512. The number of frames for the training segment sliding window is
on human annotation, which might be prone to errors. The annotations 12, and the stride is 8. We set K as 10 clusters for action classification in
are binary (normal/abnormal) and do not provide detailed categories or the clustering model. In the final fusion, the normality scores from
localization of anomalies, limiting the range of detection tasks that can classification and prediction models are normalised before calculation.
be performed. Similarly, the CUHK Avenue dataset [27] has the same And the stream weights δ in Eq. 18 are 0.2 for UCSD PED1 & PED2 and
limitations as SHTC [22] in that all videos are captured from a single 0.8 for SHTC & Avenue.
campus location. Moreover, abnormal events are primarily focused on The Receiver Operation Characteristic (ROC) curve is a common
relatively simple, human-centric behaviours, such as loitering or evaluation graphical plot to illustrate the binary classification perfor
running. This makes the CUHK Avenue dataset [27] particularly appli mance by changing the normality scores threshold. Then the Area Under
cable for contexts like assisted living and healthcare monitoring, where the Curve (AUC) is a scalar with the Area under the ROC curve with the
the central focus is on human behaviour. (See Table 1). false positive rate (FPR) as the x-axis and true positive rate (TPR) as the
Fig. 8. ROC curves on the USCD PED2 dataset using the classification, pre Fig. 10. ROC performance under classification, prediction, and fusion models
diction, and fusion models. Best viewed in colour. on SHTC datasets. Best viewed in colour.
8
Table 2 negative; FP and N represent the false positive and real negative
Comparison in AED with different object detectors. (normal) samples, respectively, and TN represents the true negative in
Detectors AUC—X AUC—XY AUC—XYZ video sequences. A higher AUC and F1 scores indicate better anomaly
detection performance.
DAMO-YOLO [51] 0.698 0.777 0.786
RTMDet [52] 0.673 0.772 0.780
YOLOv3* 0.708 0.819 0.838
4.3. Performance Analysis
X: The classification stream only with object information.
XY: The enhanced classification stream with pose and object information. In this section, we present the detection efficiency and accuracy to
XYZ: The enhanced fusion framework with classification and prediction streams.
evaluate the effectiveness of our proposed method, including the
detection efficiency, the ablation study of different proposed compo
y-axis. Meanwhile, F1 scores are calculated when counting the number nents and the effects of different abnormal events. Due to different types
of abnormal events in different datasets. of abnormal events, the specific normality scores of each frame in video
TP TP sequences are also analysed. Furthermore, we count the number of
TPR = = (20) abnormal events without including events starting and ending.
P TP + FN
FP FP 4.3.1. Detection Efficiency

FPR = = (21)
N FP + TN In order to evaluate the detection efficiency of the proposed
abnormal event detection algorithms, a comparison was conducted with
F1 =
2TP
(22) two recent AED models: rGAN [49]and MPN [50]. These two methods
2TP + FP + FN are based on meta-learning, which is efficient for training and testing for
AED. The evaluation was performed across three dimensions: parameter
where TP and P represent, respectively, the true positive and real posi
count, execution time, and frames per second (FPS).
tive (abnormal) samples in video sequences, FN represents the false
The results of the parameter count analysis demonstrate that the
Fig. 11. Normality scores for different abnormal events in video sequences: Bicycle, Skateboard, Car and Throwing. The first stream is action-based classification,
and the second is motion-based prediction. Best viewed in colour.
9
Table 3 Table 5
AUC performance on different anomalies on UCSD PED2/SHTC datasets. AUC Performance for abnormal event detection compared to state-of-the-art
Classification stream Prediction stream Fusion model
methods.
SHTC PED2 PED1 Avenue
Bicycles 0.930/0.886 0.964/0.722 0.977/0.872
Cars 0.999/0.889 1/0.684 1/0.833 Luo et al. [44] 0.680 0.922 - 0.817
Skateboards 0.930/0.758 0.935/0.802 0.950/0.805 Conv-AE [19] 0.609 0.811 0.750 0.800
Strange actions -/0.718 -/0.762 -/0.772 MDT [45] - 0.829 0.818 -
Total 0.927/0.819 0.953/0.730 0.979/0.838 ConvLSTM-AE [46] - 0.881 0.755 0.770
Abati et al. [47] 0.725 0.954 - -
Markovitz1 et al. [15] 0.761 - - -
proposed classification stream, which is driven by pose information, Yang et al. [29] - 0.940 - -
exhibits a notable advantage in terms of model complexity and resource Zhou et al. [23] - 0.960 0.839 0.860
Zhang et al. [30] 0.803 0.929 0.942 0.805
utilization. With a significantly lower parameter count compared to the
Yan et al. [7] - - 0.677 0.796
baseline models, this stream offers potential benefits in terms of effi Li et al. [4] 0.717 - - 0.820
ciency and resource allocation. Considering execution time, the com Hyun et al. [2] 0.740 0.972 - 0.868
parison reveals that the rGAN [49] model requires a relatively longer Proposed 0.838 0.979 0.855 0.842
execution time, while the MPN [50] model demonstrates significantly
faster performance. In this context, the proposed classification stream
walking Figs. 8–10 detail the AUC performance of our component
falls in between, with an execution time of 164 s. This suggests that the
models on different individual abnormal events of UCSD PED1 & PED2,
proposed model strikes a balance between computational efficiency and
SHTC datasets. First of all, compared with the individual models, the
detection accuracy, offering improved performance compared to rGAN
fusion model performs the best on the tested datasets. The prediction
[49] while being outperformed by MPN [50] in terms of execution time.
stream is sensitive to object movements and robust to abnormal events.
Furthermore, the evaluation of frames per second (FPS) provides in
However, if the background environment is filled with crowded pedes
sights into the capabilities of the models. The proposed classification
trians, the prediction stream would not be effective for abnormal events.
stream achieves an FPS of 12.3, indicating its ability to process video
The improved classification stream can detect abnormal objects such as
frames at a relatively faster rate compared to rGAN [49], although
bicycles and cars and abnormal actions such as jumping and chasing.
slightly lower than MPN [50]. It demonstrates promising detection
Nevertheless, the action classification stream does not capture tiny ob
performance while maintaining reasonable computational efficiency.
jects like skateboards well.
While the proposed fusion model incurs a higher computational cost
From Fig. 8, the AUC of the action-based classification stream is
compared to a single classification stream, its utilization of motion and
92.7%, which is lower than 96.9% of the motion-based prediction
appearance features contributes to improved detection accuracy. By
stream. This is because the prediction stream is more sensitive to the
combining multiple streams of information, the fusion model achieves
partly moving targets, which are removed by pose estimation and early
enhanced performance and robustness in detecting abnormal events.
fusion steps at the beginning. Thus, the prediction stream can detect the
These findings highlight the trade-offs between model complexity,
start and end of abnormal events more accurately.
execution time, and processing performance in abnormal event detec
In addition, the fusion model can accurately capture the occluded
tion. The proposed classification stream shows promising results in
abnormal vehicles such as trucks and cars. For the UCSD PED1 dataset,
terms of parameter efficiency and processing capabilities, primarily
the AUC performance of the prediction stream is better than the classi
because it’s a pose-centric model. Although the fusion model is more
fication stream, contrary to the ShanghaiTech Campus dataset. This is
complex in terms of model parameters and execution time, it boasts
because the prediction model is more accurate in capturing partly
superior detection accuracy due to the additional motion and appear
moving targets, and the classification stream is more sensitive to high-
ance information. This means we can flexibly choose different models to
resolution and densely populated datasets.
handle different tasks, optimizing for either efficiency or accuracy
In testing on Avenue datasets, The fusion model has demonstrated on
depending on the application. Further investigations and optimizations
the Avenue dataset that a good AUC performance is not achieved during
can be pursued as future work to enhance the overall detection effi
the action-based classification stream. After analyzing the Avenue
ciency of the proposed algorithms.
dataset, the ground truth labels are pixel-level. And the main types of
abnormal events: are loitering and walking in the wrong direction,
4.3.2. Ablation Study
which should not be considered anomalies.
Fig. 7 compares the AUC performance of our work presented at
ICASSP 2022 and the proposed method. The enhanced fusion framework
4.3.3. Analysis of Different Object Detectors
outperforms the previous work across all tested datasets due to its ability
Table 2 demonstrates the detection performance of YOLOv3 [25]
to efficiently recognize and classify normal and abnormal movements
compared to DAMO-YOLO [51] and RTMDet [52] on SHTC [22] dataset.
using both motion and appearance features. The improvement observed
The evaluation includes the classification stream utilizing only object
on the PED1 and PED2 datasets is higher than that on the SHTC dataset.
information, the enhanced classification stream incorporating pose and
This is not only because SHTC is larger in scale than PED1 and PED2 but
object information, and the fusion model combining the classification
also because the primary abnormal events in the PED1 and PED2 data
and prediction streams. The results indicate that YOLOv3 outperforms
sets, such as driving and cycling, exhibit different motion features than
DAMO-YOLO [51] and RTMDet [52] in all three scenarios. This suggests
that YOLOv3 provides higher abnormal event detection performance
Table 4 when considering object information alone, as well as when incorpo
F1 scores for different α parameters on UCSD PED1 dataset, which contains 55 rating additional pose information and in the fusion model.
abnormal events. It is worth noting that DAMO-YOLO [51] and RTMDet [52] are
Truly Miss False Accuracy F1 considered efficient and accurate object detectors for general object
α detected detection alarm scores detection tasks. However, their strong object detection capabilities can
0.2 51 4 29 0.927 0.773
sometimes interfere with abnormal event detection, particularly due to
0.4 49 6 9 0.891 0.867 their robust ability to detect objects of various shapes and sizes. This
0.6 49 6 1 0.891 0.933 interference can result in false alarms being detected as anomalies,
0.8 46 9 0 0.836 0.911 thereby reducing the specificity and accuracy of the detection system. As
10
such, the simpler architecture of YOLOv3 [25] could, in fact, be an 5. Conclusions

advantage, especially if the abnormal events are relatively simple or the
background is less cluttered. To mitigate this issue, it is recommended to In this paper, we proposed a novel and unified AED framework to
retrain the object detection model specifically for abnormal event address different abnormal events in different models by fusing the pose
detection. This retraining process could help focus the object detection and motion features with class information. An ablation study calculates
model on relevant abnormal objects, reduce false alarms, and improve the AUC performance of our proposed framework on different datasets.
the overall performance and reliability of the abnormal event detection An analysis of different abnormal events is also illustrated to investigate
system. the validity of different streams on different types of anomalies in detail.
Furthermore, the number of abnormal events on the UCSD PED1 dataset
4.3.4. Analysis of Different Abnormal Events is counted to discuss the influence of the starting and ending of abnormal
Fig. 11 shows the regularity scores for abnormal events, such as events. The experimental results show that our proposed fusion frame
cycling, skiing, driving and throwing. The red line represents the work has the best AUC performance compared with state-of-the-art work
normality scores generated from the motion-based prediction stream, on UCSD PED2 and SHTC datasets. The proposed framework is suitable
and the black line denotes the scores from the enhanced classification for abnormal events containing abnormal behaviours and objects. It is
stream. The proposed AED algorithm has the best performance in car crucial to highlight that the joint AED proposed approach can detect
detection. This result is consistent with the cognition that large-scale abnormal events in a multi-target real-world setting. Future research
objects and high-resolution images can effectively enhance the detec would focus on the dynamic background and real-time processing
tion ability. For the normality scores of strange behaviours, If there are challenges. Our pose-level algorithm is naturally suitable for
only one or few moving targets, the motion-based prediction stream can background-agnostic abnormal event detection in surveillance videos.
predict the next frame accurately, as shown in the last figure of Fig. 11. Meanwhile, compared with raw images as input, pose graphs are effi
In the second figure of Fig. 11, the normality score increases at the 170- cient in computational complexity, which makes online processing
th frame while the following frames are all abnormal in the ground possible. Related applications in healthcare, surveillance and automated
truth. Decision errors occur due to the limitation of the recording system alarm are straightforward.
and the long distance between the camera and the pedestrians. Another
error is in the last figure of Fig. 11, in which the classification stream did CRediT authorship contribution statement
not find fast-running anomalies.
Table 3 is the AUC performance of different types of abnormal events Yuxing Yang: Conceptualization, Methodology, Software, Valida
on UCSD PED2 and SHTC datasets. Compared with different abnormal tion, Visualization, Writing - original draft. Zeyu Fu: Investigation,
events, Cars have the highest AUC performance due to their appearance Writing - review & editing. Syed Mohsen Naqvi: Methodology,
being different from pedestrians and their fast movement. The skate Supervision.
board is hardest to detect because of its tiny shape, and the skiing
behaviour is similar to walking. Since the SHTC dataset has complex and
changed scenes with multiple pedestrians and is a large AED dataset Declaration of Competing Interest
compared with UCSD, The AUC performance of our fusion model on
SHTC is 83.8%, much lower than that on UCSD PED2. The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
4.3.5. Accuracy of Finding Abnormal Events the work reported in this paper.
To decrease the influence of missing detection of abnormal events
starting and ending, we also set a hyper parameter to count the number Data availability
of detected, miss detection and false alarm events and calculate the F1
scores on the UCSD PED1 dataset on our proposed fusion model. As Data will be made available on request.
shown in Table 4, The parameter α is the ratio of detected abnormal
event frames to total abnormal event frames in each video sequence. The References
number of false alarms decreases with the increase of α. Meanwhile, the
number of missing detection is increasing. The framework achieves the [1] Y. Wang, T. Liu, J. Zhou, J. Guan, Video anomaly detection based on spatio-
temporal relationships among objects, Neurocomputing 532 (2023) 141–151.
highest F1 scores at α = 0.6. At this moment, the accuracy of truly [2] W. Hyun, W.J. Nam, S.W. Lee, Dissimilate-and-assimilate strategy for video
detected abnormal events is 89.1%. This measurement is reliable when anomaly detection and localization, Neurocomputing 522 (2023) 203–213.
detecting whether the abnormal event happened compared to the frame [3] N. Li, J.-X. Zhong, X. Shu, H. Guo, Weakly-supervised anomaly detection in video
surveillance via graph convolutional label noise cleaning, Neurocomputing 481
level’s AUC performance. (2022) 154–167.
[4] N. Li, F. Chang, C. Liu, A self-trained spatial graph convolutional network for
4.4. Comparisons with State-of-the-art Methods unsupervised human-related anomalous event detection in complex scenes, IEEE
Transactions on Cognitive and Developmental Systems (2022), 1–1.
[5] S. Zhang, Z. Wei, J. Nie, L. Huang, S. Wang, Z. Li, A Review on Human Activity
Table 5 shows the AUC performance of the proposed fusion model Recognition Using Vision-Based Method, Journal of Healthcare Engineering (2017)
compared with state-of-the-art methods on SHTC, UCSD PED1 & PED2 1–31.
[6] Z. Fu, X. Lai, S.M. Naqvi, Enhanced detection reliability for human tracking based
and Avenue datasets. We achieved relatively high AUC performance on video analytics (2019).
all public AED datasets. The normal solutions for abnormal event [7] S. Yan, J.S. Smith, W. Lu, B. Zhang, Abnormal event detection from videos using a
problems are generative models such as ConvAE [45] and ConvLSTM-AE two-stream recurrent variational autoencoder, IEEE Transactions on Cognitive and
Developmental Systems 12 (1) (2020) 30–42.
[46]. These methods construct the normal event probability model and [8] A.B. Chan, N. Vasconcelos, Modeling, clustering, and segmenting video with
detect the outlier event in testing. However, most of the time, abnormal mixtures of dynamic textures, IEEE Trans. on Pattern Analysis and Machine
events happen in the small region, which is hard to detect at the frame Intelligence, vol. 30(5) (2008) 909–926.
[9] Y. Yang, Z. Fu, S.M. Naqvi, A two-stream information fusion approach to abnormal
level. The other solutions, such as Markovitz1 et al. [15], only focus on
event detection in video, IEEE International Conference on Acoustics, Speech and
the human-related abnormal events and can not handle the covered Signal Processing (ICASSP) (2022) 5787–5791.
abnormal events, which is not suitable for a crowded environment. Our [10] Z. Fu, F. Angelini, J. Chambers, S.M. Naqvi, Multi-level cooperative fusion of GM-
fusion model achieves the highest AUC performance in UCSD PED2 and PHD filters for online multiple human tracking, IEEE Transactions on Multimedia
21 (9) (2019) 2277–2291.
SHTC, which are 4.18% and 1.98% higher than the second-highest AED [11] Y. Du, W. Wang, L. Wang, Hierarchical recurrent neural network for skeleton based
algorithms. action recognition (2015).
11
[12] F. Angelini, Y. Jiawei, S.M. Naqvi, Privacy-Preserving Online Human Behaviour [44] W. Luo, W. Liu, S. Gao, A revisit of sparse coding based anomaly detection in
Anomaly Detection Based On Body Movements and Objects Positions (2019). stacked rnn framework, in: IEEE International Conference on Computer Vision
[13] N. Li, F. Chang, C. Liu, Human-related anomalous event detection via spatial- (ICCV), 2017.
temporal graph convolutional autoencoder with embedded long short-term [45] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded
memory network, Neurocomputing 490 (2022) 482–494. scenes, IEEE International Conference on Computer Vision and Pattern Recognition
[14] S. Yan, Y. Xiong, D. Lin, Spatial temporal graph convolutional networks for (2010).
skeleton-based action recognition., AAAI Conference on Artificial Intelligence [46] W. Luo, W. Liu, S. Gao, Remembering history with convolutional lstm for anomaly
(2018). detection, IEEE International Conference on Multimedia and Expo (2017).
[15] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, S. Avidan, Graph Embedded [47] D. Abati, A. Porrello, S. Calderara, R. Cucchiara, Latent space autoregression for
Pose Clustering for Anomaly Detection (2020). novelty detection, IEEE/CVF Conference on Computer Vision and Pattern
[16] Y. Yang, Z. Fu, S.M. Naqvi, Enhanced adversarial learning based video anomaly Recognition (CVPR) (2019).
detection with object confidence and position, 13th International Conference on [48] Z. Wang, Y. Zou, Z. Zhang, Cluster attention contrast for video anomaly detection,
Signal Processing and Communication Systems (ICSPCS) (2019). ACM International Conference on Multimedia (2020) 2463–2471.
[17] W. Sultani, C. Chen, M. Shah, Real-World Anomaly Detection in Surveillance [49] Y. Lu, F. Yu, M.K.K. Reddy, Y. Wang, Few-shot scene-adaptive anomaly detection,
Videos (2018). ECCV (2020).
[18] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, J. Gall, A survey on human motion [50] H. Lv, C. Chen, Z. Cui, C. Xu, Y. Li, J. Yang, Learning normal dynamics in videos
analysis from depth data (2013) 149–187. with meta prototype network, in: the IEEE/CVF Conference on Computer Vision
[19] M. Hasan, J. Choi, J. Neumann, A.K. Roy-Chowdhury, L.S. Davis, Learning and Pattern Recognition (CVPR), 2021, pp. 15425–15434.
temporal regularity in video sequences, IEEE International Conference on [51] X. Xu, Y. Jiang, W. Chen, Y. Huang, Y. Zhang, X. Sun, Damo-yolo: A report on real-
Computer Vision and Pattern Recognition (2016). time object detection design, arXiv preprint arXiv:2211.15444v2 (2022).
[20] Y.S. Chong, Y.H. Tay, Abnormal event detection in videos using spatiotemporal [52] C. Lyu, W. Zhang, H. Huang, Y. Zhou, Y. Wang, Y. Liu, S. Zhang, K. Chen, Rtmdet:
autoencoder, CoRR (2017). An empirical study of designing real-time object detectors (2022). arXiv:
[21] M. Astrid, M.Z. Zaheer, S.-I. Lee, Pseudobound: Limiting the anomaly 2212.07784.
reconstruction capability of one-class classifiers using pseudo anomalies,
Neurocomputing 534 (2023) 147–160.
[22] W. Liu, W. Luo, D. Lian, S. Gao, Future frame prediction for anomaly detection – a
new baseline, IEEE Conference on Computer Vision and Pattern Recognition Yuxing Yang earned his Master’s degree from the School of
(CVPR) (2018). Engineering, Newcastle University, U.K., in 2018. Recently, he
[23] J.T. Zhou, L. Zhang, Z. Fang, J. Du, X. Peng, Y. Xiao, Attention-driven loss for defended his Ph.D. thesis with the Intelligent Sensing and
anomaly detection in video surveillance, IEEE Transactions on Circuits and Communications Research Group at the same institution. In
2021, he worked as a research assistant on an EPSRC IAA
Systems for Video Technology 30 (12) (2020) 4639–4647.
[24] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, Z. Qiu, A hierarchical spatio-temporal project, which applied deep learning for multimodal human
graph convolutional neural network for anomaly detection in videos, CoRR security surveillance. His research interests include video
2112.04294 (2021). anomaly detection, human behavior analysis, and intelligent
[25] J. Redmon, A. Farhadi, Yolov3: An incremental improvement, arXiv (2018). surveillance systems.
[26] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der
Smagt, D. Cremers, T. Brox, Flownet: Learning optical flow with convolutional
networks, CoRR 1504.06852 (2015).
[27] J.S.C. Lu, J. Jia, Abnormal event detection at 150 fps in matlab, in: IEEE
International Conference on Computer Vision (ICCV), 2013.
[28] M. Sabokrou, M. Khalooei, M. Fathy, E. Adeli, Adversarially learned one-class
classifier for novelty detection, IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (2018). Zeyu Fu received the B.Eng. (with First-Class Hons.) and Ph.D.
[29] Y. Yang, Y. Xian, Z. Fu, S.M. Naqvi, Video anomaly detection for surveillance based degrees from the School of Engineering, Newcastle University,
on effective frame area, in: IEEE 24th International Conference on Information U.K., in 2015 and 2019, respectively. During his PhD study, he
Fusion (FUSION), 2021. worked on a DSTL & EPSRC-funded project which develops
[30] S. Zhang, M. Gong, Y. Xie, A.K. Qin, H. Li, Y. Gao, Y.-S. Ong, Influence-aware algorithms for video-based multiple human tracking. At the
attention networks for anomaly detection in surveillance videos, IEEE Transactions same institution, he was a research assistant working on an
on Circuits and Systems for Video Technology (2022). MRC-CiC funded project which applies machine learning for
[31] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, S. Venkatesh, Learning Regularity ocular imaging. From 2020 to 2022, he was a postdoctoral
in Skeleton Trajectories for Anomaly Detection in Videos, in: Computer Vision and researcher at the Institute of Biomedical Engineering, Univer
Pattern Recognition (CVPR), 2019. sity of Oxford, U.K., working on two projects about AI for
[32] Q. Wang, Y. Liu, Z. Xiong, Y. Yuan, Hybrid feature aligned network for salient healthcare (NIH-funded CIFASD and ERC-adv funded PULSE).
object detection in optical remote sensing imagery, IEEE Transactions on He is currently a lecturer (assistant professor) in computer
Geoscience and Remote Sensing 60 (2022) 1–15. vision at the Department of Computer Science, University of
[33] Y. Liu, Q. Li, Y. Yuan, Q. Du, Q. Wang, Abnet: Adaptive balanced network for Exeter, U.K. His research interests include visual surveillance, machine learning, and
multiscale object detection in remote sensing imagery, IEEE Transactions on medical image analysis.
Geoscience and Remote Sensing 60 (2022) 1–14.
[34] A. Ali, P. Angelov, Anomalous behaviour detection based on heterogeneous data
and data fusion, Soft Computing 22 (2018).
Syed Mohsen Naqvi received the Ph.D. degree from Lough
[35] V. Chatzigiannakis, G. Androulidakis, K. Pelechrinis, S. Papavassiliou, V. Maglaris,
borough University, Loughborough, U.K., in 2010. He is
Data fusion algorithms for network anomaly detection: classification and
currently a Reader in Signal and Information Processing, the
evaluation, in: International Conference on Networking and Services (ICNS), 2007.
Director of the Intelligent Sensing Laboratory, and the Deputy
[36] Z. Fu, S.M. Naqvi, J.A. Chambers, Collaborative detector fusion of data-driven phd
Head of the Intelligent Sensing and Communications Research
filter for online multiple human tracking, in: International Conference on
Group, Newcastle University, Newcastle, U.K. His research
Information Fusion (FUSION), 2018.
contributions have been in human action, activity, behavior
[37] K.B.-Y. Wong, T. Zhang, H. Aghajan, Data fusion with a dense sensor network for
analyses, multiple human target detection, localization, and
anomaly detection in smart homes, Human Behavior Understanding in Networked
tracking, human speech enhancement and separation, and
Sensing, Theory and Applications of Networks of Sensors (2014) 211–237.
explainable AI, all for defence and healthcare applications.
[38] D. Lahat, T. Adali, C. Jutten, Multimodal data fusion: An overview of methods,
Dr Naqvi has above 130 publications in peer-reviewed ar
challenges, and prospects, Proceedings of the IEEE 103 (9) (2015) 1449–1477.
ticles in high impact journals and proceedings of leading in
[39] F. Angelini, Z. Fu, Y. Long, L. Shao, S.M. Naqvi, 2D Pose-based Real-time Human
ternational conferences. He organized special sessions in
Action Recognition with Occlusion-handling, IEEE Transactions on Multimedia
FUSION (2013-2022), delivered Seminars, and was a Speaker with University Defence
(2019).
Research Collaboration (UDRC) Summer Schools (2015-2017). He was involved in more
[40] Y. Xiu, J. Li, H. Wang, Y. Fang, C. Lu, Pose Flow: Efficient online pose tracking,
than 15 research projects funded by UKRI and Industry (e.g., EPSRC, BBSRC, MoD, UDRC,
British Machine Vision Conference (BMVC) (2018).
Thales, Innovate U.K., NHS). He also successfully supervised and graduated above 20 Ph.
[41] W. Luo, W. Liu, S. Gao, Graph convolutional neural network for skeleton-based
D., including the authors of this paper. He is a Senior Member of IEEE and a Fellow of the
video abnormal behavior detection, Generalization with Deep Learning (2021)
Higher Education Academy. He was an Associate Editor of Elsevier Journal on Signal
139–155.
Processing (2018-2022). He served two terms of Associate Editor of IEEE Transactions on
[42] Z. Chen, Y. Tian, W. Zeng, T. Huang, Detecting abnormal behaviors in surveillance
Signal Processing (2019-2023). He is an Associate Editor of IEEE/ACM Transactions on
videos based on fuzzy clustering and multiple auto-encoders, in: IEEE International
Audio Speech and Language Processing (2019-to date).
Conference on Multimedia and Expo (ICME), 2015.
[43] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical
image segmentation, in: Medical Image Computing and Computer-Assisted
Intervention (MICCAI), 2015, pp. 234–241.
12

1 s2.0 S0925231223006847 Main

Uploaded by

Copyright:

Available Formats

You might also like

1 s2.0 S0925231223006847 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0925231223006847 Main

Uploaded by

Copyright:

Available Formats

Neurocomputing 553 (2023) 126561

Contents lists available at ScienceDirect

Abnormal event detection for video surveillance using an enhanced

1. Introduction mentioned above. Human action recognition algorithms can detect

Algorithm 1: Initial fusion for pose graph

information are represented as x1 , x2 , y1 , y2 and F is confidence scores,

We exploit pose-based action classification architecture for AED,

Algorithm 2, which act as the weights of intensity loss, gradient loss,

After normalization, the final normality score N F is represented as:

where δ is the stream weights for different datasets.

Algorithm 3: Final fusion for classification and prediction stream

3.3. Final Fusion Steps

rGAN [49] 19.0 957 2.1

Here we briefly introduce the datasets used in our experiments. We

In Fig. 5, there is a slight imbalance towards abnormal events in the

In real-life applications, all these datasets can be instrumental for

FP FP 4.3.1. Detection Efficiency

such, the simpler architecture of YOLOv3 [25] could, in fact, be an 5. Conclusions

You might also like