Video Anomaly Detection in 10 Years: A Survey and Outlook

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1

Video Anomaly Detection in 10 Years: A Survey


and Outlook
Moshira Abdalla, Sajid Javed, Muaz Al Radi, Anwaar Ulhaq, and Naoufel Werghi

Abstract—Video anomaly detection (VAD) holds immense im- of anomalies in video data. By leveraging deep learning tech-
portance across diverse domains such as surveillance, health- niques, researchers and practitioners have been able to develop
care, and environmental monitoring. While numerous surveys innovative approaches that outperform traditional methods and
focus on conventional VAD methods, they often lack depth
in exploring specific approaches and emerging trends. This address the limitations associated with traditional ML [7], [8].
survey explores deep learning-based VAD, expanding beyond While the majority of deep learning-based VAD systems
arXiv:2405.19387v1 [cs.CV] 29 May 2024

traditional supervised training paradigms to encompass emerging rely on supervised learning paradigms, recent years have
weakly supervised, self-supervised, and unsupervised approaches. witnessed a surge in interest in exploring novel approaches,
A prominent feature of this review is the investigation of including weakly supervised, self-supervised, and unsuper-
core challenges within the VAD paradigms including large-scale
datasets, features extraction, learning methods, loss functions, vised methods [9]–[11]. These alternative approaches present
regularization, and anomaly score prediction. Moreover, this promising solutions to the challenges encountered by con-
review also investigates the vision language models (VLMs) ventional VAD methods, such as the requirement for exten-
as potent feature extractors for VAD. VLMs integrate visual sively annotated datasets and the complexity of capturing
data with textual descriptions or spoken language from videos, intricate spatiotemporal patterns. Through the utilization of
enabling a nuanced understanding of scenes crucial for anomaly
detection. By addressing these challenges and proposing future deep learning techniques, researchers aspire to cultivate robust
research directions, this review aims to foster the development and efficient VAD systems adept at navigating diverse real-
of robust and efficient VAD systems leveraging the capabilities world scenarios and applications [1], [12]. Similarly, several
of VLMs for enhanced anomaly detection in complex real-world research challenges are either not thoroughly explored or
scenarios. This comprehensive analysis seeks to bridge existing remain unexplored. For instance, a crucial consideration is the
knowledge gaps, provide researchers with valuable insights, and
contribute to shaping the future of VAD research. quality and diversity of available datasets. The development
of emerging comprehensive datasets, such as the UCF Crime
Index Terms—Video Anomaly Detection, Weak Supervi- dataset [11], XD-Violence dataset [13], and ShanghaiTech
sion, Reconstruction-based Techniques, Vision-Language Models,
Video Surveillance. dataset [14], plays a pivotal role in this progress. These
datasets encompass various types of anomalies, significantly
contributing to the enhancement of VAD techniques.
I. I NTRODUCTION In VAD, videos consist of sequences of frames, resulting
NOMALY detection aims to pinpoint events or patterns in complex, high-dimensional spatiotemporal data. Detecting
A that stray from the typical or expected behavior within
a given data modality [1], [2]. It holds diverse applications,
anomalies within such complex data necessitates methods
capable of effectively capturing spatial, temporal, spatiotem-
spanning from fraud detection and cybersecurity for network poral, and textual features. To address these challenges,
intrusion detection to quality assurance in manufacturing, fault numerous VAD methods and deep feature extractors have
identification in industrial machinery, healthcare monitoring, been introduced in academic research, which has played a
and beyond. Our primary emphasis, however, revolves around significant role in advancing the current state-of-the-art. It is
video anomaly detection (VAD) [3], a pivotal technology that also worth mentioning the importance of choosing a correct
automates the identification of unusual or suspicious activities loss function based on the type of VAD paradigm. Hence, loss
within video sequences. functions and regularization techniques are another challenge
While conventional methods for VAD have been extensively in the VAD problem. Another significant challenge lies in
studied, the rapid advancements in deep learning techniques the diverse paradigms including self-supervised, weakly super-
have opened up new avenues for more effective anomaly vised, fully supervised, and vision-language models utilized to
detection [4]. Deep learning algorithms, such as convolutional address VAD problem. These paradigms can be categorized
neural networks (CNNs) and vision transformers (ViTs), have into classical and deep learning. The classical approaches
shown remarkable capabilities in learning complex patterns employ handcrafted features including spatiotemporal gradient
and representations from large-scale data [5], [6]. These [15], Histograms of Oriented Gradients (HOG) features [16],
advancements have led to significant improvements in VAD [17], and Histograms of Optical Flows (HOF) [18] within a
performance, enabling more accurate and reliable detection temporal cuboid. These features are selected for their effec-
tiveness in capturing appearance and motion information in
(Corresponding author: Sajid Javed (sajid.javed@ku.ac.ae )) a spatiotemporal context. The deep learning approaches are
M. Abdalla, S. Javed, M. Radi, and N. Werghi are with the Department more powerful in terms of rich feature representation and
of Computer Science, Khalifa University of Science and Technology, Abu end-to-end learning and have gained much popularity in the
Dhabi, UAE
A. Ulhaq is with School of Engineering & Technology, Central Queensland past decade due to advancements in different learning models.
University Australia, 400 Kent Street, Sydney 2000, Australia Specifically, these approaches leverage powerful feature ex-
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2

• We identify the core challenges of the SOTA VAD


paradigms including the vision-language models in terms
of feature extraction, large-scale VAD training datasets,
loss functions, spatiotemporal regularization, and video
anomaly scores prediction.
• We provide a comparative analysis through quantitative
and qualitative comparisons of SOTA models on different
benchmarking datasets. This analysis sheds light on the
strengths and weaknesses of existing methodologies, of-
fering valuable insights for researchers and practitioners
in the field.
• Following our analysis, we outline our proposed rec-
ommendations for addressing the open challenges in
VAD. These recommendations are informed by our deep
Fig. 1. Performance improvement from 2017 until 2023 on two understanding of the current landscape of VAD research
popular benchmarks. Performance is measured by the area under the and aim to guide future research directions toward over-
ROC curve (AUC%). Note that some of the models were developed coming existing limitations and advancing the SOTA.
before the datasets were created but were used after the creation by
other researchers such as [15], [16]. The other proposed models are The structure of this paper is as follows: Section II outlines
: [23], [26]–[33] the methodology used to select the research studies included in
tractors to capture meaningful spatiotemporal features. Exam- this review over the past decade. Section III examines previous
ples include convolutional neural networks [19], autoencoders surveys conducted in the field of video anomaly detection.
[20], GANs [21], vision transformers [22], [23], and vision Section IV presents the formulation of the video anomaly
language models [24], [25]. detection problem. Section V presents core VAD challenges.
Section VI outlines experimentation and evaluation protocols
In addition to the aforementioned inherent challenges within
used. Section VII provides a comparative analysis through
the VAD problem, each paradigm comes up with diverse
quantitative and qualitative comparisons of state-of-the-art
loss functions, spatiotemporal regularization/constraints, and
models. Section VIII provides visualization of bibliometric
anomaly score prediction components. We also identify these
networks for thematic analysis. Finally, we conclude our work
different modules in this survey and provide take-home mes-
and present additional future directions in Section IX.
sages for each of the abovementioned challenges. Figure
1 shows the performance variation of representative deep II. R EVIEW METHODOLOGY
learning-based VAD methods on two publicly available bench-
mark datasets including UCF-Crime [11] and ShanghaiTech This survey paper exclusively considers research directly re-
[14]. The performance is reported in terms of the area un- lated to “Video Anomaly Detection.” The survey is conducted
der the ROC curve (AUC%). The performance improvement systematically, utilizing publications from top computer vision
trend in this figure underscores a significant advancement in venues, including CVPR (Conference on Computer Vision
deep learning methodologies over the past decade, with the and Pattern Recognition), ICCV (International Conference on
most notable improvement achieved by utilizing the recently Computer Vision), ECCV (European Conference on Computer
proposed vision-language model [26]. Vision), IEEE TPAMI (Transactions on Pattern Analysis and
Motivation: Despite the growing interest in VAD and the Machine Intelligence), IJCV (International Journal of Com-
proliferation of deep learning-based approaches, there remains puter Vision), CVIU (Computer Vision and Image Under-
a need for a comprehensive survey that explores the latest standing). The objective of this review paper is to present the
developments in this field. Existing surveys often focus on state-of-the-art in Video Anomaly Detection. Accordingly, this
traditional VAD methods and may not adequately cover the study focuses on identifying published research concerning
emerging trends and methodologies such as vision-language the implementation of various computer Vision and deep
models (VLMs) [33]–[36]. By delving into VLMs and deep Learning-based methods to address the challenges of VAD.
learning-based VAD, including supervised, weakly supervised, The reviewed works span the last decade covering over 50
self-supervised, and unsupervised approaches, this survey aims articles.
to provide a thorough understanding of the state-of-the-art
III. P REVIOUS L ITERATURE R EVIEW
techniques and their potential applications. Our main focus
lies on deep learning-based solutions for VAD problems. In the field of anomaly detection, several survey papers
Particularly, we explore emerging VAD paradigms that em- have been conducted in the past decade [1], [4], [7], [8],
ploy different datasets, feature extractions, loss functions, [37]. The first survey paper, published in 2018 by [8], fo-
and spatiotemporal regularization. In addition, we offer a cused on deep learning techniques for VAD, with particular
comprehensive analysis of over 50 different methodologies attention to unsupervised and semi-supervised methods. Their
employed in VAD, with a specific focus on the potential of classification of models included three distinct categories:
textual features extracted with vision language models. In this reconstruction-based, spatio-temporal predictive, and genera-
survey, our main contributions are summarized as follows: tive models. Notably, this paper preceded the notable Multiple
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 3

Instance Learning (MIL) approach, resulting in the omission In our survey paper, we aim to address these research gaps
of weakly supervised methods from their study. that have not been adequately outlined in previous surveys.
Another significant study by Chalapathy et al. [37] high- Furthermore, our study strongly emphasizes the latest trends
lighted the potential of deep anomaly detection methods in in VAD, ensuring that our analysis remains current and rele-
addressing various detection challenges. Their work covered vant in the rapidly evolving landscape of anomaly detection
diverse application areas such as the Internet of Things, intru- research. We go beyond simply summarizing existing litera-
sion detection, and surveillance videos. They further classified ture; instead, we explore recent advancements that build upon
anomalies into collective, contextual, and point anomalies. the foundation laid by earlier survey papers in this field. By
The deep learning methods they categorized into four primary synthesizing this wealth of information, we provide readers
classes include unsupervised, semi-supervised, hybrid, and with a comprehensive overview of the state-of-the-art in VAD.
One-Class Neural Network models. Additionally, we classify learning and supervision meth-
In another survey, Ramachandra et al. [1] focus primarily ods into four distinct groups: supervised, unsupervised, self-
on detecting anomalies within a single scene while also supervised, and weakly supervised techniques. This classifica-
highlighting the differences from multi-scene anomaly detec- tion scheme allows us to systematically analyze and compare
tion. An important distinction lies in the fact that single- different approaches, providing readers with valuable insights
scene VAD may involve anomalies dependent on specific into the diverse methodologies employed in video anomaly
locations, whereas multi-scene detection cannot. The survey detection research. By organizing our study in this manner,
also sheds light on benchmark datasets employed for single- we aim to facilitate a deeper understanding of the underlying
scene versus multi-scene detection and the associated evalu- principles and techniques driving advancements in this field.
ation procedures. In a broader context, the survey classifies
previous research in video anomaly detection into three main IV. D EFINING THE V IDEO A NOMALY D ETECTION
categories: distance-based, probabilistic, and reconstruction- P ROBLEM
based approaches.
In a different work, Nayak et al. [7] categorize learning Anomalies within video data encompass events or behaviors
frameworks into four main categories: supervised, unsuper- that notably diverge from anticipated or regular patterns.
vised, semi-supervised, and active learning. In the context of The primary goal of video anomaly detection is to devise
deep-learning-based VAD, state-of-the-art methods fall into and implement resilient algorithms and models capable of
several distinct categories, including Trajectory-based meth- autonomously identifying and flagging these anomalies in real
ods, Global pattern-based methods, Grid pattern-based meth- time. This involves converting raw video data into interpretable
ods, Representation learning models, Discriminative models, feature representations utilizing robust feature extractors adept
Predictive models, Deep generative models, Deep one-class at capturing both spatial and temporal characteristics. Addi-
deep neural networks, and Deep hybrid models. tionally, it necessitates the selection of appropriate algorithms
Moreover, the research provides a comprehensive analysis or techniques and the establishment of effective evaluation
of performance evaluation methodologies, covering aspects metrics to assess detection performance accurately.
such as the choice of datasets, computational infrastructure, In the context of supervised learning scenarios where frame-
evaluation criteria, and performance metrics. Another recent level labels are available, video anomaly detection problem can
work by Pang et al. [4] focused on deep learning techniques be succinctly described as follows:
for anomaly detection and explored various challenges within We represent each video as Vi , which consists of a sequence of
the anomaly detection (AD) problem. These challenges in- frames {fi,1 , fi,2 , . . . , fi,n }. From each frame, we can extract
cluded issues such as class imbalance in the data, complex essential feature representations, denoted by xi,j . Define a
anomaly detection, the presence of noisy instances in weakly model M that takes these features extracted from each frame
supervised AD methods, and more. They discussed how deep and produces an anomaly score for that frame. For each frame
learning methods offer solutions to these diverse challenges. fi,j , the anomaly score is S(fi,j ) = M (xi,j ).
To structure their analysis, Pang et al. introduced a hierarchical The total anomaly score for video Vi is the sum of the scores
taxonomy for categorizing deep anomaly detection methods. of its frames: Xn

This taxonomy comprised three principal categories: deep S(Vi ) = S(fi,j ) (1)
learning for feature extraction, learning feature representations j=1
of normality, and end-to-end anomaly score learning. Addi- This anomaly score S(Vi ) is compared with a predetermined
tionally, they provided 11 fine-grained subcategories from a threshold T , we can define the predicted binary label, Ŷi , for
modeling perspective. the video as:
(
Previous research efforts have overlooked the critical im- 1 if S(Vi ) ≥ T
portance of employing diverse feature extractors within deep Ŷi =
0 if S(Vi ) < T
learning models, especially in the context of video anomaly
detection. This oversight becomes glaringly apparent when Where: 1 indicates that Vi is anomalous and 0 indicates that
considering emerging trends such as the widespread adoption Vi is normal. The true label of the video is represented as
of transformers and vision-language pretraining models within Yi ∈ {0, 1}. The objective is to train the model M such that
video anomaly detection. These innovations have significantly the difference between Ŷi and Yi is minimized for all videos
influenced the overall performance of deep learning models. in the training set and generalizes well to unseen videos.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 4

V. A NALYZING V IDEO A NOMALY D ETECTION : A offering valuable insights for analysis and surveillance. The
S YSTEMATIC A PPROACH dataset has a combined duration of 2 hours. Anomalies within
In this section, we dig deeper into the complexities of the this dataset encompass activities such as walking in incorrect
VAD process by analyzing relevant literature. The diagram directions and loitering. Notably, this dataset is recorded
presented in Figure 2 acts as a guide for navigating VAD within an indoor environment.
research in the subsequent sections, where we explore the 2) UCSD Pedestrian : The UCSD anomaly detection
challenges associated with the VAD problem. Commencing dataset [39] was collected using a stationary camera positioned
with an array of diverse datasets, we traverse through var- at an elevated vantage point to monitor pedestrian walkways.
ious feature extraction techniques utilized to extract spatial, This dataset offers a wide range of scenes depicting varying
temporal, spatiotemporal, or textual features, leveraging vision crowd densities, spanning from sparsely populated to highly
language models. congested environments. Normal videos within the dataset
Our exploration begins with the utilization of diverse predominantly feature pedestrians, while abnormal events arise
datasets and a spectrum of feature extraction techniques, from two primary sources: the presence of non-pedestrian
encompassing CNN, 3D CNN, Autoencoders, GANs, LSTMs, entities in the walkways and the occurrence of anomalous
Transformers, vision language features, and Hybrid feature pedestrian motion patterns. Common anomalies observed in-
extractors. These methodologies are applied to extract spatial, clude bikers, skaters, small carts, individuals walking across
temporal, spatiotemporal, and textual features. Additionally, designated walkways or in adjacent grass areas, as well as
the section will address various learning approaches, including instances of people in wheelchairs. The dataset comprises two
supervised, self-supervised, and reconstruction-based strate- distinct subsets, Peds1 and Peds2, each capturing different
gies, which often fall within the realm of unsupervised or scenes. Peds1 depicts groups of people walking towards and
one-class classification methods, alongside weakly supervised away from the camera, often exhibiting perspective distortion,
and predictive models. We will also examine into aspects such while Peds2 focuses on scenes where pedestrians move parallel
as loss functions, regularization methods, and anomaly score to the camera plane.
computation. Our objective is to provide a comprehensive The dataset is further divided into clips, with each
analysis of these challenges and review the solutions proposed clip containing approximately 200 frames. Ground truth
by researchers in this field. annotations are provided at a frame level, indicating the
Furthermore, our exploration will encompass various learn- presence or absence of anomalies. Additionally, a subset
ing and supervision strategies, including supervised methods, of clips in both Peds1 and Peds2 is accompanied by
self-supervised techniques, and unsupervised techniques (of- manually generated pixel-level binary masks, enabling the
ten classified as reconstruction-based or one-class classifica- evaluation of algorithms’ ability to localize anomalies.The
tion approaches), alongside weakly supervised and prediction UCSD Ped1 & Ped2 datasets are publicly available:
methods. Additionally, we will illuminate the significance of http://www.svcl.ucsd.edu/projects/anomaly/.
loss functions, regularization techniques, and anomaly score 3) UCF-Crime dataset : The UCF-Crime dataset [11]
computation. Evaluation protocols for models are discussed stands as a widely utilized large-scale dataset within recent re-
in section VI-B. In the upcoming sections, we will explore search endeavors, boasting a multi-scene approach. Comprised
these challenges further, examining the ways they have been of long, untrimmed surveillance videos, this dataset encapsu-
approached. lates 13 real-world anomalies with profound implications for
A. Datasets Building and Selection public safety. The anomalies span a broad spectrum, includ-
The field of Video Anomaly Detection (VA relies signifi- ing Abuse, Arrest, Arson, Assault, Road Accident, Burglary,
cantly on publicly available datasets which are used for testing Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting,
and benchmarking the proposed models. In this section, we and Vandalism.
present an overview of the common datasets utilized in the To ensure the dataset’s integrity and quality, a meticulous
field of VAD, each carefully curated to facilitate the study of curation process was undertaken. Ten annotators were trained
anomalies in various scenarios. These datasets encompass a to collect videos from platforms like YouTube and LiveLeak,
wide range of scenes, offering diverse challenges for anomaly utilizing text search queries across multiple languages. Pruning
detection. We will investigate the number of videos contained conditions were enforced to exclude manually edited videos,
in each dataset, the specific types of scenes they cover, and prank videos, those captured by handheld cameras, and those
the anomalies present. sourced from news or compilations. The resultant dataset
1) Subway dataset: The Subway dataset [38] consists of comprises 950 unedited real-world surveillance videos, each
two videos collected using a CCTV camera, each capturing featuring clear anomalies, alongside an equal number of nor-
distinct perspectives of an underground train station. The first mal videos, totaling 1900 videos.
video focuses on the “entrance gate” area, where individuals Temporal annotations were meticulously acquired by as-
typically descend through the turnstiles and enter the platform signing the same videos to multiple annotators and averaging
with their backs turned to the camera. In contrast, the second their annotations. The dataset is thoughtfully partitioned into
video is situated at the “exit gate,” observing the platform a training set, consisting of 800 normal and 810 anomalous
where passengers ascend while facing the camera. These two videos, and a testing set, comprising 150 normal and 140
cameras provide unique vantage points within the station, anomalous videos. With its expansive coverage of various
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 5

Fig. 2. Video anomaly detection paradigm including (A) state-of-the-art dataset building and selection V-A, (B) Spatial, temporal, spatio-
temporal, and textual deep feature extraction V-B, (C) Diverse deep learning and supervision schemes (supervised, self-supervised, weakly
supervised, and unsupervised methods) V-C, (D) selection of loss functions V-D, (E) integration of regularization techniques within loss
functions V-E, (F) anomaly score calculation V-F, and (G) model evaluation techniques VI-B.

anomalous events, the UCF-Crime dataset serves as a compre- playback. With its wealth of data and meticulous annotations,
hensive resource for evaluating anomaly detection algorithms the ShanghaiTech Campus dataset serves as a cornerstone
across diverse real-world scenarios. The dataset is publicly resource for advancing anomaly detection methodologies
available: http://crcv.ucf.edu/projects/real-world/. in real-world scenarios. The dataset is publicly available:
4) CUHK Avenue : The CUHK Avenue dataset [15] is https://sviplab.github.io/dataset/campus dataset.html
a widely employed resource for Video Anomaly Detection, 6) XD-Violence : The XD-Violence dataset [13] is a large-
meticulously crafted by researchers from the Chinese Univer- scale and multi-scene dataset with a total duration of 217
sity of Hong Kong (CUHK). Set within the CUHK campus hours, containing 4754 untrimmed videos. This dataset com-
avenue, this dataset primarily focuses on anomalies commonly prises 2405 violent videos and 2349 non-violent videos, all
encountered in urban environments and public streets. It offers of which include audio signals and weak labels. The focus
a diverse array of lighting conditions, weather variations, and of the dataset was on the field of weakly supervised violence
human activities, thereby presenting formidable challenges for detection, where only video-level labels are available in the
VAD models. training set. This approach offers the advantage of being more
An array of anomalous behaviors is captured within the labor-saving compared to annotating frame-level labels. As a
dataset, encompassing both physical anomalies such as result, forming large-scale datasets of untrimmed videos and
fighting and sudden running, and non-physical anomalies training a data-driven and practical system is no longer a
like unusual gatherings and incorrect movement directions. challenging endeavor. The dataset incorporates audio signals
Notably, the dataset introduces several key challenges for and is sourced from both movies and in-the-wild scenarios.
VAD models, including slight camera shake in certain The dataset encompasses six physically violent classes: Abuse,
frames, occasional absence of normal behaviors in the Car Accident, Explosion, Fighting, Riot, and Shooting. The
training set, and outliers in the training data. In total, dataset is divided into a training set, which comprises 3954
the dataset comprises 30,652 frames, with 15,328 frames videos, and a test set which includes 800 videos. Within the
allocated for training purposes and the remaining 15,324 test set, there are 500 violent videos and 300 non-violent
frames earmarked for testing. With its diverse scenarios videos. https://roc-ng.github.io/XD-Violence/
and realistic challenges, the CUHK Avenue dataset serves 7) NWPU Campus dataset: The NWPU Campus dataset
as a valuable benchmark for evaluating and advancing [40] is newly available. It represents a significant contribution
VAD techniques. The dataset is publicly available: to the field of video anomaly detection and anticipation.
http://www.cse.cuhk.edu.hk/leojia/projects/detectabnormal The dataset was compiled by setting up cameras at 43
/dataset.html. outdoor locations on the campus to capture the activities of
5) ShanghaiTech: The ShanghaiTech Campus dataset [14] pedestrians and vehicles. To ensure a sufficient number of
stands as a substantial contribution to the realm of anomaly anomalous events, more than 30 volunteers were involved in
detection, presenting a vast and diverse collection of data. performing both normal and abnormal activities. The dataset
Across its 13 scenes, characterized by intricate light conditions covers a wide range of normal events, including regular
and varied camera angles, the dataset encapsulates a com- walking, cycling, driving, and other daily behaviors that
prehensive array of challenging scenarios. Notably, it boasts adhere to rules. The anomalies encompass various categories
an extensive compilation of over 270,000 training frames, such as single-person anomalies, interaction anomalies, group
alongside 130 abnormal events meticulously annotated at the anomalies, scene-dependent anomalies, location anomalies,
pixel level, facilitating precise evaluation and analysis. appearance anomalies, and trajectory anomalies. The dataset
Comprising a total of 330 training videos and 107 consists of 305 training videos and 242 testing videos,
testing videos, all rendered at a resolution of 856x480, totaling 16 hours of footage. Frame-level annotations are
the dataset ensures consistency and compatibility across its provided for the testing videos, indicating the presence or
contents. Each video maintains a frame rate of 24 frames absence of anomalous events. Privacy concerns are addressed
per second (24fps), ensuring smooth and standardized by blurring the faces of volunteers and pedestrians. Compared
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 6

Fig. 3. Sample frames showcasing the diversity of scenes and anomalies present in publicly available datasets used for Video Anomaly
Detection. These frames offer a glimpse into the range of challenges and scenarios addressed within the field, providing valuable insights
for testing and benchmarking anomaly detection models.
to other datasets, the NWPU Campus dataset stands out due anomalies present, spanning from action-related anomalies
to its larger volume of data, diverse scenes, and inclusion of such as unexpected movement direction and criminal activ-
scene-dependent anomalies. Additionally, it is the first dataset ities to object-related anomalies like suspicious personnel and
designed for video anomaly anticipation. These features make specific vehicle types.
the NWPU Campus dataset a valuable resource for advancing Table I provides a comprehensive summary of the most
research in video anomaly detection and anticipation. The commonly used publicly available datasets, while Figure 3
dataset is publicly available: https://campusvad.github.io/ illustrates sample frames extracted from these datasets.
Despite the availability of numerous datasets dedicated to
8) Discussion on Datasets: The existing literature presents the problem of VAD, several crucial trends and challenges
a plethora of diverse and extensive datasets covering a broad need to be addressed:
spectrum of normal and abnormal scenarios. These datasets • The majority of the publicly available VAD datasets offer
vary from single-scene datasets, which focus on specific limited environmental diversity and are restricted to a
situations and their anomalies, such as the UCSD Pedestrian specific setting. Examples are the ShanghaiTech [14] and
dataset [39], to more diverse datasets featuring multiple scenes the CUHK Avenue [15] datasets which are only limited
and corresponding anomalous conditions, like the UCF-Crime to university campus scenes and anomalies. Such a trend
[11] and XD-Violence [13] datasets. can significantly hinder the trained models’ capability to
Furthermore, these datasets exhibit variations in size and generalize well in other types of settings.
total duration, ranging from as low as five minutes to more • A recurring theme in multiple datasets is the limited
than two hundred hours. Some datasets comprise a small number of anomalous events contained in the dataset.
number of long videos, primarily collected from fixed scenes, Some datasets contained as little as twelve anomalous
as seen in the Subway dataset [38], while others contain a event types (UCSD Ped2 [39]) in comparison to other
larger number of shorter videos sourced from a wide array datasets that featured more than 130 types of anomalies
of locations, like the XD-Violence dataset [13]. Additionally, (ShanghaiTech [14]). This constraint lessens the models’
the datasets showcase a noticeable diversity in the types of efficacy in practical applications by limiting their capacity
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 7

TABLE I
OVERVIEW OF THE MOST IMPORTANT PUBLICLY AVAILABLE DATASETS .
Dataset Name Paper Year Balanced # Videos Length Example anomalies Frames Per. Video Train:Test Challenges
Single scene, indoor only,
Subway Entrance Gate [38] 2008 No 1 1.5 hrs Wrong direction, No payment 121,749 13:87
limited number of anomalies
Single scene, indoor only,
Subway Exit Gate [38] 2008 No 1 1.5 hrs Wrong direction, No payment 64,901 13:87
limited number of anomalies
Bikers, small carts, Small size, single scene,
UCSD Ped1 [39] 2010 No 70 5 min walking across walkways, 201 49:51 only outdoor, only
wheelchair users vehicle anomalies
Bikers, small carts, Small size, single scene,
UCSD Ped2 [39] 2010 No 28 5 min walking across walkways, 163 55:45 only outdoor, only
wheelchair users vehicle anomalies
Strange action,
Small size, single scene,
CUHK Avenue [15] 2013 No 37 30 min Wrong direction, 839 50:50
outdoor only, camera shake
Abnormal object
Running, Loitering, Only university setting
ShanghaiTech Campus Dataset [14] 2017 No 437 3.67 hrs Biking in Restricted Areas, 1,040 86:14 anomalies, single
Unusual Gatherings, Theft geographic location
Abuse, arrest, arson, Imbalance between normal
UCF-Crime [11] 2018 Yes 1,900 128 hrs assault, accident, burglary, 7,247 85:15 and abnormal classes,
fighting, robbery Variation in Video Quality
Abuse, Car Accident, Limited number of
XD-Violence [13] 2020 No 4,754 214 hrs Explosion, Fighting, 3,944 83:17 anomalies, Variation
Riot, Shooting in Video Quality
Single-person, interaction, Only university setting
NWPU Campus dataset [40] 2023 No 547 16 hrs group, location, 2,527 56:44 anomalies, single
appearance, trajectory geographic location

to learn a broad range of potential abnormal behaviors. approaches, ultimately improving the security and efficiency
• Most datasets available in the literature are imbalanced of surveillance systems.
in terms of the normal and abnormal classes. Particularly B. Feature Learning in Deep Video Anomaly Detection
in larger datasets like UCF-Crime [11], the imbalance
Feature learning plays a pivotal role in the effectiveness of
between normal and abnormal classes makes it difficult
deep learning models for video anomaly detection. This sec-
for models to be trained to detect anomalies accurately
tion examines various feature extraction techniques utilized in
without being surpassed by the majority class (normal).
the literature, emphasizing their significance and performance
We draw the following future directions for the VAD in surveillance video analysis.
problem in surveillance videos particularly in the realm of 1) Exploring Different Feature Types: Video frames encom-
benchmark datasets for addressing the aforementioned issues: pass diverse types of features crucial for VAD. These include
• The benchmark datasets should be diverse and strive to spatial, temporal, spatiotemporal, and textual features.
a) Spatial Features: Spatial features pertain to the visual
encompass a wide range of anomalies, including both
characteristics present within individual frames of a video.
subtle and overt deviations from normal behavior.
These encompass attributes such as shapes, textures, colors,
• The datasets should aim for realism, mimicking real-
and object positions within the frame. In the realm of anomaly
world surveillance environments as closely as possible.
detection, the analysis of spatial features aids in the detection
Additionally, scalability is crucial to accommodate the
of unusual patterns or objects within specific regions of the
growing size and complexity of surveillance video data.
video frame. Initially, traditional machine learning methods
• Anomalies often manifest over time, making tempo-
dominated this area, employing techniques like Gaussian mix-
ral context essential for effective detection. Benchmark
ture models and manually constructed features [17], [18].
datasets should incorporate long-term temporal informa-
However, the transition to deep learning facilitated automated
tion to capture the dynamics of normal and abnormal
feature extraction, thereby augmenting the capability to discern
behaviors accurately
intricate spatial details within video data.
• Beyond detecting anomalies, future benchmark datasets
b) Temporal Features: Temporal features encompass
could focus on localizing and segmenting anomalous
changes or movements occurring over time within videos,
regions within video frames or sequences. This finer
including object motion, speed variations, and environmental
granularity aids in understanding the nature and extent
alterations from frame to frame. For instance, Liu et al. [41]
of anomalies, facilitating prompt response measures.
employed the optical flow approach to capture motion features
• Integrating data from multiple modalities, such as visual,
within frames. In video anomaly detection, temporal features
auditory, and textual information, can enhance anomaly
play a pivotal role in identifying unusual actions or events
detection performance. Future benchmark datasets might
spanning consecutive frames, such as unauthorized running in
include multi-modal data to reflect the complexity of real-
restricted areas [11], [16].
world surveillance systems. c) Spatiotemporal Features: Relying solely on one type
• For advancing anomaly detection research, openness of feature can be restrictive: temporal features might fail
and collaboration are crucial therefore future benchmark to pinpoint where an anomaly occurs, while spatial features
datasets should be openly accessible, inviting contribu- might be overlooked when it happens. Incorporating both
tions from researchers worldwide and fostering innova- spatiotemporal features offers a more holistic perspective,
tion in the field. capturing not only the occurrence but also the precise location
By addressing these aspects, datasets can facilitate the of an anomaly. This approach leads to more accurate and
development of more robust and effective anomaly detection effective anomaly detection [41]–[43].
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 8

d) Textual Features: In video anomaly detection, textual Their use in reconstruction-based approaches is especially
features such as captioning and labeling significantly enhance noteworthy; these approaches reconstruct input data (like
the system’s ability to recognize and understand anomalies. video snippets) using high-level representations learned from
By incorporating descriptive captions and relevant labels into normal videos [48]. The premise is that anomalies, being
video frames, these systems gain a deeper understanding of out-of-distribution inputs, are harder to reconstruct accurately
the content, context, and actions depicted in the videos. This compared to normal data, making reconstruction error a viable
semantic layer aids in distinguishing normal from abnormal metric for anomaly detection.
activities more effectively. Advanced techniques like vision As seen in the work of [42], GANs are employed for future
language pretraining, including image-to-text, are employed frame predictions, which are then reconstructed, showcasing
to analyze these textual annotations, which can also encom- their versatility and effectiveness in anomaly detection tasks.
pass temporal and spatial context. The integration of textual In this domain, GANs alongside AEs have shown to be
features with visual data leads to more sophisticated, context- effective in capturing these intricate patterns in video data,
aware anomaly detection [24]–[26], [33]. aiding in the more precise identification of anomalous activ-
2) Deep Feature Extractors: Different feature extractors ities. Their combined use enables the learning of high-level
have been utilized by previous researchers. representations and the generation of realistic data, enhancing
a) Convolutional Neural Networks (CNNs): the ability to detect anomalies accurately and efficiently.
• 2D Convolutional Neural Networks (2D CNNs): CNNs d) Sequential Deep Learning:
have revolutionized spatial feature processing, enabling
detailed analysis of structural elements in scenes. In their • Long Short-Term Memory (LSTM): LSTMs are a
work [44], they discuss the use of Faster RCNN, a specialized type of Recurrent Neural Network designed to
specific type of CNN architecture, because of its high capture temporal dependencies. This characteristic makes
accuracy and its dual capability of performing object them exceptionally well-suited for processing sequential
classification and bounding box regression simultane- data, such as textual content in NLP-based systems [49].
ously. This means it can both locate and classify objects Moreover, their ability to remember and integrate long-
within the video frames, which is particularly useful for range patterns is also invaluable in video applications.
identifying and localizing anomalies. Specifically, they facilitate tracking and identifying tem-
• 3D Convolutional Networks (3D CNNs): enhance tradi-
poral anomalies within sequential video streams, a critical
tional CNNs by incorporating temporal analysis, allowing function in spatiotemporal analysis [12].
for an effective evaluation of spatiotemporal character- • Vision Transformers (ViT): Vision Transformers, known
istics within video data. The adoption of models such for their attention mechanisms, have revolutionized the
as C3D [45] and I3D [46] have markedly elevated the way sequential data is processed. They can weigh the
performance in state-of-the-art (SOTA) systems. A sig- importance of different parts of the input data, giving
nificant number of studies, including those by [11], [28], precedence to the most relevant features. This makes
[47], have employed these 3D CNN architectures as their them highly effective for extracting both temporal and
foundational backbone, demonstrating their outstanding spatiotemporal features in videos, which is crucial for
efficacy in spatiotemporal feature extraction. detecting complex anomalies. Yang et al. [50] introduce
a novel video anomaly detection method by restoring
b) Autoencoders (AEs): AEs excel in VAD due to their
events from keyframes, using a U-shaped Swin Trans-
unsupervised learning ability. They encode data into a lower-
former Network (a special type of ViT). This approach,
dimensional space and then reconstruct it, learning key data
distinct from traditional methods, focuses on inferring
features without needing labeled examples. This is vital in
missing frames and capturing complex spatiotemporal
anomaly detection, where anomalies are rare and often not
relationships, demonstrating improved anomaly detection
well-defined, allowing AEs to effectively identify unusual
in video surveillance.
patterns in video data.
In the work of [16], their approach utilized a fully convo- e) Vision Language Models (VLM):: Conventional
lutional autoencoder to learn low-level motion features, en- techniques often rely solely on spatial-temporal characteristics,
hancing the ability to learn regular dynamics in long-duration which might fall short in complex real-life situations where
videos, making it an effective model for detecting irregulari- a deeper semantic insight is required for greater precision.
ties. This method is versatile and applicable to various tasks, Vision Language features involve training models using both
including temporal regularity analysis, frame prediction, and visual and textual encoders, enabling comprehensive analysis
abnormal event detection in videos. of complex surveillance scenarios.
c) Generative Adversarial Networks (GANs): GANs The notable rise of vision-language feature extractors using
consist of two parts: a generator that creates realistic data contrastive learning like the CLIP [34] and BLIP [51] also
and a discriminator that distinguishes between generated and aims to align vision and language, promising to bring about
real data. Through this adversarial process, GANs effectively a transformative change in how surveillance videos are pro-
learn the distribution of real data. This ability is particularly cessed and interpreted. These models are designed to augment
valuable in anomaly detection, as GANs can generate data video content with rich semantic understanding, effectively
that closely resembles normal instances, making it easier to narrowing the gap between simple pixel-based data and a more
identify anomalies that deviate from this learned pattern [21]. human-like interpretation of video content [35], additionally,
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 9

the language models using text captions are used in text-to- approaches [6], [26], [35], [50]. The end-to-end deep feature
video anomaly retrieval as in [52]. In their methodology of learning paradigms are the recent trends and these approaches
[35] VadClip model for video anomaly detection, Wu et al. have advanced the VAD performance. We provide the follow-
combined the CLIP model’s vision-language features with a ing future directions in terms of the feature representations to
dual-branch system using both visual and textual encoders. improve detection, accuracy, and efficiency.
It includes a local-global temporal adapter (LGT-Adapter)
for modeling video temporal relations and leverages both
visual and textual features for anomaly detection. One branch • A continuous exploration and refinement of deep learn-
performs binary classification using visual features, while the ing architectures tailored for VAD is needed. This may
other aligns visual and textual data for finer anomaly detection. involve developing more efficient architectures such as
In contrast, [33] produced text features using video captions spatiotemporal CNNs, ViTs, or RNNs that can effectively
with SwinBERT [36], enriching the semantic context for capture temporal dependencies and spatial information in
detecting abnormalities. This approach expands the semantic video sequences.
understanding of video content beyond the pixel level, enhanc- • The hybrid models, which integrate multiple types of
ing anomaly detection capabilities. features, including appearance, motion, and spatiotem-
f) Hybrid Feature Learners:: Certain studies have de- poral information, could also improve the performance.
veloped hybrid feature extractors that integrate various types This could involve combining deep learning-based fea-
of feature extraction techniques to capture both spatial and tures with handcrafted features or leveraging multi-modal
temporal information effectively for video anomaly detection. approaches such as VLMs to capture diverse aspects of
For example, [53] proposed a hybrid architecture that anomalies.
combines U-Net and a modified Video Vision Transformer • Investigation of cross-modal feature representations that
(ViViT). The U-Net, known for its encoder-decoder structure integrate information from multiple modalities, such as
with skip connections, excels at capturing detailed spatial visual, audio, and textual cues present in surveillance
information. In contrast, the modified ViViT, originally de- videos. Cross-modal representations can capture comple-
signed for video classification tasks, is adapted here to encode mentary information sources and improve the robustness
both spatial and temporal information effectively for video of anomaly detection models to diverse types of anoma-
prediction. This hybrid model leverages the detailed high- lies and environmental conditions.
resolution spatial information from the CNN features of the • Exploration of self-supervised learning techniques for
U-Net and the global context captured by the transformer. pre-training feature representations in an unsupervised
Similarly, the work by [6] employed a composite system manner can help in learning rich representations from
that combines CNNs with transformers. In this configuration, unlabeled data. The pre-trained models can then be fine-
the CNN component is responsible for discerning spatial tuned for specific anomaly detection tasks with limited
characteristics, whereas the transformers are tasked with iden- labeled data.
tifying extended temporal attributes. This hybrid approach • Integration of attention mechanisms into deep learning
offers a complementary fusion of spatial and temporal features, architectures to focus on relevant spatiotemporal regions
enhancing the model’s capability to detect anomalies in video or frames within videos. Attention mechanisms can help
sequences. Additionally, [54] proposed a CNN architecture improve the model’s ability to attend to informative parts
that integrates convolutional autoencoders (CAEs) and a UNet. of the input data and suppress irrelevant noise, leading to
Each stream within the network contributes uniquely to the more robust anomaly detection features.
task of detecting anomalous frames. The CAEs are respon- • Utilization of graph-based feature representations to
sible for extracting spatial features, while the UNet focuses model complex relationships among different entities
on capturing contextual information, enabling the model to (e.g., objects, regions) within surveillance videos. GNNs
effectively identify anomalies in video frames by leveraging can capture dependencies and interactions among entities,
both local and global features. enabling more effective anomaly detection in scenarios
Similarly, [55] combined I3D 3D CNNs with LSTM net- where anomalies manifest as abnormal interactions or
works to enhance anomaly detection performance. The I3D relationships.
CNN extracts spatio-temporal features from video frames, • Investigation of adversarial learning techniques to en-
effectively identifying potential anomalies. However, as CNNs hance the robustness of feature representations against
primarily capture short-term dynamics, LSTMs are inte- adversarial attacks. Adversarial training methods can help
grated to capture long-range temporal dependencies crucial in improve the model’s ability to generalize to unseen
lengthy, untrimmed surveillance footage. To handle the high- anomalies and mitigate the risk of evasion attacks in real-
dimensional data from the I3D network, a pooling strategy is world surveillance systems.
employed to optimize feature vectors for efficient processing • Development of incremental learning approaches to adap-
by the LSTM. This hybrid architecture effectively integrates tively update feature representations over time as new
the strengths of both CNNs and LSTMs, enhancing the data becomes available can help the model adapt to evolv-
model’s ability to detect anomalies in varying temporal scales. ing patterns of normal and abnormal behavior in dynamic
While each feature representation has its strengths and surveillance environments without forgetting previously
weaknesses as empirically investigated by the SOTA VAD learned knowledge.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 10

C. Learning and Supervision Schemes are then processed through feature extractors to glean spa-
tiotemporal features before being directed to a fully connected
In the context of this survey, we will delve into various
network to produce the final output. The output anomaly score
learning approaches aimed at addressing the VAD problem.
ranges from 0 to 1, with the goal of increasing the model’s
1) Supervised Approaches: In supervised learning, algo-
output anomaly score for abnormal segments while decreasing
rithms are trained on datasets where frames are pre-classified
it for normal segments.
as either ’normal’ or ’anomalous.’ However, supervised meth-
The authors [28] introduced a novel approach to weakly
ods are less common due to the scarcity of datasets with
supervised anomaly detection (WSVAD), reframing it as a
detailed annotations at the frame level for the training set.
supervised learning task with noisy labels, departing from the
An exemplary study by [56] introduced an approach for
conventional MIL framework. Leveraging a Graph Convolu-
detecting and localizing anomalies in crowded scenes using
tional Network (GCN), this method effectively cleans noisy
spatial-temporal Convolutional Neural Networks (CNNs). This
labels, thereby enhancing the training and reliability of fully
method can simultaneously process spatial and temporal data
supervised action classifiers for anomaly detection. Addition-
from video sequences, capturing both appearance and motion
ally, the authors [57] introduced a binarization-embedded WS-
information. Tailored specifically for crowded environments, it
VAD (BE-WSVAD) method, which innovates by embedding
efficiently identifies anomalies by focusing on moving pixels,
binarization into GCN-based anomaly detection module.
enhancing accuracy and robustness.
Another work by [48] addresses the scarcity of labeled In another study, [27] enhances traditional MIL with a
anomalies in anomaly detection by employing a conditional Temporal Convolutional Network (TCN) and a unique Inner
Generative Adversarial Network (cGAN) to generate supple- Bag Loss (IBL). The IBL strategically focuses on the variation
mentary training data that is class balanced. This approach of anomaly scores within each bag (video), emphasizing a
utilizes labeled data for training the model. They propose larger score gap in positive bags (containing anomalies) and a
a novel supervised anomaly detector, the Ensemble Active smaller gap in negative bags (without anomalies). Meanwhile,
Learning Generative Adversarial Network (EAL-GAN). This the TCN effectively captures the temporal dynamics in videos,
network, unique in its architecture of one generator against a crucial aspect often overlooked in standard MIL approaches.
multiple discriminators, leverages an ensemble learning loss Moreover, authors [29], [58] propose a self-reasoning ap-
function and an active learning algorithm. The goal is to proach based on binary clustering of spatio-temporal video
alleviate the class imbalance problem and reduce the cost of features to mitigate label noise in anomalous videos. Their
labeling real-world data. framework involves generating pseudo labels through cluster-
2) Self-supervised Approaches: In self-supervised ap- ing, facilitating the cleaning of noise from the labels, and
proaches, algorithms generate their supervision from the input enhancing the overall anomaly detection performance. This
data, bypassing the need for external labels provided by method not only removes noisy labels but also improves the
humans. This paradigm is particularly valuable for anomaly network’s performance through a clustering distance loss.
detection, where labeled anomalous data is scarce. Traditional MIL approaches often neglect the intricate in-
In the work by [9], researchers propose a method based on terplay of features over time.
self-supervised and multi-task learning at the object level. Self- In [31], the method commences with a relation-aware fea-
supervision involves training a 3D CNN on several proxy tasks ture extractor that captures multi-scale CNN features from
that do not require labeled anomaly data. These tasks include videos. The unique aspect of their approach is the integration
determining the direction of object movement (arrow of time), of self-attention with Conditional Random Fields (CRFs),
identifying motion irregularities by comparing objects in con- leveraging self-attention to capture short-range feature cor-
secutive versus intermittent frames, and reconstructing object relations and CRFs to learn feature interdependencies. This
appearances based on their preceding and succeeding frames. approach offers a more comprehensive analysis of complex
By learning from the video data itself what normal object movements and interactions for anomaly detection.
behavior looks like, the model becomes adept at detecting On the other hand, [32] proposed Robust Temporal Feature
anomalies when deviations from this learned behavior occur. Magnitude Learning (RTFM). RTFM addresses the challenge
This methodology enables effective anomaly detection even in of recognizing rare abnormal snippets in videos dominated
the absence of explicit labels. by normal events, particularly subtle anomalies that exhibit
3) Weakly Supervised Approaches: In weakly supervised only minor differences from normal events. It employs tem-
approaches, obtaining precise frame-level annotations for poral feature magnitude learning to improve the robustness of
anomalies in long video sequences can be challenging and the MIL approach against negative instances from abnormal
time-consuming. Instead, annotators may label “snippets” or videos and integrates dilated convolutions and self-attention
short segments within the video where anomalies are observed, mechanisms to capture both long and short-range temporal
serving as weak labels for training the model. dependencies.
In their work, [11] introduced a pioneering multiple instance Furthermore, [47] introduced “MIST: Multiple Instance
learning (MIL) model, the first to adopt the notion of weakly Self-Training Framework for Video Anomaly Detection” a
labeled training videos. Here, normal and anomalous videos novel approach for WSVAD. MIST diverges from traditional
are considered negative and positive bags, respectively, with MIL by introducing a pseudo label generator with a sparse
video segments serving as instances in MIL. These bags continuous sampling strategy for more accurate clip-level
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 11

pseudo labels, and a self-guided attention-boosted feature class (typically the “normal” class), and any deviation from
encoder for focusing on anomalous regions within frames. this learned structure in the testing phase “anomaly” results in
Additionally, [59] presents a novel weakly supervised tem- higher reconstruction error.
poral relation learning framework (WSTR). The framework, On the other hand, the unsupervised learning perspective
which uses I3D for feature extraction and incorporates snippet- focuses on understanding the structure or distribution within
level classifiers and top-k video classification for weak super- the data itself. Autoencoder-based anomaly detection is pri-
vision, is the first of its kind to apply transformer technology marily unsupervised, as it learns to reconstruct data without
in this context. explicitly being told what’s normal or anomalous. It just tries
In [26], a novel approach called CLIPTSA leverages Vi- to minimize the difference between input and reconstructed
sion Language (ViT) features for Weakly Supervised Video output.
Anomaly Detection (WSVAD). Unlike conventional models Future Frame Prediction Approaches: The concept of
such as C3D or I3D, CLIPTSA utilizes the Vision Transformer reconstruction-based approaches evolved as researchers ob-
(ViT) encoded visual features from CLIP [34] to efficiently served that deep neural networks may not always produce
extract discriminative representations. It incorporates a Tem- significant reconstruction errors for unusual events. Addition-
poral Self-Attention (TSA) mechanism to model both long- ally, certain normal videos, previously unencountered, could
and short-range temporal dependencies, thereby enhancing be incorrectly labeled as abnormal. To address these chal-
detection performance in VAD. lenges, researchers began focusing on reconstructing future
4) Unsupervised and Reconstruction-based Approaches: frames based on previous video frames while considering
Reconstruction-based approaches for video anomaly detection optical flow constraints to ensure motion consistency. This
operate under the principle that normal events can be ef- evolved methodology, termed “prediction approaches,” in-
fectively reconstructed from a learned representation, while troduced Generative Adversarial Networks (GANs), which
anomalies or abnormal events deviate significantly from this played a significant role in enhancing this approach [41], [42].
representation and hence are harder to reconstruct. Essentially, Building upon this framework, the HSTGCNN [61] model
the model learns to represent or “reconstruct” normal data, incorporates a sophisticated Future Frame Prediction (FFP)
and anomalies are detected based on how poorly the model mechanism that significantly refines the anomaly detection
reconstructs them. process. By integrating hierarchical graph representations, the
These methods are particularly suitable when labeled model not only predicts future frames but also encodes com-
anomaly data is scarce. During training, only normal videos plex interactions among individuals and their movements, thus
are considered, while during testing, the model is evaluated providing a more robust and context-aware anomaly detection
on both normal and abnormal videos to assess its anomaly system.
detection capabilities. A hybrid system incorporating flow reconstruction and
Deep learning techniques, especially CNN [54] or autoen- frame prediction was introduced by [20]. This system de-
coders [16], [60], are widely utilized for this approach. An tects unusual events in videos by memorizing normal activity
autoencoder attempts to learn a compressed representation of patterns and predicting future frames. It utilizes a memory-
its input data and then reconstructs the original data from this augmented autoencoder for accurate flow reconstruction and
representation. During training, the model learns to minimize a Conditional Variational Autoencoder for frame prediction.
the reconstruction error between the input and the output. Anomalies are highlighted by larger errors in flow reconstruc-
After training, when the model encounters new data, it tion and subsequent frame prediction.
tries to reconstruct it based on what it has learned. The Different VAD paradigms have their strengths and weak-
difference between the reconstructed output and the original nesses. ViT-based approaches with the integration of language
input is measured, typically as a reconstruction error. A high models/language supervision are the recent trends for improv-
reconstruction error indicates that the input is significantly ing VAD performance. In the realm of different approaches
different from what the model considers “normal” implying for VAD, several promising directions are emerging:
that the input might be an anomaly. • Development of vision-language models for VAD prob-
In the work by [16], the authors developed a method to lems will be more dominant compared to vision-only
assess the regularity of frames in video sequences using models. These models do not only capture the semantic
reconstruction models. They employed two autoencoders: one representation of the videos but also consider the natural
with convolutional layers and the other without. The model language descriptions for anomalies.
processed two distinct types of inputs: manually designed • Exploration of methods to localize and describe anoma-
features (such as HOG and HOF, enhanced with trajectory- lies in videos using natural language descriptions is
based characteristics), and a combination of 10 successive the recent trend. Language-guided anomaly localization
frames aligned along the time axis. The error in reconstructing techniques enable the model to identify spatial and tem-
these frames serves as an indicator of their regularity score. poral regions corresponding to anomalies and generate
The reconstruction-based paradigm is typically categorized human-interpretable descriptions, enhancing situational
within certain learning frameworks as one-class classification awareness and facilitating response efforts.
(OCC) or unsupervised learning, primarily due to its emphasis • Exploration of GAN-based approaches for VAD, where
on training using only the “normal” class of videos. Within the the generator learns to generate normal video frames
OCC paradigm, the model is trained only on data from one while the discriminator distinguishes between real and
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 12

generated frames. GANs can capture complex distribu- 2) Cross Entropy Loss: Alongside the MIL loss, some other
tions of normal behavior and detect deviations indicative researchers [28], [62], used the binary cross entropy within
of anomalies. their loss functions to classify between normal and abnormal
• Development of graph-based models that represent spatial video snippets in the WSVAD tasks.
and temporal dependencies among entities in surveil- The goal is to refine the model to identify the snippet with
lance videos as a graph structure. Spatiotemporal graph the highest anomaly score in a video, which is denoted as
models can capture intricate relationships among objects, normal, and conversely in an anomalous video. For each video
activities, and contexts, enabling more accurate anomaly instance, MIL formulates a pair that includes the model’s pre-
detection in complex scenes. diction for the snippet with the highest anomaly score and the
• Investigation of transfer learning techniques to transfer true label for the video’s anomaly (e.g., max{f (Vi )}ni=1 , yi ),
knowledge from labeled source domains to unlabeled or where yi = 0 for normal and yi = 1 for anomalous instances).
sparsely labeled target domains can help mitigate the MIL subsequently consolidates these pairs from all the videos
scarcity of labeled data in target domains and improve to form a set C of confidently labeled snippets. The model f
anomaly detection performance by leveraging knowledge is refined by optimizing the Binary Cross-Entropy (BCE) loss,
learned from related surveillance scenarios. calculated as follows:
• Adoption of multi-resolution analysis techniques to an- 1 X
BCE(C) = − [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )]
alyze videos at different spatial and temporal resolu- N
(yi ,ŷi )∈C
tions. Multi-resolution approaches enable the detection (3)
of anomalies occurring at varying scales, from small- where the maximum score ŷi = max{f (Vi )}ni=1 . Under
scale events to large-scale spatial or temporal patterns, this paradigm, the model f must assign the lowest anomaly
enhancing the model’s sensitivity to diverse types of probability to all snippets in a normal video, thereby mini-
anomalies. mizing max{f (Vi )}ni=1 , and the highest anomaly probability
• Adoption of dynamic ensemble learning strategies to in an anomalous video, hence maximizing max{f (Vi )}ni=1 .
combine predictions from multiple anomaly detection The strategy is to focus on the snippet with the strongest
models dynamically. Dynamic ensemble methods can abnormality, even if the video is largely normal, to generate a
adaptively adjust the ensemble composition based on the set of snippets labeled with high confidence as anomalous.
current surveillance context, improving detection robust- 3) Reconstruction Error loss: As for the unsupervised ap-
ness and resilience to evolving anomalies. proach, the objective function of the model, as shown in Equa-
tion 4, is to minimize the reconstruction error (the squared
D. Loss Functions Euclidean distance, ∥ · ∥22 ), as proposed by [16], between
the original input frames pixel values and its reconstruction
Loss functions play a pivotal role in quantifying the dispar-
by the model with weights θ. The input frames denoted as
ity between predicted outcomes by a model and the actual
Vi are passed through an encoder to obtain a compressed
results. They serve as guides for the optimization process,
representation, which is then reconstructed back into frames
facilitating adjustments and enhancements to model param-
F (Vi ; θ) using a decoder and N is the size of the minibatch.
eters during training. The selection of an appropriate loss
1 X
function is critical as it directly impacts the model’s capacity arg min ∥Vi − F (Vi ; θ)∥22 (4)
to discern underlying patterns in the data and, consequently, its θ 2N i
performance on unseen data. Various tasks necessitate distinct Future directions for video anomaly detection in terms of
loss functions to adeptly capture the specific characteristics or loss functions involve the exploration and development of
complexities inherent in the problem domain. novel loss functions tailored to address specific challenges and
1) Multiple Instance Learning (MIL) Loss: Within the cate- objectives in anomaly detection tasks. The promising future
gory of weakly supervised learning, notably Multiple Instance directions in this domain are:
Learning (MIL) [11], the objective function is crafted to • Designing anomaly-aware loss that explicitly accounts
proficiently distinguish between normalcy and anomaly within for the characteristics of anomalies could penalize model
video segments. errors differently based on the severity or rarity of anoma-
When dealing with a dataset of labeled videos, denoted as V, lies, helping the model focus more on detecting critical
where V is a video segment. The videos are categorized into anomalies while reducing false alarms for common or
positive bags B+ and negative bags B− . The overall objective benign events.
function is represented as, • Incorporating temporal consistency constraints into the

X   loss function to encourage smooth transitions and coher-


arg min max 0, 1 − max S(V ; θ) + max S(V ; θ) ent predictions over time may penalize abrupt changes or
θ V ∈B+ V ∈B−
B+ ,B− ∈V inconsistencies in the model’s predictions across consecu-
(2) tive frames, promoting more stable and coherent anomaly
This function aims to ensure that the anomaly score S(V ; θ) detection results.
for the positively labeled video dataset is greater than any • Integrating adversarial losses into the training process
segment within the negatively labeled set. This is achieved by to improve the robustness of anomaly detection models
employing a hinge loss function which is parameterized by θ. against adversarial attacks can encourage the model to
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 13

generate predictions that are resilient to adversarial per- which determines the strength of the regularization effect.
turbations or manipulations, enhancing the reliability and Increasing the value of λ enhances the regularization
effectiveness of anomaly detection systems in real-world effect, penalizing larger weights more significantly. The key
scenarios. characteristic of L2 regularization is its ability to prevent
• Incorporating uncertainty-aware losses to quantify and overfitting by keeping the weights small, which leads to a
mitigate uncertainty in anomaly detection predictions more generalized model. Unlike L1 regularization, which
may enable the model to estimate the confidence or un- promotes sparsity in the model (some coefficients becoming
certainty associated with its predictions, facilitating more exactly zero), L2 regularization tends to shrink all coefficients
reliable decision-making and uncertainty quantification in towards zero but does not make them exactly zero. This
anomaly detection systems. attribute makes it particularly useful in scenarios where
E. Regularization many features contribute small amounts to the predictions.
By penalizing the square of the weights, L2 regularization
Regularization techniques in deep learning, particularly for ensures that no single feature dominates the model, which
video anomaly detection, are crucial for preventing overfitting can be essential for models where all features carry some
and improving the generalization of the models. level of importance.
1) Weight Decay regularization: Weight decay is a form of
regularization commonly used in training neural networks to 2) Temporal and Spatial Constraints: Temporal and spatial
prevent overfitting. The term “weight decay” specifically refers constraints are other types of regularization terms that are
to a technique that modifies the learning process to shrink the commonly used in VAD [11]. The addition of these terms
weights of the neural network during training. Weight decay in the loss function ensures important specifications in the
works on the principle of adding a penalty to the loss function model’s spatial and temporal learning for distinguishing
of a neural network related to the size of the weights [63]. between the normal and abnormal events in videos [64].
The primary objective is to keep the weights small, which Temporal constraints are used to guarantee the continuity and
helps in reducing the model’s complexity and its tendency to progression of events over time while spatial constraints are
overfit the training data. By penalizing large weights, weight used to enhance the model’s understanding of the objects’
decay ensures that the model does not rely too heavily on any positions and physical locations throughout the frames of the
single feature or combination of features, leading to better video and their relation to the existence of anomalies. By
generalization. incorporating both types of constraints in the loss function of
a) L1 Regularization: L1 regularization is a regular- a VAD model, the model is inclined toward achieving more
ization term that adds a penalty to the loss function of the robust anomaly detection.
model equivalent to the absolute value of the magnitude of
the weights of the model. The general equation for a loss a) Temporal Smoothing Constraint: The temporal
function with L1 regularization can be expressed generally as smoothing constraint is a crucial component of loss functions
n
follows: X
in VAD, employed to maintain stability in predicted anomaly
L(θ) = L0 (θ) + λ |θi | (5)
i=1 scores across consecutive frames. This constraint aims to mit-
where L(θ) is the total loss function after including the igate abrupt fluctuations in anomaly scores between adjacent
L1 regularization term, L0 (θ) is the original loss function frames, aligning with the expectation that successive frames
without regularization, and λ is a regularization parameter typically exhibit similar anomaly characteristics. The temporal
that controls the strength of the regularization effect. Higher smoothing constraint penalizes sudden changes in the anomaly
values of λ lead to greater regularization, potentially resulting score between sequential frames as follows in equation 7:
in more features being effectively ignored by reducing their |B+ |  2
X
coefficients to zero. The key aspect of L1 regularization is Lsmooth = λ1 S(Vi ; θ) − S(Vi+1 ; θ) (7)
its ability to create sparse models. Sparsity here refers to the i=1
where a λ1 is a smoothing coefficient, S(Vi ; θ) is the
property where some of the coefficients become exactly zero,
anomaly score of frame i, and S(Vi+1 ; θ) is the anomaly
effectively excluding some features from the model. This
score of frame i + 1.
is particularly useful in feature selection, where the most
important features in the model are to be identified.
b) Sparsity constraint: The sparsity constraint is a
commonly used regularization term in VAD that forces the
b) L2 Regularization: L2 regularization, also known as
anomalous frames to be a minority among the video frames.
Ridge regularization, is a technique that modifies the loss func-
In VAD, the anomalous frames are generally assumed to be
tion of a model by adding a penalty equivalent to the square
the minority compared to the normal frames. The sparsity
of the magnitude of the model’s weights. The formula for a
constraint penalizes the total number of anomalous frames in
loss function with L2 regularization is generally expressed as
n the video as follows in equation X8:
follows: X
L(θ) = L0 (θ) + λ θi2 (6) Lsparsity = λ2 S(Vi ; θ) (8)
i=1 V ∈B+
Here, L(θ) represents the total loss function, including the where a λ2 is a sparsity coefficient and S(Vi ; θ) is the
L2 regularization term. L0 (θ) is the original loss function anomaly score of frame i.
without regularization, and λ is the regularization parameter,
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 14

While spatio-temporal constraints suit well for handling the measuring the difference between a predicted frame Ŷ and its
abrupt anomalies within the video sequence, the recent trends corresponding ground truth Y . Where maxŶ is the maximum
toward incorporating VAD-specific constraints improve the possible value of the image intensities divided by the mean
overall performance. Future directions in this domain include square error (MSE). In image reconstruction tasks, a higher
the following: PSNR would indicate a lower error and, thus, higher fidelity
• Adversarial regularization techniques may help the model to the original image, which is the desired outcome.
learn more resilient features by introducing adversarial After calculating the PSNR between the predicted Ŷ and
perturbations during training, thereby enhancing its abil- its corresponding ground truth Y for each frame Vi in a test
ity to detect anomalies in the presence of adversarial video and normalizing these values, we determine each frame’s
manipulations. regularity score Sr (Vi ) using the equation below.
• Incorporating temporal regularization techniques to en-
PSNR(YVi , ŶVi ) − minVi PSNR(YVi , ŶVi )
force temporal consistency and coherence in anomaly Sr (Vi ) =
detection predictions over time can encourage smooth maxVi PSNR(YVi , ŶVi ) − minVi PSNR(YVi , ŶVi )
(12)
transitions and consistent predictions across consecutive
This score indicates the likelihood of a frame being normal
frames.
or abnormal, with the potential to set a threshold point for
• Graph-based regularization techniques can impose struc-
distinguishing between the two.
tural constraints on the learned representations, encour-
aging the model to capture meaningful interactions and VI. E XPERIMENTS
contextual information among objects, scenes, or events, A. Datasets Guideline
leading to more accurate anomaly detection. As discussed in section V-A, several datasets were used
• Integrating sparse regularization techniques to encour- in literature for training and evaluating VAD models. These
age sparsity in the learned representations, focusing the datasets are well-suited for different types of supervision
model’s attention on salient features or regions within schemes. Most of the datasets, including UCF-crime [11],
videos can promote the selection of informative features ShanghaiTech [14], XD-Violence [13], and CUHK Avenue
while suppressing irrelevant noise. [15], provide video-level labels for the training set which could
• Addressing catastrophic forgetting and model degradation be utilized for training weakly-supervised VAD as in the works
over time by incorporating continual learning regular- by Joo et al. [26] and Cho et al. [64]. These datasets could
ization techniques can enable the model to adapt and also be utilized for unsupervised VAD by utilizing the training
learn from new data while preserving previously learned set without the video-level labels as in the works of Shi at al.
knowledge, facilitating the adaptation in the surveillance [66] and Zaheer et al. [67] and for self-supervised learning as
environments. in the work of Zhou et al. [56].

F. Anomaly Score B. Evaluation Metrics


The anomaly score indicates the likelihood of a segment Most previous work mainly compared their works on differ-
or frame in a video being abnormal, calculated based on how ent datasets using the area under the ROC curve (AUC), Equal
much it deviates from normal patterns. A high anomaly score Error Rate (EER), and Average precision (AP) to measure
suggests a high probability of an anomaly. their performance. The main aim of the metrics is to measure
For the reconstruction approaches [16], [60], after training the model’s proficiency in differentiating between normal and
the model, its effectiveness is measured by inputting test data abnormal videos. One significant metric used is the AUC “area
to see if it can accurately identify unusual activities with under the Receiver Operating Characteristic (ROC) curve”,
a minimal number of false alarm rates. Then, calculate the which plots the True Positive Rate (TPR) against the False
anomaly score S(Vi ) for each frame Vi by scaling between 0 Positive Rate (FPR) across various threshold settings. The TPR
and 1 using the reconstruction error of the frame e(Vi ). is defined as: TP
TPR = (13)
TP + FN
e(Vi ) − minVi e(Vi ) and the FPR is given by:
S(Vi ) = . (9)
maxVi e(Vi )
FP
Hence, we can define the regularity score to be: FPR = . (14)
FP + TN
Sr (Vi ) = 1 − S(Vi ). (10) A higher AUC value indicates a better model performance,
where an AUC of 1 signifies a perfect model and an AUC
However for future frame prediction methods, [65] has
of 0.5 suggests no discriminative power between positive and
shown that Peak Signal-to-Noise Ratio (PSNR) is a superior
negative classes. The other method used is the AP, which
method for evaluating image quality [41], [42], as indicated
computes the average value of precision for different recall
below: !
max2Ŷ values. It summarizes the precision-recall curve into a single
PSNR(Y, Ŷ ) = 10 log10 1 PN , (11) value representing average performance across all threshold
2
N i=0 (Yi − Ŷi ) levels. Additionally, the EER represents the point at which the
The assumption is that we can make accurate predictions FPR equals the false negative rate (FNR). Lower EER values
for normal events. Therefore, we can detect anomalies by indicate a greater degree of accuracy in the system [56].
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 15

TABLE II
B ENCHMARKING OF RECENT ANOMALY DETECTION WORKS ON PUBLICLY AVAILABLE DATASETS .

Method Venue Year Supervision type Feature extractor UCF-Crime AUC STech AUC XD-Violance AUC Avenue AUC Ref
Lu et al. ICCV 2013 Unsupervised 3D cube gradient and PCA 65.51 68.00 // // [15]
Hasan et al. CVPR 2016 Unsupervised Convolutional autoencoder 50.60 60.85 // 70.20 [16]
Sultani et al. CVPR 2018 Weakly Sup. C3D 75.41 // // // [11]
AMC ICCV 2019 Unsupervised Conv-AE // // // 86.9 [54]
MemAE ICCV 2019 Unsupervised 2D CNN Encoder-Decoder // 71.2 // 83.3 [68]
TCN-IBL ICIP 2019 Weakly Sup. C3D 78.66 82.50 // // [27]
GCN-TSN CVPR 2019 Weakly Sup. TSN - RGB 82.12 84.44 // // [28]
MPED-RNN CVPR 2019 Unsupervised RNN // 73.4 // // [69]
SRF IEEE SPL 2020 Weakly Sup. C3D 79.54 84.16 // // [29]
CVAE ICCV 2021 Unsupervised ML-MemAE-SC, CVAE // 76.2 // 91.1 [20]
SS-MTL CVPR 2021 self-supervised 3D CNN // // // 86.9 [9]
Noise Cleaner CVPRW 2021 Weakly Sup. C3D 78.26 84.16 // // [58]
DAM AVSS 2021 Weakly Sup. I3D- RGB 82.67 88.22 // // [30]
SA-CRF ICCV 2021 Weakly Sup. Relation-aware TSN ResNet-50 85.00 96.85 // // [31]
RTFM ICCV 2021 Weakly Sup. I3D- RGB 84.30 97.21 77.81 // [32]
MIST CVPR 2021 Weakly Sup. I3D- RGB 82.30 94.84 // // [47]
MSLNet-CTE AAAI 2022 Weakly Sup. VideoSwin-RGB 85.62 97.32 78.59 // [23]
GCL CVPR 2022 Unsupervised ResNext 71.04 // 78.93 // [67]
CUPL CVPR 2023 Weakly Sup. I3D+VGGish 86.22 // // // [70]
USTN-DSC CVPR 2023 Unsupervised Transformer Encoder-Decoder // 73.8 // 89.9 [50]
FPDM ICCV 2023 Unsupervised Diffusion Model 74.7 78.6 // // [71]
HSC CVPR 2023 Weakly Sup. AE // 83.4 // 93.7 [22]
SLMPT ICCV 2023 Unsupervised U-Net // 78.8 // 90.9 [66]
UMIL CVPR 2023 Weakly Sup. X-CLIP-B/32 86.75 // // // [62]
TeD-SPAD ICCV 2023 Weakly Sup. U-Net + I3D 75.06 // // // [72]
CLAV CVPR 2023 Weakly Sup. I3D-RGB 86.1 97.6 81.3 89.8 [64]
EVAL CVPR 2023 Unsupervised 3D CNN // 76.63 // 86.02 [73]
TEVAD CVPRW 2023 Weakly Sup. ResNet-50 I3D-RGB / SwinBERT 84.90 98.10 79.80 // [33]
CLIP-TSA ICIP 2023 Weakly Sup. CLIP 87.58 98.32 82.19 // [26]

In this survey, we use the AUC as the performance metric Analysis (PCA) to more sophisticated deep learning archi-
in our quantitative comparison of the SOTA models in section tectures such as Convolutional Auto Encoders (CAE), 3D
VII-A. Convolutional Networks (C3D), Temporal Segment Networks
VII. C OMPARATIVE ANALYSIS OF SOTA MODELS (TSN), Inflated 3D ConvNet (I3D), and CLIP, the diversity
in approaches underscores the intricate nature of the VAD
A. Quantitative comparison task, emphasizing the necessity for tailored methodologies to
There have been several works in literature that address diverse scenarios.
proposed different architectures of deep learning models Another critical facet of the analysis pertains to performance
for encountering the problem of VAD. Table II highlights a metrics, particularly the Area Under the Curve (AUC) scores
benchmarking of recent anomaly detection works on publicly across different datasets, indicating notable enhancements in
available datasets. The table offers a detailed look at the model performance over time. This trend not only signifies
evolution and current state of anomaly detection techniques, advancements in methodological efficacy but also underscores
primarily focusing on publicly available datasets including the increasing accuracy and reliability of anomaly detection
UCF-Crime, Shanghai-Tech, XD-Violence, and Avenue. systems. Leading models, notably recent ones like “CLIP-
The data spans a decade, from 2013 to 2023, providing a TSA” published in 2023, exhibit exceptionally high AUC
comprehensive view of the field’s progression in the last scores, illustrating the rapid progress achieved in recent years,
ten years. An important trend that can be observed in the particularly with the integration of advanced techniques from
table is the shift in research interest from supervised VAD Natural Language Processing (NLP) into models like CLIP,
approaches to weakly supervised VAD learning methods. This BLIP, and GPT.
change, particularly notable from 2018 onwards and greatly However, the analysis reveals several instances of missing
influenced by the work of Sultani et al. [11] who introduced data. These gaps across various methodologies and datasets
the “UCF-Crime” dataset and the Multi-Instance Learning highlight the necessity for more comprehensive benchmarking
(MIL) learning pipeline, suggests a growing preference in the efforts. Addressing such gaps is crucial to foster a deeper
industry for techniques that require less fully labeled data, understanding and facilitate meaningful comparisons among
which is often expensive or difficult to obtain, in comparison different anomaly detection methods, ultimately advancing the
to weakly supervised approaches that rely on video-level field.
labels. This shift reflects the practical challenges and evolving
needs in anomaly detection applications particularly in B. Qualitative comparison
applications where hundreds of hours of unlabelled or weakly To observe in more detail the performance of current
labeled video footage is available. state-of-the-art models performance on various normal
and anomalous surveillance videos, this section presents a
A notable aspect to consider is the wide array of feature qualitative analysis of these models’ performances. Figure
extractors utilized in these studies. Ranging from simpler 4 depicts a qualitative assessment of correctly classified
methods like 3D cube gradient with Principal Component and misclassified video frames using four VAD models,
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 16

Fig. 4. A qualitative comparison and illustration of correctly and incorrectly classified frames using four VAD models, namely Sultani et
al. [11], GCN [67], CLAV [64], and VAD-CLIP [35].

namely Sultani et al. [11], GCN [67], CLAV [64], and the video cannot be inferred directly from the frame in cases
VAD-CLIP [35]. As shown in the figure’s first row, these where the frames give a false impression of a different context.
four VAD models were able to correctly classify normal
and anomalous video frames taken from different scenes
and environments. However, these models misclassified VIII. V ISUALIZING B IBLIOMETRIC N ETWORKS FOR
other similar frames taken from the same videos either by T HEMATIC A NALYSIS
detecting anomalies in a normal video or by not detecting an
anomalous video segment. For example, the model proposed In bibliometrics, visualization emerges as a potent tech-
by [11] misclassified the frame number 10000 in the video nique for analyzing diverse networks, encompassing citation-
“Burglary032”, which shows a person entering an office based, co-authorship-based, or keyword co-occurrence-based
through a window, by labeling it as “Normal” while it is networks. Density visualizations provide a rapid overview of
an “Anomaly”. This was attributed to the immense darkness the essential components within a bibliometric network. Em-
of the scene which made recognizing such an ambiguous ploying NLP techniques, one can construct term co-occurrence
anomaly difficult. Another example of misclassification is networks for textual data in English. The rationale behind
with the GCN model by [67] as it misclassified the frame employing interconnected networks is to illustrate the temporal
number 4400 in the video “Stealing058”, which shows a car and relevance-driven evolution of the field.
leaving the parking spot, by labeling it as an “Anomaly” We utilized VOSviewer [74], a tool employing a distance-
while it is “Normal”. This could be attributed to the sudden based method to generate visualizations of bibliometric net-
movement of the car with its lights turned on. Similarly, the works. Directed networks, like those formed by citation rela-
model proposed by [64] named CLAV misclassified the initial tionships, are treated as undirected in this context. VOSviewer
frames of a scene showing cars stopping and accelerating autonomously organizes network nodes into clusters, where
again in a street. This increased anomaly score could be nodes with similar attributes are grouped. Each node in the net-
explained by the abrupt stopping of the cars which in many work is neatly assigned to one of these clusters. Furthermore,
cases could be considered an anomaly. Finally, the VAD-CLIP VOSviewer employs color to denote a node’s membership
model proposed by [35] misclassified the frame number 360 within a cluster in bibliometric network visualizations.
in the video “Shooting008”, which shows a person crawling Five distinct clusters have emerged, as shown in Figure
on the floor, by classifying it as an “Anomaly” while it is 5, each representing a different research theme. The green
“Normal”. This misclassification could be explained as the cluster, centered around the theme of diffusion, explores meth-
person crawling moves unconventionally in contrast with ods based on diffusion models for anomaly detection, with
what commonly happens in normal street and shopping center representative work such as “Anomaly Detection in Satellite
scenes. It could be concluded that some normal and abnormal Videos using Diffusion Models” [75]. In the yellow cluster,
frames could be misclassified when the overall context of efficiency takes precedence, with research efforts directed
towards developing algorithms capable of accurate anomaly
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 17

Fig. 5. Visualizing Bibliometric Networks for Thematic Analysis of Recent Literature (50 top cited papers) on Video Anomaly Detection between the year
2023-2024.
TABLE III
C LUSTER CENTRAL POINT AND COLOR FOR EACH CLUSTER WITH REPRESENTATIVE PAPER AND TEHEME

Cluster Research Paper: The Title


Anomaly Detection in Satellite Videos using Diffusion Models (Theme: Diffusion)
Delving into CLIP latent space for Video Anomaly Recognition (Theme: VLM)
EfficientAD: Accurate Visual Anomaly Detection at Millisecond-Level Latencies (Theme: Efficiency)
Video Anomaly Detection using GAN (Theme: GAN)
Interleaving One-Class and Weakly-Supervised Models with Adaptive Thresholding for Unsupervised Video Anomaly Detection (Theme:
Weakly-Supervised)

detection at millisecond-level latencies, as demonstrated in IX. D ISCUSSION AND F UTURE R ESEARCH


the paper “EfficientAD: Accurate Visual Anomaly Detection
at Millisecond-Level Latencies” [76]. The purple cluster fo- In this survey paper, we present a comprehensive review
cuses on interleaving one-class and weakly-supervised models that serves as a guideline to solve challenges related to VAD.
with adaptive thresholding for unsupervised video anomaly The past decade has witnessed a remarkable evolution in
detection, exemplified by the paper “Interleaving One-Class the field of Video Anomaly Detection (VAD), marked by
and Weakly-Supervised Models with Adaptive Thresholding a notable transition from supervised to weakly supervised
for Unsupervised Video Anomaly Detection” [77]. Within the learning approaches and reconstruction-based techniques. This
red cluster, the exploration delves into vision language models shift reflects a growing preference for methods capable of
(VLM), particularly CLIP, for video anomaly recognition, as operating effectively with less reliance on fully labeled data,
seen in “Delving into CLIP latent space for Video Anomaly which can be both costly and challenging to obtain. Notably,
Recognition” [78]. Finally, the blue cluster is dedicated to re- the majority of benchmarking datasets are weakly supervised,
search involving Generative Adversarial Networks (GANs) for featuring video-level annotations, while reconstruction ap-
VAD tasks, with representative work titled “Video Anomaly proaches utilize only normal data in an unsupervised manner
Detection using GAN” [79]. All clusters are further illustrated during training.
in Table III. The existing literature presents a diverse array of VAD
datasets, ranging from specific single-scene datasets like
UCSD Pedestrian to more comprehensive, multi-scene collec-
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 18

tions such as UCF-Crime and XD-Violence. These datasets performance by providing rich, descriptive information that
vary significantly in size, duration, and the nature of anomalies aids in the interpretation and analysis of visual content.
they encompass. Consequently, this trend will necessitate the development of
However, despite the progress made, challenges persist stronger loss functions capable of more effectively managing
within the realm of VAD. Issues such as limited environmental textual information. In conclusion, this survey paper highlights
diversity, a narrow range of anomalous event types, and class an exciting phase of growth and transformation in the field
imbalance continue to impact the generalization and detection of VAD, characterized by methodological advancements, the
accuracy of models across different settings. This persistent integration of new technologies, and a shift towards more
gap underscores the urgent need for more diverse datasets, efficient learning approaches. However, the identified gaps and
emphasizing the importance of expanding available resources challenges underscore the need for continued efforts within the
to facilitate comprehensive and varied testing of VAD models. research community to develop comprehensive datasets and
Furthermore, the utilization of various hybrid deep learn- explore novel methodologies, ultimately advancing the state
ing techniques for feature extraction, including Convolutional of the art in anomaly detection.
Neural Networks (CNNs), Autoencoders (AEs), Generative R EFERENCES
Adversarial Networks (GANs), Sequential deep learning, and [1] B. Ramachandra, M. J. Jones, and R. R. Vatsavai, “A survey of single-
vision-language models as feature extractors, highlights the scene video anomaly detection,” IEEE transactions on pattern analysis
complexity of VAD tasks. These techniques underscore the and machine intelligence, vol. 44, no. 5, pp. 2293–2312, 2020.
[2] C. C. Aggarwal and C. C. Aggarwal, “Applications of outlier analysis,”
extraction of crucial types of features such as spatiotemporal Outlier Analysis, pp. 373–400, 2013.
and textual features, underscoring the necessity for specialized [3] M. Jiang, C. Hou, A. Zheng, X. Hu, S. Han, H. Huang, X. He, P. S. Yu,
approaches tailored to specific scenarios and reflecting the and Y. Zhao, “Weakly supervised anomaly detection: A survey,” arXiv
preprint arXiv:2302.04549, 2023.
dynamic nature of the field. [4] G. Pang, C. Shen, L. Cao, and A. V. D. Hengel, “Deep learning for
Moreover, the selection of appropriate loss functions has anomaly detection: A review,” ACM computing surveys (CSUR), vol. 54,
played a pivotal role in the effectiveness of various tasks, no. 2, pp. 1–38, 2021.
[5] Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji et al., “Transmil:
serving as a fundamental component for model optimization Transformer based correlated multiple instance learning for whole
and directly influencing a model’s capacity to learn and make slide image classification,” Advances in neural information processing
accurate predictions. Additionally, incorporating regularization systems, vol. 34, pp. 2136–2147, 2021.
[6] W. Ullah, T. Hussain, F. U. M. Ullah, M. Y. Lee, and S. W. Baik,
terms into loss functions, particularly sparsity and smoothing “Transcnn: Hybrid cnn and transformer mechanism for surveillance
constraints, has been essential to enhance the model’s capa- anomaly detection,” Engineering Applications of Artificial Intelligence,
bility to discern between normal and abnormal events. vol. 123, p. 106173, 2023.
[7] R. Nayak, U. C. Pati, and S. K. Das, “A comprehensive review on deep
In the testing phase, evaluating the anomaly score using learning-based methods for video anomaly detection,” Image and Vision
metrics such as Area Under the Curve (AUC), Average Pre- Computing, vol. 106, p. 104078, 2021.
[8] B. R. Kiran, D. M. Thomas, and R. Parakkal, “An overview of deep
cision (AP), and Equal Error Rate (EER) has been critical learning based methods for unsupervised and semi-supervised anomaly
for comprehending the quality and biases of false positives, detection in videos,” Journal of Imaging, vol. 4, no. 2, p. 36, 2018.
thereby providing insights into the model’s limitations. It has [9] M.-I. Georgescu, A. Barbalau, R. T. Ionescu, F. S. Khan, M. Popescu,
and M. Shah, “Anomaly detection in video via self-supervised and multi-
been imperative to assess models across multiple datasets to task learning,” in Proceedings of the IEEE/CVF conference on computer
ensure their robustness and generalization ability. vision and pattern recognition, 2021, pp. 12 742–12 752.
[10] G. Liu, L. Shu, Y. Yang, and C. Jin, “Unsupervised video anomaly
A. Future Directions for Research detection in uavs: a new approach based on learning and inference,”
Frontiers in Sustainable Cities, vol. 5, p. 1197434, 2023.
Exploring the fusion of state-of-the-art vision-language [11] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in
models, particularly integrating textual features, with tradi- surveillance videos,” in Proceedings of the IEEE conference on computer
tional VAD approaches presents a promising avenue for future vision and pattern recognition, 2018, pp. 6479–6488.
[12] R. Nawaratne, D. Alahakoon, D. De Silva, and X. Yu, “Spatiotemporal
investigations. This interdisciplinary approach holds the poten- anomaly detection using deep learning for real-time video surveillance,”
tial to enhance anomaly detection systems by imbuing them IEEE Transactions on Industrial Informatics, vol. 16, no. 1, pp. 393–
with a deeper understanding of complex video content, where 402, 2019.
[13] P. Wu, J. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, “Not
semantic meanings are enriched through textual annotations. only look, but also listen: Learning multimodal violence detection under
While significant progress has been made in the field of weak supervision,” in Computer Vision–ECCV 2020: 16th European
VAD, there remains a pressing need for more diverse and Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX
16. Springer, 2020, pp. 322–339.
extensive datasets. Specifically, datasets covering a broader [14] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly
range of scenarios, encompassing multiple scenes and anoma- detection in stacked rnn framework,” in 2017 IEEE International Con-
lies, are essential. Such datasets would not only facilitate the ference on Computer Vision (ICCV), 2017, pp. 341–349.
[15] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab,”
benchmarking of existing models but also stimulate innovation in Proceedings of the IEEE international conference on computer vision,
by presenting researchers with more challenging real-world 2013, pp. 2720–2727.
situations to address. [16] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis,
“Learning temporal regularity in video sequences,” in Proceedings of the
Furthermore, since VLMs are emerging, an important di- IEEE conference on computer vision and pattern recognition, 2016, pp.
rection for future research involves the integration of textual 733–742.
descriptions in the data containing contextual details, such [17] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in 2005 IEEE computer society conference on computer
as frame-level captions, into anomaly detection models. This vision and pattern recognition (CVPR’05), vol. 1. Ieee, 2005, pp.
integration has the potential to significantly improve model 886–893.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 19

[18] N. Dalal, B. Triggs, and C. Schmid, “Human detection using oriented [39] V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos, “Anomaly de-
histograms of flow and appearance,” in Computer Vision–ECCV 2006: tection in crowded scenes,” in 2010 IEEE Computer Society Conference
9th European Conference on Computer Vision, Graz, Austria, May 7-13, on Computer Vision and Pattern Recognition, 2010, pp. 1975–1981.
2006. Proceedings, Part II 9. Springer, 2006, pp. 428–441. [40] C. Cao, Y. Lu, P. Wang, and Y. Zhang, “A new comprehensive bench-
[19] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional mark for semi-supervised video anomaly detection and anticipation,”
lstm for anomaly detection,” in 2017 IEEE International conference on in Proceedings of the IEEE/CVF Conference on Computer Vision and
multimedia and expo (ICME). IEEE, 2017, pp. 439–444. Pattern Recognition (CVPR), June 2023, pp. 20 392–20 401.
[20] Z. Liu, Y. Nie, C. Long, Q. Zhang, and G. Li, “A hybrid video anomaly [41] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction
detection framework via memory-augmented flow reconstruction and for anomaly detection–a new baseline,” in Proceedings of the IEEE
flow-guided frame prediction,” in Proceedings of the IEEE/CVF inter- Conference on Computer Vision and Pattern Recognition (CVPR), 2018,
national conference on computer vision, 2021, pp. 13 588–13 597. pp. 6536–6545.
[21] D. Li, X. Nie, R. Gong, X. Lin, and H. Yu, “Multi-branch gan-based [42] W. Luo, W. Liu, D. Lian, and S. Gao, “Future frame prediction network
abnormal events detection via context learning in surveillance videos,” for video anomaly detection,” IEEE transactions on pattern analysis and
IEEE Transactions on Circuits and Systems for Video Technology, 2023. machine intelligence, vol. 44, no. 11, pp. 7505–7520, 2021.
[22] S. Sun and X. Gong, “Hierarchical semantic contrast for scene-aware [43] Y. Zhong, X. Chen, Y. Hu, P. Tang, and F. Ren, “Bidirectional spatio-
video anomaly detection,” in Proceedings of the IEEE/CVF Conference temporal feature learning with multiscale evaluation for video anomaly
on Computer Vision and Pattern Recognition, 2023, pp. 22 846–22 856. detection,” IEEE Transactions on Circuits and Systems for Video Tech-
[23] S. Li, F. Liu, and L. Jiao, “Self-training multi-sequence learning with nology, vol. 32, no. 12, pp. 8285–8296, 2022.
transformer for weakly supervised video anomaly detection,” in Proceed- [44] R. F. Mansour, J. Escorcia-Gutierrez, M. Gamarra, J. A. Villanueva, and
ings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, N. Leal, “Intelligent video anomaly detection and classification using
2022, pp. 1395–1403. faster rcnn with deep reinforcement learning model,” Image and Vision
[24] B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, Computing, vol. 112, p. 104229, 2021.
and H. Ling, “Expanding language-image pretrained models for gen- [45] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new
eral video recognition,” in European Conference on Computer Vision. model and the kinetics dataset,” in proceedings of the IEEE Conference
Springer, 2022, pp. 1–18. on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
[25] C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual- [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
language models for efficient video understanding,” in European Con- with deep convolutional neural networks,” Advances in neural informa-
ference on Computer Vision. Springer, 2022, pp. 105–124. tion processing systems, vol. 25, 2012.
[26] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, “Clip-tsa: Clip-assisted [47] J.-C. Feng, F.-T. Hong, and W.-S. Zheng, “Mist: Multiple instance self-
temporal self-attention for weakly-supervised video anomaly detection,” training framework for video anomaly detection,” in Proceedings of the
in 2023 IEEE International Conference on Image Processing (ICIP). IEEE/CVF conference on computer vision and pattern recognition, 2021,
IEEE, 2023, pp. 3230–3234. pp. 14 009–14 018.
[27] J. Zhang, L. Qing, and J. Miao, “Temporal convolutional network with [48] Z. Chen, J. Duan, L. Kang, and G. Qiu, “Supervised anomaly de-
complementary inner bag loss for weakly supervised anomaly detection,” tection via conditional generative adversarial network and ensemble
in 2019 IEEE International Conference on Image Processing (ICIP). active learning,” IEEE Transactions on Pattern Analysis and Machine
IEEE, 2019, pp. 4030–4034. Intelligence, vol. 45, no. 6, pp. 7781–7798, 2022.
[28] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph [49] M. Abdalla, H. Hassan, N. Mostafa, S. Abdelghafar, A. Al-Kabbany, and
convolutional label noise cleaner: Train a plug-and-play action classifier M. Hadhoud, “An nlp-based system for modulating virtual experiences
for anomaly detection,” in Proceedings of the IEEE/CVF conference on using speech instructions,” Expert Systems with Applications, vol. 249,
computer vision and pattern recognition, 2019, pp. 1237–1246. p. 123484, 2024.
[29] M. Z. Zaheer, A. Mahmood, H. Shin, and S.-I. Lee, “A self-reasoning [50] Z. Yang, J. Liu, Z. Wu, P. Wu, and X. Liu, “Video event restoration
framework for anomaly detection using video-level labels,” IEEE Signal based on keyframes for video anomaly detection,” in Proceedings of the
Processing Letters, vol. 27, pp. 1705–1709, 2020. IEEE/CVF Conference on Computer Vision and Pattern Recognition,
[30] S. Majhi, S. Das, and F. Brémond, “Dam: dissimilarity attention module 2023, pp. 14 592–14 601.
for weakly-supervised video anomaly detection,” in 2021 17th IEEE [51] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-
International Conference on Advanced Video and Signal Based Surveil- image pre-training with frozen image encoders and large language
lance (AVSS). IEEE, 2021, pp. 1–8. models,” arXiv preprint arXiv:2301.12597, 2023.
[31] D. Purwanto, Y.-T. Chen, and W.-H. Fang, “Dance with self-attention: A [52] P. Wu, J. Liu, X. He, Y. Peng, P. Wang, and Y. Zhang, “Toward video
new look of conditional random fields on anomaly detection in videos,” anomaly retrieval from video anomaly detection: New benchmarks and
in Proceedings of the IEEE/CVF International Conference on Computer model,” IEEE Transactions on Image Processing, vol. 33, pp. 2213–
Vision, 2021, pp. 173–183. 2225, 2024.
[32] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro, [53] H. Yuan, Z. Cai, H. Zhou, Y. Wang, and X. Chen, “Transanomaly: Video
“Weakly-supervised video anomaly detection with robust temporal fea- anomaly detection using video vision transformer,” IEEE Access, vol. 9,
ture magnitude learning,” in Proceedings of the IEEE/CVF international pp. 123 977–123 986, 2021.
conference on computer vision, 2021, pp. 4975–4986. [54] T.-N. Nguyen and J. Meunier, “Anomaly detection in video se-
[33] W. Chen, K. T. Ma, Z. J. Yew, M. Hur, and D. A.-A. Khoo, “Tevad: quence with appearance-motion correspondence,” in Proceedings of the
Improved video anomaly detection with captions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1273–
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1283.
2023, pp. 5548–5558. [55] S. Majhi, R. Dash, and P. K. Sa, “Temporal pooling in inflated 3dcnn for
[34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, weakly-supervised video anomaly detection,” in 2020 11th International
G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable Conference on Computing, Communication and Networking Technolo-
visual models from natural language supervision,” in International gies (ICCCNT). IEEE, 2020, pp. 1–6.
conference on machine learning. PMLR, 2021, pp. 8748–8763. [56] S. Zhou, W. Shen, D. Zeng, M. Fang, Y. Wei, and Z. Zhang, “Spatial–
[35] P. Wu, X. Zhou, G. Pang, L. Zhou, Q. Yan, P. Wang, and Y. Zhang, temporal convolutional neural networks for anomaly detection and local-
“Vadclip: Adapting vision-language models for weakly supervised video ization in crowded scenes,” Signal Processing: Image Communication,
anomaly detection,” arXiv preprint arXiv:2308.11681, 2023. vol. 47, pp. 358–368, 2016.
[36] K. Lin, L. Li, C.-C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang, [57] Z. Yang, Y. Guo, J. Wang, D. Huang, X. Bao, and Y. Wang, “Towards
“Swinbert: End-to-end transformers with sparse attention for video video anomaly detection in the real world: A binarization embedded
captioning,” in Proceedings of the IEEE/CVF Conference on Computer weakly-supervised network,” IEEE Transactions on Circuits and Systems
Vision and Pattern Recognition, 2022, pp. 17 949–17 958. for Video Technology, pp. 1–1, 2023.
[37] R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A [58] M. Z. Zaheer, J.-h. Lee, M. Astrid, A. Mahmood, and S.-I. Lee,
survey,” arXiv preprint arXiv:1901.03407, 2019. “Cleaning label noise with clusters for minimally supervised anomaly
[38] A. Adam, E. Rivlin, I. Shimshoni, and D. Reinitz, “Robust real-time detection,” arXiv preprint arXiv:2104.14770, 2021.
unusual event detection using multiple fixed-location monitors,” IEEE [59] D. Zhang, C. Huang, C. Liu, and Y. Xu, “Weakly supervised video
Transactions on Pattern Analysis and Machine Intelligence, vol. 30, anomaly detection via transformer-enabled temporal relation learning,”
no. 3, pp. 555–560, 2008. IEEE Signal Processing Letters, vol. 29, pp. 1197–1201, 2022.
JOURNAL OF IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 20

[60] Y. S. Chong and Y. H. Tay, “Abnormal event detection in videos using Moshira Abdalla is a Graduate Research and
spatiotemporal autoencoder,” in Advances in Neural Networks-ISNN Teaching Assistant (GRTA) and a Ph.D. in Elec-
2017: 14th International Symposium, ISNN 2017, Sapporo, Hakodate, trical and Computer Engineering student at Khalifa
and Muroran, Hokkaido, Japan, June 21–26, 2017, Proceedings, Part University. She received a B.Sc. in Computer and
II. Springer International Publishing, 2017, pp. 189–196. Systems Engineering from Minya University, Egypt,
[61] X. Zeng, Y. Jiang, W. Ding, H. Li, Y. Hao, and Z. Qiu, “A hierarchical in 2020 and her Master’s degree in Electrical and
spatio-temporal graph convolutional neural network for anomaly detec- Computer Engineering from the University of Ot-
tion in videos,” IEEE Transactions on Circuits and Systems for Video tawa, Canada, in 2022. Her research is focused on
Technology, vol. 33, no. 1, pp. 200–212, 2023. Computer Vision, Anomaly Detection, and Artificial
[62] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, “Unbiased multiple Intelligence (AI).
instance learning for weakly supervised video anomaly detection,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Sajid Javed is a faculty member at Khalifa Uni-
Pattern Recognition, 2023, pp. 8022–8031. versity (KU), UAE. Prior to that, he was a re-
[63] J. Kukačka, V. Golkov, and D. Cremers, “Regularization for deep search fellow at KU from 2019 to 2021 and at the
learning: A taxonomy,” arXiv preprint arXiv:1710.10686, 2017. University of Warwick, U.K, from 2017-2018. He
[64] M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and S. Lee, “Look around received his B.Sc. degree in computer science from
for anomalies: Weakly-supervised anomaly detection via context-motion the University of Hertfordshire, U.K, in 2010. He
relational learning,” in Proceedings of the IEEE/CVF Conference on completed his combined Master’s and Ph.D. degrees
Computer Vision and Pattern Recognition, 2023, pp. 12 137–12 146. in computer science from Kyungpook National Uni-
[65] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video versity, Republic of Korea, in 2017.
prediction beyond mean square error,” arXiv preprint arXiv:1511.05440,
2015.
[66] C. Shi, C. Sun, Y. Wu, and Y. Jia, “Video anomaly detection via Muaz Al Radi is a Graduate Research and Teaching
sequentially learning multiple pretext tasks,” in Proceedings of the Assistant (GRTA) and a Ph.D. in Electrical and
IEEE/CVF International Conference on Computer Vision, 2023, pp. Computer Engineering student at Khalifa University.
10 330–10 340. He received his B.Sc. degree in Sustainable and
[67] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I. Renewable Energy Engineering from the University
Lee, “Generative cooperative learning for unsupervised video anomaly of Sharjah, Sharjah, UAE, in 2020 and his M.Sc. in
detection,” in Proceedings of the IEEE/CVF conference on computer Electrical and Computer Engineering from Khalifa
vision and pattern recognition, 2022, pp. 14 744–14 754. University, Abu Dhabi, UAE, in 2022. His research
[68] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and is focused on Computer Vision, Anomaly Detection,
A. v. d. Hengel, “Memorizing normality to detect anomaly: Memory- Vision-based Control, Artificial Intelligence (AI),
augmented deep autoencoder for unsupervised anomaly detection,” in and Robotics.
Proceedings of the IEEE/CVF International Conference on Computer Anwaar Ulhaq received the Ph.D. degree in artifi-
Vision, 2019, pp. 1705–1714. cial intelligence from Monash University, Australia.
[69] R. Morais, V. Le, T. Tran, B. Saha, M. Mansour, and S. Venkatesh, He is currently working as a Senior Lecturer (AI)
“Learning regularity in skeleton trajectories for anomaly detection in with the School of Computing, Mathematics, and
videos,” in Proceedings of the IEEE/CVF conference on computer vision Engineering, Charles Sturt University, Australia. He
and pattern recognition, 2019, pp. 11 996–12 004. has developed national and international recognition
[70] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang, in computer vision and image processing. His re-
“Exploiting completeness and uncertainty of pseudo labels for weakly search has been featured 16 times in national and
supervised video anomaly detection,” in Proceedings of the IEEE/CVF international news venues, including ABC News and
Conference on Computer Vision and Pattern Recognition, 2023, pp. IFIP (UNESCO). He is an Active Member of IEEE,
16 271–16 280. ACS, and the Australian Academy of Sciences. As
[71] C. Yan, S. Zhang, Y. Liu, G. Pang, and W. Wang, “Feature prediction the Deputy Leader of the Machine Vision and Digital Health Research
diffusion model for video anomaly detection,” in Proceedings of the Group (MaViDH), he provides leadership in artificial intelligence research
IEEE/CVF International Conference on Computer Vision, 2023, pp. and leverages his leadership vision and strategy to promote AI research by
5527–5537. mentoring junior researchers in AI and supervising HDR students devising
[72] J. Fioresi, I. R. Dave, and M. Shah, “Ted-spad: Temporal distinctiveness plans to increase research impact.
for self-supervised privacy-preservation for video anomaly detection,” in
Proceedings of the IEEE/CVF International Conference on Computer Naoufel Werghi is a Professor at the Department of
Vision, 2023, pp. 13 598–13 609. Computer Science at Khalifa University for Science
[73] A. Singh, M. J. Jones, and E. G. Learned-Miller, “Eval: Explainable and Technology, UAE. He received his Habilitation
video anomaly localization,” in Proceedings of the IEEE/CVF Confer- and PhD in Computer Vision from the University
ence on Computer Vision and Pattern Recognition, 2023, pp. 18 717– of Strasbourg. His main research area is 2D/3D
18 726. image analysis and interpretation, where he has been
[74] N. Van Eck and L. Waltman, “Software survey: Vosviewer, a computer leading several funded projects related to biometrics,
program for bibliometric mapping,” scientometrics, vol. 84, no. 2, pp. medical imaging, remote sensing, and intelligent
523–538, 2010. systems.
[75] A. Awasthi, S. Ly, J. Nizam, S. Zare, V. Mehta, S. Ahmed, K. Shah,
R. Nemani, S. Prasad, and H. Van Nguyen, “Anomaly detection in satel-
lite videos using diffusion models,” arXiv preprint arXiv:2306.05376,
2023.
[76] K. Batzner, L. Heckler, and R. König, “Efficientad: Accurate visual
anomaly detection at millisecond-level latencies,” in Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision,
2024, pp. 128–138.
[77] Y. Nie, H. Huang, C. Long, Q. Zhang, P. Maji, and H. Cai, “Inter-
leaving one-class and weakly-supervised models with adaptive thresh-
olding for unsupervised video anomaly detection,” arXiv preprint
arXiv:2401.13551, 2024.
[78] L. Zanella, B. Liberatori, W. Menapace, F. Poiesi, Y. Wang, and E. Ricci,
“Delving into clip latent space for video anomaly recognition,” arXiv
preprint arXiv:2310.02835, 2023.
[79] A. Sethi, K. Saini, and S. M. Mididoddi, “Video anomaly detection using
gan,” 2023.

You might also like