UCF-Crime Annotation

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

UCF-Crime Annotation: A Benchmark for

Surveillance Video-and-Language Understanding

Tongtong Yuan1 , Xuange Zhang1 , Kun Liu2 , Bo Liu1∗, Jian Jin3 , Zhenzhen Jiao4
1
Beijing University of Technology
2
arXiv:2309.13925v1 [cs.CV] 25 Sep 2023

3
Institute of Industrial Internet of Things, CAICT
4
Beijing Teleinfo Technology Co., Ltd., CAICT
{yuantt, liubo}@bjut.edu.cn

Abstract
Surveillance videos are an essential component of daily life with various criti-
cal applications, particularly in public security. However, current surveillance
video tasks mainly focus on classifying and localizing anomalous events. Exist-
ing methods are limited to detecting and classifying the predefined events with
unsatisfactory generalization ability and semantic understanding, although they
have obtained considerable performance. To address this issue, we propose con-
structing the first multimodal surveillance video dataset by manually annotating
the real-world surveillance dataset UCF-Crime with fine-grained event content and
timing. Our newly annotated dataset, UCA (UCF-Crime Annotation), provides
a novel benchmark for multimodal surveillance video analysis. It not only de-
scribes events in detailed descriptions but also provides precise temporal grounding
of the events in 0.1-second intervals. UCA contains 20,822 sentences, with an
average length of 23 words, and its annotated videos are as long as 102 hours.
Furthermore, we benchmark the state-of-the-art models of multiple multimodal
tasks on this newly created dataset, including temporal sentence grounding in
videos, video captioning, and dense video captioning. Through our experiments,
we found that mainstream models used in previously publicly available datasets
perform poorly on multimodal surveillance video scenarios, which highlights the
necessity of constructing this dataset. The link to our dataset and code is provided
at: https://github.com/Xuange923/UCA-dataset.

1 Introduction
Surveillance videos are crucial and indispensable for public security. In recent years, various
surveillance-video-oriented tasks are widely studied, e.g., anomaly detection. However, the existing
surveillance video datasets just provide the category labels of anomalous events, and require all
anomalous categories to be predefined. Thus, the related methods are still limited to detecting and
classifying predefined events merely, lacking both the generalization ability to other tasks and the
semantic understanding of video content. To address these above challenges, we propose extending
the conventional surveillance video datasets for anomaly event classification to multimodal scenarios.
Specifically, we define several characteristics that a multimodal surveillance dataset should possess.
The multimodal surveillance video dataset should be composed of real-world videos, detailed event
and timing descriptions, and as many labeled events as possible.
In recent years, a large number of multimodal video datasets [30, 31, 33] have been released, on
which various multimodal learning tasks [17, 1, 16] are being explored. However, to our knowledge,

Corresponding author, Beijing University of Technology, E-mail: liubo@bjut.edu.cn

Preprint. Under review.


surveillance-video-oriented multimodal learning is still understudied. One main reason is that the
current surveillance datasets have no sentence-level language annotations, so that the tasks such as
video captions cannot be conducted. Moreover, learning multimodal patterns from the conventional
video datasets is relatively easier, if compared with surveillance datasets [35, 24]. Because in general
conventional videos have higher image quality, more recognizable events, and cleaner backgrounds
than surveillance datasets. Therefore, this calls for a timely multimodal surveillance video dataset to
enable the development of both multimodal learning and surveillance AI.
In this work, we contribute a dataset termed UCA (UCF-Crime Annotation), by making manually
fine-grained annotations of event content and event timing on a real-world surveillance dataset
UCF-Crime. Our UCA not only captions the events in UCF-Crime in detailed descriptions but also
provides the precise temporal grounding of the events in 0.1-second intervals. It can support multiple
multimodal understanding tasks, such as Temporal Sentence Grounding in Videos (TSGV) [17],
Video Captioning [1], Dense Video Captioning [16]. Compared with UCF-Crime, the main features
of our UCA include: (1) Though the basic videos is sourced from UCF-Crime, some bad-quality
videos in UCF-Crime are excluded. (2) The annotation information is relatively detailed and fine-
grained, that includes as many event descriptions as possible. (3) Considering the realistic demand
for temporal localization in the security field, our annotation also includes event timing along with
the activity descriptions. Overall, UCA is the first real-world and multimodal surveillance dataset
with both language descriptions and event timing, which is a remarkable dataset for the surveillance
field to develop the multimodal understandable capacity of machine intelligence. To demonstrate the
scalability of applicative learning tasks of our UCA dataset, we also conduct experiments in three
mainstream multi-model learning tasks on UCA dataset, and these experiment results can also serve
as the baselines in the multimodal surveillance learning.

2 Related Work

2.1 Video Datasets

Recently, numerous video datasets have been released for different video-language-understanding
tasks, such as video caption, video dense caption, temporal sentence grounding in videos (TSGV),
etc. We review some video datasets widely utilized in various video-understanding tasks. Textually
Annotated Cooking Scenes (TACoS) [30] was built based on the “MPII Cooking Composite Activ-
ities" [31]. This dataset has detailed textual annotations for high-quality video recordings, which
contain 17,334 action tokens. ActivityNet Captions dataset has nearly 20k videos, which has an
amounted duration of 849 hours. The involved content in this dataset is diverse and open. The
temporal annotation information contains 100k total descriptions. Temporal Sentence Grounding
in Videos can also be named as Temporal activity localization via language query (TALL), which
is a video-language-understanding task proposed by Gao et al [11]. In [11], they also built a new
dataset for this task based on Charades by introducing clip-level sentence temporal annotations, called
Charades-STA. The original Charades dataset [33] has approximately 10k videos and 27k activity
descriptions from 157 activity categories. The Charades-STA dataset has 19,509 clip-sentence pairs
and can be used for the TALL task. DiDeMo [3] is collected from Filcker, which contains various
activities and around 40k video-sentence pairs. In total, these video datasets consist of high-quality
videos and have been annotated with detailed descriptions, which have played indispensable roles in
video-and-language tasks.
Except for the above regular video datasets, there are surveillance video datasets used for abnormal
detection. UCSD Ped1 [19] contanins 70 surveillance videos and 40 events. UCSD Ped2 datasets [19]
contain 28 surveillance videos with 12 events. Both Ped1 and Ped2 are captured at only one location,
and the given abnormal cases are not realistic anomalies. Avenue dataset [24] has 14 unusual events,
such as running, throwing objects, and loitering. The total number of frames in the Avenue dataset is
35,240. Subway dataset [2] has 2 categories, i.e., Entrance and Exit. Abnormal events simply include
walking in the wrong direction and loitering. And the videos were shot in an indoor environment.
ShanghaiTech Campus dataset [25] has 13 scenes with complex light conditions and camera angles.
It contains 130 abnormal events. Recently, NWPU Campus dataset [4] has been released, which is
currently the largest and most complex dataset in the semi-supervised video anomaly detection field,
and contains 43 scenes and 28 categories of anomalous events.

2
These above abnormal datasets also have some limitations on the number of videos or the degree
of reality. However, Sultani et al [35] constructed a real-world surveillance video dataset, called
UCF-Crime. The dataset consists of 1,900 surveillance videos that present 13 real-world anomalies,
such as Abuse, Burglary, Explosion, etc. However, the annotation information of UCF-Crime only
includes abnormal categories, which can only be used for abnormal detection. Additionally, we have
noted that the paper SAVCHOI [26] mentions annotations on the UCF-Crime dataset for monitoring
suspicious activities. SAVCHOI annotated summaries for 300 videos and lack detailed temporal
information on activities. In contrast, our research aims to develop multimodal surveillance video
learning tasks. Hence, we manually annotated the event content and event occurrence time for 1,854
videos from UCF-Crime, called UCF-Crime Annotation (UCA), to create a multimodal surveillance
video dataset.

2.2 Multimodal Video Learning

In the following, we briefly review some mainstream multimodal video learning tasks and related
methods.
Temporal sentence grounding in videos (TSGV) is a recent multimodal task [22, 17, 46], which
learns the temporal activity localization from a given video with respect to a given language query,
i.e., the goal of the task is localizing the start and end times for the described activity inside the
video. Cross-modal Temporal Regression Localizer (CTRL) [11] was proposed by Gao et al., which
is the earliest TSGV work. Yuan et al. [45] proposed a semantic conditioned dynamic modulation
(SCDM) for TSGV learning, which improved the precision of temporal boundary prediction. He et
al. [13] proposed A2C, which formulated TSGV as a sequential decision-making procedure in a
reinforcement learning-based framework. Zhang et al. proposed 2D-TAN [49] and MS-2D-TAN [48],
which aimed to learn 2D temporal adjacent networks for TSGV tasks and achieved better performance.
Further, Wang et al. [43] proposed a metric learning method on the basis of 2D-TAN by mining the
negative samples.
Video captioning is a multimodal task of producing a natural-language utterance, which aims at
describing the visual content of a video. It plays an important technical role in the demand for
automatic visual understanding and content summarization [18, 1]. Venugopalan et al. [39] proposed
a sequence-to-sequence model, which incorporates a stacked LSTM to generate a sequence of words.
Wang et al. [40] proposed RecNet, which is a reconstruction network with a novel encoder-decoder-
reconstructor architecture for video captioning. Pei et al. [29] proposed Memory-Attended Recurrent
Network (MARN), which utilized a memory structure to explore the correspondence between textual
words and visual contents in training data. Ryu et al. proposed SGN [32] by partially decoding
caption and encoding a video into semantic groups with a visual-textual alignment. Transformer-
based architectures in multimodal learning are promising in recent years [44]. Lin et al. [21] proposed
SWINBERT, which was the first end-to-end fully Transformer-based model for video captioning and
achieves obvious performance improvements.
Dense video captioning aims to acquire the temporal localization and captioning of all events in an
untrimmed video. It is more challenging than standard video captioning, which aims to generate a
single caption for a video clip [1, 16]. Wang et al. proposed TDA-CG [41], which was based on
a context-gating mechanism and aimed to balance the current event and its surrounding contexts.
[42] proposed an end-to-end dense video captioning with parallel decoding (PDVC) based on
a transformer architecture, and formulated the dense caption generation as a set prediction task.
Zhang et al. proposed a UEDVC method [47] by defining event detection as a sequence generation
task and deploying a unified framework to naturally enhance the relation between event detection and
textual captioning.
Benefiting from the manually fine-grained annotations, our UCA dataset can be used in various
multimodal video learning tasks, including but not limited to the above three tasks.

3 The UCA Dataset

Our dataset is based on the UCF-Crime dataset, which is a real-world surveillance video dataset
containing 13 real-world anomalies and some normal videos. To extend UCF-Crime to a multi-
modal dataset, we conducted a fine-grained language annotation on UCF-Crime that recorded each

3
Table 1: Data split and video statistics in our UCA.
Video Statistics UCF-Crime UCA UCA train UCA val UCA test
#Video 1900 1854 1164 380 310
Video length 128h 121.8h 75.4h 25.2h 21.2h
Annotated video length — 102.4h 66.4h 20.3h 15.7h

event/change of the videos with detailed descriptions and precise time stamps. The result of our
annotation is a novel dataset named UCA, which could be the first multimodal surveillance video
dataset for temporal sentence grounding, video captioning, and dense video captioning. In this section,
we provide a comprehensive outline of the UCA dataset, covering the aspects of data collection and
annotation, dataset analysis, and comparison with existing datasets. By presenting these details, we
aim to provide a clear and concise understanding of the UCA dataset and its potential for advancing
research in the field of surveillance video-and-language understanding.

3.1 Collection and Annotation

During the video collection, we filtered some low-quality videos in the original UCF-Crime to
ensure the quality and fairness of our UCA. These low-quality videos encompassed instances of
repetition, severe occlusion, and excessively accelerated playback, which affected the clarity of
manual annotations and the precision of event timing localization. Therefore, we finally collect 1,854
videos from UCF-Crime datasets and remove 46 videos compared to the original 1,900 videos. The
data split in UCA is shown in Table 1.
When labeling a video from UCF-Crime, our goal is to make fine-grained annotations, i.e., making a
detailed description of each event that can be described in language as much as we can, regardless
of whether it is an abnormal event. We also record the starting and ending time for each event in
an individual video and the recorded time is 0.1-second interval. The annotated video length of
UCA is 102.4 hours, accounting for 84% of the total video duration of UCA. Figure 1 presents some
annotation examples in UCA.
During the dataset annotation process, we selected 10 volunteers with computer background as
annotators and 3 AI researchers as review team as [7, 14]. To ensure the accuracy and consistency of
the annotations, we provided the annotators with detailed annotation instructions. These rules aim
to ensure that the annotators use semantically rich language to describe the events occurring in the
videos in a clear and accurate manner, and accurately record the start and end times of each event.
Before starting the annotation work, we provided comprehensive training to all annotators. During
the annotation process, annotators need to repeatedly watch the videos to ensure that their positioning
and description of events are accurate. To ensure the quality of the annotations, after annotating every
100 instances, annotators were required to undergo validation by the review team, focusing on the
quality and consistency of annotations provided by different annotators. After all the annotations were
completed, the reviewers further reviewed the data annotations. The total duration of video annotation
work exceeds 1000 man-hours. Additionally, the total review duration is about 500 man-hours.

Figure 1: Examples in our UCA dataset, including the time axis and sentence queries.

4
Table 2: Abnormal and normal video splits of our UCA dataset.
Train Val Test
Statistics
Abnormal Normal Abnormal Normal Abnormal Normal
#Video 574 590 162 218 206 104
Video length(h) 20.91 54.54 5.88 15.30 6.85 18.38

Table 3: Statistics of annotation in our UCA dataset, including the number of queries, and the number
of words per query.
Annotation Statistics #Train #Val #Test #Summary
#Query 13864 3124 3834 20822
#Word/Query 23.79 21.00 23.27 23.27
#Nouns 22965 6156 5000 34121
#Verbs 25445 6465 5076 36986
#Adj 11531 3224 2339 17094

3.2 Dataset Analysis

As shown in Table 1, the UCA dataset comprises a total of 1854 videos, which have been divided
into three subsets: Train, Validation, and Test sets. Moreover, following the partitioning approach
of the original UCF-Crime dataset, UCA is categorized into two main groups: "Abnormal" and
"Normal" videos, as shown in Table 2. In this context, "Abnormal" videos denote those containing
scenes with exceptional occurrences or criminal activities present within the original videos. These
abnormal instances encompass a spectrum of 13 real-world anomalies, including Abuse, Arrest,
Arson, Assault, Burglary, Explosion, Fighting, RoadAccidents, Robbery, Shooting, Shoplifting,
Stealing, and Vandalism [35]. UCA has little difference from the original UCF-Crime dataset in the
modality of video. Therefore, we focus on analyzing statistical information on the language modality.
It can be seen that the total duration of the annotated video is 102.4 hours as shown in Table 1.
Figure 2 (a) shows the distribution of time length of each annotated event in our dataset. The
distribution indicates that the video length of the event with the most number is 5-10 seconds, with
the majority of events concentrated within 30 seconds and a few events exceeding 40 seconds.

Figure 2: The duration of annotated events and the number of words in the annotated queries of UCA
dataset.

Table 3 shows the number of query descriptions of the events we labeled and the average number
of words per query. It can be seen that the average number of words in our annotation is around 23
words. The distribution of the number of words in annotated queries of UCA is shown in Figure 2
(b). The highest number of sentences contains 10-20 words, followed by sentences containing 20-30
words. Additionally, we perform the vocabulary statistics in UCA queries, which can be referred to
Table 3. We can find the ratio of nouns, verbs, and adjectives in all sentences of Train, Val, and Test
is approximately 2:2:1.

3.3 Comparison with Existing Datasets

Difference with video-language datasets. Table 4 presents a comparison between our UCA dataset
and video-language datasets for multimodal learning tasks. Through this comparison, we observe
that our dataset has a moderate number of annotated sentences. However, in terms of the average
number of words per sentence, our dataset has the highest number of annotations, indicating that our

5
Table 4: Comparison of the statistics of our UCA and other Standard video datasets for multimodal
learning tasks. Avg word means the average number of words per sentence. Temp.anno. presents
temporal annotation. The statistics of previous datasets have been recorded in [8].
Dataset Domain #Videos Duration(h) #Queries Avg word Temp. anno.
MSVD [6] Open 1,970 5.3 70,028 8.7 ✓
TACos [30] Cooking 127 10.1 18,818 9.0 ✓
YouCook [9] Cooking 88 2.3 3,502 12.6 ✗
MPII-MD [31] Movie 94 73.6 68,375 9.6 ✓
ActivityNet Captions [16] Open 20,000 849.0 73,000 13.5 ✓
DiDeMo [3] Open 10,464 144.2 41,206 7.5 ✓
UCA Surveillance 1,854 122 20,822 23.27 ✓

annotated sentence descriptions are more specific than those of other datasets. The primary difference
between our UCA dataset and other video-language datasets is mainly due to the video characteristics
of the real-world surveillance dataset. In other words, the difference lies in the domain of videos (the
domain can be seen in Table 4) and video quality (i.e., uneven image quality, complex backgrounds,
and complex events in surveillance videos). Consequently, when faced with the same learning task,
the learning difficulty on our dataset is often much higher than that on a regular video dataset. The
subsequent experiments will demonstrate the experimental results of some state-of-the-art learning
methods tested on our dataset, emphasizing the issue of learning difficulty on our dataset.
Difference with abnormal video datasets. Our work highlights a key difference from abnormal
video datasets [35, 25]: we conducted the unprecedented sentence-level annotation of events in real
monitored scenarios. In contrast to the original simple category annotation, we provide specific
event descriptions in a fine-grained way. As a result, our dataset can be used to conduct various
video-and-language understanding learning tasks in monitored scenarios. To our knowledge, research
on this type of monitored scenario is still blank, and it holds great practical significance.

3.4 Ethical Considerations

The videos used in our research originates from a publicly available dataset UCF-Crime [35], which
has not mentioned any privacy or ethical concerns. Notably, we have observed several recent papers in
the anomaly detection field citing this dataset. During our annotation procedure, we have implemented
a range of measures to mitigate potential risks during the data annotation process. Additionally, we
also found that the primary videos did not focus on any specific race or gender; instead, we found
the datset covered a diverse range of scenes. Thus, our UCA dataset, which provides language-level
annotations, does not directly raise any privacy or ethical concerns. Readers should strictly comply
with our data license agreement, which can be found in the appendix or our project homepage.
We explicitly highlight its usage strictly for academic and research purposes. We also welcome
suggestions from the research community to improve the quality and ethical reliability of the dataset.

4 Experiments

We proceed with the experiments of three multimodal tasks on the UCA dataset on Tensorflow
and PyTorch using an RTX3090 GPU. In each task, we first describe the corresponding task and
evaluation metric, along with baseline methods, and then present the performance of our selected
baselines. The basic experimental settings for these various methods used in this section can be
referenced in their respective published papers and code repositories.

4.1 Temporal sentence grounding in videos

Task. TSGV aims to learn the temporal activity localization from a given video with respect to a
given language query [22].
Metric. In existing methods [11, 45, 48], R@K for IoU=θ is commonly adopted as the evaluation
metric to measure the performance in TSGV. It is defined as the percentage of at least one of the top-K
predicted moments that have IoU with ground-truth moment larger than θ [11]. In the following, we
set R@K for IoU=θ with K = 1, 5 and θ = 0.3, 0.5, 0.7 as the evaluation metric.

6
Baselines. We benchmark UCA using five different methods, namely: CTRL [11], SCDM [45],
A2C [13], 2D-TAN [49], LGI [27] and MMN [43]. The visual features of the above methods are
extracted from C3D features [37] on our UCA.
Implementation Details. We use C3D [37] pretrained on Sports1M to extract video features for all
experiments and the 4,096 dimensions output of the FC6 layer is represented as the video feature. For
CTRL, we employ skip-thought [15] as the sentence encoder, resulting in sentence features of 4,800
dimensions. For SCDM, the length of input video clips is set as 256 to accommodate the temporal
convolution. For A2C, we also obtain sentence features via skip-thought. For 2D-TAN, the number
of video clips is set to 16 and the scaling thresholds tmin and tmax are set to 0.5 and 1.0. We chose
the stacked convolution approach to extract moment-level features. For LGI, the number of sampled
segments per video is 128, and the maximum length of each query is 50. The settings of MMN are
similar to 2D-TAN, except that max-pooling is used when obtaining moment-level features.
Performance. The experimental results from Table 5 indicate that the TSGV task presents significant
challenges within the domain of surveillance videos. In the R@1, IoU = 0.3 metrics, all evaluations
results are generally below 10%, and as the IoU threshold increases (IoU = 0.7), the highest R@1
result reaches a mere 1.88%. This underscores the difficulty in achieving precise event localization in
surveillance videos. Among the numerous benchmarking methods, the A2C model, which employs
reinforcement learning mechanisms, exhibits relatively inferior performance on the UCA dataset.
However, two models employing 2D Temporal Adjacent Networks, namely 2D-TAN and MMN,
demonstrate superior overall performance compared to other methods. This advantage can possibly
be attributed to their capability to perceive richer contextual information from video data, thereby
better comprehending and capturing the temporal characteristics of events.

Table 5: Benchmarking of TSGV baselines on our UCA dataset.


IoU=0.3 IoU=0.5 IoU=0.7
Method Features Batch Size
R@1 R@5 R@1 R@5 R@1 R@5
CTRL [11] C3D 32 6.08 19.93 3.05 10.69 0.73 3.23
SCDM [45] C3D 8 6.37 11.25 3.58 7.10 1.59 3.08
A2C [13] C3D 16 4.56 - 1.98 - 0.50 -
2D-TAN [49] C3D 32 8.45 21.28 4.17 12.47 1.85 5.56
LGI [27] C3D 64 8.97 - 3.99 - 1.54 -
MMN [43] C3D 8 9.55 22.30 4.83 13.85 1.88 6.94

4.2 Video Captioning

Task. The goal of video captioning is understanding a video clip and describing it with language [1].
Metric. We evaluate the performance with the frequently used metrics as in [21, 32]. The evaluation
metrics of correctness include Bilingual Evaluation Understudy (BLEU) [B@n] [28], Metric for
Evaluation of Translation with Explicit Ordering (METEOR) [M] [10], Recall Oriented Understudy
of Gisting Evaluation (ROUGE-L) [R] [20], and Consensus-based Image Description Evaluation
(CIDEr) [C] [38].
Baselines. We also select five methods for video captioning as our baselines. These methods are as
follows: S2VT [39], RecNet [40], MARN [29], SGN [32], SwinBERT [21].
Implementation Details. For S2VT, we utilize two pre-trained models, Inception V4 [36] and VGG
16 [34], as visual features. Each segment of the video is fixed to 80 frames. In RecNet and MARN,
we introduce the same Inception V4 as the video feature extractor. For the sentences, we remove the
punctuation marks, split them with spaces, and convert all words to lowercase. The sentences longer
than 80 will be truncated. For SwinBert, the VidSwin is initialized with Kinetics-600 pre-trained
weights [23]. For SGN, we uniformly sample 30 frames and clips from each video, and the visual
encoder utilizes pre-trained ResNet and 3D-ResNext [12] models. ResNext in MARN and SGN
corresponds to the motion feature, which focuses on the dynamic variation information between
video frames.
Performance. The performance of video captioning baselines is presented in Table 6. Among these
benchmark comparisons, the MARN and SGN models exhibit superior performance on the UCA
dataset. This performance enhancement is primarily attributed to their effective integration of motion
and appearance features. Notably, the SGN model, owing to its semantic group design, achieves
significant improvement by better integrating annotated information with video frames. However,

7
Table 6: Benchmarking of video captioning baselines on the UCA dataset.
Method Features Batch Size B1 B2 B3 B4 M R C
S2VT [39] Inception V4 64 21.00 12.42 7.47 4.48 9.42 22.71 13.06
S2VT [39] VGG 16 64 22.90 12.95 7.44 4.24 8.53 22.06 10.10
RecNet_global [40] Inception V4 64 23.49 14.37 8.86 5.58 9.21 24.77 12.17
RecNet_local [40] Inception V4 64 23.04 14.39 9.07 5.94 9.29 25.09 12.49
SwinBERT [21] End to End 6 15.60 8.07 4.34 2.12 6.92 17.06 8.72
Inception V4
MARN [29] 32 24.92 13.14 7.63 4.39 9.34 25.29 11.93
ResNext101
ResNet101
SGN [32] 16 29.93 18.80 12.17 7.90 12.85 27.94 16.73
ResNext101

it is worth noting that the latest model, SwinBert, demonstrates comparatively lower results. This
could be attributed to the utilization of pre-trained weights provided by the original authors, which
might not comprehensively capture the characteristics of surveillance videos in our experiment. This
observation also reflects the inherent challenges of the UCA dataset.

4.3 Dense Video Captioning

Task. Dense video captioning learns to generate the temporal localization and captioning of dense
events in an untrimmed video.
Metric. We have assessed the performance from two perspectives. For localization performance,
we employed the evaluation tools provided by the ActivityNet Captions Challenge 2018. We
computed the average precision and average recall across different IoU thresholds (0.3, 0.5, 0.7,
0.9). As for dense captioning performance, we utilized classic caption evaluation metrics, including
BLEU, METEOR, and CIDEr, to report our results. Furthermore, to gain a more comprehensive
understanding of the model’s performance in describing video stories, we also calculated the SODA_c
score.
Baselines. We select three methods for dense video captioning with different input features as our
baselines. These methods include TDA-CG [41], PDVC [42] and UEDVC [47]. The visual input
features of all the above dense video captioning methods have two categories, including C3D [37]
and I3D [5] features extracted from our UCA.
Table 7: Event localization on the UCA dataset.
Recall Precision
Method Features
0.3 0.5 0.7 0.9 avg 0.3 0.5 0.7 0.9 avg
TDA-CG C3D 91.35 84.42 73.82 45.15 73.68 54.22 26.12 8.21 0.94 22.37
PDVC C3D 95.63 89.60 70.09 14.01 67.33 48.73 21.48 6.50 0.66 19.34
TDA-CG I3D 73.61 61.84 48.10 30.82 53.59 57.50 28.49 8.72 0.88 23.90
PDVC I3D 95.69 89.91 72.11 16.18 68.47 46.53 20.13 6.05 0.76 18.37

Table 8: Dense captioning on the UCA dataset with Predicted proposals.


Predicted proposals
Method Features
B1 B2 B3 B4 M C SODA_c
TDA-CG C3D 6.07 2.87 1.28 0.49 3.06 2.76 0.16
PDVC C3D 8.09 4.05 1.75 0.56 4.00 8.86 1.21
TDA-CG I3D 6.28 3.26 1.40 0.50 3.08 2.39 0.11
PDVC I3D 8.47 4.60 2.33 0.91 4.45 7.65 1.28

Implementation Details. For the three models used here, we utilize pre-trained C3D and I3D models
for video feature extraction. In the TDA-CG model, the maximum frame length is set to 300, and any
frames beyond this length will be truncated. For the PDVC model, the number of event queries is
100. Similarly, for the UEDVC model, the maximum frame length is set to 300. All of the batch size
in our experiments for dense video captioning is set to 1.
Performance.
The localization performance of the model on the UCA dataset is illustrated in Table 7. Overall, the
model exhibits a tendency for relatively higher recall and relatively lower precision. Across different
IoU thresholds, the PDVC method consistently outperforms the TDA-CG method, implying the
PDVC method’s notable competence in target localization.

8
For Dense captioning performance, Table 8 and Table 9 illustrate the results. Using Predicted
proposals, PDVC surpasses TDA-CG across most metrics, especially in BLEU and METEOR scores,
implying superior n-gram matching and semantic alignment in PDVC-generated captions. However,
all methods exhibit suboptimal CIDEr performance, suggesting room for enhancing caption diversity.
In terms of the SODA_c metric, PDVC’s captions better capture the overall video story to some
extent compared to TDA-CG. With Ground-Truth proposals, PDVC outperforms UEDVC, especially
in METEOR and CIDEr indicators, indicating its better semantic consistency and diversity.
In summary, the PDVC method shows commendable performance on the UCA dataset. These findings
provide insightful guidance for further advancing the performance of dense video captioning tasks
and enhancing the generation of more accurate and diverse captions.

4.4 Testing performance on abnormal and normal videos of UCA

Given that our dataset can be classified into two major categories: abnormal and normal videos (as
shown in Table 2), we further conduct a series of experiments using the previously trained model
to investigate the differences in the performance between these two categories. For each task, we
selected two methods with relatively good performance to process the evaluations. The results of the
experiments for the three multimodal tasks are respectively presented in Table 10 and Table 11.
In the experimental findings of TSGV, it has been observed that the performance of normal videos
significantly lags behind that of abnormal videos. We delved into the disparities between abnormal
and normal videos. It was revealed that normal videos tend to exhibit longer durations than abnormal
videos, coupled with relatively minor variations in event dynamics. While this might introduce
challenges in event localization within normal videos, we contend that this characteristic could lead
to enhanced video storytelling performance in dense video captioning. Furthermore, it has also
been noted that the employment of the SGN approach utilizing ResNext101 features yields notably
superior outcomes on the UCA normal videos in video captioning. This outcome could be attributed
to the model’s emphasis on capturing dynamic variations, thereby offering valuable insights for
advancing the realm of multimodal surveillance learning research.

5 Future works

The experimental results from the aforementioned three multimodal tasks reveal that: there are still
significant challenges in learning from multimodal surveillance video datasets. We will describe
these challenges from two perspectives: the dataset and the model, and propose some future works.
Dataset Challenges. Surveillance video datasets exhibit unique characteristics such as uneven image
quality, complex backgrounds, and complex events when compared to previous multimodal video
datasets. Because the collection of surveillance video data is still challenging, our UCA based
on UCF-Crime, also suffers from insufficient data volumes. Consequently, the use of multimodal
surveillance videos for research still faces significant challenges in terms of dataset quality and
quantity. In the future, we can make more annotations on the newly released surveillance videos [4].
Model Challenges. Current models have been well-trained on standard datasets, but their performance
significantly deteriorates when applied to training on the UCA dataset. To mitigate these issues, it is
crucial to retrain models on extensive amounts of data. Additionally, the model architecture must
be modified and designed based on the unique characteristics of surveillance video datasets. In the
future, we will pay more attention to research on capturing the dynamic characteristics of videos and
locating events in long videos.

Table 9: Dense captioning on the UCA dataset with Ground-Truth proposals.


Ground-Truth proposals
Method Features
B1 B2 B3 B4 M C
PDVC C3D 22.54 11.63 5.51 2.21 10.06 16.39
UEDVC C3D 23.78 13.56 7.76 4.52 9.30 11.32
PDVC I3D 22.35 11.99 5.97 2.32 10.47 20.44
UEDVC I3D 25.66 14.43 8.23 4.81 9.98 13.68

9
Table 10: Abnormal and normal videos testing performance in TSGV.
Iou=0.3 Iou=0.5 Iou=0.7
Method Partiton
R@1 R@5 R@1 R@5 R@1 R@5
Abnormal 14.52 37.30 6.88 22.16 3.58 10.05
2D-TAN [49]
Normal 4.28 9.74 2.23 5.88 1.18 3.36
Abnormal 17.21 39.57 8.81 25.67 3.51 12.59
MMN [43]
Normal 4.87 11.76 2.39 6.64 0.88 3.49

Table 11: Abnormal and normal videos testing performance in two tasks: video captioning and dense
video captioning with predicted proposals
Method Partition B1 B2 B3 B4 M C R SODA_c
Abnormal 26.18 14.06 7.40 4.01 9.00 11.92 23.16 -
RecNet_global [40]
Normal 20.65 13.26 8.55 5.58 9.31 10.87 25.77 -
Video captioning
Abnormal 20.77 11.74 6.53 3.55 11.03 9.83 23.39 -
SGN [32]
. Normal 35.86 23.32 15.69 10.55 13.67 18.47 30.76 -
Abnormal 6.03 2.72 1.14 0.37 3.14 3.15 - 0.13
TDA-CG [41]
Normal 6.15 3.17 1.55 0.72 2.91 1.99 - 0.19
Dense video captioning
Abnormal 8.10 3.99 1.65 0.52 4.07 9.69 - 0.96
PDVC [42]
Normal 8.07 4.19 1.97 0.65 3.85 7.24 - 1.63

6 Conclusions
In the current AI field, there is a significant gap in research on multimodal surveillance video datasets,
despite their potential for contributing to social security and daily life. To bridge this gap, we propose
UCA, as the first multimodal surveillance video dataset, which is derived from re-annotating the
UCF-Crime dataset. Our annotation focuses on providing detailed event descriptions and start-stop
time marks for normal or abnormal events occurring in surveillance videos, resulting in the creation
of a multimodal surveillance video dataset with approximately 20,000 sentence-level descriptions and
frame-level time records. Moreover, we conducted experiments on three multimodal tasks using the
UCA dataset, evaluating the performance of a set of benchmark methods. Based on the evaluations,
we identify significant opportunities in this field and emphasize the need for further research in both
the dataset and model and aspects.

10
Acknowledgement
We gratefully acknowledge the help of numerous members of BJUT in helping us with the annotations.
This work has been supported by China Postdoctoral Science Foundation: No.2021M690270, and
Chaoyang District Postdoctoral Foundation.

References
[1] Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah, Daniel McDuff, Mo-
hammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas Khosravi, Erik Cambria,
et al. A review of deep learning for video captioning. arXiv preprint arXiv:2304.11431, 2023.
[2] Amit Adam, Ehud Rivlin, Ilan Shimshoni, and Daviv Reinitz. Robust real-time unusual event
detection using multiple fixed-location monitors. IEEE transactions on pattern analysis and
machine intelligence, 30(3):555–560, 2008.
[3] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan
Russell. Localizing moments in video with natural language. In Proceedings of the IEEE
international conference on computer vision, pages 5803–5812, 2017.
[4] Congqi Cao, Yue Lu, Peng Wang, and Yanning Zhang. A new comprehensive benchmark for
semi-supervised video anomaly detection and anticipation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 20392–20401, June
2023.
[5] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the
kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 6299–6308, 2017.
[6] David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In
Proceedings of the 49th annual meeting of the association for computational linguistics: human
language technologies, pages 190–200, 2011.
[7] Du Chen, Jie Liang, Xindong Zhang, Ming Liu, Hui Zeng, and Lei Zhang. Human guided
ground-truth generation for realistic image super-resolution. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 14082–14091, 2023.
[8] Shaoxiang Chen, Ting Yao, and Yu-Gang Jiang. Deep learning for video captioning: A review.
In IJCAI, volume 1, page 2, 2019.
[9] Pradipto Das, Chenliang Xu, Richard F Doell, and Jason J Corso. A thousand frames in just
a few words: Lingual description of videos through latent topics and sparse object stitching.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
2634–2641, 2013.
[10] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation
for any target language. In Proceedings of the ninth workshop on statistical machine translation,
pages 376–380, 2014.
[11] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization
via language query. In Proceedings of the IEEE international conference on computer vision,
pages 5267–5275, 2017.
[12] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the
history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition, pages 6546–6555, 2018.
[13] Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. Read, watch,
and move: Reinforcement learning for temporally grounding natural language descriptions in
videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages
8393–8400, 2019.
[14] Muhammad Haris Khan, John McDonagh, Salman Khan, Muhammad Shahabuddin, Aditya
Arora, Fahad Shahbaz Khan, Ling Shao, and Georgios Tzimiropoulos. Animalweb: A large-
scale hierarchical dataset of annotated animal faces. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 6939–6948, 2020.

11
[15] Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio
Torralba, and Sanja Fidler. Skip-thought vectors. Advances in neural information processing
systems, 28, 2015.
[16] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-
captioning events in videos. In Proceedings of the IEEE international conference on computer
vision, pages 706–715, 2017.
[17] Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, and Wenwu Zhu. A survey on temporal
sentence grounding in videos. ACM Transactions on Multimedia Computing, Communications
and Applications, 19(2):1–33, 2023.
[18] Sheng Li, Zhiqiang Tao, Kang Li, and Yun Fu. Visual to text: Survey of image and video
captioning. IEEE Transactions on Emerging Topics in Computational Intelligence, 3(4):297–
312, 2019.
[19] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in
crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32,
2013.
[20] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization
branches out, pages 74–81, 2004.
[21] Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and
Lijuan Wang. Swinbert: End-to-end transformers with sparse attention for video captioning. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
17949–17958, 2022.
[22] Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, and Yong Rui. A survey on video moment
localization. ACM Computing Surveys, 55(9):1–37, 2023.
[23] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin
transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, pages 3202–3211, 2022.
[24] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In
Proceedings of the IEEE international conference on computer vision, pages 2720–2727, 2013.
[25] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection
in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer
Vision (ICCV), Oct 2017.
[26] Ansh Mittal, Shuvam Ghosal, and Rishibha Bansal. Savchoi: Detecting suspicious activities
using dense video captioning with human object interactions. arXiv preprint arXiv:2207.11838,
2022.
[27] Jonghwan Mun, Minsu Cho, and Bohyung Han. Local-global video-text interactions for
temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 10810–10819, 2020.
[28] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association
for Computational Linguistics, pages 311–318, 2002.
[29] Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. Memory-
attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 8347–8356, 2019.
[30] Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and
Manfred Pinkal. Grounding action descriptions in videos. Transactions of the Association for
Computational Linguistics, 1:25–36, 2013.
[31] Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie
description. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3202–3212, 2015.
[32] Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D Yoo. Semantic grouping network for
video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, volume 35,
pages 2514–2522, 2021.

12
[33] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta.
Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer
Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14,
2016, Proceedings, Part I 14, pages 510–526. Springer, 2016.
[34] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
[35] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance
videos. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 6479–6488, 2018.
[36] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi. Inception-v4,
inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI
conference on artificial intelligence, volume 31, 2017.
[37] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa-
tiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international
conference on computer vision, pages 4489–4497, 2015.
[38] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image
description evaluation. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 4566–4575, 2015.
[39] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor
Darrell, and Kate Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE
international conference on computer vision, pages 4534–4542, 2015.
[40] Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. Reconstruction network for video captioning.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
7622–7631, 2018.
[41] Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. Bidirectional attentive fusion
with context gating for dense video captioning. In CVPR, 2018.
[42] Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. End-to-end
dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 6847–6857, 2021.
[43] Zhenzhi Wang, Limin Wang, Tao Wu, Tianhao Li, and Gangshan Wu. Negative sample matters:
A renaissance of metric learning for temporal grounding. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 36, pages 2613–2623, 2022.
[44] Peng Xu, Xiatian Zhu, and David A Clifton. Multimodal learning with transformers: A survey.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[45] Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, and Wenwu Zhu. Semantic conditioned dynamic
modulation for temporal sentence grounding in videos. Advances in Neural Information
Processing Systems, 32, 2019.
[46] Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Temporal sentence grounding in
videos: A survey and future directions. arXiv preprint arXiv:2201.08071, 2022.
[47] Qi Zhang, Yuqing Song, and Qin Jin. Unifying event detection and captioning as sequence
generation via pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel
Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 363–379. Springer, 2022.
[48] Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, and Jiebo Luo. Multi-scale 2d temporal
adjacency networks for moment localization with natural language. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 44(12):9073–9087, 2021.
[49] Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent
networks for moment localization with natural language. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 34, pages 12870–12877, 2020.

Checklist
1. For all authors...

13
(a) Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes]
(c) Did you discuss any potential negative societal impacts of your work? [N/A]
(d) Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments (e.g. for benchmarks)...
(a) Did you include the code, data, and instructions needed to reproduce the main experi-
mental results (either in the supplemental material or as a URL)? [Yes] See the github
link in abstract.
(b) Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes]
(c) Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [N/A]
(d) Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes]
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a) If your work uses existing assets, did you cite the creators? [Yes]
(b) Did you mention the license of the assets? [Yes]
(c) Did you include any new assets either in the supplemental material or as a URL? [Yes]
(d) Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [N/A]
(e) Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [N/A]
5. If you used crowdsourcing or conducted research with human subjects...
(a) Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b) Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c) Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]

14

You might also like