Professional Documents
Culture Documents
MMM2023 Duy
MMM2023 Duy
MMM2023 Duy
Abstract. Dashcam video has become popular recently due to the safety
of both individuals and communities. While individuals can have unde-
niable evidence for legal and insurance, communities can benefit from
sharing these dashcam videos for further traffic education and criminal
investigation. Moreover, relying on recent computer vision and AI devel-
opment, a few companies have launched the so-called AI dashcam that
can alert drivers to near-risk accidents (e.g., following distance detection,
forward collision warning) to improve driver safety. Unfortunately, even
though dashcam videos create a driver’s travel log (i.e., a traveling diary),
little research focuses on creating a valuable and friendly tool to find any
incident or event with few described sketches by users. Inspired by these
observations, we introduce an interactive incident detection and retrieval
system for first-view travel-log data that can retrieve fine-grained inci-
dents for both defined and undefined incidents. Moreover, the system
gives promising results when evaluated on several public datasets and
popular text-image retrieval methods. The source code is published at
https://github.com/PDD0911-HCMUS/Cross_Model_Attention
1 Introduction
Intelligent transportation systems have become essential topics to improve the
quality of the current transportation and reduce street accidents in each country.
Moreover, with the development of telecommunication and information technol-
ogy during the last few years, people have had more and more chances to inte-
grate such systems into their vehicles and receive the associated support any time
they drive. To build an intelligent transportation system, different companies uti-
lize the recently advanced techniques in computer vision, signal processing, and
2 Dinh-Duy Pham et al.
natural language processing along with sensors to detect real-time behaviors and
interactions of people and other transportation systems on the street. And then,
the systems can recommend quick actions to avoid incidents [1].
Using dash cams installed in vehicles is increasingly ubiquitous in different
countries. Therefore, people can easily record all things happening during their
road trips and provide potential digital evidence for any traffic accidents related
to or a relevant crime scene like a moving surveillance camera [2]. In addition, it is
worth noting that these dash cams could monitor in different directions: in front
of the vehicles, behind them, or even inside the cars. Kim and colleagues studied
the difference in dashcam video sharing’s motives and privacy concerns among
multiple countries [3]. Based on these advantages, people started utilizing dash-
cams as an economical and common resource to enhance traveling safety. Evans
et al.[4] presented a brief review of the urban road incident detection research
and highlighted potential enhancements for road incident detection algorithms
(IDAs). Adamova showed the importance of using dashcams in real-time activi-
ties for different vehicle views to enhance drivers’ safety levels when moving on
a street [5]. Bazilinskyy et al. [6] investigated the importance of better under-
standing road traffic safety using dashcam videos collected from four locations
worldwide.
Connecting dashcams mounted on vehicles provides a large IoT that requires
Big Data techniques to handle and extract useful information. Mohandu et al. [7]
provided a detailed survey of the current big data techniques applied in different
intelligent transportation systems and listed several open problems related to
big data analytics in this domain.
Considering emerging trends in incident detection using dashcam videos men-
tioned above, we observe there is an utmost requirement for a quick query tool
that can return a relevant incident image with the free-style textual query from
users. Beyond the methods of detecting incidents that produce a database of
short videos indexed by incident types, users need a tool that not only searches
by incident types but also by user-defined types. In other words, users can imag-
ine what the accident scene looks like and assume that a system can return what
they want exactly. Nevertheless, while regular users can generate less detailed
queries, experts (e.g., police officers, insurance staff, detectives) can create more
detailed ones. Hence, the requirement of multi-level fine-grained textual incident
retrieval is authentic and in very high demand.
One of the solutions for the challenge is the image-text retrieval approach [8]
and adversarial text-to-image synthesis [9], where people can find or generate
the relevant image by describing their textual query. These approaches relied
heavily on a vast training database where perfect pairs (image, text) are created
carefully. Applications that want to deal with text-image retrieval either re-
utilize a big pre-trained model and downstream to smaller datasets or create a
new model trained on a particular dataset. Both mentioned approaches do not
suit text-image incident retrieval. The key is that popular relevant text-image
pre-trained models do not cover the incident case where how an incident happens
and what an accident scene looks like are searching content with high frequency.
MM-trafficEvent: Fine-Grained Incident Retrieval 3
2 Problem Statements
In this section, we introduce the problem we want to tackle. The problem plays
at the core of our system that aims to fine-grained retrieve incidents from user’s
textural queries with additional constraint supports.
Problem Statement:Given an image dataset DI and the textual query Q
whose content normally is a various-granularity cognitive-based expression of a
want-to-be-found incident. Let Ij ∈ DI be the j th image. The problem here is
to construct the function F (Q, DI) → {Ik } to find a set of best-matched pair
{(Q, Ik )}. Q is the set of captions, descriptions, or query sentences of the image.
Proposed Solution: Let EIij and M EIij be the ith image embedding ex-
tracted from Ij and the model used to extract this object. Here, the image
embedding terminology expresses features (both high semantic and low levels)
extracted from Ij . Let ETj and M ETj be the text embedding of the annota-
tion or caption or query of the query Qj and the model used to build up to
ETj . Let JS be the joint representation space where the query Q, {Ij } are pro-
jected. Inside JS, we will find the best match embedding vectors of Q and {Ij }
interactively with the judgment made by users. In other words, we try to con-
struct a joint representation space where a textual query and its relevant images
have the minimum similarity distance. In this space, information from both data
modalities is exchangeable. Moreover, with the support of users, we can tune the
similarity measure to decrease the intra-class errors and increase the inter-class
distance.
3 Methodology
This section describes the methodology used to solve the mentioned problem. We
start from the method overview, continued by system components individually
or reciprocally to help readers have inter-and intra-component visions.
4 Dinh-Duy Pham et al.
To build our attention component, we inherit the spirit from the attention article
[13]. We reuse this attention structure with a modification of (V (Vector), Q
(Query), K(Key)) settings. In our research, we set K and Q for image embedding
vectors and V for text embedding vectors. The bottom part of Figure 1 illustrates
this block structure.
user’s cognition. In this case, searching by only an incident class could not meet
users’ requirements.
For example, the class-based incident query returns all incident moments
that met searching criteria such as “find all car-crash incidents that happened
last week." with "car-crash" being an accident class/type. Nevertheless, when
users (e.g., police officers, lawyers, and insurance companies) want to find more
detailed incident scenes, we have to run fine-grained searching by shrinking the
search space accordingly. For example, a fine-grained query could be “find the
incident where a white car hit a blue truck from behind and happened near an
intersection."
Assuming we have the joint representation space created by training the
multi-head cross-modal attention model on a dataset. Then we want to make a
search engine that can consume a text query and utilize the joint representation
space to find the set of images with maximum similarity measure. We apply
the approximate similarity matching approach [15] that utilizes frameworks like
ScaNN [16], Annoy, or Faiss [17] to work with a large number of images in a
real-time mode.
4 Experimental Results
This section describes, explains, and discusses datasets, working environments,
evaluation, and comparisons with other methods. It should be noted that this
project is ongoing, and not all system components have gained the best results.
4.1 Dataset
We use various datasets gathered from public sources and created by ourselves.
The former included 8-classes of incidents dataset [24], BDD1000K[23], and
RetroTruck[22] datasets and the latter contains I4W[25] datasets. The 8-classes
dataset contains around 12K positive samples for eight incident classes whose
volumes are illustrated in Figure 3. The RetroTruck dataset has 254 videos (25
fps) of normal and 56 abnormal driving scenes. The I4W contains 600 videos (15
minutes/video) recorded repeatedly in four courses in Tokyo center, Japan.
While the former already has incident labels, the latter needs manual labeling
for incidents. First, we asked volunteers to label incidents recorded in the latter
manually. Second, we merge all datasets to form an extensive dataset for our
evaluation. Finally, we asked volunteers to label data and created a structured
data set of the form Dk = {(Ik ; [Tki ])} where I is an image, and [Tki ] is a list of
captions similar with I’s content, emphasize on the incident aspect. We organize
the dataset as {key; value} where key represents an image’s ID, and value is for
the image’s description (i.e., image label/annotation/caption), and store them
(i.e., key, value) in JSON format. The significant criterion that differed from our
dataset is that we asked the volunteers to describe an image of an accident scene
in detail as if they reported the accident they witnessed to police staff.
8 Dinh-Duy Pham et al.
We also use CIFAR and MS-COCO datasets with five captions for each image
to compare our method. In this case, we do not emphasize the particular incident
images. In contrast, we want to see how well our method can work with other
datasets of broad domains.
We set the working environment as follows: Keras and Tensorflow have the same
version 2.8.0, with CUDA 10.1 running on Python 3.8, using GPU NVIDIA
RTX 3060 12GB. We simulate online mode by considering that one video is
continuously sent to the system from a dashcam. The hyperparameters of our
model are denoted in Table 1.
Table 1. Model Parameters
Parameter Value
Pre-trained model Xception and BERT
Trainable Params frozen all layers
Scale Normalized Image Size 299x299
Batch size 32
Number epoch 21
Optimizer Adam
Learning Rate 0.0001
Reduce Learning Rate Schedule Reduce Learning Rate on Plateau
Reduce Learning Rate Factor 0.2
Training Batch 70% dataset
Validation Batch 20% dataset
Testing Batch 10% dataset
Training time around 3.5 Hour
Query Time around 1.2 second
Vision Encoder Parameters 21,452,328 (Trainable: 590K, Non-trainable: 20.8M)
Text Encoder Parameters 28,961,281 (Trainable: 197K, Non-trainable: 28.7M)
Total Parameters of Model 53,560,107 (Trainable: 3.9M, Non-trainable: 49.6M)
4.3 Results
5 Conclusions
This paper introduces a fine-grained text-image incident retrieval using multi-
modal cross-modal attention with the FAISS index. The former aim to create a
10 Dinh-Duy Pham et al.
joint representation space where textual and visual features can compensate and
assist each other to build the link of salient zones together. We emphasize the
specific domain: incident, where the description of the accident scene requires
more detail and the interaction among traffic objects (e.g., car, truck, tree, pedes-
trian, flood) and object’s attributes (e.g., color, size, position) are the utmost
attention. To our best knowledge, these criteria have not been mentioned in any
work. Besides, the FAISS index can support running on a massive database in
real-time mode. We evaluate our model on different datasets and models with
the judgment of both naive and expert users. The experimental results show
our advantages and open a new approach for exploiting and exploring first-view
travel-log data and contributing to smart mobility, where safety is the priority.
We will consider the position of textual and visual embedding vectors in
the future since, as we mentioned, the traffic object’s interaction is one of the
incident domain’s significant criteria. We also will ask more volunteers to enrich
the labels of each incident image to have better cross-link between textual and
visual components. We will investigate the decoder with the transformer to get
direct relevant images with the query without utilizing the FAISS index. We also
want to benefit from pre-train vast models (e.g., Visual Genome) of objects and
their topology in an image and downstream them into our dataset, which could
bring more accuracy and alleviate the burden of manually labeling.
Acknowledgement
The results of this study are based on collaborative research on "Research and
Development of Interactive Visual Lifelog Retrieval Method for Multimedia Sens-
ing" between National Institute of Information and Communications Technology,
Japan and University of Science, Vietnam National University - Ho Chi Minh
City, Vietnam from April 2020 to March 2022.
References
1. Xu, Y., Liang, X., Dong, X. & Chen, W. Intelligent Transportation System and Fu-
ture of Road Safety. 2019 IEEE International Conference On Smart Cloud (Smart-
Cloud). pp. 209-214 (2019)
2. Lee, K., Choi, J., Park, J. & Lee, S. Your Car Is Recording: Metadata-driven Dash-
cam Analysis System. DFRWS APAC. (2021)
3. Kim, J., Park, S. & Lee, U. Dashcam Witness: Video Sharing Motives and Privacy
Concerns Across Different Nations. IEEE Access. 8 pp. 110425-110437 (2020)
4. Evans, J., Waterson, B. & Hamilton, A. Evolution and Future of Urban Road In-
cident Detection Algorithms. Journal Of Transportation Engineering, Part A: Sys-
tems. 146 (2020)
5. Adamová, V. DASHCAM AS A DEVICE TO INCREASE THE ROAD SAFETY
LEVEL. Int. Conf. On Innovations In Science And Education (CBU). pp. 1-5 (2020)
6. Bazilinskyy, P., Eisma, Y., Dodou, D. & Winter, J. Risk perception: A study us-
ing dashcam videos and participants from different world regions. Traffic Injury
Prevention. 21, 347-353 (2020)
MM-trafficEvent: Fine-Grained Incident Retrieval 11