MMM2023 Duy

A Cross-modal Attention Model for Fine-Grained
Incident Retrieval from Dashcam Videos
Dinh-Duy Pham1[0000−1111−2222−3333] , Minh-Son Dao2[0000−0003−3044−8175] ,

and Thanh-Binh Nguyen3[0000−0002−3153−2064]
1
AISIA Research Lab - University of Science - Vietnam National University
Ho Chi Minh City, Vietnam
2
National Institute of Information and Communications Technology, Tokyo, Japan
dao@nict.go.jp
3
AISIA Research Lab - University of Science - Vietnam National University
Ho Chi Minh City, Vietnam
ngtbinh@hcmus.edu.vn
Abstract. Dashcam video has become popular recently due to the safety
of both individuals and communities. While individuals can have unde-
niable evidence for legal and insurance, communities can benefit from
sharing these dashcam videos for further traffic education and criminal
investigation. Moreover, relying on recent computer vision and AI devel-
opment, a few companies have launched the so-called AI dashcam that
can alert drivers to near-risk accidents (e.g., following distance detection,
forward collision warning) to improve driver safety. Unfortunately, even
though dashcam videos create a driver’s travel log (i.e., a traveling diary),
little research focuses on creating a valuable and friendly tool to find any
incident or event with few described sketches by users. Inspired by these
observations, we introduce an interactive incident detection and retrieval
system for first-view travel-log data that can retrieve fine-grained inci-
dents for both defined and undefined incidents. Moreover, the system
gives promising results when evaluated on several public datasets and
popular text-image retrieval methods. The source code is published at
https://github.com/PDD0911-HCMUS/Cross_Model_Attention
Keywords: dashcam video, incident detection and retrieval, travel log,

interactive searching, intelligent cross-data retrieval, text-image match-
ing
1 Introduction
Intelligent transportation systems have become essential topics to improve the
quality of the current transportation and reduce street accidents in each country.
Moreover, with the development of telecommunication and information technol-
ogy during the last few years, people have had more and more chances to inte-
grate such systems into their vehicles and receive the associated support any time
they drive. To build an intelligent transportation system, different companies uti-
lize the recently advanced techniques in computer vision, signal processing, and
2 Dinh-Duy Pham et al.
natural language processing along with sensors to detect real-time behaviors and
interactions of people and other transportation systems on the street. And then,
the systems can recommend quick actions to avoid incidents [1].
Using dash cams installed in vehicles is increasingly ubiquitous in different
countries. Therefore, people can easily record all things happening during their
road trips and provide potential digital evidence for any traffic accidents related
to or a relevant crime scene like a moving surveillance camera [2]. In addition, it is
worth noting that these dash cams could monitor in different directions: in front
of the vehicles, behind them, or even inside the cars. Kim and colleagues studied
the difference in dashcam video sharing’s motives and privacy concerns among
multiple countries [3]. Based on these advantages, people started utilizing dash-
cams as an economical and common resource to enhance traveling safety. Evans
et al.[4] presented a brief review of the urban road incident detection research
and highlighted potential enhancements for road incident detection algorithms
(IDAs). Adamova showed the importance of using dashcams in real-time activi-
ties for different vehicle views to enhance drivers’ safety levels when moving on
a street [5]. Bazilinskyy et al. [6] investigated the importance of better under-
standing road traffic safety using dashcam videos collected from four locations
worldwide.
Connecting dashcams mounted on vehicles provides a large IoT that requires
Big Data techniques to handle and extract useful information. Mohandu et al. [7]
provided a detailed survey of the current big data techniques applied in different
intelligent transportation systems and listed several open problems related to
big data analytics in this domain.
Considering emerging trends in incident detection using dashcam videos men-
tioned above, we observe there is an utmost requirement for a quick query tool
that can return a relevant incident image with the free-style textual query from
users. Beyond the methods of detecting incidents that produce a database of
short videos indexed by incident types, users need a tool that not only searches
by incident types but also by user-defined types. In other words, users can imag-
ine what the accident scene looks like and assume that a system can return what
they want exactly. Nevertheless, while regular users can generate less detailed
queries, experts (e.g., police officers, insurance staff, detectives) can create more
detailed ones. Hence, the requirement of multi-level fine-grained textual incident
retrieval is authentic and in very high demand.
One of the solutions for the challenge is the image-text retrieval approach [8]
and adversarial text-to-image synthesis [9], where people can find or generate
the relevant image by describing their textual query. These approaches relied
heavily on a vast training database where perfect pairs (image, text) are created
carefully. Applications that want to deal with text-image retrieval either re-
utilize a big pre-trained model and downstream to smaller datasets or create a
new model trained on a particular dataset. Both mentioned approaches do not
suit text-image incident retrieval. The key is that popular relevant text-image
pre-trained models do not cover the incident case where how an incident happens
and what an accident scene looks like are searching content with high frequency.
MM-trafficEvent: Fine-Grained Incident Retrieval 3
In light of these observations, we propose a novel interactive incident re-

trieval system for first-view travel-log data that can overcome the mentioned
challenges. Furthermore, it is worth noting that the proposed approach can ef-
ficiently adapt to new datasets, models, and domains for an interactive incident
retrieval system. We measure the performance of our framework and its compo-
nents on different datasets, users, and models. The experimental results show the
promising performance of the proposed method and provide further applications
in exploiting and exploring first-view travel-log data and contributing to smart
mobility, where safety is the priority.
The paper is organized as follows: Section 1 introduced the research motiva-
tion and related works. Section 2 defines the problem statements and proposed
solution. Section 3 explains the methodology. Section 4 discusses the experimen-
tal results. Section 5 concludes the work and points out future works.
2 Problem Statements
In this section, we introduce the problem we want to tackle. The problem plays
at the core of our system that aims to fine-grained retrieve incidents from user’s
textural queries with additional constraint supports.
Problem Statement:Given an image dataset DI and the textual query Q
whose content normally is a various-granularity cognitive-based expression of a
want-to-be-found incident. Let Ij ∈ DI be the j th image. The problem here is
to construct the function F (Q, DI) → {Ik } to find a set of best-matched pair
{(Q, Ik )}. Q is the set of captions, descriptions, or query sentences of the image.
Proposed Solution: Let EIij and M EIij be the ith image embedding ex-
tracted from Ij and the model used to extract this object. Here, the image
embedding terminology expresses features (both high semantic and low levels)
extracted from Ij . Let ETj and M ETj be the text embedding of the annota-
tion or caption or query of the query Qj and the model used to build up to
ETj . Let JS be the joint representation space where the query Q, {Ij } are pro-
jected. Inside JS, we will find the best match embedding vectors of Q and {Ij }
interactively with the judgment made by users. In other words, we try to con-
struct a joint representation space where a textual query and its relevant images
have the minimum similarity distance. In this space, information from both data
modalities is exchangeable. Moreover, with the support of users, we can tune the
similarity measure to decrease the intra-class errors and increase the inter-class
distance.
3 Methodology
This section describes the methodology used to solve the mentioned problem. We
start from the method overview, continued by system components individually
or reciprocally to help readers have inter-and intra-component visions.
3.1 A Cross-Modal Attention Model: An Overview

Figure 1 illustrates the method’s overview. We leverage the text-image cross-
modal attention to create the joint representation space. We build only the en-
coder to generate the joint representation space and use the information retrieval
technique to carry out the query progress. In general, an in-context pair of image
and text is proceeded to generate associative embedding vectors. These vectors
are pushed through the multi-head attention module to encode. The encoded
vectors go through several full connections to form a final joint representation
space that will further be utilized for querying.
Fig. 1. A Cross-Modal Attention Model: An overview
3.2 Vision Encoder

We utilize the Xception [10] model that pre-trained on the ImageNet dataset
to build our embedding vector. We retrain the model with all frozen layers on
our dataset. We also add two full connection layers as the end of the backbone
with the output is a 256-dimension vector. We normalize the input image size
299x299. The top-left part of Figure 1 illustrates this encoder.
3.3 Text Encoder

We use the smaller BERT models referenced in the published article [11] as
the core of our text encoder. It is well known that the standard BERT recipe
(including model architecture and training objective) is effective on a wide range

of model sizes beyond BERT-Base and BERT-Large. The smaller BERT[12]
models are considered for limited computational resources, and they can be fine-
tuned in the same manner as the original BERT models. However, they are most
effective in knowledge distillation, where a more extensive and accurate teacher
produces the fine-tuning labels. Our goal is to enable research in institutions with
fewer computational resources and encourage the community to seek directions
of innovation alternatives to increasing model capacity. We also add two full
connection layers as the end of the backbone with the output for embedding
text to a 256-dimensions vector. The top-left (second line) part of Figure 1
illustrates this encoder.
3.4 Multi-Head Attention Block
To build our attention component, we inherit the spirit from the attention article
[13]. We reuse this attention structure with a modification of (V (Vector), Q
(Query), K(Key)) settings. In our research, we set K and Q for image embedding
vectors and V for text embedding vectors. The bottom part of Figure 1 illustrates
this block structure.
3.5 Multi-Head Cross-Modal Attention
We design our multi-head cross-modal attention block by consuming the outputs

of the vision and text encoders as Value and Query-Key, respectively. We have
run a survey to collect all possible accident scene descriptions from volunteers
and police traffic accident records. The authors discuss people’s subitizing in [14],
which usually is less than four objects at once that people can pay attention to.
Considering three areas of an accident scene image (i.e., left, right, and center),
we come up with 12 objects that may be concerned. Hence, we set the multi-head
as 12 to emphasize a maximum of 12 suspect objects/zones/phrases primarily
associated with incident scene descriptions. At the end of this component, we add
three full connection layers to decrease the dimension and increase the feature’s
salience.
Loss function: We compute the pairwise dot-product similarity between
each captioni and imagei as the predictions to calculate the loss. The target
similarity between captioni and imagei is computed as the average of the (dot-
product similarity between captioni and imagei ) and (the dot-product similarity
between imagei and imagej ). Then we utilize cross-entropy to compute the loss
between the targets and the predictions.
3.6 Fine-grained Incident Retrieval
The Fine-Grained Incident Retrieval aims to enhance the searching accuracy

against the travel-log dataset (e.g., dashcam, lifelog camera), incredibly satisfy-
ing the free-style textual queries that primarily explain the accident scene in the
user’s cognition. In this case, searching by only an incident class could not meet
users’ requirements.
For example, the class-based incident query returns all incident moments
that met searching criteria such as “find all car-crash incidents that happened
last week." with "car-crash" being an accident class/type. Nevertheless, when
users (e.g., police officers, lawyers, and insurance companies) want to find more
detailed incident scenes, we have to run fine-grained searching by shrinking the
search space accordingly. For example, a fine-grained query could be “find the
incident where a white car hit a blue truck from behind and happened near an
intersection."
Assuming we have the joint representation space created by training the
multi-head cross-modal attention model on a dataset. Then we want to make a
search engine that can consume a text query and utilize the joint representation
space to find the set of images with maximum similarity measure. We apply
the approximate similarity matching approach [15] that utilizes frameworks like
ScaNN [16], Annoy, or Faiss [17] to work with a large number of images in a
real-time mode.
Fig. 2. Overview system image retrieval
We define the similarity function as follows:

Dotsimilarity = Qe · Ies
where Qe is Query embedding that user input and Ies is all image embedding
vectors using Vision Encoder to predict after training. Qe = T E(Q) and Ies =
V E(Is ) the T E and QE is a model encoder after complete training progress,
Q is a user’s querying input, and Is is a list of all images that we have in our
dataset. After encoding, the Qe and Ies become Q1∗256 and IN ∗256 , respectively,
N is the number of images that we have in the dataset. Hence, the Dotsimilarity
has size 1 ∗ N . Finally, the search engine returns values and indices of the k
largest entries. Figure 2 illustrates the search engine.
The System Database is a services backend independent of the User Interface

part. The System Database part will create a set of embedding vectors and store
it in the system as the database when the user inputs a query. The User Interface
part will create a Query embedding vector, and the system will use the similarity
function to get top-K results.
4 Experimental Results
This section describes, explains, and discusses datasets, working environments,
evaluation, and comparisons with other methods. It should be noted that this
project is ongoing, and not all system components have gained the best results.
4.1 Dataset
We use various datasets gathered from public sources and created by ourselves.
The former included 8-classes of incidents dataset [24], BDD1000K[23], and
RetroTruck[22] datasets and the latter contains I4W[25] datasets. The 8-classes
dataset contains around 12K positive samples for eight incident classes whose
volumes are illustrated in Figure 3. The RetroTruck dataset has 254 videos (25
fps) of normal and 56 abnormal driving scenes. The I4W contains 600 videos (15
minutes/video) recorded repeatedly in four courses in Tokyo center, Japan.
Fig. 3. 8-classes dataset[24]: A sample
While the former already has incident labels, the latter needs manual labeling
for incidents. First, we asked volunteers to label incidents recorded in the latter
manually. Second, we merge all datasets to form an extensive dataset for our
evaluation. Finally, we asked volunteers to label data and created a structured
data set of the form Dk = {(Ik ; [Tki ])} where I is an image, and [Tki ] is a list of
captions similar with I’s content, emphasize on the incident aspect. We organize
the dataset as {key; value} where key represents an image’s ID, and value is for
the image’s description (i.e., image label/annotation/caption), and store them
(i.e., key, value) in JSON format. The significant criterion that differed from our
dataset is that we asked the volunteers to describe an image of an accident scene
in detail as if they reported the accident they witnessed to police staff.
We also use CIFAR and MS-COCO datasets with five captions for each image
to compare our method. In this case, we do not emphasize the particular incident
images. In contrast, we want to see how well our method can work with other
datasets of broad domains.
4.2 Experimental Setup
We set the working environment as follows: Keras and Tensorflow have the same
version 2.8.0, with CUDA 10.1 running on Python 3.8, using GPU NVIDIA
RTX 3060 12GB. We simulate online mode by considering that one video is
continuously sent to the system from a dashcam. The hyperparameters of our
model are denoted in Table 1.
Table 1. Model Parameters
Parameter Value
Pre-trained model Xception and BERT
Trainable Params frozen all layers
Scale Normalized Image Size 299x299
Batch size 32
Number epoch 21
Optimizer Adam
Learning Rate 0.0001
Reduce Learning Rate Schedule Reduce Learning Rate on Plateau
Reduce Learning Rate Factor 0.2
Training Batch 70% dataset
Validation Batch 20% dataset
Testing Batch 10% dataset
Training time around 3.5 Hour
Query Time around 1.2 second
Vision Encoder Parameters 21,452,328 (Trainable: 590K, Non-trainable: 20.8M)
Text Encoder Parameters 28,961,281 (Trainable: 197K, Non-trainable: 28.7M)
Total Parameters of Model 53,560,107 (Trainable: 3.9M, Non-trainable: 49.6M)
4.3 Results
We implement different fine-grained text-image incident retrieval methods to

compare with our model. We utilize Knowledge Graph (KG) to analyze text
query features to form a {source − edge − target} tuple. Then, we use FAISS[17]
to calculate the correlation between the text query and images stored in our
indexed databases. We also develop a Dual Encoder (DE) method with two
Encode blocks (i.e., one for image and one for texts). Then we perform the math
of the correlation between a text query and images with the query relationship.
We merge these two methods by bagging to generate the third one. This model
outputs a sub-dataset from the original dataset that correlates with the query.
All these three methods run on both MS-COCO and our dataset for comparison.
We also reproduce three well-known methods DSRAN[18], VSRN[19], and

SCAN[20], for comparison. The first two methods run on the MS-COCO dataset,
and the last runs on our dataset.
Table 2. Evaluation on MS-COCO dataset
Method p@10 p@50 p@100 p@200

KG 3 12 23 52
DE 5 32 60 136
DE + KG 2 30 55 77
SCAN[20] 7 40 67 143
DSRAN[18] 8 45 77 163
VSRN[19] 6 37 63 80
Cross-Modal Attention 7 47 70 170
Table 3. Evaluation on our dataset
Method p@10 p@50 p@100 p@200

KG 6 31 54 69
DE 8 42 87 159
DE + KG 5 35 68 83
SCAN[20] 7 45 93 172
Cross-Modal Attention 8 47 98 196
Since the freestyle query primarily represents a subjective perspective of users

when imagining an accident scene, it is hard to have an automatic judgment
mechanism. Hence, we asked volunteers to help us evaluate each method’s results
mentioned above. We ask one person who does not familiar with the topic as a
naive user and another with experience as an expert user.
Volunteers handle our system to make 50 different queries. The volunteers
are free to use the interactive GUI, introduced in our previous work [21], to
add more attributes, topologies, and classes under their cognition of what they
imagine their queries. The volunteer gives P@10 score (i.e., precision at top 10
results) for each query at the first round, then records the number of querying
loops when the system returns no more new results (i.e., reaches the maximum
R-precision). Then, we report each method’s P@10, P@50, P@100, and P@200
to form the Table 2 and 3. Our model gives a better result than others, both
in the particular incident domain and in broader domains, except DSRAN at
P@10 and P@100.
Table 4 denotes ten random queries and their results with the confirmation of
volunteers about right or wrong. Images with their red-box boundaries indicate
wrong answers. We could see that the portion of correct answers in top K-results
is relatively high.
5 Conclusions
This paper introduces a fine-grained text-image incident retrieval using multi-
modal cross-modal attention with the FAISS index. The former aim to create a
joint representation space where textual and visual features can compensate and
assist each other to build the link of salient zones together. We emphasize the
specific domain: incident, where the description of the accident scene requires
more detail and the interaction among traffic objects (e.g., car, truck, tree, pedes-
trian, flood) and object’s attributes (e.g., color, size, position) are the utmost
attention. To our best knowledge, these criteria have not been mentioned in any
work. Besides, the FAISS index can support running on a massive database in
real-time mode. We evaluate our model on different datasets and models with
the judgment of both naive and expert users. The experimental results show
our advantages and open a new approach for exploiting and exploring first-view
travel-log data and contributing to smart mobility, where safety is the priority.
We will consider the position of textual and visual embedding vectors in
the future since, as we mentioned, the traffic object’s interaction is one of the
incident domain’s significant criteria. We also will ask more volunteers to enrich
the labels of each incident image to have better cross-link between textual and
visual components. We will investigate the decoder with the transformer to get
direct relevant images with the query without utilizing the FAISS index. We also
want to benefit from pre-train vast models (e.g., Visual Genome) of objects and
their topology in an image and downstream them into our dataset, which could
bring more accuracy and alleviate the burden of manually labeling.
Acknowledgement
The results of this study are based on collaborative research on "Research and
Development of Interactive Visual Lifelog Retrieval Method for Multimedia Sens-
ing" between National Institute of Information and Communications Technology,
Japan and University of Science, Vietnam National University - Ho Chi Minh
City, Vietnam from April 2020 to March 2022.
References
1. Xu, Y., Liang, X., Dong, X. & Chen, W. Intelligent Transportation System and Fu-
ture of Road Safety. 2019 IEEE International Conference On Smart Cloud (Smart-
Cloud). pp. 209-214 (2019)
2. Lee, K., Choi, J., Park, J. & Lee, S. Your Car Is Recording: Metadata-driven Dash-
cam Analysis System. DFRWS APAC. (2021)
3. Kim, J., Park, S. & Lee, U. Dashcam Witness: Video Sharing Motives and Privacy
Concerns Across Different Nations. IEEE Access. 8 pp. 110425-110437 (2020)
4. Evans, J., Waterson, B. & Hamilton, A. Evolution and Future of Urban Road In-
cident Detection Algorithms. Journal Of Transportation Engineering, Part A: Sys-
tems. 146 (2020)
5. Adamová, V. DASHCAM AS A DEVICE TO INCREASE THE ROAD SAFETY
LEVEL. Int. Conf. On Innovations In Science And Education (CBU). pp. 1-5 (2020)
6. Bazilinskyy, P., Eisma, Y., Dodou, D. & Winter, J. Risk perception: A study us-
ing dashcam videos and participants from different world regions. Traffic Injury
Prevention. 21, 347-353 (2020)
7. Mohandu, A. & Kubendiran, M. Survey on Big Data Techniques in Intelligent Trans-

portation System (ITS). Materials Today: Proceedings. (2021)
8. Cao, M., Li, S., Li. T., Nie, L., Zhang, M. Image-text Retrieval: A Survey on Recent
Research and Development, https://doi.org/10.48550/arXiv.2203.14713
9. Frolov, S., Hinz, T., Raue, T., Hees, J., Dengel, A. Adversarial Text-to-Image Syn-
thesis: A Review, https://doi.org/10.48550/arXiv.2101.09983
10. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions.
CoRR. abs/1610.02357 (2016), http://arxiv.org/abs/1610.02357
11. Turc, I., Chang, M., Lee, K. & Toutanova, K. Well-Read Students Learn
Better: The Impact of Student Initialization on Knowledge Distillation. CoRR.
abs/1908.08962 (2019), http://arxiv.org/abs/1908.08962
12. Turc, I., Chang, M., Lee, K. & Toutanova, K. Well-Read Students Learn
Better: On the Importance of Pre-training Compact Models. ArXiv Preprint
ArXiv:1908.08962v2 . (2019)
13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser,
L. & Polosukhin, I. Attention Is All You Need. CoRR. abs/1706.03762 (2017),
http://arxiv.org/abs/1706.03762
14. Tian, Y., Chen, L. Cross-modal attention modulates tactile subitizing but not
tactile numerosity estimation. Atten Percept Psychophys 80, 1229–1239 (2018).
https://doi.org/10.3758/s13414-018-1507-x
15. Wang, M., Xu, X., Yue, Q. & Wang, Y. A Comprehensive Survey and Experi-
mental Comparison of Graph-Based Approximate Nearest Neighbor Search. CoRR.
abs/2101.12631 (2021), https://arxiv.org/abs/2101.12631
16. Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F. & Kumar, S. Accel-
erating Large-Scale Inference with Anisotropic Vector Quantization. International
Conference On Machine Learning. (2020), https://arxiv.org/abs/1908.10396
17. Johnson, J., Douze, M. & Jégou, H. Billion-scale similarity search with GPUs.
IEEE Transactions On Big Data. 7, 535-547 (2019)
18. Wen, K., Gu, X. & Cheng, Q. Learning Dual Semantic Relations with
Graph Attention for Image-Text Matching. CoRR. abs/2010.11550 (2020),
https://arxiv.org/abs/2010.11550
19. Li, K., Zhang, Y., Li, K., Li, Y. & Fu, Y. Visual Semantic Reasoning for Image-Text
Matching. CoRR. abs/1909.02701 (2019), http://arxiv.org/abs/1909.02701
20. Lee, K., Chen, X., Hua, G., Hu, H. & He, X. Stacked Cross Attention for Image-
Text Matching. CoRR. abs/1803.08024 (2018), http://arxiv.org/abs/1803.08024
21. Dao, M., Pham, D., Nguyen, M., Nguyen, T. & Zettu, K. MM-trafficEvent: An
Interactive Incident Retrieval System for First-view Travel-log Data. 2021 IEEE
International Conference On Big Data (Big Data). pp. 4842-4851 (2021)
22. S. Haresh, S. Kumar, M. Z. Zia, and Q.-H. Tran, “Towards anomaly detec-
tion in dashcam videos,” in IEEE Intelligent Vehicles Symposium (IV), 2020, pp.
1407–1414.
23. F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell,
“Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” 2020.
24. A. Levering, M. Tomko, D. Tuia, and K. Khoshelham, “Detecting unsigned physical
road incidents from driver-view images,” IEEE Trans. on Intelligent Vehicles, vol.
6, no. 1, pp. 24–33, 2021.
25. P. Zhao, M.-S. Dao, N.-T. Nguyen, T.-B. Nguyen, D.-T. Dang-Nguyen, and C.
Gurrin, “Overview of mediaeval 2020 insights for wellbeing: Multimodal personal
health lifelog data analysis,” in MediaEval 2020
Table 4. Multimodal Fine-Grained Text-Image Cross-Modal Attention Model: Query-

ing Samples. Images with red-box boundaries are wrong results
a car with fire smoke pouring out of it
a dump truck is parked on the side of the

road
a dump truck that is on the ground
man standing on top of a pile of rocks
a road that has some trees on it
broken down tree on the side of a road
car is driving down a snowy road

MMM2023 Duy

Uploaded by

Copyright:

Available Formats

You might also like

MMM2023 Duy

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MMM2023 Duy

Uploaded by

Copyright:

Available Formats

A Cross-modal Attention Model for Fine-Grained

Incident Retrieval from Dashcam Videos

Dinh-Duy Pham1[0000−1111−2222−3333] , Minh-Son Dao2[0000−0003−3044−8175] ,

Keywords: dashcam video, incident detection and retrieval, travel log,

In light of these observations, we propose a novel interactive incident re-

3.1 A Cross-Modal Attention Model: An Overview

Fig. 1. A Cross-Modal Attention Model: An overview

3.2 Vision Encoder

3.3 Text Encoder

(including model architecture and training objective) is effective on a wide range

3.4 Multi-Head Attention Block

3.5 Multi-Head Cross-Modal Attention

We design our multi-head cross-modal attention block by consuming the outputs

3.6 Fine-grained Incident Retrieval

The Fine-Grained Incident Retrieval aims to enhance the searching accuracy

Fig. 2. Overview system image retrieval

We define the similarity function as follows:

The System Database is a services backend independent of the User Interface

Fig. 3. 8-classes dataset[24]: A sample

4.2 Experimental Setup

We implement different fine-grained text-image incident retrieval methods to

We also reproduce three well-known methods DSRAN[18], VSRN[19], and

Method p@10 p@50 p@100 p@200

Table 3. Evaluation on our dataset

Method p@10 p@50 p@100 p@200

Since the freestyle query primarily represents a subjective perspective of users

7. Mohandu, A. & Kubendiran, M. Survey on Big Data Techniques in Intelligent Trans-

Table 4. Multimodal Fine-Grained Text-Image Cross-Modal Attention Model: Query-

a car with fire smoke pouring out of it

a dump truck is parked on the side of the

a dump truck that is on the ground

man standing on top of a pile of rocks

a road that has some trees on it

broken down tree on the side of a road

car is driving down a snowy road

You might also like