Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps

Dongsheng Xu1 , Qingbao Huang1,2 , Yi Cai3


1
School of Electrical Engineering, Guangxi University
2
Guangxi Key Laboratory of Multimedia Communications and Network Technology
3
School of Software Engineering, South China University of Technology
chrishsu1998@163.com, qbhuang@gxu.edu.cn, ycai@scut.edu.cn
arXiv:2302.01540v1 [cs.CV] 3 Feb 2023

Abstract Strong baseline Strong baseline


CNMT (without CNMT (without
depth information): visual object
Move
Text-based image captioning is an important but under- concepts):

explored task, aiming to generate descriptions containing a bottle of victorian


coca sits on a table Spinner
victorian Coca
visual objects and scene text. Recent studies have made en- next to a glass.
couraging progress, but they are still suffering from a lack a person is wearing clothes with move on it.

of overall understanding of scenes and generating inaccu-


rate captions. One possible reason is that current studies Our DEVICE Our DEVICE
(with depth information): (with visual
mainly focus on constructing the plane-level geometric rela- Move object concepts):
tionship of scene text without depth information. This leads
56 a bottle of coca -
to insufficient scene text relational reasoning so that mod- 112 cola sits on a table
Spinner

els may describe scene text inaccurately. The other possible 15 15 135
in front of a glass. Salient visual object concepts:
96 [‘bike’, ‘equipment’, ‘athlete’]
reason is that existing methods fail to generate fine-grained
an athlete in move pants is on a spinner bike.
descriptions of some visual objects. In addition, they may
ignore essential visual objects, leading to the scene text (a) (b)
belonging to these ignored objects not being utilized. To
address the above issues, we propose a DEpth and VI- Figure 1. Example of influence of depth information and visual
sual ConcEpts Aware Transformer (DEVICE) for TextCaps. object concepts on descriptions. Without the assistance of depth
Concretely, to construct three-dimensional geometric rela- information, the model may confuse the front and rear positional
tions, we introduce depth information and propose a depth- relationship of the OCR tokens. The semantic information of vi-
enhanced feature updating module to ameliorate OCR to- sual object concepts can assist the model in generating accurate
and comprehensive captions. The visual object concepts are de-
ken features. To generate more precise and comprehensive
tected by the visual object concepts extractor. Green indicates
captions, we introduce semantic features of detected visual
scene text, and Blue indicates visual objects.
object concepts as auxiliary information. Our DEVICE is
capable of generalizing scenes more comprehensively and
boosting the accuracy of described visual entities. Sufficient scribing scenes containing text. Taking the upper left Case
experiments demonstrate the effectiveness of our proposed in Fig. 1 as an example, “a bottle of Coca-Cola” should
DEVICE, which outperforms state-of-the-art models on the be described rather than “a bottle”. To solve this defect,
TextCaps test set. Our code will be publicly available. Sidorov et al. [27] collect a high-quality dataset TextCaps,
in which captions contain scene text from the images.
1. Introduction To utilize the scene text in pictures, extracting Optical
Character Recognition (OCR) tokens from given images is
Image captioning is one of the important tasks in the in- the precondition. M4C-Captioner [27] utilizes Rosetta [10]
tersection of vision and language, which aims to help the OCR system to obtain OCR tokens, and then employ a mul-
visually impaired understand visual information as accu- timodal transformer directly to model the relationships be-
rately as possible. Some recent studies have achieved supe- tween visual objects and OCR tokens. However, it is inap-
rior performance and are even comparable to humans under propriate to treat all OCR tokens equally, since deformed
certain circumstances [4, 41, 42]. However, these traditional scene texts may be detected incorrectly. Wang et al. [36]
image captioning models perform unsatisfactorily when de- introduce a concept of OCR confidence which largely alle-
viated this issue. Then Wang et al. [34, 35] are committed Transformer (DEVICE) for TextCaps. DEVICE per-
to improving the spatial relationship construct methods for forms better than previous methods as it improves the
OCR tokens, which have been proven effective, and the pro- accuracy and completeness of captions.
posed model LSTM-R [35] achieves the state-of-the-art per-
formance. However, current methods are not accurate and • We devise a Depth-enhanced Feature Updating Mod-
comprehensive enough when describing the scenes, which ule (DeFUM), which effectively helps the model eval-
can be mainly manifested in the following two-fold. uate the relevance of OCR tokens. To the best of our
Firstly, the real-world is three-dimensional (3D). How- knowledge, we are the first to apply depth information
ever, current studies [27, 34–36, 39] only exploit the spatial to the text-based image captioning task.
relationships of OCR tokens at the plane-level, which leads
to visual location information being utilized incompletely, • We introduce semantic information of salient vi-
and even results in generating inaccurate scene text. As il- sual objects by the Visual Object Concepts Extractor
lustrated in the Case (a) of Fig. 1, CNMT [36] mistakenly (VOC). Semantic information of salient visual objects
regards “victorian” and “coca” as closely related OCR to- can help the multimodal transformer generate fine-
kens in the same plane, which are irrelevant and actually grained objects and better understand holistic scenes.
one behind the other. Naturally, we consider that intro-
• Our DEVICE outperforms the state-of-the-art models
ducing depth information based on two-dimensional (2D)
on the TextCaps test set, inspiringly boosting CIDEr-D
information can simulate real-world geometric information
from 100.8 to 109.7.
for improving the accuracy of scene text in captions.
Secondly, scenes contain abundant but complex visual
entities (visual objects and scene text), existing methods
2. Related Work
suffer from coarse-grained descriptions of visual objects
and the absence of crucial visual entities in captions. As 2.1. Text-based Image Captioning
shown in the Case (b) on the right of Fig. 1, CNMT [36]
generates the coarse-grained word “person” for the visual Image captioning task has made a lot of progress in re-
object “athlete”. Besides, CNMT ignores an important en- cent years [4, 11, 16, 33], but most of them perform poorly
tity “bike”, causing the OCR token “spinner” on it cannot facing images with scene text. Text-based image caption-
be utilized effectively. One possible reason for the absence ing was naturally born to alleviate this problem, which was
of “bike” is that the visual features of mutilated objects are proposed by Sidorov et al. [27]. The authors introduced the
not conducive to modeling. Intuitively, we consider that the Textcaps dataset and a multimodal transformer-based [30]
semantic information of salient visual object concepts can baseline model M4C-Captioner. The M4C-Captioner was
alleviate coarse-grained and partial captioning. modified from a Text-VQA [8, 17, 28, 40] model M4C [15],
To tackle the aforementioned issues, we propose a DEpth and can encode both image and OCR tokens to generate
and VIsual ConcEpts Aware Transformer (DEVICE) model more property captions. However, this paradigm suffers
for TextCaps. It mainly comprises a reading module, a from a lack of diversity and information integrity. Zhang
depth-enhanced feature updating module, a salient visual et al. [43] and Xu et al. [38] were committed to solving this
object concepts extractor, and a multimodal transformer. problem by introducing additional captions. Although in-
Concretely, in the reading module, we employ a monocu- troducing multiple captions alleviates the problem of miss-
lar depth estimator BTS [19] to generate depth maps, then ing scene information, it is inherently hard to summarize
we extract depth values for visual objects and OCR tokens, the complex entities of a scene within one caption well.
respectively. The depth-enhanced feature updating mod- Apart from this, M4C-Captioner is less than satisfactory
ule is designed to optimize the visual features of OCR to- performed on Textcaps due to insufficient use of visual and
kens, which is beneficial for three-dimensional geometry re- semantic information of images. To distinguish possible
lationship construction. In the salient visual object concepts misidentified OCR recognition results, Wang et al. [36] in-
semantic extractor, we first employ a pre-trained visual- troduced confidence embedding and alleviated the adverse
language model CLIP [25] as a filter to generate K concepts effects of inaccurate OCR tokens to a certain extent. Yang
most related to the given picture. Then we adopt the seman- et al. [39] designed a pre-training model that can exploit
tic information of visual object concepts as augmentations OCR tokens and achieved strong results on multiple tasks,
to improve the overall expressiveness of our model. Subse- including text-based image captioning. Wang et al. [34, 35]
quent experiments in Sec. 4 demonstrate the effectiveness and Tang et al. [29] employ the 2D spatial relationships to
of our proposed DEVICE. measure correlation between OCR tokens. Nevertheless,
Our main contributions can be summarized as follows: we find that simply leveraging two-dimensional spatial rela-
tionships may cause incorrect captions. Therefore, we ame-
• We propose a DEpth and VIsual ConcEpts Aware liorate this problem by utilizing depth information.
Input image Reading module
𝑜𝑜𝑜𝑜
Object Objects 𝑥𝑥1:𝑛𝑛
detection embedding

Multimodal
𝑡𝑡𝑡𝑡
Depth-enhanced 𝑥𝑥1:𝑚𝑚

𝑦𝑦𝑡𝑡′ Linear Pointer
OCR OCR tokens
detection embedding feature updating Layer Network
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
module 𝑥𝑥1:𝑚𝑚

Depth Depth 𝑑𝑑𝑑𝑑 Sentence Decoder


estimate values
y𝑡𝑡

Transformer
𝑣𝑣𝑣𝑣𝑣𝑣
Billboard 𝑥𝑥1:𝐾𝐾
Vocabulary CLIP Advertisement Salient visual object a billboard for tanger outlets is
Sign concepts embedding
above a building.

Top-k concepts
Visual Object Concepts Extractor Generated Caption
𝑑𝑑𝑒𝑒𝑒𝑒
𝑥𝑥𝑡𝑡−1
Previous output embedding

Figure 2. Overview of our DEpth and VIsual ConcEpts Aware Transformer for TextCaps (DEVICE). In the reading module, we first
extract visual objects and OCR tokens along with their depth values. Then a depth-enhanced feature updating module is used to enhance
OCR appearance features with the help of depth values. After that, DEVICE utilizes depth information and plane geometry relationships to
construct 3D positional relationships. Semantic information of salient visual object concepts promotes fine-grained and holistic captioning.

2.2. Depth Information For Multimodal Tasks transformer, and a sentence decoder. Concretely, the input
image is first put into an object detector and two OCR de-
Monocular depth estimation is a challenging problem tectors to extract N visual objects and M OCR tokens along
of computer vision, whose purpose is to use depth val- with their two-dimensional regional coordinates. Then we
ues to measure the distance of objects relative to the cam- utilize a pre-trained depth estimation model to generate the
era [12, 14, 19]. With depth estimation technology, sig- pixel-level depth map D. After that, we update OCR vi-
nificant improvements have been achieved on many multi- sual features with a depth-enhanced feature updating mod-
modal tasks [6,21,37]. In the visual question answer (VQA) ule. To generate salient visual object semantic concepts for
task, Banerjee et al. [6] and Liu et al. [22] proved that each input image, we employ a pre-trained cross-modal tool
exploiting three-dimensional relationships by depth infor- CLIP [25] to match the top-K related concepts (e.g., “bill-
mation could enhance the spatial reasoning ability of VQA board” and “advertisement”) in vocabulary, which could al-
models. Apart from this, Qiu et al. [24] and Liao et al. [21] leviate the model from missing influential visual objects in
point out that ignoring the depth information could tremen- the scene, and generate fine-grained visual objects for more
dously place restrictions on applications of scene change accurate descriptions. Depth information is included in the
captioning in the real world. At the same time, we notice coordinates of both scene text and objects, hence the mul-
that text-based image captioning is facing the same prob- timodal transformer could establish three-dimensional re-
lem. Therefore, we construct three-dimensional spatial re- lations between visual entities in the picture. The image
lationships based on deep information. captions are generated in the auto-regressive paradigm.

3. Approach 3.2. Multimodal Embedding


3.1. Overview First of all, we exploit pre-trained Faster R-CNN [26]
model to get the bounding box coordinates {bobj n }n=1:N
Given an image I, text-based image captioning models of objects {aobj
n }n=1:N . Following LSTM-R [35], we ap-
aim to automatically generate caption C = [y0 , y1 , ..., yn ] ply external OCR systems [1, 10] to obtain both OCR to-
based on scene text S in the image. kens {aocr
m }m=1:M and corresponding bounding box co-
As depicted in Fig. 2, our DEVICE mainly consists of a ordinates {bocr
m }m=1:M , respectively. After that, we ex-
reading module, a salient visual object concepts extractor, tract the appearance features {xof n }n=1:N of objects and
a depth-enhanced feature updating module, a multimodal appearance features {xtf }
m m=1:M of OCR tokens by Faster
𝑜𝑜𝑜𝑜𝑜𝑜
𝑎𝑎 𝑛𝑛 Two-dimension OCR tokens, following Hu et al. [15] and Sidorov et al.
relationship
[27], we apply FastText [9] to extract the sub-word fea-
𝑜𝑜𝑜𝑜𝑜𝑜
𝑏𝑏 𝑛𝑛
ture xfmt for OCR token aocrm , and employ PHOC (Pyra-
𝑎𝑎𝑜𝑜𝑜𝑜𝑜𝑜
𝑚𝑚 𝑏𝑏 𝑜𝑜𝑜𝑜𝑜𝑜 𝑎𝑎𝑜𝑜𝑜𝑜𝑜𝑜
𝑜𝑜𝑜𝑜𝑜𝑜
𝑎𝑎 𝑛𝑛
𝑚𝑚 𝑚𝑚
midal Histogram of Characters) [2] to extract character-
level feature xph ocr
m for am . Similarly, the spatial loca-
56 tion of am is denoted as xts
ocr ocr ocr
m = [bm ; dvm /255], where
ocr tl tl br br
bm = [x /W ; y /H; x /W ; y /H]. To better fuse the
175
96 three-dimensional relative spatial relationships into OCR
56 features, we propose a Depth-enhanced Feature Updating
15
56 𝑑𝑑 Module (DeFUM, cf. Sec. 3.3). After fed the OCR token
appearance features xtf of
1:M into DeFUM along with x1:N , we
Three-dimension Cartesian t0
𝑜𝑜𝑜𝑜𝑜𝑜
Grayscale distribution of 𝑎𝑎 𝑛𝑛
𝑜𝑜𝑜𝑜𝑜𝑜
𝑑𝑑 𝑛𝑛 = 56 coordinate system can get the depth-enhanced OCR appearance feature xf1:M :
0
xtf of tf obj ocr
1:M = DeF U M (x1:N , x1:M , dv1:N , dv1:M ). (2)
Figure 3. Illustrations for the process of getting the depth values
and 3D geometric relationship modeling. The pie chart represents Considering the gap between visual features and semantic
the gray value distribution of object aobj
n . We take the gray values features, we do not update FastText and PHOC features (se-
with the highest proportion as the depth value (0 means nearest) mantic embeddings). Following Wang et al. [36], we adopt
of object aobj ocr obj ocr
n and OCR token am . bn and bm denote bound- confidence features of detected OCR tokens {xconf }m=1:M
obj ocr m
ing boxes of an and am , respectively. With depth information, due to some OCR tokens may be amiss. The final embed-
relationships between visual entities become more clear.
ding of OCR token aocr
m could be calculated by
0
xocr tf ft ph
m = LN (Wtf xm + Wf t xm + Wph xm )
R-CNN [26]. To generate depth maps of images in the (3)
TextCaps dataset, we utilize a monocular depth estimation +LN (Ws xts conf
m ) + LN (Wconf xm ),
model BTS [19]. where Wtf , Wf t , Ws , Wconf are learnable parameters, and
Depth values. To measure the relative distance of ob- LN represents layer normalization.
jects and OCR tokens to the observer, we employ a pre-
trained depth estimation model BTS [19] to generate depth 3.3. Depth-enhanced Feature Updating Module
maps. Subsequently, we unify the values of the depth maps
Since the OCR tokens are generally not contained in the
to the range of 0 to 255. For object aobj n and OCR token word vocabulary, the model needs to pick the predicted
aocr
m , we count the grayscale distribution of bounding box tokens from detected OCR tokens through a pointer net-
coordinates bobj ocr
n and bm in the depth map at pixel scale. We work [32]. Therefore the quality of the OCR token features
take the gray values with the highest frequency as the depth
is more critical to the generated captions. For this reason,
values for object aobj ocr
n and OCR token am (cf. Fig. 3). We we propose the DeFUM to enhance the OCR appearance
denote {dvn }n=1:N as the depth values for {aobj
obj
n }n=1:N
ocr features {xtf
m }m=1:M . Intuitively, we consider that object
and {dvm }m=1:M for {aocr
m }m=1:M , respectively. features can play the role of a bridge linking its adjacent
Embedding of objects. To measure the relationship
scene text. For example, if different OCR tokens are on
between visual entities more effectively, we incorporate
the same object, then these OCR tokens are likely to have
depth information into object features. For the object
a strong correlation. Therefore, we first obtain appearance
aobj
n , we denote its 2D geometry coordinate by bn
obj
=
tl tl br br tl tl embedding xv of visual entities by concatenating the visual
[x /W ; y /H; x /W ; y /H], where x and y repre-
features {xof obj
n }n=1:N of objects {an }n=1:N and the visual
sent the coordinates of the upper left corner of object aobjn , features {xm }m=1:M of OCR tokens {aocr
tf
m }m=1:M :
xbr and y br represent the coordinates of the bottom right
corner of object aobj
n , respectively. Then we denote the 3D xv = Concat({xof tf
n }n=1:N , {xm }m=1:M ), (4)
spatial location feature of aobj os obj obj
n by xn = [bn ; dvn /255].
The embedding of object aobj is calculated as where xv ∈ R(m+n)×d . Subsequently, we calculate the
n
query Q, key K, and value V as follows:
xobj of os
n = LN (Wof xn ) + LN (Ws xn ), (1)  v
 Q = x WQ

where Wf r and Ws are learnable matrices, and LN repre- K = xv WK , (5)
sents layer normalization. v

V = x WV

Embedding of OCR tokens. Unlike objects, OCR to-
kens not only contain rich visual clues but also contain where query Q ∈ R(m+n)×d , key K ∈ R(m+n)×d , value
rich semantic information. To get rich representations of V ∈ R(m+n)×d . Then, we can obtain the attention weights
and generate more fine-grained captions, we propose a vi-
𝑥𝑥 𝑣𝑣𝑣𝑣𝑣𝑣
V Softmax sual object concepts extractor to introduce semantic infor-
mation of visual object concepts. This semantic informa-
𝐴𝐴 N× Transformer layers tion can enhance the visual information of objects in the
interaction of multimodal transformer. Specifically, we em-
𝑥𝑥 𝑣𝑣𝑣𝑣𝑣𝑣
K Q
ploy a pre-trained model CLIP [25] to filter K object con-
LN cepts related to the given image from vocabulary. Visual
𝑅𝑅
FC FC object concepts and corresponding scores are represented as
𝑥𝑥 𝑣𝑣
𝑥𝑥 𝑣𝑣 {avoc score
k }k=1:K and {ak }k=1:K . The embedding of visual
Concatenate
𝑜𝑜𝑜𝑜 𝑡𝑡𝑡𝑡
𝑥𝑥1:𝑀𝑀 𝑑𝑑𝑑𝑑
object concepts are calculated as follows:
𝑥𝑥1:𝑁𝑁
Depth Depth-aware SA
values xvoc
k = LN (Wsort F T (avoc score
k )) + LN (Wscore ak ), (10)

𝑜𝑜𝑜𝑜 𝑡𝑡𝑡𝑡 where F T (avoc voc


k ) means the FastText [9] embedding of ak ,
Objects Ocr tokens 𝑥𝑥1:𝑁𝑁 , 𝑥𝑥1:𝑀𝑀 , 𝑑𝑑𝑑𝑑
embedding embedding Wsort and Wscore are learnable parameters.
3.5. Reasoning and Generation Module
Figure 4. Illustration of the depth-enhanced feature updating mod-
ule, which consists of a depth-aware self-attention module and N Following Sidorov et al. [27], we adopt a multimodal
layers of transformer. The DeFUM could effectively enhance the transformer to encode all the obtained embeddings. Con-
appearance features of OCR tokens with the help of depth infor- sidering most OCR tokens (e.g., “tanger” and “outlets”)
mation, which is conducive to constructing the three-dimensional are not common words, therefore it is not appropriate to
geometric relationships and improving the accuracy of captions.
generate captions based on a fixed vocabulary. Following
M4C-Captioner [27], we utilize two classifiers for common
vocabulary and candidate OCR tokens separately to gener-
matrix A by:
ate words. Overall, multimodal transformer models multi-
QK T
A= √ , (6) modal embedding and generates common words yt0 . After-
d ward, a dynamic pointer network [32] is employed to gen-
where K ∈ R(m+n)×(m+n) . To measure the relative depth erate final words yt :
between different visual entities, we calculate the relative obj
yt0 , xtocr ocr voc dec
1:M = mmt(x1:N , x1:M , x1:K , xt−1 ), (11)
depth score of xvi and xvj by
where mmt represents multimodal transformer, xtocr m indi-
dvj cates the feature of xocr updated by mmt, and x dec
is the
Ri,j = log( ), (7) m t−1
dvi embedding of previous output yt−10
, respectively. Eventu-
ally, the dynamic pointer network makes the final predic-
where dv = Concat(dv obj , dv ocr ) and the relative depth
tion from the common word in the fixed vocabulary and the
map R ∈ R(m+n)×(m+n) . Then the output of Depth-aware
special words xtocr
1:m from OCR tokens:
Self-Attention module (cf. Fig. 4) is updated by
w
yt = argmax([lm (yt0 ), P N (yt0 , xtocr
1:m )]), (12)
xvti = LN (xv + sof tmax(A + R)V ), (8)
w
where lm (·) denotes a linear classifier for fixed common
vti (m+n)×d
where the depth-aware visual feature x ∈ R , vocabulary, P N (·) denotes pointer network, and the caption
LN denotes layer normalization. Eventually, xvti is fed C consists of predicted words y0 , ..., yt . We train our model
into N layers of transformer [30] to update appearance fea- by optimizing the Cross-Entropy loss:
tures with depth information
T
X
x vto
= transf ormer(x vti
), (9) L(θ) = − log(ŷt |y1:t−1 , xtocr
1:m ), (13)
t=1

where the depth-enhanced appearance feature xvto ∈ where ŷt is the corresponding token in the ground truth.
0 0 0 0
R(m+n)×d (xvto = [xof , xtf ]), xf r ∈ Rn×d , xf t ∈
m×d
R , respectively. 4. Experiments
3.4. Salient Visual Object Concepts Extractor 4.1. Dataset and Settings
To improve the modeling ability of the multimodal trans- Dataset and evaluation metrics. The TextCaps dataset
former to the scene objects, avoid the lack of vital objects, [27] is constructed for text-based image captioning, which
BLUE-4 METEOR ROUGE-L SPICE CIDEr-D
Up-Down (CVPR 2018) [4] 20.1 17.8 42.9 11.7 41.9
AoA (ICCV 2019) [16] 20.4 18.9 42.9 13.2 42.7
M4C-Captioner (ECCV 2020) [27] 23.3 22.0 46.2 15.6 89.6
MMA-SR (ACM MM 2020) [34] 24.6 23.0 47.3 16.2 98.0
SS-Baseline (AAAI 2021) [44] 24.9 22.7 47.2 15.7 98.8
CNMT (AAAI 2021) [36] 24.8 23.0 47.1 16.3 101.7
ACGs-Captioner (CVPR 2021) [38] 24.7 22.5 47.1 15.9 95.5
MAGIC (AAAI 2022) [43] 22.2 20.5 42.3 13.8 76.6
OMOT (ICMR 2022) [29] 26.4 22.2 47.5 14.9 95.1
LSTM-R (CVPR 2021) [35] 27.9 23.7 49.1 16.6 109.3
DEVICE (Ours) 27.4 24.5 49.2 17.7 115.9
M4C-Captioner(w/ GT OCRs) [27] 26.0 23.2 47.8 16.2 104.3
TAP† (CVPR 2021) [39] 25.8 23.8 47.9 17.1 109.2

Table 1. Comparison of our performance with other baseline models on the TextCaps validation set, and TAP† is a pre-trained model. Our
DEVICE acquires competitive performance compared to state-of-the-art models with p-values < 0.05.

BLUE-4 METEOR ROUGE-L SPICE CIDEr-D


Up-Down (CVPR 2018) [4] 14.9 15.2 39.9 8.8 33.8
AoA (ICCV 2019) [16] 15.9 16.6 40.4 10.5 34.6
M4C-Captioner (ECCV 2020) [27] 18.9 19.8 43.2 12.8 81.0
MMA-SR (ACM MM 2020) [34] 19.8 20.6 44.0 13.2 88.0
SS-Baseline (AAAI 2021) [44] 20.2 20.3 44.2 12.8 89.6
CNMT (AAAI 2021) [36] 20.0 20.8 44.4 13.4 93.0
ACGs-Captioner (CVPR 2021) [38] 20.7 20.7 44.6 13.4 87.4
OMOT (ICMR 2022) [29] 21.2 19.6 44.4 12.1 84.9
LSTM-R (CVPR 2021) [35] 22.9 21.3 46.1 13.8 100.8
DEVICE (Ours) 23.0 22.5 46.7 15.0 109.7
M4C-Captioner(w/ GT OCRs) [27] 21.3 21.1 45.0 13.5 97.2
TAP† (CVPR 2021) [39] 21.9 21.8 45.6 14.6 103.2
Human [27] 24.4 26.1 47.0 18.8 125.5

Table 2. Comparison of our performance with other baseline models on the TextCaps test set. Our DEVICE outperforms state-of-the-art
models on all metrics, notably boosting CIDEr-D from 100.8 to 109.7.

contains 28408 images with 5 captions per image. The the dimension of common embedding space in multimodal
training, validation, and test set contain 21953, 3166, and transformer is 768. The maximum generation length is 30.
3289 images, respectively. For evaluation metrics, we DeFUM contains two layers of transformer in our exper-
choose five widely used evaluation metrics for image cap- iments, we set the layer number of the multimodal trans-
tioning, i.e., BLEU-4 [23], ROUGE-L [20], METEOR [7], former to 4 and the number of self-attention heads to 12.
SPICE [3], and CIDEr-D [31]. Following Sidorov et al. We adopt default settings for other parameters following
[27], we focus on the CIDEr-D when comparing different BERT-BASE [13]. The fixed common vocabulary has 6736
methods, because CIDEr-D has a high correlation with hu- words. To get visual object concepts, we filter top-k (k = 5
man evaluation scores and puts more weight on informative in our implement) visual object concepts from the vocabu-
special tokens (e.g. OCR tokens) in the captions. lary by CLIP [25]. We retain the similarity score given by
CLIP as a reference for measuring reliability of visual ob-
Settings and implementation details. To detect suffi- ject concepts. To get the confidence of OCR tokens, we use
cient OCR tokens, we employ the Rosetta OCR system [10] pre-trained four-stage STR [5] following Wang et al. [36].
and Google OCR system [1] following Wang et al. [35]. For those OCR tokens that only appear in one OCR system,
The number of OCR tokens for each image is limited to 80 we set the confidence to 0.9.
at most. We extract 100 object appearance features for each
picture, the dimension d of appearance features is 2048, and We train the model for about 10 epochs on a single 3090
Google DI DeFUM VOC B-4 C-D Number of VOC CIDEr-D
1 24.5 101.5 DEVICE K=3 114.9 ↓1.0
2 X 24.9 105.3 DEVICE K=5 115.9
3 X X 25.6 109.8 DEVICE K=8 115.5 ↓0.4
4 X X 26.3 112.4 DEVICE K = 10 114.6 ↓1.3
5 X X X 26.1 113.9
6 X X X X 27.4 115.9 Table 4. Analysis of visual object concepts number K on valida-
tion set. VOC indicates Visual Object Concepts.
Table 3. Ablation of each module in DEVICE on the TextCaps
validation set. Google denotes Google OCR system, DI denotes
Depth Information, DeFUM represents the Depth-enhanced Fea- tions by removing the undetected OCR symbols, which in-
ture Updating Module, and VOC indicates Visual Object Con- creases BLUE-4 by 0.6 [35]. Unlike LSTM-R, for a fair
cepts. B-4, C-D represent BLUE-4 and CIDEr-D, respectively. comparison, we train our model with raw captions provided
by TextCaps [27], following the majority [27, 36, 39, 44].
The performance of our model on the TextCaps validation
Ti GPU, and the batch size is 64. We adapt Adam [18] set demonstrates the effectiveness of both depth information
optimizer, the initial learning rate is 1e-4 and is declined and semantic information of visual object concepts.
to 0.1 times every 3 epochs. We monitored the CIDEr-D Main results on the TextCaps test set. Compared with
metric to choose the best model and evaluate it on both the LSTM-R, our model shows superiority on the TextCaps test
validation set and test set. All the experimental results are set, which gains rises of 0.1, 1.2, 0.6, 1.2, 8.9 on BLUE-
computed by Eval AI online platform submissions. 4, METEOR, ROUGE-L, SPICE, CIDEr-D, respectively.
Significantly, the inspiring improvement on the CIDEr-D
4.2. Experimental Results shows that our DEVICE generates more accurate scene text.
Compared models. We make comparisons with other The results of m4c-captioner (w/ GT OCRs) [27] are eval-
models in Tab. 1 and Tab. 2. Up-Down [4] and AOA [16] uated on a subset of the TextCaps test set, excluding those
are widely used image captioning models. Both of them are samples without OCR annotations. DEVICE outperforms
trained and inferred without OCR tokens. M4C-Captioner m4c-captioner(w/ GT OCRs) verifies that our DEVICE is
[27], SS-Baseline [44], ACGs-Captioner [38], MAGIC capable of compensating the negative impact of the error of
[43], and CNMT [36] are transformer-based strong base- OCR systems to some extent. Encouragingly, our model
line models with OCR tokens and other modalities as in- outperforms the pre-trained model TAP† on all metrics,
put. MMA-SR [34], OMOT [29], and LSTM-R [35] are demonstrating that DEVICE utilize limited training data
LSTM-based strong baseline models that focus more on more efficiently. Finally, the gap between machines and hu-
the modeling of two-dimensional geometrical relationships. mans narrows on the text-based image captioning task.
TAP† [39] is a text-aware pre-trained model trained with ex-
4.3. Ablation Studies
tra pre-training data. M4C-Captioner (w/ GT OCRs) [27]
indicates M4C-Cpationer takes ground truth OCR tokens To analyze the respective effects of each module on the
as input and evaluates on a subset of the TextCaps test set. model performance, we conducte ablation experiments, as
Human [27] represents human-generated captions. shown in Tab. 3. Comparing Line 1 and Line 2, CIDEr-D is
Main results on the TextCaps validation set. The com- boosted from 101.5 to 105.3. It is evident that more affluent
parisons on the validation set between our DEVICE and and more accurate OCR tokens positively impact the model
other models are shown in Tab. 1. Traditional methods UP- performance. While the improvement on BLUE-4 is rela-
Down and AOA perform less well than other models with tively moderate, we think one reason may be that BLUE-4
OCR tokens due to they cannot generate scene text. CNMT treats all matched words equally, but CIDEr-D pays more
proves the effectiveness of confidence embedding and rep- attention to OCR tokens. Line 2 and 3 prove that simply
etition mask, which improves CIDEr-D from 89.6 to 101.7 concatenating depth values with two-dimensional space co-
compared with M4C-Captioner. LSTM-R constructs suf- ordinates boosts the performance substantially, which in-
ficient 2D spatial relations and boosts all metrics signif- creases BLUE-4 by 0.7 and CIDEr-D by 4.5, respectively.
icantly. By incorporating depth information and seman- Comparing Line 3 and 5, we find that updating the ap-
tic information of visual object concepts with transformer, pearance features of OCR tokens under the supervision of
compared with SOTA model LSTM-R [35], our DEVICE depth information boosts all metrics. The significant im-
improves the performance on METEOR, SPICE, CIDEr- provements between Line 5 and 6 demonstrate the effec-
D by 0.8, 1.1, and 6.6, respectively. The performance of tiveness of visual object concepts’ semantic information,
DEVICE on BLUE-4 ranks second only to LSTM-R (0.5 which boosts BLUE-4 by 1.3 and CIDEr-D by 2.0, respec-
below). Note that LSTM-R has cleaned the training cap- tively. More remarkably, the total increase ratio of CIDEr-D
MOBILE
ATS

unsafe
area

keep out

Salient visual Scores: Salient visual Scores:


object concepts: object concepts:
book [0.277] mascot [0.767]
comic [0.248] hockey [0.153]
cover [0.223] podium [0.040]
shelf [0.146] teammates [0.024]
bookshelf [0.106] banner [0.016]

CNMT: A yellow and white CNMT: A plane with the words unsafe CNMT: A table with many books CNMT: A man in a blue shirt that says
van with the word ats mobile on on it. including one called "pey". winnipeg jet on it.
the front. Ours: A sign in front of a plane that Ours: A table full of comic books Ours: A mascot for the pylon is holding
Ours: A yellow and green van says unsafe area keep out. including one called final. a woman wearing a winnipeg jet shirt.
with the word ats on the front. GT: The body of a large airplane is GT: Several comic books, including GT: Men wearing Mario and Luigi
GT: The ats van is mobile and being worked on behind a sign that say Spider-Man and X-Men, are laid out. costumes wearing advertisements for the
going to a service for tyre fitting. Unsafe Area Keep Out. pylon festival pose with a woman.
Case 1 Case 2 Case 3 Case 4

Figure 5. Example of captions generated by strong baseline CNMT [36], DEVICE, and GT (ground truth). Red indicates inappropriate
sence text generated by CNMT. The depth maps and visual object concepts are displayed in the middle. The score of visual object concepts
measures how similar these concepts are to the image. Blue indicates scene text and Green indicates visual object concepts.

is 10.6 and BLEU-4 is 2.5. This demonstrates that all mod- spatial relationship between visual entities by introducing
ules work together very efficiently. 3D geometric relationships. For Case 3, CNMT only gener-
We also evaluate the influences of the number K of vi- ates relatively coarse-grained words such as “many books”.
sual object concepts on the TextCaps validation (cf. Tab. 4). In contrast, with the help of semantic information of visual
The experimental results indicate that some salient visual object concepts, DEVICE is capable of generating more ac-
object concepts may be missed when K is less than 5. curate words “comic books”, like human. In the last case,
Meanwhile, irrelevant visual concepts may be retrieved unlike CNMT, which ignores the visual object “mascots”,
when K is larger than 7. Therefore, we set K to 5. DEVICE generates a more comprehensive description, en-
abling rational use of scene text “pylon”. The aforemen-
4.4. Qualitative Analysis tioned cases demonstrate the effectiveness of DEVICE well.
However, in Case 4, we can see the gap between our
We show some cases which are generated by CNMT [36]
DEVICE and human, we consider one possible reason is
and DEVICE on the the TextCaps validation set in Fig. 5,
that humans have common knowledge. Although adopting
where “GT” means ground truth. We pick CNMT [36] for
CLIP [25] to extract the visual object concept “mascot” in
comparison because our geometric relationship construc-
Case 4 can also be regarded as a way of implicitly using
tion methods are the same except for depth, and both are
external knowledge, this knowledge is not fine-grained and
transformer-based structures. Case 1 and Case 2 demon-
accurate enough compared to “Mario” and “Luigi”.
strate the effectiveness of depth information, Case 3 and 4
illustrate the influences of visual object concepts. In Case 1, 5. Conclusion
CNMT which without depth information mistakenly takes
“ats” and “mobile” as tokens in the same plane and con- In this paper, we propose a DEpth and VIsual ConcEpts
nects them incorrectly. By introducing 3D geometric re- Aware Transformer (DEVICE) for TextCaps. We introduce
lationship, our DEVICE generates accurate scene text. In depth information and semantic information of salient vi-
Case 2, CNMT incompletely represents the text in the sign sual object concepts to improve the accuracy and integrity
and confuses the positional relationship between the air- of captions. We also design a depth-enhanced feature updat-
plane and the text on the sign. DEVICE enhances the cor- ing module to improve OCR appearance features, which is
relation of scene text on the warning sign and clarifies the capable of facilitating the construction of three-dimensional
geometric relations. More remarkably, our model achieves puter vision and pattern recognition, pages 10578–10587,
state-of-the-art performances on the TextCaps test set. By 2020. 2
comparing experimental results with human annotations, [12] Camille Couprie, Clément Farabet, Laurent Najman, and
we consider that extracting more explicit external knowl- Yann LeCun. Indoor semantic segmentation using depth in-
edge can further improve this task. formation. arXiv preprint arXiv:1301.3572, 2013. 3
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional
References transformers for language understanding. arXiv preprint
[1] Googleocr. https://cloud.google.com/functions/docs/tutorials arXiv:1810.04805, 2018. 6
/ocr. 3, 6 [14] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
[2] Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Val- ready for autonomous driving? the kitti vision benchmark
veny. Word spotting and recognition with embedded at- suite. In 2012 IEEE conference on computer vision and pat-
tributes. IEEE transactions on pattern analysis and machine tern recognition, pages 3354–3361. IEEE, 2012. 3
intelligence, 36(12):2552–2566, 2014. 4 [15] Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Mar-
[3] Peter Anderson, Basura Fernando, Mark Johnson, and cus Rohrbach. Iterative answer prediction with pointer-
Stephen Gould. Spice: Semantic propositional image cap- augmented multimodal transformers for textvqa. In Proceed-
tion evaluation. In European conference on computer vision, ings of the IEEE/CVF Conference on Computer Vision and
pages 382–398. Springer, 2016. 6 Pattern Recognition, pages 9992–10002, 2020. 2, 4
[16] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei.
[4] Peter Anderson, Xiaodong He, Chris Buehler, Damien
Attention on attention for image captioning. In Proceedings
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
of the IEEE/CVF international conference on computer vi-
Bottom-up and top-down attention for image captioning and
sion, pages 4634–4643, 2019. 2, 6, 7
visual question answering. In Proceedings of the IEEE con-
[17] Zan-Xia Jin, Mike Zheng Shou, Fang Zhou, Satoshi Tsutsui,
ference on computer vision and pattern recognition, pages
Jingyan Qin, and Xu-Cheng Yin. From token to word: Ocr
6077–6086, 2018. 1, 2, 6, 7
token evolution via contrastive learning and semantic match-
[5] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park,
ing for text-vqa. In Proceedings of the 30th ACM Interna-
Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwal-
tional Conference on Multimedia, pages 4564–4572, 2022.
suk Lee. What is wrong with scene text recognition model
2
comparisons? dataset and model analysis. In Proceedings of
[18] A KingaD. A methodforstochasticoptimization. Anon. Inter-
the IEEE/CVF international conference on computer vision,
nationalConferenceon Learning Representations. SanDego:
pages 4715–4723, 2019. 6
ICLR, 2015. 7
[6] Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, and Chitta [19] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and
Baral. Weakly supervised relative spatial reasoning for vi- Il Hong Suh. From big to small: Multi-scale local planar
sual question answering. In Proceedings of the IEEE/CVF guidance for monocular depth estimation. arXiv preprint
International Conference on Computer Vision, pages 1908– arXiv:1907.10326, 2019. 2, 3, 4
1918, 2021. 3
[20] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang,
[7] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic and Wenyu Liu. Textboxes: A fast text detector with a sin-
metric for mt evaluation with improved correlation with hu- gle deep neural network. In Thirty-first AAAI conference on
man judgments. In Proceedings of the acl workshop on in- artificial intelligence, 2017. 6
trinsic and extrinsic evaluation measures for machine trans- [21] Zeming Liao, Qingbao Huang, Yu Liang, Mingyi Fu, Yi Cai,
lation and/or summarization, pages 65–72, 2005. 6 and Qing Li. Scene graph with 3d information for change
[8] Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Ap- captioning. In Proceedings of the 29th ACM International
palaraju, and R Manmatha. Latr: Layout-aware transformer Conference on Multimedia, pages 5074–5082, 2021. 3
for scene-text vqa. In Proceedings of the IEEE/CVF Con- [22] Yuhang Liu, Wei Wei, Daowan Peng, Xian-Ling Mao, Zhiy-
ference on Computer Vision and Pattern Recognition, pages ong He, and Pan Zhou. Depth-aware and semantic guided
16548–16558, 2022. 2 relational attention network for visual question answering.
[9] Piotr Bojanowski, Edouard Grave, Armand Joulin, and IEEE Transactions on Multimedia, pages 1–14, 2022. 3
Tomas Mikolov. Enriching word vectors with subword in- [23] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
formation. Transactions of the association for computational Zhu. Bleu: a method for automatic evaluation of machine
linguistics, 5:135–146, 2017. 4, 5 translation. In Proceedings of the 40th annual meeting of the
[10] Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Association for Computational Linguistics, pages 311–318,
Rosetta: Large scale system for text detection and recogni- 2002. 6
tion in images. In Proceedings of the 24th ACM SIGKDD in- [24] Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, and Hi-
ternational conference on knowledge discovery & data min- rokatsu Kataoka. 3d-aware scene change captioning from
ing, pages 71–79, 2018. 1, 3, 6 multiview images. IEEE Robotics and Automation Letters,
[11] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and 5(3):4743–4750, 2020. 3
Rita Cucchiara. Meshed-memory transformer for image cap- [25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
tioning. In Proceedings of the IEEE/CVF conference on com- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- captioning with content diversity exploration. In Proceed-
ing transferable visual models from natural language super- ings of the IEEE/CVF Conference on Computer Vision and
vision. In International Conference on Machine Learning, Pattern Recognition, pages 12637–12646, 2021. 2, 6, 7
pages 8748–8763. PMLR, 2021. 2, 3, 5, 6, 8 [39] Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei
[26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo
Faster r-cnn: Towards real-time object detection with region Luo. Tap: Text-aware pre-training for text-vqa and text-
proposal networks. Advances in neural information process- caption. In Proceedings of the IEEE/CVF conference on
ing systems, 28, 2015. 3, 4 computer vision and pattern recognition, pages 8751–8761,
[27] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and 2021. 2, 6, 7
Amanpreet Singh. Textcaps: a dataset for image caption- [40] Gangyan Zeng, Yuan Zhang, Yu Zhou, and Xiaomeng Yang.
ing with reading comprehension. In European conference on Beyond ocr+ vqa: involving ocr into the flow for robust and
computer vision, pages 742–758. Springer, 2020. 1, 2, 4, 5, accurate textvqa. In Proceedings of the 29th ACM Interna-
6, 7 tional Conference on Multimedia, pages 376–385, 2021. 2
[28] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, [41] Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong
Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Zhang, and Feng Wu. Context-aware visual policy network
Rohrbach. Towards vqa models that can read. In Proceedings for fine-grained image captioning. IEEE transactions on pat-
of the IEEE/CVF conference on computer vision and pattern tern analysis and machine intelligence, 2019. 1
recognition, pages 8317–8326, 2019. 2 [42] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang,
[29] Wenliang Tang, Zhenzhen Hu, Zijie Song, and Richang Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
Hong. Ocr-oriented master object for text image caption- Vinvl: Revisiting visual representations in vision-language
ing. In Proceedings of the 2022 International Conference on models. In Proceedings of the IEEE/CVF Conference on
Multimedia Retrieval, pages 39–43, 2022. 2, 6, 7 Computer Vision and Pattern Recognition, pages 5579–
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- 5588, 2021. 1
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia [43] Wenqiao Zhang, Haochen Shi, Jiannan Guo, Shengyu
Polosukhin. Attention is all you need. Advances in neural Zhang, Qingpeng Cai, Juncheng Li, Sihui Luo, and Yuet-
information processing systems, 30, 2017. 2, 5 ing Zhuang. Magic: Multimodal relational graph adversarial
[31] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi inference for diverse and unpaired text-based image caption-
Parikh. Cider: Consensus-based image description evalua- ing. In Proceedings of the AAAI Conference on Artificial
tion. In Proceedings of the IEEE conference on computer Intelligence, volume 36, pages 3335–3343, 2022. 2, 6, 7
vision and pattern recognition, pages 4566–4575, 2015. 6 [44] Qi Zhu, Chenyu Gao, Peng Wang, and Qi Wu. Simple is
[32] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer not easy: A simple strong baseline for textvqa and textcaps.
networks. Advances in neural information processing sys- In Proceedings of the AAAI Conference on Artificial Intelli-
tems, 28, 2015. 4, 5 gence, volume 35, pages 3608–3615, 2021. 6, 7
[33] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
mitru Erhan. Show and tell: A neural image caption gen-
erator. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 3156–3164, 2015. 2
[34] Jing Wang, Jinhui Tang, and Jiebo Luo. Multimodal atten-
tion with image text spatial relationship for ocr-based image
captioning. In Proceedings of the 28th ACM International
Conference on Multimedia, pages 4337–4345, 2020. 2, 6, 7
[35] Jing Wang, Jinhui Tang, Mingkun Yang, Xiang Bai, and
Jiebo Luo. Improving ocr-based image captioning by in-
corporating geometrical relationship. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 1306–1315, 2021. 2, 3, 6, 7
[36] Zhaokai Wang, Renda Bao, Qi Wu, and Si Liu. Confidence-
aware non-repetitive multimodal transformers for textcaps.
In Proceedings of the AAAI Conference on Artificial Intel-
ligence, volume 35, pages 2835–2843, 2021. 1, 2, 4, 6, 7,
8
[37] Ziwei Wang, Yadan Luo, Yang Li, Zi Huang, and Hongzhi
Yin. Look deeper see richer: Depth-aware image paragraph
captioning. In Proceedings of the 26th ACM international
conference on Multimedia, pages 672–680, 2018. 3
[38] Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo,
Qing Du, and Qi Wu. Towards accurate text-based image

You might also like