Professional Documents
Culture Documents
Device: Depth and Visual Concepts Aware Transformer For Textcaps
Device: Depth and Visual Concepts Aware Transformer For Textcaps
els may describe scene text inaccurately. The other possible 15 15 135
in front of a glass. Salient visual object concepts:
96 [‘bike’, ‘equipment’, ‘athlete’]
reason is that existing methods fail to generate fine-grained
an athlete in move pants is on a spinner bike.
descriptions of some visual objects. In addition, they may
ignore essential visual objects, leading to the scene text (a) (b)
belonging to these ignored objects not being utilized. To
address the above issues, we propose a DEpth and VI- Figure 1. Example of influence of depth information and visual
sual ConcEpts Aware Transformer (DEVICE) for TextCaps. object concepts on descriptions. Without the assistance of depth
Concretely, to construct three-dimensional geometric rela- information, the model may confuse the front and rear positional
tions, we introduce depth information and propose a depth- relationship of the OCR tokens. The semantic information of vi-
enhanced feature updating module to ameliorate OCR to- sual object concepts can assist the model in generating accurate
and comprehensive captions. The visual object concepts are de-
ken features. To generate more precise and comprehensive
tected by the visual object concepts extractor. Green indicates
captions, we introduce semantic features of detected visual
scene text, and Blue indicates visual objects.
object concepts as auxiliary information. Our DEVICE is
capable of generalizing scenes more comprehensively and
boosting the accuracy of described visual entities. Sufficient scribing scenes containing text. Taking the upper left Case
experiments demonstrate the effectiveness of our proposed in Fig. 1 as an example, “a bottle of Coca-Cola” should
DEVICE, which outperforms state-of-the-art models on the be described rather than “a bottle”. To solve this defect,
TextCaps test set. Our code will be publicly available. Sidorov et al. [27] collect a high-quality dataset TextCaps,
in which captions contain scene text from the images.
1. Introduction To utilize the scene text in pictures, extracting Optical
Character Recognition (OCR) tokens from given images is
Image captioning is one of the important tasks in the in- the precondition. M4C-Captioner [27] utilizes Rosetta [10]
tersection of vision and language, which aims to help the OCR system to obtain OCR tokens, and then employ a mul-
visually impaired understand visual information as accu- timodal transformer directly to model the relationships be-
rately as possible. Some recent studies have achieved supe- tween visual objects and OCR tokens. However, it is inap-
rior performance and are even comparable to humans under propriate to treat all OCR tokens equally, since deformed
certain circumstances [4, 41, 42]. However, these traditional scene texts may be detected incorrectly. Wang et al. [36]
image captioning models perform unsatisfactorily when de- introduce a concept of OCR confidence which largely alle-
viated this issue. Then Wang et al. [34, 35] are committed Transformer (DEVICE) for TextCaps. DEVICE per-
to improving the spatial relationship construct methods for forms better than previous methods as it improves the
OCR tokens, which have been proven effective, and the pro- accuracy and completeness of captions.
posed model LSTM-R [35] achieves the state-of-the-art per-
formance. However, current methods are not accurate and • We devise a Depth-enhanced Feature Updating Mod-
comprehensive enough when describing the scenes, which ule (DeFUM), which effectively helps the model eval-
can be mainly manifested in the following two-fold. uate the relevance of OCR tokens. To the best of our
Firstly, the real-world is three-dimensional (3D). How- knowledge, we are the first to apply depth information
ever, current studies [27, 34–36, 39] only exploit the spatial to the text-based image captioning task.
relationships of OCR tokens at the plane-level, which leads
to visual location information being utilized incompletely, • We introduce semantic information of salient vi-
and even results in generating inaccurate scene text. As il- sual objects by the Visual Object Concepts Extractor
lustrated in the Case (a) of Fig. 1, CNMT [36] mistakenly (VOC). Semantic information of salient visual objects
regards “victorian” and “coca” as closely related OCR to- can help the multimodal transformer generate fine-
kens in the same plane, which are irrelevant and actually grained objects and better understand holistic scenes.
one behind the other. Naturally, we consider that intro-
• Our DEVICE outperforms the state-of-the-art models
ducing depth information based on two-dimensional (2D)
on the TextCaps test set, inspiringly boosting CIDEr-D
information can simulate real-world geometric information
from 100.8 to 109.7.
for improving the accuracy of scene text in captions.
Secondly, scenes contain abundant but complex visual
entities (visual objects and scene text), existing methods
2. Related Work
suffer from coarse-grained descriptions of visual objects
and the absence of crucial visual entities in captions. As 2.1. Text-based Image Captioning
shown in the Case (b) on the right of Fig. 1, CNMT [36]
generates the coarse-grained word “person” for the visual Image captioning task has made a lot of progress in re-
object “athlete”. Besides, CNMT ignores an important en- cent years [4, 11, 16, 33], but most of them perform poorly
tity “bike”, causing the OCR token “spinner” on it cannot facing images with scene text. Text-based image caption-
be utilized effectively. One possible reason for the absence ing was naturally born to alleviate this problem, which was
of “bike” is that the visual features of mutilated objects are proposed by Sidorov et al. [27]. The authors introduced the
not conducive to modeling. Intuitively, we consider that the Textcaps dataset and a multimodal transformer-based [30]
semantic information of salient visual object concepts can baseline model M4C-Captioner. The M4C-Captioner was
alleviate coarse-grained and partial captioning. modified from a Text-VQA [8, 17, 28, 40] model M4C [15],
To tackle the aforementioned issues, we propose a DEpth and can encode both image and OCR tokens to generate
and VIsual ConcEpts Aware Transformer (DEVICE) model more property captions. However, this paradigm suffers
for TextCaps. It mainly comprises a reading module, a from a lack of diversity and information integrity. Zhang
depth-enhanced feature updating module, a salient visual et al. [43] and Xu et al. [38] were committed to solving this
object concepts extractor, and a multimodal transformer. problem by introducing additional captions. Although in-
Concretely, in the reading module, we employ a monocu- troducing multiple captions alleviates the problem of miss-
lar depth estimator BTS [19] to generate depth maps, then ing scene information, it is inherently hard to summarize
we extract depth values for visual objects and OCR tokens, the complex entities of a scene within one caption well.
respectively. The depth-enhanced feature updating mod- Apart from this, M4C-Captioner is less than satisfactory
ule is designed to optimize the visual features of OCR to- performed on Textcaps due to insufficient use of visual and
kens, which is beneficial for three-dimensional geometry re- semantic information of images. To distinguish possible
lationship construction. In the salient visual object concepts misidentified OCR recognition results, Wang et al. [36] in-
semantic extractor, we first employ a pre-trained visual- troduced confidence embedding and alleviated the adverse
language model CLIP [25] as a filter to generate K concepts effects of inaccurate OCR tokens to a certain extent. Yang
most related to the given picture. Then we adopt the seman- et al. [39] designed a pre-training model that can exploit
tic information of visual object concepts as augmentations OCR tokens and achieved strong results on multiple tasks,
to improve the overall expressiveness of our model. Subse- including text-based image captioning. Wang et al. [34, 35]
quent experiments in Sec. 4 demonstrate the effectiveness and Tang et al. [29] employ the 2D spatial relationships to
of our proposed DEVICE. measure correlation between OCR tokens. Nevertheless,
Our main contributions can be summarized as follows: we find that simply leveraging two-dimensional spatial rela-
tionships may cause incorrect captions. Therefore, we ame-
• We propose a DEpth and VIsual ConcEpts Aware liorate this problem by utilizing depth information.
Input image Reading module
𝑜𝑜𝑜𝑜
Object Objects 𝑥𝑥1:𝑛𝑛
detection embedding
Multimodal
𝑡𝑡𝑡𝑡
Depth-enhanced 𝑥𝑥1:𝑚𝑚
′
𝑦𝑦𝑡𝑡′ Linear Pointer
OCR OCR tokens
detection embedding feature updating Layer Network
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
module 𝑥𝑥1:𝑚𝑚
Transformer
𝑣𝑣𝑣𝑣𝑣𝑣
Billboard 𝑥𝑥1:𝐾𝐾
Vocabulary CLIP Advertisement Salient visual object a billboard for tanger outlets is
Sign concepts embedding
above a building.
…
Top-k concepts
Visual Object Concepts Extractor Generated Caption
𝑑𝑑𝑒𝑒𝑒𝑒
𝑥𝑥𝑡𝑡−1
Previous output embedding
Figure 2. Overview of our DEpth and VIsual ConcEpts Aware Transformer for TextCaps (DEVICE). In the reading module, we first
extract visual objects and OCR tokens along with their depth values. Then a depth-enhanced feature updating module is used to enhance
OCR appearance features with the help of depth values. After that, DEVICE utilizes depth information and plane geometry relationships to
construct 3D positional relationships. Semantic information of salient visual object concepts promotes fine-grained and holistic captioning.
2.2. Depth Information For Multimodal Tasks transformer, and a sentence decoder. Concretely, the input
image is first put into an object detector and two OCR de-
Monocular depth estimation is a challenging problem tectors to extract N visual objects and M OCR tokens along
of computer vision, whose purpose is to use depth val- with their two-dimensional regional coordinates. Then we
ues to measure the distance of objects relative to the cam- utilize a pre-trained depth estimation model to generate the
era [12, 14, 19]. With depth estimation technology, sig- pixel-level depth map D. After that, we update OCR vi-
nificant improvements have been achieved on many multi- sual features with a depth-enhanced feature updating mod-
modal tasks [6,21,37]. In the visual question answer (VQA) ule. To generate salient visual object semantic concepts for
task, Banerjee et al. [6] and Liu et al. [22] proved that each input image, we employ a pre-trained cross-modal tool
exploiting three-dimensional relationships by depth infor- CLIP [25] to match the top-K related concepts (e.g., “bill-
mation could enhance the spatial reasoning ability of VQA board” and “advertisement”) in vocabulary, which could al-
models. Apart from this, Qiu et al. [24] and Liao et al. [21] leviate the model from missing influential visual objects in
point out that ignoring the depth information could tremen- the scene, and generate fine-grained visual objects for more
dously place restrictions on applications of scene change accurate descriptions. Depth information is included in the
captioning in the real world. At the same time, we notice coordinates of both scene text and objects, hence the mul-
that text-based image captioning is facing the same prob- timodal transformer could establish three-dimensional re-
lem. Therefore, we construct three-dimensional spatial re- lations between visual entities in the picture. The image
lationships based on deep information. captions are generated in the auto-regressive paradigm.
where the depth-enhanced appearance feature xvto ∈ where ŷt is the corresponding token in the ground truth.
0 0 0 0
R(m+n)×d (xvto = [xof , xtf ]), xf r ∈ Rn×d , xf t ∈
m×d
R , respectively. 4. Experiments
3.4. Salient Visual Object Concepts Extractor 4.1. Dataset and Settings
To improve the modeling ability of the multimodal trans- Dataset and evaluation metrics. The TextCaps dataset
former to the scene objects, avoid the lack of vital objects, [27] is constructed for text-based image captioning, which
BLUE-4 METEOR ROUGE-L SPICE CIDEr-D
Up-Down (CVPR 2018) [4] 20.1 17.8 42.9 11.7 41.9
AoA (ICCV 2019) [16] 20.4 18.9 42.9 13.2 42.7
M4C-Captioner (ECCV 2020) [27] 23.3 22.0 46.2 15.6 89.6
MMA-SR (ACM MM 2020) [34] 24.6 23.0 47.3 16.2 98.0
SS-Baseline (AAAI 2021) [44] 24.9 22.7 47.2 15.7 98.8
CNMT (AAAI 2021) [36] 24.8 23.0 47.1 16.3 101.7
ACGs-Captioner (CVPR 2021) [38] 24.7 22.5 47.1 15.9 95.5
MAGIC (AAAI 2022) [43] 22.2 20.5 42.3 13.8 76.6
OMOT (ICMR 2022) [29] 26.4 22.2 47.5 14.9 95.1
LSTM-R (CVPR 2021) [35] 27.9 23.7 49.1 16.6 109.3
DEVICE (Ours) 27.4 24.5 49.2 17.7 115.9
M4C-Captioner(w/ GT OCRs) [27] 26.0 23.2 47.8 16.2 104.3
TAP† (CVPR 2021) [39] 25.8 23.8 47.9 17.1 109.2
Table 1. Comparison of our performance with other baseline models on the TextCaps validation set, and TAP† is a pre-trained model. Our
DEVICE acquires competitive performance compared to state-of-the-art models with p-values < 0.05.
Table 2. Comparison of our performance with other baseline models on the TextCaps test set. Our DEVICE outperforms state-of-the-art
models on all metrics, notably boosting CIDEr-D from 100.8 to 109.7.
contains 28408 images with 5 captions per image. The the dimension of common embedding space in multimodal
training, validation, and test set contain 21953, 3166, and transformer is 768. The maximum generation length is 30.
3289 images, respectively. For evaluation metrics, we DeFUM contains two layers of transformer in our exper-
choose five widely used evaluation metrics for image cap- iments, we set the layer number of the multimodal trans-
tioning, i.e., BLEU-4 [23], ROUGE-L [20], METEOR [7], former to 4 and the number of self-attention heads to 12.
SPICE [3], and CIDEr-D [31]. Following Sidorov et al. We adopt default settings for other parameters following
[27], we focus on the CIDEr-D when comparing different BERT-BASE [13]. The fixed common vocabulary has 6736
methods, because CIDEr-D has a high correlation with hu- words. To get visual object concepts, we filter top-k (k = 5
man evaluation scores and puts more weight on informative in our implement) visual object concepts from the vocabu-
special tokens (e.g. OCR tokens) in the captions. lary by CLIP [25]. We retain the similarity score given by
CLIP as a reference for measuring reliability of visual ob-
Settings and implementation details. To detect suffi- ject concepts. To get the confidence of OCR tokens, we use
cient OCR tokens, we employ the Rosetta OCR system [10] pre-trained four-stage STR [5] following Wang et al. [36].
and Google OCR system [1] following Wang et al. [35]. For those OCR tokens that only appear in one OCR system,
The number of OCR tokens for each image is limited to 80 we set the confidence to 0.9.
at most. We extract 100 object appearance features for each
picture, the dimension d of appearance features is 2048, and We train the model for about 10 epochs on a single 3090
Google DI DeFUM VOC B-4 C-D Number of VOC CIDEr-D
1 24.5 101.5 DEVICE K=3 114.9 ↓1.0
2 X 24.9 105.3 DEVICE K=5 115.9
3 X X 25.6 109.8 DEVICE K=8 115.5 ↓0.4
4 X X 26.3 112.4 DEVICE K = 10 114.6 ↓1.3
5 X X X 26.1 113.9
6 X X X X 27.4 115.9 Table 4. Analysis of visual object concepts number K on valida-
tion set. VOC indicates Visual Object Concepts.
Table 3. Ablation of each module in DEVICE on the TextCaps
validation set. Google denotes Google OCR system, DI denotes
Depth Information, DeFUM represents the Depth-enhanced Fea- tions by removing the undetected OCR symbols, which in-
ture Updating Module, and VOC indicates Visual Object Con- creases BLUE-4 by 0.6 [35]. Unlike LSTM-R, for a fair
cepts. B-4, C-D represent BLUE-4 and CIDEr-D, respectively. comparison, we train our model with raw captions provided
by TextCaps [27], following the majority [27, 36, 39, 44].
The performance of our model on the TextCaps validation
Ti GPU, and the batch size is 64. We adapt Adam [18] set demonstrates the effectiveness of both depth information
optimizer, the initial learning rate is 1e-4 and is declined and semantic information of visual object concepts.
to 0.1 times every 3 epochs. We monitored the CIDEr-D Main results on the TextCaps test set. Compared with
metric to choose the best model and evaluate it on both the LSTM-R, our model shows superiority on the TextCaps test
validation set and test set. All the experimental results are set, which gains rises of 0.1, 1.2, 0.6, 1.2, 8.9 on BLUE-
computed by Eval AI online platform submissions. 4, METEOR, ROUGE-L, SPICE, CIDEr-D, respectively.
Significantly, the inspiring improvement on the CIDEr-D
4.2. Experimental Results shows that our DEVICE generates more accurate scene text.
Compared models. We make comparisons with other The results of m4c-captioner (w/ GT OCRs) [27] are eval-
models in Tab. 1 and Tab. 2. Up-Down [4] and AOA [16] uated on a subset of the TextCaps test set, excluding those
are widely used image captioning models. Both of them are samples without OCR annotations. DEVICE outperforms
trained and inferred without OCR tokens. M4C-Captioner m4c-captioner(w/ GT OCRs) verifies that our DEVICE is
[27], SS-Baseline [44], ACGs-Captioner [38], MAGIC capable of compensating the negative impact of the error of
[43], and CNMT [36] are transformer-based strong base- OCR systems to some extent. Encouragingly, our model
line models with OCR tokens and other modalities as in- outperforms the pre-trained model TAP† on all metrics,
put. MMA-SR [34], OMOT [29], and LSTM-R [35] are demonstrating that DEVICE utilize limited training data
LSTM-based strong baseline models that focus more on more efficiently. Finally, the gap between machines and hu-
the modeling of two-dimensional geometrical relationships. mans narrows on the text-based image captioning task.
TAP† [39] is a text-aware pre-trained model trained with ex-
4.3. Ablation Studies
tra pre-training data. M4C-Captioner (w/ GT OCRs) [27]
indicates M4C-Cpationer takes ground truth OCR tokens To analyze the respective effects of each module on the
as input and evaluates on a subset of the TextCaps test set. model performance, we conducte ablation experiments, as
Human [27] represents human-generated captions. shown in Tab. 3. Comparing Line 1 and Line 2, CIDEr-D is
Main results on the TextCaps validation set. The com- boosted from 101.5 to 105.3. It is evident that more affluent
parisons on the validation set between our DEVICE and and more accurate OCR tokens positively impact the model
other models are shown in Tab. 1. Traditional methods UP- performance. While the improvement on BLUE-4 is rela-
Down and AOA perform less well than other models with tively moderate, we think one reason may be that BLUE-4
OCR tokens due to they cannot generate scene text. CNMT treats all matched words equally, but CIDEr-D pays more
proves the effectiveness of confidence embedding and rep- attention to OCR tokens. Line 2 and 3 prove that simply
etition mask, which improves CIDEr-D from 89.6 to 101.7 concatenating depth values with two-dimensional space co-
compared with M4C-Captioner. LSTM-R constructs suf- ordinates boosts the performance substantially, which in-
ficient 2D spatial relations and boosts all metrics signif- creases BLUE-4 by 0.7 and CIDEr-D by 4.5, respectively.
icantly. By incorporating depth information and seman- Comparing Line 3 and 5, we find that updating the ap-
tic information of visual object concepts with transformer, pearance features of OCR tokens under the supervision of
compared with SOTA model LSTM-R [35], our DEVICE depth information boosts all metrics. The significant im-
improves the performance on METEOR, SPICE, CIDEr- provements between Line 5 and 6 demonstrate the effec-
D by 0.8, 1.1, and 6.6, respectively. The performance of tiveness of visual object concepts’ semantic information,
DEVICE on BLUE-4 ranks second only to LSTM-R (0.5 which boosts BLUE-4 by 1.3 and CIDEr-D by 2.0, respec-
below). Note that LSTM-R has cleaned the training cap- tively. More remarkably, the total increase ratio of CIDEr-D
MOBILE
ATS
unsafe
area
keep out
CNMT: A yellow and white CNMT: A plane with the words unsafe CNMT: A table with many books CNMT: A man in a blue shirt that says
van with the word ats mobile on on it. including one called "pey". winnipeg jet on it.
the front. Ours: A sign in front of a plane that Ours: A table full of comic books Ours: A mascot for the pylon is holding
Ours: A yellow and green van says unsafe area keep out. including one called final. a woman wearing a winnipeg jet shirt.
with the word ats on the front. GT: The body of a large airplane is GT: Several comic books, including GT: Men wearing Mario and Luigi
GT: The ats van is mobile and being worked on behind a sign that say Spider-Man and X-Men, are laid out. costumes wearing advertisements for the
going to a service for tyre fitting. Unsafe Area Keep Out. pylon festival pose with a woman.
Case 1 Case 2 Case 3 Case 4
Figure 5. Example of captions generated by strong baseline CNMT [36], DEVICE, and GT (ground truth). Red indicates inappropriate
sence text generated by CNMT. The depth maps and visual object concepts are displayed in the middle. The score of visual object concepts
measures how similar these concepts are to the image. Blue indicates scene text and Green indicates visual object concepts.
is 10.6 and BLEU-4 is 2.5. This demonstrates that all mod- spatial relationship between visual entities by introducing
ules work together very efficiently. 3D geometric relationships. For Case 3, CNMT only gener-
We also evaluate the influences of the number K of vi- ates relatively coarse-grained words such as “many books”.
sual object concepts on the TextCaps validation (cf. Tab. 4). In contrast, with the help of semantic information of visual
The experimental results indicate that some salient visual object concepts, DEVICE is capable of generating more ac-
object concepts may be missed when K is less than 5. curate words “comic books”, like human. In the last case,
Meanwhile, irrelevant visual concepts may be retrieved unlike CNMT, which ignores the visual object “mascots”,
when K is larger than 7. Therefore, we set K to 5. DEVICE generates a more comprehensive description, en-
abling rational use of scene text “pylon”. The aforemen-
4.4. Qualitative Analysis tioned cases demonstrate the effectiveness of DEVICE well.
However, in Case 4, we can see the gap between our
We show some cases which are generated by CNMT [36]
DEVICE and human, we consider one possible reason is
and DEVICE on the the TextCaps validation set in Fig. 5,
that humans have common knowledge. Although adopting
where “GT” means ground truth. We pick CNMT [36] for
CLIP [25] to extract the visual object concept “mascot” in
comparison because our geometric relationship construc-
Case 4 can also be regarded as a way of implicitly using
tion methods are the same except for depth, and both are
external knowledge, this knowledge is not fine-grained and
transformer-based structures. Case 1 and Case 2 demon-
accurate enough compared to “Mario” and “Luigi”.
strate the effectiveness of depth information, Case 3 and 4
illustrate the influences of visual object concepts. In Case 1, 5. Conclusion
CNMT which without depth information mistakenly takes
“ats” and “mobile” as tokens in the same plane and con- In this paper, we propose a DEpth and VIsual ConcEpts
nects them incorrectly. By introducing 3D geometric re- Aware Transformer (DEVICE) for TextCaps. We introduce
lationship, our DEVICE generates accurate scene text. In depth information and semantic information of salient vi-
Case 2, CNMT incompletely represents the text in the sign sual object concepts to improve the accuracy and integrity
and confuses the positional relationship between the air- of captions. We also design a depth-enhanced feature updat-
plane and the text on the sign. DEVICE enhances the cor- ing module to improve OCR appearance features, which is
relation of scene text on the warning sign and clarifies the capable of facilitating the construction of three-dimensional
geometric relations. More remarkably, our model achieves puter vision and pattern recognition, pages 10578–10587,
state-of-the-art performances on the TextCaps test set. By 2020. 2
comparing experimental results with human annotations, [12] Camille Couprie, Clément Farabet, Laurent Najman, and
we consider that extracting more explicit external knowl- Yann LeCun. Indoor semantic segmentation using depth in-
edge can further improve this task. formation. arXiv preprint arXiv:1301.3572, 2013. 3
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova. Bert: Pre-training of deep bidirectional
References transformers for language understanding. arXiv preprint
[1] Googleocr. https://cloud.google.com/functions/docs/tutorials arXiv:1810.04805, 2018. 6
/ocr. 3, 6 [14] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
[2] Jon Almazán, Albert Gordo, Alicia Fornés, and Ernest Val- ready for autonomous driving? the kitti vision benchmark
veny. Word spotting and recognition with embedded at- suite. In 2012 IEEE conference on computer vision and pat-
tributes. IEEE transactions on pattern analysis and machine tern recognition, pages 3354–3361. IEEE, 2012. 3
intelligence, 36(12):2552–2566, 2014. 4 [15] Ronghang Hu, Amanpreet Singh, Trevor Darrell, and Mar-
[3] Peter Anderson, Basura Fernando, Mark Johnson, and cus Rohrbach. Iterative answer prediction with pointer-
Stephen Gould. Spice: Semantic propositional image cap- augmented multimodal transformers for textvqa. In Proceed-
tion evaluation. In European conference on computer vision, ings of the IEEE/CVF Conference on Computer Vision and
pages 382–398. Springer, 2016. 6 Pattern Recognition, pages 9992–10002, 2020. 2, 4
[16] Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei.
[4] Peter Anderson, Xiaodong He, Chris Buehler, Damien
Attention on attention for image captioning. In Proceedings
Teney, Mark Johnson, Stephen Gould, and Lei Zhang.
of the IEEE/CVF international conference on computer vi-
Bottom-up and top-down attention for image captioning and
sion, pages 4634–4643, 2019. 2, 6, 7
visual question answering. In Proceedings of the IEEE con-
[17] Zan-Xia Jin, Mike Zheng Shou, Fang Zhou, Satoshi Tsutsui,
ference on computer vision and pattern recognition, pages
Jingyan Qin, and Xu-Cheng Yin. From token to word: Ocr
6077–6086, 2018. 1, 2, 6, 7
token evolution via contrastive learning and semantic match-
[5] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park,
ing for text-vqa. In Proceedings of the 30th ACM Interna-
Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwal-
tional Conference on Multimedia, pages 4564–4572, 2022.
suk Lee. What is wrong with scene text recognition model
2
comparisons? dataset and model analysis. In Proceedings of
[18] A KingaD. A methodforstochasticoptimization. Anon. Inter-
the IEEE/CVF international conference on computer vision,
nationalConferenceon Learning Representations. SanDego:
pages 4715–4723, 2019. 6
ICLR, 2015. 7
[6] Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, and Chitta [19] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and
Baral. Weakly supervised relative spatial reasoning for vi- Il Hong Suh. From big to small: Multi-scale local planar
sual question answering. In Proceedings of the IEEE/CVF guidance for monocular depth estimation. arXiv preprint
International Conference on Computer Vision, pages 1908– arXiv:1907.10326, 2019. 2, 3, 4
1918, 2021. 3
[20] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang,
[7] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic and Wenyu Liu. Textboxes: A fast text detector with a sin-
metric for mt evaluation with improved correlation with hu- gle deep neural network. In Thirty-first AAAI conference on
man judgments. In Proceedings of the acl workshop on in- artificial intelligence, 2017. 6
trinsic and extrinsic evaluation measures for machine trans- [21] Zeming Liao, Qingbao Huang, Yu Liang, Mingyi Fu, Yi Cai,
lation and/or summarization, pages 65–72, 2005. 6 and Qing Li. Scene graph with 3d information for change
[8] Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Ap- captioning. In Proceedings of the 29th ACM International
palaraju, and R Manmatha. Latr: Layout-aware transformer Conference on Multimedia, pages 5074–5082, 2021. 3
for scene-text vqa. In Proceedings of the IEEE/CVF Con- [22] Yuhang Liu, Wei Wei, Daowan Peng, Xian-Ling Mao, Zhiy-
ference on Computer Vision and Pattern Recognition, pages ong He, and Pan Zhou. Depth-aware and semantic guided
16548–16558, 2022. 2 relational attention network for visual question answering.
[9] Piotr Bojanowski, Edouard Grave, Armand Joulin, and IEEE Transactions on Multimedia, pages 1–14, 2022. 3
Tomas Mikolov. Enriching word vectors with subword in- [23] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing
formation. Transactions of the association for computational Zhu. Bleu: a method for automatic evaluation of machine
linguistics, 5:135–146, 2017. 4, 5 translation. In Proceedings of the 40th annual meeting of the
[10] Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. Association for Computational Linguistics, pages 311–318,
Rosetta: Large scale system for text detection and recogni- 2002. 6
tion in images. In Proceedings of the 24th ACM SIGKDD in- [24] Yue Qiu, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, and Hi-
ternational conference on knowledge discovery & data min- rokatsu Kataoka. 3d-aware scene change captioning from
ing, pages 71–79, 2018. 1, 3, 6 multiview images. IEEE Robotics and Automation Letters,
[11] Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and 5(3):4743–4750, 2020. 3
Rita Cucchiara. Meshed-memory transformer for image cap- [25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
tioning. In Proceedings of the IEEE/CVF conference on com- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- captioning with content diversity exploration. In Proceed-
ing transferable visual models from natural language super- ings of the IEEE/CVF Conference on Computer Vision and
vision. In International Conference on Machine Learning, Pattern Recognition, pages 12637–12646, 2021. 2, 6, 7
pages 8748–8763. PMLR, 2021. 2, 3, 5, 6, 8 [39] Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei
[26] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo
Faster r-cnn: Towards real-time object detection with region Luo. Tap: Text-aware pre-training for text-vqa and text-
proposal networks. Advances in neural information process- caption. In Proceedings of the IEEE/CVF conference on
ing systems, 28, 2015. 3, 4 computer vision and pattern recognition, pages 8751–8761,
[27] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and 2021. 2, 6, 7
Amanpreet Singh. Textcaps: a dataset for image caption- [40] Gangyan Zeng, Yuan Zhang, Yu Zhou, and Xiaomeng Yang.
ing with reading comprehension. In European conference on Beyond ocr+ vqa: involving ocr into the flow for robust and
computer vision, pages 742–758. Springer, 2020. 1, 2, 4, 5, accurate textvqa. In Proceedings of the 29th ACM Interna-
6, 7 tional Conference on Multimedia, pages 376–385, 2021. 2
[28] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, [41] Zheng-Jun Zha, Daqing Liu, Hanwang Zhang, Yongdong
Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Zhang, and Feng Wu. Context-aware visual policy network
Rohrbach. Towards vqa models that can read. In Proceedings for fine-grained image captioning. IEEE transactions on pat-
of the IEEE/CVF conference on computer vision and pattern tern analysis and machine intelligence, 2019. 1
recognition, pages 8317–8326, 2019. 2 [42] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang,
[29] Wenliang Tang, Zhenzhen Hu, Zijie Song, and Richang Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
Hong. Ocr-oriented master object for text image caption- Vinvl: Revisiting visual representations in vision-language
ing. In Proceedings of the 2022 International Conference on models. In Proceedings of the IEEE/CVF Conference on
Multimedia Retrieval, pages 39–43, 2022. 2, 6, 7 Computer Vision and Pattern Recognition, pages 5579–
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- 5588, 2021. 1
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia [43] Wenqiao Zhang, Haochen Shi, Jiannan Guo, Shengyu
Polosukhin. Attention is all you need. Advances in neural Zhang, Qingpeng Cai, Juncheng Li, Sihui Luo, and Yuet-
information processing systems, 30, 2017. 2, 5 ing Zhuang. Magic: Multimodal relational graph adversarial
[31] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi inference for diverse and unpaired text-based image caption-
Parikh. Cider: Consensus-based image description evalua- ing. In Proceedings of the AAAI Conference on Artificial
tion. In Proceedings of the IEEE conference on computer Intelligence, volume 36, pages 3335–3343, 2022. 2, 6, 7
vision and pattern recognition, pages 4566–4575, 2015. 6 [44] Qi Zhu, Chenyu Gao, Peng Wang, and Qi Wu. Simple is
[32] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer not easy: A simple strong baseline for textvqa and textcaps.
networks. Advances in neural information processing sys- In Proceedings of the AAAI Conference on Artificial Intelli-
tems, 28, 2015. 4, 5 gence, volume 35, pages 3608–3615, 2021. 6, 7
[33] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
mitru Erhan. Show and tell: A neural image caption gen-
erator. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 3156–3164, 2015. 2
[34] Jing Wang, Jinhui Tang, and Jiebo Luo. Multimodal atten-
tion with image text spatial relationship for ocr-based image
captioning. In Proceedings of the 28th ACM International
Conference on Multimedia, pages 4337–4345, 2020. 2, 6, 7
[35] Jing Wang, Jinhui Tang, Mingkun Yang, Xiang Bai, and
Jiebo Luo. Improving ocr-based image captioning by in-
corporating geometrical relationship. In Proceedings of
the IEEE/CVF conference on computer vision and pattern
recognition, pages 1306–1315, 2021. 2, 3, 6, 7
[36] Zhaokai Wang, Renda Bao, Qi Wu, and Si Liu. Confidence-
aware non-repetitive multimodal transformers for textcaps.
In Proceedings of the AAAI Conference on Artificial Intel-
ligence, volume 35, pages 2835–2843, 2021. 1, 2, 4, 6, 7,
8
[37] Ziwei Wang, Yadan Luo, Yang Li, Zi Huang, and Hongzhi
Yin. Look deeper see richer: Depth-aware image paragraph
captioning. In Proceedings of the 26th ACM international
conference on Multimedia, pages 672–680, 2018. 3
[38] Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo,
Qing Du, and Qi Wu. Towards accurate text-based image