Professional Documents
Culture Documents
Open Ended VQA Models Using Transformers
Open Ended VQA Models Using Transformers
Alberto Mario Bellini1,2 and Natalie Parde2 and Matteo Matteucci1 and Mark J. Carman1
1
Politecnico di Milano
2
University of Illinois at Chicago
albertomario.bellini@mail.polimi.it, parde@uic.edu
matteo.matteucci@polimi.it, mark.carman@polimi.it
input questions. Since, in our context, we do not output of the Transformer T ∈ Rn×hidden consists
need VGGNet to act as a classifier, but rather as a of a sequence of n hidden vectors of size hidden
feature extractor, we follow common practice and = 768. We use T as the second input to our atten-
drop the last three fully-connected layers, retaining tion mechanism.
only the output of the last average pooling layer.
This way, given an image, the encoder produces a 3.3 Attention
set of maps containing high-level features. A crucial component of our architecture is a novel
Specifically, letting ci be three different channels co-attention mechanism, which is responsible for
(red, blue, and green) of width wi = 224 and height combining features from the two different sub-
hi = 224, the visual input consists of an image spaces into a single representation. In designing
I ∈ Rci ×wi ×hi . The output of the image encoder this mechanism, we use the question to drive the
then consists of a set of maps M ∈ Rco ×wo ×ho , focus on the image and vice versa. To formally
where co = 512 different channels of width wo = 7 define this component, we recall that it leverages
and height ho = 7. M then serves as one of the the following inputs:
two inputs to our attention mechanism.
• Image Encoder output: M ∈ R512×7×7
3.2 Language Model
• Language Model output: T ∈ Rn×768
We use OpenAI’s GPT-2 (small, 117M) Trans-
former (Radford et al., 2019) as our base language Our model first flattens M into a more conve-
model, refining their pre-trained architecture in or- nient representation Mflat ∈ R512×49 . Then,
der to integrate it into our system. GPT-2 is a both Mflat and T are fed to two separate linear
powerful, Transformer-based language model that layers (LinM and LinT ) that bring the two vec-
relies entirely on attention mechanisms to condi- tors into a common subspace of dimensionality
tion the generation of tokens on the given context AttDim = 512:
and the tokens generated at previous time steps. It
consists of two core blocks: (1) the Transformer, • LinM : Maps Mflat ∈ R512×49 to Matt ∈
which is responsible for computing the multi-head RAttDim×49
attention discussed previously; and (2) the final • LinT : Maps T ∈ Rn×768 to Tatt ∈
classifier which, for each token processed by the Rn×AttDim
Transformer, produces a distribution of probabili-
ties over the vocabulary of words.2 At this point, for every element Tiatt ∈ RAttDim
Since our goal is for the language model to in- in Tatt , with i ∈ N ∧ i ∈ [0, n), we repeat the
teract with the image encoder, we split apart these following set of operations:
two blocks and use them separately in our model.
Given a sequence of n input tokens S ∈ Rn×1 , the 1. We sum Tiatt with each element Mkatt ∈
RAttDim of Matt , with k ∈ N ∧ k ∈ [0, 49),
2
Recall that this is a language model that exploits attention, obtaining the first multimodal attention repre-
which means that the probability p(wi |ti , ti−1 , ...ti−k ) of
generating the word wi at time step i is conditioned on the sentation Ai1 ∈ RAttDim×49 for the ith token
input tokens ti , ti−1 , ...ti−k up to time step i. in the input sequence.
2. We rectify Ai1 using a ReLU unit and obtain 3.4 Answer Generator
an intermediate attention representation Ai2 ∈ We generate the final answer by once again ex-
RAttDim×49 . ploiting the language model. This last component
diverges from classical VQA systems—while most
3. Using a linear layer (Linsoft ) we compress
existing work has fed the output of the attention
AttDim down to 1. In other words, Linsoft
mechanism to a classifier, thereby distributing prob-
brings Ai2 ∈ RAttDim×49 into Ai3 ∈ R1×49 .
abilities over a predefined set of answers, we con-
4. To obtain a softmap that highlights relevant dition the generation of words using the output of
image regions given a specific word at po- the attention layer.
sition i, we pass Ai3 through a Sof tmax Instead of feeding the output T of the Trans-
function and obtain Ai4 ∈ R1×49 , where former directly to its head, we first take the at-
tention output Aout and concatenate it with the
Ai,h
4 ∈ (0, 1) with h ∈ N ∧ h ∈ [0, 49). At
Transformer output T into a new vector O ∈
this point every pixel in Ai4 has an associated
Rn×1536 . Then, with the help of a new linear layer
value between 0 and 1, which indicates how
(LinClassifier ), we distribute probabilities over the
important that pixel is to answer the question.
original vocabulary of words from the pre-trained
5. We perform a point-wise multiplication be- GPT-2 model. In order not to lose the valuable
tween every element Mkflat ∈ R1×49 in the information contained in the weights of the pre-
original flattened maps Mflat and Ai4 , obtain- trained model C (or head), we do the following:
ing a new vector Ai5 ∈ R512×49 . In this step 1. We initialize the first 768 × 50254 weights
we perform the question-to-image attention, of LinClassifier with the ones present in the
since we exploit the softmaps computed at the original head of GPT-2.
previous step to mask out irrelevant portions
of the image. 2. We initialize the last 768 × 50254 weights,
accounting for the attention layer output, to 0.
6. We sum, for each channel in Ai5 , all the
This allows the model to start training from a state
masked pixels together, reducing the dimen-
in which it is already able to generate natural se-
sion from 49 to 1. In other words, we go from
quences of words, since all weights relevant to the
Ai5 ∈ R512×49 to Ai6 ∈ R512×1 by summing
attention mechanism are initialized to 0, and the
all the elements together on the pixel axis.
only ones that influence the output are coming from
7. We bring Ai6 back to the Transformer space T. This last linear layer consists of roughly 77.1M
through a final linear layer (Linout ), which parameters (1536 × 50254), so it is imperative to
expands Ai6 ∈ R512×1 to Ai7 ∈ R768 . perform such an initialization to train the system in
a reasonable amount of time.
8. We compute the image-to-question atten- Afterward, we feed O, the Logits vector, to a
tion through a final point-wise multiplica- Sof tmax function that distributes probabilities
tion between Ai7 and every element Ti ∈ over our vocabulary. To generate the output word
R768 of T, which is the original output of the wi corresponding to the input token at position i,
Transformer before this layer. The final output we pool the most probable word in a greedy fash-
of the attention mechanism for every embed- ion,3 dynamically generating each word and con-
ded token Ti and the maps M coming from ditioning it on the previously-generated words. At
the image encoder is the vector Aiout ∈ R768 . evaluation time, this allows the system to generate
arbitrarily long answers in the following fashion:
We described how the attention layer works con-
1. Given a question of n words and an image, we
sidering one single embedded token Ti at a time.
first let the system generate the initial word in
However, it can be trivially scaled to a sequence of
the answer, conditioned only on the question
n elements by including an additional dimension
and the image.
for each vector that accounts for the token under
3
consideration. The output Aout of the attention We also experimented with beam search, finding that
with beam sizes greater than one, the model generated shorter
mechanism is then combined back with the output answers more characteristic of those produced by traditional
of the language model to generate the answer. classification-based VQA systems.
2. After generating the first word, we append Train Test
it to the original question, which becomes a
Split size 106 105
sequence of n+1 words. The next word in the
Sequence length 24 22
answer is generated conditioned on the new
Question + Answer yes no
sequence.
Min. question length 5 6
3. The procedure continues until the system out- Max. question length 21 22
puts the <eos> token or a maximum length Avg. question length 8.45 9.48
threshold is reached. Min. answer length 3 -
Max. answer length 18 -
4 Dataset Avg. answer length 3.89 -
Avg. <pad> count 11.65 -
Following other recent work on VQA, we # of unique images 71450 32822
train and evaluate our model using the VQAv2
dataset (Goyal et al., 2017), a large collection of Table 1: Structure of the processed dataset
real-world question-image pairs. Although the cor-
pus is primarily meant for standard classifier-driven
Data formatting. Given a question/answer pair
VQA models, its use facilitates direct comparison
for a given image, we concatenated the pair to form
with existing models. Such a comparison is key
a sequence of words. This sequence was then fed
to demonstrating the effectiveness of open-ended,
to the model at training time. For the test data,
generative VQA models such as that designed here.
we dropped all answers and kept only the ques-
Importantly, the dataset is based on real-word im-
tion/image pairs. All textual inputs were byte-pair
ages rather than synthetic ones, allowing for learn-
encoded (Shibata et al., 1999) using the pre-trained
ing better generalization capabilities. It consists of
GPT-2 vocabulary.
more than 200K images, 1M questions and 10M
ground truth answers, manually annotated. There Custom tokens. We introduce four new tokens:
are at least 3 questions (5.4 on average) per im- <bos>, <eos>, <sep>, and <pad> to indicate
age, and each question/image pair has 10 differ- the beginning and end of the sequence, the sepa-
ent annotated answers. To disincentivize answers ration between question and answer, and finally
based purely on linguistic bias, some questions the padding token. In the training set, we make
have complementary image pairs. Additionally, an- use of all four tokens while encoding the input se-
swers were balanced for different question types in quence. In the test set, we only use <bos> and
order not to have a predominant value. <sep> since no answer is present. Since we intro-
duce these new tokens, we set the language model’s
4.1 Preprocessing weights to be trainable in order to update the em-
We apply a variety of preprocessing steps to the bedding matrix and the attention layers.
text and images in order to most productively make
use of the corpus. We summarize the structure of
5 Training
the preprocessed dataset in Table 1, and elaborate To train our architecture we employ a technique
further on our preprocessing steps below. called Teacher Forcing (Lamb et al., 2016): given
a question with its associated answer, we combine
Grayscale removal. We discarded all samples
the two to form a sequence of tokens that are fed
with non-RGB images, retaining only colored ones.
to the model. Exploiting a CrossEntropy loss
Answer length. Since we were interested in function, we train our system to predict the input
building an architecture capable of generating open- sequence shifted by 1 time step, effectively telling
ended answers, we prioritized longer samples and the model to condition the probability of generating
gave less importance to those with concise answers. a new token on the previously seen context. The
loss function ignores the <pad> tokens and fo-
Answer variability. We randomly selected up to cuses only on portions of the sequence that contain
four annotations for each question, ensuring that at meaningful information. We block weight updates
most four identical question/image pairs were fed in our image encoder, using it solely as a visual
to the model, with four different answers. feature extractor. However, we set our language
model’s weights to be trainable. This is necessary Model Overall Yes/No Number Other
for the model to learn both the nature of the task Captioning 0.00 0.00 0.00 0.00
and the meaning of the newly introduced tokens. Answering 0.11 0.20 0.20 0.14
VQA 0.46 0.71 0.27 0.41
We train the system for 20 epochs using batches VGGPT-2 0.34 0.73 0.23 0.24
of 20 elements4 . The learning rate is initialized to
5 × 10−5 and Adam is employed as the optimizer. Table 2: Accuracy scores of baselines and VGGPT-2.
The overall architecture consists of 202M parame-
ters, since we fine-tune GPT-2 (117M) alongside
the attention mechanism (8M) and the final classi- architecture. The system consists of an image en-
fier (77M). coder (ResNet-152 (He et al., 2016)), alongside an
LSTM which extracts textual features. A compact,
6 Evaluation but powerful, attention layer brings the two modal-
ities into a common subspace where they are com-
To assess the performance of our architecture, we
bined to generate attention maps over the image.
compare it against three different baselines:
Finally, the attention output is concatenated back
Captioning baseline. The first baseline that we with the LSTM hidden state and the resulting vector
consider is a captioning system (Xu et al., 2015): is fed to a classifier which distributes probabilities
given an image, this model tries to generate a mean- over 3000 possible answers. We do not re-train or
ingful description about what it sees. Our goal was fine-tune this architecture, and download the au-
to understand the extent to which the textual modal- thors’ pre-trained instance on the VQAv2 dataset.
ity (i.e., the question) was useful to generate an- This architecture follows the standard VQA ap-
swers. The captioning baseline exploits a VGGNet- proach of addressing the task in a classification
11 image encoder to extract visual features that are manner, generating short and concise answers.
later combined with the outputs of an LSTM. The
joint representation then passes through an atten- 6.1 Quantitative Results
tion layer, which focuses on specific parts of the We evaluate our models using three different quan-
image and updates the hidden states of the LSTM. titative metrics: Accuracy, BLEU (Papineni et al.,
The caption is generated one token at a time and 2002), and Word Mover’s Distance (Kusner et al.,
the process goes on until an <eos> token is pro- 2015). We describe each briefly here.
duced. We train this model using Teacher Forcing
and a loss computed with a doubly stochastic atten- Accuracy. Nearly all existing VQA work reports
tion regularization that updates the gradient using accuracy as the primary metric, computed as the
CrossEntropy on the output sequence, with a percentage of times an answer exactly matches one
regularization factor computed starting from the of the ground truths. Even though our architec-
attention maps. ture does not generate answers that span across a
Answering baseline. To check if our architec- fixed set of possible choices, we report these re-
ture was making use of the information contained sults as a comparative means for traditional VQA
in the input images, we considered another base- approaches. It is critical to note that if a ground
line starting from a pre-trained instance of GPT-2 truth is, say, “sunny,” and our system outputs “it is
(Small). We fine-tuned the system to learn how to sunny,” the accuracy would be 0. Since our architec-
answer the questions exploiting only the linguistic ture’s goal is to generate open-ended answers, this
bias present in the question, without conditioning problem occurs very frequently. Nevertheless, we
the generation of the answer on the images. find that our model’s accuracy is comparable to that
of the VQA baseline (Table 2). Interestingly, when
VQA baseline. For a direct comparison with a we compare the shortest answers (i.e., binary out-
standard, publicly-available VQA system, we con- puts), VGGPT-2 achieves a higher accuracy than
sider a third baseline (Kazemi and Elqursh, 2017) the VQA baseline; it appears that when the model
that uses both question and image to generate the does output short answers, it does so with relatively
answer. This is a strong baseline specifically de- high precision. For other question types, the VQA
signed for VQA tasks, and a sound comparative baseline performs better, as expected given the pre-
4
Training took less than 60 hours on an NVidia RTX2070 ponderance of short, concise answers in both the
GPU. gold standard and VQA baseline’s output.
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4
Captioning 0.100 0 0.100 0
Answering 0.293 0.149 0.074 0.027
VQA 0.549 0.187 0.066 0.011
VGGPT-2 0.463 0.223 0.101 0.036
7 Conclusion
In this work, we proposed a novel Transformer-
based architecture for visual question answering
that combines textual and visual modalities to gen-
erate unconstrained, open-ended answers. We em-
pirically verify the feasibility of this architecture
by framing the problem in a way that diverges from
the standard classification perspective, by building
a system that does not distribute probabilities over
a fixed set of possible answers but generates them
dynamically one token at a time. In future work
we plan to experiment with deeper image encoders,
such as Residual Networks (He et al., 2016), and
will experiment with a multi-head variation of our
attention mechanism. We will also examine deeper
Transformer architectures in this context.7 It is our
hope that the work reported here stimulates addi-
tional interest in open-ended models in the VQA
domain, and to that end we will also release our
model and source code online to facilitate ease of Figure 3: Examples of questions and automatically gen-
replication.8 erated responses for three different images. Attention
maps for each question/answer token are shown. Spe-
cial tokens <bos>, <sep> and <eos> delimit the
7
For example, GPT-2 (Large) or perhaps even a variant of start and end of the question and answer.
GPT-3 (Brown et al., 2020), a 175 billion parameter model
unveiled a few days before the deadline for this submission.
8
Link in camera-ready paper.
References Ronghang Hu, Jacob Andreas, Marcus Rohrbach,
Trevor Darrell, and Kate Saenko. 2017. Learning
Peter Anderson, Xiaodong He, Chris Buehler, Damien to reason: End-to-end module networks for visual
Teney, Mark Johnson, Stephen Gould, and Lei question answering. In Proceedings of the Confer-
Zhang. 2018. Bottom-up and top-down attention for ence on Computer Vision and Pattern Recognition
image captioning and visual question answering. In (CVPR), pages 804–813.
Proceedings of the Conference on Computer Vision
and Pattern Recognition (CVPR), pages 6077–6086. Vahid Kazemi and Ali Elqursh. 2017. Show, ask, at-
IEEE. tend, and answer: A strong baseline for visual ques-
tion answering.
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and
Dan Klein. 2016. Neural module networks. In Pro- Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
ceedings of the IEEE Conference on Computer Vi- Weinberger. 2015. From word embeddings to docu-
sion and Pattern Recognition (CVPR), pages 39–48. ment distances. In International Conference on Ma-
chine Learning, pages 957–966.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying
and Devi Parikh. 2015. Vqa: Visual question an- Zhang, Saizheng Zhang, Aaron C Courville, and
swering. In The IEEE International Conference on Yoshua Bengio. 2016. Professor forcing: A new
Computer Vision (ICCV). algorithm for training recurrent networks. In Ad-
vances In Neural Information Processing Systems,
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie pages 4601–4609.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
Askell, Sandhini Agarwal, Ariel Herbert-Voss, 2016. Hierarchical question-image co-attention for
Gretchen Krueger, Tom Henighan, Rewon Child, visual question answering. In Advances In Neural
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Information Processing Systems, pages 289–297.
Clemens Winter, Christopher Hesse, Mark Chen,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Jing Zhu. 2002. Bleu: a method for automatic eval-
Chess, Jack Clark, Christopher Berner, Sam Mc-
uation of machine translation. In Proceedings of
Candlish, Alec Radford, Ilya Sutskever, and Dario
the 40th Annual Meeting of the Association for Com-
Amodei. 2020. Language models are few-shot learn-
putational Linguistics, pages 311–318, Philadelphia,
ers. arXiv preprint arXiv:2005.14165.
Pennsylvania, USA. Association for Computational
Moses Charikar, Kevin Chen, and Martin Farach- Linguistics.
Colton. 2002. Finding frequent items in data Jeffrey Pennington, Richard Socher, and Christopher
streams. In International Colloquium on Au- Manning. 2014. Glove: Global vectors for word rep-
tomata, Languages, and Programming, pages 693– resentation. In Proceedings of the 2014 Conference
703. Springer. on Empirical Methods in Natural Language Process-
Akira Fukui, Dong Huk Park, Daylen Yang, Anna ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso-
Rohrbach, Trevor Darrell, and Marcus Rohrbach. ciation for Computational Linguistics.
2016. Multimodal compact bilinear pooling for vi- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
sual question answering and visual grounding. In Dario Amodei, and Ilya Sutskever. 2019. Language
Proceedings of the 2016 Conference on Empirical models are unsupervised multitask learners. OpenAI
Methods in Natural Language Processing, pages Blog, 1(8).
457–468, Austin, Texas. Association for Computa-
tional Linguistics. Yusuke Shibata, Takuya Kida, Shuichi Fukamachi,
Masayuki Takeda, Ayumi Shinohara, and Takeshi
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Shinohara. 1999. Byte pair encoding: A text com-
Darrell. 2016. Compact bilinear pooling. In Pro- pression scheme that accelerates pattern matching.
ceedings of the Conference on Computer Vision and Technical Report DOI-TR-161, Department of Infor-
Pattern Recognition (CVPR), pages 317–326. matics, Kyushu University.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Karen Simonyan and Andrew Zisserman. 2014. Very
Dhruv Batra, and Devi Parikh. 2017. Making the deep convolutional networks for large-scale image
V in VQA matter: Elevating the role of image un- recognition. arXiv preprint arXiv:1409.1556.
derstanding in Visual Question Answering. In Pro-
ceedings of the Conference on Computer Vision and Hao Tan and Mohit Bansal. 2019. LXMERT: Learning
Pattern Recognition (CVPR). cross-modality encoder representations from trans-
formers. In Proceedings of the 2019 Conference on
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Empirical Methods in Natural Language Processing
Sun. 2016. Deep residual learning for image recog- and the 9th International Joint Conference on Natu-
nition. In Proceedings of the Conference on Com- ral Language Processing (EMNLP-IJCNLP), pages
puter Vision and Pattern Recognition (CVPR), pages 5099–5110, Hong Kong, China. Association for
770–778. Computational Linguistics.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
and Yoshua Bengio. 2015. Show, attend and tell:
Neural image caption generation with visual at-
tention. In International Conference on Machine
Learning, pages 2048–2057.