Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Towards Open-Ended VQA Models Using Transformers

Alberto Mario Bellini1,2 and Natalie Parde2 and Matteo Matteucci1 and Mark J. Carman1
1
Politecnico di Milano
2
University of Illinois at Chicago
albertomario.bellini@mail.polimi.it, parde@uic.edu
matteo.matteucci@polimi.it, mark.carman@polimi.it

Abstract concise (e.g., “yes,” “no,” “blue”), with limited or


no supplemental context. This is partially a side ef-
Visual Question Answering (VQA) is an open fect of how the VQA problem is commonly framed,
and active research task with many excit-
with language generation modelled as a classifica-
ing practical applications. However, most
prior work on this problem shares a common tion task rather than the open-ended exercise more
underlying limitation: the number of possi- reflective of how it is approached by humans. Be-
ble answers is restricted to a limited set of cause of this framing, answers are forced to lie
candidates, constraining the resulting models’ within a limited, predefined set of candidates, creat-
power and flexibility. We address this short- ing a dichotomy between the human and machine
coming by introducing a novel Transformer- scenarios; while humans enrich their answers with
based architecture for VQA, designed to gen- supplemental content and arguments, VQA sys-
erate open-ended answers. This shift towards
tems tend to rely on outputting single words as a
generating unconstrained text rather than se-
lecting from discrete categories represents a performance-preserving mechanism.
substantial contribution towards developing We tackle the VQA task from a different per-
more humanlike VQA systems. We show that spective. Instead of distributing probabilities over
our architecture’s performance is competitive a predefined set of possible candidate answers, we
with traditional, classification-based VQA ar- employ a state-of-the-art Transformer model (Rad-
chitectures, laying promising groundwork for ford et al., 2019) to generate image-relevant an-
future open-ended, generative VQA models.
swers autoregressively. In doing so, we seek to
empirically assess the potential of Transformers
1 Introduction
for an established, inherently multimodal task; ad-
Visual Question Answering (VQA) is a multidisci- vance the current understanding of generative VQA
plinary research problem that synthesizes contribu- models; and examine their utility as an alternative
tions from computer vision and natural language to more classical, discriminative approaches. Our
processing to interpret and respond to open-domain contributions are as follows:
questions about images. This task is challenging be- 1. We leverage the Transformer architecture in
cause it requires both an ability to comprehend the a VQA system to develop a model capable
question, and an ability to apply reasoning skills to of exploiting linguistic knowledge to generate
the associated image to seek relevant information. more natural and open-ended answers.
If both tasks are performed adequately, the result-
ing VQA system should be capable of generating 2. We empirically demonstrate the feasibility
meaningful natural language answers that preserve of combining a Transformer language model
both semantic and syntactic correctness. with common visual feature extractors (e.g.,
Despite the difficulty of these interrelated tasks, convolutional neural networks), to condition
current VQA architectures achieve strong results the generation of each token not only on the
and can generalize reasonably well across different textual modality but on visual input as well.
question-image pairs (Antol et al., 2015; Andreas 3. To this end, we introduce a variation of the
et al., 2016; Lu et al., 2016; Hu et al., 2017; Tan Transformer architecture such that feature
and Bansal, 2019). Still, they possess a shared lim- maps output from the image encoder are taken
itation: the generated answers tend to be short and into consideration when generating tokens.
4. We show that our novel architecture is com- ing (Anderson et al., 2018) is a recent architecture
petitive with classical VQA models, provid- that achieves an accuracy of 70.3% on the Visual
ing evidence that unconstrained, generative Question Answering V2 (VQAv2) dataset (Goyal
models are a viable approach for VQA and et al., 2017). It combines bottom-up and top-down
establishing a benchmark to stimulate further attention mechanisms that enable attention to be
research in this area. calculated both at the level of objects and other
We describe our methods and evaluation proce- salient image regions. The bottom-up attention
dures in more detail in the following sections. mechanism accounts for determining interesting
regions in the input image, whereas the top-down
2 Related Work mechanism determines how to weight each region.
The final classification model distributes probabili-
Prior work in VQA has primarily shared a com- ties over a fixed set of candidate answers.
mon structure (Antol et al., 2015; Andreas et al.,
2016; Lu et al., 2016; Hu et al., 2017; Tan and LXMERT. Tan and Bansal’s (2019) architecture
Bansal, 2019). First, encoders (e.g., long short- is the current state-of-the-art on the VQAv2 dataset,
term memory (LSTM) or convolutional neural net- achieving a 72.5% accuracy. Similarly to our ap-
work (CNN) models) extract features from visual proach, LXMERT incorporates a Transformer ar-
and textual inputs. Then, the two modalities are chitecture, although it does so for a different pur-
joined in a common subspace to form a higher- pose. The large-scale model consists of three en-
level representation. Finally, a classifier distributes coders: an object relationship encoder, a language
probabilities over a vocabulary of possible answers. encoder, and a cross-modality encoder. To train
A host of smaller variations to this common struc- the model to connect vision and language, authors
ture exist across architectures; we describe high- pre-train the model with large amounts of image-
performing recent models in more detail. and-sentence pairs. The model does not generate
open-ended answers, with the authors instead ex-
MCB. Multimodal Compact Bilinear Pooling ploiting Transformers as more powerful encoders.
(MCB) (Fukui et al., 2016) is a robust architec- Ultimately, the cross-modal intermediate output is
ture that exploits an outer product to combine the fed to a classifier, which still distributes probabili-
two modalities, to maximize the multiplicative in- ties over a predefined vocabulary of answers.
teraction between their two vector representations.
Fukui et al. (2016) devised this mechanism under 3 Architecture
the hypothesis that existing methods for combin-
In this work, we introduce a multimodal genera-
ing the two modalities were insufficient for pro-
tive VQA model (VGGPT-2) that employs atten-
jecting them in the right space. In MCB, the vec-
tion mechanisms at multiple levels and focuses on
tor representations for text and visual inputs are
specific portions of the image using information
computed, respectively, by a two-layer LSTM and
contained in the question. This information is also
a Residual Network (He et al., 2016). Once the
employed to mask out irrelevant portions of the
multimodal representations are encoded, they are
question, facilitating the generation of accurate an-
pooled together. Traditional bilinear pooling re-
swers. Four components work together to process
sults in high-dimensional vectors; however, MCB
the visual and textual modalities and subsequently
efficiently compresses the output for every modal-
form natural, unconstrained answers. We describe
ity by first projecting image and question into a
these components in the subsections below, and
higher-dimensional space using the Count Sketch
provide a high-level overview of the architecture in
method (Charikar et al., 2002), and then convolving
Figure 1.
the two representations via the element-wise prod-
uct in the Fast Fourier Transform space (Gao et al., 3.1 Image Encoder
2016). Given the attention output computed by
The first component, an image encoder, consists of
MCB and the output of the LSTM, the final answer
a pre-trained VGGNet-111 model (Simonyan and
is generated using a classifier achieving 62.5% ac-
Zisserman, 2014). We employ this model to extract
curacy on the VQA dataset in (Antol et al., 2015).
visual features from the images associated with the
BUTD. Bottom-Up and Top-Down Attention for 1
We do not experiment with deeper vision architectures
Image Captioning and Visual Question Answer- here since our primary focus is on enhancing language output.
Figure 1: VGGPT-2 architecture overview, highlighting the answer generation process beginning from a ques-
tion/image pair.

input questions. Since, in our context, we do not output of the Transformer T ∈ Rn×hidden consists
need VGGNet to act as a classifier, but rather as a of a sequence of n hidden vectors of size hidden
feature extractor, we follow common practice and = 768. We use T as the second input to our atten-
drop the last three fully-connected layers, retaining tion mechanism.
only the output of the last average pooling layer.
This way, given an image, the encoder produces a 3.3 Attention
set of maps containing high-level features. A crucial component of our architecture is a novel
Specifically, letting ci be three different channels co-attention mechanism, which is responsible for
(red, blue, and green) of width wi = 224 and height combining features from the two different sub-
hi = 224, the visual input consists of an image spaces into a single representation. In designing
I ∈ Rci ×wi ×hi . The output of the image encoder this mechanism, we use the question to drive the
then consists of a set of maps M ∈ Rco ×wo ×ho , focus on the image and vice versa. To formally
where co = 512 different channels of width wo = 7 define this component, we recall that it leverages
and height ho = 7. M then serves as one of the the following inputs:
two inputs to our attention mechanism.
• Image Encoder output: M ∈ R512×7×7
3.2 Language Model
• Language Model output: T ∈ Rn×768
We use OpenAI’s GPT-2 (small, 117M) Trans-
former (Radford et al., 2019) as our base language Our model first flattens M into a more conve-
model, refining their pre-trained architecture in or- nient representation Mflat ∈ R512×49 . Then,
der to integrate it into our system. GPT-2 is a both Mflat and T are fed to two separate linear
powerful, Transformer-based language model that layers (LinM and LinT ) that bring the two vec-
relies entirely on attention mechanisms to condi- tors into a common subspace of dimensionality
tion the generation of tokens on the given context AttDim = 512:
and the tokens generated at previous time steps. It
consists of two core blocks: (1) the Transformer, • LinM : Maps Mflat ∈ R512×49 to Matt ∈
which is responsible for computing the multi-head RAttDim×49
attention discussed previously; and (2) the final • LinT : Maps T ∈ Rn×768 to Tatt ∈
classifier which, for each token processed by the Rn×AttDim
Transformer, produces a distribution of probabili-
ties over the vocabulary of words.2 At this point, for every element Tiatt ∈ RAttDim
Since our goal is for the language model to in- in Tatt , with i ∈ N ∧ i ∈ [0, n), we repeat the
teract with the image encoder, we split apart these following set of operations:
two blocks and use them separately in our model.
Given a sequence of n input tokens S ∈ Rn×1 , the 1. We sum Tiatt with each element Mkatt ∈
RAttDim of Matt , with k ∈ N ∧ k ∈ [0, 49),
2
Recall that this is a language model that exploits attention, obtaining the first multimodal attention repre-
which means that the probability p(wi |ti , ti−1 , ...ti−k ) of
generating the word wi at time step i is conditioned on the sentation Ai1 ∈ RAttDim×49 for the ith token
input tokens ti , ti−1 , ...ti−k up to time step i. in the input sequence.
2. We rectify Ai1 using a ReLU unit and obtain 3.4 Answer Generator
an intermediate attention representation Ai2 ∈ We generate the final answer by once again ex-
RAttDim×49 . ploiting the language model. This last component
diverges from classical VQA systems—while most
3. Using a linear layer (Linsoft ) we compress
existing work has fed the output of the attention
AttDim down to 1. In other words, Linsoft
mechanism to a classifier, thereby distributing prob-
brings Ai2 ∈ RAttDim×49 into Ai3 ∈ R1×49 .
abilities over a predefined set of answers, we con-
4. To obtain a softmap that highlights relevant dition the generation of words using the output of
image regions given a specific word at po- the attention layer.
sition i, we pass Ai3 through a Sof tmax Instead of feeding the output T of the Trans-
function and obtain Ai4 ∈ R1×49 , where former directly to its head, we first take the at-
tention output Aout and concatenate it with the
Ai,h
4 ∈ (0, 1) with h ∈ N ∧ h ∈ [0, 49). At
Transformer output T into a new vector O ∈
this point every pixel in Ai4 has an associated
Rn×1536 . Then, with the help of a new linear layer
value between 0 and 1, which indicates how
(LinClassifier ), we distribute probabilities over the
important that pixel is to answer the question.
original vocabulary of words from the pre-trained
5. We perform a point-wise multiplication be- GPT-2 model. In order not to lose the valuable
tween every element Mkflat ∈ R1×49 in the information contained in the weights of the pre-
original flattened maps Mflat and Ai4 , obtain- trained model C (or head), we do the following:
ing a new vector Ai5 ∈ R512×49 . In this step 1. We initialize the first 768 × 50254 weights
we perform the question-to-image attention, of LinClassifier with the ones present in the
since we exploit the softmaps computed at the original head of GPT-2.
previous step to mask out irrelevant portions
of the image. 2. We initialize the last 768 × 50254 weights,
accounting for the attention layer output, to 0.
6. We sum, for each channel in Ai5 , all the
This allows the model to start training from a state
masked pixels together, reducing the dimen-
in which it is already able to generate natural se-
sion from 49 to 1. In other words, we go from
quences of words, since all weights relevant to the
Ai5 ∈ R512×49 to Ai6 ∈ R512×1 by summing
attention mechanism are initialized to 0, and the
all the elements together on the pixel axis.
only ones that influence the output are coming from
7. We bring Ai6 back to the Transformer space T. This last linear layer consists of roughly 77.1M
through a final linear layer (Linout ), which parameters (1536 × 50254), so it is imperative to
expands Ai6 ∈ R512×1 to Ai7 ∈ R768 . perform such an initialization to train the system in
a reasonable amount of time.
8. We compute the image-to-question atten- Afterward, we feed O, the Logits vector, to a
tion through a final point-wise multiplica- Sof tmax function that distributes probabilities
tion between Ai7 and every element Ti ∈ over our vocabulary. To generate the output word
R768 of T, which is the original output of the wi corresponding to the input token at position i,
Transformer before this layer. The final output we pool the most probable word in a greedy fash-
of the attention mechanism for every embed- ion,3 dynamically generating each word and con-
ded token Ti and the maps M coming from ditioning it on the previously-generated words. At
the image encoder is the vector Aiout ∈ R768 . evaluation time, this allows the system to generate
arbitrarily long answers in the following fashion:
We described how the attention layer works con-
1. Given a question of n words and an image, we
sidering one single embedded token Ti at a time.
first let the system generate the initial word in
However, it can be trivially scaled to a sequence of
the answer, conditioned only on the question
n elements by including an additional dimension
and the image.
for each vector that accounts for the token under
3
consideration. The output Aout of the attention We also experimented with beam search, finding that
with beam sizes greater than one, the model generated shorter
mechanism is then combined back with the output answers more characteristic of those produced by traditional
of the language model to generate the answer. classification-based VQA systems.
2. After generating the first word, we append Train Test
it to the original question, which becomes a
Split size 106 105
sequence of n+1 words. The next word in the
Sequence length 24 22
answer is generated conditioned on the new
Question + Answer yes no
sequence.
Min. question length 5 6
3. The procedure continues until the system out- Max. question length 21 22
puts the <eos> token or a maximum length Avg. question length 8.45 9.48
threshold is reached. Min. answer length 3 -
Max. answer length 18 -
4 Dataset Avg. answer length 3.89 -
Avg. <pad> count 11.65 -
Following other recent work on VQA, we # of unique images 71450 32822
train and evaluate our model using the VQAv2
dataset (Goyal et al., 2017), a large collection of Table 1: Structure of the processed dataset
real-world question-image pairs. Although the cor-
pus is primarily meant for standard classifier-driven
Data formatting. Given a question/answer pair
VQA models, its use facilitates direct comparison
for a given image, we concatenated the pair to form
with existing models. Such a comparison is key
a sequence of words. This sequence was then fed
to demonstrating the effectiveness of open-ended,
to the model at training time. For the test data,
generative VQA models such as that designed here.
we dropped all answers and kept only the ques-
Importantly, the dataset is based on real-word im-
tion/image pairs. All textual inputs were byte-pair
ages rather than synthetic ones, allowing for learn-
encoded (Shibata et al., 1999) using the pre-trained
ing better generalization capabilities. It consists of
GPT-2 vocabulary.
more than 200K images, 1M questions and 10M
ground truth answers, manually annotated. There Custom tokens. We introduce four new tokens:
are at least 3 questions (5.4 on average) per im- <bos>, <eos>, <sep>, and <pad> to indicate
age, and each question/image pair has 10 differ- the beginning and end of the sequence, the sepa-
ent annotated answers. To disincentivize answers ration between question and answer, and finally
based purely on linguistic bias, some questions the padding token. In the training set, we make
have complementary image pairs. Additionally, an- use of all four tokens while encoding the input se-
swers were balanced for different question types in quence. In the test set, we only use <bos> and
order not to have a predominant value. <sep> since no answer is present. Since we intro-
duce these new tokens, we set the language model’s
4.1 Preprocessing weights to be trainable in order to update the em-
We apply a variety of preprocessing steps to the bedding matrix and the attention layers.
text and images in order to most productively make
use of the corpus. We summarize the structure of
5 Training
the preprocessed dataset in Table 1, and elaborate To train our architecture we employ a technique
further on our preprocessing steps below. called Teacher Forcing (Lamb et al., 2016): given
a question with its associated answer, we combine
Grayscale removal. We discarded all samples
the two to form a sequence of tokens that are fed
with non-RGB images, retaining only colored ones.
to the model. Exploiting a CrossEntropy loss
Answer length. Since we were interested in function, we train our system to predict the input
building an architecture capable of generating open- sequence shifted by 1 time step, effectively telling
ended answers, we prioritized longer samples and the model to condition the probability of generating
gave less importance to those with concise answers. a new token on the previously seen context. The
loss function ignores the <pad> tokens and fo-
Answer variability. We randomly selected up to cuses only on portions of the sequence that contain
four annotations for each question, ensuring that at meaningful information. We block weight updates
most four identical question/image pairs were fed in our image encoder, using it solely as a visual
to the model, with four different answers. feature extractor. However, we set our language
model’s weights to be trainable. This is necessary Model Overall Yes/No Number Other
for the model to learn both the nature of the task Captioning 0.00 0.00 0.00 0.00
and the meaning of the newly introduced tokens. Answering 0.11 0.20 0.20 0.14
VQA 0.46 0.71 0.27 0.41
We train the system for 20 epochs using batches VGGPT-2 0.34 0.73 0.23 0.24
of 20 elements4 . The learning rate is initialized to
5 × 10−5 and Adam is employed as the optimizer. Table 2: Accuracy scores of baselines and VGGPT-2.
The overall architecture consists of 202M parame-
ters, since we fine-tune GPT-2 (117M) alongside
the attention mechanism (8M) and the final classi- architecture. The system consists of an image en-
fier (77M). coder (ResNet-152 (He et al., 2016)), alongside an
LSTM which extracts textual features. A compact,
6 Evaluation but powerful, attention layer brings the two modal-
ities into a common subspace where they are com-
To assess the performance of our architecture, we
bined to generate attention maps over the image.
compare it against three different baselines:
Finally, the attention output is concatenated back
Captioning baseline. The first baseline that we with the LSTM hidden state and the resulting vector
consider is a captioning system (Xu et al., 2015): is fed to a classifier which distributes probabilities
given an image, this model tries to generate a mean- over 3000 possible answers. We do not re-train or
ingful description about what it sees. Our goal was fine-tune this architecture, and download the au-
to understand the extent to which the textual modal- thors’ pre-trained instance on the VQAv2 dataset.
ity (i.e., the question) was useful to generate an- This architecture follows the standard VQA ap-
swers. The captioning baseline exploits a VGGNet- proach of addressing the task in a classification
11 image encoder to extract visual features that are manner, generating short and concise answers.
later combined with the outputs of an LSTM. The
joint representation then passes through an atten- 6.1 Quantitative Results
tion layer, which focuses on specific parts of the We evaluate our models using three different quan-
image and updates the hidden states of the LSTM. titative metrics: Accuracy, BLEU (Papineni et al.,
The caption is generated one token at a time and 2002), and Word Mover’s Distance (Kusner et al.,
the process goes on until an <eos> token is pro- 2015). We describe each briefly here.
duced. We train this model using Teacher Forcing
and a loss computed with a doubly stochastic atten- Accuracy. Nearly all existing VQA work reports
tion regularization that updates the gradient using accuracy as the primary metric, computed as the
CrossEntropy on the output sequence, with a percentage of times an answer exactly matches one
regularization factor computed starting from the of the ground truths. Even though our architec-
attention maps. ture does not generate answers that span across a
Answering baseline. To check if our architec- fixed set of possible choices, we report these re-
ture was making use of the information contained sults as a comparative means for traditional VQA
in the input images, we considered another base- approaches. It is critical to note that if a ground
line starting from a pre-trained instance of GPT-2 truth is, say, “sunny,” and our system outputs “it is
(Small). We fine-tuned the system to learn how to sunny,” the accuracy would be 0. Since our architec-
answer the questions exploiting only the linguistic ture’s goal is to generate open-ended answers, this
bias present in the question, without conditioning problem occurs very frequently. Nevertheless, we
the generation of the answer on the images. find that our model’s accuracy is comparable to that
of the VQA baseline (Table 2). Interestingly, when
VQA baseline. For a direct comparison with a we compare the shortest answers (i.e., binary out-
standard, publicly-available VQA system, we con- puts), VGGPT-2 achieves a higher accuracy than
sider a third baseline (Kazemi and Elqursh, 2017) the VQA baseline; it appears that when the model
that uses both question and image to generate the does output short answers, it does so with relatively
answer. This is a strong baseline specifically de- high precision. For other question types, the VQA
signed for VQA tasks, and a sound comparative baseline performs better, as expected given the pre-
4
Training took less than 60 hours on an NVidia RTX2070 ponderance of short, concise answers in both the
GPU. gold standard and VQA baseline’s output.
Model BLEU-1 BLEU-2 BLEU-3 BLEU-4
Captioning 0.100 0 0.100 0
Answering 0.293 0.149 0.074 0.027
VQA 0.549 0.187 0.066 0.011
VGGPT-2 0.463 0.223 0.101 0.036

Table 3: BLEU-k scores of baselines and VGGPT-2.

BLEU. A commonly used evaluation metric for


generative language tasks is BLEU (Papineni et al., Figure 2: Word Mover’s Distance distribution.
2002), and thus we incorporated it as an additional
evaluation metric here. We compute a corpus-level
BLEU score by comparing each output with its re-
spective 10 ground truth answers from the VQAv2 line with VGGPT-2, we see a strong correlation
corpus. In Table 3 we report BLEU-k scores.5 between the two models since they both tend to
From these results we highlight a couple interest- produce answers which are semantically close to
ing findings: the ground truths. Overall, the WMD distribution
effectively confirms the general trend seen with our
1. As we consider longer n-gram overlaps other evaluation metrics, and supports the notion
(BLEU-2, -3 and -4), VGGPT-2 performs bet- that an open-ended, generative architecture offers a
ter relative to the VQA baseline. Thus, our feasible alternative to traditional means of address-
architecture is superior when it comes to gen- ing the VQA task.
erating longer answers, while the VQA base-
line is strongest with single word outputs.
6.2 Qualitative Results
2. The two VQA models outperform the caption-
ing6 and (particularly importantly) answering
baselines, confirming that multimodality is In addition to the quantitative results, we perform
key to answer quality. a qualitative analysis of our architecture’s perfor-
mance. In Table 4, we compare answers for specific
Word Mover’s Distance. To check the semantic question/image pairs between VGGPT-2 and the
similarity between the answers generated by our VQA baseline. We discover that our model gener-
models and the ground truths, we also evaluated our ates, most of the time, longer and more explanatory
results using a metric that exploits pre-trained word answers, exploiting the autoregressive behavior of
embeddings, known as Word Mover’s Distance the underlying language model. In contrast, the
(WMD) (Kusner et al., 2015). We employ WMD VQA baseline generates shorter and more concise
with 100-dimensional pre-trained GloVe embed- outputs. This issue becomes particularly evident
dings (Pennington et al., 2014), finding evidence when the question requires the output to be de-
that VGGPT-2 indeed generates semantically high- scriptive, such as the first two questions in Table 4,
quality answers, further supporting the trend of where the baseline is forced to pick only one of the
high performance observed with both accuracy and objects present on the plate, or one of the colors of
BLEU metrics. the train. Finally, the third question demonstrates
In Figure 2 we plot the WMD distribution for the language model’s ability to generate more spe-
each model. The plot demonstrates that VQA ar- cific action descriptions than are possible with the
chitectures generate a larger amount of answers classification-based approach.
whose WMD is small (sharper peak around 0), in- In Figure 3, we show the answers produced by
dicating that the latter models are making good VGGPT-2 for three other images (a baby brushing
use of both the visual and textual modality to gen- its teeth, a man holding scissors, and a clock tower).
erate the outputs. If we compare the VQA base- The attention mask inferred by the model for each
5
To compute BLEU-k we equally weight n-gram matches question and answer token is shown, demonstrating
from 1 through k. the ability of the model to identify salient aspects
6
In general, we observe that the image captioning baseline
failed to properly learn the task, as evidenced by dismal scores of the images in order to answer the question cor-
for both accuracy and BLEU. rectly.
Question & Model Image & Answer

(a): What is in the plate?


VQA-baseline Salad
VGGPT-2 Toast, orange and salad

(b): What color is the train?


VQA-baseline Red
VGGPT-2 Red and white

(c): What are the animals doing?


VQA-baseline Drinking
VGGPT-2 Walking in water

Table 4: Qualitative results on real-world images.

7 Conclusion
In this work, we proposed a novel Transformer-
based architecture for visual question answering
that combines textual and visual modalities to gen-
erate unconstrained, open-ended answers. We em-
pirically verify the feasibility of this architecture
by framing the problem in a way that diverges from
the standard classification perspective, by building
a system that does not distribute probabilities over
a fixed set of possible answers but generates them
dynamically one token at a time. In future work
we plan to experiment with deeper image encoders,
such as Residual Networks (He et al., 2016), and
will experiment with a multi-head variation of our
attention mechanism. We will also examine deeper
Transformer architectures in this context.7 It is our
hope that the work reported here stimulates addi-
tional interest in open-ended models in the VQA
domain, and to that end we will also release our
model and source code online to facilitate ease of Figure 3: Examples of questions and automatically gen-
replication.8 erated responses for three different images. Attention
maps for each question/answer token are shown. Spe-
cial tokens <bos>, <sep> and <eos> delimit the
7
For example, GPT-2 (Large) or perhaps even a variant of start and end of the question and answer.
GPT-3 (Brown et al., 2020), a 175 billion parameter model
unveiled a few days before the deadline for this submission.
8
Link in camera-ready paper.
References Ronghang Hu, Jacob Andreas, Marcus Rohrbach,
Trevor Darrell, and Kate Saenko. 2017. Learning
Peter Anderson, Xiaodong He, Chris Buehler, Damien to reason: End-to-end module networks for visual
Teney, Mark Johnson, Stephen Gould, and Lei question answering. In Proceedings of the Confer-
Zhang. 2018. Bottom-up and top-down attention for ence on Computer Vision and Pattern Recognition
image captioning and visual question answering. In (CVPR), pages 804–813.
Proceedings of the Conference on Computer Vision
and Pattern Recognition (CVPR), pages 6077–6086. Vahid Kazemi and Ali Elqursh. 2017. Show, ask, at-
IEEE. tend, and answer: A strong baseline for visual ques-
tion answering.
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and
Dan Klein. 2016. Neural module networks. In Pro- Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian
ceedings of the IEEE Conference on Computer Vi- Weinberger. 2015. From word embeddings to docu-
sion and Pattern Recognition (CVPR), pages 39–48. ment distances. In International Conference on Ma-
chine Learning, pages 957–966.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
garet Mitchell, Dhruv Batra, C. Lawrence Zitnick, Alex M Lamb, Anirudh Goyal Alias Parth Goyal, Ying
and Devi Parikh. 2015. Vqa: Visual question an- Zhang, Saizheng Zhang, Aaron C Courville, and
swering. In The IEEE International Conference on Yoshua Bengio. 2016. Professor forcing: A new
Computer Vision (ICCV). algorithm for training recurrent networks. In Ad-
vances In Neural Information Processing Systems,
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie pages 4601–4609.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh.
Askell, Sandhini Agarwal, Ariel Herbert-Voss, 2016. Hierarchical question-image co-attention for
Gretchen Krueger, Tom Henighan, Rewon Child, visual question answering. In Advances In Neural
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Information Processing Systems, pages 289–297.
Clemens Winter, Christopher Hesse, Mark Chen,
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Jing Zhu. 2002. Bleu: a method for automatic eval-
Chess, Jack Clark, Christopher Berner, Sam Mc-
uation of machine translation. In Proceedings of
Candlish, Alec Radford, Ilya Sutskever, and Dario
the 40th Annual Meeting of the Association for Com-
Amodei. 2020. Language models are few-shot learn-
putational Linguistics, pages 311–318, Philadelphia,
ers. arXiv preprint arXiv:2005.14165.
Pennsylvania, USA. Association for Computational
Moses Charikar, Kevin Chen, and Martin Farach- Linguistics.
Colton. 2002. Finding frequent items in data Jeffrey Pennington, Richard Socher, and Christopher
streams. In International Colloquium on Au- Manning. 2014. Glove: Global vectors for word rep-
tomata, Languages, and Programming, pages 693– resentation. In Proceedings of the 2014 Conference
703. Springer. on Empirical Methods in Natural Language Process-
Akira Fukui, Dong Huk Park, Daylen Yang, Anna ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso-
Rohrbach, Trevor Darrell, and Marcus Rohrbach. ciation for Computational Linguistics.
2016. Multimodal compact bilinear pooling for vi- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
sual question answering and visual grounding. In Dario Amodei, and Ilya Sutskever. 2019. Language
Proceedings of the 2016 Conference on Empirical models are unsupervised multitask learners. OpenAI
Methods in Natural Language Processing, pages Blog, 1(8).
457–468, Austin, Texas. Association for Computa-
tional Linguistics. Yusuke Shibata, Takuya Kida, Shuichi Fukamachi,
Masayuki Takeda, Ayumi Shinohara, and Takeshi
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Shinohara. 1999. Byte pair encoding: A text com-
Darrell. 2016. Compact bilinear pooling. In Pro- pression scheme that accelerates pattern matching.
ceedings of the Conference on Computer Vision and Technical Report DOI-TR-161, Department of Infor-
Pattern Recognition (CVPR), pages 317–326. matics, Kyushu University.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Karen Simonyan and Andrew Zisserman. 2014. Very
Dhruv Batra, and Devi Parikh. 2017. Making the deep convolutional networks for large-scale image
V in VQA matter: Elevating the role of image un- recognition. arXiv preprint arXiv:1409.1556.
derstanding in Visual Question Answering. In Pro-
ceedings of the Conference on Computer Vision and Hao Tan and Mohit Bansal. 2019. LXMERT: Learning
Pattern Recognition (CVPR). cross-modality encoder representations from trans-
formers. In Proceedings of the 2019 Conference on
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Empirical Methods in Natural Language Processing
Sun. 2016. Deep residual learning for image recog- and the 9th International Joint Conference on Natu-
nition. In Proceedings of the Conference on Com- ral Language Processing (EMNLP-IJCNLP), pages
puter Vision and Pattern Recognition (CVPR), pages 5099–5110, Hong Kong, China. Association for
770–778. Computational Linguistics.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Aaron Courville, Ruslan Salakhudinov, Rich Zemel,
and Yoshua Bengio. 2015. Show, attend and tell:
Neural image caption generation with visual at-
tention. In International Conference on Machine
Learning, pages 2048–2057.

You might also like