Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Learning to Compose Neural Networks for Question Answering

Jacob Andreas and Marcus Rohrbach and Trevor Darrell and Dan Klein
Department of Electrical Engineering and Computer Sciences
University of California, Berkeley
{jda,rohrbach,trevor,klein}@eecs.berkeley.edu

Abstract What cities are in Georgia? Atlanta


arXiv:1601.01705v4 [cs.CL] 7 Jun 2016

We describe a question answering model that


Module inventory (Section 4.1)
applies to both images and structured knowl- and

edge bases. The model uses natural lan- lookup relate


find city relate in
guage strings to automatically assemble neu-
and find
ral networks from a collection of composable (a) (c)
lookup Georgia

modules. Parameters for these modules are


learned jointly with network-assembly param-
Network layout (Section 4.2) Knowledge source
eters via reinforcement learning, with only and
Montgomery
(world, question, answer) triples as supervi- find[city] relate[in]
Georgia
sion. Our approach, which we term a dynamic
Atlanta
neural module network, achieves state-of-the- (b) lookup[Georgia] (d)
art results on benchmark datasets in both vi-
sual and structured domains. Figure 1: A learned syntactic analysis (a) is used to assemble a
collection of neural modules (b) into a deep neural network (c),
and applied to a world representation (d) to produce an answer.
1 Introduction
This paper presents a compositional, attentional Previous work has used manually-specified modular
model for answering questions about a variety of structures for visual learning (Andreas et al., 2016).
world representations, including images and struc- Here we:
tured knowledge bases. The model translates from • learn a network structure predictor jointly with
questions to dynamically assembled neural net- module parameters themselves
works, then applies these networks to world rep-
resentations (images or knowledge bases) to pro- • extend visual primitives from previous work to
duce answers. We take advantage of two largely reason over structured world representations
independent lines of work: on one hand, an exten-
sive literature on answering questions by mapping Training data consists of (world, question, answer)
from strings to logical representations of meaning; triples: our approach requires no supervision of net-
on the other, a series of recent successes in deep work layouts. We achieve state-of-the-art perfor-
neural models for image recognition and captioning. mance on two markedly different question answer-
By constructing neural networks instead of logical ing tasks: one with questions about natural im-
forms, our model leverages the best aspects of both ages, and another with more compositional ques-
linguistic compositionality and continuous represen- tions about United States geography.1
tations. 2 Deep networks as functional programs
Our model has two components, trained jointly:
first, a collection of neural “modules” that can be We begin with a high-level discussion of the kinds
freely composed (Figure 1a); second, a network lay- of composed networks we would like to learn.
out predictor that assembles modules into complete 1
We have released our code at http://github.com/
deep networks tailored to each question (Figure 1b). jacobandreas/nmn2
Andreas et al. (2016) describe a heuristic ap- black and white true

proach for decomposing visual question answering


tasks into sequence of modular sub-problems. For (c) describe color exists (d)
example, the question What color is the bird? might
Montgomery
be answered in two steps: first, “where is the bird?”
Georgia
(Figure 2a), second, “what color is that part of the
Atlanta
image?” (Figure 2c). This first step, a generic mod-
ule called find, can be expressed as a fragment of
(a) find bird find state (b)
a neural network that maps from image features and
a lexical item (here bird) to a distribution over pix-
Montgomery
els. This operation is commonly referred to as the
Georgia
attention mechanism, and is a standard tool for ma-
Atlanta
nipulating images (Xu et al., 2015) and text repre-
sentations (Hermann et al., 2015). Figure 2: Simple neural module networks, corresponding to
The first contribution of this paper is an exten- the questions What color is the bird? and Are there any states?
sion and generalization of this mechanism to enable (a) A neural find module for computing an attention over
pixels. (b) The same operation applied to a knowledge base.
fully-differentiable reasoning about more structured
(c) Using an attention produced by a lower module to identify
semantic representations. Figure 2b shows how the the color of the region of the image attended to. (d) Performing
same module can be used to focus on the entity quantification by evaluating an attention directly.
Georgia in a non-visual grounding domain; more
generally, by representing every entity in the uni- works are required. Thus our goal is to automati-
verse of discourse as a feature vector, we can obtain cally induce variable-free, tree-structured computa-
a distribution over entities that corresponds roughly tion descriptors. We can use a familiar functional
to a logical set-valued denotation. notation from formal semantics (e.g. Liang et al.,
Having obtained such a distribution, existing neu- 2011) to represent these computations.2 We write
ral approaches use it to immediately compute a the two examples in Figure 2 as
weighted average of image features and project back
(describe[color] find[bird])
into a labeling decision—a describe module (Fig-
ure 2c). But the logical perspective suggests a num- and
ber of novel modules that might operate on atten- (exists find[state])
tions: e.g. combining them (by analogy to conjunc- respectively. These are network layouts: they spec-
tion or disjunction) or inspecting them directly with- ify a structure for arranging modules (and their lex-
out a return to feature space (by analogy to quantifi- ical parameters) into a complete network. Andreas
cation, Figure 2d). These modules are discussed in et al. (2016) use hand-written rules to deterministi-
detail in Section 4. Unlike their formal counterparts, cally transform dependency trees into layouts, and
they are differentiable end-to-end, facilitating their are restricted to producing simple structures like the
integration into learned models. Building on previ- above for non-synthetic data. For full generality, we
ous work, we learn behavior for a collection of het- will need to solve harder problems, like transform-
erogeneous modules from (world, question, answer) ing What cities are in Georgia? (Figure 1) into
triples. (and
The second contribution of this paper is a model find[city]
(relate[in] lookup[Georgia]))
for learning to assemble such modules composition-
ally. Isolated modules are of limited use—to ob- In this paper, we present a model for learning to se-
tain expressive power comparable to either formal lect such structures from a set of automatically gen-
approaches or monolithic deep networks, they must erated candidates. We call this model a dynamic
be composed into larger structures. Figure 2 shows neural module network.
simple examples of composed structures, but for 2
But note that unlike formal semantics, the behavior of the
realistic question-answering tasks, even larger net- primitive functions here is itself unknown.
3 Related work Most previous approaches to visual question an-
swering either apply a recurrent model to deep rep-
There is an extensive literature on database ques- resentations of both the image and the question (Ren
tion answering, in which strings are mapped to log- et al., 2015; Malinowski et al., 2015), or use the
ical forms, then evaluated by a black-box execu- question to compute an attention over the input im-
tion model to produce answers. Supervision may be age, and then answer based on both the question and
provided either by annotated logical forms (Wong the image features attended to (Yang et al., 2015;
and Mooney, 2007; Kwiatkowski et al., 2010; An- Xu and Saenko, 2015). Other approaches include
dreas et al., 2013) or from (world, question, answer) the simple classification model described by Zhou
triples alone (Liang et al., 2011; Pasupat and Liang, et al. (2015) and the dynamic parameter prediction
2015). In general the set of primitive functions network described by Noh et al. (2015). All of
from which these logical forms can be assembled is these models assume that a fixed computation can
fixed, but one recent line of work focuses on induc- be performed on the image and question to compute
ing new predicates functions automatically, either the answer, rather than adapting the structure of the
from perceptual features (Krishnamurthy and Kol- computation to the question.
lar, 2013) or the underlying schema (Kwiatkowski As noted, Andreas et al. (2016) previously con-
et al., 2013). The model we describe in this paper sidered a simple generalization of these attentional
has a unified framework for handling both the per- approaches in which small variations in the net-
ceptual and schema cases, and differs from existing work structure per-question were permitted, with
work primarily in learning a differentiable execution the structure chosen by (deterministic) syntactic pro-
model with continuous evaluation results. cessing of questions. Other approaches in this gen-
Neural models for question answering are also a eral family include the “universal parser” sketched
subject of current interest. These include approaches by Bottou (2014), the graph transformer networks
that model the task directly as a multiclass classifi- of Bottou et al. (1997), the knowledge-based neu-
cation problem (Iyyer et al., 2014), models that at- ral networks of Towell and Shavlik (1994) and the
tempt to embed questions and answers in a shared recursive neural networks of Socher et al. (2013),
vector space (Bordes et al., 2014) and attentional which use a fixed tree structure to perform further
models that select words from documents sources linguistic analysis without any external world rep-
(Hermann et al., 2015). Such approaches generally resentation. We are unaware of previous work that
require that answers can be retrieved directly based simultaneously learns both parameters for and struc-
on surface linguistic features, without requiring in- tures of instance-specific networks.
termediate computation. A more structured ap-
proach described by Yin et al. (2015) learns a query 4 Model
execution model for database tables without any nat- Recall that our goal is to map from questions and
ural language component. Previous efforts toward world representations to answers. This process in-
unifying formal logic and representation learning in- volves the following variables:
clude those of Grefenstette (2013), Krishnamurthy
and Mitchell (2013), Lewis and Steedman (2013), 1. w a world representation
and Beltagy et al. (2013). 2. x a question
3. y an answer
The visually-grounded component of this work
4. z a network layout
relies on recent advances in convolutional net-
5. θ a collection of model parameters
works for computer vision (Simonyan and Zisser-
man, 2014), and in particular the fact that late convo- Our model is built around two distributions: a lay-
lutional layers in networks trained for image recog- out model p(z|x; θ` ) which chooses a layout for a
nition contain rich features useful for other vision sentence, and a execution model pz (y|w; θe ) which
tasks while preserving spatial information. These applies the network specified by z to w.
features have been used for both image captioning For ease of presentation, we introduce these mod-
(Xu et al., 2015) and visual QA (Yang et al., 2015). els in reverse order. We first imagine that z is always
observed, and in Section 4.1 describe how to evalu- the world representation, represented as a collection
ate and learn modules parameterized by θe within of vectors w1 , w2 , . . . (or W expressed as a matrix).
fixed structures. In Section 4.2, we move to the real The nonlinearity σ denotes a rectified linear unit.
scenario, where z is unknown. We describe how to The modules used in this paper are shown below,
predict layouts from questions and learn θe and θ` with names and type constraints in the first row and a
jointly without layout supervision. description of the module’s computation following.
4.1 Evaluating modules Lookup (→ Attention)
Given a layout z, we assemble the corresponding lookup[i] produces an attention focused entirely at the
modules into a full neural network (Figure 1c), and index f (i), where the relationship f between words
apply it to the knowledge representation. Interme- and positions in the input map is known ahead of time
(e.g. string matches on database fields).
diate results flow between modules until an answer
is produced at the root. We denote the output of the Jlookup[i]K = ef (i) (2)
network with layout z on input world w as JzKw ; where ei is the basis vector that is 1 in the ith position
when explicitly referencing the substructure of z, we and 0 elsewhere.
can alternatively write Jm(h1 , h2 )K for a top-level
module m with submodule outputs h1 and h2 . We Find (→ Attention)
then define the execution model: find[i] computes a distribution over indices by con-
catenating the parameter argument with each position
pz (y|w) = (JzKw )y (1) of the input feature map, and passing the concatenated
vector through a MLP:
(This assumes that the root module of z produces Jfind[i]K = softmax(a σ(Bv i ⊕ CW ⊕ d)) (3)
a distribution over labels y.) The set of possible
layouts z is restricted by module type constraints: Relate (Attention → Attention)
some modules (like find above) operate directly on relate directs focus from one region of the input to
the input representation, while others (like describe another. It behaves much like the find module, but
above) also depend on input from specific earlier also conditions its behavior
P on the current region of
modules. Two base types are considered in this pa- attention h. Let w̄(h) = k hk wk , where hk is the
k th element of h. Then,
per are Attention (a distribution over pixels or enti-
ties) and Labels (a distribution over answers). Jrelate[i](h)K = softmax(a
Parameters are tied across multiple instances of σ(Bv i ⊕ CW ⊕ Dw̄(h) ⊕ e)) (4)
the same module, so different instantiated networks
And (Attention* → Attention)
may share some parameters but not others. Modules
and performs an operation analogous to set intersec-
have both parameter arguments (shown in square tion for attentions. The analogy to probabilistic logic
brackets) and ordinary inputs (shown in parenthe- suggests multiplying probabilities:
ses). Parameter arguments, like the running bird
Jand(h1 , h2 , . . .)K = h1 h2 · · · (5)
example in Section 2, are provided by the layout,
and are used to specialize module behavior for par-
Describe (Attention → Labels)
ticular lexical items. Ordinary inputs are the re-
describe[i] computes a weighted average of w under
sult of computation lower in the network. In ad- the input attention. This average is then used to predict
dition to parameter-specific weights, modules have an answer representation. With w̄ as above,
global weights shared across all instances of the
Jdescribe[i](h)K = softmax(Aσ(B w̄(h) + v i )) (6)
module (but not shared with other modules). We
write A, a, B, b, . . . for global weights and ui , v i for Exists (Attention → Labels)
weights associated with the parameter argument i. exists is the existential quantifier, and inspects the
⊕ and denote (possibly broadcasted) elementwise incoming attention directly to produce a label, rather
addition and multiplication respectively. The com- than an intermediate feature vector like describe:
  
plete set of global weights and parameter-specific Jexists](h)K = softmax max hk a + b (7)
k
weights constitutes θe . Every module has access to
What cities are in Georgia? (a) mapped onto a (possibly smaller) set of semantic
primitives. Second, these semantic primitives must
be
be combined into a structure that closely, but not ex-
city Georgia actly, parallels the structure provided by syntax. For
(b) example, state and province might need to be identi-
what in fied with the same field in a database schema, while
all states have a capital might need to be identified
relate[in] with the correct (in situ) quantifier scope.
find[city] (c) While we cannot avoid the structure selection
lookup[Georgia]

problem, continuous representations simplify the


and lexical selection problem. For modules that accept
relate[in] a vector parameter, we associate these parameters
find[city]
lookup[Georgia]
with words rather than semantic tokens, and thus
(d) turn the combinatorial optimization problem asso-
relate[in] ciated with lexicon induction into a continuous one.
...
lookup[Georgia] Now, in order to learn that province and state have
the same denotation, it is sufficient to learn that their
associated parameters are close in some embedding
Figure 3: Generation of layout candidates. The input sentence
(a) is represented as a dependency parse (b). Fragments of this space—a task amenable to gradient descent. (Note
dependency parse are then associated with appropriate modules that this is easy only in an optimizability sense,
(c), and these fragments are assembled into full layouts (d). and not an information-theoretic one—we must still
learn to associate each independent lexical item with
the correct vector.) The remaining combinatorial
With z observed, the model we have described
problem is to arrange the provided lexical items into
so far corresponds largely to that of Andreas et al.
the right computational structure. In this respect,
(2016), though the module inventory is different—
layout prediction is more like syntactic parsing than
in particular, our new exists and relate modules
ordinary semantic parsing, and we can rely on an
do not depend on the two-dimensional spatial struc-
off-the-shelf syntactic parser to get most of the way
ture of the input. This enables generalization to non-
there. In this work, syntactic structure is provided by
visual world representations.
the Stanford dependency parser (De Marneffe and
Learning in this simplified setting is straightfor-
Manning, 2008).
ward. Assuming the top-level module in each layout
The construction of layout candidates is depicted
is a describe or exists module, the fully- instan-
in Figure 3, and proceeds as follows:
tiated network corresponds to a distribution over la-
bels
P conditioned on layouts. To train, we maximize 1. Represent the input sentence as a dependency
(w,y,z) log pz (y|w; θe ) directly. This can be under- tree.
stood as a parameter-tying scheme, where the deci-
2. Collect all nouns, verbs, and prepositional
sions about which parameters to tie are governed by
phrases that are attached directly to a wh-word
the observed layouts z.
or copula.
3. Associate each of these with a layout frag-
4.2 Assembling networks
ment: Ordinary nouns and verbs are mapped
Next we describe the layout model p(z|x; θ` ). We to a single find module. Proper nouns to a sin-
first use a fixed syntactic parse to generate a small gle lookup module. Prepositional phrases are
set of candidate layouts, analogously to the way mapped to a depth-2 fragment, with a relate
a semantic grammar generates candidate semantic module for the preposition above a find mod-
parses in previous work (Berant and Liang, 2014). ule for the enclosed head noun.
A semantic parse differs from a syntactic parse 4. Form subsets of this set of layout fragments.
in two primary ways. First, lexical items must be For each subset, construct a layout candidate by
joining all fragments with an and module, and expensive application of a deep network to a large
inserting either a measure or describe module input representation), but can tractably evaluate
at the top (each subset thus results in two parse p(z|x; θ` ) for all z (which involves application
candidates.) of a shallow network to a relatively small set of
candidates). This is the opposite of the situation
All layouts resulting from this process feature a
usually encountered semantic parsing, where calls
relatively flat tree structure with at most one con-
to the query execution model are fast but the set of
junction and one quantifier. This is a strong sim-
candidate parses is too large to score exhaustively.
plifying assumption, but appears sufficient to cover
most of the examples that appear in both of our In fact, the problem more closely resembles the
tasks. As our approach includes both categories, re- scenario faced by agents in the reinforcement learn-
lations and simple quantification, the range of phe- ing setting (where it is cheap to score actions, but
nomena considered is generally broader than pre- potentially expensive to execute them and obtain re-
vious perceptually-grounded QA work (Krishna- wards). We adopt a common approach from that lit-
murthy and Kollar, 2013; Matuszek et al., 2012). erature, and express our model as a stochastic pol-
Having generated a set of candidate parses, we icy. Under this policy, we first sample a layout z
need to score them. This is a ranking problem; from a distribution p(z|x; θ` ), and then apply z to
as in the rest of our approach, we solve it using the knowledge source and obtain a distribution over
standard neural machinery. In particular, we pro- answers p(y|z, w; θe ).
duce an LSTM representation of the question, a After z is chosen, we can train the execution
feature-based representation of the query, and pass model directly by maximizing log p(y|z, w; θe ) with
both representations through a multilayer perceptron respect to θe as before (this is ordinary backprop-
(MLP). The query feature vector includes indicators agation). Because the hard selection of z is non-
on the number of modules of each type present, as differentiable, we optimize p(z|x; θ` ) using a policy
well as their associated parameter arguments. While gradient method. The gradient of the reward surface
one can easily imagine a more sophisticated parse- J with respect to the parameters of the policy is
scoring model, this simple approach works well for
our tasks. ∇J(θ` ) = E[∇ log p(z|x; θ` ) · r] (10)
Formally, for a question x, let hq (x) be an LSTM
encoding of the question (i.e. the last hidden layer of (this is the REINFORCE rule (Williams, 1992)). Here
an LSTM applied word-by-word to the input ques- the expectation is taken with respect to rollouts of
tion). Let {z1 , z2 , . . .} be the proposed layouts for the policy, and r is the reward. Because our goal is
x, and let f (zi ) be a feature vector representing the to select the network that makes the most accurate
ith layout. Then the score s(zi |x) for the layout zi is predictions, we take the reward to be identically the
negative log-probability from the execution phase,
s(zi |x) = a> σ(Bhq (x) + Cf (zi ) + d) (8) i.e.

i.e. the output of an MLP with inputs hq (x) and E[(∇ log p(z|x; θ` )) · log p(y|z, w; θe )] (11)
f (zi ), and parameters θ` = {a, B, C, d}. Finally,
we normalize these scores to obtain a distribution: Thus the update to the layout-scoring model at each
n
s(zi |x)
.X timestep is simply the gradient of the log-probability
p(zi |x; θ` ) = e es(zj |x) (9) of the chosen layout, scaled by the accuracy of that
j=1
layout’s predictions. At training time, we approxi-
Having defined a layout selection module mate the expectation with a single rollout, so at each
p(z|x; θ` ) and a network execution model step we update θ` in the direction (∇ log p(z|x; θ` ))·
pz (y|w; θe ), we are ready to define a model log p(y|z, w; θe ) for a single z ∼ p(z|x; θ` ). θe and
for predicting answers given only (world, question) θ` are optimized using ADADELTA (Zeiler, 2012)
pairs. The key constraint is that we want to min- with ρ = 0.95, ε = 1e−6 and gradient clipping at a
imize evaluations of pz (y|w; θe ) (which involves norm of 10.
test-dev test-std
Yes/No Number Other All All
Zhou (2015) 76.6 35.0 42.6 55.7 55.9
Noh (2015) 80.7 37.2 41.7 57.2 57.4
Yang (2015) 79.3 36.6 46.1 58.7 58.9
NMN 81.2 38.0 44.0 58.6 58.7
D-NMN 81.1 38.6 45.5 59.4 59.4

Table 1: Results on the VQA test server. NMN is the


parameter-tying model from Andreas et al. (2015), and D-NMN
What is in the sheep’s ear? What color is she What is the man is the model described in this paper.
wearing? dragging?
(describe[what] (describe[color] (describe[what]
(and find[sheep] find[wear]) find[man])
find[ear]))
best if the candidate layouts were relatively simple:
tag white boat (board) only describe, and and find modules are used, and
layouts contain at most two conjuncts.
Figure 4: Sample outputs for the visual question answering One weakness of this basic framework is a diffi-
task. The second row shows the final attention provided as in-
culty modeling prior knowledge about answers (of
put to the top-level describe module. For the first two exam-
ples, the model produces reasonable parses, attends to the cor- the form most bears are brown). This kinds of lin-
rect region of the images (the ear and the woman’s clothing), guistic “prior” is essential for the VQA task, and
and generates the correct answer. In the third image, the verb is easily incorporated. We simply introduce an extra
discarded and a wrong answer is produced.
hidden layer for recombining the final module net-
work output with the input sentence representation
5 Experiments hq (x) (see Equation 8), replacing Equation 1 with:

The framework described in this paper is general,


log pz (y|w, x) = (Ahq (x) + BJzKw )y (12)
and we are interested in how well it performs on
datasets of varying domain, size and linguistic com-
plexity. To that end, we evaluate our model on tasks (Now modules with output type Labels should
at opposite extremes of both these criteria: a large be understood as producing an answer embedding
visual question answering dataset, and a small col- rather than a distribution over answers.) This allows
lection of more structured geography questions. the question to influence the answer directly.
Results are shown in Table 1. The use of dynamic
5.1 Questions about images networks provides a small gain, most noticeably on
Our first task is the recently-introduced Visual Ques- ”other” questions. We achieve state-of-the-art re-
tion Answering challenge (VQA) (Antol et al., sults on this task, outperforming a highly effective
2015). The VQA dataset consists of more than visual bag-of-words model (Zhou et al., 2015), a
200,000 images paired with human-annotated ques- model with dynamic network parameter prediction
tions and answers, as in Figure 4. (but fixed network structure) (Noh et al., 2015), a
We use the VQA 1.0 release, employing the de- more conventional attentional model (Yang et al.,
velopment set for model selection and hyperparam- 2015), and a previous approach using neural mod-
eter tuning, and reporting final results from the eval- ule networks with no structure prediction (Andreas
uation server on the test-standard set. For the ex- et al., 2016).
periments described in this section, the input feature Some examples are shown in Figure 4. In general,
representations wi are computed by the the fifth con- the model learns to focus on the correct region of the
volutional layer of a 16-layer VGGNet after pooling image, and tends to consider a broad window around
(Simonyan and Zisserman, 2014). Input images are the region. This facilitates answering questions like
scaled to 448×448 before computing their represen- Where is the cat?, which requires knowledge of the
tations. We found that performance on this task was surroundings as well as the object in question.
Accuracy Is Key Largo an island?
Model GeoQA GeoQA+Q (exists (and lookup[key-largo] find[island]))

LSP-F 48 – yes: correct


LSP-W 51 –
What national parks are in Florida?
NMN 51.7 35.7
D-NMN 54.3 42.9 (and find[park] (relate[in] lookup[florida]))
everglades: correct
Table 2: Results on the GeoQA dataset, and the GeoQA
dataset with quantification. Our approach outperforms both a What are some beaches in Florida?
purely logical model (LSP-F) and a model with learned percep- (exists (and lookup[beach]
tual predicates (LSP-W) on the original dataset, and a fixed- (relate[in] lookup[florida])))
structure NMN under both evaluation conditions.
yes (daytona-beach): wrong parse

What beach city is there in Florida?


5.2 Questions about geography (and lookup[beach] lookup[city]
(relate[in] lookup[florida]))
The next set of experiments we consider focuses
[none] (daytona-beach): wrong module behavior
on GeoQA, a geographical question-answering
task first introduced by Krishnamurthy and Kollar Figure 5: Example layouts and answers selected by the model
(2013). This task was originally paired with a vi- on the GeoQA dataset. For incorrect predictions, the correct
sual question answering task much simpler than the answer is shown in parentheses.
one just discussed, and is appealing for a number
vironments. Our dynamic model (D-NMN) outper-
of reasons. In contrast to the VQA dataset, GeoQA
forms both the logical (LSP-F) and perceptual mod-
is quite small, containing only 263 examples. Two
els (LSP-W) described by (Krishnamurthy and Kol-
baselines are available: one using a classical se-
lar, 2013), as well as a fixed-structure neural mod-
mantic parser backed by a database, and another
ule net (NMN). This improvement is particularly
which induces logical predicates using linear clas-
notable on the dataset with quantifiers, where dy-
sifiers over both spatial and distributional features.
namic structure prediction produces a 20% relative
This allows us to evaluate the quality of our model
improvement over the fixed baseline. A variety of
relative to other perceptually grounded logical se-
predicted layouts are shown in Figure 5.
mantics, as well as strictly logical approaches.
The GeoQA domain consists of a set of entities 6 Conclusion
(e.g. states, cities, parks) which participate in vari-
ous relations (e.g. north-of, capital-of). Here we take We have introduced a new model, the dynamic neu-
the world representation to consist of two pieces: a ral module network, for answering queries about
set of category features (used by the find module) both structured and unstructured sources of informa-
and a different set of relational features (used by the tion. Given only (question, world, answer) triples
relate module). For our experiments, we use a sub- as training data, the model learns to assemble neu-
set of the features originally used by Krishnamurthy ral networks on the fly from an inventory of neural
et al. The original dataset includes no quantifiers, models, and simultaneously learns weights for these
and treats the questions What cities are in Texas? modules so that they can be composed into novel
and Are there any cities in Texas? identically. Be- structures. Our approach achieves state-of-the-art
cause we are interested in testing the parser’s ability results on two tasks. We believe that the success of
to predict a variety of different structures, we intro- this work derives from two factors:
duce a new version of the dataset, GeoQA+Q, which Continuous representations improve the expres-
distinguishes these two cases, and expects a Boolean siveness and learnability of semantic parsers: by re-
answer to questions of the second kind. placing discrete predicates with differentiable neural
Results are shown in Table 2. As in the orig- network fragments, we bypass the challenging com-
inal work, we report the results of leave-one- binatorial optimization problem associated with in-
environment-out cross-validation on the set of 10 en- duction of a semantic lexicon. In structured world
representations, neural predicate representations al- on Distributional and Logical Semantics, pages 11–
low the model to invent reusable attributes and re- 21.
lations not expressed in the schema. Perhaps more Jonathan Berant and Percy Liang. 2014. Semantic pars-
importantly, we can extend compositional question- ing via paraphrasing. In Proceedings of the Annual
Meeting of the Association for Computational Linguis-
answering machinery to complex, continuous world
tics, volume 7, page 92.
representations like images.
Antoine Bordes, Sumit Chopra, and Jason Weston. 2014.
Semantic structure prediction improves general- Question answering with subgraph embeddings. Pro-
ization in deep networks: by replacing a fixed net- ceedings of the Conference on Empirical Methods in
work topology with a dynamic one, we can tailor the Natural Language Processing.
computation performed to each problem instance, Léon Bottou, Yoshua Bengio, and Yann Le Cun. 1997.
using deeper networks for more complex questions Global training of document processing systems us-
and representing combinatorially many queries with ing graph transformer networks. In Proceedings of the
comparatively few parameters. In practice, this re- Conference on Computer Vision and Pattern Recogni-
tion, pages 489–494. IEEE.
sults in considerable gains in speed and sample effi-
Léon Bottou. 2014. From machine learning to machine
ciency, even with very little training data.
reasoning. Machine learning, 94(2):133–149.
These observations are not limited to the question
Marie-Catherine De Marneffe and Christopher D Man-
answering domain, and we expect that they can be ning. 2008. The Stanford typed dependencies repre-
applied similarly to tasks like instruction following, sentation. In Proceedings of the International Confer-
game playing, and language generation. ence on Computational Linguistics, pages 1–8.
Edward Grefenstette. 2013. Towards a formal distribu-
Acknowledgments tional semantics: Simulating logical calculi with ten-
sors. Joint Conference on Lexical and Computational
JA is supported by a National Science Foundation Semantics.
Graduate Fellowship. MR is supported by a fellow- Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
ship within the FIT weltweit-Program of the German stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
Academic Exchange Service (DAAD). This work and Phil Blunsom. 2015. Teaching machines to read
was additionally supported by DARPA, AFRL, DoD and comprehend. In Advances in Neural Information
MURI award N000141110688, NSF awards IIS- Processing Systems, pages 1684–1692.
1427425 and IIS-1212798, and the Berkeley Vision Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino,
Richard Socher, and Hal Daumé III. 2014. A neu-
and Learning Center.
ral network for factoid question answering over para-
graphs. In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing.
References
Jayant Krishnamurthy and Thomas Kollar. 2013. Jointly
Jacob Andreas, Andreas Vlachos, and Stephen Clark. learning to parse and perceive: connecting natural lan-
2013. Semantic parsing as machine translation. In guage to the physical world. Transactions of the Asso-
Proceedings of the Annual Meeting of the Association ciation for Computational Linguistics.
for Computational Linguistics, Sofia, Bulgaria. Jayant Krishnamurthy and Tom Mitchell. 2013. Vec-
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and tor space semantic parsing: A framework for compo-
Dan Klein. 2016. Neural module networks. In Pro- sitional vector space models. In Proceedings of the
ceedings of the Conference on Computer Vision and ACL Workshop on Continuous Vector Space Models
Pattern Recognition. and their Compositionality.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa-
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and ter, and Mark Steedman. 2010. Inducing probabilis-
Devi Parikh. 2015. VQA: Visual question answer- tic CCG grammars from logical form with higher-
ing. In Proceedings of the International Conference order unification. In Proceedings of the Conference
on Computer Vision. on Empirical Methods in Natural Language Process-
Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Gar- ing, pages 1223–1233, Cambridge, Massachusetts.
rette, Katrin Erk, and Raymond Mooney. 2013. Mon- Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke
tague meets markov: Deep semantics with probabilis- Zettlemoyer. 2013. Scaling semantic parsers with on-
tic logical form. Proceedings of the Joint Conference the-fly ontology matching. In Proceedings of the Con-
ference on Empirical Methods in Natural Language Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho,
Processing. Aaron Courville, Ruslan Salakhutdinov, Richard
Mike Lewis and Mark Steedman. 2013. Combining Zemel, and Yoshua Bengio. 2015. Show, attend
distributional and logical semantics. Transactions of and tell: Neural image caption generation with visual
the Association for Computational Linguistics, 1:179– attention. In International Conference on Machine
192. Learning.
Percy Liang, Michael Jordan, and Dan Klein. 2011. Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng,
Learning dependency-based compositional semantics. and Alex Smola. 2015. Stacked attention net-
In Proceedings of the Human Language Technology works for image question answering. arXiv preprint
Conference of the Association for Computational Lin- arXiv:1511.02274.
guistics, pages 590–599, Portland, Oregon. Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao.
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Neural enquirer: Learning to query tables.
2015. Ask your neurons: A neural-based approach to arXiv preprint arXiv:1512.00965.
answering questions about images. In Proceedings of Matthew D Zeiler. 2012. ADADELTA: An
the International Conference on Computer Vision. adaptive learning rate method. arXiv preprint
Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle- arXiv:1212.5701.
moyer, Liefeng Bo, and Dieter Fox. 2012. A joint Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar,
model of language and perception for grounded at- Arthur Szlam, and Rob Fergus. 2015. Simple base-
tribute learning. In International Conference on Ma- line for visual question answering. arXiv preprint
chine Learning. arXiv:1512.02167.
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han.
2015. Image question answering using convolutional
neural network with dynamic parameter prediction.
arXiv preprint arXiv:1511.05756.
Panupong Pasupat and Percy Liang. 2015. Composi-
tional semantic parsing on semi-structured tables. In
Proceedings of the Annual Meeting of the Association
for Computational Linguistics.
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Ex-
ploring models and data for image question answer-
ing. In Advances in Neural Information Processing
Systems.
K Simonyan and A Zisserman. 2014. Very deep con-
volutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556.
Richard Socher, John Bauer, Christopher D. Manning,
and Andrew Y. Ng. 2013. Parsing with compositional
vector grammars. In Proceedings of the Annual Meet-
ing of the Association for Computational Linguistics.
Geoffrey G Towell and Jude W Shavlik. 1994.
Knowledge-based artificial neural networks. Artificial
Intelligence, 70(1):119–165.
Ronald J Williams. 1992. Simple statistical gradient-
following algorithms for connectionist reinforcement
learning. Machine learning, 8(3-4):229–256.
Yuk Wah Wong and Raymond J. Mooney. 2007. Learn-
ing synchronous grammars for semantic parsing with
lambda calculus. In Proceedings of the Annual Meet-
ing of the Association for Computational Linguistics,
volume 45, page 960.
Huijuan Xu and Kate Saenko. 2015. Ask, attend
and answer: Exploring question-guided spatial atten-
tion for visual question answering. arXiv preprint
arXiv:1511.05234.

You might also like