Professional Documents
Culture Documents
X-Linear Attention Networks For Image Captioning: Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei
X-Linear Attention Networks For Image Captioning: Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei
K Embed
N x Dc Nx1 Nx1
Abstract Embed Softmax
Q Embed
Weight Sum v^
N x Dv
Recent progress on fine-grained visual recognition and V
K Embed N x DB N x Dc
multi-modal inputs. Nevertheless, there has not been evi- Embed Squeeze Excitation 1 x DB
Embed
dence in support of building such interactions concurrently Q Embed Softmax v^
Nx1
Nx1
with attention mechanism for image captioning. In this Embed 1 x DB
Weight Sum
N x DB
paper, we introduce a unified attention block — X-Linear V Embed
attention block, that fully employs bilinear pooling to se- Bilinear Pooling
(b) X-Linear attention block
lectively capitalize on visual information or perform multi- Figure 1. Comparison between conventional attention mechanism
modal reasoning. Technically, X-Linear attention block si- and our X-Linear attention block for image captioning. (a) Con-
multaneously exploits both the spatial and channel-wise bi- ventional attention mechanism linearly fuses query (Q) and key
linear attention distributions to capture the 2nd order inter- (K) via element-wise sum to compute spatial attention weight for
each value (V), which characterizes the 1st order interaction be-
actions between the input single-modal or multi-modal fea-
tween query and key. (b) X-Linear attention block fully capital-
tures. Higher and even infinity order feature interactions
izes on bilinear pooling to capture the 2nd order feature interaction
are readily modeled through stacking multiple X-Linear at- in between, and measures both spatial and channel-wise attention
tention blocks and equipping the block with Exponential distributions. The two attention weights are adopted to accumulate
Linear Unit (ELU) in a parameter-free fashion, respec- the enhanced values of bilinear pooling on query and value.
tively. Furthermore, we present X-Linear Attention Net-
works (dubbed as X-LAN) that novelly integrates X-Linear different major modalities (visual content and textual sen-
attention block(s) into image encoder and sentence decoder tence) in image captioning, such paradigm of approaches
of image captioning model to leverage higher order intra- seldom explores the multi-modal interactions particularly
and inter-modal interactions. The experiments on COCO at the early stage. In other words, vision and language
benchmark demonstrate that our X-LAN obtains to-date the are treated independently. That prompts the recent state-
best published CIDEr performance of 132.0% on COCO of-the-art methods [2, 35] to adopt visual attention mech-
Karpathy test split. When further endowing Transformer anisms which trigger the interaction between visual con-
with X-Linear attention blocks, CIDEr is boosted up to tent and natural sentence. Concretely, these visual attention
132.8%. Source code is available at https://github. mechanisms boost performance by learning to identify se-
com/Panda-Peter/image-captioning. lective spatial regions in an image conditioning on current
hidden state of language decoder, and in turn accumulat-
1. Introduction ing encoded region features with attention weights to guide
decoding process. Figure 1(a) illustrates the most conven-
Image captioning is the task of automatically producing tional attention measure which estimates attention weights
a natural-language sentence to describe the visual content via linearly fusing the given query (hidden state of sentence
of an image. The essential practice of neural captioning decoder) and key (encoded image features) from different
models follows encoder-decoder paradigm [24, 33], which modalities. The attention is then applied to the value (en-
is derived from neural machine translation [30]. In between, coded image features) to derive a weighted sum. Never-
Convolutional Neural Network (CNN) is utilized to encode theless, we argue that the design of conventional attention
an input image and Recurrent Neural Network (RNN) is inherently exploits only the 1st order feature interaction and
adopted as sentence decoder to generate the output sen- is still lacking in efficacy. That severely limits the capacity
tence, one word at each time step. Despite involving two of complex multi-modal reasoning in image captioning.
A natural way to mitigate the problem is to capture 2. Related Work
higher order interactions. We start our exploration from 2nd
order interaction through the unique design of a unified at- Image Captioning. Image captioning is an active re-
tention block, namely X-Linear attention block, as shown search area [2, 12, 19, 23, 24, 28, 33, 34, 35, 37, 39, 40, 41].
in Figure 1(b). Technically, the outer product of key and The early attempts [24, 33] exploit the encoder-decoder
query is computed through bilinear pooling to take all pair- paradigm that firstly utilizes CNN to encoder image and
wise interactions between query and key into account. After then adopts RNN based decoder to generate the output word
bilinear pooling, two embedding layers are exploited to pre- sequence, leading to promising results for this task. After
dict attention weights for each spatial region, followed by a that, a series of innovations have been proposed to boost im-
softmax layer to normalize the spatial attention vector. In age captioning by encouraging more interactions between
the meantime, the embedded outer product (feature map) is the two different modalities via attention mechanism [5].
passed through a squeeze-excitation operation. The squeeze In particular, [35] integrates soft and hard attention mecha-
operation aggregates the feature map across spatial regions nism into LSTM based decoder, aiming to select the most
to produce a channel descriptor and the excitation operation relevant image regions for word prediction at each decod-
performs the self-gating mechanism with a sigmoid on the ing stage. [41] presents semantic attention that learns to se-
channel descriptor to obtain the channel-wise attention vec- lectively focus on the semantic attributes in image for sen-
tor. Finally, the outer product of query and value via bilin- tence generation. Instead of fully performing visual atten-
ear pooling is weighted summated with the spatial attention tion as in [35], [23] proposes an adaptive attention model
vector, and we take the channel-wise multiplication of the that dynamically decides whether to attend to image regions
sum and the channel-wise attention vector as the attended at each decoding stage. Furthermore, bottom-up and top-
features. As such, our X-Linear attention block builds the down attention mechanism [2] exploits visual attention at
2nd order interactions and infers the joint representations object level via bottom-up mechanism, and all salient im-
for image features and hidden states. It is also appealing in age regions are associated with the output words through
view that a stack of the blocks is readily grouped to go be- top-down mechanism for image captioning. [26] presents
yond bilinear models and extract higher order interactions. the look back method to integrate attention weights from
In the extreme case, our model could create infinity order in- previous time step into the measurement of attention at cur-
teractions by stacking numerous X-Linear attention blocks rent time step, which suits visual coherence of human. Later
and we implement this via the kernel trick, e.g., Exponential on, the most recently proposed attention on attention mod-
Linear Unit (ELU), in practice. ule [12] enhances visual attention by further measuring the
relevance between the attention result and the query.
By integrating X-Linear attention block(s) into image Much of existing attention mechanisms in image cap-
captioning structures, we present a new X-Linear Atten- tioning have concentrated on the exploration of only the 1st
tion Networks (X-LAN) to leverage high order intra- and order feature interaction between image content and sen-
inter-modal interactions, respectively, in the encoder and tence, reflecting limited capacity of multi-modal reasoning.
decoder. Specifically, for image encoder, Faster R-CNN In contrast, we design a novel X-Linear attention block to
is firstly utilized to detect a set of image regions. After capture higher and even infinity order interactions, which
that, a stack of X-Linear attention blocks are adopted to en- facilitate both single-modal feature enhancement and multi-
code the region-level features with the higher order intra- modal reasoning for image captioning.
modal interaction in between, leading to a set of enhanced Bilinear Pooling. Bilinear pooling is an operation to
region-level and image-level features. Conditioned on the calculate outer product between two feature vectors. Such
enhanced visual features induced by image encoder, we fur- technique can enable the 2nd order interaction across all el-
ther employ X-Linear attention block in sentence decoder to ements in feature vectors and thus provide more discrimina-
perform multi-modal reasoning. This encourages the explo- tive representations than linear pooling. An early pioneer-
ration of high order inter-modal interactions between visual ing work [22] demonstrates the advantage of bilinear pool-
content and natural sentence to boost sentence generation. ing for fine-grained visual recognition task. Local pairwise
The main contribution of this work is the proposal of a feature interactions are thus modeled by leveraging bilinear
unified X-Linear attention block that models the 2nd order pooling over the outputs of two CNNs. Later on, [9] pro-
interactions with both spatial and channel-wise bilinear at- poses compact bilinear pooling that efficiently compresses
tention. This also leads to the elegant view of how the block the high-dimensional bilinear pooling feature into compact
should be extended for mining higher or even infinity order one with a few thousand dimensions, but retains the same
interactions and how to integrate such block(s) into image discriminative power in the meantime. [8] further extends
captioning structure. Through an extensive set of experi- compact bilinear pooling into multi-modal scenario where
ments, we demonstrate that our new X-LAN model achieves visual and textual representations are combined for visual
new state-of-the-art performances on COCO dataset. question answering task. Instead of compact bilinear pool-
ing that needs complex computations, [16] proposes a flexi- capacity of complex multi-modal reasoning in image cap-
ble low-rank bilinear pooling structure with linear mapping tioning. Inspired by the recent successes of bilinear pool-
and Hadamard product. Recently, [42] presents a hierarchi- ing applied in fine-grained visual recognition [9, 42] or vi-
cal bilinear pooling model to aggregate multiple cross-layer sual question answering [8, 15], we fully capitalize on bilin-
bilinear pooling features for fine-grained visual recognition. ear pooling techniques to construct a unified attention mod-
[15] exploits low-rank bilinear pooling to construct bilinear ule (X-Linear attention block) for image captioning, as de-
attention network, aiming to learn bilinear attention distri- picted in Figure 1(b). Such design of X-Linear attention
butions for visual question answering. block strengthens the representative capacity of the output
The aforementioned bilinear pooling techniques are attended feature by exploiting higher order interactions be-
mainly designed for fine-grained visual recognition or vi- tween the input single-modal or multi-modal features.
sual question answering. Instead, our X-Linear attention In particular, suppose we have the query Q ∈ RDq , a set
block is applicable to image encoder and sentence decoder of keys K = {ki }N N
i=1 , and a set of values V = {vi }i=1 ,
Dk Dv
to exploit higher order intra and inter-modal interactions for where ki ∈ R and vi ∈ R denote the i-th key/value
image captioning task. pair. X-Linear attention block firstly performs low-rank bi-
linear pooling to achieve a joint bilinear query-key repre-
3. X-linear Attention Networks (X-LAN) sentation Bki ∈ RDB between query Q and each key ki :
In this section, we introduce a novel unified formulation Bki = σ (Wk ki ) σ Wqk Q , (2)
of attention module, named X-Linear attention block, that
fully capitalizes on bilinear pooling to capture the 2nd order where Wk ∈ RDB ×Dk , and Wqk ∈ RDB ×Dq are em-
feature interactions with both spatial and channel-wise bi- bedding matrices, σ denotes ReLU unit, and represents
linear attention. Moreover, we show a specific integration of element-wise multiplication. As such, the learnt bilinear
X-Linear attention block into image encoder and sentence query-key representation Bki conveys the 2nd order feature
decoder to capture higher order intra- and inter-modal inter- interactions between query and key.
actions, aiming to enhance visual information and perform Next, depending on all bilinear query-key representa-
complex multi-modal reasoning for image captioning. tions {Bki }N
i=1 , two kinds of bilinear attention distributions
are obtained to aggregate both spatial and channel-wise in-
3.1. Conventional Attention Module formation within all values. Most specifically, the spatial bi-
We first provide a brief review of the most conventional linear attention distribution is introduced by projecting each
attention module [35] applied in image captioning, which bilinear query-key representation into the corresponding at-
learns to selectively attend to salient image regions for sen- tention weight via two embedding layers, followed with a
tence generation. Formally, at decoding time step t, condi- softmax layer for normalization:
tioned on the query Q (current hidden state of sentence de- 0
0
Bik = σ WB
k
Bki , bsi = Wb Bik , β s = sof tmax (bs ) , (3)
coder ht ), we can obtain the attention distribution αt over
a set of keys K = {ki }N i=1 (N local image features):
where WB k
∈ RDc ×DB and Wb ∈ R1×Dc are embedding
0
ati t
= Wa [tanh (Wk ki + Wq Q)] , α = sof tmax a t
, (1) matrices, Bik is the transformed bilinear query-key repre-
sentation, and bsi is the i-th element in bs . Here each el-
where Wa , Wk , and Wq are embedding matrices, and ati ement βis in β s denotes the normalized spatial attention
denotes the i-th element in at . In this sense, the normalized weight for each key/value pair. Meanwhile, we perform
attention weight αit for each local image feature (i-th ele- a squeeze-excitation operation [11] over all transformed
0
ment in αt ) is derived from the linear fusion of the given bilinear query-key representations {Bik }N i=1 for channel-
query and key via element-wise sum. Such way inherently wise attention measurement. Concretely, the operation of
exploits only the 1st order feature interaction between nat- squeeze aggregates all transformed bilinear query-key rep-
ural sentence and visual content for attention measurement. resentations via average pooling, leading to a global channel
Next, attention module produces the attended image feature descriptor B̄:
v̂t by accumulating all values V = {vi }N i=1 (N local image B̄ =
1 XN 0
Bik . (4)
PN N
features) with spatial attention weights: v̂t = i=1 αit vi . i=1
phasising semantic attributes at decoding stage. RFNet and X-LAN: a blue semi truck hauling logs on a road
Up-Down: a blue truck is parked on the back of a road
Up-Down further boost the performances by involving at- GT1: a large blue truck hauling many long logs
GT2: a large truck is stacked with cut wooden logs
tention mechanism that learns to identify selective spatial GT3: a blue and silver truck with logs trees and wires