Transformer 1803.02155

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Self-Attention with Relative Position Representations

Peter Shaw Jakob Uszkoreit Ashish Vaswani


Google Google Brain Google Brain
petershaw@google.com usz@google.com avaswani@google.com

Abstract time dimension directly through their sequential


structure. Non-recurrent models do not necessar-
Relying entirely on an attention mechanism,
ily consider input elements sequentially and may
the Transformer introduced by Vaswani et
arXiv:1803.02155v2 [cs.CL] 12 Apr 2018

al. (2017) achieves state-of-the-art results for


hence require explicitly encoding position infor-
machine translation. In contrast to recurrent mation to be able to use sequence order.
and convolutional neural networks, it does
not explicitly model relative or absolute po- One common approach is to use position encod-
sition information in its structure. Instead, ings which are combined with input elements to
it requires adding representations of abso- expose position information to the model. These
lute positions to its inputs. In this work position encodings can be a deterministic func-
we present an alternative approach, extend- tion of position (Sukhbaatar et al., 2015; Vaswani
ing the self-attention mechanism to efficiently et al., 2017) or learned representations. Convolu-
consider representations of the relative posi-
tional neural networks inherently capture relative
tions, or distances between sequence elements.
On the WMT 2014 English-to-German and positions within the kernel size of each convolu-
English-to-French translation tasks, this ap- tion. They have been shown to still benefit from
proach yields improvements of 1.3 BLEU and position encodings (Gehring et al., 2017), how-
0.3 BLEU over absolute position representa- ever.
tions, respectively. Notably, we observe that
combining relative and absolute position rep- For the Transformer, which employs neither
resentations yields no further improvement in convolution nor recurrence, incorporating explicit
translation quality. We describe an efficient representations of position information is an espe-
implementation of our method and cast it as an
cially important consideration since the model is
instance of relation-aware self-attention mech-
anisms that can generalize to arbitrary graph- otherwise entirely invariant to sequence ordering.
labeled inputs. Attention-based models have therefore used posi-
tion encodings or biased attention weights based
1 Introduction on distance (Parikh et al., 2016).
Recent approaches to sequence to sequence learn- In this work we present an efficient way of
ing typically leverage recurrence (Sutskever et al., incorporating relative position representations in
2014), convolution (Gehring et al., 2017; Kalch- the self-attention mechanism of the Transformer.
brenner et al., 2016), attention (Vaswani et al., Even when entirely replacing its absolute position
2017), or a combination of recurrence and atten- encodings, we demonstrate significant improve-
tion (Bahdanau et al., 2014; Cho et al., 2014; Lu- ments in translation quality on two machine trans-
ong et al., 2015; Wu et al., 2016) as basic building lation tasks.
blocks. These approaches incorporate information
about the sequential position of elements differ- Our approach can be cast as a special case of ex-
ently. tending the self-attention mechanism of the Trans-
Recurrent neural networks (RNNs) typically former to considering arbitrary relations between
compute a hidden state ht , as a function of their any two elements of the input, a direction we plan
input at time t and a previous hidden state ht−1 , to explore in future work on modeling labeled, di-
capturing relative and absolute positions along the rected graphs.
2 Background And eij is computed using a compatibility func-
tion that compares two input elements:
2.1 Transformer
The Transformer (Vaswani et al., 2017) em- (xi W Q )(xj W K )T
ploys an encoder-decoder structure, consisting of eij = √ (2)
dz
stacked encoder and decoder layers. Encoder
layers consist of two sublayers: self-attention Scaled dot product was chosen for the compat-
followed by a position-wise feed-forward layer. ibility function, which enables efficient computa-
Decoder layers consist of three sublayers: self- tion. Linear transformations of the inputs add suf-
attention followed by encoder-decoder attention, ficient expressive power.
followed by a position-wise feed-forward layer. W Q , W K , W V ∈ Rdx ×dz are parameter matri-
It uses residual connections around each of the ces. These parameter matrices are unique per layer
sublayers, followed by layer normalization (Ba and attention head.
et al., 2016). The decoder uses masking in its self-
attention to prevent a given output position from 3 Proposed Architecture
incorporating information about future output po-
sitions during training. 3.1 Relation-aware Self-Attention
Position encodings based on sinusoids of vary- We propose an extension to self-attention to con-
ing frequency are added to encoder and decoder sider the pairwise relationships between input ele-
input elements prior to the first layer. In contrast ments. In this sense, we model the input as a la-
to learned, absolute position representations, the beled, directed, fully-connected graph.
authors hypothesized that sinusoidal position en-
The edge between input elements xi and xj is
codings would help the model to generalize to se-
represented by vectors aVij , aK da
ij ∈ R . The mo-
quence lengths unseen during training by allowing
tivation for learning two distinct edge represen-
it to learn to attend also by relative position. This
tations is that aVij and aK
ij are suitable for use in
property is shared by our relative position repre-
eq. (3) and eq. (4), respectively, without requiring
sentations which, in contrast to absolute position
additional linear transformations. These represen-
representations, are invariant to the total sequence
tations can be shared across attention heads. We
length.
use da = dz .
Residual connections help propagate position
We modify eq. (1) to propagate edge informa-
information to higher layers.
tion to the sublayer output:
2.2 Self-Attention
n
Self-attention sublayers employ h attention heads.
X
zi = αij (xj W V + aVij ) (3)
To form the sublayer output, results from each j=1
head are concatenated and a parameterized linear
transformation is applied. This extension is presumably important for
Each attention head operates on an input se- tasks where information about the edge types se-
quence, x = (x1 , . . . , xn ) of n elements where lected by a given attention head is useful to down-
xi ∈ Rdx , and computes a new sequence z = stream encoder or decoder layers. However, as ex-
(z1 , . . . , zn ) of the same length where zi ∈ Rdz . plored in 4.3, this may not be necessary for ma-
Each output element, zi , is computed as chine translation.
weighted sum of a linearly transformed input el- We also, importantly, modify eq. (2) to consider
ements: edges when determining compatibility:
n
xi W Q (xj W K + aK T
X
zi = αij (xj W V ) (1) ij )
eij = √ (4)
j=1 dz
Each weight coefficient, αij , is computed using The primary motivation for using simple addi-
a softmax function: tion to incorporate edge representations in eq. (3)
exp eij and eq. (4) is to enable an efficient implementation
αij = Pn described in 3.3.
k=1 exp eik
a batch using parallel matrix multiplication opera-
aV2,1=wV-1 aV2,4=wV2 aV4,n=wVk
tions (Vaswani et al., 2017). Without relative posi-
aK2,1=wK-1 aK2,4=wK2 aK4,n=wKk
tion representations, each eij can be computed us-
ing bh parallel multiplications of n×dz and dz ×n
matrices. Each matrix multiplication computes eij
x1 x2 x3 x4 … xn for all sequence positions, for a particular head
and sequence. For any sequence and head, this
requires sharing the same representation for each
Figure 1: Example edges representing relative posi-
tions, or the distance between elements. We learn rep- position across all compatibility function applica-
resentations for each relative position within a clipping tions (dot products) with other positions.
distance k. The figure assumes 2 <= k <= n − 4. When we consider relative positions the repre-
Note that not all edges are shown. sentations differ with different pairs of positions.
This prevents us from computing all eij for all
pairs of positions in a single matrix multiplication.
3.2 Relative Position Representations We also want to avoid broadcasting relative po-
For linear sequences, edges can capture infor- sition representations. However, both issues can
mation about the relative position differences be- be resolved by splitting the computation of eq. (4)
tween input elements. The maximum relative po- into two terms:
sition we consider is clipped to a maximum abso-
lute value of k. We hypothesized that precise rel-
xi W Q (xj W K )T + xi W Q (aK
ij )
T
ative position information is not useful beyond a eij = √ (5)
certain distance. Clipping the maximum distance dz
also enables the model to generalize to sequence
The first term is identical to eq. (2), and can be
lengths not seen during training. Therefore, we
computed as described above. For the second term
consider 2k + 1 unique edge labels.
involving relative position representations, tensor
reshaping can be used to compute n parallel multi-
aK K plications of bh×dz and dz ×n matrices. Each ma-
ij = wclip(j−i,k)
trix multiplication computes contributions to eij
aVij = wclip(j−i,k)
V
for all heads and batches, corresponding to a par-
clip(x, k) = max(−k, min(k, x)) ticular sequence position. Further reshaping al-
lows adding the two terms. The same approach
We then learn relative position representations can be used to efficiently compute eq. (3).
wK = (w−kK , . . . , w K ) and w V = (w V , . . . , w V )
k −k k For our machine translation experiments, the re-
where wi , wiV ∈ Rda .
K sult was a modest 7% decrease in steps per sec-
ond, but we were able to maintain the same model
3.3 Efficient Implementation and batch sizes on P100 GPUs as Vaswani et
There are practical space complexity concerns al. (2017).
when considering edges between input elements,
as noted by Veličković et al. (2017), which consid- 4 Experiments
ers unlabeled graph inputs to an attention model. 4.1 Experimental Setup
For a sequence of length n and h attention
heads, we reduce the space complexity of storing We use the tensor2tensor 1 library for training and
relative position representations from O(hn2 da ) evaluating our model.
to O(n2 da ) by sharing them across each heads. We evaluated our model on the WMT 2014
Additionally, relative position representations can machine translation task, using the WMT 2014
be shared across sequences. Therefore, the over- English-German dataset consisting of approxi-
all self-attention space complexity increases from mately 4.5M sentence pairs and the 2014 WMT
O(bhndz ) to O(bhndz + n2 da ). Given da = dz , English-French dataset consisting of approxi-
the size of the relative increase depends on bhn
. mately 36M sentence pairs.
The Transformer computes self-attention effi- 1
The tensor2tensor library is available at https://
ciently for all sequences, heads, and positions in github.com/tensorflow/tensor2tensor.
Model Position Information EN-DE BLEU EN-FR BLEU
Transformer (base) Absolute Position Representations 26.5 38.2
Transformer (base) Relative Position Representations 26.8 38.7
Transformer (big) Absolute Position Representations 27.9 41.2
Transformer (big) Relative Position Representations 29.2 41.5

Table 1: Experimental results for WMT 2014 English-to-German (EN-DE) and English-to-French (EN-FR) trans-
lation tasks, using newstest2014 test set.

For all experiments, we split tokens into a For English-to-German our approach improved
32,768 word-piece vocabulary (Wu et al., 2016). performance over our baseline by 0.3 and 1.3
We batched sentence pairs by approximate length, BLEU for the base and big configurations, respec-
and limited input and output tokens per batch to tively. For English-to-French it improved by 0.5
4096 per GPU. Each resulting training batch con- and 0.3 BLEU for the base and big configurations,
tained approximately 25,000 source and 25,000 respectively. In our experiments we did not ob-
target tokens. serve any benefit from including sinusoidal posi-
We used the Adam optimizer (Kingma and Ba, tion encodings in addition to relative position rep-
2014) with β1 = 0.9, β2 = 0.98, and  = 10−9 . resentations. The results are shown in Table 1.
We used the same warmup and decay strategy for
4.3 Model Variations
learning rate as Vaswani et al. (2017), with 4,000
warmup steps. During training, we employed la- We performed several experiments modifying var-
bel smoothing of value ls = 0.1 (Szegedy et al., ious aspects of our model. All of our experi-
2016). For evaluation, we used beam search with ments in this section use the base model configura-
a beam size of 4 and length penalty α = 0.6 (Wu tion without any absolute position representations.
et al., 2016). BLEU scores are calculated on the WMT English-
For our base model, we used 6 encoder and de- to-German task using the development set, new-
coder layers, dx = 512, dz = 64, 8 attention stest2013.
heads, 1024 feed forward inner-layer dimensions, We evaluated the effect of varying the clipping
and Pdropout = 0.1. When using relative posi- distance, k, of the maximum absolute relative po-
tion encodings, we used clipping distance k = 16, sition difference. Notably, for k ≥ 2, there does
and used unique edge representations per layer and not appear to be much variation in BLEU scores.
head. We trained for 100,000 steps on 8 K40 However, as we use multiple encoder layers, pre-
GPUs, and did not use checkpoint averaging. cise relative position information may be able to
For our big model, we used 6 encoder and de- propagate beyond the clipping distance. The re-
coder layers, dx = 1024, dz = 64, 16 attention sults are shown in Table 2.
heads, 4096 feed forward inner-layer dimensions, k EN-DE BLEU
and Pdropout = 0.3 for EN-DE and Pdropout = 0.1 0 12.5
for EN-FR. When using relative position encod- 1 25.5
ings, we used k = 8, and used unique edge repre- 2 25.8
sentations per layer. We trained for 300,000 steps 4 25.9
on 8 P100 GPUs, and averaged the last 20 check- 16 25.8
points, saved at 10 minute intervals. 64 25.9
256 25.8
4.2 Machine Translation
Table 2: Experimental results for varying the clipping
We compared our model using only relative po- distance, k.
sition representations to the baseline Transformer
(Vaswani et al., 2017) with sinusoidal position en- We also evaluated the impact of ablating each of
codings. We generated baseline results to iso- the two relative position representations defined in
late the impact of relative position representations section 3.1, aVij in eq. (3) and aK
ij in eq. (4). Includ-
from any other changes to the underlying library ing relative position representations solely when
and experimental configuration. determining compatibility between elements may
be sufficient, but further work is needed to deter- Minh-Thang Luong, Hieu Pham, and Christopher D
mine whether this is true for other tasks. The re- Manning. 2015. Effective approaches to attention-
based neural machine translation. arXiv preprint
sults are shown in Table 3.
arXiv:1508.04025 .
aVij aK
ij EN-DE BLEU Ankur P Parikh, Oscar Täckström, Dipanjan Das, and
Yes Yes 25.8 Jakob Uszkoreit. 2016. A decomposable attention
No Yes 25.8 model for natural language inference. In Empirical
Yes No 25.3 Methods in Natural Language Processing.
No No 12.5
Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.
2015. End-to-end memory networks. In Advances
Table 3: Experimental results for ablating relative po- in neural information processing systems. pages
sition representations aVij and aK
ij . 2440–2448.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural net-
5 Conclusions works. In Advances in neural information process-
In this paper we presented an extension to self- ing systems. pages 3104–3112.
attention that can be used to incorporate rela- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
tive position information for sequences, which im- Jon Shlens, and Zbigniew Wojna. 2016. Rethinking
proves performance for machine translation. the inception architecture for computer vision. In
Proceedings of the IEEE Conference on Computer
For future work, we plan to extend this mecha- Vision and Pattern Recognition. pages 2818–2826.
nism to consider arbitrary directed, labeled graph
inputs to the Transformer. We are also inter- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ested in nonlinear compatibility functions to com- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
bine input representations and edge representa- you need. In Advances in Neural Information Pro-
tions. For both of these extensions, a key consid- cessing Systems. pages 6000–6010.
eration will be determining efficient implementa-
Petar Veličković, Guillem Cucurull, Arantxa Casanova,
tions. Adriana Romero, Pietro Liò, and Yoshua Bengio.
2017. Graph attention networks. arXiv preprint
arXiv:1710.10903 .
References
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- Le, Mohammad Norouzi, Wolfgang Macherey,
ton. 2016. Layer normalization. arXiv preprint Maxim Krikun, Yuan Cao, Qin Gao, Klaus
arXiv:1607.06450 . Macherey, et al. 2016. Google’s neural ma-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- chine translation system: Bridging the gap between
gio. 2014. Neural machine translation by jointly human and machine translation. arXiv preprint
learning to align and translate. arXiv preprint arXiv:1609.08144 .
arXiv:1409.0473 .
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
Schwenk, and Yoshua Bengio. 2014. Learning
phrase representations using rnn encoder-decoder
for statistical machine translation. arXiv preprint
arXiv:1406.1078 .
Jonas Gehring, Michael Auli, David Grangier, De-
nis Yarats, and Yann N Dauphin. 2017. Convolu-
tional sequence to sequence learning. arXiv preprint
arXiv:1705.03122 .
Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
Aaron van den Oord, Alex Graves, and Koray
Kavukcuoglu. 2016. Neural machine translation in
linear time. arXiv preprint arXiv:1610.10099 .
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980 .

You might also like