Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

The Best of Both Worlds:

Combining Recent Advances in Neural Machine Translation


Mia Xu Chen ∗ Orhan Firat ∗ Ankur Bapna ∗

Melvin Johnson Wolfgang Macherey George Foster Llion Jones Niki Parmar

Noam Shazeer Ashish Vaswani Jakob Uszkoreit Lukasz Kaiser

Mike Schuster Zhifeng Chen Yonghui Wu Macduff Hughes


miachen,orhanf,ankurbpn,yonghui@google.com
Google AI
Abstract the quality of phrase-based MT, both the en-
coder and decoder were implemented as Recur-
The past year has witnessed rapid ad- rent Neural Networks (RNNs), interacting via a
vances in sequence-to-sequence (seq2seq) soft-attention mechanism (Bahdanau et al., 2015).
modeling for Machine Translation (MT). The RNN-based NMT approach, or RNMT, was
The classic RNN-based approaches to MT quickly established as the de-facto standard for
were first out-performed by the convolu- NMT, and gained rapid adoption into large-scale
tional seq2seq model, which was then out- systems in industry, e.g. Baidu (Zhou et al., 2016),
performed by the more recent Transformer Google (Wu et al., 2016), and Systran (Crego
model. Each of these new approaches con- et al., 2016).
sists of a fundamental architecture accom-
Following RNMT, convolutional neural net-
panied by a set of modeling and training
work based approaches (LeCun and Bengio, 1998)
techniques that are in principle applicable
to NMT have recently drawn research attention
to other seq2seq architectures. In this pa-
due to their ability to fully parallelize training to
per, we tease apart the new architectures
take advantage of modern fast computing devices.
and their accompanying techniques in two
such as GPUs and Tensor Processing Units (TPUs)
ways. First, we identify several key mod-
(Jouppi et al., 2017). Well known examples are
eling and training techniques, and apply
ByteNet (Kalchbrenner et al., 2016) and ConvS2S
them to the RNN architecture, yielding a
(Gehring et al., 2017). The ConvS2S model was
new RNMT+ model that outperforms all
shown to outperform the original RNMT archi-
of the three fundamental architectures on
tecture in terms of quality, while also providing
the benchmark WMT’14 English→French
greater training speed.
and English→German tasks. Second, we
Most recently, the Transformer model (Vaswani
analyze the properties of each fundamen-
et al., 2017), which is based solely on a self-
tal seq2seq architecture and devise new
attention mechanism (Parikh et al., 2016) and
hybrid architectures intended to combine
feed-forward connections, has further advanced
their strengths. Our hybrid models ob-
the field of NMT, both in terms of translation qual-
tain further improvements, outperforming
ity and speed of convergence.
the RNMT+ model on both benchmark
In many instances, new architectures are ac-
datasets.
companied by a novel set of techniques for per-
1 Introduction forming training and inference that have been
carefully optimized to work in concert. This
In recent years, the emergence of seq2seq mod- ‘bag of tricks’ can be crucial to the performance
els (Kalchbrenner and Blunsom, 2013; Sutskever of a proposed architecture, yet it is typically
et al., 2014; Cho et al., 2014) has revolutionized under-documented and left for the enterprising re-
the field of MT by replacing traditional phrase- searcher to discover in publicly released code (if
based approaches with neural machine transla- any) or through anecdotal evidence. This is not
tion (NMT) systems based on the encoder-decoder simply a problem for reproducibility; it obscures
paradigm. In the first architectures that surpassed the central scientific question of how much of the

Equal contribution. observed gains come from the new architecture

76
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 76–86
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
and how much can be attributed to the associated We quickly note two prior works that pro-
training and inference techniques. In some cases, vided empirical solutions to the difficulty of train-
these new techniques may be broadly applicable ing NMT architectures (specifically RNMT). In
to other architectures and thus constitute a major, (Britz et al., 2017) the authors systematically ex-
though implicit, contribution of an architecture pa- plore which elements of NMT architectures have
per. Clearly, they need to be considered in order a significant impact on translation quality. In
to ensure a fair comparison across different model (Denkowski and Neubig, 2017) the authors recom-
architectures. mend three specific techniques for strengthening
In this paper, we therefore take a step back and NMT systems and empirically demonstrated how
look at which techniques and methods contribute incorporating those techniques improves the relia-
significantly to the success of recent architectures, bility of the experimental results.
namely ConvS2S and Transformer, and explore
applying these methods to other architectures, in- 2 Background
cluding RNMT models. In doing so, we come up
In this section, we briefly discuss the commmonly
with an enhanced version of RNMT, referred to
used NMT architectures.
as RNMT+, that significantly outperforms all in-
dividual architectures in our setup. We further in- 2.1 RNN-based NMT Models - RNMT
troduce new architectures built with different com-
ponents borrowed from RNMT+, ConvS2S and RNMT models are composed of an encoder RNN
Transformer. In order to ensure a fair setting for and a decoder RNN, coupled with an attention
comparison, all architectures were implemented in network. The encoder summarizes the input se-
the same framework, use the same pre-processed quence into a set of vectors while the decoder con-
data and apply no further post-processing as this ditions on the encoded input sequence through an
may confound bare model performance. attention mechanism, and generates the output se-
Our contributions are three-fold: quence one token at a time.
The most successful RNMT models consist of
1. In ablation studies, we quantify the effect stacked RNN encoders with one or more bidirec-
of several modeling improvements (includ- tional RNNs (Schuster and Paliwal, 1997; Graves
ing multi-head attention and layer normaliza- and Schmidhuber, 2005), and stacked decoders
tion) as well as optimization techniques (such with unidirectional RNNs. Both encoder and de-
as synchronous replica training and label- coder RNNs consist of either LSTM (Hochreiter
smoothing), which are used in recent archi- and Schmidhuber, 1997; Gers et al., 2000) or GRU
tectures. We demonstrate that these tech- units (Cho et al., 2014), and make extensive use of
niques are applicable across different model residual (He et al., 2015) or highway (Srivastava
architectures. et al., 2015) connections.
In Google-NMT (GNMT) (Wu et al., 2016),
2. Combining these improvements with the
the best performing RNMT model on the datasets
RNMT model, we propose the new RNMT+
we consider, the encoder network consists of
model, which significantly outperforms all
one bi-directional LSTM layer, followed by 7
fundamental architectures on the widely-used
uni-directional LSTM layers. The decoder is
WMT’14 En→Fr and En→De benchmark
equipped with a single attention network and 8
datasets. We provide a detailed model anal-
uni-directional LSTM layers. Both the encoder
ysis and comparison of RNMT+, ConvS2S
and the decoder use residual skip connections be-
and Transformer in terms of model quality,
tween consecutive layers.
model size, and training and inference speed.
In this paper, we adopt GNMT as the starting
3. Inspired by our understanding of the rela- point for our proposed RNMT+ architecture.
tive strengths and weaknesses of individual
model architectures, we propose new model 2.2 Convolutional NMT Models - ConvS2S
architectures that combine components from In the most successful convolutional sequence-to-
the RNMT+ and the Transformer model, and sequence model (Gehring et al., 2017), both the
achieve better results than both individual ar- encoder and decoder are constructed by stacking
chitectures. multiple convolutional layers, where each layer

77
contains 1-dimensional convolutions followed by Transformer model in the Tensor2Tensor2 code-
a gated linear units (GLU) (Dauphin et al., 2016). base.
Each decoder layer computes a separate dot-
product attention by using the current decoder 2.4 A Theory-Based Characterization of
layer output and the final encoder layer outputs. NMT Architectures
Positional embeddings are used to provide explicit From a theoretical point of view, RNNs belong
positional information to the model. Following the to the most expressive members of the neural
practice in (Gehring et al., 2017), we scale the gra- network family (Siegelmann and Sontag, 1995)3 .
dients of the encoder layers to stabilize training. Possessing an infinite Markovian structure (and
We also use residual connections across each con- thus an infinite receptive fields) equips them to
volutional layer and apply weight normalization model sequential data (Elman, 1990), especially
(Salimans and Kingma, 2016) to speed up conver- natural language (Grefenstette et al., 2015) ef-
gence. We follow the public ConvS2S codebase1 fectively. In practice, RNNs are notoriously
in our experiments. hard to train (Hochreiter, 1991; Bengio et al.,
1994; Hochreiter et al., 2001), confirming the well
2.3 Conditional Transformation-based NMT known dilemma of trainability versus expressivity.
Models - Transformer Convolutional layers are adept at capturing lo-
cal context and local correlations by design. A
The Transformer model (Vaswani et al., 2017) is fixed and narrow receptive field for each convo-
motivated by two major design choices that aim lutional layer limits their capacity when the ar-
to address deficiencies in the former two model chitecture is shallow. In practice, this weakness
families: (1) Unlike RNMT, but similar to the is mitigated by stacking more convolutional lay-
ConvS2S, the Transformer model avoids any se- ers (e.g. 15 layers as in the ConvS2S model),
quential dependencies in both the encoder and which makes the model harder to train and de-
decoder networks to maximally parallelize train- mands meticulous initialization schemes and care-
ing. (2) To address the limited context problem fully designed regularization techniques.
(limited receptive field) present in ConvS2S, the The transformer network is capable of ap-
Transformer model makes pervasive use of self- proximating arbitrary squashing functions (Hornik
attention networks (Parikh et al., 2016) so that et al., 1989), and can be considered a strong fea-
each position in the current layer has access to in- ture extractor with extended receptive fields capa-
formation from all other positions in the previous ble of linking salient features from the entire se-
layer. quence. On the other hand, lacking a memory
The Transformer model still follows the component (as present in the RNN models) pre-
encoder-decoder paradigm. Encoder transformer vents the network from modeling a state space,
layers are built with two sub-modules: (1) a self- reducing its theoretical strength as a sequence
attention network and (2) a feed-forward network. model, thus it requires additional positional infor-
Decoder transformer layers have an additional mation (e.g. sinusoidal positional encodings).
cross-attention layer sandwiched between the self- Above theoretical characterizations will drive
attention and feed-forward layers to attend to the our explorations in the following sections.
encoder outputs.
There are two details which we found very im- 3 Experiment Setup
portant to the model’s performance: (1) Each sub- We train our models on the standard WMT’14
layer in the transformer (i.e. self-attention, cross- En→Fr and En→De datasets that comprise 36.3M
attention, and the feed-forward sub-layer) follows and 4.5M sentence pairs, respectively. Each sen-
a strict computation sequence: normalize → trans- tence was encoded into a sequence of sub-word
form → dropout→ residual-add. (2) In addition to units obtained by first tokenizing the sentence with
per-layer normalization, the final encoder output is the Moses tokenizer, then splitting tokens into sub-
again normalized to prevent a blow up after con- word units (also known as “wordpieces”) using
secutive residual additions. the approach described in (Schuster and Nakajima,
In this paper, we follow the latest version of the 2012).
2
https://github.com/tensorflow/tensor2tensor
1 3
https://github.com/facebookresearch/fairseq-py Assuming that data complexity is satisfied.

78
Figure 1: Model architecture of RNMT+. On the left side, the encoder network has 6 bidirectional LSTM
layers. At the end of each bidirectional layer, the outputs of the forward layer and the backward layer
are concatenated. On the right side, the decoder network has 8 unidirectional LSTM layers, with the first
layer used for obtaining the attention context vector through multi-head additive attention. The attention
context vector is then fed directly into the rest of the decoder layers as well as the softmax layer.

We use a shared vocabulary of 32K sub-word evaluating the performance of individual models.
units for each source-target language pair. No fur-
ther manual or rule-based post processing of the 4 RNMT+
output was performed beyond combining the sub-
4.1 Model Architecture of RNMT+
word units to generate the targets. We report all
our results on newstest 2014, which serves as the The newly proposed RNMT+ model architecture
test set. A combination of newstest 2012 and new- is shown in Figure 1. Here we highlight the key
stest 2013 is used for validation. architectural choices that are different between the
To evaluate the models, we compute the BLEU RNMT+ model and the GNMT model. There are
metric on tokenized, true-case output.4 For each 6 bidirectional LSTM layers in the encoder instead
training run, we evaluate the model every 30 min- of 1 bidirectional LSTM layer followed by 7 uni-
utes on the dev set. Once the model converges, we directional layers as in GNMT. For each bidirec-
determine the best window based on the average tional layer, the outputs of the forward layer and
dev-set BLEU score over 21 consecutive evalua- the backward layer are concatenated before being
tions. We report the mean test score and standard fed into the next layer. The decoder network con-
deviation over the selected window. This allows sists of 8 unidirectional LSTM layers similar to the
us to compare model architectures based on their GNMT model. Residual connections are added to
mean performance after convergence rather than the third layer and above for both the encoder and
individual checkpoint evaluations, as the latter can decoder. Inspired by the Transformer model, per-
be quite noisy for some models. gate layer normalization (Ba et al., 2016) is ap-
To enable a fair comparison of architectures, plied within each LSTM cell. Our empirical re-
we use the same pre-processing and evaluation sults show that layer normalization greatly stabi-
methodology for all our experiments. We re- lizes training. No non-linearity is applied to the
frain from using checkpoint averaging (exponen- LSTM output. A projection layer is added to the
tial moving averages of parameters) (Junczys- encoder final output.5 Multi-head additive atten-
Dowmunt et al., 2016) or checkpoint ensembles tion is used instead of the single-head attention in
(Jean et al., 2015; Chen et al., 2017) to focus on the GNMT model. Similar to GNMT, we use the
5
Additional projection aims to reduce the dimensionality
4
This procedure is used in the literature to which we com- of the encoder output representations to match the decoder
pare (Gehring et al., 2017; Wu et al., 2016). stack dimension.

79
bottom decoder layer and the final encoder layer it a constant until the decay start step s, then ex-
output after projection for obtaining the recurrent ponentially decay until the decay end step e, and
attention context. In addition to feeding the atten- keep it at 5 · 10−5 after the decay ends. This
tion context to all decoder LSTM layers, we also learning rate schedule is motivated by a similar
feed it to the softmax by concatenating it with the schedule that was successfully applied in training
layer input. This is important for both the quality the Resnet-50 model with a very large batch size
of the models with multi-head attention and the (Goyal et al., 2017).
stability of the training process. In contrast to the asynchronous training used
Since the encoder network in RNMT+ consists for GNMT (Dean et al., 2012), we train RNMT+
solely of bi-directional LSTM layers, model par- models with synchronous training (Chen et al.,
allelism is not used during training. We com- 2016). Our empirical results suggest that when
pensate for the resulting longer per-step time with hyper-parameters are tuned properly, synchronous
increased data parallelism (more model replicas), training often leads to improved convergence
so that the overall time to reach convergence of speed and superior model quality.
the RNMT+ model is still comparable to that of To further stabilize training, we also use adap-
GNMT. tive gradient clipping. We discard a training step
We apply the following regularization tech- completely if an anomaly in the gradient norm
niques during training. value is detected, which is usually an indication
of an imminent gradient explosion. More specif-
• Dropout: We apply dropout to both embed-
ically, we keep track of a moving average and a
ding layers and each LSTM layer output before
moving standard deviation of the log of the gradi-
it is added to the next layer’s input. Attention
ent norm values, and we abort a step if the norm
dropout is also applied.
of the gradient exceeds four standard deviations of
• Label Smoothing: We use uniform label the moving average.
smoothing with an uncertainty=0.1 (Szegedy
et al., 2015). Label smoothing was shown to 4.2 Model Analysis and Comparison
have a positive impact on both Transformer In this section, we compare the results of RNMT+
and RNMT+ models, especially in the case with ConvS2S and Transformer.
of RNMT+ with multi-head attention. Similar
All models were trained with synchronous
to the observations in (Chorowski and Jaitly,
training. RNMT+ and ConvS2S were trained with
2016), we found it beneficial to use a larger
32 NVIDIA P100 GPUs while the Transformer
beam size (e.g. 16, 20, etc.) during decoding
Base and Big models were trained using 16 GPUs.
when models are trained with label smoothing.
For RNMT+, we use sentence-level cross-
• Weight Decay: For the WMT’14 En→De task, entropy loss. Each training batch contained 4096
we apply L2 regularization to the weights with sentence pairs (4096 source sequences and 4096
λ = 10−5 . Weight decay is only applied to the target sequences). For ConvS2S and Transformer
En→De task as the corpus is smaller and thus models, we use token-level cross-entropy loss.
more regularization is required. Each training batch contained 65536 source to-
kens and 65536 target tokens. For the GNMT
We use the Adam optimizer (Kingma and Ba, baselines on both tasks, we cite the largest BLEU
2014) with β1 = 0.9, β2 = 0.999,  = 10−6 and score reported in (Wu et al., 2016) without rein-
vary the learning rate according to this schedule: forcement learning.
 t · (n − 1) s−nt  Table 1 shows our results on the WMT’14
lr = 10−4 · min 1 + , n, n · (2n) e−s En→Fr task. Both the Transformer Big model
np
(1) and RNMT+ outperform GNMT and ConvS2S by
Here, t is the current step, n is the number of con- about 2 BLEU points. RNMT+ is slightly better
current model replicas used in training, p is the than the Transformer Big model in terms of its
number of warmup steps, s is the start step of the mean BLEU score. RNMT+ also yields a much
exponential decay, and e is the end step of the de- lower standard deviation, and hence we observed
cay. Specifically, we first increase the learning rate much less fluctuation in the training curve. It
linearly during the number of warmup steps, keep takes approximately 3 days for the Transformer

80
Base model to converge, while both RNMT+ and Training
Model Test BLEU Epochs
the Transformer Big model require about 5 days Time
to converge. Although the batching schemes are GNMT 24.67 - -
quite different between the Transformer Big and ConvS2S 25.01 ±0.17 38 20h
the RNMT+ model, they have processed about Trans. Base 27.26 ± 0.15 38 17h
the same amount of training samples upon conver- Trans. Big 27.94 ± 0.18 26.9 48h
gence. RNMT+ 28.49 ± 0.05 24.6 40h

Training Table 2: Results on WMT14 En→De. Note that


Model Test BLEU Epochs Transformer models are trained using 16 GPUs,
Time
GNMT 38.95 - - while ConvS2S and RNMT+ are trained using 32
ConvS2S 7 39.49 ± 0.11 62.2 438h GPUs.
Trans. Base 39.43 ± 0.17 20.7 90h
Trans. Big 8 40.73 ± 0.19 8.3 120h is the fastest model in terms of training speed.
RNMT+ 41.00 ± 0.05 8.5 120h RNMT+ is slower to train than the Transformer
Table 1: Results on WMT14 En→Fr. The num- Big model on a per-GPU basis. However, since
bers before and after ‘±’ are the mean and stan- the RNMT+ model is quite stable, we were able to
dard deviation of test BLEU score over an eval- offset the lower per-GPU throughput with higher
uation window. Note that Transformer models concurrency by increasing the number of model
are trained using 16 GPUs, while ConvS2S and replicas, and hence the overall time to convergence
RNMT+ are trained using 32 GPUs. was not slowed down much. We also computed
the number of floating point operations (FLOPs)
in the model’s forward path as well as the num-
Table 2 shows our results on the WMT’14
ber of total parameters for all architectures (cf. Ta-
En→De task. The Transformer Base model im-
ble 3). RNMT+ requires fewer FLOPs than the
proves over GNMT and ConvS2S by more than
Transformer Big model, even though both models
2 BLEU points while the Big model improves by
have a comparable number of parameters.
over 3 BLEU points. RNMT+ further outperforms
the Transformer Big model and establishes a new
Model Examples/s FLOPs Params
state of the art with an averaged value of 28.49. In
ConvS2S 80 15.7B 263.4M
this case, RNMT+ converged slightly faster than
Trans. Base 160 6.2B 93.3M
the Transformer Big model and maintained much
Trans. Big 50 31.2B 375.4M
more stable performance after convergence with a
RNMT+ 30 28.1B 378.9M
very small standard deviation, which is similar to
what we observed on the En-Fr task. Table 3: Performance comparison. Examples/s are
Table 3 summarizes training performance and normalized by the number of GPUs used in the
model statistics. The Transformer Base model training job. FLOPs are computed assuming that
6
Since the ConvS2S model convergence is very slow we source and target sequence length are both 50.
did not explore further tuning on En→Fr, and validated our
implementation on En→De. 5 Ablation Experiments
7
The BLEU scores for Transformer model are slightly
lower than those reported in (Vaswani et al., 2017) due to In this section, we evaluate the importance of four
four differences:
1) We report the mean test BLEU score using the strategy main techniques for both the RNMT+ and the
described in section 3. Transformer Big models. We believe that these
2) We did not perform checkpoint averaging since it would
be inconsistent with our evaluation for other models.
techniques are universally applicable across dif-
3) We avoided any manual post-processing, like unicode ferent model architectures, and should always be
normalization using Moses replace-unicode-punctuation.perl employed by NMT practitioners for best perfor-
or output tokenization using Moses tokenizer.perl, to rule out
its effect on the evaluation. We observed a significant BLEU mance.
increase (about 0.6) on applying these post processing tech- We take our best RNMT+ and Transformer Big
niques. models and remove each one of these techniques
4) In (Vaswani et al., 2017), reported BLEU scores are cal-
culated using mteval-v13a.pl from Moses, which re-tokenizes independently. By doing this we hope to learn two
its input. things about each technique: (1) How much does

81
it affect the model performance? (2) How useful is Eq. 2 for Transformer). For RNMT+, removing
it for stable training of other techniques and hence this warmup stage during synchronous training
the final model? causes the model to become unstable.

Model RNMT+ Trans. Big 6 Hybrid NMT Models


Baseline 41.00 40.73
In this section, we explore hybrid architectures
- Label Smoothing 40.33 40.49
that shed some light on the salient behavior of
- Multi-head Attention 40.44 39.83
each model family. These hybrid models outper-
- Layer Norm. * *
form the individual architectures on both bench-
- Sync. Training 39.68 *
mark datasets and provide a better understanding
Table 4: Ablation results of RNMT+ and the of the capabilities and limitations of each model
Transformer Big model on WMT’14 En → Fr. We family.
report average BLEU scores on the test set. An as-
terisk ’*’ indicates an unstable training run (train- 6.1 Assessing Individual Encoders and
ing halts due to non-finite elements). Decoders
In an encoder-decoder architecture, a natural as-
From Table 4 we draw the following conclu- sumption is that the role of an encoder is to build
sions about the four techniques: feature representations that can best encode the
meaning of the source sequence, while a decoder
• Label Smoothing We observed that label should be able to process and interpret the repre-
smoothing improves both models, leading to an sentations from the encoder and, at the same time,
average increase of 0.7 BLEU for RNMT+ and track the current target history. Decoding is in-
0.2 BLEU for Transformer Big models. herently auto-regressive, and keeping track of the
state information should therefore be intuitively
• Multi-head Attention Multi-head attention beneficial for conditional generation.
contributes significantly to the quality of both
We set out to study which family of encoders is
models, resulting in an average increase of 0.6
more suitable to extract rich representations from
BLEU for RNMT+ and 0.9 BLEU for Trans-
a given input sequence, and which family of de-
former Big models.
coders can make the best of such rich representa-
• Layer Normalization Layer normalization is tions. We start by combining the encoder and de-
most critical to stabilize the training process of coder from different model families. Since it takes
either model, especially when multi-head atten- a significant amount of time for a ConvS2S model
tion is used. Removing layer normalization re- to converge, and because the final translation qual-
sults in unstable training runs for both models. ity was not on par with the other models, we fo-
Since by design, we remove one technique at a cus on two types of hybrids only: Transformer en-
time in our ablation experiments, we were un- coder with RNMT+ decoder and RNMT+ encoder
able to quantify how much layer normalization with Transformer decoder.
helped in either case. To be able to successfully
train a model without layer normalization, we Encoder Decoder En→Fr Test BLEU
would have to adjust other parts of the model Trans. Big Trans. Big 40.73 ± 0.19
and retune its hyper-parameters. RNMT+ RNMT+ 41.00 ± 0.05
Trans. Big RNMT+ 41.12 ± 0.16
• Synchronous training Removing synchronous RNMT+ Trans. Big 39.92 ± 0.21
training has different effects on RNMT+ and
Transformer. For RNMT+, it results in a sig- Table 5: Results for encoder-decoder hybrids.
nificant quality drop, while for the Transformer
Big model, it causes the model to become un- From Table 5, it is clear that the Transformer
stable. We also notice that synchronous train- encoder is better at encoding or feature extrac-
ing is only successful when coupled with a tai- tion than the RNMT+ encoder, whereas RNMT+
lored learning rate schedule that has a warmup is better at decoding or conditional language mod-
stage at the beginning (cf. Eq. 1 for RNMT+ and eling, confirming our intuition that a stateful de-

82
coder is beneficial for conditional language gener- Model En→Fr BLEU En→De BLEU
ation. Trans. Big 40.73 ± 0.19 27.94 ± 0.18
RNMT+ 41.00 ± 0.05 28.49 ± 0.05
6.2 Assessing Encoder Combinations Cascaded 41.67 ± 0.11 28.62 ± 0.06
Next, we explore how the features extracted by MultiCol 41.66 ± 0.11 28.84 ± 0.06
an encoder can be further enhanced by incorpo-
Table 6: Results for hybrids with cascaded en-
rating additional information. Specifically, we in-
coder and multi-column encoder.
vestigate the combination of transformer layers
with RNMT+ layers in the same encoder block to
build even richer feature representations. We ex-
clusively use RNMT+ decoders in the following
architectures since stateful decoders show better
performance according to Table 5.
We study two mixing schemes in the encoder
(see Fig. 2):
(1) Cascaded Encoder: The cascaded encoder
aims at combining the representational power of
RNNs and self-attention. The idea is to enrich a
set of stateful representations by cascading a fea-
ture extractor with a focus on vertical mapping,
(a) Cascaded Encoder (b) Multi-Column Encoder
similar to (Pascanu et al., 2013; Devlin, 2017).
Our best performing cascaded encoder involves Figure 2: Vertical and horizontal mixing of Trans-
fine tuning transformer layers stacked on top of former and RNMT+ components in an encoder.
a pre-trained frozen RNMT+ encoder. Using a
pre-trained encoder avoids optimization difficul-
ties while significantly enhancing encoder capac- 7 Conclusion
ity. As shown in Table 6, the cascaded encoder In this work we explored the efficacy of sev-
improves over the Transformer encoder by more eral architectural and training techniques proposed
than 0.5 BLEU points on the WMT’14 En→Fr in recent studies on seq2seq models for NMT.
task. This suggests that the Transformer encoder We demonstrated that many of these techniques
is able to extract richer representations if the input are broadly applicable to multiple model architec-
is augmented with sequential context. tures. Applying these new techniques to RNMT
(2) Multi-Column Encoder: As illustrated in models yields RNMT+, an enhanced RNMT
Fig. 2b, a multi-column encoder merges the out- model that significantly outperforms the three fun-
puts of several independent encoders into a sin- damental architectures on WMT’14 En→Fr and
gle combined representation. Unlike a cascaded En→De tasks. We further presented several hy-
encoder, the multi-column encoder enables us to brid models developed by combining encoders and
investigate whether an RNMT+ decoder can dis- decoders from the Transformer and RNMT+ mod-
tinguish information received from two different els, and empirically demonstrated the superiority
channels and benefit from its combination. A of the Transformer encoder and the RNMT+ de-
crucial operation in a multi-column encoder is coder in comparison with their counterparts. We
therefore how different sources of information are then enhanced the encoder architecture by hori-
merged into a unified representation. Our best zontally and vertically mixing components bor-
multi-column encoder performs a simple concate- rowed from these architectures, leading to hy-
nation of individual column outputs. brid architectures that obtain further improve-
The model details and hyperparameters of the ments over RNMT+.
above two encoders are described in Appendix A.5 We hope that our work will motivate NMT re-
and A.6. As shown in Table 6, the multi-column searchers to further investigate generally applica-
encoder followed by an RNMT+ decoder achieves ble training and optimization techniques, and that
better results than the Transformer and the RNMT our exploration of hybrid architectures will open
model on both WMT’14 benchmark tasks. paths for new architecture search efforts for NMT.

83
Our focus on a standard single-language-pair Yoshua Bengio. 2014. Learning phrase repre-
translation task leaves important open questions sentations using RNN encoder-decoder for statis-
tical machine translation. In Conference on Em-
to be answered: How do our new architectures
pirical Methods in Natural Language Processing.
compare in multilingual settings, i.e., modeling http://arxiv.org/abs/1406.1078.
an interlingua? Which architecture is more effi-
cient and powerful in processing finer grained in- Jan Chorowski and Navdeep Jaitly. 2016. Towards bet-
ter decoding and language model integration in se-
puts and outputs, e.g., characters or bytes? How quence to sequence models. CoRR abs/1612.02695.
transferable are the representations learned by the http://arxiv.org/abs/1612.02695.
different architectures to other tasks? And what
Josep Maria Crego, Jungi Kim, Guillaume Klein, An-
are the characteristic errors that each architecture abel Rebollo, Kathy Yang, Jean Senellart, Egor
makes, e.g., linguistic plausibility? Akhanov, Patrice Brunelle, Aurelien Coquard,
Yongchao Deng, Satoshi Enoue, Chiyo Geiss,
Acknowledgments Joshua Johanson, Ardas Khalsa, Raoum Khiari,
Byeongil Ko, Catherine Kobus, Jean Lorieux, Leid-
We would like to thank the entire Google Brain iana Martins, Dang-Chuan Nguyen, Alexandra Pri-
ori, Thomas Riccardi, Natalia Segal, Christophe Ser-
Team and Google Translate Team for their foun- van, Cyril Tiquet, Bo Wang, Jin Yang, Dakun Zhang,
dational contributions to this project. We would Jing Zhou, and Peter Zoldan. 2016. Systran’s
also like to thank the entire Tensor2Tensor devel- pure neural machine translation systems. CoRR
opment team for their useful inputs and discus- abs/1610.05540. http://arxiv.org/abs/1610.05540.
sions. Yann N. Dauphin, Angela Fan, Michael Auli,
and David Grangier. 2016. Language model-
ing with gated convolutional networks. CoRR
References abs/1612.08083. http://arxiv.org/abs/1612.08083.

Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai
2016. Layer normalization. CoRR abs/1607.06450. Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao,
http://arxiv.org/abs/1607.06450. MarcAurelio Ranzato, Andrew Senior, Paul Tucker,
Ke Yang, and Andrew Y. Ng. 2012. Large scale dis-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua tributed deep networks. In NIPS.
Bengio. 2015. Neural machine translation by
Michael Denkowski and Graham Neubig. 2017.
jointly learning to align and translate. In Inter-
Stronger baselines for trustable results in neural ma-
national Conference on Learning Representations.
chine translation. In Proceedings of the First Work-
http://arxiv.org/abs/1409.0473.
shop on Neural Machine Translation. Association
for Computational Linguistics, Vancouver, pages
Y. Bengio, P. Simard, and P. Frasconi. 1994. Learn- 18–27. http://www.aclweb.org/anthology/W17-
ing long-term dependencies with gradient descent is 3203.
difficult. Trans. Neur. Netw. 5(2):157–166.
Jacob Devlin. 2017. Sharp models on dull hard-
Denny Britz, Anna Goldie, Minh-Thang Luong, ware: Fast and accurate neural machine trans-
and Quoc Le. 2017. Massive exploration of lation decoding on the cpu. In Proceedings
neural machine translation architectures. In of the 2017 Conference on Empirical Meth-
Proceedings of the 2017 Conference on Em- ods in Natural Language Processing. Association
pirical Methods in Natural Language Process- for Computational Linguistics, pages 2820–2825.
ing. Association for Computational Linguis- http://aclweb.org/anthology/D17-1300.
tics, Copenhagen, Denmark, pages 1442–1451.
https://www.aclweb.org/anthology/D17-1151. Jeffrey L. Elman. 1990. Finding structure in time.
Cognitive Science 14(2):179 – 211.
Hugh Chen, Scott Lundberg, and Su-In Lee. 2017.
Checkpoint ensembles: Ensemble methods from Jonas Gehring, Michael Auli, David Grangier, De-
a single training process. CoRR abs/1710.03282. nis Yarats, and Yann N. Dauphin. 2017. Con-
http://arxiv.org/abs/1710.03282. volutional sequence to sequence learning. CoRR
abs/1705.03122. http://arxiv.org/abs/1705.03122.
Jianmin Chen, Rajat Monga, Samy Bengio, and Felix A Gers, Jürgen Schmidhuber, and Fred Cummins.
Rafal Józefowicz. 2016. Revisiting distributed 2000. Learning to forget: Continual prediction with
synchronous SGD. CoRR abs/1604.00981. LSTM. Neural computation 12(10):2451–2471.
http://arxiv.org/abs/1604.00981.
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter
Kyunghyun Cho, Bart van Merrienboer, Çaglar Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An-
Gülçehre, Fethi Bougares, Holger Schwenk, and drew Tulloch, Yangqing Jia, and Kaiming He.

84
2017. Accurate, large minibatch SGD: train- Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
ing imagenet in 1 hour. CoRR abs/1706.02677. Aäron van den Oord, Alex Graves, and Ko-
http://arxiv.org/abs/1706.02677. ray Kavukcuoglu. 2016. Neural machine trans-
lation in linear time. CoRR abs/1610.10099.
Alex Graves and Jürgen Schmidhuber. 2005. Frame- http://arxiv.org/abs/1610.10099.
wise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Net- Diederik P. Kingma and Jimmy Ba. 2014. Adam:
works 18(5):602 – 610. IJCNN 2005. A method for stochastic optimization. CoRR
abs/1412.6980. http://arxiv.org/abs/1412.6980.
Edward Grefenstette, Karl Moritz Hermann,
Mustafa Suleyman, and Phil Blunsom. 2015. Yann LeCun and Yoshua Bengio. 1998. The
Learning to transduce with unbounded mem- handbook of brain theory and neural net-
ory. In Proceedings of the 28th International works. MIT Press, Cambridge, MA, USA,
Conference on Neural Information Process- chapter Convolutional Networks for Images,
ing Systems - Volume 2. MIT Press, Cam- Speech, and Time Series, pages 255–258.
bridge, MA, USA, NIPS’15, pages 1828–1836. http://dl.acm.org/citation.cfm?id=303568.303704.
http://dl.acm.org/citation.cfm?id=2969442.2969444.
Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jakob Uszkoreit. 2016. A decomposable attention
Jian Sun. 2015. Deep residual learning for model for natural language inference. In EMNLP.
image recognition. CoRR abs/1512.03385.
http://arxiv.org/abs/1512.03385. Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho,
and Yoshua Bengio. 2013. How to construct deep
Sepp Hochreiter. 1991. Untersuchungen zu dynamis- recurrent neural networks. CoRR abs/1312.6026.
chen neuronalen netzen. Diploma, Technische Uni- http://arxiv.org/abs/1312.6026.
versität München 91:1.
Tim Salimans and Diederik P. Kingma. 2016. Weight
Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and
normalization: A simple reparameterization to ac-
Jrgen Schmidhuber. 2001. Gradient flow in recur-
celerate training of deep neural networks. CoRR
rent nets: the difficulty of learning long-term depen-
abs/1602.07868. http://arxiv.org/abs/1602.07868.
dencies.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. M. Schuster and K.K. Paliwal. 1997. Bidirectional re-
Long short-term memory. Neural computation current neural networks. IEEE Transactions on Sig-
9(8):1735–1780. nal Processing 45(11):2673–2681.

Kurt Hornik, Maxwell Stinchcombe, and Halbert Mike Schuster and Kaisuke Nakajima. 2012. Japanese
White. 1989. Multilayer feedforward networks are and Korean voice search. 2012 IEEE International
universal approximators. Neural Networks 2(5):359 Conference on Acoustics, Speech and Signal Pro-
– 366. cessing .

Sébastien Jean, Kyunghyun Cho, Roland Memisevic, H.T. Siegelmann and E.D. Sontag. 1995. On the com-
and Yoshua Bengio. 2015. On using very large putational power of neural nets. Journal of Com-
target vocabulary for neural machine translation. puter and System Sciences 50(1):132 – 150.
In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the Rupesh Kumar Srivastava, Klaus Greff, and Jürgen
7th International Joint Conference on Natural Lan- Schmidhuber. 2015. Highway networks. CoRR
guage Processing. abs/1505.00387.

Norman P. Jouppi, Cliff Young, Nishant Patil, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
David Patterson, Gaurav Agrawal, and et al. Sequence to sequence learning with neural net-
2017. In-datacenter performance analysis of a works. In Advances in Neural Information Process-
tensor processing unit. CoRR abs/1704.04760. ing Systems. pages 3104–3112.
http://arxiv.org/abs/1704.04760.
Christian Szegedy, Vincent Vanhoucke, Sergey
Marcin Junczys-Dowmunt, Tomasz Dwojak, and Rico Ioffe, Jonathon Shlens, and Zbigniew Wojna.
Sennrich. 2016. The amu-uedin submission to the 2015. Rethinking the inception architecture
wmt16 news translation task: Attention-based nmt for computer vision. CoRR abs/1512.00567.
models as feature functions in phrase-based smt. http://arxiv.org/abs/1512.00567.
arXiv preprint arXiv:1605.04809 .
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
continuous translation models. In Conference on Kaiser, and Illia Polosukhin. 2017. Atten-
Empirical Methods in Natural Language Process- tion is all you need. CoRR abs/1706.03762.
ing. http://arxiv.org/abs/1706.03762.

85
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the gap
between human and machine translation. CoRR
abs/1609.08144. http://arxiv.org/abs/1609.08144.
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei
Xu. 2016. Deep recurrent models with fast-forward
connections for neural machine translation. CoRR
abs/1606.04199. http://arxiv.org/abs/1606.04199.

86

You might also like