Professional Documents
Culture Documents
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
Melvin Johnson Wolfgang Macherey George Foster Llion Jones Niki Parmar
76
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 76–86
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
and how much can be attributed to the associated We quickly note two prior works that pro-
training and inference techniques. In some cases, vided empirical solutions to the difficulty of train-
these new techniques may be broadly applicable ing NMT architectures (specifically RNMT). In
to other architectures and thus constitute a major, (Britz et al., 2017) the authors systematically ex-
though implicit, contribution of an architecture pa- plore which elements of NMT architectures have
per. Clearly, they need to be considered in order a significant impact on translation quality. In
to ensure a fair comparison across different model (Denkowski and Neubig, 2017) the authors recom-
architectures. mend three specific techniques for strengthening
In this paper, we therefore take a step back and NMT systems and empirically demonstrated how
look at which techniques and methods contribute incorporating those techniques improves the relia-
significantly to the success of recent architectures, bility of the experimental results.
namely ConvS2S and Transformer, and explore
applying these methods to other architectures, in- 2 Background
cluding RNMT models. In doing so, we come up
In this section, we briefly discuss the commmonly
with an enhanced version of RNMT, referred to
used NMT architectures.
as RNMT+, that significantly outperforms all in-
dividual architectures in our setup. We further in- 2.1 RNN-based NMT Models - RNMT
troduce new architectures built with different com-
ponents borrowed from RNMT+, ConvS2S and RNMT models are composed of an encoder RNN
Transformer. In order to ensure a fair setting for and a decoder RNN, coupled with an attention
comparison, all architectures were implemented in network. The encoder summarizes the input se-
the same framework, use the same pre-processed quence into a set of vectors while the decoder con-
data and apply no further post-processing as this ditions on the encoded input sequence through an
may confound bare model performance. attention mechanism, and generates the output se-
Our contributions are three-fold: quence one token at a time.
The most successful RNMT models consist of
1. In ablation studies, we quantify the effect stacked RNN encoders with one or more bidirec-
of several modeling improvements (includ- tional RNNs (Schuster and Paliwal, 1997; Graves
ing multi-head attention and layer normaliza- and Schmidhuber, 2005), and stacked decoders
tion) as well as optimization techniques (such with unidirectional RNNs. Both encoder and de-
as synchronous replica training and label- coder RNNs consist of either LSTM (Hochreiter
smoothing), which are used in recent archi- and Schmidhuber, 1997; Gers et al., 2000) or GRU
tectures. We demonstrate that these tech- units (Cho et al., 2014), and make extensive use of
niques are applicable across different model residual (He et al., 2015) or highway (Srivastava
architectures. et al., 2015) connections.
In Google-NMT (GNMT) (Wu et al., 2016),
2. Combining these improvements with the
the best performing RNMT model on the datasets
RNMT model, we propose the new RNMT+
we consider, the encoder network consists of
model, which significantly outperforms all
one bi-directional LSTM layer, followed by 7
fundamental architectures on the widely-used
uni-directional LSTM layers. The decoder is
WMT’14 En→Fr and En→De benchmark
equipped with a single attention network and 8
datasets. We provide a detailed model anal-
uni-directional LSTM layers. Both the encoder
ysis and comparison of RNMT+, ConvS2S
and the decoder use residual skip connections be-
and Transformer in terms of model quality,
tween consecutive layers.
model size, and training and inference speed.
In this paper, we adopt GNMT as the starting
3. Inspired by our understanding of the rela- point for our proposed RNMT+ architecture.
tive strengths and weaknesses of individual
model architectures, we propose new model 2.2 Convolutional NMT Models - ConvS2S
architectures that combine components from In the most successful convolutional sequence-to-
the RNMT+ and the Transformer model, and sequence model (Gehring et al., 2017), both the
achieve better results than both individual ar- encoder and decoder are constructed by stacking
chitectures. multiple convolutional layers, where each layer
77
contains 1-dimensional convolutions followed by Transformer model in the Tensor2Tensor2 code-
a gated linear units (GLU) (Dauphin et al., 2016). base.
Each decoder layer computes a separate dot-
product attention by using the current decoder 2.4 A Theory-Based Characterization of
layer output and the final encoder layer outputs. NMT Architectures
Positional embeddings are used to provide explicit From a theoretical point of view, RNNs belong
positional information to the model. Following the to the most expressive members of the neural
practice in (Gehring et al., 2017), we scale the gra- network family (Siegelmann and Sontag, 1995)3 .
dients of the encoder layers to stabilize training. Possessing an infinite Markovian structure (and
We also use residual connections across each con- thus an infinite receptive fields) equips them to
volutional layer and apply weight normalization model sequential data (Elman, 1990), especially
(Salimans and Kingma, 2016) to speed up conver- natural language (Grefenstette et al., 2015) ef-
gence. We follow the public ConvS2S codebase1 fectively. In practice, RNNs are notoriously
in our experiments. hard to train (Hochreiter, 1991; Bengio et al.,
1994; Hochreiter et al., 2001), confirming the well
2.3 Conditional Transformation-based NMT known dilemma of trainability versus expressivity.
Models - Transformer Convolutional layers are adept at capturing lo-
cal context and local correlations by design. A
The Transformer model (Vaswani et al., 2017) is fixed and narrow receptive field for each convo-
motivated by two major design choices that aim lutional layer limits their capacity when the ar-
to address deficiencies in the former two model chitecture is shallow. In practice, this weakness
families: (1) Unlike RNMT, but similar to the is mitigated by stacking more convolutional lay-
ConvS2S, the Transformer model avoids any se- ers (e.g. 15 layers as in the ConvS2S model),
quential dependencies in both the encoder and which makes the model harder to train and de-
decoder networks to maximally parallelize train- mands meticulous initialization schemes and care-
ing. (2) To address the limited context problem fully designed regularization techniques.
(limited receptive field) present in ConvS2S, the The transformer network is capable of ap-
Transformer model makes pervasive use of self- proximating arbitrary squashing functions (Hornik
attention networks (Parikh et al., 2016) so that et al., 1989), and can be considered a strong fea-
each position in the current layer has access to in- ture extractor with extended receptive fields capa-
formation from all other positions in the previous ble of linking salient features from the entire se-
layer. quence. On the other hand, lacking a memory
The Transformer model still follows the component (as present in the RNN models) pre-
encoder-decoder paradigm. Encoder transformer vents the network from modeling a state space,
layers are built with two sub-modules: (1) a self- reducing its theoretical strength as a sequence
attention network and (2) a feed-forward network. model, thus it requires additional positional infor-
Decoder transformer layers have an additional mation (e.g. sinusoidal positional encodings).
cross-attention layer sandwiched between the self- Above theoretical characterizations will drive
attention and feed-forward layers to attend to the our explorations in the following sections.
encoder outputs.
There are two details which we found very im- 3 Experiment Setup
portant to the model’s performance: (1) Each sub- We train our models on the standard WMT’14
layer in the transformer (i.e. self-attention, cross- En→Fr and En→De datasets that comprise 36.3M
attention, and the feed-forward sub-layer) follows and 4.5M sentence pairs, respectively. Each sen-
a strict computation sequence: normalize → trans- tence was encoded into a sequence of sub-word
form → dropout→ residual-add. (2) In addition to units obtained by first tokenizing the sentence with
per-layer normalization, the final encoder output is the Moses tokenizer, then splitting tokens into sub-
again normalized to prevent a blow up after con- word units (also known as “wordpieces”) using
secutive residual additions. the approach described in (Schuster and Nakajima,
In this paper, we follow the latest version of the 2012).
2
https://github.com/tensorflow/tensor2tensor
1 3
https://github.com/facebookresearch/fairseq-py Assuming that data complexity is satisfied.
78
Figure 1: Model architecture of RNMT+. On the left side, the encoder network has 6 bidirectional LSTM
layers. At the end of each bidirectional layer, the outputs of the forward layer and the backward layer
are concatenated. On the right side, the decoder network has 8 unidirectional LSTM layers, with the first
layer used for obtaining the attention context vector through multi-head additive attention. The attention
context vector is then fed directly into the rest of the decoder layers as well as the softmax layer.
We use a shared vocabulary of 32K sub-word evaluating the performance of individual models.
units for each source-target language pair. No fur-
ther manual or rule-based post processing of the 4 RNMT+
output was performed beyond combining the sub-
4.1 Model Architecture of RNMT+
word units to generate the targets. We report all
our results on newstest 2014, which serves as the The newly proposed RNMT+ model architecture
test set. A combination of newstest 2012 and new- is shown in Figure 1. Here we highlight the key
stest 2013 is used for validation. architectural choices that are different between the
To evaluate the models, we compute the BLEU RNMT+ model and the GNMT model. There are
metric on tokenized, true-case output.4 For each 6 bidirectional LSTM layers in the encoder instead
training run, we evaluate the model every 30 min- of 1 bidirectional LSTM layer followed by 7 uni-
utes on the dev set. Once the model converges, we directional layers as in GNMT. For each bidirec-
determine the best window based on the average tional layer, the outputs of the forward layer and
dev-set BLEU score over 21 consecutive evalua- the backward layer are concatenated before being
tions. We report the mean test score and standard fed into the next layer. The decoder network con-
deviation over the selected window. This allows sists of 8 unidirectional LSTM layers similar to the
us to compare model architectures based on their GNMT model. Residual connections are added to
mean performance after convergence rather than the third layer and above for both the encoder and
individual checkpoint evaluations, as the latter can decoder. Inspired by the Transformer model, per-
be quite noisy for some models. gate layer normalization (Ba et al., 2016) is ap-
To enable a fair comparison of architectures, plied within each LSTM cell. Our empirical re-
we use the same pre-processing and evaluation sults show that layer normalization greatly stabi-
methodology for all our experiments. We re- lizes training. No non-linearity is applied to the
frain from using checkpoint averaging (exponen- LSTM output. A projection layer is added to the
tial moving averages of parameters) (Junczys- encoder final output.5 Multi-head additive atten-
Dowmunt et al., 2016) or checkpoint ensembles tion is used instead of the single-head attention in
(Jean et al., 2015; Chen et al., 2017) to focus on the GNMT model. Similar to GNMT, we use the
5
Additional projection aims to reduce the dimensionality
4
This procedure is used in the literature to which we com- of the encoder output representations to match the decoder
pare (Gehring et al., 2017; Wu et al., 2016). stack dimension.
79
bottom decoder layer and the final encoder layer it a constant until the decay start step s, then ex-
output after projection for obtaining the recurrent ponentially decay until the decay end step e, and
attention context. In addition to feeding the atten- keep it at 5 · 10−5 after the decay ends. This
tion context to all decoder LSTM layers, we also learning rate schedule is motivated by a similar
feed it to the softmax by concatenating it with the schedule that was successfully applied in training
layer input. This is important for both the quality the Resnet-50 model with a very large batch size
of the models with multi-head attention and the (Goyal et al., 2017).
stability of the training process. In contrast to the asynchronous training used
Since the encoder network in RNMT+ consists for GNMT (Dean et al., 2012), we train RNMT+
solely of bi-directional LSTM layers, model par- models with synchronous training (Chen et al.,
allelism is not used during training. We com- 2016). Our empirical results suggest that when
pensate for the resulting longer per-step time with hyper-parameters are tuned properly, synchronous
increased data parallelism (more model replicas), training often leads to improved convergence
so that the overall time to reach convergence of speed and superior model quality.
the RNMT+ model is still comparable to that of To further stabilize training, we also use adap-
GNMT. tive gradient clipping. We discard a training step
We apply the following regularization tech- completely if an anomaly in the gradient norm
niques during training. value is detected, which is usually an indication
of an imminent gradient explosion. More specif-
• Dropout: We apply dropout to both embed-
ically, we keep track of a moving average and a
ding layers and each LSTM layer output before
moving standard deviation of the log of the gradi-
it is added to the next layer’s input. Attention
ent norm values, and we abort a step if the norm
dropout is also applied.
of the gradient exceeds four standard deviations of
• Label Smoothing: We use uniform label the moving average.
smoothing with an uncertainty=0.1 (Szegedy
et al., 2015). Label smoothing was shown to 4.2 Model Analysis and Comparison
have a positive impact on both Transformer In this section, we compare the results of RNMT+
and RNMT+ models, especially in the case with ConvS2S and Transformer.
of RNMT+ with multi-head attention. Similar
All models were trained with synchronous
to the observations in (Chorowski and Jaitly,
training. RNMT+ and ConvS2S were trained with
2016), we found it beneficial to use a larger
32 NVIDIA P100 GPUs while the Transformer
beam size (e.g. 16, 20, etc.) during decoding
Base and Big models were trained using 16 GPUs.
when models are trained with label smoothing.
For RNMT+, we use sentence-level cross-
• Weight Decay: For the WMT’14 En→De task, entropy loss. Each training batch contained 4096
we apply L2 regularization to the weights with sentence pairs (4096 source sequences and 4096
λ = 10−5 . Weight decay is only applied to the target sequences). For ConvS2S and Transformer
En→De task as the corpus is smaller and thus models, we use token-level cross-entropy loss.
more regularization is required. Each training batch contained 65536 source to-
kens and 65536 target tokens. For the GNMT
We use the Adam optimizer (Kingma and Ba, baselines on both tasks, we cite the largest BLEU
2014) with β1 = 0.9, β2 = 0.999, = 10−6 and score reported in (Wu et al., 2016) without rein-
vary the learning rate according to this schedule: forcement learning.
t · (n − 1) s−nt Table 1 shows our results on the WMT’14
lr = 10−4 · min 1 + , n, n · (2n) e−s En→Fr task. Both the Transformer Big model
np
(1) and RNMT+ outperform GNMT and ConvS2S by
Here, t is the current step, n is the number of con- about 2 BLEU points. RNMT+ is slightly better
current model replicas used in training, p is the than the Transformer Big model in terms of its
number of warmup steps, s is the start step of the mean BLEU score. RNMT+ also yields a much
exponential decay, and e is the end step of the de- lower standard deviation, and hence we observed
cay. Specifically, we first increase the learning rate much less fluctuation in the training curve. It
linearly during the number of warmup steps, keep takes approximately 3 days for the Transformer
80
Base model to converge, while both RNMT+ and Training
Model Test BLEU Epochs
the Transformer Big model require about 5 days Time
to converge. Although the batching schemes are GNMT 24.67 - -
quite different between the Transformer Big and ConvS2S 25.01 ±0.17 38 20h
the RNMT+ model, they have processed about Trans. Base 27.26 ± 0.15 38 17h
the same amount of training samples upon conver- Trans. Big 27.94 ± 0.18 26.9 48h
gence. RNMT+ 28.49 ± 0.05 24.6 40h
81
it affect the model performance? (2) How useful is Eq. 2 for Transformer). For RNMT+, removing
it for stable training of other techniques and hence this warmup stage during synchronous training
the final model? causes the model to become unstable.
82
coder is beneficial for conditional language gener- Model En→Fr BLEU En→De BLEU
ation. Trans. Big 40.73 ± 0.19 27.94 ± 0.18
RNMT+ 41.00 ± 0.05 28.49 ± 0.05
6.2 Assessing Encoder Combinations Cascaded 41.67 ± 0.11 28.62 ± 0.06
Next, we explore how the features extracted by MultiCol 41.66 ± 0.11 28.84 ± 0.06
an encoder can be further enhanced by incorpo-
Table 6: Results for hybrids with cascaded en-
rating additional information. Specifically, we in-
coder and multi-column encoder.
vestigate the combination of transformer layers
with RNMT+ layers in the same encoder block to
build even richer feature representations. We ex-
clusively use RNMT+ decoders in the following
architectures since stateful decoders show better
performance according to Table 5.
We study two mixing schemes in the encoder
(see Fig. 2):
(1) Cascaded Encoder: The cascaded encoder
aims at combining the representational power of
RNNs and self-attention. The idea is to enrich a
set of stateful representations by cascading a fea-
ture extractor with a focus on vertical mapping,
(a) Cascaded Encoder (b) Multi-Column Encoder
similar to (Pascanu et al., 2013; Devlin, 2017).
Our best performing cascaded encoder involves Figure 2: Vertical and horizontal mixing of Trans-
fine tuning transformer layers stacked on top of former and RNMT+ components in an encoder.
a pre-trained frozen RNMT+ encoder. Using a
pre-trained encoder avoids optimization difficul-
ties while significantly enhancing encoder capac- 7 Conclusion
ity. As shown in Table 6, the cascaded encoder In this work we explored the efficacy of sev-
improves over the Transformer encoder by more eral architectural and training techniques proposed
than 0.5 BLEU points on the WMT’14 En→Fr in recent studies on seq2seq models for NMT.
task. This suggests that the Transformer encoder We demonstrated that many of these techniques
is able to extract richer representations if the input are broadly applicable to multiple model architec-
is augmented with sequential context. tures. Applying these new techniques to RNMT
(2) Multi-Column Encoder: As illustrated in models yields RNMT+, an enhanced RNMT
Fig. 2b, a multi-column encoder merges the out- model that significantly outperforms the three fun-
puts of several independent encoders into a sin- damental architectures on WMT’14 En→Fr and
gle combined representation. Unlike a cascaded En→De tasks. We further presented several hy-
encoder, the multi-column encoder enables us to brid models developed by combining encoders and
investigate whether an RNMT+ decoder can dis- decoders from the Transformer and RNMT+ mod-
tinguish information received from two different els, and empirically demonstrated the superiority
channels and benefit from its combination. A of the Transformer encoder and the RNMT+ de-
crucial operation in a multi-column encoder is coder in comparison with their counterparts. We
therefore how different sources of information are then enhanced the encoder architecture by hori-
merged into a unified representation. Our best zontally and vertically mixing components bor-
multi-column encoder performs a simple concate- rowed from these architectures, leading to hy-
nation of individual column outputs. brid architectures that obtain further improve-
The model details and hyperparameters of the ments over RNMT+.
above two encoders are described in Appendix A.5 We hope that our work will motivate NMT re-
and A.6. As shown in Table 6, the multi-column searchers to further investigate generally applica-
encoder followed by an RNMT+ decoder achieves ble training and optimization techniques, and that
better results than the Transformer and the RNMT our exploration of hybrid architectures will open
model on both WMT’14 benchmark tasks. paths for new architecture search efforts for NMT.
83
Our focus on a standard single-language-pair Yoshua Bengio. 2014. Learning phrase repre-
translation task leaves important open questions sentations using RNN encoder-decoder for statis-
tical machine translation. In Conference on Em-
to be answered: How do our new architectures
pirical Methods in Natural Language Processing.
compare in multilingual settings, i.e., modeling http://arxiv.org/abs/1406.1078.
an interlingua? Which architecture is more effi-
cient and powerful in processing finer grained in- Jan Chorowski and Navdeep Jaitly. 2016. Towards bet-
ter decoding and language model integration in se-
puts and outputs, e.g., characters or bytes? How quence to sequence models. CoRR abs/1612.02695.
transferable are the representations learned by the http://arxiv.org/abs/1612.02695.
different architectures to other tasks? And what
Josep Maria Crego, Jungi Kim, Guillaume Klein, An-
are the characteristic errors that each architecture abel Rebollo, Kathy Yang, Jean Senellart, Egor
makes, e.g., linguistic plausibility? Akhanov, Patrice Brunelle, Aurelien Coquard,
Yongchao Deng, Satoshi Enoue, Chiyo Geiss,
Acknowledgments Joshua Johanson, Ardas Khalsa, Raoum Khiari,
Byeongil Ko, Catherine Kobus, Jean Lorieux, Leid-
We would like to thank the entire Google Brain iana Martins, Dang-Chuan Nguyen, Alexandra Pri-
ori, Thomas Riccardi, Natalia Segal, Christophe Ser-
Team and Google Translate Team for their foun- van, Cyril Tiquet, Bo Wang, Jin Yang, Dakun Zhang,
dational contributions to this project. We would Jing Zhou, and Peter Zoldan. 2016. Systran’s
also like to thank the entire Tensor2Tensor devel- pure neural machine translation systems. CoRR
opment team for their useful inputs and discus- abs/1610.05540. http://arxiv.org/abs/1610.05540.
sions. Yann N. Dauphin, Angela Fan, Michael Auli,
and David Grangier. 2016. Language model-
ing with gated convolutional networks. CoRR
References abs/1612.08083. http://arxiv.org/abs/1612.08083.
Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai
2016. Layer normalization. CoRR abs/1607.06450. Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao,
http://arxiv.org/abs/1607.06450. MarcAurelio Ranzato, Andrew Senior, Paul Tucker,
Ke Yang, and Andrew Y. Ng. 2012. Large scale dis-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua tributed deep networks. In NIPS.
Bengio. 2015. Neural machine translation by
Michael Denkowski and Graham Neubig. 2017.
jointly learning to align and translate. In Inter-
Stronger baselines for trustable results in neural ma-
national Conference on Learning Representations.
chine translation. In Proceedings of the First Work-
http://arxiv.org/abs/1409.0473.
shop on Neural Machine Translation. Association
for Computational Linguistics, Vancouver, pages
Y. Bengio, P. Simard, and P. Frasconi. 1994. Learn- 18–27. http://www.aclweb.org/anthology/W17-
ing long-term dependencies with gradient descent is 3203.
difficult. Trans. Neur. Netw. 5(2):157–166.
Jacob Devlin. 2017. Sharp models on dull hard-
Denny Britz, Anna Goldie, Minh-Thang Luong, ware: Fast and accurate neural machine trans-
and Quoc Le. 2017. Massive exploration of lation decoding on the cpu. In Proceedings
neural machine translation architectures. In of the 2017 Conference on Empirical Meth-
Proceedings of the 2017 Conference on Em- ods in Natural Language Processing. Association
pirical Methods in Natural Language Process- for Computational Linguistics, pages 2820–2825.
ing. Association for Computational Linguis- http://aclweb.org/anthology/D17-1300.
tics, Copenhagen, Denmark, pages 1442–1451.
https://www.aclweb.org/anthology/D17-1151. Jeffrey L. Elman. 1990. Finding structure in time.
Cognitive Science 14(2):179 – 211.
Hugh Chen, Scott Lundberg, and Su-In Lee. 2017.
Checkpoint ensembles: Ensemble methods from Jonas Gehring, Michael Auli, David Grangier, De-
a single training process. CoRR abs/1710.03282. nis Yarats, and Yann N. Dauphin. 2017. Con-
http://arxiv.org/abs/1710.03282. volutional sequence to sequence learning. CoRR
abs/1705.03122. http://arxiv.org/abs/1705.03122.
Jianmin Chen, Rajat Monga, Samy Bengio, and Felix A Gers, Jürgen Schmidhuber, and Fred Cummins.
Rafal Józefowicz. 2016. Revisiting distributed 2000. Learning to forget: Continual prediction with
synchronous SGD. CoRR abs/1604.00981. LSTM. Neural computation 12(10):2451–2471.
http://arxiv.org/abs/1604.00981.
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter
Kyunghyun Cho, Bart van Merrienboer, Çaglar Noordhuis, Lukasz Wesolowski, Aapo Kyrola, An-
Gülçehre, Fethi Bougares, Holger Schwenk, and drew Tulloch, Yangqing Jia, and Kaiming He.
84
2017. Accurate, large minibatch SGD: train- Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,
ing imagenet in 1 hour. CoRR abs/1706.02677. Aäron van den Oord, Alex Graves, and Ko-
http://arxiv.org/abs/1706.02677. ray Kavukcuoglu. 2016. Neural machine trans-
lation in linear time. CoRR abs/1610.10099.
Alex Graves and Jürgen Schmidhuber. 2005. Frame- http://arxiv.org/abs/1610.10099.
wise phoneme classification with bidirectional lstm
and other neural network architectures. Neural Net- Diederik P. Kingma and Jimmy Ba. 2014. Adam:
works 18(5):602 – 610. IJCNN 2005. A method for stochastic optimization. CoRR
abs/1412.6980. http://arxiv.org/abs/1412.6980.
Edward Grefenstette, Karl Moritz Hermann,
Mustafa Suleyman, and Phil Blunsom. 2015. Yann LeCun and Yoshua Bengio. 1998. The
Learning to transduce with unbounded mem- handbook of brain theory and neural net-
ory. In Proceedings of the 28th International works. MIT Press, Cambridge, MA, USA,
Conference on Neural Information Process- chapter Convolutional Networks for Images,
ing Systems - Volume 2. MIT Press, Cam- Speech, and Time Series, pages 255–258.
bridge, MA, USA, NIPS’15, pages 1828–1836. http://dl.acm.org/citation.cfm?id=303568.303704.
http://dl.acm.org/citation.cfm?id=2969442.2969444.
Ankur P. Parikh, Oscar Täckström, Dipanjan Das, and
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jakob Uszkoreit. 2016. A decomposable attention
Jian Sun. 2015. Deep residual learning for model for natural language inference. In EMNLP.
image recognition. CoRR abs/1512.03385.
http://arxiv.org/abs/1512.03385. Razvan Pascanu, Çaglar Gülçehre, Kyunghyun Cho,
and Yoshua Bengio. 2013. How to construct deep
Sepp Hochreiter. 1991. Untersuchungen zu dynamis- recurrent neural networks. CoRR abs/1312.6026.
chen neuronalen netzen. Diploma, Technische Uni- http://arxiv.org/abs/1312.6026.
versität München 91:1.
Tim Salimans and Diederik P. Kingma. 2016. Weight
Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, and
normalization: A simple reparameterization to ac-
Jrgen Schmidhuber. 2001. Gradient flow in recur-
celerate training of deep neural networks. CoRR
rent nets: the difficulty of learning long-term depen-
abs/1602.07868. http://arxiv.org/abs/1602.07868.
dencies.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. M. Schuster and K.K. Paliwal. 1997. Bidirectional re-
Long short-term memory. Neural computation current neural networks. IEEE Transactions on Sig-
9(8):1735–1780. nal Processing 45(11):2673–2681.
Kurt Hornik, Maxwell Stinchcombe, and Halbert Mike Schuster and Kaisuke Nakajima. 2012. Japanese
White. 1989. Multilayer feedforward networks are and Korean voice search. 2012 IEEE International
universal approximators. Neural Networks 2(5):359 Conference on Acoustics, Speech and Signal Pro-
– 366. cessing .
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, H.T. Siegelmann and E.D. Sontag. 1995. On the com-
and Yoshua Bengio. 2015. On using very large putational power of neural nets. Journal of Com-
target vocabulary for neural machine translation. puter and System Sciences 50(1):132 – 150.
In Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the Rupesh Kumar Srivastava, Klaus Greff, and Jürgen
7th International Joint Conference on Natural Lan- Schmidhuber. 2015. Highway networks. CoRR
guage Processing. abs/1505.00387.
Norman P. Jouppi, Cliff Young, Nishant Patil, Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
David Patterson, Gaurav Agrawal, and et al. Sequence to sequence learning with neural net-
2017. In-datacenter performance analysis of a works. In Advances in Neural Information Process-
tensor processing unit. CoRR abs/1704.04760. ing Systems. pages 3104–3112.
http://arxiv.org/abs/1704.04760.
Christian Szegedy, Vincent Vanhoucke, Sergey
Marcin Junczys-Dowmunt, Tomasz Dwojak, and Rico Ioffe, Jonathon Shlens, and Zbigniew Wojna.
Sennrich. 2016. The amu-uedin submission to the 2015. Rethinking the inception architecture
wmt16 news translation task: Attention-based nmt for computer vision. CoRR abs/1512.00567.
models as feature functions in phrase-based smt. http://arxiv.org/abs/1512.00567.
arXiv preprint arXiv:1605.04809 .
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
continuous translation models. In Conference on Kaiser, and Illia Polosukhin. 2017. Atten-
Empirical Methods in Natural Language Process- tion is all you need. CoRR abs/1706.03762.
ing. http://arxiv.org/abs/1706.03762.
85
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun, Yuan Cao, Qin Gao, Klaus
Macherey, Jeff Klingner, Apurva Shah, Melvin
Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto
Kazawa, Keith Stevens, George Kurian, Nishant
Patil, Wei Wang, Cliff Young, Jason Smith, Jason
Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado,
Macduff Hughes, and Jeffrey Dean. 2016. Google’s
neural machine translation system: Bridging the gap
between human and machine translation. CoRR
abs/1609.08144. http://arxiv.org/abs/1609.08144.
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei
Xu. 2016. Deep recurrent models with fast-forward
connections for neural machine translation. CoRR
abs/1606.04199. http://arxiv.org/abs/1606.04199.
86