Professional Documents
Culture Documents
Improving Neural Machine Translation Models With Monolingual Data
Improving Neural Machine Translation Models With Monolingual Data
86
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 86–96,
Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics
training set. serves to improve the estimate of the prior prob-
ability p(T ) of the target sentence T , before tak-
• we investigate two different methods to fill ing the source sentence S into account. In con-
the source side of monolingual training in- trast to (Gülçehre et al., 2015), who train separate
stances: using a dummy source sentence, and language models on monolingual training data and
using a source sentence obtained via back- incorporate them into the neural network through
translation, which we call synthetic. We find shallow or deep fusion, we propose techniques
that the latter is more effective. to train the main NMT model with monolingual
data, exploiting the fact that encoder-decoder neu-
• we successfully adapt NMT models to a new ral networks already condition the probability dis-
domain by fine-tuning with either monolin- tribution of the next target word on the previous
gual or parallel in-domain data. target words. We describe two strategies to do this:
providing monolingual training examples with an
2 Neural Machine Translation empty (or dummy) source sentence, or providing
monolingual training data with a synthetic source
We follow the neural machine translation archi- sentence that is obtained from automatically trans-
tecture by Bahdanau et al. (2015), which we will lating the target sentence into the source language,
briefly summarize here. However, we note that our which we will refer to as back-translation.
approach is not specific to this architecture.
The neural machine translation system is imple- 3.1 Dummy Source Sentences
mented as an encoder-decoder network with recur-
rent neural networks. The first technique we employ is to treat monolin-
gual training examples as parallel examples with
The encoder is a bidirectional neural network
empty source side, essentially adding training ex-
with gated recurrent units (Cho et al., 2014)
amples whose context vector ci is uninformative,
that reads an input sequence x = (x1 , ..., xm )
and for which the network has to fully rely on
and calculates a forward sequence of hidden
→
− →
− the previous target words for its prediction. This
states ( h 1 , ..., h m ), and a backward sequence
←− ←− →
− ←
− could be conceived as a form of dropout (Hinton
( h 1 , ..., h m ). The hidden states h j and h j are et al., 2012), with the difference that the train-
concatenated to obtain the annotation vector hj . ing instances that have the context vector dropped
The decoder is a recurrent neural network that out constitute novel training data. We can also
predicts a target sequence y = (y1 , ..., yn ). Each conceive of this setup as multi-task learning, with
word yi is predicted based on a recurrent hidden the two tasks being translation when the source
state si , the previously predicted word yi−1 , and is known, and language modelling when it is un-
a context vector ci . ci is computed as a weighted known.
sum of the annotations hj . The weight of each During training, we use both parallel and mono-
annotation hj is computed through an alignment lingual training examples in the ratio 1-to-1, and
model αij , which models the probability that yi is randomly shuffle them. We define an epoch as one
aligned to xj . The alignment model is a single- iteration through the parallel data set, and resam-
layer feedforward neural network that is learned ple from the monolingual data set for every epoch.
jointly with the rest of the network through back- We pair monolingual sentences with a single-word
propagation. dummy source side <null> to allow processing of
A detailed description can be found in (Bah- both parallel and monolingual training examples
danau et al., 2015). Training is performed on a with the same network graph.1 For monolingual
parallel corpus with stochastic gradient descent. minibatches2 , we freeze the network parameters
For translation, a beam search with small beam of the encoder and the attention model.
size is employed. One problem with this integration of monolin-
87
gual data is that we cannot arbitrarily increase the dataset sentences
ratio of monolingual training instances, or fine- WMTparallel 4 200 000
tune a model with only monolingual training data, WITparallel 200 000
because different output layer parameters are opti- WMTmono_de 160 000 000
mal for the two tasks, and the network ‘unlearns’ WMTsynth_de 3 600 000
its conditioning on the source context if the ratio WMTmono_en 118 000 000
of monolingual training instances is too high. WMTsynth_en 4 200 000
88
dataset sentences We found overfitting to be a bigger problem
WIT 160 000 than with the larger English↔German data set,
SETimes 160 000 and follow Gülçehre et al. (2015) in using Gaus-
Gigawordmono 177 000 000 sian noise (stddev 0.01) (Graves, 2011), and
Gigawordsynth 3 200 000 dropout on the output layer (p=0.5) (Hinton et al.,
2012). We also use early stopping, based on B LEU
Table 2: Turkish→English training data. measured every three hours on tst2010, which we
treat as development set. For Turkish→English,
cabulary size is 90 000. we use gradient clipping with threshold 5, follow-
We also perform experiments on the IWSLT ing Gülçehre et al. (2015), in contrast to the thresh-
15 test sets to investigate a cross-domain setting.5 old 1 that we use for English↔German, following
The test sets consist of TED talk transcripts. As in- Jean et al. (2015a).
domain training data, IWSLT provides the WIT3
4.2 Results
parallel corpus (Cettolo et al., 2012), which also
consists of TED talks. 4.2.1 English→German WMT 15
Table 3 shows English→German results with
4.1.2 Turkish→English WMT training and test data. We find that mixing
We use data provided for the IWSLT 14 machine parallel training data with monolingual data with a
translation track (Cettolo et al., 2014), namely the dummy source side in a ratio of 1-1 improves qual-
WIT3 parallel corpus (Cettolo et al., 2012), which ity by 0.4–0.5 B LEU for the single system, 1 B LEU
consists of TED talks, and the SETimes corpus for the ensemble. We train the system for twice
(Tyers and Alperen, 2010).6 After removal of as long as the baseline to provide the training al-
sentence pairs which contain empty lines or lines gorithm with a similar amount of parallel training
with a length ratio above 9, we retain 320 000 sen- instances. To ensure that the quality improvement
tence pairs of training data. For the experiments is due to the monolingual training instances, and
with monolingual training data, we use the En- not just increased training time, we also continued
glish LDC Gigaword corpus (Fifth Edition). The training our baseline system for another week, but
amount of training data is shown in Table 2. With saw no improvements in B LEU.
only 320 000 sentences of parallel data available Including synthetic data during training is very
for training, this is a much lower-resourced trans- effective, and yields an improvement over our
lation setting than English↔German. baseline by 2.8–3.4 B LEU. Our best ensemble
Gülçehre et al. (2015) segment the Turkish text system also outperforms a syntax-based baseline
with the morphology tool Zemberek, followed by (Sennrich and Haddow, 2015) by 1.2–2.1 B LEU.
a disambiguation of the morphological analysis We also substantially outperform NMT results re-
(Sak et al., 2007), and removal of non-surface to- ported by Jean et al. (2015a) and Luong et al.
kens produced by the analysis. We use the same (2015), who previously reported SOTA result.8
preprocessing7 . For both Turkish and English, we We note that the difference is particularly large
represent rare words (or morphemes in the case of for single systems, since our ensemble is not as
Turkish) as character bigram sequences (Sennrich diverse as that of Luong et al. (2015), who used
et al., 2016). The 20 000 most frequent words 8 independently trained ensemble components,
(morphemes) are left unsegmented. The networks whereas we sampled 4 ensemble components from
have a vocabulary size of 23 000 symbols. the same training run.
To obtain a synthetic parallel training
4.2.2 English→German IWSLT 15
set, we back-translate a random sample of
3 200 000 sentences from Gigaword. We use an Table 4 shows English→German results on
English→Turkish NMT system trained with the IWSLT test sets. IWSLT test sets consist of TED
same settings as the Turkish→English baseline talks, and are thus very dissimilar from the WMT
system. 8
Luong et al. (2015) report 20.9 B LEU (tokenized) on
newstest2014 with a single model, and 23.0 B LEU with an
5
http://workshop2015.iwslt.org/ ensemble of 8 models. Our best single system achieves a to-
6
http://workshop2014.iwslt.org/ kenized B LEU (as opposed to untokenized scores reported in
7
github.com/orhanf/zemberekMorphTR Table 3) of 23.8, and our ensemble reaches 25.0 B LEU.
89
B LEU
name training instances newstest2014 newstest2015
single ens-4 single ens-4
syntax-based (Sennrich and Haddow, 2015) 22.6 - 24.4 -
Neural MT (Jean et al., 2015b) - - 22.4 -
parallel 37m (parallel) 19.9 20.4 22.8 23.6
+monolingual 49m (parallel) / 49m (monolingual) 20.4 21.4 23.2 24.6
+synthetic 44m (parallel) / 36m (synthetic) 22.7 23.8 25.7 26.5
Table 3: English→German translation performance (B LEU) on WMT training/test sets. Ens-4: ensemble
of 4 models. Number of training instances varies due to differences in training time and speed.
Table 4: English→German translation performance (B LEU) on IWSLT test sets (TED talks). Single
models.
90
4.2.3 German→English WMT 15 B LEU
DE→EN EN→DE
Results for German→English on the WMT 15
back-translation 2015 2014 2015
data sets are shown in Table 5. Like for the re-
verse translation direction, we see substantial im- none - 20.4 23.6
provements (3.6–3.7 B LEU) from adding mono- parallel (greedy) 22.3 23.2 26.0
lingual training data with synthetic source sen- parallel (beam 12) 25.0 23.8 26.5
tences, which is substantially bigger than the im- synthetic (beam 12) 28.3 23.9 26.6
provement observed with deep fusion (Gülçehre et ensemble of 3 - 24.2 27.0
al., 2015); our ensemble outperforms the previous ensemble of 12 - 24.7 27.6
state of the art on newstest2015 by 2.3 B LEU. Table 7: English→German translation perfor-
mance (B LEU) on WMT training/test sets (new-
4.2.4 Turkish→English IWSLT 14
stest2014; newstest2015). Systems differ in how
Table 6 shows results for Turkish→English. On the synthetic training data is obtained. Ensembles
average, we see an improvement of 0.6 B LEU on of 4 models (unless specified otherwise).
the test sets from adding monolingual data with a
dummy source side in a 1-1 ratio10 , although we
note a high variance between different test sets. • with the German→English system that was
With synthetic training data (Gigawordsynth ), we itself trained with synthetic data (beam size
outperform the baseline by 2.7 B LEU on average, 12).
and also outperform results obtained via shallow
B LEU scores of the German→English sys-
or deep fusion by Gülçehre et al. (2015) by 0.5
tems, and of the resulting English→German sys-
B LEU on average. To compare to what extent syn-
tems that are trained on the different back-
thetic data has a regularization effect, even without
translations, are shown in Table 7. The quality
novel training data, we also back-translate the tar-
of the German→English back-translation differs
get side of the parallel training text to obtain the
substantially, with a difference of 6 B LEU on new-
training corpus parallelsynth . Mixing the original
stest2015. Regarding the English→German sys-
parallel corpus with parallelsynth (ratio 1-1) gives
tems trained on the different synthetic corpora, we
some improvement over the baseline (1.7 B LEU
find that the 6 B LEU difference in back-translation
on average), but the novel monolingual training
quality leads to a 0.6–0.7 B LEU difference in
data (Gigawordmono ) gives higher improvements,
translation quality. This is balanced by the fact
despite being out-of-domain in relation to the test
that we can increase the speed of back-translation
sets. We speculate that novel in-domain monolin-
by trading off some quality, for instance by reduc-
gual data would lead to even higher improvements.
ing beam size, and we leave it to future research
4.2.5 Back-translation Quality for Synthetic to explore how much the amount of synthetic data
Data affects translation quality.
We also show results for an ensemble of 3 mod-
One question that our previous experiments leave els (the best single model of each training run),
open is how the quality of the automatic back- and 12 models (all 4 models of each training run).
translation affects training with synthetic data. To Thanks to the increased diversity of the ensemble
investigate this question, we back-translate the components, these ensembles outperform the en-
same German monolingual corpus with three dif- sembles of 4 models that were all sampled from
ferent German→English systems: the same training run, and we obtain another im-
provement of 0.8–1.0 B LEU.
• with our baseline system and greedy decod-
ing 4.3 Contrast to Phrase-based SMT
The back-translation of monolingual target data
• with our baseline system and beam search
into the source language to produce synthetic par-
(beam size 12). This is the same system used
allel text has been previously explored for phrase-
for the experiments in Table 3.
based SMT (Bertoldi and Federico, 2009; Lambert
10
We also experimented with higher ratios of monolingual et al., 2011). While our approach is technically
data, but this led to decreased B LEU scores. similar, synthetic parallel data fulfills novel roles
91
name training B LEU
data instances tst2011 tst2012 tst2013 tst2014
baseline (Gülçehre et al., 2015) 18.4 18.8 19.9 18.7
deep fusion (Gülçehre et al., 2015) 20.2 20.2 21.3 20.6
baseline parallel 7.2m 18.6 18.2 18.4 18.3
parallelsynth parallel/parallelsynth 6m/6m 19.9 20.4 20.1 20.0
Gigawordmono parallel/Gigawordmono 7.6m/7.6m 18.8 19.6 19.4 18.2
Gigawordsynth parallel/Gigawordsynth 8.4m/8.4m 21.2 21.1 21.8 20.4
Table 6: Turkish→English translation performance (tokenized B LEU) on IWSLT test sets (TED talks).
Single models. Number of training instances varies due to early stopping.
cross-entropy
6 Gigawordmono (train)
PBSMT gain +0.7 +0.1 Gigawordsynth (dev)
NMT gain +2.9 +1.2 Gigawordsynth (train)
92
8 WMTparallel (dev) system produced attested natural
WMTparallel (train) parallel 1078 53.4% 74.9%
WMTsynth (dev)
WMTsynth (train)
+mono 994 61.6% 84.6%
+synthetic 1217 56.4% 82.5%
cross-entropy
6
Table 9: Number of words in system out-
put that do not occur in parallel training data
4 (countref = 1168), and proportion that is attested
in data, or natural according to native speaker.
English→German; newstest2015; ensemble sys-
tems.
2
0 20 40 60 80
training time (training instances ·106 )
ency, its ability to produce natural target-language
sentences. As a proxy to sentence-level flu-
Figure 2: English→German training and develop-
ency, we investigate word-level fluency, specif-
ment set (newstest2013) cross-entropy as a func-
ically words produced as sequences of subword
tion of training time (number of training instances)
units, and whether NMT systems trained with ad-
for different systems.
ditional monolingual data produce more natural
words. For instance, the English→German sys-
4.4 Analysis tems translate the English phrase civil rights pro-
tections as a single compound, composed of three
We previously indicated that overfitting is a con-
subword units: Bürger|rechts|schutzes12 , and we
cern with our baseline system, especially on small
analyze how many of these multi-unit words that
data sets of several hundred thousand training
the translation systems produce are well-formed
sentences, despite the regularization employed.
German words.
This overfitting is illustrated in Figure 1, which
plots training and development set cross-entropy We compare the number of words in the system
by training time for Turkish→English models. output for the newstest2015 test set which are pro-
For comparability, we measure training set cross- duced via subword units, and that do not occur in
entropy for all models on the same random sam- the parallel training corpus. We also count how
ple of the parallel training set. We can see many of them are attested in the full monolingual
that the model trained on only parallel train- corpus or the reference translation, which we all
ing data quickly overfits, while all three mono- consider ‘natural’. Additionally, the main authors,
lingual data sets (parallelsynth , Gigawordmono , or a native speaker of German, annotated a random
Gigawordsynth ) delay overfitting, and give bet- subset (n = 100) of unattested words of each sys-
ter perplexity on the development set. The tem according to their naturalness13 , distinguish-
best development set cross-entropy is reached by ing between natural German words (or names)
Gigawordsynth . such as Literatur|klassen ‘literature classes’, and
Figure 2 shows cross-entropy for nonsensical ones such as *As|best|atten (a miss-
English→German, comparing the system trained spelling of Astbestmatten ‘asbestos mats’).
on only parallel data and the system that includes In the results (Table 9), we see that the sys-
synthetic training data. Since more training data is tems trained with additional monolingual or syn-
available for English→German, there is no indi- thetic data have a higher proportion of novel words
cation that overfitting happens during the first 40 attested in the non-parallel data, and a higher
million training instances (or 7 days of training); proportion that is deemed natural by our annota-
while both systems obtain comparable training tor. This supports our expectation that additional
set cross-entropies, the system with synthetic data monolingual data improves the (word-level) flu-
reaches a lower cross-entropy on the development ency of the NMT system.
set. One explanation for this is the domain effect
discussed in the previous section. 12
Subword boundaries are marked with ‘|’.
A central theoretical expectation is that mono- 13
For the annotation, the words were blinded regarding the
lingual target-side data improves the model’s flu- system that produced them.
93
5 Related Work architecture. Providing training examples with
dummy source context was successful to some ex-
To our knowledge, the integration of monolingual
tent, but we achieve substantial gains in all tasks,
data for pure neural machine translation architec-
and new SOTA results, via back-translation of
tures was first investigated by (Gülçehre et al.,
monolingual target data into the source language,
2015), who train monolingual language models in-
and treating this synthetic data as additional train-
dependently, and then integrate them during de-
ing data. We also show that small amounts of in-
coding through rescoring of the beam (shallow fu-
domain monolingual data, back-translated into the
sion), or by adding the recurrent hidden state of
source language, can be effectively used for do-
the language model to the decoder state of the
main adaptation. In our analysis, we identified do-
encoder-decoder network, with an additional con-
main adaptation effects, a reduction of overfitting,
troller mechanism that controls the magnitude of
and improved fluency as reasons for the effective-
the LM signal (deep fusion). In deep fusion, the
ness of using monolingual data for training.
controller parameters and output parameters are
While our experiments did make use of mono-
tuned on further parallel training data, but the lan-
lingual training data, we only used a small ran-
guage model parameters are fixed during the fine-
dom sample of the available data, especially for
tuning stage. Jean et al. (2015b) also report on
the experiments with synthetic parallel data. It is
experiments with reranking of NMT output with
conceivable that larger synthetic data sets, or data
a 5-gram language model, but improvements are
sets obtained via data selection, will provide big-
small (between 0.1–0.5 B LEU).
ger performance benefits.
The production of synthetic parallel texts bears
Because we do not change the neural net-
resemblance to data augmentation techniques used
work architecture to integrate monolingual train-
in computer vision, where datasets are often aug-
ing data, our approach can be easily applied to
mented with rotated, scaled, or otherwise distorted
other NMT systems. We expect that the effective-
variants of the (limited) training set (Rowley et al.,
ness of our approach not only varies with the qual-
1996).
ity of the MT system used for back-translation, but
Another similar avenue of research is self-
also depends on the amount (and similarity to the
training (McClosky et al., 2006; Schwenk, 2008).
test set) of available parallel and monolingual data,
The main difference is that self-training typically
and the extent of overfitting of the baseline model.
refers to scenario where the training set is en-
Future work will explore the effectiveness of our
hanced with training instances with artificially
approach in more settings.
produced output labels, whereas we start with
human-produced output (i.e. the translation), and Acknowledgments
artificially produce an input. We expect that this
is more robust towards noise in the automatic The research presented in this publication was
translation. Improving NMT with monolingual conducted in cooperation with Samsung Elec-
source data, following similar work on phrase- tronics Polska sp. z o.o. - Samsung R&D In-
based SMT (Schwenk, 2008), remains possible fu- stitute Poland. This project received funding
ture work. from the European Union’s Horizon 2020 research
Domain adaptation of neural networks via con- and innovation programme under grant agreement
tinued training has been shown to be effective for 645452 (QT21).
neural language models by (Ter-Sarkisov et al.,
2015), and in work parallel to ours, for neural
translation models (Luong and Manning, 2015). References
We are the first to show that we can effectively Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
adapt neural translation models with monolingual gio. 2015. Neural Machine Translation by Jointly
data. Learning to Align and Translate. In Proceedings of
the International Conference on Learning Represen-
tations (ICLR).
6 Conclusion
Nicola Bertoldi and Marcello Federico. 2009. Do-
In this paper, we propose two simple methods to main adaptation for statistical machine translation
use monolingual training data during training of with monolingual resources. In Proceedings of the
NMT systems, with no changes to the network Fourth Workshop on Statistical Machine Translation
94
StatMT 09. Association for Computational Linguis- preventing co-adaptation of feature detectors.
tics. CoRR, abs/1207.0580.
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
Barry Haddow, Matthias Huck, Chris Hokamp, and Yoshua Bengio. 2015a. On Using Very Large
Philipp Koehn, Varvara Logacheva, Christof Monz, Target Vocabulary for Neural Machine Translation.
Matteo Negri, Matt Post, Carolina Scarton, Lucia In Proceedings of the 53rd Annual Meeting of the
Specia, and Marco Turchi. 2015. Findings of the Association for Computational Linguistics and the
2015 Workshop on Statistical Machine Translation. 7th International Joint Conference on Natural Lan-
In Proceedings of the Tenth Workshop on Statistical guage Processing (Volume 1: Long Papers), pages
Machine Translation, pages 1–46, Lisbon, Portugal. 1–10, Beijing, China. Association for Computa-
Association for Computational Linguistics. tional Linguistics.
P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, F. Je-
linek, J.D. Lafferty, R.L. Mercer, and P.S. Roossin. Sébastien Jean, Orhan Firat, Kyunghyun Cho, Roland
1990. A Statistical Approach to Machine Transla- Memisevic, and Yoshua Bengio. 2015b. Montreal
tion. Computational Linguistics, 16(2):79–85. Neural Machine Translation Systems for WMT’15 .
In Proceedings of the Tenth Workshop on Statistical
Mauro Cettolo, Christian Girardi, and Marcello Fed- Machine Translation, pages 134–140, Lisbon, Por-
erico. 2012. WIT3 : Web Inventory of Transcribed tugal. Association for Computational Linguistics.
and Translated Talks. In Proceedings of the 16th
Conference of the European Association for Ma- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
chine Translation (EAMT), pages 261–268, Trento, Callison-Burch, Marcello Federico, Nicola Bertoldi,
Italy. Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Constantin, and Evan Herbst. 2007. Moses: Open
Bentivogli, and Marcello Federico. 2014. Report Source Toolkit for Statistical Machine Translation.
on the 11th IWSLT Evaluation Campaign, IWSLT In Proceedings of the ACL-2007 Demo and Poster
2014. In Proceedings of the 11th Workshop on Spo- Sessions, pages 177–180, Prague, Czech Republic.
ken Language Translation, pages 2–16, Lake Tahoe, Association for Computational Linguistics.
CA, USA.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- Patrik Lambert, Holger Schwenk, Christophe Servan,
cehre, Dzmitry Bahdanau, Fethi Bougares, Hol- and Sadaf Abdul-Rauf. 2011. Investigations on
ger Schwenk, and Yoshua Bengio. 2014. Learn- Translation Model Adaptation Using Monolingual
ing Phrase Representations using RNN Encoder– Data. In Proceedings of the Sixth Workshop on Sta-
Decoder for Statistical Machine Translation. In Pro- tistical Machine Translation, pages 284–293, Edin-
ceedings of the 2014 Conference on Empirical Meth- burgh, Scotland. Association for Computational Lin-
ods in Natural Language Processing (EMNLP), guistics.
pages 1724–1734, Doha, Qatar. Association for
Computational Linguistics. Minh-Thang Luong and Christopher D. Manning.
2015. Stanford Neural Machine Translation Sys-
Alex Graves. 2011. Practical Variational Inference for tems for Spoken Language Domains. In Proceed-
Neural Networks. In J. Shawe-Taylor, R.S. Zemel, ings of the International Workshop on Spoken Lan-
P.L. Bartlett, F. Pereira, and K.Q. Weinberger, ed- guage Translation 2015, Da Nang, Vietnam.
itors, Advances in Neural Information Processing
Systems 24, pages 2348–2356. Curran Associates, Thang Luong, Hieu Pham, and Christopher D. Man-
Inc. ning. 2015. Effective Approaches to Attention-
based Neural Machine Translation. In Proceed-
Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun ings of the 2015 Conference on Empirical Meth-
Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, ods in Natural Language Processing, pages 1412–
Holger Schwenk, and Yoshua Bengio. 2015. On 1421, Lisbon, Portugal. Association for Computa-
Using Monolingual Corpora in Neural Machine tional Linguistics.
Translation. CoRR, abs/1503.03535.
Barry Haddow, Matthias Huck, Alexandra Birch, Niko- David McClosky, Eugene Charniak, and Mark John-
lay Bogoychev, and Philipp Koehn. 2015. The son. 2006. Effective Self-training for Parsing. In
Edinburgh/JHU Phrase-based Machine Translation Proceedings of the Main Conference on Human Lan-
Systems for WMT 2015. In Proceedings of the guage Technology Conference of the North Amer-
Tenth Workshop on Statistical Machine Translation, ican Chapter of the Association of Computational
pages 126–133, Lisbon, Portugal. Association for Linguistics, HLT-NAACL ’06, pages 152–159, New
Computational Linguistics. York. Association for Computational Linguistics.
Geoffrey E. Hinton, Nitish Srivastava, Alex Henry Rowley, Shumeet Baluja, and Takeo Kanade.
Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- 1996. Neural Network-Based Face Detection. In
dinov. 2012. Improving neural networks by Computer Vision and Pattern Recognition ’96.
95
Haşim Sak, Tunga Güngör, and Murat Saraçlar. 2007.
Morphological Disambiguation of Turkish Text with
Perceptron Algorithm. In CICLing 2007, pages
107–118.
Holger Schwenk. 2008. Investigations on Large-Scale
Lightly-Supervised Training for Statistical Machine
Translation. In International Workshop on Spoken
Language Translation, pages 182–189.
Rico Sennrich and Barry Haddow. 2015. A Joint
Dependency Model of Morphological and Syntac-
tic Structure for Statistical Machine Translation. In
Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages
2081–2087, Lisbon, Portugal. Association for Com-
putational Linguistics.
Rico Sennrich, Barry Haddow, and Alexandra Birch.
2016. Neural Machine Translation of Rare Words
with Subword Units. In Proceedings of the 54th An-
nual Meeting of the Association for Computational
Linguistics (ACL 2016), Berlin, Germany.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to Sequence Learning with Neural Net-
works. In Advances in Neural Information Process-
ing Systems 27: Annual Conference on Neural Infor-
mation Processing Systems 2014, pages 3104–3112,
Montreal, Quebec, Canada.
96