CUNI Submission For Low-Resource Languages in WMT News 2019

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

CUNI Submission for Low-Resource Languages in WMT News 2019

Tom Kocmi Ondřej Bojar

Charles University, Faculty of Mathematics and Physics


Institute of Formal and Applied Linguistics
Malostranské náměstı́ 25, 118 00 Prague, Czech Republic
<surname>@ufal.mff.cuni.cz

Abstract first improve the performance of the NMT system


later used for backtranslating monolingual data.
This paper describes the CUNI submission
to the WMT 2019 News Translation Shared The structure of this paper is organized as fol-
Task for the low-resource languages: Gujarati- lows. First, we describe the techniques of trans-
English and Kazakh-English. We participated fer learning and backtranslation, followed by an
in both language pairs in both translation di- overview of used datasets and the NMT model
rections. Our system combines transfer learn- architecture. Next, we present our experiments,
ing from a different high-resource language final submissions, and followup analysis of syn-
pair followed by training on backtranslated
thetic training data usage. The paper is concluded
monolingual data. Thanks to the simultane-
ous training in both directions, we can iterate in Section 5.
the backtranslation process. We are using the
Transformer model in a constrained submis-
2 Background
sion. In this section, we first describe the technique of
1 Introduction transfer learning and iterative backtranslation, fol-
lowed by our training procedure that combines
Recently, the rapid development of Neural Ma- both approaches.
chine Translations (NMT) systems led to situ-
ations where the translation quality of NMT is 2.1 Transfer Learning
approaching human translation for high-resource Kocmi and Bojar (2018) presented a trivial method
language pairs like Chinese-English (Hassan et al., of transfer learning that uses a high-resource lan-
2018) or English-Czech (Bojar et al., 2018). How- guage pair to train the “parent” model. After the
ever, NMT systems tend to be very data hun- convergence, the parent training data are replaced
gry and with limited data, they can be surpassed with the training data of the low-resource, “child”,
by phrase-based MT (Koehn and Knowles, 2017). language pair, and the training continues as if the
Recent research focus has thus been on low- replacement would not happen. The training con-
resource NMT, where the goal is to improve the tinues without changing any parameters nor reset-
performance of a language pair that has only lim- ting moments or learning rate.
ited available parallel data. This technique of fine-tuning of model parame-
In this paper, we describe our approach to low- ters is often used for domain adaptation within the
resource NMT. We use the standard Transformer- same language pair. When used across different
big model (Vaswani et al., 2017) and apply two language pairs, there emerges a problem with vo-
techniques to improve the performance on the cabulary mismatch. Kocmi and Bojar (2018) over-
low-resource language, namely transfer learning come this problem by preparing the shared vocab-
(Kocmi and Bojar, 2018) and iterative backtrans- ulary for all languages in both parent and child
lation (Hoang et al., 2018). language pairs in advance. Their approach is to
A model trained solely on the authentic parallel prepare a mixed vocabulary from training corpora
data of the low-resource NMT model has poor per- of both languages and generate wordpiece vocab-
formance and using it directly for the backtransla- ulary (Vaswani et al., 2017) from it.
tion of monolingual data lead to poor translation as We use the balanced vocabulary approach, that
well. Hence transfer learning is as a great tool to combines an equal amount of parallel data from
Corpora Language pair Sentence pairs Words 1st lang. Words in English
Commoncrawl Russian-English 878k 17.4M 18.8M
News Commentary Russian-English 235k 5.0M 5.4M
UN corpus Russian-English 11.4M 273.2M 294.4M
Yandex Russian-English 1000k 18.7M 21.3M
CzEng 1.7 Czech-English 57.4M 546.2M 621.9M
Crawl Kazakh-English 97.7k 1.0M 1.3M
News commentary Kazakh-English 9.6k 174.1k 213.2k
Wiki titles Kazakh-English 112.7k 174.9k 204.5k
Bible Gujarati-English 7.8k 198.6k 177.1k
Dictionary Gujarati-English 19.3k 19.3k 28.8k
Govincrawl Gujarati-English 10.7k 121.2k 150.6k
Software Gujarati-English 107.6k 691.5k 681.3k
Wiki texts Gujarati-English 18.0k 317.9k 320.4k
Wiki titles Gujarati-English 9.2k 16.6k 17.6k

Table 1: The parallel training corpora used to train our models with counts of the total number of sentences as
well as the number of words (segmented on space). More details on the individual corpora can be obtained at
http://statmt.org/wmt19/.

both training corpora, low-resource and high- nating between the training on the authentic and
resource, undersampling the high-resource lan- on the synthetic portion of the parallel data instead
guage pair as needed. Hence the low-resource lan- of mixing them.
guage subwords are represented in the vocabulary This new corpus is used to train the first model
with roughly the same prominence as the high- by using backtranslated data as the source and the
resource language pair ones. monolingual as the target side of the model.
As Kocmi and Bojar (2018) showed, the lan- Hoang et al. (2018) showed that backtranslation
guage pairs do not have to be linguistically related, can be iterated and with the second round of back-
and the most important criterion is the amount of translation, we improve the performance of both
parent parallel data. For this reason, we have se- models. However, the third round of backtransla-
lected Czech-English as a parent language pair for tion did yield better results.
Gujarati-English and Russian-English as the par- The performance of the backtranslation model
ent for Kazakh-English. The Russian was selected is essential. Especially in the low-resource sce-
due to the use of Cyrillic and being a high-resource nario, the baseline models trained only on the au-
language pair. All language pairs share English. thentic parallel data have a poor score (2.0 BLEU
We prepare Gujarati-English and Kazakh-English for English→Gujarati). As a result, they generate
systems separately from each other. very low quality backtranslated data. We have im-
proved the baseline with the transfer learning to
2.2 Backtranslation improve performance and generate the synthetic
The amount of available monolingual data typi- data of better quality.
cally exceeds the amount of available parallel data.
The standard technique of using monolingual data 2.3 Training Procedure
in NMT is called backtranslation (Sennrich et al., We are training two models in parallel, one for
2016). It uses a second model trained in the re- each translation direction. Our training pro-
verse direction to translate monolingual data to the cedure is as follows. We train four parent
source language of the first model. models on the high-resource language pair until
Backtranslated data are aligned with their convergence: English→Czech, Czech→English,
monolingual sentences to create synthetic parallel English→Russian and Russian→English. We stop
corpora. The standard practice is to mix the au- training the models if there was no improvement
thentic parallel corpora with the synthetic ones, al- bigger than 0.1 BLEU in the last 20% of the train-
though it is not the only possible approach. (Popel, ing time.
2018) obtained better results by repeatedly alter- Whenever we train the parent model
Czech→English, we apply hyperparameter Corpus Lang. Sent. Words
search (see Section 4.1) on transfer learning of News crawl 2018 EN 15.4M 344.3M
Gujarati→English. It gives us better performing Common Crawl KK 12.5M 189.2M
model on this language pair, however we update News commentary KK 13.0k 218.7k
the parameters for all children models. News crawl KK 772.9k 10.3M
Afterwards, we apply transfer learning on Common Crawl GU 3.7M 67.3M
the authentic dataset of the corresponding low- Newscrawl GU 244.9k 3.3M
resource language pair. We preserve the En- Emille GU 273.2k 11.4M
glish side, thus Czech→English serves as the par-
Table 2: Statistics of all monolingual data used for the
ent to Gujarati→English and English→Czech to
backtranslation. It shows the number of sentences in
English→Gujarati. The same strategy is used for each corpus and the number of words segmented on
the transfer from Russian to Kazakh. space. We mixed together all corpora for each language
After transfer learning, we select one of the separately.
translation directions to translate monolingual
data. As the starting system for the backtranslation
process, we have selected the English→Gujarati pairs with the same source and target translations
and Kazakh→English. The decision for Kazakh- in Wiki Titles dataset.
English is motivated by choosing the better per- As the development set, we used the official
forming model, see Table 3 below. This is how- WMT test sets from the year 2013 for Czech-
ever only a rough estimate because bigger BLEU English and Russian-English. For the Gujarati-
scores across various language pairs do not always English, we used the official 2019 development
need to indicate better performance; the proper- set. Finally for the Kazakh-English, the organizers
ties of the target language such as its morphologi- did not provide any development set. Therefore
cal richness affect the absolute value of the score. we separated the first 2000 sentence pairs from the
For the Gujarati-English, we decided to start with News Commentary training set and used them as
the model with English source side in contrast to our development set.
Kazakh→English. The monolingual data used for the backtrans-
After the backtranslation, we mix the synthetic lation are shown in Table 2. We use all avail-
data with the authentic parallel data. We continue able monolingual data for Gujarati and Kazakh.
repeating this process: Use the improved system We did not use all available English monolingual
to backtranslate the data, and use this data in order data due to the backtranslation process being time-
to build an even better system in reverse direction. consuming, therefore we use only the 2018 News
Crawl.
We make two rounds of backtranslation for both
directions on Gujarati-English and only one round The available monolingual corpora are usually
of backtranslation on Kazakh-English due to the of high quality. However, we noticed that Com-
time consumption of the NMT translation process. mon Crawl contains many sentences in a different
language and also long paragraphs which are not
At the end, we take the model with the high-
useful for sentence-level translation.
est BLEU score on the devset and average it with
Therefore, we used language identification tool
seven previous checkpoints to create final model.
by Lui and Baldwin (2012) on the Common Crawl
3 Datasets and Model corpus and dropped all sentences automatically
annotated as a different language than Gujarati or
In this section, we describe the datasets used to Kazakh, respectively. We broke suspected para-
train our final models. All our models were trained graphs into individual sentences by splitting them
only on the data allowed for the WMT 2019 News on all full stops whenever the segment was longer
shared task. Hence our submission is constrained. than 100 words.
All used training data are presented in Table 1.
We used all available parallel corpora allowed and 3.1 Model
accessible by WMT 2019 except for the Czech- The Transformer model seems superior to other
English language pair, where we used only the NMT approaches as documented on several lan-
CzEng 1.7. We have not cleaned any of the par- guage pairs in the manual evaluation of WMT18
allel corpora except deduplication and removing (Bojar et al., 2018).
All reported results are calculated on the test set
of WMT 2019 and evaluated with case sensitive
25 SacreBLEU (Post, 2018)2 if not specified other-
wise. Statistical significance is tested by paired
bootstrap resampling (1000 samples, conf. level
20
0.05; Koehn, 2004).

4.1 Hyperparameter Search


BLEU

15
Before the first step of transfer learning,
we have done a hyperparameter search on
10 Gujarati→English over the set of parameters that
are not fixed from the parent (like dimensions of
matrices or structure of layers). We examined
En-Gu Synth 1
5 En-Gu Synth 2 the following hyperparameters: learning rate,
En-Gu Transfer Learning
Gu-En Synth 2 dropout, layer prepostprocess dropout, label
Gu-En Synth 1
Gu-En Transfer Learning smoothing, and attention dropout.
0 The performance before hyperparameter search
2000 2250 2500 2750 3000
Steps (in thousands) was 9.8 BLEU3 for Gujarati→English, this score
was improved to 11.0 BLEU. Based on the hy-
perparameter search we set the layer prepostpro-
Figure 1: Learning curves for both directions of
Gujarati-English models. The BLEU score is uncased cess dropout and label smoothing both to 0.2 in
and computed on the development set. the setup of Transformer-big.
These improvements show that transfer learning
is not strictly associated with parent setup and that
We use the version 1.11 of sequence-to- some parameters can be further optimized. It must
sequence implementation of Transformer called be however noted that we experimented only with
tensor2tensor.1 We use the Transformer “big sin- a small subset of all hyperparameters and it is pos-
gle GPU” configuration as described in (Vaswani sible that other parameters could also be changed
et al., 2017), model which translates through an without damaging the parent model.
encoder-decoder with each layer involving an at- In this paper, we reuse these parameters for all
tention network followed by a feed-forward net- experiments (except for the parent models). A hy-
work. The architecture is much faster than other perparameter search on each language pair or even
NMT due to the absence of recurrent layers. for each dataset switch separately might have led
Popel and Bojar (2018) documented best prac- to better results but it was beyond the scope of this
tices to improve the performance of the model. paper.
Based on their observation, we use the Adafactor
optimizer with inverse square root decay. Based 4.2 Problems with Backtranslation
on our previous experiments (Kocmi et al., 2018), The synthetic data have a quality similar with the
we set the maximum number of subwords in a sen- model by which they were produced. Since the
tence to 100, which drops less than 0.1 percent of low-resource scenario has an overall low quality,
training sentences. The benefit is that the batch we observed, that the synthetic data contain many
size can be increased to 4500 for our GPUs. The errors:
experiments are trained on a single GPU NVidia
• Repeated sequences of words (“model
GeForce 1080 Ti.
spasm”): The State Department has made
no reference in statements, statements,
4 Experiments
statements, statements ...
In this section, we describe our experiments start- • Sentences in Czech or Russian, most proba-
ing with hyperparameter search, our training pro- bly due to the parent model.
cedure, and supporting experiments. 2
The SacreBLEU signature is BLEU + case.mixed +
numrefs.1 + smooth.exp + tok.13a + version.1.2.12.
1 3
https://github.com/tensorflow/ This score is computed on the devset with averaging last
tensor2tensor 8 models distanced one and half hours of training time.
Training dataset EN→GU GU→EN EN→KK KK→EN
Authentic (baseline) 2.0 1.8 0.5 4.2
Parent dataset 0.7 0.1 0.7 0.6
Authentic (transfer learning) 1 9.1 9.2 6.2 1 14.4
Synth generated by model 1 - 2 14.2 2 8.3 -
Synth generated by model 2 3 13.4 - - 17.3
Synth generated by model 3 - 4 16.2 - -
Synth generated by model 4 13.7 - - -
Averaging + beam 8 14.3 17.4 8.7 18.5

Table 3: Test set BLEU scores of our setup. Except for the baseline, each column shows improvements obtained
after fine-tuning a single model on different datasets beginning with the score on a trained parent model.

• Source sentences only copied, untranslated. of training time from each other.
We see that the transfer learning can be com-
To avoid these problems, we cleaned all syn- bined with iterated backtranslation on a low-
thetic data in the following way. We dropped all resource language to obtain an absolute improve-
sentences that contained any repetitive sequence ment of 12.3 BLEU compared to the baseline in
of words. Then we checked the sentences by lan- Gujarati→English and 15.6 in English→Gujarati.
guage identification tool (Lui and Baldwin, 2012)
For the final submission, we selected
and dropped all sentences automatically annotated
models at the following steps: step
as a wrong language. The second step also filtered
2.99M for English→Gujarati, step 3.03M
out some remaining gibberish translations.
for Gujarati→English, step 2.48M for
We have not used beam search during back-
English→Kazakh and step 2.47M for
translation of monolingual data in order to speed
Kazakh→English
up the translation process roughly 20 times com-
pared to the beam search of 8.
4.4 Ratio of Parallel Data
4.3 Final Models Poncelas et al. (2018) showed that the balance be-
We trained our parent models as described in Sec- tween the synthetic and authentic data matters, and
tion 2.3 for two million steps. One exception from there should always be a part of authentic parallel
the described approach is that we used a subset of data. We started our experiments with this intu-
2M monolingual English data for the first round of ition. However, the low-resource scenario compli-
backtranslation by the English→Gujarati model to cates the setup since the amount of authentic data
cut down on the total time requirements. is several times smaller than synthetic. In order to
Figure 1 above shows the progress of training balance the authentic and synthetic parallel data,
Gujarati-English models in both directions. The we duplicated the authentic data several times.
learning curves start at step 2M, visualizing first We noticed that the performance did not change
the parent model. We can notice that after each compared the setup relying on synthetic data only.
change of parallel data, there is a substantial im- Thus we prepared an experiment with a second
provement of the performance. The learning curve round of backtranslation on Gujarati→English
is plotted on the development data, the corre- varying the ratio of authentic and synthetic par-
sponding scores for the test sets are in Table 3. allel data. For this experiment, we mix the au-
The baseline models in Table 3 are trained on thentic parallel corpus of 173k sentences with a
the authentic data only, and it seems that the randomly selected 3.6M sentences from the syn-
amount of parallel data is not sufficient to train the thetic corpus, which is equal to 20x size of au-
NMT model for the investigated language pairs. thentic data. We present the ratio between authen-
The remaining rows show incremental improve- tic and synthetic corpora in number of copies of
ments as we perform the various training stages. authentic data. The synthetic corpus is never over-
The last stage of model averaging takes the best sampled, we only duplicate the authentic corpora.
performing model and averages it with the previ- For example, the ratio “auth:synth 10:1” means
ous seven checkpoints that are one and half hours that the authentic corpus has been multiplied ten
28 Training dataset cased uncased
Auth (baseline) 1.8 2.2
Synth only 16.9 18.7
26 Auth:Synth 20:1 16.8 18.4
Auth:Synth 40:1 16.3 17.8
Auth:Synth 80:1 15.2 16.8
Submitted model 16.2 17.9
24 Synth only
BLEU

Auth:Synth 10:1
Auth:Synth 20:1 Table 4: BLEU scores for training English→Gujarati
Auth:Synth 80:1 from scratch on synthetic data from the second round
Auth:Synth 160:1
Auth only of backtranslation evaluated on the test set. Neither of
22
models uses the averaging or beam search. Thus the
submitted model is our submitted model before aver-
aging and beam search (the model 3 ). The scores
20 are equal to those from http://matrix.statmt.
org.

18 setting. However, the most surprising fact is that


2350 2450 2550 2650 2750
Steps (in thousands) training from scratch leads to significantly better
model (cased BLEU higher by 0.7) than the model
Figure 2: Comparison of different ratios of authentic
trained by transfer learning and two rounds of the
and synthetic data as a continuation of the previous backtranslation. Unfortunately, we came up with
learning. the idea of training from scratch only after the sub-
mission, our systems submitted to the WMT man-
ual evaluations are thus of a worse performance.
times (1.7M sentences overall) and the synthetic We believe that the observed gain from retrain-
corpus was used once (3.6M sentences). Based on ing from scratch could result from a subtle over-
the sizes of available data, the mix of “auth:synth fitting to the development set. We observe that,
20:1” contains the same amount of sentences from contrary to Table 4, the performance on the de-
authentic and synthetic corpora. velopment set is higher for our final submitted
In Figure 2, we can see the effect of varying the model (26.9 BLEU) compared to the performance
ratio of synthetic and authentic data. It seems that of 25.8 BLEU for the synthetic only training. In
using only synthetic data generates the best per- the gradual training of the final submitted model,
formance, and whenever we increase the authentic we used the development set three times: first to
part, the performance slowly decreases, contrary select the best model from the transfer learning,
to Poncelas et al. (2018). One possible explanation then when selecting the best performing model in
would be the noise in the authentic data. Synthetic the first round of backtranslation and then for the
data can thus be effectively cleaner and more suit- third time during the second round of backtransla-
able for the training of the model. tion. Training on synthetic data from scratch used
the development set only once for the selection of
4.5 Synthetic from Scratch the best performing model to evaluate.
In the previous section, we have shown that during Another possible explanation is that the final
the iterative backtranslation of low-resource lan- model is already overspecialized to the data from
guages, the authentic data hurt the performance. the first round of backtranslation, that it is not able
In this section, we use the various ratios of train- to adapt to the improved second synthetic data.
ing data and train the model from scratch with-
5 Conclusion
out transfer learning or other backtranslation. No-
tably, all the parameters, as well as the wordpiece We participated in four translation directions in
vocabulary, are kept unchanged. the low-resource language pairs in the WMT 2019
Table 4 present the result of using synthetic data News Translation Shared Task. We combined
directly without any adaptation. It shows that hav- transfer learning with the iterated backtranslation
ing more authentic data hurts in the low-resource and obtained significant improvements.
We showed that mixing authentic data and back- lation. In Proceedings of the 3rd Conference on Ma-
translated data in a low-resource scenario does not chine Translation (WMT): Research Papers, Brus-
sels, Belgium.
affect the performance of the model: synthetic
data is far more important. This is a different re- Tom Kocmi, Roman Sudarikov, and Ondřej Bojar.
sult from what Poncelas et al. (2018) observed on 2018. Cuni submissions in wmt18. In Proceed-
higher-resource language pairs. ings of the Third Conference on Machine Transla-
tion, Volume 2: Shared Task Papers, pages 435–441,
Lastly, in some scenarios, it is better to train the Belgium, Brussels. Association for Computational
model on backtranslated data from scratch instead Linguistics.
of fine-tuning the previous model.
Philipp Koehn. 2004. Statistical significance tests for
In the future work, we want to investigate, machine translation evaluation. In Proceedings of
why the training from scratch on the backtrans- EMNLP, volume 4, pages 388–395.
lated data has led to better results. One of the
reviewers suggested to keep mixing the parent Philipp Koehn and Rebecca Knowles. 2017. Six chal-
lenges for neural machine translation. In Pro-
Czech→English corpus even during later stages of ceedings of the First Workshop on Neural Machine
training as an additional source of parallel data, Translation, pages 28–39, Vancouver. Association
which we would like to evaluate as well. for Computational Linguistics.
Marco Lui and Timothy Baldwin. 2012. langid.py: An
Acknowledgments off-the-shelf language identification tool. In Pro-
This study was supported in parts by the grants ceedings of the ACL 2012 System Demonstrations,
pages 25–30, Jeju Island, Korea. Association for
SVV 260 453 of the Charles University, 18- Computational Linguistics.
24210S of the Czech Science Foundation and
825303 (Bergamot) of the European Union. This Alberto Poncelas, Dimitar Shterionov, Andy Way,
Gideon Maillette de Buy Wenniger, and Pey-
work has been using language resources and tools man Passban. 2018. Investigating backtransla-
stored and distributed by the LINDAT/CLARIN tion in neural machine translation. arXiv preprint
project of the Ministry of Education, Youth and arXiv:1804.06189.
Sports of the Czech Republic (LM2015071).
Martin Popel. 2018. Machine translation using syntac-
tic analysis. Univerzita Karlova.
References Martin Popel and Ondřej Bojar. 2018. Training Tips
for the Transformer Model. The Prague Bulletin of
Ondřej Bojar, Christian Federmann, Mark Fishel, Mathematical Linguistics, 110(1):43–70.
Yvette Graham, Barry Haddow, Matthias Huck,
Philipp Koehn, and Christof Monz. 2018. Find- Matt Post. 2018. A Call for Clarity in Reporting BLEU
ings of the 2018 conference on machine translation Scores. arXiv preprint arXiv:1804.08771.
(wmt18). In Proceedings of the Third Conference
on Machine Translation, Volume 2: Shared Task Pa- Rico Sennrich, Barry Haddow, and Alexandra Birch.
pers, pages 272–307, Belgium, Brussels. Associa- 2016. Improving neural machine translation mod-
tion for Computational Linguistics. els with monolingual data. In Proceedings of the
54th Annual Meeting of the Association for Compu-
Hany Hassan, Anthony Aue, Chang Chen, Vishal tational Linguistics (Volume 1: Long Papers), pages
Chowdhary, Jonathan Clark, Christian Feder- 86–96, Berlin, Germany. Association for Computa-
mann, Xuedong Huang, Marcin Junczys-Dowmunt, tional Linguistics.
William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu,
Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Xia, Dongdong Zhang, Zhirui Zhang, and Ming Kaiser, and Illia Polosukhin. 2017. Attention is all
Zhou. 2018. Achieving human parity on auto- you need. In I. Guyon, U. V. Luxburg, S. Bengio,
matic chinese to english news translation. CoRR, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
abs/1803.05567. nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 6000–6010. Curran As-
Vu Cong Duy Hoang, Philipp Koehn, Gholamreza sociates, Inc.
Haffari, and Trevor Cohn. 2018. Iterative back-
translation for neural machine translation. In Pro-
ceedings of the 2nd Workshop on Neural Machine
Translation and Generation, pages 18–24.

Tom Kocmi and Ondřej Bojar. 2018. Trivial Transfer


Learning for Low-Resource Neural Machine Trans-

You might also like