Professional Documents
Culture Documents
CUNI Submission For Low-Resource Languages in WMT News 2019
CUNI Submission For Low-Resource Languages in WMT News 2019
CUNI Submission For Low-Resource Languages in WMT News 2019
Table 1: The parallel training corpora used to train our models with counts of the total number of sentences as
well as the number of words (segmented on space). More details on the individual corpora can be obtained at
http://statmt.org/wmt19/.
both training corpora, low-resource and high- nating between the training on the authentic and
resource, undersampling the high-resource lan- on the synthetic portion of the parallel data instead
guage pair as needed. Hence the low-resource lan- of mixing them.
guage subwords are represented in the vocabulary This new corpus is used to train the first model
with roughly the same prominence as the high- by using backtranslated data as the source and the
resource language pair ones. monolingual as the target side of the model.
As Kocmi and Bojar (2018) showed, the lan- Hoang et al. (2018) showed that backtranslation
guage pairs do not have to be linguistically related, can be iterated and with the second round of back-
and the most important criterion is the amount of translation, we improve the performance of both
parent parallel data. For this reason, we have se- models. However, the third round of backtransla-
lected Czech-English as a parent language pair for tion did yield better results.
Gujarati-English and Russian-English as the par- The performance of the backtranslation model
ent for Kazakh-English. The Russian was selected is essential. Especially in the low-resource sce-
due to the use of Cyrillic and being a high-resource nario, the baseline models trained only on the au-
language pair. All language pairs share English. thentic parallel data have a poor score (2.0 BLEU
We prepare Gujarati-English and Kazakh-English for English→Gujarati). As a result, they generate
systems separately from each other. very low quality backtranslated data. We have im-
proved the baseline with the transfer learning to
2.2 Backtranslation improve performance and generate the synthetic
The amount of available monolingual data typi- data of better quality.
cally exceeds the amount of available parallel data.
The standard technique of using monolingual data 2.3 Training Procedure
in NMT is called backtranslation (Sennrich et al., We are training two models in parallel, one for
2016). It uses a second model trained in the re- each translation direction. Our training pro-
verse direction to translate monolingual data to the cedure is as follows. We train four parent
source language of the first model. models on the high-resource language pair until
Backtranslated data are aligned with their convergence: English→Czech, Czech→English,
monolingual sentences to create synthetic parallel English→Russian and Russian→English. We stop
corpora. The standard practice is to mix the au- training the models if there was no improvement
thentic parallel corpora with the synthetic ones, al- bigger than 0.1 BLEU in the last 20% of the train-
though it is not the only possible approach. (Popel, ing time.
2018) obtained better results by repeatedly alter- Whenever we train the parent model
Czech→English, we apply hyperparameter Corpus Lang. Sent. Words
search (see Section 4.1) on transfer learning of News crawl 2018 EN 15.4M 344.3M
Gujarati→English. It gives us better performing Common Crawl KK 12.5M 189.2M
model on this language pair, however we update News commentary KK 13.0k 218.7k
the parameters for all children models. News crawl KK 772.9k 10.3M
Afterwards, we apply transfer learning on Common Crawl GU 3.7M 67.3M
the authentic dataset of the corresponding low- Newscrawl GU 244.9k 3.3M
resource language pair. We preserve the En- Emille GU 273.2k 11.4M
glish side, thus Czech→English serves as the par-
Table 2: Statistics of all monolingual data used for the
ent to Gujarati→English and English→Czech to
backtranslation. It shows the number of sentences in
English→Gujarati. The same strategy is used for each corpus and the number of words segmented on
the transfer from Russian to Kazakh. space. We mixed together all corpora for each language
After transfer learning, we select one of the separately.
translation directions to translate monolingual
data. As the starting system for the backtranslation
process, we have selected the English→Gujarati pairs with the same source and target translations
and Kazakh→English. The decision for Kazakh- in Wiki Titles dataset.
English is motivated by choosing the better per- As the development set, we used the official
forming model, see Table 3 below. This is how- WMT test sets from the year 2013 for Czech-
ever only a rough estimate because bigger BLEU English and Russian-English. For the Gujarati-
scores across various language pairs do not always English, we used the official 2019 development
need to indicate better performance; the proper- set. Finally for the Kazakh-English, the organizers
ties of the target language such as its morphologi- did not provide any development set. Therefore
cal richness affect the absolute value of the score. we separated the first 2000 sentence pairs from the
For the Gujarati-English, we decided to start with News Commentary training set and used them as
the model with English source side in contrast to our development set.
Kazakh→English. The monolingual data used for the backtrans-
After the backtranslation, we mix the synthetic lation are shown in Table 2. We use all avail-
data with the authentic parallel data. We continue able monolingual data for Gujarati and Kazakh.
repeating this process: Use the improved system We did not use all available English monolingual
to backtranslate the data, and use this data in order data due to the backtranslation process being time-
to build an even better system in reverse direction. consuming, therefore we use only the 2018 News
Crawl.
We make two rounds of backtranslation for both
directions on Gujarati-English and only one round The available monolingual corpora are usually
of backtranslation on Kazakh-English due to the of high quality. However, we noticed that Com-
time consumption of the NMT translation process. mon Crawl contains many sentences in a different
language and also long paragraphs which are not
At the end, we take the model with the high-
useful for sentence-level translation.
est BLEU score on the devset and average it with
Therefore, we used language identification tool
seven previous checkpoints to create final model.
by Lui and Baldwin (2012) on the Common Crawl
3 Datasets and Model corpus and dropped all sentences automatically
annotated as a different language than Gujarati or
In this section, we describe the datasets used to Kazakh, respectively. We broke suspected para-
train our final models. All our models were trained graphs into individual sentences by splitting them
only on the data allowed for the WMT 2019 News on all full stops whenever the segment was longer
shared task. Hence our submission is constrained. than 100 words.
All used training data are presented in Table 1.
We used all available parallel corpora allowed and 3.1 Model
accessible by WMT 2019 except for the Czech- The Transformer model seems superior to other
English language pair, where we used only the NMT approaches as documented on several lan-
CzEng 1.7. We have not cleaned any of the par- guage pairs in the manual evaluation of WMT18
allel corpora except deduplication and removing (Bojar et al., 2018).
All reported results are calculated on the test set
of WMT 2019 and evaluated with case sensitive
25 SacreBLEU (Post, 2018)2 if not specified other-
wise. Statistical significance is tested by paired
bootstrap resampling (1000 samples, conf. level
20
0.05; Koehn, 2004).
15
Before the first step of transfer learning,
we have done a hyperparameter search on
10 Gujarati→English over the set of parameters that
are not fixed from the parent (like dimensions of
matrices or structure of layers). We examined
En-Gu Synth 1
5 En-Gu Synth 2 the following hyperparameters: learning rate,
En-Gu Transfer Learning
Gu-En Synth 2 dropout, layer prepostprocess dropout, label
Gu-En Synth 1
Gu-En Transfer Learning smoothing, and attention dropout.
0 The performance before hyperparameter search
2000 2250 2500 2750 3000
Steps (in thousands) was 9.8 BLEU3 for Gujarati→English, this score
was improved to 11.0 BLEU. Based on the hy-
perparameter search we set the layer prepostpro-
Figure 1: Learning curves for both directions of
Gujarati-English models. The BLEU score is uncased cess dropout and label smoothing both to 0.2 in
and computed on the development set. the setup of Transformer-big.
These improvements show that transfer learning
is not strictly associated with parent setup and that
We use the version 1.11 of sequence-to- some parameters can be further optimized. It must
sequence implementation of Transformer called be however noted that we experimented only with
tensor2tensor.1 We use the Transformer “big sin- a small subset of all hyperparameters and it is pos-
gle GPU” configuration as described in (Vaswani sible that other parameters could also be changed
et al., 2017), model which translates through an without damaging the parent model.
encoder-decoder with each layer involving an at- In this paper, we reuse these parameters for all
tention network followed by a feed-forward net- experiments (except for the parent models). A hy-
work. The architecture is much faster than other perparameter search on each language pair or even
NMT due to the absence of recurrent layers. for each dataset switch separately might have led
Popel and Bojar (2018) documented best prac- to better results but it was beyond the scope of this
tices to improve the performance of the model. paper.
Based on their observation, we use the Adafactor
optimizer with inverse square root decay. Based 4.2 Problems with Backtranslation
on our previous experiments (Kocmi et al., 2018), The synthetic data have a quality similar with the
we set the maximum number of subwords in a sen- model by which they were produced. Since the
tence to 100, which drops less than 0.1 percent of low-resource scenario has an overall low quality,
training sentences. The benefit is that the batch we observed, that the synthetic data contain many
size can be increased to 4500 for our GPUs. The errors:
experiments are trained on a single GPU NVidia
• Repeated sequences of words (“model
GeForce 1080 Ti.
spasm”): The State Department has made
no reference in statements, statements,
4 Experiments
statements, statements ...
In this section, we describe our experiments start- • Sentences in Czech or Russian, most proba-
ing with hyperparameter search, our training pro- bly due to the parent model.
cedure, and supporting experiments. 2
The SacreBLEU signature is BLEU + case.mixed +
numrefs.1 + smooth.exp + tok.13a + version.1.2.12.
1 3
https://github.com/tensorflow/ This score is computed on the devset with averaging last
tensor2tensor 8 models distanced one and half hours of training time.
Training dataset EN→GU GU→EN EN→KK KK→EN
Authentic (baseline) 2.0 1.8 0.5 4.2
Parent dataset 0.7 0.1 0.7 0.6
Authentic (transfer learning) 1 9.1 9.2 6.2 1 14.4
Synth generated by model 1 - 2 14.2 2 8.3 -
Synth generated by model 2 3 13.4 - - 17.3
Synth generated by model 3 - 4 16.2 - -
Synth generated by model 4 13.7 - - -
Averaging + beam 8 14.3 17.4 8.7 18.5
Table 3: Test set BLEU scores of our setup. Except for the baseline, each column shows improvements obtained
after fine-tuning a single model on different datasets beginning with the score on a trained parent model.
• Source sentences only copied, untranslated. of training time from each other.
We see that the transfer learning can be com-
To avoid these problems, we cleaned all syn- bined with iterated backtranslation on a low-
thetic data in the following way. We dropped all resource language to obtain an absolute improve-
sentences that contained any repetitive sequence ment of 12.3 BLEU compared to the baseline in
of words. Then we checked the sentences by lan- Gujarati→English and 15.6 in English→Gujarati.
guage identification tool (Lui and Baldwin, 2012)
For the final submission, we selected
and dropped all sentences automatically annotated
models at the following steps: step
as a wrong language. The second step also filtered
2.99M for English→Gujarati, step 3.03M
out some remaining gibberish translations.
for Gujarati→English, step 2.48M for
We have not used beam search during back-
English→Kazakh and step 2.47M for
translation of monolingual data in order to speed
Kazakh→English
up the translation process roughly 20 times com-
pared to the beam search of 8.
4.4 Ratio of Parallel Data
4.3 Final Models Poncelas et al. (2018) showed that the balance be-
We trained our parent models as described in Sec- tween the synthetic and authentic data matters, and
tion 2.3 for two million steps. One exception from there should always be a part of authentic parallel
the described approach is that we used a subset of data. We started our experiments with this intu-
2M monolingual English data for the first round of ition. However, the low-resource scenario compli-
backtranslation by the English→Gujarati model to cates the setup since the amount of authentic data
cut down on the total time requirements. is several times smaller than synthetic. In order to
Figure 1 above shows the progress of training balance the authentic and synthetic parallel data,
Gujarati-English models in both directions. The we duplicated the authentic data several times.
learning curves start at step 2M, visualizing first We noticed that the performance did not change
the parent model. We can notice that after each compared the setup relying on synthetic data only.
change of parallel data, there is a substantial im- Thus we prepared an experiment with a second
provement of the performance. The learning curve round of backtranslation on Gujarati→English
is plotted on the development data, the corre- varying the ratio of authentic and synthetic par-
sponding scores for the test sets are in Table 3. allel data. For this experiment, we mix the au-
The baseline models in Table 3 are trained on thentic parallel corpus of 173k sentences with a
the authentic data only, and it seems that the randomly selected 3.6M sentences from the syn-
amount of parallel data is not sufficient to train the thetic corpus, which is equal to 20x size of au-
NMT model for the investigated language pairs. thentic data. We present the ratio between authen-
The remaining rows show incremental improve- tic and synthetic corpora in number of copies of
ments as we perform the various training stages. authentic data. The synthetic corpus is never over-
The last stage of model averaging takes the best sampled, we only duplicate the authentic corpora.
performing model and averages it with the previ- For example, the ratio “auth:synth 10:1” means
ous seven checkpoints that are one and half hours that the authentic corpus has been multiplied ten
28 Training dataset cased uncased
Auth (baseline) 1.8 2.2
Synth only 16.9 18.7
26 Auth:Synth 20:1 16.8 18.4
Auth:Synth 40:1 16.3 17.8
Auth:Synth 80:1 15.2 16.8
Submitted model 16.2 17.9
24 Synth only
BLEU
Auth:Synth 10:1
Auth:Synth 20:1 Table 4: BLEU scores for training English→Gujarati
Auth:Synth 80:1 from scratch on synthetic data from the second round
Auth:Synth 160:1
Auth only of backtranslation evaluated on the test set. Neither of
22
models uses the averaging or beam search. Thus the
submitted model is our submitted model before aver-
aging and beam search (the model 3 ). The scores
20 are equal to those from http://matrix.statmt.
org.