Professional Documents
Culture Documents
Survey On Neural Machine Translation Into Polish: Proceedings of The 11th International Conference MISSI 2018
Survey On Neural Machine Translation Into Polish: Proceedings of The 11th International Conference MISSI 2018
Survey On Neural Machine Translation Into Polish: Proceedings of The 11th International Conference MISSI 2018
net/publication/327042176
CITATION READS
1 543
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Krzysztof Wołk on 22 December 2018.
1 Introduction
Machine translation (MT) started ca. 50 years ago with some rule-based systems. In
90’s statistical MT (SMT) systems were invented. They create statistical models by
analyzing aligned source-target language data (training set) and use them to generate
the translation. This has been extended to phrases [1] and additional linguistic features
(part-of-speech, and such a method dominates over last decades. During training phase,
SMT creates a translation model and a language model. The first one stores the different
translations of the phrases while the later model stores the probability of the sequence
of phrases on the target side. During the translation phase, the decoder chooses the best
translation based on the combination of these two models. This of course needs huge
training sets (for proper estimation of statistical models) what limits translation quality
especially for grammatically rich languages. Recent achievement, Neural Machine
Translation uses vector-space word representation and deep learning techniques to learn
best weights for neural network to transform segments from source to the target lan-
guage. This is achieved using different recurrent network architectures: recurrent net-
works [2], networks with attention mechanism [3] or convolutional networks[4]. After
initial enthusiasm gained by better NMT results on shared tasks [5], some observations
has been made that NMT not guarantees better MT performance. Koehn & Knowles
[6] found that for English-Spanish and German-English pairs NMT systems (compared
to SMT ) have: worse out-of-domain performance, worse performance in low resource
2
conditions, worse translation of long sequences, sometimes weird word alignments pro-
duced by attention mechanism, some problems with large beam decoding, but better
translation of unfrequent words (perhaps because use of subword units).
This study surveys SMT and NMT toolkits for Polish-English translations.
General quality of MT systems hardly depends on language pairs, training data
amount and quality and domain’s match. Particularly challenging is translation to/from
low resourced language with different syntax and morphology. Polish as a Slavic lan-
guage, have quite free word order and is highly inflected. The inflectional morphology
is very rich for all word classes, seven distinct cases affect not only common nouns, but
also proper nouns as well as pronouns, adjectives and numbers, complex orthography.
This, in case of Polish-English translation, forms tasks which are hard to solve by sta-
tistical systems: unbalanced dictionaries (Polish usually 4-5 times bigger than English),
segments sequences (free-word order) probabilities estimation, frequent use of foreign
words but with Polish inflection, limited sizes of parallel corpora.
The baseline system testing was done using the Moses open source SMT toolkit [7]
with its Experiment Management System (EMS) [8]. The SRI Language Modeling
Toolkit (SRILM) [9] with an interpolated version of the Kneser-Ney discounting (in-
terpolate –unk –kndiscount) was used for 5-gram language model training. We used the
MGIZA++ [10] tool for word and phrase alignment. KenLM [11] was used to binarize
the language model, with a lexical reordering set to use the msd-bidirectional-fe model.
As a second SMT toolkit we used state of the art ModernMT (MMT) system [12]. It
was created in cooperation of Translated, FBK, UEDIN and TAUS. ModernMT also
has secondary neural translation engine. Form the code on Github we know that it is
based on PyTorch [13] and OpenNMT [14] it also uses BPE for sub-word units [15]
generation as default. More detailed information are unknown and not stated on the
project manual pages. Project probably has many more default optimizations.
The third toolkit we used is based on Google’s TensorFlow. TensorFlow is an open
source software library for high performance numerical computation. One of the mod-
ules within the TensorFlow is seq2seq [2, 16]. Sequence-to-sequence (seq2seq) models
have enjoyed great success in a variety of tasks such as machine translation, speech
recognition, and text summarization. This work uses seq2seq in the task of Neural Ma-
chine Translation (NMT) which was the very first testbed for seq2seq models with wild
success.
The next system we experimented on was MarianNMT [17] (formerly known as
AmuNMT) that is an efficient Neural Machine Translation framework written in pure
C++ with minimal dependencies. It has mainly been developed at the Adam Mickie-
wicz University in Poznań (AMU) and at the University of Edinburgh. It advantages
are up to 15x faster translation than Nematus and similar toolkits on a single GPU, up
to 2x faster training than toolkits based on Theano, TensorFlow, Torch on a single GPU,
multi-GPU training and translation, usage of different types of models, including deep
3
3 Data preparation
The experiments described in this article were conducted using official test and train-
ing sets from IWSTL’13 [20, 21] conference as well as WMT’17 conference. From
IWSLT we borrowed the TED Lectures corpora and from WMT we used the Europarl
v7 corpus [22]. Whereas TED Lectures were ready to be used the Europarl had to be
pre-processed. To be more precise it was necessary to deduplicate the dataset and assure
that none of development and test data was present in the training data set. The speci-
fication of both corpora is presented in the Table 1.
For data sub-word division and augmentation, we used our author tool. It was im-
plemented as part of the study the Polish language and is able to segment Polish texts
into the suffix prefix core and for syllables. Its additional advantage is the possibility
of dividing the text into grammatical groups and tagging texts with the POS tags. This
type of tool will not only have considerable significance for scientists, but also for busi-
ness. In the currently rapidly evolving machine translation based on neural networks,
the morphological segmentation is necessary to reduce the size of dictionaries consist-
ing of full word forms (the so-called open dictionary) which are used, among others, in
the training of the translation system and in language modelling.
Sample stemmed Polish sentence:
dział++ --@@pos_noun++ --@@b_38++ --@@sb_1++ --ania pod++
--@@pos_past_participle++ --@@b_46++ --@@sb_2++ --jęte
w++ --@@pos_other_x++ --@@b_0++ --@@sb_0 wy++ --ni++ --
@@pos_noun++ --@@infl_M3++ --@@b_0++ --@@sb_4++ --ku
4
For English side text tagging we utilized spaCy POS Tagger for which we coded
python script to unify the format of both tools. The tagging speed was about 70 sen-
tences per second. A sample result:
we --%%pos_pronoun++ can --%%pos_verb++ not --%%pos_ad-
verb++ run --%%pos_verb++ the --%%pos_determiner++ risk -
-%%pos_noun++ of --%%pos_adposition++ creating --
%%pos_verb++ a --%%pos_determiner++ regulatory
Whenever in experiments section “stemmed” is used it means that the data was pro-
cessed to such data format.
In addition, we used byte pair encoding (BPE) [15], a compression algorithm, to the
task of word segmentation. BPE allows for the representation of an open vocabulary
through a fixed-size vocabulary of variable-length character sequences, making it a
very suitable word segmentation strategy for neural network models. We try this
method independently as well as in conjunction with our “stemmer”. After tokenizing
and applying BPE to a dataset, the original sentences may look like the following. Note
that the name "Nikitin" is a rare word that has been split up into subword units delimited
by @@.
Madam President , I should like to draw your attention to
a case in which this Parliament has consistently shown an
interest . It is the case of Alexander Ni@@ ki@@ tin .
4 Experiments
All of the experiments were conducted on the same machine. We had to our disposal
32 core CPU, 256GB of RAM and 4 x nVidia Tesla K80 GPUs. We used only CUDA-
enabled toolkits that were able facilitate cuDNN library for faster neural computing.
This allowed us not only to measure the quality but also time cost needed for similar
operations on different toolkits and settings [23].
The experiments on neural machine translation were started rather in most casual
option which is usage of Recurrent Neural Networks (RNN). Firstly, we focused on
TensorFlow toolkit provided by the Google corporation. Using its default settings and
official IWSTL 2013 PL-EN test sets we tried to compare it with SMT (Moses) quality.
The results of such comparison are showed in Table 2.
The quality of NMT was not only much lower but also much slower. We conducted
that it might be because of the fact that TED has a very wide domain, diverse dictionary
and dictionary size disproportion between PL and EN. On the other hand, most suc-
cessful research on NMT were done on narrow domains on texts that had similar vo-
cabulary on both sides. Decision was made to switch to European Parliament Proceed-
ings (EUP) parallel corpus. Those results are showed in the Table 3. As we can see even
that adagrad optimization made positive impact on training outcome still the time cost
and lower quality were not satisfying [24].
In conclusion we assumed that current baseline RNN topology and training param-
eters in the TensorFlow were not properly optimized for the polish language. That is
why we put our attention on Polish NMT system called Marian NMT. The Table 4
shows our experiments in PL-EN direction.
52 30.48 28.06
54 30.41 28.18
56 30.63 28.37
58 30.66 28.29
60 30.82 28.43
62 30.72 28.28
64 30.28 28.59
66 28.52
204 31.35
Even that we finally obtained satisfactory results those were still too similar to SMT
method. Another problem was training performance. We required one week to compute
a proper experiment. We also found out that using sub-words units (stemming) im-
proves the translation into Polish by visible factor, whereas in opposite direction the
results were negative.
All this was motivation for using CNN instead of RNN. It should provide similar
results in much faster time. Decision was made to use FairSeq toolkit developed in the
Facebook laboratories. We show results of translation into Polish in Table 6. We were
able to obtain results very similar to MarianNMT but in much less time. It required only
1.5 day in average to conduct a full training whereas in Marian NMT it was about 6
days. We also decided to translate only into Polish direction because from Marian NMT
experiments we concluded that sub-word units are only usable when translating into
Polish. In opposite direction they generated too many wrong hypotheses. The same
conclusions could be drawn from FairSeq experiments.
Our next step in research was training hyperparameter search. For this purpose, we
used only 25% of dataset in order to improve the performance.
Our baseline translation system score was equal to 24.95 whereas finally we obtained
score of 26.18 in BLEU metric [25]. Most importantly we proved that our “stemmer”
works well and that it really improves the translation quality. Highest score was ob-
tained using stemmed data augmented with grammatical group and base suffix code.
Some minor improvement was observed while only using stemmed word forms without
extra codes. What is interesting, most advanced tagging and stemming with group codes
and POS tags did not work as anticipated. Most likely reason to this is that we simply
added to many information which artificially extended the number of tokens in sen-
tences. This could possibly reduce the training accuracy especially that our word em-
beddings encoding vector was set only to have size of 256 (--dim-embeddings).
Next, we choose best settings from hyperparameter search part and applied it to
100% of our data set. Because of the POS issue we also decided to increase maximum
number of accepted tokens per sentence and the size of embedding vector to 512 and
768 respectively. This made the BLEU score improve event more to 31.53 without POS
tags (but with stemming and BPE) and with POS tags and BPE to 31.28. By doing so
9
The problem of little lower score while using POS tags which we did not anticipate
remained. That is why did manual system evaluation by analysing translation results.
That is how we discovered a looping issue. For instance, for the following sentence the
generated hypothesis was partially correct but repeated many times. This was most
likely reason for BLEU disproportion.
10
We judged that the attention window on English side was too small and made the
system go into a loop. Unfortunately, we did not succeed in eliminating this issue yet.
The final step of our research was comparison of our systems to commercial Google
Translate engine and context aware ModernMT system (Table 8). Both of those sys-
tems are state of the art tools but (especially MMT) their close architecture makes it
impossible to directly compare the results. Nonetheless, they put some light at the out-
comes of this research.
5 Conclusions
To conclude our work, we put everything in one big table for easier comparison. We
only put the best scores for every toolkit that we used in the research and we skipped
the information if the data was augmented with tags, BPE or stemmed. Only best scores
remained. We compare our work to Google Translate API [26] and ModernMT.
References
1. Koehn, P., Och, F. J., & Marcu, D.: Statistical phrase-based translation. In Proceedings of
the 2003 Conference of the North American Chapter of the Association for Computational
Linguistics on Human Language Technology-Volume 1 (pp. 48-54). Association for Com-
putational Linguistics. (2003, May)
2. Sutskever, I., Vinyals, O., & Le, Q. V.: Sequence to sequence learning with neural networks.
In Advances in neural information processing systems (pp. 3104-3112). (2014)
3. Bahdanau, D., Cho, K., & Bengio, Y.: Neural machine translation by jointly learning to align
and translate. arXiv preprint arXiv:1409.0473, (2014).
4. Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N.: Convolutional sequence
to sequence learning. arXiv preprint arXiv:1705.03122. (2017).
5. Luong, M. T., & Manning, C. D.: Stanford neural machine translation systems for spoken
language domains. In Proceedings of the International Workshop on Spoken Language
Translation (pp. 76-79). (2015).
6. Koehn, P., & Knowles, R.: Six challenges for neural machine translation. arXiv preprint
arXiv:1706.03872. (2017).
7. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., ... & Dyer,
C.: Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th
annual meeting of the ACL on interactive poster and demonstration sessions (pp. 177-180).
Association for Computational Linguistics. (2007, June).
8. Vasiļjevs, A., Skadiņš, R., & Tiedemann, J.: LetsMT!: a cloud-based platform for do-it-
yourself machine translation. In Proceedings of the ACL 2012 System Demonstrations (pp.
43-48). Association for Computational Linguistics. (2012, July).
9. Stolcke, A.: SRILM-an extensible language modeling toolkit. In Seventh international con-
ference on spoken language processing. (2002).
12
10. Junczys-Dowmunt, M., & Szał, A.: Symgiza++: symmetrized word alignment models for
statistical machine translation. In Security and Intelligent Information Systems (pp. 379-
390). Springer, Berlin, Heidelberg. (2012).
11. Heafield, K.: KenLM: Faster and smaller language model queries. In Proceedings of the
Sixth Workshop on Statistical Machine Translation (pp. 187-197). Association for Compu-
tational Linguistics. (2011, July).
12. Jelinek, R.: Modern MT systems and the myth of human translation: Real World Status Quo.
In proceedings of the International Conference Translating and the Computer. (2004, No-
vember).
13. Team, PyTorch Core.: Pytorch: Tensors and dynamic neural networks in python with strong
gpu acceleration. (2017).
14. Klein, G., Kim, Y., Deng, Y., Senellart, J., & Rush, A. M.: Opennmt: Open-source toolkit
for neural machine translation. arXiv preprint arXiv:1701.02810 (2017).
15. Sennrich, R., Haddow, B., & Birch, A.: Neural machine translation of rare words with sub-
word units. arXiv preprint arXiv:1508.07909. (2015).
16. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical ma-
chine translation. arXiv preprint arXiv:1406.1078. (2014).
17. Junczys-Dowmunt, M., Grundkiewicz, R., Grundkiewicz, T., Hoang, H., Heafield, K.,
Neckermann, T., ... & Martins, A.: Marian: Fast Neural Machine Translation in C++. arXiv
preprint arXiv:1804.00344. (2018).
18. Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., ... & Nădejde, M.:
Nematus: a toolkit for neural machine translation. arXiv preprint arXiv:1703.04357. (2017).
19. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P.: Gradient-based learning applied to docu-
ment recognition. Proceedings of the IEEE, 86(11), 2278-2324. (1998).
20. Wołk, A., Wołk, K., & Marasek, K.: Analysis of complexity between spoken and written
language for statistical machine translation in West-Slavic group. In Multimedia and Net-
work Information Systems (pp. 251-260). Springer, Cham. (2017).
21. Wołk, K., & Marasek, K.: Polish-English speech statistical machine translation systems for
the IWSLT 2013. arXiv preprint arXiv:1509.09097. (2013).
22. Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In MT summit
(Vol. 5, pp. 79-86). (2005, September).
23. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., ... & Klingner, J.:
Google's neural machine translation system: Bridging the gap between human and machine
translation. arXiv preprint arXiv:1609.08144. (2016).
24. Kingma, D. P., & Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980. (2014).
25. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J.: BLEU: a method for automatic evaluation
of machine translation. In Proceedings of the 40th annual meeting on association for com-
putational linguistics (pp. 311-318). Association for Computational Linguistics. (2002,
July).
26. Groves, M., & Mundt, K.: Friend or foe? Google Translate in language for academic pur-
poses. English for Specific Purposes, 37, 112-121. (2015).