Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/368275060

NLP project final report (1)

Article · February 2023

CITATIONS READS

0 1,686

1 author:

Aliakbar Abdurahimov
Mohamed bin Zayed University of Artificial Intelligence
6 PUBLICATIONS 2 CITATIONS

SEE PROFILE

All content following this page was uploaded by Aliakbar Abdurahimov on 04 February 2023.

The user has requested enhancement of the downloaded file.


Leveraging BERT for English-Arabic Machine Translation

MuhammadMahdi Abdurahimov
National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)

Abstract neural translation model are trained jointly (end-


to-end) to maximize translation performance. The
Large pre-trained models have gained great at-
tention in the field of NLP tasks. In this work,
first scientific paper on the use of neural networks
we have used the transformer as our baseline in machine translation appeared in 2014(Sutskever
model for machine translation of Arabic to En- et al., 2014). NMT is able to handle a wider range
glish as well as English to Arabic. In terms of inputs and produce more natural translations,
of the details of the experiment, We first pre- but it requires a large amount of training data and
processed the data, such as unicode normal- computational resources. To some extent, neural
ization, orthographic normalization, and dedi- machine translation is more efficient and versatile
acritization for Arabic and lowercase, normal-
than statistical machine translation.
ize punctuation for English. We also used the
BPE tokenization scheme in our experiment for Arabic is spoken by more than 400 million people
both languages. Besides for the baseline model, and is one of the official languages of the United
based on related work, we used the BERT-fused Nations, the Arabic term Semitic is widely used
model and compared the performance of base- as a first language in Arab countries. Most of the
line MT models on 6 separate test sets which Arabic-speaking countries are located in the Mid-
are in the IWSLT2017 dataset. As a result, for dle East, North Africa and the Arabian Gulf region.
both English-to-Arabic and Arabic-to-English
A total of 22 countries have Arabic as their official
machine translations, we achieved a relatively
good BLEU score compared to the baseline
language, and Arabic is also the liturgical language
model. of Islam, which means that there is a large demand
for accurate and reliable machine translation ser-
1 Introduction vices in Arabic, both for personal and professional
purposes. Arabic is a morphologically rich lan-
Machine Translation (MT) is an important task in
guage that usually puts many kinds of morphemes
NLP, it has undergone many changes, and now the
inside a word, so a single prefix or suffix may have
most popular ones are Rule-based MT, Statistical
a significant impact on the result. Moreover, due
MT(SMT), and Neural MT(NMT). Rule-based MT
to diacritization (addition of short vowels to each
is a rule created by linguists and programmers, by
word), it is possible for two words written with the
using dictionaries and grammar rules to create a
same letter to represent something opposed. Due to
translation rule base, but this MT approach has sig-
the above problems, the word vectors generated by
nificant limitations, and any new corpus will lead to
the existing models are relatively not sufficient for
some problems. Statistical machine translation, on
accurate machine translation, and in this case, we
the other hand, creates a mapping between two par-
would like to obtain better word vectors by some
allel corpora, transitions the earlier word-based ma-
methods.
chine translation to phrase-based translation, and
As mentioned before, the rich morphology of the
incorporates syntactic information to further im-
Arabic language leads to some challenges. The
prove the accuracy of the translation.
specificity of Arabic causes the biggest challenge
However, due to the low syntactic and semantic
of Arabic machine translation as a language, the
components in such models, problems are easily
absence of letters representing short vowels, the
encountered when dealing with syntactically differ-
absence of capital letters and then multiple morpho-
ent language pairs, such as Chinese-English. NMT
logical letters, and few punctuation marks all cause
is a deep learning approach to translate text, and un-
some difficulties. Furthermore, the variant symbols
like traditional translation systems, all parts of the
have a great impact. As a result, machine trans- amounts of Byte Pair Encoding (BPE) merging
lation can be particularly challenging for Arabic, operations had a huge influence on the system
and the development of effective Arabic machine performance. And it is important to select ideal
translation algorithms can help to advance the field BPE setups for LSTM-based designs or Trans-
of machine translation as a whole. former systems.They discovered that a sub-optimal
Traditional methods especially for the Arabic lan- BPE configuration would cause a lower system
guage with rich morphology are not easy to han- performance by 3–4 BLEU points.
dle, our work here is to pre-train the data first, we In order to address difficulty with short sentences
used unicode normalization, orthographic normal- in neural machine translation, (Oudah et al., 2019)
ization, and dediacritization. We also use the BPE constructed a length-based system selection to
tokenization scheme in the experiment. Besides we solve this problem. They used different training
also considered that the BERT(Devlin et al., 2018) data set for tuning and obtained better BLEU
pre-training model has a bi-directional transformer scores 56.18% than prior work when optimized the
structure, which can consider the contextual con- choice the tokenization scheme.
tent, so we may get better results in machine trans- The vast majority of deep learning models rely on
lation. So we used the mdoel which is based on a word vector layer for both input and output, i.e.,
BERT. We also preprocessed the English data, such each unit is mapped to a word vector at input and
as lowercase and normalize punctuation.e used the finally needs to be mapped back to a word vector
IWSLT2017 En-Ar dataset, which is an open cor- at output. However, there are some problems
pus and is often used for training Arabic machine associated with this, as the lexicon is large and
translation. Our work focuses on prepossing the the space occupied by the word vectors can be
data first, then used the BPE tokenization for both large. (Shaham and Levy, 2020) explored the
languages, after using BPE tokenization, we set use of byte encoding, which discarded the input
models and train the model with tuning the param- and output word vectors, and experimented on
eters to train and improve the accuracy of English several languages, finding that the results are not
to Arabic and Arabic to English translation. very different, and even slightly better for some
tasks. Their Experiment discovered a consistent
2 Related Work improvement in BLEU score with 12.9%. Which
inspires us thinking about the use of encoding
At present, most of the research papers in the field forms and word vectors.
of Arabic-English translation have mainly focused (Zoph et al., 2016) employs a migration learning
on statistical and neural machine translation strategy, not necessarily for the English to Arabic
methods. In this section, we will introduce some translation task. The main idea is twofold, and
research works in the tasks of Arabic-English or we use the translation task of X and Y as an
English-Arabic translation that is based on neural example here. The first aspect is to first train the
machine translation. NMT model between X and Z, and then train the
translation model of X and Y afterwards. The
second aspect is to pre-train on the high-resource
2.1 Arabic Machine Translation
language and then migrate to the low-resource for
In 2016, a neural machine translation (Almahairi training. Due to the problem of a low resource
et al., 2016) in the task of Arabic translation was language in English-Arabic and Arabic-English,
compared to the typical phrase-based translation (Ren et al., 2018) proposed a triangular approach
system, which the neural machine translation to train English to French and translate English
method outperformed the typical phrase- based to Arabic and Arabic to French using the well
system on the MT05 dataset with BLEU 33.62%. translated corpus, which was informed by (Zoph
(Shapiro and Duh, 2018) introduced word em- et al., 2016). This method first trains a parent
beddings to process the morphological resources model using a high-resource dataset, and then uses
in Arabic language, which using sub-word infor- this parent model to train a low-resource dataset
mation outperformed regular word embeddings with initialization constraints. French-English
on a word similarity task in the experiments of is used as the parent model and other languages
Arabic-English on a small corpus of TED subtitles. are used as child models for training. Their
(Ding et al., 2019) demonstrated that different
experiments showed that both English to Ara- in Figure 11 . In this section, we introduce the
bic and the Arabic to French tasks got better results. BERT-fused(Zhu et al., 2019) model in detail and
employ the BERT-fused model in Arabic-English
and English-Arabic machine translation.
2.2 Pre-training in Machine Translation In traditional natural language tasks, pre-trained
Pre-trained embeddings also have a good applica- models have been used widely and it can be used
tion on translation tasks. (Qi et al., 2018) employed in the encoder and decoder modules or be used as
pre-trained embeddings vectors in Neural Machine the input of machine translation tasks. However,
Translation and obtained a consistent BLEU score different from text classification or other natural
improvement. language tasks that usually apply pre-trained model
(Mikolov et al., 2013) proposed a new word2vec as fine-tuning to perform experiments, BERT-fused
method, which is the most universal word embed- model was designed to explore the contextual em-
dings and each word is linked to a vector represen- bedding in machine translation.
tation with the advantage of captureing semantic More specifically, as shown in Figure 22 , the rep-
relationship. Another approaches is ELMo based resentation of input information is extracted using
on features (Peters et al., 2018) using bidirectional BERT by feeding it into decoder and encoder lay-
LSTM structures. It can be used to capture context- ers instead of served as only input embedding and
related meanings in the whole sentences. In ad- then attention mechanism is fused in the layers of
dition, there are other word embedding methods encoder modules and decoder modules in BERT-
for Arabic, such as Polyglot (Al-Rfou et al., 2013), fused model architecture. In order to solve the
AraVec(Soliman et al., 2017). problem of different sequence lengths, the atten-
In contrast to word-level representation, many pre- tion modules were designed adaptively to process
trained language model can be used in the level different word segmentation lengths. In addition,
of sentence representation and can be fine-tuned two new modules named BERT-encoder attention
downstream tasks with funing very few parame- and BERT-decoder attention were fused with NMT
ters. One of these language models is OpenAI encoder and decoder to obtain fused output repre-
GPT (Radford et al., 2018) which can capture a sentations.
long range of linguistic information based on the It can be seen in Figure 2 that given any input sen-
Transformer network(Vaswani et al., 2017). An- tence or tokenization x, it firstly is feed into BERT
other widely used pre-trained language models module and encoded into representation. Here, We
is BERT(Devlin et al., 2018) based Transformer defined the output of final layer in BERT module
attention networks, which uses the bidirectional as HB . After that, we define HEl as the hidden
Transformer architecture to capture both left and representation of l-th layer in the encoder, and HE0
right context. These pre-trained language models as word embedding of input x. Denote the i-th
achieve state-of-the-art results in many NLP tasks, element in HEl as hli for any i ∈ [lx ].
which is useful for machine translation as well. Therefore, the output in l-th layer is that,
In this work, we mainly focus on using differ-
ent pre-trained models to generate embedding vec-
tors in Arabic-English and English-Arabic machine hˆli = 12 (attns (hl−1 l−1 l−1
i , HE , HE )
translation tasks and compare the difference be- (1)
tween different language models. +attnB (hl−1
i , HB , HB )), i ∈ [lx ]

3 Method
where attens and attnB are attention models
Typical neural machine translation model includes with different parameters defined in Eqn.(2).Then
two parts, one is the encoder architecture that
each hˆli is further processed by F F N (x) defined
the encoder forms contextualized word embedding
in Eqn.(3) and we get the output of the l-th layer:
from a source sentence , another is the decoder ar-
chitecture that the decoder generates a target trans- HEl = (F F N (hˆl1 ), ..., F F N (hˆllx )). The encoder
lation from left to right. 1
Attention Is All You Need(Vaswani et al., 2017)
In our experiment, we firstly use transformer as our 2
Incorporating bert into neural machine translation(Zhu
baseline. The transformer architecture is as shown et al., 2019)
will eventually output HEl from the last layer, In short, BERT-fused model combine the output of
BERT with attention modules to incorporate it into
the machine translation model.
P|V |
attn(q, K, V ) = i=1 ai Wv vi ,
4 Experiments
exp((Wq q)T (Wk ki ))
ai = Z , (2) 4.1 Dataset
P|V | In our experiment, we use IWSLT2017 En-Ar
Z= T
i=1 exp((Wq q) (Wk ki )) dataset3 , which was constructed using transcripts
and manual translations of TED talks. As shown in
Table 1, this dataset contains 235,527 parallel sen-
where attn(q, K, V ) defines the attention layer tences in the training set and 888 parallel sentences
and q, K and V represent query, key and value re- in the validation set. And it also contains 6 test sets,
spectively. Here q is a dq -dimensional vector (d ∈ which were collected from TED talks in the year
Z), K and V are two sets with |K| = |V |. Each 2010 to 2015. Figure 3shows some examples about
ki ∈ K and vi ∈ V are also dk /dv -dimensional En→Ar and Ar→En translation of baseline mod-
(dq , dk and dv can be different) vectors, i ∈ [|K|] els. For each translation direction, we show a good
and Wq , Wk and Wv are the parameters to be case, a tricky case and a bad case from Ex- ample 1
learned. to Example 3 respectively. As shown in Example 2,
We define the non-linear transformation layer as the Arabic sentence contains two un- known words.
This results in two problems:En→Ar translation,
F F N (x) = W2 max(W1 x + b1 , 0) + b2 (3) BLEU score does not reflect the actual translation
quality; 2) For Ar→En trans- lation, the MT model
can not generate correct text due to information
where x is the input; W1 , W2 , b1 , b2 are the pa- missing. Therefore, it is of great importance to
rameters to be learned. Let S<t l denote the hidden reconsider the tokenization scheme of Arabic text.
state of l-th layer in the decoder proceeding time
l . Note s0 is a special token indicat-
step t, i.e., S<t Parallel sentences
1
ing the state of a sequence, and s0t is the embedding train 235,527
of the predicted word at time-step t − 1. At the l-th
dev 888
layer, we have
tst2010 1,565
tst2011 1,427
sˆlt = attns (sl−1 l−1 l−1
t , S<t+1 , S<t+1 ), tst2012 1,705
test
tst2013 1,380
sˆlt = 12 (attnB (sˆlt , HB , HB ) (4) tst2014 1,301
tst2015 1,205
+attnE (sˆlt , HEL , HEL )), slt = F F N (sˆlt )
Table 1: Overview of IWSLT2017 En-Ar dataset.

The attnS , attnB and attnE represent self-


4.2 Setting
attention model, BERT-decoder attention model
and encoder-decoder attention model respectively. Preprocessing For English, we normalize punc-
Eqn.(2) iterates over layers and we can eventually tuation, remove non-printing characters and
obtain sL L perform lowercase for all sentences using
t . Finally st is mapped via a linear trans-
formation and softmax to get he t-th predicted word mosesdecoder script4 . For Arabic, we first
ŷˆt. The decoding process continues until meet- preprocess the text by unicode normalization, or-
ing the end-of-sentence token. Besides, in order to thographic normalization and dediacritization us-
ensure that the features length obtained by BERT ing camel_tools (Obeid et al., 2020). Then we
and the traditional encoder are the same, a drop-net clean the text by using camel_arclean from
trick was used to ensure that the features are fully 3
https://wit3.fbk.eu/2017-01-c
4
utilized. https://github.com/moses-smt/mosesdecoder
Figure 1: The architecture of transformer model.

camel_tools5 . Preprocessing is a very impor- Models For Ar → En MT, we use BERT-base-


tant step especially for Arabic since Arabic texts arabic6 with a standard transformer-base model in
are often inconsistent in terms of punctuation, dig- our BERT-fused method. Similarly, we use BERT-
its, diacritics and spelling. Especially in informal base-english7 . for En → Ar MT. For the baseline
texts such as TED talks, these phenomena occur model, we use the transformer for machine trans-
frequently, exacerbating the data sparsity problem lation of Arabic to English as well as English to
in Arabic. Arabic. Since the training set is quite small, in-
stead of using the standard transformer-base archi-
Tokenization We use byte-pair-encoding(BPE) tecture, we use a smaller model with 6 layers, 4
(Sennrich et al., 2016) tokenization scheme in our attention heads, 512 embedding dimensions, and
experiment. For each language, the vocabulary 1,024 feed-forward embedding dimensions for both
size is set to 8,000. We train the BPE model on the encoder and the decoder.
the training set and then use the trained model to
tokenize the train/dev/test data. 6
https://huggingface.co/asafaya/bert-base-arabic
5 7
https://github.com/CAMeL-Lab/camel_tools https://huggingface.co/bert-base-uncased
Figure 2: The architecture of BERT-fused model.From left to right, the modules name is the BERT, encoder and
decoder respectively.

Training Following the practice of (Zhu et al., baseline model achieves a 26.71 BLEU score on av-
2019), we first train the transformer until con- erage across the 6 test sets with a standard deviation
vergence and then initialize the encoder and de- of 1.80. The Bert-fused model achieves a 28.52
coder of the BERT-fused model with the obtained BLEU score on average, outperforming the base-
model. The BERT-encoder attention and BERT- line by an absolute improvement of 1.81 BLEU
decoder attention are randomly initialized. Dur- score. For En→Ar translation, the baseline model
ing training, all parameters in BERT are frozen. achieves a 12.78 BLEU score on average with a
We use fairseq8 for training. For each transla- standard deviation of 1.96. The Bert-fused model
tion direction, we train the BERT-fused model with achieves a 13.81 BLEU score on average, outper-
max_tokens = 4, 000 in each batch. It takes roughly forming the baseline by an absolute improvement
10 hours to train on a single Nvidia A6000 48G of 1.03 BLEU score. For both translation direc-
GPU. We also employ label smoothing of value 0.1 tions, incorporating BERT into Transformer results
during training. in consistent improvements across all six test sets
over the vanilla transformer model. In addition,
Evaluation During the evaluation, we use the comparing the absolute performance of Ar→En
model with the best validation score to generate MT and En→Ar MT, we can see that translating
translations of the given source input. For decoding, to Arabic is much more difficult than translating to
we use a beam search algorithm with a beam size of English in terms. However, it is also possible that
5. The evaluation metric is BLEU score (Papineni BLEU4 is not an appropriate metric for evaluating
et al., 2002), which automatically measures word Arabic translation.
and phrase matching scores between the MT output
and reference translations. Specifically, we use
BLEU4 following common practice.
Effect of Tokenization Table 3 shows the aver-
4.3 Results age performance of Transformer models trained
on different tokenization schemes. From Table 3,
Main Results Table 2 shows the performance of
we can see that for both translation directions, us-
baseline MT models and our BERT-fused models
ing BPE results in a much better BLEU score than
on 6 separate test sets. For Ar→En translation, the
using whole-word tokenization. This is mainly be-
8
https://github.com/facebookresearch/fairseq cause BPE addresses out-of-vocabulary issues.
Figure 3: Examples about En→Ar and Ar→En translations of baseline models.

Model tst2010 tst2011 tst2012 tst2013 tst2014 tst2015 Avg Std


Transformer 25.86 26.5 29.73 27.76 24.66 25.73 26.71 1.80
Ar →En Bert-fused Model 29.94 28.22 32.95 27.98 25.32 26.72 28.52 2.66
∆ +4.08 +1.72 +3.22 +0.22 +0.66 +0.99 +1.81
Transformer 10.06 11.52 15.37 14.3 11.96 13.48 12.78 1.96
En→Ar Bert-fused Model 10.77 12.32 17.22 15.58 13.32 13.66 13.81 2.30
∆ +0.71 +0.80 +1.85 +1.28 +1.36 +0.18 +1.03

Table 2: Results of baseline models and BERT-fused models. We report BLEU4 in this table. Bold denotes the best
result.

Effect of Preprocessing Table 4 shows the aver- to English as well as English to Arabic. During the
age performance of Transformer models trained on experiment, we preformed unicode normalization,
raw Arabic corpus and preprocessed Arabic corpus. orthographic normalization, dediacritization and
When using BPE, the results of models trained on BPE tokenization scheme in IWSLT2017 dataset.
preprocessed Arabic and raw Arabic are almost the We compared the performance of baseline MT mod-
same for Ar→En MT. This result indicates that for els on 6 separate test sets, the results in all of the
Ar→En MT, preprocessing Arabic is not necessary sets are really good. We not only explored the low
when using BPE, which makes it easier for peo- resource translation of Ar→En and En→Ar, but
ple who do not understand any Arabic to perform also leveraged pre-trained BERT and fuse it into
this task. However, for En→Ar MT, preprocessing Transformer, and the results are consistently im-
Arabic is important with and without using BPE. proved across six test sets for both direction. What
is more, we found that preprocessing Arabic is crit-
5 Conclusion and Future Work ical for translating English to Arabic. We think
it could work in some other languages. In future
In this work, we have used the transformer as our work, we will continue to use Arabic-BERT in-
baseline model for machine translation of Arabic
Model Ar→En En→Ar Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
word, raw Ar 17.95 8.24 tions of words and phrases and their compositionality.
BPE, raw Ar 26.31 10.53 Advances in neural information processing systems,
26.
∆ +8.36 +2.29
word, clean Ar 21.02 9.93 Ossama Obeid, Nasser Zalmout, Salam Khalifa, Dima
Taji, Mai Oudah, Bashar Alhafni, Go Inoue, Fadhl
BPE, clean Ar 26.71 12.78 Eryani, Alexander Erdmann, and Nizar Habash. 2020.
∆ +5.69 +2.85 CAMeL tools: An open source python toolkit for Ara-
bic natural language processing. In Proceedings of
Table 3: Results of Transformer models trained with the 12th Language Resources and Evaluation Confer-
different tokenization schemes. We report the average ence, pages 7022–7032, Marseille, France. European
BLEU score on the six test sets in this table. “word" Language Resources Association.
refers to whole-word tokenization, “BPE" refers to byte-
Mai Oudah, Amjad Almahairi, and Nizar Habash. 2019.
pair encoding. “raw Ar" refers to raw Arabic corpus The impact of preprocessing on arabic-english statis-
while “clean Ar" refers to the preprocessed Arabic cor- tical and neural machine translation. arXiv preprint
pus. arXiv:1906.11751.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-


Model Ar→En En→Ar Jing Zhu. 2002. Bleu: a method for automatic evalu-
word, raw Ar 17.95 8.24 ation of machine translation. In Proceedings of the
40th Annual Meeting of the Association for Compu-
word, clean Ar 21.02 9.93 tational Linguistics, pages 311–318, Philadelphia,
∆ +3.07 +1.69 Pennsylvania, USA. Association for Computational
Linguistics.
BPE, raw Ar 26.31 10.53
BPE, clean Ar 26.71 12.78 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
∆ +0.40 +2.25 Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
Table 4: Results of Transformer models trained with sentations. arXiv.
raw and preprocessing Arabic corpus. We report the Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sar-
average BLEU score on the six test sets in this table. guna Janani Padmanabhan, and Graham Neubig.
2018. When and why are pre-trained word embed-
dings useful for neural machine translation? arXiv
stead of English-BERT to help En→Ar MT. Also preprint arXiv:1804.06323.
we think that we can improve the evaluation met-
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya
ric of Arabic translation (eg. BERTScore1(Zhang
Sutskever, et al. 2018. Improving language under-
et al., 2019)) because the evaluation metric of Ara- standing by generative pre-training.
bic translation may not be appropriate.
Shuo Ren, Wenhu Chen, Shujie Liu, Mu Li, Ming
Zhou, and Shuai Ma. 2018. Triangular architec-
References ture for rare language translation. arXiv preprint
arXiv:1805.04813.
Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2013.
Polyglot: Distributed word representations for multi- Rico Sennrich, Barry Haddow, and Alexandra Birch.
lingual nlp. arXiv preprint arXiv:1307.1662. 2016. Neural machine translation of rare words with
subword units. In Proceedings of the 54th Annual
Amjad Almahairi, Kyunghyun Cho, Nizar Habash, Meeting of the Association for Computational Lin-
and Aaron Courville. 2016. First result on ara- guistics (Volume 1: Long Papers), pages 1715–1725,
bic neural machine translation. arXiv preprint Berlin, Germany. Association for Computational Lin-
arXiv:1606.02680. guistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Uri Shaham and Omer Levy. 2020. Neural machine
Kristina Toutanova. 2018. Bert: Pre-training of deep translation without embeddings. arXiv preprint
bidirectional transformers for language understand- arXiv:2008.09396.
ing. arXiv preprint arXiv:1810.04805.
Pamela Shapiro and Kevin Duh. 2018. Morphological
Shuoyang Ding, Adithya Renduchintala, and Kevin Duh. word embeddings for arabic neural machine trans-
2019. A call for prudent choice of subword merge op- lation in low-resource settings. In Proceedings of
erations in neural machine translation. arXiv preprint the Second Workshop on Subword/Character LEvel
arXiv:1905.10453. Models, pages 1–11.
Abu Bakr Soliman, Kareem Eissa, and Samhaa R El-
Beltagy. 2017. Aravec: A set of arabic word embed-
ding models for use in arabic nlp. Procedia Com-
puter Science, 117:256–265.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.
Advances in neural information processing systems,
27.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint
arXiv:1904.09675.
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
Wengang Zhou, Houqiang Li, and Tieyan Liu. 2019.
Incorporating bert into neural machine translation. In
International Conference on Learning Representa-
tions.
Barret Zoph, Deniz Yuret, Jonathan May, and
Kevin Knight. 2016. Transfer learning for low-
resource neural machine translation. arXiv preprint
arXiv:1604.02201.

View publication stats

You might also like