Professional Documents
Culture Documents
Comparing Machine Translation Accuracy of Attention Models
Comparing Machine Translation Accuracy of Attention Models
Abstract—Machine translation models using encoder and good BLEU scores while Transformer and Luong models
decoder architecture do not give accuracy as high as expectation. obtain moderate and rather low BLEU scores respectively.
One reason for this ineffectiveness is due to lack of attention
mechanism during training phase. Attention-based models II. THEORETICAL B ACKGROUND
overcome drawbacks of previous ones and obtain noteworthy
improvement in terms of accuracy. In the paper, we experiment A. Machine Translation with Recurrent neural networks
three attention models and evaluate their BLEU scores on small Architecture of Recurrent neural network (RNN) consists
data sets. Bahdanau model achieves high accuracy, Transformer of input, hidden, and output layers. Input layer takes vectors
model obtains good accuracy while Luong model only gets that represent sequential data. Units in hidden layer are
acceptable accuracy. connected to each other through time with recurrent
connections. At each timestep, output layer generates a
Keywords—attention, embedding, multi-head, recurrent sequence, which is inferred from hidden units and weighted
connections as well as non-linear activations [2].
I. INTRODUCTION
Today, English becomes a global language and this leads Recurrent neural networks are the most suitable models for
to the need of translation between English and other sequential data problems. In an article, Cho and partners
languages. Bilingual translation supports those who begin introduce an architecture of Encoder-Decoder for neural
learning English. For instance, it helps pupils reduce grammar machine translation. Mechanism of this architecture is that the
mistakes and enhance the writing skill. In recent years, the Encoder builds a fixed-length vector from one input sequence,
approach of neural machine translation has been researched and the Decoder uses that vector to predict the output
and developed. sequence [7]. The article proposes two models and one of
them is based on RNN.
One challenge of machine translation models is that
languages have different grammar structures. Moreover, one On theoretical basis, RNN can learn variable-length
word of source language can be translated into some various sequences. However, it does not perform well on tasks
words of target language. Therefore, source and target involving long-term dependencies. Activation function is a
sentences will contain different meanings if sentences are only cause of this problem: Hyperbolic tangent and sigmoid
translated sequentially. In fact, we translate a source sentence activations cause vanishing gradients [2,13]; ReLU is not
into a target sentence after understanding the meaning and resistant to a large gradient flow and as weight matrix grows,
context of source sentence. In order to generate correct target neurons may be inactive during training phase. An author
sentences, mechanisms have been approached to build one explains that vanishing and exploding gradients are
machine translation model that simulates thinking of humans. responsible for weakness of RNN [1]. In Back Propagation
Through Time algorithm, if gradient magnitudes converge
Many machine translation models are based on neural into zero rapidly after several epochs then gradients will not
networks, and one of them was firstly introduced in a scientific contribute anything to learning phase. On the other hand,
paper in 2014. This model used Encoder-Decoder architecture exploding gradient might happen, gradient magnitudes
in which learning stages applied recurrent neural network. increase fast so learning phase becomes unstable.
Since that time, mechanisms have been innovated to improve
effectiveness of machine translation models. Such typical Two popular variants of RNN are Long Short-Term
mechanisms are global attention, local attention and self- Memory (LSTM) and Gated Recurrent Unit (GRU), which
attention. They are described in Bahdanau, Luong, and have various applications in natural language processing.
Transformer models. LSTM contains three gates called “input”, “forget”, and
“output”. They control flow of information to hidden neurons,
In this paper, we focus on comparing machine translation and prevent the rest of network from modifying content in
accuracy of three models. The project experiments models on memory cell. Meanwhile, GRU has two gates, “update”
small data sets that include English-Vietnamese sentences. controls how present input and previous hidden state are used
BLEU scores are used to evaluate accuracy of neural machine to compute current state, “reset” allows unit to ignore previous
translation [4]. For the 1st set, the differences between hidden state [2, 3]. LSTM and GRU have some common
maximum BLEU scores of models are insignificant. In points: They are capable of learning long-term dependencies,
contrast, on the 2nd set, BLEU scores of Bahdanau model are and robust to vanishing gradients [2]. Comparing
much higher than those of the remaining two models. When effectiveness between LSTM and GRU, this is only true with
working on the test set, Bahdanau model achieves relatively specific tasks.
131
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
p
PE(p,2l) sin( 2l
)
10000 dmodel
p (5)
PE(p,2l 1) cos( 2l
) Fig. 5. Losses of three models.
10000 dmodel
132
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
133
2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
134