Comparing Machine Translation Accuracy of Attention Models

2020 7th NAFOSTED Conference on Information and
Computer Science (NICS)
Comparing Machine Translation Accuracy

of Attention Models
Dat Pham Tuan Duy Pham Ngoc
Faculty of Information Technology Faculty of Information Technology
Vietnam Maritime University Vietnam Maritime University
Hai Phong, Vietnam Hai Phong, Vietnam
datpt@vimaru.edu.vn duypn@vimaru.edu.vn
Abstract—Machine translation models using encoder and good BLEU scores while Transformer and Luong models
decoder architecture do not give accuracy as high as expectation. obtain moderate and rather low BLEU scores respectively.
One reason for this ineffectiveness is due to lack of attention
mechanism during training phase. Attention-based models II. THEORETICAL B ACKGROUND
overcome drawbacks of previous ones and obtain noteworthy
improvement in terms of accuracy. In the paper, we experiment A. Machine Translation with Recurrent neural networks
three attention models and evaluate their BLEU scores on small Architecture of Recurrent neural network (RNN) consists
data sets. Bahdanau model achieves high accuracy, Transformer of input, hidden, and output layers. Input layer takes vectors
model obtains good accuracy while Luong model only gets that represent sequential data. Units in hidden layer are
acceptable accuracy. connected to each other through time with recurrent
connections. At each timestep, output layer generates a
Keywords—attention, embedding, multi-head, recurrent sequence, which is inferred from hidden units and weighted
connections as well as non-linear activations [2].
I. INTRODUCTION
Today, English becomes a global language and this leads Recurrent neural networks are the most suitable models for
to the need of translation between English and other sequential data problems. In an article, Cho and partners
languages. Bilingual translation supports those who begin introduce an architecture of Encoder-Decoder for neural
learning English. For instance, it helps pupils reduce grammar machine translation. Mechanism of this architecture is that the
mistakes and enhance the writing skill. In recent years, the Encoder builds a ﬁxed-length vector from one input sequence,
approach of neural machine translation has been researched and the Decoder uses that vector to predict the output
and developed. sequence [7]. The article proposes two models and one of
them is based on RNN.
One challenge of machine translation models is that
languages have different grammar structures. Moreover, one On theoretical basis, RNN can learn variable-length
word of source language can be translated into some various sequences. However, it does not perform well on tasks
words of target language. Therefore, source and target involving long-term dependencies. Activation function is a
sentences will contain different meanings if sentences are only cause of this problem: Hyperbolic tangent and sigmoid
translated sequentially. In fact, we translate a source sentence activations cause vanishing gradients [2,13]; ReLU is not
into a target sentence after understanding the meaning and resistant to a large gradient ﬂow and as weight matrix grows,
context of source sentence. In order to generate correct target neurons may be inactive during training phase. An author
sentences, mechanisms have been approached to build one explains that vanishing and exploding gradients are
machine translation model that simulates thinking of humans. responsible for weakness of RNN [1]. In Back Propagation
Through Time algorithm, if gradient magnitudes converge
Many machine translation models are based on neural into zero rapidly after several epochs then gradients will not
networks, and one of them was firstly introduced in a scientific contribute anything to learning phase. On the other hand,
paper in 2014. This model used Encoder-Decoder architecture exploding gradient might happen, gradient magnitudes
in which learning stages applied recurrent neural network. increase fast so learning phase becomes unstable.
Since that time, mechanisms have been innovated to improve
effectiveness of machine translation models. Such typical Two popular variants of RNN are Long Short-Term
mechanisms are global attention, local attention and self- Memory (LSTM) and Gated Recurrent Unit (GRU), which
attention. They are described in Bahdanau, Luong, and have various applications in natural language processing.
Transformer models. LSTM contains three gates called “input”, “forget”, and
“output”. They control flow of information to hidden neurons,
In this paper, we focus on comparing machine translation and prevent the rest of network from modifying content in
accuracy of three models. The project experiments models on memory cell. Meanwhile, GRU has two gates, “update”
small data sets that include English-Vietnamese sentences. controls how present input and previous hidden state are used
BLEU scores are used to evaluate accuracy of neural machine to compute current state, “reset” allows unit to ignore previous
translation [4]. For the 1st set, the differences between hidden state [2, 3]. LSTM and GRU have some common
maximum BLEU scores of models are insignificant. In points: They are capable of learning long-term dependencies,
contrast, on the 2nd set, BLEU scores of Bahdanau model are and robust to vanishing gradients [2]. Comparing
much higher than those of the remaining two models. When effectiveness between LSTM and GRU, this is only true with
working on the test set, Bahdanau model achieves relatively specific tasks.
978-0-7381-0553-6/20/$31.00 ©2020 IEEE 130

Fig. 1. Model of Sequence to Sequence.
LSTM is one solution to the problem of ordinary RNN. In

2014, Sutskever published a model of Sequence to Sequence
using LSTM. Results of the experiment showed that LSTM
overcame weakness of RNN [8]. Nevertheless, similar to
Encoder – Decoder architecture, model of Sequence to Fig. 3. Global attention of Luong mechanism.
Sequence also mapped each input sequence into a vector of
There are three choices of score estimation: Dot, General,
fixed dimensionality.
and Concat. They are used to estimate alignment vector.
Bahdanau and Luong are the first authors who propose a 

T
new method, which is known as attention mechanism. There     h t h s 
is a semantic relationship among words in each sentence, so 
   score(h t , h s )   hTt Wa h s  
its context plays an essential role. To predict words of output v T tanh( W [h ; h s ])
sequence, the Decoder needs information about relevant  a a t
words in the context. Attention mechanism assigns weights to

words in each sentence and allows the Decoder to concentrate Local attention focuses on a few source positions instead
on suitable words. of all source positions. As a result, it takes less time for the
training phase than global attention. One variant of this
B. Attention mechanisms mechanism is monotonic alignment. It generates an aligned
Bahdanau inserts attention mechanism into the Encoder– position (pt = t) for the tth target word. Score estimations are
Decoder model. The Decoder finds positions in a source used to compute alignment vector but its parameters are the
sentence where the most relevant information is concentrated current target state and source states within the window [10].
[9]. Weight αij gives the probability of Yi translated from Xj. Context vector is derived as a weighted average of the source
hidden states in the window.
exp(e ij )
α ij   Transformer model computes query, key and value
Tx
 exp(eik )  matrices from inputs. Next, it calculates attention functions on

k 1 those matrices. Dot-product attention in the model is used to
estimate the compatibility between query and corresponding
A choice for eij is fully connected layer of feed forward key. To avoid extremely small gradients, the model divides
network using hyperbolic tangent activation. Each annotation each of the dot products by a scale factor. Multi-head attention
hi contains information about the input sequence with a strong lets the model find dependent relationships between words in
focus on the words around the ith word. Context vector (ci) is each sentence.
computed as a weighted sum. Finally, the model combines the
context vector with previous generated words to predict the  QK T
current target word.  Attention (Q, K, V)  softmax ( )V
 dk
 
Tx
 headi 1..H  Attention (QWiQ , KWiK , VWiV )
 c i   α ij h j   Multi - head(Q, K, V)  Concat(head ,…, head )W O
j 1  1 H


Fig. 2. Generating the tth target word in Bahdanau model.
Different points between Luong and Bahdanau

mechanisms lie in score functions and computation paths.
Additionally, Luong classifies attention-based models into
local and global categories. Meanwhile, mechanism of
Bahdanau model is similar to global attention approach.
For global attention of Luong model, it also generates a Fig. 4. Architecture of Transformer model.
context vector to predict target word in output sentence.
131
Word embedding represents each token as a vector of

floating point values. This representation will save memory if
the word embedding dimensionality is smaller than the
number of words in data set. On the other hand, it allows
finding semantic relationships between words in each
sentence [5, 6]. Other representations have disadvantages if
compared to word embedding. For example, one-hot encoding
is appropriate for pattern classification, but it should not be
used to represent words due to memory waste.
 p
 PE(p,2l)  sin( 2l
)
 10000 dmodel
 p (5)
PE(p,2l  1)  cos( 2l
) Fig. 5. Losses of three models.

 10000 dmodel
The model does not contain RNNs or CNN, so it lacks

information about the order of words in each sentence. The
Encoder and Decoder will know the relative position of each
token in the sequence if the model encodes positions in
sequences and adds sinusoidal positional encoding to two
parts. Positional encoding has the same dimensionality as
word embedding.
Multi-head attention of Transformer model has the
following features: The model applies self-attention
mechanism to both the Encoder and Decoder; Encoder-
Decoder attention layers let each position in the Decoder
attend over all positions in the input sequence; Furthermore,
every position in the Encoder can attend over all positions in Fig. 6. BLEU scores of Bahdanau model.
the previous layer of the Encoder; Lastly, in the sub-layer of
the Decoder, predictions for position i depend on the known The project builds Luong model with global attention. In
outputs at positions less than i [11]. order to get results in Table I, the project sets up 512-
dimensional embedding for this model. Comparing
III. EXPERIMENTAL WORK estimations of the model in the training phase, Concat and
A. Building attention-based models General give much higher BLEU scores than Dot. In contrast,
Dot takes less time for training than Concat and General.
The principle of attention mechanism is adaptable to Decreasing the dimensionality of word vectors from 512 to
various networks such as CNN, LSTM, GRU, etc. Therefore, 256, BLEU scores of the model decline significantly.
in the project, Luong model uses LSTM and Bahdanau model
uses GRU. In Transformer model, both the Encoder and TABLE I. BLEU S CORES OF LUONG MODEL
Decoder have 6 layers, and each layer contains 8 heads.
Function 1-Gram 2-Gram
Optimizer in three models is Adam [12]. Parameters of
General 0.90 0.87
this algorithm in Luong and Bahdanau models: β1 = 0.9, β2 =
0.999, and learning rate = 10-3. For Transformer model: β1 = Dot 0.32 0.18
0.9, β2 = 0.98, and learning rate depends on the Concat 0.95 0.94
dimensionality of the model [11]. The translation accuracy of
this model may decrease if its learning rate is constant and TABLE II. BLEU S CORES OF M ODELS ON THE 1ST SET
equal to 10-3.
Model 1-Gram 2-Gram
B. Experiment and Comparison
Luong 0.90 0.87
Data sets are collected from “manythings.org/anki” and
“https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EV Bahdanau 0.95 0.94
BCorpus_EVBNews_v1.0.rar?attredirects=0&d=1”. The 1st Transformer 0.96 0.94
set (4082 pairs) includes short sentences and the 2nd set (939
pairs) includes long sentences. The project sets up 256-dimensional embedding for both
In the project, Sparse categorical cross entropy is used to Transformer and Bahdanau models. Although the word
estimate the loss of three models. During the training phase embedding dimensionality of Luong model is twice as large
on a subset of the 1st set, the loss of Transformer model is as that of Transformer model, BLEU scores of Luong model
higher than the losses of Bahdanau and Luong models, as are not higher than those of Transformer model in the training
shown in Figure 5. The project uses NLTK library to evaluate phase of the 1st set.
BLEU scores (1-gram and 2-gram) for each pair (input We extract 929 short sentence pairs from the 1 st set, and
sentence and predictive sentence). From BLEU scores of create the test set by modifying few words in each pair.
three models on the set, we assess accuracies of machine Obviously, words of the test set must be found in the 1st set.
translation models on that set.
132
scores and Transformer model maintains good BLEU scores,

but results of Luong model fall below 0.6. Generally, Luong
mechanism can cope with long sentences, and in this case,
Luong model translates many long sequences correctly. The
model chooses Adam instead of Adamax, because the
experiment on the 2nd set shows that the model has a poor
performance with Adamax. Perhaps, small data of this set is
a cause that reduces effectiveness of Luong model.
Table VI shows the result of neural machine translation
from a given sentence in the 2nd set (training phase): “Osd
thường xảy ra ở thanh thiếu niên hay hoạt động thể chất gần
thời điểm bắt đầu lớn nhanh như thổi của chúng, thời gian
này kéo dài xấp xỉ gần 2 năm - đây là thời gian phát triển
nhanh nhất của các bạn thanh thiếu niên”.
TABLE VI. EXAMPLE F ROM THE 2ND SET

Fig. 7. Visualization of attention weights.
Predictive Actual
Model
Figure 7 illustrates attention weights of words in a pair Sentence Sentence
(input sentence in the test set and predictive sentence) after Osd is a group of injuries that part at
the translation phase of Bahdanau model. Weights are Luong the brands and needs are easy at an Osd usually
displayed by coloured cells, bright and dark colours express earlier age. strikes active
large and small values respectively. However, if attention teens around
Osd usually strikes active teens the beginning
weights are not exactly evaluated, the output sequence may around the beginning of their growth of their growth
be wrong. In Table III, from an input sentence: “Tôi không Transformer
spurts, they're 2-year period during spurts, the
chắc chắn Tom đã nói thật với bạn”, Transformer and which they grow one way too. approximately
Bahdanau models predict several words incorrectly. Osd usually strikes active teens 2-year period
around the beginning of their growth during which
TABLE III. EXAMPLE F ROM THE TEST S ET Bahdanau spurts, the approximately 2-year they grow
period during which they grow most most rapidly.
Actual rapidly.
Model Predictive Sentence
Sentence
I’m not sure Tom told you the
Luong
truth. I’m not sure IV. CONCLUSION
I’m sure Tom told you might Tom told you
Transformer
be truth. the truth. Neural machine translation has attracted research projects
Bahdanau I don’t Tom told you the truth. in the field of deep learning. The authors study and find an
optimal model based on neural networks. Consequently,
Luong model generates the highest incorrect proportion recurrent neural networks are the most suitable models for
in the test phase, so it gives rather low BLEU scores (0.56 this problem.
and 0.40). BLEU scores of Transformer model are around Encoder – Decoder models achieve impressive results, but
0.61 and 0.46. Maximum BLEU scores belong to Bahdanau they do not give reliable accuracy. Hence, research projects
model, 0.70 and 0.60. have innovated models with attention mechanisms. The
experiments in the articles show that attention mechanisms
TABLE IV. BLEU S CORES OF M ODELS ON THE TEST S ET improve effectiveness of neural machine translation.
Model 1-Gram 2-Gram In our paper, the project chooses attention mechanisms of
Luong 0.56 0.40 Luong, Bahdanau and Transformer. With Python language
and relevant libraries, the project experiments models on
Bahdanau 0.70 0.60
small data sets of bilingual sentences. In general, three
Transformer 0.61 0.46 models get good BLEU scores on the 1st set; model of
Bahdanau gains the highest BLEU scores on the 2nd set. When
TABLE V. BLEU S CORES OF M ODELS ON THE 2ND S ET the project experiments models on the test set, Bahdanau
model obtains relatively good accuracy while Transformer
Model 1-Gram 2-Gram
and Luong models achieve moderate and rather low
Luong 0.59 0.53 accuracies respectively.
Bahdanau 0.99 0.98
REFERENCES
Transformer 0.75 0.71
[1] Alex Sherstinsky, “Fundamental of Recurrent Neural Netwrok (RNN)
and Long Short-Term Memory (LSTM) Network”, Elsevier Journal,
The project experiments another training phase on the 2nd vol. 404, March 2020.
set. If the project sets up 256-dimensional embedding for [2] Hojjat Salehinejad, Sharan Sankar, Joseph Barfett, Errol Colak, and
Bahdanau and Transformer models, they give unreliable Shahrokh Valaee, “Recent Advances in Recurrent Neural Networks”,
arXiv:1801.01078v3, February 2018.
translation accuracies. Thus, the project increases the word
embedding dimensionality from 256 to 512. In the training [3] Rahul Dey and Fathi M.Salem, “Gate-Variants of Gated Recurrent
Unit (GRU) Neural Network”, IEEE Conference, August 2017.
phase, Bahdanau model achieves nearly absolute BLEU
[4] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu,
133
“BLEU: a Method for Automatic Evaluation of Machine Translatio n”,

Proceedings of ACL, July 2002, pp. 311-318.
[5] Amit Mandelbaum and Adi Shalev, “Word embeddings and Their Use
in Sentence Classification Tasks”, Arxiv.org/pdf/1610.08229v1.pdf,
October 2016.
[6] Tim vor Der Bruck and Marc Pouly, “Text Similarity Estimation based
on Word Embedding and Matrix Norms for targeted Marketing”,
Proceedings of Conference of Association for computationa l
Linguistics, vol. 1, June 2019.
[7] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and
Yoshua Bengio, “On the Properties of Neural Machine Translation:
Encoder–Decoder Approaches”, Association for computational
Linguistics, October 2014.
[8] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, “Sequence to Sequence
Learning with Neural Networks”, Proceedings of 27th International
Conference on Neural Information Processing Systems, vol. 2.
December 2014.
[9] Dzmitry Bahdanau, KyungHuyn Cho, and Yoshua Bengio, “Neural
Machine Translation by jointly Learning to Align and Translate”,
Conference ICLR, 2015.
[10] Minh-Thang Luong, Hieu Pham, and Christopher D.Manning,
“Effective Approaches to Attention-based on Neural Machine
Translation”, Proceedings of Conference, 2015.
[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llio n
Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin,
“Attention is all you need”, 31st Conference on NIPS, 2017.
[12] Diederik P.Kingma and Jimmy LeiBa, “Adam: a Method for Stochastic
optimazation”, ICLR 2015.
[13] Jianli Feng and Shengnan Lu, “Performance Analysis of Various
Activation Functions in Artificial Neural Networks”, Journal of
Physics Conference, 2019.
134

Comparing Machine Translation Accuracy of Attention Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparing Machine Translation Accuracy of Attention Models

Uploaded by

Copyright:

Available Formats

2020 7th NAFOSTED Conference on Information and

Computer Science (NICS)

Comparing Machine Translation Accuracy

978-0-7381-0553-6/20/$31.00 ©2020 IEEE 130

Fig. 1. Model of Sequence to Sequence.

LSTM is one solution to the problem of ordinary RNN. In

words in the context. Attention mechanism assigns weights to

 exp(eik )  matrices from inputs. Next, it calculates attention functions on

Fig. 2. Generating the tth target word in Bahdanau model.

Different points between Luong and Bahdanau

Word embedding represents each token as a vector of

The model does not contain RNNs or CNN, so it lacks

scores and Transformer model maintains good BLEU scores,

TABLE VI. EXAMPLE F ROM THE 2ND SET

“BLEU: a Method for Automatic Evaluation of Machine Translatio n”,

You might also like