Cs 224N: Assignment #4: 1. Neural Machine Translation With Rnns (45 Points)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

CS 224n: Assignment #4

This assignment is split into two sections: Neural Machine Translation with RNNs and Analyzing NMT
Systems. The first is primarily coding and implementation focused, whereas the second entirely consists
of written, analysis questions. If you get stuck on the first section, you can always work on the second.
That being said, the NMT system is more complicated than the neural networks we have previously con-
structed within this class and takes about 4 hours to train on a GPU. Thus, we strongly recommend you
get started early with this assignment. Finally, the notation and implementation of the NMT system is a
bit tricky, so if you ever get stuck along the way, please come to Office Hours so that the TAs can support you.

1. Neural Machine Translation with RNNs (45 points)


In Machine Translation, our goal is to convert a sentence from the source language (e.g. Spanish) to the
target language (e.g. English). In this assignment, we will implement a sequence-to-sequence (Seq2Seq)
network with attention, to build a Neural Machine Translation (NMT) system. In this section, we
describe the training procedure for the proposed NMT system, which uses a Bidirectional LSTM
Encoder and a Unidirectional LSTM Decoder.

Figure 1: Seq2Seq Model with Multiplicative Attention, shown on the third step of the decoder. Note that
for readability, we do not picture the concatenation of the previous combined-output with the decoder input.

Given a sentence in the source language, we look up the word embeddings from an embeddings matrix,
yielding x1 , . . . , xm | xi ∈ Re×1 , where m is the length of the source sentence and e is the embedding size.
We feed these embeddings to the bidirectional Encoder, yielding hidden states and cell states for both
the forwards (→) and backwards (←) LSTMs. The forwards and backwards versions are concatenated
to give hidden states henc i and cell states cenc
i :

1
CS 224n Assignment 4 Page 2 of 10

←−− −− → ←−− −− →
henc
i = [henc enc enc
i ; hi ] where hi ∈ R2h×1 , henc
i , hi
enc
∈ Rh×1 1≤i≤m (1)
←−− −− → ←−− −− →
cenc
i = [cenc enc enc
i ; ci ] where ci ∈ R2h×1 , cenc enc
i , ci ∈ Rh×1 1≤i≤m (2)

We then initialize the Decoder’s first hidden state hdec


0 and cell state cdec
0 with a linear projection of the
1
Encoder’s final hidden state and final cell state.

←−− −− →
hdec
0 = Wh [henc enc dec
1 ; hm ] where h0 ∈ Rh×1 , Wh ∈ Rh×2h (3)
←−− −− →
cdec
0 = Wc [cenc enc dec
1 ; cm ] where c0 ∈ Rh×1 , Wc ∈ Rh×2h (4)

With the Decoder initialized, we must now feed it a matching sentence in the target language. On the
tth step, we look up the embedding for the tth word, yt ∈ Re×1 . We then concatenate yt with the
combined-output vector ot−1 ∈ Rh×1 from the previous timestep (we will explain what this is later down
this page!) to produce yt ∈ R(e+h)×1 . Note that for the first target word (i.e. the start token) o0 is a
zero-vector. We then feed yt as input to the Decoder LSTM.

hdec dec
t , ct = Decoder(yt , hdec dec dec
t−1 , ct−1 ) where ht ∈ Rh×1 , cdec
t ∈ Rh×1 (5)
(6)

We then use hdec


t to compute multiplicative attention over henc enc
0 , . . . , hm :

et,i = (hdec T enc


t ) WattProj hi where et ∈ Rm×1 , WattProj ∈ Rh×2h 1≤i≤m (7)
m×1
αt = Softmax(et ) where αt ∈ R (8)
m
X
at = αt,i henc
i where at ∈ R2h×1 (9)
i

We now concatenate the attention output at with the decoder hidden state hdec
t and pass this through
a linear layer, Tanh, and Dropout to attain the combined-output vector ot .

ut = [at ; hdec
t ] where ut ∈ R
3h×1
(10)
h×1 h×3h
vt = Wu ut where vt ∈ R , Wu ∈ R (11)
h×1
ot = Dropout(Tanh(vt )) where ot ∈ R (12)

Then, we produce a probability distribution Pt over target words at the tth timestep:

Pt = Softmax(Wvocab ot ) where Pt ∈ RVt ×1 , Wvocab ∈ RVt ×h (13)

Here, Vt is the size of the target vocabulary. Finally, to train the network we then compute the softmax
cross entropy loss between Pt and gt , where gt is the 1-hot vector of the target word at timestep t:
1 If ←−− −− →
it’s not obvious, think about why we regard [henc enc
1 , hm ] as the ‘final hidden state’ of the Encoder.
CS 224n Assignment 4 Page 3 of 10

Jt (θ) = CE(Pt , gt ) (14)

Here, θ represents all the parameters of the model and Jt (θ) is the loss on step t of the decoder. Now
that we have described the model, let’s try implementing it for Spanish to English translation!

Follow the instructions in the CS224n Azure Guide (link also provided on website and Piazza) in order
to create your VM instance. This should take you approximately 45 minutes. Though you will need
the GPU to train your model, we strongly advise that you first develop the code locally and ensure
that it runs, before attempting to train it on your VM. GPU time is expensive and limited. It takes
approximately 4 hours to train the NMT system. We don’t want you to accidentally use all your GPU
time for the assignment, debugging your model rather than training and evaluating it. Finally, make
sure that your VM is turned off whenever you are not using it.
If your Azure subscription runs out of money your VM will be locked and all code and
data on the VM will be lost. Turn off your VM and request more Azure credits before
your subscription runs out. See Piazza for instructions on requesting more credits if you
are about to run out.
In order to run the model code on your local machine, please run the following command to create the
proper virtual environment:
conda env create --file local env.yml
Note that this virtual environment will not be needed on the VM.

(a) (2 points) In order to apply tensor operations, we must ensure that the sentences in a given batch
are of the same length. Thus, we must identify the longest sentence in a batch and pad others to
be the same length. Implement the pad sents function in utils.py, which shall produce these
padded sentences.
(b) (3 points) Implement the init function in model embeddings.py to initialize the necessary
source and target embeddings.
(c) (4 points) Implement the init function in nmt model.py to initialize the necessary model em-
beddings (using the ModelEmbeddings class from model embeddings.py) and layers (LSTM,
projection, and dropout) for the NMT system.
(d) (8 points) Implement the encode function in nmt model.py. This function converts the padded
source sentences into the tensor X, generates henc enc dec
1 , . . . , hm , and computes the initial state h0 and
dec
initial cell c0 for the Decoder. You can run a non-comprehensive sanity check by executing:
python sanity_check.py 1d
(e) (8 points) Implement the decode function in nmt model.py. This function constructs ȳ and
runs the step function over every timestep for the input. You can run a non-comprehensive sanity
check by executing:
python sanity_check.py 1e
(f) (10 points) Implement the step function in nmt model.py. This function applies the Decoder’s
LSTM cell for a single timestep, computing the encoding of the target word hdec t , the attention
scores et , attention distribution αt , the attention output at , and finally the combined output ot .
You can run a non-comprehensive sanity check by executing:
python sanity_check.py 1f
CS 224n Assignment 4 Page 4 of 10

(g) (3 points) (written) The generate sent masks() function in nmt model.py produces a tensor
called enc masks. It has shape (batch size, max source sentence length) and contains 1s in positions
corresponding to ‘pad’ tokens in the input, and 0s for non-pad tokens. Look at how the masks are
used during the attention computation in the step() function (lines 295-296).
First explain (in around three sentences) what effect the masks have on the entire attention com-
putation. Then explain (in one or two sentences) why it is necessary to use the masks in this
way.

Solution:
The masks are used to set the attention scores et,i = −∞ for all the positions i that correspond
to ‘pad’ tokens in the source sentence. This means that after we apply the softmax function, the
attention distribution αt,i = 0 for all positions i that correspond to ‘pad’ tokens. This means that
the encoder hidden states henci that correspond to ‘pad’ tokens have no effect on the attention
output at .
It is necessary to apply the masks in this way because we do not want to put any attention on
the encoder hidden states that correspond to ‘pad’ tokens. These ‘padding’ hidden states were
computed because we have to pad the batch of source sentences up to maximum length m, but they
are meaningless and we do not want them to affect our model.

(h) Now it’s time to get things running! Execute the following to generate the necessary vocab file:
sh run.sh vocab
As noted earlier, we recommend that you develop the code on your personal computer. Confirm
that you are running in the proper conda environment and then execute the following command to
train the model on your local machine:
sh run.sh train_local
Once you have ensured that your code does not crash (i.e. let it run till iter 10 or iter 20),
power on your VM from the Azure Web Portal. Then read the Managing Code Deployment to a
VM section of our Practical Guide to VMs (link also given on website and Piazza) for instructions
on how to upload your code to the VM.
Next, install necessary packages to your VM by running:
pip install -r gpu_requirements.txt
Finally, turn to the Managing Processes on a VM section of the Practical Guide and follow the
instructions to create a new tmux session. Concretely, run the following command to create tmux
session called nmt.
tmux new -s nmt
Once your VM is configured and you are in a tmux session, execute:
sh run.sh train
Once you know your code is running properly, you can detach from session and close your ssh
connection to the server. To detach from the session, run:
tmux detach
You can return to your training model by ssh-ing back into the server and attaching to the tmux
session by running:
tmux a -t nmt
(i) (4 points) Once your model is done training (this should take about 4 hours on the VM),
execute the following command to test the model:
CS 224n Assignment 4 Page 5 of 10

sh run.sh test
Please report the model’s corpus BLEU Score. It should be larger than 21.
(j) (3 points) In class, we learned about dot product attention, multiplicative attention, and additive
attention. Please provide one possible advantage and disadvantage of each attention mechanism,
with respect to either of the other two attention mechanisms. As a reminder, dot product attention
is et,i = sTt hi , multiplicative attention is et,i = sTt Whi , and additive attention is et,i = vT (W1 hi +
W2 st ).

Solution:
TODO: ASK ABI TO LOOK AT THIS AGAIN BEFORE PUBLISHING SOLUTIONS.
One advantage of dot product attention is that it is the most computationally efficient. It is only an
O(h) operation, i.e. a dot product, and it does not require us to learn and store any weight matrix
W. One con of dot product attention is that it requires both the encoder and decoder states to be
of the same dimension. Additionally, it is less expressive, as there is no relative weighting of hidden
units.

One advantage of multiplicative attention is that the hidden states of the encoder and decoder can
be of different dimensions. This is why we used multiplicative attention, rather than dot product
attention, in our NMT system. Another advantage over dot product attention is that it is more
expressive, by learning the weights W. An advantage that multiplicative attention provides over
additive attention is that it is more efficient, i.e. can be implemented solely through matrix multi-
plication and requires fewer weights.

One advantage of additive attention is that it is a fundamentally different operation than the dot
product and matrix multiplications. Thus, it may be able to capture different relationships between
words. One disadvantage of additive addition is that it is the most computationally complicated,
relying on learning two weight matrices and a vector. The more parameters we add to a model, the
more difficult it is to train the model.

2. Analyzing NMT Systems (30 points)


(a) (12 points) Here we present a series of errors we found in the outputs of our NMT model (which
is the same as the one you just trained). For each example of a Spanish source sentence, reference
(i.e., ‘gold’) English translation, and NMT (i.e., ‘model’) English translation, please:
1. Identify the error in the NMT translation.
2. Provide a reason why the model may have made the error (either due to a specific linguistic
construct or specific model limitations).
3. Describe one possible way we might alter the NMT system to fix the observed error.
Below are the translations that you should analyze as described above. Note that out-of-vocabulary
words are underlined.
i. (2 points) Source Sentence: Aquı́ otro de mis favoritos, “La noche estrellada”.
Reference Translation: So another one of my favorites, “The Starry Night”.
NMT Translation: Here’s another favorite of my favorites, “The Starry Night”.

Solution:
1. Error: Repetition (favorite of my favorites)
CS 224n Assignment 4 Page 6 of 10

2. Repetition is a common problem in the output of RNN decoders. A possible reason is


that the model attended to favoritos twice, thus producing both favorite and favorites
in the output. Another possible reason: note that the reference translation says another
one, where the word one has no direct counterpart in the source sentence. These types of
words (called ‘spurious’ words) can be difficult for sequence-to-sequence+attention systems
to produce, because there is nothing concrete to attend to. On the step when the NMT
system should have produced one, perhaps it attended to favoritos, and the influence of
the attention output was too strong, so the model produced Here’s another favorite, thus
leading to the repetition.
3. Possible fix: Introduce an explicit penalty for word repetition (either during training or
during decoding), add a coverage mechanism (which discourages repeated attention on the
same words).

ii. (2 points) Source Sentence: Ustedes saben que lo que yo hago es escribir para los niños, y,
de hecho, probablemente soy el autor para niños, ms ledo en los EEUU.
Reference Translation: You know, what I do is write for children, and I’m probably America’s
most widely read children’s author, in fact.
NMT Translation: You know what I do is write for children, and in fact, I’m probably the
author for children, more reading in the U.S.

Solution:
1. Error: Unnatural word order (particularly author for children, more reading in the U.S.)
2. Reason for error: There are many possible reasons for this error. Perhaps our NMT system
is not good at modeling long-term dependencies, so struggles to produce a long output
sentence that makes sense. Perhaps the NMT system has not been trained on enough data
to have learned how to rearrange word ordering appropriately. In particular, perhaps the
decoder’s language model is not strong enough to recognize that this output translation is
unnatural English.
3. Possible fix: To improve modeling of long term dependencies, we might make the model
more powerful architecturally (e.g. increase hidden size, number of layers, add self-attention,
or switch to Transformer). If we think the problem is insufficient data, we might train the
NMT system on more data, or build our system on top of a pretrained system (e.g., ELMo
or BERT). In particular if we think that the decoder’s English Language Model is too weak,
we might try initializing our decoder with a strong English LM trained on lots of data.

iii. (2 points) Source Sentence: Un amigo me hizo eso – Richard Bolingbroke.


Reference Translation: A friend of mine did that – Richard Bolingbroke.
NMT Translation: A friend of mine did that – Richard <unk >

Solution:
1. Error: UNK instead of Bolingbroke
2. Reason for error: The last name Bolingbroke is an infrequent word, so it is not in the
vocabulary; thus the model is only able to output UNK.
3. Possible fix: We could add a neural copy/pointer mechanism to copy words from the source
sentence (especially those that do not need translation, such as names). Or we could add
a post-processing step that replaces UNKs in the output with the source word that was
attended on that step. We could switch to a subword-based NMT model (e.g. one using
characters, BPE or word-pieces); this would enable the decoder to produce new (out-of-
CS 224n Assignment 4 Page 7 of 10

vocabulary) words. We could simply increase the vocabulary size of our word-based NMT
system, though this is not a very effective solution for errors like these (as you’ll never be
able to capture all names in your vocabulary).

iv. (2 points) Source Sentence: Solo tienes que dar vuelta a la manzana para verlo como una
epifanı́a.
Reference Translation: You’ve just got to go around the block to see it as an epiphany.
NMT Translation: You just have to go back to the apple to see it as a epiphany.

Solution:
1. Error: manzana translated to apple instead of block
2. Reason for error: In Spanish, the word manzana is polysemous: it can mean either apple
or block. Here the NMT system picks the wrong meaning. This might be because NMT
systems can be biased towards picking the more common meaning (here apple). Another
reason could be lack of world knowledge – go around the block makes more sense than go
back to the apple but the NMT system doesn’t know that. Another reason could be lack of
context: NMT systems translate sentences out-of-context. If we saw more of the preceding
context, perhaps it would be more obvious that manzata means block not apple.
3. Possible fix: Word Sense Disambiguation (WSD) is the task of determining, given a usage
of a polysemous word in a sentence, which of the possible meanings is correct. WSD is an
active NLP research area. We could run a pretrained WSD system on our source sentences
and condition our NMT system on the results. If we think the Spanish word embedding
for manzana is insufficiently powerful to capture its polysemous meaning, we might try
increasing word embedding size or using multi-sense word embeddings (TODO: ASK ABI
TO ADD LINK). Adding world knowledge to NMT system is an ongoing research area;
there are efforts to integrate NMT systems with knowledge bases. We could try training
and testing our NMT systems on whole documents rather than single sentences.

v. (2 points) Source Sentence: Ella salvó mi vida al permitirme entrar al baño de la sala de
profesores.
Reference Translation: She saved my life by letting me go to the bathroom in the teachers’
lounge.
NMT Translation: She saved my life by letting me go to the bathroom in the women’s room.

Solution:
1. Error: NMT Translation contains bathroom in the women’s room instead of bathroom in the
teachers’ lounge; in particular women’s is unfounded.
2. Reason for error: The model might be producing women’s because it was attending to (or
influenced by) the feminine pronouns earlier in the sentence. Alternatively, if the training
data contained more examples of female teachers than male teachers, the model might have
learned a connection between women and teaching, leading to incorrectly producing the
word women’s. Another explanation is that the model got confused between the two rooms
in the source sentence (the bathroom and the teachers’ lounge); and was attending to baño
(bathroom) when it produced women’s room (which is an approximately correct translation),
but it should have been attending to teachers’ lounge. Another possible reason is that in
the women’s room is a more common phrase than in the teachers’ lounge – NMT systems
sometimes have a tendency default to general, unconditional target (English) Language
Modeling.
CS 224n Assignment 4 Page 8 of 10

3. Possible fix: First inspect the attention distribution on the step when the decoder produced
the word women’s. This might help us figure out which (if any) of the possible reasons above
is responsible. If we believe the problem is the dataset, we might try to add more examples
of male teachers to the dataset, or try any of the current research techniques to debias our
word vectors and/or model (this is an open research area). TODO (ABI): GIVE POSSIBLE
FIXES FOR THE OTHER ERRORS.

vi. (2 points) Source Sentence: Eso es más de 100,000 hectáreas.


Reference Translation: That’s more than 250 thousand acres.
NMT Translation: That’s over 100,000 acres.

Solution:
1. Error: Incorrect numeric conversion (100,000 hectares = 250,000 acres, not 100,000 acres).
2. Reason for error: The NMT system simply translates hectáreas (a unit of area common in
Spanish-speaking countries) as acres (a unit of area common in the U.S.); this is incorrect
as the two have different values. This may have happened because hectares was not in the
target (English) vocabulary. A deeper reason for this error is that NMT systems do not
generally have logic, reasoning or numerical abilities – here the system considered only the
semantic (not numerical) similarity of hectáreas and acres.
3. Possible fix: The simple solution for this particular example would be to make sure hectares
is in the target (English) vocabulary, e.g. by increasing vocabulary size or adding subword
abilities. A more complex solution would be to supply the NMT system with a knowledge
base of units of measurement and their conversion rates, and train the system to convert
e.g. metric to imperial. A much more complex and general solution would be to work
on imbuing NMT systems with reasoning/numerical abilities. This is an open research
problem! The most difficult solution of all would be to convince the U.S. to convert to the
metric system.

(b) (4 points) Now it is time to explore the outputs of the model that you have trained! The test-set
translations your model produced in question 1-i should be located in outputs/test outputs.txt.
Please identify 2 examples of errors that your model produced.2 The two examples you find should
be different error types from one another and different error types than the examples provided in
the previous question. For each example you should:
1. Write the source sentence in Spanish. The source sentences are in the en es data/test.es.
2. Write the reference English translation. The reference translations are in the en es data/test.en.
3. Write your NMT model’s English translation. The model-translated sentences are in the
outputs/test outputs.txt.
4. Identify the error in the NMT translation.
5. Provide a reason why the model may have made the error (either due to a specific linguistic
construct or specific model limitations).
6. Describe one possible way we might alter the NMT system to fix the observed error.
(c) (14 points) BLEU Score is the most commonly used automatic evaluation metric for NMT systems.
It is usually calculated across the entire test set, but here we will consider BLEU defined for a single
example.3 Suppose we have a source sentence s, a set of k reference translations r1 , . . . , rk , and a
2 An ‘error’ is not just a NMT translation that doesn’t match the reference translation. There must be something wrong
with the NMT translation, in your opinion.
3 This definition of sentence-level BLEU score matches the sentence bleu() function in the nltk Python package:

http://www.nltk.org/api/nltk.translate.html#nltk.translate.bleu_score.sentence_bleu
CS 224n Assignment 4 Page 9 of 10

candidate translation c. To compute the BLEU score of c, we first compute the modified n-gram
precision pn of c, for each of n = 1, 2, 3, 4:
X  
min max Countri (ngram), Countc (ngram)
i=1,...,k
ngram∈c
pn = X (15)
Countc (ngram)
ngram∈c

Here, for each of the n-grams that appear in the candidate translation c, we count the maxi-
mum number of times it appears in any one reference translation, capped by the number of times
it appears in c (this is the numerator). We divide this by the number of n-grams in c (denominator).

Next, we compute the brevity penalty BP. Let c be the length of c and let r∗ be the length of
the reference translation that is closest to c (in the case of two equally-close reference translation
lengths, choose r∗ as the shorter one).
(
1 if c ≥ r∗
BP = ∗ (16)
exp 1 − rc otherwise
Lastly, the BLEU score for candidate c with respect to r1 , . . . , rk is:
4
X 
BLEU = BP × exp λn log pn (17)
n=1

where λ1 , λ2 , λ3 , λ4 are weights that sum to 1.

i. (5 points) Please consider this example:

Source Sentence s: El amor todo lo puede


Reference Translation r1 : Love can always find a way
Reference Translation r2 : Love makes anything possible
NMT Translation c1 : The love can always do
NMT Translation c2 : Love can make anything possible

Please compute the BLEU scores for c1 and c2 . Let λi = 0.5 for i ∈ {1, 2} and λi = 0 for
i ∈ {3, 4} (this means we ignore 3-grams and 4-grams, i.e., don’t compute p3 or p4 ).
When computing BLEU scores, show your working (i.e., show your computed values for p1 , p2 ,
c, r∗ and BP ).

Which of the two NMT translations is considered the better translation according to the BLEU
Score? Do you agree that it is the better translation?

Solution: For c1 , p1 = 53 and p2 = 24 = 12 . c = 5 and the closest reference length is r∗ = 4


(because 4 and 6 are equally close to 5, so we choose 4). Thus the brevity penalty BP = 1.
Therefore BLEU = exp( 12 log 35 + 12 log 12 ) = 0.548.

For c2 , p1 = 54 and p2 = 24 = 12 . Again, c = 5 and r∗ = 4 so BP = 1. Therefore


BLEU = exp( 12 log 45 + 12 log 12 ) = 0.632.

According to these BLEU scores, NMT translation c2 is the better one. c2 is indeed the better
translation – it has the correct meaning, whereas c1 translates the Spanish phrase too literally,
leading to unnatural and nonsensical English.
CS 224n Assignment 4 Page 10 of 10

ii. (5 points) Our hard drive was corrupted and we lost Reference Translation r2 . Please recom-
pute BLEU scores for c1 and c2 , this time with respect to r1 only. Which of the two NMT
translations now receives the higher BLEU score? Do you agree that it is the better translation?

Solution: For c1 , p1 = 53 and p2 = 24 = 12 . c = 5 and r∗ = 6 so the brevity penalty


BP = exp(1 − 56 ). Therefore BLEU = exp(1 − 65 ) × exp( 12 log 35 + 21 log 12 ) = 0.448.

For c2 , p1 = 25 and p2 = 14 . c = 5 and r∗ = 6 so the brevity penalty BP = exp(1− 56 ). Therefore


BLEU = exp(1 − 65 ) × exp( 12 log 52 + 12 log 14 ) = 0.259.

According to these BLEU scores, NMT translation c1 is the better one. As noted before, c1 is
not the better translation.

iii. (2 points) Due to data availability, NMT systems are often evaluated with respect to only a
single reference translation. Please explain (in a few sentences) why this may be problematic.

Solution: Often there are many valid ways to translate a source sentence. This is particularly
true for idiomatic phrases such as the previous example. The BLEU metric is designed to ac-
commodate this flexibility: an n-gram in c is rewarded if it appears in any one of the reference
translations. If we have multiple reference translations, the BLEU metric will thus reward simi-
larity to any of the several valid translations. But if we only have one reference translation, the
BLEU metric only recognizes similarity to that particular translation – potentially penalizing
other valid candidate translations.

iv. (2 points) List two advantages and two disadvantages of BLEU, compared to human evaluation,
as an evaluation metric for Machine Translation.

Solution:
• Advantage: BLEU is automatic, so it is fast to compute (unlike human evaluation, which
is slow).
• Advantage: BLEU is automatic, so it is free to compute (unlike human evaluation, which
is expensive).
• Advantage: BLEU has a concrete definition (unlike human evaluation, which is hard to de-
fine and varies depending on the human judge). This means that researchers can reproduce
each others’ BLEU results and use BLEU to compare different systems.
• Disadvantage: BLEU requires a reference translation (whereas human evaluation doesn’t,
assuming the human judges are bilingual), and optimally requires multiple reference trans-
lations.
• Disadvantage: BLEU is based on absolute n-gram matching, so it doesn’t reward synonyms,
paraphrases, or different inflections of the same word (e.g. make and makes).
• There are lots more disadvantages of BLEU vs human evaluation (basically any way in
which BLEU does not fully capture the true notion of good translation) – e.g. not having
world knowledge, not knowing idioms, not recognizing what ‘sounds good’ and what doesn’t,
etc.

You might also like