Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

When a Good Translation is Wrong in Context: Context-Aware Machine

Translation Improves on Deixis, Ellipsis, and Lexical Cohesion

Elena Voita1,2 Rico Sennrich3,4 Ivan Titov3,2


1
Yandex, Russia 2 University of Amsterdam, Netherlands
3
University of Edinburgh, Scotland 4 University of Zurich, Switzerland
lena-voita@yandex-team.ru
rico.sennrich@ed.ac.uk ititov@inf.ed.ac.uk

Abstract et al., 2018; Voita et al., 2018; Maruf and Haf-


fari, 2018; Agrawal et al., 2018; Miculicich et al.,
Though machine translation errors caused by 2018; Zhang et al., 2018), the progress and wide-
the lack of context beyond one sentence have
spread adoption of the new paradigm is hampered
arXiv:1905.05979v2 [cs.CL] 7 Jun 2019

long been acknowledged, the development of


context-aware NMT systems is hampered by by several important problems. Firstly, it is highly
several problems. Firstly, standard metrics are non-trivial to design metrics which would reliably
not sensitive to improvements in consistency trace the progress and guide model design. Stan-
in document-level translations. Secondly, pre- dard machine translation metrics (e.g., BLEU) do
vious work on context-aware NMT assumed not appear appropriate as they do not sufficiently
that the sentence-aligned parallel data con- differentiate between consistent and inconsistent
sisted of complete documents while in most translations (Wong and Kit, 2012).2 For exam-
practical scenarios such document-level data
constitutes only a fraction of the available par-
ple, if multiple translations of a name are pos-
allel data. To address the first issue, we per- sible, forcing consistency is essentially as likely
form a human study on an English-Russian to make all occurrences of the name match the
subtitles dataset and identify deixis, ellipsis reference translation as making them all different
and lexical cohesion as three main sources of from the reference. Second, most previous work
inconsistency. We then create test sets target- on context-aware NMT has made the assumption
ing these phenomena. To address the second that all the bilingual data is available at the doc-
shortcoming, we consider a set-up in which a
ument level. However, isolated parallel sentences
much larger amount of sentence-level data is
available compared to that aligned at the doc- are a lot easier to acquire and hence only a frac-
ument level. We introduce a model that is tion of the parallel data will be at the document
suitable for this scenario and demonstrate ma- level in any practical scenario. In other words, a
jor gains over a context-agnostic baseline on context-aware model trained only on document-
our new benchmarks without sacrificing per- level parallel data is highly unlikely to outperform
formance as measured with BLEU.1 a context-agnostic model estimated from much
larger sentence-level parallel corpus. This work
1 Introduction aims to address both these shortcomings.
With the recent rapid progress of neural machine A context-agnostic NMT system would often
translation (NMT), translation mistakes and in- produce plausible translations of isolated sen-
consistencies due to the lack of extra-sentential tences, however, when put together in a docu-
context are becoming more and more notice- ment, these translations end up being inconsis-
able among otherwise adequate translations pro- tent with each other. We investigate which lin-
duced by standard context-agnostic NMT systems guistic phenomena cause the inconsistencies us-
(Läubli et al., 2018). Though this problem has ing the OpenSubtitles (Lison et al., 2018) corpus
recently triggered a lot of attention to context- for the English-Russian language pair. We iden-
aware translation (Jean et al., 2017a; Wang et al., tify deixis, ellipsis and lexical cohesion as three
2017; Tiedemann and Scherrer, 2017; Bawden
2
We use the term ‘inconsistency’ to refer to any violations
1
We release code and data sets at causing good translations of isolated sentences not to work
https://github.com/lena-voita/ together, independently of which linguistic phenomena (e.g.,
good-translation-wrong-in-context. ellipsis or lexical cohesion) impose the violated constraints.
main sources of the violations, together amount- both good
all one/both bad
ing to about 80% of the cases. We create test sets bad pair good pair
focusing specifically on the three identified phe- 2000 211 140 1649
nomena (6000 examples in total). 100% 11% 7% 82%
We show that by using a limited amount of
document-level parallel data, we can already Table 1: Human annotation statistics of pairs of con-
secutive translation.
achieve substantial improvements on these bench-
marks without negatively affecting performance as
measured with BLEU. Our approach is inspired 2018) for English and Russian. We train a context-
by the Deliberation Networks (Xia et al., 2017). agnostic Transformer on 6m sentence pairs. Then
In our method, the initial translation produced by we translate 2000 pairs of consecutive sentences
a baseline context-agnostic model is refined by a using this model. For more details on model train-
context-aware system which is trained on a small ing and data preprocessing, see Section 5.3.
document-level subset of parallel data. Then we use human annotation to assess the ad-
The key contributions are as follows: equacy of the translations without context and in
the context of each other. The whole process is
• we analyze which phenomena cause context- two-stage:
agnostic translations to be inconsistent with
each other; 1. sentence-level evaluation: we ask if the trans-
lation of a given sentence is good,
• we create test sets specifically addressing the
most frequent phenomena;
2. evaluation in context: for pairs of consecutive
• we consider a novel and realistic set-up good translations according to the first stage,
where a much larger amount of sentence- we ask if the translations are good in context
level data is available compared to that of each other.
aligned at the document level;
In the first stage, the annotators are instructed
• we introduce a model suitable for this sce-
to mark as “good” translations which (i) are fluent
nario, and demonstrate that it is effective on
sentences in the target language (in our case, Rus-
our new benchmarks without sacrificing per-
sian) (ii) can be reasonable translations of a source
formance as measured with BLEU.
sentence in some context.
2 Analysis For the second stage we only consider pairs of
sentences with good sentence-level translations.
We begin with a human study, in which we: The annotators are instructed to mark translations
as bad in context of each other only if there is
1. identify cases when good sentence-level
no other possible interpretation or extra additional
translations are not good when placed in con-
context which could have made them appropriate.
text of each other,
This was made to get more robust results, avoiding
2. categorize these examples according to the the influence of personal preferences of the anno-
phenomena leading to a discrepancy in trans- tators (for example, for using formal or informal
lations of consecutive sentences. speech), and excluding ambiguous cases that can
only be resolved with additional context.
The test sets introduced in Section 3 will then tar- The statistics of answers are provided in Ta-
get the most frequent phenomena. ble 1. We find that our annotators labelled 82%
of sentence pairs as good translations. In 11% of
2.1 Human annotation cases, at least one translation was considered bad
To find what makes good context-agnostic trans- at the sentence level, and in another 7%, the sen-
lations incorrect when placed in context of each tences were considered individually good, but bad
other, we start with pairs of consecutive sentences. in context of each other. This indicates that in our
We gather data with context from the publicly setting, a substantial proportion of translation er-
available OpenSubtitles2018 corpus (Lison et al., rors are only recognized as such in context.
type of phenomena frequency
deixis 37%
ellipsis 29%
lexical cohesion 14%
ambiguity 9%
anaphora 6%
other 5%

Table 2: Types of phenomena causing discrepancy in


context-agnostic translation of consecutive sentences
when placed in the context of each other

type of discrepancy frequency


T-V distinction 67%
speaker/addressee gender:
same speaker 22%
different speaker 9% Figure 1: Examples of violation of (a) T-V form con-
other 2% sistency, (b) speaker gender consistency.
In color: (a) red – V-form, blue – T-form; (b) red –
Table 3: Types of discrepancy in context-agnostic feminine, blue – masculine.
translation caused by deixis (excluding anaphora)
pronouns refer to the same person, the pronouns,
2.2 Types of phenomena as well as verbs that agree with them, should be
From the results of the human annotation, we take translated using the same form. See Figure 1(a)
all instances of consecutive sentences with good for an example translation that violates T-V con-
translations which become incorrect when placed sistency. Figure 1(b) shows an example of incon-
in the context of each other. For each, we identify sistent first person gender (marked on the verb),
the language phenomenon which caused a discrep- although the speaker is clearly the same.
ancy. The results are provided in Table 2. Anaphora are a form of deixis that received a
Below we discuss these types of phenomena, as lot of attention in MT research, both from the
well as problems in translation they cause, in more perspective of modelling (Le Nagard and Koehn,
detail. In the scope of current work, we concen- 2010; Hardmeier and Federico, 2010; Jean et al.,
trate only on the three most frequent phenomena. 2017b; Bawden et al., 2018; Voita et al., 2018,
among others) and targeted evaluation (Hard-
2.2.1 Deixis meier et al., 2015; Guillou and Hardmeier, 2016;
In this category, we group several types of deic- Müller et al., 2018), and we list anaphora errors
tic words or phrases, i.e. referential expressions separately, and will not further focus on them.
whose denotation depends on context. This in-
cludes personal deixis (“I”, “you”), place deixis 2.2.2 Ellipsis
(“here”, “there”), and discourse deixis, where Ellipsis is the omission from a clause of one or
parts of the discourse are referenced (“that’s a more words that are nevertheless understood in the
good question.”). Most errors in our annotated cor- context of the remaining elements.
pus are related to person deixis, specifically gen- In machine translation, elliptical constructions
der marking in the Russian translation, and the in the source language pose a problem if the target
T-V distinction between informal and formal you language does not allow the same types of ellipsis
(Latin “tu” and “vos”). (requiring the elided material to be predicted from
In many cases, even when having access to context), or if the elided material affects the syn-
neighboring sentences, one cannot make a confi- tax of the sentence; for example, the grammatical
dent decision which of the forms should be used, function of a noun phrase and thus its inflection
as there are no obvious markers pointing to one in Russian may depend on the elided verb (Fig-
form or another (e.g., for the T-V distinction, ure 2(a)), or the verb inflection may depend on the
words such as “officer”, “mister” for formal and elided subject. Our analysis focuses on ellipses
“honey”, “dude” for informal). However, when that can only be understood and translated with
type of discrepancy frequency
wrong morphological form 66%
wrong verb (VP-ellipsis) 20%
other error 14%

Table 4: Types of discrepancy in context-agnostic


translation caused by ellipsis

Figure 3: Examples of lack of lexical cohesion in MT.


(a) Name translation inconsistency. (b) Inconsistent
translation. Using either of the highlighted translations
consistently would be good.

phrase types for emphasis (Figure 3(b)) or in clar-


ification questions.

Figure 2: Examples of discrepancies caused by ellipsis. 3 Test Sets


(a) wrong morphological form, incorrectly marking the
noun phrase as a subject. (b) correct meaning is “see”, For the most frequent phenomena from the above
but MT produces хотели khoteli (“want”). analysis we create test sets for targeted evaluation.
Each test set contains contrastive examples. It
context beyond the sentence-level. This has not is specifically designed to test the ability of a sys-
been studied extensively in MT research.3 tem to adapt to contextual information and han-
We classified ellipsis examples which lead to er- dle the phenomenon under consideration. Each
rors in sentence-level translations by the type of test instance consists of a true example (sequence
error they cause. Results are provided in Table 4. of sentences and their reference translation from
It can be seen that the most frequent problems the data) and several contrastive translations which
related to ellipsis that we find in our annotated differ from the true one only in the considered as-
corpus are wrong morphological forms, followed pect. All contrastive translations we use are cor-
by wrongly predicted verbs in case of verb phrase rect plausible translations at a sentence level, and
ellipsis in English, which does not exist in Rus- only context reveals the errors we introduce. All
sian, thus requiring the prediction of the verb in the test sets are guaranteed to have the necessary
the Russian translation (Figure 2(b)). context in the provided sequence of 3 sentences.
2.2.3 Lexical cohesion The system is asked to score each candidate ex-
ample, and we compute the system accuracy as the
Lexical cohesion has been studied previously in
proportion of times the true translation is preferred
MT (Tiedemann, 2010; Gong et al., 2011; Wong
over the contrastive ones.
and Kit, 2012; Kuang et al., 2018; Miculicich
Test set statistics are shown in Table 5.
et al., 2018, among others).
There are various cohesion devices (Morris and
3.1 Deixis
Hirst, 1991), and a good translation should exhibit
lexical cohesion beyond the sentence level. We From Table 3, we see that the most frequent er-
focus on repetition with two frequent cases in our ror category related to deixis in our annotated cor-
annotated corpus being reiteration of named enti- pus is the inconsistency of T-V forms when trans-
ties (Figure 3(a)) and reiteration of more general lating second person pronouns. The test set we
3
construct for this category tests the ability of a
Exceptions include (Yamamoto and Sumita, 1998), and
work on the related phenomenon of pronoun dropping (Russo machine translation system to produce translations
et al., 2012; Wang et al., 2016; Rios and Tuggener, 2017). with consistent level of politeness.
latest relevant context get from the lexical table of Moses (Koehn et al.,
total 1st 2nd 3rd 2007) induced from the training data.
deixis 3000 1000 1000 1000 3.3 Lexical cohesion
lex. cohesion 2000 855 630 515
Lexical cohesion can be established for various
ellipsis (infl.) 500
types of phrases and can involve reiteration or
ellipsis (VP) 500
other semantic relations. In the scope of the cur-
Table 5: Size of test sets: total number of test instances
rent work, we focus on the reiteration of entities,
and with regard to the latest context sentence with po- since these tend to be non-coincidental, and can be
liteness indication or with the named entity under con- easily detected and transformed.
sideration. For ellipsis, we distinguish whether model We identify named entities with alternative
has to predict correct noun phrase inflection, or correct translations into Russian, find passages where they
verb sense (VP ellipsis). are translated consistently, and create contrastive
test examples by switching the translation of some
We semi-automatically identify sets of consec-
instances of the named entity. For more details,
utive sentences with consistent politeness markers
please refer to the appendix.
on pronouns and verbs (but without nominal mark-
ers such as “’Mr.” or “officer”) and switch T and 4 Model and Setting
V forms. Each automatic step was followed by hu-
man postprocessing, which ensures the quality of 4.1 Setting
the final test sets.4 This gives us two sets of trans- Previous work on context-aware neural machine
lations for each example, one consistently infor- translation used data where all training instances
mal (T), and one consistently formal (V). For each, have context. This setting limits the set of avail-
we create an inconsistent contrastive example by able training sets one can use: in a typical sce-
switching the formality of the last sentence. The nario, we have a lot of sentence-level parallel data
symmetry of the test set ensures that any context- and only a small fraction of document-level data.
agnostic model has 50% accuracy on the test set. Since machine translation quality depends heavily
on the amount of training data, training a context-
3.2 Ellipsis
aware model is counterproductive if this leads to
From Table 4, we see that the two most frequent ignoring the majority of available sentence-level
types of ambiguity caused by the presence of an data and sacrificing general quality. We will also
elliptical structure have different nature, hence we show that a naive approach to combining sentence-
construct individual test sets for each of them. level and document-level data leads to a drop in
Ambiguity of the first type comes from the in- performance.
ability to predict the correct morphological form In this work, we argue that it is important to
of some words. We manually gather examples consider an asymmetric setting where the amount
with such structures in a source sentence and of available document-level data is much smaller
change the morphological inflection of the rele- than that of sentence-level data, and propose an
vant target phrase to create contrastive translation. approach specifically targeting this scenario.
Specifically, we focus on noun phrases where the
verb is elided, and the ambiguity lies in how the 4.2 Model
noun phrase is inflected. We introduce a two-pass framework: first, the sen-
The second type we evaluate are verb phrase el- tence is translated with a context-agnostic model,
lipses. Mostly these are sentences with an auxil- and then this translation is refined using context
iary verb “do” and omitted main verb. We manu- of several previous sentences (context includes
ally gather such examples and replace the transla- source sentences as well as their translations). We
tion of the verb, which is only present on the target expect this architecture to be suitable in the pro-
side, with other verbs with different meaning, but posed setting: the baseline context-agnostic model
the same inflection. Verbs which are used to con- can be trained on a large amount of sentence-level
struct such contrastive translations are the top-10 data, and the second-pass model can be estimated
lemmas of translations of the verb “do” which we on a smaller subset of parallel data which includes
4
Details are provided in the appendix. context. As the first-pass translation is produced
Figure 4: Model architecture

by a strong model, we expect no loss in general document-level log-likelihood:


performance when training the second part on a
1 X
smaller dataset. log EyB ∝P (y|xj ,θB ) P (yj |xj , yjB , cj , θC ),
M j
(xj ,yj )∈Ddoc
The model is close in spirit to the Deliberation
networks (Xia et al., 2017). The first part of the where yjB is sampled from P (y|xj , θB ).
model is a context-agnostic model (we refer to it as CADec is composed of a stack of N = 6 identi-
the base model), and the second one is a context- cal layers and is similar to the decoder of the orig-
aware decoder (CADec) which refines context- inal Transformer. It has a masked self-attention
agnostic translations using context. The base layer and attention to encoder outputs, and addi-
model is trained on sentence-level data and then tionally each layer has a block attending over the
fixed. It is used only to sample context-agnostic outputs of the base decoder (Figure 4). We use the
translations and to get vector representations of the states from the last layer of the base model’s en-
source and translated sentences. CADec is trained coder of the current source sentence and all con-
only on data with context. text sentences as input to the first multi-head at-
tention. For the second multi-head attention we
Let Dsent = {(xi , yi )}N i=1 denote the sentence- input both last states of the base decoder and the
level data with n paired sentences and Ddoc =
target-side token embedding layer; this is done for
{(xj , yj , cj )}M
j=1 denote the document-level data, translations of the source and also all context sen-
where (xj , yj ) is source and target sides of a sen-
tences. All sentence representations are produced
tence to be translated, cj are several preceding sen-
by the base model. To encode the relative position
tences along with their translations.
of each sentence, we concatenate both the encoder
Base model For the baseline context-agnostic and decoder states with one-hot vectors represent-
model we use the original Transformer- ing their position (0 for the source sentence, 1 for
base (Vaswani et al., 2017), trained to the immediately preceding one, etc). These dis-
maximize tance embeddings are shown in blue in Figure 4.
1 P the sentence-level log-likelihood
N log P (yi |xi , θB ).
(xi ,yi )∈Dsent 5 Experiments

Context-aware decoder (CADec) The context- 5.1 Training


aware decoder is trained to correct translations At training time, we use reference translations as
given by the base model using contextual infor- translations of the previous sentences. For the cur-
mation. Namely, we maximize the following rent sentence, we either sample a translation from
the base model or use a corrupted version of the We randomly choose 500 out of 2000 examples
reference translation. We propose to stochastically from the lexical cohesion set and 500 out of 3000
mix objectives corresponding to these versions: from the deixis test set for validation and leave the
rest for final testing. We compute BLEU on the
1 X h
development set as well as scores on lexical co-
log bj · P (yj |xj , y˜j , cj , θC ))+
M hesion and deixis development sets. We use con-
(xj ,yj )∈Ddoc
i vergence in both metrics to decide when to stop
+ (1 − bj ) · P (yj |xj , yjB , cj , θC ) , training. The importance of using both criteria is
discussed in Section 6.4. After the convergence,
where y˜j is a corrupted version of the refer- we average 5 checkpoints and report scores on the
ence translation and bj ∈ {0, 1} is drawn from final test sets.
Bernoulli distribution with parameter p, p = 0.5
in our experiments. Reference translations are cor- 6.1 Baselines
rupted by replacing 20% of their tokens with ran- We consider three baselines.
dom tokens. baseline The context-agnostic baseline is
We discuss the importance of the proposed Transformer-base trained on all sentence-level
training strategy, as well as the effect of varying data. Recall that it is also used as the base model
the value of p, in Section 6.5. in our 2-stage approach.
concat The first context-aware baseline is a sim-
5.2 Inference
ple concatenation model. It is trained on 6m sen-
As input to CADec for the current sentence, we tence pairs, including 1.5m having 3 context sen-
use the translation produced by the base model. tences. For the concatenation baseline, we use
Target sides of the previous sentences are pro- a special token separating sentences (both on the
duced by our two-stage approach for those sen- source and target side).
tences which have context and with the base model s-hier-to-2.tied This is the version of the
for those which do not. We use beam search with model s-hier-to-2 introduced by Bawden et al.
a beam of 4 for all models. (2018), where the parameters between encoders
are shared (Müller et al., 2018). The model has
5.3 Data and setting an additional encoder for source context, whereas
We use the publicly available OpenSubtitles2018 the target side of the corpus is concatenated, in
corpus (Lison et al., 2018) for English and Rus- the same way as for the concatenation baseline.
sian. As described in detail in the appendix, we Since the model is suitable only for one context
apply data cleaning after which only a fraction of sentence, it is trained on 6m sentence pairs, includ-
data has context of several previous sentences. We ing 1.5m having one context sentence. We chose
use up to 3 context sentences in this work. We s-hier-to-2.tied as our second context-aware base-
randomly choose 6 million training instances from line because it also uses context on the target side
the resulting data, among which 1.5m have context and performed best in a contrastive evaluation of
of three sentences. We randomly choose two sub- pronoun translation (Müller et al., 2018).
sets of 10k instances for development and testing
and construct our contrastive test sets from 400k 6.2 General results
held-out instances from movies not encountered in BLEU scores for our model and the baselines are
training. The hyperparameters, preprocessing and given in Table 6.5 For context-aware models, all
training details are provided in the supplementary sentences in a group were translated, and then only
material. the current sentence is evaluated. We also report
BLEU for the context-agnostic baseline trained
6 Results only on 1.5m dataset to show how the performance
We evaluate in two different ways: using BLEU is influenced by the amount of data.
for general quality and the proposed contrastive We observe that our model is no worse in BLEU
test sets for consistency. We show that models in- than the baseline despite the second-pass model
distinguishable with BLEU can be very different 5
We use bootstrap resampling (Koehn, 2004) for signifi-
in terms of consistency. cance testing.
model BLEU latest relevant context
baseline (1.5m) 29.10 total 1st 2nd 3rd
baseline (6m) 32.40 deixis
concat 31.56 baseline 50.0 50.0 50.0 50.0
s-hier-to-2.tied 26.68 concat 83.5 88.8 85.6 76.4
CADec 32.38 s-hier-to-2.tied 60.9 83.0 50.1 50.0
CADec 81.6 84.6 84.4 75.9
Table 6: BLEU scores. CADec trained with p = 0.5.
Scores for CADec are not statistically different from lexical cohesion
the baseline (6m). baseline 45.9 46.1 45.9 45.4
concat 47.5 48.6 46.7 46.7
being trained only on a fraction of the data. In s-hier-to-2.tied 48.9 53.0 46.1 45.4
contrast, the concatenation baseline, trained on a CADec 58.1 63.2 52.0 56.7
mixture of data with and without context is about
Table 7: Accuracy for deixis and lexical cohesion.
1 BLEU below the context-agnostic baseline and
our model when using all 3 context sentences.
CADec’s performance remains the same indepen- ellipsis (infl.) ellipsis (VP)
dently from the number of context sentences (1, 2
or 3) as measured with BLEU. baseline 53.0 28.4
s-hier-to-2.tied performs worst in terms of concat 76.2 76.6
BLEU, but note that this is a shallow recurrent s-hier-to-2.tied 66.4 65.6
model, while others are Transformer-based. It also CADec 72.2 80.0
suffers from the asymmetric data setting, like the
Table 8: Accuracy on ellipsis test set.
concatenation baseline.

6.3 Consistency results


Scores on the deixis, cohesion and ellipsis test sets
are provided in Tables 7 and 8. For all tasks,
we observe a large improvement from using con-
text. For deixis, the concatenation model (con-
cat) and CADec improve over the baseline by 33.5
and 31.6 percentage points, respectively. On the
lexical cohesion test set, CADec shows a large
Figure 5: BLEU and lexical cohesion accuracy on the
improvement over the context-agnostic baseline development sets during CADec training.
(12.2 percentage points), while concat performs
similarly to the baseline. For ellipsis, both mod-
els improve substantially over the baseline (by
which are otherwise identical in terms of BLEU:
19-51 percentage points), with concat stronger
the performance of the baseline and CADec is the
for inflection tasks and CADec stronger for VP-
same when measured with BLEU, but very differ-
ellipsis. Despite its low BLEU score, s-hier-to-
ent in terms of handling contextual phenomena.
2.tied also shows clear improvements over the
context-agnostic baseline in terms of consistency,
but underperforms both the concatenation model 6.4 Context-aware stopping criteria
and CADec, which is unsurprising given that it
uses only one context sentence. When looking Figure 5 shows that for context-aware models,
only at the scores where the latest relevant con- BLEU is not sufficient as a criterion for stopping:
text is in the model’s context window (column 2 in even when a model has converged in terms of
Table 7), s-hier-to-2.tied outperforms the concate- BLEU, it continues to improve in terms of con-
nation baseline for lexical cohesion, but remains sistency. For CADec trained with p = 0.5, BLEU
behind the performance of CADec. score has stabilized after 40k batches, but the lex-
The proposed test sets let us distinguish models ical cohesion score continues to grow.
p BLEU deixis lex. c. ellipsis hesion, but accidental – Ott et al. (2018) show that
beam search introduces a bias towards frequent
p=0 32.34 84.1 48.7 65 / 75
words, which could be one factor explaining this
p = 0.25 32.31 83.3 52.4 67 / 78
finding. This means that a higher repetition rate
p = 0.5 32.38 81.6 58.1 72 / 80
does not mean that a translation system is in fact
p = 0.75 32.45 80.0 65.0 70 / 80
more cohesive, and we find that even our baseline
Table 9: Results for different probabilities of using cor- is more repetitive than the human reference.
rupted reference at training time. BLEU for 3 context
sentences. For ellipsis, we show inflection/VP scores.
8 Conclusions
6.5 Ablation: using corrupted reference
We analyze which phenomena cause otherwise
At training time, CADec uses either a transla- good context-agnostic translations to be inconsis-
tion sampled from the base model or a corrupted tent when placed in the context of each other. Our
reference translation as the first-pass translation human study on an English–Russian dataset iden-
of the current sentence. The purpose of using a tifies deixis, ellipsis and lexical cohesion as three
corrupted reference instead of just sampling is to main sources of inconsistency. We create test sets
teach CADec to rely on the base translation and focusing specifically on the identified phenomena.
not to change it much. In this section, we discuss We consider a novel and realistic set-up where
the importance of the proposed training strategy. a much larger amount of sentence-level data is
Results for different values of p are given in Ta- available compared to that aligned at the document
ble 9. All models have about the same BLEU, not level and introduce a model suitable for this sce-
statistically significantly different from the base- nario. We show that our model effectively handles
line, but they are quite different in terms of incor- contextual phenomena without sacrificing general
porating context. The denoising positively influ- quality as measured with BLEU despite using only
ences almost all tasks except for deixis, yielding a small amount of document-level data, while a
the largest improvement on lexical cohesion. naive approach to combining sentence-level and
document-level data leads to a drop in perfor-
7 Additional Related Work mance. We show that the proposed test sets al-
In concurrent work, Xiong et al. (2018) also pro- low us to distinguish models (even though iden-
pose a two-pass context-aware translation model tical in BLEU) in terms of their consistency. To
inspired by deliberation network. However, while build context-aware machine translation systems,
they consider a symmetric data scenario where such targeted test sets should prove useful, for val-
all available training data has document-level con- idation, early stopping and for model selection.
text, and train all components jointly on this data,
we focus on an asymmetric scenario where we Acknowledgments
have a large amount of sentence-level data, used
to train our first-pass model, and a smaller amount We would like to thank the anonymous reviewers
of document-level data, used to train our second- for their comments and Ekaterina Enikeeva for the
pass decoder, keeping the first-pass model fixed. help with initial phenomena classification. The
Automatic evaluation of the discourse phenom- authors also thank Yandex Machine Translation
ena we consider is challenging. For lexical cohe- team for helpful discussions and inspiration. Ivan
sion, Wong and Kit (2012) count the ratio between Titov acknowledges support of the European Re-
the number of repeated and lexically similar con- search Council (ERC StG BroadSem 678254) and
tent words over the total number of content words the Dutch National Science Foundation (NWO
in a target document. However, Guillou (2013); VIDI 639.022.518). Rico Sennrich acknowledges
Carpuat and Simard (2012) find that translations support from the Swiss National Science Foun-
generated by a machine translation system tend to dation (105212_169888), the European Union’s
be similarly or more lexically consistent, as mea- Horizon 2020 research and innovation programme
sured by a similar metric, than human ones. This (grant agreement no 825460), and the Royal Soci-
even holds for sentence-level systems, where the ety (NAF\R1\180122).
increased consistency is not due to improved co-
References Machine Translation, pages 54–57. Association for
Computational Linguistics.
Ruchit Agrawal, Turchi Marco, and Negri Matteo.
2018. Contextual Handling in Neural Machine Diederik Kingma and Jimmy Ba. 2015. Adam: A
Translation: Look Behind, Ahead and on Both method for stochastic optimization. In Proceedings
Sides. of the International Conference on Learning Repre-
Rachel Bawden, Rico Sennrich, Alexandra Birch, and sentation (ICLR 2015).
Barry Haddow. 2018. Evaluating Discourse Phe-
nomena in Neural Machine Translation. In Proceed- Philipp Koehn. 2004. Statistical significance tests for
ings of the 2018 Conference of the North American machine translation evaluation. In Proceedings of
Chapter of the Association for Computational Lin- the 2004 Conference on Empirical Methods in Nat-
guistics: Human Language Technologies, Volume ural Language Processing.
1 (Long Papers), pages 1304–1313, New Orleans,
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
USA. Association for Computational Linguistics.
Callison-Burch, Marcello Federico, Nicola Bertoldi,
Marine Carpuat and Michel Simard. 2012. The trouble Brook Cowan, Wade Shen, Christine Moran,
with smt consistency. In Proceedings of the Seventh Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra
Workshop on Statistical Machine Translation, pages Constantin, and Evan Herbst. 2007. Moses: Open
442–449, Montréal, Canada. Association for Com- Source Toolkit for Statistical Machine Translation.
putational Linguistics. In Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics Compan-
Zhengxian Gong, Min Zhang, and Guodong Zhou. ion Volume Proceedings of the Demo and Poster Ses-
2011. Cache-based document-level statistical ma- sions, pages 177–180, Prague, Czech Republic. As-
chine translation. In Proceedings of the 2011 Con- sociation for Computational Linguistics.
ference on Empirical Methods in Natural Language
Processing, pages 909–919, Edinburgh, Scotland, Mikhail Korobov. 2015. Morphological analyzer and
UK. Association for Computational Linguistics. generator for russian and ukrainian languages. In
Analysis of Images, Social Networks and Texts, vol-
Liane Guillou. 2013. Analysing lexical consistency in ume 542 of Communications in Computer and In-
translation. In Proceedings of the Workshop on Dis- formation Science, pages 320–332. Springer Inter-
course in Machine Translation, pages 10–18, Sofia, national Publishing.
Bulgaria. Association for Computational Linguis-
tics. Shaohui Kuang, Deyi Xiong, Weihua Luo, and
Guodong Zhou. 2018. Modeling coherence for
Liane Guillou and Christian Hardmeier. 2016. Protest: neural machine translation with dynamic and topic
A test suite for evaluating pronouns in machine caches. In Proceedings of the 27th International
translation. In Proceedings of the Tenth Interna- Conference on Computational Linguistics, pages
tional Conference on Language Resources and Eval- 596–606. Association for Computational Linguis-
uation (LREC 2016), Paris, France. European Lan- tics.
guage Resources Association (ELRA).
Samuel Läubli, Rico Sennrich, and Martin Volk. 2018.
Christian Hardmeier and Marcello Federico. 2010.
Has Machine Translation Achieved Human Parity?
Modelling Pronominal Anaphora in Statistical Ma-
A Case for Document-level Evaluation. In Proceed-
chine Translation. In Proceedings of the seventh In-
ings of the 2018 Conference on Empirical Methods
ternational Workshop on Spoken Language Transla-
in Natural Language Processing, pages 4791–4796.
tion (IWSLT), pages 283–289.
Association for Computational Linguistics.
Christian Hardmeier, Preslav Nakov, Sara Stymne,
Jörg Tiedemann, Yannick Versley, and Mauro Cet- Ronan Le Nagard and Philipp Koehn. 2010. Aiding
tolo. 2015. Pronoun-focused mt and cross-lingual pronoun translation with co-reference resolution. In
pronoun prediction: Findings of the 2015 discomt Proceedings of the Joint Fifth Workshop on Statis-
shared task on pronoun translation. In Proceedings tical Machine Translation and MetricsMATR, pages
of the Second Workshop on Discourse in Machine 252–261, Uppsala, Sweden. Association for Com-
Translation, pages 1–16. Association for Computa- putational Linguistics.
tional Linguistics.
Pierre Lison, Jörg Tiedemann, and Milen Kouylekov.
Sebastien Jean, Stanislas Lauly, Orhan Firat, and 2018. Opensubtitles2018: Statistical rescoring of
Kyunghyun Cho. 2017a. Does Neural Machine sentence alignments in large, noisy parallel corpora.
Translation Benefit from Larger Context? In In Proceedings of the Eleventh International Confer-
arXiv:1704.05135. ArXiv: 1704.05135. ence on Language Resources and Evaluation (LREC
2018), Miyazaki, Japan.
Sébastien Jean, Stanislas Lauly, Orhan Firat, and
Kyunghyun Cho. 2017b. Neural machine transla- Sameen Maruf and Gholamreza Haffari. 2018. Docu-
tion for cross-lingual pronoun prediction. In Pro- ment context neural machine translation with mem-
ceedings of the Third Workshop on Discourse in ory networks. In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin- Jörg Tiedemann and Yves Scherrer. 2017. Neural Ma-
guistics (Volume 1: Long Papers), pages 1275– chine Translation with Extended Context. In Pro-
1284, Melbourne, Australia. Association for Com- ceedings of the Third Workshop on Discourse in
putational Linguistics. Machine Translation, DISCOMT’17, pages 82–92,
Copenhagen, Denmark. Association for Computa-
Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, tional Linguistics.
and James Henderson. 2018. Document-level neural
machine translation with hierarchical attention net- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
works. In Proceedings of the 2018 Conference on Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Empirical Methods in Natural Language Process- Kaiser, and Illia Polosukhin. 2017. Attention is all
ing, pages 2947–2954, Brussels, Belgium. Associ- you need. In NIPS, Los Angeles.
ation for Computational Linguistics.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan
Jane Morris and Graeme Hirst. 1991. Lexical cohe- Titov. 2018. Context-aware neural machine trans-
sion computed by thesaural relations as an indicator lation learns anaphora resolution. In Proceedings of
of the structure of text. Computational Linguistics the 56th Annual Meeting of the Association for Com-
(Volume 17), pages 21–48. putational Linguistics (Volume 1: Long Papers),
pages 1264–1274, Melbourne, Australia. Associa-
Mathias Müller, Annette Rios, Elena Voita, and Rico
tion for Computational Linguistics.
Sennrich. 2018. A Large-Scale Test Set for the
Evaluation of Context-Aware Pronoun Translation Longyue Wang, Zhaopeng Tu, Andy Way, and Qun
in Neural Machine Translation. In Proceedings of Liu. 2017. Exploiting Cross-Sentence Context for
the Third Conference on Machine Translation: Re- Neural Machine Translation. In Proceedings of the
search Papers , pages 61–72, Belgium, Brussels. As- 2017 Conference on Empirical Methods in Natu-
sociation for Computational Linguistics. ral Language Processing, EMNLP’17, pages 2816–
Myle Ott, Michael Auli, David Grangier, and 2821, Denmark, Copenhagen. Association for Com-
Marc’Aurelio Ranzato. 2018. Analyzing uncer- putational Linguistics.
tainty in neural machine translation. In ICML, vol- Longyue Wang, Zhaopeng Tu, Xiaojun Zhang, Hang
ume 80 of JMLR Workshop and Conference Pro- Li, Andy Way, and Qun Liu. 2016. A novel ap-
ceedings, pages 3953–3962. JMLR.org. proach to dropped pronoun translation. In Proceed-
Martin Popel and Ondrej Bojar. 2018. Training Tips ings of the 2016 Conference of the North Ameri-
for the Transformer Model. pages 43–70. can Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
Annette Rios and Don Tuggener. 2017. Co-reference 983–993, San Diego, California. Association for
resolution of elided subjects and possessive pro- Computational Linguistics.
nouns in spanish-english statistical machine trans-
lation. In Proceedings of the 15th Conference of the Billy T. M. Wong and Chunyu Kit. 2012. Extend-
European Chapter of the Association for Computa- ing machine translation evaluation metrics with lex-
tional Linguistics: Volume 2, Short Papers, pages ical cohesion to document level. In Proceedings of
657–662, Valencia, Spain. Association for Compu- the 2012 Joint Conference on Empirical Methods
tational Linguistics. in Natural Language Processing and Computational
Natural Language Learning, pages 1060–1068, Jeju
Lorenza Russo, Sharid Loáiciga, and Asheesh Gu- Island, Korea. Association for Computational Lin-
lati. 2012. Improving machine translation of null guistics.
subjects in italian and spanish. In Proceedings of
the Student Research Workshop at the 13th Confer- Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin,
ence of the European Chapter of the Association for Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation
Computational Linguistics, pages 81–89, Avignon, networks: Sequence generation beyond one-pass de-
France. Association for Computational Linguistics. coding. In NIPS, Los Angeles.

Rico Sennrich, Barry Haddow, and Alexandra Birch. Hao Xiong, Zhongjun He, Hua Wu, and Haifeng Wang.
2016. Neural machine translation of rare words 2018. Modeling Coherence for Discourse Neural
with subword units. In Proceedings of the 54th An- Machine Translation. In arXiv:1811.05683. ArXiv:
nual Meeting of the Association for Computational 1811.05683.
Linguistics (Volume 1: Long Papers), pages 1715–
1725, Berlin, Germany. Association for Computa- Kazuhide Yamamoto and Eiichiro Sumita. 1998. Fea-
tional Linguistics. sibility study for ellipsis resolution in dialogues by
machine-learning technique. In 36th Annual Meet-
Jörg Tiedemann. 2010. Context adaptation in statisti- ing of the Association for Computational Linguis-
cal machine translation using models with exponen- tics and 17th International Conference on Compu-
tially decaying cache. In Proceedings of the 2010 tational Linguistics, Volume 2.
Workshop on Domain Adaptation for Natural Lan-
guage Processing, pages 8–15, Uppsala, Sweden. Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei
Association for Computational Linguistics. Zhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018.
Improving the transformer translation model with A.1.2 Human postprocessing of identification
document-level context. In Proceedings of the 2018 of politeness
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 533–542, Brussels, Bel- After examples with presence of indication of us-
gium. Association for Computational Linguistics. age of T/V form are extracted automatically, we
manually filter out examples where
A Protocols for test sets
1. second person plural form corresponds to
In this section we describe the process of con- plural pronoun, not V form,
structing the test suites.
2. there is a clear indication of politeness.
A.1 Deixis
The first rule is needed as morphological forms for
English second person pronoun “you” may have second person plural and second person singular V
three different interpretations important when form pronouns and related verbs are the same, and
translating into Russian: the second person singu- there is no simple and reliable way to distinguish
lar informal (T form), the second person singular these two automatically.
formal (V form) and second person plural (there The second rule is to exclude cases where there
is no T-V distinction for the plural from of second is only one appropriate level of politeness accord-
person pronouns). ing to the relation between the speaker and the lis-
Morphological forms for second person singu- tener. Such markers include “Mr.”, “Mrs.”, “of-
lar (V form) and second person plural pronoun are ficer”, “your honour” and “sir”. For the impo-
the same, that is why to automatically identify ex- lite form, these include terms denoting family re-
amples in the second person polite form, we look lationship (“mom”, “dad”), terms of endearment
for morphological forms corresponding to second (“honey”, “sweetie”) and words like “dude” and
person plural pronouns. “pal”.
To derive morphological tags for Russian, we
use publicly available pymorphy26 (Korobov, A.1.3 Automatic change of politeness
2015). To construct contrastive examples aiming to test
Below, all the steps performed to obtain the test the ability of a system to produce translations with
suite are described in detail. consistent level of politeness, we have to produce
an alternative translation by switching the formal-
A.1.1 Automatic identification of politeness ity of the reference translation. First, we do it au-
For each sentence we try to automatically find tomatically:
indications of using T or V form. Presence of
1. change the grammatical number of second
the following words and morphological forms are
person pronouns, verbs, imperative verbs,
used as indication of usage of T/V forms:
2. change the grammatical number of posses-
1. second person singular or plural pronoun, sive pronouns.

2. verb in a form corresponding to second per- For the first transformation we use pymorphy2,
son singular/plural pronoun, for the second use manual lists of possessive sec-
ond person pronouns, because pymorphy2 can
3. verbs in imperative form, not change them automatically.

4. possessive forms of second person pronouns. A.1.4 Human postprocessing of automatic


change of politeness
For 1-3 we used morphological tags predicted We manually correct the translations from the pre-
by pymorphy2, for 4th we used hand-crafted vious step. Mistakes of the described automatic
lists of forms of second person pronouns, because change of politeness happen because of:
pymorphy2 fails to identify them.
1. ambiguity arising when imperative and in-
6
https://github.com/kmike/pymorphy2 dicative verb forms are the same,
2. inability of pymorphy2 to inflect the singu- 1. find passages where named entities are trans-
lar number to some verb forms (e.g., to inflect lated consistently,
singular number to past tense verbs),
2. extract alternative translations for these
3. presence of related adjectives, which have to named entities from the lexical table of
agree with the pronoun, Moses (Koehn et al., 2007) induced from the
training data,
4. ambiguity arising when a plural form of a
pronoun may have different singular forms. 3. construct alternative translations of each ex-
ample by switching the translation of in-
A.1.5 Human annotation: are both polite and stances of the named entity,
impolite versions appropriate?
4. for each example construct several test in-
After the four previous steps, we have text frag- stances.
ments of several consecutive sentences with con-
sistent level of politeness. Each fragment uses sec- A.2.1 Identification of examples with
ond person singular pronouns, either T form or V consistent translations
form, without nominal markers indicating which We look for infrequent words that are translated
of the forms is the only one appropriate. For each consistently in a text fragment. Since the target
group we have both the original version, and the language has rich morphology, to verify that trans-
version with the switched formality. lations are the same we have to use lemmas of the
To control for appropriateness of both levels of translations. More precisely, we
politeness in the context of a whole text fragment
we conduct a human annotation. Namely, humans 1. train Berkeley aligner on about 6.5m sen-
are given both versions of the same text fragment tence pairs from both training and held-out
corresponding to different levels of politeness, and data,
asked if these versions are natural. The answers 2. find lemmas of all words in the refer-
they can pick are the following: ence translations in the held-out data using
pymorphy2,
1. both appropriate,
3. find words in the source which are not in the
2. polite version is not appropriate,
5000 most frequent words in our vocabulary
3. impolite version is not appropriate, whose translations have the same lemma.

A.2.2 Finding alternative translations


4. both versions are bad.
For the words under consideration, we find alter-
The annotators are not given any specific guide- native translations which would be (i) equally ap-
lines, and asked to answer according to their intu- propriate in the context of the remaining sentence
ition as a native speaker of the language (Russian). and text fragment (ii) possible for the model to
There are a small number of examples where produce. To address the first point, we focus on
one of the versions is not appropriate and not named entities, and we assume that all translations
equally natural as the other one: 4%. Cases where of a given named entity seen in the training data
annotators claimed both versions to be bad come are appropriate. To address the second point, we
from mistakes in target translations: OpenSubti- choose alternative translations from the reference
tles data is not perfect, and target sides contain translations encountered in the training data, and
translations which are not reasonable sentences in pick only ones with a probability at least 10%.
Russian. These account for 1.5% of all examples. The sequence of actions is as follows:
We do not include these 5.5% of examples in the
1. train Moses on the training data (6m sentence
resulting test sets.
pairs),
A.2 Lexical cohesion 2. for each word under consideration (from
The process of creating the lexical cohesion test A.2.1), get possible translations from the lex-
set consists of several stages: ical table of Moses,
3. group possible translations by their lemma include (s1 , s2 ) and (s1 , s2 , s3 ) as training ex-
using pymorphy2, amples. We do not add these two groups with
less context for the concatenation model, because
4. if a lemma has a probability at least 10%, we in preliminary experiments, this performed worse
consider this lemma as possible translation both in terms of BLEU and consistency as mea-
for the word under consideration, sured on our test sets.
5. leave only examples with the word un- We use the tokenization provided by the cor-
der consideration having several alternative pus and use multi-bleu.perl8 on lowercased
translations. data to compute BLEU score. We use beam search
with a beam of 4 for both base model and CADec.
After that, more than 90% of examples are Sentences were encoded using byte-pair encod-
translations of named entities (incl. names of ge- ing (Sennrich et al., 2016), with source and target
ographical objects). We manually filter the exam- vocabularies of about 32000 tokens. Translation
ples with named entities. pairs were batched together by approximate se-
quence length. For the Transformer models (base-
A.2.3 Constructing a test set
lines and concatenation) each training batch con-
From the two previous steps, we have examples tained a set of translation pairs containing approx-
with named entities in context and source sen- imately 160009 source tokens. It has been shown
tences and several alternative translations for each that Transformer’s performance depends heavily
named entity. Then we on the batch size (Popel and Bojar, 2018), and
we chose a large batch size to ensure that mod-
1. construct alternative translations of each ex-
els show their best performance. For CADec, we
ample by switching the translation of in-
use a batch size that contains approximately the
stances of the named entity; since the target
same number of translation instances as the base-
language has rich morphology, we do it man-
line models.
ually,
B.2 Model parameters
2. for each example, construct several test in-
stances. For each version of the translation We follow the setup of Transformer base
of a named entity, we use this translation in model (Vaswani et al., 2017). More precisely, the
the context, and vary the translation of the en- number of layers in the base encoder, base decoder
tity in the current sentence to create one con- and CADed is N = 6. We employ h = 8 parallel
sistent, and one or more inconsistent (con- attention layers, or heads. The dimensionality of
trastive) translation. input and output is dmodel = 512, and the inner-
layer of a feed-forward networks has dimensional-
B Experimental setup ity df f = 2048.
We use regularization as described in (Vaswani
B.1 Data preprocessing
et al., 2017).
We use the publicly available OpenSubtitles2018
corpus (Lison et al., 2018) for English and Rus- B.3 Optimizer
sian.7 We pick sentence pairs with a relative time The optimizer we use is the same as in (Vaswani
overlap of subtitle frames between source and tar- et al., 2017). We use the Adam optimizer (Kingma
get language subtitles of at least 0.9 to reduce and Ba, 2015) with β1 = 0.9, β2 = 0.98 and ε =
noise in the data. As context, we take the previous 10−9 . We vary the learning rate over the course of
sentence if its timestamp differs from the current training, according to the formula:
one by no more than 7 seconds. Each long group
of consecutive sentences is split into fragments of lrate = scale · min(step_num−0.5 ,
4 sentences, with the first 3 sentences treated as
context. More precisely, from a group of consec- step_num · warmup_steps−1.5 )
utive sentences s1 , s2 , . . . , sn we get (s1 , . . . , s4 ), 8
https://github.com/moses-smt/
(s2 , . . . , s5 ), . . . , (sn−3 , sn ). For CADec we also mosesdecoder/tree/master/scripts/generic
9
This can be reached by using several of GPUs or by ac-
7
http://opus.nlpl.eu/ cumulating the gradients for several batches and then making
OpenSubtitles2018.php an update.
We use warmup_steps = 16000, scale = 4
for the models trained on 6m data (baseline (6m)
and concatenation) and scale = 1 for the mod-
els trained on 1.5m data (baseline (1.5m) and
CADec).

You might also like