Don't Until The Final Verb Wait: Reinforcement Learning For Simultaneous Machine Translation

Don’t Until the Final Verb Wait:
Reinforcement Learning for Simultaneous Machine Translation
Alvin C. Grissom II He He, John Morgan,

and Jordan Boyd-Graber and Hal Daumé III
Computer Science Computer Science and UMIACS
University of Colorado University of Maryland
Boulder, CO College Park, MD
Alvin.Grissom@colorado.edu {hhe,jjm,hal}@cs.umd.edu
Jordan.Boyd.Graber@colorado.edu
Abstract and circumlocutions that translators would use

to achieve a tradeoff between translation la-
We introduce a reinforcement learning- tency and accuracy. The audio recording tech-
based approach to simultaneous ma- nology used by those interpreters sowed the
chine translation—producing a trans- seeds of technology-assisted interpretation at
lation while receiving input words— the United Nations (Gaiba, 1998).
between languages with drastically dif- Performing real-time translation is especially
ferent word orders: from verb-final lan- difficult when information that comes early in
guages (e.g., German) to verb-medial the target language (the language you’re trans-
languages (English). In traditional ma- lating to) comes late in the source language (the
chine translation, a translator must language you’re translating from). A common
“wait” for source material to appear be- example is when translating from a verb-final
fore translation begins. We remove this (sov) language (e.g., German or Japanese) to
bottleneck by predicting the final verb a verb-medial (svo) language, (e.g., English).
in advance. We use reinforcement learn- In the example in Figure 1, for instance, the
ing to learn when to trust predictions main verb of the sentence (in bold) appears
about unseen, future portions of the at the end of the German sentence. An of-
sentence. We also introduce an evalua- fline (or “batch”) translation system waits until
tion metric to measure expeditiousness the end of the sentence before translating any-
and quality. We show that our new thing. While this is a reasonable approach, it
translation model outperforms batch has obvious limitations. Real-time, interactive
and monotone translation strategies. scenarios—such as online multilingual video
conferences or diplomatic meetings—require
1 Introduction comprehensible partial interpretations before
We introduce a simultaneous machine transla- a sentence ends. Thus, a significant goal in
tion (MT) system that predicts unseen verbs interpretation is to make the translation as
and uses reinforcement learning to learn when expeditious as possible.
to trust these predictions and when to wait for We present three components for an sov-to-
more input. svo simultaneous mt system: a reinforcement
Simultaneous translation is producing a par- learning framework that uses predictions to
tial translation of a sentence before the input create expeditious translations (Section 2), a
sentence is complete, and is often used in im- system to predict how a sentence will end (e.g.,
portant diplomatic settings. One of the first predicting the main verb; Section 4), and a met-
noted uses of human simultaneous interpreta- ric that balances quality and expeditiousness
tion was the Nuremberg trials after the Sec- (Section 3). We combine these components in
ond World War. Siegfried Ramler (2009), the a framework that learns when to begin trans-
Austrian-American who organized the transla- lating sections of a sentence (Section 5).
tion teams, describes the linguistic predictions Section 6 combines this framework with a
1342
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1342–1352,
October 25-29, 2014, Doha, Qatar. 2014
c Association for Computational Linguistics
ich bin mit dem Zug nach Ulm gefahren
I am with the train to Ulm traveled To compare the system to a human transla-
I (. . . . . . waiting. . . . . . ) traveled by train to Ulm tor in a decision-making process, the state is
Figure 1: An example of translating from a akin to the translator’s cognitive state. At any
verb-final language to English. The verb, in given time, we have knowledge (observations)
bold, appears at the end of the sentence, pre- and beliefs (predictions) with varying degrees
venting coherent translations until the final of certainty: that is, the state contains the re-
source word is revealed. vealed words x1:t of a sentence; the state also
contains predictions about the remainder of
the sentence: we predict the next word in the
translation system that produces simultaneous sentence and the final verb.
translations. We show that our data-driven
More formally, we have a prediction at time
system can successfully predict unseen parts
t of the next source language word that will
of the sentence, learn when to trust them, and (t)
appear, nt+1 , and for the final verb, v (t) . For
outperform strong baselines (Section 7).
example, given the partial observation “ich
While some prior research has approached
bin mit dem”, the state might contain a pre-
the problem of simultaneous translation—we re- (t)
view these systems in more detail in Section 8— diction that the next word, nt+1 , will be “Zug”
no current model learns when to definitively and that the final verb v (t) will be “gefahren”.
begin translating chunks of an incomplete sen- We discuss the mechanics of next-word and
tence. Finally, in Section 9, we discuss the verb prediction further in Section 4; for now,
limitations of our system: it only uses the most consider these black boxes which, after observ-
frequent source language verbs, it only applies ing every new source word xt , make predictions
to sentences with a single main verb, and it of future words in the source language. This
uses an idealized translation system. However, representation of the state allows for a richer set
these limitations are not insurmountable; we of actions, described below, permitting simul-
describe how a more robust system can be as- taneous translations that outpace the source
sembled from these components. language input2 by predicting the future.
2 Decision Process for 2.2 Actions: What our system can do

Simultaneous Translation Given observed and hypothesized input, our
Human interpreters learn strategies for their simultaneous translation system must decide
profession with experience and practice. As when to translate them. This is expressed
words in the source language are observed, a in the form of four actions: our system can
translator—human or machine—must decide commit to a partial translation, predict the
whether and how to translate, while, for cer- next word and use it to update the transla-
tain language pairs, simultaneously predicting tion, predict the verb and use it to update the
future words. We would like our system to do translation, or wait for more words.
the same. To this end, we model simultaneous We discuss each of these actions in turn be-
mt as a Markov decision process (mdp) and fore describing how they come together to in-
use reinforcement learning to effectively com- crementally translate an entire sentence:
bine predicting, waiting, and translating into
a coherent strategy. Wait Waiting is the simplest action. It pro-
duces no output and allows the system to re-
2.1 States: What is, what is to come ceive more input, biding its time, so that when
The state st represents the current view of it does choose to translate, the translation is
the world given that we have seen t words of based on more information.
a source language sentence.1 The state con-
Commit Committing produces translation
tains information both about what is known
output: given the observed source sentence,
and what is predicted based on what is known.
produce the best translation possible.
1
We use t to evoke a discrete version of time. We
2
only allow actions after observing a complete source Throughout, “input” refers to source language in-
word. put, and “output” refers to target language translation.
1343
Observation Observation Observation Observation (prediction)
1. Mit dem Zug Wait 2. Mit dem Zug bin Wait 3. Mit dem Zug bin Predict 4. Mit dem Zug bin
ich ich nach ich nach … gefahren ...
with the with the
Commit
train train Output: I traveled
S
state
by train
S I am
by train by train
Observation to
Observation (prediction) by train
Commit Wait
6. Mit dem Zug bin ich
nach Ulm gefahren.
5. Mit dem Zug bin ich
nach Ulm S I traveled
… gefahren ... with the
Output: I traveled
Commit
train
by train
Output: I traveled to Ulm
by train Fixed
to Ulm. output to Ulm
STOP
Figure 2: A simultaneous translation from source (German) to target (English). The agent
chooses to wait until after (3). At this point, it is sufficiently confident to predict the final verb
of the sentence (4). Given this additional information, it can now begin translating the sentence
into English, constraining future translations (5). As the rest of the sentence is revealed, the
system can translate the remainder of the sentence.
Next Word The next word action takes must then take an action given information
a prediction of the next source word and pro- about the current state. The action will result
duces an updated translation based on that in receiving and translating more source words,
prediction, i.e., appending the predicted word transitioning the system to the next state. In
to the source sentence and translating the new the example, for the first few source-language
sentence. words, the translator lacks the confidence to
produce any output due to insufficient informa-
Verb Our system can also predict the source
tion at the state. However, after State 3, the
sentence’s final verb (the last word in the sen-
state shows high confidence in the predicted
tence). When our system takes the verb ac-
verb “gefahren”. Combined with the German
tion, it uses its verb prediction to update the
input it has observed, the system is sufficiently
translation using the prediction, by placing it
confident to act on that prediction to produce
at the end of the source sentence.
English translation.
We can recreate a traditional batch trans- 2.4 Consensus Translations

lation system (interpreted temporally) by a
sequence of wait actions until all input is ob- Three straightforward actions—commit, next
served, followed by a commit to the complete word, and verb—all produce translations.
translation. Our system can commit to par- These rely black box access to a translation
tial translations if it is confident, but producing (discussed in detail in Section 6): that is, given
a good translation early in the sentence often a source language sentence fragment, the trans-
depends on missing information. lation system produces a target language sen-
tence fragment.
2.3 Translation Process Because these actions can happen more than
Having described the state, its components, once in a sentence, we must form a single con-
and the possible actions at a state, we present sensus translation from all of the translations
the process in its entirety. In Figure 2, after that we might have seen. If we have only one
each German word is received, the system ar- translation or if translations are identical, form-
rives at a new state, which consists of the source ing the consensus translation is trivial. But
input, target translation so far, and predictions how should we resolve conflicting translations?
of the unseen words. The translation system Any time our system chooses an action that
1344
produces output, the observed input (plus extra one that can completely translate an imagined
predictions in the case of next-word or verb), sentence after one word is uttered—as an un-
is passed into the translation system. That obtainable system. On the other extreme is a
system then produces a complete translation standard batch translator, which waits until
of its input fragment. it has access to the utterer’s complete sentence
Any new words—i.e., words whose target before translating anything.
index is greater than the length of any previ- Again, we argue that a system can improve
ous translation—are appended to the previous on this by predicting unseen parts of the sen-
translation.3 Table 1 shows an example of tence to find a better tradeoff between these
forming these consensus translations. conflicting goals. However, to evaluate and op-
Now that we have defined how states evolve timize such a system, we must measure where
based on our system’s actions, we need to know a system falls on the continuum of accuracy
how to select which actions to take. Eventu- versus expeditiousness.
ally, we will formalize this as a learned policy Consider partial translations in a two-
(Section 5) that maps from states to actions. dimensional space, with time (quantized by
First, however, we need to define a reward that the number of source words seen) increasing
measures how “good” an action is. from left to right on the x axis and the bleu
score (including brevity penalty against the
3 Objective: What is a good reference length) on the y axis. At each point
simultaneous translation? in time, the system may add to the consensus
translation, changing the precision (Figure 3).
Good simultaneous translations must optimize
Like an roc curve, a good system will be high
two objectives that are often at odds, i.e., pro-
and to the left, optimizing the area under the
ducing translations that are, in the end, accu-
curve: the ideal system would produce points
rate, and producing them in pieces that are
as high as possible immediately. A translation
presented expeditiously. While there are exist-
which is, in the end, accurate, but which is less
ing automated metrics for assessing translation
expeditious, would accrue its score more slowly
quality (Papineni et al., 2002; Banerjee and
but outperform a similarly expeditious system
Lavie, 2005; Snover et al., 2006), these must
which nevertheless translates poorly.
be modified to find the necessary compromise
between translation quality and expeditious- An idealized psychic system achieves this,
ness. That is, a good metric for simultaneous claiming all of the area under the curve, as it
translation must achieve a balance between would have a perfect translation instantly, hav-
translating chunks early and translating accu- ing no need of even waiting for future input.4
rately. All else being equal, maximizing either A batch system has only a narrow (but tall)
goal in isolation is trivial: for accurate transla- sliver to the right, since it translates nothing
tions, use a batch system and wait until the until all of the words are observed.
sentence is complete, translating it all at once; Formally, let Q be the score function for a
for a maximally expeditious translation, cre- partial translation, x the sequentially revealed
ate monotone translations, translating each source words x1 , x2 , . . . , xT from time step 1 to
word as it appears, as in Tillmann et al. (1997) T , and y the partial translations y1 , y2 , . . . , yT ,
and Pytlik and Yarowsky (2006). The former where T is the length of the source language
is not simultaneous at all; the latter is mere input. Each incremental translation yt has a
word-for-word replacement and results in awk- bleu-n score with respect to a reference r. We
ward, often unintelligible translations of distant apply the usual bleu brevity penalty to all the
language pairs. incremental translations (initially empty) to
Once we have predictions, we have an ex- 4
panded array of possibilities, however. On one One could reasonably argue that this is not ideal:
a fluid conversation requires the prosody and timing
extreme, we can imagine a psychic translator— between source and target to match exactly. Thus, a
psychic system would provide too much information
3
Using constrained decoding to enforce consistent too quickly, making information exchange unnatural.
translation prefixes would complicate our method but However, we take the information-centric approach:
is an appealing extension. more information faster is better.
1345
Pos Input Intermediate Consensus
1
2 Er He1 He1
3 Er wurde It1 was2 designed3 He1 was2 designed3
gestaltet
4 It1 was2 designed3 He1 was2 designed3
5 Er wurde It1 was2 renovated3 He1 was2 designed3
gestern yesterday4 yesterday4
renoviert
Table 1: How intermediate translations are combined into a consensus translation. Incorrect
translations (e.g., “he” for an inanimate object in position 3) and incorrect predictions (e.g.,
incorrectly predicting the verb gestaltet in position 5) are kept in the consensus translation.
When no translation is made, the consensus translation remains static.
Source Sentence to compensate for the implicit bias toward low

Er ist zum Laden gegangen β
latency.5
Psychic
He went to
the store
4 Predicting Verbs and Next
Words
Monotone He He to the store
He to the He to the
The next and verb actions depend on predic-
store went tions of the sentence’s next word and final verb;
Batch He went
this section describes our process for predict-
to the ing verbs and next words given a partial source
store
language sentence.
Policy
He
The prediction of the next word in the source
Prediction
He went He went
He went to
the store
language sentence is modeled with a left-to-
to the
right language model. This is (naı̈vely) anal-
ogous to how a human translator might use
Figure 3: Comparison of lbleu (the area under his own “language model” to guess upcoming
the curve given by Equation 1) for an impossi- words to gain some speed by completing, for
ble psychic system, a traditional batch system, example, collocations before they are uttered.
a monotone (German word order) system, and We use a simple bigram language model for
our prediction-based system. By correctly pre- next-word prediction. We use Heafield et al.
dicting the verb “gegangen” (to go), we achieve (2013).
a better overall translation more quickly. For verb prediction, we use a generative
model that combines the prior probability of
obtain latency-bleu (lbleu), a particular verb v, p(v), with the likelihood
of the source context at time t given that
1X
Q(x, y) = bleu(yt , r) (1) verb (namely, p(x1:t | v)), as estimated by a
T t
smoothed Kneser-Ney language model (Kneser
+ T · bleu(yT , r) and Ney, 1995). We use Pauls and Klein
(2011). The prior probability p(v) is estimated
The lbleu score is a word-by-word inte- by simple relative frequency estimation. The
gral across the input source sentence. As each context, x1:t , consists of all words observed.
source word is observed, the system receives a We model p(x1:t | v) with verb-specific n-gram
reward based on the bleu score of the partial language models. The predicted verb v (t) at
translation. lbleu, then, represents the sum of time t is then:
these T rewards at each point in the sentence. t
Y
The score of a simultaneous translation is the arg max p(v) p(xi | v, xi−n+1:i−1 ) (2)
sum of the scores of all individual segments v
i=1
that contribute to the overall translation. 5
One could replace T with a parameter, β, to bias
We multiply the final bleu score by T to en- towards different kinds of simultaneous translations. As
sure good final translations in learned systems β → ∞, we recover batch translation.
1346
where xi−n+1:i−1 is the n − 1-gram context. To 5.2 Feature Representation
narrow the search space, we consider only the We want a feature representation that will al-
100 most frequent final verbs, where a “final low a classifier to generalize beyond the specific
verb” is defined as the sentence-final sequence examples on which it is trained. We use sev-
of verbs and particles as detected by a German eral general classes of features: features that
part-of-speech tagger (Toutanova et al., 2003).6 describe the input, features that describe the
possible translations, and features that describe
the quality of the predictions.
5 Learning a Policy
Input We include both a bag of words rep-
We have a framework (states and actions) for resentation of the input sentence as well as
simultaneous machine translation and a metric the most recent word and bigram to model
for assessing simultaneous translations. We word-specific effects. We also use a feature
now describe the use of reinforcement learning that encodes the length of the source sentence.
to learn a policy, a mapping from states to
actions, to maximize lbleu reward. Prediction We include the identity of the
We use imitation learning (Abbeel and Ng, predicted verb and next word as well as their re-
2004; Syed et al., 2008): given an optimal se- spective probabilities under the language mod-
quence of actions, learn a generalized policy els that generate the predictions. If the model
that maps states to actions. This can be viewed is confident in the prediction, the classifier can
as a cost-sensitive classification (Langford and learn to more so trust the predictions.
Zadrozny, 2005): a state is represented as a fea- Translation In addition to the state, we in-
ture vector, the loss corresponds to the quality clude features derived from the possible actions
of the action, and the output of the classifier is the system might take. This includes a bag of
the action that should be taken in that state. words representation of the target sentence, the
In this section, we explain each of these com- score of the translation (decreasing the score is
ponents: generating an optimal policy, repre- undesirable), the score of the current consen-
senting states through features, and learning a sus translation, and the difference between the
policy that can generalize to new sentences. current and potential translation scores.
5.1 Optimal Policies 5.3 Policy Learning

Because we will eventually learn policies via Our goal is to learn a classifier that can accu-
a classifier, we must provide training exam- rately mimic the oracle’s choices on previously
ples to our classifier. These training exam- unseen data. However, at test time, when we
ples come from an oracle policy π ∗ that run the learned policy classifier, the learned
demonstrates the optimal sequence—i.e., with policy’s state distribution may deviate from
maximal lbleu score—of actions for each se- the optimal policy’s state distribution due to
quence. Using dynamic programming, we can imperfect imitation, arriving in states not on
determine such actions for a fixed translation the oracle’s path. To address this, we use
model.7 From this oracle policy, we generate searn (Daumé III et al., 2009), an iterative
training examples for a supervised classifier. imitation learning algorithm. We learn from
State st is represented as a tuple of the ob- the optimal policy in the first iteration, as in
served words x1:t , predicted verb v (t) , and the standard supervised learning; in the following
(t) iterations, we run an interpolated policy
predicted word nt+1 . We represent the state to
(t)
a classifier as a feature vector φ(x1:t , nt+1 , v (t) ).
πk+1 = πk + (1 − )π ∗ , (3)
6
This has the obvious disadvantage of ignoring mor-
phology and occasionally creating duplicates of common with k as the iteration number and the mixing
verbs that have may be associated with multiple parti- probability. We collect examples by asking
cles; nevertheless, it provides a straightforward verb to the policy to label states on its path. The
predict.
7
This is possible for the limited class of consensus interpolated policy will execute the optimal
translation schemes discussed in Section 2.4. action with probability 1 − and the learned
1347
policy’s action with probability . In the first We first consider incomplete but correct in-
iteration, we have π0 = π ∗ . puts. Let y = τ (x1:t ) be the translator’s output
Mixing in the learned policy allows the given a partial source input x1:t with transla-
learned policy to slowly change from the oracle tion y. Then, τ (x1:t ) produces all target words
policy. As it trains on these no-longer-perfect yj if there is a source word xi in the input
state trajectories, the state distribution at test aligned to those words—i.e., (i, j) ∈ ax,y —and
time will be more consistent with the states all preceding target words can be translated.
used in training. (That translations must be contiguous is a nat-
searn learns the policy by training a cost- ural requirement for human recipients of trans-
sensitive classifier. Besides providing the opti- lations). In the case where yj is unaligned, the
mal action, the oracle must also assign a cost closest aligned target word to yj that has a
to an action corresponding alignment entry is aligned to xi ;
then, if xi is present in the input, yj appears in
C(at , x) ≡ Q(x, π ∗ (xt )) − Q(x, at (xt )), (4) the output. Thus, our omniscient translation
system will always produce the correct output
where at (xt ) represents the translation outcome given the correct input.
of taking action at . The cost is the regret of However, our learned policy can make wrong
not taking the optimal action. predictions, which can produce partial trans-
lations y that do not match the reference.
6 Translation System In this event, an incorrect source word x̃i
produces incorrect target words ỹj , for all
The focus of this work is to show that given an j : (i, j) ∈ ax,y . These ỹj are sampled from
effective batch translation system and predic- the ibm Model 1 lexical probability table mul-
tions, we can learn a policy that will turn this tiplied by the source language model ỹj ∼
into a simultaneous translation system. Thus, Mult(θx̃i )pLM (x̃).8 Thus, even if we predict
to separate translation errors from policy er- the correct verb using a next word action, it
rors, we perform experiments with a nearly will be in the wrong position and thus gener-
optimal translation system we call an omni- ate a translation from the lexical probabilities.
scient translator. Since translations based on Model 1 probabil-
More realistic translation systems will nat- ities are generally inaccurate, the omniscient
urally lower the objective function, often in translator will do very well when given correct
ways that make it difficult to show that we can input but will produce very poor translations
effectively predict the verbs in verb-final source otherwise.
languages. For instance, German to English
translation systems often drop the verb; thus, 7 Experiments
predicting a verb that will be ignored by the In this section, we describe our experimental
translation system will not be effective. framework and results from our experiments.
The omniscient translator translates a source From aligned data, we derive an omniscient
sentence correctly once it has been fed the ap- translator. We use monolingual data in the
propriate source words as input. There are source language to train the verb predictor and
two edge cases: empty input yields an empty the next word predictor. From these features,
output, while a complete, correct source sen- we compute an optimal policy from which we
tence returns the correct, complete translation. train a learned policy.
Intermediate cases—where the input is either
incomplete or incorrect—require using an align- 7.1 Data sets
ment. The omniscient translator assumes as For translation model and policy training, we
input a reference translation r, a partial source use data from the German-English Parallel “de-
language input x1:t and a corresponding partial news” corpus of radio broadcast news (Koehn,
output y. In addition, the omniscient transla- 2000), which we lower-cased and stripped of
tor assumes access to an alignment between r 8
If a policy chooses an incorrect unaligned word, it
and x. In practice, we use the hmm aligner (Vo- has no effect on the output. Alignments are position-
gel et al., 1996; Och and Ney, 2003). specific, so “wrong” refers to position and type.
1348
punctuation. A total of 48, 601 sentence pairs 1.25
are randomly selected for building our system. 1.00
Smoothed Average
●
Of these, we use 70% (34, 528 pairs) for training 0.75

0.50
word alignments.
For training the translation policy, we re- 0.25
strict ourselves to sentences that end with one

of the 100 most frequent verbs (see Section 4). ●●●●●●●●●● ● ● ● ● ● ● ● ●
This results in a data set of 4401 training sen- 0.00 0.25 0.50 0.75 1.00
tences and 1832 test sentences from the de-news % of Sentence
data. We did this to narrow the search space ● Batch Monotone Optimal Searn
(from thousands of possible, but mostly very
infrequent, verbs).
We used 1 million words of news text from
Figure 4: The final reward of policies on Ger-
the Leipzig Wortschatz (Quasthoff et al., 2006)
man data. Our policy outperforms all baselines
German corpus to train 5-gram language mod-
by the end of the sentence.
els to predict a verb from the 100 most frequent
verbs.
For next-word prediction, we use the 18, 345 Policy Actions
10000
most frequent German bigrams from the train- 7500
Batch
5000
ing set to provide a set of candidates in a lan- 2500
guage model trained on the same set. We use 0
10000
Monotone
frequent bigrams to reduce the computational 7500
Action Count
5000
cost of finding the completion probability of 2500
0
the next word. 10000
Optimal
7500
5000
7.2 Training Policies 2500
0
10000
In each iteration of searn, we learn a 7500
Searn
5000
multi-class classifier to implement the pol- 2500
icy. The specific learning algorithm we use 0
COMMIT WAIT NEXTWORD VERB
is arow (Crammer et al., 2013). In the com- Action
plete version of searn, the cost of each action
is calculated as the highest expected reward Figure 5: Histogram of actions taken by the
starting at the current state minus the actual policies.
roll-out reward. However, computing the full
roll-out reward is computationally very expen-
sive. We thus use a surrogate binary cost: if the first half of the sentence, as German and
the predicted action is the same as the opti- English have similar word order, though they
mal action, the cost is 0; otherwise, the cost diverge toward the end. Our learned policy
is 1. We then run searn for five iterations. outperforms the monotone policy toward the
Results on the development data indicate that end and of course outperforms the batch policy
continuing for more iterations yields no benefit. throughout the sentence.
Figure 5 shows counts of actions taken by
7.3 Policy Rewards on Test Set each policy. The batch policy always commits
In Figure 4, we show performance of the opti- at the end. The monotone policy commits at
mal policy vis-à-vis the learned policy, as well each position. Our learned policy has an ac-
as the two baseline policies: the batch policy tion distribution similar to that of the optimal
and the monotone policy. The x-axis is the policy, but is slightly more cautious.
percentage of the source sentence seen by the
model, and the y-axis is a smoothed average of 7.4 What Policies Do
the reward as a function of the percentage of Figure 6 shows a policy that, predicting incor-
the sentence revealed. The monotone policy’s rectly, still produces sensible output. The pol-
performance is close to the optimal policy for icy correctly intuits that the person discussed
1349
INPUT OUTPUT
hear more information in a spoken context—or
use phrase table and reordering probabilities to
bundesumweltministerin
Merkel
NEX
T federal minister of the decide where to translate with less delay (Fu-
environment angela merkel jita et al., 2013). Oda et al. (2014) is the
merkel hat den entwurf
gezeigt
VER
B federal minister of the
most similar to our work on the translation
environment angela merkel
shown the draft
side. They frame word segmentation as an
merkel hat den entwurf COM
MIT
optimization problem, using a greedy search
federal minister of the
eines umweltpolitischen
programms vorgestellt environment angela merkel and dynamic programming to find segmenta-
shown the draft of an
ecopolitical program
tion strategies that maximize an evaluation
measure. However, unlike our work, the direc-
tion of translation was from from svo to svo,
Figure 6: An imperfect execution of a learned obviating the need for verb prediction. Simul-
policy. Despite choosing the wrong verb taneous translation is more straightforward for
“gezeigt” (showed) instead of “vorgestellt” (pre- languages with compatible word orders, such
sented), the translation retains the meaning. as English and Spanish (Fügen, 2008).
To our knowledge, the only attempt to
is Angela Merkel, who was the environmen- specifically predict verbs or any late-occurring
tal minister at the time, but the policy uses terms (Matsubara et al., 2000) uses pattern
an incorrectly predicted verb. Because of our matching on what would today be considered
poor translation model (Section 6), it renders a small data set to predict English verbs for
this word as “shown”, which is a poor transla- Japanese to English simultaneous mt.
tion. However, it is still comprehensible, and Incorporating verb predictions into the trans-
the overall policy is similar to what a human lation process is a significant component of
would do: intuit the subject of the sentence our framework, though n-gram models strongly
from early clues and use a more general verb prefer highly frequent verbs. Verb prediction
to stand in for a more specific one. might be improved by applying the insights
from psycholinguistics. Ferreira (2000) argues
8 Related Work that verb lemmas are required early in sentence
production—prior to the first noun phrase
Just as mt was revolutionized by statistical argument—and that multiple possible syntac-
learning, we suspect that simultaneous mt will tic hypotheses are maintained in parallel as the
similarly benefit from this paradigm, both from sentence is produced. Schriefers et al. (1998)
a systematic system for simultaneous transla- argues that, in simple German sentences, non-
tion and from a framework for learning how to initial verbs do not need lemma planning at
incorporate predictions. all. Momma et al. (2014), investigating these
Simultaneous translation has been prior claims, argues that the abstract relation-
dominated by rule and parse-based ap- ship between the internal arguments and verbs
proaches (Mima et al., 1998a; Ryu et al., 2006). triggers selective verb planning.
In contrast, although Verbmobil (Wahlster,
2000) performs incremental translation using a 9 Conclusion and Future Work
statistical mt module, its incremental decision-
making module is rule-based. Other recent Creating an effective simultaneous translation
approaches in speech-based systems focus on system for sov to svo languages requires not
waiting until a pause to translate (Sakamoto only translating partial sentences, but also ef-
et al., 2013) or using word alignments (Ryu fectively predicting a sentence’s verb. Both
et al., 2012) between languages to determine elements of the system require substantial re-
optimal translation units. finement before they are usable in a real-world
Unlike our work, which focuses on predic- system.
tion and learning, previous strategies for deal- Replacing our idealized translation system
ing with sov-to-svo translation use rule-based is the most challenging and most important
methods (Mima et al., 1998b) (for instance, next step. Supporting multiple translation hy-
passivization) to buy time for the translator to potheses and incremental decoding (Sankaran
1350
et al., 2010) would improve both the efficiency Hal Daumé III, John Langford, and Daniel Marcu.
and effectiveness of our system. Using data 2009. Search-based structured prediction. Ma-
chine Learning Journal (MLJ).
from human translators (Shimizu et al., 2014)
could also add richer strategies for simultane- Fernanda Ferreira. 2000. Syntax in language
ous translation: passive constructions, reorder- production: An approach using tree-adjoining
ing, etc. grammars. Aspects of language production,
pages 291–330.
Verb prediction also can be substantially im-
proved both in its scope in the system and Christian Fügen. 2008. A system for simultane-
how we predict verbs. Verb-final languages ous translation of lectures and speeches. Ph.D.
thesis, KIT-Bibliothek.
also often place verbs at the end of clauses,
and also predicting these verbs would improve Tomoki Fujita, Graham Neubig, Sakriani Sakti,
simultaneous translation, enabling its effective Tomoki Toda, and Satoshi Nakamura. 2013.
Simple, lexicalized choice of translation timing
application to a wider range of sentences. In-
for simultaneous speech translation. INTER-
stead predicting an exact verb early (which is SPEECH.
very difficult), predicting a semantically close
or a more general verb might yield interpretable Francesca Gaiba. 1998. The origins of simultane-
ous interpretation: The Nuremberg Trial. Uni-
translations. versity of Ottawa Press.
A natural next step is expanding this work
to other languages, such as Japanese, which not Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
Clark, and Philipp Koehn. 2013. Scalable
only has sov word order but also requires tok- modified Kneser-Ney language model estima-
enization and morphological analysis, perhaps tion. In Proceedings of the Association for Com-
requiring sub-word prediction. putational Linguistics.
Reinhard Kneser and Hermann Ney. 1995. Im-

Acknowledgments proved backing-off for n-gram language model-
ing. In Acoustics, Speech, and Signal Processing,
We thank the anonymous reviewers, as well as 1995. ICASSP-95., 1995 International Confer-
Yusuke Miyao, Naho Orita, Doug Oard, and ence on. IEEE.
Sudha Rao for their insightful comments. This
Philipp Koehn. 2000. German-english parallel cor-
work was supported by NSF Grant IIS-1320538. pus “de-news”.
Boyd-Graber is also partially supported by
NSF Grant CCF-1018625. Daumé III and He John Langford and Bianca Zadrozny. 2005. Relat-
ing reinforcement learning performance to clas-
are also partially supported by NSF Grant IIS- sification performance. In Proceedings of the In-
0964681. Any opinions, findings, conclusions, ternational Conference of Machine Learning.
or recommendations expressed here are those
of the authors and do not necessarily reflect Shigeki Matsubara, Keiichi Iwashima, Nobuo
Kawaguchi, Katsuhiko Toyama, and Yasuyoshi
the view of the sponsor. Inagaki. 2000. Simultaneous Japanese-English
interpretation based on early predition of En-
glish verb. In Symposium on Natural Language
References Processing.
Pieter Abbeel and Andrew Y. Ng. 2004. Appren- Hideki Mima, Hitoshi Iida, and Osamu Furuse.
ticeship learning via inverse reinforcement learn- 1998a. Simultaneous interpretation utilizing
ing. In Proceedings of the International Confer- example-based incremental transfer. In Pro-
ence of Machine Learning. ceedings of the 17th international conference on
Computational linguistics-Volume 2, pages 855–
Satanjeev Banerjee and Alon Lavie. 2005. ME- 861. Association for Computational Linguistics.
TEOR: An automatic metric for MT evalua- Hideki Mima, Hitoshi Iida, and Osamu Furuse.
tion with improved correlation with human judg- 1998b. Simultaneous interpretation utilizing
ments. In Proceedings of the Association for example-based incremental transfer. In Proceed-
Computational Linguistics. Association for Com- ings of the Association for Computational Lin-
putational Linguistics. guistics.
Koby Crammer, Alex Kulesza, and Mark Dredze. Shota Momma, Robert Slevc, and Colin Phillips.
2013. Adaptive regularization of weight vectors. 2014. The timing of verb selection in english
Machine Learning, 91(2):155–187. active and passive sentences.
1351
Franz Josef Och and Hermann Ney. 2003. A Hiroaki Shimizu, Graham Neubig, Sakriani Sakti,
systematic comparison of various statistical Tomoki Toda, and Satoshi Nakamura. 2014.
alignment models. Computational Linguistics, Collection of a simultaneous translation corpus
29(1):19–51. for comparative analysis. In International Lan-
guage Resources and Evaluation.
Yusuke Oda, Graham Neubig, Sakriani Sakti,
Tomoki Toda, and Satoshi Nakamura. 2014. Op- Matthew Snover, Bonnie Dorr, Richard Schwartz,
timizing segmentation strategies for simultane- Linnea Micciulla, and John Makhoul. 2006. A
ous speech translation. In Proceedings of the As- study of translation edit rate with targeted hu-
sociation for Computational Linguistics, June. man annotation. In In Proceedings of Associa-
tion for Machine Translation in the Americas.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. BLEU: a method for auto- Umar Syed, Michael Bowling, and Robert E.
matic evaluation of machine translation. In Pro- Schapire. 2008. Apprenticeship learning using
ceedings of the Association for Computational linear programming. In Proceedings of the Inter-
Linguistics. national Conference of Machine Learning.
Adam Pauls and Dan Klein. 2011. Faster and Christoph Tillmann, Stephan Vogel, Hermann Ney,
smaller n-gram language models. In Proceed- and Alex Zubiaga. 1997. A dp-based search us-
ings of the Association for Computational Lin- ing monotone alignments in statistical transla-
guistics. tion. In Proceedings of the Association for Com-
Brock Pytlik and David Yarowsky. 2006. Machine putational Linguistics.
translation for languages lacking bitext via mul- Kristina Toutanova, Dan Klein, Christopher D
tilingual gloss transduction. In 5th Conference Manning, and Yoram Singer. 2003. Feature-rich
of the Association for Machine Translation in part-of-speech tagging with a cyclic dependency
the Americas (AMTA), August. network. In Conference of the North American
Uwe Quasthoff, Matthias Richter, and Christian Chapter of the Association for Computational
Biemann. 2006. Corpus portal for search in Linguistics, pages 173–180.
monolingual corpora. In International Language
Stephan Vogel, Hermann Ney, and Christoph Till-
Resources and Evaluation, pages 1799–1802.
mann. 1996. HMM-based word alignment in
Siegfried Ramler and Paul Berry. 2009. Nuremberg statistical translation. In Proceedings of the 16th
and Beyond: The Memoirs of Siegfried Ramler International Conference on Computational Lin-
from 20th Century Europe to Hawai’i. Booklines guistics (COLING).
Hawaii Limited.
Wolfgang Wahlster. 2000. Verbmobil: foundations
Koichiro Ryu, Shigeki Matsubara, and Yasuyoshi of speech-to-speech translation. Springer.
Inagaki. 2006. Simultaneous english-japanese
spoken language translation based on incremen-
tal dependency parsing and transfer. In Proceed-
ings of the Association for Computational Lin-
guistics.
Koichiro Ryu, Shigeki Matsubara, and Yasuyoshi
Inagaki. 2012. Alignment-based translation
unit for simultaneous japanese-english spoken di-
alogue translation. In Innovations in Intelligent
Machines–2, pages 33–44. Springer.
Akiko Sakamoto, Nayuko Watanabe, Satoshi Ka-
matani, and Kazuo Sumita. 2013. Development
of a simultaneous interpretation system for face-
to-face services and its evaluation experiment in
real situation.
Baskaran Sankaran, Ajeet Grewal, and Anoop
Sarkar. 2010. Incremental decoding for phrase-
based statistical machine translation. In Pro-
ceedings of the Joint Fifth Workshop on Statis-
tical Machine Translation.
H Schriefers, E Teruel, and RM Meinshausen.
1998. Producing simple sentences: Results from
picture–word interference experiments. Journal
of Memory and Language, 39(4):609–632.
1352

Don't Until The Final Verb Wait: Reinforcement Learning For Simultaneous Machine Translation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Don't Until The Final Verb Wait: Reinforcement Learning For Simultaneous Machine Translation

Uploaded by

Copyright:

Available Formats

Don’t Until the Final Verb Wait:

Reinforcement Learning for Simultaneous Machine Translation

Alvin C. Grissom II He He, John Morgan,

Abstract and circumlocutions that translators would use

2 Decision Process for 2.2 Actions: What our system can do

with the with the

We can recreate a traditional batch trans- 2.4 Consensus Translations

Source Sentence to compensate for the implicit bias toward low

5.1 Optimal Policies 5.3 Policy Learning

Of these, we use 70% (34, 528 pairs) for training 0.75

strict ourselves to sentences that end with one

Reinhard Kneser and Hermann Ney. 1995. Im-

You might also like