Professional Documents
Culture Documents
Don't Until The Final Verb Wait: Reinforcement Learning For Simultaneous Machine Translation
Don't Until The Final Verb Wait: Reinforcement Learning For Simultaneous Machine Translation
1342
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1342–1352,
October 25-29, 2014, Doha, Qatar.
2014
c Association for Computational Linguistics
ich bin mit dem Zug nach Ulm gefahren
I am with the train to Ulm traveled To compare the system to a human transla-
I (. . . . . . waiting. . . . . . ) traveled by train to Ulm tor in a decision-making process, the state is
Figure 1: An example of translating from a akin to the translator’s cognitive state. At any
verb-final language to English. The verb, in given time, we have knowledge (observations)
bold, appears at the end of the sentence, pre- and beliefs (predictions) with varying degrees
venting coherent translations until the final of certainty: that is, the state contains the re-
source word is revealed. vealed words x1:t of a sentence; the state also
contains predictions about the remainder of
the sentence: we predict the next word in the
translation system that produces simultaneous sentence and the final verb.
translations. We show that our data-driven
More formally, we have a prediction at time
system can successfully predict unseen parts
t of the next source language word that will
of the sentence, learn when to trust them, and (t)
appear, nt+1 , and for the final verb, v (t) . For
outperform strong baselines (Section 7).
example, given the partial observation “ich
While some prior research has approached
bin mit dem”, the state might contain a pre-
the problem of simultaneous translation—we re- (t)
view these systems in more detail in Section 8— diction that the next word, nt+1 , will be “Zug”
no current model learns when to definitively and that the final verb v (t) will be “gefahren”.
begin translating chunks of an incomplete sen- We discuss the mechanics of next-word and
tence. Finally, in Section 9, we discuss the verb prediction further in Section 4; for now,
limitations of our system: it only uses the most consider these black boxes which, after observ-
frequent source language verbs, it only applies ing every new source word xt , make predictions
to sentences with a single main verb, and it of future words in the source language. This
uses an idealized translation system. However, representation of the state allows for a richer set
these limitations are not insurmountable; we of actions, described below, permitting simul-
describe how a more robust system can be as- taneous translations that outpace the source
sembled from these components. language input2 by predicting the future.
1343
Observation Observation Observation Observation (prediction)
1. Mit dem Zug Wait 2. Mit dem Zug bin Wait 3. Mit dem Zug bin Predict 4. Mit dem Zug bin
ich ich nach ich nach … gefahren ...
Commit
train train Output: I traveled
S
state
by train
S I am
by train by train
Observation to
Observation (prediction) by train
Commit Wait
6. Mit dem Zug bin ich
nach Ulm gefahren.
5. Mit dem Zug bin ich
nach Ulm S I traveled
… gefahren ... with the
Output: I traveled
Commit
train
by train
Output: I traveled to Ulm
by train Fixed
to Ulm. output to Ulm
STOP
Figure 2: A simultaneous translation from source (German) to target (English). The agent
chooses to wait until after (3). At this point, it is sufficiently confident to predict the final verb
of the sentence (4). Given this additional information, it can now begin translating the sentence
into English, constraining future translations (5). As the rest of the sentence is revealed, the
system can translate the remainder of the sentence.
Next Word The next word action takes must then take an action given information
a prediction of the next source word and pro- about the current state. The action will result
duces an updated translation based on that in receiving and translating more source words,
prediction, i.e., appending the predicted word transitioning the system to the next state. In
to the source sentence and translating the new the example, for the first few source-language
sentence. words, the translator lacks the confidence to
produce any output due to insufficient informa-
Verb Our system can also predict the source
tion at the state. However, after State 3, the
sentence’s final verb (the last word in the sen-
state shows high confidence in the predicted
tence). When our system takes the verb ac-
verb “gefahren”. Combined with the German
tion, it uses its verb prediction to update the
input it has observed, the system is sufficiently
translation using the prediction, by placing it
confident to act on that prediction to produce
at the end of the source sentence.
English translation.
1344
produces output, the observed input (plus extra one that can completely translate an imagined
predictions in the case of next-word or verb), sentence after one word is uttered—as an un-
is passed into the translation system. That obtainable system. On the other extreme is a
system then produces a complete translation standard batch translator, which waits until
of its input fragment. it has access to the utterer’s complete sentence
Any new words—i.e., words whose target before translating anything.
index is greater than the length of any previ- Again, we argue that a system can improve
ous translation—are appended to the previous on this by predicting unseen parts of the sen-
translation.3 Table 1 shows an example of tence to find a better tradeoff between these
forming these consensus translations. conflicting goals. However, to evaluate and op-
Now that we have defined how states evolve timize such a system, we must measure where
based on our system’s actions, we need to know a system falls on the continuum of accuracy
how to select which actions to take. Eventu- versus expeditiousness.
ally, we will formalize this as a learned policy Consider partial translations in a two-
(Section 5) that maps from states to actions. dimensional space, with time (quantized by
First, however, we need to define a reward that the number of source words seen) increasing
measures how “good” an action is. from left to right on the x axis and the bleu
score (including brevity penalty against the
3 Objective: What is a good reference length) on the y axis. At each point
simultaneous translation? in time, the system may add to the consensus
translation, changing the precision (Figure 3).
Good simultaneous translations must optimize
Like an roc curve, a good system will be high
two objectives that are often at odds, i.e., pro-
and to the left, optimizing the area under the
ducing translations that are, in the end, accu-
curve: the ideal system would produce points
rate, and producing them in pieces that are
as high as possible immediately. A translation
presented expeditiously. While there are exist-
which is, in the end, accurate, but which is less
ing automated metrics for assessing translation
expeditious, would accrue its score more slowly
quality (Papineni et al., 2002; Banerjee and
but outperform a similarly expeditious system
Lavie, 2005; Snover et al., 2006), these must
which nevertheless translates poorly.
be modified to find the necessary compromise
between translation quality and expeditious- An idealized psychic system achieves this,
ness. That is, a good metric for simultaneous claiming all of the area under the curve, as it
translation must achieve a balance between would have a perfect translation instantly, hav-
translating chunks early and translating accu- ing no need of even waiting for future input.4
rately. All else being equal, maximizing either A batch system has only a narrow (but tall)
goal in isolation is trivial: for accurate transla- sliver to the right, since it translates nothing
tions, use a batch system and wait until the until all of the words are observed.
sentence is complete, translating it all at once; Formally, let Q be the score function for a
for a maximally expeditious translation, cre- partial translation, x the sequentially revealed
ate monotone translations, translating each source words x1 , x2 , . . . , xT from time step 1 to
word as it appears, as in Tillmann et al. (1997) T , and y the partial translations y1 , y2 , . . . , yT ,
and Pytlik and Yarowsky (2006). The former where T is the length of the source language
is not simultaneous at all; the latter is mere input. Each incremental translation yt has a
word-for-word replacement and results in awk- bleu-n score with respect to a reference r. We
ward, often unintelligible translations of distant apply the usual bleu brevity penalty to all the
language pairs. incremental translations (initially empty) to
Once we have predictions, we have an ex- 4
panded array of possibilities, however. On one One could reasonably argue that this is not ideal:
a fluid conversation requires the prosody and timing
extreme, we can imagine a psychic translator— between source and target to match exactly. Thus, a
psychic system would provide too much information
3
Using constrained decoding to enforce consistent too quickly, making information exchange unnatural.
translation prefixes would complicate our method but However, we take the information-centric approach:
is an appealing extension. more information faster is better.
1345
Pos Input Intermediate Consensus
1
2 Er He1 He1
3 Er wurde It1 was2 designed3 He1 was2 designed3
gestaltet
4 It1 was2 designed3 He1 was2 designed3
5 Er wurde It1 was2 renovated3 He1 was2 designed3
gestern yesterday4 yesterday4
renoviert
Table 1: How intermediate translations are combined into a consensus translation. Incorrect
translations (e.g., “he” for an inanimate object in position 3) and incorrect predictions (e.g.,
incorrectly predicting the verb gestaltet in position 5) are kept in the consensus translation.
When no translation is made, the consensus translation remains static.
1346
where xi−n+1:i−1 is the n − 1-gram context. To 5.2 Feature Representation
narrow the search space, we consider only the We want a feature representation that will al-
100 most frequent final verbs, where a “final low a classifier to generalize beyond the specific
verb” is defined as the sentence-final sequence examples on which it is trained. We use sev-
of verbs and particles as detected by a German eral general classes of features: features that
part-of-speech tagger (Toutanova et al., 2003).6 describe the input, features that describe the
possible translations, and features that describe
the quality of the predictions.
5 Learning a Policy
Input We include both a bag of words rep-
We have a framework (states and actions) for resentation of the input sentence as well as
simultaneous machine translation and a metric the most recent word and bigram to model
for assessing simultaneous translations. We word-specific effects. We also use a feature
now describe the use of reinforcement learning that encodes the length of the source sentence.
to learn a policy, a mapping from states to
actions, to maximize lbleu reward. Prediction We include the identity of the
We use imitation learning (Abbeel and Ng, predicted verb and next word as well as their re-
2004; Syed et al., 2008): given an optimal se- spective probabilities under the language mod-
quence of actions, learn a generalized policy els that generate the predictions. If the model
that maps states to actions. This can be viewed is confident in the prediction, the classifier can
as a cost-sensitive classification (Langford and learn to more so trust the predictions.
Zadrozny, 2005): a state is represented as a fea- Translation In addition to the state, we in-
ture vector, the loss corresponds to the quality clude features derived from the possible actions
of the action, and the output of the classifier is the system might take. This includes a bag of
the action that should be taken in that state. words representation of the target sentence, the
In this section, we explain each of these com- score of the translation (decreasing the score is
ponents: generating an optimal policy, repre- undesirable), the score of the current consen-
senting states through features, and learning a sus translation, and the difference between the
policy that can generalize to new sentences. current and potential translation scores.
1347
policy’s action with probability . In the first We first consider incomplete but correct in-
iteration, we have π0 = π ∗ . puts. Let y = τ (x1:t ) be the translator’s output
Mixing in the learned policy allows the given a partial source input x1:t with transla-
learned policy to slowly change from the oracle tion y. Then, τ (x1:t ) produces all target words
policy. As it trains on these no-longer-perfect yj if there is a source word xi in the input
state trajectories, the state distribution at test aligned to those words—i.e., (i, j) ∈ ax,y —and
time will be more consistent with the states all preceding target words can be translated.
used in training. (That translations must be contiguous is a nat-
searn learns the policy by training a cost- ural requirement for human recipients of trans-
sensitive classifier. Besides providing the opti- lations). In the case where yj is unaligned, the
mal action, the oracle must also assign a cost closest aligned target word to yj that has a
to an action corresponding alignment entry is aligned to xi ;
then, if xi is present in the input, yj appears in
C(at , x) ≡ Q(x, π ∗ (xt )) − Q(x, at (xt )), (4) the output. Thus, our omniscient translation
system will always produce the correct output
where at (xt ) represents the translation outcome given the correct input.
of taking action at . The cost is the regret of However, our learned policy can make wrong
not taking the optimal action. predictions, which can produce partial trans-
lations y that do not match the reference.
6 Translation System In this event, an incorrect source word x̃i
produces incorrect target words ỹj , for all
The focus of this work is to show that given an j : (i, j) ∈ ax,y . These ỹj are sampled from
effective batch translation system and predic- the ibm Model 1 lexical probability table mul-
tions, we can learn a policy that will turn this tiplied by the source language model ỹj ∼
into a simultaneous translation system. Thus, Mult(θx̃i )pLM (x̃).8 Thus, even if we predict
to separate translation errors from policy er- the correct verb using a next word action, it
rors, we perform experiments with a nearly will be in the wrong position and thus gener-
optimal translation system we call an omni- ate a translation from the lexical probabilities.
scient translator. Since translations based on Model 1 probabil-
More realistic translation systems will nat- ities are generally inaccurate, the omniscient
urally lower the objective function, often in translator will do very well when given correct
ways that make it difficult to show that we can input but will produce very poor translations
effectively predict the verbs in verb-final source otherwise.
languages. For instance, German to English
translation systems often drop the verb; thus, 7 Experiments
predicting a verb that will be ignored by the In this section, we describe our experimental
translation system will not be effective. framework and results from our experiments.
The omniscient translator translates a source From aligned data, we derive an omniscient
sentence correctly once it has been fed the ap- translator. We use monolingual data in the
propriate source words as input. There are source language to train the verb predictor and
two edge cases: empty input yields an empty the next word predictor. From these features,
output, while a complete, correct source sen- we compute an optimal policy from which we
tence returns the correct, complete translation. train a learned policy.
Intermediate cases—where the input is either
incomplete or incorrect—require using an align- 7.1 Data sets
ment. The omniscient translator assumes as For translation model and policy training, we
input a reference translation r, a partial source use data from the German-English Parallel “de-
language input x1:t and a corresponding partial news” corpus of radio broadcast news (Koehn,
output y. In addition, the omniscient transla- 2000), which we lower-cased and stripped of
tor assumes access to an alignment between r 8
If a policy chooses an incorrect unaligned word, it
and x. In practice, we use the hmm aligner (Vo- has no effect on the output. Alignments are position-
gel et al., 1996; Och and Ney, 2003). specific, so “wrong” refers to position and type.
1348
punctuation. A total of 48, 601 sentence pairs 1.25
are randomly selected for building our system. 1.00
Smoothed Average
●
This results in a data set of 4401 training sen- 0.00 0.25 0.50 0.75 1.00
tences and 1832 test sentences from the de-news % of Sentence
data. We did this to narrow the search space ● Batch Monotone Optimal Searn
(from thousands of possible, but mostly very
infrequent, verbs).
We used 1 million words of news text from
Figure 4: The final reward of policies on Ger-
the Leipzig Wortschatz (Quasthoff et al., 2006)
man data. Our policy outperforms all baselines
German corpus to train 5-gram language mod-
by the end of the sentence.
els to predict a verb from the 100 most frequent
verbs.
For next-word prediction, we use the 18, 345 Policy Actions
10000
most frequent German bigrams from the train- 7500
Batch
5000
ing set to provide a set of candidates in a lan- 2500
guage model trained on the same set. We use 0
10000
Monotone
frequent bigrams to reduce the computational 7500
Action Count
5000
cost of finding the completion probability of 2500
0
the next word. 10000
Optimal
7500
5000
7.2 Training Policies 2500
0
10000
In each iteration of searn, we learn a 7500
Searn
5000
multi-class classifier to implement the pol- 2500
icy. The specific learning algorithm we use 0
COMMIT WAIT NEXTWORD VERB
is arow (Crammer et al., 2013). In the com- Action
plete version of searn, the cost of each action
is calculated as the highest expected reward Figure 5: Histogram of actions taken by the
starting at the current state minus the actual policies.
roll-out reward. However, computing the full
roll-out reward is computationally very expen-
sive. We thus use a surrogate binary cost: if the first half of the sentence, as German and
the predicted action is the same as the opti- English have similar word order, though they
mal action, the cost is 0; otherwise, the cost diverge toward the end. Our learned policy
is 1. We then run searn for five iterations. outperforms the monotone policy toward the
Results on the development data indicate that end and of course outperforms the batch policy
continuing for more iterations yields no benefit. throughout the sentence.
Figure 5 shows counts of actions taken by
7.3 Policy Rewards on Test Set each policy. The batch policy always commits
In Figure 4, we show performance of the opti- at the end. The monotone policy commits at
mal policy vis-à-vis the learned policy, as well each position. Our learned policy has an ac-
as the two baseline policies: the batch policy tion distribution similar to that of the optimal
and the monotone policy. The x-axis is the policy, but is slightly more cautious.
percentage of the source sentence seen by the
model, and the y-axis is a smoothed average of 7.4 What Policies Do
the reward as a function of the percentage of Figure 6 shows a policy that, predicting incor-
the sentence revealed. The monotone policy’s rectly, still produces sensible output. The pol-
performance is close to the optimal policy for icy correctly intuits that the person discussed
1349
INPUT OUTPUT
hear more information in a spoken context—or
use phrase table and reordering probabilities to
bundesumweltministerin
Merkel
NEX
T federal minister of the decide where to translate with less delay (Fu-
bundesumweltministerin
environment angela merkel jita et al., 2013). Oda et al. (2014) is the
merkel hat den entwurf
gezeigt
VER
B federal minister of the
most similar to our work on the translation
environment angela merkel
shown the draft
side. They frame word segmentation as an
bundesumweltministerin
merkel hat den entwurf COM
MIT
optimization problem, using a greedy search
federal minister of the
eines umweltpolitischen
programms vorgestellt environment angela merkel and dynamic programming to find segmenta-
shown the draft of an
ecopolitical program
tion strategies that maximize an evaluation
measure. However, unlike our work, the direc-
tion of translation was from from svo to svo,
Figure 6: An imperfect execution of a learned obviating the need for verb prediction. Simul-
policy. Despite choosing the wrong verb taneous translation is more straightforward for
“gezeigt” (showed) instead of “vorgestellt” (pre- languages with compatible word orders, such
sented), the translation retains the meaning. as English and Spanish (Fügen, 2008).
To our knowledge, the only attempt to
is Angela Merkel, who was the environmen- specifically predict verbs or any late-occurring
tal minister at the time, but the policy uses terms (Matsubara et al., 2000) uses pattern
an incorrectly predicted verb. Because of our matching on what would today be considered
poor translation model (Section 6), it renders a small data set to predict English verbs for
this word as “shown”, which is a poor transla- Japanese to English simultaneous mt.
tion. However, it is still comprehensible, and Incorporating verb predictions into the trans-
the overall policy is similar to what a human lation process is a significant component of
would do: intuit the subject of the sentence our framework, though n-gram models strongly
from early clues and use a more general verb prefer highly frequent verbs. Verb prediction
to stand in for a more specific one. might be improved by applying the insights
from psycholinguistics. Ferreira (2000) argues
8 Related Work that verb lemmas are required early in sentence
production—prior to the first noun phrase
Just as mt was revolutionized by statistical argument—and that multiple possible syntac-
learning, we suspect that simultaneous mt will tic hypotheses are maintained in parallel as the
similarly benefit from this paradigm, both from sentence is produced. Schriefers et al. (1998)
a systematic system for simultaneous transla- argues that, in simple German sentences, non-
tion and from a framework for learning how to initial verbs do not need lemma planning at
incorporate predictions. all. Momma et al. (2014), investigating these
Simultaneous translation has been prior claims, argues that the abstract relation-
dominated by rule and parse-based ap- ship between the internal arguments and verbs
proaches (Mima et al., 1998a; Ryu et al., 2006). triggers selective verb planning.
In contrast, although Verbmobil (Wahlster,
2000) performs incremental translation using a 9 Conclusion and Future Work
statistical mt module, its incremental decision-
making module is rule-based. Other recent Creating an effective simultaneous translation
approaches in speech-based systems focus on system for sov to svo languages requires not
waiting until a pause to translate (Sakamoto only translating partial sentences, but also ef-
et al., 2013) or using word alignments (Ryu fectively predicting a sentence’s verb. Both
et al., 2012) between languages to determine elements of the system require substantial re-
optimal translation units. finement before they are usable in a real-world
Unlike our work, which focuses on predic- system.
tion and learning, previous strategies for deal- Replacing our idealized translation system
ing with sov-to-svo translation use rule-based is the most challenging and most important
methods (Mima et al., 1998b) (for instance, next step. Supporting multiple translation hy-
passivization) to buy time for the translator to potheses and incremental decoding (Sankaran
1350
et al., 2010) would improve both the efficiency Hal Daumé III, John Langford, and Daniel Marcu.
and effectiveness of our system. Using data 2009. Search-based structured prediction. Ma-
chine Learning Journal (MLJ).
from human translators (Shimizu et al., 2014)
could also add richer strategies for simultane- Fernanda Ferreira. 2000. Syntax in language
ous translation: passive constructions, reorder- production: An approach using tree-adjoining
ing, etc. grammars. Aspects of language production,
pages 291–330.
Verb prediction also can be substantially im-
proved both in its scope in the system and Christian Fügen. 2008. A system for simultane-
how we predict verbs. Verb-final languages ous translation of lectures and speeches. Ph.D.
thesis, KIT-Bibliothek.
also often place verbs at the end of clauses,
and also predicting these verbs would improve Tomoki Fujita, Graham Neubig, Sakriani Sakti,
simultaneous translation, enabling its effective Tomoki Toda, and Satoshi Nakamura. 2013.
Simple, lexicalized choice of translation timing
application to a wider range of sentences. In-
for simultaneous speech translation. INTER-
stead predicting an exact verb early (which is SPEECH.
very difficult), predicting a semantically close
or a more general verb might yield interpretable Francesca Gaiba. 1998. The origins of simultane-
ous interpretation: The Nuremberg Trial. Uni-
translations. versity of Ottawa Press.
A natural next step is expanding this work
to other languages, such as Japanese, which not Kenneth Heafield, Ivan Pouzyrevsky, Jonathan H.
Clark, and Philipp Koehn. 2013. Scalable
only has sov word order but also requires tok- modified Kneser-Ney language model estima-
enization and morphological analysis, perhaps tion. In Proceedings of the Association for Com-
requiring sub-word prediction. putational Linguistics.
Pieter Abbeel and Andrew Y. Ng. 2004. Appren- Hideki Mima, Hitoshi Iida, and Osamu Furuse.
ticeship learning via inverse reinforcement learn- 1998a. Simultaneous interpretation utilizing
ing. In Proceedings of the International Confer- example-based incremental transfer. In Pro-
ence of Machine Learning. ceedings of the 17th international conference on
Computational linguistics-Volume 2, pages 855–
Satanjeev Banerjee and Alon Lavie. 2005. ME- 861. Association for Computational Linguistics.
TEOR: An automatic metric for MT evalua- Hideki Mima, Hitoshi Iida, and Osamu Furuse.
tion with improved correlation with human judg- 1998b. Simultaneous interpretation utilizing
ments. In Proceedings of the Association for example-based incremental transfer. In Proceed-
Computational Linguistics. Association for Com- ings of the Association for Computational Lin-
putational Linguistics. guistics.
Koby Crammer, Alex Kulesza, and Mark Dredze. Shota Momma, Robert Slevc, and Colin Phillips.
2013. Adaptive regularization of weight vectors. 2014. The timing of verb selection in english
Machine Learning, 91(2):155–187. active and passive sentences.
1351
Franz Josef Och and Hermann Ney. 2003. A Hiroaki Shimizu, Graham Neubig, Sakriani Sakti,
systematic comparison of various statistical Tomoki Toda, and Satoshi Nakamura. 2014.
alignment models. Computational Linguistics, Collection of a simultaneous translation corpus
29(1):19–51. for comparative analysis. In International Lan-
guage Resources and Evaluation.
Yusuke Oda, Graham Neubig, Sakriani Sakti,
Tomoki Toda, and Satoshi Nakamura. 2014. Op- Matthew Snover, Bonnie Dorr, Richard Schwartz,
timizing segmentation strategies for simultane- Linnea Micciulla, and John Makhoul. 2006. A
ous speech translation. In Proceedings of the As- study of translation edit rate with targeted hu-
sociation for Computational Linguistics, June. man annotation. In In Proceedings of Associa-
tion for Machine Translation in the Americas.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. BLEU: a method for auto- Umar Syed, Michael Bowling, and Robert E.
matic evaluation of machine translation. In Pro- Schapire. 2008. Apprenticeship learning using
ceedings of the Association for Computational linear programming. In Proceedings of the Inter-
Linguistics. national Conference of Machine Learning.
Adam Pauls and Dan Klein. 2011. Faster and Christoph Tillmann, Stephan Vogel, Hermann Ney,
smaller n-gram language models. In Proceed- and Alex Zubiaga. 1997. A dp-based search us-
ings of the Association for Computational Lin- ing monotone alignments in statistical transla-
guistics. tion. In Proceedings of the Association for Com-
Brock Pytlik and David Yarowsky. 2006. Machine putational Linguistics.
translation for languages lacking bitext via mul- Kristina Toutanova, Dan Klein, Christopher D
tilingual gloss transduction. In 5th Conference Manning, and Yoram Singer. 2003. Feature-rich
of the Association for Machine Translation in part-of-speech tagging with a cyclic dependency
the Americas (AMTA), August. network. In Conference of the North American
Uwe Quasthoff, Matthias Richter, and Christian Chapter of the Association for Computational
Biemann. 2006. Corpus portal for search in Linguistics, pages 173–180.
monolingual corpora. In International Language
Stephan Vogel, Hermann Ney, and Christoph Till-
Resources and Evaluation, pages 1799–1802.
mann. 1996. HMM-based word alignment in
Siegfried Ramler and Paul Berry. 2009. Nuremberg statistical translation. In Proceedings of the 16th
and Beyond: The Memoirs of Siegfried Ramler International Conference on Computational Lin-
from 20th Century Europe to Hawai’i. Booklines guistics (COLING).
Hawaii Limited.
Wolfgang Wahlster. 2000. Verbmobil: foundations
Koichiro Ryu, Shigeki Matsubara, and Yasuyoshi of speech-to-speech translation. Springer.
Inagaki. 2006. Simultaneous english-japanese
spoken language translation based on incremen-
tal dependency parsing and transfer. In Proceed-
ings of the Association for Computational Lin-
guistics.
Koichiro Ryu, Shigeki Matsubara, and Yasuyoshi
Inagaki. 2012. Alignment-based translation
unit for simultaneous japanese-english spoken di-
alogue translation. In Innovations in Intelligent
Machines–2, pages 33–44. Springer.
Akiko Sakamoto, Nayuko Watanabe, Satoshi Ka-
matani, and Kazuo Sumita. 2013. Development
of a simultaneous interpretation system for face-
to-face services and its evaluation experiment in
real situation.
Baskaran Sankaran, Ajeet Grewal, and Anoop
Sarkar. 2010. Incremental decoding for phrase-
based statistical machine translation. In Pro-
ceedings of the Joint Fifth Workshop on Statis-
tical Machine Translation.
H Schriefers, E Teruel, and RM Meinshausen.
1998. Producing simple sentences: Results from
picture–word interference experiments. Journal
of Memory and Language, 39(4):609–632.
1352