Professional Documents
Culture Documents
Hiroaki Kitano (Auth.) Speech-To-Speech Translation - A Massively Parallel Memory-Based Approach 1994
Hiroaki Kitano (Auth.) Speech-To-Speech Translation - A Massively Parallel Memory-Based Approach 1994
Hiroaki Kitano (Auth.) Speech-To-Speech Translation - A Massively Parallel Memory-Based Approach 1994
TRANSLATION:
A MASSIVELY PARALLEL
MEMORY-BASED APPROACH
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
Hiroaki Kitano
and
"
~.
LIST OF FIGURES IX
PREFACE Xv
1 INTRODUCTION 1
1.1 Speech-to-Speech Dialogue Translation 1
1.2 Why Spoken Language Translation is So Difficult? 4
1.3 A Brief History of Speech Translation Related Fields 8
5 DMSNAP: AN IMPLEMENTATION ON
THE SNAP SEMANTIC NETWORK ARRAY
PROCESSOR 115
5.1 Introduction 115
5.2 SNAP Architecture 116
5.3 Philosophy Behind DMSN AP 119
5.4 Implementation of DMSN AP 121
5.5 Linguistic Processing in DMSN AP 125
5.6 Performance 131
5.7 Conclusion 133
8 CONCLUSION 173
8.1 Summary of Contributions 173
8.2 Future Works 175
8.3 Final Remarks 176
Bibliography 177
Index 191
LIST OF FIGURES
Chapter 1
1.1 Process flow of Speech-to-speech translation 2
1.2 Overall process flow of Speech-to-speech dialog translation system 4
Chapter 2
2.1 An example of sentence analysis result 19
2.2 JANUS using the generalized LR parser. 22
2.3 JANUS using the connectionist parser. 23
Chapter 3
3.1 Translation as Analogy 34
3.2 Distribution by Sentence Length 37
3.3 Coverage by Sentence Length 37
3.4 Real space and possible space 41
Chapter 4
4.1 Lexical Nodes for 'Kaigi' and 'Conference' 51
4.2 Grammar using LFG-like notation 52
4.3 Grammar using Semantic-oriented notation 53
4.4 Grammar using mixture of surface string and generalized case 53
4.5 Example of an A-Marker and a P-Marker 55
4.6 Example of a G-Marker and a V-Marker 56
4.7 Movement of P-Markers 58
4.8 Movement of P-Marker on Hierarchical CSCs 58
4.9 Parsing with a small grammar 59
4.10 A simple parsing example. 61
4.11 Examples of Noisy Phoneme Sequences 63
x SPEECH-TO-SPEECH TRANSLATION
Chapter 5
5.1 SN AP Architecture 117
5.2 Concept Sequence on SNAP 122
5.3 Part of Memory Network 126
5.4 Parsing Performance of DmSN AP 132
Chapter 6
6.1 Syntactic Recognition Time vs. Sentence Length 140
6.2 Performance Improvement by Learning New Cases 142
6.3 Training Sentences vs. Syntactic Patterns 144
6.4 Overall Architecture of the Parsing Part 145
6.5 Network for 'about' and its phoneme sequence 146
6.6 Parsing Time vs. Length of Input 149
6.7 Parsing Time vs. KB Size 150
6.8 Number of Active Hypotheses per Processor 152
List of Figures Xl
Chapter 7
7.1 Overall Architecture 159
7.2 Abstraction-based Word Distance Definition 162
7.3 DP-Matching of Input and Examples 164
7.4 Multiple Match between Examples 165
LIST OF TABLES
Chapter 1
1.1 Major speech recognition systems 9
Chapter 2
2.1 A portion of a confusion matrix 14
2.2 Examples of Sentences Processed by SpeechTrans 16
2.3 Performance of the SL-TRANS 19
2.4 Performance of JANUS1 and JANUS2 on N-Best Hypotheses 22
2.5 Performance of JANUS1 and JANUS2 on the First Hypothesis 23
2.6 Performance of the MINDS system 25
Chapter 3
3.1 Knowledge and parallelism involved in the speech translation task 30
3.2 Distribution of the global ill-formed ness 41
Chapter 4
4.1 Types of Nodes in the Memory Network 50
4.2 Markers in the Model 55
4.3 Transcript: English to Japanese 96
4.4 Transcript: Japanese to English (1) 97
4.5 Transcript: Japanese to English (2) 97
4.6 Simultaneous interpretation in <I>DMDIALOG 99
Chapter 5
5.1 Execution times for DmSNAP 132
Chapter 6
xiv SPEECH-TO-SPEECH TRANSLATION
Chapter 7
7.1 Examples of Translation Pair 159
7.2 A Part of Memory-Base (Morphological tags are omitted) 160
7.3 Examples matched for a simple input 163
7.4 Difference Table 165
7.5 Adaptation Operations 166
7.6 Adaptation for a simple sentence translation 166
7.7 Retrieved Examples 167
7.8 Adaptive Translation Process 168
PREFACE
xvi SPEECH-TO-SPEECH TRANSLATION
Massively parallel implementation on IXM2, SNAP-I, and CM-2 has been car-
ried out with different variation of the original model. Massively parallel imple-
mentation proved the validity of the approach and demonstrate that real-time
speech-to-speech translation is attainable.
It is interesting to see how my ideas change and grow as research progress. The
original cI> DM DIALOG system reflects my early vision on natural language pro-
cessing, whereas up-dated chapter reflects my recent thought. It is consistent
in a sense memory-based procesing and massively parallel computing to be the
basis of the model. However, the use of rules has been changed drastically.
For me, the work described in this book is an important milestone. The ideas
grown out of this work lead me to propose massively parallel artificial intelli-
gence, which is now being recognized as a distinct research field.
ing idea, and Jim Hendler helped me propose massively parallel AI. Members
of Carnegie Mellon research community, James McClelland, David Touretzky,
Kai-Fee Lee, Alex Waibel , Carl Pollard, Lori Levin, Sergei Nirenburg, Wayne
Ward, and Takeo Kanade, gave me various suggestions on my research and
on my thesis. Hitoshi Iida and Akira Kurematsu at ATR Interpreting Tele-
phony Research Laboratories allow me to use ATR corpus in which the system
operates. Massively parallel implementation could not be possible without re-
search collaboration with Tetsuya Higuchi and his colleagues at ElectroTechni-
cal Laboratory, and Dan Moldovan and his colleagues a t University of Southeru
California.
This research has been supported by National Science Foundation grant MIP-
9009111, Pittsburgh Supercomputing Center IRI-910002P, and a r esearch con-
tract between Carnegie Mellon University and ATR Interpreting Telephony
Research Laboratories. NEC Corporation supportedl1lY stay at Caruegie Mel-
lon University.
SPEECH-TO-SPEECH
TRANSLATION:
A MASSIVELY PARALLEL
MEMORY-BASED APPROACH
1
INTRODUCTION
Spoken
sentence
Audio
signal
.-!
~ Speech
Phoneme recognition
recognition
Phoneme
hypotheses tt ~
possible
next
phonemes
Lexical
,
activation
possible
Words .... l" next
words
hypotheses
....
I I
Parsing Machine
Translation
Meaning of
Utterance
+
Generation
Translated ...
sentence
Voice
Synthesis
Audio
signal
Translated
sentence in
sound
and must be capable of interpreting very elliptical (where some words are not
said) and ill-formed sentences which may appear in real spoken dialogues. In
addition, an interface between the parser and the speech recognition module
must be well designed so that necessary information is passed to the parser,
and an appropriate feedback is given from the parser to the speech recognition
module in order to to improve recognition accuracy. In figure 1.1, we assumed
that an interface is made at both phoneme hypothesis and word hypothesis
levels, so that prediction made by the parser can be immediately fed back to
the phoneme recognition device. No speech recognition module is capable of
recognizing input speech with perfect accuracy, thus it sometimes provides a
false word sequence as its first choice. However, it is often the case that a cor-
rect word is in the second or third best hypothesis. Thus, phoneme and word
hypotheses given to the parser consist of several competitive phoneme or word
hypotheses each of which are assigned the probability of being correct. With
this mechanism, the accuracy of recognition can be improved because it filters
out false first choices of the speech recognition module and selects grammati-
cally and semantically plausible second or third best hypotheses. To implement
this mechanism, the parser needs to handle multiple hypotheses in a parallel
rather than a single word sequence as seen in text input machine translation
systems. For a translation scheme, we use interlingua, i.e. language indepen-
dent representation of meaning of the sentence, so that translation into multiple
languages can be done efficiently. A generation module needs to be designed
so that appropriate spoken sentences can be generated with correct articula-
tion control. In addition to these functional challenges, it should be noted that
the real-time response is the major requirement of the system, because speech-
to-speech dialog translation systems would be used for real-time transactions,
imposing a far more severe performance challenge than for text-based machine
translation systems.
English
~ trans laton
in sound
Japanese _ Utterance
translation in English
in sound
Continuous speech is highly complex than isolated speech mainly due to (1)
unclear word boundary, (2) co-articulation effect, and (3) poor articulation of
functional words. Unclear word bounday significantly increases search space
because of large segmentation candidates. Involvement of contextual effect, i.e.
change in articulation due to existance of other words (previous or following)
and due to emergence of stress (emphasis and de-emphasis), accounts for the
co-articulation effects and the poor articulation of the functional words.
For those who are interested in details of speech recognition research, there are
ample of publications in this field such as [Waibel and Lee, 1990].
Translation The issue of translation has been one of the major problems in
natural language processing and artificial intelligence community. Generally,
translation of sentence between two language entails various problems many of
which are yet to be solved. Following is a partial list of problems need to be
solved and questions need to answered.
6 CHAPTER 1
In the late 80's, SPHINX [Lee, 1988J and BYBLOS [Chow et. al., 1987J system
has been developed, both using the Hidden Markov Model (HMM). SPHINX
was extended to a vocabulary-independent VOCIND system [Hon, 1992J.
Early in the 90's, we saw a first neural network-based speech recognition sys-
tems [Waibel et. al., 1989] [Tebelskis, et. al., 1991J.
Introduction 9
The dramatic change in the MT research come with the ALPAC report [AL-
PAC, 1966] in 1964, which strongly criticized the MT research at that age.
10 CHAPTER 1
The ALPAC report was correct in most of accessment of the problems and
limitation of the state-of-the-art technolgies against the difficulties. The report
pointed out necessity of more basic and computational research toward under-
standing of nature of natural language itself. After the ALPAC report, we saw
a sharp decline in the MT effort in the United States, though research in other
countries such as Japan and European countries have continued.
In 1986, the Center for Machine Translation (CMT) was formed at Carnegie
Mellon University, which symbolizes the scientific come back of the MT re-
search in the United States. Several state-of-the-art systems, such as KBMT-
89 [Goodman and Nirenburg, 1991), SpeechTrans [Tomita et. al., 1989],
iI>DMDIALOG system, and others, has been developed at the CMT. The KBMT-
89 system is the first system to claim fully interligua and employs knowledge-
based machine translation (KBMT) paradigm [Nirenberg et. al., 1989al.
In 1985, Hillis proposed the Connection Machine [Hillis, 1985] and the Think-
ing Machines Corporation was formed to commercially deliver the CM-l Con-
nection Machine. The CM-l has 64K I-bit processors. Thinking Machines
Corporation soon announced the up-graded version, CM-2 with floating point
~~~K.\\\\~~~\~~bilit~ attaining 28 GFlops in single precision [Thinking Machines
Corporation, 1989\.
There are architectural evolutions. First evolution is the use of more powerful
processors. While most of the first generation massively parallel machines has
been equipped with fine-gained processors as seen in CM-2 (I-bit) and MP-l
(4-bit), newer generation machines employ 32-bits or 54-bits processors. For
example, CM-5 uses a 32-bits SPARC chip and vector processors for each node.
12 CHAPTER 1
• SpeechTrans (CMU)
• SL-TRANS (ATR)
• JANUS (CMU)
• HMM-LR Parser (CMU)
• MINDS System (CMU)
• Knowledge Based Machine Translation System (CMU)
2.1 SPEECHTRANS
SpeechTrans [Tomita et. al., 1989J [Tomabechi et. al., 1989J is a Japanese-
English speech-to-speech translation system developed at Center for Machine
Translation, Carnegie Mellon University. It translates spoken Japanese into
English and produces audio outputs. It operates on the doctor-patient domain.
The system consists of four parts:
Output
/ a/ /0/ /u/ /i/ /e/ /j/ /w/ .. . (Il (II)
Input
/a/ 93.8 1.1 1.3 0 2.7 0 0 ... 0.9 5477
/0/ 2.4 84.3 5.8 0 0.3 0 0.6 ... 6.5 7529
/u/ 0.3 1.8 79.7 2.4 4.6 0.1 0 ... 9.7 5722
/i/ 0.2 0 0.9 91.2 3.5 0.7 0 ... 2.9 6158
/e/ 1.9 0 4.5 3.3 89.1 0.1 0 ... 1.1 3248
h/ 0 0 1.1 2.3 2.2 80.1 0.3 ... 11.4 2660
/w/ 0.2 5.1 5 .8 0.5 0 2.6 56.1 ... 11.2 428
The Matsushita's custom speech recognition device [Morii et. al., 1985] takes a
continuous speech utterance, such as 'atamagaitai' ('I have a headache.'), and
produces a noisy phoneme sequence. The speech recognition device only has
phonotactic rules which defines possible adjunct phoneme combinations, but
does not have any syntactic nor semantic knowledge. Also, it only produces
a single phoneme sequence, not a phoneme lattice. Therefore, we need some
mechanisms to make the best guess based solely on the phoneme sequence
generated by the speech device. There are three types of errors caused by
the speech device: (1) substituted phoneme, (2) deleted phonemes, and (3)
inserted phonemes. Speech Trans uses the confusion matrix to restore possible
phoneme lattice from the noisy phoneme sequence. The confusion matrix is a
matrix which shows acoustic confusability among phonemes. Table 2.1 shows
an example of the confusion matrix for the Matsushita's speech recognition
device. In the table, (I) denotes the possibility of deleted phonemes; (II) the
number of samples; and (III) the number of times this phoneme has been
spuriously inserted in the given samples. When the device output / a/, it may
actually be a /a/ with a probability of 84.3%, and it may actually be a /0/
with probability of 2.4%, so forth .
Current Research toward Speech-to-Speech Translation 15
instead of
Noun - "watasi".
This rule defines the correct phoneme sequence for the word watashi (/). Speech-
Trans system has two versions of the grammar: One that utilizes modularized
syntax (LFG) and semantic (case-frame) knowledge, merging them at run-time,
and another version which uses a hand-coded grammar with syntax and se-
mantics precompiled into one pseudo-unification grammar. For demonstration,
Speech Trans uses the latter grammar due to its run-time speed.
1. A very efficient parsing algorithm, since parsing of the noisy phoneme se-
quence requires much more search than conventional typed sentence pars-
ing.
2. Capability to compute scoring for each hypothesis, because SpeechTrans
need to select the most likely hypothesis out of multiple candidates and to
prune out unlikely hypotheses during the parse.
The error recovery strategies of the iJ?GLR parser is as follows [Nirenberg et.
aI., 1989a]:
Input Translation
atama ga itai I have a headache
me ga itai I have a pain in my eyes
kata ga koru I have a stiff shoulder
asupirin wo nonde kudasai Please take an aspirin
arukuto koshi ga itai When I walk, I have a pain in my lower back
each branch of the parser. Not all phonemes can be altered to any other
phoneme. For example, while /0/ can be mis-recognized as /u/, /i/ can
never be mis-recognized as /0/. This kind of information can be obtained
from a confusion matrix, which we shall discuss in the next subsection.
With the confusion matrix, the parser does not have to exhaustively create
alternate phoneme candidates.
• Inserted phonemes: Each phoneme in a phoneme sequence may be an
extra, and the parser has to consider these possibilities. We have one
branch of the parser consider an extra phoneme by simply ignoring the
phoneme. The parser assumes at most two inserted phonemes can ex-
ist between two real phonemes, and we have found the assumption quite
reasonable and safe.
Table 2.2 shows example of sentences and their translation in Speech Trans
system.
Current Research toward Speech-to-Speech Translation 17
The problem of this method, however, is that language model does not provide
feed back to speech recognition module. It simply get a phoneme at a time, and
restore possible phoneme lattice to be used for parsing. The ~GLR parser does
not make predictions on possible next phonemes. Since perplexity reduction
effects by top-down prediction from the langauge model is considered to be
effective, this shortcoming may be a serious flaw in this approach. This problem
obviously lead to the development of more tightly coupled model such as the
HMM-LR parser which will be described later.
2.2 SL-TRANS
SL-TRANS [Ogura et. al., 1989] is a Japanese effort of developing a speech-to-
speech dialogue translation system undertaken by ATR Interpreting Telephony
Research Laboratories. SL-TRANS translates spoken Japanese into English,
of course, on the ATR conference registration domain.
For the speech recognition module, they have introduced a discrete HMM
phoneme recognition with improvements over the standard model using a new
duration control mechanism, separate vector quantization, and fuzzy vector
quantization.
sional grammar. The grammar for the HMM-LR parser covers entire domain
of the ATR corpus, but its scope is limited to intra-phrase (Bunsetsu) level.
Predictions made by the LR parser is passed to the HMM phoneme verifier to
verify existence of predicted phonemes. Multiple hypothesis are created and
maintained during parsing process. With the vocabulary of 1,035 words and
trained with 5,240 words, the HMM-LR parser attains 89% phrase recognition
rate.
The analysis module has a phrase structure analysis module and a zero-pronoun
resolution module. The parser is based on an active chart parser and uses a
Typed Feature Structure Propagation (TFSP) method. The parser output an
feature structure with the highest analysis score. The analysis score is based on
syntactic criteria such as phrase structure complexity, degress of left-branching,
syntactic-semantic criteria such as missing obligatory elements, and pragmatic
criteria such as pragmatic felicity condition violation penalty.
where S(x) is the speech recognition score, Nt(x) is the number of nodes in
the syntactic tree, Nu(x) is the number of unfilled obligatory elements, Np(x)
is the number of pragmatic constraint violations. The weights al, a2, a3, a4 are
decided experimentally.
Figure 2.1 is an analysis result for a sentence Kaigi enD touToku wo sitain-
odeuga.
Table 2.3 shows accuracy of speech recognition, sentence filtering, and trans-
lation.
[[reIn REQUEST]
Eagen !sp *speaker*]
[reep *hearer*]
[manner indirect]
[obje [[reIn SURU]
Eagen !sp]
[obje [[parm !x <TOUROKU>]
[restr [[reIn MODIFY]
[arg1 !x]
[sloe <KAIGI>]]]]]]]]
One weakness of the SL-TRANS is, however, that it has two separate parsers:
the predictive LR parser in the HMM-LR module, and the Active Chart Parser
in the language analysis module. This is obviously a redundant architecture,
and changes of grammar made in the one of the parser need to be reflected
to the other parser in a consistent manner, which is perhaps a costly process.
Also, the prediction at the sentence level does not fed back to the speech recog-
nition module because the grammar for the predictive LR parser only deal with
the intra-phrase level. However, these problems are relatively trivial - it is a
problem in design decision, not the theoretical limitations - so that they can
be remedied easily.
2.3 JANUS
JANUS is a yet another speech-to-speech translation system developed at
Carnegie Mellon University [Waibel et. al., 1991]. Unlike other systems which
largely depends upon statistical method of speech recognition such as Hidden
Markov Model, JANUS is based on a connectionist speech recognition mod-
ule. Linked Predictive Neural Network (LPNN) [Tebelskis, et. aI., 1991] offers
highly accurate, continuous speech, and large-vocabulary speech recognition
capability. When combined with a statistical bigram grammar whose perplex-
ity is 5 and vocabulary is 400 words, LPNN attains 90% of sentence accuracy
with top 3 hypotheses.
1. Forward pass: For each input speech frame at time t, the frames at time
t - 1 and t - 2 are fed into all the networks that are linked into this word.
Each of these nets then makes a prediction of frame(t), and the prediction
errors are computed and stored in a matrix.
The JANUS1 uses the generalized LR parser. The grammar rule is hand-written
to cover entire ATR conference registration domain. In this implementation,
the semantic grammar has been used with notations similar to Lexical Func-
trional Grammar. Figure 2.2 shows a recognition result of the LPNN, parser
output, and translation results.
LPNN output:
(HELLO IS THIS THE OFFICE FOR
THE CONFERENCE $)
Japanese translation:
MOSHI MOSHI KAIGI JIMUKYOKU DESUKA
German translation:
HALLO 1ST DIES DAS KONFERENZBUERO
It should be noted that the JANUS2 out performed the JANUSI is the first
hypothesis case, but not in the N-best case. This is because that the connec-
tionist parser simply provide the best available output from the first N-best
LPNN output:
(HELLO IS THIS THE OFFICE FOR
THE CONFERENCE $)
Connectionist parse:
«QUESTION 0.9)
«GREETING 0.8)
«MISC 0.9) HELLO»
«MAIN-CLAUSE 0.9)
«ACTION 0.9) IS)
«AGENT 0.9) THIS)
«PATIENT 0.8) THE OFFICE)
«MOD-l 0.9) FOR THE CONFERENCE»)
Japanese translation:
MOSHI MOSHI KAIGI JIMUKYOKU DESUKA
German translation:
HALLO 1ST DIES DAS KONFERENZBUERO
candidate although there is the correct hypothesis in the second best, or third
best place. When only one word sequence is given as in the first hypothesis
case, the JANUS2 is better because it provides the best guess, hopefully a cor-
rect one. This characteristic of the connectionist parser comes from the nature
of the neural networks that it does not hold the correct instances given at the
training stage. The neural network simply changes weight and make general-
ization. This means that the neural network does not know how far the input
is from the known training data, thus it does not have a means to tell how bad
2.4 MINDS
The MINDS system [Young et. al., 1989] is an spoken input user interface
system for data-base query on the DARPA resource management domain. The
speech recognition part is the SPHINX system [Lee, 1988] with 1,000 words
vocabulary. The main feature of the MINDS system is on its layered pre-
diction method to reduce perplexity. The basic diea to accomplish reduction
of perplexity is the use of plan based constraints by tracking all information
communicated (user questions and database answers).
Recognition Performance
Constraints used: grammar layered predictions
Test Set Perplexity 242.4 18.3
Word Accuracy 82.1 96.5
Semantic Accuracy 85% 100%
Insertions 0.0% 0.5%
Deletions 8.5% 1.6%
Substitutions 9.4% 1.4%
The system size measured by the knowledge-base size is about 1,500 concepts
for the domain model, 800 words for Japanese, and 900 words for English.
26 CHAPTER 2
The KBMT uses a set of modular components developed at Center for Machine
Translation. These are FRAMEKIT frame-based knowledge representation sys-
tem [Nyberg, 1988], the generalized LR parser, a semantic mapper for treating
additional semantic constraints, an interactive augmentor for resolving remain-
ing ambiguities [Brown, 1990], and the semantic and syntactic generation mod-
ules [Nirenburg et. al., 1988b]. In addition to these modules, the KBMT's
knowledge-base and grammar were developed using the ONTOS knowledge
acquisition tool [Nirenburg et. al., 1988a], and grammar writing environment.
The system was tested on 300 sentences without pre-editing, though some of
the sentences could not translated automatically.
All partial parses are represented using the graph structured stack, and each
partial parse has its probability based on the probability measure from the
HMM phone verifier. Partial parses are pruned when their probability falls
below a predefined threshold. The HMM-LR method uses the BEAM search
[Lowerre, 1976] for this pruning. In case multiple hypotheses have survived at
the end, the one with highest probability is selected.
Similar to the <i>GLR parser, the grammar should have phone names as its
terminal symbols, instead of words. A very simple example of a context-free
grammar rules with a phonetic lexicon is as follows:
(a) S --) NP VP
(b) NP --) DET N
Current Research toward Speech-to-Speech Translation 27
(c) VP --> v
(d) VP --> v NP
(e) DET --> /z/ /a/
(f) DET --> /z/ /i/
(g) N --> /m/ /ae/ /n/
(h) N --> /ae/ /p/ /a/ /1/
(i) V --> /iy/ /ts/
(j) V --> /s/ /ih/ /ng/ /s/
Rule (e) represents the definite article the pronounced Izl Ial before con-
sonants, while rule (f) represents the the pronounced Izl Iii before vowels.
Rules (g), (h), (i) and (j) correspond to the words man, apple, eats and sings,
respectively.
3
DESIGN PHILOSOPHY BEHIND
THE <I>DMDIALOG SYSTEM
3.1 INTRODUCTION
This chapter discusses ideas behind the model of spoken language translation
implemented as <J?DMDIALOG system. The three major design decisions re-
garding its basic framework were:
These design decisions were made by analyzing knowledge and parallelism in-
volved in the speech-to-speech translation task. Table 3.1 shows how different
levels of parallelism are involved in the speech-to-speech translation task (this
table is by no means exhaustive).
Table 3_1 Knowledge and parallelism involved in the speech translation task
such operations can be simulated from the neural level in the future, it is more
computationally efficient, at this moment, to carry out symbolic processing or
a hybrid of systems. <I>DMDIALOG is an instance of this idea. In <I>DMDIALOG,
various parallel operations - from parallel numeric computations (mostly, but
not exclusively, at phonological level) to parallel constraint satisfaction (at
a syntactic/semantic and discourse levels) - are integrated. Although sub-
symbolic processing is experimentally incorporated in <I>DMDIALOG, we focus
our discussion on symbolic processing using a parallel marker-passing scheme.
Obviously, the ideas behind memory-based and case-based reasoning are sim-
ilar. Only the difference is, perhaps, that case-based reasoning attaches more
importance to so called case-adaptation. Case-adaption is a process of adapt-
ing retrieved cases to the case in consideration in order to offer a solution. In
32 CHAPTER 3
Since parsing is processed directly by accessing the memory, they claim DMAP
to be a "recognize-and-record" model of language understanding; this is in con-
trast to the "build-and-store" model used by traditional parsers. DMAP uses
hierarchically organized Memory Organization Packets (MOPs) [Schank, 1982]
as knowledge for understanding sentences. Marker-passing is used for marking
activated parts of the memory and to predict the next concept to be activated.
This is a very attractive approach to natural language processing because of its
potential parallelism, use of cases for understanding, and contextual processing
capability. In fact, the initial design of the ~DMDIALOG incorporated major
features of the DMAP system.
Although our initial work was based on the DMAP model, we soon augmented
the model in various ways in order to overcome its problems and take advantage
of its potential benefits which were not well exploited in the DMAP. As a
result of modifications, however, our model became quite different from the
DMAP. Especially, the development of the memory-based generation method
is a significant addition to the memory-based paradigm. These augmentation
will be described in relevant part of this book.
Sato and Sumita took Nagao's idea to implement their experimental machine
translation systems. Sato developed MBT-I, -II, and -III [Sato and Nagao,
34 CHAPTER 3
.. ...... ........
.:::::::::::;..:::::::.. ::::..: ....: ..:::..:::..:::..::::.. :::..::......:::.
.... .......... .
Covering
;;;;:;.::::::::::::::
Input Sentence ~- ... ..... .. Previous
.:::::..
....
~
• Input Sentences
•• ~
•••
::::::::::::::~:·:·:·!'I':·I-------
........
• ~~
:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.: --"""",,~... ... .
.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:-:.:.:.:.:.:.:.:.:.:.
Synthesized :
•
Derivation •
•• ••
............. ....... ...... . .. .. ... .........................
y
I Previous
Translation ~ : :: : :::\ Translation
.....- - - - - - - Covering ?0.· ~::. :.I - - - - - - . . I ...
(Transform : : : : : :: :: : : : .:.:.:.:.:.:.:.:.:.:.:-:-:.:.:. ,:.:.:.:.:.:.:-:.:.:.:.:.:-:.:.:.:...::.:.:.:-:.:.:.:.:.:.:.:.:.:.:
Process) }}}}}}}Cas:a;:B.as.:s?::
... .. ... ....... ... ................. ..........................
: : : }: : : : : : : :
3.2.3 Rationale
Although substantial research efforts and fundings has been made to solve these
problems, no major break through has been reported so far. This implies that
the basic approach taken in traditional systems faces a serious dead-end, and
needs a dramatically different paradigm to overcome these problems.
36 CHAPTER 3
1. Finite vocabulary
Obviously, first and second condition stands, since no individual not a group
of individuals has infinite vocabulary or grammar rules. Third condition also
stands since it is biologically impossible to produce infinite length of sentences.
Also, in practice, the length of sentences has certain upper-bound.
An analysis using corpra from ATR, DARPA, and CNN prime news shows
that most sentences are within 30-40 words length. As shown in figure 3.2, the
ATR and DARPA corpora show very similar characteristics: both have a peak
of sentences between 7 to 9 words long. CNN has longer sentences, though it
too has peaks in 7 to 13 words length. The ATR sentences' maximum length
was 19 words and that of the DARPA corpus was 24. CNN Prime News was
48, and CNN segmented was 35. 99.7% of sentences from ATR, DARPA, and
CNN Segmented corpus are less than 25 words length (Figure 3.3).
Design Philosophy 37
Percentage (%)
Conference Registration
10
:\. . . l
CNN Prime News
CNN (Segmented)
;.~\ .::::~" .~
~ ."... .
' ....:..... ........
....... ..........
::!~I: ••
10 20 Length (words)
Covera e (%)
100
50 (Segmented)
News
10 20 Length (words)
Therefore, number of possible sentences are not infinite. It is finite, but very
large in number, hence VLFS.
First, it is widely recognized that sentences used in daily life are highly stereo-
typical. The following is an excerpt from the ATR corpus on conference regis-
tration:
This KWIC view on the word "would" demonstrates that, in this specific cor-
pus, the usage of the word "would" is very limited. Possible sentence patterns
involving "would" can be expressed by three templates:
By the same token, by having these examples of sentences, most new sentences
are expected to be similar to one of these examples, so that translation can be
reconstructed using translation pair of these examples. For example, an input
I would like to participate the conference is similar to one of the examples, I
would like to attend the conference. Thus, the translation of the example, which
is "~m~:ilim; L.t.::lt\o)1."'TiJ~', can be used to create a correct translation of
the input, which is "~m~:~1JD L.t.::lt\o)1."'TiJ~'.
Taking the other example, following is a KWIC view on the preposition "for"
from the ATR corpus:
Performance
Possibility of real-time performance is the real advantage of using memory-
based approach. From the performance point of view, traditional parsing is
a time consuming task. A few seconds or even a few minutes is required to
complete the parsing of one sentence. Thus, the time complexity of parsing
Design Philosophy 41
Solution space
by rule-based
approach
algorithms has been one of the central issues of research in parsing technologies.
The most efficient parsing algorithm known to data is Tomita's generalized LR
parser which takes less than O(n 3 ) for most practical cases [Tomita, 1986].
Although some attempt has been made to implement a parallel parsing process
to improve its performance, the degree of parallelism attained by implementing
a parallel version of the traditional parsing scheme is rather low. In fact,
it is usually less than 100, and takes about 0.3 seconds to parse a lO-word
sentence [Tanaka and Numazaki, 1989J. Plus, it degrades more than linearly
as the length of the input sentence gets longer. Thus, no dramatic increase
in speed can be expected. In our model, serial rule application is eliminated
and replaced by a parallel memory-search process. This approach has not
been taken in the serial machine because it would result in a trade-off between
improved efficiency due to the use of cases and degradation due to an increase
in search cost. However, by using massively parallel computers, we expect that
our model attains high performance processing. Performance evaluation on
actual massively parallel machines will be given in chapters 5 and 6 of this
book.
pear most frquently, memory-based processing will be used, and when there is
no match in the memory-based processing stage, unification-based processing
with wide coverage will be used for processing. We merged these two levels
of processing by using a mechanism called feature aggregation. The goal for
integration is to ensure that the computationally cheapest path to be taken for
any input, with ensuring linguistic soundness.
The alternative approach exist to view rules as a monitor for the translation
process. With this view, rules do not playa central role of parsing or generating
sentences. Memory-based process will handle the entire process. However, the
rule will be used to check if word choice, styles, and other constraints are sat-
isfied. In this division of labor, memory-based process handles an autonomous
part of translation which is difficult to formalize in an explicit manner. The
rule-based process handles a conscious part of translation which can be explic-
itly formalized. Although this approach has not been a central option in the
original <P DMDIALOG system, chapter 7 describes one of her descendents, the
Memoir system, which emphasizes this approach.
Historically, the acquisition of greater computing power and memory space has
changed the way things are done. For example, it was once believed that the
grand master level chess system is attainable only by using extensive heuristics
of the expert chess players. However, the history of computer chess proves
that the key factor for the stronger chess system is the computing power. The
computing power of chess systems and the rating of these system has a di-
rect correspondence [Hsu et aI., 1990J. Indenpendently, the success of various
memory-based reasoning systems prove the importance of massive data stream.
Three massively parallel machines have been accessible to the author for this
project: IXM2 associative memory processor, SNAP semantic network array
processor, and CM-2 connection machine. We have implemented versions of
the ~DMDIALOG system on IXM2 and SNAP. Implementation on the CM-2 is
underway. Thus, in this book, we will describe implementations on IXM2 and
SNAP. Both IXM2 and SNAP are marker-passing machines, and are particu-
larly suitable for memory-based processing due to their data-parallelism.
The SNAP project was started at the University of Southern California in 1983
by Dan Moldovan. The initial goal of the project was to develop a parallel
machine for semantic network processing. From 1990, the project became a
joint project between the University of Southern California and Carnegie Mellon
University. The joint project focuses on the development of the SNAP-1 and
implementation of the ~DMDIALOG system.
The IXM2 project was initiated by Tetsuya Higuchi at the Electrotechnical Lab-
oratories in Japan. Similar to the SNAP project, the main goal is to develop
a high-performance and cost-effective machine for a semantic network process-
ing. During 1990-91, the IXM2 machine was installed at the Center for Ma-
chine Translation at Carnegie Mellon University. A version of the ~DMDIALOG
model has been experimentally implemented on the IXM2.
Design Philosophy 45
3.4 MARKER-PASSING
Marker-passing has been used as a central means to carry out inference on mas-
sively parallel computers. Marker-passing was first proposed by Scott Fahlman
[Fahlman, 1979] as a means to perform inferencing on the semantic network.
The marker in his NETL system was a bit marker. Charniak, however, was the
first to apply marker-passing to natural language processing [Charniak, 1983].
In [Charniak, 1983], a marker-passing mechanism was used to handle contex-
tual processing outside of a syntactic parser. He also extended the concept of
marker-passing to include numeric values and a source of activation in order to
control search space and perform path evaluation. Use of numeric value was
also used by Norvig [Norvig, 1986] to compute contextual recognition in story
understanding. The path evaluation idea was further extended by Hendler
[Hendler, 1988] when he allowed a marker to carry entire path of marker prop-
agation. These marker-passing methods has been used, however, as an adjunct
module connected to the main component of the system. Perhaps, the DMAP
[Riesbeck and Martin, 1985] was the first to propose a parsing method entirely
by marker-passing. In DMAP, two types of markers (Activation markers and
Prediction markers) are used to carry out parsing. Our model is largely in-
fluenced by the idea presented in DMAP, and we have further augmented and
redefined the model in the <I> DMDIALOG system.
4.1 INTRODUCTION
iPDMDIALOG is an experimental speech-to-speech dialog translation system de-
veloped at the Center for Machine Translation at Carnegie Mellon University.
This is one of the first experimental speech-to-speech translation systems cur-
rently up and running. The system employs a parallel marker-passing algorithm
as the basic architecture of a model. Speech processing, natural language, and
discourse processing are integrated to improve the speech recognition rate by
providing top-down predictions on possible next inputs. A parallel incremental
generation scheme is employed, and a generation process and the parsing pro-
cess run almost concurrently. Thus, a part of the utterance may be generated
while parsing is in progress. Unlike most machine translation systems, where
parsing and generation operate by different principles, our system adopts com-
mon computation principles in both parsing and generation, and thus allows
integration of these processes. The system has been publicly demonstrated
since March 1989.
during parsing [Kitano, 1989a]. Both the parsing and generation processes
employ parallel incremental algorithms. This enables our model to gen-
erate a part of the input utterance during the parsing of the rest of the
utterance.
Lexical Entry
In our model, lexical items are represented by Concept Sequence Class nodes
applied to the lexical level which we call lexical nodes. Each lexical node has
knowledge of how each word should be pronounced in the form of a phoneme
sequence. For example, definitions for a lexical node for a Japanese word 'Kaigi'
(conference) and an English word 'Conference' are shown in figure 4.1.
(defLEX '(kaigi
(is-a (conference»
(language (japanese»
(surface (kaigi))
(gen-phon (ka i gi))
(sequence (k a i * i))))
(defLEX '(conference
(is-a (conference»
(language (english»
(surface (conference»
(gen-phon (conference))
(sequence (K AA N F R AX N S»»
Ontological Hierarchy
Class/subclass relation is represented in the memory network in form of the
ontological hierarchy. Each CC represents specific concept and they are con-
nected through IS-A links. The highest (the most general) concept is *entity
which entail all possible concepts in the network. Subclasses are linked under
the *entity node, and each subclass node has their own subclasses. As a basis
of the ontological hierarchy, we use the hierarchy developed for the MU project
[Tsujii, 1985], and domain specific knowledge and other knowledge necessary
for processing in the system has been added.
Memory-Base
Memory-base is represented as a memory network using specific cases and gen-
eralized cases. Specific cases represent surface string, or near-surface, of sen-
tences. A pair of translation is indexed under one node representing a concept
of the sentence, or just a ID-tag. When a source language CSC is activated by
the input, the translation will be created using the target language CSC which
is connected by the concept node. When only specific cases are used in the
memory-network, the system will be pure memory-based translation system.
52 CHAPTER 4
However, generalized cases and grammar rules can be represented so that vari-
ous exprements can be carried out in the uniform framework. Generalized cases
is similar to concept sequence in DMAP, and is templete of sentences. Both
specific cases and generalized cases are represented using the same encoding
style and compiler.
Grammar Rules
Grammar rules can be written using notations similar to Lexical-Functional
Grammar (LFG:[Kaplan and Bresnan, 1982]) (figure 4.2) or using more
semantic-oriented encoding (figure 4.3). LFG is a kind of unification-based
grammar. It consists of phrase structure rules and associated constraints. Uni-
fication operation is the central operation which built up meaning represena-
tion and imposing constraints. In figure 4.2, xO indicates the left hand side
of the rule, thus (xO head) is a head of the <NP>. By the same token, xl and
x2 indicate the first and the second term in the right hand side, respectively.
Also, mixing levels of abstraction in a grammar rule is permitted in our model
(figure 4.4). Although the use of such a semantic-oriented grammatical encod-
ing method may be linguistically controversial (it would provides less linguistic
generalization than other formalisms such as Lexical Functional Grammar [Ka-
plan and Bresnan, 1982J or Head-driven Phrase Structure Grammar [Pollard
and Sag, 1987]), this is one of the best way to write grammar for speech input
parsing due to its perplexity reduction effects. Perplexity is a measure of the
complexity of the task similar to an average branching factor; a small perplexity
measure means a task is rather simple. Generally, smaller perplexity improves
accuracy and response speed of the speech recognition module. We extend the
idea of the semantic-oriented grammar to allow direct encoding of a surface
string sequence into a specific case of utterances. Use of specific cases with
stochastic measurement has a significant contribution in perplexity reduction
while strong constraints at the syntactic/semantic level can directly influence
the speech processing level.
The <I?DMDIALOG System 53
Figure 4.4 Grammar using mixture of surface string and generalized case
54 CHAPTER 4
4.2.2 Markers
Markers are entities which carries information, and pass through the memory
network in order to make inference and predictions. Our model uses five types
of markers:
G-Markers are created at the lexical level and each contains a surface string,
linguistic and semantic features, a cost measure, and concept instance nodes.
G-Markers are passed up through the memory network. At a certain point of
processing, surface strings of G-Markers are concatenated following the order
of concept sequence class nodes, and a final string of the utterance is created.
When incremental sentence production is performed, V-Markers record part
of the sentence which is already verbalized and anticipate the next possible
verbalization strings. Figure 4.6 shows an example of a G-Marker and a V-
Marker.
The basic concept introduced in our model, which is substantially different from
other marker-passing models, is a use of a probabilistic and structured paral-
lel marker-passing. Each marker carries a probability measure that weights
how likely the hypothesis (represented in the marker) is true. It also contains
symbolic data such as linguistic feature, a pointer to a discourse entity, and
The ~ DMDIALOG System 55
(A-MARKER0236
(Probability: 0.14)
(CI: Conference045)
(Concept: Conference)
(Feature: nil)
(Type: Event)
)
(P-MARKEROi96
(Probability: 0.50)
(Constraints: «xO = x2)
«xO subj) = xi»
(Feature: nil)
)
Our model uses five types of markers. These markers are (1) Activation Mark-
ers (A-Markers), which show activation of nodes, (2) Prediction Markers (P-
Markers), which are passed along the conceptual and phonemic sequences to
make predictions about nodes to be activated next, (3) Contextual Markers (C-
Markers), which are placed on nodes with contextual preferences, (4) Genera-
56 CHAPTER 4
(G-MARKER0886
(Probability: 0.67)
(Cl: John021)
(Concept: Male-Person)
(Feature: «Gendar: Male)
(Number 3sg»)
(Type: Object)
)
(V-MARKER0180
(Probability: 0.50)
(Constraints: «xO actor) = x1)
«xO object) = x3)
(xO = xS»
(Feature: «Actor (Cl: John021)
(Gendar: Male)
(Number 3sg»»
(Surface String: IIJon ha")
)
tion Markers (G-Markers), which each contain a surface string and an instance
which the surface string represents and (5) Verbalization Markers (V-Markers),
which anticipate and keep track of the verbalization of surface strings. Informa-
tion on which instances caused activations, linguistic and semantic features, and
probabilistic measures are carried by A-Markers. The binding list of instances
and their roles, probability measures, and constraints are held in P-Markers.
G-Markers are created at the lexical level and passed up through the memory
network. At a certain point of processing, surface strings of G-Markers are
concatenated following the order of a esc and a final string of the utterance
is created. When incremental sentence production is performed, V-Markers
record part of sentences which are already verbalized and anticipate next pos-
sible verbalization strings.
Marker Passing: Markers are passed up through IS-A links. Features are
aggregated through this process.
Figure 4.8 shows movement of a P-Marker on the layers of eses. When the P-
Marker at the last element of the esc gets an A-Marker, the CSC is accepted
and an A-Marker is passed up to the element in the higher layer CSC. Then,
a P-Marker on the element of the cse collides with an A-Marker, and the P-
Marker is moved to the next element. At this time, a P-Marker which contains
information relevant to the lower esc is passed down and placed 011 the first
element of the lower CSC. This is a process of accepting one CSC and predicting
the possible next word and syntactic structure.
Although a real memory network has a highly layered and indexed structure,
for the sake of clarity, figure 4.9 shows an example of how our algorithm parses
an input with a simple context-free grammar. This example assumes following
context-free grammar:
58 CHAPTER 4
P P
< eO e1 e2 e3 .... en > ---+- < eO e1 e2 e3 .... en >
t
A (a) Simple Prediction
P P P P
< eO e1 e2 e3 .... en > ---+- < eO e1 e2 e3 .... en >
A
t
(b) Dual Prediction
p p
<e20 e21 ... e2n > <e20 e21 ... e2n >
/\
<eOOeOl ... eOI> <el0ell .. elm>
/
<eOOeOl ... eOI>
~p
<el0ell .. elm>
t
A
s S s
PI AP I -t p
/"
*A *B> *A *B> <flA >
*
/ "8
A A 8
I
C
~F
<*0 *E>
I
C
~F
<*0 *E>
I (~ I
p abc d
I (~ I
abc d
(a) Initial Prediction (b) Input 'a' causes (c) Shift and Predict
A-P-Collision
s s
I P I P
/"
<*A *8> <*A *B>
A B A/ "B
I AP~ I~
1IJrr~
a c
1
d
1 {'~ 1
abc p d
(d) Input 'b' (e) Shift and Predict (0 Input 'c' causes reduce in
<*D *E> and <*A *B>
• S -+ A B
• A-+C
• B-+DEIF
• C-+a
• D-+b
• E-+c
• F-+d
triggers propagation of an A-Marker from node 'a' to node 'C', to node 'A',
and to node '*A.' As a result of the A-Marker propagation up to the element
,* A' of concept sequence class <*A *B>, A-P-Collision takes place at *A. Then,
in figure 4.9c, the P-Marker is shifted to the next element of the concept
sequence class (element '*B'). Then, a P-Marker is passed down to predict the
next possible inputs, from element *B to element *D, and to node b. Also, a
P-Marker is passed down from element '*B' to node F, and to node d. In figure
4.9d, the activation of node 'b' triggers an A-Marker propagation from node b
to node D, resulting in an A-P-Collision at <*D *E>. Figure 4.ge shows a shift
of a P-Marker and top down prediction with a P-Marker to node c. At figure
4.9f, activation of node c causes reduce, first, at <*D *E> and then at <*A *B>.
Finally, an A-Marker activates S and the input sentence is accepted.
We will further illustrate this basic parsing algorithm using a simple memory
network, as in Figure 4.10. Part (a) of the figure shows an initial prediction
stage. P-markers are placed on *person in a CSC at the syntax/semantics
level. Also, the other P-marker is placed on the first element of CSCs at the
phonological level. In In part (b) of Figure 4.10, a word, john, is activated
as a result of speech recognition, and an A-marker is passed up through IS-A
link. It reaches to *person in the CSC, which has the P-marker. An A-P-
collision takes place and features in the A-marker are incorporated into the
P-marker following the constraint equation specified in the CSC. Next, the P-
marker shift takes place; this may be seen in part (c) of the figure. Now the
P-Marker is placed on *want. Also, the prediction is made that the possible
next word is wants. Part (d) shows, a movement of P-markers after recognizing
to. In (e), the last word of the sentence, conference, comes in and causes an
A-P-collision at *event. Since this is the last element of the CSC <*attend
*def *event>, the CSC is accepted and a new A-marker is created. The newly
created A-marker contains information on build up by a local parse with this
CSC. Then, the A-marker is propagated upward, and it causes another A-P-
collision at *circumstance. Again, because *circumstance is the last element
of the CSC <*person *want *to *circumstance>, the CSC is accepted, and
interpretation of the sentence is stored in a newly created A-marker. The A-
marker further propagates upward to perform discourse-level processing.
A-P-Collision
'want 'to
Shift
-p
<.peron 'W~ .~ircu~
.attlnd .dr
'john tev~ 'john
'.v~
'dr
•• t r
.eonfrence
'confrence
'JOhn' P 'wants' 'to' 'attend' 'the' • cont erence'
'JOhn' 'wants' 'lOlp "attend' "the" 'conference'
Predict
(e) Shift and Prediction (d) Predicting 'circumstance, 'attend, and 'attend'
r - - - -. . p20
generates an A-Marker that hits the P-Marker on the i+l-th element, the P-
Marker is again moved using the dual prediction method. The probability
density measure computed on the P-Marker is as follows:
In figure 4.12, an input sequence is io i l ... in. Pij in the diagram denotes
a phoneme Pj at the i-th element of the CSC. Pij is a state rather than an
actual phoneme, and Pj in the CSC refers to the actual phoneme. P-Markers at
POO,POI,P02, and P-Markers on the O-th element of the CSCs referring to Po, PI,
and P2, respectively, are hit by A-Markers. Eventually, P-Markers are moved
to the next elements of CSCs. For instance, POO will move to PIo,Pn,P20,P21
depending on which CSC the P-Marker is placed on. Probabilities are computed
with each movement. A P-Marker at Pn has the probability 'lro. When the P-
Marker receives an A-Marker from iI, the probability is re-computed and it will
be 7l"OX biQ,poo x ap;o'p;,' Transitions such as POO -+ P21 and POO -+ P20 insert an
66 CHAPTER 4
Activated
Phonemes z
GOMI
KAIGI
GOEI NE ..
Input
Phoneme
Sequence
extra phoneme which does not exist in the input sequence. Probability for such
transitions is computed in such a way as: 7rO x bio ,poo x aio,</> x biz,pzo x a</>,i z '
A P-Marker at PIO does not get an A-Marker from i 1 due to the threshold.
In such cases, a probability measure of the P-Marker is re-computed as 7ro X
bio,poo x aio,noise' This represents a decrease of probability due to an extra
input symbol.
P-Markers at the last element (Pn) and one before the last (Pn-d are involved in
the word boundary problem. When a P-Marker at Pn is hit by an A-Marker, the
phoneme sequence is accepted and an A-Marker which contains the probability
and the phoneme sequence is passed up to the syntactic/semantic-level of the
network. Then, the next possible words are predicted using syntactic/semantic
knowledge, and P-Markers are placed on the first and the second element of the
phoneme sequence of the predicted words. When a P-Marker at Pn-l is hit by
an A-Marker, the P-Marker is moved to Pn and, independently, the phoneme
sequence is accepted, due to the dual prediction, and the first and the second
elements of the predicted phoneme sequences get P-Markers.
drawn in this figure. A path shown by a thick solid line shows a lexical hypoth-
esis for KAIGI (conference), a thick dashed line shows GOMI (garbage), and a
thin dashed line shows GOEI NE .. (guard and a part of a following word). In
activating lexical hypothesis GOMI, the third input phoneme'!' is considered
to be a noise, and, thus, a path ignores either 'i' or 'e' activated by 'I'. On the
other hand, a path for GOEI has a transition that adds a phoneme which does
not appear in the input phoneme sequence; a phoneme '0' is inserted while
transiting from 'g' to 'e'.
where k denotes a transition from ii-2 to ii-l. It is interesting that our context-
dependent model is quite similar to the Hidden Markov Model (HMM) when
the transition of the state of P-Markers are synchronously determined by, for
example, certain time intervals. We can implement a forward-passing algorithm
and the Viterbi algorithm [Viterbi, 19671 using our model. This implies that
when we decide to employ the HMM as our speech recognition model, instead
of a current speech input device, it can be implemented within the framework
of our model.
The algorithm itself is simple, but this algorithm has more flexibility than
Litman's model when performed on the memory network which represents do-
main plan hierarchy for each speaker. There are two major differences from
the Litman's model and our model which explains flexibility and computational
advantages of our model.
Second, our model assumes specific domain plans for each speaker. The do-
main plan which has previously been considered a joint plan, is now separated
into two domain plans, each of which represents a domain plan of a specific
speaker. Each speaker can only carry out his/her own domain plans in the
stack. Progression from one domain plan to another can only be accomplished
through utterances in the dialogue. A domain plan can be a joint plan when
both speakers execute or recognize the same domain plan at the same specific
point in the speech event and which occurs separately for each speaker in the
domain plan hierarchy in the memory network.
Goal Marker) is created and it is passed upward in the plan hierarchy. All
nodes exist along the path of the IG-marker is marked with the IG-marker
(figure 4.14). They represent possible goal/subgoal of the speaker. Then, the
P-Marker is moved to the next element, and its copy is passed down to lower
CSC representing a sequence of actions to attain the predicted goaljsubgoal
(figure 4.15). Then the next A-Marker hit the P-Marker, and an IG-marker
is created and propagate upward (figure 4.16). Although, this illustration
is much too simplified, basic process flow is captured. When an A-Marker
and a P-Marker collides, constraint satisfaction generally takes place in order
to ensure coherence of dialogue recognition. This process is similar to the
constraint-directed search used in [Litman and Allen, 1987].
Initial state of the network is shown in figure 4.17( A). Notice that there are
different domain plans for each speaker. Speaker A is customer and Speaker
B is the travel agent. All first element of CSC for both speaker is marked
with P-Markers. As first utterance comes in (utterance (1), A-Marker prop-
agated from parsing stage comes up and hit a state destination state-dest
action. An A-P-Collision takes place, and an IG-Marker is created. The IG-
Marker propagate upward marking all possible goals of the speaker A. Also,
the P-Marker is moved to the next element of the CSC. This is shown in figure
4.17(B). Next, the utterance (2) comes in, and an A-Marker hits a P-Marker
at confirm-destination (figure 4.17(C)). An IG-Marker is created and it
marks sell-ticket which is the goal of the travel agent as inferred from the
utterance (2). Utterance (3) and (4) are replies by the Speaker A to utterance
(2) made by the Speaker B. For such replies, generally, P-Markers and IG-
Markers at domain plans do not move. When the Speaker B, the travel agent,
made an utterance (5), an A-Marker hits the P-Marker on tell-best-option
predicted from the previous utterance. However, IG-Markers are unaffected
because nothings has been accomplished yet. If the Speaker A accomplished
buy-ticket, an A-Marker is created at the CSC <state-dest ... >, and hit
the P-Marker at buy-ticket. Then, the P-Marker is moved to the next element
74 CHAPTER 4
IG
~ttendjeOn,ernee,
A P
<register gotolsite attend-session .•• >
IG
<attend-\confernce>
---....... P ~
P
<buy-ticket take-train ... >
IG
<fttendjeon,ernee,
. j!;G P
<register~otolsite attend-session ... >
A P
<,Uy-ticket take-train ... >
of theese, and the IG-Marker is removed since this subgoal has been accom-
plished. It is perhaps the next series of utterance that may create IG-Markers
to mark take-airplane as a next subgoal. Similar to syntax/semantics-level,
multiple levels of abstraction can co-exists. For sake of the clearity, however,
this example only shows one level of abstraction.
Aspects which distinguish our model from simple scripts or MOPs as used in
past models are (i) utterance sequences can be dynamically created from more
productive knowledge on dialogue and domain as well as previously acquired
case knowledge, whereas scripts and MOPs are simple predefined sequential
memory structures; and (ii) an utterance has an internal constraint structure
which enables constraints to be imposed to ensure discourse processing coher-
ence. Again, the level of abstraction may vary in each eserepresenting a
discourse script. Some CSCs may involve elements linked to specific instances
of utterance, or others may use elements linked to abstract nodes representing
utterance types.
p
<state-dest __ . >
Speaker-l Speaker- 2
(AI
IG
)attend-cofference>
IG Ii' I
<bUyttlcket take - airplane .. >
P
<conE inn-destination tell-best-option establish-agreement>
P A
<star-dest ... >
Speaker-l Speaker-2
(BI
IG
<attend-corference>
/:"'['Cket>
IG P
<goty-cit y gota-site register>
P A
IG P I
<bUY-j'cket take-airplane .. >
<conf it-destinatiOn tell-best -option est ablish-agreement>
P
<state-destination . .. >
Speaker-l Speaker-2
(el
One other type of CI represents meanings of the specific utterance. Such CIs are
recorded in the memory network as cases of utterance in the specific discourse
context. In <I>DMDIALOG, this type of CI has the following links:
In addition, each CSC has world knowledge which triggers modification of the
memory network to reflect what was told to the system. For instance, if an
utterance conveys information that 'John bought a car', the memory network
is modified so that '@person001' which is the CI for John has a POSSESS link
to the specific instance of the car, @CAROOl. A correct understanding of an
utterance refers to the state where a CI is created under an appropriate CC
and connected to appropriate CIs with appropriate links. This knowledge is
particularly important when resolving ambiguities and identifying anaphoric
references.
A1 A2. AO A3 A4 A5
P=O.1 P=O.4 P=O.8 P=O.23 P=O.12 P=O.75
AO A1 A2 A5 A3 A4
P=O.5 P=O.2 P=O.6 P=O.35 P=O.25 P=O.50
are assigned for words which are not predicted from a case-based process, but
are predicted from a unification-based process, so that even utterances which
have been unexpected by the case-based process can be handled. Figure 4.19
shows how probability measures are propagated in a simple network. With a
certain threshold value, we obtained an experimental result which shows that
the top-down prediction effectively reduced perplexity. With a small test set
which has a perplexity of word choice of 247.0 with no constraints, the perplex-
ity was reduced to 19.7 with syntactic and semantic constraints, and further
reduced to 2.4 with discourse-level constraints. However, the effect of perplex-
ity reduction by adding the discourse-level knowledge would be less effective
when we apply our method in a large mixed-initiative dialog domain. This
problem will be discussed in section 4.13. The perplexity can be controlled
by the threshold value. Introduction of a threshold relaxation method would
take advantage of the probabilistic approach to prediction. A high threshold
is assumed at the beginning of a search to narrow down a search space; if no
solution is found the threshold is lowered and the search space is widened to
find a solution. This idea is similar to layering prediction [Young et. al., 1989)
and probabilistic marker speed control [Wu, 1989).
~
Speech Input Feedback to speech processing module
(the global minima) through the path with the least workload. We believe such
a law of physics is applicable at the abstract level since cognitive processes are
manifestations of a dynamic system which consists of large numbers of neurons,
the brain. In addition, several psycholinguistic studies [Crain and Steedman,
198511Ford, Bresnan and Kaplan, 19811lPrather and Swinney, 1988] were taken
into account in deriving the cost-based theory. The cost-based disambiguation
scheme applies to both the parsing and the generation process. In a speech
input natural language system, ambiguities caused by noisy speech inputs are
added along with lexical and structural ambiguities. The cost-based approach
enables us to handle these different levels of ambiguity in a uniform manner.
This has not been attained in the past models of ambiguity resolution. In the
parsing process, costs are added when: (1) substitution, deletion and insertion
of phonemes are performed to activate certain lexical items from noisy speech
inputs (this part is handled using probabilistic measures as described earlier),
(2) new CIs are created, (3) CCs without contextual priming are used for fur-
ther processing, and (4) the memory network is modified to satisfy constraints.
Costs are subtracted when: (1) prediction has been made from discourse knowl-
edge, and (2) CCs with contextual priming have been used.
whereas CCij' constraints"" bias; denote a cost of the j-th element of CSC;,
a cost of assuming the k-th constraints, and the lexical preference of CSC;.
LEXj, instantiateC!, primingj denote a cost of the lexical node LEX j , a cost
of creating new CIs by referential failure, and contextual priming, respectively.
P denotes the probability measure of phonological level activities, and is con-
verted into cost using -Clog P where C is a constant.
P = e--C- (4.8)
cost = -C log P (4.9)
where aac(i), CCi;_loP;_l' tCi;_2 ,i;_1' and pe are an AAC measure of the P-Marker
at the i-th element, confusion cost between ii-l and Pi-I, transition cost be-
tween ii-2 and ii-I, and phonetic energy, respectively. Phonetic energy reflects
an influx of energy from an external acoustic energy source.
4.8.5 Constraints
Constraints are attached to each CSC. These constraints play important roles
during disambiguation. Constraints define relations between instances when
sentences or sentence fragments are accepted. When a constraint is satisfied,
the parse is regarded as plausible. On the other hand, the parse is less plausible
when the constraint is unsatisfied. Whereas traditional parsers simply reject
a parse which does not satisfy a given constraint, links between nodes are
built or removed, forcing them to satisfy constraints. A parse with such forced
constraints will record an increased cost and will be less preferred than parses
without attached costs.
Level of
abstraction
s
Unification-based
grammar
person-want-circurnstance
Generalized
cases
<*person '~1 <*person *ha *circum *want>
IS-A \ \
Specific I-wan~to-attend-conference . ~\
cases
<I want to ~Od th, eoof",oe,' <,.i9i oi "O~t'i'
Translation
meaning representations. When a CSC is a specific case (level a), its meaning
representation is directly associated.
4.10 GENERATION
The generation algorithm of <J>DMDIALOG can be characterized by its highly
integrated processing and its capability of simultaneous interpretation. The
generation algorithm employs a parallel incremental model which is coupled
with the parsing process in order to perform simultaneous interpretation. In
addition, the case-based process and the constraint-based process are integrated
in order to generate the most specific expressions using past cases and their
generalization, while maintaining syntactic coverage of the generator.
v V
(a) < eO e1 e2 ... en > -+- < eO e1 e2 ... en >
+
G
V V V
(b) < eO e1 e2 ... en > -+- < eO e1 e2 ... en >-+- < eO e1 e2 ... en >
t tG
G G
V V
(c) < eO e; e2 ... en > -+- < eO e; e2 ... en >
+
G
/o\o2n> _
v
/0,'20>V
< eOO e01 ... eOI > <el0 ell ... elm> < eOO eOl ... eOI > <el0 ell ... elm>
t
G
realization for the element is retrieved when the V-Marker passes through the
element.
There are cases in which an element of the CSC is linked to other CSCs as
seen in figure 4.22. In such instances, when the last element with a V-Marker
gets a G-Marker, that CSC is accepted and a G -Marker that contains relevant
information is passed up through an IS-A link. Then an element of the higher
layer CSC gets the G-Marker and a V-Marker is moved to the next element.
Since the element is linked to the other CSCs, constraints recorded on the V-
Marker are passed down to lower CSCs. This movement of the V-Marker allows
our algorithm to generate complex sentences.
A
1\
NP NP NP PP
1 j, ~ j ~
She
I II~
She
at the hotel
II~~
She at the hotel her luggage
~
A
NP PP
1/\ r~
1 r6 'AjlAA
lLsu'~r6' A
She at the hotel
to
I~
unpack her luggage
s s
/ " -VP
NP
/ " -VP
NP
!
NP
I I V/'vpI
N f'JP---
I I
N
I
John John
I was
I.
surprised
I
John
I 6
was surprised by Mary
s s
~ ~
NP NP VP NP VP
N
I I V/
N
I V~p
N I
I
Mary
I I
Mary surprised
I I JohnNI
Mary surprised
(81) (82) (83)
the order in which conceptual fragments are given is based on the order that
conceptual fragments can be identified when parsing a corresponding Japanese
sentence incrementally. It should be noted that our model does not necessary
assume a special grammar as is the case in other incremental generation models
such as [Kempen and Hoekamp, 1987].
Let us illustrate the generation process using the simple example shown in
Figure 4.25. In keeping with earlier examples, the input sentence is John wants
to attend the conference.
First, part (a) of the figure shows the concept activation stage: john is activated
and an A-marker is created. The A-marker propagates and activates a ee node,
* john. This is a part of parsing process.
Next, part (c) of Figure 4.25 shows the V-marker shift. Since the element *ha
is a closed-class item, it retrieves the lexical realization of *ha which is 'ha',
and the V-Marker is moved to *event. At this point, the V-marker contains
the surface string 'Jon ha' along with other syntax and semantic information.
This is a partial realization of the surface string.
Part (d) shows the processing of want and attend. This is the concept activation
stage and the lexical hypothesis activation stage. Due to the difference in
word order in English and Japanese, the V-marker is not placed on *want and
*attend in the ese for Japanese. G-markers propagated from shitai ("want")
and sanka ("attend") simply stay at each element of the ese until the V-marker
arrives.
The processing of kaigi triggered by input word conference is shown in part (e).
A G-marker and a V-marker collide at *event, and the V-marker is moved to
*ni. Since *ni is a closed-class item, its surface realization is appended to the
V-marker and the V-marker further moves to *attend. Now, *attend already
has a G-marker, and so a G-V-collision takes place there and the V-marker
moves to *want. Again, *want already has a G-marker and a G-V-collision
occurs. Since *want is the last element of the esc, the V-marker contains the
surface realization created by this local generation process: 'Jon ha kaigi ni
90 CHAPTER 4
'7 t
:r
\/SO;
v
'0//°;'7 7 'wat
V
nt 'att 'wa\> nt
t ",SOia 7 t
-v'jonha' v'jon ha' G'sanka' G'shitai'
" , s o i a jent 'att 'wa\> nt 'att1r .wa\\
'perron 'ha 'event ' \ 'attend 'Wf\ 'peTn 'ha 'event 'ni 'atK\ 'Wr\\
sanka shitai'. This is the realization stage. Although the possible translation
is created, it does not mean that this is the translation of the input sentence
because the whole process is based on lexical-level translation, and no result
of analysis from the parsing stage is involved. At this stage, it is generally the
case that multiple generation hypotheses are activated.
CC2
C~
I LEx2-TL ~C2-TL
CC1
CC1
LEX-~X1-TL CI
LEX-~C1-TL CI
(a) (b)
C~ C~
I ~2-TL I ~2-TL
CC1
%.
CSC-~X1-TL CI
CSC-~ I
CI
"?sC1-TL
(c) (d)
4.11 SIMULTANEOUS
INTERPRETATION: GENERATION
WHILE PARSING IS IN PROGRESS
Development of a model of simultaneous interpretation is a major goal of the
project which makes our project unique among other researches in this field. We
have investigated actual recordings of simultaneous interpretation sessions and
simulated telephone conversation experiments, and made several hypotheses as
to how such activities are performed, as a basis for designing the <I>DMDIALOG
system.
The two transcripts of Japanese to English translation (tables 4.4 and 4.5)
show that the interpreter is dividing a original sentence into several sentences
The cI> DMDIALOG System 95
Delay time
Utterance .. ..
in Japanese
·:
Speaker-l
Speech
··
: Translation Translation
: in English in Japanese
Translation
System
···
uttenince
in English
Speaker-2
J B ± It> (J) ~
e Japan economy success actually remarkable things exist, people
E The success of Japanese economic development
J B 1 3C"~ "( "(1t>.o(J)I, ~Q '.t "t" "
e QOL improved CPI stabilized being desirable thing
E has actually been remarkable. The living standard has risen and the CPI has
J 1 \ "t"1 .... It> I, I.::t/,:Ii (J)
e success all not because, for example, JNR
E stabilized. These are to be desired, but not all are successes.
J "( ,"t" 0
e case look-at obvious
E JNR is an obvious example.
J
e
E
J
e
E
J
e can't-take work such image Japan labor system
E It seems to those who say this that workers are tied to the company and work
J
e
E
J o
e scratched-the-surface Westerners have it-seems
E This is the image of the working system held by Europeans and Americans
J
e
E towards the Japanese workers.
These observations support our hypotheses stated at the begining of this paper.
We can therefore derive several requirements that the generator of a simulta-
neous interpretation system must satisfy:
~
Io!j
oq.
(I)
...=
~
~
00
>
."
Q>
;::.
s,
;.
(II
3C1>
3
...o
'<
:0
(II
...~
PO"
*john
*t *becf'
A *to A *tf Q
/\ ~ *r ::r:
>
'1j
'John' 'Jon" 'wants' 'shitai' "to" "attend' "sanka' •the' "conference" 'kaigi' "ga' "ni" 'because" ...:j
M
::0
~
The 1>DMDIALOG System 101
Let's explain this process with an example. Table 4.6 indicates a temporal re-
lationship between a series of words given to the system and incremental gen-
eration in the target language. Figure 4.28 shows part of the memory network
involved in this translation (simplified for the sake of clarity). An incremen-
tal translation and generation of the input (John wants to attend the confer-
ence beca1Lse he is interested in interpreting telephony) results in two connected
Japanese sentences: Jon wa kaigi ni sanka shitai. Toiunoha kare ha tuuyaku
denwa ni kyoumi ga arukara desu. Speech processing is conducted using the
method already outlined. The following explanation of processing is from the
perspective of lexical activations.
The G-marker is passed up through the memory network. This process takes
place for each word in the sentence. P-markers are initially located on the
first element of CSCs for the source language. In this example, a P-marker
is at *person in <*person *want *to *circumstance> and <*attend *def
*conference>. V-markers are located in the first element of escs for the
target language.
Instantiation takes place when this sequence is accepted, i.e. when the P-
marker is placed on the last element of the sequence and that element gets an
A-marker. As a result, a CI is created under the CC C*want-attend-confer-
ence), which is a root node of the accepted CSC, and linked with relevant
instances. The CI represents the meaning of the utterance analyzed.
There is a point during analysis at which the semantics of a part of the sen-
tence is determined. For instance, John (@John001) can only be an agent
of the action when wants is analyzed. At this time, :; 3 "/ if' ( (John
Role-Agent) ) can be verbalized. A V-marker which is initially located on
*person is now moved to *conference, which is the next element of the
CSC. The V-marker simply passes through *ga because it is a closed-class
item. The next verbalization will not be triggered until because comes in
because the role of John wants to attend the conference in the discourse
is still ambiguous. At this point, *want-attend-conference and its su-
perclass node including *assert-goal are activated. A-markers passed up
through discourse-level knowledge carry relevant information needed to im-
pose constraints on possible next utterances. A P-marker is now placed on
*because in a CSC (namely C<*assert-goal *because *assert-reason»)
and a V-Marker is still placed on *assert-goal of the corresponding Japanese
CSC C<*assert-goal *touiunoha *assert-reason». Activation of
*because will determine the role of *want-attend-conference. The word
because acts as a clue which divides the assertion of the speaker's goal from the
reasons for the goal, as represented in the CSC. Verbalization is now triggered,
i.e., a Japanese translation of John wants to attend the conference is vocalized,
and the V-marker is moved to the next element of the CSC. When a whole
sentence is parsed, its entire meaning is made clear and the rest of the sentence
is verbalized: :; 3 "/ li~8fil:~1JD L. t.:v' o tv' -? o)li, 1ltli:@aR1t~~:~P*il'~
if ~ :0' L? -z"T 0
This example illustrates generation in the target language at the earliest possi-
ble point. Although this is a perfectly acceptable Japanese translation, it leaves
much room for stylistic improvement. We are currently investigating algorithms
for producing more stylistic translations without undermining the simultaneity
of translation. Translated-sentence style is greatly affected by temporal factors
such as speed of input speech and prosodic factors .
The surface string and other information is, again, stored in a G-Marker and
The <J? DMDIALOG System 103
/
<·Cet-Instruction ... >
/
<*ShOfll-Instruction ... >
l
<'Llsten-~stratlon <'Assert-Registration ·Conform .. >
(·Attend-conference>
·Conference
!\
'I 'Plrst-of-all
I
·conference
/\
9conference
iattend
Ik a n fer e n sl 1m azul
W
la II
lal III
~ Ikl lal 1m!
I1 lal
First, several efforts have been made to integrate speech and natural language
processing. [Tomabechi et. al., 1988] attempt to extend the marker-passing
model to speech input. Their model uses environment without probabilistic
measures which would allow environmental rules to be applied. Since mis-
recognitions are stochastic in nature, lack of a probability measure seems to be
a shortcoming in their model. [Saito and Tomita, 1988] [Kita et. aI., 1989] and
[Chow and Roukos, 1989] are examples of approaches to integrate speech with
unification-based parsing, but, unfortunately, discourse processing has not been
incorporated. [Young et. aI., 1989] describes an attempt to integrate speech
and natural language processing implementing layered prediction. They re-
ported that use of layered prediction involving discourse knowledge reduced
the perplexity of the task.
4.13 DISCUSSIONS
Use of cases reduces perplexity because predictions from cases are more spe-
cific than predictions from linguistic knowledge, and more constrained than a
bi-gram grammar. In an experimental grammar which has a test set perplex-
ity of 3.66 with a bi-gram grammar, the perplexity for the same test set was
reduced to 2.44 by using cases of utterances. Probabilities carried by markers
from each level of processing are merged when they meet at certain nodes as
predicted, and decide a final a priori probability distribution. Although this
method successfully reduces the perplexity measure in the test set used in the
experiment, there is some question as to its effectiveness when it applied to
larger domains.
While there are considerable doubts regarding the effectiveness of using syn-
tactic and semantic levels of knowledge alone for prediction, use of pragmatic
and discourse knowledge , such as discourse plans [Litman and Allen, 1987] and
discourse structure [Grosz and Sidner, 1990], gain attention with the hope that
introduction of these higher levels of constraints may help by further reducing
perplexity, and thus attain higher recognition rates. As a matter of fact, [Young
et. al., 1989] reports that perplexity was reduced dramatically by introducing
discourse knowledge using a layering prediction method, and that the semantic
accuracy of the recognition result was 100%. The introduction of discourse
knowledge would be useful for highly goal-oriented and relatively limited do-
mains such as the DARPA resource management domain. We have investigated
the effectiveness of using predictions from the discourse level in the ATR con-
ference registration domain, because the ATR domain is a mixed-initiative and
less goal-oriented domain.
A domain of this subdialog seems to be a credit card charge, but it has subdi-
alogs of asking if the proceedings will be published and asking for a registration
form to be sent out. Although predictions of speech acts may be attainable since
more than 80% of the interaction is based on the Request-Inform discourse plan,
predictions on which subdomain the dialog may switch into and when it may
happen are hopelessly difficult. In the above dialog, how can we predict that
the questioner may ask if the proceedings will be published in the middle of a
dialog on a credit card? This means that although stronger preferences can be
placed on some of the subdomains, the system must be able to expand its search
space to nearly the entire domain so that sudden switching of subdomains in
such complicated dialog structures can be handled. When this happens, the
perplexity measure would drastically increase. In the case of our experimental
set it should fall somewhere in the middle between 2.4 and 19.7. However,
obviously, expanding search space to entire domains significantly undermines
the recognition rate. We still do not have an answer to this problem.
Third, prediction failures run the risk of undermining recognition rates by prun-
ing out a correct hypothesis in favor of incorrect but predicted hypotheses.
Chances of making wrong predictions depend upon the coverage of the corpus
collected from real dialogs. If the corpus covers a sufficient portion of possible
dialog transitions, the chances of making wrong predictions would be much
lower. In the ATR's conference registration domain which involves various
topics such as sightseeing, dinner, and hotel reservations, covering all possible
subdomains and transitions is nearly impossible. Actually, one dialog of the
corpus involves how to spend time with geishya girls at Kyoto! While covering
all possible transitions is not feasible, the problem remains of how to avoid se-
lecting wrong but predicted hypotheses when an unexpected utterance is made.
We believe that higher level knowledge can help only a little with this problem,
and that it can even be harmful in some cases. The only solution we suggest
is to improve speech recognition at lower levels.
In summary, the language model cannot be 100% correct ill providing a priori
probability to the speech processing level. Use of discourse knowledge is effec-
108 CHAPTER 4
tive only with a task of a relatively limited domain, and it would be less effective
in mixed-initiative and wide domains with which we intend to deal. Given the
fact that a highly accurate prediction of what may be told next is not feasi-
ble, we still need to improve the speech recognition system's accuracy without
depending on higher levels of knowledge sources such as discourse knowledge.
Studies of the planning unit in sentence production [Ford and Holmes, 1978]
give additional support to the psychological plausibility of our model. They
report that deep clause instead of surface clause is the unit of sentence plan-
ning. This is consistent with our model which employs CSCs, which account for
deep propositional units and the realization of deep clauses as the basic units
of sentence planning. They also report that people are planning the next clause
while speaking the current clause. This is exactly what our model is perform-
ing, and is consistent with our observations from transcripts of simultaneous
in terpretation.
4.13.4 Learning
Learning is one of the area which we did not addressed in the discussion so far.
This is a subject of ongoing effort in the project. Although we do not have
concrete algorithms to incorporate learning in our system, we describe some of
our motivations and basic ideas toward learning in our model.
There are several motivations for developing a learning scheme in our model:
Learning by Parsing: Input utterances that are provided during the trans-
lation session are used for learning utterance cases. Since our system is a
bi-directional system, both Japanese and English are provided to the sys-
tem, and we can assume that speakers of each language are native. The
purpose of learning from such examples is to acquire cases of utterances
which are actually used by native speakers. By preferring use of cases
acquired by this process over other hypotheses, we can avoid generating
sentences that will never be used by native speakers, although they are
syntactically and semantically correct. The process of acquisition consists
of the following operations:
1. Generation of Utterance-Case: New utterance-cases are gener-
ated from input utterances by using syntactic and semantic knowledge
and generalized cases. Syntactic and semantic knowledge involved in
this process can serve as an explanation for the case and can be used
for generalization. This part of the process is an Explanation-Based
Learning scheme [Minton, 1988].
112 CHAPTER 4
4.14 CONCLUSION
Although interpreting telephony or a real-time speech-to-speech translation
system has been considered as one of the prime research goals in speech and
natural language processing, this is perhaps the first to propose a comprehensive
The <.P DMDIALOG System 113
Several experiments has been conducted by using the ATR corpus of telephone
dialogs. We confirmed that the use of utterance cases and discourse knowledge
contributes towards reducing perplexity. However, at the same time, we found
that the effect of perplexity reduction by discourse knowledge in a larger do-
main is severely restricted due to the inherent unpredictability of subdomain
transitions in mixed-initiative dialogs. In order for our model to generate trans-
lated sentences simultaneously, resolution of ambiguity at the earliest possible
moment is desirable. Extra ambiguities caused by the addition of speech pro-
cessing pose serious problems which need to be resolved. Since limitations of
the usefulness of discourse knowledge in reducing perplexity have been found
in mixed-initiative domains, we need to conduct research on a better speech
processing module and methods of reducing search space without heavily de-
pending upon discourse knowledge.
Of course, our model is, by no means, complete and we have a long list of
future research issues. However, we believe that the importance of developing
an actual prototype is in the fact that we have actually faced these problems
and identified what need to be done next.
One of the most significant problem was its performance. On serial machines,
such as IBM RT-PC, it took 3 to 20 seconds to translate from speech-to-speech
even when the vocabulary is less than 100. With the vocabulary of 450 words,
it took over a minutes. The solution to this is to use actual massively parallel
machines.
5
DMSNAP: AN IMPLEMENTATION
ON THE SNAP SEMANTIC
NETWORK ARRAY PROCESSOR
5.1 INTRODUCTION
This chapter describes the DMSN AP system, a version of the <I> DMDIALOG
implemented on the Semantic Network Array Processor (SNAP). The goal of
our work is to develop a scalable and high-performance natural language pro-
cessing system which utilizes the high degree of parallelism provided by the
SNAP machine.
In the next section, we describe briefly the SNAP architecture, then, describe
design philosophy behind the DMSN AP followed by descriptions on implemen-
tation and linguistic processing. Finally, performance are presented.
116 CHAPTER 5
Program
Host development
Computer using SNAP
instruction
set
SNAp-1 Compiled
Controller SNAP
code
SNAp-1
Array 160 Knowledge base
Processor SNAP instruction
Array execution
Eight 9U -size boards
ers, (2) address markers, and (3) values 1. Propagation of feature structures
and heavy symbolic operations at each PE, as seen in the original version of
the q,DMDIALOG, are not practical assumptions to make, at least, on cur-
rent massively parallel machines due to processor power, memory capacity on
each PE, and the communication bottleneck. Propagation of feature struc-
tures would impose serious hardware design problems since size of the message
is unbounded (unbounded message passing). Also, PEs capable of perform-
ing unification would be large in physical size which causes assembly problems
when thousands of processors are to be assembled into one machine. Even
with machines which overcome these problems, applications with a restricted
message passing model would run much faster than applications with an un-
bounded message passing model. Thus, in DMSN AP, information propagated
is restricted to bit markers, address markers, and values. These are readily
supported by SNAP from hardware level.
In the syntactic constraint network model, all syntactic constraints are rep-
lWe call a type of marker-passing which propagates feature structures (or graphs) an
Unbounded Message Passing. A type of marker-passing which passes fix-length packets as
seem in DmSNAP is a Finite Message Passing. This classification is derived from [Blelloch,
19861. With the classification in [Blelloch, 19861, our model is close to the Activity Flow
Network.
DmSNAP: SNAP Implementation 121
and target language (ENG and JPN), Role links (ROLE), constraint links
(CONSTRAINT), contextual links (CONTEXT) and others.
Besides, concept instance nodes (CI) and concept sequence instance structures
(CSI) are dynamically created during parsing. Each CI or CSI is connected
to the associated CC or CSC by INST link. CIs correspond to discourse
entities proposed in [Webber, 1983]. Three additional links are used to fa-
cilitate pragmatic inferences. They are CONTEXT links, CONSTRAINT
DmSNAP: SNAP Implementation 123
5.4.3 Markers
The processing of natural language on a marker-propagation architecture re-
quires the creation and movement of markers on the memory network. The
following types of markers are used:
NEXT link where they collide with A-MARKERS at the element nodes.
There are some other markers used for control process and timing, they are
not described here. These five markers are sufficient to understand the central
part of the algorithm in this paper.
This parsing algorithm is similar to the shift-reduce parser except that our
algorithms handles ambiguities, parallel processing of each hypothesis, and top-
down predictions of possible next input symbol. The generation algorithm
implemented on SNAP is a version of the lexically guided bottom-up algorithm
which is described in the chapter 3.
Example I
s1 John wanted to attend IJCAI-91.
s2 He is at the conference.
s3 He said that the quality of the paper is superb.
Example II
s4 Dan planned to develop a parallel processing
computer.
s5 Eric built a SNAP simulator.
s6 Juntae found bugs in the simulator.
s7 Dan tried to persuade Eric to help Juntae modify
the simulator.
s8 Juntae solved a problem with the simulator.
s9 It was the bug that Juntae mentioned.
The examples contain various linguistic phenomena such as: lexical ambiguity,
structural ambiguity, referencing (pronoun reference, definite noun reference,
etc), control, and unbounded dependencies. It should be noted that each exam-
ple consists of a set of sentences (not a single sentence isolated from the context)
in order to demonstrate contextual processing capability of the DMSN AP.
These sentences in examples are not all the sentences which DMSN AP can
handle. Currently, DMSN AP handles substantial portion of the ATR confer-
ence registration domain (vocabulary 450 words, 329 sentences) and sentences
from other corpora.
126 CHAPTER 5
WANT-CIRCUM
..
, ,",,
I.. Instance Node
/'
, .,.
'I'M,",!
WJ'
on-
.
" -at
•• 1..... • •••,'
JkmIT-1 : :.•..:...
I I I I •
c- JC 1-91
WANT~TEND
-CONF-#
ATIlT.'#6 /
.._~
•• /':IJCAI- 9 U ..t.. U' l chikai-91'
• I I I
ATTEWo~tONF#5 IJdarJ91#2
Initially, the first CSE in every csc on the memory network gets a P-MARKER.
This P-MARKER is passed down ISA links. The CCs receiving a P-MARKER
are C-PERSON and C-ATTEND. Also the closed class lexical items (CCI) in the
target language propagate G-MARKER upward ISA links.
Upon processing the first word' John' in the sentence s1, C-JOHN is activated so
that C-JOHN gets an A-MARKER and a CI JOHN#1 is created under C-JOHN.
At this point, the corresponding Japanese lexical item is searched for, and JON
is found. A G-MARKER is created on JON. The A-MARKER and G-MARKER
propagate up through ISA links (activating C-MALE-PERSON and C-PERSON
in sequence) and, then, ROLE links. When an A-MARKER collides with a
P-MARKER at a CSE, the associated case role is bound with the source of the
A-MARKER and the prediction is updated by passing P-MARKER to the next
CSE. This P-MARKER is passed down ISA links. In this memory network,
the ACTOR roles of concept sequences WANT-CIRCUM-E is bound to JOHN#1
pointed by the A-MARKER. This is made possible in the SNAP architecture
which allows markers to carry address as well as bit-vectors and values, where
many other marker-passing machines such as NETL [Fahlman, 1979] and IXM2
[Higuchi et. al., 1991] only allow bit-vectors to be passed around. Also, G-
MARKERS are placed on the ACTOR role CSE of WANT-CIRCUM-J. The G-
MARKER points to the Japanese lexical item jon.
After processing 'wanted' and 'to', a P-MARKER is passed to CIRCUM and, then,
to ATTEND-CONF. At this point, a source language (English) expression for the
concept ATTEND-CONF is searched for and ATTEND-CONF-E is found. The first
CSE of ATTEND-CONF-E gets a P-MARKER. After processing 'attend' and
'IJCAI-91', ATTEND-CONF-E becomes fully recognized 2 so that a CSI having
CIs is created under ATTEND-CONF-E. Then the associated concept ATTEND-
2 fully recognized means that the CSC can be reduced, in the shift-reduce parser's
expression.
128 CHAPTER 5
With this algorithm, the first set of sentences (sl, s2 and s3) is translated into
Japanese:
5.5.2 Anaphora
Anaphoric reference is resolved by searching for discourse entity as represented
by CIs under a specific type of concept node. Sentence s2 contains anaphora
problems due to 'He' and 'the conference'. When processing 'He', DMSNAP
searches for any CIS under the concept C-MALE-PERSON and its subclass con-
cepts such as C-JOHN. In the current discourse, JOHN#l is found under C-JOHN.
JOHN#l and IJCAI-91#2 are created when the sl is parsed. An A-MARKER
pointing to JOHN#l propagates up through ISA links. Likewise, IJCAI-91#2 is
found for C-CONFERENCE. In this sentence, there is only one discourse entity
(CI in our model) as a candidate for each anaphoric reference, thus a simple
instance search over the typed hierarchy network suffices. However, when there
are multiple candidates, we use the centering theory by introducing forward-
looking center (Cf), backward-looking center (Cb), etc [Brennan et. al., 1986}.
Also, incorporating the notion of the focus is straightforward [Sidner, 1979}.
DmSNAP: SNAP Implementation 129
5.5.4 Control
Control is handled using the syntactic constraint network. Sentence 87 is an
example of sentence involving functional control [Bresnan, 1982]. In 87, both
subject control and object control exist - the subject of 'persuade' should be the
subject of 'tried' (subject control), and the subject of 'help' should be the object
of 'persuade' (object control). In this case, CSCs for infinitival complement
has CSE without NEXT link. Such an CSE represents missing subject. There
are SUBJ, OBJ, and OBJ2 nodes (these are functional controller) in the
syntactic constraints network each of which store pointer to the CI node for
possible controllee. Syntactic constraint links from each lexical items of the verb
determine which functional controller is active. Activated functional controller
propagate a pointer to the CI node to unbound subject nodes of CSCs for
infinitival complements. Basically, one set of nodes for functional controller
handles deeply nested cases due to functional locality.
each CSC for infinitival complement. This way, DMSNAP performs control.
In this case, two hypotheses are activated at the end of the parse. Then, DM-
SNAP computes the cost of each hypothesis. Factors involved are contextual
priming, lexical preference, existence of discourse entity, and consistency with
world knowledge. In this example, the consistency with the world knowledge
plays central role. The world knowledge is a set of knowledge of common sense
and knowledge obtained from understanding previous sentences. To resolve
ambiguity in this example, the DMSN AP checks if there is a problem in the
simulator. Constraint checks are performed by bit-marker propagation through
CONSTRAINT links and EQROLE links. Since there is a CI which packages
instances of ERROR and SNAP-SIMULATOR then the constraint is satisfied and
the second interpretation incurs no cost from constrain check. However, there
is no CI which packages instances of JUNTAE and SNAP-SIMULATOR. Therefore
the first interpretation incurs a cost of constraint violation (15 in our current
implementation). Thus DMSNAP is able to interpret the structural ambiguity
in favor of the second interpretation.
address of CI for the displaced phrase (such as 'the bug' in the example 59) is
propagated to the TOPIC or FOCUS nodes in the syntactic constraint net-
work. Further propagation of the address of the CI is controlled by activation
of nodes along the syntactic constraint network. The network virtually encodes
a finite-state transition equivalent to {COMP-XCOMP}*GF-COMP [Kaplan
and Zaenen, 1989] where GF-COMP denotes grammatical functions other than
COMPo The address of the CI bound to TOPIC or FOCUS can propagate
through the path based on the activation patterns of the syntactic constraint
network, and the activation patterns are essentially controlled by markers flow
from the memory network. When the CSC is accepted and there is a case-role
not bound to any CI (OBJECT in the example), the CSE for the case-role
bound with the CI propagated from the syntactic constraint network.
5.6 PERFORMANCE
DMSNAP complete parsing in the order of milliseconds. While actual SNAP
hardware is now being assembled and to be fully operational by May 1991, this
section provides performance estimation with precise simulation of the SNAP
machine. Simulations of the DMSN AP algorithm have been performed on a
SUN 3/280 using the SNAP simulator which has been developed at USC [Lin
and Moldovan, 1990]. The simulator is implemented in both SUN Common
LISP and C, and simulates the SNAP machine at the processor level. The
LISP version of the simulators also provides information about the number of
SNAP clock cycles required to perform the simulation.
There are two versions of DMSNAP, one written in LISP and one in C. The
high-level languages only take care of the process flow control, and the actual
processing is done with SNAP instructions. The performance data summarized
in Table 5.1 was obtained with the first version of DMSN AP written in LISP.
Furthermore, with a clock speed of 10 MHz, these execution times are in the
order of 1 millisecond. These and other simulation results verify the operation
of the algorithm and indicate that typical runtime is on the order of milliseconds
per sentence.
The size of the memory network for example II is far larger than that of example
I, yet we see no notable increase in the processing time. This is due to the use
of a guided marker-passing which constraints propagation paths of markers.
Our analysis of the algorithm shows that parsing time grow only to sublinear
to the size of the network.
132 CHAPTER 5
Parsing time
(milliseconds)
x
1.5
x x
1.0 x
0.5 Xx
5 10 Length
(words)
5.7 CONCLUSION
In this chapter, we have demonstrated that high-performance natural language
processing with parsing speeds in the order of milliseconds is achievable without
making substantial compromise in linguistic analysis. To the contrary, our
model is superior to other traditional natural language processing models in
several aspects, particularly, in contextual processing.
6.1 INTRODUCTION
In this chapter, we report experimental results on ASTRAL, a partial imple-
mentations of CI>DMDIALOG on the IXM2 associative memory processor. On
the IXM2 associative memory processor, we have investigated the feasibility
and the performance of the memory-based parsing part of the CI>DMDIALOG
model.
The system consists of two parts: a syntactic recognition part on the IXM2
and a semantic interpretation part on the host computer.
For the syntactic recognition part on the IXM2, the memory consists of three
layers: a lexical entry layer, a syntactic category layer, and a syntactic pattern
layer.
Lexical Entry Layer: The lexical entry layer is a set of nodes each of which
represents a specific lexical entry. Most of the information is encoded
in lexical entries in accordance with modern linguistic theories such as
HPSG[Poliard and Sag, 1987], and the information is represented as a
feature structure. Obviously, it is a straight forward task to represent
huge numbers of lexical entries on the IXM2.
N V-BSE DET N
N V- BSE N
N BE-V V-PAS PP-by N
6.3.2 Algorithm
The algorithm is simple. Two markers, activation markers (A-Markers) and
prediction markers (P-Markers) are used to control the parsing process. A-
ASTRAL: IXM2 Implementation 139
Markers are propagated through the memory network from the lexical items
which are activated by the input. P-Markers are used to mark the next possible
elements to be activated. A general algorithm follows:
Syntactic
Recognition time
(milli seconds)
1.0
0.5
5 10 15 Sentence length
(words)
6.4 PERFORMANCE
We carried out several experiments to measure the system's performance. Fig-
ure 6.1 shows the syntactic recognition time against sentences of various
lengths. Syntactic recognition at milliseconds order is attained. This exper-
iment uses a memory containing 1,800 syntactic patterns. On average, 30
syntactic patterns are loaded into each associative processor. Processing speed
improves as parsing progresses. This is because the computational costs for
a sequential part in the process is reduced as number of hypotheses activated
decrease. There is one sequential process which checks active hypotheses on
each 64 transputer. During this process, the parallelism of the total system is
64.
It should be noted that this speed has been attained by extensive use of asso-
ciative memory in the IXM2 architecture - simple use of 64 parallel processors
will not attain this speed. In order to illustrate this point, we measured the
performance of a single associative processor of the IXM2 (one of the 64 as-
sociative processors) and of the SUN-4/330, CM-2 Connection Machine, and
Cray X-MP.
The program on each machine uses an optimized code for this task in C lan-
guage. The numbers of syntactic patterns is 30 for both a single associative
ASTRAL: IXM2 Implementation 141
processor of the IXM2 and other machines. The experimental results are shown
in Table 6.3. A single processor of the IXM2 is almost 16 times faster than
that of the SUN-4/330 and Cray X-MP even with such a small task 1 The CM-2
Connection Machine is very slow due to a communication bottleneck between
processors. While both the IXM2 and the SUN-4/330 uses a CPU of compa-
rable speed, the superiority of the IXM2 can be attributed to its intensive use
of the associative memory which attains a massively parallel search.
This trend will be even more clear when we look into the scaling property
of both systems. Figure 6.4 shows the performance for a sentence of length
8, for syntactic patterns of size 10 and 30. While a single processor of the
IXM2 maintains less-than-linear degradation, the SUN-4/330 and Cray X-MP
degrades more than linearly. It should be noted that 30 syntactic patterns in
other machines literally means 30 patterns, but in the single processor in the
IXM2, it means 1,800 patterns when all 64 processors are used.
It is expected that the larger task set would demonstrate a dramatic difference
in total computation time. The IXM2 can load more than 20,000 syntactic pat-
terns which is sufficient to cover the large vocabulary tasks currently available
for speech recognition systems. With up-to-date associative memory chips, the
lCray X-MP is very slow in this experiment mainly due to its sub-routine call overhead.
We have tested this benchmark on a Cary X-MP in Japan and at the Pittsburgh SuperCom-
puting Center, and obtained the same result . Thus this is not a hardware trouble or other
irregular problems.
142 CHAPTER 6
Expected
performance
(seconds)
2.0
1.0
Number of
Sentences
500 1,000 1,500 2,000 Training
number of syntactic patterns which can be loaded on the IXM2 exceeds 100,000.
Also, extending the IXM2 architecture to load over one million syntactic pat-
terns is both economically and technically feasible.
The memory-based parser can improve its performance over time. While pre-
vious experiments stored necessary syntactic patterns before hand, more com-
prehensive systems start from no pre-stored cases and tries to improve its per-
formance through acquirering syntactic patterns. Figure 6.2 shows the per-
formance improvement of our system assuming that each new case of syntactic
patterns is incrementally stored in run time 2 • In other words, first the input
is given to the memory-based parser, and if it fails to parse, i.e. no case in the
memory corresponds to the input sentence, then the conventional parser will
parse the input. Parsing by the conventional parser takes about an average of
2 seconds. New syntactic patterns can be given from the parse tree of the con-
ventional parser to be loaded on the memory-based parser to improve coverage.
This way, overall performance of the system can be improved over time. The
memory-based parsing can be combined with a conventional parser to improve
overall performance of the system by incrementally learning syntactic patterns
in the task domain.
2 Notice that the parsing time is an expected time. When the memory-based parser covers
the input it should complete parsing in a few milli-seconds, else the conventional parser will
parse and takes about 2 seconds. The expected parsing time will improve as the memory-
based parser cover more inputs.
ASTRAL: IXM2 Implementation 143
Number of
syntactic Patterns
1,500
1,000
500
Number of
Sentences
500 1,000 1,500 2,000
The hierarchical memory network model avoids this problem by layering the
levels of abstractions incorporated in the memory. Figure 6.3 shows an exam-
ple of the memory saving effect of the hierarchical memory network. The model
assumes three levels of abstraction: surface sequences, generalized cases, and
syntactic rules. The surface sequences are simple sequences of words. This level
of abstraction is useful to process such utterances as "How's it going" or "What
can I do for you?" These are a kind of canned phrase which frequently appear
in conversations. They also exemplify an extended notion of phrasal lexicon.
By pre-encoding such phrases in their surface form, computational costs can be
saved. However, we can not store all the sentences in this way. This leads to
the next level of abstraction which is the generalized cases. Generalized cases
are a kind of semantic grammar whose phrase structure rules use non-terminal
symbols to represent concepts with specific syntactic and semantic features.
ASTRAL: IXM2 Implementation 145
ike to 'alion>
link(first,ax31,about) .
link(last,t34,about) .
link(instance_of,ax31,ax) .
link(destination,ax31,b32) .
link(instance_of,b32,b) .
link(destination,b32,aw33) .
link(instance_of,aw33,aw) .
link(destination,aw33 , t34) .
link(instance_of,t34,t) .
Figure 6.5 Network for ' about ' and its phoneme sequence
Lexical Entry Layer: The lexical entry layer is a set of nodes each of which
represents a specific lexical entry.
Abstraction Hierarchy: The class/subclass relation is represented using 1S-
A links. The highest (the most general) concept is *all which entails all
the possible concepts in the network. Subclasses are linked under the *all
node, and each subclass node has its own subclasses. As a basis of the
ontological hierarchy, we use the hierarchy developed for the MU project
[Tsujii, 1985], and domain specific knowledge has been added.
Concept Sequence: Concept sequences which represent patterns of input
sentences are represented in the form of a network. Concept sequences
capture linguistic knowledge (syntax) with selectional restrictions.
Figure 6.5 shows a part of the network. The figure shows a node for the word
'about', and how the phoneme sequence is represented. The left side of the
figure is a set of IXM instructions to encode the network in the right side on
the 1XM2 processor. Refer [Higuchi et. al., 19911 for details of the mapping of
semantic networks to 1XM2. We have encoded a network including phonemes,
phoneme sequences, lexical entries, abstraction hierarchies, concept sequences
which cover the entire task of the ATR's conference registration domain. The
vocabulary size is 405 words in one language, and at least over 300 sentences
in the corpus have been covered. The average fanout of the network is 40.6.
The weight value has not been set in this experiment in order to compare the
ASTRAL: IXM2 Implementation 147
performance with other parsers which do not handle stochastic inputs. In the
real operation, however, a fully tuned weight is used. The implementation in
this version uses a hierarchical memory networks thereby attaining a wider
coverage with smaller memory requirements 4 •
The table for templates of the target language is stored in the host computer
(SUN-3j250). The binding-table of each concept and concept sequence, and
specific substrings are also created. When the parsing is complete, the genera-
tion process is invoked on the host. It is also possible to compute distributively
on 64 T800 transputers. The generation process is computationally cheap since
it only retrieves and concatenates substrings (which is a lexical realization in
the target language) bound to conceptual nodes following the patterns of the
concept sequence in the target language.
5. If the A-P-Collision takes place at the last element of the phoneme se-
quence, an A-Marker is passed up to the Lexical Entry. (Reduce) Else,
Goto 2.
6. Pass the A-Marker from the lexical entry to the Concept Node.
7. Pass the A-Marker from the Concept Node to the elements in the Concept
Sequence.
8. If the A-Marker and a P-Marker co-exist at an element in the Concept
Sequence, then the P-Marker is moved to the next element of the Concept
Sequence (Shift).
9. If an A-P-Collision takes place at the last element of the Concept Sequence,
the Concept Sequence is temporarily accepted (Reduce), and an A-Marker
is passed up to abstract nodes. Else, Goto 2.
10. If the Top-level Concept Sequence is accepted, invoke the generation pro-
cess.
6.8 PERFORMANCE
We carried out several experiments to measure the system's performance. Fig-
ure 6.6 shows the parsing time against sentences of various lengths. Parsing
at milliseconds order is attained. PLR is a parallel version of Tomita's LR
parser. The performance for PLR was shown only to provide a general idea of
the speed of the traditional parsing models. Since machines and grammars are
different from PLR and our experiments, we can not make a direct comparison.
However, its order of time required, and exponentially increasing parsing time
dearly demonstrate the problems inherent in the traditional approach. The
memory-based approach on IXM2 (MBT on IXM2) shows a magnitude faster
parsing performance. Also, its parsing time increases almost linearly to the
length of the input sentences, as opposed to the exponential increase seen in
the PLR. Notice that this graph is drawn in log scale for the Y-axis. CM-2 is
slow in speed, but exhibits similar characteristics with IXM2. The speed is due
to PE's capabilities and machine architecture, and the fact that CM-2 shows a
similar curvature indicates the benefits of the MBT. The SUN-4 shows a similar
curve, too. However, because the SUN-4 is a serial machine, its performance
degrades drastically as the size of the KB grows, as discussed below.
2---+------~--------+-------~----
....... .. ..
le+Ol---+-----1-----r--=~~-=
........ ----+---
..... ........
........ . ............
5---+------~-=------~~~··=····~ ····-··--
····-
··--··-·r----
..................
.. ' ............................... .
2-~~··=::~···--···-···-····-···-·-··~·-----~---=~~---=----
le+OO --+V-----=----F-----t----- --+-----------~
5-~'-------~~------+--------r----
Input Length
5.00 W.OO 15.00 20.00
,,'
5.50 ---j-----+-,'~---+-----+--
5.00 ----+---,'-,,-"....' - - + - - - - - + - - - - - - + - - -
4.50 ---+-...,..-<-,'----+-------1r---~-"F-=
..................... ....-...-....-.. -
I •••• • •••••••••••• • ••••••••••• ••••
, ............................. .
4.00 - ..-...-.~.-:.:;""'"/;l.I-.~=----+------+-----+---
3.50 --,'-;,-,- + - - - - - j - - - - - + - - - - - + - -
,
3.00 - - - ' - - - + - - - - - j - - - - - + - - - - - + - - -
2.50 - - - + - - - - - j - - - - - + - - - - - + - -
2.00 ----+------+-----+----~---
1.50 --r===t===i======t===-
1.00 ---...q.-....- - - - + - - - - - f - - - - - + - - -
Nodes
100.00 200.00 300.00 400.00
This trend is the opposite of the traditional parser in which the parsing time
grows beyond linear to the size of the grammar KB (which generally grows
square to the size of grammar rules, o(G 2 )) due to a combinatorial explosion
of the serial rule applications. CM-2 shows a similar curve with IXM2, but is
much slower due to the slow processing capability of I-bit PEs. The SUN-4 has
a disadvantage in a scaled up KB due to its serial architecture. Particularly,
the MBT algorithm involves an extensive set operations to find nodes with the
A-P-Collision, which is suitable for SIMD machines. Serial machines need to
search the entire KB which lead to the undesirable performance as shown in
the figures in this section.
Figure 6.8 shows the number of active hypotheses per one associative processor.
Although it starts with a high processing load where a significant percentage of
hypotheses are activated, the number of hypotheses decreases drastically as the
processing proceeds. Since the IXM2 uses associative memory chips to store
syntactic patterns, no processor will be idle unless all the hypotheses assign to
the processor are eliminated. However, in other massively parallel machines
that assign processors to all the hypotheses, most of the processors will be idle
because most of the hypotheses will be eliminated as the processing progresses.
Since the use of associative memory chips would be far cheaper than processor
chips to store and carry out the operations necessary in the implementation
in this paper, the IXM2's architecture would be more cost effective than other
architectures for this task.
Number of
Active Hypotheses
10
5 10 15 Position
(word)
associative memory would suffice its purpose since a sixty four 32 bit CPU
can distributively perform higher-levels of symbolic operations. Even in most
memory-based reasoning tasks, similarity matching uses relatively simple simi-
larity measures from numeric computations which can be computed on associa-
tive memory. Thus, the IXM2's architecture which we advocate in this paper
is a cost effective architecture not only for the memory-based parser, but also
for more general memory-based reasoning systems.
6.10 CONCLUSION
We have shown, using data obtained from our experiments, that the massively
parallel memory-based parsing is a promising approach to implement a high-
performance real-time parsing system for certain task domains.
One of the major contributions of this paper, however, is that we have shown
that the time-complexity of the natural language processing can be transferred
to space-complexity, thereby drastically improving the performance of the pars-
ing when executed on massively parallel machines. This assumption is the basic
thrust of the memory-based and case-based reasoning paradigm. This point has
been clearly illustrated by comparing a version of Tomita's LR parsing algo-
rithm and the memory-based parsing approach. Traditional parsing strategies
exhibited an exponential degradation due to extensive rule application, even in
a parallel algorithm. The memory-based approach avoids this problem by using
hierarchical network which compiles grammars and knowledge in a memory-
intensive way. While many AI researchers have been speculatively assuming
the speed up by massively parallel machines, this is the first report to actu-
ally support the benefit of the memory-based approach to natural language
processing on massively parallel machines.
2--1----_+------4-----~:·-·--~------+_----r_
,-:~ ..
,:....
le+02--1-----+-----,~.~:~
:·-·---+-----+-------+-----r-
.,. ";,,,
,-;, ..
5--1---b:.~.'~··--+----I---+----+---+-
'.
Fanout
le+OO 2 5 le+Ol 2 5 le+02
7.1 INTRODUCTION
This chapter presents the Memoir system, which is a yet another model of
memory-based machine translation and integration of memory-based and rule-
based processing. However, it is based on ideas different from that of the
original q)DMDIALOG. In the q)DMDIALOG system, a set of rules has been used
to create meaning representation of input sentences. It played the similar role
to examples and templates of sentences, but at the different level of abstraction.
The Memoir system takes a different view. Rules are used solely to monitor
whether a correct word choice and pronoun reference was made.
The major emphasis of the proposed model are: (1) integrates memory-based
process and rule-based process in a novel fashion , (2) employs a user customiz-
able example representation, and (3) introduces a robust and dynamic matching
of examples against the input sentence.
First, the grammatical inference module uses a local and relation-driven con-
trol to infer head, subject, object, and focus of the sentence. The result of
inference is accessible from the adaptive translation module and the monitor
module. The adaptive translation module uses information on what the Center
of the previous sentence is, in order to handle zero-pronoun in Japanese-English
translation. Identification of the center will be based on the centering constraint
[Kameyama, 1988]. The monitor module uses information on what is the sub-
ject or object of a certain verb in order to decide which expert rules to apply.
Second, the monitor module uses a set of rules encoding translator's knowledge
to ensure correct word choice and stylistics. The monitor module initially checks
to determine if any word in the sentence matches any part of the condition part
of the rules. If there are rules which involve words in the sentence, the monitor
module dispatches a request to the grammar inference module, to check if the
condition part of the rule (such as object of increase is short-term interest rate)
can be met. The grammatical inference module uses its local rules and relation-
driven control to check whether or not this condition can be satisfied. Then, the
grammar inference module returns the result. Depending on the query result,
the monitor module may invoke its rules to rewrite the translated sentences.
Although the model uses the rule-based process, this process carries out very
different tasks from those employed in the traditional rule-based MT. In fact,
there is no place in the proposed model where a complete parse tree is to
be built or a full parse is to be carried out. This approach coincides with
recent research in MUC-3 [Sundheim, 1991], MUC-4, and TIPSTER, which
heavily rely on partial parse and dynamic control strategy, such as relation-
driven control[Jacobs, 1992].
For example, translation examples, seen in Table 7.1, are stored in the memory-
base after the morphological analysis (Table 7.2).
Memoir: An Alternative View 159
l t
Morphological ~ Grammatical Inference Morphological
Analysis Generation
The segment map defines which SSE segment corresponds to which TSE seg-
ment. In case of TX-3, I in SSE has a segment position 1. Since the segment
map defines 1 = 1, I corresponds to ' fl.' in TSE, which has segment position
1. The term C gi ve up' is an ideomatic expression, which should be treated as
one word. The segment map defines that C gi ve up' corresponds to ' ;fy ~ t:;,
61) ~' by notation 2-3 = 3. 2-3 = 3 denotes that SSE segments from 2 to 3
correspond to segment 3 in TSE.
The model uses segment mapping at the surface level, rather than at the parse
tree or canonical represenation, as seen in current MBMT models. There are
three reasons why the author chose to define mapping at this level, rather than
at parse tree or at canonical representation. First, most non-linguists cannot
understand what is represented in the parse tree or the canonical represena-
tion. Although professional translators are experts in translation, they are not
necessarily experts in linguistics. Marking the correspolldece at the surface
level, however, is possible by many potential users and was reported as being
preferred by users in interviews with the author. Second, deriving the cor-
rect parse tree (for both SEE and TSE) requires extensive work. Automating
this process is not possible at the curent level of technology. Even if a large
tree-bank were available for translation pairs, user customization will not be
possible, as users are unlikely to understand and work on this representation.
Third, use of parse tree or canonical representation as a starting point of the
memory-based process leaves parsing problems unresolved. As is well under-
stood amoung commercial MT systems developers, ambiguities and difficulties
in tracking the systems behavior during parsing are the major bottlenecks in
the quality improvement. Any MBMT approach, which would leave this pro-
cess on the traditional model, would not attain significant improvement over
traditional MT systems.
Distance Sentence
Input 0.0 You gave me a direction
TX-4 0.9 I gave her a lesson
TX-2 1.4 I gave her a gift
n
d(Si , Sj) = Ld(wpi,Wpj) (7.1)
p=l
where n is the length of the sentence. d( Wpi, Wpj) is a distance between words
Wpi and Wpj . Similarity of the environment, in which the sentence fragement is
embedded, should also be a factor. Issues on what makes the best environment
similarity measure are presently under experiment, thus they are discarded
from the description in this paper.
Assume that an input is: 'You gave me a direction. I TX-2 and TX-4 re-
sults in the best match (Table 7.3).
This is the simplest case, where there is no alignment dislocation and a single
example covers the input sentence.
164 CHAPTER 7
Word Position
in Examples Path A
Path B
Path C
1 2 3 4 5
Word Position in Input Sentence
Word Position
in Examples
6
5
4
3
2 F
1
1 2 3 4 5 6 7 8 9 10
Word Position in Input Sentence
Difference Table
TX-4 Input
I You
her me
lesson direction
Next, the difference between the input sentence and TX-4 is checked. Table
166 CHAPTER 7
Adaption Operations
English Japanese
I -+ You f.l. ..... iF) 1,t 1.:..
her -+ me 1Btk -+ fJ,.
gift -+ direction ~IDII -+ 1T~h
1 2 3 4 5 6 7
BME .fJ.. IJ:, I:: 'i: '7;<.1:;0
Adaptation fJ.. --+ ~t:tt:; ~:=:. fJ.. ~fill ::rI~r ~ 1J
Translation ~t:tt:; Ij:, fJ.. I:: lf~1J :a- !§-H:;o
7.4 shows the differences between the input sentence and TX-4. In this case,
the input sentence and TX -4 are sufficently similar so that only a lexical level
adaptation would be sufficient to produce a translation.
Then, adaptation operations are derived, based on the difference table. For
example, a word'!' in TX-4 needs to be replaced with 'you' to cover the input
sentence. This operation is associated with change of fJ,. ('1') into ~"l;J.t.:.. ('you')
in a target language part of TX -4. Table 7.5 shows adaptations involved in this
translation. Word choices may be based on a statistical likelihood, computed
from segment maps. In effect, the method is similar to MBT-I [Sato, 1991bl.
Once the adapation operations are defined, the final stage is to reconstruct
translation from the target language part of the cases. In this example, TX-4
is the best match example (BME), and the only case involved. Thus, TSE for
TX-4 will be adapted, using the derived adaptation operations. This process is
shown in Table 7.6 1 .
1 2 3 4 5 6 7 8 9 10
Input I am not sure if she gave him a gift
TX-6 I am not sure if it was snowing
TX-2 I gave her a gift
combinations to cover the input are possible. The proposed model chooses to
use the minimum cost covering method which takes into account size, similarity,
and DP-matching cost of each fragment. Heuristics for scoring the combination
of examples and an algorithm for the minimum cost covering for tree structured
examples has been proposed [Sato and Nagao, 1990, Maruyama and Watanabe,
19921. The proposed model employs the I-D version of these algorithms. In the
given example, a combination of examples B D F is likely to be preferred over a
combination of BeE F, ACE F, etc. This is because B D F covers the input
with a minimum number of examples and no example involves the disposition
in DP-Matching.
Assume we have an input: "I am not sure if she gave him a gift".
None of the cases in the memory-base cover the input sentence; even the best
match cases only covers part of the input sentence. Table 7.7 shows the input
sentence and retrieved cases. TX-6 matches only in sentence segments from 1
to 5, 'I am not sure if', and TX -2 matches from 6 to 10, 'she gave him a
gift'.
At this moment, which segments can be used in each case can be defined using
the segment map. In this case, TX-l has segment of 1-5 (' I am not sure if')
and 6-8 (' i t ~as sno~ing'). Therefore, an obvious operation is to use 'I am
not sure if' and 'she gave him a gift'. Since a part of input sentence
is 'I gave her a gift', 'she gave him a gift' has to be adapted. The
segment map is used again, to identify which words should be replaced. As a
result of these operations, adaptation operations are defined, and this is cross
applied to produce Japanese translation.
Since segment 1 of the TX-l has high similarity and segment 2 has low simi-
larity, only segment 1 will be used, and the TX-2 will be used to translate' I
gave her a gift.' Table 7.8 shows how adaptive translation is carried out
in this example.
168 CHAPTER 7
1 2 3 4 5 6 7 8 9
TX-6
...'" iJ~ i!I¥-o> "(~,t.: iJ'~? iJ' Jl!. <?3-iJ' ~ '"'~,
TX-2
fl 1;1 ttY: ~:: 7'v-c/~ ~ ~Ift.:
Adaptation-l
~iJfi!l¥-o>"(~,t.: -+ fll;1ttY:~::7'v-c/ ~ ~~Ift.:
Intermediate
fl 1;1 ttY: I:: 7'v-c/~ ~ ~Ift.: iJ'~?iJ' Jl!. <?3-iJ' ~ ,",~,
Adaptation-2
f1.. -+ ~:tc ~:tc -> ~
Translation
~y: 1;1 ~ I:: 7'v-l!/~ a- ~~ft.: iJ'~?iJ' Jl!.<7tiJ'~,",It'
7.7 MONITOR
The monitor process checks whether or not the translation result is stylistically
and grammatically appropariate. This process has not been incorporated in
previous models of MBMT. The main rationale for the monitor process is the
existance of the domain rules for translation. These rules are widely used by
Memoir: An Alternative View 169
professional translators and are explicitly documented. For example, in the eco-
nomic domain, an English word 'prediction' must be translated as -=f~ when
no specific number is given, but the same word must be translated as -=f1Jt1j when
specific numbers are given. Also, the word' increase' must be translated as
51 ~ .1Jf (pull up) in the context of 'increase in the official discount
rates', but it should be J:~ (elevation) for' increase in the short-term
interest rates'. Thus, ~IDtll~fIJ0)51 ~ J:ff (pull up in short-term interest
rates) is considered as being a mistranslation. An extreme example is wherein
the use of the word 1)tjt-~JHT (investiment bank) for Solomon Brothers and
Goldman-Sacks is officially unacceptable for the Japanese government as these
banks are approved as security companies, not as investiment banks. At this
time, the English word 'investiment bank' must be phonetically transcribed as
-1 )I.I'{;J,. ~ )C )I ~ • Jf.)I -7.
Using the examples in Table 7.2, conventional MBMT systems would trans-
late' CD-rate was increased' into' CD v - ~ ~'t51 ~ J:ff I? nt.::.' Although
TX-10 implicitly encodes knowledge that J:~ is associated with CD v- ~ , it
did not contribute to this translation because its did not match with the input
sentence due to the substantial difference in their syntactic structures. Since
CD-rate is a kind of short-term interest rate, this translation is incorrect. In
the model, the output from the adaptive translation module will be checked
against expert rules. The monitor module search words appear in rules in the
input sentence. If there is any rule which involves words in the sentences, re-
quests are dispatched to the grammatical inference module to verify whether
or not the condition to invoke the rule can be met. In this case, the condition
is met (as explained in Section 4), so that the word 51 ~ J:ff I? nt.:: will be
replaced with ...t~ l..1.::. Thus, the final translation will be CD v- r ~'t...t~ L-
170 CHAPTER 7
t::. which is a correct translation. Currently, there are 82 of such rules for the
financial domain.
Independently, the effects of the monitor process were examined, using 100
sentences from the Wall Street Journal (WST) and The Financial Times (FT).
95 words, which are considered to be problematic in translation, have been
selected as a benchmark. While semantic accuracy was 100%, there are stylistic
problems, as discussed in Section 7. Pure MBMT attained stylistically correct
translation for 48 words out of 95 words. When it is combined with the expert
rules (using the monitor module), 84 words out of 95 words were translated with
correct stylistics. The rule in the monitor module has been invoked in 76 ca.~es.
Some 40 cases were consistent with the pure MBMT translation, while 36 cases
resulted in modification of translation. 9 words were translated with incorrect
style due to lack of examples and rules. This figure, however, should not be
viewed as indicating the MBMT accuracy, because it focuses on problematic
events. It should be viewed as evidence to support the effectiveness of the
monitor process, and should not be used to draw any quantitative conclusions.
With larger memory-base and more detailed rules, the accuracy is expected to
improve drastically.
Memoir: An Alternative View 171
7.9 CONCLUSION
In this chapter, the author proposed the Memoir system as an alternative view
on memory-based machine translation. In effect, the model offers a basis for
achieving a solution towards user customizable example definition, handling
contextual phenomena, such as zero-pronoun, broad-coverage and robust trans-
lation, and use of expert knowledge.
Use of expert knowledge in the rule form is the key factor which mitigated
the problem of sparse corpus inherent in the early phase of deployment. Al-
though a more formal model needs to be developed on how local grammar
and expert knowledge should be encoded, preliminary evaluation suggests the
monitor process actually improves the quality of translation.
The authors hope the model presented here will serve as a basis for further
discussion on how MBMT should be developed as a self-contained system,
which is expected to be a mainstay approach in the next generation machine
translation.
8
CONCLUSION
Second, we have demonstrate that the high performance natural language pro-
cessing is attainable by using appropriate algorithms and specialized hardware.
Experimental implementations on the IXM2 associative memory processor and
174 CHAPTER 8
Third, several methods to integrate the memory-based process and the rule-
based process has been proposed. The role is rules has been the major issue
in the research. Two alternative views were proposed. The first approach
is to complement memory-based process and rule-based process by different
levels of abstraction. The second approach is to rely mostly on memory-based
translation, but uses rules to monitor translated sentences using explicit expert
knowledge. The first appraoch has been implemented in q?DMDIALOG and
DMSN AP, and the second appraoch has been implemented in MEMOIR.
In the first approach, the model allows the specific cases, generalized cases, and
unification-based grammar to co-exist. This counters to most machine trans-
lation system which allows only a single level of abstraction is the processing.
In addition, our model allows translation to be carried out at the most spe-
cific level of abstraction among several levels of abstraction. This ensures that
translation can be carried out with the least cost process at any time.
On the other hand, the second model views rules as monitor to check explicit
stylistics and word choices of the translation. Neither parse tree nor meaning
representation will be created in this approach.
such a system.
Translator's Assistant: The approach taken in MEMOIRE system can be ap-
plied to assist human translator by providing retrie,{al of past translation
examples and by providing the first cut translation. Such a system should
be implemented on personal computers and workstations so that fast and
efficient memory search on serial machine would be one of the major critical
factor.
Any of these projects would requires non-trivial efforts. However, their eco-
nomic, social, and scientific impact would be enormous.
Although this book completes with this final remarks, research is now underway
toward the next generation systems. It is our wish that we can report significant
achievement before the end of this century.
BIBLIOGRAPHY
[Bowler and Pawley, 1984] Bowler, K. C. and Pawley, G. S., "Molecular Dy-
namics and Monte Carlo Simulation ill Solid-State and Elementary Particle
Physics," Proceedings of IEEE. 74, January, 1984.
[Brachman and Schmolze, 1985] R. J. Brachman and J. G. Schmolze, "An
Overview of The KL-ONE Knowledge Representation System," Cognitive
Science 9, 171-216, August 1985.
[Brennan et. al., 1986J S. Brennan, M. Friedman, and C. Pollard, "A Centering
Approach to Pronouns," Proceedings of the ACL-86, 1986.
[Chow and Roukos, 1989] Chow, YL. and Roukos, S., "Speech Understanding
using a Unification Grammar," In Proc. of ICASSP- IEEE International
Conference on Aco'Ustics, Speech , and Signal Processing. 1989.
[Chow et. al., 1987J Chow, YL., Dunham, M.O., Kimball, O.A. , Krasner,
M.A., Kubala, G.F., Makhoul, J., Roucos, S., and Schwartz, R.M .,
"BYBLOS: The BBN Continuous Speech Recognition System," Proc . of
IEEE International Conference on Aco'Ustics, Speech , and Signal Processing
(ICASSP-87), pp 89-92, 1987.
BIBLIOGRAPHY 179
[Cole et. al., 1983] Cole, RA., Stern, RM., Phillips, M.S., Brill, S.M., Specker,
P., and Pilant, A.P., "Feature-Based Speaker Independent Recognition of En-
glish Letters," Proc. of IEEE International Conference on Acoustics , Speech,
and Signal Processing (ICASSP-83/, 1983.
[DeMara and Moldovan, 1990] DeMara, R and Moldovan, D., "The SNAP-1
Parallel AI Prototype", Technical Report PKPL 90-15, University of South-
ern California, Department of EE-Systems, 1990.
[De Smedt, 1989] De Smedt, K., "Distributed Unification in Parallel Incremen-
tal Syntactic Tree Formation," In Proceedings of the Second European Work-
shop on Natural Langauge Generation, 1989.
[Dwork et. al., 1984] Dwork, C., Kanellakis, P. and Mitchell, J ., "On the Se-
quential Nature of Unification," Journal of Logic Programming, vol. 1, 1984.
[Fahlman, 1979] Fahlman, S., NETL: A System for Representing and Using
Real- World Knowledge , The MIT Press, 1979.
180 SPEECH-TO-SPEECH TRANSLATION
[Fikes and Nilsson, 1971] Fikes. R. , and Nilsson, N., "STRIPS: A new appo-
rach to the application of theorem proving to problem solving," Artificial
Intelligence, 2, 189-208, 1971.
[Ford, Bresnan and Kaplan, 1981] Ford, M., Bresnan, J. and Kaplan , R., "A
Competence-Based Theory of Syntactic Closure," In Bresnan, J. (Ed.). The
Mental Representation of Grammatical Relations, MIT Press, 1981.
[Ford and Holmes, 1978] Ford, M. and Holmes, V., "Planning Units and Syn-
tax in Sentence Production," Cognition, 6, pp35-53, 1978.
[Furuse and Iida, 1992] Furuse, O. and !ida, H., "Cooperation between Trans-
fer and Analysis in Example-Based Framework," Proc. of COLING-92, 1992.
[Furuse et. al., 1990] Fm'use, 0., Sumita, E., Iida, H. , "A Method for realiz-
ing Transfer-Driven Machine Translation," Workshop on Nat'ural Lang'uage
Processing, IPSJ, 1990 (in Japanese).
[Gibson, 1990] Gibson, T., "Memory Capacity and Sentence Processing," Pro-
ceedings of ACL-90, 1990.
[Higuchi et. al., 1989] Higuchi, T. , Furuya, T. , Kusumoto, H., Handa, K. , and
Kokubu, A. , "The Prototype of a Semantic Network Machine IXM," Pro-
ceedings of the International Conference on Parallel Processing. 1989.
[Higuchi et. al., 1991) Higuchi, T. , Kitano, H., Handa. , K., Furuya, T., Taka-
hashi, H., and Kokubu, A., "IXM2: A Parallel Associative Processor for
Knowledge Processing," Proceedings of AAAI-91 , 1991.
[Hillis, 1985] Hillis, D. W., The Connection Machine. The MIT Press, Cam-
bridge, MA, 1985.
[Hon, 1992) Hon, H., Large Scale Vocab'ulary Indepdent Speech Recognition:
The VOCIND system, Carnegie Mellon University, 1992,
[Hovy, 1988) Hovy, E. H., Generating Natural Language Under Pragmatic Con-
straints, Lawrence Erlbaum Associates, 1988.
[Hsu et al., 1990] Hsu, F., Ananthara111an , T., Campbell, M. and Nowatzyk,
A., "A Grand Master Chess Machine," Scientific American, October, 1990.
[Isabelle, 1987) Isabelle, P., "Machine Translation at the TAUM Gronp," King,
M. (Ed.), Machine Translation Today , Edinburgh : Edinburgh University
Press, 247-277, 1987.
[ltakura, 1975] Itakura, F., "Minimum Prediction Residual Principle Applied
to Speech Recognition," IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-23(1):67-72, 1975.
[Jacobs, 1992) Jacobs, p" "Parsing Run Amok: Relation-Driven Control for
Text Analysis," Proc. of AAAI-92, 1992.
[Jain, 1990] "Parsing complex sentences with structured connectionist net-
works," Neural Computation, 3:110-120, 1990.
182 SPEECH-TO-SPEECH TRANSLATION
[Kaji, 1989] Kaji, H., "A Japanese-English Machine Translation System Based
on Semantics," Nagao, M. (Ed.-in-Chief), Machine Translation Summit,
Tokyo, 1989.
[Kameyama, 1988] Kameyama, M.. "A Property-Sharing Constraint in Cen-
tering," A CL-88, 1988.
[Kaplan and Bresnan, 1982] Kaplan , R. and Bresnan. J., "Lexical-Functional
Grammar: A Formal System for Grammatical Representation ," In Bresnan,
J. (Ed.), The Mental Representation of Grammatical Relations, MIT Press,
1982.
[Kaplan and Zaenen, 1989] Kaplan, R. and Zaenen, A., "Long-distance Depen-
dencies, Constituent Structure, and Functional Uncertainty," 1989.
[Kasper, 1989] Kasper, R. , "Unification and Classification: An Experiment in
Information-Based Parsing," Proceedings of the International Workshop on
Parsing Technologies, Pittsburgh, 1989.
[Kempen, 1987] Kempen, G., "A Framework for Incremental Syntactic Tree
Formation," In Proceedings of the Int ern ational Joint Conference on Artifi-
cial Intelligence (IJCAI-87), 1987.
[Kempen and Hoekamp, 1987] Kempen, G. and Hoenkamp, E., "An Incremen-
tal Procedural Grammar for Sentence Formulation," Cognitive Science, 11,
201-258, 1987.
[Kita et. aI., 1989] Kita, K., Kawabata, T. and Saito, H., "HMM Continuous
Speech Recognition using Predictive LR Parsing," In Proc. of ICASSP -
IEEE International Conference on Acoustic, Speech, and Signal Processing,
1989.
[Kitano and Hendler, 1993] Kitano, H. and Hendler, J ., (Ed.) Massively Par-
allel A rtificial Intelligence, The MIT Press, 1993.
[Kitano, 19931 Kitano, H., "Challenges of Massive Parallelism," Proc. of
IJCAI-93, Chambery, 1993.
[Kitano et. al., 19911 Kitano, H., Hendler, J ., Higuchi, T., Moldovan, D., and
Waltz, D., "Massively Parallel Artifical Intelligence," Proc. of IJCAI-91,
Sydney, 1991.
BIBLIOGRAPHY 183
[Kitano et. al., 1989a] Kitano, H., Tomabechi, H. and Levin, L., "Ambiguity
Resolution in DMTRANS PLUS," In Proceedings of the Fourth Conference
of the European Chapter of the Association for Computational Linguistics,
1989.
[Kitano et. al. , 1989b] Kitano, H., Tomabechi, H., Mitamura, T . and Iida, H.,
"A Massively Parallel Model of Speech-to-Speech Dialog Translation: A Step
Toward Interpreting Telephony," In Proceedings of the European Conference
on Speech Communication and Technology (EuroSpeech-89) , 1989.
[Kitano et. al., 1989c] Kitano , H. , Mitamura, T. and Tomita, M. , "A Massively
Parallel Parsing in <I>DMDIALOG: An Integrated Architecture for Parsing
Speech Inputs," In Proceedings of the International Workshop on Parsing
Technologies, 1989.
[Kitano, 1988] Kitano, H., "Multilinguial Information Ret.rieval Mechanism us-
ing VLSI," Proceedings of RIA 0 -88, Boston, 1988.
[Kim and Moldovan, 1990] Kim, J. and Moldovan, D., "Parallel Classification
for Knowledge Representation on SNAP" Proceedings of the 1990 Interna-
tional Conference on Parallel Processing, 1990.
184 SPEECH-TO-SPEECH TRANSLATION
[Kogure, et. al., 1990] Kogure, K., !ida, H. , Hasegawa, T., and Ogura, K.,
"NADINE An Experimental Dialogue Translation System from Japanese to
English," Proceedings of InfoJapan-90, Tokyo, Japan, 1990.
[Lee, 1988] Lee, K.F., Large- Vocabulary Speaker-Independent Contin'uous
Speech Recognition: The SPHINX System, Ph.D. Thesis, Carnegie Mellon
University, 1988.
[Lee and Moldovan, 1990] W. Lee and D. Moldovan, "The Design of a Marker
Passing Architecture for Knowledge Processing", Proceedings of AAAI-90,
1990.
[Lesser et. al., 1975] Lesser, V.R. , Fennell, R.D., Erman, L.D., Reddy, R.D.,
"The Hearsay II Speech Understanding System," IEEE Transactions on
Acoustics, Speech, and Signal Processing, ASSP-23(1):11-24, 1975.
[Levelt and Maassen, 1981] Levelt, W.J.M. and Maassen, B., "Lexical Search
and Order of Mention in Sentence Production," In Klein, W. and Levelt,
W.J.M. (Eds.), Crossing the Boundaries in Linguistics: St'udies Presented to
Manfred Bierwisch, Dordrecht, Reidel, 1981.
[Levinson et. al., 1979] Levinson, S.E., Rabiner, L.R. , Rosenberg, A.E., and
Wilpon, J. G., "Interactive Clustering Techniques for Selecting Speaker-
Independent Reference Templates for Isolated Word Recognition ," IEEE
Transactions on Acoustics, Speech, and Signal Processing , ASSP-27(2):134-
41, April, 1979.
[Lin and Moldovan, 1990] Lin , C. and Moldovan, D., "SNAP: Simulator Re-
suIts", Technical Report PKPL 90-5, University of Southern California, De-
partment of EE-Systems, 1990.
[Litman and Allen, 1987] Litman, D. and Allen, J. , "A Plan Recognition
Model for Subdialogues in Conversation," Cognitive Science 11 (1987), 163-
200.
[Lowerre, 1976] Lowerre, B., The HARPY Speech Recognition System, Ph .D.
Thesis, Carnegie Mellon University, 1976.
[Maruyama and Watanabe, 1992] Maruyama, H. and Watanabe, H. , "Tree
Cover Search Algorithm for Example-Based Translation," Proceedings of
TMI-92, 1992.
[Morii et. al. , 1985] Morii, S. , Niyada, K. , Fujii , S. and Hoshimi, M., "Large
Vocabulary Speaker-Independent Japanese Speech Recognition System," In
Proceedings of ICSSP - IEEE International Conference on Aco'ustics, Sp eech,
and Signal Processing, 1985.
[Morimoto et. al. , 1990J Morimoto, T., Iida, H. , Kurematsu, A., Shikano, K.,
and Aizawa, T., "Spoken Language Translation: Toward Realizing an Au-
tomatic Telephone Interpretation System," Proceedings of InfoJapan-90,
Tokyo, 1990.
[Muraki, 1989] Muraki, K., "Two-Phase Machine Translation System," Nagao,
M. (Ed.-in-Chief) Machine Translation Summit, Tokyo, 1989.
[Nagao, 1989J Nagao, M., Machine Translation: How Far Can It Go?, Oxford,
1989.
[Nagao, 1984J Nagao, M., "A Framework of a Mechanical Translation between
Japanese and English by Analogy Principle," Artificial and Human Intelli-
gence, Elithorn, A. and Ballerji, R. (Eds.), Elsevier Science Publishers, B.V.
1984.
[Nirenberg et. al., 1989a] Nirenberg, S. (Ed.) , Knowledge-Based Machine
Translation, Center for Machine Translation Project Report, Carnegie Mel-
lon University, 1989.
[Nirenburg et. al., 1989b] Nirenburg, S., Lesser, V. and Nyberg, E. , "Control-
ling a Language Generation Planner," In Proceedings of the International
Joint Conference on Artificial Intelligence (IJCAI-89), 1989.
186 SPEECH-TO-SPEECH TRANSLATION
[Prather and Swinney, 1988] Prather, P. and Swinney, D., "Lexical Processing
and Ambiguity Resolution: An Autonomous Processing in an Interactive
Box," In Lexical A mbiguity Resolution, Small, S. et. al. (Eds.), Morgan Kauf-
mann Publishers, 1988.
[Quillian, 1968] Quillian, M., "Semantic Memory," Semantic Information Pro-
cessing, M. Minsky(Ed.), 216-270, The MIT press, Cambridge, MA, 1968.
[Rabiner et. aI., 1979] Rabiner, L.R., Levinson, S.E., Rosenberg, A.E. , and
Wilpon, J .G., "Speaker-Independent Recognition of Isolated Words Using
Clustering Techniques," IEEE Transactions on Acoustics, Speech. and Sig-
nal Processing, ASSP-27( 4):336-349, August, 1979.
BIBLIOGRAPHY 187
[Riesbeck and Martin, 1985] Riesbeck , C. and Martin, C., "Direct Memory Ac-
cess Parsing," Yale University Report :154, 1985.
[Riesbeck and Martin, 1986J Riesbeck, C. and Martin, C. , "Direct Memory Ac-
cess Parsing," Experience, Memory , and Reasoning, Lawrence Erlbaum As-
sociates, 1986.
[Riesbeck and Schank, 1989] Riesbeck, C. and Schank, R., Inside Case-Based
Reasoning, Lawrence Erlbaum Associates, 1989.
[Sacerdoti, 1977] Sacerdoti, E. D., A Structure for Plans and Beha1lior, New
York: American Elsevier, 1977.
[Saito and Tomita, 1988] Saito, H. and Tomita, M., "Parsing Noisy Sentences,"
In Proceedings of COLING-88, 1988.
[Sato, 1993] Sato, S., Example-Based Translation of Technical Terms, IS-RR-
93-41, Japan Advanced Institute of Science and Technology, 1993.
[Sato, 1991a] Sato, S., Example-Based Machine Translation, Ph.D. Thesis, Ky-
oto University, 1991.
[Sato, 1991b] Sato, S., "MBT-I: Word Choice based on Examples," Jou'r nalof
Japanese Society for AI, Vol. 6, No.4, 1991 (in Japanese) .
[Sato and Nagao, 1990] Sato, S. and Nagao, M., "Toward Memory-based
Translation," Proceedings of COLING-90, 1990.
[Selfridge, 1980] Selfridge, M., A Process Model of Language Acquis'ition, Ph.D.
thesis, Yale University Department of Computer Science, 1980.
[Schank, 1982] Schank, R. , Dynamic Memory: A Theory of Learning in Com-
puter and People, Cambridge University Press, 1982.
[Schank, 1975J Schank, R., Conceptual Information Processing, Reading,
North-Holland, 1975.
[Sidner, 1979] Sidner, C., Towards a Computational Theory of Definite
Anaphora Comprehension in English Discouru, Ph. D. Thesis, Artificial In-
telligence LAb. , M.L T., 1979.
[Small et. al., 1988] Small, S. , et. al. (Eds.) Lexical Ambiguity Resolution, Mor-
gan Kaufmann Publishers, Inc., CA, 1988.
[Sowa, 1984] Sowa, J ., Conceptual Structures, Reading, Addison Wesley, 1984.
[Stanfill and Waltz, 1988] Stanfill, C. and Waltz, D., "The Memory-Based
Reasoning Paradigm," Proceedings of the Case-Based Reasoning Workshop,
DARPA, 1988.
188 SPEECH-TO-SPEECH TRANSLATION
[Sumita et. al. , 1993] Sumita, E. Oi, K., Iida, H., Higuchi, T ., Takahashi, N.,
Kitano, H., "Example-Based Machine Translation on Massively Parallel Pro-
cessors," Proc. of IJCAI-93, 1993.
[Sumita and Iida, 1991] Sumita, E. and !ida, H. , "Experiments and Prospects
of Example-Based Machine Translation." Proceedings of the 29th ann'ual
Meeting of the Association for Computational Ling'U.isitcs. 1991.
[Tebelskis, et. al., 1991] Tebelskis, J., Waibel, A., Petek, B. , and Schmid-
bauer. 0., "Continuous speech recognition using linked predictive neural
networks," IEEE Proceedings of the 1991 Interrwtional Conference on Ac-
coustics, Speech, and Signal Processing. April 1991.
[Tomabechi et. al., 1989] Tomabechi, H., Saito, H. and Tomita. M., "Speech-
Trans: An Experimental Real-Time Speech-to-Speech Translation," In Pro-
ceedings of the 1989 Spring Symposi·u m of the American Assoc't ation for Ar-
tificinl Intelligence, 1989.
[Tomabechi et. al. , 1988] Tomabechi, H., Mitamura, T. and Tomita, M" "Di-
rect Memory Translation for Speech Input: A Massively Parallel Network
for Episodic/Thematic and Phonological Memory," In Proceedings of the In-
ternntiorwZ Conference on Fifth Generation Comp1Lter Systems. 1988.
BIBLIOGRAPHY 189
[Tomita, 1986) Tomita, M., Efficient Parsing for Natural Language, Kluwer
Academic Publishers, 1986.
[Tomita and Carbonell, 1987) Tomita, M. and Carbonell, J. G., "The Universal
Parser Architecture for Knowledge-Based Machine Translation," In Proceed-
ings of the International Joint Conference on Artificial Intelligence (IJCAI-
87), 1987.
[Tomita and Knight, 1988) Tomita, M. and Kevin, K , "Pseudo-Unification
and Full Unification," CMU-CMT-88-MEMO, 1988.
[Tomita et. al., 1989] Tomita, M., Kee, M., Saito, H., Mitamura, T ., and
Tomabechi, H., "Towards a Speech-to-Speech Translation System," Jo'urnal
of Applied Linguistics, 3.1 , 1989.
[Viterbi, 1967] Viterbi, A.J ., "Error Bounds for Convolutional Codes and an
Asymptotically Optimum Decoding Algorithm," In IEEE Transactions on
Information Theory, IT-13(2): 260-269, April, 1967.
[Waibel et. al. , 1989] Waibel, A., Hanazawa, T. , Hinton, G" Shikano, K and
Lang, K., "Phoneme Recognition Using Time-Delay Neural Networks,"
IEEE, Transactions on Acottstics, Speech and Signal Processing, March,
1989.
[Waibel et. al., 1991] Waibel, A. , Jain, A., McNair, A., Saito, H., Hauptmann,
A., and Tebelskis, J., "JANUS: A Speech-to-speech Translation system Using
Connectionist and Symbolic Processing Strategies," IEEE Proceedings of the
1991 International Conference on Accoustics, Speech, and Signal Processing,
April 1991.
[Waibel and Lee, 1990) Waibel, A" and Lee, KF" (Eds.) Readings in Speech
Recognition. Morgan Kaufmann, 1990.
[Walker and Whittaker, 1990] Walker, M. and Whittaker, S., "Mixed Initiative
in Dialogue: An Investigation into Discourse Segmentation," Proceedings of
ACL-90, Pittsburgh, 1990.
[Webber, 1983] Webber, B., "So What Can We Talk About Now?" In Brady,
M. and Berwick, R. (Eds.), Computational Models of Discou-rse, The MIT
Press, 1983.
[Wilensky, 1987] Wilensky, R., "Son~e Problems and Proposals for Knowledge
Representation" , Technical Report UCB/CSD 87/351 , University of Califor-
nia, Berkeley, Computer Science Division, 1987.
[Winograd, 1983] Winograd, T. , Language as a Cognitive P-rocess. Vol. 1: Syn-
tax, Addison-Wesley, 1983.
[Zue, 1985] Zue, V.W. , "The Use of Speech Knowledge in Automatic Speech
Recognition," P-roceedings of the IEEE 73(11) : 1602-1615, November, 1985.
Index
A E
C-Marker, 54 H
case-based reasoning, 31
HARPY , 8
CC,49
Head-driven Phrase Structure Grammar,
Center for Machine Translation , 10
18
CI,50
HEARSAY -II, 8
combination of cases, 166
Hidden Markov Model, 17
concept class, 49
HMM-LR, 17,26
concept instance, 50
concept sequence class, 49 I
concurrent parsing and generation, 48, 93
confusion matrix, 14, 64 IG-Marker , 71
connection machine, 11 ill-formed in pu ts, 6
constraint satisfaction, 70 ILLIAC-IV, 11
contextual marker, 54 inferred goal marker, 71
continuous speech, 4, 5 inserted phonemes, 16
control, 129 intension, 7
cost-based ambiguity resolution , 48, 78 interlingua, 3, 6, 82
CSC , 49 interpreting telephony, 1
IXM2 , 11, 44, 135
D
J
DAP, 11
deleted phoneme, 16 JANUS , 20
direct memory access parser, 32
discourse entities, 50 K
disourse, 71
KBMT ,25
DMAP, 32, 45
KBMT-89,10
DmDialog, 10, 29, 47
Knowledge-Based Machine Translation,
DmSNAP,115
25
DRAGON , 8
KSR-1 . 12
dynamic confusion matrix, 64
dynamic time warp, 8
192 INDEX
possible space, 40
L pragmatic choice, 6
pragmatics, 24
language model, 62 prediction, 77
large vocabulary, 4, 5 prediction marker, 54
learning, 110 predictive LR parser, 17, 26
lexical choice, 6, 93 probability matrix, 63
lexical entry, 50 productivity of language, 36
lexical-functional grammar, 52 propagation rules, 118
Linked Predictive Neural Network, 20
LR parser, 14 R
M real space, 40
real-time response, 3, 7
machine translation, 3, 9 recognize-and-record, 33
marker, 54, 123 references, 6
marker-passing, 29, 45 , 119 rule-based approach, 42, 157
massively parallel artificial intelligence, rule-based processing, 48
11,43
massively parallel computing, 11, 29, 43, s
48
MBRtalk,l1 semantic network array processor, 44,115
meaning representation, 6 simultaneous interpretation, 93
Memoir, 157 SL-TRANS, 10, 17
memory network, 49, 51, 121, 144 SNAP, 11,44, 115
Memory-base, 51, 158 speaker independence, 4
memory-based approach, 31, 47, 119, 157 speaker-independent , 4
memory-based parser, 137 specific case, 68 , 83
memory-based parsing, 29, 67 speech input processing , 60
memory-based reasoning , 11, 31 speech recognition, 1,4, 8
MIND,105 speech-to-speech translation , 1
MINDS, 24 SpeechTrans, 10. 13
mixed-initiative dialog, 106 SPHINX, 5, 8
monitor, 158, 168 spoken language translation , 4
MPP, 11 stylistics, 6
MU project, 10 su b-word model, 5
multiple hypotheses, 7 substituted phonemes, 15
syntactic choice, 6, 93
N syntactic constraint network, 120, 123
SYSTRAN,lO
NETL, 11,45
noisy inputs, 6 T
noisy phoneme sequence, 14, 62
TANGORA,8
o TAUM-METEO, 10
transition matrix , 64
ontological hierarchy, 51 transition probability, 64
p translation, 5
translation by analogy principle, 33
P-Marker,54 Typed Feature Structure Propagation, 18
parsing, 68 U
perplexity, 78, 105
phoneme hypothesis, 3 unbounded dependency, 130
phoneme recognition, 3
phoneme-based generalized LR parser, 15
phonological processing, 63
Index 193
v
V-Marker, 54
verbalization marker, 54
very large finite space, 36
VOCIND , 5, 8
w
word boundary, 5
word hypothesis, 3