Hiroaki Kitano (Auth.) Speech-To-Speech Translation - A Massively Parallel Memory-Based Approach 1994

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 204

SPEECH-TO-SPEECH

TRANSLATION:
A MASSIVELY PARALLEL
MEMORY-BASED APPROACH
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE

NATURAL LANGUAGE PROCESSING


AND MACHINE TRANSLATION
Consulting Editor
Jaime Carbonell

Other books in the series:


REVERSIBLE GRAMMAR IN NATURAL LANGUAGE PROCESSING, Tomek Strzalkowski
ISBN: 0-7923-9416-X
THE FUNCTIONAL TREATMENT OF PARSING, Rene Leennakers
ISBN: 0-7923-9376-7
NATURAL LANGUAGE PROCESSING: THE PLNLP APPROACH, Karen Jensen, George E.
Heidorn, Stephen D. Richardson
ISBN : 0-7923-9279-5
ADAPTIVE PARSING: Self-Extending Natural Language Interfaces, 1. F. Lehman
ISBN: 0-7923-9183-7
GENERALIZED L. R. PARSING, M. Tomita
ISBN: 0-7923-9201-9
CONCEPTUAL INFORMATION RETRIEVAL: A Case Study in Adaptive Partial Parsing,
M. L. Mauldin
ISBN: 0-7923-9214-0
CURRENT ISSUES IN PARSING TECHNOLOGY, M . Tomita
ISBN: 0-7923-9131-4
NATURAL LANGUAGE GENERATION IN ARTIFICIAL INTELLIGENCE AND
COMPUTATIONAL LINGUISTICS, C. L. Paris, W. R. Swartout, W. C. Mann
ISBN: 0-7923-9098-9
UNDERSTANDING EDITORIAL TEXT: A Computer Model of Argument Comprehension,
S. J. Alvarado
ISBN: 0-7923-9123-3
NAIVE SEMANTICS FOR NATURAL LANGUAGE UNDERSTANDING, K. Dahlgren
ISBN: 0-89838-287-4
INTEGRATED NATURAL LANGUAGE DIALOGUE: A Computational Model, R. E.
Frederking
ISBN: 0-89838-255-6
A NATURAL LANGUAGE INTERFACE FOR COMPUTER AIDED DESIGN, T. Samad
ISBN : 0-89838-222-X
EFF1CIENT PARSING FOR NATURAL LANGUAGE: A Fast Algorithm for Practical
Systems, M. Tomita
ISBN : 0-89838-202-5
SPEECH-TO-SPEECH
TRANSLATION:
A MASSlVELY PARALLEL
MEMORY-BASED APPROACH

Hiroaki Kitano

Carnegie Mellon University


Pittsburgh, Pennsylvania

and

Sony Computer Science Laboratory


Tokyo, Japan

"
~.

SPRINGER SCIENCE+BUSINESS MEDIA, LLC


Library of Congress Cataloging-in-Publication Data

A C.I.P. Catalogue record for this book is available from


the Library of Congress.

ISBN 978-1-4613-6178-7 ISBN 978-1-4615-2732-9 (eBook)


DOI 10.1007/978-1-4615-2732-9

Copyright © 1994 by Springer Science+Business Media New York


Originally published by Kluwer Academic Publishers in 1994
Softcover reprint of the hardcover 1st edition 1994
AH rights reserved. No part of this publication may be reproduced, stored in
a retrieval system or transmitted in any form or by any means, mechanical,
photo-copying, recording, or otherwise, without the prior written permission of
the publisher, Springer Science+Business Media, LLC.

Printed on acidlree paper.


CONTENTS

LIST OF FIGURES IX

LIST OF TABLES xiii

PREFACE Xv

1 INTRODUCTION 1
1.1 Speech-to-Speech Dialogue Translation 1
1.2 Why Spoken Language Translation is So Difficult? 4
1.3 A Brief History of Speech Translation Related Fields 8

2 CURRENT RESEARCH TOWARD


SPEECH-TO-SPEECH TRANSLATION 13
2.1 SpeechTrans 13
2.2 SL-TRANS 17
2.3 JANUS 20
2.4 MINDS 24
2.5 Knowledge-Based Machine Translation System 25
2.6 The HMM-LR Method 26

3 DESIGN PHILOSOPHY BEHIND THE


<I>DMDIALOG SYSTEM 29
3.1 Introduction 29
3.2 Memory-Based Approach to Natural Language Processing 31
3.3 Massively Parallel Computing 43
3.4 Marker-Passing 45
vi SPEECH-TO-SPEECH TRANSLATION

4 THE <I>DMDIALOG SYSTEM 47


4.1 Introduction 47
4.2 An Overview of the Model 49
4.3 Speech Input Processing 60
4.4 Memory-Based Parsing 67
4.5 Syntactic/Semantic Parsing 68
4.6 Discourse Processing 71
4.7 Prediction from the Language Model 77
4.8 Cost-based Ambiguity Resolution 78
4.9 Interlingua with Multiple Levels of Abstraction 82
4.10 Generation 84
4.11 Simultaneous Interpretation: Generation while Parsing is ill
Progress 93
4.12 Related Works 104
4.13 Discussions 105
4.14 Conclusion 112

5 DMSNAP: AN IMPLEMENTATION ON
THE SNAP SEMANTIC NETWORK ARRAY
PROCESSOR 115
5.1 Introduction 115
5.2 SNAP Architecture 116
5.3 Philosophy Behind DMSN AP 119
5.4 Implementation of DMSN AP 121
5.5 Linguistic Processing in DMSN AP 125
5.6 Performance 131
5.7 Conclusion 133

6 ASTRAL: AN IMPLEMENTATION ON THE


IXM2 ASSOCIATIVE MEMORY PROCESSOR 135
6.1 Introduction 135
6.2 The Massively Parallel Associative Processor IXM2 135
6.3 Experimental Implementation I: A Flat Pattern Model 136
6.4 Performance 140
6.5 Memory and Processor Requirements 143
6.6 Enhancement: Hierarchical Memory Network 144
Contents vii

6.7 Experimental Implementation II: Hierarchical Memory Net-


work Model 145
6.8 Performance 148
6.9 Hardware Architecture for Memory-Based Parsing 151
6.10 Conclusion 152

7 MEMOIR: AN ALTERNATIVE VIEW 157


7.1 Introduction 157
7.2 Overall Architecture 157
7.3 Knowledge Sources 158
7.4 Grammatical Inference 162
7.5 Examples Retrieval 163
7.6 Adaptive Translation 165
7.7 Monitor 168
7.8 Preliminary Evaluation 170
7.9 Conclusion 171

8 CONCLUSION 173
8.1 Summary of Contributions 173
8.2 Future Works 175
8.3 Final Remarks 176

Bibliography 177

Index 191
LIST OF FIGURES

Chapter 1
1.1 Process flow of Speech-to-speech translation 2
1.2 Overall process flow of Speech-to-speech dialog translation system 4

Chapter 2
2.1 An example of sentence analysis result 19
2.2 JANUS using the generalized LR parser. 22
2.3 JANUS using the connectionist parser. 23

Chapter 3
3.1 Translation as Analogy 34
3.2 Distribution by Sentence Length 37
3.3 Coverage by Sentence Length 37
3.4 Real space and possible space 41

Chapter 4
4.1 Lexical Nodes for 'Kaigi' and 'Conference' 51
4.2 Grammar using LFG-like notation 52
4.3 Grammar using Semantic-oriented notation 53
4.4 Grammar using mixture of surface string and generalized case 53
4.5 Example of an A-Marker and a P-Marker 55
4.6 Example of a G-Marker and a V-Marker 56
4.7 Movement of P-Markers 58
4.8 Movement of P-Marker on Hierarchical CSCs 58
4.9 Parsing with a small grammar 59
4.10 A simple parsing example. 61
4.11 Examples of Noisy Phoneme Sequences 63
x SPEECH-TO-SPEECH TRANSLATION

4.12 Phoneme-level State Transition 65


4.13 Phoneme Processing 66
4.14 A simple plan recognition example (Activation) 74
4.15 A simple plan recognition example (Prediction) 74
4.16 A simple plan recognition example (Activation) 74
4.17 A simple plan recognition example with Multiple Hierarchy. 76
4.18 Branching and Merging of Markers 78
4.19 Prediction 79
4.20 Translation paths at different levels of abstraction 83
4.21 Movement of V-Marker in the CSC 86
4.22 Movement of V-Marker in the Hierarchy of CSCs 86
4.23 An Incremental Tree Construction 87
4.24 Change of Produced Sentence due to the Different Semantic Inputs 88
4.25 A simple example of the generation process 90
4.26 Activation of Syntactic and Lexical Hypotheses 92
4.27 Transaction with Conventional and Simultaneous Interpretation
Architecture 95
4.28 A part of the memory network 100
4.29 A Process of Parsing, Generation and Prediction 103

Chapter 5
5.1 SN AP Architecture 117
5.2 Concept Sequence on SNAP 122
5.3 Part of Memory Network 126
5.4 Parsing Performance of DmSN AP 132

Chapter 6
6.1 Syntactic Recognition Time vs. Sentence Length 140
6.2 Performance Improvement by Learning New Cases 142
6.3 Training Sentences vs. Syntactic Patterns 144
6.4 Overall Architecture of the Parsing Part 145
6.5 Network for 'about' and its phoneme sequence 146
6.6 Parsing Time vs. Length of Input 149
6.7 Parsing Time vs. KB Size 150
6.8 Number of Active Hypotheses per Processor 152
List of Figures Xl

6.9 Parallel Marker-Propagation Time vs. Fanout 155

Chapter 7
7.1 Overall Architecture 159
7.2 Abstraction-based Word Distance Definition 162
7.3 DP-Matching of Input and Examples 164
7.4 Multiple Match between Examples 165
LIST OF TABLES

Chapter 1
1.1 Major speech recognition systems 9

Chapter 2
2.1 A portion of a confusion matrix 14
2.2 Examples of Sentences Processed by SpeechTrans 16
2.3 Performance of the SL-TRANS 19
2.4 Performance of JANUS1 and JANUS2 on N-Best Hypotheses 22
2.5 Performance of JANUS1 and JANUS2 on the First Hypothesis 23
2.6 Performance of the MINDS system 25

Chapter 3
3.1 Knowledge and parallelism involved in the speech translation task 30
3.2 Distribution of the global ill-formed ness 41

Chapter 4
4.1 Types of Nodes in the Memory Network 50
4.2 Markers in the Model 55
4.3 Transcript: English to Japanese 96
4.4 Transcript: Japanese to English (1) 97
4.5 Transcript: Japanese to English (2) 97
4.6 Simultaneous interpretation in <I>DMDIALOG 99

Chapter 5
5.1 Execution times for DmSNAP 132

Chapter 6
xiv SPEECH-TO-SPEECH TRANSLATION

6.1 Pre-Expanded Syntactic Structures 138


6.2 Case-Role Table 139
6.3 Syntactic Recognition Time vs. Sentence Length (milliseconds) 141
6.4 Syntactic Recognit.ion Time vs. Grammar Size (milliseconds) 141

Chapter 7
7.1 Examples of Translation Pair 159
7.2 A Part of Memory-Base (Morphological tags are omitted) 160
7.3 Examples matched for a simple input 163
7.4 Difference Table 165
7.5 Adaptation Operations 166
7.6 Adaptation for a simple sentence translation 166
7.7 Retrieved Examples 167
7.8 Adaptive Translation Process 168
PREFACE
xvi SPEECH-TO-SPEECH TRANSLATION

Development of a speech-to-speech translation or an interpreting telephony


system is one of the ultimate goals of research in speech recognition, natural
language processing, and artificial intelligence. It is considered to be the grand
challenge for modern computer science and engineering.

This book describes cI>DMDlALOG and its descendents ~ DmSNAP, ASTRAL,


and Memoir. cI>DMDIALOG is a speech-to-speech dialog translation system de-
veloped at the Center for Machine Translation (CMT), Carnegie Mellon Uni-
versity. It accepts speaker-independent continuous speech inputs and produces
audio outputs of translated sentences in real-time. cI>DMDIALOG is one of the
first experimental systems that perform speech-to-speech translation, and the
first system which demonstrates the possibility of simultaneous interpretation.
The original version of the <I>DMDIALOG on a serial machine was implemented
and has been publicly demonstrated since March 1989. It translates Japanese
into English in real-time, and operates on the ATR (Advanced Telecommu-
nication Research Interpreting Telephony Research Laboratories) conference
registration domain.

Massively parallel implementation on IXM2, SNAP-I, and CM-2 has been car-
ried out with different variation of the original model. Massively parallel imple-
mentation proved the validity of the approach and demonstrate that real-time
speech-to-speech translation is attainable.

This book is based on my dissertation at Kyoto University, but up-dated before


the publication as a book form. Particularly, chapter 7 describes a recent
development which was at the level of conceptualization when I was writing
the dissertation.

It is interesting to see how my ideas change and grow as research progress. The
original cI> DM DIALOG system reflects my early vision on natural language pro-
cessing, whereas up-dated chapter reflects my recent thought. It is consistent
in a sense memory-based procesing and massively parallel computing to be the
basis of the model. However, the use of rules has been changed drastically.

For me, the work described in this book is an important milestone. The ideas
grown out of this work lead me to propose massively parallel artificial intelli-
gence, which is now being recognized as a distinct research field.

I would like to express my sincere thanks to Makoto Nagao, my thesis com-


mittee chair, Jaime Carbonell, director of CMT, and Masaru Tomita, associate
director of CMT. There are many people who helped me and influenced me
in various ways. David Waltz influenced me with his memory-based rea.'lon-
Preface xvii

ing idea, and Jim Hendler helped me propose massively parallel AI. Members
of Carnegie Mellon research community, James McClelland, David Touretzky,
Kai-Fee Lee, Alex Waibel , Carl Pollard, Lori Levin, Sergei Nirenburg, Wayne
Ward, and Takeo Kanade, gave me various suggestions on my research and
on my thesis. Hitoshi Iida and Akira Kurematsu at ATR Interpreting Tele-
phony Research Laboratories allow me to use ATR corpus in which the system
operates. Massively parallel implementation could not be possible without re-
search collaboration with Tetsuya Higuchi and his colleagues at ElectroTechni-
cal Laboratory, and Dan Moldovan and his colleagues a t University of Southeru
California.

This research has been supported by National Science Foundation grant MIP-
9009111, Pittsburgh Supercomputing Center IRI-910002P, and a r esearch con-
tract between Carnegie Mellon University and ATR Interpreting Telephony
Research Laboratories. NEC Corporation supportedl1lY stay at Caruegie Mel-
lon University.
SPEECH-TO-SPEECH
TRANSLATION:
A MASSIVELY PARALLEL
MEMORY-BASED APPROACH
1
INTRODUCTION

1.1 SPEECH-TO-SPEECH DIALOGUE


TRANSLATION
Development of a speech-to-speech translation system or interpreting telephony
is one of the ultimate goals of research in natural language, speech recogni-
tion, and artificial intelligence. The task of speech-to-speech translation ulti-
mately requires recognition and understanding of speaker-independent, large-
vocabulary, continuous speech in the context of mixed-initiative dialogues. It
also needs to accurately translate and produce appropriately articulated audio
output in real-time (figure 1.1). The utility of the speech-to-speech translation
is far reaching. Beside obvious scientific and engineering significance, there
are unspeakable economic and cultural impacts. Consider even a small and
restricted speech-to-speech translation which helps travelers to order meals at
restaurants, buy tickets, asking directions, and all other little talks, it would
greatly reduce burden of traveller and adds more flexibility ill their activities.
Also, imagine telephone services which translates a limited domain interactions
such as asking phone numbers, asking train schedule, making reservations of all
kinds, and more. Commercial pay offs would be enormous even by restricted
systems. When technologies advanced, we will be able to relax some of the con-
straints imposed on the first generation systems. Then, the speech-to-speech
translation system would attain unparalleled utility in our society.

Accomplishment of the task requires the collective effort of various researchers.


Speech recognition modules need to exhibit highly accurate and real-time per-
formance under a speaker-independent, continuous speech, large vocabulary
condition. A machine translation module consists of parsing and generation,
2 CHAPTER 1

Spoken
sentence
Audio
signal
.-!
~ Speech
Phoneme recognition
recognition
Phoneme
hypotheses tt ~
possible
next
phonemes
Lexical
,
activation
possible
Words .... l" next
words
hypotheses
....
I I
Parsing Machine
Translation

Meaning of
Utterance
+
Generation

Translated ...
sentence

Voice
Synthesis

Audio
signal

Translated
sentence in
sound

Figure 1.1 Process flow of Speech-to-speech translation


Introduction 3

and must be capable of interpreting very elliptical (where some words are not
said) and ill-formed sentences which may appear in real spoken dialogues. In
addition, an interface between the parser and the speech recognition module
must be well designed so that necessary information is passed to the parser,
and an appropriate feedback is given from the parser to the speech recognition
module in order to to improve recognition accuracy. In figure 1.1, we assumed
that an interface is made at both phoneme hypothesis and word hypothesis
levels, so that prediction made by the parser can be immediately fed back to
the phoneme recognition device. No speech recognition module is capable of
recognizing input speech with perfect accuracy, thus it sometimes provides a
false word sequence as its first choice. However, it is often the case that a cor-
rect word is in the second or third best hypothesis. Thus, phoneme and word
hypotheses given to the parser consist of several competitive phoneme or word
hypotheses each of which are assigned the probability of being correct. With
this mechanism, the accuracy of recognition can be improved because it filters
out false first choices of the speech recognition module and selects grammati-
cally and semantically plausible second or third best hypotheses. To implement
this mechanism, the parser needs to handle multiple hypotheses in a parallel
rather than a single word sequence as seen in text input machine translation
systems. For a translation scheme, we use interlingua, i.e. language indepen-
dent representation of meaning of the sentence, so that translation into multiple
languages can be done efficiently. A generation module needs to be designed
so that appropriate spoken sentences can be generated with correct articula-
tion control. In addition to these functional challenges, it should be noted that
the real-time response is the major requirement of the system, because speech-
to-speech dialog translation systems would be used for real-time transactions,
imposing a far more severe performance challenge than for text-based machine
translation systems.

Furthermore, since the comprehensive system must handle two-way conversa-


tions, the system should have bi-directional translation capability with an abil-
ity to understand interaction at the discourse knowledge level, predict possible
next utterance, understand what particular pronouns refer to, and to provide
high-level constraints for the generation of contextually appropriate sentences
involving various context-dependent phenomena. To attain these features, the
overall picture of the system should look like figure 1.1. The knowledge-base
in figure 1.1 is used for keeping track of discourse and world knowledge estab-
lished during the conversation, and is continuously up-dated during processing.
It is now clear that development of a speech-to-speech translation system re-
quires a new set of technologies: it is not just an assembly of existing speech
recognition, machine translation, and voice synthesis systems.
4 CHAPTER 1

English
~ trans laton
in sound

Japanese _ Utterance
translation in English
in sound

Figure 1.2 Overall process flow of Speech-to-speech dialog translation system

1.2 WHY SPOKEN LANGUAGE


TRANSLATION IS SO DIFFICULT?
Spoken language translation is a difficult task as it has been one of the ul-
timate goals of the speech recognition, natural language processing, machine
translation, and artificial intelligence researchers. The difficulties of the spoken
language processing can be roughly subdivided into two major issues:

• Difficulties in speech recognition,

• Difficulties in translating spoken sentences.

Speech Recognition The central issues in the speech recognition research


has been the development of highly accurate speech recognition system which is
free from constraints regarding (1) vocabulary size, (2) speaker-independency,
(3) continuous speech, and (4) task characteristics. In other words, current
speech recognition systems aims at large vocabulary, speaker-independent, con-
tinuous speech recognition on natural tasks.

Speaker-independence is one of the most difficult constraints to relax. The dif-


ficulty comes from the obvious fact that no two person create identical acoustic
wave even when articulating the same word. Past approaches toward speaker-
independence has been to find the invariant parameters which are independent
from personal variation [Cole et. al., 1980]. Although there are reports on high
accuracy systems [Cole et. al., 1983] [Cole, 1986aJ, [Zue, 1985], this approach
has not been successful on more difficult tasks [Cole et. al., 1986bl . Other
Introduction 5

approachs includes multiple representation so that each represenation accounts


for a class of speakers sharing similar acoustic features [Levinson et. al., 1979]
[Rabiner et. al., 1979]. However, this approach was not able to attain a high
recognition rate on large tasks. In the SPHINX system [Lee, 1988), a large
set of speaker-independent training data has been used to attain high accu-
racy with speaker-independence. Recently, some success has been reported on
vocabulary-independent speech recognition. For example, the VOCIND system
attained a vocabulary-independent word error rate of 4.6% [Hon, 1992). The
VOCIND system uses a powerful vocabulary-independent subword modeling,
an effective vocabulary learning, and an environment normalization algorithm.

Continuous speech is highly complex than isolated speech mainly due to (1)
unclear word boundary, (2) co-articulation effect, and (3) poor articulation of
functional words. Unclear word bounday significantly increases search space
because of large segmentation candidates. Involvement of contextual effect, i.e.
change in articulation due to existance of other words (previous or following)
and due to emergence of stress (emphasis and de-emphasis), accounts for the
co-articulation effects and the poor articulation of the functional words.

No system can be truly practical unless it has sufficiently large vocabulary


so that users can use most of their daily vocabulary. However, having a larger
vocabulary significantly increases acoustic confusability, and complicates search
space. One of the problems is the increasing difficulties in obtaining sufficiently
large set of training data. The training data would inevitably be sparse since
acquisition of data which contains co-articualtion of all possible word pairs
would be extremely costly and time consuming. The alternative approach is to
use the sub-word model so that dense training data can be obtained. However,
this approach has obvious trade off that it does not capture co-articulation
effects.

For those who are interested in details of speech recognition research, there are
ample of publications in this field such as [Waibel and Lee, 1990].

Translation The issue of translation has been one of the major problems in
natural language processing and artificial intelligence community. Generally,
translation of sentence between two language entails various problems many of
which are yet to be solved. Following is a partial list of problems need to be
solved and questions need to answered.
6 CHAPTER 1

• Understanding the meaning of sentences.


- Resolving lexical and structual ambiguity.
- Resolving references.
- Resolving ellipsis.
• Representing the meaning of the sentence.
- What information should be represented?
- How they should be represented?
- Is there interlingua, or do we need transfer?
• Mapping the analysis result in the source langauge into the representation
for the target language.
- Do we need this phase at all?
- What know do we need for mapping?
- How do we restore missing information?
• Generating syntactically, semantically, contextually, pragmatically, and
stylistically acceptable translation.
- Lexical and Syntactic choice.
- Pragmatic choice.
- Stylistics.

These are some of the issues in designing convensional machine translation


systems for translating written text. In addition to these issues, we need to
consider following problems in translation of spoken language.

Parsing ill-formed inputs: Spoken utterances are often ill-formed involving


false-start, repeat, no-ending, etc. Methods to parse sentences such as
Well, I .... Oh, I think you are right. need to be developed for the practical
deployment of the speech-to-speech translation with high habitability.
Parsing Noisy inputs: Noisy inputs refers to inputs which contain errors
from the speech recognition device. It should be distinguished from the
ill-formed inputs in which the cause of error is the speaker. For example,
if we use a speech recognition device which provides a phoneme sequence
to the translation module. The phoneme sequence generally contains some
errors such as insertion, substitution, and deletion of phonemes. Parsing
method to make the best guess to restore the possibly a correct hypothesis
need to developed.
Introduction 7

Parsing with Multiple Hypotheses: Due to noisy inputs, parsers for


speech inputs need to retain multiple hypotheses during the parse. For
example, the speech recognition device may provides a word lattice with
probability measure to each word hypothesis. Input to the parser, in this
case, is not a simple word sequence. The parser need to handle multiple
input hypotheses.
Restoring severly elliptical sentences: Spoken sentences are highly ellip-
tical. The subject or the object of the sentence may not be articulated
when it is obvious from the context. this is particularly true in Japanese.
Some of the Japanese sentences even drop both subject and verb, or sub-
ject and object at the same time. The method to restore these missing
information is essential in making translation into English in which clear
articulation of subjects and objects is required.

Understanding intension of the speaker: Unlike many written text trans-


lation which mainly targets computer manulas or other descriptive texts,
speech-to-speech translation system encounters sentences with illocution-
ary acts. For example, the speaker of the utterance Can you tell me about
yourself? does not expect to hear yes or no answer. The intension of
the question is to get information on the hearer. The speaker obviously
expects that the hearer to understand the real intension of the question,
and start talking about him/herself. While such speech acts are frequently
observed in spoken conversations, the system need to have some capability
to understand intension of the utterace.
Real-Time performance: Quick translation is always desired even in text-
to-text translation. However, this requirement is even more pressing in
the speech-to-speech translation system which inherently requires real-time
response. Due to its nature, the system will be dominantly used at the
interactive mode as opposed to batch mode in some of the text translation
systems.

The list is by no means exhaustive. There should be much more problems to be


solved. It is fair to say that there are problems which we do not anticipate even
at this moment, and we would face them in future. However, above listing is
fair mapping of recognized problems in developing speech-to-speech translation
systems.
8 CHAPTER 1

1.3 A BRIEF HISTORY OF SPEECH


TRANSLATION RELATED FIELDS
While it was only recently, perhaps since late 1980's, that the researchers have
specifically target the development of the speech-to-speech translation, there
are long history of related fields such as speech recognition and machine trans-
lation. The concept of speech-to-speech translation has been recognized for
years, but it was sheer dream even with the state-of-the-art technologies in the
past years. Now, we are entering the stage which allow us to explicitly target
speech-to-speech translation system. Although we need years of effort to build
practical systems, we can build and test prototype systems in some of the most
advanced research institutions in the world. This section briefly review history
of related fields which bring us to the state of the technology which we are
enjoying.

1.3.1 Speech Recognition


The major efforts on speech recognition has been made since DARPA started to
support the speech recognition research in 1971. Numbers of systems has been
developed and various approaches has been proposed. the first large vocabulary
systems appeared in the mid 70's exemplified by HEARSAY-II [Lesser et. aI.,
19751 and HARPY [Lowerre, 19761. The HEARSAY-II system used a black
board architecture to attain dynamic interaction of various knowledge sources.
The HARPY system integrated the network representation used in DRAGON
[Baker, 1975] and the BEAM search technique. The dynamic time warp (DTW)
[Itakura, 1975] was proposed in 1975. Early in the 80's a group of speaker-
independent systems has been developed such as FEATURE system [Cole et.
al., 1983]. The FEATURE system is based on the idea that spectrum features
can be the clue for speech recognition as spectrogram readers can accurately
indentify phoneme from the spectrograms. In the middle of the 80's, a group at
IBM developed the TANGORA system [IBM, 1985] which has 5,000 vocabulary,
and works on natural language-like grammar with perplexity of 160. This is
perhaps the first system which works on the natural task.

In the late 80's, SPHINX [Lee, 1988J and BYBLOS [Chow et. al., 1987J system
has been developed, both using the Hidden Markov Model (HMM). SPHINX
was extended to a vocabulary-independent VOCIND system [Hon, 1992J.

Early in the 90's, we saw a first neural network-based speech recognition sys-
tems [Waibel et. al., 1989] [Tebelskis, et. al., 1991J.
Introduction 9

System Speaker Continuous Vocabulary Recog.


Indep. Speech rate
NTT No No 200 97.5%
DRAGON No Yes 194 84%
HEARSAY-II No Yes 1011 87%
HARPY No Yes 1011 97%
Bell-l Yes No 129 91%
FEATURE Yes No 26 90%
TANGORA No Yes 5,000 97%
BYBLOS No Yes 997 93%
SPHINX Yes Yes 997 96%
VOCIND Yes Yes Indep. 96%

Table 1.1 Major speech recognition systems

Table 1.1 shows major speech recognition systems developed to date.

1.3.2 Machine Translation


Perhaps the first explicit document on machine translation was the so-called
V\'eaver Memorandum in 1949. Warren Weaver, who was a vice president of
the Rockfeller Foundation at the time, distributed a memorandum which sug-
gest a possibility of machine translation. The first research group on machine
translation was headed by Andrew Booth and Richard Richens at University
of London, who already started MT research before the Weaver Memorandum.
Immidiately after the Weaver Memorandum numbers of research institutions
have start MT research. These are the Massachusetts Institute of Technology
(MIT) , the University of Washington, the University of California at Los An-
geles, the RAND Corporation, the National Bureau of Standard, Harvard Uni-
versity and Georgetown University. The first MT system was demonstrated by
the Georgetown group in 1954 between Russian and English. It was, of course,
a very small system with only around vocabulary of 250 words. Although it was
limited in vocabulary, no pre-editing was required and the translation was ade-
quate quality. This lead to grow of the MT research throughout 50's and early
60's. Bar-Hillel [Bar-Hillel, 1959] made a strong criticism of the MT research,
but has been a minority in this aera.

The dramatic change in the MT research come with the ALPAC report [AL-
PAC, 1966] in 1964, which strongly criticized the MT research at that age.
10 CHAPTER 1

The ALPAC report was correct in most of accessment of the problems and
limitation of the state-of-the-art technolgies against the difficulties. The report
pointed out necessity of more basic and computational research toward under-
standing of nature of natural language itself. After the ALPAC report, we saw
a sharp decline in the MT effort in the United States, though research in other
countries such as Japan and European countries have continued.

In 1977, the University of Montreal developed a historical system TAUM-


METEO [Isabelle, 19871 which is the first practical- and succesful- MT system
ever developed. The system has been used to translate wether report from En-
glish to French. SYSTRAN system was used in the Appolo-Soyuz space mission
in 1975.

Eurotra project started in 1978 aiming at translation between languages of all


official EEC languages. These are Danish, Dutch, English, French, German,
Greek, Italian, Portuguese, and Spanish.

In Japan, a group led by Nagao at Kyoto University started up the MU project


[Tsujii, 1985], which became a mother of all MT system developed in Japanese
main framers in 1980s. Several commercial system has been develop by the
Japanese companies - PIVOT (NEC; [Muraki, 1989]), HICAT (Hitachi;[Kaji,
1989)), etc. The translation paradigm of these systems is the transfer model.

The ATR Interpreting Telephony Research Laboratories was established specif-


ically focusing on the development of the speech-to-speech translation system.
This Japanese effort resulted in the development of the SL-TRANS speech-to-
speech translation system [Ogura et. al., 1989).

In 1986, the Center for Machine Translation (CMT) was formed at Carnegie
Mellon University, which symbolizes the scientific come back of the MT re-
search in the United States. Several state-of-the-art systems, such as KBMT-
89 [Goodman and Nirenburg, 1991), SpeechTrans [Tomita et. al., 1989],
iI>DMDIALOG system, and others, has been developed at the CMT. The KBMT-
89 system is the first system to claim fully interligua and employs knowledge-
based machine translation (KBMT) paradigm [Nirenberg et. al., 1989al.

In 1991, DARPA started to fund a major MT research involving the CMT at


Carnegie Mellon University, New Mexico State University, and the Information
Science Institute at University of Southern California.
/rttroduction 11

1.3.3 Massively Parallel Computing


The history of massively parallel computing is perhaps the youngest among
other fields relevent to this book. Although ILLIAC-IV [Bouknight et. al.,
1972] was the first parallel computer, it has only 64 processors. The history
of massively parallel machine starts in late 70's with the development of the
MPP massively parallel processor by Goodyear Corporation [Batcher, 1980] in
1979. MPP has 16K 1 bit processors interconnected by 128 by 128 mesh. It
was used for image processing. The other early attempt is DAP [Bowler and
Pawley, 1984]. DAP has processors ranges from 1,024 to 4,096.

In 1985, Hillis proposed the Connection Machine [Hillis, 1985] and the Think-
ing Machines Corporation was formed to commercially deliver the CM-l Con-
nection Machine. The CM-l has 64K I-bit processors. Thinking Machines
Corporation soon announced the up-graded version, CM-2 with floating point
~~~K.\\\\~~~\~~bilit~ attaining 28 GFlops in single precision [Thinking Machines
Corporation, 1989\.

Application of massively parallel computing to natural language, or more


broadly to artificial intelligence, has been discussed since NETL [Fahlman,
1979]. However, it was Stanfill and Waltz who first implemented seemingly
intelligent system on actual massively parallel machine. Their MBRtalk sys-
tem [Stanfill and Waltz, 1988] uses the Memory-Based Reasoning paradigm
to pronounce English words. Several AI systems, or algorithms, has been
implemented on the Connection Machine in late 80s. For details of various
massively parallel AI applicaitons and inevitability of massive parallelism, see
[Waltz, 1990, Kitano, 1993, Kitano and Hendler, 1993].

Increasing attention on massively parallel architecture resulted in several ex-


perimental massively parallel machines such as IXM2 [Higuchi et. al., 1991],
SNAP [Moldovan et. al., 1990], and J-Machine [Dally, 1990], and commer-
cial machines such as MasPar MP-l [MasPar Corporation, 1990], MP-2, Intel
PARAGON, CRAY T3D, NCR 3600, Fujitsu AP-1000, and Thinking Machine's
CM-5 [Thinking Machines Corporation, 1991].

There are architectural evolutions. First evolution is the use of more powerful
processors. While most of the first generation massively parallel machines has
been equipped with fine-gained processors as seen in CM-2 (I-bit) and MP-l
(4-bit), newer generation machines employ 32-bits or 54-bits processors. For
example, CM-5 uses a 32-bits SPARC chip and vector processors for each node.
12 CHAPTER 1

Second evolution is emergence of the virtual shared memory (VSM) architec-


ture in the commecial products. Kendall Square Research's KSR-l employs
ALLCACHE architecture, which is a hardware-supported VSM architecture.
In KSR-l, the physical configuration resembles that of distributed memory ar-
chitecture, but the logical configuration maintains a single memory space. This
architecture has advantages in ease of programming and porting of commercial
software. A significant part of the silicon area has been dedicated to maintain
cache consistency.

The bottom line of these phenonema is that massive parallelism is economically


and technically inevitable. This warrants the central theme of the research
described in this book - massively parallel architecture for speech-to-speech
translation.
2
CURRENT RESEARCH TOWARD
SPEECH-TO-SPEECH
TRANSLATION

This chapter describes some of the efforts toward speech-to-speech translation


systems and some related research efforts. Specifically, we describe following
systems:

• SpeechTrans (CMU)
• SL-TRANS (ATR)
• JANUS (CMU)
• HMM-LR Parser (CMU)
• MINDS System (CMU)
• Knowledge Based Machine Translation System (CMU)

2.1 SPEECHTRANS
SpeechTrans [Tomita et. al., 1989J [Tomabechi et. al., 1989J is a Japanese-
English speech-to-speech translation system developed at Center for Machine
Translation, Carnegie Mellon University. It translates spoken Japanese into
English and produces audio outputs. It operates on the doctor-patient domain.
The system consists of four parts:

• Speech recognition hardware (Matsushita Research Institute),


14 CHAPTER 2

Output
/ a/ /0/ /u/ /i/ /e/ /j/ /w/ .. . (Il (II)
Input
/a/ 93.8 1.1 1.3 0 2.7 0 0 ... 0.9 5477
/0/ 2.4 84.3 5.8 0 0.3 0 0.6 ... 6.5 7529
/u/ 0.3 1.8 79.7 2.4 4.6 0.1 0 ... 9.7 5722
/i/ 0.2 0 0.9 91.2 3.5 0.7 0 ... 2.9 6158
/e/ 1.9 0 4.5 3.3 89.1 0.1 0 ... 1.1 3248
h/ 0 0 1.1 2.3 2.2 80.1 0.3 ... 11.4 2660
/w/ 0.2 5.1 5 .8 0.5 0 2.6 56.1 ... 11.2 428

(III) 327 176 564 512 290 864 212 ...

Table 2.1 A portion of a confusion matrix.

• Phoneme-based generalized LR parser,

• Natural language generator GenKit,


• Speech synthesis module (DECtalk).

The Matsushita's custom speech recognition device [Morii et. al., 1985] takes a
continuous speech utterance, such as 'atamagaitai' ('I have a headache.'), and
produces a noisy phoneme sequence. The speech recognition device only has
phonotactic rules which defines possible adjunct phoneme combinations, but
does not have any syntactic nor semantic knowledge. Also, it only produces
a single phoneme sequence, not a phoneme lattice. Therefore, we need some
mechanisms to make the best guess based solely on the phoneme sequence
generated by the speech device. There are three types of errors caused by
the speech device: (1) substituted phoneme, (2) deleted phonemes, and (3)
inserted phonemes. Speech Trans uses the confusion matrix to restore possible
phoneme lattice from the noisy phoneme sequence. The confusion matrix is a
matrix which shows acoustic confusability among phonemes. Table 2.1 shows
an example of the confusion matrix for the Matsushita's speech recognition
device. In the table, (I) denotes the possibility of deleted phonemes; (II) the
number of samples; and (III) the number of times this phoneme has been
spuriously inserted in the given samples. When the device output / a/, it may
actually be a /a/ with a probability of 84.3%, and it may actually be a /0/
with probability of 2.4%, so forth .
Current Research toward Speech-to-Speech Translation 15

SpeechTrans uses the Phoneme-based Generalized LR Parser (iJ?GLR; [Saito


and Tomita, 1988]) which is based on the Universal Parser Architecture [Tomita
and Carbonell, 1987] [Tomita, 1986] augmented to handle a stream of phonemes
instead of text. Unlike parser for text inputs, iJ?GLR parser takes a stream of
phonemes. Therefore, the grammar is written in the way that its terminal
symbols are phonemes, instead of words. An example of the grammar rule is:

Noun _ /w/ /a/ /t/ /a/ /s/ /i/

instead of

Noun - "watasi".

This rule defines the correct phoneme sequence for the word watashi (/). Speech-
Trans system has two versions of the grammar: One that utilizes modularized
syntax (LFG) and semantic (case-frame) knowledge, merging them at run-time,
and another version which uses a hand-coded grammar with syntax and se-
mantics precompiled into one pseudo-unification grammar. For demonstration,
Speech Trans uses the latter grammar due to its run-time speed.

The iJ?GLR parser was develop to meet following requirements:

1. A very efficient parsing algorithm, since parsing of the noisy phoneme se-
quence requires much more search than conventional typed sentence pars-
ing.
2. Capability to compute scoring for each hypothesis, because SpeechTrans
need to select the most likely hypothesis out of multiple candidates and to
prune out unlikely hypotheses during the parse.

The error recovery strategies of the iJ?GLR parser is as follows [Nirenberg et.
aI., 1989a]:

• Substituted phonemes: substituted and thus may be incorrect. The


parser has to consider all these possibilities. We can create a phoneme
lattice dynamically by placing alternate phoneme candidates in the same
location as the original phoneme. Each possibility is then explored by
16 CHAPTER 2

Input Translation
atama ga itai I have a headache
me ga itai I have a pain in my eyes
kata ga koru I have a stiff shoulder
asupirin wo nonde kudasai Please take an aspirin
arukuto koshi ga itai When I walk, I have a pain in my lower back

Table 2.2 Examples of Sentences Processed by SpeechTrans

each branch of the parser. Not all phonemes can be altered to any other
phoneme. For example, while /0/ can be mis-recognized as /u/, /i/ can
never be mis-recognized as /0/. This kind of information can be obtained
from a confusion matrix, which we shall discuss in the next subsection.
With the confusion matrix, the parser does not have to exhaustively create
alternate phoneme candidates.
• Inserted phonemes: Each phoneme in a phoneme sequence may be an
extra, and the parser has to consider these possibilities. We have one
branch of the parser consider an extra phoneme by simply ignoring the
phoneme. The parser assumes at most two inserted phonemes can ex-
ist between two real phonemes, and we have found the assumption quite
reasonable and safe.

• Deleted phonemes: Deleted phonemes can be handled by inserting pos-


sible deleted phonemes between two real phonemes. The parser assumes
that at most one phoneme can be missing between two real phonemes.

These strategies effectively restore a possible phoneme lattice and provides


multiple parsing hypotheses. However, some of the hypotheses would be pruned
out following the heuristics:

• Discarding the low-score shift-waiting branches when a phoneme is applied.

• Discarding the low-score branches in a local ambiguity packing.

Table 2.2 shows example of sentences and their translation in Speech Trans
system.
Current Research toward Speech-to-Speech Translation 17

Advantages of the approach taken in the SpeechTrans system is in its portabil-


ity. When we can create a confusion matrix, the system can be quickly adapted
to other speech recognition systems as far as it produces phoneme sequences.
For relatively small domains, we can attain reasonably accurate translation
with reasonably, though not surprisingly quick, fast processing speed.

The problem of this method, however, is that language model does not provide
feed back to speech recognition module. It simply get a phoneme at a time, and
restore possible phoneme lattice to be used for parsing. The ~GLR parser does
not make predictions on possible next phonemes. Since perplexity reduction
effects by top-down prediction from the langauge model is considered to be
effective, this shortcoming may be a serious flaw in this approach. This problem
obviously lead to the development of more tightly coupled model such as the
HMM-LR parser which will be described later.

In summary, however, the SpeechTrans system is an important landmark sys-


tem which lead to several more sophisticated systems developed later.

2.2 SL-TRANS
SL-TRANS [Ogura et. al., 1989] is a Japanese effort of developing a speech-to-
speech dialogue translation system undertaken by ATR Interpreting Telephony
Research Laboratories. SL-TRANS translates spoken Japanese into English,
of course, on the ATR conference registration domain.

SL-TRANS is composed of HMM speech recognition system combined with


predictive LR parser [Kita et. al., 1989), NADINE dialogue translation system
[Kogure, et. al., 1990], and DECtaik speech synthesizer.

For the speech recognition module, they have introduced a discrete HMM
phoneme recognition with improvements over the standard model using a new
duration control mechanism, separate vector quantization, and fuzzy vector
quantization.

In order to better constrain search space, SL-TRANS employs HMM-LR


method which combines HMM speech recognition module and a modified ver-
sion of the generalized LR parser. The LR parser is used to predict next possible
phoneme in the speech input. Obviously, the grammar is written in the way
that the terminal symbol to be phonemes instead of words as seem in conven-
18 CHAPTER 2

sional grammar. The grammar for the HMM-LR parser covers entire domain
of the ATR corpus, but its scope is limited to intra-phrase (Bunsetsu) level.
Predictions made by the LR parser is passed to the HMM phoneme verifier to
verify existence of predicted phonemes. Multiple hypothesis are created and
maintained during parsing process. With the vocabulary of 1,035 words and
trained with 5,240 words, the HMM-LR parser attains 89% phrase recognition
rate.

Translation will be carried out by NADINE dialogue translation system. NA-


DINE uses the Head-driven Phrase Structure Grammar (HPSG)[Pollard and
Sag, 1987] as a grammar formalizm. NADINE consists of an analysis, transfer,
and generation module.

The analysis module has a phrase structure analysis module and a zero-pronoun
resolution module. The parser is based on an active chart parser and uses a
Typed Feature Structure Propagation (TFSP) method. The parser output an
feature structure with the highest analysis score. The analysis score is based on
syntactic criteria such as phrase structure complexity, degress of left-branching,
syntactic-semantic criteria such as missing obligatory elements, and pragmatic
criteria such as pragmatic felicity condition violation penalty.

The score of each hypothesis is obtained by the following equation:

where S(x) is the speech recognition score, Nt(x) is the number of nodes in
the syntactic tree, Nu(x) is the number of unfilled obligatory elements, Np(x)
is the number of pragmatic constraint violations. The weights al, a2, a3, a4 are
decided experimentally.

Figure 2.1 is an analysis result for a sentence Kaigi enD touToku wo sitain-
odeuga.

Table 2.3 shows accuracy of speech recognition, sentence filtering, and trans-
lation.

The SL-TRANS system is a heavy-duty system using the HMM-LR method,


HPSG-based parsing and generation, and the intension transfer method. The
SL-TRANS architecture provides appropriate feed back from local syntax to
speech recognition module using the HMM-LR method. Also, the SL-TRANS
Current Research toward Speech-to-Speech Translation 19

[[reIn REQUEST]
Eagen !sp *speaker*]
[reep *hearer*]
[manner indirect]
[obje [[reIn SURU]
Eagen !sp]
[obje [[parm !x <TOUROKU>]
[restr [[reIn MODIFY]
[arg1 !x]
[sloe <KAIGI>]]]]]]]]

Figure 2.1 An example of sentence analysis result

Bunsetsu (phrase) 81.5% (First)


recognition 93.2% (Top 5 hypotheses)
Sentence Filtering 5 -+ 1.8 candidates in average
Including correct candidates: 76.0%
Correct Parse 68.8% (First)
74.6% (Top 3 choices)

Table 2.3 Performance of the SL-TRANS


20 CHAPTER 2

system employs a fully developed HPSG-based sentence analysis and advanced


linguistic processing strategies such as the intension transfer method. These
features makes the SL-TRANS important testbed of modern linguistic theories.

One weakness of the SL-TRANS is, however, that it has two separate parsers:
the predictive LR parser in the HMM-LR module, and the Active Chart Parser
in the language analysis module. This is obviously a redundant architecture,
and changes of grammar made in the one of the parser need to be reflected
to the other parser in a consistent manner, which is perhaps a costly process.
Also, the prediction at the sentence level does not fed back to the speech recog-
nition module because the grammar for the predictive LR parser only deal with
the intra-phrase level. However, these problems are relatively trivial - it is a
problem in design decision, not the theoretical limitations - so that they can
be remedied easily.

2.3 JANUS
JANUS is a yet another speech-to-speech translation system developed at
Carnegie Mellon University [Waibel et. al., 1991]. Unlike other systems which
largely depends upon statistical method of speech recognition such as Hidden
Markov Model, JANUS is based on a connectionist speech recognition mod-
ule. Linked Predictive Neural Network (LPNN) [Tebelskis, et. aI., 1991] offers
highly accurate, continuous speech, and large-vocabulary speech recognition
capability. When combined with a statistical bigram grammar whose perplex-
ity is 5 and vocabulary is 400 words, LPNN attains 90% of sentence accuracy
with top 3 hypotheses.

The system organization is as follows: LPNN speech recognition module, a con-


nectionist parser or LR parser for parsing, GenKit sentence generation module,
and Digital Equipment Corporation's DECtaik DTC01 for German voice syn-
thesis, and Panasonic Text-to-Speech System EV-3 for Japanese output. There
are two versions of JANUS: JANUS1 which uses an LR parser, and JANUS2
which uses a connectionist parser. JANUS translates English sentences into
Japanese and into German, on the ATR conference registration domain.

The LPNN is based on canonical phoneme models which can be concatinated


in any order using a linkage pattern to create a template as a word model. A
predictive neural network models a phone model as in an HMM. The networks
predicts the next frame of speech. The network is trained through three steps:
Current Research toward Speech-to-Speech Translation 21

a forward pass, an alignment step, and a backward pass.

We briefly describe a three-step training algorithm of the LPNN on a word


(from [Tebelskis, et. al., 1991, Waibel et. al., 1991]):

1. Forward pass: For each input speech frame at time t, the frames at time
t - 1 and t - 2 are fed into all the networks that are linked into this word.
Each of these nets then makes a prediction of frame(t), and the prediction
errors are computed and stored in a matrix.

2. Alignment step: Dynamic programming is applied to the prediction error


matrix to find the optimal alignment between the speech signal and the
phoneme models.
3. Backward pass: Errors are propagated backward along the alignment path.
For each frame, the error is back-propagated into the network that best
predicted the frame according to the alignment. Note that this alignment-
controlled back-propagation causes each subnetwork to specialize on a dif-
ferent section of speech, resulting eventually in a model for each phoneme.

The JANUS1 uses the generalized LR parser. The grammar rule is hand-written
to cover entire ATR conference registration domain. In this implementation,
the semantic grammar has been used with notations similar to Lexical Func-
trional Grammar. Figure 2.2 shows a recognition result of the LPNN, parser
output, and translation results.

The connectionist parser in JANUS2 has a highly modular architecture. There


are several modules such as word level feature units, phrase level units, and
structure level units. Each lexical entry node is linked to a set of nodes which
consists a feature unit. For example, the word John will activate features such
as Proper, Animate, and Human. Phrases are represented as head words and
their modifiers. It is represented using a noun block, verb block, feature units,
and gating units. The gating units control the behavior of the phrase level
module. The structure level module consists of nodes representing possible
roles of each phrase. These are agent, patient, recipient, prepositional mod-
ification, relative clause, and subordinate clause. [Jain, 1990] reports some
interesting behavior of the connectionist parser includes: dynamic behavior,
generalization, and robustness. Figure 2.3 shows the LPNN recognition result,
the connectionist parser output, and translation results. Note that numeric
value is assigned to the value of each case-role which indicates the activation
level of each concept to be filled to the case-role slot.
22 CHAPTER 2

LPNN output:
(HELLO IS THIS THE OFFICE FOR
THE CONFERENCE $)

Parser's interlingual output:


«CFNAME *IS-THIS-PHONE)
(MOOD *INTERROGATIVE)
(OBJECT «NUMBER SG)
(DET THE)
(CFNAME *CONF-OFFICE))
(SADJUNCTl
«CFNAME *HELLO»»

Japanese translation:
MOSHI MOSHI KAIGI JIMUKYOKU DESUKA

German translation:
HALLO 1ST DIES DAS KONFERENZBUERO

Figure 2.2 JANUS using the generalized LR parser.

JANUS system reports accuracy of translation to be over 80%. Specifically,


JANUSl attains 89.7%, and JANUS2 attains 82.8% translation accuracy with
N-best recognition hypotheses (Table 2.4). With the first hypothesis only,
JANUSl attains 77.0% and JANUS2 attains 78.2% (Table 2.5).

It should be noted that the JANUS2 out performed the JANUSI is the first
hypothesis case, but not in the N-best case. This is because that the connec-
tionist parser simply provide the best available output from the first N-best

Results JANUS 1 JANUS2


Correct recognition and translation 76 66
Incorrect recognition but correct translation 2 6
Total correct translation 78 (89.7%) 72 (82.8%)

Table 2.4 Performance of JANUSl and JANUS2 on N-Best Hypotheses


Current Research toward Speech-to-Speech Translation 23

LPNN output:
(HELLO IS THIS THE OFFICE FOR
THE CONFERENCE $)

Connectionist parse:
«QUESTION 0.9)
«GREETING 0.8)
«MISC 0.9) HELLO»
«MAIN-CLAUSE 0.9)
«ACTION 0.9) IS)
«AGENT 0.9) THIS)
«PATIENT 0.8) THE OFFICE)
«MOD-l 0.9) FOR THE CONFERENCE»)

Japanese translation:
MOSHI MOSHI KAIGI JIMUKYOKU DESUKA

German translation:
HALLO 1ST DIES DAS KONFERENZBUERO

Figure 2.3 JANUS using the connectionist parser.

candidate although there is the correct hypothesis in the second best, or third
best place. When only one word sequence is given as in the first hypothesis
case, the JANUS2 is better because it provides the best guess, hopefully a cor-
rect one. This characteristic of the connectionist parser comes from the nature
of the neural networks that it does not hold the correct instances given at the
training stage. The neural network simply changes weight and make general-
ization. This means that the neural network does not know how far the input
is from the known training data, thus it does not have a means to tell how bad

Results JANUSl JANUS2


Correct recognition and translation 65 63
Incorrect recognition but correct translation 2 5
Total correct translation 67 (77.0%) 68 (78.2%)

Table 2.5 Performance of JANUSl and JANUS2 on the First Hypothesis


24 CHAPTER 2

the answer could be. It would be particurarly interesting to develop a method


to goodness of the solution by the neural network, may be by combining with
the memory-based approach.

2.4 MINDS
The MINDS system [Young et. al., 1989] is an spoken input user interface
system for data-base query on the DARPA resource management domain. The
speech recognition part is the SPHINX system [Lee, 1988] with 1,000 words
vocabulary. The main feature of the MINDS system is on its layered pre-
diction method to reduce perplexity. The basic diea to accomplish reduction
of perplexity is the use of plan based constraints by tracking all information
communicated (user questions and database answers).

The MINDS system uses following knowledge:

• knowledge of problem solving plans and goals represented hierarchically,


• a finite set of discourse plans
• semantic knowledge about application domain's objects, attributes and
their interrelations,
• knowledge about methods of speaking, local and global focus,
• dialog history knowledge about information previously communicated,
• discrete models of user domain expertise,
• information about user preferences for ordering conjunctive subgoals

Introduction of layered predictions have reduced preplexity significantly. They


reported that the perplexity was reduced from 279.2 with grammar to 17.8 with
layered predictions (Table 2.6).

The significant accomplishment of the MINDS system is that it demonstrate


high semantic accuracy can be obtained by using the pragmentic levels of knowl-
edge. For some types of tasks which has highly goal-oriented nature and has
highly predictable topic transition, the approach taken in the MINDS system
would work. Since there are many of such tasks, it would be useful method for
the practical systems.
Current Research toward Speech-to-Speech Translation 25

Recognition Performance
Constraints used: grammar layered predictions
Test Set Perplexity 242.4 18.3
Word Accuracy 82.1 96.5
Semantic Accuracy 85% 100%
Insertions 0.0% 0.5%
Deletions 8.5% 1.6%
Substitutions 9.4% 1.4%

Table 2.6 Performance of the MINDS system

However, it is questionable whether the method can be useful in more compli-


cated domains such as telephone dialogues which is less goal-oriented, mixed-
initiative, and has very unpredictable topic transitions. Investigation of the
method to account for such doamins would be the future issues in this direc-
tion.

2.5 KNOWLEDGE-BASED MACHINE


TRANSLATION SYSTEM
The Knowledge-Based Machine Translation (KBMT) [Nirenberg et. al., 1989al
is an approach to provide high quality translation using extensive knowledge
of languages and the domain of translation .

. The KBMT system has following features:

• Translates between English and Japanese, bi-directionally,

• Uses an interlingua paradigm of translation,


• Computational architecture is a distributed, coarsely parallel system, and

• Domain is personal computer installation and maintenance manuals.

The system size measured by the knowledge-base size is about 1,500 concepts
for the domain model, 800 words for Japanese, and 900 words for English.
26 CHAPTER 2

The KBMT uses a set of modular components developed at Center for Machine
Translation. These are FRAMEKIT frame-based knowledge representation sys-
tem [Nyberg, 1988], the generalized LR parser, a semantic mapper for treating
additional semantic constraints, an interactive augmentor for resolving remain-
ing ambiguities [Brown, 1990], and the semantic and syntactic generation mod-
ules [Nirenburg et. al., 1988b]. In addition to these modules, the KBMT's
knowledge-base and grammar were developed using the ONTOS knowledge
acquisition tool [Nirenburg et. al., 1988a], and grammar writing environment.

In terms of grammar formalisms, the KBMT employes a specialized grammar


based on Lexical Functional Grammar (LFG) and uses pseudo-unification, in-
stead of full-unification, for unification operation.

The system was tested on 300 sentences without pre-editing, though some of
the sentences could not translated automatically.

2.6 THE HMM-LR METHOD


The HMM-LR method [Kita et. al., 1989] combines Hidden Markov Models
(HMM) and the generalized LR parsing technique. The basic idea is that the
generalized LR parser to provide a set of possible next phonemes (this can be
words or syllable when applied to these levels), and the HMM verifier returns its
probabilistic values. Since the generalized LR parser in this method provides
predictions, it is called the predictive LR parser. The prediction will be
made using the pre-compiled LR parsing table. In addition to the grammatical
ambiguity, confusable phone variations also split the parsing stack.

All partial parses are represented using the graph structured stack, and each
partial parse has its probability based on the probability measure from the
HMM phone verifier. Partial parses are pruned when their probability falls
below a predefined threshold. The HMM-LR method uses the BEAM search
[Lowerre, 1976] for this pruning. In case multiple hypotheses have survived at
the end, the one with highest probability is selected.

Similar to the <i>GLR parser, the grammar should have phone names as its
terminal symbols, instead of words. A very simple example of a context-free
grammar rules with a phonetic lexicon is as follows:

(a) S --) NP VP
(b) NP --) DET N
Current Research toward Speech-to-Speech Translation 27

(c) VP --> v
(d) VP --> v NP
(e) DET --> /z/ /a/
(f) DET --> /z/ /i/
(g) N --> /m/ /ae/ /n/
(h) N --> /ae/ /p/ /a/ /1/
(i) V --> /iy/ /ts/
(j) V --> /s/ /ih/ /ng/ /s/

Rule (e) represents the definite article the pronounced Izl Ial before con-
sonants, while rule (f) represents the the pronounced Izl Iii before vowels.
Rules (g), (h), (i) and (j) correspond to the words man, apple, eats and sings,
respectively.
3
DESIGN PHILOSOPHY BEHIND
THE <I>DMDIALOG SYSTEM

3.1 INTRODUCTION
This chapter discusses ideas behind the model of spoken language translation
implemented as <J?DMDIALOG system. The three major design decisions re-
garding its basic framework were:

• use of memory-based parsing and generation

• use of massively parallel computing model

• use of marker-passing scheme

These design decisions were made by analyzing knowledge and parallelism in-
volved in the speech-to-speech translation task. Table 3.1 shows how different
levels of parallelism are involved in the speech-to-speech translation task (this
table is by no means exhaustive).

It is clear that no single level of parallelism or knowledge alone enables this


task; rather it is a hybrid of various levels of parallelism and knowledge that
enables such a highly intelligent task. The underlying idea in designing the
system, therefore, is that of hybrid parallelism [Kitano, 1989cl. Under this
concept, heterogeneity is the foundation of higher intelligence and various levels
of knowledge and parallelism need to be integrated to accomplish intelligent
tasks. At the physiological level, however, human brain is composed of billions
of neurons which execute massively parallel computing. At the functional level,
we can observe coarse-grain symbolic operations. Although we believe that even
30 CHAPTER 3

computational symbolic sub-symbolic


model constraint-based I memory-based numeric and neural
Granularity Coarse-grain Fine- or coarse-grain Fine-grain
Parallelism Medium Massive or medium Massive
Speech Phonological Phoneme sequence Neural network or
processing rules recognition stochastic model
Syntax and Unification grammar Memory-based Ambiguity resolution
semantics semantic restriction parsing, generation contextual priming
Discourse Plan recognition Case-based dialogue Context recognition
understanding

Table 3_1 Knowledge and parallelism involved in the speech translation task

such operations can be simulated from the neural level in the future, it is more
computationally efficient, at this moment, to carry out symbolic processing or
a hybrid of systems. <I>DMDIALOG is an instance of this idea. In <I>DMDIALOG,
various parallel operations - from parallel numeric computations (mostly, but
not exclusively, at phonological level) to parallel constraint satisfaction (at
a syntactic/semantic and discourse levels) - are integrated. Although sub-
symbolic processing is experimentally incorporated in <I>DMDIALOG, we focus
our discussion on symbolic processing using a parallel marker-passing scheme.

Since we believe a mixture of various levels of parallelism and knowledge is


necessary to carry out intelligent activities, we do not confine our model to
memory-based processing. <I>DMDIALOG integrates memory-based processing
and unification- or constraint-based processingl. Nevertheless, memory-based
processing is a notable characteristic of our system and has not been investi-
gated in past studies on machine translation.

These design decisions to use parallel marker-passing and memory-based pro-


cessing naturally lead us to use massively parallel computers for our experi-
mental implementations. While our model involves computationally expensive
symbolic operations which are best performed on coarse-grained computers,
massively parallel machines provide us with opportunity to evaluate a part of
our model, and allow us to investigate possible future development of hybrid
parallel machines that satisfy various requirements for high intelligent systems.
IThe role of syntactic knowledge in the system changes from early version of iPDmDialog
to more recent version of the model presented in Chapter 7. This reflects a perception change
of the author in the course of this research program.
Design Philosophy 31

3.2 MEMORY-BASED APPROACH TO


NATURAL LANGUAGE PROCESSING
Memory-based parsing was inspired from concepts of memory-based reasoning
and case-based reasoning which place memory as the basis of reasoning. First,
we will describe memory-based and case-based reasoning. Then, we will discuss
its application to natural language processing.

3.2.1 Memory-Based Reasoning and


Case-Based Reasoning
Memory-based and case-based reasoning place memory as the foundation of
intelligence. These paradigms are, by definition, memory-intensive and usu-
ally assume the use of massivelly parallel machines for the implementation of
practical systems. The basic idea is that if the system has a large set of past
experiences, a new problem presented to the system can be solved by retrieving
similar cases in the memory and adapting to the case in consideration. Stanfill
and Waltz describe memory-based reasoning as:

The traditional assumption in artificial intelligence (AI) is that


most expert knowledge is encoded in the form of rules. We consider the
phenomenon of reasoning from memories of specific episodes, however,
to be the foundation of an intelligent system, rather than an adjunct
to some other reasoning method. [Stanfill and Waltz, 1986]

Riesbeck and Schank describe their case-based reasoning as:

Case-based reasoning means reasoning from prior examples. A


case-based reasoner (CBR) has a case library. In a problem-solving
system, each case would describe a problem and a solution to that
problem. The reasoner solves new problems by adapting relevant cases
from the library. [Riesbeck and Schank, 1989]

Obviously, the ideas behind memory-based and case-based reasoning are sim-
ilar. Only the difference is, perhaps, that case-based reasoning attaches more
importance to so called case-adaptation. Case-adaption is a process of adapt-
ing retrieved cases to the case in consideration in order to offer a solution. In
32 CHAPTER 3

memory-based reasoning, more importance is placed on how to collect large


volume of cases to cover a possible problem space so that adaptation can be
minimum. Regardless of such differences, they adopt the same approach that
reasoning is based on memory and we will use the terms memory-based rea-
soning and case-based reasoning interchangeably.

Some advantages of memory-based reasoning have been analyzed [Stanfill and


Waltz, 1986]; here, we list these advantages particularly relevant to our re-
search.

First, memory-based reasoning attains high performance over rule-based sys-


tems, when it is implemented on massively parallel computers. A case is,
generally, a larger chunk than what is expressed in the form of a rule. By
using this larger chunk as a unit for reasoning, memory-based reasoning avoids
combinatorial explosion in search for a solution.

Second, memory-based reasoning is more scalable since addition of cases, even


by non-experts, can augment the system's capability. In a rule-based system,
addition of rules might have an unexpected impact on the entire reasoning
process, and thus, extensive analysis by an expert of the specific implementation
is required. Also, for each addition of rules, the system's performance degrades
significantly (due to combinatorial explosion) , whereas in the memory-based
reasoning system degrades only linear to the additional knowledge.

Third, memory-based reasoning recognizes the state of "not knowing". When


the system fails to find similar cases, memory-based reasoning system realizes
the absence of a relevant case. Thus, the system can employ other reasoning
paradigms that are available. In comparison, the connectionist approach cannot
tell whether they have the correct results or not.

3.2.2 Application to Natural Language


Processing
The memory-based approach has been applied to natural language processing
by a direct memory access parser (DMAP) [Riesbeck and Martin, 1985] . In
DMAP, understanding is viewed as a state of a memory network activation.
Riesbeck and Martin wrote:
Design Philosophy 33

Direct Memory Access Parsing views conceptual language analysis


as a problem of recognizing the structure in memory to which a text
is referring, not as a problem of building a meaning structure for
the text. That is, conceptual analysis is a memory search process
very similar to other recently developed dynamic memory inference
processes. [Riesbeck and Martin, 1985]

Since parsing is processed directly by accessing the memory, they claim DMAP
to be a "recognize-and-record" model of language understanding; this is in con-
trast to the "build-and-store" model used by traditional parsers. DMAP uses
hierarchically organized Memory Organization Packets (MOPs) [Schank, 1982]
as knowledge for understanding sentences. Marker-passing is used for marking
activated parts of the memory and to predict the next concept to be activated.
This is a very attractive approach to natural language processing because of its
potential parallelism, use of cases for understanding, and contextual processing
capability. In fact, the initial design of the ~DMDIALOG incorporated major
features of the DMAP system.

Although our initial work was based on the DMAP model, we soon augmented
the model in various ways in order to overcome its problems and take advantage
of its potential benefits which were not well exploited in the DMAP. As a
result of modifications, however, our model became quite different from the
DMAP. Especially, the development of the memory-based generation method
is a significant addition to the memory-based paradigm. These augmentation
will be described in relevant part of this book.

An independent root of the ~DMDIALOG system can be found in Nagao's pro-


posal for Translation by Analogy Principle. Nagao believes that "Man does not
translate a simple sentences by doing deep linguistic analysis," and considered:

Man does the translation, first, by properly decomposing an input


sentence into certain fragmental phrases (very often, into case frame
units), then, by translating these fragmental phrases into other lan-
guage phrases, and finally by properly composing these fragmental
translations into one long sentence. The translation of each frag-
mental phrase will be done by the analogy translation principle with
proper examples as its reference... [Nagao, 1984]

Sato and Sumita took Nagao's idea to implement their experimental machine
translation systems. Sato developed MBT-I, -II, and -III [Sato and Nagao,
34 CHAPTER 3

.. ...... ........
.:::::::::::;..:::::::.. ::::..: ....: ..:::..:::..:::..::::.. :::..::......:::.
.... .......... .
Covering
;;;;:;.::::::::::::::
Input Sentence ~- ... ..... .. Previous
.:::::..

....
~
• Input Sentences
•• ~

•••
::::::::::::::~:·:·:·!'I':·I-------
........
• ~~
:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.: --"""",,~... ... .
.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:.:-:.:.:.:.:.:.:.:.:.:.
Synthesized :

Derivation •
•• ••
............. ....... ...... . .. .. ... .........................
y
I Previous
Translation ~ : :: : :::\ Translation
.....- - - - - - - Covering ?0.· ~::. :.I - - - - - - . . I ...
(Transform : : : : : :: :: : : : .:.:.:.:.:.:.:.:.:.:.:-:-:.:.:. ,:.:.:.:.:.:.:-:.:.:.:.:.:-:.:.:.:...::.:.:.:-:.:.:.:.:.:.:.:.:.:.:
Process) }}}}}}}Cas:a;:B.as.:s?::
... .. ... ....... ... ................. ..........................
: : : }: : : : : : : :

Figure 3.1 Translation as Analogy

1990, Sato, 1991a, Sato, 1993]. Sumita developed Example-Based Machine


Translation (EBMT: [Sumita and !ida, 1991]).

These approaches to natur.al langauge process can be illustrated as shown in


Figure 3.1.

Translation can be decomposed into several stages:

1. Identify memory of past sentences similar to the input.

2. Identify differences between the input and past examples.


3. Modify past examples of translation.

4. Compose translation of the input.

This process is drastically different from the traditional model of translation


which is carried out by:

1. Construct a parse tree of the input sentence.


Design Philosophy 35

2. Construct an interlingua meaning representation from the parse tree.

3. Generate a translation from the interlingua meaning representation.

3.2.3 Rationale

Problems of Traditional Approach


Researchers who advocate the memory-based approach and industrial re-
searchers who actually develop large-scale commercial systems seem to agree
on what the problems in the traditional method of machine translation are.
These are:

Quality: Quality of translation is not sufficient for practical translation with-


out pre- and post-editing. Needs for pre- and post-editing to attain accept-
able quality of translation is the fatal problem for current approaches since
no pre- and post-editing would be allowed for spoken language translation.

Performance: Performance of most existing machine translation systems are


not good. It takes about a few seconds to a few minutes to translate one
sentence. This performance is insufficient to carry out real-time spoken
language translation or bulk text translation.

Robustness: Current machine traslation system assume inputs to be gram-


matical. However, this aSsumption do not stands in the real world. Signifi-
cant percentages of sentences in spoken dialogues are ungrammatical in one
way or the other. Thus, significant part of sentences must be pre-edited
or results in translation failures .

Scalability: Current machine translation system is difficult to scale up because


its complexity of processing makes system's behavior almost intractable.
Grammar Writing: By the same token, grammar writing is very difficult
since complex sentence has to be described by the piecewise rules which is
hard to trace its behavior when they are added in the whole system.

Although substantial research efforts and fundings has been made to solve these
problems, no major break through has been reported so far. This implies that
the basic approach taken in traditional systems faces a serious dead-end, and
needs a dramatically different paradigm to overcome these problems.
36 CHAPTER 3

While it is premature to conclude that the approach presented in this book is


the answer for these problems, it seems evident that the answer is not in the
direction of incremental improvements of the traditional paradigm.

Very Large Finite Space


The basic assumption of memory-based approach is the concept of Very Large
Finite Space (VLFS). Memory-based natural language processing is justified
because natural language has VLFS as opposed to infinite space.

One of the major objections against memory-based natural language process-


ing is that the memory-based system cannot cover infinite number of sentences
which human produce. This objection is based on the productivity of language
hypothesis. Chomsky postulated the productivity of language hypothesis which
dictates that human can produce an infinite number of sentences with a finite
set of rules. The author argues, however, the productivity of language hypoth-
esis is false. It should be noted that in order to produce an infinite number of
sentences, one must have either infinite vocabulary or infinite grammar rules,
or one must be able to produce infinite length of sentence. In the other words,
the productivity of language is false when following conditions are met:

1. Finite vocabulary

2. Finite grammar rules

3. Finite sentence length.

Obviously, first and second condition stands, since no individual not a group
of individuals has infinite vocabulary or grammar rules. Third condition also
stands since it is biologically impossible to produce infinite length of sentences.
Also, in practice, the length of sentences has certain upper-bound.

An analysis using corpra from ATR, DARPA, and CNN prime news shows
that most sentences are within 30-40 words length. As shown in figure 3.2, the
ATR and DARPA corpora show very similar characteristics: both have a peak
of sentences between 7 to 9 words long. CNN has longer sentences, though it
too has peaks in 7 to 13 words length. The ATR sentences' maximum length
was 19 words and that of the DARPA corpus was 24. CNN Prime News was
48, and CNN segmented was 35. 99.7% of sentences from ATR, DARPA, and
CNN Segmented corpus are less than 25 words length (Figure 3.3).
Design Philosophy 37

Percentage (%)

DARPA Resource Management

Conference Registration
10

:\. . . l
CNN Prime News
CNN (Segmented)

;.~\ .::::~" .~
~ ."... .
' ....:..... ........
....... ..........
::!~I: ••
10 20 Length (words)

Figure 3.2 Distribution by Sentence Length

Covera e (%)

100

50 (Segmented)

News

DARPA Resource Management


ATR Conference Registration

10 20 Length (words)

Figure 3.3 Coverage by Sentence Length


38 CHAPTER 3

Therefore, number of possible sentences are not infinite. It is finite, but very
large in number, hence VLFS.

Similarity of Sentence Structures


Although naturallangauge has VLFS, it is practically infeasible to use memory-
based approach if there is no effective means to cover VLFS using realistic size
of data. Fortunatly, there are experimental data and arguments to back up the
memory-based approach.

First, it is widely recognized that sentences used in daily life are highly stereo-
typical. The following is an excerpt from the ATR corpus on conference regis-
tration:

I would like to register for the conference.


I would like to take part in the conference.
I would like to attend the conference.
I would like to contribute a paper to the conference.
That would be helpful for me. Thank you very much.
I would like to know the details of the conference.
so I would like to cancel.
Then I would also like to go.
I would like to contribute a paper to the conference.
I would like to ask you about hotel accommodations for the conference.
Then I would like to make a reservation for the Hilton Hotel.

This KWIC view on the word "would" demonstrates that, in this specific cor-
pus, the usage of the word "would" is very limited. Possible sentence patterns
involving "would" can be expressed by three templates:

1. <I would like to *circumstance>


2. <I would also like to *circumstance>
3. <that would be *state>

By having a set of templates with semantic restrictions, we can save significant


computational costs during parsing. These templates can be applied to chose
the appropriate translation.
Design Philosophy 39

By the same token, by having these examples of sentences, most new sentences
are expected to be similar to one of these examples, so that translation can be
reconstructed using translation pair of these examples. For example, an input
I would like to participate the conference is similar to one of the examples, I
would like to attend the conference. Thus, the translation of the example, which
is "~m~:ilim; L.t.::lt\o)1."'TiJ~', can be used to create a correct translation of
the input, which is "~m~:~1JD L.t.::lt\o)1."'TiJ~'.

Taking the other example, following is a KWIC view on the preposition "for"
from the ATR corpus:

Hello, is this the office for the conference?


I would like to register for the conference.
This is the office for the Conference .
How much will it cost if I apply for the conference right now?
Is there a discount for members?
That would be helpful for me.
The fee for participation is $500.
Would you spell your last name for me?
I sent in the registration form for the conference.
you have already paid $400 for your registration fee,
Would you spell your first name for me, Mr. Ohara?
Would you spell your last name for me please?
We'll also enclose special forms for your ....
We have a special form for the summary .
. you about hotel accommodations for the conference.
would like to make a reservation for the Hilton Hotel.
We'll be able to reserve rooms for you at either the Hilton Hotel

Although a single preposition "for" is used in English, different translations


must be made in Japanese depending upon the semantics of the modifee. In
the memory-based approach, we store a set of parsing temples and the corre-
sponding translation templates. A set of templates for parsing and generation
for "for" is:
40 CHAPTER 3

*office for *event -+ *event *office


*action for *event -+ *event I: *action
*action for *person -+ *person I: *action
*action for *object -+ *object I: *action
*object for *action -+ *action *object
*object for *event -+ *event (J) *object
*object for *object -+ *object ffl (J) *obj ect
*object for *person -+ *person I: *object

Real Space vs. Possible Space


In addition to stereo-typical nature of sentences, it should be noted that human
do not use all legitimate sentence structures. For exmaple, it is almost incon-
ceivable for us to actually produce or listen to sentences such as The dog saw the
cat that chased the mouse into the house that Jack persuaded Susie to built and
John told the man Mary kissed that Bill saw Phil [Gibson, 1990J. These sen-
tences are syntactically legitimate, but difficult to comprehend for most people.
While the possibility to encounter, or to accidentally produce, such sentences
remains, likely hood of such events would be extremely small. Independently,
an empirical study shows that only 0.53% of possible sentences are actually
observed in large corpora. At the same time, significant portion of sponta-
neous speech involves ungrammatical sentences. Table 3.2 shows a statistics
of illformed sentences observed in two spontaneous speech dialogue corpus. Al-
though the statistics only counts global illformedness which is concerned with
breakdown of major sentence structure, significant percentages of sentences in-
volves such illformedness. If we count minor syntactic problems, this figure
would go up drastically. Thus, only a small part of the real space of natu-
ral language overlap with the possible space. Since memory-based approach
constructs its memory-base using real corpus, it is expected that the approach
would cover only real space and avoids covering possible-but-unreal-space (Fig-
ure 3.4). Combined with similarity-based processing, it is concievable that
very large, but managable numbers of examples would effectively cover real
space for natural language sentences.

Performance
Possibility of real-time performance is the real advantage of using memory-
based approach. From the performance point of view, traditional parsing is
a time consuming task. A few seconds or even a few minutes is required to
complete the parsing of one sentence. Thus, the time complexity of parsing
Design Philosophy 41

Ill-formedness type Ratio (%)


Corpus 1 Corpus 2
False start 19.6 6.7
Insertion 16.1 6.7
Repeat 7.1 5.8
No-end 5.4 10.0
Dead end 1.8 1.7
Total 50.0 30.8

Table 3.2 Distribution of the global ill-formed ness

Solution space
by rule-based
approach

Figure 3.4 Real space and possible space


42 CHAPTER 3

algorithms has been one of the central issues of research in parsing technologies.
The most efficient parsing algorithm known to data is Tomita's generalized LR
parser which takes less than O(n 3 ) for most practical cases [Tomita, 1986].
Although some attempt has been made to implement a parallel parsing process
to improve its performance, the degree of parallelism attained by implementing
a parallel version of the traditional parsing scheme is rather low. In fact,
it is usually less than 100, and takes about 0.3 seconds to parse a lO-word
sentence [Tanaka and Numazaki, 1989J. Plus, it degrades more than linearly
as the length of the input sentence gets longer. Thus, no dramatic increase
in speed can be expected. In our model, serial rule application is eliminated
and replaced by a parallel memory-search process. This approach has not
been taken in the serial machine because it would result in a trade-off between
improved efficiency due to the use of cases and degradation due to an increase
in search cost. However, by using massively parallel computers, we expect that
our model attains high performance processing. Performance evaluation on
actual massively parallel machines will be given in chapters 5 and 6 of this
book.

3.2.4 The Role of Rules


The committment on memory-based approach does not imply elimination of
roles from the process of translation. In fact, one of the major focus of this
research program is to identify the role of rules in natural language. Several
ideas and implementations has been introduced in the <PDMDIALOG and its
descendants.

In the original implementation of <PDMDIALOG, rules has been used to cover


linguistic phenomena which are not covered by limited numbers of examples and
templates. Although the memory-based model of natural language processing
seems an attractive idea, use of only surface patterns, or near-surface patterns,
would be too memory-inefficient. Also, collecting large numbers of examples
could be an expensive enterprise. Although it is not possible to cover entire
space of natural language using a set of rules within affordable cost, use of
rule covers wider range of linguistic phenomena when only a small number of
examples has been collected. Thus, at the early stage of the system, the use of
rules has advantages. The question is how to integrate memory-based approach
and rule-based approach in a consistent manner.

Our approach in this issue is to integrate memory-based processing and


unification-based processing. For phrases or idiomatic expressions which ap-
Design Philosophy 43

pear most frquently, memory-based processing will be used, and when there is
no match in the memory-based processing stage, unification-based processing
with wide coverage will be used for processing. We merged these two levels
of processing by using a mechanism called feature aggregation. The goal for
integration is to ensure that the computationally cheapest path to be taken for
any input, with ensuring linguistic soundness.

The alternative approach exist to view rules as a monitor for the translation
process. With this view, rules do not playa central role of parsing or generating
sentences. Memory-based process will handle the entire process. However, the
rule will be used to check if word choice, styles, and other constraints are sat-
isfied. In this division of labor, memory-based process handles an autonomous
part of translation which is difficult to formalize in an explicit manner. The
rule-based process handles a conscious part of translation which can be explic-
itly formalized. Although this approach has not been a central option in the
original <P DMDIALOG system, chapter 7 describes one of her descendents, the
Memoir system, which emphasizes this approach.

3.3 MASSIVELY PARALLEL COMPUTING


Massively parallel computing is the basic computing paradigm assumed in the
research program. Massive parallelism is considered to be a promising and
an inevitable direction for the future computing paradigm. Massively parallel
artificial intelligence [Kitano et. al., 19911 has been proposed as a research
field for building intelligent systems using massive parallelism. This section
briefly describes warrants for massive paralelism. For details of arguments on
the problems of the traditional approach and benefits of massively parallel AI,
see [Kitano, 19931.

Massively parallel computing is technically inevitable from the view point of


computer architecture. The speed of uni-processor vector supercomputers is
saturating and further improvements require prohibitive research and produc-
tion costs. Various factors, such as heat, memory access speed, clocks, and
switching speeds, impede performance improvements for high-performance uni-
processors. Thus, the most promising way to attain greater computing power is
to introduce a large-scale parallelism. In fact, CRAY announced a development
plan for the massively parallel T3D supercomputer, and most other supercom-
puter producers are working on their own massively parallel supercomputer
projects.
44 CHAPTER 3

From the AI perspective, massively parallel computing liberates computational


constraints imposed in the past 30 years. The characteristics of traditional
sequential computers guided AI research in a particular direction. Suppose,
a computer with 0.1 MIPS and 4M byte main memory is the only available
comuting resources, it would be infeasible for researchers to venture into com-
puting and memory intensive paradigms, such as memory-based reasoning and
genetic algorithms. Within the hardware comstraints, the best approach has
been a set of sequentially optimized, computationally cheap, and memory sav-
ing algorithms. While there are some grounds for these algorithms, such an
approach does not necessary the best and correct approach even in the light of
the state-of-the-art parallel computing paradigm.

Historically, the acquisition of greater computing power and memory space has
changed the way things are done. For example, it was once believed that the
grand master level chess system is attainable only by using extensive heuristics
of the expert chess players. However, the history of computer chess proves
that the key factor for the stronger chess system is the computing power. The
computing power of chess systems and the rating of these system has a di-
rect correspondence [Hsu et aI., 1990J. Indenpendently, the success of various
memory-based reasoning systems prove the importance of massive data stream.

Three massively parallel machines have been accessible to the author for this
project: IXM2 associative memory processor, SNAP semantic network array
processor, and CM-2 connection machine. We have implemented versions of
the ~DMDIALOG system on IXM2 and SNAP. Implementation on the CM-2 is
underway. Thus, in this book, we will describe implementations on IXM2 and
SNAP. Both IXM2 and SNAP are marker-passing machines, and are particu-
larly suitable for memory-based processing due to their data-parallelism.

The SNAP project was started at the University of Southern California in 1983
by Dan Moldovan. The initial goal of the project was to develop a parallel
machine for semantic network processing. From 1990, the project became a
joint project between the University of Southern California and Carnegie Mellon
University. The joint project focuses on the development of the SNAP-1 and
implementation of the ~DMDIALOG system.

The IXM2 project was initiated by Tetsuya Higuchi at the Electrotechnical Lab-
oratories in Japan. Similar to the SNAP project, the main goal is to develop
a high-performance and cost-effective machine for a semantic network process-
ing. During 1990-91, the IXM2 machine was installed at the Center for Ma-
chine Translation at Carnegie Mellon University. A version of the ~DMDIALOG
model has been experimentally implemented on the IXM2.
Design Philosophy 45

3.4 MARKER-PASSING
Marker-passing has been used as a central means to carry out inference on mas-
sively parallel computers. Marker-passing was first proposed by Scott Fahlman
[Fahlman, 1979] as a means to perform inferencing on the semantic network.
The marker in his NETL system was a bit marker. Charniak, however, was the
first to apply marker-passing to natural language processing [Charniak, 1983].
In [Charniak, 1983], a marker-passing mechanism was used to handle contex-
tual processing outside of a syntactic parser. He also extended the concept of
marker-passing to include numeric values and a source of activation in order to
control search space and perform path evaluation. Use of numeric value was
also used by Norvig [Norvig, 1986] to compute contextual recognition in story
understanding. The path evaluation idea was further extended by Hendler
[Hendler, 1988] when he allowed a marker to carry entire path of marker prop-
agation. These marker-passing methods has been used, however, as an adjunct
module connected to the main component of the system. Perhaps, the DMAP
[Riesbeck and Martin, 1985] was the first to propose a parsing method entirely
by marker-passing. In DMAP, two types of markers (Activation markers and
Prediction markers) are used to carry out parsing. Our model is largely in-
fluenced by the idea presented in DMAP, and we have further augmented and
redefined the model in the <I> DMDIALOG system.

In our model, we have augmented traditional models of marker-passing by al-


lowing markers to carry such information as source of activation, feature struc-
ture, and numeric values. Although computational cost will be much more
expensive than traditional marker-passing, this augmentation enables marker-
passing to perform various complex tasks. We call our scheme stochastic struc-
tured marker-passing because our marker carries probabilistic information and
feature structures along with other information.

From the aspect of propagation control, marker-passing in our model is dif-


ferent from most other marker-passing models. In traditional models, markers
propagate in an almost unrestricted manner. It may go through any link in
the semantic network. Some attempt has been made to constraint propaga-
tion through the use of branching factors, or activation values to terminate
propagation. These approaches have been ad hoc, however, and did not pro-
vide a consistent solution to the problem. Usage of markers in our model is
quite different from these types of models. We consider a marker as an object
that carries information through designated paths. In this aspect, our model
inherits the concept of marker-passing in the DMAP system. For example, a
certain type of marker called A-Marker can only propagate IS-A links upward.
46 CHAPTER 3

We call this type of marker-passing as Guided or Constraint Marker-Passing


because there are propagation rules that guide or constrain propagation paths
of markers.
4
THE <I>DMDIALOG SYSTEM

4.1 INTRODUCTION
iPDMDIALOG is an experimental speech-to-speech dialog translation system de-
veloped at the Center for Machine Translation at Carnegie Mellon University.
This is one of the first experimental speech-to-speech translation systems cur-
rently up and running. The system employs a parallel marker-passing algorithm
as the basic architecture of a model. Speech processing, natural language, and
discourse processing are integrated to improve the speech recognition rate by
providing top-down predictions on possible next inputs. A parallel incremental
generation scheme is employed, and a generation process and the parsing pro-
cess run almost concurrently. Thus, a part of the utterance may be generated
while parsing is in progress. Unlike most machine translation systems, where
parsing and generation operate by different principles, our system adopts com-
mon computation principles in both parsing and generation, and thus allows
integration of these processes. The system has been publicly demonstrated
since March 1989.

Major feature of the iPDMDIALOG system includes:

Memory-based approach to natural language processing: The central


assumption of the model is the memory-based approach to natural lan-
guage. Details of this idea has been described in the previous chapter.

Almost concurrent parsing and generation: Unlike traditional methods


of machine translation in which a generation process is invoked after pars-
ing is completed, our model concurrently executes the generation process
48 CHAPTER 4

during parsing [Kitano, 1989a]. Both the parsing and generation processes
employ parallel incremental algorithms. This enables our model to gen-
erate a part of the input utterance during the parsing of the rest of the
utterance.

Dynamic participation of discourse and world knowledge: The model


attains highly interactive processing of knowledge from the morphopho-
netics level to the discourse level by distributively representing that knowl-
edge in a memory network on which actual computations are performed.
Discourse plan hierarchies for each participant of the dialog provide the
ability to handle complex mixed-initiative dialog, and enable our system
to perform discourse processing which are essential for simultaneous inter-
pretation.
The cost-based ambiguity resolution scheme: We adopted the cost-
based scheme [Kitano et. al., 1989a] to attain a highly interactive and
uniform mechanism of ambiguity resolution. In this scheme, the least-cost
hypothesis is selected from among other hypotheses. The idea behind this
scheme is to consider parsing and generation as processes of a dynamic
system which converge into a global minima through a path with the least
cost. The scheme is consistent with several psycholinguistic studies[Crain
and Steedman, 1985] [Ford, Bresnan and Kaplan, 1981] [Prather and Swin-
ney, 1988].

Integration of Memory-Based and Rule-Based Processing: In the model,


memory-based and rule-based processing are integrated using feature ag-
gregation and constraint satisfaction methods. The integration of memory-
based and rule/ constraint- based processing provides specificity of predic-
tion and productive syntactic and discourse processing.

Massively parallel model: The model is built on a hybrid parallel paradigm


[Kitano, 1989c] which is a parallel processing scheme which integrates a
marker-passing algorithm and a connectionist network in a consistent man-
ner. The marker-passing section of our model captures the flow of infor-
mation during processing, and the connectionist network acts as a dis-
criminator which selects one hypothesis out of multiple hypotheses. The
marker-passing division of the model borrowed some ideas from the direct
memory access (DMA) paradigm[Riesbeck and Martin, 1985] in the ini-
tial stage of our research; however, significant theoretical modification has
been made, thus, these two models should be regarded as different models.
The choice of the massively parallel computation scheme was one of the
crucial factors in implementing our model.
The q,DMDIALOG System 49

The <I>DMDIALOG system described in this chapter has been implemented in


on an IBM RT-PC workstation and HP-9000 workstation using CommollLisp
and MultiLisp run on Mach OS. Speech input and voice synthesis are done by
connected hardware systems such as Matsushita Institute's Japanese speech
recognition device [Morii et. al., 1985J and DECTalk. Concept hierarchy of
the memory network is based on [Tsujii, 19851, with domain-specific and other
general knowledge added. Parallelism is simulated on these serial machines.
Implementations on actual parallel machines are described in later chapters.

4.2 AN OVERVIEW OF THE MODEL


This section describes the basic organization and the algorithm of the model.
The <I>DMDIALOG system is composed of a memory network which represents
various levels of knowledge involved in the speech-to-speech processing task,
and markers which carry information in the memory network. Also, a con-
nectionist network with a localist implementation performs spreading activa-
tion for sub-symbolic processing with complements symbolic processing by the
marker-passing. Although the connectionist network has been experimentally
incorporated in the system, this dissertation focus on processing by the marker-
passing scheme.

4.2.1 Memory Network


The memory network incorporates knowledge from morphophonetics to plan
hierarchies for each participant of a dialog. This knowledge includes both
abstract knowledge and specific cases of utterances and discourses. Knowl-
edge on specific cases is important since we assume that a considerable portion
of human language comprehension and production is strongly memory / case-
based. Each node is a type and represents either a concept (Concept Class
node; CC) or a sequence of concepts (Concept Sequence Class node; CSC).
Strictly speaking, both CCs and CSCs are a collection or family since they
are, most of time, sets of classes. CCs represent such knowledge as phonemes
(i.e. /k/ /a/), concepts (i.e. *Conference, *Event, *Mtrans-Action),
and plans (i.e. *Dec1are-Want-Attend). When a strict linear-order con-
straint is imposed, CSCs represent sequences of concepts and their rela-
tions such as phoneme sequences (i.e. jk a i g ij), concept sequences!
1 Concept sequences are the representations of the integrated syntax/semantics level of
knowledge in our model. These sequences can be used to represent abstract knowledge or
50 CHAPTER 4

II Phonology Syntax/Semantics Discourse


CSC P~oneme Sequence Concept Sequence Discourse Plan Sequence
CSI --- Instance of Instance of
Concept Sequence Discourse Plan Sequence
CC Phoneme Concept Discourse Plan
CI --- Instance of Concept Instance of Discourse Plan

Table 4.1 Types of Nodes in the Memory Network

(Le. <*Conference *Goal-Role *Attend *Want» or plan sequences (i.e.


<*Declare-Want-Attend *Listen-Instruction» 2 of two participants of the
dialog 3 • They are summarized in table 4.1. Each type of node creates instances
during parsing which are called concept instances (CI) and concept sequence in-
stances (CSI), respectively. CIs correspond to 'discourse entities' as described in
[Webber, 1983]. Nodes are connected through several types oflinks. A guided
marker-passing scheme is employed for inference in the memory network.

Lexical Entry
In our model, lexical items are represented by Concept Sequence Class nodes
applied to the lexical level which we call lexical nodes. Each lexical node has
knowledge of how each word should be pronounced in the form of a phoneme
sequence. For example, definitions for a lexical node for a Japanese word 'Kaigi'
(conference) and an English word 'Conference' are shown in figure 4.1.

This represents 'kaigi' as a Japanese lexical representation of a concept 'con-


ference', its surface representation used for written text processing is 'kaigi',
a symbol sequence used for generation process is 'ka i gi', and a recognition
phoneme sequence is 'k a i * i'4. By the same token, a lexical node for 'con-
ference' has an expected phoneme sequence 'K AA N F R AX N S' and other
information similar to the case of the Japanese lexical node shown in the fig-
ure. Each word has its own lexical node containing such information, and these
specific cases. We assume use of phrasal lexicons [Becker, 1975] [Hovy, 1988] as generic
patterns that are induced from a large sample of utterance cases and map between specific
surface representations and semantics.
2This should not be confused with 'discourse segments ' [Grosz and Sidner, 1990]. In our
model, information represented in discourse segments is distributively incorporated in the
memory network.
3Use of plan hierarchies of each speaker as discourse knowledge is another unique feature
of our model. Most other studies of dialog have been dedicated to one-speaker initiative
domains[Cohen and Fertig, 1986][Litman and Allen, 1987].
4,., represents nasal phoneme.
The <.P DMDIALOG System 51

(defLEX '(kaigi
(is-a (conference»
(language (japanese»
(surface (kaigi))
(gen-phon (ka i gi))
(sequence (k a i * i))))

(defLEX '(conference
(is-a (conference»
(language (english»
(surface (conference»
(gen-phon (conference))
(sequence (K AA N F R AX N S»»

Figure 4.1 Lexical Nodes for 'Kaigi' and 'Conference'

definitions are compiled into the memory network.

Ontological Hierarchy
Class/subclass relation is represented in the memory network in form of the
ontological hierarchy. Each CC represents specific concept and they are con-
nected through IS-A links. The highest (the most general) concept is *entity
which entail all possible concepts in the network. Subclasses are linked under
the *entity node, and each subclass node has their own subclasses. As a basis
of the ontological hierarchy, we use the hierarchy developed for the MU project
[Tsujii, 1985], and domain specific knowledge and other knowledge necessary
for processing in the system has been added.

Memory-Base
Memory-base is represented as a memory network using specific cases and gen-
eralized cases. Specific cases represent surface string, or near-surface, of sen-
tences. A pair of translation is indexed under one node representing a concept
of the sentence, or just a ID-tag. When a source language CSC is activated by
the input, the translation will be created using the target language CSC which
is connected by the concept node. When only specific cases are used in the
memory-network, the system will be pure memory-based translation system.
52 CHAPTER 4

«NP> <==> «NP> <PostP»


«(xO head) -- xl)
«xO case) == (x2 case»)

Figure 4.2 Grammar using LFG-like notation

However, generalized cases and grammar rules can be represented so that vari-
ous exprements can be carried out in the uniform framework. Generalized cases
is similar to concept sequence in DMAP, and is templete of sentences. Both
specific cases and generalized cases are represented using the same encoding
style and compiler.

Grammar Rules
Grammar rules can be written using notations similar to Lexical-Functional
Grammar (LFG:[Kaplan and Bresnan, 1982]) (figure 4.2) or using more
semantic-oriented encoding (figure 4.3). LFG is a kind of unification-based
grammar. It consists of phrase structure rules and associated constraints. Uni-
fication operation is the central operation which built up meaning represena-
tion and imposing constraints. In figure 4.2, xO indicates the left hand side
of the rule, thus (xO head) is a head of the <NP>. By the same token, xl and
x2 indicate the first and the second term in the right hand side, respectively.
Also, mixing levels of abstraction in a grammar rule is permitted in our model
(figure 4.4). Although the use of such a semantic-oriented grammatical encod-
ing method may be linguistically controversial (it would provides less linguistic
generalization than other formalisms such as Lexical Functional Grammar [Ka-
plan and Bresnan, 1982J or Head-driven Phrase Structure Grammar [Pollard
and Sag, 1987]), this is one of the best way to write grammar for speech input
parsing due to its perplexity reduction effects. Perplexity is a measure of the
complexity of the task similar to an average branching factor; a small perplexity
measure means a task is rather simple. Generally, smaller perplexity improves
accuracy and response speed of the speech recognition module. We extend the
idea of the semantic-oriented grammar to allow direct encoding of a surface
string sequence into a specific case of utterances. Use of specific cases with
stochastic measurement has a significant contribution in perplexity reduction
while strong constraints at the syntactic/semantic level can directly influence
the speech processing level.
The <I?DMDIALOG System 53

«attend-action> <== «attend> <event»


«xO = x1)
«xO ACTION) = (x1 root»
«xO OBJECT) = x2»)

Figure 4.3 Grammar using Semantic-oriented notation

«i-want-attend-event> <== (I want to *circumstance)


«(xO ACTOR) = x1)
«xO ACTION) = x2)
«xO OBJECT) = x4»)

«*circumstance> <== (register for the *event)


«(xO ACTOR) == (xO ACTOR»
«xO ACTION) -- x1)
«xO OBJECT) == x4»)

Figure 4.4 Grammar using mixture of surface string and generalized case
54 CHAPTER 4

4.2.2 Markers
Markers are entities which carries information, and pass through the memory
network in order to make inference and predictions. Our model uses five types
of markers:

• Activation Markers (A-Markers), which show activation of nodes.

• Prediction Markers (P-Markers), which are passed along the conceptual


and phonemic sequences to make predictions about nodes to be activated
next.
• Contextual Markers (C-Markers), which are placed on nodes with contex-
tual preferences.
• Generation Markers (G-Markers), each of which contain a surface string
and an instance which represents the surface string.

• Verbalization Markers (V-Markers), which anticipate and keep track of


verbalization of surface strings.

These markers are summarized in table 4.2. A-Markers carry information


on a concept instance node, probability/cost measure, linguistic and semantic
features, etc. P-Markers contain constraint equations, feature structures, and
a probability measure of the hypothesis. Figure 4.5 shows an example of
information carried by an A-Marker and a P-Marker.

G-Markers are created at the lexical level and each contains a surface string,
linguistic and semantic features, a cost measure, and concept instance nodes.
G-Markers are passed up through the memory network. At a certain point of
processing, surface strings of G-Markers are concatenated following the order
of concept sequence class nodes, and a final string of the utterance is created.
When incremental sentence production is performed, V-Markers record part
of the sentence which is already verbalized and anticipate the next possible
verbalization strings. Figure 4.6 shows an example of a G-Marker and a V-
Marker.

The basic concept introduced in our model, which is substantially different from
other marker-passing models, is a use of a probabilistic and structured paral-
lel marker-passing. Each marker carries a probability measure that weights
how likely the hypothesis (represented in the marker) is true. It also contains
symbolic data such as linguistic feature, a pointer to a discourse entity, and
The ~ DMDIALOG System 55

(A-MARKER0236
(Probability: 0.14)
(CI: Conference045)
(Concept: Conference)
(Feature: nil)
(Type: Event)
)

(P-MARKEROi96
(Probability: 0.50)
(Constraints: «xO = x2)
«xO subj) = xi»
(Feature: nil)
)

Figure 4.5 Example of an A-Marker and a P-Marker

Marker II Passing Direction I Information


A-Markers IS-A Instance,Features, Cost
P-Markers The next element on CSCs Constraints, Cost, Instances
G-Markers IS-A Instance, Feature, Cost, lexical
Realization
V-Markers The next element on CSCs Constraints, Cost, Instances,
Verbalized words
C-Markers Contextual links Activation level

Table 4.2 Markers in the Model

constraints. Inclusion of rich information in markers dramatically increased


the capability of the marker-passing scheme, compared with traditional marker-
passing which merely carried bit-vectors or an identifier of an activation source.
Propagation of a probability measure enables stochastic analysis (which is es-
sential in processing speech inputs) and propagation of syntactic and semantic
features and discourse entities enables sound linguistic analysis.

Our model uses five types of markers. These markers are (1) Activation Mark-
ers (A-Markers), which show activation of nodes, (2) Prediction Markers (P-
Markers), which are passed along the conceptual and phonemic sequences to
make predictions about nodes to be activated next, (3) Contextual Markers (C-
Markers), which are placed on nodes with contextual preferences, (4) Genera-
56 CHAPTER 4

(G-MARKER0886
(Probability: 0.67)
(Cl: John021)
(Concept: Male-Person)
(Feature: «Gendar: Male)
(Number 3sg»)
(Type: Object)
)

(V-MARKER0180
(Probability: 0.50)
(Constraints: «xO actor) = x1)
«xO object) = x3)
(xO = xS»
(Feature: «Actor (Cl: John021)
(Gendar: Male)
(Number 3sg»»
(Surface String: IIJon ha")
)

Figure 4.6 Example of a G-Marker and a V-Marker

tion Markers (G-Markers), which each contain a surface string and an instance
which the surface string represents and (5) Verbalization Markers (V-Markers),
which anticipate and keep track of the verbalization of surface strings. Informa-
tion on which instances caused activations, linguistic and semantic features, and
probabilistic measures are carried by A-Markers. The binding list of instances
and their roles, probability measures, and constraints are held in P-Markers.
G-Markers are created at the lexical level and passed up through the memory
network. At a certain point of processing, surface strings of G-Markers are
concatenated following the order of a esc and a final string of the utterance
is created. When incremental sentence production is performed, V-Markers
record part of sentences which are already verbalized and anticipate next pos-
sible verbalization strings.

4.2.3 Baseline Parsing Algorithm


The basic cycle of our model is:
The <P DMDIALOG System 57

Activation: Nodes are activated either by external inputs or as a result of


accepting a ese. A new A-Marker (for parsing) and a G-Marker (for
generation) are created containing relevant information.

Marker Passing: Markers are passed up through IS-A links. Features are
aggregated through this process.

Collision: When an A-Marker or a G-Marker activate a node with a P-Marker


or a V-Marker, a collision takes place. Information stored in the A-Marker
or G-Marker is combined with that of the P-Marker or G-Marker. Re-
calculation of costs, probability measures, and constraint satisfaction are
conducted at this stage.

Prediction or Acceptance of CSC: After the collision, the P-Marker or V-


Marker is moved to the next element of the ese and makes a prediction
as to what may be activated next. In the event that the P-Marker or V-
Marker is at the last element of the ese, the ese is accepted, and a new
A-Marker or G-Marker is created which contains information stored in the
P-Marker or G-Marker. This creation of a new marker is the activation
stage of the next cycle.

Movement of P-Markers is important in understanding parsing in our model.


The movement of G-Markers for generation will be described in a later section.
The movements of P-Markers on a ese are illustrated in figure 4.7. In (a),
a P-Marker (initially located on eo) is hit by an A-Marker and moved to the
next element. In (b), two P-Markers are used and moved to e2 and e3. In dual
prediction, two P-Markers are placed on elements of the ese (on eo and el).
This dual prediction is used for phonological processing.

Figure 4.8 shows movement of a P-Marker on the layers of eses. When the P-
Marker at the last element of the esc gets an A-Marker, the CSC is accepted
and an A-Marker is passed up to the element in the higher layer CSC. Then,
a P-Marker on the element of the cse collides with an A-Marker, and the P-
Marker is moved to the next element. At this time, a P-Marker which contains
information relevant to the lower esc is passed down and placed 011 the first
element of the lower CSC. This is a process of accepting one CSC and predicting
the possible next word and syntactic structure.

Although a real memory network has a highly layered and indexed structure,
for the sake of clarity, figure 4.9 shows an example of how our algorithm parses
an input with a simple context-free grammar. This example assumes following
context-free grammar:
58 CHAPTER 4

P P
< eO e1 e2 e3 .... en > ---+- < eO e1 e2 e3 .... en >
t
A (a) Simple Prediction

P P P P
< eO e1 e2 e3 .... en > ---+- < eO e1 e2 e3 .... en >

A
t
(b) Dual Prediction

Figure 4.7 Movement of P-Markers

p p
<e20 e21 ... e2n > <e20 e21 ... e2n >

/\
<eOOeOl ... eOI> <el0ell .. elm>
/
<eOOeOl ... eOI>
~p
<el0ell .. elm>

t
A

Figure 4.8 Movement of P-Marker on Hierarchical CSCs


The ~DMDIALOG System 59

s S s
PI AP I -t p
/"
*A *B> *A *B> <flA >
*

/ "8
A A 8

I
C
~F
<*0 *E>
I
C
~F
<*0 *E>

I (~ I
p abc d
I (~ I
abc d

(a) Initial Prediction (b) Input 'a' causes (c) Shift and Predict
A-P-Collision

s s
I P I P

/"
<*A *8> <*A *B>

A B A/ "B
I AP~ I~

1IJrr~
a c
1
d
1 {'~ 1
abc p d

(d) Input 'b' (e) Shift and Predict (0 Input 'c' causes reduce in
<*D *E> and <*A *B>

Figure 4.9 Parsing with a small grammar

• S -+ A B

• A-+C

• B-+DEIF

• C-+a

• D-+b

• E-+c

• F-+d

At figure 4.9a, initialization is conducted and P-Markers are passed down


to predict possible starting symbols. At figure 4.9b, activation of node 'a'
60 CHAPTER 4

triggers propagation of an A-Marker from node 'a' to node 'C', to node 'A',
and to node '*A.' As a result of the A-Marker propagation up to the element
,* A' of concept sequence class <*A *B>, A-P-Collision takes place at *A. Then,
in figure 4.9c, the P-Marker is shifted to the next element of the concept
sequence class (element '*B'). Then, a P-Marker is passed down to predict the
next possible inputs, from element *B to element *D, and to node b. Also, a
P-Marker is passed down from element '*B' to node F, and to node d. In figure
4.9d, the activation of node 'b' triggers an A-Marker propagation from node b
to node D, resulting in an A-P-Collision at <*D *E>. Figure 4.ge shows a shift
of a P-Marker and top down prediction with a P-Marker to node c. At figure
4.9f, activation of node c causes reduce, first, at <*D *E> and then at <*A *B>.
Finally, an A-Marker activates S and the input sentence is accepted.

We will further illustrate this basic parsing algorithm using a simple memory
network, as in Figure 4.10. Part (a) of the figure shows an initial prediction
stage. P-markers are placed on *person in a CSC at the syntax/semantics
level. Also, the other P-marker is placed on the first element of CSCs at the
phonological level. In In part (b) of Figure 4.10, a word, john, is activated
as a result of speech recognition, and an A-marker is passed up through IS-A
link. It reaches to *person in the CSC, which has the P-marker. An A-P-
collision takes place and features in the A-marker are incorporated into the
P-marker following the constraint equation specified in the CSC. Next, the P-
marker shift takes place; this may be seen in part (c) of the figure. Now the
P-Marker is placed on *want. Also, the prediction is made that the possible
next word is wants. Part (d) shows, a movement of P-markers after recognizing
to. In (e), the last word of the sentence, conference, comes in and causes an
A-P-collision at *event. Since this is the last element of the CSC <*attend
*def *event>, the CSC is accepted and a new A-marker is created. The newly
created A-marker contains information on build up by a local parse with this
CSC. Then, the A-marker is propagated upward, and it causes another A-P-
collision at *circumstance. Again, because *circumstance is the last element
of the CSC <*person *want *to *circumstance>, the CSC is accepted, and
interpretation of the sentence is stored in a newly created A-marker. The A-
marker further propagates upward to perform discourse-level processing.

4.3 SPEECH INPUT PROCESSING


The integration of speech recognition and natural language processing is one of
the most important topics in spoken language processing. The benefit of inte-
The ~DMDIALOG System 61

A-P-Collision

'want 'to

(.a/end tdy -evrt>


• john
..ttlOO .dlf 'ev\
.oonfrence

'JOhn' 'wants' 'to' •attend' •the' 'oon ference'


P
to phonE!OO-level

(a) Initial Prediction (bl Process ing 'john'

Shift
-p
<.peron 'W~ .~ircu~

·p~rson 'want 'to 'c1rc~ance 'want 'to 'ei rcumftance

tdy tevrt> ,.•/end 'dr .evr


p
'j1. <la/end t>

.attlnd .dr
'john tev~ 'john
'.v~
'dr
•• t r
.eonfrence
'confrence
'JOhn' P 'wants' 'to' 'attend' 'the' • cont erence'
'JOhn' 'wants' 'lOlp "attend' "the" 'conference'
Predict

(e) Shift and Prediction (d) Predicting 'circumstance, 'attend, and 'attend'

Ie) Processing 'conference'

Figure 4,10 A simple parsing example.


62 CHAPTER 4

grating speech recognition and natural language processing is that it improves


the recognition rate of the speech inputs in two ways. First, the integration
provides more appropriate assignment of a priori probability to each hypoth-
esis so that several highly ambiguous hypotheses can be differentiated from
expectations, and the correct one may be selected. Second, it imposes more
constraints to reduce search space. Given the same computational power, re-
duction of search space results in improvement of the recognition rate. Thus,
the quality of the language model is an important factor. Since our goal is
to create accurate translation from speech input, a sophisticated parsing and
discourse understanding scheme is necessary.

In our system, we assume that an acoustic processing device provides a symbol


sequence for a given speech input. In this paper, we assume that a phoneme-
level sequence is provided to the system 5 . The phoneme sequence given from
the phoneme recognition device contains substitution, insertion and deletion of
phonemes, as compared to a correct transcription which contains only expected
phonemes. We call such a phoneme sequence a noisy phoneme sequence. The
task of phonological-level processing is to activate a hypothesis as to the correct
phoneme sequence from this noisy phoneme sequence. Inevitably, multiple hy-
potheses can be generated due to the stochastic nature of phoneme recognition
errors. Thus, we want each hypothesis to be assigned a measure of its being
correct. In the stochastic models of speech recognition, the probability of each
hypothesis is determined by P(ylh) x P(h). P(ylh) is the probability of a series
of input sequences being observed when a hypothesis h is articulated. P( h)
is an a priori probability of the hypothesis derived from the language model.
Apparently, when phonological-level processing is the same, the system with a
sophisticated language model attains a higher recognition rate, because a priori
probability differentiates among hypotheses of high acoustic similarity which
would otherwise lead to confusion. At the same time, we want to eliminate
less-plausible hypotheses as early as possible so that the search space is kept
within a certain size. We use syntactic/semantic and discourse knowledge to
impose constraints which reduce search space, in addition to probability-based
pruning within the phonological level. A priori probability will be given by a
P-Marker passed down from the higher-level processing.

This section describes phonological-level activities. We assume a noisy phoneme


5We use Matsushita Institute's Japanese speech recognition system[Morii et. aI., 1985)
for current implementation. However, this does not mean that our framework is incapable
of attaining phoneme recognition. In fact, an introduction of probabilistic time-synchronous
marker-passing would add a speech processing capability comparable to the Hidden Markov
Model. Alternatively, neural network-based phoneme recognition such as TDNN [Waibel et.
aI., 1989) is highly conformal to our framework.
The <I>DMDIALOG System 63

kaigi ni sanka shitai nodesu


DAI*I*IPAUTAQPAINO*EKU
BAII*IPAA=KAS@PAINODUSU
BAII*I*IPAU =KAIQPAI*O*ESU
KAIIMIPAA=KAS@PEEI*ODESU
KAI*I*IPAA=ZAS@PAIWO*USJU
youshi ha arimasuka
BJOHIRAARI* ATAWA
JOSJUWAARINAOQZAA
IOUSIWAARIMAUQKA
JOOSIHAKARI* AUQKA
IOOSJUWAWARI* AACA
oname wo onegai shimasu
0* A*AEJOORE*EISI* AS@
WO* A*AEJOORE*EEHJ AN A
WONA *AEJOBO*E*EIHJAH@
0* A*AEJ O*O*E*EEISIN AKU
0* A*AEJ OO*E*EEIHJ AZU

Figure 4.11 Examples of Noisy Phoneme Sequences

sequence, as shown in figure 4.11, to be the input of the phonological-level


processing. In order to capture the stochastic nature of speech inputs, we
adopt a probabilistic model similar to that used in other speech recognition
research. First, we describe a simple model using a static probability matrix.
In this model, probability is context-independent. Then, we extend the model
to capture context-dependent probability. In subsequent sections, we describe
the language model of our system, and how predictions are provided to the
speech processing level to attain integrated processing.

4.3.1 The Organization of Phonological


Processing
The algorithm described as a baseline algorithm is deployed on phonetic-level
knowledge. In the memory network there are CSCs representing the phoneme
sequence for each lexical entry. The dual prediction method is used in order
to handle deletion of a phoneme. Probability measures involved are: a priori
64 CHAPTER 4

probability given by the language model, a confusion probability given by a


confusion matrix, and a transition probability given by a transition matrix.

A priori probability is derived from the language model and is a measure of


which phoneme sequence is likely to be recognized. A method of deriving
a priori probability is described in the section on syntax/semantic parsing,
discourse processing, and prediction (section 4.5, 4.6 and 4.7).

A confusion matrix defines the output probability of a phoneme when an


input symbol is given. Given an input sign ii, the confusion matrix aij
determines the probability that the sign ii will be recognized as a phoneme
Pj' It is a measure of the distance between symbols and phonemes as well
as a measure of the cost of hypotheses that interpret the symbol ii as the
phoneme Pj. In the context-dependent model, the confusion matrix will
defined as aijk which gives a probability of a symbol ii to be interpreted as
a phoneme Pj at a transition tk' We call such matrix a dynamic confusion
matrix.

A transition matrix defines the transition probability which is a probability


of a symbol ii+l to follow a symbol ii. For an input sequence io i 1 .. . in,
the a priori probability of the transition between io and i 1 is given by
bio,i 1 • Since we have a finite set of input symbols, each transition can be
indexed as tk' The transition probability and the confusion probability
are intended to capture the context-dependency of phoneme substitutions
- a phenomena whereby a certain phoneme can be actually articulated as
other phonemes in certain environments.

4.3.2 The Context-Independent Model


First, we will explain our algorithm using a simple model whose confusion
matrix is context-independent. Later, we will describe the context-dependent
model which uses a dynamic confusion matrix. Initially, P-Markers contain
a priori probability (7l'z) given by the language model. In <I? DMDIALOG, the
language model reflects full natural language knowledge from syntax/semantics
to discourse. The P-Markers are placed on each first and second element of
CSCs representing expected phoneme sequences. For an input symbol ii, A-
Markers are passed up to all phoneme nodes that have a probability(bij) greater
than the threshold (Th). When a P-Marker, which is at the i-th element, and
an A-Marker collide, the P-Marker is moved to the i+l-th and i+2-th elements
of the sequence (this is a dual prediction). When the next input symbol ii+l
The <P DMDIALOG System 65

r - - - -. . p20

Figure 4.12 Phoneme-level State Transition

generates an A-Marker that hits the P-Marker on the i+l-th element, the P-
Marker is again moved using the dual prediction method. The probability
density measure computed on the P-Marker is as follows:

ppm(i) = ppm(i - 1) x t S ;_2,s;_1 x CP ;_2,S;_2 ( 4.1)


ppm(l) = ppm(O) x tonset,io (4.2)
ppm(O) = 'lrl (4.3)

Here ppm(i) is a probability measure of a P-Marker at the i-th element of the


concept sequence class node, which is a probability of the input sequence being
recognized as a phoneme sequence traced by the P-Marker. And tonset,io is a
transition probability from the onset to the first phoneme, i.e. probability of
the phoneme to be given as a first phoneme of inputs.

In figure 4.12, an input sequence is io i l ... in. Pij in the diagram denotes
a phoneme Pj at the i-th element of the CSC. Pij is a state rather than an
actual phoneme, and Pj in the CSC refers to the actual phoneme. P-Markers at
POO,POI,P02, and P-Markers on the O-th element of the CSCs referring to Po, PI,
and P2, respectively, are hit by A-Markers. Eventually, P-Markers are moved
to the next elements of CSCs. For instance, POO will move to PIo,Pn,P20,P21
depending on which CSC the P-Marker is placed on. Probabilities are computed
with each movement. A P-Marker at Pn has the probability 'lro. When the P-
Marker receives an A-Marker from iI, the probability is re-computed and it will
be 7l"OX biQ,poo x ap;o'p;,' Transitions such as POO -+ P21 and POO -+ P20 insert an
66 CHAPTER 4

Activated
Phonemes z

GOMI
KAIGI

GOEI NE ..

Input
Phoneme
Sequence

Figure 4.13 Phoneme Processing

extra phoneme which does not exist in the input sequence. Probability for such
transitions is computed in such a way as: 7rO x bio ,poo x aio,</> x biz,pzo x a</>,i z '
A P-Marker at PIO does not get an A-Marker from i 1 due to the threshold.
In such cases, a probability measure of the P-Marker is re-computed as 7ro X
bio,poo x aio,noise' This represents a decrease of probability due to an extra
input symbol.

P-Markers at the last element (Pn) and one before the last (Pn-d are involved in
the word boundary problem. When a P-Marker at Pn is hit by an A-Marker, the
phoneme sequence is accepted and an A-Marker which contains the probability
and the phoneme sequence is passed up to the syntactic/semantic-level of the
network. Then, the next possible words are predicted using syntactic/semantic
knowledge, and P-Markers are placed on the first and the second element of the
phoneme sequence of the predicted words. When a P-Marker at Pn-l is hit by
an A-Marker, the P-Marker is moved to Pn and, independently, the phoneme
sequence is accepted, due to the dual prediction, and the first and the second
elements of the predicted phoneme sequences get P-Markers.

Figure 4.13 shows a simplified view of this process. We suppose an input


phoneme sequence to be 'DAI*I. .. '. Each input phoneme activates phoneme
hypotheses based on a confusion matrix. For example, an input phoneme 'D'
activates phoneme hypotheses 't', 'k', 'd', 'g', and '*' with corresponding prob-
ability measures. For the sake of simplicity, the transition probability is not
The <I>DMDIALOG System 67

drawn in this figure. A path shown by a thick solid line shows a lexical hypoth-
esis for KAIGI (conference), a thick dashed line shows GOMI (garbage), and a
thin dashed line shows GOEI NE .. (guard and a part of a following word). In
activating lexical hypothesis GOMI, the third input phoneme'!' is considered
to be a noise, and, thus, a path ignores either 'i' or 'e' activated by 'I'. On the
other hand, a path for GOEI has a transition that adds a phoneme which does
not appear in the input phoneme sequence; a phoneme '0' is inserted while
transiting from 'g' to 'e'.

4.3.3 The Context-Dependent Model


The context-dependent model can be implemented by using the dynamic con-
fusion matrix. The algorithm described above can be applied with some modi-
fications. First, A-Markers are passed up to phonemes whose maximum output
probability is above the threshold. Second, output probability used for proba-
bility calculation is defined by the dynamic confusion matrix. The probability
measure is computed by:

where k denotes a transition from ii-2 to ii-l. It is interesting that our context-
dependent model is quite similar to the Hidden Markov Model (HMM) when
the transition of the state of P-Markers are synchronously determined by, for
example, certain time intervals. We can implement a forward-passing algorithm
and the Viterbi algorithm [Viterbi, 19671 using our model. This implies that
when we decide to employ the HMM as our speech recognition model, instead
of a current speech input device, it can be implemented within the framework
of our model.

4.4 MEMORY-BASED PARSING


Memory-based parsing in the 1>DMDIALOG uses specific cases and generalized
cases. Since specific cases represent surface string of sentences, any input sen-
tence has to be matched based on similarity measure. In order to handle inputs
which is not identical to stored specific cases, A-Markers are propagated down-
ward to find CSCs which are indexed to words similar to the word in the input
sentence. Suppose an input sentence has a word John, but a specific case has
a word Jack. An A-Marker is activated by John and propagate upward to
hit a concept Male-Person. Then, the A-Marker propagate downward to hit
68 CHAPTER 4

Jack. If it Jack is a part of a specific case and has a P-Marker, A-P-Collision


takes place and the next element will be predicted. While the distance mea-
sure attached to the arc diminishes probability measure of the activation, this
mechanism allows the system to utilize specific cases even if identical sentences
are not spoken to the system. One of the problems of the method is explosive
search space and a computational requirement. Massively parallel implemen-
tation would be necessary to built any practical system using this approach.
One of the descendents of the <])DMDIALOG, the Memoir system, represents an
extreme end of this approach.

The same mechanism applies to generalized cases. When a concept activated by


the input is not indexed to any CSC, the A-Marker would propagate downward
to find simialr concepts which are indexed to CSCs.

4.5 SYNTACTIC/SEMANTIC PARSING


Unlike most other language models employed in speech recognition research,
our language model is a complete implementation of a natural language pars-
ing system. Thus, complete semantic interpretations, constraint checks, ambi-
guity resolution and discourse interpretations are performed. The process of
prediction is a part of parsing in our model, thereby attaining an integrated
architecture of speech input parsing. In syntactic/semantic processing, the
central focus is on how to build the informational content of the utterance
and how to reflect syntactic/semantic constraints at phonological-level activi-
ties. Throughout the syntactic/semantic-level and the discourse-level, we use a
method to fuse constraint-based and case-based approaches. In our model, the
difference between a constrain t- based process and a case-based process is the
level of abstraction; the case-based process is specific and the constraint-based
process is more abstract. The constraint-based approach is represented by vari-
ous unification-based grammar formalisms [Pollard and Sag, 1987] [Kaplan and
Bresnan, 1982). We employ a mixed approach which uses syntactic and seman-
tic constraints at three levels of abstraction; specific cases, generalized cases,
and syntactic rules 6 • In our model, propagation of features and unification
are conducted as feature aggregation by A-Markers and constraint satisfaction
is performed by operations involving P-Markers. The case-based approach is a
basic feature of our model. Specific cases of utterances are indexed in the mem-
ory network and reactivated when similar utterances are given to the system.
6In fact, we are now developing a cross-compiler that compiles grammar rules written in
LFG for the universal parser [Tomita and Carbonell, 19871 into our network. Designing of a
cross-compiler from HPSG to our network is also underway.
The <I> DMDIALOG System 69

One of the motivations for case-based parsing is that it encompasses phrasal


lexicons [Becker, 1975]7. The scheme described in this section is applied to
discourse-level processing and attains an integration of the syntactic/semantic-
level and the discourse-level.

4.5.1 Feature Aggregation


Feature aggregation is an operation which combines features in the process of
passing up A-Markers so that minimal features are carried up. Due to the
hierarchical organization of the memory network, features which need to be
carried by A-Markers are different depending upon which level of abstraction
is used for parsing. When knowledge of cases is used for parsing, features are
not necessary because this knowledge is already indexed to specific discourse
entities. Features need to be carried when more abstract knowledge is used
for parsing. For example, the parsing of a sentence She runs can be handled
at different levels of abstraction using the same mechanism. The word she
refers to a certain discourse entity so that very specific case-based parsing can
directly access a memory which recalls previous memory in the network. Since
previous cases are indexed into specific discourse entities, the activation can
directly identify which memory to recall. When this word she is processed
in a more abstract level such as PERSON, we need to check features such as
number and gender. Thus, these features need to be contained in the A-Marker.
Further abstraction requires more features to be contained in the A-Marker.
Therefore, the case-based process and the constraint-based process are treated
in one mechanism. Aggregation is a cheap operation since it simply adds new
features to existing features in the A-Marker. Given the fact that unification is a
computationally expensive operation, aggregation is an efficient mechanism for
propagating features because it ensures only minimal features are aggregated
when features are unified.

Feature aggregation is applied in order to interface with different levels of knowl-


edge. At the phonological level, only a probability measure and a phoneme
sequence are involved. Thus, when an A-Marker hits a CC node represent-
ing a certain concept, i.e. female-person-3sg for she, the A-Marker does
not contain any linguistically significant information. However, when the A-
Marker is passed up to more abstract CC nodes, i.e. person, linguistically
significant features are contained in the A-Marker and unnecessary informa-
tion is discarded. When a sentence is analyzed at the syntactic/semantic-level,
7 Discussions on benefits of phrasal lexicons for parsing and gen e ration are found in [Ries-
beck and Martin, 1985J [Hovy, 1988J.
70 CHAPTER 4

propositional content is established and is passed up to the discourse-level by


an A-Marker, and some linguistic information which is necessary only within
the syntactic/semantic-level is discarded.

4.5.2 Constraint Satisfaction


Constraint is a central notion in modern syntax theories. Each CSC has con-
straint equations which define the constraints imposed for that CSC depending
upon their level of abstraction. CSCs representing specific cases do not have
constraint equations since they are already instantiated and the CSCs are in-
dexed in the memory network. The more abstract the knowledge is the more
it contains constraint equations. Feature structures and constraint equations
interact at two stages. At the prediction stage, if a P-Marker placed on the
first element of the CSC already contains a feature structure that is non-nil,
the feature structure determines, according to the constraint equations, possi-
ble feature structures of A-Markers that subsequent elements of the CSC can
accept. At an A-P-Collision stage, a feature structure in the A-Marker is tested
to see if it can meet what is anticipated. If the feature structure passes this
test, information in the A-Marker and the P-Marker are combined and more
precise predictions are made as to what can be acceptable in the subsequent
element. For She runs, we assume a constraint equation (Agent num = Action
num) associated with a CSC representing a generalized case which has, for ex-
ample, <*Agent *Action> for a concept sequence. When a P-Marker initially
has a feature structure that is nil, no expectation is made. In this example, at
an A-P-Collision, an A-Marker has a feature structure containing (num = 38)
constraints for the possible verb form which can follow, because the feature in
the A-Marker is assigned in the constraint equation so that (Agent num 38)
requires (Action num 38). This guarantees that only the verb form runs can
be legitimateS. When predicting what comes as an *Action, P-Markers can
be passed down via IS-A links and only lexical entries that meet a constraint
(Action num 38) can be predicted. When we need to relax grammatical con-
straints, P-Markers can be placed on every verb form, but are assigned higher a
priori probabilities for those which meet the constraint. A unification operation
can be used to conduct operations described in this section. As a result of pars-
ing at the syntactic/semantic-level, the propositional content of the utterance
is established. Since our model is a memory-based parsing model, the memory
network is modified to reflect what is understood as a result of previous parsing.
8When we use abstract notation such as NP or VP, the same mechanism applies and
captures linguistic phenomena.
The q> DMDIALOG System 71

4.6 DISCOURSE PROCESSING


4.6.1 Plan-Based Dialogue Understanding
We use hierarchical plan sequences, represented by CSCs, to understand dia-
logues and to predict possible next utterances. Plan hierarchies are organized
for each participant of the dialog in order to capture complex dialog which often
takes place in a mixed-initiative dialog. This is one of the major differences of
our model from other discourse models. Each element of the plan sequence rep-
resents a domain-specific instance of a plan or an utterance type [Litman and
Allen, 1987]. Major differences which distinguish our model from simple scripts
or MOPs [Schank, 1982] as used in past memory-based models are: (1) our plan
sequences can be dynamically created from more productive knowledge on dia-
log and domain as well as previously acquired case knowledge, whereas scripts
and MOPs are simple predefined sequential memory structures, and (2) a plan
sequence has an internal constraint structure which enables constraints to be
imposed to ensure coherence of the discourse processing. These features attain
hierarchical organization of discourse knowledge and productivity of knowledge
depending upon the level of abstraction. Abstract plan sequences are similar
to plan schemata described in [Litman and Allen, 19871 in that they repre-
sent generic constraints and relationships between an utterance and a domain
plan. They are parameterized and domain-independent knowledge of discourse
plans. When an element of the plan sequence of this abstraction is activated,
the rest of the elements of the plan sequence have constraints imposed, derived
from the information given to the activated elements. This ensures coherence
of the discourse. On the other hand, specific plan sequences representing the
discourse cases are already indexed in the memory as a result of instantiating
abstract knowledge based on past cases of discourse and, therefore, they con-
tain domain-specific knowledge. When such a plan sequence is activated, it
simply predicts the next plan elements because these specific plan sequences
are regarded as records of past cases, and, thus, most constraints are already
imposed and the sequence is indexed according to the specific constraints.

A basic algorithm of plan-based dialogue understanding is similar to pars-


ing stage, except IG-Markers. For the plan-based dialogue recognition, we
introduce a new marker called Inferred Goal Marker (IG-Marker). IG-Markers
propagate upward in the plan hierarchy of each speaker, and mark all possible
goals/subgoals of the speaker. The IG-Marker is created when an A-P-Collision
take place at domain plan sequences. Overall algorithm is:

1. Put P-Marker to all possible element of the CSCs


72 CHAPTER 4

2. When an A-P-Collision take place,


(a) Create an IG-Marker and it propagate upward
(b) The P-Marker is moved to the next element of the CSC, and top-down
prediction is made by passing down copies of the P-Marker.
3. When an A-P-Collision take place at the last element of the CSC, an A-
Marker is created and propagate upward.

The algorithm itself is simple, but this algorithm has more flexibility than
Litman's model when performed on the memory network which represents do-
main plan hierarchy for each speaker. There are two major differences from
the Litman's model and our model which explains flexibility and computational
advantages of our model.

First, in the <.t>DMDIALOG, plan recognition is performed directly using plan


hierarchy represented as a part of the memory network. In the Litman's model,
domain plans has been expanded in the stack whenever a new domain plan is
recognized. This approach is computationally expensive since plan has to be
retrieved from the plan library and it must be expanded into the stack. Also,
when there are ambiguities, their model has to create several stacks each of
which corresponds to specific interpretation.

Second, our model assumes specific domain plans for each speaker. The do-
main plan which has previously been considered a joint plan, is now separated
into two domain plans, each of which represents a domain plan of a specific
speaker. Each speaker can only carry out his/her own domain plans in the
stack. Progression from one domain plan to another can only be accomplished
through utterances in the dialogue. A domain plan can be a joint plan when
both speakers execute or recognize the same domain plan at the same specific
point in the speech event and which occurs separately for each speaker in the
domain plan hierarchy in the memory network.

We describe a basic plan recognition process using a simple example. In this


example, plan recognition in one of the speaker is described. In this example,
we have two CSCs representing domain plans. A CSC representing actions
to attain attend-conference has a sequence of plans (or actions) register
goto-site attend-session. Ohe other CSC representing a sequence of plans
for goto-site has buy-ticket take-train as its decomposition in Litman's
term. Similar to parsing stage, all first element of CSCs is marked with P-
Markers. When an A-Marker collide with a P-marker, an IG-marker (Inferred
The ~DMDIALOG System 73

Goal Marker) is created and it is passed upward in the plan hierarchy. All
nodes exist along the path of the IG-marker is marked with the IG-marker
(figure 4.14). They represent possible goal/subgoal of the speaker. Then, the
P-Marker is moved to the next element, and its copy is passed down to lower
CSC representing a sequence of actions to attain the predicted goaljsubgoal
(figure 4.15). Then the next A-Marker hit the P-Marker, and an IG-marker
is created and propagate upward (figure 4.16). Although, this illustration
is much too simplified, basic process flow is captured. When an A-Marker
and a P-Marker collides, constraint satisfaction generally takes place in order
to ensure coherence of dialogue recognition. This process is similar to the
constraint-directed search used in [Litman and Allen, 1987].

Next, we describe how our model handles mixed-initiative dialogues using a


short dialogue on airline flight reservation:

SpA: I'd like to buy a ticket for San Francisco. (1)


SpB: Oh, you are going to visit San Francisco. (2)
SpA: No. (3)
SpA: I have a conference at Berkeley. (4)
SpB: Then you better fly into the Oakland airport. (5)
SpA: Really? (6)
SpA: How much does the ticket cost? (7)

Initial state of the network is shown in figure 4.17( A). Notice that there are
different domain plans for each speaker. Speaker A is customer and Speaker
B is the travel agent. All first element of CSC for both speaker is marked
with P-Markers. As first utterance comes in (utterance (1), A-Marker prop-
agated from parsing stage comes up and hit a state destination state-dest
action. An A-P-Collision takes place, and an IG-Marker is created. The IG-
Marker propagate upward marking all possible goals of the speaker A. Also,
the P-Marker is moved to the next element of the CSC. This is shown in figure
4.17(B). Next, the utterance (2) comes in, and an A-Marker hits a P-Marker
at confirm-destination (figure 4.17(C)). An IG-Marker is created and it
marks sell-ticket which is the goal of the travel agent as inferred from the
utterance (2). Utterance (3) and (4) are replies by the Speaker A to utterance
(2) made by the Speaker B. For such replies, generally, P-Markers and IG-
Markers at domain plans do not move. When the Speaker B, the travel agent,
made an utterance (5), an A-Marker hits the P-Marker on tell-best-option
predicted from the previous utterance. However, IG-Markers are unaffected
because nothings has been accomplished yet. If the Speaker A accomplished
buy-ticket, an A-Marker is created at the CSC <state-dest ... >, and hit
the P-Marker at buy-ticket. Then, the P-Marker is moved to the next element
74 CHAPTER 4

IG
~ttendjeOn,ernee,

A P
<register gotolsite attend-session .•• >

t <buy-ticket take-train ... >

Figure 4.14 A simple plan recognition example (Activation)

IG
<attend-\confernce>

---....... P ~

<register got+lsite attend-session •.. >

P
<buy-ticket take-train ... >

Figure 4.15 A simple plan recognition example (Prediction)

IG
<fttendjeon,ernee,

. j!;G P
<register~otolsite attend-session ... >

A P
<,Uy-ticket take-train ... >

Figure 4.16 A simple plan recognition example (Activation)


The q, DMDIALOG System 75

of theese, and the IG-Marker is removed since this subgoal has been accom-
plished. It is perhaps the next series of utterance that may create IG-Markers
to mark take-airplane as a next subgoal. Similar to syntax/semantics-level,
multiple levels of abstraction can co-exists. For sake of the clearity, however,
this example only shows one level of abstraction.

4.6.2 Script-based Discourse Understanding


In addition to plan-based discourse understanding, the model entails a script-
based discourse understanding which is an extended version of a script or mem-
ory organization packet- (MOP) based story understanding [Schank, 1982]. In
this approach, a sequence of utterances is represented as CSC at a discourse
level tangential to the plan-based model; here, however, each utterance is also
linked to nodes representing speech acts and domain plans. This part of the
model accounts for canned conversations in which speakers do not really re-
compute the planning behind the utterance.

Aspects which distinguish our model from simple scripts or MOPs as used in
past models are (i) utterance sequences can be dynamically created from more
productive knowledge on dialogue and domain as well as previously acquired
case knowledge, whereas scripts and MOPs are simple predefined sequential
memory structures; and (ii) an utterance has an internal constraint structure
which enables constraints to be imposed to ensure discourse processing coher-
ence. Again, the level of abstraction may vary in each eserepresenting a
discourse script. Some CSCs may involve elements linked to specific instances
of utterance, or others may use elements linked to abstract nodes representing
utterance types.

4.6.3 Discourse Entities and Their Relations


Discourse entities and their relations are established as a result of understanding
the utterance in the dialog. Understanding in <I>DMDIALOG is regarded as the
activity of dynamically modifying the memory network as a result of parsing.
The memory network is modified in two ways: creation of a new instance in the
network, and creation and deletion of links between nodes. As each utterance
is processed, CIs are created under CCs [Kitano et. al., 1989a]. These instances
represent discourse entities (similar to [Webber, 1983]). CIs are created when
new discourse entities are introduced and utterances are processed, and they
are referred to later when the same discourse entity is referred to in utterances.
76 CHAPTER 4

p <attend-COrerence> <sell-t' cket>

p <getrcity gote-site register>


p
<bUy-rcket take-airplane __ > <conf irm-destination te 11-best -opt ion establish-agreement>

p
<state-dest __ . >

Speaker-l Speaker- 2
(AI

IG
)attend-cofference>

f~O-~itY goto-site register>

IG Ii' I
<bUyttlcket take - airplane .. >
P
<conE inn-destination tell-best-option establish-agreement>

P A
<star-dest ... >

Speaker-l Speaker-2
(BI

IG
<attend-corference>
/:"'['Cket>
IG P
<goty-cit y gota-site register>
P A
IG P I
<bUY-j'cket take-airplane .. >
<conf it-destinatiOn tell-best -option est ablish-agreement>

P
<state-destination . .. >

Speaker-l Speaker-2
(el

Figure 4.17 A simple plan recognition example with Multiple Hierarchy.


The <P DMDIALOG System 77

One other type of CI represents meanings of the specific utterance. Such CIs are
recorded in the memory network as cases of utterance in the specific discourse
context. In <I>DMDIALOG, this type of CI has the following links:

• A bstraction links to a CC,

• Links to each speaker's plan hierarchy,


• A speaker link which shows the speaker of the utterance,

• A hearer link which shows the hearer of the utterance,

• Links which establish propositional content.

In addition, each CSC has world knowledge which triggers modification of the
memory network to reflect what was told to the system. For instance, if an
utterance conveys information that 'John bought a car', the memory network
is modified so that '@person001' which is the CI for John has a POSSESS link
to the specific instance of the car, @CAROOl. A correct understanding of an
utterance refers to the state where a CI is created under an appropriate CC
and connected to appropriate CIs with appropriate links. This knowledge is
particularly important when resolving ambiguities and identifying anaphoric
references.

4.7 PREDICTION FROM THE


LANGUAGE MODEL
Predictions are provided in a top-down manner. When A-Markers are passed
up, a probability measure for each hypothesis is carried up. At each colli-
sion between an A-Marker and a P-Marker, a prediction is associated, and
P-Markers are passed down. During this process, a probability measure should
follow Po = L:i Pi in branching and merging (figure 4.18). A parameter for
each branch needs to be learned through exposure to a large corpus of utter-
ances of an applied domain. Case-based parsing provides the most specific
prediction and gives high a priori probability. Prediction by more abstract
knowledge provides less specific predictions and gives weaker a priori probabil-
ity compared to predictions by more specific cases. Introduction of cases for
prediction has proven to be useful since it reduces perplexity to 2.44 where the
perplexity measure by using a bi-gram grammar was 3.60. Some probabilities
78 CHAPTER 4

A1 A2. AO A3 A4 A5
P=O.1 P=O.4 P=O.8 P=O.23 P=O.12 P=O.75

AO A1 A2 A5 A3 A4
P=O.5 P=O.2 P=O.6 P=O.35 P=O.25 P=O.50

Figure 4.18 Branching and Merging of Markers

are assigned for words which are not predicted from a case-based process, but
are predicted from a unification-based process, so that even utterances which
have been unexpected by the case-based process can be handled. Figure 4.19
shows how probability measures are propagated in a simple network. With a
certain threshold value, we obtained an experimental result which shows that
the top-down prediction effectively reduced perplexity. With a small test set
which has a perplexity of word choice of 247.0 with no constraints, the perplex-
ity was reduced to 19.7 with syntactic and semantic constraints, and further
reduced to 2.4 with discourse-level constraints. However, the effect of perplex-
ity reduction by adding the discourse-level knowledge would be less effective
when we apply our method in a large mixed-initiative dialog domain. This
problem will be discussed in section 4.13. The perplexity can be controlled
by the threshold value. Introduction of a threshold relaxation method would
take advantage of the probabilistic approach to prediction. A high threshold
is assumed at the beginning of a search to narrow down a search space; if no
solution is found the threshold is lowered and the search space is widened to
find a solution. This idea is similar to layering prediction [Young et. al., 1989)
and probabilistic marker speed control [Wu, 1989).

4.8 COST-BASED AMBIGUITY


RESOLUTION
A cost-based scheme of ambiguity resolution[Kitano et. al., 1989al has been
adopted in <I>DMDIALOG. In the cost-based theory, the hypothesis with the
least cost will be selected. The cost-based theory rests on the idea that, in
dynamic systems, the state of the system converges into the most stable point
The <PDMDIALOG System 79

word1 word2 word3 word4 word5 word6


P=O.25 P=O.75 P=O.60 P=O.08 P=O.05 P=O.02

~
Speech Input Feedback to speech processing module

Figure 4.19 Prediction

(the global minima) through the path with the least workload. We believe such
a law of physics is applicable at the abstract level since cognitive processes are
manifestations of a dynamic system which consists of large numbers of neurons,
the brain. In addition, several psycholinguistic studies [Crain and Steedman,
198511Ford, Bresnan and Kaplan, 19811lPrather and Swinney, 1988] were taken
into account in deriving the cost-based theory. The cost-based disambiguation
scheme applies to both the parsing and the generation process. In a speech
input natural language system, ambiguities caused by noisy speech inputs are
added along with lexical and structural ambiguities. The cost-based approach
enables us to handle these different levels of ambiguity in a uniform manner.
This has not been attained in the past models of ambiguity resolution. In the
parsing process, costs are added when: (1) substitution, deletion and insertion
of phonemes are performed to activate certain lexical items from noisy speech
inputs (this part is handled using probabilistic measures as described earlier),
(2) new CIs are created, (3) CCs without contextual priming are used for fur-
ther processing, and (4) the memory network is modified to satisfy constraints.
Costs are subtracted when: (1) prediction has been made from discourse knowl-
edge, and (2) CCs with contextual priming have been used.

CSCi = L CCij + L constraintsk + biasi (4.5)


j k
CCj = LEXj + instantiateCI - primingj (4.6)
80 CHAPTER 4

LEXl = -ClogP (4.7)

whereas CCij' constraints"" bias; denote a cost of the j-th element of CSC;,
a cost of assuming the k-th constraints, and the lexical preference of CSC;.
LEXj, instantiateC!, primingj denote a cost of the lexical node LEX j , a cost
of creating new CIs by referential failure, and contextual priming, respectively.
P denotes the probability measure of phonological level activities, and is con-
verted into cost using -Clog P where C is a constant.

4.8.1 Initial Conditions


Our model parses utterances under a given context. Thus, the cost assigned
to a certain hypothesis is not always the same. It is dependent on the context,
the initial conditions according to which the utterance is given to the system.
The discourse context which is the initial condition of the system is determined
based on the previous course of discourse. Major factors are the state of the
memory network modified as a result of processing previous utterances, con-
textual priming, and predictions from discourse plans. The memory network
is modified based on the knowledge conveyed by the series of utterances in
the discourse as described briefly in the previous section. Contextual priming
is imposed either by using C-Marker passing or by the connectionist network.
The mechanism of assigning preference is based on top-down prediction using
discourse knowledge. Such predictions are reflected as an a priori probability
at the speech processing level.

4.8.2 Phonological Processing


The phonological level has been handled with a probability measure. When a
probability is introduced, the process which requires more workload is less likely
to be chosen. Thus, qualitatively, higher probability means less cost and lower
probability means higher cost. Probability/cost conversion equations are 9 :
-COllt

P = e--C- (4.8)
cost = -C log P (4.9)

A version of our implementation uses a cost-based scheme because the use


of probability requires multiplication, whereas the use of cost requires only
9The equations are based on the Maxwell-Boltzmann distribution P = e-Tf-.
The ~ DMDIALOG System 81

addition which is computationally less expensive than multiplication. It is also


a straightforward implementation of our model which perceives parsing as a
physical process (an energy dispersion process). For such cases, we introduce
an accumulated acoustic cost (AAC) as a measure of cost which is computed
by:

aac(i) = aac(i - 1) + CCi;_l,P;_l + tCi;_2 ,i;_1 - pe (4.10)

where aac(i), CCi;_loP;_l' tCi;_2 ,i;_1' and pe are an AAC measure of the P-Marker
at the i-th element, confusion cost between ii-l and Pi-I, transition cost be-
tween ii-2 and ii-I, and phonetic energy, respectively. Phonetic energy reflects
an influx of energy from an external acoustic energy source.

4.8.3 Reference to the Discourse Entity


When a lexical node activates any CC node, a CI node under the CC node is
searched for. This activity models reference to an already established discourse
entity [Webber, 1983] in the hearer's mind. If such a CI node exists, the
reference succeeds and this parse will be attached with no cost. However, if no
such instance is found, reference failure results. If this happens, an instantiation
activity is performed creating a new instance with certain costs. As a result, a
parse using a newly created instance riode will be attached with some cost.

4.8.4 Contextual Priming


Either a C-Marker-based method or a connectionist network method is used for
contextual priming. In the C-Marker-based scheme [Tomabechi, 1987], some
CC nodes designated as Contextual Root Nodes have a list of contextually
relevant nodes. C-Markers are sent to these nodes as soon as a Contextual
Root Node is activated. Thus each sentence, phrase or word might influence
the interpretation of succeeding sentences, phrase or words. When a node
with a C-Marker is activated by receiving an A-Marker, the activation will
be propagated with no cost. Thus, a parse using such nodes would have no
cost. However, when a node without a C-Marker is activated, a small cost is
attached to the interpretation using that node. However, the problem of this
C-Marker-based scheme is that it cannot capture the dynamic nature of con-
textual priming - phenomena such as backward priming and a winner-take-all
process cannot be simulated. The connectionist network is adopted with some
computational costs in order to overcome these problems. When a connec-
tionist network is fully deployed, every node in the network is connected with
82 CHAPTER 4

weighted links. A competitive excitation and inhibition process is performed


to select one hypothesis[Waltz and Pollack, 1985J . Although both A-Markers
and G-Markers carry cost information, their actual value changes over time
according to the change in the activation level of the lexical and conceptual
nodes.

4.8.5 Constraints
Constraints are attached to each CSC. These constraints play important roles
during disambiguation. Constraints define relations between instances when
sentences or sentence fragments are accepted. When a constraint is satisfied,
the parse is regarded as plausible. On the other hand, the parse is less plausible
when the constraint is unsatisfied. Whereas traditional parsers simply reject
a parse which does not satisfy a given constraint, links between nodes are
built or removed, forcing them to satisfy constraints. A parse with such forced
constraints will record an increased cost and will be less preferred than parses
without attached costs.

4.9 INTERLINGUA WITH MULTIPLE


LEVELS OF ABSTRACTION
<I>DMDIALOG is an interlingua system in which meaning representations are as-
sumed to be language independent. However, unlike other machine translation
systems which employ a single level of abstraction in parsing and generation,
<I> DMDIALOG uses multiple levels of abstraction. For instance, the KBMT-89
system [Nirenberg et. al., 1989aJ uses Lexical Functional Grammar (LFG) as
a basis for parsing and generation, but it does not use any phrasal lexicons
or semantic-grammars along with with the LFG rules. On the contrary in
<I>DMDIALOG, specific case, generalized case and unification grammar co-exist.
This is illustrated in Figure 4.9. There, line (a) represents the process of
translating a specific case, i.e. a representation of a particular source language
sentence. The level of abstraction increases as we move up to line (f3), which
traces translation of what we call "generalized cases" or conceptual representa-
tions (given as <*person *want *to *circum». At the most abstract level,
b), we rely on a unification-based grammar. Translation occurs at the low-
est - the least abstract - possible level. At all levels, however, CCs which
are linked to CSCs represent language-independent concepts. Each CSC, at
any level of abstraction, is associated with constraint equations which create
The 4> DMDIALOG System 83

Level of
abstraction
s
Unification-based
grammar

person-want-circurnstance
Generalized
cases
<*person '~1 <*person *ha *circum *want>
IS-A \ \
Specific I-wan~to-attend-conference . ~\
cases
<I want to ~Od th, eoof",oe,' <,.i9i oi "O~t'i'

"I want to attend the conference" "Kaigi ni sanka shitai"


~

Translation

Figure 4.20 Translation paths at different levels of abstraction

meaning representations. When a CSC is a specific case (level a), its meaning
representation is directly associated.

Advantages of using mUltiple levels of abstraction are as follows. The approach:

1. Improves performance by performing translations whenever possible at a


level closer to the surface; there is no need for expensive unification or
constraint-based processes.
2. Ensures scalability since new sentence structures can be handled simply
by adding CSCs.
3. Attains high performance with massively parallel machines because, in
most cases, translation can be done by finding specific or generalized cases
during parsing and by invoking corresponding cases in the target language.
This essentially converts time-complexity into space-complexity, which is
not a problem with massively parallel machines.
84 CHAPTER 4

Knowledge at different levels of abstraction can be integrated by the feature


aggregation scheme described in Section 4.5. With feature aggregation, pars-
ing and generation are attempted with the most specific cases, which involve
no or only a small amount of feature unification (or constraint satisfaction).
If a specific case is not found, the A-marker propagates upward and searches
for more abstract cases or those which involve more expensive constraint sat-
isfaction processes. When none of the cases is applicable for the input, the
unification- or constraint-based process is invoked; this is stored at the highest
level of abstraction in the memory network. The levels of abstraction are by no
means discrete. Levels between, say, specific case and generalized case are per-
missible. In such cases, a CSC may represent a sequence such as <I want to
attend *EVENT>, which is a mixture of a specific case and a generalized case.
CSCs representing specific cases are mostly used for "canned expressions" and
CSCs at more abstract levels cover a wider range of expressions. It is also
possible to use specific cases for analysis and general cases for generation, and
vice versa, if necessary. Since the system translates at the most specific level
of abstraction, computational costs are automatically kept to a minimum.

4.10 GENERATION
The generation algorithm of <J>DMDIALOG can be characterized by its highly
integrated processing and its capability of simultaneous interpretation. The
generation algorithm employs a parallel incremental model which is coupled
with the parsing process in order to perform simultaneous interpretation. In
addition, the case-based process and the constraint-based process are integrated
in order to generate the most specific expressions using past cases and their
generalization, while maintaining syntactic coverage of the generator.

4.10.1 A Basic Generation Algorithm


Generally, natural language generation involves several stages: content delin-
eation, text structuring, lexical selection, syntactic selection, coreference treat-
ment, constituent ordering, and realization [Nirenburg et. al., 1989b]. In our
model, the content is determined at the parsing stage, and most other pro-
cesses are unified into one stage, because, in our model, lexicon, phrase, syn-
tactic fragment, and sentence are treated in the same mechanism. There is no
need to subdivide the generation process into text structuring, lexical selection
and syntactic selection. The common thrust in our model is the hypothesis
The q>DMDIALOG System 85

activation-selection cycle in which multiple hypotheses are activated and where


one of them is finally selected. This cycle is adopted throughout parsing and
generation. Lexical and syntactic hypotheses are activated at the same time
and one hypothesis among them is selected to form a final output string. Thus,
the translation process of our model can be viewed in the following processes:

Concept Activation: A part of the parsing process. Individual concepts rep-


resented by GGs are activated as a result of parsing speech inputs. A-
Markers are created and passed up by activating the concept.

Lexical and Phrasal Hypotheses Activation: Hypotheses for lexicons


and phrases which represent the activated concept are searched for, and
G-Markers are created and passed up as a result of this process. Usually,
multiple candidates are activated at one time.
Propositional Content Activation: A part of the parsing process, by which
propositional content of the utterance is determined.
Syntactic and Lexical Selection: Selection of one hypothesis from multiple
candidates of lexicons or phrases. First, the syntactic and semantic con-
straints are checked to ensure the correctness of the hypotheses, and the
last selection is made using cost/activation-based selection.
Realization: The surface string (which can be either a sequence of words or
a sequence of phonological signs) is formed from the selected hypothesis
and sent to the speech synthesis device.

The movement of V-Markers is important in understanding our algorithm.


First, a V-Marker is located on the first element of the CSC. When a G-Marker
hits the element with the V-Marker, the V-Marker is moved to the next element
of the CSC (Figure 4.21a). In the case where the G-Marker hits an element
without a V-Marker, the G-Marker is stored in the element. When another
G-Marker hits the element with a V-Marker, the V-Marker is moved to the
next element. Since the next element already has a G-Marker, the V-Marker
is further moved to the subsequent element of the CSC (Figure 4.21b). In fig-
ure 4.21c, el is a closed class lexical item10 • When a G-Marker hits the first
element, a V-Marker on the first element is moved to the third element by
passing through the second element which is a closed class item. In this case,
the element for the closed class item need not have a G-Marker. The lexical
10 Closed class lexical items refer to function words such as in, at, of, to in English and ga,
ha, wo in Japanese.
86 CHAPTER 4

v V
(a) < eO e1 e2 ... en > -+- < eO e1 e2 ... en >
+
G

V V V
(b) < eO e1 e2 ... en > -+- < eO e1 e2 ... en >-+- < eO e1 e2 ... en >
t tG
G G
V V
(c) < eO e; e2 ... en > -+- < eO e; e2 ... en >
+
G

Figure 4.21 Movement of V-Marker in the CSC

/o\o2n> _
v
/0,'20>V

< eOO e01 ... eOI > <el0 ell ... elm> < eOO eOl ... eOI > <el0 ell ... elm>
t
G

Figure 4.22 Movement of V-Marker in the Hierarchy of CSCs

realization for the element is retrieved when the V-Marker passes through the
element.

There are cases in which an element of the CSC is linked to other CSCs as
seen in figure 4.22. In such instances, when the last element with a V-Marker
gets a G-Marker, that CSC is accepted and a G -Marker that contains relevant
information is passed up through an IS-A link. Then an element of the higher
layer CSC gets the G-Marker and a V-Marker is moved to the next element.
Since the element is linked to the other CSCs, constraints recorded on the V-
Marker are passed down to lower CSCs. This movement of the V-Marker allows
our algorithm to generate complex sentences.

Figure 4.23 shows an example of how an analysis tree can be constructed in


our model. In this example, we assume LFG as a grammar formalism, and
The q} DMDIALOG System 87

A
1\
NP NP NP PP

1 j, ~ j ~
She
I II~
She
at the hotel
II~~
She at the hotel her luggage

(1) (2) (3)

~
A
NP PP

1/\ r~
1 r6 'AjlAA
lLsu'~r6' A
She at the hotel

to unpack her luggage at the hotel V NP

to
I~
unpack her luggage

Figure 4.23 An Incremental Tree Construction


88 CHAPTER 4

s s
/ " -VP
NP
/ " -VP
NP

!
NP
I I V/'vpI
N f'JP---
I I
N
I
John John
I was
I.
surprised
I
John
I 6
was surprised by Mary

(A1) (A2) (A3)

s s
~ ~
NP NP VP NP VP

N
I I V/
N
I V~p
N I
I
Mary
I I
Mary surprised
I I JohnNI
Mary surprised
(81) (82) (83)

Figure 4.24 Change of Produced Sentence due to the Different Semantic


Inputs

the order in which conceptual fragments are given is based on the order that
conceptual fragments can be identified when parsing a corresponding Japanese
sentence incrementally. It should be noted that our model does not necessary
assume a special grammar as is the case in other incremental generation models
such as [Kempen and Hoekamp, 1987].

Figure 4.24 shows variations in generated sentences due to the differences in


the order of input concept fragments. In figure 16a, concept fragments are given
in the order of (OBJECT JOHN), (ACTION SURPRISE), and (ACTOR
MARY). This order results in generation of a passive sentence. However, in
figure 16b, conceptual fragments are given in the order of (ACTOR MARY),
(ACTION SURPRISE), and (OBJECT JOHN) which creates an active
The <I> DMDIALOG System 89

voice sentence. This flexibility of sentence structure, which is dependent upon


the order and timing of an input concept fragment, is an essential feature for
an incremental sentence production scheme.

Let us illustrate the generation process using the simple example shown in
Figure 4.25. In keeping with earlier examples, the input sentence is John wants
to attend the conference.

First, part (a) of the figure shows the concept activation stage: john is activated
and an A-marker is created. The A-marker propagates and activates a ee node,
* john. This is a part of parsing process.

Part (b) is a lexical hypothesis activation stage: a G-marker is created at the


lexical node under the ee *john. The G-marker contains surface realization of
the concept * john which is j on, along with other linguistic information. Then,
the A-marker and the G-marker propagate upward. The A-marker traverses
toward eses representing English syntax, and the G-marker traverses towards
the ese representing Japanese syntax. On the ese <*person *ha ... >, a
V-marker is already placed on the first element of the ese. The G-marker and
the V-marker collide at *person.

Next, part (c) of Figure 4.25 shows the V-marker shift. Since the element *ha
is a closed-class item, it retrieves the lexical realization of *ha which is 'ha',
and the V-Marker is moved to *event. At this point, the V-marker contains
the surface string 'Jon ha' along with other syntax and semantic information.
This is a partial realization of the surface string.

Part (d) shows the processing of want and attend. This is the concept activation
stage and the lexical hypothesis activation stage. Due to the difference in
word order in English and Japanese, the V-marker is not placed on *want and
*attend in the ese for Japanese. G-markers propagated from shitai ("want")
and sanka ("attend") simply stay at each element of the ese until the V-marker
arrives.

The processing of kaigi triggered by input word conference is shown in part (e).
A G-marker and a V-marker collide at *event, and the V-marker is moved to
*ni. Since *ni is a closed-class item, its surface realization is appended to the
V-marker and the V-marker further moves to *attend. Now, *attend already
has a G-marker, and so a G-V-collision takes place there and the V-marker
moves to *want. Again, *want already has a G-marker and a G-V-collision
occurs. Since *want is the last element of the esc, the V-marker contains the
surface realization created by this local generation process: 'Jon ha kaigi ni
90 CHAPTER 4

'7 t
:r
\/SO;
v
'0//°;'7 7 'wat
V
nt 'att 'wa\> nt

U{r '/\ '\ "r\ 1\


A 'att

'r 'e~t 'at~ '\ 'wr\

~on' 1'~i9i' 'ni'I's;nka'I's~itai' ~on'


'ha' 'ha'/'kdi9i' 'ni'I's;nka'I's~itai'
'john' 'confernece' 'attend' 'want' 'john' 'confernece' 'attend' 'want'
la) Concept Activation I'john activated) Ib) Passing up A- and G-Marker

t ",SOia 7 t
-v'jonha' v'jon ha' G'sanka' G'shitai'
" , s o i a jent 'att 'wa\> nt 'att1r .wa\\

'perron 'ha 'event ' \ 'attend 'Wf\ 'peTn 'ha 'event 'ni 'atK\ 'Wr\\

"f,., 1. ~"" '" ~".'I'.~"" 7,., 1. ~"'\. ~"·'I"~"I'


'john' 'confernece' 'attend' 'want' 'john' 'confernece' 'attend' 'want'
Ic) Shift V-Marker Id) After processing 'attend'

v'jon ha' G'sanka' G'shitai' V 'jon ha


'? 'event '~i 'attend 'w~t> kaigi ni
I / I I \ sanka shitai'
, ('l7son
" , s o i a j7:Vi~;~tt 'wa\>

'pe,rron 'r 'even,\\ 'attend 'Wr\


V
'perron 'ha 'event ' \ 'attend '1\
"f,., '.' ~"" "" ~·'I"i'"" ~.,.l. ~., r~'I"~I"
'john' 'conf ernece' 'attend' 'want' 'john' 'confernece'
'm'
'attend' 'l!!IIt '
Ie) Processing 'conference' If! Final translation created

Figure 4,25 A simple example of the generation process


The ~ DMDIALOG System 91

sanka shitai'. This is the realization stage. Although the possible translation
is created, it does not mean that this is the translation of the input sentence
because the whole process is based on lexical-level translation, and no result
of analysis from the parsing stage is involved. At this stage, it is generally the
case that multiple generation hypotheses are activated.

When parsing of the sentence, as a whole or as a local phrase, is complete, its


interlingua representation is constructed. It should be noted that for each CSC,
there is a CC node which represents its concept (see Figure 4.26). As a result of
parsing (whether of a complete sentence or of a local phrase), certain CC nodes
will be activated and one will be selected as a CC representing the meaning of
the sentence (or a phrase). This is the propositional content activation stage.
Then, the target language CSC under the CC will be selected as a translation
of the input sentence. This is the syntactic and lexical selection stage. This
time, a constraint check is performed to ensure the legitimacy of the sentence
to be generated. When there are more than one CSCs active under the CC,
the one with lowest cost in the G-marker is selected.

4.10.2 Hypotheses Activation


When a concept is recognized by the parsing process, hypotheses for translation
will be activated. The concept can be an individual concept, a phrase or a
sentence. In our model, they are all represented as CC nodes, and each instance
of the concept is represented as a CI node. The basic process is that for each of
the activated CC, LEX nodesl l in the target language will be activated. There
are several cases:

Word-to-Word: This is a case when a word in the source language can be


translated into a word in the target language. In figure 4.26a, the word
LEXs L activates CCI. LEXI TL is activated as a hypothesis of transla-
tion for LEXSL interpreted as CCl. A G-Marker is created at LEXTL
containing a surface realization, cost, features, and an instance which the
LEXI TL represents (CI). The G-Marker is passed up through an IS-A link.
When a CCl does not have LEXI TL , CC2 is activated and LEX2TL will
be activated. Thus, the most specific word in the target language will be
activated as a hypothesis.
11 LEX nodes are a kind of CSC. They represents lexical entry and phonological realization
of the word.
92 CHAPTER 4

CC2
C~
I LEx2-TL ~C2-TL
CC1
CC1

LEX-~X1-TL CI
LEX-~C1-TL CI

(a) (b)

C~ C~
I ~2-TL I ~2-TL
CC1
%.
CSC-~X1-TL CI
CSC-~ I
CI
"?sC1-TL

(c) (d)

Figure 4.26 Activation of Syntactic and Lexical Hypotheses

Word-to-Phrase: When a CC can be represented by a phrase or sentence,


a esc node is activated and a G-Marker which contains that phrase or
sentence will be created. In figure 4.26b, LEXSL activates GGI which
has GSGI TL . In this case, GSGI TL will be activated as a hypothesis to
translate LEXs L interpreted as GGI .

Phrase-to-Word; There are cases where a phrasal or sentence expression can


be expressed in one word in the target language. In figure 4.26c, G SGSL ac-
tivates GGI which can be expressed in one word using LEXITL ' LEXI TL
will be activated as a hypothesis for translating GSGSL '

Phrase-to-Phrase: In cases where the expression in both languages corre-


sponds at the phrase-level, the phrase-to-phrase translation mechanism is
adopted. In figure 4.26d, GSGSL will be translated using GSGITL via
GGI . Such cases are often found in greetings or canned phrases.
The 1>DMDIALOG System 93

4.10.3 Syntactic and Lexical Selection


Syntactic and lexical selections are conducted involving three processes: fea-
ture aggregation, constraint satisfaction, and competitive activation. Feature
aggregation and constraint satisfaction correspond to a symbolic approach to
syntactic and lexical selection which guaranteesgrammaticality and local se-
mantic accuracy of the generated sentences, and the competitive activation
process is added in order to select the best decision among multiple candidates.
Features are carried up by G-Markers using feature aggregation. At each CSC,
constraint satisfaction is performed in order to ensure the grammaticality of
each hypothesis. Hypotheses which do not meet grammatical constraints are
eliminated at this stage. Grammatical constraints are imposed using constraint
equations, an example of which is (agent num) = (action num) which re-
quires number agreement. Among hypotheses which are grammatically sound,
one hypothesis is selected using the cost-based scheme; i.e. the hypothesis with
the least cost will be selected. Priming of each hypothesis can be done by
C-Marker passing or by the connectionist network. There are cases in which
hypotheses from case-based and constraint-based process are both activated.
In such cases, the system prefers the hypothesis from the case-based process,
unless ungrammaticality is observed.

4.11 SIMULTANEOUS
INTERPRETATION: GENERATION
WHILE PARSING IS IN PROGRESS
Development of a model of simultaneous interpretation is a major goal of the
project which makes our project unique among other researches in this field. We
have investigated actual recordings of simultaneous interpretation sessions and
simulated telephone conversation experiments, and made several hypotheses as
to how such activities are performed, as a basis for designing the <I>DMDIALOG
system.

The process of simultaneous interpretation is a knowledge-intensive and a


highly interactive process requiring the dynamic participation of various knowl-
edge sources. Simultaneity of interpretation emerges from the fact that inter-
preters actually start translation even before the whole sentence is spoken by
the speaker. We hypothesize that such activity is made possible because simul-
taneous interpreters process parsing and generation almost concurrently, and
94 CHAPTER 4

their knowledge, especially discourse and world knowledge, enables appropriate


prediction and selection of hypotheses as to the meanings of utterances.

From the practical aspect, the simultaneous interpretation capability is essen-


tial for real-time deployment of the system. In real dialogs, the length of each
utterance can be considerably long. Utterances of sentences where each took
10-15 seconds are frequently observed. This imposes critical problems in de-
ploying sequential parse-and-generation type architectures. Supposing that one
utterance is 15 seconds in length, the hearer would need to wait more than 15
seconds to start hearing the translation of her/his dialog partner's utterance.
Then, assuming that she/he responds with an utterance of 15 seconds in length,
the first speaker would have to wait at least 30 seconds to start hearing her/his
dialog partner's response. We believe that unless speech-to-speech translation
systems overcome this problem, practical deployments are hopeless.

4.11.1 Empirical Studies of Simultaneous


Interpretation
The approach we would take is to simulate actual simultaneous interpreters at
work. Here, we will briefly investigate part of some transcripts of simultaneous
interpretation sessions. Transcripts shown in Tables 4.3, 4.4, and 4.5 are
taken from actual simultaneous interpretation sessions. J shows source (or
translated) sentences in Japanese, e is an English annotation to the Japanese
sentences, and E is translations made by the interpreter (or sentence spoken
by the speaker). It is time-aligned so that the time before the interpreter starts
translating the speaker's utterance can be analyzed.

Table 4.3 is a transcript of English to Japanese simultaneous interpretation.


Judging from this transcript and other transcripts, there are substantial num-
bers of canned phrases used by the interpreter. The first two sentences are
good examples. It seems that for some sentences, and typically in greetings,
phrasal lexicons representing canned phrases may be useful even in computer-
based processing. In the fourth sentence, we can see how parts of a sentence
are translated and incrementally generated. In this example, a subject is trans-
lated before a verb is spoken. Due to the verb-final nature of Japanese, some
parts of the sentence were not translated until the accusative was recognized
in the source language.

The two transcripts of Japanese to English translation (tables 4.4 and 4.5)
show that the interpreter is dividing a original sentence into several sentences
The cI> DMDIALOG System 95

30 sec. + Processing Time


Utterance
.:
~ ~
in Japanese
Speaker-l Processing
15 sec. Time
/. ___ :• Translation
Speech I ~---j in Japanese
Translation
System 15 sec. : : 15 sec.
Translation Utterance
in Engl ish in English
Speaker-2
15 sec.

(a) Transaction with Conventional Sequential System

Delay time
Utterance .. ..
in Japanese

·:
Speaker-l

Speech
··
: Translation Translation
: in English in Japanese
Translation
System
···
uttenince
in English
Speaker-2

(b) Transaction with Simultaneous Interpretation System

Figure 4.27 Transaction with Conventional and Simultaneous Interpretation


Architecture
96 CHAPTER 4

E On behalf of the Financial Times,


J 7 i 1;- '/ y"\")lI . ~ 1 A;;q:t~1t~v)t,: l * V(
e Financial Times Co. represent
E may I welcome you all to the Imperial Hotel.
J **,l1J ~, ,*00* j" )Id: ;:--Mft l * To
e you -PL the imperial hotel welcome
E It is a great pleasure to see so many people here today.
J fJ..J: c ~ * V(
e I for
E
J ***1Jf:, ;j3EH:i1>l,p1.>(7)I;l:7t*-ea;~ *To
e you meet pleasure
E Our presentation today is in three parts.
J A,B (7)7' v-!:f'/j"- Y 3,/1;1:, =: -:?(7)$r~;6> t? ~ -? "(It> '1 To
e today presentation -TOP three -GEN part from consists
E First of all, I would like to give you a brief
J '1-rft1m:, fJ..iJ't goku kantan ni
e first of all, I -NOM briefly
E historical review of the Financial Times -
J 7717 '/y"\")v • ~ 1 AA1±(7)Ht9:f:1tlt>"((7)~~~ l'1To
e Financial times Co. -gen history about talk
E where we are today,
J A,B(7)~~c
e today status and
E and what our plans are for the future.
J *f*(7)~tOO;6't c· -? ~ -? "(It> 1.> ;6>(7)~~ l'1 To
e future plans what status talk

Table 4.3 Transcript: English to Japanese


The {p DMDIALOG System 97

J B ± It> (J) ~
e Japan economy success actually remarkable things exist, people
E The success of Japanese economic development
J B 1 3C"~ "( "(1t>.o(J)I, ~Q '.t "t" "
e QOL improved CPI stabilized being desirable thing
E has actually been remarkable. The living standard has risen and the CPI has
J 1 \ "t"1 .... It> I, I.::t/,:Ii (J)
e success all not because, for example, JNR
E stabilized. These are to be desired, but not all are successes.
J "( ,"t" 0
e case look-at obvious
E JNR is an obvious example.

Table 4.4 Transcript : Japanese to English (1)

J
e
E
J
e
E
J
e can't-take work such image Japan labor system
E It seems to those who say this that workers are tied to the company and work
J
e
E
J o
e scratched-the-surface Westerners have it-seems
E This is the image of the working system held by Europeans and Americans
J
e
E towards the Japanese workers.

Table 4.5 Transcript: Japanese to English (2)


98 CHAPTER 4

in translation. This is because a long Japanese sentence often contains several


distinct parts, each of which can be expressed as a sentence, and translation of
such a sentence into one English sentence is almost impossible. By subdividing
a long sentence into multiple sentences, the interpreter (1) is able to produce
understandable translations, and (2) avoids delays in translation mainly caused
by the verb-final characteristics of Japanese. Behind this, we can assume, is
the fact that the interpreter has a strong expectation about what can be said
in the sentence currently being processed using discourse context and world
knowledge. For example, in the second sentence (table 4.5), the verb of the
sentence, motteiru (have, or hold), comes at the very end of the sentence.
Simultaneous interpretation is only possible because the interpreter made a
guess, from the context, that issues of the Japanese labor system are to be
described in the images held by western peoples. Thus, the interpreter made
translations using... are what have been pointed to or ... It seems to those who
say this ... . It is important to notice that these translations were made before
the main verb was spoken.

Several observations can be made from these transcripts:

• Translation began even in the middle of the input sentence.


• The interpreter uses a phrasal lexicon of canned expressions.
• Translation generally starts after a phrase is spoken.
• Long sentences are translated into multiple sentences. This is typically
observed in Japanese-to-English translation.
• The interpreter is making strong guesses as to what will be said.

These observations support our hypotheses stated at the begining of this paper.
We can therefore derive several requirements that the generator of a simulta-
neous interpretation system must satisfy:

• The system must have incremental parsing capability.


• The system must be able produce sentences incrementally.
• The system must have opportunistic sentence planning capability to avoid
syntactic dead-ends.
• The system must be able to divide one sentence into multiple sentences.
• The system must be able to predict what may be said.
The 4> DMDIALOG System 99

INPUT UTTERANCE TRANSLATION


John
wants :; 3 /1J{ (John role-agent)
to
attend
the
conference
because f,;~~:~1.Jo It.:It' (want to attend the conference)
he
is tit'? Q)li (because)
interested ~1J~ (he role-agent)
in
interpreting telephony iiwHlt~~:_1J~;f;j.o 1J' t?"t"1"
(interested in interpreting telephony)

Table 4.6 Simultaneous interpretation in ~DmDialog

4.11.2 Simultaneous Interpretation in


~DMDIALOG

In this section, we describe how our model performs simultaneous interpreta-


tion. The basis for this capability is the use of a parallel incremental model
of parsing and generation, which has been described in previous sections, so
that these can run almost concurrently with certain interactions. Of course,
formulation of each part of the sentence takes place after it is processed and its
meaning is determined. However, it is concurrent in the sense that the genera-
tion process does not wait until the entire parse is completed, so the translated
utterance is generated incrementally12. Lexical selection and partial produc-
tion of utterances are conducted while parsing is in progress. Thus, in some
inputs, a part of the utterance can be generated before parsing of the entire
sentence is completed. We do this by verbalizing a surface string or phonolog-
ical realization of the instance whose role is determined, i.e. not ambiguous,
and delay verbalization of ambiguous instances until they are disambiguated.
The part of the sentence which is verbalized is recorded in a V-Marker and
the V-Marker is moved to the next possible verbalization element. This avoids
redundant verbalization. Only the element with a V-Marker can be verbalized
in order to ensure the consistency of the produced sentence.
12 Unlike an incremental generation by IPG[Kempen and Hoekamp, 1987J, which assigns
procedures to each syntactic category, our algorithm uses markers to carry information. Also,
concepts to be expressed are incrementally determined as parsing progresses.
~
o
o

~
Io!j
oq.
(I)
...=
~
~
00

>
."
Q>
;::.
s,
;.
(II

3C1>
3
...o
'<
:0
(II

...~
PO"

*john
*t *becf'
A *to A *tf Q
/\ ~ *r ::r:
>
'1j
'John' 'Jon" 'wants' 'shitai' "to" "attend' "sanka' •the' "conference" 'kaigi' "ga' "ni" 'because" ...:j
M
::0
~
The 1>DMDIALOG System 101

Let's explain this process with an example. Table 4.6 indicates a temporal re-
lationship between a series of words given to the system and incremental gen-
eration in the target language. Figure 4.28 shows part of the memory network
involved in this translation (simplified for the sake of clarity). An incremen-
tal translation and generation of the input (John wants to attend the confer-
ence beca1Lse he is interested in interpreting telephony) results in two connected
Japanese sentences: Jon wa kaigi ni sanka shitai. Toiunoha kare ha tuuyaku
denwa ni kyoumi ga arukara desu. Speech processing is conducted using the
method already outlined. The following explanation of processing is from the
perspective of lexical activations.

When a LEX node for John is activated as a result of phonological processing,


an A-marker is created containing information relevant to John and sent to the
ee nodes *john, *male (superclass of *john), and *person (superclass
of *male). The A-marker should contain a cost, feature and discourse entity
for John. When the CC node *john is activated, the program searches for
a Japanese lexical entry for the concept *john and finds jon. A G-marker
is created and includes an instance (IDJohnOOlj although this is not shown in
Figure 4.28, IDJohnOOl is created under *john as a discourse entity) and a
surface string ('jon').

The G-marker is passed up through the memory network. This process takes
place for each word in the sentence. P-markers are initially located on the
first element of CSCs for the source language. In this example, a P-marker
is at *person in <*person *want *to *circumstance> and <*attend *def
*conference>. V-markers are located in the first element of escs for the
target language.

When *person receives an A-marker, information in the A-marker is tested


against constraints imposed by the P-marker. Since there is no a priori
constraint in this case, features and instances contained in the A-marker
are simply assigned to constraint equations. A constraint equation «agent
num) = (action num)) constrains the number feature of action to be third-
person singular in this case because (actor num) is third-person singular.
Then, the P-marker is moved to *want. When *want receives an A-marker,
the P-marker is moved to *to and then to *circumstances. Similar con-
straint checks are conducted for each collision. When the P-marker is moved
to *circumstances, constraints are passed down through a copy of the P-
marker, which is a P-marker with identical information, and located on
the first element of the lower level ese, *attend. The lower-level cse
«*attend *def *conference» has constraint equations including (actor =
<T actor)) which describe control. Such propagation of constraints ensures
102 CHAPTER 4

linguistically legitimate processing. When the lower level CSC is accepted, a


new A-marker is created and passed up through IS-A links. It will eventually
activate *circumstances.

Instantiation takes place when this sequence is accepted, i.e. when the P-
marker is placed on the last element of the sequence and that element gets an
A-marker. As a result, a CI is created under the CC C*want-attend-confer-
ence), which is a root node of the accepted CSC, and linked with relevant
instances. The CI represents the meaning of the utterance analyzed.

There is a point during analysis at which the semantics of a part of the sen-
tence is determined. For instance, John (@John001) can only be an agent
of the action when wants is analyzed. At this time, :; 3 "/ if' ( (John
Role-Agent) ) can be verbalized. A V-marker which is initially located on
*person is now moved to *conference, which is the next element of the
CSC. The V-marker simply passes through *ga because it is a closed-class
item. The next verbalization will not be triggered until because comes in
because the role of John wants to attend the conference in the discourse
is still ambiguous. At this point, *want-attend-conference and its su-
perclass node including *assert-goal are activated. A-markers passed up
through discourse-level knowledge carry relevant information needed to im-
pose constraints on possible next utterances. A P-marker is now placed on
*because in a CSC (namely C<*assert-goal *because *assert-reason»)
and a V-Marker is still placed on *assert-goal of the corresponding Japanese
CSC C<*assert-goal *touiunoha *assert-reason». Activation of
*because will determine the role of *want-attend-conference. The word
because acts as a clue which divides the assertion of the speaker's goal from the
reasons for the goal, as represented in the CSC. Verbalization is now triggered,
i.e., a Japanese translation of John wants to attend the conference is vocalized,
and the V-marker is moved to the next element of the CSC. When a whole
sentence is parsed, its entire meaning is made clear and the rest of the sentence
is verbalized: :; 3 "/ li~8fil:~1JD L. t.:v' o tv' -? o)li, 1ltli:@aR1t~~:~P*il'~
if ~ :0' L? -z"T 0

This example illustrates generation in the target language at the earliest possi-
ble point. Although this is a perfectly acceptable Japanese translation, it leaves
much room for stylistic improvement. We are currently investigating algorithms
for producing more stylistic translations without undermining the simultaneity
of translation. Translated-sentence style is greatly affected by temporal factors
such as speed of input speech and prosodic factors .

The surface string and other information is, again, stored in a G-Marker and
The <J? DMDIALOG System 103

(tAttend-conference> <* Ass ist -At tend -Con f erence>

/
<·Cet-Instruction ... >
/
<*ShOfll-Instruction ... >

('Declare-Want-Attend 'Listen-Instruction ... >


/
<'Listen-Want -Attend 'Assist -Rilstratlon .. . >

l
<'Llsten-~stratlon <'Assert-Registration ·Conform .. >

'Want -At tend-Con ference ('First-of-all .. . >

<* Person ·Want tto tei rcumstance>


«·person 'role-agent) ·conference trole-goal attend 'want

(·Attend-conference>

·Conference

!\
'I 'Plrst-of-all

I
·conference

/\
9conference
iattend

'I' 'Watashi' 'conference' 'kalgl' 'mazu'

Ik a n fer e n sl 1m azul

W
la II

lal III
~ Ikl lal 1m!
I1 lal

Figure 4.29 A Process of Parsing, Generation and Prediction


104 CHAPTER 4

passed up to discourse-level CSCs in order to generate multiple sentences. Fig-


ure 4.29 shows a part of memory network including discourse-level knowl-
edge. Activation is further passed up to the discourse knowledge layer and
*Declare-Want-Attend and *Listen-Want-Attend are activated. As a re-
sult of this activation, the next possible utterances *Assert-Registration,
*First-oi-All, and </m a z u/> are predicted (shown as downward arrows).

4.12 RELATED WORKS


Since other speech-to-speech translation system has been reviewed in the chap-
ter 1, we will review specific component technologies.

First, several efforts have been made to integrate speech and natural language
processing. [Tomabechi et. al., 1988] attempt to extend the marker-passing
model to speech input. Their model uses environment without probabilistic
measures which would allow environmental rules to be applied. Since mis-
recognitions are stochastic in nature, lack of a probability measure seems to be
a shortcoming in their model. [Saito and Tomita, 1988] [Kita et. aI., 1989] and
[Chow and Roukos, 1989] are examples of approaches to integrate speech with
unification-based parsing, but, unfortunately, discourse processing has not been
incorporated. [Young et. aI., 1989] describes an attempt to integrate speech
and natural language processing implementing layered prediction. They re-
ported that use of layered prediction involving discourse knowledge reduced
the perplexity of the task.

Incremental generation is an important feature of our model. Pioneer studies


of parallel incremental sentence production are seen in [Kempen and Hoekamp,
1987] [Kempen, 1987]. They use a segment grammar which is composed of
Node-Arc-Node building blocks to attain incremental formation of trees. Their
studies parallel our generation algorithm in many aspects. The segment gram-
mar is a kind of semantic grammar since the arc label of each segment makes
each segment a syntax/semantic object. Feature aggregation and constraint
satisfaction by G-Markers and V-Markers in our model correspond to dis-
tributed unification [De Smedt, 1989] in the segment grammar. However, their
model is limited to the syntactic layer and was not tested at the discourse-level.
The <J? DMDIALOG System 105

4.13 DISCUSSIONS

4.13.1 Integration of Speech and Natural


Language Processing
Although top-down prediction from the language model is known to improve
speech recognition accuracy, the appropriate method of providing prediction is
still an open question. In our system, we introduced (1) utterance cases and
(2) discourse knowledge as sources of predictions.

Use of cases reduces perplexity because predictions from cases are more spe-
cific than predictions from linguistic knowledge, and more constrained than a
bi-gram grammar. In an experimental grammar which has a test set perplex-
ity of 3.66 with a bi-gram grammar, the perplexity for the same test set was
reduced to 2.44 by using cases of utterances. Probabilities carried by markers
from each level of processing are merged when they meet at certain nodes as
predicted, and decide a final a priori probability distribution. Although this
method successfully reduces the perplexity measure in the test set used in the
experiment, there is some question as to its effectiveness when it applied to
larger domains.

While there are considerable doubts regarding the effectiveness of using syn-
tactic and semantic levels of knowledge alone for prediction, use of pragmatic
and discourse knowledge , such as discourse plans [Litman and Allen, 1987] and
discourse structure [Grosz and Sidner, 1990], gain attention with the hope that
introduction of these higher levels of constraints may help by further reducing
perplexity, and thus attain higher recognition rates. As a matter of fact, [Young
et. al., 1989] reports that perplexity was reduced dramatically by introducing
discourse knowledge using a layering prediction method, and that the semantic
accuracy of the recognition result was 100%. The introduction of discourse
knowledge would be useful for highly goal-oriented and relatively limited do-
mains such as the DARPA resource management domain. We have investigated
the effectiveness of using predictions from the discourse level in the ATR con-
ference registration domain, because the ATR domain is a mixed-initiative and
less goal-oriented domain.

We have carried out an experiment using a small corpus of 3 dialogs consisting


of 92 utterances. For the test set of perplexity of 19.7 using syntax and semantic
constraints, the addition of discourse knowledge reduced the measure to 2.4.
This result alone is a significant success, and seems to confirm the effective-
106 CHAPTER 4

ness of using discourse knowledge. However, effectiveness of predictions from


discourse knowledge largely depends upon the task domain and the coverage
of the corpus compared to dialogs in real deployment. There are three basic
problems:

First, although use of discourse knowledge generally helps in reducing per-


plexity, this assumes that patterns of dialogs, i.e. transition patterns among
subdomains, are relatively limited so that discourse level knowledge can fur-
ther constrain possible next word choice. We have investigated patterns of
sub domain transitions in the ATR corpus in order to examine how this as-
sumption holds in our domain. We took 30 dialogs from the ATR corpus to
measure perplexity of subdomain transitions. The total number of utterances
in 30 dialogs was 1325. 31 subdomains and 177 transitions were identified. For
each subdomain, there were several subdomains within them, but we did not
count these details. We simply counted major transitions between relatively
abstract sub domains. Each sub domain consisted of utterances ranging from 2
utterances to over 50 utterances. Perplexity of the test set without constraint
was 4.95 (note that this is not a word choice perplexity). A test set perplexity
of subdomain choices was reduced to 3.44 by using a bi-gram grammar at the
subdomain transition level. Yet, on average, the system has to select a do-
main from 11 hypotheses in order for a new subdomain to transit. Moreover,
none of the dialog transitions were equal to any other dialog transitions, which
implies that perplexity of the task with a larger corpus of data would be sig-
nificantly larger than that which we encountered in our experiment. It should
be noted that a considerable portion of syntactic structures and vocabulary
are shared among subdomains, so that even if the possible next subdomain
is reduced to 1/3 of all subdomains, this does not mean possible next words
can be reduced to 1/3. Therefore, unfortunately, we must conclude that the
use of discourse knowledge which captures transition patterns among subdialog
domains, i.e. statistical transition models, bi-grams at the discourse level, and
goal calculations, would have only a limited impact in reducing perplexity in
mixed-initiative domains with larger topic space.

Second, the nature of mixed-initiative dialog makes accurate prediction even


more difficult. Unlike the DARPA resource management domain, the ATR
domain is a mixed-initiative dialog where the two participants in the dialog
have their own intentions and goals. This is one of the inherent characteristics
of the task which speech-to-speech translation system is expected to process.
Here is an example taken from the ATR corpus:

Secretary: Please give me your card name and number.


The ~ DMDIALOG System 107

Questioner: It's American Express, the number is 123-45678-90123. Would


the proceedings be published?
Secretary: Yes, it will be published in July. Can I charge a registration fee
to your AMEX account?
Questioner: Yes, and send me a registration form please.
Secretary: OK. I need your name and phone number, too.

A domain of this subdialog seems to be a credit card charge, but it has subdi-
alogs of asking if the proceedings will be published and asking for a registration
form to be sent out. Although predictions of speech acts may be attainable since
more than 80% of the interaction is based on the Request-Inform discourse plan,
predictions on which subdomain the dialog may switch into and when it may
happen are hopelessly difficult. In the above dialog, how can we predict that
the questioner may ask if the proceedings will be published in the middle of a
dialog on a credit card? This means that although stronger preferences can be
placed on some of the subdomains, the system must be able to expand its search
space to nearly the entire domain so that sudden switching of subdomains in
such complicated dialog structures can be handled. When this happens, the
perplexity measure would drastically increase. In the case of our experimental
set it should fall somewhere in the middle between 2.4 and 19.7. However,
obviously, expanding search space to entire domains significantly undermines
the recognition rate. We still do not have an answer to this problem.

Third, prediction failures run the risk of undermining recognition rates by prun-
ing out a correct hypothesis in favor of incorrect but predicted hypotheses.
Chances of making wrong predictions depend upon the coverage of the corpus
collected from real dialogs. If the corpus covers a sufficient portion of possible
dialog transitions, the chances of making wrong predictions would be much
lower. In the ATR's conference registration domain which involves various
topics such as sightseeing, dinner, and hotel reservations, covering all possible
subdomains and transitions is nearly impossible. Actually, one dialog of the
corpus involves how to spend time with geishya girls at Kyoto! While covering
all possible transitions is not feasible, the problem remains of how to avoid se-
lecting wrong but predicted hypotheses when an unexpected utterance is made.
We believe that higher level knowledge can help only a little with this problem,
and that it can even be harmful in some cases. The only solution we suggest
is to improve speech recognition at lower levels.

In summary, the language model cannot be 100% correct ill providing a priori
probability to the speech processing level. Use of discourse knowledge is effec-
108 CHAPTER 4

tive only with a task of a relatively limited domain, and it would be less effective
in mixed-initiative and wide domains with which we intend to deal. Given the
fact that a highly accurate prediction of what may be told next is not feasi-
ble, we still need to improve the speech recognition system's accuracy without
depending on higher levels of knowledge sources such as discourse knowledge.

4.13.2 Psychological Plausibility


Although we do not claim our model as a psychologically plausible model, we
have taken some psychological studies into account in designing our model.
The phenomenon of lexical ambiguity has been studied by many psycholin-
guistic researchers including [Prather and Swinney, 1988], [Cottrell, 1988], and
[Small et. al., 1988]. These studies have identified contextual priming as an
important factor in ambiguity resolution. Contextual priming is incorporated
in our model using the C-Marker passing scheme and the connectionist net-
work. In resolving structural ambiguity, [Crain and Steedman, 1985] argue for
the principle of referential success. Their experiments demonstrate that people
prefer the interpretation which is most plausible and that they access previ-
ously defined discourse entities. This psycholinguistic claim and experimental
result was incorporated in our model by adding costs for instance creation and
constraint satisfaction. [Ford, Bresnan and Kaplan, 1981] proposes the Lexical
Preference Theory which assumes a preference order among lexical entries of
verbs which differ in sub categorization for prepositional phrases. This type of
preference was incorporated as the bias term in our cost equation.

Psychological studies of sentence production [Garrett, 1975] [Garrett, 1980]


[Levelt and Maassen, 1981] [Bock, 1982] [Bock, 1987] and [Kempen and Hui-
jbers, 1983] were taken into account in designing the model. In [Kempen and
Huijbers, 1983], two independent retrieval processes are assumed, one account-
ing for abstract pre-phonological items (L1-items) and the other for phonolog-
ical items (L2-items). The lexicalization in their model follows: (1) a simulta-
neous multiple L1-item retrieval, (2) a monitoring process which watches the
output of L1-lexicalization to check that it is keeping within constraints upon
utterance format, (3) retrieval of L2-items after waiting until the Ll-item has
been checked by the monitor, and all other L1-items become available. In our
model, a CCs activation stage corresponds to multiple L1-item retrieval, con-
straint checks by V-Markers correspond to the monitoring, and the realization
stage which concatenates the surface string in a V-Marker corresponds to the
L2-item retrieval stage. The difference between our model and their model is
that, in our model, L2-items are already incorporated in G-Markers whereas
The <I> DMDIALOG System 109

they assume L2-items are accessed only after monitoring. Phenomenologically,


this does not make a significant difference because L2-items (phonological re-
alizations) in our model are not explicitly selected until constraints are met; at
which point the monitoring is completed. However, this difference may be more
explicit in the production of sentences because of the difference in the schedul-
ing of the L2-item retrieval and the monitoring. This is due to the fact that
our model retains interaction between two level as investigated by [Bock, 1987J.
Our model also explains contradictory observations by [Bock, 1982J and [Levelt
and Maassen, 1981J because activation of CC nodes (L1-items) and LEX nodes
(L2-items) are separated with some interactions. Also, our model is consistent
with a two-stage model [Garrett, 1975J [Garrett, 1980]. The functional and po-
. sitionallevels of processing in his model correspond to the parallel activation of
CCs and CSCs, the V-Marker movement which is left to right, and the surface
string concatenation during that movement.

Studies of the planning unit in sentence production [Ford and Holmes, 1978]
give additional support to the psychological plausibility of our model. They
report that deep clause instead of surface clause is the unit of sentence plan-
ning. This is consistent with our model which employs CSCs, which account for
deep propositional units and the realization of deep clauses as the basic units
of sentence planning. They also report that people are planning the next clause
while speaking the current clause. This is exactly what our model is perform-
ing, and is consistent with our observations from transcripts of simultaneous
in terpretation.

4.13.3 Commitment and Ambiguities


We found that the development of an incremental disambiguation scheme,
which effectively eliminates unplausible or impossible interpretation, is essen-
tial for a simultaneous interpretation system. As a solution to the delay in
transaction, we have introduced the simultaneous interpretation model, so that
translation could start even before the entire sentence is parsed. The premise
of this scheme is that the input sentence can be disambiguated at a reason-
able point in a long utterance. If an input utterance is totally ambiguous
until the very end of the sentence, translation cannot be started because its
meaning cannot be determined. This premise is a very severe premise when
we consider the extra ambiguities added due to multiple hypotheses that come
out of the speech processing module. Although the input sentence does not
need to be fully disambiguated to start translation {because some part of the
sentence can be translated without risking mistranslation regardless of how
110 CHAPTER 4

the rest of sentence is analyzed), full disambiguation at the earliest point of


utterance is desired. This requirement can be met only if we employ an in-
cremental scheme of disambiguation (as we have experimentally implemented
as a cost-based disambiguation scheme). Then, less probable or impossible in-
terpretations are pruned out immediately whenever they are detected. This
requires world knowledge and a strong top-down expectation of what can be
uttered next. Although predicting the next subdomain has proven to be highly
difficult, we have observed that more than 80% of utterances were involved in
interactions called request-inform interaction plans, such as How much is the
registration fee? and It is 100 dollars. While spoken dialog has a less com-
plicated syntactic structure, high predictability of the type of utterance at the
pragmatic level, would help disambiguation at an early stage of processing.

Obviously, a modular architecture of pipe-lining the syntactic parser and the


semantic ambiguity resolver is not an appropriate solution. The ambiguity
resolver needs to be embedded in the syntactic parsing stage so that feedback
from semantic and even discourse levels can be obtained immediately after an
ambiguity is detected. This is a very difficult task because (1) it undermines
the ease of trace and debugging, and (2) ambiguity resolution itself is not fully
accomplished. However, one hope is that, as can be seen from the transcript,
the interpreter does not start translating unless she/he is sure about what the
sentence means so that if ambiguity can be resolved at the end of each clause,
delay in generation would be minimal. Another problem is how to decide to
which hypothesis to commit. If some ambiguities still remain, the generator
needs to commit to one of the hypotheses, which may turn out to be false.
Theories of commitment in ambiguity resolution and generation are not yet
established, and thus they are a subject of further investigation.

4.13.4 Learning
Learning is one of the area which we did not addressed in the discussion so far.
This is a subject of ongoing effort in the project. Although we do not have
concrete algorithms to incorporate learning in our system, we describe some of
our motivations and basic ideas toward learning in our model.

There are several motivations for developing a learning scheme in our model:

Acquisition of Utterance Cases: Although our model requires extensive


knowledge of specific instances of utterances as cases, it is neither prac-
tical nor psychologically plausible to assume that all phrases appearing
The <I> DMDIALOG System 111

in the generated sentence were pre-stored in memory. Thus, a learning


mechanism that can learn expressions from examples is necessary.
Avoiding Over-Generation: We would like to avoid the generation of ut-
terances which may be possible to generate with a given set of syntactic
and semantic knowledge, but which are never used by native speakers.
The task of a learning scheme is to generate utterance-specific linguistic
knowledge from the given utterances and to index them into the memory
network under the given context.
Avoiding Adopting Faulty Cases: Contrary to the problems which arise in
syntax-based generation, case-based generation runs the risk of generating
ungrammatical sentences because explicit syntactic knowledge is not incor-
porated. Hypotheses generated by combining utterance-cases and phrasal
lexicons may be syntactically unsound. We need a mechanism to moni-
tor such hypotheses using explicit syntactic knowledge, which will learn to
avoid forming such hypotheses in the future.
Efficiency: By acquiring specific utterance cases that are linked to semantic
representation, recalculation to form surface realization is no longer nec-
essary. By having a large knowledge of utterance cases specific for each
situation, efficiency of the generation, as well as the parsing process, would
improve significantly.

Two types of learning mechanisms are considered in our model:

Learning by Parsing: Input utterances that are provided during the trans-
lation session are used for learning utterance cases. Since our system is a
bi-directional system, both Japanese and English are provided to the sys-
tem, and we can assume that speakers of each language are native. The
purpose of learning from such examples is to acquire cases of utterances
which are actually used by native speakers. By preferring use of cases
acquired by this process over other hypotheses, we can avoid generating
sentences that will never be used by native speakers, although they are
syntactically and semantically correct. The process of acquisition consists
of the following operations:
1. Generation of Utterance-Case: New utterance-cases are gener-
ated from input utterances by using syntactic and semantic knowledge
and generalized cases. Syntactic and semantic knowledge involved in
this process can serve as an explanation for the case and can be used
for generalization. This part of the process is an Explanation-Based
Learning scheme [Minton, 1988].
112 CHAPTER 4

2. Indexing into the Memory Network: New utterance-cases are


indexed into the memory network so that they can be used in the
parsing and generation processes. They should be indexed under the
least abstract node which subsumes the new utterance-case. This is a
basic condition to maintain the consistency of the memory network.
Indexing of the utterance-cases is conducted by parallel subsumption
tests.
3. Generalization: A generalized utterance-case is created by gener-
alizing specific cases. Domain knowledge is used to make generaliza-
tions. The generalized case is described in the form of a esc, and
indexed into the memory network.

Learning by Generation: Hypotheses which are generated during the gen-


eration process can be sources of knowledge for learning. Existing knowl-
edge of cases is used for hypothesizing a new pattern of sentence. Syntactic
knowledge monitors this sentence to assure its grammaticality. Failures to
meet the grammaticality judgment are recorded to avoid hypothesization
of ungrammatical sentences. The basic idea is failure-driven learning.

1. Hypothesization: Using the cases of past utterances, sentences for


a given meaning are hypothesized.
2. Monitoring: Syntactic knowledge monitors these hypotheses, and
rejects those which violate syntactic constraints. Knowledge of such
failures is recorded in each case, to avoid hypothesization of syntac-
tically unsound sentences in the future.

Learning schemes in our model focus on the acquisition of specific patterns of


utterances and learning to avoid ungrammatical sentences from phrasal com-
binations. This is counter to most previous works in language acquisition,
the focus of which has been placed on either acquisition of lexicon [Granger,
1977], syntax [Pinker, 1984], or language development in children [Selfridge,
1980]. The work closest to our model is [Zernik, 1987] which focused on the
acquisition of phrasal lexicons.

4.14 CONCLUSION
Although interpreting telephony or a real-time speech-to-speech translation
system has been considered as one of the prime research goals in speech and
natural language processing, this is perhaps the first to propose a comprehensive
The <.P DMDIALOG System 113

model of speech-to-speech dialogue translation. The <I> DMDIALOG system is one


of the first speech-to-speech translation systems, and the first to demonstrate
a possibility of simultaneous interpretation.

The model has several significant features. First, it is based on a massively


parallel computational model which provides considerably higher performance
using parallel computers. Second, it integrates the parsing and generation pro-
cesses, which enables it to translate input utterances even before the whole
sentence is spoken to the system. The generation process is lexically guided
and uses both surface and semantic information carried by G-Markers and A-
Markers. Predictions and tracking of verbalization are made by V-Markers.
Third, discourse knowledge has been fully incorporated in the model. This
allows effective simultaneous translation and prediction of next possible utter-
ances. Fourth, a cost-based disambiguation scheme and a connectionist net-
work have been useful faculties in dynamically selecting one hypothesis among
multiple hypotheses.

Several experiments has been conducted by using the ATR corpus of telephone
dialogs. We confirmed that the use of utterance cases and discourse knowledge
contributes towards reducing perplexity. However, at the same time, we found
that the effect of perplexity reduction by discourse knowledge in a larger do-
main is severely restricted due to the inherent unpredictability of subdomain
transitions in mixed-initiative dialogs. In order for our model to generate trans-
lated sentences simultaneously, resolution of ambiguity at the earliest possible
moment is desirable. Extra ambiguities caused by the addition of speech pro-
cessing pose serious problems which need to be resolved. Since limitations of
the usefulness of discourse knowledge in reducing perplexity have been found
in mixed-initiative domains, we need to conduct research on a better speech
processing module and methods of reducing search space without heavily de-
pending upon discourse knowledge.

Of course, our model is, by no means, complete and we have a long list of
future research issues. However, we believe that the importance of developing
an actual prototype is in the fact that we have actually faced these problems
and identified what need to be done next.

One of the most significant problem was its performance. On serial machines,
such as IBM RT-PC, it took 3 to 20 seconds to translate from speech-to-speech
even when the vocabulary is less than 100. With the vocabulary of 450 words,
it took over a minutes. The solution to this is to use actual massively parallel
machines.
5
DMSNAP: AN IMPLEMENTATION
ON THE SNAP SEMANTIC
NETWORK ARRAY PROCESSOR

5.1 INTRODUCTION
This chapter describes the DMSN AP system, a version of the <I> DMDIALOG
implemented on the Semantic Network Array Processor (SNAP). The goal of
our work is to develop a scalable and high-performance natural language pro-
cessing system which utilizes the high degree of parallelism provided by the
SNAP machine.

In order to accomplish the high-performance natural language processing, we


have designed a highly parallel machine called Semantic Network Array Pro-
cessor (SNAP) [Moldovan et. al., 1990] [Lee and Moldovan, 1990], and im-
plemented an experimental machine translation system called DMSN AP using
a parallel marker-passing scheme. DMSN AP is a SNAP implementation of
the <I>DMDIALOG speech-to-speech dialogue translation system [Kitano, 1989d]
[Kitano, 1991a], but with some modifications to meet hardware constraints. De-
spite its high performance, our system carries out sound syntactic and semantic
analysis including lexical ambiguity, structural ambiguity, pronoun reference,
control, unbounded dependency, and others.

In the next section, we describe briefly the SNAP architecture, then, describe
design philosophy behind the DMSN AP followed by descriptions on implemen-
tation and linguistic processing. Finally, performance are presented.
116 CHAPTER 5

5.2 SNAP ARCHITECTURE


The Semantic Network Array Processor (SNAP) is a highly parallel array pro-
cessor fully optimized for semantic network processing with marker-passing
mechanism. In order to facilitate efficient propagation of markers and to ease
development of applications, a set of marker propagation instructions has been
microcoded. SNAP supports propagation of markers containing (1) bit-vectors,
(2) address, and (3) numeric value. By limiting content of markers, significant
reduction in cost and resource has been attained without undermining perfor-
mance requirements for knowledge processing. Several AI applications such as
natural language processing system, classification system [Kim and Moldovan,
1990], and rule-based system has been developed 011 SNAP.

5.2.1 The Architecture


SNAP consists of a processor array and an array controller (figure 5.1). The
processor array has processing cells which contain the nodes and links of a se-
mantic network. The SNAP array consists of 160 processing elements each of
which consists of TMS320C30 DSP chip, local SRAM, etc. Each processing
elements stores 1024 nodes which act as virtual processors. They are intercon-
nected via a modified hypercube network. The SNAP controller interfaces the
SNAP array with a SUN 3/280 host and broadcasts instructions to control the
operation of the array. The instructions for the array are distributed through
a global bus by the controiler. Propagation of markers and the execution of
other instructions can be processed simultaneously.

5.2.2 Instruction Sets


A set of 30 high-level instructions specific to semantic network processing are
implemented directly in hardware. These include associative search, marker
setting and propagation, logical/arithmetic operations involving markers, cre-
ate and delete nodes and relations, and collect a list of nodes with a certain
marker set. Currently, the instruction set can be called from C language so
that users can develop applications with an extended version of C language.
From the programming level, SNAP provides data-parallel programming en-
vironment similar to C* of the Connection Machine [Thinking Machines Cor-
poration, 1989], but specialized for semantic network processing with marker
passing.
DmSNAP: SNAP Implementation 117

Hardware Software Physical


Environment Environment Design

Program
Host development
Computer using SNAP
instruction
set

SNAp-1 Compiled
Controller SNAP
code

SNAp-1
Array 160 Knowledge base
Processor SNAP instruction
Array execution
Eight 9U -size boards

Figure 5.1 SNAP Architecture


118 CHAPTER 5

5.2.3 Propagation Rules


Several marker propagation rules are provided to govern the movement of mark-
ers. Marker propagation rules enables us to implement guided, or constraint,
marker passing as well as unguided marker passing. This is done by specifying
type of links that markers can propagate. All markers in DMSN AP are guided
markers, thus they are controlled by propagation rules. The following are some
of the propagation rules of SNAP:

• Seq(rl,r2): The Seq (sequence) propagation rule allows the marker to


propagate through rl once then to r2.

• Spread(rl,r2): The Spread propagation rule allows the marker to travel


through a chain of r 1 links and then r2 links.

• Comb(rl,r2): The Comb (combine) propagation rule allows the marker to


propagate to all rl and r2links without limitation.

5.2.4 Knowledge Representation on SNAP


SNAP provides four knowledge representation elements: node, link, node color
and link value. These elements offers wide range of knowledge representation
schemes to be mapped on SNAP. On SNAP, a concept is represented by a node.
A relation can be represented by either a node called relation node or a link
between two nodes. The node color indicates the type of node. For example,
when representing use is in Los Angeles and eMU is in Pittsburgh, we
may assign a relation node for in. The IN node is shared by the two facts.
In order to prevent the wrong interpretations such as use in Pittsburgh and
eMU in Los Angeles, we assign IN#l and IN#2 to two distinct IN relations,
and group the two relation nodes by a node color IN. Each link has assigned
to it a link value which indicates the strength of interconcepts relations. This
link value supports probabilistic reasoning and connectionist-like processing.
These four basic elements allow SNAP to support virtually any kinds of graph-
based knowledge representation formalisms such as KL-ONE [Brachman and
Schmolze, 1985], Conceptual Graphs [Sowa, 1984], KODIAK [Wilensky, 1987],
etc.
DmSNAP: SNAP Implementation 119

5.3 PHILOSOPHY BEHIND DMSNAP


DMSNAP is a SNAP implementation of the ~DMDIALOG speech-to-speech
dialogue translation system. Naturally, it inherits basic ideas and mechanisms
of the ~DMDIALOG system such as memory-based approach to natural lan-
guage processing and parallel marker-passing. Syntactic constraint network is
introduced in DMSNAP whereas ~DMDIALOG has been assuming unification
operation to handle linguistic processing.

5.3.1 Memory-Based Natural Language


Processing
Memory-based NLP is an idea of viewing NLP as a memory activity. For
example, parsing is considered as a memory-search process which identifies
similar cases in the past from the memory, and to provide interpretation based
on the identified case. It can be considered as an application of Memory-
Based Reasoning (MBR) [Stanfill and Waltz, 1988] and Case-Based Reasoning
(CBR) [Riesbeck and Schank, 1989] to NLP. This view, however, counters to
traditional idea to view NLP as an extensive rule application process to build
up meaning representation. Some models has been proposed in this direction,
such as Direct Memory Access Parsing (DMAP) [Riesbeck and Martin, 1985]
and ~DMDIALOG [Kitano, 1989d]. We consider that memory-based approach
is superior than traditional approach. For arguments concerning superiority of
the memory-based approach over the traditional approach, see [Nagao, 1984],
[Riesbeck and Martin, 1985], and [Sumita and !ida, 1991].

5.3.2 Parallel Marker-Passing


One other feature inherited from the ~DMDIALOG is use of parallel marker-
passing. In DMSNAP, however, a different approach has been taken with
regard to the content of markers propagate through the network. Since
~DMDIALOG has been designed and implemented on idealized simulation of
massively parallel machines, markers carry feature structure (or graph) along
with other information such as probabilistic measures, and unification or a
similar heavy symbolic operations has been assumed at each processor element
(PE). In the DMSN AP, content of the marker is restricted to (1) bit mark-
120 CHAPTER 5

ers, (2) address markers, and (3) values 1. Propagation of feature structures
and heavy symbolic operations at each PE, as seen in the original version of
the q,DMDIALOG, are not practical assumptions to make, at least, on cur-
rent massively parallel machines due to processor power, memory capacity on
each PE, and the communication bottleneck. Propagation of feature struc-
tures would impose serious hardware design problems since size of the message
is unbounded (unbounded message passing). Also, PEs capable of perform-
ing unification would be large in physical size which causes assembly problems
when thousands of processors are to be assembled into one machine. Even
with machines which overcome these problems, applications with a restricted
message passing model would run much faster than applications with an un-
bounded message passing model. Thus, in DMSN AP, information propagated
is restricted to bit markers, address markers, and values. These are readily
supported by SNAP from hardware level.

5.3.3 Syntactic Constraint Network


Syntactic constraint network (SeN) is a new feature which has not been used in
the previous works in memory-based NLP. SeN is used to handle syntactic phe-
nomena without undermining benefits of memory-based approach. Although,
unification has been the central operation in the recent syntactic theories such
as LFG [Kaplan and Bresnan, 1982] and HPSG [Pollard and Sag, 1987], we pre-
fer SeN over unification-based approach because unification is computationally
expensive and it is not suitable for massively parallel implementation. Although
there is a report on an unification algorithm on massively parallel machines [Ki-
tano, 1991b], still it is computationally expensive, and takes up major part of
computing time even on SNAP. In addition, there is a report that unification
is not necessary the correct mechanism of enforcing agreement [Ingria, 19901.
Also, the classification-based approach [Kasper, 1989], which pre-compiles a
hierarchy of feature structures in the form of a semantic network, can carry out
similar task with less computational cost [Kim and Moldovan, 1990] . Finally,
current unification hard-rejects failure which is not desirable from our point.
We want the system to be robust enough that while recognizing minor syntactic
violation, it keep processing to get meaning of the sentence.

In the syntactic constraint network model, all syntactic constraints are rep-
lWe call a type of marker-passing which propagates feature structures (or graphs) an
Unbounded Message Passing. A type of marker-passing which passes fix-length packets as
seem in DmSNAP is a Finite Message Passing. This classification is derived from [Blelloch,
19861. With the classification in [Blelloch, 19861, our model is close to the Activity Flow
Network.
DmSNAP: SNAP Implementation 121

resented in the finite-state network consists of (1) nodes representing specific


syntactic constraints (such as 3SGPRES), (2) nodes representing grammatical
functions (such as SUBJ, OBJ, and OBJ2 for functional controller), and (3)
syntactic constraint links which control state-transitions and the passing of in-
formation among them. Although, unification has been used to (1) enforce
formal agreement, (2) percolate features, and (3) building up feature structure,
we argue that these functions are attained by independent mechanism in our
model. Formal agreement is enforced by activation and inhibition of nodes
through active syntactic constraints. Percolation of feature is attained by pass-
ing of address through memory and syntactic constraint networks. It should be
noted that not all features now being carried by unification grammar need to be
carried around in order to make an interpretation of sentences. Our model only
propagates necessary information to relevant nodes. Finally, instead of building
up features, we distributively represent meaning of the sentence. When parsing
is complete, we have a set of new nodes where each represents an instance of
concept and links defines relation among them.

We are currently investigating whether our model is consistent with human


language processing which has limited memory capacity [Gibson, 1990].

5.4 IMPLEMENTATION OF DMSNAP


DMSN AP consists of the memory network, syntactic constraint network, and
markers to carry out inference. The memory network and the syntactic con-
straint network are compiled from a set of grammar rules written for DMSN AP.
This section describes these components and a basic parsing algorithm to pro-
vide brief implementation aspects of the DMSN AP.

5.4.1 Memory Network on SNAP


The major types of knowledge required for language translation in DMSN AP
are: a lexicon, a concept type hierarchy, concept sequences, and syntactic con-
straints. Among them, the syntactic constraints are represented in the syntactic
constraint network, and the rest of the knowledge is represented in the mem-
ory network. The memory network consists of various types of nodes such
as concept sequence class (CSC), lexical item nodes (LEX), concept nodes
(CC) and others. Nodes are connected by a number of different links such
as concept abstraction links (ISA), expression links for both source language
122 CHAPTER 5

Figure 5.2 Concept Sequence on SNAP

and target language (ENG and JPN), Role links (ROLE), constraint links
(CONSTRAINT), contextual links (CONTEXT) and others.

A CSC captures ordering constraints of natural language, and it roughly cor-


responds to phrase structure rules. CSCs can be used to represent syntax and
semantics of sentences at different levels of abstraction from instances of sur-
face sequence to linguistically motivated grammar such as Lexical-Functional
Grammar (LFG) [Kaplan and Bresnan, 1982]. As shown in figure 5.2, a CSC
consists of a root node (CSR), element nodes (CSE), a FIRST link, a LAST
link, NEXT link(s) and ROLE links. A CSR is a representative node for the
meaning of the entire CSC structure. CSRs are connected to their designated
interlingual concepts by ENG or JPN. Each CSC has one or more CSEs linked
to a CSR by ROLE links. The ordering constraints between two concept se-
quence element nodes are represented by NEXT link. FIRST and LAST links
in each CSC points to the first and last elements, respectively. Also, each CSE
represents the relevant case role, and the case role has a selectional restriction.
Since we want to avoid heavy symbolic operations during parsing, ROLE links
and associated constraint links are used instead of performing type and value
consistency check by unification. Therefore each CSE is used for both enforcing
the ordering constraint and capturing semantic information.

Besides, concept instance nodes (CI) and concept sequence instance structures
(CSI) are dynamically created during parsing. Each CI or CSI is connected
to the associated CC or CSC by INST link. CIs correspond to discourse
entities proposed in [Webber, 1983]. Three additional links are used to fa-
cilitate pragmatic inferences. They are CONTEXT links, CONSTRAINT
DmSNAP: SNAP Implementation 123

links and EQROLE links. A CONTEXT link is a path of contextual priming


which is crucial in word sense disambiguation. When a word is activated dur-
ing processing, the activation spreads through CONTEXT links and impose
contextual priming to relevant concepts. A CONSTRAINT link denotes an
antecedent/ consequence relationship between two events or states, which is cre-
ated between two CSRs. An EQROLE link denotes the necessary argument
matching condition for testing an antecedent/consequence relationship, which
is created between two CSEs in different CSCs.

5.4.2 Syntactic Constraint Network


DMSN AP has a syntactic constraint network (SCN) which captures various
syntactic constraints such as agreement, control, etc. Syntactic constraint net-
work consists of syntactic constraint nodes (SC nodes), syntactic function nodes
(SF nodes), and syntactic constraint links (SC-links). SC nodes represents syn-
tactic constraints such as 3rd Singular Present (3SgPres) and Reflexive Pronoun
(Ref). These nodes simply get bit markers to indicate whether these syntactic
constraints are active or not, and send bit markers to show which lexical items
are legal candidate for the next word. SF nodes represents grammatical func-
tions such as functional controllers. SF nodes generally get an address marker
and a bit marker. The address marker carries point to the CI nodes which
should be bound to a case-frame in dislocated place in the network, and a bit
marker shows whether the specific grammatical function node should be acti-
vated. When both a bit marker and an address marker exist at a certain SF
node, the address marker is further propagated through SC-links to send in-
formation which is necessary to carry out interpretation of sentences involving
control and unbounded dependency.

5.4.3 Markers
The processing of natural language on a marker-propagation architecture re-
quires the creation and movement of markers on the memory network. The
following types of markers are used:

A-MARKERS indicate activation of nodes. They propagate through ISA links


upward, carry a pointer to the source of activation and a cost measure.
P-MARKERS indicate the next possible nodes to be activated. They are ini-
tially placed on the first element nodes of the CSCs, and move through
124 CHAPTER 5

NEXT link where they collide with A-MARKERS at the element nodes.

G-MARKERS indicate activation of nodes in the target language. They carry


pointers to the lexical node to be lexicalized, and propagate through ISA
links upward.

V - MARKERS indicate current state of the verbalization. When a V-MARKER


collides with the G-MARKER, the surface string (which is specified by the
pointer in the G-MARKER) is verbalized.

C-MARKERS indicate contextual priming. Nodes with C-MARKERS are contex-


tually primed. A C-MARKER moves from the designated contextual root
node to other contextually relevant nodes through contextual links.

SC-MARKERS indicate active syntax constraints, and primed and/or inhibited


nodes by currently active syntactic constraints. It also carries pointer to
specific nodes.

There are some other markers used for control process and timing, they are
not described here. These five markers are sufficient to understand the central
part of the algorithm in this paper.

5.4.4 DMSN AP Parsing Algorithm


Overall flow of the algorithm implemented on SNAP consists of the following
steps:

1. Activate a lexical node


2. Pass an A-Marker through ISA link; Pass SC-Markers through SC-link.
3. When the A-Marker collide with a P -Marker on CSE, the P-Marker is passed through
NEXT link once. However, if the CSE was the last element of the CSC, then the CSC
is accepted and an A-Marker is passed up through ISA link from CSR of the accepted
CSC.
4. When the P-Marker passed through the NEXT link in step 3, then a copy of the P-
Marker is passed down through inverse ISA link to make top-down prediction.
5. Pass SC-Markers from active nodes to activate and/or inhibit syntactic constraints, and
to percolate pointers to the specific CI.
6. Repeat 1 through 5 until all words are read.
7. Compute total cost for each hypothesis.
8. Select Lowest Cost Hypothesis.
9. Remove other hypotheses.
DmSNAP: SNAP Implementation 125

This parsing algorithm is similar to the shift-reduce parser except that our
algorithms handles ambiguities, parallel processing of each hypothesis, and top-
down predictions of possible next input symbol. The generation algorithm
implemented on SNAP is a version of the lexically guided bottom-up algorithm
which is described in the chapter 3.

5.5 LINGUISTIC PROCESSING IN


DMSNAP
We will explain how DMSN AP carries out linguistic analysis using two sets of
examples:

Example I
s1 John wanted to attend IJCAI-91.
s2 He is at the conference.
s3 He said that the quality of the paper is superb.
Example II
s4 Dan planned to develop a parallel processing
computer.
s5 Eric built a SNAP simulator.
s6 Juntae found bugs in the simulator.
s7 Dan tried to persuade Eric to help Juntae modify
the simulator.
s8 Juntae solved a problem with the simulator.
s9 It was the bug that Juntae mentioned.

The examples contain various linguistic phenomena such as: lexical ambiguity,
structural ambiguity, referencing (pronoun reference, definite noun reference,
etc), control, and unbounded dependencies. It should be noted that each exam-
ple consists of a set of sentences (not a single sentence isolated from the context)
in order to demonstrate contextual processing capability of the DMSN AP.

These sentences in examples are not all the sentences which DMSN AP can
handle. Currently, DMSN AP handles substantial portion of the ATR confer-
ence registration domain (vocabulary 450 words, 329 sentences) and sentences
from other corpora.
126 CHAPTER 5

WANT-CIRCUM

~ ... Next link


Role, Isa
Instance_of

..
, ,",,
I.. Instance Node

....... o Concept node


••~TTEND- ,..,..,.""'""'o:"r-_.....,,,,,,,,,,,
nted' 'to/"

/'
, .,.
'I'M,",!
WJ'
on-
.
" -at
•• 1..... • •••,'

JkmIT-1 : :.•..:...
I I I I •

c- JC 1-91
WANT~TEND
-CONF-#
ATIlT.'#6 /
.._~
•• /':IJCAI- 9 U ..t.. U' l chikai-91'
• I I I

ATTEWo~tONF#5 IJdarJ91#2

Figure 5.3 Part of Memory Network


DmSNAP: SNAP Implementation 127

5.5.1 Basic Parsing and Generation -


Translation
The essence of DMSNAP parsing and generation algorithm is described using
sentence 81. A part of memory network involved in this explanation is shown in
figure 5.3. C- denotes concepts; and" ... " denotes surface string in the lexical
node. Notice that only a part of the memory network is shown and no part of
the syntactic constraint network is shown here. Also, the following explanation
does not describe activity of the syntactic constraint network part. This will
be described in a relevant part later.

Initially, the first CSE in every csc on the memory network gets a P-MARKER.
This P-MARKER is passed down ISA links. The CCs receiving a P-MARKER
are C-PERSON and C-ATTEND. Also the closed class lexical items (CCI) in the
target language propagate G-MARKER upward ISA links.

Upon processing the first word' John' in the sentence s1, C-JOHN is activated so
that C-JOHN gets an A-MARKER and a CI JOHN#1 is created under C-JOHN.
At this point, the corresponding Japanese lexical item is searched for, and JON
is found. A G-MARKER is created on JON. The A-MARKER and G-MARKER
propagate up through ISA links (activating C-MALE-PERSON and C-PERSON
in sequence) and, then, ROLE links. When an A-MARKER collides with a
P-MARKER at a CSE, the associated case role is bound with the source of the
A-MARKER and the prediction is updated by passing P-MARKER to the next
CSE. This P-MARKER is passed down ISA links. In this memory network,
the ACTOR roles of concept sequences WANT-CIRCUM-E is bound to JOHN#1
pointed by the A-MARKER. This is made possible in the SNAP architecture
which allows markers to carry address as well as bit-vectors and values, where
many other marker-passing machines such as NETL [Fahlman, 1979] and IXM2
[Higuchi et. al., 1991] only allow bit-vectors to be passed around. Also, G-
MARKERS are placed on the ACTOR role CSE of WANT-CIRCUM-J. The G-
MARKER points to the Japanese lexical item jon.

After processing 'wanted' and 'to', a P-MARKER is passed to CIRCUM and, then,
to ATTEND-CONF. At this point, a source language (English) expression for the
concept ATTEND-CONF is searched for and ATTEND-CONF-E is found. The first
CSE of ATTEND-CONF-E gets a P-MARKER. After processing 'attend' and
'IJCAI-91', ATTEND-CONF-E becomes fully recognized 2 so that a CSI having
CIs is created under ATTEND-CONF-E. Then the associated concept ATTEND-
2 fully recognized means that the CSC can be reduced, in the shift-reduce parser's
expression.
128 CHAPTER 5

CONF is activated. An A-MARKER is passed up from ATTEND-CONF to the last


. element of the CSR WANT-CIRCUM-E. As the result, the CSR WANT-CIRCUM-
E and its CC WANT-CIRCUM are activated in sequence. Therefore the parsing
result is represented by the activated CC WANT-CIRCUM and the associated
CSI. Also, upon processing 'IJCAI-91', the concept C-CONFERENCE is activated
and then C-MARKERS are passed to nodes connected to C-CONFERENCE by
CONTEXT links. This is an operation for contextual priming.

When the parsing is done, a V-MARKER is passed to the target language


(Japanese) expression WANT-CIRCUM-J from WANT-CIRCUM, and, then, to the
first CSE of WANT-CIRCUM-J. Since the first CSE has a G-MARKER pointing
to JON, jon becomes the first word in the translated Japanese sentence and
then the V -MARKER is passed to the next CSE. This operation is repeated for
all CSEs in the CSC. Finally, the Japanese sentence tl is constructed for the
English sentence s1.

With this algorithm, the first set of sentences (sl, s2 and s3) is translated into
Japanese:

t1 Jon ha ichikai-91 ni sanka shitakatta.


t2 Kare ha kaigi ni iru.
t3 Kare ha ronbun no shitsu ga subarashii to itta.

5.5.2 Anaphora
Anaphoric reference is resolved by searching for discourse entity as represented
by CIs under a specific type of concept node. Sentence s2 contains anaphora
problems due to 'He' and 'the conference'. When processing 'He', DMSNAP
searches for any CIS under the concept C-MALE-PERSON and its subclass con-
cepts such as C-JOHN. In the current discourse, JOHN#l is found under C-JOHN.
JOHN#l and IJCAI-91#2 are created when the sl is parsed. An A-MARKER
pointing to JOHN#l propagates up through ISA links. Likewise, IJCAI-91#2 is
found for C-CONFERENCE. In this sentence, there is only one discourse entity
(CI in our model) as a candidate for each anaphoric reference, thus a simple
instance search over the typed hierarchy network suffices. However, when there
are multiple candidates, we use the centering theory by introducing forward-
looking center (Cf), backward-looking center (Cb), etc [Brennan et. al., 1986}.
Also, incorporating the notion of the focus is straightforward [Sidner, 1979}.
DmSNAP: SNAP Implementation 129

5.5.3 Lexical Ambiguity


DMSN AP is capable of resolving this lexical ambiguity through use of contex-
tual priming using the contextual marker (C-Marker) [Tomabechi, 1987] and the
cost-based disambiguation [Kitano et. al., 1989a]. Sentence 83 contains a word
sense ambiguity in the interpretation of the word 'paper' as either a technical
document or a sheet of paper. Upon reading 'paper', C-THESIS and C-PAPER are
activated. At this time, C-THESIS has a C-MARKER. The C-MARKER comes
from activation of C-IJCAI-91 and C-CONFERENCE, in previous sentences, which
has contextual links connecting concepts relevant to academic conference such
as C-THESIS. The meaning hypothesis containing C-THESIS costs less than the
one with C-PAPER so that it is selected as the best hypothesis.

5.5.4 Control
Control is handled using the syntactic constraint network. Sentence 87 is an
example of sentence involving functional control [Bresnan, 1982]. In 87, both
subject control and object control exist - the subject of 'persuade' should be the
subject of 'tried' (subject control), and the subject of 'help' should be the object
of 'persuade' (object control). In this case, CSCs for infinitival complement
has CSE without NEXT link. Such an CSE represents missing subject. There
are SUBJ, OBJ, and OBJ2 nodes (these are functional controller) in the
syntactic constraints network each of which store pointer to the CI node for
possible controllee. Syntactic constraint links from each lexical items of the verb
determine which functional controller is active. Activated functional controller
propagate a pointer to the CI node to unbound subject nodes of CSCs for
infinitival complements. Basically, one set of nodes for functional controller
handles deeply nested cases due to functional locality.

Take an example from 57, when processing 'Dan', a pointer to an instance


of 'Dan' which is C-DAN#l is passed to SUBJ node of functional controller.
Then, when processing 'tried', a SC-Marker propagates from the lexical node of
'tried' to SUBJ through SC-link, and the SUBJ node is being activated. Then,
the pointer to C-DAN#l in the SUBJ node propagate to SUBJ role node (or
ACTOR node) of the CSC for infinitival complement. After processing 'to',
the csc for infinitival complement is predicted. Temporal bindings take place in
each predicted CSC. When processing 'persuade', however, OBJ gets activated
since 'persuade' enforces object control, not subject control. Thus, after 'Eric' is
processed, a pointer to an instance of 'Eric' propagate in to the already active
OBJ node, and then propagate to SUBJ role node (or ACTOR role node) of
130 CHAPTER 5

each CSC for infinitival complement. This way, DMSNAP performs control.

5.5.5 Structural Ambiguity


Structural ambiguity is resolved by the cost-based ambiguity resolution method
[Kitano et. al., 1989al. The cost-based ambiguity resolution takes into account
various psycholinguistic studies such as [Crain and Steedman, 1985J and [Ford,
Bresnan and Kaplan, 1981). Sentence s8 contains a structural ambiguity in the
PP-attachment. It can be interpreted either:

• [8 juntae [v p solved [N p the problem) [p p with [N p the simulator 1II


• [8 juntae [v p solved [N p the problem [p p with [N p the simulator lllll.

In this case, two hypotheses are activated at the end of the parse. Then, DM-
SNAP computes the cost of each hypothesis. Factors involved are contextual
priming, lexical preference, existence of discourse entity, and consistency with
world knowledge. In this example, the consistency with the world knowledge
plays central role. The world knowledge is a set of knowledge of common sense
and knowledge obtained from understanding previous sentences. To resolve
ambiguity in this example, the DMSN AP checks if there is a problem in the
simulator. Constraint checks are performed by bit-marker propagation through
CONSTRAINT links and EQROLE links. Since there is a CI which packages
instances of ERROR and SNAP-SIMULATOR then the constraint is satisfied and
the second interpretation incurs no cost from constrain check. However, there
is no CI which packages instances of JUNTAE and SNAP-SIMULATOR. Therefore
the first interpretation incurs a cost of constraint violation (15 in our current
implementation). Thus DMSNAP is able to interpret the structural ambiguity
in favor of the second interpretation.

5.5.6 Unbounded Dependency


There are two ways to handle sentences with unbounded dependency. The first
approach is straightforward memory-based approach which simply store a set
of CSCs involves unbounded dependency. A large set of CSCs would have to
be prepared for this, but its simplicity minimized computational requirements.
Alternatively, we can employ somewhat linguistic treatment of this phenomena
within our framework. The syntactic constraint network has a node represent-
ing TOPIC and FOCUS which usually bound to the displaced phrase. An
DmSNAP: SNAP Implementation 131

address of CI for the displaced phrase (such as 'the bug' in the example 59) is
propagated to the TOPIC or FOCUS nodes in the syntactic constraint net-
work. Further propagation of the address of the CI is controlled by activation
of nodes along the syntactic constraint network. The network virtually encodes
a finite-state transition equivalent to {COMP-XCOMP}*GF-COMP [Kaplan
and Zaenen, 1989] where GF-COMP denotes grammatical functions other than
COMPo The address of the CI bound to TOPIC or FOCUS can propagate
through the path based on the activation patterns of the syntactic constraint
network, and the activation patterns are essentially controlled by markers flow
from the memory network. When the CSC is accepted and there is a case-role
not bound to any CI (OBJECT in the example), the CSE for the case-role
bound with the CI propagated from the syntactic constraint network.

5.6 PERFORMANCE
DMSNAP complete parsing in the order of milliseconds. While actual SNAP
hardware is now being assembled and to be fully operational by May 1991, this
section provides performance estimation with precise simulation of the SNAP
machine. Simulations of the DMSN AP algorithm have been performed on a
SUN 3/280 using the SNAP simulator which has been developed at USC [Lin
and Moldovan, 1990]. The simulator is implemented in both SUN Common
LISP and C, and simulates the SNAP machine at the processor level. The
LISP version of the simulators also provides information about the number of
SNAP clock cycles required to perform the simulation.

There are two versions of DMSNAP, one written in LISP and one in C. The
high-level languages only take care of the process flow control, and the actual
processing is done with SNAP instructions. The performance data summarized
in Table 5.1 was obtained with the first version of DMSN AP written in LISP.
Furthermore, with a clock speed of 10 MHz, these execution times are in the
order of 1 millisecond. These and other simulation results verify the operation
of the algorithm and indicate that typical runtime is on the order of milliseconds
per sentence.

The size of the memory network for example II is far larger than that of example
I, yet we see no notable increase in the processing time. This is due to the use
of a guided marker-passing which constraints propagation paths of markers.
Our analysis of the algorithm shows that parsing time grow only to sublinear
to the size of the network.
132 CHAPTER 5

Sentence Length Number of Time at


(words) machine cycle 10 MHz (msec)
s2: He is at ... 4 6500 0.65
s3: He said that ... 10 15000 1.50
s5: Eric build ... 5 5500 0.55
s6: Juntae found ... 6 10500 1.05
s8: Juntae solved ... 7 16500 1.65

Table 5.1 Execution times for DmSNAP

Parsing time
(milliseconds)

x
1.5
x x

1.0 x

0.5 Xx

5 10 Length
(words)

Figure 5.4 Parsing Performance of DmSNAP


DmSNAP: SNAP Implementation 133

5.7 CONCLUSION
In this chapter, we have demonstrated that high-performance natural language
processing with parsing speeds in the order of milliseconds is achievable without
making substantial compromise in linguistic analysis. To the contrary, our
model is superior to other traditional natural language processing models in
several aspects, particularly, in contextual processing.

The DMSN AP is based on the idea of memory-based model of natural language


processing. The DMSN AP is a variation of the iI>DMDIALOG speech-to-speech
dialog translation system. We use the parallel marker-passing scheme to per-
form parsing, generation, and inferencing. The syntactic constraint network
was introduced to handle linguistically complex phenomena without under-
mining benefits of the memory-based approach.

Not only the DMSN AP exhibits high-performance natural language process-


ing, but also demonstrates capabilities to carry out linguistically sound parsing
particularly on contextual processing. The use of the memory network to dis-
tributively represent knowledge and modify it to reflect new states of the mental
model is an effective way to handle such phenomena as pronoun reference and
control.

In summary, we demonstrated that the model presented in this paper is


a promising approach to high-performance natural language processing with
highly contextual and linguisticly sound processing. We hope to extend this
work to the real-world domains in the near-future. We are convinced that mil-
lisecond performance opens new possibilities for natural language processing.
6
ASTRAL: AN IMPLEMENTATION
ON THE IXM2 ASSOCIATIVE
MEMORY PROCESSOR

6.1 INTRODUCTION
In this chapter, we report experimental results on ASTRAL, a partial imple-
mentations of CI>DMDIALOG on the IXM2 associative memory processor. On
the IXM2 associative memory processor, we have investigated the feasibility
and the performance of the memory-based parsing part of the CI>DMDIALOG
model.

Two implementations will be described: a parser with a flat syntactic patterns,


and a parser with a hierarchical memory network. The first implementation
took an extreme view that all possible syntactic structure is pre-expanded in a
fiat memory structure. This is the most memory intensive version of the model.
The latter model is moderate strategy to use some abstraction in encoding a
memory network which is closer to the CI>DMDIALOG.

Experimental results were impressive. Syntactic recognition complete at the


order of a few milliseconds. Scaling property seems to be desirable since only
a linear degradation is observed with the scaling up of the memory-base.

6.2 THE MASSIVELY PARALLEL


ASSOCIATIVE PROCESSOR IXM2
IXM2 is a massively parallel associative processor designed and developed by
one of the authors at the Electrotechnical Laboratory [Higuchi et. al., 1991J.
136 CHAPTER 6

it is dedicated to semantic network processing using marker-passing.

IXM2 consists of 64 processors, called associative processors, which operate with


associative memory each of which has a memory capacity of 256K words by
40 bits. Each associative processor is connected to other associative processors
through network processors.

An associative processor consists of an IMS T800 transputer, 8 associative


memory chips, RAM, link adapters, and associated logic. When operated at
20 MHz clock, T800 attains 10 MIPS [Inmos, 1987]. Each associative memory
chip is a 20 Kbit CAM (512 words x 40 bits) manufactured by NTT [Ogura et.
al., 1989]. The IXM2 has 64 such processors, thus attaining 256K parallelism
which is far larger than 64K parallel of the Connection Machine [Hillis, 1985).
This high level of parallelism allows us to implement practical memory-based
systems.

Network processors are used to handle communication between associative pro-


cessors. There is one top-level network processor which deals with communi-
cation among the lower-level network processors, and 8 lower-level network
processors each of which is connected to 8 associative processors. Unlike most
other massively parallel architectures which use N-cube connections or cross-bar
connections, IXM2 employs a full connection so that communication between
any two processors can be attained by going through only 2 network proces-
sors. This full connection architecture ensures high communication bandwidth
and expandability which are critical factors in implementing real-time applica-
tions. Each interconnection attains high speed serial links (20 Mbitsfsec) which
enable the maximum transfer rate per link at the speed of 2.4 Mbytesfsec.

6.3 EXPERIMENTAL IMPLEMENTATION


I: A FLAT PATTERN MODEL
This section describes the implementation used in the experiments in this paper.
It should be understood that the idea of memory-based parsing is new alld
that it is in an early stages of development. Thus the specific implementation
described here should be regarded as an example of implementation, not the
definitive implementation of the memory-based parser. In fact , we will discuss
some enhancements later. The experimental implementation has two major
parts: a massively parallel associative processor IXM2 and a memory-based
parser implemented in the IXM2.
ASTRAL: IXM2 Implementation 137

6.3.1 Organization of the Parser


Now, we describe the organization and algorithm of the memory-based parser
on the IXM2. As an experimental implementation designed to test the practi-
cality of the approach, we employed a flat memory structure, i.e. no hierarchy
was used to encode syntactic patterns. This is because the flat structure is the
most memory-intensive way of implementing the memory-based parsing model.
Thus, should this implementation be judged to be practically useful, other ver-
sions which use a more memory-efficient implementation can also be judged to
be practical.

The system consists of two parts: a syntactic recognition part on the IXM2
and a semantic interpretation part on the host computer.

For the syntactic recognition part on the IXM2, the memory consists of three
layers: a lexical entry layer, a syntactic category layer, and a syntactic pattern
layer.

Lexical Entry Layer: The lexical entry layer is a set of nodes each of which
represents a specific lexical entry. Most of the information is encoded
in lexical entries in accordance with modern linguistic theories such as
HPSG[Poliard and Sag, 1987], and the information is represented as a
feature structure. Obviously, it is a straight forward task to represent
huge numbers of lexical entries on the IXM2.

Syntactic Category Layer: The second layer comprises a group of nodes


representing the syntactic features. Perhaps the most important feature
for parsing is the head major category, generally known as the syntactic
category. In the specific implementation examined in this paper, we use
the head major category as a single feature to index syntactic structures.
However, it is also possible to incorporate other features to index syntactic
structures. The choice of features to be incorporated largely depends on
the strategy of how precisely differentiate syntactic structures and how
heavily the constraint checks to be conducted on each processor or on the
host computer.
Syntactic Patterns Layer: All possible syntactic structures are directly
mapped onto the associative memory as a syntactic patterns layer. As
mentioned earlier, the syntactic structure is a fiat sequence of syntactic
categories which can be generated from the given grammar or from a cor-
pus of training sentences. Table 6.1 shows a part of simple syntactic
structure loaded on the associative memory. Grammatical constraints can
138 CHAPTER 6

N V-BSE DET N
N V- BSE N
N BE-V V-PAS PP-by N

Table 6.1 Pre-Expanded Syntactic Structures

be incorporated when expanding grammar rules. It allows for a recursive


structure so that the number of actual syntactic structures loaded is less
than the actual number of syntactic patterns the system can accept.

The degree of constraints which are incorporated in the expanded syntactic


structures largely affects the memory requirements and the processing load
on the host processor. If only the head major category is incorporated, most
constraint checks must be done by the host computer or at the transputer. On
the other hand, if all constraints are incorporated in expanding grammar, the
number of possible syntactic structures will be explosive and it will require far
more associative memory chips. In this experiment, we only used the head
major category (such as NOUN, VERB), thus most constraint processing is
done at each transputer and at the host processor. It is also possible to use
more subdivided symbols at the cost of memory requirements.

In the host computer (SUN-3/250), the case-role binding table is pre-compiled


which indicates correspondence between case-roles and word positions. Table
6.2 shows a part of a simple case-role binding table. Each position in the table is
associated with actions to be taken in order to build meaning representation. In
building the meaning representation, the program resides on the host computer
and carries out role-bindings and some constraint checks depending on how the
constraints are incorporated into the syntactic recognition part. If there are
ambiguous parses, more than two items in the table need to be processed.
However, it should be noted that all items which are notified from the IXM2
are already known and accepted parsing hypotheses as far as syntactic structure
is concerned. This architecture drastically minimizes the number of operations
required for parsing by eliminating operations on parses which turn out to be
false.

6.3.2 Algorithm
The algorithm is simple. Two markers, activation markers (A-Markers) and
prediction markers (P-Markers) are used to control the parsing process. A-
ASTRAL: IXM2 Implementation 139

001 ACTOR ACTION DET OBJECT


002 ACTOR ACTION OBJECT
003 OBJECT was ACTION by ACTOR
Wor~s John was kicked by Mary

Table 6.2 Case-Role Table

Markers are propagated through the memory network from the lexical items
which are activated by the input. P-Markers are used to mark the next possible
elements to be activated. A general algorithm follows:

1. Place P -Markers at all first elements of the syntactic patterns

2. Activate the lexical entry

3. Pass the A-Marker to the Syntactic Category Node


4. Pass the A-Marker to the elements in the Syntactic Patterns
5. If the A-Marker and a P-Marker co-exist at an element in the Syntactic
Pattern, then the P-Marker is moved to the next element of the Syntactic
Pattern.
6. If there are no more elements, the syntactic pattern is temporarily accepted
and a pattern ID is send to the host or local processors for semantic inter-
pretation.

7. Repeat 2 thru 6, until the end of the sentence.

On the host computer or on the 64 T800 transputers, the semantic interpreta-


tion is performed for each hypothesis. The general Bow follows:

1. Receive the syntactic pattern ID

2. If words remain in the sentence, then ignore the ID received.


3. If no words remain, perform semantic interpretation by executing the func-
tions associated with each hypothesis in the table. Most operations are
reduced to a bit-marker constraint check and case-role bindings at compile
time.
140 CHAPTER 6

Syntactic
Recognition time
(milli seconds)

1.0

0.5

5 10 15 Sentence length
(words)

Figure 6.1 Syntactic Recognition Time va. Sentence Length

6.4 PERFORMANCE
We carried out several experiments to measure the system's performance. Fig-
ure 6.1 shows the syntactic recognition time against sentences of various
lengths. Syntactic recognition at milliseconds order is attained. This exper-
iment uses a memory containing 1,800 syntactic patterns. On average, 30
syntactic patterns are loaded into each associative processor. Processing speed
improves as parsing progresses. This is because the computational costs for
a sequential part in the process is reduced as number of hypotheses activated
decrease. There is one sequential process which checks active hypotheses on
each 64 transputer. During this process, the parallelism of the total system is
64.

It should be noted that this speed has been attained by extensive use of asso-
ciative memory in the IXM2 architecture - simple use of 64 parallel processors
will not attain this speed. In order to illustrate this point, we measured the
performance of a single associative processor of the IXM2 (one of the 64 as-
sociative processors) and of the SUN-4/330, CM-2 Connection Machine, and
Cray X-MP.

The program on each machine uses an optimized code for this task in C lan-
guage. The numbers of syntactic patterns is 30 for both a single associative
ASTRAL: IXM2 Implementation 141

Sentence Length (words) IXM2 CM2 SUN-4/330 Cray X-MP


5 0.8 377.3 12.8 14.5
6 1.1 457.6 17.9 19.8
7 1.3 533.3 18.3 20.3
8 1.3 620.8 18.2 20.4

Table 6.3 Syntactic Recognition Time vs. Sentence Length (milliseconds)

Numbers of Patterns IXM2 CM2 SUN-4/330 Cray X-MP


10 0.7 608.4 4.4 4.7
30 1.3 620.8 18.2 20.4

Table 6.4 Syntactic Recognition Time VB . Grammar Size (milliseconds)

processor of the IXM2 and other machines. The experimental results are shown
in Table 6.3. A single processor of the IXM2 is almost 16 times faster than
that of the SUN-4/330 and Cray X-MP even with such a small task 1 The CM-2
Connection Machine is very slow due to a communication bottleneck between
processors. While both the IXM2 and the SUN-4/330 uses a CPU of compa-
rable speed, the superiority of the IXM2 can be attributed to its intensive use
of the associative memory which attains a massively parallel search.

This trend will be even more clear when we look into the scaling property
of both systems. Figure 6.4 shows the performance for a sentence of length
8, for syntactic patterns of size 10 and 30. While a single processor of the
IXM2 maintains less-than-linear degradation, the SUN-4/330 and Cray X-MP
degrades more than linearly. It should be noted that 30 syntactic patterns in
other machines literally means 30 patterns, but in the single processor in the
IXM2, it means 1,800 patterns when all 64 processors are used.

It is expected that the larger task set would demonstrate a dramatic difference
in total computation time. The IXM2 can load more than 20,000 syntactic pat-
terns which is sufficient to cover the large vocabulary tasks currently available
for speech recognition systems. With up-to-date associative memory chips, the
lCray X-MP is very slow in this experiment mainly due to its sub-routine call overhead.
We have tested this benchmark on a Cary X-MP in Japan and at the Pittsburgh SuperCom-
puting Center, and obtained the same result . Thus this is not a hardware trouble or other
irregular problems.
142 CHAPTER 6

Expected
performance
(seconds)

2.0

1.0

Number of
Sentences
500 1,000 1,500 2,000 Training

Figure 6.2 Performance Improvement by Learning New Cases

number of syntactic patterns which can be loaded on the IXM2 exceeds 100,000.
Also, extending the IXM2 architecture to load over one million syntactic pat-
terns is both economically and technically feasible.

The memory-based parser can improve its performance over time. While pre-
vious experiments stored necessary syntactic patterns before hand, more com-
prehensive systems start from no pre-stored cases and tries to improve its per-
formance through acquirering syntactic patterns. Figure 6.2 shows the per-
formance improvement of our system assuming that each new case of syntactic
patterns is incrementally stored in run time 2 • In other words, first the input
is given to the memory-based parser, and if it fails to parse, i.e. no case in the
memory corresponds to the input sentence, then the conventional parser will
parse the input. Parsing by the conventional parser takes about an average of
2 seconds. New syntactic patterns can be given from the parse tree of the con-
ventional parser to be loaded on the memory-based parser to improve coverage.
This way, overall performance of the system can be improved over time. The
memory-based parsing can be combined with a conventional parser to improve
overall performance of the system by incrementally learning syntactic patterns
in the task domain.
2 Notice that the parsing time is an expected time. When the memory-based parser covers
the input it should complete parsing in a few milli-seconds, else the conventional parser will
parse and takes about 2 seconds. The expected parsing time will improve as the memory-
based parser cover more inputs.
ASTRAL: IXM2 Implementation 143

6.5 MEMORY AND PROCESSOR


REQUIREMENTS
While the high performance of memory-based parsing on a massively parallel
machine has been clearly demonstrated, now we look into its memory require-
ment. We examine that if, in practice, the number of syntactic structures which
appear in the given task domain will saturate at a certain number. Empirical
observation using a corpus taken from the DARPA task shows that it does
converge when it is in a restricted domain (figure 6.3). However, the number
of syntactic patterns necessary to cover the task domain was 1,500 with the
flat structure, and it was reduced to 900 with a simple hierarchical network.
Since IXM2 is capable of loading over 20,000 syntactic patterns, the model is
capable of covering the task even with the flat memory approach, and much
wider domain can be covered with hierarchical model. However, a larger scale
of experiment will be necessary to see if the number of syntactic patterns sat-
urates, and where it saturates. We are currently investigating this issue using
a large corpus from real world data such as CNN.

Independently, we have carried out an experiment to cover a given domain


based on syntactic patterns pre-expanded from a set of grammar rules. We pre-
expanded syntactic patterns from a set of context-free rules to see the memory
requirements. A set of 6 basic grammar rules will produce about 2,000 patterns
when the maximum length is 10 words, and about 20,000 patterns when the
maximum length is 15 words. However, this has been reduced to 1/20 by
using local networks which handle noun-noun modifications, adjective-noun
modifications, etc. Thus, by imposing additional constraints, pre-expansion
of syntactic patterns from a set of grammar rules is also feasible, and can
be loaded on IXM2. In addition, it should be noted that not all syntactic
patterns are actually used in the real world, thus the number of syntactic
patterns that we really need to load on the machine would be far smaller.
Psycholinguistic study shows that there is an upper-bound in the complexity
of sentences which people can process [Gibson, 1990J. The hypothesis that
the number of syntactic patterns that actually appears in the given task is
relatively small can be independently confirmed. Nagao [Nagao, 1989J reported
that syntactic patterns appeared in the title of over 10,000 scientific papers were
around 1,000, and it was reduced to just 18 with simple reduction rules. While
we can only confirm our hypothesis on the basis of our experiments on the
small and medium size domains, increasing availability of large memory space
and large number of processors provided by massively parallel machines offers
a realistic opportunity that massively parallel memory-based parsing can be
deployed practical tasks.
144 CHAPTER 6

Number of
syntactic Patterns

1,500

1,000

500

Number of
Sentences
500 1,000 1,500 2,000

Figure 6.3 Training Sentences vs. Syntactic Patterns

6.6 ENHANCEMENT: HIERARCHICAL


MEMORY NETWORK
The hierarchical memory model incorporates syntactic and semantic knowledge
at various levels of abstraction in order to capture the productivity of language
with efficient memory use. As we have seen, the fiat memory model that simply
pre-expands the possible syntactic patterns requires a far larger memory space
when the task domain is enlarged. Thus, the memory-based parsing model in
this primitive form will only suffice in restricted medium size domains.

The hierarchical memory network model avoids this problem by layering the
levels of abstractions incorporated in the memory. Figure 6.3 shows an exam-
ple of the memory saving effect of the hierarchical memory network. The model
assumes three levels of abstraction: surface sequences, generalized cases, and
syntactic rules. The surface sequences are simple sequences of words. This level
of abstraction is useful to process such utterances as "How's it going" or "What
can I do for you?" These are a kind of canned phrase which frequently appear
in conversations. They also exemplify an extended notion of phrasal lexicon.
By pre-encoding such phrases in their surface form, computational costs can be
saved. However, we can not store all the sentences in this way. This leads to
the next level of abstraction which is the generalized cases. Generalized cases
are a kind of semantic grammar whose phrase structure rules use non-terminal
symbols to represent concepts with specific syntactic and semantic features.
ASTRAL: IXM2 Implementation 145

Abstraction Hierarchy Concept ual Sequence Layer

ike to 'alion>

('regiser for 'conference>

Lexical Entry Layer J hn

Phoneme Sequence Layer

Phoneme Layer AEIOUJWN •••••••

Figure 6.4 Overall Architecture of the Parsing Part

One example of such a sequence is <*agent *'W'ant-to *attend-event>. This


level of knowledge is, of course, less productive than syntactic rules. But it de-
creases the cost for semantic interpretation since some semantic features are in-
corporated at pre-expa~sion time, and imposes far more constrains on a speech
recognition module. The latter is extremely important for language models
for spoken language understanding systems. The third layer directly encodes
syntactic rules (with no or minimum pre-expansion), thereby guaranteeing a
wide coverage of the system.

6.7 EXPERIMENTAL IMPLEMENTATION


II: HIERARCHICAL MEMORY
NETWORK MODEL
ASTRAL3 is an implementation of the memory-based translation on IXM2.
The overall architecture is shown in figure 6.4. The memory consists of four
layers: a phoneme sequence layer, a lexical entry layer, abstraction hierarchy,
. and a concept sequence layer.

Phoneme Layer: Phonemes are represented as nodes in the network, and


3 ASTRAL is an acronym for the Associative model of Translation of l!anguage.
146 CHAPTER 6

link(first,ax31,about) .
link(last,t34,about) .
link(instance_of,ax31,ax) .
link(destination,ax31,b32) .
link(instance_of,b32,b) .
link(destination,b32,aw33) .
link(instance_of,aw33,aw) .
link(destination,aw33 , t34) .
link(instance_of,t34,t) .

Figure 6.5 Network for ' about ' and its phoneme sequence

they are connected to each instance of phoneme in the phoneme sequence


layer. Weights are associated to links which represent the likelihood of
acoustic confusion between phonemes.

Phoneme Sequence Layer: The phoneme sequence of each word is repre-


sented in the form of a network. This part is shown in figure 6.5.

Lexical Entry Layer: The lexical entry layer is a set of nodes each of which
represents a specific lexical entry.
Abstraction Hierarchy: The class/subclass relation is represented using 1S-
A links. The highest (the most general) concept is *all which entails all
the possible concepts in the network. Subclasses are linked under the *all
node, and each subclass node has its own subclasses. As a basis of the
ontological hierarchy, we use the hierarchy developed for the MU project
[Tsujii, 1985], and domain specific knowledge has been added.
Concept Sequence: Concept sequences which represent patterns of input
sentences are represented in the form of a network. Concept sequences
capture linguistic knowledge (syntax) with selectional restrictions.

Figure 6.5 shows a part of the network. The figure shows a node for the word
'about', and how the phoneme sequence is represented. The left side of the
figure is a set of IXM instructions to encode the network in the right side on
the 1XM2 processor. Refer [Higuchi et. al., 19911 for details of the mapping of
semantic networks to 1XM2. We have encoded a network including phonemes,
phoneme sequences, lexical entries, abstraction hierarchies, concept sequences
which cover the entire task of the ATR's conference registration domain. The
vocabulary size is 405 words in one language, and at least over 300 sentences
in the corpus have been covered. The average fanout of the network is 40.6.
The weight value has not been set in this experiment in order to compare the
ASTRAL: IXM2 Implementation 147

performance with other parsers which do not handle stochastic inputs. In the
real operation, however, a fully tuned weight is used. The implementation in
this version uses a hierarchical memory networks thereby attaining a wider
coverage with smaller memory requirements 4 •

The table for templates of the target language is stored in the host computer
(SUN-3j250). The binding-table of each concept and concept sequence, and
specific substrings are also created. When the parsing is complete, the genera-
tion process is invoked on the host. It is also possible to compute distributively
on 64 T800 transputers. The generation process is computationally cheap since
it only retrieves and concatenates substrings (which is a lexical realization in
the target language) bound to conceptual nodes following the patterns of the
concept sequence in the target language.

The algorithm is simple. Two markers, activation markers (A-Markers) and


prediction markers (P-Markers) are used to control the parsing process. A-
Markers are propagated through the memory network from the lexical items
which are activated by the input. P-Markers are used to mark the next possible
elements to be activated. This algorithm is similar to the basic framework of
the cI>DMDIALOG speech-to-speech translation system [Kitano, 1989d], and in-
herits the basic notion of the direct memory access parsing (DMAP) [Riesbeck
and Martin, 1985]. The parsing algorithm can process context-free grammar
(CFG) and augmented CFG using constraints (in effect, augment CFG is Con-
text Sensitive Grammar due to constraints added to CFG).Part of the parsing
process is analogous to the Earley-type shift-reduce parser. To help under-
standing, shift and reduce have been labeled where appropriate. However, the
basic operation is highly parallel. Particularly, it exhibits the data-parallel na-
ture of the operation due to simultaneous operations for all the data in the
memory. A general algorithm follows (only a basic framework is shown. Some
extra procedures are necessary to handle CFG and Augmented CFG.):

1. Place P-Markers at all first elements of Concept Sequence.


2. Activate Phoneme Node.
3. Pass A-Markers from the Phoneme Node to Nodes of Phoneme Sequences.
4. If the A-Marker and a P-Marker co-exist (this is called an A-P-Collision)
at an element in the Phoneme Sequence, then the P-Marker is moved to
the next element of the Phoneme Sequence. (Shift)
4 An alternative method of covering wider inputs is to use similarity-based matching as
seen in ISumita and Iida, 19911. Combining such an approach with our model is feasible.
148 CHAPTER 6

5. If the A-P-Collision takes place at the last element of the phoneme se-
quence, an A-Marker is passed up to the Lexical Entry. (Reduce) Else,
Goto 2.
6. Pass the A-Marker from the lexical entry to the Concept Node.
7. Pass the A-Marker from the Concept Node to the elements in the Concept
Sequence.
8. If the A-Marker and a P-Marker co-exist at an element in the Concept
Sequence, then the P-Marker is moved to the next element of the Concept
Sequence (Shift).
9. If an A-P-Collision takes place at the last element of the Concept Sequence,
the Concept Sequence is temporarily accepted (Reduce), and an A-Marker
is passed up to abstract nodes. Else, Goto 2.
10. If the Top-level Concept Sequence is accepted, invoke the generation pro-
cess.

6.8 PERFORMANCE
We carried out several experiments to measure the system's performance. Fig-
ure 6.6 shows the parsing time against sentences of various lengths. Parsing
at milliseconds order is attained. PLR is a parallel version of Tomita's LR
parser. The performance for PLR was shown only to provide a general idea of
the speed of the traditional parsing models. Since machines and grammars are
different from PLR and our experiments, we can not make a direct comparison.
However, its order of time required, and exponentially increasing parsing time
dearly demonstrate the problems inherent in the traditional approach. The
memory-based approach on IXM2 (MBT on IXM2) shows a magnitude faster
parsing performance. Also, its parsing time increases almost linearly to the
length of the input sentences, as opposed to the exponential increase seen in
the PLR. Notice that this graph is drawn in log scale for the Y-axis. CM-2 is
slow in speed, but exhibits similar characteristics with IXM2. The speed is due
to PE's capabilities and machine architecture, and the fact that CM-2 shows a
similar curvature indicates the benefits of the MBT. The SUN-4 shows a similar
curve, too. However, because the SUN-4 is a serial machine, its performance
degrades drastically as the size of the KB grows, as discussed below.

Scalability is demonstrated in figure 6.7. The parsing time of a sentence


with 14 input symbols is shown for various sizes of KBs. The size of the
KB is measured by the number of nodes in the network. The performance
degradation is less than linear due to the local activation of the algorithm.
ASTRAL: IXM2 Implementation 149

Parsing Time vs. Length of Input


milliseconds

2 - - + - - - - - + - - - -- t-----t---;,;-- MBTon IXM-2


" MB'i'"onCMT"
'
le+03 --+-----+-----t---------:I-"'--'-- -MBT~~-SUN-4
,, PLR------ -
,,
s--+----1-----r--,~,~'-_+----
.... ' ,
---------
.. '
2 --+-~~'~'-'1-----r----_+----

le+02 ---+------+-----r---- -I-- ---


5---+------+-----r-----I-----

2---+------~--------+-------~----
....... .. ..
le+Ol---+-----1-----r--=~~-=
........ ----+---
..... ........
........ . ............
5---+------~-=------~~~··=····~ ····-··--
····-
··--··-·r----
..................
.. ' ............................... .
2-~~··=::~···--···-···-····-···-·-··~·-----~---=~~---=----
le+OO --+V-----=----F-----t----- --+-----------~
5-~'-------~~------+--------r----
Input Length
5.00 W.OO 15.00 20.00

Figure 6 .6 Parsing Time VS. Length of Input


150 CHAPTER 6

Parsing Time vs. KB Size


milliseconds
IXM-2
'CM:2"
7.50 ----+-----+------+-----+---,--<-,'- WN4
",
",
7.00 ----+----~-----+----'-+---
;,,'
"
6.50 - - - - ! - - - - - - + - - - - - + - "...., .=.....,,'----r--
,,;'
6.00 ---j-----+---,'-,,'~-+-----+--

,,'
5.50 ---j-----+-,'~---+-----+--

5.00 ----+---,'-,,-"....' - - + - - - - - + - - - - - - + - - -
4.50 ---+-...,..-<-,'----+-------1r---~-"F-=
..................... ....-...-....-.. -
I •••• • •••••••••••• • ••••••••••• ••••

, ............................. .
4.00 - ..-...-.~.-:.:;""'"/;l.I-.~=----+------+-----+---
3.50 --,'-;,-,- + - - - - - j - - - - - + - - - - - + - -
,
3.00 - - - ' - - - + - - - - - j - - - - - + - - - - - + - - -
2.50 - - - + - - - - - j - - - - - + - - - - - + - -
2.00 ----+------+-----+----~---

1.50 --r===t===i======t===-
1.00 ---...q.-....- - - - + - - - - - f - - - - - + - - -
Nodes
100.00 200.00 300.00 400.00

Figure 6.7 Parsing Time vs. KB Size


ASTRAL: IXM2 Implementation 151

This trend is the opposite of the traditional parser in which the parsing time
grows beyond linear to the size of the grammar KB (which generally grows
square to the size of grammar rules, o(G 2 )) due to a combinatorial explosion
of the serial rule applications. CM-2 shows a similar curve with IXM2, but is
much slower due to the slow processing capability of I-bit PEs. The SUN-4 has
a disadvantage in a scaled up KB due to its serial architecture. Particularly,
the MBT algorithm involves an extensive set operations to find nodes with the
A-P-Collision, which is suitable for SIMD machines. Serial machines need to
search the entire KB which lead to the undesirable performance as shown in
the figures in this section.

6.9 HARDWARE ARCHITECTURE FOR


MEMORY-BASED PARSING
First, the IXM2-type associative memory machine has the advantage in cost
performance for the flat memory architecture which is the most simple ver-
sion of a memory-based parser. In the IXM2, the associative memory stores
syntactic patterns, and it allows for a massively parallel search (the current
implementation allows for a parallel search up to 256K), while limiting the
number of processors to 64.

Figure 6.8 shows the number of active hypotheses per one associative processor.
Although it starts with a high processing load where a significant percentage of
hypotheses are activated, the number of hypotheses decreases drastically as the
processing proceeds. Since the IXM2 uses associative memory chips to store
syntactic patterns, no processor will be idle unless all the hypotheses assign to
the processor are eliminated. However, in other massively parallel machines
that assign processors to all the hypotheses, most of the processors will be idle
because most of the hypotheses will be eliminated as the processing progresses.
Since the use of associative memory chips would be far cheaper than processor
chips to store and carry out the operations necessary in the implementation
in this paper, the IXM2's architecture would be more cost effective than other
architectures for this task.

However, there are advantages of assigning processors to all the hypotheses.


When each hypothesis needs to make a complex calculation such as a prob-
ability measure, a source of activation check, and dynamic reconfiguration of
the network, the simple associative memory does not suffice for these opera-
tion and the use of processors would be necessary regardless of the possibility
of substantial idle time. For many possible memory-based parsers, the simple
152 CHAPTER 6

Number of
Active Hypotheses

10

5 10 15 Position
(word)

Figure 6.8 Number of Active Hypotheses per Processor

associative memory would suffice its purpose since a sixty four 32 bit CPU
can distributively perform higher-levels of symbolic operations. Even in most
memory-based reasoning tasks, similarity matching uses relatively simple simi-
larity measures from numeric computations which can be computed on associa-
tive memory. Thus, the IXM2's architecture which we advocate in this paper
is a cost effective architecture not only for the memory-based parser, but also
for more general memory-based reasoning systems.

6.10 CONCLUSION
We have shown, using data obtained from our experiments, that the massively
parallel memory-based parsing is a promising approach to implement a high-
performance real-time parsing system for certain task domains.

Major claims and observations made by our experiments are:

• The massively parallel memory-based translation attains real-time transla-


tion when implemented on a massively parallel machine. Our experiments
using the IXM2 associative memory processor show that parsing is com-
ASTRAL: IXM2 Implementation 153

pleted on the order of a few milliseconds, whereas the traditional parsers


requires a few seconds to even a few minutes. The main reason for this
performance is the data-parallel nature of the memory-based translation
paradigm where a parallel search is carried out for all sentence patterns
(represented as conceptual sequences). In addition, the parsing time grows
only linearly (or sublinearly) to the size of the inputs (::; o(n)), whereas
traditional parsers generally require o(n3 ). The system not only attains
milli-second order parsing performance, but also exhibits a desirable scal-
ing property. The parsing time required grows only sublinearly to the size
of the knowledge-base loaded. This scaling property is the real benefit of
using a massively parallel machine. Also, we can state that the memory-
based approach is promising for large-scale domains.

• Massively parallel memory-based parsing can be implemented within prac-


tical memory and processor requirements, when designed for suitable task
domains. Our observation from spoken language corpora demonstrates
that the length of sentences converges within a manageable length, and
the number of syntactic patterns also converges into a practical scale. This
enables us to implement the memory-based parsing model using massively
parallel machines that already exist. The task domains which we have
examined are tasks for large vocabulary speech recognition systems which
assume vocabulary size of over 1,000 words. Thus, our model should suffice
for parsing back-end for the largest scale speech recognition system such
as SPHINX. With the possible development of larger scale massively par-
allel machines such as the one targeted by DARPA for TeraOps by 1995
[Waltz, 1990], the possibility of the large scale massively parallel memory
based parsing would be within sight.

• The use of a hierarchical memory structure allows our model to cover


broader task domains without significant loss in its performance.

• The effectiveness of the IXM2's architecture for large-scale parallelism has


been confirmed. In the memory-based translation, a large set of sentence
patterns are stored in associative memory. In natural language process-
ing, each phoneme, word, and concept appear in various places due to the
vast combinatorial possibilities of sentence production. This is particularly
true for the memory-based translation because surface, near-surface, and
conceptual sequences are used, which are more specific than most gram-
mar rules. Because of this representation level, the average fanout of the
semantic network which represents linguistic knowledge is large. The net-
work used in this experiment has an average fanout of 40.6. The IXM2
has an architectural advantage in processing networks with a large fanout.
An additional experiment verifies the advantage of the IXM2 architecture
for this type of processing. Given a network with a different fanout, the
154 CHAPTER 6

IXM2 has an overwhelming performance over other machines as average


fanout becomes larger (Figure 6.9). While other machines degrade its
performance, the IXM2 keeps a constant time to complete the propaga-
tion of the markers to all nodes linked to the source of activation. This is
due to the use of associative memory in IXM2. For memory-based natural
language processing, this approach is extremely powerful because semantic
networks for natural language processing tend to have a large fanout factor
as seem in the example in this paper.

• The IXM2's architecture is cost effective for memory-based reasoning tasks.


The use of associative memory to attain a high level of parallelism (256K,
in the current implementation) allows us to build a practical memory-
based system with possibly lower resource requirements. While only a few
parts of memory are actually involved in solving a specific problem, most
processors could be idled when a full fine-grained processor architecture
is used. The IXM2 is one of the ideal and cost effective architectures for
building practical and large-scale memory-based systems.
• The massively parallel memory-based parsing can benefit various appli-
cation areas related to natural language processing, including real-time
speech understanding systems, bulk-text processing and retrieval systems,
and real-time text summary systems.

One of the major contributions of this paper, however, is that we have shown
that the time-complexity of the natural language processing can be transferred
to space-complexity, thereby drastically improving the performance of the pars-
ing when executed on massively parallel machines. This assumption is the basic
thrust of the memory-based and case-based reasoning paradigm. This point has
been clearly illustrated by comparing a version of Tomita's LR parsing algo-
rithm and the memory-based parsing approach. Traditional parsing strategies
exhibited an exponential degradation due to extensive rule application, even in
a parallel algorithm. The memory-based approach avoids this problem by using
hierarchical network which compiles grammars and knowledge in a memory-
intensive way. While many AI researchers have been speculatively assuming
the speed up by massively parallel machines, this is the first report to actu-
ally support the benefit of the memory-based approach to natural language
processing on massively parallel machines.

In addition, we have shown that the difference in architectures between mas-


sively parallel machines significantly affects the total performance of the ap-
plication. The IXM2 is significantly faster than the CM-2, mainly due to its
parallel marker-passing capability of associative memory.
ASTRAL: IXM2 Implementation 155

Parallel Marker·Propagation Time


Time (Microseconds)
2~----~----~--~----~----~--~--IXM2
...... CM2
....--I-- SUN-.
le+05 --+---+-----+----+----I-------t-...-.....--:;
5 --1---+----+-----1---+----0""..-....+-----+- tray
...................
2--1----_+------~----+_~L.~-···~----_+----~-
............
le+04--1-----+-------r~~·~/+·-----+-------+-----r_
............
S--l~---+----~····4···~---+-----+-------+-----~
.................•.
......
2__1- .. ~
.....-....?•• L-r---~--r---+---r--,-:~
.,. ,:,"
.. -: ....
le+03--1----+-------+--~---+---~r----+­
.,.,,-:;:.'
, .'
5--1----_+------4-----+_----~'~:·-·--_+-----r-
,,-: .. '
.
,-:~-: ,

2--1----_+------4-----~:·-·--~------+_----r_
,-:~ ..
,:....
le+02--1-----+-----,~.~:~
:·-·---+-----+-------+-----r-
.,. ";,,,
,-;, ..
5--1---b:.~.'~··--+----I---+----+---+-
'.

Fanout
le+OO 2 5 le+Ol 2 5 le+02

Figure 6.9 Parallel Marker-Propagation Time vs. Fanout


7
MEMOIR: AN ALTERNATIVE
VIEW

7.1 INTRODUCTION
This chapter presents the Memoir system, which is a yet another model of
memory-based machine translation and integration of memory-based and rule-
based processing. However, it is based on ideas different from that of the
original q)DMDIALOG. In the q)DMDIALOG system, a set of rules has been used
to create meaning representation of input sentences. It played the similar role
to examples and templates of sentences, but at the different level of abstraction.
The Memoir system takes a different view. Rules are used solely to monitor
whether a correct word choice and pronoun reference was made.

The major emphasis of the proposed model are: (1) integrates memory-based
process and rule-based process in a novel fashion , (2) employs a user customiz-
able example representation, and (3) introduces a robust and dynamic matching
of examples against the input sentence.

7.2 OVERALL ARCHITECTURE


The architecture is shown in Figure 7.1. The major differences from other
memory-based machine translation systems (MBMTs) are the monitoring
process and the grammatical inference process. Pure MBMT advocates an
extreme-end of viewing translation as a sole memory-based process. However,
it is widely recognized that rules are obviously involved in the translation pro-
cess. the authors' model makes use of rules and linguistic knowledge in two
ways.
158 CHAPTER 7

First, the grammatical inference module uses a local and relation-driven con-
trol to infer head, subject, object, and focus of the sentence. The result of
inference is accessible from the adaptive translation module and the monitor
module. The adaptive translation module uses information on what the Center
of the previous sentence is, in order to handle zero-pronoun in Japanese-English
translation. Identification of the center will be based on the centering constraint
[Kameyama, 1988]. The monitor module uses information on what is the sub-
ject or object of a certain verb in order to decide which expert rules to apply.

Second, the monitor module uses a set of rules encoding translator's knowledge
to ensure correct word choice and stylistics. The monitor module initially checks
to determine if any word in the sentence matches any part of the condition part
of the rules. If there are rules which involve words in the sentence, the monitor
module dispatches a request to the grammar inference module, to check if the
condition part of the rule (such as object of increase is short-term interest rate)
can be met. The grammatical inference module uses its local rules and relation-
driven control to check whether or not this condition can be satisfied. Then, the
grammar inference module returns the result. Depending on the query result,
the monitor module may invoke its rules to rewrite the translated sentences.

Although the model uses the rule-based process, this process carries out very
different tasks from those employed in the traditional rule-based MT. In fact,
there is no place in the proposed model where a complete parse tree is to
be built or a full parse is to be carried out. This approach coincides with
recent research in MUC-3 [Sundheim, 1991], MUC-4, and TIPSTER, which
heavily rely on partial parse and dynamic control strategy, such as relation-
driven control[Jacobs, 1992].

7.3 KNOWLEDGE SOURCES


7.3.1 Memory-Base
A memory-base consists of examples of translation pair. A translation pair
(TX) consists of a source sentence example (SSE), its translation (target sen-
tence example: TSE), and a segment map, which defines segment-level corre-
spondence between SSE and TSE. In the real system, sentences are stored after
morphological analysis.

For example, translation examples, seen in Table 7.1, are stored in the memory-
base after the morphological analysis (Table 7.2).
Memoir: An Alternative View 159

Input Sentence Translation Result

l t
Morphological ~ Grammatical Inference Morphological
Analysis Generation

Example Retrieval Adaptive Translation Monitor


~ ~

Figure 7.1 Overall Architecture

1 I gave a talk. fL.I;t, ~?Yl: 'd:- lJ.: a


2 I gave her a gift. fL.I;t, 1lt:P: I::, 7' V -l:!)' r 'd:- iIllJ.: a
3 I gave up. fL.I;t, ~~ G~t.:o
4 I gave her a lesson. fL.I;l:, 1lt:P: I:: ~wll 'd:- -9- i t.: a
5 He gave an example. 1it1;l:, f7ll'd:-~fft.:o
6 I am not sure if it was snowing. ~:b~~t -? -C v\ t.: ip c< -? :b~, R <5j-:b~ G ~ 1
7 Shall I send you a document flj-'d:-, ~~ i l J: -? :b~o
8 I already have a document b -?~-?-Cv\iTo
9 ODR was increased 1}5E~%:b'$'51 ~ J:Jf GtLt.:o
10 The increase in CD-rate CD v- rO)L~o

Table 7.1 Examples of Translation Pair


......
0')
o
English !jentences
~ No. 1 2 3 4 5 6 7 8
!D 1 give -PA!;"!" a talk
~ 2 ~ give -PAST her a gift
t" 3 I give -PAST up
4 I give -PAST her a lesson
5 He give -PAST an example
> 6 I am not sure if it be -PAST snowing
7 Shan I send you a document
'"~ 8 I already ha.ve a document
g, 9 ODR be -PAST increase -PAST
10 The increa.se in CD-rate
E::: Ja.panese Sentences
<1>
S No. 1 2 3 4 5 6 7 8
o 1 -, <> -!':A:;T
'1" 2 ~ :~ ;; ~ 7' ld!;- ~ t: ~T -PAST
I:lj 3 fI,. 1;1: "'~ 1?lt')J., -PAST
g: 4 fI,. 11 ~jr; I: fUll t: lj.;tJ., -PAST
<1> 1;1:
~
5 ~ ~ t: *,fJ.> -PAST
6 ~ 1J< ~"?-n'J., -PAST -IJ'~? 11> Ii!. <:»--IJ'I?r.t~,
E::: 7 JUl ... jaJ., -GER ~L-~?-IJ'
g
'0 8 b? *?n'!l:T
::r 9 ~)E!F.g. 1J< 31~..t.lf -PSV -PAST
o 10 CDv-~ If) J:;Il-
ic:;o No. !;egment Map
1 1_1 2-5 4 3
e:. 2 1=1 2=7 3=3 5=5
3 1=1 2-3=3
i., 4 1=1 2=7 3=3 5=5
I" 5 1=1 2=5 4=3
Cil 6 1-5=4-5 6-8=1-3
o 7 2=0 3=3 4=0 6=1
[0 8 1=0 2=1 3=2 5=0
0+-
9 1=1 2-3=3
<1> 10 2=3 4=1
~ Q
~
;.:.
'i:l
o-j
t'J
~
--:r
Memoir: An Alternative View 161

The segment map defines which SSE segment corresponds to which TSE seg-
ment. In case of TX-3, I in SSE has a segment position 1. Since the segment
map defines 1 = 1, I corresponds to ' fl.' in TSE, which has segment position
1. The term C gi ve up' is an ideomatic expression, which should be treated as
one word. The segment map defines that C gi ve up' corresponds to ' ;fy ~ t:;,
61) ~' by notation 2-3 = 3. 2-3 = 3 denotes that SSE segments from 2 to 3
correspond to segment 3 in TSE.

The model uses segment mapping at the surface level, rather than at the parse
tree or canonical represenation, as seen in current MBMT models. There are
three reasons why the author chose to define mapping at this level, rather than
at parse tree or at canonical representation. First, most non-linguists cannot
understand what is represented in the parse tree or the canonical represena-
tion. Although professional translators are experts in translation, they are not
necessarily experts in linguistics. Marking the correspolldece at the surface
level, however, is possible by many potential users and was reported as being
preferred by users in interviews with the author. Second, deriving the cor-
rect parse tree (for both SEE and TSE) requires extensive work. Automating
this process is not possible at the curent level of technology. Even if a large
tree-bank were available for translation pairs, user customization will not be
possible, as users are unlikely to understand and work on this representation.
Third, use of parse tree or canonical representation as a starting point of the
memory-based process leaves parsing problems unresolved. As is well under-
stood amoung commercial MT systems developers, ambiguities and difficulties
in tracking the systems behavior during parsing are the major bottlenecks in
the quality improvement. Any MBMT approach, which would leave this pro-
cess on the traditional model, would not attain significant improvement over
traditional MT systems.

7.3.2 A bstraction Hierarchy


There are several ways to calculate the distance between words. A methods to
use an abstraction heirarchy, or Thesaurus, (left side of Figure 7.2) is readily
available [Sumita and Iida, 1991, Sato and Nagao, 19901. The problem of using a
hand-crafted abstraction hierarchy is that it is difficult to fine-tune the distance
between words, which will reflect the real corpus. Alternatives are the statistical
or neural network method [McLean, 19921. However, these methods cannot
provide accurate distance measures for low-frequency words. For example, these
methods cannot determine the appropriate distance for words which appear
only once in the corpus. Also, methods using neural networks involve problems
in re-training, when new words and examples are added.
162 CHAPTER 7

Low Frequency Word

Figure 7.2 Abstraction·based Word Distance Definition

The approach used in the Memoir is knowledge-based clustering (KBC). KBC


uses a hand-crafted hierarchy or network as a basis of distance calculation,
and uses a corpus-based similarity clustering to determine the distance to be
assigned to each arc. For low frequency words, a distance is set to the average of
distance of other words under the same node. For example, if Wi and W2 , both
under the same node, have equal distances to nodes 1.2 and 2.6, respectively, a
low frequency word W3 under the same node will have 1.9 distance to the node
(right side of Figure 7.2) .

7.4 GRAMMATICAL INFERENCE


The grammatical inference module takes an input from the morphological anal-
ysis module, which contains basic linguistic information word-stem, tense, part-
of-speech, etc. For example, a sentence C CD-rate was increased' is repre-
sented as «cd-rate () N) (be (-PAST) V) (increase ( - PAST) V)) when
it is given to the grammatical inference module. When there are multiple possi-
bilities in the part-of-speech and other information, all candidates will be listed.
The grammatical inference module uses a local and relation-driven control to
identify the head and the fillers for certain relations. For example, when the
request is dispatched from the monitor module to check if the object of increase
is CD-rate, the grammatical inference module first spots two words C increase'
and C cd- rate'. If these words cannot be found, nil will be returned. After
confirming the existance of these words, local grammar rules are invoked to
check whether or not the requested relation can be confirmed. In this case,
Memoir: An Alternative View 163

Distance Sentence
Input 0.0 You gave me a direction
TX-4 0.9 I gave her a lesson
TX-2 1.4 I gave her a gift

Table 7.3 Examples matched for a simple input

the relation can be confirmed, because 'increase' is in -PAST form and it


is proceeded by (be (-PAST) V) which, in turn, was preceeded by (cd-rate
() N). There are several ways to encode linguistic knowledge without creating
a full parse of the sentence. However, the best and theoretically well estab-
lished method is a subject for further study. At present, the author uses a
combination of an island-driven parser and ad-hoc hand-coded rules.

7.5 EXAMPLES RETRIEVAL


Identification of similar examples plays an important role determining the the
success of MBMT. Assuming that there is no alignment disposition between
two sentences, the distance between Si and Sj is given by:

n
d(Si , Sj) = Ld(wpi,Wpj) (7.1)
p=l

where n is the length of the sentence. d( Wpi, Wpj) is a distance between words
Wpi and Wpj . Similarity of the environment, in which the sentence fragement is
embedded, should also be a factor. Issues on what makes the best environment
similarity measure are presently under experiment, thus they are discarded
from the description in this paper.

Assume that an input is: 'You gave me a direction. I TX-2 and TX-4 re-
sults in the best match (Table 7.3).

This is the simplest case, where there is no alignment dislocation and a single
example covers the input sentence.
164 CHAPTER 7

Word Position
in Examples Path A

Path B
Path C

1 2 3 4 5
Word Position in Input Sentence

Figure 7.3 DP-Matching of Input and Examples

Matching the examples against the input sentence is based on a version of


parallel DP-matching. Figure 7.3 shows the basic idea. Each word in the
input sentence is matched against all examples in the memory-base. When a
word matches (an exact match or similarity match) the next word in the input
is matched against the next word in the candidate examples. In Figure 7.3,
words in positions 1 and 2 in the input matched with words in positions 1
and 2 in examples. However, a word in position 3 in the input only matched
with example B. Example A has an extra word inserted in between words in
positions 2 and 3 of the input, and example C misses a word in position 3 of the
input. For example, let us assume that example A is 'a b f e d e', example
B is 'a bed e', example C is 'a b de', and the input is 'a bed e'.
Example B results in a complete match, which is shown as a straight line in
Figure 7.3. Example A has an extra word' f' in between (b' and' c'. Thus,
(c' in the input, which is at position 3 (the horizontal axis), matches with (c'
in the example, which is at position 4 (the vertical axis). Since this matching
is not exact, it incurs some cost. However, it is still considered as a candidate.
The same is true for example C. Due to space limitations, it is not feasible
to describe the details of this matching machanism. Interested readers should
consult literature on the DP-Matching and Dynamic Time Warping for basic
ideas and mathematics. Multiple matching will take place in real sentences, as
seen in Figure 7.4.
Memoir: An Alternative View 165

Word Position
in Examples

6
5
4
3
2 F
1
1 2 3 4 5 6 7 8 9 10
Word Position in Input Sentence

Figure 7.4 Multiple Match between Examples

Difference Table
TX-4 Input
I You
her me
lesson direction

Table 7.4 Difference Table

7.6 ADAPTIVE TRANSLATION


Once the system has retrieved similar cases, the next step is to synthesize a
derivation from the input sentence to the translation. This process is divided
into (1) covering the input sentence by retrieved cases, (2) producing an adap-
tation operation list, and (3) reconstructing the translation from the target
language part of the retrieved cases.

7.6.1 Adapation with a Single Case


This process can be illustrated by using the previous example. For the input
sentence, two examples (TX-4 and TX-2) are retrieved. Since a single case
can cover the input sentence, the system simply uses TX -4, which is the most
similar case which covers the input sentence. An example of using multiple
cases to cover an input sentnece will be shown later.

Next, the difference between the input sentence and TX-4 is checked. Table
166 CHAPTER 7

Adaption Operations
English Japanese
I -+ You f.l. ..... iF) 1,t 1.:..
her -+ me 1Btk -+ fJ,.
gift -+ direction ~IDII -+ 1T~h

Table 7.5 Adaptation Operations

1 2 3 4 5 6 7
BME .fJ.. IJ:, I:: 'i: '7;<.1:;0
Adaptation fJ.. --+ ~t:tt:; ~:=:. fJ.. ~fill ::rI~r ~ 1J
Translation ~t:tt:; Ij:, fJ.. I:: lf~1J :a- !§-H:;o

Table 7.6 Adaptation for a simple sentence translation

7.4 shows the differences between the input sentence and TX-4. In this case,
the input sentence and TX -4 are sufficently similar so that only a lexical level
adaptation would be sufficient to produce a translation.

Then, adaptation operations are derived, based on the difference table. For
example, a word'!' in TX-4 needs to be replaced with 'you' to cover the input
sentence. This operation is associated with change of fJ,. ('1') into ~"l;J.t.:.. ('you')
in a target language part of TX -4. Table 7.5 shows adaptations involved in this
translation. Word choices may be based on a statistical likelihood, computed
from segment maps. In effect, the method is similar to MBT-I [Sato, 1991bl.

Once the adapation operations are defined, the final stage is to reconstruct
translation from the target language part of the cases. In this example, TX-4
is the best match example (BME), and the only case involved. Thus, TSE for
TX-4 will be adapted, using the derived adaptation operations. This process is
shown in Table 7.6 1 .

7.6.2 Combination of Cases


For more complex sentences, the input sentence needs to be covered by multiple
examples. Figure 7.4 shows an example of a matching diagram for such an
event. Examples from A to F provide partial matches to the input, and several
1 For the sake of simplicity, the morphlogical part is ignored in this example.
Memoir: An Alternative View 167

1 2 3 4 5 6 7 8 9 10
Input I am not sure if she gave him a gift
TX-6 I am not sure if it was snowing
TX-2 I gave her a gift

Table 7.7 Retrieved Examples

combinations to cover the input are possible. The proposed model chooses to
use the minimum cost covering method which takes into account size, similarity,
and DP-matching cost of each fragment. Heuristics for scoring the combination
of examples and an algorithm for the minimum cost covering for tree structured
examples has been proposed [Sato and Nagao, 1990, Maruyama and Watanabe,
19921. The proposed model employs the I-D version of these algorithms. In the
given example, a combination of examples B D F is likely to be preferred over a
combination of BeE F, ACE F, etc. This is because B D F covers the input
with a minimum number of examples and no example involves the disposition
in DP-Matching.

Assume we have an input: "I am not sure if she gave him a gift".
None of the cases in the memory-base cover the input sentence; even the best
match cases only covers part of the input sentence. Table 7.7 shows the input
sentence and retrieved cases. TX-6 matches only in sentence segments from 1
to 5, 'I am not sure if', and TX -2 matches from 6 to 10, 'she gave him a
gift'.

At this moment, which segments can be used in each case can be defined using
the segment map. In this case, TX-l has segment of 1-5 (' I am not sure if')
and 6-8 (' i t ~as sno~ing'). Therefore, an obvious operation is to use 'I am
not sure if' and 'she gave him a gift'. Since a part of input sentence
is 'I gave her a gift', 'she gave him a gift' has to be adapted. The
segment map is used again, to identify which words should be replaced. As a
result of these operations, adaptation operations are defined, and this is cross
applied to produce Japanese translation.

Since segment 1 of the TX-l has high similarity and segment 2 has low simi-
larity, only segment 1 will be used, and the TX-2 will be used to translate' I
gave her a gift.' Table 7.8 shows how adaptive translation is carried out
in this example.
168 CHAPTER 7

1 2 3 4 5 6 7 8 9
TX-6
...'" iJ~ i!I¥-o> "(~,t.: iJ'~? iJ' Jl!. <?3-iJ' ~ '"'~,
TX-2
fl 1;1 ttY: ~:: 7'v-c/~ ~ ~Ift.:
Adaptation-l
~iJfi!l¥-o>"(~,t.: -+ fll;1ttY:~::7'v-c/ ~ ~~Ift.:
Intermediate
fl 1;1 ttY: I:: 7'v-c/~ ~ ~Ift.: iJ'~?iJ' Jl!. <?3-iJ' ~ ,",~,
Adaptation-2
f1.. -+ ~:tc ~:tc -> ~
Translation
~y: 1;1 ~ I:: 7'v-l!/~ a- ~~ft.: iJ'~?iJ' Jl!.<7tiJ'~,",It'

Table 7.8 Adaptive Translation Process

7.6.3 Zero Pronoun Handling


Substantial percentage of Japanese sentences involves zero-pronoun (a kind of
ellipsis) which need to be resolved and articulated in English translation. The
following dialogue illustrates the problem of zero-pronoun (a word in [... ] was
not articulated.):

Speaker-I: jtf-4~Jt ~ i l.l: -? fp (Shall [I] send [you] a document)

Speaker-2: ~ -?, ~0 -Cit' i -9 ([I] already have [a document].)

The authors' solution to this problem is to explicity define the zero-pronoun


in the segment map. TX-7 and TX-8 in Table 7.2 show translation pairs for
sentences with zero-pronoun. Words which are not articulated in Japanese
are marked as 0 in the segment map. In the Japanese to English translation
process, a segment whose Japanese correspondence is 0 needs to be recovered
using a list of the center. Information on what is in the center is provided from
the grammatical inference module.

7.7 MONITOR
The monitor process checks whether or not the translation result is stylistically
and grammatically appropariate. This process has not been incorporated in
previous models of MBMT. The main rationale for the monitor process is the
existance of the domain rules for translation. These rules are widely used by
Memoir: An Alternative View 169

professional translators and are explicitly documented. For example, in the eco-
nomic domain, an English word 'prediction' must be translated as -=f~ when
no specific number is given, but the same word must be translated as -=f1Jt1j when
specific numbers are given. Also, the word' increase' must be translated as
51 ~ .1Jf (pull up) in the context of 'increase in the official discount
rates', but it should be J:~ (elevation) for' increase in the short-term
interest rates'. Thus, ~IDtll~fIJ0)51 ~ J:ff (pull up in short-term interest
rates) is considered as being a mistranslation. An extreme example is wherein
the use of the word 1)tjt-~JHT (investiment bank) for Solomon Brothers and
Goldman-Sacks is officially unacceptable for the Japanese government as these
banks are approved as security companies, not as investiment banks. At this
time, the English word 'investiment bank' must be phonetically transcribed as
-1 )I.I'{;J,. ~ )C )I ~ • Jf.)I -7.

Segregating these rules from examples is extremely inefficient, even if it were


possible. Given the fact that professional translators, who are the users of
the system, have these rules explicitly written, the most efficient and user
acceptable way is to directly use these rules in the translation process. The
main task of the monitor process is the monitoring of translation, using these
domain-specific rules. For example, a rule specifies that, if there is a word
'increase' in the input and the object of 'increase' is 'CD-rate', then 'increase'
should be translated as 'J:~', an internal representation (not exposed to users)
is shown below:

«COND (AND (EQUAL HEAD "increase")


(EQUAL OBJECT "CD-rate"»)
(ACTION (TRANSLATE "increase" "J:~"»)

Using the examples in Table 7.2, conventional MBMT systems would trans-
late' CD-rate was increased' into' CD v - ~ ~'t51 ~ J:ff I? nt.::.' Although
TX-10 implicitly encodes knowledge that J:~ is associated with CD v- ~ , it
did not contribute to this translation because its did not match with the input
sentence due to the substantial difference in their syntactic structures. Since
CD-rate is a kind of short-term interest rate, this translation is incorrect. In
the model, the output from the adaptive translation module will be checked
against expert rules. The monitor module search words appear in rules in the
input sentence. If there is any rule which involves words in the sentences, re-
quests are dispatched to the grammatical inference module to verify whether
or not the condition to invoke the rule can be met. In this case, the condition
is met (as explained in Section 4), so that the word 51 ~ J:ff I? nt.:: will be
replaced with ...t~ l..1.::. Thus, the final translation will be CD v- r ~'t...t~ L-
170 CHAPTER 7

t::. which is a correct translation. Currently, there are 82 of such rules for the
financial domain.

7.8 PRELIMINARY EVALUATION


A brief report on the initial evaluation phase would help understanding superi-
orities of the model. The author have implemented an experimental prototype
of English-Japanese translation for the financial domain. Example matching
and retrieval module have been implemented on Semantic Network Array Pro-
cessor (SNAP)[Moldovan et. al., 1990] and the CM-2 connection machine.
Unoptimized version resulted in few hundreds milli seconds order performance.
The performance figure is almost constant regardless of the memory-base size,
due to data-parallel nature of the algorithm. While a major part of the comput-
ing time has been spent on interprocessor communication, optimization would
certainly provide better performance and scalability. Independently, joint re-
search with ATR and the Electrotechnical Laboratory obtained promising per-
formance for phrase translation tasks [Sumita et. al., 1993].

Independently, the effects of the monitor process were examined, using 100
sentences from the Wall Street Journal (WST) and The Financial Times (FT).
95 words, which are considered to be problematic in translation, have been
selected as a benchmark. While semantic accuracy was 100%, there are stylistic
problems, as discussed in Section 7. Pure MBMT attained stylistically correct
translation for 48 words out of 95 words. When it is combined with the expert
rules (using the monitor module), 84 words out of 95 words were translated with
correct stylistics. The rule in the monitor module has been invoked in 76 ca.~es.
Some 40 cases were consistent with the pure MBMT translation, while 36 cases
resulted in modification of translation. 9 words were translated with incorrect
style due to lack of examples and rules. This figure, however, should not be
viewed as indicating the MBMT accuracy, because it focuses on problematic
events. It should be viewed as evidence to support the effectiveness of the
monitor process, and should not be used to draw any quantitative conclusions.
With larger memory-base and more detailed rules, the accuracy is expected to
improve drastically.
Memoir: An Alternative View 171

7.9 CONCLUSION
In this chapter, the author proposed the Memoir system as an alternative view
on memory-based machine translation. In effect, the model offers a basis for
achieving a solution towards user customizable example definition, handling
contextual phenomena, such as zero-pronoun, broad-coverage and robust trans-
lation, and use of expert knowledge.

Use of expert knowledge in the rule form is the key factor which mitigated
the problem of sparse corpus inherent in the early phase of deployment. Al-
though a more formal model needs to be developed on how local grammar
and expert knowledge should be encoded, preliminary evaluation suggests the
monitor process actually improves the quality of translation.

Massively parallel implementation of the example retrieval and matching at-


tained promising performance. This implies that MBMT can attain both high
speed and high quality translation.

The authors hope the model presented here will serve as a basis for further
discussion on how MBMT should be developed as a self-contained system,
which is expected to be a mainstay approach in the next generation machine
translation.
8
CONCLUSION

8.1 SUMMARY OF CONTRIBUTIONS


Although interpreting telephony or a real-time speech-to-speech translation sys-
tem has been considered as one of the prime research goals in speech and natu-
ral language processing, the work described in this book is perhaps the first to
propose a comprehensive model of speech-to-speech dialogue translation. The
<}DMDIALOG speech-to-speech dialogue translation system was implemented
and up and running since March 1989. The system is one of the first speech-
to-speech translation system ever developed. Three descendents (DmSNAP,
ASTRAL, and MEMOIR) have been developed and implemented on massively
parallel machines.

A series of research have made several major contributions to the field:

First, it demonstrate that the memory-based approach is feasible and ef-


fective for speech-to-speech dialogue translation. The central thread of the
memory- based translation is translation by memory recall and adaptations.
The <}DMDIALOG is the first system which accomplished speech-to-speech
translation using this idea. While the original model was mixture of memory-
based and traditional rule-based processing, the descendents have a clear ori-
entation toward memory-based approach. This is due to the fact that most of
problems encountered in the development of the <}DMDIALOG arose from the
rule-based part rather than memory-based part.

Second, we have demonstrate that the high performance natural language pro-
cessing is attainable by using appropriate algorithms and specialized hardware.
Experimental implementations on the IXM2 associative memory processor and
174 CHAPTER 8

SNAP semantic network array processor clearly demonstrated this point. In


both cases, parsing complete at the order of milliseconds. Thus, the prob-
lem of memory-based approach which requires extensive data has been solved
from the aspect of computational cost. Due to the data-parallel nature of the
memory-based approach, high performance translation has been attained on
massively parallel computers. These results are based on a massively paral-
lel computational model which provides considerably higher performance using
parallel computers.

Marker-passing has been used as a basic computational mechanism. Of course,


we did not use a naive marker-passing which only carries a bit-vector. In-
stead, we have significantly extended the notion of the marker-passing to allow
complex features and probabilistic measures to be passed around. The move-
ment of the markers are also restricted by specific propagation rules. For the
hardware implementation, we have decomposed some functions of markers into
bit-vector, address passing, value-passing, and host interrupts. Some of the
findings regarding necessary features for marker-passing influenced the basic
hardware design of SNAP-I.

Third, several methods to integrate the memory-based process and the rule-
based process has been proposed. The role is rules has been the major issue
in the research. Two alternative views were proposed. The first approach
is to complement memory-based process and rule-based process by different
levels of abstraction. The second approach is to rely mostly on memory-based
translation, but uses rules to monitor translated sentences using explicit expert
knowledge. The first appraoch has been implemented in q?DMDIALOG and
DMSN AP, and the second appraoch has been implemented in MEMOIR.

In the first approach, the model allows the specific cases, generalized cases, and
unification-based grammar to co-exist. This counters to most machine trans-
lation system which allows only a single level of abstraction is the processing.
In addition, our model allows translation to be carried out at the most spe-
cific level of abstraction among several levels of abstraction. This ensures that
translation can be carried out with the least cost process at any time.

On the other hand, the second model views rules as monitor to check explicit
stylistics and word choices of the translation. Neither parse tree nor meaning
representation will be created in this approach.

Fourth, it provides an integrated model of speech and natural language pro-


cessing. Processing at phonological-level directly interface to natural language
processing stage, and the prediction made by the language processing part is
Conclusion 175

fed back to the speech recognition module.

Fifth, it integrates the parsing and generation processes, which enables it to


translate input utterances even before the whole sentence is spoken to the
system. The generation process is lexically guided and uses both surface and
semantic information carried by G-Markers and A-Markers. Predictions and
tracking of verbalization are made by V-Markers.

With regards to dialogue processing, the plan-based as well as the script-based


dialogue understanding has been incorporated, as a first time in the machine
translation system. The plan-based model extends traditional model by as-
suming domain plan hierarchy for each speaker in dialogues. This extension
significantly extend the capability of the model for mixed-initiative dialogues.

Finally, a cost-based disambiguation scheme and a connectionist network have


been useful faculties in dynamically selecting one hypothesis among multiple
hypotheses.

8.2 FUTURE WORKS


There are several possiblities which the work reported in this book may be
extended. Although the basic ideas developed could be applied to broad range
of applications, following three possibilities are most probable.

Deployable speech-to-speech translation system: This is an obvious ex-


tension of the work, but requires major efforts. There are two possible
paths to attain this goal. First, we could make effort to scale up the sys-
tem to allow users to speak broad range of topics to the system. This goal
is significant, but it would take years to accomplish. Nevertheless, this
goal would continue to be a grand challenge of AI. Alternatively goal is
to build a portable but restricted systems which is capable of translating
simple dialogues for travel or emergency. This would be feasible within
this century.
Closed Caption Translation: The technique developed for speech-to-speech
translation can be used to translate closed caption of TV news programs.
already, major networks provide full transript of news programs in real
time. Memory-based translation system may be able to translate closed
caption in real time so that news programs can be viewed in various lan-
guages. The author have recently started the CAPTRAN project to develop
176 CHAPTER 8

such a system.
Translator's Assistant: The approach taken in MEMOIRE system can be ap-
plied to assist human translator by providing retrie,{al of past translation
examples and by providing the first cut translation. Such a system should
be implemented on personal computers and workstations so that fast and
efficient memory search on serial machine would be one of the major critical
factor.

Any of these projects would requires non-trivial efforts. However, their eco-
nomic, social, and scientific impact would be enormous.

8.3 FINAL REMARKS


This book presented a model of speech-to-speech dialogue translation. We also
reported preliminary performance evaluation of the model on massively parallel
machines such as IXM2 and SNAP.

Development of speech-to-speech dialogue translation system is definitely the


one of the ultimate goals of modern science and engineering. It is, however, a
formidable challenge. We have just made a very first step towards this goal.

Needless to say, the work described in this dissertation is by no means complete.


The completion of the research will be years away. However, we believe that we
have made a definite step in this work. We have implemented a comprehensive
and consistent model of spoken language translation on both serial and parallel
machines. The most significant contribution of this dissertation is in the fact
that we have demonstrated that real-time spoken language translation at the
order of milliseconds is attainable when appropriate algorithms and specialized
hardware is employed.

Although this book completes with this final remarks, research is now underway
toward the next generation systems. It is our wish that we can report significant
achievement before the end of this century.
BIBLIOGRAPHY

[Ait-Kaci, 1984] Ait-Kaci, H., A Lattice Theoretic Approach to Computation


Based on a Calculus of Partially Ordered Type Structures, Ph.D . Thesis,
University of Pennsylvania, 1984.
[ALPAC, 1966] Automatic Language Processing Advisory Committee, Lan-
guage and Machines: Computers in Translation and Linguistics, Washington
D.C., National Academy of Science, Publication Number 1416, 1966.
[Baker, 1975] Baker, J. K. , "The DRAGON system - An Overview," IEEE
Transactions on Acoustics, Speech, and Signal Processing, ASSP-23(1):24-
29, 1975.
[Bar-Hillel, 1959] Bar-Hillel, Y., Reports on the state of Machine Translation
in the United States and Great Britian, Technical Report, Hebrew University,
Jerusalem, 1959.
[Batcher, 1980] Batcher, K.E., "Design of a Massively Parallel Processor,"
IEEE Trans. C-29, No.9, 836-840, 1980.
[Becker, 1975] Becker, J. D., The Phrasal Lexicon, Bolt, Beranek and Newman
Technical Report 3081, 1975.
[Bisiani et. al., 1989] Bisiani, R., Anantharaman, T., and Butcher, L., BEAM:
An Accelerator for Speech Recognition, CMU-CS-89-102, 1989.

[Blelloch, 1986] Blelloch,G. E., "CIS: A Massively Parallel Concurrent Rule-


Based System," Proceeding of AAAI-86, 1986.
[Bock, 1987] "Exploring Levels of Processing in Sentence Production," In Kem-
pen, G. (Ed.) Natural Language Generation, Nijhoff, 1987

[Bock, 1982] "Toward a Cognitive Psychology of Syntax: Information Process-


ing Contributions to Sentence Formulation," Psycho. Rev. , 89, ppl-47, 1982.
[Bouknight et. al., 1972] Bouknight, W.J ., et. al. "The ILLIAC IV System,"
Proceedings of IEEE, 369-388, April, 1972.
178 SPEECH- TO-SPEECH TRANSLATION

[Bowler and Pawley, 1984] Bowler, K. C. and Pawley, G. S., "Molecular Dy-
namics and Monte Carlo Simulation ill Solid-State and Elementary Particle
Physics," Proceedings of IEEE. 74, January, 1984.
[Brachman and Schmolze, 1985] R. J. Brachman and J. G. Schmolze, "An
Overview of The KL-ONE Knowledge Representation System," Cognitive
Science 9, 171-216, August 1985.

[Brennan et. al., 1986J S. Brennan, M. Friedman, and C. Pollard, "A Centering
Approach to Pronouns," Proceedings of the ACL-86, 1986.

[Bresnan, 1982] Bresnan, J., "Control and Complementation," The Mental


Representation of Grammatical Relations. The MIT Press, 1982.

[Brown, 1990] Brown, R. and Nirenburg, S., "Human-Computer Interaction for


Semantic Disambiguation," Proc. of COLING-90, Helsinki, 1990.

[Carberry, 1985] Carberry, M., Pragmatic Modeling in Information System In-


terfaces, Ph. D. Thesis, Department of Computer and Information Science,
University of Delaware, Newark , DE., 1985.

[Carbonell, 1986] Carbonell, J . G., "Derivational Analogy: A theory of Recon-


structive Problem Solving and Expertise Acquisition," in Machine Learning,
An Artificial Intelligence Approach, Vohtmn II, Michalski , R. S., Carbonell,
J. G., and Mitchell, T. M., eds., morgan Kaufmann , 1986.
[Carbonell et. a1., 1981] Carbonell, J .G. , Cullingford, R., and Greshman, A.,
"Steps Towards Knowledge- Based Machine Translation," IEEE Transactions
on Pattern Analysis and Machine Intelligence . 3:376-392, 1981.

[Charniak, 1983] Charniak, E. , "Passing Markers: A theory of contextual in-


fluence in language comprehension," Cognitive Science. 7 (3) , 171-190, 1975.
[Charniak, 1988] Charniak, E. , " A Neat Theory of Marker-Passing," Proceed-
ings of AAAI-88, 1988.

[Chow and Roukos, 1989] Chow, YL. and Roukos, S., "Speech Understanding
using a Unification Grammar," In Proc. of ICASSP- IEEE International
Conference on Aco'Ustics, Speech , and Signal Processing. 1989.

[Chow et. al., 1987J Chow, YL., Dunham, M.O., Kimball, O.A. , Krasner,
M.A., Kubala, G.F., Makhoul, J., Roucos, S., and Schwartz, R.M .,
"BYBLOS: The BBN Continuous Speech Recognition System," Proc . of
IEEE International Conference on Aco'Ustics, Speech , and Signal Processing
(ICASSP-87), pp 89-92, 1987.
BIBLIOGRAPHY 179

[Church, 1987] Church, K., Phonological Parsing in Speech Recognition,


Kluwer Academic Publishers, 1987.
[Cohen and Fertig, 1986] Cohen, P. and Fertig, S., 'Discourse Structure and
the Modality of Communication,' In International Symposium on Prospects
and Problems of Interpreting Telephony, 1986.
"",ue, • V~W., ltecfdy, -n.R,
rrv-m,,~h.~A1., 1.r§.%j n\..70Ie, "h.:A., "h.U'aiUCKy, 'A:1.,
"Speech as Patterns on Paper," Cole, RA. (Ed.), Perception and Production
of Fluent Speech, Lawrence Erlbaum Associates, Hillsdale, N.J., 1980.

[Cole et. al., 1983] Cole, RA., Stern, RM., Phillips, M.S., Brill, S.M., Specker,
P., and Pilant, A.P., "Feature-Based Speaker Independent Recognition of En-
glish Letters," Proc. of IEEE International Conference on Acoustics , Speech,
and Signal Processing (ICASSP-83/, 1983.

[Cole, 1986a] Cole, RA., "Phonetic Classification in New Generation Speech


Recognition Systems," Speech Tech-86, 1986.
[Cole et. aI. , 1986bj Cole, RA., Phillips, M.S., Chigier, B., "The C-MU Phop-
netic Classification System," Proc. of IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP-86/, 1986.
[Cottrell, 1988] Cottrell, G., "A Model of Lexical Access of Ambiguous Words,"
In Small, S. et. al. (Eds.) Lexical A mb'igtL'ity Resolution, Morgan Kaufmann
Publishers, 1988.
[Crain and Steedman, 1985] Crain, S. and Steedman, M., "On Not Being Led
Up The Garden Path: The Use of Context by the Physhological Syntax
Processor," In Natural Language Parsing, 1985.
[Dally,1990] Dally, W. J., "The J-Machine System," In Winston, P. (Ed.),
Artificial Intelligence at MIT, 1990.

[DeMara and Moldovan, 1990] DeMara, R and Moldovan, D., "The SNAP-1
Parallel AI Prototype", Technical Report PKPL 90-15, University of South-
ern California, Department of EE-Systems, 1990.
[De Smedt, 1989] De Smedt, K., "Distributed Unification in Parallel Incremen-
tal Syntactic Tree Formation," In Proceedings of the Second European Work-
shop on Natural Langauge Generation, 1989.

[Dwork et. al., 1984] Dwork, C., Kanellakis, P. and Mitchell, J ., "On the Se-
quential Nature of Unification," Journal of Logic Programming, vol. 1, 1984.
[Fahlman, 1979] Fahlman, S., NETL: A System for Representing and Using
Real- World Knowledge , The MIT Press, 1979.
180 SPEECH-TO-SPEECH TRANSLATION

[Fikes and Nilsson, 1971] Fikes. R. , and Nilsson, N., "STRIPS: A new appo-
rach to the application of theorem proving to problem solving," Artificial
Intelligence, 2, 189-208, 1971.

[Ford, Bresnan and Kaplan, 1981] Ford, M., Bresnan, J. and Kaplan , R., "A
Competence-Based Theory of Syntactic Closure," In Bresnan, J. (Ed.). The
Mental Representation of Grammatical Relations, MIT Press, 1981.

[Ford and Holmes, 1978] Ford, M. and Holmes, V., "Planning Units and Syn-
tax in Sentence Production," Cognition, 6, pp35-53, 1978.
[Furuse and Iida, 1992] Furuse, O. and !ida, H., "Cooperation between Trans-
fer and Analysis in Example-Based Framework," Proc. of COLING-92, 1992.
[Furuse et. al., 1990] Fm'use, 0., Sumita, E., Iida, H. , "A Method for realiz-
ing Transfer-Driven Machine Translation," Workshop on Nat'ural Lang'uage
Processing, IPSJ, 1990 (in Japanese).

[Garrett, 1980] Garrett, M.F., "Levels of Processing in Sentence Production,"


In Butterworth, B. (Ed.) Language Production (Vol. 1 Speech and Talk),
Academic Press, 1980.

[Garrett, 1975] Garrett, M.F. , "The Analysis of Sentence Production," In


Bower, G. (Ed.) The Psychology of Learning and Moti'l'ation, Vol. 9, Aca-
demic Press, 1975.

[Gibson, 1990] Gibson, T., "Memory Capacity and Sentence Processing," Pro-
ceedings of ACL-90, 1990.

[Goodman and Nirenburg, 1991] Goodman, K. and Nirenberg, S. (Eds.) , The


KBMT Project: A case study in Knowledge-Based Machine Translation,
Morgan Kaufmann , 1991.

[Granger, 1977] Granger, R. H. , "FOUL-UP: A Program that Figures Out


Meanings of Words from Context", Proceedings of IlCAI-77, 1977.
[Grosz and Sidner, 1985) Grosz, B. and Sidner, C., "The Structure of Discourse
Structure," CSLI Report No. CSLI-85-39, 1985.
[Grosz and Sidner, 1990] Grosz, B. and Sidner, C., "Plans for Discourse," In
Cohen, Morgan and Pollack, eds. Intentions in Communication, MIT Press,
Cambridge, MA., 1990.

[Hendler, 1988] Hendler, J., Integrating Marker-Passing and Problem-Solving,


Lawrence Erlbaun Associates, 1988.
BIBLIOGRAPHY 181

[Higuchi et. al., 1989] Higuchi, T. , Furuya, T. , Kusumoto, H., Handa, K. , and
Kokubu, A. , "The Prototype of a Semantic Network Machine IXM," Pro-
ceedings of the International Conference on Parallel Processing. 1989.

[Higuchi et. al., 1991) Higuchi, T. , Kitano, H., Handa. , K., Furuya, T., Taka-
hashi, H., and Kokubu, A., "IXM2: A Parallel Associative Processor for
Knowledge Processing," Proceedings of AAAI-91 , 1991.
[Hillis, 1985] Hillis, D. W., The Connection Machine. The MIT Press, Cam-
bridge, MA, 1985.
[Hon, 1992) Hon, H., Large Scale Vocab'ulary Indepdent Speech Recognition:
The VOCIND system, Carnegie Mellon University, 1992,

[Hovy, 1988) Hovy, E. H., Generating Natural Language Under Pragmatic Con-
straints, Lawrence Erlbaum Associates, 1988.

[Hsu et al., 1990] Hsu, F., Ananthara111an , T., Campbell, M. and Nowatzyk,
A., "A Grand Master Chess Machine," Scientific American, October, 1990.

[IBM, 1985] IBM Speech Recognition Group, " A Real-Time, Isolated-Word,


Speech Recognition System for Dictation Transcription," Proc, of IEEE In-
ternational Conference on Acoustics, Speech, and Signal Processing, 1985.

[Iida, 1988) lida, H., 'Pragmatic Characteristics of Natural Spoken Dialogues


and Dialogue Processing Issues,' Journal of Society for Artificial Intelligence,
Vol.3, No.4, 1988. (In Japanese)
[Ingria, 1990] Ingria, R., "The Limit of Unification," Proceedings of ACL-90,
1990.
[Inmos, 1987) In1110s, IMS T800 Transputer, 1987.

[Isabelle, 1987) Isabelle, P., "Machine Translation at the TAUM Gronp," King,
M. (Ed.), Machine Translation Today , Edinburgh : Edinburgh University
Press, 247-277, 1987.
[ltakura, 1975] Itakura, F., "Minimum Prediction Residual Principle Applied
to Speech Recognition," IEEE Transactions on Acoustics, Speech, and Signal
Processing, ASSP-23(1):67-72, 1975.

[Jacobs, 1992) Jacobs, p" "Parsing Run Amok: Relation-Driven Control for
Text Analysis," Proc. of AAAI-92, 1992.
[Jain, 1990] "Parsing complex sentences with structured connectionist net-
works," Neural Computation, 3:110-120, 1990.
182 SPEECH-TO-SPEECH TRANSLATION

[Kaji, 1989] Kaji, H., "A Japanese-English Machine Translation System Based
on Semantics," Nagao, M. (Ed.-in-Chief), Machine Translation Summit,
Tokyo, 1989.
[Kameyama, 1988] Kameyama, M.. "A Property-Sharing Constraint in Cen-
tering," A CL-88, 1988.
[Kaplan and Bresnan, 1982] Kaplan , R. and Bresnan. J., "Lexical-Functional
Grammar: A Formal System for Grammatical Representation ," In Bresnan,
J. (Ed.), The Mental Representation of Grammatical Relations, MIT Press,
1982.
[Kaplan and Zaenen, 1989] Kaplan, R. and Zaenen, A., "Long-distance Depen-
dencies, Constituent Structure, and Functional Uncertainty," 1989.
[Kasper, 1989] Kasper, R. , "Unification and Classification: An Experiment in
Information-Based Parsing," Proceedings of the International Workshop on
Parsing Technologies, Pittsburgh, 1989.

[Kempen, 1987] Kempen, G., "A Framework for Incremental Syntactic Tree
Formation," In Proceedings of the Int ern ational Joint Conference on Artifi-
cial Intelligence (IJCAI-87), 1987.

[Kempen and Hoekamp, 1987] Kempen, G. and Hoenkamp, E., "An Incremen-
tal Procedural Grammar for Sentence Formulation," Cognitive Science, 11,
201-258, 1987.

[Kempen and Huijbers, 1983] Kempen, G. and Huijbers, P. , "The Lexical-


ization Process in Sentence Production and Naming: Indirect Election of
Words," Cognition, 14, 185-209, 1983.

[Kita et. aI., 1989] Kita, K., Kawabata, T. and Saito, H., "HMM Continuous
Speech Recognition using Predictive LR Parsing," In Proc. of ICASSP -
IEEE International Conference on Acoustic, Speech, and Signal Processing,
1989.
[Kitano and Hendler, 1993] Kitano, H. and Hendler, J ., (Ed.) Massively Par-
allel A rtificial Intelligence, The MIT Press, 1993.
[Kitano, 19931 Kitano, H., "Challenges of Massive Parallelism," Proc. of
IJCAI-93, Chambery, 1993.
[Kitano et. al., 19911 Kitano, H., Hendler, J ., Higuchi, T., Moldovan, D., and
Waltz, D., "Massively Parallel Artifical Intelligence," Proc. of IJCAI-91,
Sydney, 1991.
BIBLIOGRAPHY 183

[Kitano, 199b] Kitano, H., "<I>DMDIALOG: An Experimental Speech-to-


Speech Dialogue Translation System," IEEE Computer, June, 1991.
[Kitano, 1991b] H. Kitano, "Unification Algorithm for Massively Parallel Com-
puters," Proceedings of the International Workshop on Parsing Technologies ,
Cancun, 1991
[Kitano, 1989a] Kitano, H., "A Massively Parallel Model of Natural Language
Generation for Interpreting Telephony: Almost Concurrent Processing of
Parsing and Generation," In Proceedings of the Second European Conference
on Nat'ural Lang'uage Generation, 1989.

[Kitano, 1989b] Kitano, H., "A Model of Simultaneous Interpretation: A Mas-


sively Parallel Model of Speech-to-Speech Dialog Translation," In Proceed-
ings of the Annual Conference of the International Association of Knowledge
Engineers, 1989.

[Kitano, 1989c] Kitano, H., "Hybrid Parallelism: A Case of Speech-to-Speech


Dialog Translation," In Proceedings of IJCAI-89 Workshop on Para.llel Al-
gorithms for Machine Intelligence, 1989.

[Kitano, 1989d] Kitano, H., A Massi11ely Parallel Model of Sim'ultaneo'ltS Inter-


pretation: The <I>DMDIALOG System, CMU-CMT-89-116, 1989.

[Kitano et. al., 1989a] Kitano, H., Tomabechi, H. and Levin, L., "Ambiguity
Resolution in DMTRANS PLUS," In Proceedings of the Fourth Conference
of the European Chapter of the Association for Computational Linguistics,
1989.
[Kitano et. al. , 1989b] Kitano, H., Tomabechi, H., Mitamura, T . and Iida, H.,
"A Massively Parallel Model of Speech-to-Speech Dialog Translation: A Step
Toward Interpreting Telephony," In Proceedings of the European Conference
on Speech Communication and Technology (EuroSpeech-89) , 1989.

[Kitano et. al., 1989c] Kitano , H. , Mitamura, T. and Tomita, M. , "A Massively
Parallel Parsing in <I>DMDIALOG: An Integrated Architecture for Parsing
Speech Inputs," In Proceedings of the International Workshop on Parsing
Technologies, 1989.
[Kitano, 1988] Kitano, H., "Multilinguial Information Ret.rieval Mechanism us-
ing VLSI," Proceedings of RIA 0 -88, Boston, 1988.
[Kim and Moldovan, 1990] Kim, J. and Moldovan, D., "Parallel Classification
for Knowledge Representation on SNAP" Proceedings of the 1990 Interna-
tional Conference on Parallel Processing, 1990.
184 SPEECH-TO-SPEECH TRANSLATION

[Knight, 1989] Kevin, K., "Unification: A Multi-Disciplinary Survey," ACM


Computing Surveys, Vol. 21, Number 1,1989.

[Kogure, et. al., 1990] Kogure, K., !ida, H. , Hasegawa, T., and Ogura, K.,
"NADINE An Experimental Dialogue Translation System from Japanese to
English," Proceedings of InfoJapan-90, Tokyo, Japan, 1990.
[Lee, 1988] Lee, K.F., Large- Vocabulary Speaker-Independent Contin'uous
Speech Recognition: The SPHINX System, Ph.D. Thesis, Carnegie Mellon
University, 1988.
[Lee and Moldovan, 1990] W. Lee and D. Moldovan, "The Design of a Marker
Passing Architecture for Knowledge Processing", Proceedings of AAAI-90,
1990.
[Lesser et. al., 1975] Lesser, V.R. , Fennell, R.D., Erman, L.D., Reddy, R.D.,
"The Hearsay II Speech Understanding System," IEEE Transactions on
Acoustics, Speech, and Signal Processing, ASSP-23(1):11-24, 1975.
[Levelt and Maassen, 1981] Levelt, W.J.M. and Maassen, B., "Lexical Search
and Order of Mention in Sentence Production," In Klein, W. and Levelt,
W.J.M. (Eds.), Crossing the Boundaries in Linguistics: St'udies Presented to
Manfred Bierwisch, Dordrecht, Reidel, 1981.

[Levinson et. al., 1979] Levinson, S.E., Rabiner, L.R. , Rosenberg, A.E., and
Wilpon, J. G., "Interactive Clustering Techniques for Selecting Speaker-
Independent Reference Templates for Isolated Word Recognition ," IEEE
Transactions on Acoustics, Speech, and Signal Processing , ASSP-27(2):134-
41, April, 1979.
[Lin and Moldovan, 1990] Lin , C. and Moldovan, D., "SNAP: Simulator Re-
suIts", Technical Report PKPL 90-5, University of Southern California, De-
partment of EE-Systems, 1990.
[Litman and Allen, 1987] Litman, D. and Allen, J. , "A Plan Recognition
Model for Subdialogues in Conversation," Cognitive Science 11 (1987), 163-
200.
[Lowerre, 1976] Lowerre, B., The HARPY Speech Recognition System, Ph .D.
Thesis, Carnegie Mellon University, 1976.
[Maruyama and Watanabe, 1992] Maruyama, H. and Watanabe, H. , "Tree
Cover Search Algorithm for Example-Based Translation," Proceedings of
TMI-92, 1992.

[MasPar Corporation, 1990] MasPar MP-1 Computer, MasPar Corporation,


1990.
BIBLIOGRAPHY 185

[McLean, 1992] McLean, 1., "Example-Based Machine Translation using Con-


nectionist Matching," Proceedings of TMI-92, 1992.
[Minton, 1988] Minton, S. , Learning Effective Search Control Knowledg e: An
Explanation-Based ApP'roach , CMU-CS-88-133, Carnegie Mellon University,
1988.
[Moldovan et. al. , 1990] Moldovan, D., Lee, W., and Lin, C. , SNAP: A Marker-
Passing Architecture for Knowledge Processing, Technical Report PKPL 90-
1, Department of Electrical Engineering Systems, University of Southern
California, 1990.
[Moldovan, 1983J Moldovan, D., "An Associative Array Architecture for Se-
mantic Network Processing," Technical Report PPP-83-8, University of
Southern California, Department of EE-systems, 1983.

[Morii et. al. , 1985] Morii, S. , Niyada, K. , Fujii , S. and Hoshimi, M., "Large
Vocabulary Speaker-Independent Japanese Speech Recognition System," In
Proceedings of ICSSP - IEEE International Conference on Aco'ustics, Sp eech,
and Signal Processing, 1985.

[Morimoto et. al. , 1990J Morimoto, T., Iida, H. , Kurematsu, A., Shikano, K.,
and Aizawa, T., "Spoken Language Translation: Toward Realizing an Au-
tomatic Telephone Interpretation System," Proceedings of InfoJapan-90,
Tokyo, 1990.
[Muraki, 1989] Muraki, K., "Two-Phase Machine Translation System," Nagao,
M. (Ed.-in-Chief) Machine Translation Summit, Tokyo, 1989.
[Nagao, 1989J Nagao, M., Machine Translation: How Far Can It Go?, Oxford,
1989.
[Nagao, 1984J Nagao, M., "A Framework of a Mechanical Translation between
Japanese and English by Analogy Principle," Artificial and Human Intelli-
gence, Elithorn, A. and Ballerji, R. (Eds.), Elsevier Science Publishers, B.V.
1984.
[Nirenberg et. al., 1989a] Nirenberg, S. (Ed.) , Knowledge-Based Machine
Translation, Center for Machine Translation Project Report, Carnegie Mel-
lon University, 1989.
[Nirenburg et. al., 1989b] Nirenburg, S., Lesser, V. and Nyberg, E. , "Control-
ling a Language Generation Planner," In Proceedings of the International
Joint Conference on Artificial Intelligence (IJCAI-89), 1989.
186 SPEECH-TO-SPEECH TRANSLATION

[Nirenburg et. al., 1988a] Nirenburg, S., Monarch, I. , Kaufmann. T. , Niren-


burg, 1., and Carbonell, J. , Acq1tisition of Very Larg e Knowledge Bases. Tech-
nical Report CMU-CMT-88-108, Center for Machine Translation. Carnegie
Mellon University, Pittsburgh, PA, 1988.
[Nirenburg et. al., 1988b] Nirenburg, S., McCardell, R., Nyberg, E., Werner,
P., Huffman, S., Kenschaft, E. , and Nirenburg, 1. , DIOGENES-88, Technical
Report CMU-CMT-88-107, Center for Machine Translation, Carnegie Mellon
University, Pittsburgh, PA, 1988.
[Norvig, 1986] Norvig, P. , Unified Theory of Inference for Text Understanding,
Ph.D. Thesis, University of California at Berkeley, 1986.
[Nyberg, 1988] The FrameKit User's Guide.. Technical Memo, Center for Ma-
chine Translation, Carnegie Mellon University, Pittsburgh, PA, 1988.
[Ogura et. aI., 1989] Ogura, K., Sakano, T., Hosaka, J. , and Morimoto, T.,
Spoken Language Japanese-English Translation Experimental System (SL-
TRANS), TR-I-0l02, ATR Interpreting Telephony Research Laboratories,
1989.
[Ogura et. aI., 1989] Ogura, T., Yamada, J., Yamada, S. and Tanno, M. , "A
20-K bit Associative Memory LSI for Artificial Intelligence Machines," IEEE
Journal of Solid-State Circuits , Vol. 24, No. 4, 1989.
[Pinker, 1984] Pinker, S. , Lang'/Lage Learnability and Language Development,
Harvard University Press, 1984.
[Pollack, 1990] Pollack, M. , "Plans as Complex Mental Attitudes," In Cohen,
Morgan and Pollack, eds. Intentions in Communication. MIT Press, Cam-
bridge, MA., 1990.
[Pollard and Sag, 1987] Pollard, C. and Sag, 1., Information-Baud Syntax and
Semantics, Volume 1, CSLI Lecture Notes, 13, 1987.

[Prather and Swinney, 1988] Prather, P. and Swinney, D., "Lexical Processing
and Ambiguity Resolution: An Autonomous Processing in an Interactive
Box," In Lexical A mbiguity Resolution, Small, S. et. al. (Eds.), Morgan Kauf-
mann Publishers, 1988.
[Quillian, 1968] Quillian, M., "Semantic Memory," Semantic Information Pro-
cessing, M. Minsky(Ed.), 216-270, The MIT press, Cambridge, MA, 1968.

[Rabiner et. aI., 1979] Rabiner, L.R., Levinson, S.E., Rosenberg, A.E. , and
Wilpon, J .G., "Speaker-Independent Recognition of Isolated Words Using
Clustering Techniques," IEEE Transactions on Acoustics, Speech. and Sig-
nal Processing, ASSP-27( 4):336-349, August, 1979.
BIBLIOGRAPHY 187

[Riesbeck and Martin, 1985] Riesbeck , C. and Martin, C., "Direct Memory Ac-
cess Parsing," Yale University Report :154, 1985.
[Riesbeck and Martin, 1986J Riesbeck, C. and Martin, C. , "Direct Memory Ac-
cess Parsing," Experience, Memory , and Reasoning, Lawrence Erlbaum As-
sociates, 1986.
[Riesbeck and Schank, 1989] Riesbeck, C. and Schank, R., Inside Case-Based
Reasoning, Lawrence Erlbaum Associates, 1989.
[Sacerdoti, 1977] Sacerdoti, E. D., A Structure for Plans and Beha1lior, New
York: American Elsevier, 1977.
[Saito and Tomita, 1988] Saito, H. and Tomita, M., "Parsing Noisy Sentences,"
In Proceedings of COLING-88, 1988.
[Sato, 1993] Sato, S., Example-Based Translation of Technical Terms, IS-RR-
93-41, Japan Advanced Institute of Science and Technology, 1993.
[Sato, 1991a] Sato, S., Example-Based Machine Translation, Ph.D. Thesis, Ky-
oto University, 1991.
[Sato, 1991b] Sato, S., "MBT-I: Word Choice based on Examples," Jou'r nalof
Japanese Society for AI, Vol. 6, No.4, 1991 (in Japanese) .
[Sato and Nagao, 1990] Sato, S. and Nagao, M., "Toward Memory-based
Translation," Proceedings of COLING-90, 1990.
[Selfridge, 1980] Selfridge, M., A Process Model of Language Acquis'ition, Ph.D.
thesis, Yale University Department of Computer Science, 1980.
[Schank, 1982] Schank, R. , Dynamic Memory: A Theory of Learning in Com-
puter and People, Cambridge University Press, 1982.
[Schank, 1975J Schank, R., Conceptual Information Processing, Reading,
North-Holland, 1975.
[Sidner, 1979] Sidner, C., Towards a Computational Theory of Definite
Anaphora Comprehension in English Discouru, Ph. D. Thesis, Artificial In-
telligence LAb. , M.L T., 1979.
[Small et. al., 1988] Small, S. , et. al. (Eds.) Lexical Ambiguity Resolution, Mor-
gan Kaufmann Publishers, Inc., CA, 1988.
[Sowa, 1984] Sowa, J ., Conceptual Structures, Reading, Addison Wesley, 1984.
[Stanfill and Waltz, 1988] Stanfill, C. and Waltz, D., "The Memory-Based
Reasoning Paradigm," Proceedings of the Case-Based Reasoning Workshop,
DARPA, 1988.
188 SPEECH-TO-SPEECH TRANSLATION

[Stanfill and Waltz, 1986] Stanfill, C. and Waltz. D. , "Toward Memory-Based


Reasoning," Communications of the ACM, 1986.
[Stanfill, 1988] Stanfill, C., "Memory-Based Reasoning Reasoning Applied to
English Pronunciation," Proceedings of the AAAI-88, 1988.

[Sundheim, 1991] Sundheim, B. (Ed.), Proceedings of the Third Message Un-


derstanding Conference (MUC-3), Morgan Kaufmann Publishers, San Ma-
teo, CA, 1991.

[Sumita et. al. , 1993] Sumita, E. Oi, K., Iida, H., Higuchi, T ., Takahashi, N.,
Kitano, H., "Example-Based Machine Translation on Massively Parallel Pro-
cessors," Proc. of IJCAI-93, 1993.
[Sumita and Iida, 1991] Sumita, E. and !ida, H. , "Experiments and Prospects
of Example-Based Machine Translation." Proceedings of the 29th ann'ual
Meeting of the Association for Computational Ling'U.isitcs. 1991.

[Tanaka and Numazaki, 1989] Tanaka, H. and Numazaki. H. , "Parallel Gener-


alized LR Parsing based on Logic Programming," Proceedings of the Fir.~t
International Workshop on Parsing Technologies , Pittsburgh. 1989.

[Tebelskis, et. al., 1991] Tebelskis, J., Waibel, A., Petek, B. , and Schmid-
bauer. 0., "Continuous speech recognition using linked predictive neural
networks," IEEE Proceedings of the 1991 Interrwtional Conference on Ac-
coustics, Speech, and Signal Processing. April 1991.

[Thinking Machines Corporation, 1989] Thinking Machines Corp .. Model CM-


2 Technical Summary. Technical Report TR89-1 , 1989.

[Thinking Machines Corporation, 1991] Thinking Machines Corp., The Con-


nection Machine CM-5 Technical Summary, 1991.

[Tomabechi, 1987] Tomabechi, H., "Direct Memory Access Translation," In


Proceedings of the International Joint Conference on A rtificinl Intelligence
(IJCAI-87), 1987.

[Tomabechi et. al., 1989] Tomabechi, H., Saito, H. and Tomita. M., "Speech-
Trans: An Experimental Real-Time Speech-to-Speech Translation," In Pro-
ceedings of the 1989 Spring Symposi·u m of the American Assoc't ation for Ar-
tificinl Intelligence, 1989.

[Tomabechi et. al. , 1988] Tomabechi, H., Mitamura, T. and Tomita, M" "Di-
rect Memory Translation for Speech Input: A Massively Parallel Network
for Episodic/Thematic and Phonological Memory," In Proceedings of the In-
ternntiorwZ Conference on Fifth Generation Comp1Lter Systems. 1988.
BIBLIOGRAPHY 189

[Tomita, 1986) Tomita, M., Efficient Parsing for Natural Language, Kluwer
Academic Publishers, 1986.
[Tomita and Carbonell, 1987) Tomita, M. and Carbonell, J. G., "The Universal
Parser Architecture for Knowledge-Based Machine Translation," In Proceed-
ings of the International Joint Conference on Artificial Intelligence (IJCAI-
87), 1987.
[Tomita and Knight, 1988) Tomita, M. and Kevin, K , "Pseudo-Unification
and Full Unification," CMU-CMT-88-MEMO, 1988.
[Tomita et. al., 1989] Tomita, M., Kee, M., Saito, H., Mitamura, T ., and
Tomabechi, H., "Towards a Speech-to-Speech Translation System," Jo'urnal
of Applied Linguistics, 3.1 , 1989.

[Tsujii, 1985] Tsujii, J ., "The Roles of Dictionaries in Machine Translation,"


Jouhou-syori (Information Processing), Information Processing Society of
Japan, Vol. 26, No. 10, 1985. (In Japanese)

[Viterbi, 1967] Viterbi, A.J ., "Error Bounds for Convolutional Codes and an
Asymptotically Optimum Decoding Algorithm," In IEEE Transactions on
Information Theory, IT-13(2): 260-269, April, 1967.

[Waibel et. al. , 1989] Waibel, A., Hanazawa, T. , Hinton, G" Shikano, K and
Lang, K., "Phoneme Recognition Using Time-Delay Neural Networks,"
IEEE, Transactions on Acottstics, Speech and Signal Processing, March,
1989.
[Waibel et. al., 1991] Waibel, A. , Jain, A., McNair, A., Saito, H., Hauptmann,
A., and Tebelskis, J., "JANUS: A Speech-to-speech Translation system Using
Connectionist and Symbolic Processing Strategies," IEEE Proceedings of the
1991 International Conference on Accoustics, Speech, and Signal Processing,
April 1991.
[Waibel and Lee, 1990) Waibel, A" and Lee, KF" (Eds.) Readings in Speech
Recognition. Morgan Kaufmann, 1990.

[Walker and Whittaker, 1990] Walker, M. and Whittaker, S., "Mixed Initiative
in Dialogue: An Investigation into Discourse Segmentation," Proceedings of
ACL-90, Pittsburgh, 1990.

[Waltz, 1987] Waltz, D. , "Application of the Connection Machine," Computer,


Jan., 1987.
[Waltz, 1990) Waltz, D., "Massively Parallel AI," Proceedings of AAAI-90,
1990.
190 SPEECH-TO-SPEECH TRANSLATION

[Waltz and Pollack, 1985] Waltz, D. 1. and Pollack, J. B .. "Massively P arallel


Parsing: A Strongly Interactive Model of Natural Language Interpretation,"
Cogniti1,e Science , 9( 1): 51-74, 1985.

[Webber, 1983] Webber, B., "So What Can We Talk About Now?" In Brady,
M. and Berwick, R. (Eds.), Computational Models of Discou-rse, The MIT
Press, 1983.

[Wilensky, 1987] Wilensky, R., "Son~e Problems and Proposals for Knowledge
Representation" , Technical Report UCB/CSD 87/351 , University of Califor-
nia, Berkeley, Computer Science Division, 1987.
[Winograd, 1983] Winograd, T. , Language as a Cognitive P-rocess. Vol. 1: Syn-
tax, Addison-Wesley, 1983.

[Wroblewski, 1988] Wroblewski, D., "Nondestructive Graph Unification," III


Proceedings of AAAI-88, 1988.

[Wu, 1989] Wu, D., "A Probabilistic Approach to Marker Propagation," In


P-roceedings of the Intemational Joint Confe-rence on A-rtijicial Intelligence
(IJCAI-89) , 1989.

[Yasuura, 19841 Yasuura, H., "On Parallel Computational Complexity of Uni-


fication ," in Proceedings of the Intemational Conje-rence on Fifth Genemtion
Compute-r Systems, 1984.
[Young et. al. , 1989] Young, S., Ward, W. and Hauptmann, A. , "Layeriug Pre-
dictions: Flexible Use of Dialog Expectation in Speech Recognition," In
P-roceedings oj the Intemational Joint Confe-rence on A-rtijicial Intelligence
(IJCAI-89), 1989.

[Zernik, 19871 Zernik, U, Stmtegies in Language Acquisitions: Learning


Phmses f-rom Examples in Context, Technical Report UCLA-AI-87-1, 1987.

[Zue, 1985] Zue, V.W. , "The Use of Speech Knowledge in Automatic Speech
Recognition," P-roceedings of the IEEE 73(11) : 1602-1615, November, 1985.
Index

A E

a priori probability, 62, 64 ellipsis, 6, 7


A-Marker, 54 error!phoneme prediction, 21
acoustic features, 5 Eurotra project, 10
activation marker, 54
adaptation, 165 F
ALPAC report , 9
FEATURE, 8
ambiguity, 6, 109. 129, 130 featur e aggregation, 69
anaphora, 128
articulartion, 5 G
associative memory processor , 44, 135
ATR Interpreting Telephony Research G-Marker, 54
Laboratories, 10 generalized case, 68, 83
generalized LR parser, 21, 26
B generation, 84
generation marker, 54
baseline parsing algorithm, 56
GenKit , 14
build-and-store, 33
grammar rules , 52,68, 83
BYBLOS,8
grammatical inference, 158, 162
c guided marker-passing, 46

C-Marker, 54 H
case-based reasoning, 31
HARPY , 8
CC,49
Head-driven Phrase Structure Grammar,
Center for Machine Translation , 10
18
CI,50
HEARSAY -II, 8
combination of cases, 166
Hidden Markov Model, 17
concept class, 49
HMM-LR, 17,26
concept instance, 50
concept sequence class, 49 I
concurrent parsing and generation, 48, 93
confusion matrix, 14, 64 IG-Marker , 71
connection machine, 11 ill-formed in pu ts, 6
constraint satisfaction, 70 ILLIAC-IV, 11
contextual marker, 54 inferred goal marker, 71
continuous speech, 4, 5 inserted phonemes, 16
control, 129 intension, 7
cost-based ambiguity resolution , 48, 78 interlingua, 3, 6, 82
CSC , 49 interpreting telephony, 1
IXM2 , 11, 44, 135
D
J
DAP, 11
deleted phoneme, 16 JANUS , 20
direct memory access parser, 32
discourse entities, 50 K
disourse, 71
KBMT ,25
DMAP, 32, 45
KBMT-89,10
DmDialog, 10, 29, 47
Knowledge-Based Machine Translation,
DmSNAP,115
25
DRAGON , 8
KSR-1 . 12
dynamic confusion matrix, 64
dynamic time warp, 8
192 INDEX

possible space, 40
L pragmatic choice, 6
pragmatics, 24
language model, 62 prediction, 77
large vocabulary, 4, 5 prediction marker, 54
learning, 110 predictive LR parser, 17, 26
lexical choice, 6, 93 probability matrix, 63
lexical entry, 50 productivity of language, 36
lexical-functional grammar, 52 propagation rules, 118
Linked Predictive Neural Network, 20
LR parser, 14 R

M real space, 40
real-time response, 3, 7
machine translation, 3, 9 recognize-and-record, 33
marker, 54, 123 references, 6
marker-passing, 29, 45 , 119 rule-based approach, 42, 157
massively parallel artificial intelligence, rule-based processing, 48
11,43
massively parallel computing, 11, 29, 43, s
48
MBRtalk,l1 semantic network array processor, 44,115
meaning representation, 6 simultaneous interpretation, 93
Memoir, 157 SL-TRANS, 10, 17
memory network, 49, 51, 121, 144 SNAP, 11,44, 115
Memory-base, 51, 158 speaker independence, 4
memory-based approach, 31, 47, 119, 157 speaker-independent , 4
memory-based parser, 137 specific case, 68 , 83
memory-based parsing, 29, 67 speech input processing , 60
memory-based reasoning , 11, 31 speech recognition, 1,4, 8
MIND,105 speech-to-speech translation , 1
MINDS, 24 SpeechTrans, 10. 13
mixed-initiative dialog, 106 SPHINX, 5, 8
monitor, 158, 168 spoken language translation , 4
MPP, 11 stylistics, 6
MU project, 10 su b-word model, 5
multiple hypotheses, 7 substituted phonemes, 15
syntactic choice, 6, 93
N syntactic constraint network, 120, 123
SYSTRAN,lO
NETL, 11,45
noisy inputs, 6 T
noisy phoneme sequence, 14, 62
TANGORA,8
o TAUM-METEO, 10
transition matrix , 64
ontological hierarchy, 51 transition probability, 64
p translation, 5
translation by analogy principle, 33
P-Marker,54 Typed Feature Structure Propagation, 18
parsing, 68 U
perplexity, 78, 105
phoneme hypothesis, 3 unbounded dependency, 130
phoneme recognition, 3
phoneme-based generalized LR parser, 15
phonological processing, 63
Index 193

v
V-Marker, 54
verbalization marker, 54
very large finite space, 36
VOCIND , 5, 8
w
word boundary, 5
word hypothesis, 3

You might also like