Professional Documents
Culture Documents
Computation For Metaphors (LNCS1562, Springer, 1999) (ISBN 3540659595) (399s) - CSLN - PDF
Computation For Metaphors (LNCS1562, Springer, 1999) (ISBN 3540659595) (399s) - CSLN - PDF
Computation for
Metaphors,
Analogy, and Agents
13
Series Editors
Volume Editor
Chrystopher L. Nehaniv
University of Hertfordshire
Faculty of Engineering and Information Sciences
College Lane, Hatfield Herts AL10 9AB, UK
E-mail: c.l.nehaniv@herts.ac.uk
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
c Springer-Verlag Berlin Heidelberg 1999
Printed in Germany
Typesetting: Camera-ready by author
SPIN 10702939 06/3142 – 5 4 3 2 1 0 Printed on acid-free paper
Preface
Advisory Committee
Rodney A. Brooks MIT Articial Intelligence Lab, U.S.A.
Joseph Goguen University of California, San Diego, U.S.A.
Douglas R. Hofstadter Indiana University, U.S.A.
Alex Meystel National Institute of Standards and
Technology, U.S.A.
Melanie Mitchell Santa Fe Institute, U.S.A.
Referees
Steve Battle Robert M. French Chrystopher Nehaniv
Meurig Beynon Joseph Goguen Minetada Osano
Aude Billard Karsten Henckell Thomas S. Ray
Larry Bull Masami Ito John L. Rhodes
Zixue Cheng William Martens Paul Thagard
Kerstin Dautenhahn Jacob L. Mey and other anonymous
Gilles Fauconnier Alex Meystel referees
Table of Contents
Introduction
Computation for Metaphors, Analogy and Agents : : : : : : : : : : : : : : : : : : : : : : 1
Chrystopher L. Nehaniv (University of Aizu, Japan & University of
Hertfordshire, U.K.)
Metaphors and Blending
Forging Connections : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
Mark Turner (University of Maryland, U.S.A.)
Rough Sea and the Milky Way: `Blending' in a Haiku Text :: :: :: :: :: :: :: 27
Masako K. Hiraga (University of the Air, Japan)
Pragmatic Forces in Metaphor Use: The Mechanics of Blend Recruitment
in Visual Metaphors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 37
Tony Veale (Dublin City University, Ireland)
Embodiment: The First Person
The Cog Project: Building a Humanoid Robot : : : : : : : : : : : : : : : : : : : : : : : : : 52
Rodney A. Brooks, Cynthia Breazeal, Matthew Marjanovic,
Brian Scassellati, Matthew M. Williamson (MIT Articial Intelligence
Lab, U.S.A.)
Embodiment as Metaphor: Metaphorizing-In the Environment : : : : : : : : : : : 88
Georgi Stojanov (SS Cyril & Methodius University, Macedonia)
Interaction: The Second Person
Embodiment and Interaction in Socially Intelligent Life-Like Agents : : : : : : 102
Kerstin Dautenhahn (University of Reading, U.K.)
An Implemented System for Metaphor-Based Reasoning with Special
Application to Reasoning about Agents : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 143
John A. Barnden (University of Birmingham, U.K.)
GAIA: An Experimental Pedagogical Agent for Exploring Multimodal
Interaction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 154
Tom Fenton-Kerr (University of Sydney, Australia)
When Agents Meet Cross-Cultural Metaphor: Can They Be Equipped to
Parse and Generate It? : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 165
Patricia O'Neill-Brown (Japan Technology Program, U.S. Dept. of
Commerce)
Imitation: First and Second Person
Imitation and Mechanisms of Joint Attention: A Developmental Structure
for Building Social Skills on a Humanoid Robot : : : : : : : : : : : : : : : : : : : : : : : : 176
Brian Scassellati (MIT Articial Intelligence Lab, U.S.A.)
Figures of Speech, a Way to Acquire Language : : : : : : : : : : : : : : : : : : : : : : : : 196
Anneli Kauppinen (University of Helsinki & Helsinki Polytechnic,
Finland)
Situated Mapping: Space and Time
\Meaning" through Clustering by Self-Organization of Spatial and
Temporal Information : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 209
Ulrich Nehmzow (University of Manchester, U.K.)
Conceptual Mappings from Spatial Motion to Time: Analysis of English
and Japanese : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 230
Kazuko Shinohara (Otsuma Women's University, Japan)
Algebraic Engineering: Respecting Structure
An Introduction to Algebraic Semiotics, with Application to User Interface
Design : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 242
Joseph Goguen (University of California, San Diego, U.S.A.)
An Algebraic Approach to Modeling Creativity of Metaphor : : : : : : : : : : : : : 292
Bipin Indurkhya (Tokyo University of Agriculture and Technology, Japan)
Metaphor and Human-Computer Interaction: A Model Based Approach : : : 307
J. L. Alty and R. P. Knott (Loughborough University, U.K.)
A Sea-Change in Viewpoints
Empirical Modelling and the Foundations of Articial Intelligence : : : : : : : : 322
Meurig Beynon (University of Warwick, U.K.)
Communication as an Emergent Metaphor for Neuronal Operation : : : : : : : 365
Slawomir J. Nasuto, Kerstin Dautenhahn, and Mark Bishop
(University of Reading, U.K.)
The Second Person | Meaning and Metaphors : : : : : : : : : : : : : : : : : : : : : : : : 380
Chrystopher L. Nehaniv (University of Aizu, Japan & University of
Hertfordshire, U.K.)
Author Index :: :: :: :: :: :: :: :: :: :: :: :: :: : :: :: :: :: :: :: :: :: :: :: :: 389
Computation for Metaphors, Analogy and
Agents
Chrystopher L. Nehaniv
Metaphor and analogy had traditionally been considered the strict domain of
rhetoric, poetics and linguistics. Their study goes back in long scholarly histories
at least to the ancient Greece of Aristotle and the India of Panini. More recently
it has been realized that human metaphor in language is primarily conceptual,
and moreover that metaphor transcends language, going much deeper into the
roots of human concepts, epistemologies, and cultures. Seen as a major com-
ponent in human thought, metaphor has come to be understood and studied
as belonging also to the realm of the cognitive sciences. Lakoff and Johnson’s
and Ortony’s landmark volumes [22,36] cast metaphor in cognitive terms (for
humans with their particular type of embodiment) and shed much light on the
constructive nature of metaphorical understanding and creation of conceptual
worlds.
Our thesis is that these ideas on metaphor have a power extending beyond
the human realm, not only beyond language and into human cognition, but to
the realm of animals, as well as robots and other constructed agents. In building
robots and agents, we are engaging in a kind of constructive biology, working
to realize the mechanism-as-creature metaphor, which has guided and inspired
much work on robots and agents. Such agents may have to deal with aspects of
Current address: Interactive Systems Engineering, Department of Computer Science,
University of Hertfordshire, Hatfield, Hertfordshire AL10 9AB, United Kingdom, E-
mail: c.l.nehaniv@herts.ac.uk
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 1–11, 1999.
c Springer-Verlag Berlin Heidelberg 1999
2 Chrystopher L. Nehaniv
time, space, mapping, history and adaptation to their respective Umwelt (“world
around”).
By looking at the linguistic and cognitive understanding of metaphor and
analogy, and at formal and computation instantiations of our understanding
of metaphor and analogy, the constructors of agents, robots and creatures may
have much to gain. Understanding through building is a powerful way to validate
theories and uncover explanatory mechanisms. Moreover, building can open one’s
eyes to the light of new understanding in both theory and practice.
An intriguing metaphorical blend is the notion of a Robot. The concept
of a robot is understood as a cognitive blend of the concepts “machine”1 and
“human” (or “animal”).2 Attempting to build such a mechanism, one is led to
the question of ‘transferring’ – realizing analogues of – human or animal-like
abilities in a new medium. Moreover, if this new mechanism should act like an
animal, How will need to interact with, adapt to, and perhaps interpret the
world around it? How is this agent to ‘compute’ in this way? and, How is it to
be engineered in order to meet either or both of these mutually reflective goals?
Scientific advances (and delays) have often rested on metaphors and analo-
gies, and paradigmatic shifts may be largely based on them [20]. But compu-
tation employing conceptual metaphors has mostly been carried out via human
thought. In the realms of human-computer interaction (HCI), artificial intelli-
gence (AI), artificial life, agent technology, constructive biology, cognitive sci-
ence, linguistics, robotics, and computer science, we may ask for means to em-
ploy the powerful tool of metaphor for the synthesis and analysis of systems for
which meaning makes sense3 , for which a correspondence exists between inside
and outside, among behaviors, embodiments and environments.
Richards [37] formulated a metaphor as a mapping from a topic source do-
main (‘tenor’) to a target domain (‘vehicle’), by means of which something is as-
1
The machine itself as tool and metaphor has had a long and creative history [11]. In-
deed the conceptualization of what we consider mechanistic explanations in physics,
biology and engineering has changed very much in course of the history of ideas. For
instance, Newton’s physics was criticized as being non-mechanistic, since it required
action at a distance without interconnecting parts (Toumlin [43]). Modern mech-
anistic scientific explanations were not necessary mechanistic in the older sense of
the term. ‘Mechanistic’ represents a refined, blended concept that has evolved over
many centuries.
2
A related blend is the notion of ‘cyborg’ — a ‘cybernetic organism’, which is more
proximal for us than ‘robot’ in that it entails a physical blending of biological life,
including ourselves, with the machine. Indeed, our use of tools such as eyeglasses,
hammers, numerical notations, contact lenses, and other prosthetics to augment our
bodies and minds has already made us cyborgs. This can be taken as an empower-
ing metaphor when one takes control of and responsibility for our own cybernetic
augmentation (Haraway [12], Nehaniv [28]).
3
See the discussion paper “The Second Person — Meaning and Metaphors” [30]
at the end of this book for outline of a theory of meaning in a setting extending
Shannon-Weaver information theory to situated agents and observers and addressing
the origin, evolution and maintenance of interaction channels for perception and
action.
Computation for Metaphors, Analogy and Agents 3
serted (or understood) about the topic. Cognitive theories realized that metaphor
is not an exceptional decorative occurrence in language, but is a main mechanism
by which humans understand abstract concepts and carry out abstract reasoning
(e.g. Lakoff and Johnson [22], Lakoff [21], Johnson [18]). On this view, metaphors
are structure-preserving mappings (partial homomorphisms) between conceptual
domains, rather than linguistic constructions. Common metaphorical schemas in
our cultures are grounded in embodied perception. Correspondences in experi-
ence (rather than just abstract similarity) structure our cognition. Common
conceptual root metaphors in English are studied by Lakoff and Johnson [22],
as also extended with detailed attention to root analogies in the English lexicon
by Goatly [9].
An important extension for conceptual metaphors is the framework of Mark
Turner and Gilles Fauconnier (see Turner’s paper in this volume), who argue that
metaphors and analogies are not sufficiently accounted for by mappings between
pre-existing static domains, but are actually better understood as constructs in
forged conceptual spaces, which are blends of conceptual domains, over some
common space, with projections from the blend space back to the constituent
factors (e.g. a ‘tenor’ and ‘vehicle’) affording recruitment of features from the
blend space in which much inference and new structure may be generated.4
We shall not restrict ourselves to concepts or language. A more general,
not necessarily symbolic view is also possible if one conceives of metaphor and
analogy in the study of ‘meaning transfer’ between domains, or in light of the
theory of cognitive blending, as the realm of ‘meaning synthesis’ by putting
things together that already share something to create a new domain guiding
thought, perception or action. Other types of meaning can be seen for instance
in Dawkins’ notion of memes as replicators in minds and cultures [7], transmit-
ted by imitation and learning, propagating, often in difficult circumstances, via
motion through behavioral or linguistic media. Still another type of meaning is
comprised by agent behavior in response to sensory stimuli to effect changes in
its environment.
become interaction games, with the meaning of artifacts defined by the actions
they afford.
A particular case is the area of ‘intelligent software agents’. This has grown
into a large arena of research and application, concerned with realizing the
software-as-agent metaphor in interfaces, entertainment and synthetic worlds,
as well as for workload and information overload reduction (cf. [38]). As with
other types of semantic change in human language and cultures, what may at
first have been marked as strange may become common: these metaphors become
definitional identities; rather than conceptual mappings, they become realities.
Some pieces of software are really agents.
the quality of user-interfaces in terms of the degree to which they preserve the al-
gebraic structure of semiotic systems. Unlike most formal approaches, it remains
agent- and user-centered, considering situated interaction in its particular, allow-
ing it to avoid overconstraint and other pitfalls of objectivism, while focusing on
the central role of structure-respecting mappings. In understanding metaphors
and analogies concerning real-world things, one would do well do avoid forcing
fixed conceptual representations onto them since conceptualizations can be con-
structed dynamically in creative analogy, perception and problem-solving (Hoft-
stadter et al. [14], Mitchell [23], Holyoak and Thagard [15]) which allow for fluid
‘conceptual slippage’. Scientists know well not to neglect their intuitions of vague
analogies, since these may lead to deep insights that may later be substantiated
by hard empirical data.
2 Overview of Papers
The mechanics of metaphors and blending comprise the first section of the book.
Mark Turner [44] presents a sophisticated approach to metaphor and analogy
in terms of ‘blends’, a framework that can be expressed in category-theoretical
terms of pushouts (or more general colimits) of conceptual spaces over a com-
mon skeletal space. It is shown how this framework works better for the analysis
of analogy than traditional source-target approaches, especially since elements
of the constructed (blend) space are recruited to the analogy. Thus meaning is
often constructed (‘forged’) in the blend rather than merely transferred between
domains by mapping. Masako K. Hiraga [13] illustrates the Fauconnier-Turner
framework by her detailed study of metaphorical blends in a famous haiku of the
Japanese poet Basho. She carries out a beautiful tour de force analysis involving
levels of logographics, grammar, poetics, morphophonemics, and culture. Tony
Veale [45] gives applications to visual metaphors using a sophisticated implemen-
tation of a computational system for finding and understanding metaphor with
special attention to computational feasibility and pragmatics using the blend
framework and notions of recruitment (semantic crossover from domains of the
blend) with good use of some traditional AI methods.
The agent-centered or first-person viewpoint is the focus of the next section
which concentrates on the details of embodiment and agent-environment cou-
pling from an agent perspective: Rodney A. Brooks, Cynthia Breazeal, Matthew
Marjanović, Brian Scassellati and Matthew M. Williamson [4] discuss alternative
essences of intelligence and lessons from embodied AI, presenting the MIT Hu-
manoid Robot Cog and the embodied AI viewpoint. Emergent dynamics driven
by human interaction (turn-taking) and exploitation of natural dynamics in
the robot (arm swinging and force-feedback with a slinky toy) have also been
achieved by the MIT group. Key ideas are to reject monolithic control and full
internal models, not attempting general purposehood, and the recognition of
6 Chrystopher L. Nehaniv
6
Here ‘development’ means an incremental or ‘subsumption’ approach building on
what has been achieved so far, suppressing, invoking, or otherwise modulating its
behaviors in wider contexts by means of new layers of structure.
Computation for Metaphors, Analogy and Agents 7
References
1. J. L. Alty and R. P. Knott, Metaphor and Human-Computer Interaction: A
Model Based Approach. In [31], 307–321, (this volume). 7
2. John A. Barnden, An Implemented System for Metaphor-Based Reasoning
with Special Application to Reasoning about Agents. In [31], 143–153, (this
volume). 6
Computation for Metaphors, Analogy and Agents 9
On Monday, October 27, 1997, when the Dow Jones Industrial Average fell more than
five hundred points, precipitously and unnervingly, on huge volume, in a single day,
and the last two hours saw broad panic selling, investors wondered whether the next
day would be a bloodbath. Later that evening, the internet was flooded with thousands
of postings analyzing whether the crash was like the infamous crash on Black Monday
ten years earlier. I read them all evening.
These professional and amateur investors never questioned the fundamental
importance of knowing whether the analogy was true. Evidently, punishment awaited
anyone who made the wrong call. If the analogy held, then the investor in equities
should preserve positions and buy aggressively into the market, which would rise.
Yet there were reasons to doubt the analogy. Even after their five-hundred point
fall, stocks were still expensively valued by traditional measures. Most investors had
enjoyed unprecedented capital gains on paper in the previous few years, and many
could not resist the argument that it would be prudent to realize those gains before the
market plunged into the vortex of Asian currency troubles. Thailand's monetary
turmoil—in a domino cascade running through Indonesia, Korea, Hong Kong, Japan,
and the United States—could be lethal.
The analysts on the internet took it for granted that establishing analogy or
disanalogy depends upon rebuilding, reconstruing, reinterpreting the two inputs—in
this case, the two crashes. They began with provisional background structure and
connections—for example, the Dow on Black Monday corresponded to the Dow in
October, 1997 (even though the thirty companies comprising the Dow Industrials had
been changed), the drop on Black Monday in 1987 corresponded to the drop on
October 27, 1997, and so on. But this structure and these correspondences provided
only a launching pad, not the analogy itself. In particular, they provided none of the
inferences investors sought as the basis for their consequential decisions and actions.
The effective claims in the internet analyses were introduced with phrases like,
"What this crash is a case of . . .," "We must not forget that the 1987 crash . . .," and
"It would be a mistake to think of the 1987 crash as . . . ." There were injunctions like
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp.11 -26, 1999.
Springer-Verlag Berlin Heidelberg 1999
12 Mark Turner
Suppose we began to analyze this political analogy by adopting the mistaken but
common folk assumption that analogy starts with two pre-existing analogs, aligns and
matches them, and projects inferences from one to the other. The analogs to be
matched for this cartoon would be a scene with a father in the waiting room of a
maternity ward and a scene with French workers demanding a policy of early
14 Mark Turner
retirement. I can see no significant matches between these two notions. I can match
the labor of the mother to the labor of the French workers, but that connection has
nothing to do with this analogy and leads nowhere. I have no pre-existing knowledge
of fetuses according to which I can match them with French workers who make
demands about their conditions of employment. There is the possible match between
the non-delivery of the baby and the non-delivery of passengers and goods—French
transportation workers were at the time striking in support of the policy—but that
match is optional, provides no inference of absurdity, and could be fatally misleading
since it matches the obstetrician responsible for the delivery with the transportation
workers responsible for delivery, and this match destroys the analogy. It seems clear
that any straightforward matching between these two pre-existing notions, if there is
any, misses the analogy.
Matching does not work, but neither does projection of inferences from source to
target. The familiar source space would be birth in a maternity ward, supplemented
with the frame of the waiting room, and the target space would be French labor
politics. But there are no fetuses in the source space who make ridiculous demands of
any kind and no doctors who toss up their hands in exasperation at the absurd ideas of
the fetus. In the source space, members of the delivery team do not come into the
waiting room to protest the unreasonable views of the fetus. None of this and none of
the associated inferences in fact exist in the source to be projected onto the target in
the first place.
The absurdity of the situation does not belong to the pre-existing source.
Interestingly, it does not belong to the pre-existing target, either. The inference of the
cartoon is that the demands of the French workers are so absolutely absurd and
unheard-of as to be completely astonishing. They are wild from any perspective. But
if such an absurdity were already part of the pre-existing target, there would be no
need to make the analogy. The motivation for making the analogy is that 61% of the
French do in fact support these demands, and those citizens need to be persuaded to
drop their support.
The cartoon is unmistakably organized by the abstract conceptual frame of the
source space—a waiting room in a maternity ward. It also contains a few specified
elements, and it is illuminating to consider what they are doing in the cartoon.
Consider the newspaper in the expectant father's right hand. Naturally, an expectant
father might read a newspaper while he waits, and the analogist exploits this
possibility. But the motivation for including the newspaper in the cartoon is not to
evoke the frame of a waiting room and not to lead us to match or project the
newspaper to some analogical counterpart in the target space. There is a counterpart
newspaper in the target space, in fact this identical newspaper, but the connection
between them is identity, not analogy. The newspaper has been incorporated deftly
into the frame of the waiting room because it is important in the target: it announces
president Chirac's resistance to the policy of retirement. The construal of the waiting
room, we see, is driven by the analogy. The source analog is being forged so the
analogy can work.
The newspaper headline is the least of the elements in the waiting room that appear
there under pressure from the target. The difficulty of the delivery and the doctor's
Forging Connections 15
frustration are motivated only by the target. In fact, there are elements in this cartoon
that are impossible for the source space of real waiting rooms. The perversity of the
fetus, the disapproval of the fetus by the obstetrician and the nurse and presumably the
father, the speech of the fetus and its logic, the biting irony of putting the problem of
retirement ahead of the problem of unemployment—an irony clearly conveyed by the
cartoonist but not recognized by the doctor whose words convey it—come only from
the target.
The mental operations that account for this analogy and its work are not retrieval of
pre-existing source and target notions, alignment and matching of their elements, and
projection of inferences from source to target. Instead, the relevant mental operation
is, as Gilles Fauconnier and I have called it, "conceptual integration." (Fauconnier &
Turner 1994, 1996, in pressa, in press b, and in preparation; Turner and Fauconnier
1995, in pressa, and in pressb, Fauconnier 1997, and Turner 1996a and 1996b). There
is a website presenting the research on conceptual integration at
http://www.wam.umd.edu/~mturn/WWW/blending.html.
Conceptual integration—sometimes called "blending" or "mental binding"—
develops a network of mental spaces, including contributing spaces and a blended
space. In the example of the cartoon, the contributing spaces are the French labor
situation, with workers, and the maternity ward, with a fetus. The blend has a single
element that is both a faction in the French labor debate and a baby. Fauconnier and I
call a network of such connections and emergent structures a "conceptual integration
network." A conceptual integration network has projection of elements from
contributing spaces to the blend; cross-space mappings between the contributing
spaces; compositions of elements in the blend; completion of structure in the blend by
recruitment of other frames; and elaboration of the structure in the blend. The
operations of composition, completion, and elaboration in the blend frequently
develop emergent structure there that is not available from the contributing spaces. In
conceptual integration networks, inferences can be projected from the blend to either
input. In the case of analogy, the contributing spaces are asymmetric: one is a source
and one is a target. But causal, ontological, intentional, modal, and frame structure
can come from the target to the blend, and inferences can be projected from the blend
to both source and target. Conceptual integration networks have structural and
dynamic properties and develop under a set of competing optimality constraints which
Fauconnier and I have discussed elsewhere.
Of particular importance for this cartoon, construction and interpretation can be
done on any space at any time as the network develops. In particular, the input spaces
can be re-represented, rebuilt, reconstrued, and reinterpreted. For example, although
notions of the waiting room in a maternity ward do not include conventionally that the
obstetrician comes out to report a problem, or centrally that the expectant father is
reading a newspaper, nonetheless these structures can be recruited to the source space,
and are in this case, since they are needed for blending, under pressure from the target,
with its labor problems and politicians whose views are reported by the media. When
an organizing frame of the blend has been borrowed from the source, it can be
elaborated for the blend with structure left out of the source or impossible for the
source. For example, the baby in the cartoon has highly developed intentional,
16 Mark Turner
expressive, and political capacities, projected to it from the workers in the target, but
we do not project those abilities to the source: we do not interpret this cartoon as
asking us to revise our notions of fetuses to include these advanced abilities.
We keep the source, the target, and the blend quite distinct in this network and do
not become confused. Given the genre of the cartoon, we know that the purpose of
this analogy is to project inferences from the blend to the target rather than to the
source. (Seana Coulson [1996] has shown that there are other genres with other
standard directions of projection.) In the blend, we develop the inference that
something has gone wrong with the natural course of things and that agents dealing
with it are exasperated, but we do not project back to the source the inference that
when delivery is actually failing, it's fine for the obstetrician to take a walk out to the
waiting room to whine for sympathy, instead of redoubling his medical efforts in the
delivery room. We do not project back to the source the inference that in a true
medical emergency the reaction of the expectant father and the obstetrician should be
dumb-founded astonishment at the uncooperative behavior of the fetus rather than
anxiety over the health of the mother and child.
We do project the absurdity of the baby's demand in the blend to the worker's
demand in the target—that is the point of the analogy—but this projection is
complicated. The baby in the blend is an individual who has not yet obtained
employment. Part of the reason we judge the baby to be irrational is that, for the
individual, it would be manifestly illogical to care more about retiring early than about
having a job, since retiring at all is conditional upon having a job. Yet this inference
cannot project identically to each individual working French citizen, who is in fact
already employed. Nor does it seem to project identically to each individual
unemployed French citizen, who may in fact be more concerned about having a job
than about retiring early. The inference projects not identically but to a related
inference for the target, an inference not for individuals but for French citizens as a
political body. The baby's individual retirement age projects to the retirement age to
be set by policy, and the baby's individual prospects for employment project to general
employment trends in France. In the target, these numbers are distributed in a way
that does not give wild absurdity—61% of French citizens are unruffled by their
conjunction—but in the blend, these numbers have become the prospects faced by a
single individual, whose passion to know his conditional retirement age but
nonchalance about his prospects for employment yield a manifest absurdity and irony,
judgments that the cartoonist hopes to induce the reader to project back to the target.
The intended implication of the analogical integration network is that since
unemployment is a general concern for the nation, French citizens should not ask for
expensive retirement policies. The two central inferences of the analogy—manifest
absurdity and biting irony—are constructed only in the blend; they are not available
from the inputs.
The analogy of this cartoon, which appears on the front page of the newspaper as
an illustration of the main story, and which presents no difficulty whatever to its
readers, gives us a picture of analogy as a simultaneous forging of contributing spaces,
a blend, and connections in a dynamic integration network.
Forging Connections 17
We see quite a different picture of the nature of analogy, this time an explicit
academic picture, if we look at work in artificial intelligence. Forbus, Gentner,
Markman, and Ferguson (in press) take the view that there is consensus in AI on the
main theoretical assumptions to be made about analogy, and in particular on the
usefulness of decomposing analogical processing into constituent subprocesses such as
retrieving representations of the analogs, mapping (aligning the representations and
projecting inferences from one to the other), abstracting the common system, and so
on . . .
But for at least an important range of analogies, including many standard analogies
in political science and economics, this decompositional view of analogy fails. There
are two reasons for its failure. First, the analogies I have in mind cannot be explained
as operating over pre-derived construals that are independent of the making of the
analogy. Rather, the construal of the inputs depends upon the attempt to make the
analogical mappings.
Second, models in this Artificial Intelligence tradition do not seem to allow a place
for analogical meaning to arise that is not a composition of the meanings and
inferences of the inputs, yet the analogies I have in mind include essential emergent
meaning (e. g. absurdity) that cannot be viewed as a conjunction of structures in the
inputs.
Forbus, Gentner, Markman, and Ferguson make their claims about the theoretical
consensus for decomposition of processes as part of an attack on Douglas Hofstadter,
or rather a counterattack, since Hofstadter had claimed that their work, and similar
work in the relevant AI tradition1, is hollow, vacuous, a "dead-end" because it takes as
given what Hofstadter calls "gist extraction." Gist extraction is "the ability to see to
the core of the matter." Hofstadter views this ability as "the key to analogy making—
indeed to all intelligence" (Hofstadter, 1995). In collaboration with David Chalmers
and Robert French (1992), Hofstadter argues that there is no illumination to be found
in this tradition because the programs compute over merely meaningless symbolic
structures, because these formal structures are cooked beforehand in ways that make
matching easy, and, most importantly, because the cooking is done by the
programmer, not the program. In Hofstadter's view, the programmer has already done
the all-important gist extractions, boiled the meanings out of them, and substituted in
their place formal sets of predicate calculus symbols that already contain, implicitly,
the highly abstract, nearly vacuous formal match. The programmer then provides
these formal nuggets to the program. A program that detects the formal match
between them is not making analogies.
It seems to me that the people who understand the nature of analogy in this acerbic
debate are the practicial-minded non-academics who were actually making analogies
and disanalogies and posting them on the internet on the night of October 27, 1997—
Grey Monday, as it came to be called, once its aftermath was known. For them,
finding analogy or disanalogy is a process of forging, not merely finding, connections,
1 See e.g., Falkenhainer, Forbus & Gentner, 1989; Gentner 1983; Gentner & Gentner,
1983; Gentner & Stevens, 1983; Gick & Holyoak 1980, 1983; Holland, Holyoak,
Nisbett & Thagard, 1986; Holyoak & Thagard, 1989, 1995.
18 Mark Turner
and to forge those connections requires forging the inputs as you forge the
connections, revising the entire system of inputs and connections dynamically and
repeatedly, until one arrives at a network of inputs and connections that is persuasive.
My claim that analogy works by forging such a network may seem at first
counterintuitive because it runs against the folk theory according to which "finding an
analogy" consists of comparing two things in the world and locating the "hidden"
matches. We speak of "seeing" the analogy, which presupposes that the analogy is
completely there to be seen. On this folk theory, things in the world are objectively as
they are, things match objectively or not, and analogies and disanalogies are scientific
discoveries of objective truth. This view is reassuring and attractive. By contrast,
when I speak of forging inputs and connections, with continual revision and
backtracking, to build a network of spreading coherence that is "persuasive," it may
sound as if I am offering a dismal and barbarous postmodern hash in which anything
can be anything, any construal of the inputs will do, any connections will serve, since
all meaning is invented, a mere "construct," anyway.
But not so. Human beings have, over time, invented many human-scale concepts to
suit their purposes—chair, sitting, rich, Tuesday, marriage—, but these inventions are
highly constrained by both our biological endowment and our experience. First, there
are mental operations we all must use. Human beings must use conceptual framing,
categorization, blending, grammar, and analogy, for example. There is such a thing as
human nature, and it includes certain fundamental kinds of mental operation, analogy
being one of them. That is one kind of constraint. Second, profound constraints come
from success and failure. Some concepts and connections lead to success while others
lead to failure. Some help you live, some make you ill. With the right analogies, you
make a killing in the market, with the wrong ones, you get slaughtered. I have no
hesitation in saying that inventive forging of analogies can result in scientific
discovery of true analogies. In fact, it has resulted in scientific discovery of true
analogies. When a network is constructed that works, we call it true.
There is another reason that the folk theory of analogy appears attractive: after the
fact, in the rearview mirror, an established analogy usually looks exactly like a match
between existing structures, and it is easy to forget the conceptual work of forging
construals and connections that went into building the network.
Reforging the inputs while constructing the analogy was common procedure on the
night of Grey Monday. The analysts on the internet expressed revisions of the inputs
elaborately and unmistakably, using phrases like "What if what really happened on
Black Monday was . . ." and "You need to think of today's events not as X but instead
as Y."
I take it that this kind of reforging is typical for analogies in business and finance.
Consider the cover of The Economist for August 9, 1997. It shows a kite high in the
air and a man in a business suit flying it. The kite is labeled "Dow," for the Dow
Jones Industrials Average, and the caption reads "Lovely while it lasts." The final
conceptual product that comes out of understanding this analogy looks as if it matches
source and target and as if it projects an inference from source to target. But that
description of the product is not a model of the process.
Forging Connections 19
When I think of someone flying a kite, at least a traditional kite like this one, rather
than a trick kite, I imagine that it is easy to do in good wind. If there is a difficult
stage in flying a kite, it is the beginning, when the kite is near the ground. Once the
kite is very high, it is much easier to keep aloft, given the relative constancy of the
wind and the absence of obstructions. The kite-flier wants to keep the kite at a single
high altitude, and when he has had his fun, he winds up his string.
The phrase "Lovely while it lasts" is conventionally used to suggest that "it" won't
last, and interpreting "it" as referring to the Dow suggests that the cartoon concerns an
20 Mark Turner
impending fall in the market. Under pressure from this target, we can reconstrue the
source by recruiting to it some possible but peripheral structure: namely, gravity pulls
objects down with constant force, while winds are irregular; therefore, in some
moment, the winds will die and the kite will fall.
The inevitability of this fall is the inference to be projected to the target. But it is
constructed for the source only under pressure from the target.
If we look at this blend, we see that even though the organizing conceptual frame of
the blend is indeed flying a kite, much of its central structure does not come from that
source and indeed some of it conflicts strongly with that source . In the blend of
flying-a-kite and investing-in-the-stock-market, the kite-flier faces extreme difficulty
in keeping the kite aloft. In fact, he is physically struggling. Yet the kite is very high
and the winds are so fine that they are blowing the kite-flier's tie and hair forward.
This is highly unconventional structure for the source because, given the wind, he
should not be struggling at all.
We also know that this kite-flier is not satisfied merely to keep the kite up; he is a
special, bizarre, unique kite-flier with a special kite, who will be content only if the
kite constantly gains altitude, or meets some more refined measure, such as never
dropping in any given period of time lower than eight percent above its low in the
previous period. This is highly unconventional for the source.
In this blend, it is upsetting if the kite loses two percent of its altitude, dangerous if
it loses five percent, a major correction if it loses ten percent, and a complete disaster
if it loses thirty percent. Of course, in the source, none of these events presents any
problem at all; indeed, the only great disaster would be the kite's hitting the ground.
And yet, in the target, there is no possibility that the market could fall to zero, or even
down by half. We see, then, that the projection of inferences from the source is very
complicated. We need from the source the structure according to which constant
gravity will ultimately find a moment to overcome completely the inconstant winds,
but we cannot take from the source the inference that gravity will ultimately make the
kite fall to zero altitude and be smashed.
Now consider the man flying the kite. He is wearing a business suit and tie. This is
not impossible for the source, but it is odd, and the only motivation for building it into
the source is pressure from the target world of business and investment.
What counterpart in the target could the kite-flier in the source have? He must
correspond analogically to something in the target, since the analogy is about harm
that will come to people and institutions, not to the kite. This is a more complex
question than it might seem. Consider that, in the domain of kite-flying, the actual
kite-flier could make the kite crash, raise it by letting out string, lower it by taking in
string, or reel in his kite and go home. But this structure is not recruited for the
source, projected to the blend, or given counterparts in the target. The kite-flier in the
blend cannot be any of these kite-fliers. The kite-flier-investor in the blend cannot sell
the market short and then make the kite lose altitude; he cannot make the Dow kite
crash to the ground; he cannot sell his stocks and get out of the market at its peak;
paradoxically, it is not even clear that he can have any effect on the kite at all, even
though he is holding the string. He can be affected by what happens to the kite but
probably cannot influence the kite significantly. Moreover, a real investor can make
Forging Connections 21
money even if the Dow Average stays fixed, by trading stocks as they rise and fall
individually. Indeed, this is the standard way to make money in the market, since
leaving out the effects of new investment in the market, there must be a loser for each
winner. This kite-flier in the blend is someone who is somehow invested in the
continuing ascent of the kite that is the Dow, perhaps someone whose money is largely
in Dow or S&P 500 index funds, or other Dow-oriented mutual funds. But notice that
in the source domain of kite-flying, there are no such kite-fliers. These kinds of kite-
fliers exist only in the blend, not in the source.
And finally, the string to the kite is not a possible kite string. It is a somewhat
smoothed graph of the Dow Average over something like the previous fifteen years.
Interestingly, Black Monday of 1987 is not visible, because including a sharp fall of
that sort, followed by the sharp rise, would deform the string unacceptably far from
the strictly increasing smooth curve of the source space In the source, the path of the
kite-string is a snapshot in time of a line in space, while in the target, the path of the
Dow Average is a graph of the value of a variable over time. (This is why the sky in
the blend is ruled like graph-paper.) In the source, the path of the kite string has to do
with the physics of kites, strings, wind power, and gravity, which should be crucial for
the analogy, since the central inference of the analogy has to do with this physics,
namely, gravity will at some moment be stronger than the winds. In the source
domain, the kite string is indispensable for raising the kite—without it the kite would
surely fall, quickly in light wind.
As we have seen, the blend that provides the inferences of the analogy has structure
for the kite string that either ignores the central structure of the kite string in the source
or powerfully contradicts it. The view of analogy as retrieving pre-existing
representations of analogs, matching and aligning them, and projecting inferences
from the source to the target fails for this analogy, which, like the Figaro cartoon, is
meant to be instantly intelligible and persuasive.
The Figaro and Economist examples work as serious analogical arguments, meant
to be persuasive on central issues of politics and economics, but because they are in
the form of cartoons, it might be tempting to dismiss them as exceptional. On the
contrary, when we turn to celebrated examples discussed in the literature on analogy
in fields like psychology and computer science, we find the same operations of
blending and forging, although they are more easily overlooked because they are
somewhat less visible. Consider the well-known analogy discussed by Keith Holyoak
and Paul Thagard in Mental Leaps: Analogy in Creative Thought (1995) and earlier in
Gick and Holyoak (1983), in which the target analog is a tumor to be destroyed and
the source analog is a fortress to be stormed. The problem in the target is that only a
laser beam of high intensity will kill the tumor, but it would also kill any other cells it
encountered on the way; a beam of low intensity would not harm the patient but would
be ineffective on the tumor. The source analog is a fortress whose roads are mined to
blow up under the weight of many soldiers; a few can get through without harm, but
they will be too few to take the fortress. The solution to taking the fortress is to send
many small groups of soldiers along many roads to converge simultaneously on the
fortress and take it. Analogically, the solution to killing the tumor is to send many
22 Mark Turner
laser beams of low intensity along many paths at the tumor, to arrive simultaneously
and combine to have the effect of a beam of high-intensity.
The analogy looks, after the fact, like a straightforward matching of source and
target and projection of useful inferences, but if we look more closely we see, I think,
that this source was put together in this fashion under pressure to make this analogy.
Of course, after the target and source are put together in the right ways so that the
analogy will work, they can be handed to someone as analogs to be connected in a
straightforward fashion, but connecting these pre-built representations is not
understanding analogy.
Consider the actual military situation in the source. When combat resources are
plentiful and easily replaced, commanders facing a crucially important military
objective have historically not hesitated to sacrifice soldiers and replace them. The
straightforward solution for the source is to run animals or soldiers up the road,
sacrificing as many as necessary to clear the mines. With a sufficient supply of
soldiers, the mines will present no problem and the fortress will be taken. After all,
there cannot be many mined places. The residents of the fortress must be able to
move vehicles over the roads, which they could do only by avoiding the few places
that are mined. Moreover, only some spots on a road are suitable for mining in any
event. Bridges, for example, are rarely mined because the mines are too easily
detected. There is no point in mining the road if the soldiers can simply walk through
the field alongside it, so one must either install entire fields of mines or pick very
narrow passes in the topography for placing mines.
But these straightforward and conventional military framings of the source do not
serve the analogy, so the representations of the source given in the scholarship
typically rebuild the source artificially so as to disallow them. For example, the
representation given in Gick and Holyoak and again in Holyoak and Thagard is this:
the attacking general has just enough men to storm the fortress—he needs his entire
army, so cannot sacrifice any of them. The purpose of this weird representation of the
source is clearly to disallow the standard representations so the analogy will work.
That particular forging of the source in the service of the analogy is explicit, but
some other crucial forgings are only implicit. For example, I have told the fortress
story to military officers of various ranks. One of them responded, "it says the fortress
is situated in the middle of the country, surrounded by farms and villages. Why
doesn't the general just send his troops through the fields?" This is an excellent
objection. However, that construal is implicitly disallowed. The Fortress Story tells
us that the attacking general is a "great general," and that he solves this problem by
dividing up his army and sending them charging down different roads. We know that
a "great general" could not have missed so obvious a solution as marching his troops
through the field, and also suspect that the defender of the fortress is unlikely to be so
inept as to mine roads running through open fields, so we conclude that in some
unspecified way the source does not allow this possibility, even though nothing
explicit forbids it. The officer asking the excellent question was answered by a
companion officer, "All of the roads must go through narrow passes or something."
The most profound conceptual reforging in the service of making analogical
connections between tumor and fortress is the most subtle. In the source, it is an
Forging Connections 23
unchangeable truth and a central point in military doctrine that the armed force one
can bring to bear is also a vulnerable asset one does not wish to lose. For example, the
British Home Fleet during World War I was exceptionally strong, but its sheer
existence as a "force-in-being" was so important that it was almost never risked in
actual battle, the single exception being the Battle of Jutland in 1916, the only major
naval battle of the war. In the source, the force and the vulnerability cannot be
separated, and their inseparability is crucial. But if the tumor-fortress analogy is to go
through, they must somehow be separated, because in the target, the force is not
vulnerable. As Holyoak and Thagard note, the laser beam and the laser are not at risk.
Nor can the vulnerability of the force in the source be ignored, because vulnerability is
indispensable structure for the target. The solution is to take what cannot be separated
in the source and to conceive of it as having two aspects—a force whose intensity
varies with the number of soldiers that constitute it, and the physical soldiers who are
vulnerable. These aspects are projected to the blend separately. The military force
with variable intensity is blended with the laser beam; the vulnerable soldiers are
blended with the patient. Again, we see that the important work of analogy is not to
match analogs but, more complexly, to create an integration network which requires
reinterpretation of the analogs.
It may still be tempting to dismiss these examples as inconsequential. Two are
cartoons and one is a hypothetical problem of the sort dreamed up by psychologists
and inflicted upon college students as subjects. However, my last example is a
historical analogy that established policy, changed law, altered the urban landscape,
and cost plenty of money. It is Justice William O. Douglas's invention of a policy as
expressed in his opinion in a case in 1954 on the constitutionality of the Federal Urban
Renewal Program in Washington, D. C. Douglas needed to justify a policy according
to which the Federal government would be authorized to condemn and destroy entire
urban areas, even though nearly all of the privately-owned properties and buildings to
be destroyed met the relevant legal codes, and most of those were in fact individually
unobjectionable. Douglas hit upon the analogical inference that, just as an entire crop,
nearly all of whose individual plants are healthy, must be destroyed and entirely
replanted when some small part of it is blighted, so an urban area, nearly all of whose
individual buildings, utilities, and roads are satisfactory, must be completely destroyed
and redesigned from scratch when it has become socially unsavory. The following
paragraph suggests his reasoning:
source to the target. But that hindsight analysis misses, I propose, the essential
cognitive operations and conceptual work.
If analogy in general involves dynamic forging of analogs, connections, and blends
as we create a network of spreading coherence, then we must find a new model of
analogy. I nominate the Fauconnier & Turner network model of conceptual
integration for the job.
References
Holland, J. H., Holyoak, K. J., Nisbett, R. E., and Thagard, P. R. 1986. Induction: Processes of
inference learning and discovery. Cambridge: MIT Press.
Holyoak, K. J. and Thagard, P. 1989. "An alogical mapping by constraint satisfaction."
Cognitive Science 13(3), pages 295-355.
Holyoak, K. J., and Thagard, P. 1995. Mental leaps: Analogy in creative thought. Cambridge:
MIT Press.
Schön, Donald and Martin Rein. 1994. Frame Reflection: Toward the Resolution of
Intractable Policy Controversies. New York: Basic.
Turner, Mark. 1996a. "Conceptual Blending and Counterfactual Argument in the Social and
Behavioral Sciences," Philip Tetlock and Aaron Belkin, editors, Counterfactual Thought
Experiments in World Politics. Princeton, N.J.: Princeton University Press. pages 291-295.
Turner, Mark. 1996b. The Literary Mind. New York: Oxford University Press.
Turner, Mark. 1991. Reading Minds: The Study of English in the Age of Cognitive Science.
Princeton: Princeton University Press.
Turner, Mark. 1987. Death is the Mother of Beauty: Mind, Metaphor, Criticism. Chicago:
University of Chicago Press.
Turner, Mark. (1989) "Categories and Analogies" in Analogical Reasoning: Perspectives of
Artificial Intelligence, Cognitive Science, and Philosophy. Edited by David Helman.
Dordrecht: Kluwer, 3-24.
Turner, Mark and Gilles Fauconnier. 1995. "Conceptual Integration and Formal Expression."
Metaphor and Symbolic Activity. 10:3, 183-203.
____________. [in pressa] "Conceptual Integration in Counterfactuals" in Conceptual
Structure, Discourse, and Language, II. Edited by Jean-Pierre Koenig. Stanford: Center for
the Study of Language and Information.
____________. [in pressb] "A Mechanism of Creativity." Poetics Today.
Rough Sea and the Milky Way:
‘Blending’ in a Haiku Text*
Masako K. Hiraga
Abstract. This paper claims that the model of 'blending' proposed by Turner
and Fauconnier [16, 17] offers a useful tool for understanding poetic creativity
in general and metaphors in haiku1 in particular. It is one of the
characteristics of haiku that two or more entities (objects, ideas, and feelings)
are juxtaposed by loose grammatical configurations such as kireji (‘cutting
letters’) and kake-kotoba (‘hanging words’ or multiple puns). The
juxtaposed entities are put in comparison or equation, and contribute to
enriching the multi-layered metaphorical meaning of haiku. The analysis of
a sample text, a haiku describing a rough sea by Basho Matsuo, demonstrates
the effectiveness of ‘blending’ as an instrument for understanding the
cognitive role played by (i) metaphorical juxtaposition by kireji and (ii)
iconicity of the foregrounded elements in the text.
*
I am indebted to Joseph Goguen and Mark Turner for their invaluable comments and
suggestions.
1
Haiku or hokku as it was called during the time of Basho (1644 -1694), is the shortest form of
Japanese traditional poetry, consisting of seventeen morae, divided into three sections of 5-7-5.
Originating in the first three lines of the 31-mora tanka, haiku began to rival the older form in
the Edo period (1603-1867). It was elevated to the level of a profoundly serious art form by the
great master Basho. It has since remained the most popular poetic form in Japan. Originally, the
subject matter of haiku was restricted to an objective description of nature suggestive of one of
the seasons, evoking a definite, though unstated, emotional response. Later, its subject range
was broadened but it remained an art of expression suggesting as much as possible in the
fewest possible words. With the 31-mora tanka, haiku is composed by people of every class,
men and women, young and old. As the Japanese language has only five vowel sounds, [a], [e],
[i], [o] and [u], with which to form its morae, either by themselves or in combination with a
consonant as in consonant-vowel sequences, it is not possible to achieve rhyming in the sense
of European poetry. Brevity, suggestiveness and ellipsis are the life and soul of haiku and tanka
The reader is invited to read the unwritten lines with the help of imagination and background
knowledge.
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 27-36, 1999.
c Springer-Verlag Heidelberg Berlin 1999
28 Masako K. Hiraga
Literary texts can be metaphoric on two levels: local and global. On the one hand,
literary texts display local metaphors, which are based on either conceptual mappings,
image mappings, or a combination of both. Conceptual mappings are often based on
conventional cognitive metaphors, which literary metaphors either extend, elaborate
or combine in a novel way. On the other hand, some texts as a whole can be read
holistically as global metaphors. According to Lakoff and Turner [9, pp. 146-147],
such global metaphorical readings are constrained in three major ways: (1) by the use
of conventional conceptual mapping; (2) by the use of commonplace knowledge in
addition to conventional metaphors; and (3) by iconicity -- a mapping between the
structure of a poem and the meaning or image it conveys. This last constraint of
iconicity, a mapping from the structure of language to the structure of the image in the
text as well as to the overall meaning of the text, is of particular importance because it
contributes to our recognition of the degree of organic unity of a text.
Rough Sea and the Milky Way: `Blending' in a Haiku Text 29
Hiraga [3] demonstrates that the many-space model is useful in analysing short
poetic texts such as haiku, which have rather obscured grammatical constructions and
dense cultural implications, for the following two reasons: (1) the ‘blending’ model
stresses the importance of “the emergent structure” of the blended space activated by
inferences from the input spaces and the contextual background knowledge, and
therefore, provides an effective tool for understanding the creativity of literary
metaphors (not only of haiku but also of any poetic text); (2) the many-space
approach, which does not specify unidirectional mapping between input spaces,
provides a better explanation of the rhetorical effects produced by loose grammatical
configurations in the haiku texts such as the juxtaposition of phrases by kireji
(‘cutting letters’) and kake-kotoba (‘hanging words’ or multiple puns) or those
produced by personification and allegory. One additional implication of the analysis
presented in Hiraga [3] is that understanding haiku texts, which are extremely short in
form and rich in traditional implications, requires common prior knowledge which is
long-standing in Japanese culture, and which shapes the cultural cognitive model. A
non-exhaustive list of the features of such knowledge would include: (1) pragmatic
knowledge of the context such as time, place, customs, life, etc., which contextualise
the poetic text in general terms; (2) folk models, which originate from myth and folk
beliefs about the conceptualisation of existing things; (3) conventional metaphors, in
Lakoff and Johnson's sense, which have been conventionalised in a given speech
community over time, and which a poet exploits in non-conventional ways; and (4)
the iconicity of kanji, Chinese logographs,2 which link form and meaning, particularly
with regard to their etymological derivation, and thereby serve as a cognitive medium
for haiku texts. The blending model provides an account for the process of
integration of these features of background knowledge in the reading of texts.
The present paper looks at one of the most famous haiku compiled in the travel
sketch by Basho called Oku no hosomichi,3 an acknowledged masterpiece in Japanese
literature [11]. The poem was chosen because (1) it has a kireji which divides the
text into two parts and puts them in metaphorical juxtaposition, and (2) the revision
done by Basho results in foregrounding the elements written in kanji, which play a
cognitive role to strengthen the organic unity of the text through iconicity. In my
analysis, I hope to demonstrate that cognitive poetics offers explanations of the
dynamic creativity of poetic meanings emergent out of blends as well as the organic
unity of form and concept expressed in the text.
2 Analysis
Example 1
2
The term ‘logographic’ will be used instead of ‘ideographic,’ because most kanji characters
correspond to words rather than ideas.
3
Oku no kosomichi was written as a travel sketch which consisted of a main narrative body,
fifty haiku poems by Basho and a few other poems by other authors. The fifty haiku poems are
considered as an integrated text in its own right, conforming to the general principle of
composition and structural congruence.
30 Masako K. Hiraga
The poem at first glance describes natural scenes. On the one hand, the sea is rough;
and on the other hand, over one’s head, there is the Milky Way arching toward the
island of Sado. Even if one does not have much pragmatic knowledge about Sado
Island or the Milky Way in Japanese history and culture, one may sense a grandness
of scale depicted by this haiku. It is a starry night. The Milky Way is magnificent.
The grandeur of the Milky Way is put in contrast to a dark rough sea beneath the
starry skies. The waves are terrifying; the water churns and moans, as if it would not
allow the boats to cross. It is dangerous and fearful in the night. This dark sea
does indeed separate the people living on the island of Sado from the mainland. The
island is visible across the troubled waves, perhaps with its scattered house-lights.
Human beings (including the poet) are so small in the face of the spectacular pageant
of powerful nature. And yet there are thousands of human lives and stories
embedded in the scenes.
The first five-syllable segment, araumi ya, consists of a noun, araumi (‘rough
sea’), and a kireji (‘cutting letter’), ya. Kireji, a rhetorical device, used in tanka and
haiku, consist of about a dozen particles and mark a division point in the text.
Although the functions of the division vary according to the particles, a general effect
of kireji is to leave room for reflection on the feelings or images evoked by the
preceding segment. Ya in Example 1 is a kireji particularly favoured by Basho and
said to have “something of the effect of a preceding “Lo!” It divides a haiku into
two parts and is usually followed by a description or comparison, sometimes an
illustration of the feeling evoked. There is always at least the suggestion of a kind of
equation, so that the effect of ya is often best indicated by a colon” [2, p. 189]. That
is, araumi (‘rough sea’) and the rest of the text, Sado ni yokotau ama no gawa (‘the
Milky Way, which lies toward Sado’), are juxtaposed to constitute a kind of metaphor
in which the feelings or images evoked by a rough sea are illustrated by the feelings
or images evoked by the Milky Way arching over the Island of Sado.
The next seven-syllable segment, Sado ni yokotau (‘[which] lies toward Sado’), is
an adjectival clause which modifies the last five-syllable segment, ama no gawa (‘the
river of heaven’). Sado is a place name, an island located about 50 miles away from
the coast of mid-Honshu. Ni (‘toward’) is a postpositonal particle of location.
Yokotau (‘to lie’) is a verb which normally has an animate agent and describes an
action (when used as a transitive verb) or a state (when used as an intransitive verb) of
spreading one’s body on something flat. As the grammatical subject of yokotau in
this poem is ama no gawa (‘the river of heaven’), an inanimate noun, the verb is used
metaphorically. The last five-syllable segment, ama no gawa, is a proper noun
signifying the Milky Way. It also involves a metaphor in which the path-shaped set
4
Word-for-word translation is given by the author and not in [12]. The author consulted [10]
and [12] to provide word-for-word translation.
Rough Sea and the Milky Way: `Blending' in a Haiku Text 31
of stars is seen as a river. The second and the third segments of the poem thus
constitute a local metaphor, in which the river of stars in the heaven spreads its body
toward the Island of Sado. There are conventional conceptual metaphors behind this
local metaphor, namely, NATURE IS ANIMATE5 (in this case RIVER IS
ANIMATE6), and A PATH-SHAPED OBJECT IS A RIVER.
Now how does this local knowledge about the grammatical and rhetorical
structure of this poem relate to the understanding of the whole text? There are at
least two major input spaces created at the reading of this poem: a rough sea and the
Milky Way. These two input spaces are juxtaposed and mediated by the use of kireji.
The input space of araumi (‘rough sea’) connotes the Sea of Japan, which is famous
for its violent waves, and which geographically lies between the mainland and Sado
Island. Although syntactically Sado modifies ama no gawa (‘the river of heaven’),
the configurational proximity and the semantic continuity of araumi and Sado seem to
suggest a metonymic reading of araumi, particularly at the time of on-line processing
of meaning. That is, a local blend of rough sea and Sado Island. This does not
deny, however, an interpretation of Sado and ama no gawa as being another local
blend, based on the grammatical proximity. The important point here is rather that
the understanding of this poem requires an array of blending, not only sequentially
but also simultaneously. It could be that the input space of Sado simultaneously
relates to the input spaces of a rough sea and the Milky Way.
Let us first consider the background knowledge recruited at the time of the blend,
for the Island of Sado and the Milky Way have rich cultural implications. Sado
Island has a long history. The island is geographically separated from the mainland
by the Sea of Japan. Because the rough waves prevented people from crossing the
sea by boat, the island functioned as a place of exile for felons and traitors from the
10th century up to the end of the 19th century. At the same time, gold mines were
discovered there in the early 17th century, and attracted all kinds of people. At the
time of Basho (1644-1694), the Tokugawa Shogunate had control of the gold mines,
and the people imprisoned in the island were forced to serve as free labour there.
Thus, the metonymy of a rough sea with Sado Island activates the cultural and
historical meanings of the island. Also, the roughness of the waves is consonant
with the roughness of life on the island which involves violence, cruelty, despair, and
so on. Another important point is that the name of this island is written in two
Chinese logographs, which mean ‘to help’ and ‘to cross’ respectively. The cognitive
meanings of the logographs, particularly that of ‘crossing,’ seem to be mapped onto
the image of a rough sea at the time of the blend. One can probably detect, in the
generic space of these two inputs, workings of such salient conventional metaphors as
LIFE IS A BOAT JOURNEY and THE WAVES ARE AN OBSTACLE TO SUCH A
JOURNEY. The difficulty of crossing is highlighted and emergent in the blend,
which further reinforces the sad feelings relating to the difficulty of reunion by
separated people. The blend is built up by recruiting structures from the
5
Metaphorical concepts are indicated in uppercase letters.
6
Some rivers have human male names such as Bando-Taro (‘place-male name’) for Tone
River. Furthermore, rivers are prototypically metaphorised as snakes in Japanese idioms, e.g.,
kawa ga dakoo-suru (‘A river snakes,’) kawa ga hebi no yoo-ni magaru (‘A river curves like a
snake,’) etc.
32 Masako K. Hiraga
Foregrounding by Kanji. Let us first look at the visual elements. The Japanese
language has a unique writing system in which three different types of signs are used
to describe the same phonological text: kanji (Chinese logographs), hiragana
(syllabary for words of Japanese origin), and katakana (syllabary for words of foreign
Rough Sea and the Milky Way: `Blending' in a Haiku Text 33
origin other than Chinese). In the context of the present discussion, logographs are
of particular importance because they function as a cognitive medium for poetry.
Basho revised this poem orthographically from 2a to 2b [11].
Example 2
a.
b.
The poem’s three noun phrases, araumi, Sado and ama no gawa, were spelled all in
kanji in both the first (Example 2a) and the revised (Example 2b) versions. The
boxed part, the verb of lying, was revised from kanji, a Chinese logograph, to
hiragana, two syllabic letters. The main effect of changing the character type in the
verb yokotau (‘to lie’) from kanji to hiragana is to make that part of the text
a ground for the conspicuous profile of (‘rough sea’) and (‘milky way’).
In general, because kanji, being logograhic characters, have a distinct angular form
and semantic integrity, they differentiate themselves visually and cognitively as the
figure while the remaining hiragana function as the ground.
The words contribute to creating input spaces at the time of the blends, i.e.,
(‘rough sea’) and (‘milky way’), were spelled in kanji. Sado , a place
name, is also written in kanji. Notice also that the three nouns, araumi (‘rough sea’),
Sado, and ama no gawa (‘milky way), are all in two kanji. Also, (‘sea’),
(‘to cross water’), and (‘river’) in these three nouns (underlined in Example 2a
and 2b) share the same radical signifying water. Both (‘rough sea’) and
(‘milky way’) relate to water, as described above. The semantic similarity between
(‘rough sea’) and (‘milky way’) in terms of ‘wateriness’ and the obstacle
(in the real life and in the legend explained above) and their dissimilarity (violence in
the ‘rough sea’ and peacefulness in ‘the river of heaven’) are also foregrounded.
This is a case of diagrammatic iconic effect, intensifying the meaning of the
foregrounded elements by the repetitive use of similar visual elements -- two-
character nouns and the same radical.
In addition, (‘to cross water’) in Sado , name of the Island, seems
important, because this logograph means ‘to cross.’ As the background history and
the legend show, both ‘rough sea’ and ‘the milky way’ are obstacles for the loved
ones crossing for their meeting. This character is placed in the middle of the poem
as if it signalled the crossing.
Sound Patterns. The sound structure also exhibits interesting iconic effects which
contribute to supporting the interpretations drawn by the theory of blending. The
following analysis illustrates three possible iconic effects produced by the distribution
of vowels, consonants, and the repetition of adjacent vowels. Example 3 is a
phonological notation of the poem’s syllabic structure.
34 Masako K. Hiraga
Example 3
Line 1 a-ra-u-mi ya ([-] = syllabification)
Line 2 sa-do ni yo-ko-ta-u
Line 3 a-ma no nga-wa ([ng] as in thing)
Firstly, the distribution of the vowels shows that the poem dominantly uses back
vowels such as [a] and [o]. As indicated in Table 1, there are 9 [a]’s (53%) and 4
[o]’s (24%) out of 17 vowels.
a o i u e Total
Line 1 3 0 1 1 0 5
Line 2 2 3 1 1 0 7
Line 3 4 1 0 0 0 5
Total 9 4 2 2 0 17
[a] and [o] are pronounced with a wide passage between the tongue and the roof of
the mouth, and with the back of the tongue higher than the front. The backness and
the openness often create ‘sonorous’ effects which may draw associations of
something deep and large [cf. 4). In this poem, perhaps, these effects have
something to do with the largeness of waves in the rough sea and the depth and width
of the river of heaven.
The sonorous effects are also created by the frequent use of nasals ([m], [n], and
[ng]), and vowel-like consonants ([y] and [w]). Table 2 shows the distribution of
consonants:
Dominance of sonorants such as [m], [n], [ng], [r], [y], and [w] is characteristic of the
text. The sonorants often provide prolongation and fullness of the sounds, and hence
usually produce lingering effects [cf. 13, pp. 10-12]. It could be argued that the back
vowels and sonorant consonants jointly reinforce a sound-iconic effect of the ‘depth’
or the ‘largeness’ of the image of ‘water’ elements, i.e., a rough sea and the river of
heaven expressed by the poem. Also note that the only line that has obstruents (i.e.,
non-sonorants such as [s], [d], [k], and [t]), is Line 2, in which the island is
mentioned. If one can interpret ‘sonorants’ as iconically associated with ‘water’
Rough Sea and the Milky Way: `Blending' in a Haiku Text 35
elements, then one can also infer that ‘obstruents’ are associated with ‘non-water,’
namely, the island in this text.
The last point is that the text seems to conceal very cleverly and wittingly a key
word, which is congruous with the meaning of the poem. The prototypical sound
sequence in Japanese is an alternation of a single consonant and a single vowel such
as CV-CV-CV. This general feature applies to the haiku text, too. A closer look,
however, enables us to recognise that there are a few occurrences of two vowels, [a]
and [u], adjacent to each other such as [a-u]. They occur in a r a u m i in Line 1 and
y o k o t a u in Line 2. In Line 3, there is a similar sound sequence, g a w a, as [w] is
phonetically close to [u]. It could be said that each line of the poem has a vowel
sequence, [a-u], hidden in the sound sequence of a word or two adjacent words.
Very interestingly, this vowel sequence, [a-u], is a verb in Japanese, which means ‘to
meet.’ The hidden repetition of [a-u] (‘to meet’) in each line could be read as an
echo of a hidden longing of the separated people. Again, the iconic effect of this
hidden element supports the reading of the text as a global metaphorical juxtaposition,
i.e., separation of the two stars on either side of the Milky Way mapped onto the
separated people in the Island of Sado from their loved ones in the mainland.
3 Conclusion
The study of the haiku text, taken from Basho's Oku no hosomichi, has pointed out
that the blending model proposed by Turner and Fauconnier [16, 17] provides an
effective tool for understanding creative mechanism of haiku. It has been claimed
that the cognitive projection derived by the metaphorical juxtaposition by kireji
(‘cutting letters’) is to be explained as a global blend which integrates input mental
spaces, which are at the same time locally blended spaces. This integration occurs
as a dynamic process of ‘making sense’ over the entire array of many mental spaces
under our recruitment from cultural and historical knowledge and other background
contexts, and thus creates emergent structures.
Interpretations of the literary text are constrained in certain ways -- by the use of
conventional conceptual mapping, by commonplace knowledge and by iconicity
between the structure and the meaning. The analysis has demonstrated that the
reading of haiku is also dependent on these factors. Basho used conceptual
metaphors, and exploited almost every possible resource in lexicon, syntax, and
orthography to multiply the implications of the short poetic text, e.g., kireji (‘cutting
letters’), kanji (‘Chinese logographs’), allusions, and sound patterns. It is
indispensable to rely also on cultural and historical background knowledge to
understand the enriched meanings of his texts. Finally, iconicity is of particular
importance in a short poetic text such as haiku because brevity seems to require the
form itself to participate in giving images, concepts, and feelings. This has been
demonstrated by Basho’s clever use of kanji and sound structure in visual, auditory
and cognitive terms.
36 Masako K. Hiraga
References
7
The reference with different years of publication indicates that the year listed first is an access
volume according to which the citation is made and the year in brackets is a source or an
original work.
Pragmatic Forces in Metaphor Use:
The Mechanics of Blend Recruitment in Visual
Metaphors
Tony Veale
1 Introduction
If a software agent is to fluently interact with, and act on behalf of, a human user,
it will require competence with both words and pictures, and metaphors which
combine the two. However, a multitude of pragmatic pressures interact to shape
the generation and interpretation of such multimedia metaphors. These pressures
range from the need to relax strict isomorphism when identifying a mapping re-
lationship between the tenor and vehicle domains, to recruiting intermediate
blends, or self-contained metaphors, as mediators between certain cross-domain
elements that would otherwise be considered too distant in conceptual or imag-
inistic space to make for an apt and aesthetically coherent metaphor. To apply
Hofstadter’s terminology of [6], such pressures fall under the broad rubric of
‘conceptual slippage’. Slippage mechanisms allow a metaphor’s content or mes-
sage to fluidly shift from one underlying concept to another, maximising the
structural coherence of the network of ideas that comprise the message.
This paper examines the complex interactions between these various slippage
pressures, and how they can be accommodated with a computational framework
that can potentially be exploited by a software agent. Though such fluid aspects
of metaphor can be accounted for structurally, they nevertheless demonstrate
that metaphor entails more than a simple structure-matching solution to the
graph-isomorphism problem, harnassing a range of on-the-fly reasoning processes
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 37–51, 1999.
c Springer-Verlag Berlin Heidelberg 1999
38 Tony Veale
Microsoft MS-Excel
Create
Control Part
Windows™ MS-Word
Part
IExplorer
Part
Create
Contain
"Soft"
Affect IExplorerUserBase Attr
Affect MicrosoftSoftware
Affect
Target
Affect NetscapeUserBase
MassMarket
Create
NetscapeNavigator
Create
Control
NetscapeInc Enable
WebAccess
Highlighted in Fig. 1 and 2 are the relational chains common to both that
might conveniently be termed the backbones of each domain structure. In Fig. 1
we see that Microsoft creates (and controls) Windows’98, which in turn contains
the browser IExplorer, which creates a market for itself denoted IExplorerBase,
which in turn reinforces Microsoft as a company. Similarly, in Fig. 2 we note
that CocaCola creates (and controls the makeup of) Coke six-packs, which con-
tain cans of Coke-branded soda, which generate a market for themselves denoted
CokeMarket, which in turn reinforces CocaCola’s corporate status. In the vocab-
ulary of the Sapper approach, we denote these relational chains using the path
notion Microsoft–create→Windows–part→IExplorer–create→IExplorerUserBase
–affect→Microsoft and CocaCola–create→CokeSixPack–part→CokeCan#6–
create- →CokeMarket–affect→CocaCola respectively. Both of these pathways are
isomorphic, and ultimately grounded in a metaphoric bridge that reconciles Mi-
crosoftSoftware with ColaSoftDrink (both are, in a sense, “soft” products that
are aimed at the mass market). This allows Sapper to generate a partial inter-
pretation of the analogy that maps Microsoft to CocaCola, Windows’98 to a
sixpack of Coke, IExplorer to a can of Coke (labelled CokeCan#6 in Sapper’s
network representation of memory) and IExplorerUserBase to CokeMarket.
Microsoft and CocaCola are viewed by Sapper as the root concepts of the
analogy, and all isomorphic pathways within a certain horizon, or size limit, orig-
inating at these nodes are considered as the basis of a new partial interpretation.
Typically Sapper only considers pathways that comprise six relations or less, a
Pragmatic Forces in Metaphor Use 41
CocaCola CokeCan#1
Create
Control Part
:
CokeSixPack : CokeCan#5
Part
CokeCan#6
Part
Fizzy
Create Contain Brown
"Soft"
Affect CokeMarket Attr
Affect ColaSoftDrink
Affect
Target
Affect PepsiMarket
MassMarket
Create Contain
PepsiCan#6
Create
Part
Control :
PepsiCo. PepsiSixPack : PespiCan#2
Part
PepsiCan#1
Part
Fig. 2. The Mirror Domain to that of Fig. 1, Illustrating Similar Market Dy-
namics at Work in the Rivalry between CocaCola and PepsiCo.
ceptual structures to create another, a structure which owes its semantic foun-
dations to its inputs but which also possesses an independent conceptual reality
of its own. Blending theory thus posits a multi-space extension of the classic
two-space model of metaphor and analogy, in which the traditional inputs to the
mapping process, the tenor and vehicle, are each assumed to occupy a distinct
mental space, while the product of their conceptual integration is also assumed
to occupy a separate output space of its own. This allows the newly blended
concept to acquire associations and conventions that do not strictly follow from
the logical makeup of its inputs. For instance, the concept BlackHole is a con-
venient and highly visual blend of the concepts Blackness and Hole, one which
enjoys continued usage in popular and scientific parlance despite evidence that
blackholes are neither hole-like or black in any real sense (i.e., while blackholes
are conveniently conceptualized as holes in the fabric of space-time, they are now
understood to emit gamma radiation, and are thus not truly black in a scientific
sense; furthermore, these emissions cause the blackhole to shrink, whereas a real
hole should grow larger the more substance it emits).
In addition, blend theory allocates a distinct space, called generic space, to
those schemas which guide the construction of a blend. These schemas operate
at a low-level of description, typically the image-schematic level, and serve both
as selectional filters and basic structure combinators for the input spaces. View-
ing Sapper from the perspective of blending theory then, the tenor and vehicle
structures correspond to the input spaces of the blend, while the lattice of cross-
domain bridges newly established in memory corresponds to the output blend
space. It follows that Sapper’s generic space is the set of conceptual schemas
that enable the generation of this lattice of metaphoric and analogical map-
pings. Thusfar we have encountered just one of these schemas, X–metaphor→Y,
but it is reasonable to assume that for every distinct pragmatic force that can
affect the shape of a given metaphoric mapping there will be a corresponding
mapping schema in generic space. By identifying these forces then, one can more
clearly theorize about their underlying generic schemas, and so begin to model
these schemas within a computational framework.
The example of Fig. 3 represents a very real and complex illustration of the
pragmatic pressures that interact to create a visually apt metaphor. Here we see
the Economist newspaper use an easily identified piece of consumer gadgetry, a
‘Tamagotchi’ virtual pet, to make a searing indictment of the Japanese financial
system: ‘Firms such as Yamaichi [Japan’s 4th-largest brokerage, recently col-
lapsed] have been kept alive as artificially as the “virtual pets” in Tamagotchi
toys: thank goodness those infernal gadgets are finally being turned off’.
Taken from a serious political newspaper, such a visual metaphor must be
eye-catching yet appropriate, and complex (with a non-trivial political message)
yet instantly understandable.
Pragmatic Forces in Metaphor Use 43
Fig. 3. A striking visual blend of a ‘Tamagotchi’ game and the Japanese financial
situation after the Yamaichi Brokerage scandal. (Source: ‘The Economist’, Nov.
29, 1997)
3.2 Double-Think
When one describes a person as a Wolf, one rarely employs a realistic schema for
Wolf, but a stereotypical model which many people now know to be false. This
archetype is closer in nature to the cartoon caricatures of Chuck Jones and Tex
Avery (e.g., lascivious, treacherous, ruthless and greedy) than to accepted reality
(e.g., that a wolf is a family animal, with strong social ties). This caricature is an
anthropomorphic and highly visual blend of properties drawn from both Person
and Wolf, which allows a cognitive agent to easily ascribe human qualities to
a non-human entity (similar observations are reported in French, [5]). More
importantly perhaps, blend recruitment facilitates a fundamental cognitive role
of metaphor that, following Orwell’s ‘1984’, we term ‘Doublethink’, namely, the
ability to hold two complementary perspectives on the same concept in mind
at the same time, and to combine or blend these perspectives for reasons of
inference when necessary.
Consider again the Tamagotchi visual metaphor of Fig. 3, whose creators ex-
ploit the Japanese associations of the Tamagotchi game to describe the situation
now facing Japan’s banking regulators after the downfall of the Yamaichi stock
brokerage. The metaphor particularly stresses the options open to the regula-
tors - to prop up (i.e., ‘feed’) the ailing brokerage, or let it fail (i.e., ‘die’), while
viewing the whole financial fiasco as a ‘game’ gone wrong. Tamagotchi games
conventionally centre around electronic pets such as puppies or kittens, which
the player (the regulator?) is supposed to nourish and nurture via constant in-
teraction. This animal is thus a good metaphor for Yamaichi, but the visual
impact would clearly be diminished if the artist simply substituted a picture of
a bank, no matter how iconic, into the game. This is thus a situation in which
direct mapping between tenor and vehicle elements lacks a sufficient pragmatic
force of its own.
Fortunately, a blend is available, that of ‘piggy-bank’, that possesses the
necessary iconicity to substitute for both Yamaichi and the Tamagotchi puppy
in the metaphor. A Piggy-Bank’s strong associations with money and savings
make it an ideal metaphor for Yamaichi, while its visual appearance makes it an
obvious (after-the-fact) counterpart to the electronic animal of the game.
This is where the notion of ‘double-think’ applies. While being a metaphor for
both a brokerage and a puppy, the Piggy-Bank blend is allowed to exploit con-
tradictory properties of both. Most obvious is the orientation of the Piggy-Bank
- its ‘belly-up’ position is an iconic visual commonly associated with animals -
indicating that Yamaichi is either already bankrupt (dead) or seriously insolvent
(dying). This inverse orientation would make no sense if applied to a literal image
of a bank, yet it is perfectly apt when applied to another artefact, the piggy-
bank, due its blend of animal visual properties (the most important here being
Pragmatic Forces in Metaphor Use 45
‘legs’ and ‘belly’). The Piggy-Bank concept is not simply a structural substitute
then for Yamaichi and puppy, but a ‘living’ blend of both.
3.3 Recasting
In the case of the Tamagotchi metaphor of Fig. 3, the slippage situation is actu-
ally even more complex than this. Though the concept Piggy-Bank is identified
as an appropriately visual mid-point between a financial institution and a puppy,
recall that the source of this key sub-metaphor is not actually a puppy at all,
but an electronic simulation of one. We thus need to introduce the idea of a
resemblance schema, taking the form X–resemble→Y. A resemblance relation is
simply a bridge relation between concepts that share a number of perceptual (i.e.,
appearance-related) properties. The transformational chain linking Yamaichi to
the Tamagotchi puppy is thus: Yamaichi–metaphor→PiggyBank–resemble→Pig–
metaphor→Puppy–resemble→TamagotchiPuppy. In effect, Yamaichi and the
Tamagotchi puppy need to be recast for the mediating blend to apply.
metaphor from the cover of the ‘Economist’ (November 22, 1997), which illus-
trates the rough-and-tumble dynamism of modern Russian politics. To convey
the main thrust of the magazine’s leader column, namely that certain once-
prestigious Russian politicians continue to suffer humiliating downfalls while
Boris Yeltsin remains upright and stable throughout, the ‘Economist’ chooses a
bowling metaphor in which different pins represent various politicians and bowl-
ing balls the fickleness of public opinion. The metaphor, illustrated in Fig. 4,
is well-chosen not only because bowling is a populous sport associated with the
general public as a whole, but because the up / down / stable / rocking status
of the pins conforms to a conventional mode of discourse in politics. However,
visual coherence cannot be bought simply by painting the faces of the politicians
involved onto the appropriate pins, as the conceptual and imaginistic distance
between bowling pins and people is such that the result would simply look con-
trived. Instead, the cover’s creator uses not bowling pins but nested Russian
dolls, of the political variety one frequently sees at tourist stalls. While pos-
sessing an iconic visual quality, such dolls also resemble both bowling pins and
politicians, and so act as a perfect mediating blend between the end-points of
the metaphor.
A blend which is recruited to act as a mapping intermediary in this way also acts
a visual precedent, in effect grounding the mapping in shared background knowl-
edge between creator and reader as well as securing the aptness of the mapping.
However, not all elements of the metaphor may be externally grounded in this
fashion. For instance, in the case of the Yeltsin bowling cartoon, the Russian fi-
nance minister Anatoly Chubais is also illustrated using a Russian doll/bowling
pin blend, yet there is no background precedent for this. Nevertheless, there
exists an internal precedent - Boris Yeltsin. Because Yeltsin is also depicted in
this fashion, and because Chubais is a strong analogical counterpart of Yeltsin
(both are powerful male Russian politicians), it makes sense that any ground-
ing applied to Yeltsin can also be analogically transferred to Chubais. So while
Yeltsin visually maps to the first bowling pin via the transformational chain
Yeltsin–resemble→ YeltsinRussianDoll–resemble→BowlingPin1, Chubais maps
to the second via Chubais–metaphor→Yeltsin–resemble→YeltsinRussianDoll –
resemble→BowlingPin1–resemble→BowlingPin2. It seems from such examples
that metaphor can possess an incestuous quality, feeding not only off other
metaphors and blends recruited from outside, but upon its own internal struc-
ture.
can also possess a take-home message which the reader transfers from the vehi-
cle domain to the tenor. For instance, in comparing Japan to a Tamagotchi, the
Economist’s take-home message is the opinion that perhaps the Japanese govern-
ment has viewed the problems of financial regulation as a game, while treating
favoured institutions like Yamaichi as ‘virtual pets’. This form of transfer-based
inferencing is readily provided by models of analogy and metaphor such as SME,
ACME and Sapper, given that the cross-domain mapping established by these
models acts as a substitution-key which dictates how elements of the vehicle
domain can be rewritten into the tenor domain.
However, not all metaphors provide a sufficient key for transferring elements
of the vehicle into the tenor. For instance, in the Russian bowling metaphor,
what is to be made of the fact that certain political kingpins are shown falling
on their sides? This idea of a ‘fall from grace’ has a strong metaphoric his-
tory in politics, conventionally denoting failure due to scandal, but this is a
metaphor that must be recruited from outside the current context rather than
identified and exploited internally. So, when presented with an image of a falling
Chubais doll/pin, one must draw upon political knowledge associated with a
‘fallen’ analogical counterpart of Chubais from outside the current context, if it
is not already appreciated that this particular politician is in a perilous position.
For instance, one can defer to another politican such as Nixon and his political
fall, via the analogical chain Chubais–metaphor→Nixon– perform→Resignation–
metaphor→Fall. In essence, we simply need to find a path that metaphorically
links the concept Chubais to the concept Fall, and this path should contain the
semantic sub-structure to be analogically carried into the tenor domain; in this
case the connecting sub-structure suggests that Chubais might perform an act
of resignation. It is necessary that the agent (software or human) reason via
an analogical counterpart like Nixon since the concept Fall may have different
metaphoric meanings in different contexts (e.g., one would not infer that a falling
share-price should also resign).
recast_as(tamagotchi_puppy, puppy)
remains for the cognitive agent to ‘run’ the metaphor of ‘Japan is a Tamagotchi
game’ with the caveat that Yamaichi receives a cross-domain mapping in the
interpretation. Many computational models of analogy and metaphor, such as
SME, ACME and Sapper, already provide for this pragmatic directive. Fig. 4
illustrates the output generated by Sapper when given structured descriptions
of these concepts to metaphorically analyse.
We have seen how, starting with the Sapper bridging schema X–metaphor→Y,
this schema can be specialised to deal with appearance-based perceptual similar-
ity in the form X–resemble→Y. Taken together, these two schemas provide the
basic building blocks for reasoning about the slippage phenomena of blend re-
cruitment (both internal and external), recasting and doublethink. For instance,
the basis of the Yamaichi:Tamagotchi metaphor can be explained using the com-
posite chain of metaphor and resemblance schemas:
Yamaichi–metaphor→PiggyBank–resemble→Pig–metaphor→Puppy–
resemble →TamagotchiPuppy
while the mapping of Mr. Chubais to a bowling pin in Fig. 4 can also be ex-
plained using the chain:
Chubais–metaphor→Yeltsin–resemble→YeltsinRussianDoll–resemble→
BowlingPin1–resemble→BowlingPin2
Our initial exploration in the domain of political and economic cartoons show
these chains–each of which is a four-fold composite of the basic metaphor and
resemblance schemas–to be as complex as one is likely to find in this domain.
We can view therefore the generic space guiding the pragmatics of Sapper’s
mapping process as being populated with all permutations of these basic schemas
within a given computational limit. That is, just as there are effective cognitive
limitations on the number of elements one can store in working memory, or
nest in a centre-embedded clause, it is reasonable to assume that the amount of
structural slippage tolerated by the metaphor faculty is similarly bounded for
reasons of computational tractability. Sapper currently operates with a maxi-
mal chain size of four bridge schemas, but again, this proves effective for even
the most complex metaphors we have encountered so far. It remains to be seen
whether the computational limit is pragmatically determined - that is, whether
the context dictates how much computational effort should be applied. For in-
stance, one expects that political cartoons demand more cognitive expenditure
than, say, advertising imagery. This conjecture, among others, is the subject of
current on-going research.
50 Tony Veale
5 Conclusions
We conclude on this theme of computational felicity, by noting that the model
of blend recruitment presented in this paper may also shed a useful compu-
tational perspective on another intriguing aspect of Fauconnier and Turner’s
theory of blending, namely the metonymy projection principle. Since metaphors
and blends typically serve the communicative purpose of throwing certain ele-
ments of a domain into bas-relief, while de-emphasising others (e.g., see [8]), this
strengthening of associations frequently causes the relational distance between
the tenor and its highlighted association to be fore-shortened in any resulting
conceptual product.
Fauconnier and Turner cite as an example of this principle the concept Grim-
Reaper, a blend which metaphorically combines the concepts Farmer and Death.
In the latter domain, the concepts Skeleton and RottingClothes are causally as-
sociated with Death, via the intermediate concepts Decompose, Rot, Coffin,
Funeral, Graveyard, and so on. But in the resultant blend space, Skeleton and
RottingClothes become directly associated with Death, and are used together as
an explicit visual metonym; the Grim Reaper is thus conventionally portrayed as
a scythe-carrying skeleton, wrapped in a decrepit cloak and cowl. We see a sim-
ilar instance of this phenomenon in the Tamagotchi example of Fig. 3, in which
the associations between Yamaichi, a rather lofty brokerage, and the concepts of
PersonalSavings and SmallInvestor are strengthened by the use of a PiggyBank
as a visual metonym. This has the effect of personalising the metaphor and mak-
ing its consequences more relevant to the intended audience, the bulk of which
will themselves be small, rather than corporate, investors. In both these cases,
metonymic short-cuts emerge because an intermediate blend is recruited that
provides a shorter path to the relevant associations. Skeleton serves as a rich
visual analog of Farmer (both have arms, legs, torso, head, etc.) while evoking
certain abstract properties of Death, whereas PiggyBank is a rich visual analog
of a TamagotchiPuppy, while sharing key abstract properties with Yamaichi.
The computational account we provide of blend recruitment may thus also
provide an algorithmic basis for much of what passes for metonymic projection.
It remains as a goal of future research to establish other aspects of conceptual
integration that can be neatly accommodated within this computational frame-
work.
References
1. Black, M. Models and Metaphor: studies in language and philosophy. Ithaca, NY:
Cornell University Press. (1962) 38
2. Falkenhainer, B., Forbus, K. D., and D. Gentner.: The Structure-Mapping Engine.
Artificial Intelligence , 41, (1989) pp 1-63. 38
3. Fauconnier, G. and M. Turner: Conceptual projection and middle spaces. UCSD:
Department of Cognitive Science Technical Report 9401, (1994). 43
4. Fauconnier, G. and M. Turner.: Conceptual Integration Networks. Cognitive Sci-
ence (in press).
Pragmatic Forces in Metaphor Use 51
1 Introduction
Building an android, an autonomous robot with humanoid form and human-
like abilities, has been both a recurring theme in science fiction and a “Holy
Grail” for the Artificial Intelligence community. In the summer of 1993, our
group began the construction of a humanoid robot. This research project has
two goals: an engineering goal of building a prototype general purpose flexible
and dextrous autonomous robot and a scientific goal of understanding human
cognition (Brooks & Stein 1994).
Recently, many other research groups have begun to construct integrated hu-
manoid robots (Hirai, Hirose, Haikawa & Takenaka 1998, Kanehiro, Mizuuchi,
Koyasako, Kakiuchi, Inaba & Inoue 1998, Takanishi, Hirano & Sato 1998, Morita,
Shibuya & Sugano 1998). There are now conferences devoted solely to humanoid
systems, such as the International Symposium on Humanoid Robots (HURO)
which was first hosted by Waseda University in October of 1996, as well as sec-
tions of more broadly-based conferences, including a recent session at the 1998
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 52-87, 1999.
c Springer-Verlag Heidelberg Berlin 1999
The Cog Project: Building a Humanoid Robot 53
2 Methodology
In recent years, AI research has begun to move away from the assumptions of
classical AI: monolithic internal models, monolithic control, and general purpose
processing. However, these concepts are still prevalent in much current work and
are deeply ingrained in many architectures for intelligent systems. For example,
in the recent AAAI-97 proceedings, one sees a continuing interest in planning
(Littman 1997, Hauskrecht 1997, Boutilier & Brafman 1997, Blythe & Veloso
1997, Brafman 1997) and representation (McCain & Turner 1997, Costello 1997,
Lobo, Mendez & Taylor 1997), which build on these assumptions.
Previously, we have presented a methodology that differs significantly from
the standard assumptions of both classical and neo-classical artificial intelli-
gence (Brooks et al. 1998). Our alternative methodology is based on evidence
from cognitive science and neuroscience which focus on four alternative at-
tributes which we believe are critical attributes of human intelligence: devel-
opmental organization, social interaction, embodiment and physical coupling,
and multimodal integration.
In this section, we summarize some of the evidence that has led us to abandon
those assumptions about intelligence that classical AI continues to uphold. We
54 Rodney A. Brooks et al.
then briefly review the alternative methodology that we have been using in
constructing humanoid robotic systems.
one side of the body, the experimenters can probe the behavior of each hemi-
sphere independently (for example, by observing the subject picking up an object
appropriate to the scene that they had viewed). In one example, a snow scene
was presented to the right hemisphere and the leg of a chicken to the left. The
subject selected a chicken head to match the chicken leg, explaining with the
verbally dominant left hemisphere that “I saw the claw and picked the chicken”.
When the right hemisphere then picked a shovel to correctly match the snow,
the left hemisphere explained that you need a shovel to “clean out the chicken
shed” (Gazzaniga & LeDoux 1978, p.148). The separate halves of the subject
independently acted appropriately, but one side falsely explained the choice of
the other. This suggests that there are multiple independent control systems,
rather than a single monolithic one.
Development: Humans are not born with complete reasoning systems, com-
plete motor systems, or even complete sensory systems. Instead, they undergo
a process of development where they perform incrementally more difficult tasks
in more complex environments en route to the adult state. Building systems de-
velopmentally facilitates learning both by providing a structured decomposition
of skills and by gradually increasing the complexity of the task to match the
competency of the system.
Development is an incremental process. Behaviors and learned skills that
have already been mastered prepare and enable the acquisition of more advanced
behaviors by providing subskills and knowledge that can be re-used, by placing
simplifying constraints on the acquisition, and by minimizing new information
that must be acquired. For example, Diamond (1990) shows that infants between
five and twelve months of age progress through a number of distinct phases
in the development of visually guided reaching. In this progression, infants in
later phases consistently demonstrate more sophisticated reaching strategies to
retrieve a toy in more challenging scenarios. As the infant’s reaching competency
develops, later stages incrementally improve upon the competency afforded by
the previous stages. Within our group, Marjanović, Scassellati & Williamson
(1996) applied a similar bootstrapping technique to enable the robot to learn
to point to a visual target. Scassellati (1996) has discussed how a humanoid
robot might acquire basic social competencies through this sort of developmental
methodology. Other examples of developmental learning that we have explored
can be found in (Ferrell 1996, Scassellati 1998b).
By gradually increasing the complexity of the required task, a developmen-
tal process optimizes learning. For example, infants are born with low acuity
vision which simplifies the visual input they must process. The infant’s visual
performance develops in step with their ability to process the influx of stimula-
tion (Johnson 1993). The same is true for the motor system. Newborn infants
do not have independent control over each degree of freedom of their limbs, but
through a gradual increase in the granularity of their motor control they learn
to coordinate the full complexity of their bodies. A process in which the acuity
of both sensory and motor systems are gradually increased significantly reduces
the difficulty of the learning problem (Thelen & Smith 1994). The caregiver also
acts to gradually increase the task complexity by structuring and controlling
the complexity of the environment. By exploiting a gradual increase in complex-
ity both internal and external, while reusing structures and information gained
from previously learned behaviors, we hope to be able to learn increasingly so-
phisticated behaviors. We believe that these methods will allow us to construct
systems which scale autonomously (Ferrell & Kemp 1996, Scassellati 1998b).
The Cog Project: Building a Humanoid Robot 57
Embodiment and Physical Coupling: Perhaps the most obvious, and most
overlooked, aspect of human intelligence is that it is embodied. A principle
tenet of our methodology is to build and test real robotic systems. We believe
that building human-like intelligence requires human-like interaction with the
world (Brooks & Stein 1994). Humanoid form is important both to allow hu-
mans to interact socially with the robot in a natural way and to provide similar
task constraints.
The direct physical coupling between action and perception reduces the need
for an intermediary representation. For an embodied system, internal repre-
sentations can be ultimately grounded in sensory-motor interactions with the
world (Lakoff 1987). Our systems are physically coupled with the world and op-
erate directly in that world without any explicit representations of it (Brooks
1986, Brooks 1991b). There are representations, or accumulations of state, but
58 Rodney A. Brooks et al.
these only refer to the internal workings of the system; they are meaningless
without interaction with the outside world. The embedding of the system within
the world enables the internal accumulations of state to provide useful behavior.1
In addition we believe that building a real system is computationally less
complex than simulating such a system. The effects of gravity, friction, and
natural human interaction are obtained for free, without any computation. Em-
bodied systems can also perform some complex tasks in relatively simple ways
by exploiting the properties of the complete system. For example, when putting
a jug of milk in the refrigerator, you can exploit the pendulum action of your
arm to move the milk (Greene 1982). The swing of the jug does not need to be
explicitly planned or controlled, since it is the natural behavior of the system.
Instead of having to plan the whole motion, the system only has to modulate,
guide and correct the natural dynamics. We have implemented one such scheme
using self-adaptive oscillators to drive the joints of the robot’s arm (Williamson
1998a, Williamson 1998b).
Fig. 1. Cog, an upper-torso humanoid robot. Cog has twenty-one degrees of freedom
to approximate human movement, and a variety of sensory systems that approximate
human senses, including visual, vestibular, auditory, and tactile senses.
3 Hardware
In pursuing the methodology outlined in the previous section, we have con-
structed an upper-torso humanoid robot called Cog (see Figure 1). This section
describes the computational, perceptual, and motor systems that have been im-
plemented on Cog as well as the development platforms that have been con-
structed to test additional hardware and software components.
Visual System: Cog’s visual system is designed to mimic some of the capa-
bilities of the human visual system, including binocularity and space-variant
sensing (Scassellati 1998a). Each eye can rotate about an independent vertical
axis (pan) and a coupled horizontal axis (tilt). To allow for both a wide field
of view and high resolution vision, there are two grayscale cameras per eye, one
which captures a wide-angle view of the periphery (88.6◦ (V ) × 115.8◦ (H) field
of view) and one which captures a narrow-angle view of the central (foveal) area
(18.4◦ (V ) × 24.4◦(H) field of view with the same resolution). Each camera pro-
duces an NTSC signal that is digitized by a frame grabber connected to the
digital signal processor network.
Vestibular System: The human vestibular system plays a critical role in the
coordination of motor responses, eye movement, posture, and balance. The hu-
man vestibular sensory organ consists of the three semi-circular canals, which
measure the acceleration of head rotation, and the two otolith organs, which
measure linear movements of the head and the orientation of the head relative
to gravity. To mimic the human vestibular system, Cog has three rate gyroscopes
mounted on orthogonal axes (corresponding to the semi-circular canals) and two
linear accelerometers (corresponding to the otolith organs). Each of these devices
is mounted in the head of the robot, slightly below eye level. Analog signals from
each of these sensors is amplified on-board the robot, and processed off-board
by a commercial A/D converter attached to one of the PC brain nodes.
crude pinnae were constructed around the microphones. Analog auditory signals
are processed by a commercial A/D board that interfaces to the digital signal
processor network.
Arms: Each arm is loosely based on the dimensions of a human arm with 6
degrees-of-freedom, each powered by a DC electric motor through a series spring
(a series elastic actuator, see (Pratt & Williamson 1995)). The spring provides
accurate torque feedback at each joint, and protects the motor gearbox from
shock loads. A low gain position control loop is implemented so that each joint
acts as if it were a virtual spring with variable stiffness, damping and equilibrium
position. These spring parameters can be changed, both to move the arm and
to alter its dynamic behavior. Motion of the arm is achieved by changing the
equilibrium positions of the joints, not by commanding the joint angles directly.
There is considerable biological evidence for this spring-like property of arms
(Zajac 1989, Cannon & Zahalak 1982, MacKay, Crammond, Kwan & Murphy
1986).
The spring-like property gives the arm a sensible “natural” behavior: if it is
disturbed, or hits an obstacle, the arm simply deflects out of the way. The dis-
turbance is absorbed by the compliant characteristics of the system, and needs
62 Rodney A. Brooks et al.
31"
Fig. 2. Range of motion for the neck and torso. Not shown are the neck twist (180
degrees) and body twist (120 degrees)
no explicit sensing or computation. The system also has a low frequency char-
acteristic (large masses and soft springs) which allows for smooth arm motion
at a slower command rate. This allows more time for computation, and makes
possible the use of control systems with substantial delay (a condition akin to
biological systems). The spring-like behavior also guarantees a stable system if
the joint set-points are fed-forward to the arm.
Neck and Torso: Cog’s body has six degrees of freedom: the waist bends side-
to-side and front-to-back, the “spine” can twist, and the neck tilts side-to-side,
nods front-to-back, and twists left-to-right. Mechanical stops on the body and
neck give a human-like range of motion, as shown in Figure 2 (Not shown are
the neck twist (180 degrees) and body twist (120 degrees)).
In addition to the humanoid robot, we have also built three development plat-
forms, similar in mechanical design to Cog’s head, with identical computational
systems; the same code can be run on all platforms. These development platforms
allow us to test and debug new behaviors before integrating them on Cog.
Vision Platform: The vision development platform (shown at the left of Figure
3) is a copy of Cog’s active vision system. The development platform has identical
degrees of freedom, similar design characteristics, and identical computational
environment. The development platform differs from Cog’s vision system in only
three ways. First, to explore issues of color vision and saliency, the development
platform has color cameras. Second, the mechanical design of the camera mounts
The Cog Project: Building a Humanoid Robot 63
Fig. 3. Two of the vision development platforms used in this work. These desktop
systems match the design of the Cog head and are used as development platforms for
visual-motor routines. The system on the right has been modified to investigate how
expressive facial gestures can regulate social learning.
has been modified for the specifications of the color cameras. Third, because the
color cameras are significantly lighter than the grayscale cameras used on Cog,
we were able to use smaller motors for the development platform while obtaining
similar eye movement speeds. Additional details on the development platform
design can be found in Scassellati (1998a).
Fig. 4. Static extremes of Kismet’s facial expressions. During operation, the 11 degrees-
of-freedom for the ears, eyebrows, mouth, and eyelids vary continuously with the cur-
rent emotional state of the robot.
child’s own beliefs, desires, and perceptions. The ability to recognize what an-
other person can see, the ability to know that another person maintains a false
belief, and the ability to recognize that another person likes games that differ
from those that the child enjoys are all part of this developmental chain. Further,
the ability to recognize oneself in the mirror, the ability to ground words in per-
ceptual experiences, and the skills involved in creative and imaginative play may
also be related to this developmental advance. These abilities are also central to
what defines human interactions. Normal social interactions depend upon the
recognition of other points of view, the understanding of other mental states,
and the recognition of complex non-verbal signals of attention and emotional
state.
If we are to build a system that can recognize and produce these complex
social behaviors, we must find a skill decomposition that maintains the com-
plexity and richness of the behaviors represented while still remaining simple
to implement and construct. Evidence from the development of these “theory
of mind” skills in normal children, as well as the abnormal development seen
in pervasive developmental disorders such as Asperger’s syndrome and autism,
demonstrate that a critical precursor is the ability to engage in joint attention
(Baron-Cohen 1995, Frith 1990). Joint attention refers to those preverbal social
behaviors that allow the infant to share with another person the experience of a
third object (Wood et al. 1976). For example, the child might laugh and point
to a toy, alternating between looking at the caregiver and the toy.
From a robotics standpoint, even the simplest of joint attention behaviors
require the coordination of a large number of perceptual, sensory-motor, atten-
tional, and cognitive processes. Our current research is the implementation of
one possible skill decomposition that has received support from developmen-
tal psychology, neuroscience, and abnormal psychology, and is consistent with
evidence from evolutionary studies of the development of joint attention behav-
iors. This decomposition is described in detail in the chapter by Scassellati, and
requires many capabilities from our robotic system including basic eye motor
skills, face and eye detection, determination of eye direction, gesture recogni-
tion, attentional systems that allow for social behavior selection at appropriate
moments, emotive responses, arm motor control, image stabilization, and many
others.
A robotic system that can recognize and engage in joint attention behav-
iors will allow for social interactions between the robot and humans that have
previously not been possible. The robot would be capable of learning from an
observer using normal social signals in the same way that human infants learn;
no specialized training of the observer would be necessary. The robot would also
be capable of expressing its internal state (emotions, desires, goals, etc.) through
social interactions without relying upon an artificial vocabulary. Further, a robot
that can recognize the goals and desires of others will allow for systems that can
more accurately react to the emotional, attentional, and cognitive states of the
observer, can learn to anticipate the reactions of the observer, and can modify its
own behavior accordingly. The construction of these systems may also provide a
66 Rodney A. Brooks et al.
new tool for investigating the predictive power and validity of the models from
natural systems that serve as the basis. An implemented model can be tested
in ways that are not possible to test on humans, using alternate developmen-
tal conditions, alternate experiences, and alternate educational and intervention
approaches.
parent receives during social exchanges serve as feedback so the parent can adjust
the nature and intensity of the structured learning episode to maintain a suitable
learning environment where the infant is neither bored nor overwhelmed.
In addition, an infant’s motivations and emotional displays are critical in
establishing the context for learning shared meanings of communicative acts
(Halliday 1975). An infant displays a wide assortment of emotive cues such as
coos, smiles, waves, and kicks. At such an early age, the mother imparts a con-
sistent meaning to her infant’s expressive gestures and expressions, interpreting
them as meaningful responses to her mothering and as indications of his inter-
nal state. Curiously, experiments by Kaye (1979) argue that the mother actually
supplies most if not all the meaning to the exchange when the infant is so young.
The infant does not know the significance his expressive acts have for his mother,
nor how to use them to evoke specific responses from her. However, because the
mother assumes her infant shares the same meanings for emotive acts, her con-
sistency allows the infant to discover what sorts of activities on his part will get
specific responses from her. Routine sequences of a predictable nature can be
built up which serve as the basis of learning episodes (Newson 1979).
Combining these ideas one can design a robot that is biased to learn how
its emotive acts influence the caretaker in order to satisfy its own drives. To-
ward this end, we endow the robot with a motivational system that works to
maintain its drives within homeostatic bounds and motivates the robot to learn
behaviors that satiate them. For our purposes, we further provide the robot with
a set of emotive expressions that are easily interpreted by a naive observer as
analogues of the types of emotive expressions that human infants display. This
allows the caretaker to observe the robot’s emotive expressions and interpret
them as communicative acts. This establishes the requisite routine interactions
for the robot to learn how its emotive acts influence the behavior of the care-
taker, which ultimately serves to satiate the robot’s own drives. By doing so,
both parties can modify both their own behavior and the behavior of the other
in order to maintain an interaction that the robot can learn from and use to
satisfy its drives.
son & Hollerbach 1988), using lightweight robots with little dynamics (Salisbury,
Townsend, Eberman & DiPietro 1988), or simply by moving slowly. Research em-
phasizing dynamic manipulation either exploits clever mechanical mechanisms
which simplify control schemes (Schaal & Atkeson 1993, McGeer 1990) or results
in computationally complex methods (Mason & Salisbury 1985).
Humans, however, exploit the mechanical characteristics of their bodies. For
example, when humans swing their arms they choose comfortable frequencies
which are close to the natural resonant frequencies of their limbs (Herr 1993,
Hatsopoulos & Warren 1996). Similarly, when placed in a jumper, infants bounce
at the natural frequency (Warren & Karrer 1984). Humans also exploit the active
dynamics of their arm when throwing a ball (Rosenbaum et al. 1993) and the
passive dynamics of their arm to allow stable interaction with objects (Mussa-
Ivaldi, Hogan & Bizzi 1985). When learning new motions, both infants and
adults quickly utilize the physical dynamics of their limbs (Thelen & Smith
1994, Schneider, Zernicke, Schmidt & Hart 1989).
On our robot, we have exploited the dynamics of the arms to perform a
variety of tasks. The compliance of the arm allows both stable motion and safe
interaction with objects. Local controllers at each joint are physically coupled
through the mechanics of the arm, allowing these controllers to interact and
produce coordinated motion such as swinging a pendulum, turning a crank, and
playing with a slinky. Our initial experiments suggest that these solutions are
very robust to perturbations, do not require accurate calibration or parameter
tuning, and are computationally simple (Williamson 1998a, Williamson 1998b).
internal systems interact with each other. Unfortunately, finding this information
is by no means trivial.
Performance measures are the most straightforward. For sensory processes,
the performance is estimated by a confidence measure, probably based on a com-
bination of repeatibility, error estimates, etc. Motor performance measurements
would be based upon criteria such as power expenditure, fatigue measures, safety
limits, and actuator accuracy.
Extracting correlations between sensorimotor events is more complex. The
first step is segmentation, that is, determining what constitutes an “event” within
a stream of proprioceptive data and/or motor commands. Segmentation algo-
rithms and filters can be hard-coded (but only for the most rudimentary enu-
meration of sensing and actuating processes) or created adaptively. Adaptive
segmentation creates and tunes filters based on how well they contribute to
the correlation models. Segmentation is crucial because it reduces the amount
of redundant information produced by confluent data streams. Any correlation
routine must deal with both the combinatorial problem of looking for patterns
between many different data sources and the problem of finding correlations
between events with time delays.
A general system for multimodal coordination is too complex to implement
all at once. We plan to start on a small scale, coordinating between two and
five systems. The first goal is a mechanism for posture — to coordinate, fixate,
and properly stiffen or relax torso, neck, and limbs for a variety of reaching and
looking tasks. Posture is not merely a reflexive control; it has feed-forward com-
ponents which require knowledge of impending tasks so that the robot can ready
itself. A postural system being so reactive and pervasive, requires a significant
amount of multi-modal integration.
5 Current Tasks
In pursuing the long-term projects outlined in the previous section, we have im-
plemented many simple behaviors on our humanoid robot. This section briefly
describes the tasks and behaviors that the robot is currently capable of perform-
ing. For brevity, many of the technical details and references to similar work
have been excluded here, but are available from the original citations. In ad-
dition, video clips of Cog performing many of these tasks are available from
http://www.ai.mit.edu/projects/cog/.
Human eye movements can be classified into five categories: three voluntary
movements (saccades, smooth pursuit, and vergence) and two involuntary move-
ments (the vestibulo-ocular reflex and the opto-kinetic response)(Goldberg, Eg-
gers & Gouras 1992). We have implemented mechanical analogues of each of
these eye motions.
70 Rodney A. Brooks et al.
Saccades: Saccades are high-speed ballistic motions that focus a salient object
on the high-resolution central area of the visual field (the fovea). In humans,
saccades are extremely rapid, often up to 900◦ per second. To enable our machine
vision systems to saccade to a target, we require a saccade function S : (x, e) →
∆e which produces a change in eye motor position (∆e) given the current eye
motor position (e) and the stimulus location in the image plane (x). To obtain
accurate saccades without requiring an accurate model of the kinematics and
optics, an unsupervised learning algorithm estimates the saccade function. This
implementation can adapt to the non-linear optical and mechanical properties
of the vision system. Marjanović et al. (1996) learned a saccade function for
this hardware platform using a 17 × 17 interpolated lookup table. The map was
initialized with a linear set of values obtained from self-calibration. For each
learning trial, a visual target was randomly selected. The robot attempted to
saccade to that location using the current map estimates. The target was located
in the post-saccade image using correlation, and the L2 offset of the target was
used as an error signal to train the map. The system learned to center pixel
patches in the peripheral field of view. The system converged to an average of
< 1 pixel of error in a 128 × 128 image per saccade after 2000 trials (1.5 hours).
With a trained saccade function S, the system can saccade to any salient stimulus
in the image plane. We have used this mapping for saccading to moving targets,
bright colors, and salient matches to static image templates.
Binocular Vergence: Vergence movements adjust the eyes for viewing ob-
jects at varying depth. While the recovery of absolute depth may not be strictly
necessary, relative disparity between objects are critical for tasks such as accu-
rate hand-eye coordination, figure-ground discrimination, and collision detection.
Yamato (1998) built a system that performs binocular vergence and integrates
the saccadic and smooth-pursuit systems described previously. Building on mod-
els of the development of binocularity in infants, Yamato used local correlations
to identify matching targets in a foveal region in both eyes, moving the eyes to
The Cog Project: Building a Humanoid Robot 71
match the pixel locations of the targets in each eye. The system was also capable
of smoothly responding to changes of targets after saccadic motions, and during
smooth pursuit.
Fig. 5. Orientation to a salient stimulus. Once a salient stimulus (a moving hand) has
been detected, the robot first saccades to that target and then orients the head and
neck to that target.
drift from the VOR. We are currently working on implementing models of VOR
and OKN coordination to allow both systems to operate simultaneously.
Orienting the head and neck along the angle of gaze can maximize the range of
the next eye motion while giving the robot a more life-like appearance. Once the
eyes have foveated a salient stimulus, the neck should move to point the head in
the direction of the stimulus while the eyes counter-rotate to maintain fixation
on the target (see Figure 5). To move the neck the appropriate distance, we must
construct a mapping N : (n, e) →∆n which produces a change in neck motor
positions (∆n) given the current neck position (n) and the initial eye position
(e). Because we are mapping motor positions to motor positions with axes that
are roughly parallel, a simple linear mapping has sufficed: ∆n = (k ė − n) for
some constant k.2
There are two possible mechanisms for counter-rotating the eyes while the
neck is in motion: the vestibulo-ocular reflex or an efference copy signal of the
neck motion. VOR can be used to compensate for neck motion without any
additions necessary. Because the reflex uses gyroscope feedback to maintain the
eye position, no communication between the neck motor controller and the eye
motor controller is necessary. This can be desirable if there is limited bandwith
between the processors responsible for neck and eye control. However, using VOR
to compensate for neck motion can become unstable. Because the gyroscopes
are mounted very close to the neck motors, motion of the neck can result in
additional vibrational noise on the gyroscopes. However, since the neck motion
is a voluntary movement, our system can utilize additional information in order
to counter-rotate the eyes, much as humans do (Ghez 1992). An efference copy
signal can be used to move the eye motors while the neck motors are moving. The
neck motion signal can be scaled and sent to the eye motors to compensate for
the neck motion. The scaling constant is simply k1 , where k is the same constant
2
This linear mapping has only been possible with motor-motor mappings and not
sensory-motor mappings because of non-linearities in the sensors.
The Cog Project: Building a Humanoid Robot 73
TONIC
INPUT c β v1
11
00
00
11
hj [gj]+
y1
1
PROPRIOCEPTIVE
1
0
0
1
INPUT gj OUTPUT
+ y out
ω y1 ω y2
-
1
0
0
1
hj [gj]-
y2
11
00
11
00
2
11
00
00
11
TONIC
INPUT c
β v2
that was used to determine ∆n. Just as with the vestibulo-ocular reflex, the
scaling constants can be obtained using controlled motion and feedback from
the opto-kinetic nystigmus. Using efference copy with constants obtained from
OKN training results in a stable system for neck orientation.
Neural oscillators have been used to generate repetitive arm motions. The cou-
pling between a set of oscillators and the physical arm of the robot achieves
many different tasks using the same software architecture and without explicit
models of the arm or environment. The tasks include swinging pendulums at
their resonant frequencies, turning cranks, and playing with a slinky.
Using a proportional-derivative control law, the torque at the ith joint can
be described by:
ui = ki (θvi − θi ) − bi θ˙i (1)
where ki is the stiffness of the joint, bi the damping, θi the joint angle, and
θvi the equilibrium point. By altering the stiffness and damping of the arm, the
dynamical characteristics of the arm can be changed. The posture of the arm
can be changed by altering the equilibrium points (Williamson 1996). This type
of control preserves stability of motion. The elastic elements of the arm produce
a system that is both compliant and shock resistant, allowing the arm to operate
in unstructured environments.
Two simulated neurons with mutually inhibitory connections drive each arm
joint, as shown in Figure 6. The neuron model describes the firing rate of a
biological neuron with self-inhibition (Matsuoka 1985). The firing rate of each
74 Rodney A. Brooks et al.
+ j=n +
τ1 x˙1 = −x1 − βv1 − ω [x2 ] − Σj=1 hj [gj ] + c (2)
+
τ2 v˙1 = −v1 + [x1 ] (3)
+ j=n −
τ1 x˙2 = −x2 − βv2 − ω [x1 ] − Σj=1 hj [gj ] + c (4)
τ2 v˙2 = −v2 + [x2 ]+ (5)
+
yi = [xi ] = max(xi , 0) (6)
yout = y1 − y2 (7)
3
These signals in general have an offset (due to gravity loading, or other factors).
When the positive and negative parts are extracted and applied to the oscillators, a
low-pass filter is used to find and remove the DC component.
The Cog Project: Building a Humanoid Robot 75
joint angles
10
−10
−20
0 1 2 3 4 5 6 7
Time seconds
20
joint angles
10
−10
−20
0 1 2 3 4 5 6 7
Time seconds
Fig. 7. Entrainment of an oscillator at the elbow as the shoulder is moved. The joints
are connected only through the physical structure of the arm. Both plots show the
angle of the shoulder (solid) and the elbow (dashed) as the speed of the shoulder is
changed (speed parameter dash-dot). The top graph shows the response of the arm
without proprioception, and the bottom with proprioception. Synchronization occurs
only with the proprioceptive feedback.
Cranks: The position constraint of a crank can also be used to coordinate the
joints of the arm. If the arm is attached to the crank and some of the joints are
moved, then the other joints are constrained by the crank. The oscillators can
sense the motion, adapt, and settle into a stable crank turning motion.
In the future, we will explore issues of complex redundant actuation (such as
multi-joint muscles), utilize optimization techniques to tune the parameters of
the oscillator, produce whole-arm oscillations by connecting various joints into a
single oscillator, and explore the use of postural primitives to move the set point
of the oscillations.
60
left arm
40 right arm
equilibrium angle
feedback gain
20
−20
−40
0 2 4 6 8 10 12
time − seconds
60
40
equilbrium angle
20
−20
−40
0 2 4 6 8 10 12
time − seconds
Fig. 8. The robot operating the slinky. Both plots show the outputs from the oscil-
lators as the proprioception is turned on and off. With proprioception, the outputs
are synchronized. Without proprioception, the oscillators move out of phase. The only
connection between the oscillators is through the physical structure of the slinky.
over many repeated trials without human supervision, using gradient descent
methods to train forward and inverse mappings between a visual parameter space
and an arm position parameter space. This behavior uses a novel approach to
arm control, and the learning bootstraps from prior knowledge contained within
the saccade behavior (discussed in Section 5.1). As implemented, the behavior
assumes that the robot’s neck remains in a fixed position.
From an external perspective, the behavior is quite rudimentary. Given a
visual stimulus, typically by a researcher waving an object in front of its cam-
eras, the robot saccades to foveate on the target, and then reaches out its arm
toward the target. Early reaches are inaccurate, and often in the wrong direction
altogether, but after a few hours of practice the accuracy improves drastically.
The reaching algorithm involves an amalgam of several subsystems. A motion
detection routine identifies a salient stimulus, which serves as a target for the
saccade module. This foveation guarantees that the target is always at the center
of the visual field; the coordinates of the target on the retina are always the
center of the visual field, and the position of the target relative to the robot is
wholly characterized by the gaze angle of the eyes (only two degrees of freedom).
Once the target is foveated, the joint configuration necessary to point to that
target is generated from the gaze angle of the eyes using a “ballistic map.” This
configuration is used by the arm controller to generate the reach.
Training the ballistic map is complicated by the inappropriate coordinate
space of the error signal. When the arm is extended, the robot waves its hand.
This motion is used to locate the end of the arm in the visual field. The distance
of the hand from the center of the visual field is the measure of the reach error.
However, this error signal is measured in units of pixels, yet the map being
The Cog Project: Building a Humanoid Robot 77
trained relates gaze angles to joint positions. The reach error measured by the
visual system cannot be directly used to train the ballistic map. However, the
saccade map has been trained to relate pixel positions to gaze angles. The saccade
map converts the reach error, measured as a pixel offset on the retina, into an
offset in the gaze angles of the eyes (as if Cog were looking at a different target).
This is still not enough to train the ballistic map. Our error is now in terms
of gaze angles, not joint positions — i.e. we know where Cog could have looked,
but not how it should have moved the arm. To train the ballistic map, we also
need a “forward map” — i.e. a forward kinematics function which gives the gaze
angle of the hand in response to a commanded set of joint positions. The error
in gaze coordinates can be back-propagated through this map, yielding a signal
appropriate for training the ballistic map.
The forward map is learned incrementally during every reach: after each
reach we know the commanded arm position, as well as the position measured
in eye gaze coordinates (even though that was not the target position). For the
ballistic map to train properly, the forward map must have the correct signs in
its derivative. Hence, training of the forward map begins first, during a “flail-
ing” period in which Cog performs reaches to random arm positions distributed
through its workspace.
Although the arm has four joints active in moving the hand to a particular
position in space (the other two control the orientation of the hand), we re-
parameterize in such a way that we only control two degrees of freedom for a
reach. The position of the outstretched arm is governed by a normalized vector
of “postural primitives.” A primitive is a fixed set joint angles, corresponding
to a static position of the arm, placed at a corner of the workspace. Three such
primitives form a basis for the workspace. The joint space command for the arm is
calculated by interpolating the joint space components between each primitive,
weighted by the coefficients of the primitive-space vector. Since the vector in
primitive space is normalized, three coefficients give rise to only two degrees of
freedom. Hence, a mapping between eye gaze position and arm position, and
vice versa, is a simple, non-degenerate R2 → R2 function. This considerably
simplifies learning.
Unfortunately, the notion of postural primitives as formulated is very brit-
tle: the primitives are chosen ad-hoc to yield a reasonable workspace. Finding
methods to adaptively generate primitives and divide the workspace is a subject
of active research.
image coordinates (p(x,y) ) is then mapped into foveal image coordinates (f(x,y) )
using a second learned mapping, the foveal map F : p(x,y) → f(x,y) . The location
of the face within the peripheral image can then be used to extract the sub-image
containing the eye for further processing.
This technique has been successful at locating and extracting sub-images that
contain eyes under a variety of conditions and from many different individuals.
Additional information on this task and its relevance to building systems that
recognize joint attention can be found in the chapter by Scassellati.
learning are those exhibited by infants such as turn taking, shared attention,
and pre-linguistic vocalizations exhibiting shared meaning with the caretaker.
Towards this end, we have implemented a behavior engine for the develop-
ment platform Kismet that integrates perceptions, drives, emotions, behaviors,
and facial expressions. These systems influence each other to establish and main-
tain social interactions that can provide suitable learning episodes, i.e., where the
robot is proficient yet slightly challenged, and where the robot is neither under-
stimulated nor over-stimulated by its interaction with the human. Although we
do not claim that this system parallels infants exactly, its design is heavily in-
spired by the role motivations and facial expressions play in maintaining an
appropriate level of stimulation during social interaction with adults.
With a specific implementation, we demonstrated how the system engages
in a mutually regulatory interaction with a human while distinguishing between
stimuli that can be influenced socially (face stimuli) and those that cannot (mo-
tion stimuli) (Breazeal & Scassellati 1998). The total system consists of three
drives (fatigue, social, and stimulation), three consummatory behaviors
(sleep, socialize, and play), five emotions (anger, disgust, fear, happiness,
sadness), two expressive states (tiredness and interest), and their corre-
sponding facial expressions. A human interacts with the robot through direct
face-to-face interaction, by waving a hand at the robot, or using a toy to play
with the robot. The toys included a small plush black and white cow and an or-
ange plastic slinky. The perceptual system classifies these interactions into two
classes: face stimuli and non-face stimuli. The face detection routine classifies
both the human face and the face of the plush cow as face stimuli, while the
waving hand and the slinky are classified as non-face stimuli. Additionally, the
motion generated by the object gives a rating of the stimulus intensity. The
robot’s facial expressions reflect its ongoing motivational state and provides the
human with visual cues as to how to modify the interaction to keep the robot’s
drives within homeostatic ranges.
In general, as long as all the robot’s drives remain within their homeostatic
ranges, the robot displays interest. This cues the human that the interac-
tion is of appropriate intensity. If the human engages the robot in face-to-face
contact while its drives are within their homeostatic regime, the robot displays
happiness. However, once any drive leaves its homeostatic range, the robot’s
interest and/or happiness wane(s) as it grows increasingly distressed. As this
occurs, the robot’s expression reflects its distressed state. In general, the facial
expressions of the robot provide visual cues which tell whether the human should
switch the type of stimulus and whether the intensity of interaction should be
intensified, diminished or maintained at its current level.
For instance, if the robot is under-stimulated for an extended period of time,
it shows an expression of sadness. This may occur either because its social
drive has migrated into the “lonely” regime due to a lack of social stimulation
(perceiving faces near by), or because its stimulation drive has migrated into
the “bored” regime due to a lack of non-face stimulation (which could be pro-
vided by slinky motion, for instance). The expression of sadness upon the robot’s
80 Rodney A. Brooks et al.
Anger
2000
Activation Level Disgust
Interest
1500 Sadness
Happiness
1000
500
0
0 20 40 60 80 100 120 140 160 180 200
Time (seconds)
2000
1000
Activation Level
Social drive
−1000
Socialize behavior
Face stimulus
−2000
0 20 40 60 80 100 120 140 160 180 200
Time (seconds)
Fig. 9. Experimental results for Kismet interacting with a person’s face. When the
face is present and moving slowly, the robot looks interested and happy. When the face
begins to move too quickly, the robot begins to show disgust, which eventually leads
to anger.
face tells the caretaker that the robot needs to be played with. In contrast, if
the robot receives an overly-intense face stimulus for an extended period of time,
the social drive moves into the “asocial” regime and the robot displays an ex-
pression of disgust. This expression tells the caretaker that she is interacting
inappropriately with the robot – moving her face too rapidly and thereby over-
whelming the robot. Similarly, if the robot receives an overly-intense non-face
stimulus (e.g. perceiving large slinky motions) for an extended period of time,
the robot displays a look of fear. This expression also tells the caretaker that
she is interacting inappropriately with the robot, probably moving the slinky
too much and over stimulating the robot.
6.1 Coherence
We have used simple cues, such as visual motion and sounds, to focus the visual
attention of Cog. However, each of these systems has been designed indepen-
dently and assumes complete control over system resources such as actuator
positions, computational resources, and sensory processing. We need to extend
our current emotional and motivational models (Breazeal & Scassellati 1998) so
that Cog might exhibit both a wide range of qualitatively different behaviors,
and be coherent in the selection and execution of those behaviors.
It is not acceptable for Cog to be repeatedly distracted by the presence of
a single person’s face when trying to attend to other tasks such as grasping
or manipulating an object. Looking up at a face that has just appeared in the
visual field is important. Looking at what the object being manipulated is also
important. Neither stimulus should completely dominate the other, but perhaps
preference should be given based upon the current goals and motivations of
the system. This simple example is multiplied with the square of the number of
basic behaviors available to Cog, and so the problem grows rapidly. At this point
neither we, nor any other robotics researchers, have focused on this problem in
a way which has produced any valid solutions.
the arms beyond direct use in feedback control — there has been no connection
of that information to other cognitive mechanisms.
Finally, we have completely ignored some of the primary senses that are used
by humans, especially infants; we have ignored the chemical senses of smell and
taste.
Physical sensors are available for all these modalities but they are very crude
compared to those that are present in humans. It may not be instructive to try
to integrate these sensory modalities into Cog when the fidelity will be so much
lower than that of the, admittedly crude, current modalities.
So far we have managed to operate with visual capabilities that are much sim-
pler than those of humans, although the performance of those that we do use are
comparable to the best available in artificial systems. We have concentrated on
motion perception, face detection and eye localization, and content-free sensory
motor routines, such as smooth pursuit, the vestibular-ocular reflex, and ver-
gence control. In addition to integrating all these pieces into a coherent whole,
we must also give the system some sort of understanding of regularities in its
environment.
A conventional approach to this would be to build object recognition systems
and face recognition systems (as opposed to our current face detection systems).
We believe that these two demands need to be addressed separately and that
neither is necessarily the correct approach.
Face recognition is an obvious step beyond simple face detection. Cog should
be able to invoke previous interaction patterns with particular people or toys
with faces whenever that person or toy is again present in its environment. Face
recognition systems typically record detailed shape or luminance information
about particular faces and compare observed shape parameters against a stored
database of previously seen data. We question whether moving straight to such
a system is necessary and whether it might not be possible to build up a more
operational sense of face recognition that may be closer to the developmental
path taken by children.
In particular we suspect that rather simple measures of color and contrast
patterns coupled with voice cues are sufficient to identify the handful of people
and toys with which a typical infant will interact. Characteristic motion cues
might also help in the recognition, leading to a stored model that is much richer
than a face template for a particular person, and leading to more widespread
and robust recognition of the person (or toy) from a wider range of viewpoints.
We also believe that classical object recognition techniques from machine
vision are not the appropriate approach for our robot. Rather than forcing all
recognition to be based on detailed shape extraction we think it is important
that a developmental path for object recognition be followed. This will include
development of vergence and binocularity, development of concepts of object
The Cog Project: Building a Humanoid Robot 83
7 Acknowledgments
Support for this project is provided in part by an ONR/ARPA Vision MURI
Grant (No. N00014-95-1-0600).
References
An, C. H., Atkeson, C. G. & Hollerbach, J. M. (1988), Model-based control of a robot
manipulator, MIT Press, Cambridge, MA.
Ashby, W. R. (1960), Design for a Brain, second edn, Chapman and Hall.
Ballard, D., Hayhoe, M. & Pelz, J. (1995), ‘Memory representations in natural tasks’,
Journal of Cognitive Neuroscience pp. 66–80.
Baron-Cohen, S. (1995), Mindblindness, MIT Press.
Blythe, J. & Veloso, M. (1997), Analogical Replay for Efficient Conditional Planning,
in ‘Proceedings of the American Association of Artificial Intelligence (AAAI-97)’,
pp. 668–673.
4
It is well known that the human visual system, at least in adults, is sensitive to
the actual pigmentation of surfaces rather than the frequency spectrum of the light
that arrives on the retina. This is a remarkable and counter-intuitive fact, and is
rarely used in modern computer vision, where cheap successes with simple direct
color segmentation have gotten impressive but non-extensible results.
84 Rodney A. Brooks et al.
Georgi Stojanov
1 Introduction
After the “behaviourist turn” [40] in the field of AI (e.g. [9] ) which may be regarded as a
reaction to the classical, so called explicit symbolic representations approaches, things
changed to another extreme. The need for representation was denied and the accent was
put on building reactive type systems that act in the real world. The usual argument was
that in a noisy and fast changing environment there was no time left for the agent to
1 The author wishes to express his gratitude to the Ministry of Culture and of Science of the
Republic of Macedonia for the awarded grants which helped the work described in this paper.
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp.88 -101, 1999.
Springer-Verlag Berlin Heidelberg 1999
Embodiment As Metaphor: Metaphorizing-in the Environment 89
update constantly its internal model of the world and act accordingly, but it was better for
it simply to (re)act. However, it was soon realized within this behavior based (BB)
approach [21], that apart from simple behaviors, like obstacle avoidance, wall (light,
odour, or any gradient) following, wandering, or exploration (“insect type intelligence”), it
was impossible to introduce systematically and naturally any type of learning or
adaptation. This was a trivial consequence from the fact that there were no variables in
these “fixed-topology network of simple finite states machines” [10] to be changed (tuned,
learned, adapted). An obvious remedy was to introduce, well – some representations.
Indeed, in the decade that followed, various architectures appeared within the frame of BB
robotic systems, which introduced different types of representations. For a taxonomy of
these systems with respect to their treatment of representations see [37]. To our opinion,
one of the most important lessons learned from this episode in AI was that the
representations were to be contingent on the particular embodiment of the artifact. We can
mention here the works of Mataric [24, 25, 26, 27], Drescher [12], Indurkhya [18], Dean
et al. [11], Bickhard [2, 3, 4], and others. These types of representations were supposed to
avoid the problems of traditional representationalism (for an excellent critique of the
traditional approach see (Chalmers, French, and Hofstadter in [17]). However, what was
lacking, was some kind of general framework which would act as a common denominator
for the above mentioned research dispersed over many diverse domains.
In this paper we put forward the idea of looking at the act of learning useful
environment models as a process of getting its metaphorical description in terms of
agents’ internal structures (i.e. its particular embodiment) during the agent-environment
interaction. This process is considered to be a basic cognitive operation. It is similar to
what has been called similarity creating metaphors (SCM) in the metaphor research
literature (Black [5, 6], Hausman [14, 15, 16], Ricoeur [30, 31, 32, 33], Indurkhya [18]).
In our case we have a variant of SCM where the target domain is the environment itself,
and the source domain is the agent’s internal structures. An agent acts in the environment
exercising its set of basic behaviors, and trying to satisfy its needs (goals, or drives, which
can be treated as a possibility to exercise some kind of consumatory behavior). For
example, an agent may be hungry and the goal would be to find a place in the
environment where the food is located. In order to perform better than random search, it
should somehow use the history of its interactions. The agent cannot know anything about
the external environment beyond the effects it has produced on its internal structures, i.e.
the target domain is implicit. Here, we do not treat the cases where the environment is
explicitly given to the agent, for example, by means of some connectivity matrix among
perceptually different “places”. Rather, it should build its idiosyncratic cognitive map of
the environment. This map is metaphorical in the sense that it stands for the agent-
environment past interactions. How “successful” a metaphor is depends on what it is used
for, i.e. what the agent’s goal is (e.g. navigation). As noted in [18] most of the research on
metaphor in cognitive science has concentrated on similarity-based metaphors. This
thread was further pursued in computational models of metaphor (understanding or
generating): both the target and the source domain are given and the program (agent,
artifact) tries to compute the similarities and the most plausible mappings. Notable
90 Georgi Stojanov
exceptions are the works of Hofstadter and his group [17]. Their view is that the essence
of metaphor and analogy making is the very process of constructing the representation of
the situations in both domains, not finding the mapping among pregiven representations.
We describe agent’s internal structure using the notion of inborn schemas. In Section 2
we elaborate more this notion, and for the time being we can only say that it represents an
ordered sequence of elementary actions that the agent is capable of performing.
In the remainder of this section we will briefly expose the basic idea by means of an
example. Suppose we have an agent inhabiting some maze-like world as given in Figure
1. The agent’s basic action set consists of 3 actions: F(orward), L(eft), and R(ight). They
move the agent forward, left, or right relative to its current position and orientation. Apart
from the sensations from the proprioceptors, informing the agent about the moving
constraints at the place it occupies, various different sensory inputs (like visual, sonar,
chemical, etc.) may be included in S.
F
S FLR L
R
a) b)
Fig. 1. A maze-like world and the agent that inhabits it. See text for explanations.
In this example, the agent possesses only one inborn schema: FLR. Being in the
environment the agent spontaneously tries to exercise it. To simplify the matter, we
assume that the agent can occupy only certain places in the maze (marked with circles),
can have one out of four possible orientations (E, W, N, S), and moves in discrete time
instances. A successfully performed elementary action can displace the agent only to a
place neighboring its current position. For example, if the agent is at the lower left most
corner and facing north (see Figure 2) when trying to exercise the FLR schema, it will
only succeed to move forward, that is F__, and its next position will be as shown in Figure
2b.
Embodiment As Metaphor: Metaphorizing-in the Environment 91
a) b)
Fig. 2. Agent trying to exercise the FLR schema from its current position and orientation.
The environment, being as it is, will systematically impose constraints on the agent’s
behavior, favoring thus only particular instances of the initial schema. For example, being
in a corridor, the agent can only move forward, that is, use the F__ instance of the schema
(that is, the F__ behavior). Cruising through the maze for a while and depending on the
initial position and orientation, the following sets of instances of the initial schema will be
favored for obvious reasons: F__, F_R, and __R or F__, F_L, and _L_. So, as a result
from this interaction one of the following two basic environment conceptualizations (or
metaphorization) will emerge (Figure 3):
SF__ SF__
SF_R SFL_
F_R FL_
F__ F__
SF__ SF__
__R _L_
a) b)
Fig. 3. Two different conceptual structures that may result from agent-environment interaction.
“SXXX”s represent percepts enabling XXX behavior.
The F__ node represents “following the corridor” concept/behavior, while “turning left”
and “turning right” behaviors are represented with FL_ or _L_ and F_R or __R,
respectively. Note that in these metaphorizations all the corridors collapse in a single F__
node. This is true for the turns also. This is so because our agent does not have any
preferred “S”s that it should strive for. However, what this conceptual structure tells the
agent is that after following the corridor it must turn to the left or to the right and then
again to switch to the corridor following concept/behavior. Another important point is that
two identical percepts are interpreted in different ways, depending on what
concept/behavior (node) is currently active.
92 Georgi Stojanov
In the previous section we presented an example where the inner structure of the agent
was defined via its inborn schemas. Indeed, this approach seems to be very appealing, so
that one can say that the notion of schema is a leitmotif in psychology, AI, and cognitive
science. Speaking about the origins of the concept of schema as classically used in AI
(e.g. [23]), Arbib [1] points to the neurologist Henry Head and his body schema. Head
used the concept of body schema to explain the cases of patients with parietal lobe lesions
who were neglecting, for example half of their bodies. According to him, the lesion
destroys a part of the body schema and that part of the body is being neglected by the
patients, i.e. no context was provided to interpret the incoming sensory inputs from those
body parts. Perhaps, a more clear example of the schema notion is given by Sir Frederic
Bartlett, a student of Head. In his 1932 book “Remembering” he observes that people do
not remember things (events, situations) in a photographic manner. Rather, having heard
something, for example, and being asked to repeat it, they rarely use the exact words.
Some parts are emphasized and given more place and others just sketched or even
omitted. The point is that hearing and understanding something means projecting it on the
internal individual space of schemas; remembering then, is not a passive process but an
active reconstruction in terms of those schemas that were activated during the exposure to
the story (picture, movie...). Humans, as linguistically competent agents, are constantly
producing novel schemas in terms of stories, or narratives, thus enriching the source
domain for constant metaphorization of their new experiences.
So far in our theory, we are concerned only with agents without linguistic competence.
Most closely related to our understanding of the notion of schema is Piaget’s schema as
used in his theory of mental development [28]. Initially, according to the theory, the infant
has no concept of object permanence and this concept is constructed by internalizing the
various appearances of an object through interactions. Interactions are performed by
exercising the set of schemas (or schemata) the child is born with. A schema is an
organized sequence of behavior (e.g. sucking, grasping). According to Piaget, the very
existence of a schema in a child’s repertoire of action itself creates a motivation for its
use. That is, the motivation is intrinsic in the schema. The child tries to “understand” the
object by incorporating it in some existing schema: the act (the schema) of sucking may
be provoked with whatever object is placed in the mouth. This process is called
assimilation. Depending on the result of such an exercise (that is, the consequence) and
mental growth, initial schemas may change, and this process is called accommodation. An
example is the reaching-and-grasping-objects schema [34]: initially it consists of a fairly
crude “swipe and grab” in the general direction of an attractive object. As the baby grows
the schema becomes more refined and is adapted to the object’s position and size. It
begins to accommodate to the object [29].
What is important for us is that the internal representations of the environment are
inherently contingent on the agent’s structure, that is its specific embodiment. The
“reality” is re-presented via the modifications of its schemas. These modifications
metaphorically stand for its past experiences.
Embodiment As Metaphor: Metaphorizing-in the Environment 93
We ended the previous section with an example illustrating the use of the schema
notion. Our agent metaphorized-in its environment after the history of interactions with it.
In the next section we show how it can use the concept/behavior structure that emerged, in
order to achieve some goals – i.e. how the agent can make these metaphorical descriptions
of its environment useful with respect to a given goal.
C
F F
S
F_R L
R
R F LR
a)
b)
Fig. 4. a) Agent in a maze with an object provoking desirable sd in it. b) internal structure of the
agent with a possible conceptualization of the environment (see Figure 3a) for details). The “C”
node represents the consumatory behavior which may be provoked by sd.
If we now put something in the maze that provokes some desirable sd in our agent, we will
create a goal in it. If we put the agent somewhere in the maze it will try to find the desired
thing, that is, to achieve the goal. Let us call that something food and place it in the upper
right part of the maze (see Figure 4a). In order to appreciate food the agent has to be able
to exhibit appropriate behavior. Let us call it consumatory behavior and represent it with a
schema within the agent as in Figure 4b.
If we assume that the Figure 3a conceptualization took place, the agent will bump onto the
food while performing the F__ behavior. It will “think” then, that in order to get to the sd it
will suffice to do F__. This means that in the conceptual network a link will be built from
F__ to the C node (Figure 5). However for the agent in the lowermost corridor or in one of
the three small corridors this will not do. If, for instance it is in the position shown in
Figure 3.4a it may reach the goal by performing F__-(F)_R-F__-C.
94 Georgi Stojanov
SF__
sd SF_R
C F_R
F__
SF__
S __R SF__
__R
Fig. 5. After bumping onto the goal this conceptual structure is built...
S F__\S F__’
SF_R
sd F_R
C F__
SF__
SF__’
__R F__’
S__R
SF__’
Fig. 6. ... But there are “F__”s not leading to the goal while performing F__.
That is, there is an instance of F__ behavior where the sd percept cannot be observed.
These actually are the percepts from the SF set that do not occur during the execution of
F__ that lead to sd. This distinction leads to a creation of another instance of F__, named
F__’, containing those percepts, linked with the right F__ node via the (F)_R nodes
(Figure 6).
a)
Fig. 7. A situation where the conceptual structure form Figs. 3-6 does not help.
Embodiment As Metaphor: Metaphorizing-in the Environment 95
SF__\SF__’
SF_R
sd F_R
C F__
SF__
SF__’ SF_R’
F_R’
__R F__’
S__R SF__’
SF_’\SF_’’
SF__’’
__R’ F__’’
S__R’
SF__’’
Fig. 8. The “correct” conceptual structure that always leads to the goal. A\B denotes percept set
difference.
According to its observations the agent assumes it is in F__’. But performing the F__’-
(F)_R-F__ sequence does not bring it to the food. Again, this expectation failure will lead
to further differentiation among the F. Introduction of the new nodes leads to the
conceptual structure shown in Figure 8.
How does the agent use this map to get to the food? Whenever performing F it
observes the percepts and locates itself in the F__ or in F__’ node. If in F__ it will
eventually perceive sd . Being in the F__’, however, it should make a turn and then
continue with F__. In doing so it marks positively the percepts from SF_’, S__R, and SF_R sets
that occurred in a successful trial that began from F__’. This is because this procedure
will not work if the agent starts from a position like the one shown in Figure 7.
Above, we used an example which showed an autonomous agent solving the
navigation problem. However, there are no assumptions regarding the interpretation of the
concept/behavior structures. In this context their natural interpretation is that of “places”
or “landmarks” in the world. Most generally they are “objects” in the agent Umwelt.
These objects afford certain manipulations with them. Agents learn these affordances via
the contingencies represented in the conceptual graph. Actually the name
concept/behavior is chosen to suggest this generality. We see that the introduction of goals
imposes additional ordering and refinement of the concept/behaviors that represent the
metaphorical description of the environment. This is a natural incorporation of the
pragmatic constraints in metaphor generation.
In [35, 36] we proposed an algebraic formulation of the above informally presented
procedure which was partially inspired by [18]. We also proposed learning algorithms and
in the next subsection we present simulation results in the case of more realistic
environments.
96 Georgi Stojanov
In these experiments, the simulated agent had a body and retina (Figure 9a), and was
capable of performing four elementary motor actions: go forward, go backward, go left,
and go right. These actions displace the agent for a fixed step, relative to its current
position in a two-dimensional environment (Figure 10a). The environment is populated by
obstacles and there is only one place where the food (goal) is to be found. Percepts
represented semicircle scans in front of the agent in 10 different directions returning the
distance to obstacles in the respective directions (Figure 9b).
distance to obstacle
body
retina
direction
a) b)
Thus, given the sensory readings in a particular direction it is possible to decide whether
the next action from the schema which is to be performed, is enabled or not. These
percepts are complemented with the outputs of two binary valued sensors for food (goal)
and bump detection. Food is detected if it falls within the semicircle in front of the agent.
In these particular experimental runs, we used agents having only one inborn schema with
length of 20 to 30 elementary actions (e.g. fffrrllffrrbllrffffllffff). So, the source domain
n
contained 2 (where n is the length of the inborn schema) potential enabled schemas.
Learning algorithm used was very simple:
45
schema instances executed
40
35
to find the food
30
25
obstacles 20
15
Goal 10
place 5
0
15
29
43
57
71
85
99
1
Fig. 10. a) The environment of the simulated agent; b) learning curve: the average number of
steps to the goal decreases each time the hunger drive is activated.
98 Georgi Stojanov
as well as with some features discernable by the agent’s perceptual apparatus. We have
done this [39] for the case of simple simulated environments but the procedure is not
applicable for complicated, real-world ones. Another issue we did not explicitly address in
the paper is the choice of inborn schemas. For the time being we are working on applying
genetic algorithms to solve this problem, i.e. to evolve “optimal” inborn schemas for
given agent, environmental niche, and goals. Although we have been doing simulations
only so far, we are quite optimistic regarding the scalability of the methods here proposed,
given the positive examples of relatively simpler real-world learning agents (e.g. [24],
[42]).
We conclude this section by explicating and justifying the use of the class of
similarity-creating metaphors to describe our agent’s architecture and operation. In the
process of internalizing the environment, the agent tries to describe metaphorically its
environment in terms of its internal structure by creating similarities between the
description and the environment. These similarities are, of course, similarities perceived
from the agent’s point of view. For example, having inhabited some environment for a
while and then being put in a different one, the only measure of similarity from the
agent’s perspective would be how good the old metaphor is in locating the food in the new
environment. We presented only one simple learning algorithm. There are many other
ways of introducing some other ordering among the enabled schemas, which would reflect
yet other more subtle “similarities” between the source and the implicit target domain.
(For example, while performing the elementary actions the agent can be treated as
traversing some finite state automaton. Repeating a sequence of elementary actions would
lead the agent to enter a cycle; thus, we could group the enabled schemas according to the
cycles they participate in, and use this grouping as a basis for building useful environment
models).
The work described here originated in our research of the problem of environment
representations in artificial and biological agents [37, 38, 39, 7, 41]. Among the main
results was the concept of environment representations via the process of metaphorizing-
it-in in terms of agent’s inner structure (i.e. agent’s particular embodiment). Various
research threads scattered across diverse areas such as embodied and situated cognition,
agency in AI, metaphor in language, and the like can easily fit this metaphorizing-in the
environment framework. The work of Tani (e.g. [42]) comes closest to the spirit of our
approach. The internal structure of its agent is represented via a Recursive Neural Net
(RNN). The structure of the RNN represents, of course, the source domain. Mataric (e.g.
[27]) proposes biologically (rat hippocampus) inspired internal structure. Drescher’s agent
[12] uses rather symbolic schema structures inspired by Piaget’s theory.
From the purely theoretical research we can mention the work of Indurkhya [18]
where he gives a rather detailed algebraic model of metaphorical reasoning, and the work
Embodiment As Metaphor: Metaphorizing-in the Environment 99
References
1. Arbib, M. A.: In Search of the Person, The University of Massachusetts Press (1985).
2. Bickhard, M. H.: Cognition, Convention, and Communication, Praeger Publishers
(1980).
3. Bickhard, M. H.: “Representational Content in Humans and Machines”, Journal of
Theoretical and Experimental Artificial Intelligence, 5 (1993a).
4. Bickhard, M. H.: “On Why Constructivism Does Not Yield Relativism”, Journal of
Theoretical and Experimental Artificial Intelligence, 5 (1993b).
5. Black, M.: “Metaphor” in M. Black Models and Metaphors, Cornell University Press,
Ithaca, NY; originally published in Proceedings of the Aristotelian Society, N.S. 55,
1954-55; Reprinted in M. Johnson (ed.) Philosophical Perspectives on Metaphor,
University of Minnesota Press, Minneapolis, Minn. (1981).
6. Black, M.: “More about Metaphor”, in A. Ortony (ed.) Metaphor and Thought,
Cambridge University Press, UK (1979).
7. Bozinovski, S., Stojanov, G., Bozinovska, L.: “Emotion, Embodiment, and
Consequence Driven Systems”, AAAI Fall Symposium, TR FS-96-02, Boston (1996).
8. Bozinovski, S., Consequence Driven Systems, GOCMAR Publishers, Athol (1995).
9. Brooks, R. A., “A Robust Layered Control System for a Mobile Robot”, IEEE Journal
of Robotics and Automation, RA-2, April (1986).
10. Brooks, R. A.: “Intelligence Without Representation”, Artificial Intelligence, No. 47
(1991).
11. Dean, T., Angluin, D., Basye, K., Kaelbling, L. P.: “Uncertainty in Graph-Based Map
Learning”, in J. Connell and S. Mahadevan (eds.) Robot Learning (1992).
12. Drescher, G.: Made-Up Minds, MIT Press (1991).
13. Fauconnier, G., Turner, M.: “Conceptual Projection and Middle Spaces”, UCSD
Cognitive Sciences Technikal Report 9401, San Diego (1994).
14. Hausman, C. R.: “Metaphors, Referents, and Individuality”, Journal of Aesthetics and
Art Criticism, Vol. 42 (1983).
100 Georgi Stojanov
15. Hausman, C. R.: A Discourse on Novelty and Creation, SUNY Press, Albany, NY
(1984).
16. Hausman, C. R.: Metaphor and Art: Interactionism and Reference in Verbal and
Nonverbal Art, Cambridge University Press, Cambridge, UK (1989).
17. Hofstadter, D. R., and the Fluid Analogies Research Group: Fluid Concepts and
Creative Analogies, BasicBooks, (1995).
18. Indurkhya, B.: Metaphor and Cognition, An Interactionist Approach, Kluwer
Academic Publishers, Boston (1992).
19. Johnson, M.: The Body in the Mind, Chicago University Press, Chicago (1987).
20. Lakoff, G.: Women, Fire, and Dangerous Things, The University of Chicago Press
(1987).
21. Maes, P.: Designing Autonomous Agents: Theory and Practice from Biology to
Engineering and Back, MIT Press, Cambridge (1991).
22. Mayer, R. E.: Thinking, Problem Solving, Cognition, W.H. Freeman and Company,
New York (1992).
23. Minsky, M.: "A Framework for Representing Knowledge", in A. Collins and E. Smith
(eds.) Readings in Cognitive Science, Morgan Kaufmann Publishers (1988).
24. Mataric, M.: “Navigating With a Rat Brain: A Neurobiologically-Inspired Model for
Robot Spatial Representation”, in J. A. Meyer & S. Wilson, eds. From Animals to
Animats, International Conference on Simulation of Adaptive Behavior, The MIT Press
(1990).
25. Mataric, M.: "Integration of Representation Into Goal-Driven Behavior-Based
Robots", in IEEE Transactions on Robotics and Automation, Vol. 8, No. 3 (1992).
26. Mataric, M.: “Integration of Representation Into Goal-Driven Behaviour-Based
Robots”, IEEE Transactions on Robotics and Automation, Vol. 8, No.3, (1992).
27. Mataric, M.: “Navigating With a Rat Brain: A Neurobiologically-Inspired Model for
Robot Spatial Representation”, in J. A. Meyer & S. Wilson, eds. From Animals to
Animats, International Conference on Simulation of Adaptive Behaviour, The MIT
Press, (1990).
28. Piaget, J.: Genetic Epistemology, Columbia, New York (1970).
29. Piaget, J.: Inhelder, B.: Intellectual Development of Children, (in Serbo-Croatian)
Zavod za udjbenike i nastavna sredstva, Beograd (1978).
30. Ricoeur, P.: Interpretation Theory: Discourse and the Surplus of Meaning, The Texas
Christian University Press, Fort Worth, Tex., (1976).
31. Ricoeur, P.: The Rule of Metaphor, University of Toronto Press, Toronto, Canada,
(1977).
32. Ricoeur, P.: “The Metaphorical Process as Cognition, Imagination, and Feeling”,
Critical Inquiry 5, No. 1, 1978; Reprinted in M. Johnson (ed.) Philosophical
Perspectives on Metaphor, University of Minnesota Pres, Minneapolis, Minn., (1981).
33. Ricoeur, P. “Imagination et Metaphore”, Psychologie Medicale, Vol. 14, No. 12,
(1982).
34. Roth, I. (ed.): Introduction to Psychology, Vol. 1., LPA and The Open University,
London (1991).
Embodiment As Metaphor: Metaphorizing-in the Environment 101
35. Stojanov, G.: Expectancy Theory and Interpretation of EXG curves in the Context of
Biological and Machine Intelligence, PhD Thesis, ETF, Skopje (1997a).
36. Stojanov, G., Bozinovski, S., Trajkovski, G.:" Interactionist Expectative View on
Agency and Learning", IMACS Journal of Mathematics and Computers in Simulation
, North-Holland, N. 44 (1997b) 295-310.
37. Stojanov, G., Trajkovski, G., Bozinovski, S.: “The Status of Representation in
Behaviour Based Robotic Systems: The Problem and A Solution”, IEEE Conference
Systems, Man, and Cybernetics, Orlando (1997c).
38. Stojanov, G., Trajkovski, G., Bozinovski, S.: "Representation versus context: A false
dichotomy", 2nd ECCS Workshop on Context, Manchester (1997d).
39. Stojanov, G., Trajkovski, G.: “Spatial Representations for Mobile Robots: Detection
of Learnable and Unlearnable Environments”, Proceedings of the First Congress of
Mathematicians and Computer Scientists in Macedonia, Ohrid, Macedonia (1996).
40. Stojanov, G., Bozinovski, S., Simovska, V.: "AI (Re)discovers behaviorism and other
analogies", presented at the 3. Int. Congress on Behaviorism and Sciences of Behavior,
Yokohama (1996).
41. Stojanov, G., Stefanovski, S., Bozinovski, S.: “Expectancy Based Emergent
Environment Models for Autonomous Agents”, 5th International Symposium on
Automatic Control and Computer Science, Iasi, Romania (1995).
42. Tani, J.: “Model-Based Learning for Mobile Robot Navigation from Dynamical
System Perspective”, IEEE Transactions on Systems, Man, and Cybernetics 26(3)
(1996).
43. Turner, M.: “Conceptual Blending and Counterfactual Argument in the Siocial and
Behavioral Sciences”, in P. Tetlock and A. Belkin (eds.), Counterfactual Thought
Experiments in World Politics, Princeton University Press, Princeton (1996).
Embodiment and Interaction in Socially
Intelligent Life-Like Agents
Kerstin Dautenhahn
Department of Cybernetics
University of Reading, United Kingdom
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 102–141, 1999.
c Springer-Verlag Berlin Heidelberg 1999
Embodiment and Interaction in Socially Intelligent Life-Like Agents 103
Above we use the term ‘agent’ in order to account for different embodiments
of agents, and also allow the discussion of biological agents and software agents.
The issue of autonomy plays an important part in agent discussions. In [27]
the author defines autonomous agents as entities inhabiting a world, being able
to react and interact with the environment they are located in and with other
agents of the same and different kind (a variation of Franklin and Graesser’s
definition ([36]).
This chapter is divided as follows: section 2 discusses the general issue of
knowledge and memory in human society (section 2.1), and the specific issue of
autobiographic agents (section 2.2). Section 3 discusses embodiment in physical
(robotic) agents (section 3.1) and virtual agents (section 3.2). The latter section
shows a concrete example of behavior-oriented control which the author has
used in her work. The same programming approach, applied to an experiment
on robot-human interaction is presented in section 3.3. Section 4 discusses the
issue of social agents in more detail, relating it to sociobiology and evolution-
ary considerations on the origin of social behavior (section 4.1). Social software
agents are discussed in section 4.2. Such issues lead to an attempt to define
(artificial) social intelligence from the perspective of an individual (section 4.3),
as well as from the perspective of social organization and control (section 4.4).
Section 5 discusses a research project which studies how an interactive robot can
be used as a remedial tool for children with autism. In section 6 we come back to
the starting point of our investigations, namely how embodiment and meaning
apply to agent research.
104 Kerstin Dautenhahn
Primate societies can be said to exhibit the most complex social relationships
which can be found in the animal world. The social position of an individual
within a primate society is neither innate nor strictly limited to a critical im-
printing period. Especially in human 20th-century societies social structures are
in an ongoing process of re-structuring. In a way one could say that the tendency
of making our non-social environment more predictable and reliable by means
of technological and cultural re-structuring and control has been accompanied
by the tendency that our social life is becoming more and more complex and
unpredictable, often due to the same technologies (e.g. electronic power helps to
keep us warm and save during winter while at the same time means of social
inter-networking could give rise to sociological and psychological changes of our
conception of personality and social relationships [88]).
Such degrees of complexity of social behavior of single humans as well as the
complexity of societies which emerge from interactions of groups of individuals
depend on having a good memory. Both a memory as part of the individual,
as well as a shared or ‘cultural memory’ for societies. Traditionally such is-
sues have not been considered in Artificial Intelligence (AI) or Artificial Life
(Alife) research. In the former the issue of discussion was less about memory
and more about knowledge. Memory (‘the hardware part’) was mostly regarded
less a problem than knowledge (the ‘software part’, representations, algorithms).
The idea to extract knowledge from human experts and make it operational in
computer programs led to the development of professions like knowledge engi-
neer and products like (expert- or) knowledge-based systems. The knowledge
debate can best be exemplified by the Cyc-endeavour ([52]) which for more than
one decade has been trying to ‘computationalize’ common-sense knowledge. The
idea here is not to extract knowledge from single human beings but to trans-
fer encyclopedic (cultural) knowledge to a computer. In the recently emerging
internet-age the knowledge-debate has regained attention through technological
developments trying to cope with ‘community knowledge’.
In Alife research the distinction between hardware and software level is less
clearly drawn. Evolutionary mechanisms are investigated both on the hardware,
as well as on the software side (see evolutionary robotics [41] and evolvable
hardware [55]). These conceptions are closer to biology, where the ‘computa-
tional units’, e.g neurons, are living, dynamic systems themselves, so that the
distinction between hardware and software is not useful. In the case of evolv-
ing software-agents the distinction becomes less clear. Nevertheless the question
when and whether to call software agents ‘life-like’ (if not to say ‘living’) is still
open.
A main research issue in Alife concerns the question how ‘intelligence’ and
‘cognition’ in artifacts can be defined and achieved. The question of how best
to approach cognitive or ‘intelligent’ behavior is still open. Here we find a broad
area of intersection between AI and Alife. The main difference in the ‘artificial life
Embodiment and Interaction in Socially Intelligent Life-Like Agents 105
Once upon a time, in the not so far future, robots and humans enjoy
spending their tea breaks together, sitting on the grass outside the office,
gossiping about the latest generation of intelligent coffee machines which
nobody cares for, debating on whether ‘loosing one’s head’ is a suitable
Embodiment and Interaction in Socially Intelligent Life-Like Agents 107
judgement on a robot which fell in love with another robot not of his
own kind, and telling each other stories about their lives and living in a
multi-species society.
Bodily interaction with the real world is the easiest way to learn about the
world, because it directly provides meaning, context, the ‘right’ perspective, and
sensory feedback. Moreover, it gives information about the believability of the
world and the position of the agent within the world. The next section discusses
issues of embodiment and meaning in different environments.
each other individually, i.e. they do not use any representations of other agents
or explicit communication. In contrast, the term ‘cooperation’ describes a form
of interaction which usually uses some form of more advanced communication.
“Specifically, any cooperative behaviors that require negotiation between agents
depend on directed communication in order to assign particular tasks” [57]. Dif-
ferent ‘roles’ between agents are for instance studied in [48], a flocking behavior
where one robot is the leader, but the role of the ‘leader’ is only temporally as-
signed and depends on local information only. Moreover there is only one fairly
simple ‘task’ (staying together) which does not change.
Behavior based research on the principle of stigmergy is not using explicit
representations of goals, the dynamics of group behavior are emergent and self-
organizing. The results of such behavior can be astonishing (e.g. see building
activities or feeding behavior of social insects), but is different from highly com-
plex forms of social organization and cooperation which we find e.g. in mammal
societies (see hunting behavior of wolves or organization of human society), em-
ploying division of labour, individual ‘roles’ and tasks allocated to specific indi-
viduals, and as such based on hierarchical organization. Hierarchies in mammal
societies can be either fairly rigid or flexible, adapted to specific needs and chang-
ing environmental conditions. The basis of an individualized society is particular
relationships and explicit communication between individuals.
Another example of fruitful scientific collaboration between biological and
engineering disciplines is the ecological approach towards the study of self-
sufficiency and cooperation between a few robotic agents which has been inten-
sively studied by David McFarland and Luc Steels. The theoretical background
and experimental results are described in [60,81,83]. The biological framework
is based on concepts and mechanisms within a sociobiological background and
rooted in economics and game theoretical evolutionary dynamics. Thus, central
concepts in the design of the ecosystem, the robots, and the control programs
which implement the behavior of the robotic agents are self-sufficiency and util-
ity (see [59] for a comprehensive treatment of this framework). A self-sufficient
robot must maintain itself in a viable state for longer periods of time, so that it
must be able to keep track of its energy consumption and recharge itself. This
can be seen as the basic ‘selfish’ need of a robot agent in order to guarantee
its ‘survival’. In the scenario developed by McFarland and Steels this level is
connected to cooperative behavior in the sense that viability can only be en-
sured by cooperation (note that here the term cooperation is used by Steels and
McFarland although the robots do not explicitly communicate with each other).
A second robot in the ecosystem is necessary since parasites (lights) are taking
energy from the ecosystem (including the charging station), but the parasites
can temporarily be switched off by a robot bumping into them. The ecosystem
itself was set-up so that a single robot alone (turn-taking between switching off
the parasites and recharging) could not survive.
It is interesting to note that McFarland very easily transferred and applied
sociobiological concepts to robot behavior. The development of robot designs
(the artificial evolution) is in these terms also interpreted in terms of marketing
Embodiment and Interaction in Socially Intelligent Life-Like Agents 109
Fig. 1. The learner robot. It has to learn the teacher’s interpretations of ‘words’
on the basis of its own sensory inputs. Learning means here creating associations.
learner robot learns to associate names for ‘hill’ and ‘plane’ (see figures 1, 2, 3)
which are distinct features in its environment.
The behavioral architecture implements concepts of equilibrium and energy
potential in order to balance the internal dynamics of processes linked to in-
stinctive tendencies and individual learning. Results obtained were successful in
terms of the learning capacities, but they point out the limitation of using the
imitative following strategy as a means of learning. Unsuccessful or misleading
learning occurs due to the embodied nature of the agents (spatial displacement)
and the temporal delay in imitative behavior. These findings gave rise to a series
of further experiments which analyzed these limitations quantitatively and de-
termined bounds on environmental and learning parameters for successful learn-
ing [10], e.g. the impact of the parameter specifying the duration of short-term
memory which is correlated to the particular spatial distance (constraints due
to the embodiment) of the two agents.
One of the basic conclusions from these experiments was that general bounds
on parameters controlling social learning in the teacher-learner set-up can be
specified, but that the exact quantitative values of these parameters have to be
adjusted in the concrete experiments, e.g. adapted to the kind of robots, en-
vironment, and interactions which the experiments consist of. What does this
imply for the general context of (social) learning experiments of mobile robots? A
careful suggestion, based on the results so far, is that the fine-tuning of parame-
ters in experiments with embodied physical agents is not an undesired effect, and
that it is not only a matter of time until it can be overcome by a next and better
Embodiment and Interaction in Socially Intelligent Life-Like Agents 111
Fig. 2. The teacher (left) and the learner (right) robot in the initial position.
The robots are not identical, they have different shapes, plus sensori-motor char-
acteristics. We assume that the teacher robot ‘knows’ how to interpret the world,
i.e. it is emitting 2 different signals (bitstrings) by radio link communication for
moving on a plane and moving on a hill.
a ‘weak’ status of embodiment. E.g. the body of the robot is static, the posi-
tion and characteristics of the sensors and actuators are modified and adapted
to the environment by hand, not by genuine development (compare with re-
cent studies on the evolution of robot morphology, e.g. [54]). The body (the
robot’s mechanical and electronical parts) is not ‘living’, and its state does not
depend on the internal dynamics of the control program. If the robot’s energy
supply is interrupted (the robot ‘dies’), the robot’s body still remains in the
same state. This is a fundamental difference to living systems. If the dynamics
(chemical-physiological processes) inside a cell stop, then the system dies, it loses
its structure, dissipates, in addition to being used by saprobes, and cannot be
reconstructed (revived), see [26].
This section illustrates the design of virtual robots in virtual worlds and discusses
the role of embodiment in virtual agents. To be concrete, the discussion is based
on the virtual laboratory INSIGHT developed by Simone Strippgen ([84,85]).
This environment uses a hilly landscape scenario with virtual robots which has
also been studied in robotic experiments ([74,21]). The environment may consist
of charging stations, areas with sand, water and trees, and other agents. IN-
SIGHT is a laboratory for experiments in an artificial ecosystem where different
environments, robots and behaviors can be designed. Visualization tools, and a
Embodiment and Interaction in Socially Intelligent Life-Like Agents 113
Inclination FB
Water
Bumper 8
Tree
“Head” Bumper 1
Sand d
Charging
Station InclinationLR
Robot
c)
ChargingStation1 ChargingStation2
a) b)
Fig. 4. Experiments in INSIGHT. a) Environment with sand, water, trees, charg-
ing station and one agent. The two sensor cones for finding the charging station
are indicated (dashed lines). It shows that these sensors cover a relatively large
area of the environment. The light sensors (necessary to detect other agents) have
the same size. b) design of an agent: The head indicates the back-front axis. It
has a ring of 8 bumpers (quantity Bumper1,2,3,4,5,6,7,8) which are surrounding
the surface of the agent’s body, 2 sensors measuring distance to the charging
station (ChargingStation, CS1 and CS2), 3 sensors each for detecting sand and
water (Water1,2,3; Sand1,2,3), 2 inclination sensors for the forward-backward
and left-right orientation of the body axis (InclinationFB, InclinationLR), and 2
sensors sensitive to green light (SignalGreenLight1,2). Each agent has a green
‘light’ on top. c) an agent approaching a charging station.
movement repertoire (‘global’ option, only one motivation factor) or single move-
ments (‘select’ option, several motivation factors). The maximum value is 100
which means that the motor control commands are directly sent to the robot,
e.g. the commands to perform a sequence of movements. A motivation which
equals zero or is below zero means that the robot will not move at all (global
option), or will not perform that particular movement (select option). Figure 6
shows the combinations of modes and options for running the experiments. In
the autonomous mode movements with associated values which equal or are less
than zero are skipped in the sequence of movements. In that situation this partic-
ular movement would therefore (from an observer point of view) disappear from
the robot’s movement repertoire. To give a simple example, let us assume two
agents A and B which can show four or respectively six different movements A1,
A2, ...,A4 and B1, B2,....,B6. If during nine consecutive timesteps agent B shows
the sequence B1-B2-B3-B1-B2-B3-B1-B2-B3 while agent A shows A4-A4-A4-
A4–A4-A4-A4-A4-A4 then the temporal coordination between the movements
equals zero. B showing B4-B4-B4-B1-B1-B1-B1-B2-B2 results in a update of the
weights between A4/B4 (update twice) and A4/B1 (three times) and A4/B2
(once). Thus, it does not matter if the movements of agent A and agent B are
the same, it only matters if the current pairing (e.g. A1 and B4) is maintained
over consecutive timesteps. Note that the sequences A1-A2-A3 and B2-B3-B4
are temporally not coordinated, although they might be considered as mirror or
imitated movements. This might appear counter-intuitive, but results from the
segmentation of movements which is needed for the input of the association ma-
trix. Inputs to the matrix represent movements during fractions of a second, so
not ‘behaviors’ (extended over time, e.g. seconds) in the strict sense. Parameters
which are controlling the generation of the input data for the association matrix
are therefore important features of the set-up. They were manually adapted to
the movements of the human.
Antenna
motor output
rotation
right
forward
translation
backward
j sensory input
i
left right up down circle+ circle-
human movements
camera
classification of hand
movements
Options
Global Select
Autonomous
Modes
Slave
Fig. 6. Modes and options used in the ‘dancing with strangers’ experiments.
zero. As a result, the robot will, as long as the human reacts with temporally
coordinated movements, continuously rotate in an anti-clockwise direction. The
human’s appropriate reaction need not necessarily be clockwise rotation, hori-
zontal movements to the left or any other movements which are linked to the
robot’s anti-clockwise movement (as specified in the association matrix), have
the same effect.
Figure 8 gives an example of an experiment in the slave mode of the system.
Series 1-4 represent motivation factors associated to particular movements of
the robot: 1-2 stand for rotation (1: anti-clockwise, 2: for clockwise), 3-4 stand
for translational movements (3: moving forwards, 4: backwards). All weights in
the association matrix are initialized with 100 (maximum) and decrease by 0.5
in each iteration cycle if no temporal coordination between the human’s and
the robot’s movements is detected by the robot. If a temporal coordination is
detected then the weight is increased by 1.5 in each iteration cycle. Since vertical
hand movements are not used in this sequence the weights for translational
movements drop monotonically, and series 3 and 4 cannot be distinguished. Due
to reactions of the human a particular movement of the robot is selected, in this
case turning to the left. The human starts with hand movements to the right
and left, points a, b, c and d in figure 8 indicate her changes of direction. At
point e she switches to circular movements in anti-clockwise direction. During
the ‘training’ period the weights for other movement tendencies drop to zero
while the robot’s tendency for anti-clockwise rotation increases to the maximum
value. At point f the human stops circular movements and starts to move her
hand from left to right. The weight for anti-clockwise rotation drops slights while
the weight for clockwise rotation slowly increases. However, since the weights
for movements other than anti-clockwise rotation are close to zero, the robot
does not exhibit any visible movement. Thus, the movement repertoire of the
robot has been trained towards anti-clockwise rotation. Strictly speaking this
only applies to movements (different from anti-clockwise rotation) with a short
duration. If the human changes her preferred movements from anti-clockwise
rotation to clockwise rotation then this leads to a retraining of the robot. Of
Embodiment and Interaction in Socially Intelligent Life-Like Agents 121
100
90
80
70
60
Weight
50
40
30
20
10
0
40
80
120
160
200
240
280
0
Tim e Steps
120
100
80
60
40 Series1
Weight
20 Series2
0 Series3
10
20
30
40
50
60
70
80
90
432
0
-20 Series4
-40
-60
-80
-100
Tim e Steps
course the learning mechanism could be changed so that once a pattern has
been trained the robot tends to memorize this movement. In the experiments
reported here we did not implement any such memory functionality.
150
a bc d e f g
100
50 Series1
Weight
Series2
0
Series3
50
100
150
200
250
300
350
400
450
500
0
-50 Series4
-100
-150
Tim e Steps
on the reactions or the feedback by the human, its movement repertoire. A very
simple association matrix was used for training purposes, however, it turned
out in demonstrations of this system4 that it was the human rather than the
robot which was the learner is these experiments. In the slave mode humans very
quickly realized that the robot’s movement were correlated to their own move-
ments and that the robot could be operated like a passive puppet-on-a-string
toy. However, the ‘puppet’ was sensitive to how long humans interact with it
and how ‘attentive’ they were (e.g. adapting the speed of their own movements
to the robot’s speed, this was necessary e.g. when trying to change the robot’s
movement from turning left to turning right, see above). A cooperative human
paid attention to the robot’s movement and kept it moving, ‘neglect’ made the
robot slow down and finally stop. The robot could also be operated (in select
option) so that it finally only performed those movement(s) where the human
gave longest response and attention to. The robot therefore adapted to the hu-
man and ‘personalized’, i.e. after a while only reacting to the human’s ‘favorite’
movement. This also occurred in the autonomous mode, however then the human
could only select from a given repertoire of movements, i.e. the human could
shape the robot’s autonomous behavior. A cooperative human learnt quickly to
give the appropriate feedback in order to keep the robot moving. Depending on
the human’s preference the robot then (in the autonomous mode) ended up per-
4
For instance at a workshop co-organized with Luc Steels: 7-14 September 1996
in Cortona, Italy (Cortona Konferenz - Naturwissenschaft und die Ganzheit des
Lebens,“Innen und Aussen” - “Inside/Outside”).
Embodiment and Interaction in Socially Intelligent Life-Like Agents 123
forming only one or a few different movements. Thus, the behavior of the robot
finally was typical of the human who interacted with it.
Potentially this method can be used to adapt the behavior of a robot to a
human’s individual needs and preferences, in particular if the ‘movements’ which
we used become complex behaviors and can be shaped individually. This process
is done is a purely non-symbolic way, without any reasoning involved except
for defining an association matrix and detecting temporal coordination. More
sophisticated learning architectures could be based on such a system, e.g. for
the study of imitation ([38,10]). This becomes particularly attractive if the robot
has more degrees of freedom than the simple system we used in this robot-human
interaction experiments. This becomes important in areas where humans have
long periods of interaction with a robot, e.g. in service robotics (e.g. [91]).
Another aspect in robot-human interaction aims at believability, e.g. as [35]
shows, a robot with life-like appearance and responses furthers the motivation of
a human to interact with the robot. The dynamics of the robot-human interac-
tions change both the states of the robot and the human, and that influences the
overall interaction and the way the human interprets the robot. The following
section analyses in more detail levels of interaction and how robot behavior is
interpreted by a human observer.
4 Social Matters
The term ‘social’ seems to have become a fashionable word during the last years.
It is often used in different communities when describing work on models, the-
ories or implementations which comprise interactions between at least two au-
tonomous systems. The word ‘social’ is intensively used in research on multi-
agent systems (MAS), distributed artificial intelligence (DAI), Alife, robotics. It
has been used for a quite longer time in research areas primarily dealing with
natural systems like psychology, sociology, biology. It would go beyond the scope
of this paper to discuss in length the historical and current use of the term social
in all these different research areas. Instead, we exemplify its use by discussing
distinct approaches to sociality. Particular emphasis is given to the role of the
individual in social modelling. We discuss issues which seem to be important
characteristics of this individual dimension. In order to account for the individ-
ual in social modelling we relate this to the concept of autobiographic agents
Embodiment and Interaction in Socially Intelligent Life-Like Agents 125
4.1 Natural Social Agents: Genes, Memes and the Role of the
Individual
Sociobiology can be defined as the science of investigating the factors of biological
adaptation of animal and human social behavior (according to [89], p. 1). In
his most influential book Sociobiology Edward O. Wilson argues for using the
term ‘social’ in an explicitly broad sense, “in order to prevent the exclusion
of many interesting phenomena” ([93]). One concept is basic to sociobiology:
gene selection, namely viewing genes and not the individual as a whole or the
species as the basic selectionist units. An important term in the sociobiological
vocabulary is selfishness which means that genes or individuals behave only in
a way which tends to increase their own fitness. The principle of gene selection
is opposed to how ‘classical’ ethology views the evolution of species with the
individual as the basic unit of selection. According to [94] the new paradigm of
sociobiology is that it uses Darwin’s theory of evolution by natural selection and
has transferred it to the level of genes.
Richard Dawkins’s selfish-gene approach has across disciplines influenced the
way people think about evolution and the role of the human species as part of
this system ([30,31]).
“There is a river out of Eden, and it flows through time, not space. It is
a river of DNA - a river of information, not a river of bones and tissues:
a river of abstract instructions for building bodies, not a river of solid
bodies themselves. The information passes through bodies and affects
them, but it is not affected by them on its way through.” ([31])
Dawkins’s definitions of an evolution based on information transfer and of
replicators (self-reproducing systems) as the unit of evolution has become very
attractive for computer scientists and the Artificial Life research direction, since
it seems to open up a path towards synthesizing life (or life-like qualities) without
the need and burden to rebuild a body in all its phenomenological complexity as
natural ones have. In Dawkins’s philosophy the body is merely an expression of
selfish genes in order to produce more selfish genes. In order to explain the evo-
lution of human culture Dawkins introduced the concept of memes, representing
126 Kerstin Dautenhahn
tractable by game theory than the former. We would like to note here that it is
an interesting point that a mathematical framework has turned out to be more
appropriate for describing the complex process of evolution than for the behavior
of those creatures who invented the framework.
In articles like [39] and [67] which model the social behavior of humans on
the basis of game theoretical approaches it is mentioned that ‘real persons’ in
real life do not only act on the bases of rationality and that the game-theoretical
assumptions do only apply in simple situations with few alternatives of choice.
[67] mentions “feelings of solidarity or selflessness” or “pressure of society” which
can underly human behavior. But nevertheless the game-theoretical models are
used to explain cooperation and developments in human societies on the abstract
level of rational choice. Axelrod himself seemed to be aware of the limitations of
the explanatory power of game-theory in modelling human behavior. In [1] he
dedicated a whole chapter to the ‘social structure of cooperation’. He identified
four factors in social structure: labels, reputation, regulation and territoriality.
Thus, while still of the basis of rational choices, Axelrod nevertheless includes
the ‘human factor’ in the game, taking into account human individual and social
characteristics. He goes a step further in his subsequent book The Complexity
of Cooperation ([2]).
Francis Heylighen [45] doubts that reciprocal altruism can sufficiently ac-
count for cooperative behavior in large groups of individuals. In [46] he introduces
another model for the evolution of cooperation especially in human society. On
the basis of memes, which we described earlier, he discusses how selfishness at the
cultural level can lead to cooperation at the lower level of the individuals. In [47]
the idea of memetic evolution is discussed in the framework of metasystem tran-
sitions, namely the evolutionary integration and control of individual systems by
shared controls. The following social metasystem transitions are identified: uni-
cellular to multicellular organisms, solitary to social insects, and human sociality.
Social insects are a good example for well-integrated societies with genetically
determined shared controls. In the case of human societies, Heylighen discusses
mutual monitoring (in small, primary groups with close face-to-face contacts),
internalized restraint, legal control and market mechanisms as memetic control
structures which lead to cooperative behavior beyond the competitive level of
the individual. This has led to ambivalent sociality and weakly integrated social
metasystems.
This section was meant to give an overview on theories about the genetic and
memetic evolution of social systems. We wanted to discuss the terms selfishness,
memes, and control structures. We come back to these terms in section 4.4 where
we discuss them in the broader context of social organization and control.
128 Kerstin Dautenhahn
a single bee from a hive no search behavior is induced. The situation is quite
different in individualized societies which primate societies belong among. Here
individual recognition gives rise to complex kinds of social interaction and the
development of various forms of social relationships. On the behavioral level
social bonding, attachment, alliances, dynamic (not genetically determined) hi-
erarchies, social learning, etc. are visible signs of individualized societies. The
evolution of language, spreading of traditions and the evolution of culture are
further developments of individualized societies.
Fig. 9 points out our conception of social systems based on concepts which
we described in the previous sections. As a starting point we consider the indi-
vidual, ‘selfish’ agent. The individual itself is integrated insofar as if it consists of
numerous components, subsystems (cells, organs) whose survival is dependent
on the survival of the system at the higher level. If the individual dies all its
subsystems will die, too. In the case of eusocial agents (e.g. social insects and
naked mole-rats) a genetically determined control structure of a ‘superorganism’
has emerged, a socially well-integrated system. The individual itself plays no
crucial role, social interactions are anonymous.
Many mammal species with long-lasting social relationships show an alterna-
tive path towards socially integrated systems. Primary groups, which typically
consist of family members and close friends, emerged with close and often long-
lasting individual relationships. We define primary groups as a network of ‘con-
specifics’ who the individual agent uses as a testbed and as a point of reference
for his social behavior. Members of this group need not necessarily be genetically
related to the agent. Social bonding is guaranteed by complex mechanisms of
individual recognition, emotional and sexual bonding. This level is the substrate
for the development of social intelligence (cf. section 4.3) where individuals build
up shared social interaction structures, which serve as control structures of the
system at this level. Even if these bonding mechanisms are based on genetical
predispositions, social relationships develop over time and are not static. The
role of the individual agent as a life-long learning individual and social learning
system becomes most obvious in human societies. In life-long learning systems
the individual viewpoint and the complexity of coping with the non-social and
social environments furthermore reinforces the development of ‘individuality’.
We proposed in a previous section (2.2) to use the term ‘autobiographic agent’
to account for the aspect of re-interpreting remembered and experienced situa-
tions in reference to the agent’s embodied ‘history’.
Secondary and tertiary level groups emerge by additional, memetic control
structures. In contrast to Heylighen [47], we distinguish between simple market
mechanisms in secondary groups (trade and direct exchange of goods between
individuals) and complex market mechanisms in tertiary groups. The level of mu-
tual monitoring and (simple) market mechanisms is necessary in larger groups
of agents with division of labour and cooperation for the sake of survival of
the economic agents. This happens still by means of face-to-face interaction and
communication (the upper limit of the group size could probably be estimated
for humans as 150, which is according to [33] the cognitive limit on the num-
Embodiment and Interaction in Socially Intelligent Life-Like Agents 131
tertiary group
primary group
socially integrated system
social bonding ->
“social”autobiographic agents
individual agent
“selfish” survival interests
Eusocial (Anonymous) Societies
eusocial agents
socially integrated system
ber of individuals with whom one person can maintain stable relationships, as a
function of brain size). Control structures in secondary groups are still based on
the needs of the individual agent. We distinguish this level from tertiary groups
where external references (legal control, religion, etc.) provide the control mech-
anisms. Complex market mechanisms which can be found in human societies,
also play a role on this level. Here, the group size is potentially unlimited, espe-
cially if effective means of communication and rules for social interaction exist
(by means of language humans can handle large group sizes by categorization of
individuals into types and instructing others to obey certain rules of behavior
towards these types, see [33]).
An important point here to mention is that secondary and tertiary control
structures do not simply enslave or subsume the lower levels in the way the
organism as a system ‘enslaves’ its components (organs, body parts). The indi-
vidual which is as a social being embedded in primary groups, does not depend
absolutely for its survival on the survival of a specific system at a higher level.
Of course, changes in political, religious or economic conditions can dramati-
cally change the lives of the primary groups. But the dependency is weaker and
more indirect than in the case of social insects or the organ-body relationships.
This independence of the individual and the primary group from higher levels
can be an advantage in cases of dramatic changes. (Disadvantages of such less
integrated systems, e.g. part-whole competitions, are discussed in [47].)
A central point is that secondary and tertiary levels have mutual exchanges
with the level of the social, autobiographic agent. In socially integrated agents
on the primary group level, complex processes can take place when genetic and
memetic factors which are emerging at different levels of control structure mutu-
ally interact within the autobiographic agent who tries to construct and integrate
all experiences on the basis of his own embodied ‘history’. Within the mind of the
agent all the influences from the primary, secondary and tertiary groups are taken
into account for the individual decision processes, referring them to the past ex-
periences and the current state of the body. The memes which are exchanged
(either directly via personal one-to-one contact or indirectly one-to-many by
means of cultural knowledge bases like books, television, World-Wide-Web) are
integrated within the individual’s processes of constructing reality, maintain-
ing a concept of self and re-telling the autobiography. Educational systems can
assist the access to these sources of information (memes) but the knowledge
is constructed within the individual (see trends in learner-centered education
and design, [66], which stress life-long-learning and the need for engagement of
the user of educational tools). Since, as we described in the previous sections, no
two agents can have the same viewpoint and the same ‘history’ of individual and
‘memetic’ development, initial genetic variability is in this way fundamentally
enhanced on a cognitive and behavioral level.
These complex, dynamic interactions within an embodied, autobiographic,
socially integrated agent yield a unique, individual, dynamical pattern of ‘per-
sonality’ at the component level of social systems. This can account for the
Embodiment and Interaction in Socially Intelligent Life-Like Agents 133
In this section the project AURORA for children with autism which addresses
issues of both human and robotic social agents is introduced.
The main characteristics of autism are: 1) qualitatively impaired social re-
lationships, 2) impairment of communication skills and fantasy, 3) significantly
reduced repertoire of activities and interests (stereotypical behavior, fixation to
stable environments).
A variety of explanations of autism have been discussed, among them the
widely discussed ‘theory of mind’ model which is conceiving autism as a cognitive
134 Kerstin Dautenhahn
6 Conclusion
1. a : the thing one intends to convey especially by language, b : the thing that
is conveyed especially by language
2. something meant or intended
3. significant quality; especially : implication of a hidden or special significance
4. a : the logical connotation of a word or phrase, b : the logical denotation or
extension of a word or phrase
Acknowledgements
My special thanks to Aude Billard, Chrystopher Nehaniv and Simone Strippgen
for discussions and collaborative work on issues which are discussed in this paper.
The thoughts presented in this paper are nevertheless the author’s own.
References
1. Robert Axelrod. The Evolution of Cooperation. Basic Books, Inc., Publishers,
1984. 126, 127
2. Robert Axelrod. The Complexity of Cooperation: Agent-based Model of Competi-
tion and Cooperation. Princeton University Press, 1997. 127
3. S. Baron-Cohen, A. M. Leslie, and U. Frith. Does the autistic child have a “theory
of mind”. Cognition, 21:37–46, 1985. 134
4. F. C. Bartlett. Remembering – A Study in Experimental and Social Psychology.
Cambridge University Press, 1932. 105, 106
5. Joseph Bates. The nature of characters in interactive worlds and the oz project.
In: Virtual Realities: Anthology of Industry and Culture, Carl Eugene Loeffler, ed.,
1993, 1993. 128
6. R. Beckers, O. E. Holland, and J. L. Deneubourg. From local actions to global
tasks: stigmergy and collective robotics. In R. A. Brooks and P. Maes, editors,
Artificial Life IV, Proc. of the Fourth International Workshop on the Synthesis
and Simulation of Living Systems, pages 181–189, 1994. 107
7. Tony Belpame. Tracking objects using active vision. Thesis, tweede licentie
toegepaste informatica verkort programma academiejaar 1995-1996, Vrije Univer-
siteit Brussel, Belgium, 1996. 116
8. Aude Billard. Allo kazam, do you follow me? or learning to speak through imitation
for social robots. MSc thesis, DAI Technical Paper no. 43, Dept. of AI, University
of Edinburgh, 1996. 109
Embodiment and Interaction in Socially Intelligent Life-Like Agents 137
26. Kerstin Dautenhahn. The role of interactive conceptions of intelligence and life in
cognitive technology. In Jonathon P. Marsh, Chrystopher L. Nehaniv, and Barbara
Gorayska, editors, Proceedings of the Second International Conference on Cognitive
Technology, pages 33–43. IEEE Computer Society Press, 1997. 111, 112, 129
27. Kerstin Dautenhahn. The art of designing socially intelligent agents: science, fiction
and the human in the loop. Applied Artificial Intelligence Journal, Special Issue
on Socially Intelligent Agents, 12(7-8):573–617, 1998. 103, 135
28. Kerstin Dautenhahn, Peter McOwan, and Kevin Warwick. Robot neuroscience —
a cybernetics approach. In Leslie S. Smith and Alister Hamilton, editors, Neu-
romorphic Systems: Engineering Silicon from Neurobiology, pages 113–125. World
Scientific, 1998. 102
29. Kerstin Dautenhahn and Chrystopher Nehaniv. Artificial life and natural stories.
In Proc. Third International Symposium on Artificial Life and Robotics (AROB
III’98 - January 19-21, 1998, Beppu, Japan), volume 2, pages 435–439, 1998. 106,
134, 135
30. Richard Dawkins. The Selfish Gene. Oxford University Press, 1976. 125
31. Richard Dawkins. River Out of Eden. Basic Books, 1995. 125
32. J. L. Deneubourg, S. Goss, N. Franks, A. Sendova-Franks, C. Detrain, and
L. Chrétien. The dynamics of collective sorting: robot-like ants and ant-like robots.
In J. A. Meyer and S. W. Wilson, editors, From Animals to Animats, Proc. of the
First International Conference on simulation of adaptive behavior, pages 356–363,
1991. 107
33. R. I. M. Dunbar. Coevolution of neocortical size, group size and language in
humans. Behavioral and Brain Sciences, 16:681–735, 1993. 130, 132
34. O. Etzioni. Intelligence without robots: a reply to Brooks. AI Magazine, pages
7–13, 1993. 115
35. C. Breazeal (Ferrell). A motivational system for regulating human-robot interac-
tion. in Proceedings of AAAI98, Madison, WI, 1998. 123
36. Stan Franklin and Art Graesser. Is it an agent, or just a program?: A taxonomy for
autonomous agent. In Proceedings of the Third International Workshop on Agent
Theories, Architectures, and Languages, published as Intelligent Agents III, pages
21–35. Springer-Verlag, 1997. 103
37. Liane Gabora. The origin and evolution of culture and creativity. Journal of
Memetics, 1(1):29–57, 1997. 133
38. P. Gaussier, S. Moga, J. P. Banquet, and M. Quoy. From perception-action loops
to imitation processes: A bottom-up approach of learning by imitation. Applied
Artificial Intelligence Journal, Special Issue on Socially Intelligent Agents, 12(7-
8):701–729, 1998. 123
39. Natalie S. Glance and Bernardo A. Huberman. Das Schmarotzer-Dilemma. Spek-
trum der Wissenschaft, 5:36–41, 1994. 126, 127
40. Deborah M. Gordon. The organization of work in social insect colonies. Nature,
380:121–124, 1996. 107
41. I. Harvey, P. Husbands, and D. Cliff. Issues in evolutionary robotics. In J. A.
Meyer, H. Roitblat, and S. Wilson, editors, From Animals to Animats, Proc. of
the Second International Conference on Simulation of Adaptive Behavior, 1992.
104
42. Barbara Hayes-Roth, Robert van Gent, and Daniel Huber. Acting in character.
In Proc. AAAI Workshop on AI and Entertainment, Portland, OR, August 1996,
1996. 128
Embodiment and Interaction in Socially Intelligent Life-Like Agents 139
43. Horst Hendriks-Jansen. Catching Ourselves in the Act: Situated Activity, Interac-
tive Emergence, Evolution, and Human Thought. MIT Press, Cambridge, Mass.,
1996. 106, 117
44. Horst Hendriks-Jansen. The epistomology of autism: making a case for an embod-
ied, dynamic, and historical explanation. Cybernetics and Systems, 25(8):359–415,
1997. 117, 134
45. Francis Heylighen. Evolution, selfishness and cooperation. Journal of Ideas,
2(4):70–76, 1992. 126, 127
46. Francis Heylighen. ‘selfish’ memes and the evolution of cooperation. Journal of
Ideas, 2(4):77–84, 1992. 127
47. Francis Heylighen and Donald T. Campbell. Selection of organization at the social
level: obstacles and facilitators of metasystem transitions. World Futures, 45:181–
212, 1995. 127, 130, 132
48. Ian Kelly and David Keating. Flocking by the fusion of sonar and active infrared
sensors on physical autonomous mobile robots. In The Third Int. Conf. on Mecha-
tronics and Machine Vision in Practice. 1996, Guimaraes, Portugal, Volume 1,
pages 1–4, 1996. 108
49. Volker Klingspor, John Demiris, and Michael Kaiser. Human-robot-communication
and machine learning. Applied Artificial Intelligence Journal, 11:719–746, 1997.
124
50. C. R. Kube and H. Z. Zhang. Collective robotics: from social insects to robots.
Adaptive Behavior, 2(2):189–218, 1994. 107
51. Nicholas Kushmerick. Software agents and their bodies. Minds and Machines,
7(2):227–247, 1997. 115
52. Douglas B. Lenat and R. V. Guha. Building Large Knowledge-Based Systems.
Representation and Inference in the Cyc Project. Addison-Wesley Publishing Com-
pany, 1990. 104
53. Robert Levinson. General game-playing and reinforcement learning. Computa-
tional Intelligence, 12(1):155–176, 96. 129
54. Henrik Hautop Lund, John Hallam, and Wei-Po Lee. Evolving robot morphology.
In Proceedings of IEEE 4th International Conference on Evolutionary Computa-
tion. IEEE Press, 1997. 112
55. P. Marchal, C. Piguet, D. Mange, A. Stauffer, and S. Durand. Embryological
development on silicon. In R. A. Brooks and P. Maes, editors, Artificial Life IV,
Proc. of the Fourth International Workshop on the Synthesis and Simulation of
Living Systems, pages 365–370, 1994. 104
56. M. J. Mataric. Learning to behave socially. In J-A. Meyer D. Cliff, P. Husbands and
S. Wilson, editors, From Animals to Animats 3, Proc. of the Third International
Conference on Simulation of Adaptive Behavior, SAB-94, pages 453–462, 1994.
109
57. Maja J. Mataric. Issues and approaches in design of collective autonomous agents.
Robotics and Autonomous Systems, 16:321–331, 1995. 107, 108, 109
58. John Maynard Smith. Evolution and the Theory of Games. Cambridge University
Press, 1982. 126, 127
59. D. McFarland and T. Bosser. Intelligent Behavior in Animals and Robots. MIT
Press, 1993. 108
60. David McFarland. Towards robot cooperation. In D. Cliff, P. Husbands, J.-A.
Meyer, and S. W. Wilson, editors, From Animals to Animats 3, Proc. of the
Third International Conference on Simulation of Adaptive Behavior, pages 440–
444. IEEE Computer Society Press, 1994. 108
140 Kerstin Dautenhahn
79. Aaron Sloman. What sort of control system is able to have a personality. In Robert
Trappl, editor, Proc. Workshop on Designing Personalities for Synthetic Actors,
Vienna, June 1995, 1995. 128
80. L. Steels. The artificial life roots of artificial intelligence. Artificial Life, 1(1):89–
125, 1994. 105
81. L. Steels. A case study in the behavior-oriented design of autonomous agents. In
D. Cliff, P. Husbands, J.-A. Meyer, and S.W. Wilson, editors, From Animals to
Animats 3, Proceedings of the Third International Conference on Simulation of
Adaptive Behavior, pages 445–452, Cambridge, MA, 1994. MIT Press/Bradford
Books. 108
82. Luc Steels. Building agents out of autonomous behavior systems. In L. Steels
and R. A. Brooks, editors, The “Artificial Life” Route to “Artificial Intelligence”:
Building Situated Embodied Agents. Lawrence Erlbaum, 1994. 113
83. Luc Steels, Peter Stuer, and Dany Vereertbrugghen. Issues in the physical reali-
sation of autonomous robotic agents. Manuscript, AI Memo, VUB Brussels, 1996.
108
84. Simone Strippgen. Insight: ein virtuelles Labor fuer Entwurf, Test und Analyse von
behaviour-basierten Agenten. Doctoral Dissertation, Department of Linguistics
and Literature, University of Bielefeld, 1996. 112
85. Simone Strippgen. Insight: A virtual laboratory for looking into behavior-based
autonomous agents. In W. L. Johnson, editor, Proceedings of the First International
Conference on Autonomous Agents. Marina del Rey, CA USA, February 5-8, 1997,
pages 474–475. ACM Press, 1997. 112
86. G. Theraulaz, S. Goss, J. Gervet, and L. J. Deneubourg. Task differentiation
in polistes wasp colonies: a model for self-organizing groups of robots. In J. A.
Meyer and S. W. Wilson, editors, From Animals to Animats, Proc. of the First
International Conference on simulation of adaptive behavior, pages 346–355, 1991.
107
87. John K. Tsotsos. Behaviorist intelligence and the scaling problem. Artificial In-
telligence, 75:135–160, 95. 105
88. Sherry Turkle. Life on the Screen, Identity in the Age of the Internet. Simon and
Schuster, 1995. 104
89. Eckart Voland. Grundriss der Soziobiologie. Gustav Fischer Verlag, Stuttgart,
Jena, 1993. 125
90. J. von Neumann and O. Morgenstern. Theory of Games and Economic Behaviour.
Princeton University Press, 1953. 126
91. D. M. Wilkes, A. Alford, R. T. Pack, T. Rogers, R. A. Peters II, and K. Kawa-
mura. Toward socially intelligent service robots. To appear in Applied Artificial
Intelligence Journal, vol. 1, no. 7, 1998. 123
92. D. M. Wilkes, R. T. Pack, A. Alford, and K. Kawamura. Hudl, a design philosophy
for socially intelligent service robots. In Socially Intelligent Agents, pages 140–145.
AAAI Press, Technical report FS-97-02, 1997. 128
93. Edward O. Wilson. Sociobiology. The Belknap Press of Harvard University Press,
Cambridge, Massachusetts and London, England, 1980. 125
94. Franz M. Wuketits. Die Entdeckung des Verhaltens. Wissenschaftliche Buchge-
sellschaft, Darmstadt, 1995. 125
95. Robert S. Wyer. Knowledge and Memory: The Real Story. Lawrence Erlbaum
Associates, Hillsdale, New Jersey, 1995. 106
An Implemented System for Metaphor-Based
Reasoning,
With Special Application to Reasoning about
Agents
John A. Barnden
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 143–153, 1999.
c Springer-Verlag Berlin Heidelberg 1999
144 John A. Barnden
ATT-Meta is merely a reasoning system, and does not itself deal with natural
language input directly. Rather, a user supplies hand-coded logic formulae that
are intended to couch the literal meaning of small discourse chunks (two or three
sentences). This will become clearer later in the paper.
The special case of mental states has particular relevance to the current
workshop, because of the workshop’s interest in the subject of intelligent agents
and societies of agents. There are many points of contact with this subject:-
An Implemented System for Metaphor-Based Reasoning 145
ATT-Meta research project has refrained from this step, which is after all only
terminological, and only explicitly countenances literal meanings for metaphor-
ical utterances. (The literal meaning of the above utterance is the ridiculous
claim that John literally had a part that literally insisted that Sally was right.)
However, the project presents no objection to the step. Thus, we can say that
ATT-Meta is “semantically agnostic” as regards metaphor. (The approach is
akin to but less extreme than that of Davidson 1979, which can be regarded as
semantically “atheist.”)
ATT-Meta’s approach is one of literal pretence. A literal-meaning represen-
tation for the metaphorical input utterance is constructed. The system then pre-
tends that this representation, however ridiculous, is true. Within the context of
this pretence, the system can do any reasoning that arises from its knowledge
of the vehicles of the metaphors involved. In our example, it can use knowledge
about interaction within groups of people, and knowledge about communicative
acts such as insistence. As a result of this knowledge, the system can infer that
the explicitly mentioned part of John believed (as well as insisted) that Sally was
right, and some other, unmentioned, part of John believed (as well as stated)
that Sally was not right. Suppose now that, as part of the system’s knowledge
of the MIND PARTS AS PERSONS metaphor, there is the knowledge that if
a “part” of someone believes something P, then the person has reasons to be-
lieve P. The system can now infer both that John had reasons to believe that
Sally was right and that John had reasons to believe that Sally was not right.
Note here that the key point is that the reasoning from the literal meaning of
the utterance, conducted within the pretence, link up with the just-mentioned
knowledge. That knowledge is itself of a very fundamental, general nature, and
does not, for instance, rely on the notion of insistence or any other sort of
communicative act. Any line of within-pretence inference that linked up with
that knowledge could lead to conclusions that John had reasons to believe certain
things. This is the way in which ATT-Meta can deal with novel manifestations
of metaphors. There are no need for it at all to have any knowledge of how
insistence by a “part” of a person maps to some non-metaphorically describable
feature of the person. Equally, an utterance that described a part as doing things
from which it can be inferred that the part insisted that Sally was right would
also to lead to the same inferences as our example utterance (unless it also led
to contrary inferences by some route).
In sum, the ATT-Meta research has taken the line that it is a mistake to focus
on the notion of the underlying meaning of a metaphorical utterance, and has
concentrated instead on the literal meaning and the inferences that can be drawn
from it. This approach is the key to being able to deal flexibly with metaphorical
utterances.
that it itself (the system) is pretending that L holds. Also, the system has the
fact, outside the cocoon, that it is pretending that PJ is a person.
As usual, the system has a goal, such as the hypothesis that John believes that
Sally is right (recall the example in the second section of this paper). Assume the
system has a rule that if someone X has reasons to believe P then, presumably,
X believes P. (This is a default rule, so its conclusion can be defeated.) Thus,
one subgoal that arises is that John had reasons to believe that Sally was right.
Now, in the earlier Section we referred to the system’s knowledge about the
MIND PARTS AS PERSONS metaphor. The mentioned knowledge is couched
in the following rule:
IF I (the system) am pretending that part Y of agent X is a person AND I
am pretending that Y believes Q THEN (presumably) X has reasons to believe
Q.
Of course, this is a paraphrase of a imagined, formally expressed rule. We call
this a conversion rule, as it maps between pretence and reality. Because of the
subgoal that John had reasons to believe that Sally was right, the conversion
leads to the setting up of the subgoal that the system is pretending that PJ (the
mentioned part of John) believes that Sally is right, This subgoal is itself outside
the cocoon, but it automatically leads to the the subgoal that PJ believes that
Sally is right, within the cocoon. This subgoal can then be inferred (as a default)
from the hypothesis that PJ stated that Sally was right, which itself can be
inferred (as a default) from the existing within-cocoon fact that PJ insisted that
Sally was right. Notice carefully that these last two steps are entirely within the
cocoon and merely use commonsense knowledge about real-life communication.
As well as the original goal (John believed that Sally was right) the system
also looks at the negation of this, and hence indirectly at the hypothesis that
John has reasons to believe that Sally was not right. This subgoal gets support
in a rather similar way to the above process, but it involves richer reasoning
within the cocoon.
the peculiarities of the case at hand. Some authors (e.g., Lakoff 1994) assume that
in cases of conflict tenor information should override metaphor-based inferences,
but it appears that such assumptions are based on inadequate realization of the
fact that tenor information can itself be uncertain.
Finally, the reasoning within the cocoon is itself usually uncertain, since
commonsense knowledge rules are usually uncertain.
The ATT-Meta system has facilities for reasoning non-metaphorically about the
beliefs and reasoning acts of agents, including cases where those beliefs and
acts are themselves about the beliefs and reasoning of further agents, and so
forth. Although ATT-Meta can reason about beliefs in an ordinary rule-based
way, its main tool is simulative reasoning (e.g., Creary 1979, Konolige 1986 [but
called “attachment” there], Haas 1986, Ballim & Wilks 1991, Dinsmore 1991,
Hwang & Schubert 1993, Chalupsky 1993 and 1996, Attardi & Simi 1994; see
also related work in philosophy and psychology in Carruthers & Smith 1996,
Davies & Stone 1995). In attempting to show that agent X believes P from the
fact that X believes Q, the system puts P as a goal and Q as a fact in a simulation
cocoon for X, which is a special environment which is meant to reflect X’s own
reasoning processes. Reasoning from Q to P in the cocoon is alleged (by default)
to be reasoning by X. The reasoning within the cocoon can involve ordinary rule-
based reasoning and/or simulation of other agents. In particular, the reasoning
can be uncertain. Also, the result of the simulation of X is itself uncertain: even
if the simulation supports the hypothesis that X believes P, ordinary rule-based
reasoning may support the negation of this hypothesis more strongly.
6 Interesting Nestings
metaphorical steps nested within a cocoon for the first. That is, within the pre-
tence that the thought is a cloud there is a further pretence that the cloud is a
person.
Embedding of a metaphorical pretence cocoon within a simulation cocoon
handles a major aspect of point (e) in Section 1, namely reasoning about agents’
metaphorical reasoning. This would be needed for dealing with one interpretation
of the sentence “Mary believed that the thought hung over John like a cloud,”
viz the interpretation under which the metaphorical view of the thought as a
cloud is part of Mary’s own belief state. (But another interpretation is that the
metaphor is used only by the speaker, and not by Mary.)
Conversely, embedding of a simulation cocoon within a metaphorical pretence
cocoon handles a major aspect of point (f) in Section 1, namely reasoning about
metaphorical agents’ reasoning, as required for sentences like “My car doesn’t
want to wake up because it thinks it’s Sunday.” From the fact that the car
thinks it’s Sunday, we might want to infer that the car thinks people needn’t
wake up until some relatively late time. (That thought would then be a reason
for not wanting to wake up.) The car’s alleged reasoning would occur within a
simulation cocoon for the car, embedded within a metaphorical pretence cocoon
for the pretence that the car is a person.
7 Conclusion
References
Attardi, G. & Simi, M. (1994). Proofs in context. In J. Doyle, E. Sandewall & P. Torasso
(Eds), Principles of Knowledge Representation and Reasoning: Proceedings of the
Fourth International Conference, pp. 15–26. (Bonn, Germany, 24–27 May 1994.)
San Mateo, CA: Morgan Kaufmann.
Ballim, A. & Wilks, Y. (1991). Artificial believers: The ascription of belief. Hillsdale,
N.J.: Lawrence Erlbaum.
Barnden, J.A. (1997a).Deceived by metaphor. Behavioral and Brain Sciences, 20 (1),
pp. 105–106. Invited Commentary on A.R. Mele’s “Real Self-Deception.”
Barnden, J.A. (1997b). Consciousness and common-sense metaphors of mind. In S.
O’Nuallain, P. McKevitt & E. Mac Aogain (Eds), Two Sciences of Mind: Readings
in Cognitive Science and Consciousness, pp. 311–340. Amsterdam/Philadelphia:
John Benjamins.
Barnden, J.A. (1998). Uncertain reasoning about agents’ beliefs and reasoning. Tech-
nical Report CSRP-98-11, School of Computer Science, The University of Birming-
ham, U.K. Invited submission to a special issue of Artificial Intelligence and Law ,
ed. E. Nissan.
Barnden, J.A. (in press). An AI system for metaphorical reasoning about mental states
in discourse. In Koenig, J-P. (Ed.), Conceptual Structure, Discourse, and Language
II. Stanford, CA: CSLI/Cambridge University Press.
Barnden, J.A., Helmreich, S., Iverson, E. & Stein, G.C. (1994a). An integrated imple-
mentation of simulative, uncertain and metaphorical reasoning about mental states.
In J. Doyle, E. Sandewall & P. Torasso (Eds), Principles of Knowledge Representa-
tion and Reasoning: Proceedings of the Fourth International Conference, pp. 27–38.
(Bonn, Germany, 24–27 May 1994.) San Mateo, CA: Morgan Kaufmann.
Barnden, J.A., Helmreich, S., Iverson, E. & Stein, G.C. (1994b). Combining simulative
and metaphor-based reasoning about beliefs. In Procs. 16th Annual Conference of
the Cognitive Science Society (Atlanta, Georgia, August 1994), pp. 21–26. Hillsdale,
N.J.: Lawrence Erlbaum.
Barnden, J.A., Helmreich, S., Iverson, E. & Stein, G.C. (1996). Artificial intelligence
and metaphors of mind: within-vehicle reasoning and its benefits. Metaphor and
Symbolic Activity, 11(2), pp. 101–123.
Carruthers, P. & Smith, P.K. (Eds). (1996). Theories of Theories of Mind. Cambridge,
UK: Cambridge University Press.
Chalupsky, H. (1993). Using hypothetical reasoning as a method for belief ascription.
J. Experimental and Theoretical Artificial Intelligence, 5 (2&3), pp. 119–133.
Chalupsky, H. (1996). Belief ascription by way of simulative reasoning. Ph.D. Disser-
tation, Department of Computer Science, State University of New York at Buffalo.
Creary, L. G. (1979). Propositional attitudes: Fregean representation and simulative
reasoning. Procs. 6th. Int. Joint Conf. on Artificial Intelligence (Tokyo), pp. 176–
181. Los Altos, CA: Morgan Kaufmann.
Davidson, D. (1979). What metaphors mean. In S. Sacks (Ed.), On Metaphor, pp. 29–
45. U. Chicago Press.
Davies, M & Stone, T. (Eds) (1995). Mental Simulation: Evaluations and Applications.
Oxford, U.K.: Blackwell.
Delgrande, J.P. & Schaub, T.H. (1994). A general approach to specificity in default
reasoning. In J. Doyle, E. Sandewall & P. Torasso (Eds), Principles of Knowledge
Representation and Reasoning: Proceedings of the Fourth International Conference,
An Implemented System for Metaphor-Based Reasoning 153
pp. 146–157. (Bonn, Germany, 24–27 May 1994.) San Mateo, CA: Morgan Kauf-
mann.
Dinsmore, J. (1991). Partitioned Representations: A Study in mental Representation,
Language Processing and Linguistic Structure. Dordrecht: Kluwer Academic Pub-
lishers.
Haas, A.R. (1986). A syntactic theory of belief and action. Artificial Intelligence , 28,
¯
245–292.
Hobbs, J.R. (1990). Literature and Cognition. CSLI Lecture Notes, No. 21, Center for
the Study of Language and Information, Stanford University.
Hunter, A. (1994). Defeasible reasoning with structured information. In J. Doyle, E.
Sandewall & P. Torasso (Eds), Principles of Knowledge Representation and Rea-
soning: Proceedings of the Fourth International Conference, pp. 281–292. (Bonn,
Germany, 24–27 May 1994.) San Mateo, CA: Morgan Kaufmann.
Hwang, C.H. & Schubert, L.K. (1993). Episodic logic: a comprehensive, natural repre-
sentation for language understanding. Minds & Machines, 3 (4), pp. 381–419.
Konolige, K. (1986). A deduction model of belief. London: Pitman. Los Altos: Morgan
Kaufmann.
Lakoff, G. (1993). The contemporary theory of metaphor. In A. Ortony (Ed.), Metaphor
and Thought, 2nd edition, pp. 202–251. New York and Cambridge, U.K.: Cambridge
University Press.
Lakoff, G. (1994). What is metaphor? In J.A. Barnden & K.J. Holyoak (Eds.), Advances
in Connectionist and Neural Computation Theory, Vol. 3: Analogy, Metaphor and
Reminding. Norwood, N.J.: Ablex Publishing Corp.
Lakoff, G. & Turner, M. (1989). More than Cool Reason: A Field Guide to Poetic
Metaphor. Chicago: University of Chicago Press.
Loui, R.P. (1987). Defeat among arguments: a system of defeasible inference. Compu-
tational Intelligence, 3, pp. 100–106.
Loui, R.P., Norman, J., Olson, J. & Merrill, A. (1993). A design for reasoning with
policies, precedents, and rationales. In Fourth International Conference on Artifi-
cial Intelligence and Law: Proceedings of the Conference, pp. 202–211. New York:
Association for Computing Machinery.
Mele, A.R. (1997). Real self-deception. Behavioral and Brain Sciences, 20 (1).
Poole, D. (1991). The effect of knowledge on belief: conditioning, specificity and the
lottery paradox in default reasoning. Artificial Intelligence, 49 , pp. 281–307.
Reddy, M.J. (1979). The conduit metaphor—a case of frame conflict in our language
about language. In A. Ortony (Ed.), Metaphor and Thought, Cambridge, UK: Cam-
bridge University Press.
Yen, J., Neches, R. & MacGregor, R. (1991). CLASP: Integrating term subsumption
systems and production systems. IEEE Trans. on Knowledge and Data Engineer-
ing, 3 (1), pp. 25–32.
GAIA: An Experimental Pedagogical Agent for
Exploring Multimodal Interaction
Tom Fenton-Kerr
Introduction
This paper is a preliminary case study of a multimodal interface agent (GAIA:
a graphic-audio interface agent) that makes use of text-to-speech (TTS) com-
munication to assist a user with a task requiring visual point discrimination in
a geographic map with minimal graphic features. The context for this interac-
tion is a work-in-progress prototype development called the Re-mapping Europa
Mission (REM), designed to provide a setting for exploring interface agent ac-
tivity. It uses a task-driven game metaphor to teach users the locations of key
cities on a series of unlabelled maps. REM’s development was partly influenced
by the Mercator project, a study by Gerber et al. (1992) that investigated ways
of developing expertise in map reading. Oviatt (1996) found that users show a
marked preference for multimodal input (i.e. speech, keyboard and gesture) when
interacting with on-screen maps. Although input issues are discussed later in this
paper, its main intent is to deal with pedagogical, cognitive and perceptual issues
concerning interface agent output.
A pedagogical software agent is an autonomous software process, which
occupies the space between human learners and a task to be learned. The
agent’s task is likely to involve offering some kind of proactive, intelligent as-
sistance (Rich, 1996) to aid task completion. Agent software programs are cur-
rently used in a diverse range of pedagogical settings. They occupy roles such
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 154–164, 1999.
c Springer-Verlag Berlin Heidelberg 1999
GAIA: An Experimental Pedagogical Agent 155
as sophisticated tutor assistants (Frasson et al., 1997; Johnson and Shaw, 1997;
Conati et al., 1997; Schank and Cleary, 1994) offering knowledge-based advice, or
as interface agents acting in a knowledge-free capacity, guiding the user towards
a pre-specified learning goal.
Learning programs that employ pedagogical agents differ from the more ubiq-
uitous information-rich/interactively poor programs in that they can offer re-
active and sometimes corrective responses to user input. ‘Dustin’1 , a language-
learning simulator developed at Northwestern University’s Institute for the
Learning Sciences, gives users access to an online Tutor that can log a user’s
input and provide suitable responses to keep the student on track. Schank and
Cleary (1994) also propose the use of Searching Agents that can enhance a user’s
understanding of a given topic by locating related information consisting of fur-
ther examples or explaining the general principles involved.
An agent’s effectiveness in providing useful help to a learner is likely to be
determined by factors such as the learning context itself, the chosen mode of com-
munication, and the appropriateness of the interactions that occur. Where the
learning context is a visual sequencing task, for example, an appropriate agent
mode might be an animated graphic that can use gesture and text to guide a
user through a specified sequence of actions. Other contexts make the use of
alternate modes such as audio or multimodal interfaces more appropriate. A pi-
ano tutor program, for example, may need to display a score and simultaneously
play a sequence of notes. In choosing the most suitable means of representing
an agent in a pedagogical setting, an instructional designer needs to look closely
at the learning objectives to be met, the scope of help being offered, and the
best means of communicating that help. Factors such as the teaching paradigm
employed (e.g. problem-based learning) and a learner’s prerequisite knowledge
may have a major influence on the final design choice.
The next section is a general discussion about agent representation in a ped-
agogical context. It is followed by an examination of the REM program in depth.
Design and implementation aspects of the interface agent’s role in assisting users
to locate key cities are then discussed. Finally, future multimodal interface de-
velopment in learning contexts such as second language acquisition programs
and any learning programs where interaction is not limited to graphic or text
modes is considered.
modes such as text, TTS or recorded speech, and gestures (Kono et al., 1998).
Agents may also be equipped to deal with user input such as speech, variable
text, simple mouse clicks or their combinations (Rudnicky, 1993; Oviatt, 1996).
These highly complex interactions can give us a sense that they are implicitly
being orchestrated by an organized intelligence of some kind. Such perceptions
in turn demand a believable agent representation if we are to accept and act
on the advice being offered in a learning setting. Bates (1994) asserts that such
believability requires the incorporation of human-like traits such as emotion and
desire. Isbister (1995) believes that a user’s perception of agent intelligence is a
factor in creating believable interface characters.
In a sense, graphic anthropomorphic agents seek to model this perceived or-
ganized intelligence by manifesting it in a human-like physical form. On one level,
this acts to make such agents acceptably believable to a user and is therefore
likely to enhance the interaction in a positive way. Unfortunately, users can also
imbue anthropomorphic agents with abilities and intelligence way beyond their
true capability. Hofstadter (1995) calls this the ‘Eliza effect’2 , defining it as ‘the
susceptibility of people to read far more understanding than is warranted into
strings of symbols - especially words - strung together by computers’. Although
Hofstadter is emphasizing the text mode here, the ‘Eliza effect’ can be seen in
almost all modes of human/computer interaction. King (1995) comments that
users perceive anthropomorphic representations of agents as having ‘intrinsic
qualities and abilities which the software controlling the agent cannot possibly
achieve.’ In a pedagogical setting this effect can have a negative impact on the
learning experience. Susceptible users are likely to have unrealistic expectations
of an agent’s potential to help them in a useful way. When the expected reve-
lations are not forthcoming, a user may ignore or trivialize any future help or
suggestions given out, even if it is obviously or logically in their interests to act
on such advice.
Graphic agents inevitably call attention to themselves when represented on a
screen. Where their task is to take the role of a magister, instructing a user about
physical aspects of a particular graphic through gesture, for example, a graphic
mode may be the most suitable. If the task is to interrupt some action that may
cause damage, such as inadvertently trashing files, then an agent expressing an
alert in graphic form is probably the best way of getting a user’s attention. One
situation where a graphic representation mode might not be the best choice is
where a user needs to give his or her attention to a task requiring visual discrim-
ination while receiving instructions or assistance from an agent. In a converse
sense, vision-impaired users might rely on an audio agent as a primary source of
both information and interactive feedback. Currently, driver navigation systems
frequently make use of audio agents to provide directions and warnings, freeing
the driver from the need to visually reference a map. Nagao and Rekimoto (1996)
have integrated an audio interface into their location-aware WalkNavi naviga-
tion/guidance system that can integrate linguistic and non-linguistic contexts in
real world situations. Their self-defined ‘augmented reality’ recognizes natural
2
after J. Weizenbaum’s ELIZA program written in the 1960s.
GAIA: An Experimental Pedagogical Agent 157
language input and responds using various modes such as graphic maps, text
and TTS instructions or explanations.
The key element in all modes of agent communication seems to be consis-
tency of representation. Users need to know that any advice is coming from the
same reliable source. If graphic agents make frequent changes to their on-screen
physical form, a user can soon get confused about just ‘who’ is offering them
help. Conversely, audio interface agents that make use of a fairly consistent and
characteristic voice can provide certainty to the user about the source and re-
liability of their communication. Audio agents can, of course, vary parameters
such as volume, tempo, prosody and spatial displacement to modify or emphasize
speech. Such modifications are appropriate where there is an obvious need for
the expression of emotion (Picard, 1995) or intention, or simply to create believ-
able rapport-building conversation. (See ‘Future Developments’ for a discussion
of enhancements to audio-based interfaces.)
Where an interface agent makes use of multiple modes of communication,
believability seems to be retained where at least one mode maintains a consis-
tent form. GAIA, the agent from the REM program described in the following
section makes use of a multimodal (graphic/audio) approach, but uses a consis-
tent characteristic voice for communicating useful feedback, whether the agent’s
graphic form is visible or not.
REM acts as a test-bed for implementing an interface agent and exploring its
interaction with a user. A game metaphor was chosen to provide a setting that
would (hopefully) be very engaging, but general enough to be easily mapped
onto other learning situations.
REM’s genesis is a synthesis of two concepts: The first is the type of interface
interaction that occurs in a computer flight simulation where the task is to land
a fighter on the deck of an aircraft carrier3. Apart from flying the plane, a pilot
can seek the help of an audio interface agent (a ‘Landing Systems Officer’- LSO)
while attempting a landing. As the simulation progresses in real time, the LSO
gives audio instructions on whether the pilot is too low or high, too fast or slow,
and offers reminders about dropping the undercarriage and hook. As the pilot
is probably already suffering from cognitive overload just flying the plane, such
advice needs to be given in a mode that can be taken in and acted upon without
adding to the visual ‘clutter’ in any way. An audio interface seems to be the best
solution for instantaneous instructional delivery in this case.
The second concept relates to a toolkit for exploring agent designs, imple-
mented by Sloman and Poli (1995). The SIM-AGENT toolkit is ‘intended to
3
an example is Graphic Simulation’s F/A-18 Hornet
158 Tom Fenton-Kerr
support exploration of design options for one or more agents interacting in dis-
crete time’4 . Sloman and Poli used the toolkit to conduct a number of simulation
experiments, some of which simulate cooperative behaviour between two agents
- the ‘Blind/Lazy Scenario’. In this scheme ‘there is a flat 2-D world inhab-
ited by two agents, a blind agent and a lazy one. The blind agent can move in
the world, can send messages and can receive messages sent by the lazy agent,
but cannot perceive where the other agent is. The lazy agent can perceive the
(roughly quantized) relative position of the blind agent and can send messages,
but cannot move’5 . The stated objective of the experiment is to see whether
rules can be evolved allowing for cooperative behaviour and resulting in the two
‘robots’ getting together. In a very general sense, the task maps loosely onto the
aircraft landing task described above, and the map-point approximation task
that drives REM.
REM’s design is an attempt to use elements of these concepts in a pedagogical
setting. Its interface design is predicated on the idea that the primary means of
instruction or help be available in a single (audio) modality. The user takes
the part of the ‘blind’ agent described above, receiving audio instructions (or
stereophonic audio tones - described below) from the ‘lazy’ agent, played by
GAIA. It should be noted that GAIA represents a simulation of an artificially
intelligent (AI) interface agent. REM’s intent is not to develop new approaches
in AI architectures, but rather to provide a setting where agents using simulated
AI techniques can be implemented to explore instructional delivery issues in
completion of a learning task.
REM’s Architecture
REM has existed in three different forms since its inception. An early prototype
built in HyperCard on the Mac OS using Plaintalk 1.5 TTS was ported to an
NT 4.0 system and re-programmed for cross-platform use in Macromedia Direc-
tor’s Lingo language. The current web-based version uses elements of Microsoft’s
Agent software, executed in JavaScript and VBScript, to drive the agent’s inter-
action, including text and TTS output. Geographic information currently sup-
plied by a simulated ‘database agent’ embedded within the HTML page script,
will come from a true relational database, accessed by GAIA as required, in the
next version of REM.
User input in REM consists of basic navigation, filling in forms (for per-
sonalized feedback from GAIA), and mouse clicks on the map window. A fairly
straightforward algorithm captures mouse location information, determines
country and proximity-to-target, then builds GAIA’s spoken response as a con-
catenated TTS string. Graphic events, such as moving the cartographic analysis
tool to the click location, and map upgrades are handled in a similar fashion.
A future version will require coordination of multimedia output such as video,
enhanced audio and prosodic TTS.
4
ibid p. 392
5
ibid p. 401
GAIA: An Experimental Pedagogical Agent 159
Fig. 1. REM’s map page after successful location of Paris with Cartographic
Analysis Tool (CAT) visible, and Geo-agent GAIA in a separate window.
GAIA’s interaction with a user varies according to the current task. TTS com-
munication was determined to be a promising mode for communicating reactive
feedback to user response, which mainly consists of mouse clicks on a map. Once
a target city has been chosen, GAIA’s task is to provide immediate, personalized
feedback. By consulting a ‘database agent’, GAIA can advise a user of the cor-
rect country name for a target city, then provide appropriate advice on whether
a user has clicked inside the country borders, in addition to guiding a user to the
correct city location. Task success elicits congratulatory remarks and a prompt
to locate further cities on the map. Map details including borders and minor
towns in the surrounding terrain are then added. When all listed cities within a
country have been located, a border outline flashes briefly to signify completion.
Secondary confirmation is provided textually by the CAT and in spoken form
by GAIA.
As users can click anywhere on the map, GAIA needs to be able to contextu-
alize responses accordingly. Where the target city is Paris, for example, a click
to the east of Spain would probably elicit the following: ‘That’s in the Mediter-
ranean and it’s too far right, too low’. GAIA’s response to a click near Calais
might be ‘Yes, that’s France, but that’s a little too high, a little too far left’.
GAIA distinguishes between an absolute ‘too far left/right’ and fuzzy descriptors
such as ‘a little too high/low’, depending on user input. Clicking within a target
city’s locus circle elicits randomized responses such as ‘You’re really warm now!’
Although the REM program (with a rather more complex learning objective -
currently in development) has yet to be formally evaluated for pedagogical effec-
tiveness, two alpha tests of the system were conducted at the end of 1997. The
first, with the purpose of evaluating the viability of REM running on different
platforms, was carried out by volunteer NeTTL staff. Versions of the program
were run successfully on both NT 4.O and Mac OS systems, making use of differ-
ent shell programs and TTS engines. The results showed consistency in program
execution and graphic displays but variability in the quality of the spoken out-
put produced on each system. As GAIA represents a female assistant, her ‘audio
presence’ relies on the availability of realistic female TTS voices. At the time of
writing, the MacinTalk Pro high-end female voices (Victoria and Agnes) seem to
provide a better representation for our purposes than the female voices (Lernout
and Hauspie’s TTS) used in the NT 4.0 OS version.
The second alpha test was a ‘dry run’ of two experiments designed to evaluate
the effectiveness of different modes of audio feedback to a user, using volunteer
testers as subjects. In the first experiment, feedback was provided through head-
phones in the form of a variable audio tone coupled with left/right stereophonic
input. Testers were asked to locate target cities by moving a mouse over an un-
labelled map in response to a rising or falling tone that could also pan from one
GAIA: An Experimental Pedagogical Agent 161
ear to the other. Successful trials were indicated by location of the point that
produced the highest tone that was simultaneously perceived as ‘most central’,
(i.e. ‘localization in the vertical median plane’ - Hendrix, 1994:12) coinciding
with the target. No form of spoken or textual feedback was available until a
target city had been successfully located. The testers were not temporally con-
strained in any way, (which will be the case in the formal evaluation) but asked
to find the target ‘as quickly as possible’. Results of this preliminary experiment
support the idea that where the task is a straightforward procedural one, the
‘tonal feedback’ approach is a very efficient way of quickly locating a fixed point
on an unlabelled map.
The second preliminary experiment made use of GAIA as the only means of
(TTS) feedback, (apart from testers who were able to fortuitously click on a tar-
get city without needing the benefit of any guidance). Once again, testers were
asked to find the target as quickly as possible, relying on GAIA to provide infor-
mation about the clicked location and accurate hints about where to click next.
The graphic representation of the agent could be shown or hidden, according
to user preference. Results from this preliminary experiment indicate that users
can easily accomplish the same target-location task as that described above by
following spoken instructions, albeit at a noticeably slower rate compared to the
‘tonal feedback’ approach. This is hardly surprising, given the relative simplicity
of the task. In this alpha phase we were more interested in tester attitudes to each
experiment than in making quantitative comparisons of the time taken to locate
a given target. A frequent comment made by testers was that although GAIA’s
feedback was slower that the tonal approach, the agent provided additional, use-
ful geographic information that the first experiment was unable to supply. The
meta-level pedagogical aim here is learning about key European cities so GAIA’s
inclusion of incidental geographic feedback should help a learner to assimilate
new knowledge in an appropriate, contextually relevant form.
Volunteer tester feedback provided some useful insights into how formal eval-
uation of the system might be carried out. The alpha tests were not designed
to provide evaluative data as such, but they have been able to suggest some ef-
fective ways of evaluating the effectiveness of different modes of interface agents
used in pedagogical settings. A comparative study that contrasts different modes
of agent feedback is planned for the formal evaluation phase.
Subject interactions with the program during the alpha test phase also allowed
us to make some tentative general suppositions regarding the human/agent in-
terface:
1. Where the primary mode of agent communication is through TTS and where
deictic or gestural information is not exploited graphically, animated graphic
representation of an interface agent is less important and may well be dis-
tracting and/or superfluous in tasks such as the current one.
162 Tom Fenton-Kerr
2. Using a log-in that captures a user name can help to establish a basic rap-
port between the agent and a user. Additional benefits may include accurate
tracking of a user’s input, and providing a user with feedback on past per-
formance.
3. Agents need to be flexible enough in their communication to offer contextual
advice according to the relative accuracy of user input.
4. Randomizing an audio agent’s spoken responses can help to keep conversa-
tion novel and engaging for a user.
Agent characterization and personalized responses seem to be important fac-
tors in making a task enjoyable and easy to learn. In TTS mode, REM’s design
requires a user to listen to GAIA’s instructions in order to infer the next step,
which places the graphic representation of the interface agent in a secondary
role compared to the audio presentation. Future plans for the beta-testing phase
include the addition of contextual ambient sounds and music, and a range of
cultural graphics, videos and demographic information.
Conclusions
References
1. Bates, J., The Role of Emotion in Believable Agents. Communications of the ACM,
Special Issue on Agents (1994).
2. Campbell, N., CHATR: A High-Definition Speech Re-Sequencing System. Proceed-
ings of the 3rd ASA/ASJ Joint Meeting, Hawaii, Dec. 23-28 (1996).
3. Conati, C., Gertner, A., VanLehn, K and Druzdzel, M. J., On-Line Student Mod-
eling for Coached Problem Solving Using Bayesian Networks. Proceedings of the
Sixth International Conference on User Modeling (UM-97), Sardinia, Italy (1997).
4. Creager, W., Simulated Conversations: Speech as an Educational Tool. In The
Future of Speech and Audio in the Interface: A CHI’94 Workshop, Arons, B. and
Mynatt, E., co-convenors. SIGCHI Bulletin, Vol. 26, No. 4, October (1994) 44–48.
5. Frasson, C., Mengelle, T. and Aimeur, E., Using Pedagogical Agents in a Multi-
strategic Intelligent Tutoring System, Proceedings of the Workshop on Pedagogi-
cal Agents, World Conference on Artificial Intelligence in Education (AI-ED’97),
Kobe, Japan (1997).
6. Gerber, R., Lidstone, J. and Nason, R., Modelling Expertise in Map Reading:
Beginnings. International Research in Geographical and Environmental Education,
Volume 1, No. 1 (1992) 31–43.
7. Grosz, B. and Sidner, C., Attention, Intentions and the Structure of Discourse.
Computational Linguistics, Vol. 12, No. 3 (1986) 175–204.
8. Hendrix, C., Exploratory Studies on the Sense of Presence in Virtual Environ-
ments as a Function of Visual and Auditory Display Paramenters. M.S.E. Thesis
submitted to the University of Washington (1994).
164 Tom Fenton-Kerr
9. Hofstadter, D., Fluid Concepts and Creative Analogies. The Penguin Press, London
(1992) 157.
10. Isbister, K., Perceived Intelligence and the Design of Computer Characters. M.A
Thesis, Lifelike Computer Characters Conference, Snowbird, Utah, Sept. (1995).
11. Johnson, W. L. and Shaw E., Using Agents to Overcome Deficiencies in Web-Based
CourseWare. Proceedings of the Workshop on Intelligent Educational Systems on
the World Wide Web, 8th World Conference of the AIED Society, Kobe, Japan,
August (1997).
12. King, W., Anthropomorphic Agents: Friend, Foe, or Folly. Technical Memorandum
M-95-1, University of Washington (1995).
13. Kono, Y., Yano, T., Ikeda, T., Chino, T., Suzuki K. and Kanazawa, H., An Inter-
face Agent System Employing an ATMS-based Multimodal Input Interpretation
Method. To appear in the Journal of the Japanese Society for Artificial Intelli-
gence, Vol. 13, No. 2 (in Japanese) (1998).
14. Nagao, K. and Rekimoto, J., Agent Augmented Reality: A Software Agent Meets
the Real World. Proceedings of the Second International Conference of Multiagent
Systems (ICMAS-96) (1996).
15. Oviatt, S., Multimodal Interfaces for Dynamic Interactive Maps. Proceedings of
the Conference on Human Factors in Computing Systems (CHI’96), ACM Press,
New York. (1996) 95–102.
16. Picard, R.W., Affective Computing. MIT Press, Cambridge, Mass. 1997.
17. Prevost, S., Contextual Aspects of Prosody in Monologue Generation. Work-
shop Proceedings, Context in Natural Language Processing (IJCAI-95), Montreal
(1995).
18. Rich, C., Window Sharing with Collaborative Interface Agents. SIGCHI Bulletin,
Vol. 28, No. 1, January (1996).
19. Rickel, J. and Johnson W. L., Intelligent Tutoring in Virtual Reality: A Preliminary
Report. Proceedings of the Eighth World Conference on AI in Education, Kobe,
Japan, August (1997).
20. Rudnicky, A.I., Mode Preference in a Simple Data-retrieval Task. Proceedings of
the ARPA Workshop on Human Language Technology, San Mateo (1993) 364–369.
21. Schank, R., and Cleary, C., Engines for Education. Lawrence Erlbaum Associates
(1996).
22. Sloman, A. and Poli, R., SIM-AGENT: A Toolkit for Exploring Agent Designs.
In Wooldridge, M., Muller, J., and Tambe, M., editors, Intelligent Agents II: Pro-
ceedings of the IJCAI ’95 Workshop ATAL, August, 1995. Springer-Verlag, Berlin
(1996) 392–407.
When Agents Meet Cross-Cultural Metaphor: Can
They Be Equipped to Parse and Generate It?
Patricia O’Neill-Brown
th
U.S. Department of Commerce, Manager, Japan Technology Program, 14 & Constitution
Ave. NW, Washington, DC
PONeillBrown@doc.gov
Metaphor is central to language and thought. Therefore, any system that attempts to
handle communicative acts must account for metaphor. Going back to The Philosophy
of Rhetoric (1936), Richards asserts that “human cognition is basically metaphoric in
nature rather than primarily literal, that the metaphors of our language actually derive
from an interaction of thoughts” and that “metaphor is not a cosmetic rhetorical
device or a stylistic ornament, but is an omnipresent principle of thought” (Johnson
1981:18-19). Similarly, Black held the view that metaphorical statements are not
replaceable by literal statements of comparison (Black 1962:31-37). It was not until
Reddy (1979) and then Lakoff and Johnson’s landmark study, Metaphors We Live By
(1980), that these views could be supported by data. These works demonstrate,
through copious examples, that metaphor shapes and influences our everyday
experience.
Once it was accepted that metaphor is ubiquitous in language and thought,
metaphor could be cast within a general theory of meaning. Indeed, the computational
models which have most effectively dealt with metaphor are those that have treated
metaphor in this manner. Due to the ubiquity of metaphor in language, therefore, it is
not a matter of “if agents encounter metaphor in a system,” or of agents perhaps
desiring to take advantage of metaphorical means, but rather a necessity that agents be
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp.165 -175, 1999.
Springer-Verlag Berlin Heidelberg 1999
166 Patricia O’Neill-Brown
able to parse and generate metaphor. Therefore, the questions become, “when, how,
and what forms will the metaphors take and how will the agent respond to as well as
produce them?” One type of metaphor that agents will have to handle is cross-cultural
metaphor.
If we take a look at Chinese, the structures, Qing-fu (“light and floating”); qing-
piao (“light and drifting”); and piao-fu (“drifting and floating”), denote the idea of
“being off the ground,” yet connote not happiness, but rather, complacency, pride and
a lack of self-control, conjuring up for the native speaker of Chinese, the concept of
frivolity and superficiality. It is true that sometimes “being off the ground” in Chinese
is equated with the state of being happy. The expression, Teng yun -jia wu (“ride on
a cloud”) sometimes is used to describe happiness about major progress or success.
However, the “being off the ground” Chinese metaphors have both positive and
negative values, whereas the English “being off the ground” metaphors are positive,
and not negative. Hence, this is a case that demonstrates that there is not necessarily a
one-to-one equivalency of metaphorical meaning across languages.
When Agents Meet Cross-Cultural Metaphor 167
Teyahsútha
te-ye-yahs-út-ha
DU-she-cross-attach, put onto-ASP
she attaches a cross
She is Catholic.
(Bonvillian 1989:187)
Japanese has metaphors that do not have metaphorical counterparts in English. For
instance, the Japanese verb, nagareru, in one of its literal senses means “to flow,” as
in “the river flows,” but when it is used metaphorically, it can mean metaphorically
“passed;” “drenched;” “spread” or “forfeited”:
1. Kono machi ni utsuri sunde kara itsu no ma ni ka goju nen ijo no gabbi ga
nagaremashita.
Since I moved to this town, before I knew it, more than 50 years’ time had
passed.
4. Asu made ni o kane o shichiya ni motte ika nai to kamera ga nagarete shimau.
If you don’t bring the money to the pawn shop by tomorrow, the camera will
wind up being forfeited.
The English verb, “to flow,” has none of these metaphorical senses.
168 Patricia O’Neill-Brown
2.1 Second Language Learners and Metaphor in the Second Language (L2)
Acceptable answers for question 1 would have been “was a hit,” the sentence, in
English, reading as “Movies are always crowded out by television, but this production
was a hit and day after day the theaters were packed.” For question 2, an acceptable
answer would have been “fell upon me,” the sentence reading, “During the English
period, the reading fell upon me.”
It was predicted that when asked for a metaphorical sense in the control condition,
subjects would provide the prototypical sense of the verb, which is typically a literal
sense. This turned out to be the case. Especially for the beginners, in the control
condition, the default did seem to be to answer with the literal, most prototypical
sense of the verb. All of the subjects had higher combined total scores for Exercise 2,
the experimental condition, than Exercise 1, the control condition, as shown in Table
1. Almost all, except in a few instances, had higher scores for the individual Exercise
2 than Exercise 1. The analysis of variance (ANOVA) showed that the effect of
instruction was significant, F=117.05, p = 0.0. This experiment revealed that all
differences among means were significant, p < .05. This demonstrates that second
language learners of Japanese do not exhibit metaphorical competence in Japanese,
and therefore, require instruction.
Table 1. Means and Standard Deviations of the Percentage Correct by Test Condition
Condition M S
Control 18.3 20.2
Experimental 76.5 10.5
The instruction helps the subject to acquire the conceptual structuring of Japanese
metaphor. The method combines a core meaning plus a context approach. The
method 1) presents the core meaning of the word under study to the student; 2)
provides a sentence with the word in it; 3) and asks the student to think of the core
170 Patricia O’Neill-Brown
meaning and the other words surrounding it in the sentence to generate a mental
picture of the situation to 4) arrive at the lexical meaning of the word in question.
The context operates on a dual level: the context of the sentence and the context of the
image that the subject conjures up of the situation.
In this way, the learner is led to a place for understanding the meaning of Japanese
words as it is understood by native speakers. In the end, the second language learner
and the first language learner may have the same conceptual understanding of
meaning. However, especially in the beginning phases of learning, L2 learners must
engage in different processes and employ different strategies for arriving at meaning.
For instance, the L2 learner, if not immersed in the language, must conjure up images
to obtain understanding, like the method employed here, which requires the student to
visually simulate the real world situation. This is in contrast to the first language
learner, who, always immersed in the environment, already has the “images” there.
Utilizing the lesson for the first verb to illustrate how the method used in exercise
2 operates, the method begins by introducing the Japanese verb and explaining what
the core meaning of that verb is, which we refer to as the general meaning of the verb.
The reason for choosing the term “the general meaning” as opposed to “core
meaning” is because students are often intimidated by linguistic terminology and a
more generic term is less distracting.
“In this lesson, we will examine the verb, ataru in more detail. Ataru has many
meanings but are related to one another in a general way. In this lesson we'll show
you how the meanings are related. In general, when you use the verb ataru what
you are doing is conceptualizing a situation in which someone or something is
directing attention to or putting themselves or itself, either physically or mentally,
at a particular point, which can be thought of as the object or the goal. The object
or goal can be a person or a thing.”
The next step is to define for the student the prototypical sense of the verb, which we
refer to in the lesson as the main meaning, for the same reason we refer to general
meaning as opposed to core meaning. Included is a sample sentence with the English
translation:
“The main meaning of ataru, which you may already know, is 'to target.' Here's an
example sentence with this meaning of ataru in it:
1. Zenryoku o agete teki ni atatta.
We targeted the enemy with all our might.
The next stage in the process is to explain to the student how the general meaning of
the verb fits in with the main meaning, using a concrete example:
“Let's look at how our explanation of the general meaning of ataru fits in with the
main meaning of ataru. Remember that we said that in general, when you use the
verb ataru what you are doing is conceptualizing a situation in which someone or
something is directing attention to or putting themselves or itself, either physically
or mentally, at a particular point, which can be thought of as the object or the goal.
The object or goal can be a person or a thing. How does this general meaning fit in
When Agents Meet Cross-Cultural Metaphor 171
with one of the scenarios covered by the main meaning, which we saw in the
example sentence? Let's look at it this way. When you are ataruing your enemy,
what are you doing? Remember, you are directing your attention to a particular
point and putting yourself at the point something else is at, either physically or
mentally. If you are in one place and your enemy is in another place, and you're
bringing yourself to the enemy, or making the enemy the object of something, what
are you doing? What you are doing is targeting your enemy.”
“However, every time you see ataru used in a sentence, the main meaning, to
target” is not always used. So how can we tell what the meaning is? We can tell by
thinking of the general meaning of ataru and seeing how the parts of the sentence
fit in with this particular sense of ataru. Let's take an example.
2. Kaze ga yoku ataru umizoi no michi o aruite iru to suna ga me ni haitte kuru.
In this sentence if we were to say that its meaning is, ‘When you walk along the
coastal road where the wind targets, the sand gets in your eyes,’ 'the wind targets'
sounds funny in English. So what do we have to do to come up with a better
translation?”
The next step in our lesson is to introduce the students to the procedure of thinking
of what the general meaning of the verb is, then looking at what type of noun the verb
is paired with to determine correct sense:
“We have to look at the type of object that is being used with the verb ataru. The
type of object linked with the verb determines the interpretation you're going to be
thinking of when you see the verb ataru in a particular sentence. This is an
important principle to remember.
So you have plug in the specific object the verb is being used with and see how it
fits in with one of the scenarios covered by the general meaning of ataru to
determine the correct meaning.
We would ask ourselves these questions: When the wind atarus a road, what is the
wind doing? Let's think. In your mind you should imagine what is happening when
the wind is at the same point that the road is or what it means when the wind is
putting itself at the same point that the road is at.”
Now, the student is brought back to the sentence and asked to think what the verb
would mean in the context of the sentence, in this way, arming her with additional
172 Patricia O’Neill-Brown
clues for deciding what the verb means via the other words in the sentence. The
lesson continues as follows:
“We could say tentatively that in this sentence, 'hit' would be the best meaning for
ataru. Then we'd have to ask ourselves if this would this be the best translation for
the sentence. Let's see if it is. The sentence, again, is:
Kaze ga yoku ataru umizoi no michi o aruite iru to suna ga me ni haitte kuru.
When you walk along the coastal road where the wind hits, the sand gets in your
eyes.
We would determine that 'hits' does make sense in this sentence. We can see that
'hits' fits in with the general meaning of ataru. When one object hits another, the
two of them are in the same place.”
The student is stepped through two more examples, reinforcing the procedure for
arriving at correct meaning. They are then asked to work through the rest of the
exercises themselves. Here are sample questions for ataru:
Think about what it means for a mother to atari chirasus her children. Plugging in
the general meaning of the verb, the one that is doing the atari chirasuing is
making the other person the object of something. What would we say that she is
doing to her children?
When business is ataru, what does this mean? What's happening is that the
business meets a particular point, which can be considered the goal. Think about
what the goal of the people running a business would be.
For question 3, an acceptable answer is “take it out on” and for question 4, something
like, “makes it big,” “takes off” or “is successful” are acceptable.
The instructional method does not merely call upon the learner to memorize
Japanese metaphors by rote; rather, the learner is required to embrace a connectionist
approach to lexical acquisition. They are called upon to embody an understanding of
When Agents Meet Cross-Cultural Metaphor 173
the core meaning of a word, and then dynamically determine its lexical meaning in
context. The L2 learner does not have in place the understanding that there is such a
thing as a core meaning holding the literal and the metaphorical together, or how to,
starting from the core, arrive at the meaning of a lexical item in a context. The
instructional method described here, which has been demonstrated to be effective, is
novel, since second language instructors typically do not take a connectionist
approach to teaching the L2 lexicon. Furthermore, this method has the potential to
enable students to recognize metaphor on their own. After they had been through
both exercises for about three or four verbs, several of the subjects started to perform
better on Exercise 1, though still not better than on Exercise 2.
as the processes for lexical acquisition must be made explicit to the second language
learner, processes for parsing and generation, including the parsing and generation of
metaphor, must be made explicit to a computational entity such as an agent.
The method employed here takes a connectionist approach to the lexicon—
something called a core meaning underlies the metaphorical and literal senses of
words, and lexical meaning is determined on the fly in the context of a situation. In
other words, the learner must memorize some content—the core meaning of a word—
which remains steady—and then step through a procedure for dynamically
understanding the lexical meaning of that word in a context. This method is
representative of the “flexible computing” approach. Checking core meaning and
relating it to other sentential constituents to conjure up an image to arrive at meaning
is procedural. Flexibility comes into the model in the sense that any verb, any
sentence, and any context can be processed. The context is not pre-composed—it is
built and computed on the fly. In addition, the method for understanding Japanese
metaphor is flexible in that it can be applied to languages other than Japanese.
Understanding Japanese metaphor involved developing a method for deriving core
meaning (not described here). This method was informed by a procedure used by
Brugman (1983) to derive core meaning for an English structure, as well as the
phenomena of “the verb mutability effect” uncovered by Gentner and France (1988).
The method was top-down, consisting of taking in and analyzing sentence after
sentence to determine core meaning. In turn, this method was used to develop the
instructional technique; essentially, the procedure in reverse. As described, the
method was bottom-up, starting from the core and generating up, capturing the lexical
meaning through the context of the sentence and the image conjured up by it. By
extension, a “flexible computing” approach similar to the methods for understanding
and producing metaphor described here may be a viable way for developing agents
that are capable of recognizing and generating metaphor.
References
1. Black, M.: Metaphor. In: Models and Metaphors: Studies in Language and Philosophy.
Cornell University Press, Ithaca, New York (1962) 25-47.
2. Bonvillain, N.: Noun Incorporation and Metaphor: Semantic Process in Akwesasne Mohawk.
In: Anthropological Linguistics (1989) 31:3-4.
3. Brugman, C.: Story of Over. Indiana University Linguistics Club, Bloomington Indiana
(1983).
4. Danesi, M.: Metaphorical Competence in Second Language Acquisition Research and
Language Teaching: The Neglected Dimension. In: Alatis, J. (ed.): Georgetown University
Round Table on Languages and Linguistics. Georgetown University Press, Washington,
D.C. (1992).
5. Dirven, R.: Metaphor as a Basic Means for Extending the Lexicon." In: Wolf, P., Dirven, R.
(eds.): The Ubiquity of Metaphor in Language and Thought, 85-119. John Benjamins,
Amsterdam (1985).
6. Emantatian, M.: Metaphor and the Expression of Emotion: The Value of Cross-Cultural
Perspectives. In: Metaphor and Symbolic Activity (1995) 10(3):163-182.
7. Gentner, D., France, I.M.: The Verb Mutability Effect: Studies of the Combinatorial
Semantics of Nouns and Verbs. In: Small, S. (ed.): Lexical Ambiguity Resolution:
When Agents Meet Cross-Cultural Metaphor 175
Brian Scassellati
1 Motivation
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 176-195, 1999.
c Springer-Verlag Berlin Heidelberg 1999
Imitation and Mechanisms of Joint Attention 177
attention to objects and events in the world serves as the initial mechanism
for infants to share experiences with others and to negotiate shared meanings.
Joint attention is also a mechanism for allowing infants to leverage the skills and
knowledge of an adult caretaker in order to learn about their environment, in
part by allowing the infant to manipulate the behavior of the caretaker and in
part by providing a basis for more complex forms of social communication such
as language and gestures.
Joint attention has been investigated by researchers in a variety of fields.
Experts in child development are interested in these skills as part of the normal
developmental course that infants acquire extremely rapidly, and in a stereotyped
sequence (Scaife & Bruner 1975, Moore & Dunham 1995). Additional work on
the etiology and behavioral manifestations of developmental disorders such as
autism and Asperger’s syndrome have focused on disruptions to joint attention
mechanisms and demonstrated how vital these skills are in our social world
(Cohen & Volkmar 1997, Baron-Cohen 1995). Philosophers have been interested
in joint attention both as an explanation for issues of contextual grounding
and as a precursor to a theory of other minds (Whiten 1991, Dennett 1991).
Evolutionary psychologists and primatologists have focused on the evolution of
these simple social skills throughout the animal kingdom as a means of evaluating
both the presence of theory of mind and as a measure of social functioning
(Povinelli & Preuss 1995, Hauser 1996, Premack 1988).
We have approached joint attention from a slightly different perspective:
the construction of human-like robots that exhibit these social skills (Scassel-
lati 1996). This approach focuses first on the construction of useful real-world
systems that can both recognize and produce normal human social cues, and
second on the evaluation of the complex models of joint attention developed by
other disciplines.
Building machines that can recognized human social cues will provide a flex-
ibility and robustness that current systems lack. While the past few decades
have seen increasingly complex machine learning systems, the systems we have
constructed have failed to approach the flexibility, robustness, and versatility
that humans display. There have been successful systems for extracting envi-
ronmental invariants and exploring static environments, but there have been
few attempts at building systems that learn by interacting with people using
natural, social cues. With advances in embodied systems research, we can now
build systems that are robust enough, safe enough, and stable enough to allow
machines to interact with humans in a learning environment. Constructing a
machine that can recognize the social cues from a human observer allows for
more natural human-machine interaction and creates possibilities for machines
to learn by directly observing untrained human instructors. We believe that by
using a developmental program to build social capabilities we will be able to
achieve a wide range of natural interactions with untrained observers (Brooks,
Ferrell, Irie, Kemp, Marjanovic, Scassellati & Williamson 1998).
Robotics also offers a unique tool to developmental psychology and related
disciplines in evaluating complex interaction models. By implementing these
178 Brian Scassellati
models in a real-world system, we provide a test bed for manipulating the be-
havioral progression. With an implemented developmental model, we can test
alternative learning and environmental conditions in order to evaluate alterna-
tive intervention and teaching techniques. This investigation of joint attention
asks questions about the development and origins of the complex non-verbal
communication skills that humans so easily master: What is the progression of
skills that humans must acquire to engage in shared attention? When something
goes wrong in this development, as it seems to do in autism, what problems can
occur, and what hope do we have for correcting these problems? What parts of
this complex interplay can be seen in other primates, and what can we learn
about the basis of communication from these comparisons? With a robotic im-
plementation of the theoretical models, we can further these investigations in
previously unavailable directions.
However, building a robot with the complete social skills of a human is a
Herculean task that still resides in the realm of science fiction and not artificial
intelligence. In order to build a successful implementation, we must decompose
the monolithic “social skills module” into manageable pieces. The remainder of
this chapter will be devoted to building a rough consensus of evidence from work
on autism and Asperger’s syndrome, from developmental psychology, and from
evolutionary studies on how this decomposition can best be accomplished. From
this rough consensus, we will outline a program for building a robot that can
recognize and generate simple joint attention behaviors. Finally, we will describe
some of the preliminary steps we have taken with one humanoid robot to build
this developmental program.
The most relevant studies to our purposes have occured as developmental and
evolutionary investigations of “theory of mind” (see Whiten (1991) for a collec-
tion of these studies). The most important finding, repeated in many different
forms, is that the mechanisms of joint attention are not a single monolithic sys-
tem. Evidence from childhood development shows that not all mechanisms for
Imitation and Mechanisms of Joint Attention 179
joint attention are present from birth, and there is a stereotypic progression of
skills that occurs in all infants at roughly the same rate (Hobson 1993). For
example, infants are always sensitive to eye direction before they can interpret
and generate pointing gestures.
There are also developmental disorders, such as autism, that limit and frac-
ture the components of this system (Frith 1990). Autism is a pervasive devel-
opmental disorder of unknown etiology that is diagnosed by a set of behav-
ioral criteria centered around abnormal social and communicative skills (DSM
1994, ICD 1993). Individuals with autism tend to have normal sensory and mo-
tor skills, but have difficulty with certain socially relevant tasks. For example,
autistic individuals fail to make appropriate eye contact, and while they can rec-
ognize where a person is looking, they often fail to grasp the implications of this
information. While the deficits of autism certainly cover many other cognitive
abilities, some researchers believe that the missing mechanisms of joint attention
may be critical to the other deficiencies (Baron-Cohen 1995). In comparison to
other mental retardation and developmental disorders (like Williams and Downs
Syndromes), the social deficiencies of autism are quite specific (Karmiloff-Smith,
Klima, Bellugi, Grant & Baron-Cohen 1995).
Evidence from research into the social skills of other animals has also indi-
cated that joint attention can be decomposed into a set of subskills. The same
ontological progression of joint attention skills that is evident in human infants
can also be seen as an evolutionary progression in which the increasingly complex
set of skills can be mapped to animals that are increasingly closer to humans on
a phylogenetic scale (Povinelli & Preuss 1995). For example, skills that infants
acquire early in life, such as sensitivity to eye direction, have been demonstrated
in relatively simple vertebrates, such as snakes (Burghardt & Greene 1990), while
skills that are acquired later tend to appear only in the primates (Whiten 1991).
This module-based description is a useful analysis tool, but does not provide
sufficient detail for a robotic implementation. To build a portion of joint behav-
ior skills, we require a set of observable behaviors that can be used to evaluate
the performance of the system incrementally. We require a task-level decom-
position of necessary skills and the developmental mechanisms that provide for
transition between stages. Our current work is on identifying and implementing
a developmental account of one possible skill decomposition, an account which
relies heavily upon imitation.
plover (Ristau 1991), and all primates (Cheney & Seyfarth 1990). Identifying
whether or not something is looking at you provides an obvious evolutionary
advantage in escaping predators, but in many mammals, especially primates, the
recognition that another is looking at you carries social significance. In monkeys,
eye contact is significant for maintaining a social dominance hierarchy (Cheney
& Seyfarth 1990). In humans, the reliance on eye contact as a social cue is
even more striking. Infants have a strong preference for looking at human faces
and eyes, and maintain (and thus recognize) eye contact within the first three
months. Maintenance of eye contact will be the testable behavioral goal for a
system in this stage.
The second step is to engage in joint attention through gaze following. Gaze
following is the rapid alternation between looking at the eyes of the individual
and looking at the distal object of their attention. While many animals are sen-
sitive to eyes that are gazing directly at them, only primates show the capability
to extrapolate from the direction of gaze to a distal object, and only the great
apes will extrapolate to an object that is outside their immediate field of view
(Povinelli & Preuss 1995).1 This evolutionary progression is also mirrored in the
ontogeny of social skills. At least by the age of three months, human infants dis-
play maintenance (and thus recognition) of eye contact. However, it is not until
nine months that children begin to exhibit gaze following, and not until eighteen
months that children will follow gaze outside their field of view (Baron-Cohen
1995). Gaze following is an extremely useful imitative gesture which serves to
focus the child’s attention on the same object that the caregiver is attending to.
This simplest form of joint attention is believed to be critical for social scaffold-
ing(Thelen & Smith 1994), development of theory of mind(Baron-Cohen 1995),
and providing shared meaning for learning language (Wood, Bruner & Ross
1976). This functional imitation appears simple, but a complete implementation
of gaze following involves many separate proficiencies. Imitation is a developing
research area in the computational sciences (for excellent examples, see (Daut-
enhahn 1994, Hayes & Demiris 1994, Dautenhahn 1997)).
The third step in our account is imperative pointing. Imperative pointing is
a gesture used to obtain an object that is out of reach by pointing at that object.
This behavior is first seen in human children at about nine months of age (Baron-
Cohen 1995), and occurs in many monkeys (Cheney & Seyfarth 1990). However,
there is nothing particular to the infant’s behavior that is different from a simple
reach – the infant is initially as likely to perform imperative pointing when the
caretaker is attending to the infant as when the caretaker is looking in the other
direction or when the caretaker is not present. The caregiver’s interpretation of
infant’s gesture provides the shared meaning. Over time, the infant learns when
the gesture is appropriate. One can imagine the child learning this behavior
through simple reinforcement. The reaching motion of the infant is interpreted
by the adult as a request for a specific object, which the adult then acquires
1
The terms “monkey” and “ape” are not to be used interchangeably. Apes include
orangutans, gorillas, bonobos, chimpanzees, and humans. All apes are monkeys, but
not all monkeys are apes.
182 Brian Scassellati
and provides to the child. The acquisition of the desired object serves as positive
reinforcement for the contextual setting that preceded the reward (the reaching
action in the presence of the attentive caretaker). Generation of this behavior is
then a simple extension of a primitive reaching behavior.
The fourth step is the advent of declarative pointing. Declarative pointing is
characterized by an extended arm and index finger designed to draw attention
to a distal object. Unlike imperative pointing, it is not necessarily a request
for an object; children often use declarative pointing to draw attention to ob-
jects that are clearly outside their reach, such as the sun or an airplane passing
overhead. Declarative pointing also only occurs under specific social conditions;
children do not point unless there is someone to observe their action. We propose
that imitation is a critical factor in the ontogeny of declarative pointing. This
is an appealing speculation from both an ontological and a phylogenetic stand-
point. From an ontological perspective, declarative pointing begins to emerge at
approximately 12 months in human infants, which is also the same time that
other complex imitative behaviors such as pretend play begin to emerge. From
the phylogenetic perspective, declarative pointing has not been identified in any
non-human primate (Premack 1988). This also corresponds to the phylogeny of
imitation; no non-human primate has ever been documented to display imitative
behavior under general conditions (Hauser 1996). We propose that the child first
learns to recognize the declarative pointing gestures of the adult and then imi-
tates those gestures in order to produce declarative pointing. The recognition of
pointing gestures builds upon the competencies of gaze following and imperative
pointing; the infrastructure for extrapolation from a body cue is already present
from gaze following, it need only be applied to a new domain. The generation of
declarative pointing gestures requires the same motor capabilities as imperative
pointing, but it must be utilized in specific social circumstances. By imitating
the successful pointing gestures of other individuals, the child can learn to make
use of similar gestures.
To build a system that can both recognize and produce the joint attention skills
outlined above, we require a system with both human-like sensory systems and
motor abilities. The Cog project at the MIT Artificial Intelligence Laboratory
has been constructing an upper-torso humanoid robot, called Cog, in part to
investigate how to build intelligent robotic systems by following a developmental
progression of skills similar to that observed in human development (Brooks &
Stein 1994, Brooks et al. 1998). In the past two years, a basic repertoire of
perceptual capabilities and sensory-motor skills have been implemented on the
robot (see Brooks et al. (1998) for a review).
The humanoid robot Cog has twenty-one degrees of freedom to approximate
human movement, and a variety of sensory systems that approximate human
senses, including visual, vestibular, auditory, and tactile senses. Cog’s visual sys-
tem is designed to mimic some of the capabilities of the human visual system,
Imitation and Mechanisms of Joint Attention 183
Fig. 2. Images obtained from the peripheral (top) and foveal (bottom) cameras on Cog.
The peripheral image is used for detecting salient objects worthy of visual attention,
while the foveal image is used to obtain high resolution detail of those objects.
Motion
Detector
Fig. 3. Block diagram for the pre-filtering stage of face detection. The pre-filter selects
target locations based upon motion information and past history. The pre-filter allows
face detection to occur at 20 Hz with little accuracy loss.
A short summary of these steps appears below, and additional details can be
found in Scassellati (1998b).
To identify face locations, the peripheral image is converted to grayscale and
passed through a pre-filter stage (see Figure 3). The pre-filter allows us to search
only locations that are likely to contain a face, greatly improving the speed of
the detection step. The pre-filter selects a location as a potential target if it has
had motion in the last 4 frames, was a detected face in the last 5 frames, or has
not been evaluated in 3 seconds. A combination of the pre-filter and some early-
rejection optimizations allows us to detect faces at 20 Hz with little accuracy
loss.
Face detection is done with a method called “ratio templates” designed to
recognize frontal views of faces under varying lighting conditions (Sinha 1996).
A ratio template is composed of a number of regions and a number of relations,
Imitation and Mechanisms of Joint Attention 185
Fig. 4. A ratio template for face detection. The template is composed of 16 regions
(the gray boxes) and 23 relations (shown by arrows).
Peripheral to Foveal
Foveal Map Grabber
Fig. 5. Block diagram for finding eyes and faces. Once a target face has been located,
the system must saccade to that location, verify that the face is still present, and then
map the position of the eye from the face template onto a position in the foveal image.
are then mapped into foveal camera coordinates using a second learned mapping.
The mapping from foveal to peripheral pixel locations can be seen as an attempt
to find both the difference in scales between the images and the difference in
pixel offset. In other words, we need to estimate four parameters: the row and
column scale factor that we must apply to the foveal image to match the scale
of the peripheral image, and the row and column offset that must be applied to
the foveal image within the peripheral image. This mapping can be learned in
two steps. First, the scale factors are estimated using active vision techniques:
while moving the motor at a constant speed, we measure the optic flow of both
cameras. The ratio of the flow rates is the ratio of the image sizes. Second, we use
correlation to find the offsets. The foveal image is scaled down by the discovered
scale factors, and then correlated with the peripheral image to find the best
match location.
Once this mapping has been learned, whenever a face is foveated we can ex-
tract the image of the eye from the foveal image (see Figure 5). This extracted
image is then ready for further processing. The left image of Figure 6 shows
the result of the face detection routines on a typical grayscale image before the
saccade. The right image of Figure 6 shows the extracted subimage of the eye
that was obtained after saccading to the target face. Additional examples of
successful detections on a variety of faces can be seen in Figure 7. This method
achieves good results in a dynamic real-world environment; in a total of 140
trials distributed between 7 subjects, the system extracted a foveal image that
contained an eye on 131 trials (94% accuracy). Of the missed trials, two resulted
from an incorrect face identification (a face was falsely detected in the back-
ground clutter), and seven resulted from either an inaccurate saccade or motion
of the subject (Scassellati 1998b).
In order to accurately recognize whether or not the caregiver is looking at
the robot, we must take into account both the position of the eye within the
head and the position of the head with respect to the body. Work on extracting
the location of the pupil within the eye and the position of the head on the body
has begun, but is still in progress.
Imitation and Mechanisms of Joint Attention 187
Fig. 6. A successfully detected face and eye. The 128x128 grayscale image was captured
by the active vision system, and then processed by the pre-filtering and ratio template
detection routines. One face was found within the peripheral image, shown at left. The
right subimage was then extracted from the foveal image using a learned peripheral-
to-foveal mapping.
Fig. 7. Additional examples of successful face and eye detections. The system locates
faces in the peripheral camera, saccades to that position, and then extracts the eye
image from the foveal camera. The position of the eye is inexact, in part because the
human subjects are not motionless.
following if the distal object is within their field of view. They will not turn to
look behind them, even if the angle of gaze from the caretaker would warrant
such an action. Around 18 months, the infant begins to enter a “representational”
stage in which it will follow gaze angles outside its own field of view, that is,
it somehow represents the angle of gaze and the presence of objects outside its
own view.
Implementing this progression for a robotic system provides a simple means
of bootstrapping behaviors. The capabilities used in detecting and maintaining
eye contact can be extended to provide a rough angle of gaze. By tracking along
this angle of gaze, and watching for objects that have salient color, intensity, or
motion, we can mimic the ecological strategy. From an ecological mechanism,
we can refine the algorithms for determining gaze and add mechanisms for de-
termining vergence. A rough geometric strategy can then be implemented, and
later refined through feedback from the caretaker. A representational strategy
requires the ability to maintain information on salient objects that are outside
of the field of view including information on their appearance, location, size,
and salient properties. The implementation of this strategy requires us to make
Imitation and Mechanisms of Joint Attention 189
Identify
Visual Foveate Generate
Target Saccade Target Ballistic Reach
Map Map
Gaze Arm Primitive
Retinal Coordinates Coordinates
Coordinates
Image
Correlation Motion
Detection
Fig. 9. Reaching to a visual target is the product of two subskills: foveating a target
and generating a ballistic reach from that eye position. Image correlation can be used
to train a saccade map which transforms retinal coordinates into gaze coordinates (eye
positions). This saccade map can then be used in conjunction with motion detection
to train a ballistic map which transforms gaze coordinates into a ballistic reach.
Fig. 10. Generation of error signals from a single reaching trial. Once a visual target
is foveated, the gaze coordinates are transformed into a ballistic reach by the ballistic
map. By observing the position of the moving hand, we can obtain a reaching error
signal in image coordinates, which can be converted back into gaze coordinates using
the saccade map.
The task of recognizing a declarative pointing gesture can be seen as the appli-
cation of the geometric and representational mechanisms for gaze following to
a new initial stimulus. Instead of extrapolating from the vector formed by the
angle of gaze to achieve a distal object, we extrapolate the vector formed by
the position of the arm with respect to the body. This requires a rudimentary
gesture recognition system, but otherwise utilizes the same mechanisms.
We have proposed that producing declarative pointing gestures relies upon
the imitation of declarative pointing in an appropriate social context. We have
not yet begun to focus on the problems involved in recognizing these contexts,
but we have begun to build systems capable of simple mimicry. By adding a
tracking mechanism to the output of the face detector and then classifying these
outputs, we have been able to have the system mimic yes/no head nods of the
caregiver, that is, when the caretaker nods yes, the robot responds by nodding yes
(see Figure 11). The face detection module produces a stream of face locations
at 20Hz. An attentional marker is attached to the most salient face stimulus,
and the location of that marker is tracked from frame to frame. If the position
192 Brian Scassellati
Fig. 11. Images captured from a videotape of the robot imitating head nods. The upper
two images show the robot imitating head nods from a human caretaker. The output of
the face detector is used to drive fixed yes/no nodding responses in the robot. The face
detector also picks out the face from stuffed animals, and will also mimic their actions.
The original video clips are available at http://www.ai.mit.edu/projects/cog/.
4 Conclusion
5 Acknowledgements
References
Baron-Cohen, S. (1995), Mindblindness, MIT Press.
Brooks, R. & Stein, L. A. (1994), ‘Building Brains for Bodies’, Autonomous Robots
1:1, 7–25.
Brooks, R. A., Ferrell, C., Irie, R., Kemp, C. C., Marjanovic, M., Scassellati, B. &
Williamson, M. (1998), Alternative Essences of Intelligence, in ‘Proceedings of
the Fifteenth National Conference on Artificial Intelligence (AAAI-98)’, AAAI
Press.
Burghardt, G. M. & Greene, H. W. (1990), ‘Predator Simulation and Duration of Death
Feigning in Neonate Hognose Snakes’, Animal Behaviour 36(6), 1842–1843.
Butterworth, G. (1991), The Ontogeny and Phylogeny of Joint Visual Attention, in
A. Whiten, ed., ‘Natural Theories of Mind’, Blackwell.
Cheney, D. L. & Seyfarth, R. M. (1990), How Monkeys See the World, University of
Chicago Press.
Cohen, D. J. & Volkmar, F. R., eds (1997), Handbook of Autism and Pervasive Devel-
opmental Disorders, second edn, John Wiley & Sons, Inc.
194 Brian Scassellati
Anneli Kauppinen
Abstract. The main aim of this study is to discuss the assumption that analogy
and imitation may be a crucial principle in the acquisition of language. The
manifestations of the acquisition process are called figures of speech in this
study. These memorized entities carry some elements of former contexts in
them. Figures of speech may be identical (deferred) imitations, but as well
some parts, or the outline of a former utterance may be repeated in them. A
tendency of some speech functions to be acquired as figures was found in this
study. The findings lead to the constructive-type grammar, in which
pragmatics and semantics are an integral part of the structure.
1 Introduction
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp.196 -208, 1999.
Springer-Verlag Berlin Heidelberg 1999
Figures of Speech, a Way to Acquire Language 197
The current data are based on longitudinal diaries and tape recordings on a Finnish
boy, Teemu, in everyday situations (T-data). This database is compared with
recordings made in the University of Oulu and the University of Helsinki on about
twenty Finnish children.
When going through T-data, I found a tendency of some categories and speech
functions to be acquired by deferred imitation/repetition. This tendency was checked
by studying the other databases. Imitation is, however, not a fully adequate name for
this process, because the repeated chunks of speech may be variable. At the level of
syntax some of them are analogically acquired frames with open slots, and they may
be coupled with other structures. In addition, there are many examples of analogy in
the acquisition of semantics.
198 Anneli Kauppinen
Piaget’s theory of deferred imitation entails analogic thinking. For instance, in order
to imitate an opening and closing box with his or her hands, a child has had to
realize the functional similarities between these two events. From another viewpoint
these two courses of events have an iconic connection. A child’s utterance (speech,
movements, rhythm) is, from this perspective, an icon which stands for the
movements of the box. In T-data, the functional similarity is an important ground for
early semantic inference.
In T’s speech the word wov-wov stood at age 1;3 to 1;7 (year; months) for
different animals from cows to birds, both toys and real ones. Such examples are
known in the child language literature. The word ovi ‘door’ was used also for a
wooden cover. The verb avata ‘open’ was used also for ‘uncover’ and ‘peel’, as
'please, open this orange'. At the age of 1;9 T. had one utterance "piillon" (< piiloon
'to a hidden place') for four kinds of activities. He said "piillon", when 1) putting a
paper roll into a stick-type stand, 2) putting a folder into its cover, 3) folding a paper
sheet with a drawing in it, and 4) crawling under an adult’s bent knees. One word for
all these activcities is an indication of the child’s way to connect these functionally
similar acitivities analogically together.
Many prohibitions by adults, directed to the child are at first memorized as formulaic
utterances about at two years of age. In the language acquisition research, there are
many examples of adults’ prohibitions switched to new contexts by children (Clark
1977, Painter 1984, Clancy 1985, Katis 1997). I call them compliance figures,
because children direct the imitated prohibitions towards themselves in private
speech, as Teemu in (1):
(1) Teemu (2;2) takes an orange and a knife and says to himself:
Älä itte kuori, älä itte kuori.
'Don't peel by yourself, don't peel by yourself'
2;0.12
(2) Mother: Älä ota mun kynnää
Teemu : äiti ikkee
In this example (2) the important trigger of the argument is the preceding turn.
The argument 'mummy will cry' is comparable to the compliance figures, because it
represents the adult's, not the child's own voice.
There are also other examples of this kind of acquired argument, used as delayed
imitations. Teemu's parents had earlier sometimes argumented their prohibitions by
saying sitte iltapäivällä, '[not now but] later in the afternoon'. During the age 2;5.17 -
2;8.18 there are eight examples of different conversations including the imitated
argument sitte iltapäivällä by Teemu, as in example (3). There is also another
imitated argument in this conversation (3): sitte pannan laastari, '[don't worry] we'll
put a band-aid on'.
(3) 2;6.7
Mother: Älä pelleile manteli suussa, tullee pipi.
Teemu: Sitte pannaan laastari
Mother: Ei se auta.
Teemu: Sitte iltapäivällä auttaa
200 Anneli Kauppinen
There is a difference between the figures in examples (1) and (2) compared to
example (3). The interpretation of the arguments 'then we'll put a band-aid on' and
'later in the afternoon it will help' in (3) is persuasive. They are imitated and they
reflect the adult-way of argumentation, but the figures are taken to work for the
child's own intentions. It seems to me that the child is somehow testing the effect of
this kind of argumentation. As we can see, these figures (3) are not in a quite
relevant context, but the examples lend support to the notion that "delayed
imitations" can be used productively in conversations.
The definitions of everyday objects are important speech topics for children. By
describing the functions of the things around them children prepare themselves for
future practical tasks. These definitions occur in T- data in some repetitive forms,
as relative clauses, figures with modal verb voi, 'can' , and 'if - then' structures. The
child uses these formulas analogically: keeping up the “frame”, but varying the words.
Two examples of the formula (Pro)noun+ADE can verb+INF, 'It is possible to do
something with X' / 'One/You can do something with X', are presented in (4a),
(4b). The adessive case has an instrumental function in them. The structure is generic,
without any subject. This kind of figure is a usual way to define things, and therefore
is has been acquired analogically from adult speech.
(4a) 2;4.18
Teemu: Mikä tää o-n?
What this be+3SG
'What is this?'
Adult: Kamera
camera
'[It is] a camera'
________________________________________________________
Table 1. The first variations of the formula saa(n)ko ottaa (‘may I take’) in T-data. The
only exceptions from adult conventionality in table 1 are the phonological deviations ( as
partitive forms"mevuja" and "meuja" for mehua, "ieluja" for viulua, and the verb “aako” for
saako).
saa -n -ko minä ottaa
may+1SG+NTRG I take (INF)
‘May I take?’
_________________________________________________________
202 Anneli Kauppinen
The permissions are important for a child to cope with everyday social situations.
Apparently for this reason some utterances of will and permission are acquired as
figures of speech in T-data. One example of this acquisition process is represented in
table 1. It can be seen that the formula, or collocation, saako ottaa, ‘may [I] take’ ,
becomes little by little more flexible and variable. All of these syntactic structures are
conventional in adult language. The generic structure saako ottaa without 1 SG
suffix or 1 SG pronoun, mostly gets the inclusive interpretation 'may I take?' in
adult speech, too. The kernel of the formula (s)aako ottaa gets different object
complements, all of them in partitive case (see table 1): Later, at the age of 2;3.5, the
first utterances with explicit 1 SG suffix and pronoun appear, but the infinitive is
dropped. The first person pronoun takes the place of the infinitive, and therefore the
outline and the rhythm of the utterances are preserved.
The conditional verb forms are acquired and memorized in figures of speech
(formulaic utterances) in T-data. There are totally about 670 occurrences of utterances
including conditional verb forms in the database. They can be grouped into 36
different formulas. These formulas are frequent also in the other Finnish databases
analyzed for this research. The main functions of the conditional utterances are
request, imagining, and planning. (Kauppinen, forthcoming.) The first occurrences
of the conditional verb forms appear in T-data at the age of 2;0. All the occurrences
during 2:0 - 2;1 represent one figure, the semi-idiomatic question 'What would PRON
be?' By the age 2;4 there have been 35 occurrences of the conditional verb forms, all
of which belong to 4 figures. The findings suggest the supposition that the child
does not acquire distinct verb forms but some figures of speech including conditional
verb forms. In other words, he acquires some means to request, imagine and plan.
Each figure is a way to plan and handle everyday situations.
Compared to many Indo-European languages, the Finnish conditional verb form
can be said to have functions of both subjunctive and conditional verbs (Kauppinen
1996). In most languages these two verb categories have specific contexts, as
conditional verbs in apodosis, and subjunctive forms for example in protasis, final,
and concessive clauses. (Bybee et al.1994; Fillmore et al. 1988.) For this reason the
configuration of these verb categories is an essential feature of them, and therefore
it belongs to adult speech routines, too.
It is possible to see conditional sentences (with indicative verb forms, too) as a
figure of speech or a rhythmic pattern, as Ellis (1995) has put it. The conditional
sentences have their specific senses in different languages. It is not possible to learn
some patterns of logical argumentation without learning the ‘if - then’ formula, a
kind of representation, (see also Johnson-Laird 1983). Conditional structures have in
addition some other special senses, e.g. threatening in many languages (If you don’t
eat, I’ll - - ). Many examples of child language suggest that this affective meaning is
acquired together with the conditional figure of speech.
Figures of Speech, a Way to Acquire Language 203
The imaginative function of speech is early. When Teemu, at the age of 1;4, crawled
on all fours and imitated a dog, he was a conscious pretender. When he, at the age of
two, calls a piece of wood "gun" or puts a stick into his mouth and pretends to smoke,
the whole pretending pattern is an analogical representation of earlier experiences. In
the pretend play the child moves to another mental space which he knows to be
different from the reality. According to Vygotsky (1978) the essence of pretend play
is analogical, not symbolic. The pretend plays are planned structures, the parts of
which are e.g. emplotment, enactment, and underscoring (Giffin 1984), as in the
pretend play by Aino aged 5 and Eeva aged 2;6 :
(6a) Italian
Io sono il marito, e tu eri [IMPF] la mia moglie. (Bates 1976)
(6b) English
You were mother and she didn’t want you to go. (Lodge 1978)
(6c) Dutch
Ik was [IMPF] de vader en ik ging [IMPF] een diepe kuil graven. (Kaper 1980)
(6d) German
Dies ist ein Pferd und das wäre [KONJUNKTIV] der Stall. (Kaper 1980).
The examples (6a - d) indicate that there is a prototypical figure of speech specialized
for the emplotment of the pretend play in many languages. It has a specific space
builder function. In the studied languages the characteristic verb forms in the
emplotment of the pretend play seem to be subjunctive/conjunctive in nature.
(7a)
'I was/were mother "Child, come to eat." (C. is eating:) "Yum, yum." '
Figures of Speech, a Way to Acquire Language 205
(7b)
'If I were mother, I would offer delicious food for the child.'
An essential difference between the structures is that in (7a) the built mental space is
realized immediately as action, turn taking, and underscoring. In the conditional
complex sentence (7b), instead, there is a distance between the space builder and
action, and the indication of the distance is the connective 'if' (see Werth 1997).
These two structures (7a) and (7b) also include different contextual and pragmatic
senses. This is an important point in language acquisition. The logic of the pretend
play is not conditional logic but it is grounded on open possibilities and negotiation.
Children also prefer pretense-type structures because of the possibility to immediate
action. (see Kauppinen 1996.) Action precedes speech also in the human ontology.
The conditional complex sentences are common measures of children's
inference ability in the language acquisition research. It is typically assumed that the
ability of logical inference and the language structures the children use have a direct
connection to each other (e.g. Bowerman 1986). The theory of figures of speech,
instead, takes into account the child's need and will to use an utterance because of its
sense in the social context in question. On the basis of this view, it is possible to
explain the "exceptions" to the assumed acquisition order of language structures. It
also clarifies why children, such as Teemu, may favour pretending figures instead
of the conditional complex sentences.
8 Conclusions
Abbreviations
ADE adessive 'at', 'on', 'with' (instrumental)
CON conditional affix
INE inessive 'in'
INF infinitive
ILL illative 'into'
IMP imperative
IMPF imperfect verb form
INTRG interrogative morpheme
PL plural
PRON pronoun
PRT partitive case
SG singular
Figures of Speech, a Way to Acquire Language 207
References
1. Bakhtin, M. M.: The Dialogic Imagination. In: Four Essays by Michael Bakhtin. Edited by
Michael Holquist. Translated by Caryl Emerson and Michael Holquist. University of
Texas Press Austin (1990).
2. Bates, E.: Language and Context: The Acquisition of Pragmatics. Academic Press New
York (1976).
3. Bowerman, Melissa : First steps in acquiring conditionals. In: Traugott, E.C., Meulen, A.,
Reilly, J. S., Ferguson, C. A. (eds.): On Conditionals. Cambridge University Press
Cambridge (1986).
4. Bybee, J., Perkins R., Pagliuca W.: The Evolution of Grammar. Tense, Aspect, and
Modality in the Languages of the World. The University of Chicago Press Chicago
(1994).
5. Clancy, P. M.: The Acquisition of Japanese. In: Slobin, D. I. (ed.): The Crosslinguistic
Study of Language Acquisition. Volume 1: The Data. Lawrence Erlbaum Associates
Hillsdale, New Jersey (1985).
6. Clark, R.: What’s the use of imitation. Journal of Child Language 4 (1977) 341-358.
7. Ellis, R. D.: The imagist approach to inferential thought patterns: The crucial role of
rhythm pattern recognition. Pragmatics & Cognition. Vol. 3, No. 1, 75-109 (1995).
8. Fauconnier, G.: Mental Spaces: Aspects of Meaning Construction in Natural Language. A
Bradford Book Cambridge (1985).
9. Fillmore, C. J., Kay, P., O’Connor M. C.: Regularity and Idiomaticity in grammatical
constructions: The case of “Let alone”. Language. Volume 64 (3) 501-538 (1988).
10. Giffin, H.: The Coordination of meaning in the Creation of a Shared Make-Believe
Reality. In: Bretherton, I. (ed.) Symbolic Play. The Development of Social
Understanding. Academic Press, INC. Orlando. 73—100 (1984).
11. Hopper, P.: Emergent Grammar. — Proceedings of the Thirteenth Annual Meeting.
Berkeley Linguistic Society (1987).
12. Johnson-Laird, P. N.: Mental Models. Towards a Cognitive Science of Language,
Inference, and Consciousness. Cambridge University Press Cambridge (1983).
13. Kaper, W.: The use of the past tense in games of pretending. Journal of Child Language
7. 213—215 (1980).
14. Katis, D. :The emergence of conditionals in child language: Are they really so late? In:
Athanasiadou, A., Dirven R. (eds.): On Conditionals Again. Current Issues in Linguistic
Theory 143 John Benjamins Amsterdam/Philadelphia (1997).
15. Kauppinen, A.:The Italian imperfetto compared to the Finnish conditional verb form —
evidence from child language. Journal of Pragmatics 26, 109-136.(1996).
16. Kauppinen, A.: Puhekuviot, tilanteen ja rakenteen liitto. [Figures of speech, a union of
situation and structure.] Suomalaisen Kirjallisuuden Seura Helsinki (1998).
17. Kauppinen, A.: Acquisition of the Finnish conditional verb forms in formulaic utterances.
In: Hiraga, M., Sinha, C., Wilcox S.(eds.): Cognitive Linguistics 95, Vol. 3: Cultural,
Psychological, and Typological approaches. John Benjamins. (forthcoming).
18. Kay, P., Fillmore, C. J.: Grammatical Constructions and Linguistic Generalizations: the
`What's X doing Y?’ Construction , <http://www.icsi.berkeley.edu/~fillmore/concon.html>
1997 (Read 27th March 97).
19. Lerner, G. H., On the syntax of sentences-in-progress. Language in Society 20, 441—458
(1991).
20. Locke, J. L.: Development of the Capacity for Spoken Language. In: Fletcher, P.,
MacWhinney, B. (eds.): The Handbook of Child Language. Basil Blackwell
Ltd.Cambridge, (1995).
21. Lodge. K. R.: The use of the past tense in games of pretend. Journal of Child Language 6
(1978).
208 Anneli Kauppinen
22. Musatti, T., Orsolini, M.: Uses past forms in the social pretend play of Italian children.
Journal of Child Language 20, 619—639 (1993).
23. Painter, C.: Into the Mother Tongue: A Case Study in Early Language Development.
Frances Pinter London (1984).
24. Peters, A. M.: The Units of Language Acquisition. Cambridge University Press Cambridge
(1983 ).
25. Tannen, D.: Talking Voices. Repetition, Dialogue, and Imagery in Conversational
Discourse. Studies in Interactional Sociolinguistics 6. Cambridge University Press
Cambridge (1989)
26. Turner, M.: Figure. Manuscript. Forthcoming in: Cacciari - Gibbs - Katz - Turner:
Figurative Language and Thought. Oxford University Press. (1997) .
27. Voloshinov, V. N.: Marxism and The Philosophy of Language. Translated by Ladislav
Matejka and J.R. Titunik. Seminar Press New York. Studies in Language 1. Originally
published in Russian under the title Marksism i filosofija jazyka. Leningrad. Seminar
Press New York 1973 [1929].
28. Vygotsky, L.S.: Mind in Society. The Development of Higher Psychological Processes.
Edited by Michael Cole, Vera John-Steiner, Sylvia Scribner, Ellen Souberman. Harvad
University Press Cambridge (1978).
29. Werth, P.: Conditionality as Cognitive Distance. In: Athanasiadou, A., Dirven, R. (eds.):
On Conditionals Again. Current Issues in Linguistic Theory 143. John Benjamins
Amsterdam/Philadelphia (1997).
30. Wertsch, J. V.: The Semiotic Mediation of Mental Life: L. S. Vygotsky and M. M.
Bakhtin. In: Mertz, E., Parmentier., R. J. (eds.): Semiotic Mediation. Sociocultural and
Psychological Perspectives. pp. 49—71. (1985).
31. Wong Fillmore, L.: Individual differences in second language acquisition. In: Fillmore, C.
J., Kempler, D., Wang, W. S-Y. (eds.): Individual Differences in Language Ability and
Language Behaviour. Academic Press New York (1979).
“Meaning” through Clustering by
Self-Organisation of Spatial and Temporal
Information
Ulrich Nehmzow
1 Introduction
1.1 Motivation: Transfer of Redundant Sensory Perceptions to
Meaningful Localisation Information
The ability to move towards specific, identified places within the environment is
one of the most important competences for a mobile robot.
While tasks that require only either random exploration or following canoni-
cal paths (such as paths marked by induction loops, beacons, or other markers)
can be achieved by low-level behaviour based control (see e.g. [1], more complex
navigation tasks for mobile robots (such as for example delivery tasks) require
the robot to determine its current position, the goal position, and the required
motion between the two.
Apart from ‘staying operational’ (involving obstacle avoidance and staying
within the machine’s operational limits) navigation requires 1) the ability to
construct a mapping between observed features and an internal representation
(“mapbuilding”), and 2) the interpretation of this mapping (“map interpreta-
tion”).
Following Webster’s definition of “analogy” as a “resemblance in some partic-
ulars between things otherwise unlike” [14], such mapbuilding is the construction
of an analogy between the real world and a robot’s perception of it.
In order to use its representation of the world at all, the robot must know
where it is in representation space: localisation is the most fundamental compo-
nent of interpreting mappings. Unless the robot can identify its current position
on the map, no path planning and hence no navigation can be performed.
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 209–229, 1999.
c Springer-Verlag Berlin Heidelberg 1999
210 Ulrich Nehmzow
The mapping used here is similar in many ways to hippocampal mappings found
in rats. In particular, place cells1 in the rat’s hippocampus can be likened to
activity patterns observed in the self-organising feature maps used here. There
have been a number of implementations of robot navigation systems that simu-
late such place cells, notably the work of Burgess, Recce and O’Keefe [2,3], but
also of others [8,9].
Self-organising mechanisms for static sensor signal clustering have been used
for robot localisation before [9,11,6,16]: the current sensory perception of a mo-
bile robot is clustered through an unsupervised, self-organising artificial neu-
ral network, and the network’s excitation pattern is then taken to indicate the
robot’s current position in perceptual space. If no perceptual aliasing (ambigu-
ous sensory perceptions) were present, this would then also identify the robot’s
position in the world unambiguously. Contrary to work discussed in this paper,
however, no information about perception over time was encoded in these cases.
Regarding the use of episodic information as input to a self-organising struc-
ture, some work has been done using such information in the input to a single
layer Kohonen network [10]. The work discussed here differs from that approach,
in that here we use a second Kohonen network that clusters the already clus-
tered sensory information encoded in the first layer network, rather than using
sequences of raw sensory data.
There is also related work in the area of assessment of robot performance [12].
Notably the work of Lee and Recce is relevant to the experiments reported here.
In their case [7] however, mapbuilding performance was measured against a
hand-crafted “optimal” performance (i.e. an absolute comparison). In contrast
to this, we perform quantitative comparisons between two algorithms performing
under identical conditions (i.e comparing against a relative standard).
1
Recordings from single units in and around the rat hippocampus show strong cor-
relation between a freely moving animal’s location and cell firing ([2]. Certain cells
only fire when the rat is in a restricted portion of its environment.
“Meaning” through Clustering 211
We contend that any navigation system that is to be used on a real robot, be-
yond the immediate vicinity of a “home location”, and over extended periods of
time, has to be anchored in exteroception2 , for the following reasons. Proprio-
ceptive systems are subject to uncorrectable drift errors, which means that the
anchor points of such navigation systems will change over time and introduce
navigation errors that are not correctable through proprioception alone. Only
through calibration using exteroception can this error be removed (a recent ex-
ample of such a system is presented in [15]. Drift errors are inherent and not
an engineering problem - more precise wheel encoders will simply mean that
the proprioception-based navigation system will function correctly for a longer
period of time. Eventually, however, it will fail, which is our main reason for
investigating landmark-based robot self-localisation.
2.1 Introduction
There are two main shortcomings of any episodic mapping mechanism: firstly,
it is dependent upon robot motion along a fixed path (or a few fixed paths),
because a unique and repeatable sequence of perceptions is required to identify
a location. Secondly, localisation is affected by “freak perceptions”3 for a much
longer time than in a navigation system based on the current perception only,
because any erroneous (freak) perception is retained for n timesteps, where n
is the number of past perceptions used for localisation. Such freak perceptions
do not normally occur in computer simulations, but they occur frequently when
a real robot interacts with the real world, because of sensor properties (e.g.
specular reflection of sonar signals), sensor noise, or electronic noise.
The episodic mapping algorithm proposed here specifically addresses this
question of how to cope with freak perceptions when using an episodic mapping
2
Sensory stimuli impinging on the robot from the outside, as opposed to propriocep-
tion (using internal sensory stimuli).
3
Spurious sensory perceptions caused by intermittent processes such as specular re-
flections or sensor crosstalk.
212 Ulrich Nehmzow
To cluster incoming sensory information, in both the static and the episodic
mapbuilding paradigm, a self-organising feature map ([5] was used.
The self-organising feature map (SOFM) is an example of an artificial neural
network that performs a topological clustering of its input data using an unsu-
pervised learning mechanism. The network consists of one layer of cells typically
arranged as a two dimensional grid. Figure 1 shows the basic structure of such
a network.
Input vector i
w
... jn
... Oj
...
...
: : : :
...
Fig. 1. Structure of the SOFM
n
oj = wjk ik = wj · ı, (1)
k=1
with n being the number of elements in the input vector and the weight vectors.
The initial state of the network uses randomised values for the weights. There-
fore, when a stimulus is presented to the network, one cell of the network will
respond more strongly than the others to that particular input vector (see equa-
tion 1.
The weight vector of this “winning” unit as well as those of the eight neigh-
bouring units are then changed according to equation 2:
“Meaning” through Clustering 213
and
where α is the learning rate. Typical values for this parameter are in the range
0.2 - 0.5. A value of α = 0.25 (constant over time) was used in the experiments
presented here. Weight vectors are normalised again after being adjusted.
As this process continues the network organises into a state whereby dis-
similar input vectors/patterns map onto different regions of the network, whilst
similar patterns are clustered together in groups: a topological map of the input
space develops.
When the network has settled, distinct physical locations will map onto dis-
tinct regions of the network4 , whilst similar perceptual patterns cluster together
in a region. To achieve this, no symbolic representations have been created, and
the robot is mapping its environment “as it sees it”.
In this way, regions of the network can be seen as representing ‘perceptual
landmarks’ within the robot’s environment, and map response can then be used
for localisation.
SOFM of
m x m units
Fig. 2. The static mapping mechanism: the SOFM clusters the current sensory
perception and thus generates the static mapping.
The episodic mapping paradigm uses two layers of self organising feature maps
(see figure 3. Layer one is the layer described in subsection 2.4.
Layer two is also a two-dimensional SOFM of k x k units (k=9 or k=12 in
our experiments), it is trained using an input vector of m2 -element length. All
elements of this vector are set to zero, apart from the last τ centres of excitation
of layer one, which are set to “1”. The value of the (“history”) parameter τ was
varied in our experiments.
SOFM of
k x k units
2
Input vector m elements long
SOFM of
m x m units
Fig. 3. The episodic mapbuilding mechanism: the first layer SOFM clusters the
current sensory perception, the second layer SOFM clusters the last τ perceptions
and thus generates the episodic mapping.
“Meaning” through Clustering 215
per bin, but has the disadvantage that spatial resolution is no longer uniform over
physical space. For this reason we adopted a spatially uniform binning method.
Entropy-based measures were then used to determine the strength of associ-
ation between map response and robot location. The entropy (or average infor-
mation) provides a measure of the probability that a particular signal is received,
given particular contextual information. For example, if the system’s response to
a particular stimulus is known with absolute certainty (probability 1), then the
entropy (i.e. average information) of having perceived that particular stimulus
is obviously 0.
For localisation, entropy can serve as a quality metric in the following way: if
any response R of the localisation system corresponds with exactly one location L
in the physical world, then the entropy H(L|R) is zero for that case (the “perfect”
map). The larger H(L|R), the larger the uncertainty that the robot is at a
particular location, given some system response R.
H(L|R) is defined as follows [13]:
pl,r
H(L|R) = − pl,r ln (3)
pl.
l,r
with
Nl,r
pl,r = (4)
N
and
Nl.
pl. = . (5)
N
N is the total number of events recorded in the contingency table, Nl,r is
the number of occurrences of response r at location l, and Nl. is the number of
occurrences of any response at location l.
H(L|R) can therefore be used as a metric to determine the suitability of the
obtained mapping for localisation. If H(L|R) is zero, perfect localisation can be
achieved, i.e. a particular system response R will indicate with absolute certainly
where the robot is in the world. If H(L|R) is non-zero, some ambiguity regarding
the robot’s current location exists, the larger H(L|R), the larger the ambiguity.
This measure allows quantitative comparison of two or more mapping
paradigms being compared under identical experimental circumstances. In par-
ticular, bin sizes must be identical for all experiments. In other words, the metric
allows the comparison between mapping systems, but does not provide an abso-
lute standard which is experiment-independent. This quality metric is a useful
measure for the experiments presented in this paper, because the fundamen-
tal question asked is which of two mapping paradigms performs better, under
identical experimental conditions.
“Meaning” through Clustering 217
In all experiments reported here, we used the quality metric defined in subsec-
tion 3.1 to determine the quality of a mapping: the lower the entropy H(L | R),
the higher the map quality. A ‘perfect’ map has an entropy H(L | R) of zero.
In all our experiments, the “static mapping”, using a single layer self-
organising feature map (see subsection 2.4, and the “episodic mapping”, using
a twin layer self-organising feature map (see subsection 2.5 are directly com-
pared. Of the former we know that it does provide a feasible method for mobile
robot localisation [9,11,10]. The question is: does the episodic mapping paradigm
produce better maps, with respect to the criterion discussed in section 3.1?
4 Experiments
Fig. 4. The Manchester Nomad 200 mobile robot, “FortyTwo”. The robot has
sixteen sonar sensors, sixteen infrared sensors, camera, compass, tactile sensors
and onboard odometry sensors.
Furthermore, experimental parameters such as network size and bin size were
varied, to determine their influence upon localisation performance.
218 Ulrich Nehmzow
Experimental procedure Here, the robot was manually driven along a (more
or less) fixed path in an environment containing brick walls, cloth screens and
cardboard boxes. The whole route was traversed six times, and 366 data points
in total were obtained, containing the robot’s 16 infrared sensors and location
in (x,y) coordinates (see figure 5.
Of the 366 data points, 120 were used for the initial training of the networks7 ,
i.e. the mapbuilding phase, and the remaining 246 data points were used for the
evaluation of the localisation ability.
E F
C D
A B
This is essentially identical to the static case, with the difference that in the
static case the sixteen sonar readings provide more information than the one
excitation centre of layer 1 that is used as input to layer 2 in episodic mapping.
The expected result, therefore, is that episodic mapping with τ = 1 always
produces slightly worse results than static mapping — as is indeed the outcome
in all experiments bar experiment 4, where both methods produce very similar
results).
The conclusion to draw from this experiment, then, is that episodic mapping
produces better localisation performance than static mapping, up to a certain
maximum value of τ , indicating that too much episodic information is confusing,
rather than helpful.
In the second experiment, the spatial resolution (i.e. the localisation preci-
sion) was reduced to 12 bins. As would be expected, localisation performance
improved (because there is less opportunity for error). The difference to experi-
ment 1, however, was small (figure 8, and essentially, the findings of experiment 2
confirm those of experiment 1.
In the final experiment conducted in environment 1, we used smaller net-
works, and reduced the bins size further. The results of this experiment are
shown in figure 9, a discussion of results follows in section 4.4. The contingency
table for this experiment is shown in table 1 in appendix A. Again, the earlier
findings are confirmed. In this case, episodic mapping produces better localisa-
tion performance than static mapping in all cases, regardless of the value of τ .
open space was traversed nine times, and 456 data points in total were obtained
by manually driving the robot. 160 of these data points were used for training
the networks8 , the remaining 296 data points were used to evaluate localisation
performance.
Environment 2 was less structured than environment 1, in that it contained
a larger variety of perceptually distinct objects, and more clutter. It was also
bigger, and the robot’s path in it is longer than in environment 1. Figure 10
shows the robot’s path through this environment, and the robot’s perception of
it.
Fig. 10. Robot trajectory in environment 2 (left) and accumulated infrared sen-
sor readings obtained by the robot in environment 2 (right, environment 2 “as
the robot sees it”). Dimensions are in units of 2.5mm.
G H I
D E F
A B C
The optimum value of τ is dependent on bin sizes, but lies between 3 and 5
in most cases. Note that the optimum value can be determined in real time by
the robot itself, as the contingency table and H(L | R) is available to the robot.
The choice of network size and bin size appear to be non-critical. In all cases
episodic mapping can outperform static mapping.
The performance metric introduced in section 3.1 can be applied to any
mapbuilding system that generates categorical data, and therefore provides a
tool of comparing different paradigms, as well as determining the influence of
any process parameters, independent of the actual paradigm used.
This visual analysis of the contingency tables therefore confirms the findings
of subsection 4.4.
M M
a a
p p
R R
e e
s s M M
p p a a
o o p
p
n n
s s R R
e e e e
s s
p
p
o
o
Episodic Mapping Static Mapping n
n
Experiment 3 Experiment 3 s s
e e
Fig. 15. Assessment of the suitability of contingency tables for robot self locali-
sation. Comparing the tables for static and episodic mapping demonstrates that
episodic mapping provides a clearer correlation between physical location and
map response. (The table values are shown in appendix A.
References
1. Rodney A. Brooks, A Robust Layered Control System for a Mobile Robot, MIT AI
Memo No. 864, September 1985. 209
2. N. Burgess and J. O’Keefe, Neuronal computations underlying the rfi ing of place
cells and their role in navigation, Hippocampus 7:749-762 (1996). 210
3. N. Burgess, J. O’Keefe and M. Recce, Using hippocampal ‘place cells’ for navi-
gation, exploiting phase coding, in Hanson, Giles and Cowan (eds.), Advances in
neural information processing systems 5, Morgan Kaufmann 1993. 210
4. Tom Duckett and Ulrich Nehmzow, Mobile Robot Self-Localization and Measure-
ment of Performance in Middle-Scale Environments, J. Robotics and Autonomous
Systems, Vol 24, Nos. 1-2, 1998. 211
5. Teuvo Kohonen, Self Organization and Associative Memory, Springer Verlag,
Berlin, Heidelberg, New York, 2nd edition, 1988. 212
6. Andreas Kurz, Constructing maps for mobile robot navigation based on ultrasonic
range data, IEEE Trans Systems, Man and Cybernetics B, Vol 26 No 2 1996. 210
7. David Charles Lee, The map-building and exploration strategies of a simple sonar-
equipped mobile robot; an experimental quantitative evaluation, PhD thesis, Uni-
versity College London, 1995. 210
8. Maja Mataric, Navigating with a Rat Brain: A Neurobiologically-Inspired Model
for Robot Spatial Representation, in Jean-Arcady Meyer and Stuart Wilson (eds.),
From Animals to Animats, MIT Press 1991. 210
“Meaning” through Clustering 227
A Contingency Tables
Physical Location
A B C D E F
M 18 (67%) 0 (0%) 6 (33%) 0 (0%) 0 (0%) 0 (0%)
a 0 (0%) 0 (0%) 3 (33%) 0 (0%) 6 (67%) 0 (0%)
p 2 (9%) 0 (0%) 8 (35%) 0 (0%) 0 (0%) 13 (57%)
1 (3%) 0 (0%) 7 (24%) 0 (0%) 8 (28%) 13 (45%)
R 1 (2%) 1 (2%) 0 (0%) 9 (18%) 12 (24%) 27 (54%)
e 19 (43%) 4 (9%) 13 (30%) 3 (7%) 5 (11%) 0 (0%)
s 0 (0%) 0 (0%) 1 (25%) 3 (75%) 0 (0%) 0 (0%)
p 0 (0%) 0 (0%) 0 (0%) 11 (79%) 0 (0%) 3 (21%)
. 16 (55%) 1 (3%) 0 (0%) 9 (31%) 0 (0%) 3 (10%)
Physical Location
A B C D E F
M 14 (44%) 0 (0%) 4 (13%) 6 (10%) 3 (9%) 5 (16%)
a 11 (50%) 2 (9%) 1 (5%) 3 (14%) 2 (9%) 3 (14%)
p 17 (53%) 2 (6%) 4 (13%) 5 (16%) 1 (3%) 3 (9%)
3 (11%) 0 (0%) 8 (30%) 4 (15%) 5 (19%) 7 (26%)
R 8 (30%) 1 (4%) 3 (11%) 6 (22%) 2 (7%) 7 (26%)
e 5 (21%) 1 (4%) 4 (17%) 4 (17%) 4 (17%) 6 (25%)
s 2 (7%) 0 (0%) 6 (21%) 3 (11%) 3 (11%) 14 (50%)
p 1 (4%) 0 (0%) 3 (11%) 14 (52%) 6 (22%) 3 (11%)
. 2 (8%) 0 (0%) 6 (23%) 1 (4%) 5 (19%) 12 (46%)
Table 2. Contingency table for experiment 3, static mapping. The six situations
in which first and second candidate location (given a particular map response)
differ by more than 150% are shown in heavy type. Percentages of localisation
given response are shown in brackets. See also table 1.
Physical Location
A B C D E F G H I
0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 3 (25) 9 (75)
M 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 9 (64) 0 (0) 0 (0) 5 (36)
a 0 (0) 7 (39) 0 (0) 0 (0) 5 (28) 3 (17) 0 (0) 2 (11) 1 (6)
p 3 (16) 2 (11) 0 (0) 2 (11) 5 (26) 5 (26) 0 (0) 1 (5) 1 (5)
2 (67) 1 (33) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
0 (0) 5 0 (0) 3 (13) 10 (44) 3 (13) 0 (0) 2 (9) 0 (0)
R 0 (0) 3 (100) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
e 5 (50) 5 (50) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)
s 23 (58) 2 (5) 0 (0) 13 (33) 1 (3) 1 (3) 0 (0) 0 (0) 0 (0)
p 3 (14) 8 (38) 0 (0) 0 (0) 0 (0) 6 (29) 0 (0) 0 (0) 4 (19)
o 2 (13) 0 (0) 0 (0) 0 (0) 4 (27) 8 (50) 0 (0) 0 (0) 0 (0)
n 0 (0) 7 (54) 0 (0) 0 (0) 1 (8) 5 (39) 0 (0) 0 (0) 0 (0)
s 0 (0) 0 (0) 0 (0) 1 (5) 8 (40) 3 (15) 0 (0) 4 (20) 4 (20)
e 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 5 (57) 0 (0) 2 (22) 2 (22)
6 (29) 0 (0) 0 (0) 4 (19) 3 (14) 1 (5) 0 (0) 0 (0) 7 (33)
4 (11) 0 (0) 0 (0) 1 (3) 6 (17) 0 (0) 0 (0) 7 (19) 18 (50)
Physical Location
A B C D E F G H I
10 (27) 3 (8) 0 (0) 4 (11) 3 (8) 2 (5) 0 (0) 4 (11) 11 (30)
M 4 (20) 1 (5) 0 (0%) 0 (0%) 7 (35) 4 (20) 0 (0%) 1 (5) 2 (10)
a 8 (35) 2 (9) 0 (0%) 0 (0%) 2 (9) 7 (30) 0 (0%) 0 (0%) 4 (17)
p 0 (0%) 4 (16) 0 (0%) 3 (12) 5 (20) 6 (24) 0 (0%) 4 (16) 3 (12)
1 (3) 0 (0%) 0 (0%) 3 (9) 10 (30) 1 (3) 0 (0%) 3 (9) 15 (46)
2 (13) 2 (13) 0 (0%) 0 (0%) 1 (7) 6 (40) 0 (0%) 0 (0%) 4 (27)
R 2 (25) 3 (38) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2 (25) 1 (13)
e 1 (9) 2 (18) 0 (0%) 0 (0%) 2 (18) 0 (0%) 0 (0%) 3 (27) 3 (27)
s 10 (56) 2 (11) 0 (0%) 2 (11) 1 (6) 0 (0%) 0 (0%) 2 (11) 1 (6)
p 1 (10) 6 (60) 0 (0%) 0 (0%) 0 (0%) 2 (20) 0 (0%) 0 (0%) 1 (10)
o 2 (25) 5 (63) 0 (0%) 0 (0%) 1 (13) 0 (0%) 0 (0%) 0 (0%) 0 (0%)
n 2 (50) 1 (25) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 0 (0%) 1 (25)
s 3 (12) 0 (0%) 0 (0%) 10 (39) 4 (15) 8 (31) 0 (0%) 1 (4) 0 (0%)
e 0 (0%) 8 (42) 0 (0%) 0 (0%) 0 (0%) 8 (42) 0 (0%) 0 (0%) 3 (16)
6 (32) 5 (26) 0 (0%) 0 (0%) 4 (21) 3 (16) 0 (0%) 0 (0%) 1 (5)
5 (25) 4 (20) 0 (0%) 2 (10) 4 (20) 3 (15) 0 (0%) 1 (5) 1 (5)
Abstract. In the metaphorical mapping from spatial motion to time, the path
schema is preserved but other source domain structures are constrained by the
target domain structure. There are at least four constraints: (1) the Front-Back
Constraint, (2) the Straight Path Constraint, (3) Restriction on Manner
Information, and (4) Exclusion of Cause, Circumstance, and Resultant State.
This is demonstrated by analyzing English and Japanese motion-time
metaphors. Thus it is shown that target domain structures play an important
role in determining the elements of information preserved in metaphorical
mappings which are regarded as “unidirectional.”
1 Framework
This study adopts the cognitive semantic theory of metaphor, originated and
developed by Lakoff and Johnson. In this theory, metaphor is defined as conceptual
mapping from the source domain to the target domain, and the image-schematic
structure of the source domain is said to be preserved in metaphorical mappings, as
seen in the following descriptions.
The Invariance Principle implies that the target domain structures are not totally
constructed by the mapping of the source domain structures, but they have their own
inherent structures which can restrict the mapping itself. The image schema of the
source domain, however, is said to be mapped to the target domain, therefore there
should be some kind of unidirectional mapping from the source domain to the target
domain.
The concept of image schema is taken from Johnson’s. Since Johnson does not give
a short definition, his descriptions are combined into the following working definition
of my own.
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp.230 -241, 1999.
Springer-Verlag Berlin Heidelberg 1999
Conceptual Mappings from Spatial Motion to Time 231
It is presupposed that the Path Schema (an image schema which consists of a source,
a goal, and a sequence of contiguous locations connecting the source and the goal) is
the one preserved in the TIME AS MOTION metaphor. The Path Schema includes
nothing other than these elements, nor any information about lexicalization (whether
each element of the concept is lexically expressed or not).
The structure of the source domain (spatial motion) is analyzed in terms of Talmy’s
(1985) Motion Event Frame, and the constraints on the mapping from spatial motion
to time are specified in relation to this frame.
These partial mappings are seen in the mappings of specific information concerning
the elements of the Motion Event Frame (Talmy (1985) with a slight revision of my
own). The Motion Event Frame consists of the following elements.
232 Kazuko Shinohara
When the source domain structures other than the elements of the Path Schema
(source, goal, and contiguous locations connecting the source and the goal) are
examined, some of them are found to be preserved in the target domain, while others
are not preserved. These partial mappings are analyzed in this study as the following
four constraints.
Spatial orientation of motion (one aspect of the Path of motion in the central elements
of the Motion Event Frame) is one of the extra-image-schematic structures, since the
Path Schema includes no information about it. The spatial orientations which can be
mapped to time are basically restricted to front and back. Other spatial orientations
such as up-down, right-left, north-south, and others are rejected in motion-time
mappings, except in some idiomatic expressions using up-down orientation. This
constraint is found both in English and Japanese, as seen in the following examples
(asterisk indicates that it is an incorrect, inappropriate use).
(e.g. 1) a. John died ten days before [after / *to the right of / *to the left of /
*to the south of / *above / *below] his wedding.
The spatial orientation (front-back) and the temporal orientation (future-past) are
mapped in terms of two reference points: the observer and the time. There are four
logically possible patterns of Future/Past assignment to the Front-Back slots for the
two reference points.
Conceptual Mappings from Spatial Motion to Time 233
Observer Time
Front Back Front Back
Apparently contradictory expressions like ‘We are looking forward to the following
weeks’ or ‘San nen mae o furikaeru (three years front ACC look-back)’ can be
explained in terms of these dual reference points and parameters of assignment of
orientation. (There can be other languages which select (b), (c), or (d). Malagasy is a
candidate for (b).)
The shape of the path of motion is also an extra-image-schematic structure. The use
of nonstraight paths is restricted to a considerable extent both in English and Japanese,
though this is not an absolute constraint. Cyclic time is possible in both languages, but
the application of cyclic (nonstraight) paths is not free. It seems that the cyclic path is
available only when some repetitious experience is involved.
234 Kazuko Shinohara
(e.g. 5) a. Leap year [*3:17 PM / *the end of the world] came around.
b. Uruudoshi [*gogo 3-ji 17-fun / *sekai no owari] ga megutte kita.
(=(5a))
English: flow, fly, crawl, creep, dash, hurry, march, run, rush, sneak, roll,
slide, slip, glide
Japanese: nagareru (flow), ?hashiri-saru (run-leave), tobi-saru (fly-leave),
nagare-saru (flow-leave), kake-nukeru (run through), shinobi-yoru
(sneak-approach)
These verbs imply either of the aspects (a) saliently high or low speed, (b) motion
which is unnoticeable to the observer, (c) motion with regular rhythm, (d) invariable,
smooth motion, as shown in Fig. 2 (English) and 3. (Japanese).
Conceptual Mappings from Spatial Motion to Time 235
flow - - - +
fly +h - - +-
crawl +l +- - -
creep +l + - -
dash +h - - -
hurry +h - - -
march - - + -
run +h - - -
rush +h - - -
sneak +l + - -
roll - - +- +
slide - + - +
slip - + - -
glide - + - +
Fig. 2 [+] indicates that the verb has the implication, [-] indicates otherwise. [+-]
indicates that both cases are possible depending on context. [+h] means “high
speed” and [+l] means “low speed.”
-nagareru - - - +
?hashiri-saru +h - - -
tobi-saru +h - - -
nagare-saru - - - +
kake-nukeru +h - - -
shinobi-yoru +l + - -
Fig. 3 (As for representation, see Fig. 2).
Thus:
Positive Factors concerning manner of motion are;
(a) speed (saliently high or saliently low)
(b) unnoticeable motion
(c) invariable motion
(d) regular rhythm
236 Kazuko Shinohara
There are also some negative factors for Motion+Manner Verbs in the TIME IS A
MOVING OBJECT metaphor. See Fig. 4.
fly + - +h - - +-
crawl + - +l +- - -
run + - +h - - -
*swim + - - - - -
*shuffle + - - - - -
*walk + - - - - -
*skip + - - - - -
*limp + - - - - -
*cruise - + - - - -
*canoe - + - - - -
*jet - + +h - - -
*rocket - + +h - - -
Thus:
The negative factors conditioning the use of Motion+Manner Verbs in the TIME IS A
MOVING OBJECT metaphor are;
(a) up-down or random (non-front-back) motion
(b) implication of the type of instrument used
(c) implication of sound emission
(d) salient motion of limbs or body-internal motion
(e) implication of specified circumstances of motion
(f) motion of plural figures.
Conceptual Mappings from Spatial Motion to Time 237
While English has at least 14 Motion+Manner Verbs which are often used in the
TIME IS A MOVING OBJECT metaphor, Japanese has only 6 Motion+Manner
Verbs which can be used for the TIME IS A MOVING OBJECT metaphor. They are
‘nagareru (flow),’ ‘hashiri-saru (run-leave),’ ‘tobi-saru (fly-leave),’ ‘nagare-saru
(flow-leave),’ ‘kake-nukeru (run-go through),’ and ‘shinobi-yoru (hide-approach).’
Except ‘nagareru,’ all of them are compound verbs which are formed by
[Motion+Manner Verb] + [Motion+Path Verb]. The above positive and negative
factors, however, seem to be common in English and Japanese.
The major difference between English and Japanese concerning this metaphor is
seen in the pattern of expressing manner of motion. The striking difference is that
English allows the Motion+Manner Verbs which have one or more positive factors but
not negative factors (except limb motion) to be used in single forms, in most cases
accompanied by Path expressions such as ‘by,’ ‘on,’ or ‘away,’ while Japanese allows
only one single verb (‘nagareru’ (flow)) and requires other Motion+Manner verbs
such as ‘tobu (fly),’ ‘hashiru / kakeru (run),’ ‘hau (crawl / creep),’ ‘suberu (glide /
slide)’ or ‘korogaru (roll)’ to be accompanied by a Motion+Path Verb or by a simile
marker ‘yooni (as if)’ plus Motion+Path Verb like ‘sugiru (pass)’ or ‘sugite iku (pass
go).’ This difference seems to be due to the difference in lexicalization patterns
between English and Japanese. Verbs like ‘fly,’ ‘run,’ ‘crawl,’ or ‘creep’ (and the
counterparts in Japanese) basically denote an action, which prototypically implies
change of place (these are called “Motion-Propelling Action Verbs” by Kageyama
(1997)). In these verbs, Manner information is attributed to the action itself, not to the
motion. In order to denote change of place, these English verbs requires, in most
cases, Path information expressed mostly by adverbs or prepositional phrases, since
English is “Motion+Manner”-type language. By contrast, since Japanese is a
“Motion+Path”-type language, it does not regularly use Path expressions outside the
verbs; that is, basic Path information is conflated in verbs. Thus, when temporal
motion is expressed by a Motion-Propelling Action Verb in Japanese, the Path
information is attached to the expression by the use of a compound verb or by
attaching ‘yooni (as if)’ and a Motion+Path Verb.
In spite of this difference, it is clear that English and Japanese share the fundamental
constraints on motion-time mappings. The difference is seen only in the patterns of
lexical realization, which are consistent with the major patterns of lexicalization of the
Motion Event Frame.
The sixth elements of the Motion Event Frame (Cause, Circumstance, and Resultant
State) are consistently excluded from motion-time mappings both in English and
Japanese. Thus, the expressions like ‘Time blew off’ (meaning ‘Time passed
quickly’), ‘Time wore wings to the past’ (meaning ‘Time flew away’), ‘The
examination day stuck to next Wednesday’ (meaning ‘The examination day came as
near as next Wednesday’), and so on are rejected.
238 Kazuko Shinohara
3 Conclusion
In motion-time mappings, the aspects of spatial motion such as orientation (front-
back, up-down, right-left, north-south, etc.), the shape of the path (straight, curve,
circular, zigzag, etc.), and manner of motion (‘run,’ ‘fly,’ ‘creep,’ ‘wiggle,’ etc.) are
only partially mapped. The same constraints are found in English and Japanese.
Since these constraints concern extra-image-schematic structures of the source
domain, it is concluded that the partial mappings are seen outside the image schema in
the TIME IS A MOVING OBJECT metaphor. The Path Schema is preserved, since
these constraints do not affect the mappings of this image schema. See Fig. 5.
C B A A’ B’ C’
Fig. 5
A --> A’ : The Path Schema is completely mapped.
B --> B’ : The part of the conceptual structure of spatial motion
which is allowed by the constraints is mapped.
C --> C’ : The rest of the conceptual structure of spatial motion
(rejected by the constraints) is not mapped.
Conceptual Mappings from Spatial Motion to Time 239
(iii) The positive and the negative factors concerning the TIME IS A MOVING
OBJECT metaphor are also understood as motivated by the structure of the concept of
time. Speed (high or low) and unnoticeable motion (our unawareness of the passing
of time) are our subjective feelings about time projected to the motion of time. The
other two of the positive factors are the result of our concept of time that time is
passing constantly, incessantly, or invariably in always the same manner. The
negative factors, which must not be mapped to time, can also be explained in terms of
the conceptual structure of time. "Up-down or random motion" is excluded by the
Front-Back constraint, and the other negative factors ("instrument used," "sound
emission," "salient bodily motion," "specified circumstance," and "plural figure") are
also explained by the conceptual structure of time, which we assume to lack such
elements.
(iv) Exclusion of Cause, Circumstance and Resultant State is also motivated
conceptually. These elements are rejected because our concept of time tells us that
there can be no agent acting on the motion of time and thus causing time to move, that
time is engaged in no other activities than motion itself, and that time undergoes no
durative change of state caused by its motion.
Appendix
Asterisk indicates that it is inappropriate to use the verb in expressions like “Time
________ by (away, on, etc.).” Question marks indicate that the use of the verb is not
totally inappropriate but it is somewhat strange or it needs some special context
(judged by two to five native speakers).
240 Kazuko Shinohara
(leap around), *hane-modoru (leap back), *hashiri-deru (run out), *hashiri-komu (run
into), *hashiri-mawaru (run around), *hashiri-oriru (run down), ?hashiri-saru (run-
leave), *kake-agaru (run up), *kake-komu (run into), *kake-mawaru (run around),
*kake-meguru (run around), *kake-modoru (run back), *kake-noboru (run up), kake-
nukeru (run through), *kake-oriru (run down), *korogari-deru (roll out), *korogari-
komu (roll into), *koroge-mawaru (roll around), *korogari-modoru (roll back),
*korogari-nukeru (roll through), *korogari-ochiru (roll-fall), *korogari-oriru (roll
down), *korogari-saru (roll-leave), *mai-agaru (dance up), *mai-komu (dance into),
*mai-modoru (dance back), *mai-ochiru (dance-fall), *mai-oriru (dance down),
*suberi-komu (slide into), *nagare-deru (flow out), *nagare-komu (flow into),
*nagare-kudaru (flow down), *nagare-ochiru (flow-fall), nagare-saru (flow-leave),
*nagare-tsuku (flow-arrive), *nige-daru (sneak away), *oyogi-mawaru (swim around),
*oyogi-saru (swim-leave), *oyogi-tsuku (swim-arrive), shinobi-yoru (sneak-
approach), *suberi-deru (slide out), *suberi-komu (slide into), *suberi-ochiru (slide-
fall), *suberi-oriru (slide down), *tobi-agaru (junp up), *tobi-dasu (jump out), *tobi-
deru (jump out), *tobi-koeru (jump over), *tobi-komu (jump into), *tobi-mawaru
(jump/fly around), *tobi-oriru (jump down), tobi-saru (fly away)
References
1. Johnson, M., The Body in the Mind: The Bodily Basis of Meaning, Imagination,
and Reason. Chicago: The University of Chicago Press (1987).
2. Kageyama, T., Nichieigo dooshi no imi to bumpoo (Meaning and grammar of
Japanese and English verbs). Handout for presentation at Summer Special
Lectures, Tokyo Gengo Kenkyuujo (1997).
3. Lakoff, G., The Invariance Hypothesis: Is abstract reason based on image-
schemas? Cognitive Linguistics 1 (1990), 39-74.
4. Lakoff, G., The syntax of metaphorical semantic roles. In J. Pustejovsky (Ed.),
Semantics and the Lexicon. Dordrecht: Kluwer (1993a), pp. 27-36.
5. Lakoff, G., The contemporary theory of metaphor. In A. Ortony (Ed.), Metaphor
and Thought, Second ed. Cambridge: Cambridge UniversityPress (1993b), pp.
202-251.
6. Lakoff, G., Johnson, M.: Metaphors We Live By. Chicago: The University of
Chicago Press (1980).
7. Shinohara, K., Invariance and override in space-time metaphor. ICU English
Studies 5 (1996), 39-56.
8. Talmy, L., Lexicalization patterns: semantic structure in lexical forms. In T.
Shopen (Ed.), Language Typology and Syntactic Description, vol. 3, Cambridge:
Cambridge University Press (1985), pp. 57-149.
9. Yamaguchi, K., Cognitive approach to temporal expressions in Japanese and
English. Proceedings of TACL summer institute of linguistics (1995), 203-214.
10. Yamanashi, M., Ninchi Bumpooron (Cognitive Grammar). Tokyo: Hitsuji
Shoboo (1995).
An Introduction to Algebraic Semiotics,
with Application to User Interface Design
Joseph Goguen
Dept. Computer Science & Engineering, Univ. of California at San Diego
Abstract: This paper introduces a new approach to user interface design and
other areas, called algebraic semiotics. The approach is based on a notion of
sign, which allows complex hierarchical structure and incorporates the insight
(emphasized by Saussure) that signs come in systems, and should be studied
at that level, rather than individually. A user interface can be considered as a
representation of the underlying functionality to which it provides access, and
thus user interface design can be considered a craft of constructing such repre-
sentations, where both the interface and the underlying functionality are con-
sidered as (structured) sign systems. In this setting, representations appear as
mappings, or morphisms, between sign systems, which should preserve as much
structure as possible. This motivates developing a calculus having systematic
ways to combine signs, sign systems, and representations. One important mode
of composition is blending, introduced by Fauconnier and Turner; we relate this
to certain concepts from the very abstract area of mathematics called category
theory. Applications for algebraic semiotics include not only user interface design,
but also cognitive linguistics, especially metaphor theory and cognitive poetics.
The main contribution of this paper is the precision it can bring to such areas.
Building on an insight from computer science, that discrete structures can be
described by algebraic theories, sign systems are dened to be algebraic theo-
ries with extra structure, and semiotic morphisms are dened to be mappings
of algebraic theories that (to some extent) preserve the extra structure. As an
aid for practical design, we show that the quality of representations is closely
related to the preservation properties of semiotic morphisms; these measures of
quality also provide the orderings needed by our category theoretic formulation
of blending.
1 Introduction
Analogy, metaphor, representation and user interface have much in common:
each involves signs, meaning, one or more people, and some context, including
culture; moreover each can be looked at dually from either a design or a use
perspective. Recent research in several disciplines is converging on a general
area that includes the four topics in the rst sentence above; these disciplines
include (aspects of) sociology, cognitive linguistics, computer science, literary
criticism, user interface design, psychology, semiotics, and philosophy. Of these,
semiotics takes perhaps the most general view, although much of the research in
this area has been rather vague. A goal of the research reported here is to develop
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 242-291, 1999.
c Springer-Verlag Berlin Heidelberg 1999
An Introduction to Algebraic Semiotics 243
In signs, one sees an advantage for discovery that is greatest when they
express the exact nature of a thing brie
y and, as it were, picture it;
then, indeed, the labor of thought is wonderfully diminished.
A good example is the dierence in plane geometry between doing proofs with
diagrams and doing proofs with axioms (see also Appendix D). The above quota-
tion also draws attention to signs and their use, and indeed, our previous discus-
sion about coee cups, elevator buttons, etc. can be re-expressed very nicely in
the language of semiotics, which is the study of signs. Signs are everywhere: not
just icons on computer screens and corporate logos on T-shirts or racing cars,
but more signicantly, the organization of signs is the very nature of language,
natural human language both spoken and written, articial computer languages,
and visual languages, as in architecture and art, both ne and popular, including
cinema.
We will see that the following ideas are basic to our general theory:
{ Signs appear as members of sign systems1, not in isolation.
{ Most signs are complex objects, constructed from other, lower level signs.
{ Sign systems are better viewed as theories { that is, as declarations for
symbols plus sentences, called \axioms," that restrict their use { than as
(set-based) models.
{ Representations in general, and user interfaces in particular, are \morphisms"
(mappings) between sign systems.
Charles Sanders Peirce [49], a nineteenth century logician and philosopher
working in Boston, coined the word \semiotics" and introduced many of its basic
concepts. He emphasized that meanings are not directly attached to signs, but
that instead, signs mediate meaning, through events (or processes) of semiosis,
each involving a signier (i.e., a sign), a signied (an \object" of some kind
{ e.g., an idea), and an interpretant2 that links these two; these three things
are often called the semiotic triad, and occur wherever there is meaning. Signs,
meanings, and referents only exist for a particular semiosis, which must include
its social context; therefore meaning is always embedded and embodied. In gen-
eral, the signied is not given, but must be inferred by some person or persons
involved. Designers work in the reverse direction, creating signs for a given sig-
nied. Peirce's approach may sound simple, but it is very dierent from more
common and naive approaches, such as the use of denotational semantics for
programming languages. Peirce's theory of signs is not a representational theory
of meaning, in which a sign has a denotation; instead, the interpretant makes
it a relational theory of meaning. Peirce's important notions of icon, index and
symbol are discussed below in Section 4. In addition, we use the term signal for
a physical conguration that may or may not be a sign.
1
There is a diculty with terminology here: the phrase \semiotic system" sounds too
broad, while \sign system" may sound too narrow, since it is intended to include
(descriptions of) conceptual spaces, as well as systems of physical signs.
2
This is Peirce's original terminology; \interpretant" should not be confused with
\interpreter," as it refers to the link itself.
An Introduction to Algebraic Semiotics 245
Structure is part of our experience, and though seemingly more abstract than
immediate sensations, emotions, evaluations, etc., there is strong evidence that
it too plays a crucial role in the formation of such experiences (e.g., consider how
movies are structured). Context, which for spoken language would include the
speaker, can be at least as important for meaning as the signs involved. For an
extreme example, \Yes" can mean almost anything given an appropriate context.
Moreover, work in articial intelligence has found contextual cues essential for
disambiguation in speech understanding, machine vision, and elsewhere.
The vowel systems of various accents within the same language show that
the same sign system can be realized in dierent ways; let us call these dierent
models of the sign system. For computer scientists, it may be helpful to view
sign systems as abstract data types, because this already includes the idea
that the same information can be represented in dierent ways; for example,
dates, times, and sports scores each have multiple familiar representations. The
Greek, Roman and Cyrillic alphabets show that the sets underlying models can
overlap; this example also shows that a signal that is meaningful in one sign
system may not be in another, even though they share a medium. The same
signal in a dierent alphabet is a dierent sign, because it is in a dierent sign
system. The vowel system example also shows that dierent models of the same
sign system can use exactly the same signals in dierent ways; therefore it is how
elements are used that makes the models dierent, not the elements themselves.
Here are some further useful concepts:
{ A medium expresses dimensions within which signs can vary; for example,
standard TV is a two dimensional pixel array with certain possible ranges of
intensity and color, plus a monophonic audio channel with a certain possible
range of frequency, etc.
{ A genre is a collection of conventions for using a medium; these can be
seen as further delimiting a sign system. For example, the daily newspaper
is a genre within the medium of multisection collections of large size pages.
Soap operas are a genre for TV. Obviously, genres have subgenres; e.g., soap
operas about rich families.
{ Multimedia are characterized by multiple simultaneous perceptual chan-
nels. So TV is multimedia, and so (in a weak sense) are cartoons, as well as
books with pictures.
{ Interactive media allow inputs as well as outputs. So PCs are (potentially)
interactive multimedia. The web provides (at least one) genre within this
medium; email is another.
We can even say that a book is interactive, because users can mark and turn
pages, and can go to any page they wish; indices, glossaries, etc. are also used
in an interactive manner. Many museums have interactive multimedia exhibits,
and every museum is interactive in a more prosaic sense.
This paper proposes a precise framework for studying sign systems and their
representations, as well as for studying what makes some representations better
than others, and how to combine representations. The framework is intended
for application to aspects of communication and cognition, such as designing
An Introduction to Algebraic Semiotics 247
to a sign systems for windows, buttons, menus, etc. [31]. A web browser can
be seen as a map from html (plus JavaScript, etc.) into the capabilities of a
particular computer on which it is running4 . Metaphors can be seen as semiotic
morphisms from one system of concepts to another [10, 12, 58]. A given text
(spoken utterance, etc.) can be seen as the image under a morphism from some
(usually unknown) structure into the sign system of written English (or spoken
English, or whatever). Conversely, we may be given some situation, and want to
nd the best way to describe it in natural language, or in some other medium or
combination of media, such as text with photos, or cartoon sequences, or video,
or online hypertext or hypermedia [27].
In these and many other cases, representations are signs in one system that
relate systematically to signs in another system. Generally it is just as fruitless
to study representations of single signs as to study single isolated signs. For rep-
resentations also occur in systems, just as signs do: usually there are systematic
regularities in how signs of one system are represented as signs of another. Let
us use the notation M : S1 ! S2 for a morphism from sign system S1 to sign
system S2 . Of course, in all but the most trivial cases, there is no unique mor-
phism S1 ! S2. Think, for example, of the diculties of translating from one
language to another. Moreover in general, morphisms are partial, that is, not
dened for all the signs in the source system; some signs may be untranslatable,
or at least, not translated by a given morphism.
Here are some very simple examples. Let N1 be the familiar decimal Arabic
numerals and let N2 be the Roman numerals. Then there is a natural morphism
M : N1 ! N2 but it is undened for Arabic 0, since the Romans did not have
the concept of zero. We can also consider transliterations between the English
and Greek alphabets: then certain letters just don't map. Similarly, Scandinavian
alphabets make some distinctions that the English alphabet does not; Chinese
and Sanskrit raise still other problems. Ciphers (i.e., \secret codes") are also
representations, simple in their input and output alphabets, but deliberately
complex in their algorithmic construction.
Further examples and details about the systematic organization of signs are
discussed later, but it should now be clear that an ambitious enterprise is being
proposed, taking a wide interpretation of the notion of sign, and treating sign
systems and their morphisms with great rigor. However, because this enterprise is
still at an early stage, our examples cannot be both complex and detailed. Hoping
that readers will forgive the ambition and erontery of combining such diverse
elements, I acknowledge the deep indebtedness of this work to its precedents,
and hope to have the help of readers of this paper in developing its potential.
4
These two examples highlight the important but subtle point that theory morphisms
go in the opposite direction from the maps of models that they induce; this duality
is explained at an abstract level by the theory of institutions [24], but is well outside
the scope of this paper.
An Introduction to Algebraic Semiotics 249
1.3 On Formalization
Sapir said all systems leak ; he was referring to the fact that no grammatical
system has ever successfully captured a real natural language, but it is natural
to generalize his slogan to the formalization of any complex natural sign system.
There are always \loose ends"; some deep reasons for this, having to do with the
social nature of communication, are discussed in [21]. Thus we cannot expect our
semiotic models to be perfect. However, a precise description that is somewhat
wrong is better than a description so vague that no one can tell if it's wrong.
We do not seek to formalize actual living meanings, but rather to express our
partial understandings more exactly. Precision is also needed to build computer
programs that use the theory. I do not believe that meaning in the human sense
can be captured by formal sign systems; however, human analysts can note the
extent to which the meanings that they see in some sign system are preserved by
dierent representations. Thus we seek to formalize particular understandings
of analysts, without claiming that such understandings are necessarily correct,
or have some ideal kind of Platonic existence.
Acknowledgements
The proofs in Appendix B were provided by Grigore Rosu, and the basic def-
initions were worked out in collaboration with Grigore Rosu and Razvan Dia-
conescu. Further results on 32 -colimits should eventually appear in a separate
paper. I wish to thank the students in my Winter 1998 class CSE 271 on user
250 Joseph Goguen
interface design, for their patience, enthusiasm, and questions. I also thank Gilles
Fauconnier, Masako Hiraga, and Mark Turner for their valuable comments on
earlier drafts of this paper, and Michael Reddy for intensifying my interest in
metaphor, as I supervised his PhD thesis at the University of Chicago.
2 Sign Systems
Sign systems usually have a classication of signs into certain sorts5 , and some
rules for combining signs of appropriate sorts to get a new sign of another sort;
we call these rules the constructors of the system. Constructors may have pa-
rameters. For example, a \cat" sign on a computer screen may have parameters
for the size and location of its upper lefthand corner; changing these values does
not change the identity of the cat.
Constructors may have what we call priority: a primary constructor has
greatest priority; secondary constructors have less priority than the primary
constructor but more than any non-primary or non-secondary constructor; ter-
tiary constructors, etc. follow the same pattern. Priority is a partial ordering,
not total. Experiments of Goguen and Linde [27] (testing subjects after multi-
media instruction in various formats about a simple electronic device) support
the hypothesis that the reasoning discourse type [32] has a primary constructor
that conjoins reasons supporting a statement6 .
Semiotics should focus on the structure of sign systems rather than on ad
hoc properties of individual signs and their settings, just as modern biology
focuses on molecular structures like DNA rather than on descriptive classication
schemes. For example, formalizing the handwritten letter \a" (or the spoken
sound \ah") in isolation, is both far harder and less useful than formalizing
relations between written letters and words (or phonemes and spoken words).
It is natural to think of a sign system as a set of signs, grouped into sorts
and levels, not necessarily disjoint, with \constructor" functions at each level
that build new signs from old ones. But such a set-based approach does not
capture the openness of sign systems, that there might be other signs we don't
yet know about, or haven't wanted to include, because we are always involved in
constructing only partial understandings. It is therefore preferable to view sign
systems as theories than as pre-given set theoretic objects. This motivates the
following:
Denition 1: A sign system S consists of:
1. a set S of sorts for signs, not necessarily disjoint;
5
We deliberately avoid the more familiar word \type" because it has had so many
dierent uses in computer science. The so called parts of speech in syntax, such as
noun and verb, are one example of sorts in the sense that we intend.
6
The primary constructor of a given discourse type is its \default" constructor, i.e., the
constructor assumed when there is no explicit marker in the text. In narrative, if one
sentence follows another we assume they are connected by a sequence constructor;
this is called the narrative presupposition [39].
An Introduction to Algebraic Semiotics 251
has level 2, char has level 3, alphanum and spec have level 4, and alpha and
num have level 5 (or we could give all subsorts of char level 4, or even 3; such
choices can be a bit arbitrary until they are forced by some denite applica-
tion). There are various choices for the constructors of this sign system. Since
lines are strings of characters, one choice is an operation _ that concatenates a
character with a line to get a longer line, and another operation, also denoted
_, that concatenates a line and a window to get another window; there must
also be constant constructors for the empty line and the empty window. (The
constraints on the lengths of lines and windows are given by axioms that are
discussed below.) For each sort, the concatenation operations have priority over
the constant operations.
This editor also has data sorts for xed data types that are used in an
auxiliary way in describing its signs: these include at least the natural numbers,
and possibly colors, fonts, etc., depending on the capabilities we want to give our
little editor. Functions include windowidth and windowlength, and there could
also be predicates for the subsorts, such as a numeric predicate on characters.
Then the constraints of length can be expressed by the following axioms:
(8L : line) windowidth(L) 24 .
(8W : window) windowlength(W ) 80 .
Let us denote this sign system W.
If we want to study how texts can be displayed in this window, we should
dene a sign system for texts. One simple way to do this has sorts char, word,
sent (sentence), and text, in addition to the data sorts and the subsorts of
char as in W above; the sort text is level 1, sent level 2, word level 3, and
char level 4. There are several choices for constructors, one of which denes any
concatenation of alphanumeric characters to be a word, any concatenation of
words to be a sentence, and any concatenation of sentences to be a text. Let us
denote this sign system TXT. Clearly there are many dierent ways to display
texts in a window, and each one is a dierent semiotic morphism; we will see
some of these later.
A somewhat dierent sign system is given by simple parsed sentences, i.e.,
sentences with their \part of speech" (or syntactic category) explicitly given.
The most familiar way to describe these is probably with a context free gram-
mar like that below, where S, NP, VP, N, Det, V, PP and P stand for sentence, noun
phrase, verb phrase, noun, determiner, verb, prepositional phrase, and preposi-
tion, respectively:
S -> NP VP
NP -> N
NP -> Det N
VP -> V
VP -> V PP
PP -> P NP
.....
An Introduction to Algebraic Semiotics 253
The \parts of speech" S, NP, VP, etc. are the sorts of this sign system, and the
rules are its constructors. For example, the rst rule says that a sentence can
be constructed from a NP and a VP. There should also be some constants of the
various sorts, such as
N -> time
N -> arrow
V -> flies
Det -> an
Det -> the
P -> like
......
There is a systematic way to view context free rules as operations that \con-
struct" things from their parts (introduced in [15]), which in this case gives the
following:
sen : NP VP -> S
nnp : N -> NP
np : N Det -> N
vvp : V -> VP
vp : V PP -> VP
pp : P NP -> PP
.....
time : -> N
flies : -> V
.....
owner resident
owner passenger
house
boat
including only as much detail as is needed to analyze some particular text. For
example, a theory of houses might have constants house, owner and resident,
with relations own and live-in making the obvious assertions. Similarly, a boat
theory might have constants boat, owner and passenger, with relations own and
ride. These two spaces are illustrated in Figure 1. No sorts are shown, but for
this simple example, one is enough, say Thing. That a relation such as own, holds
of two things is given by a line in the gure, and in the corresponding logical
theory is given by an axiom, e.g., own(owner,house). It is usually assumed that
relation instances that are not shown (such as ride(boat,owner)) do not hold,
i.e., are false (one way to formalize this, which is related to the so called frame
problem in articial intelligence, is given in Chapter 8 of [23]). Let us call this
the default negation assumption. But sometimes whether or not a relation
holds may be unknown. Humans generally do a good job of guring all this
out, using what is called \common sense". However, the deductions involved can
actually be extremely complex; some hints of this complexity may be found in
the discussion of the blending examples in Section 5.
Formalism and representation feature in much recent work in sociology of
science, with many fascinating examples. For example, Latour [42] shows how
representation by cartographic maps was essential for European colonization,
and Bowers [6] discusses the politics of formalism, including cscw systems. La-
tour leaves representation undened, while Bowers has a slightly formal notion
of formalism. I believe that such discussions could be given greater precision by
using the framework proposed in this paper.
3 Semiotic Morphisms
The purpose of semiotic morphisms12 is to provide a way to describe the move-
ment (mapping, translation, interpretation, representation) of signs in one sys-
tem to signs in another system. This is intended to include metaphors as well
as representations in the more familiar user interface design sense. Generating a
good icon, le name, explanation or metaphor, or arranging text and graphics
together in an appropriate way, each involves moving signs from one system to
12
Although the root \morph" of the noun \morphism" means \form," this word has
recently also become a verb meaning \to change form."
256 Joseph Goguen
same; this is a kind of unary notation. Let N (t) be the number of s's in some
t from TOD. Let Q(t) and R(t) be the quotient and remainder after dividing
N (t) by 80. Then there will be Q(t) lines of 80 s's followed by one line of R(t)
s's and a nal 0. This is guaranteed to t in our window because Q(1439) = 17
is less than 24, and R(t) + 1 80. For humans, this representation is so detailed
that it is more or less analog: I think after getting familiar with it, a user would
have a \feel" for the approximate number of (these strange 80 minute) hours in
a window and of minutes in the last line, just from its appearance. Let us call
this representation U . Figure 2 shows the time that we would call \1:15 pm" in
it.
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss0
Another obvious but naive representation just displays N (t) in decimal nota-
tion, giving a string of 1 to 4 decimal digits. This is very dierent from our usual
representations; but we could imagine a culture that divides its days into 14
\hours" each having 100 minutes, except the last hour, which only has 40 (this
is less strange than what we do with our months, with their varying numbers
of days!). Here N (0) is 0, and s just adds 1, except that s(1439) = 0. Figure
3 shows quarter after one in the afternoon in this representation; the last two
digits give the number of minutes, and those to the left of that give the number
of \hours". Let us call this representation N .
795
Fig. 3. A Naive Digital Clock
13 15
Fig. 4. A Military Time Clock
situation (i.e., theory) within which we can deduce the connection between the
signied and signifying signs. For a symbol, there is no such more basic relation-
ship between source and target signs.
For purposes of design, other things being equal, there is a natural ordering to
these three kinds of sign: icons are better than indices, and indices are better than
symbols. However, things are not always equal. For example, base 1 notation for
natural numbers is iconic, e.g., 4 is represented as ||||, 3 as |||, and we get
their sum just by copying and appending,
|||| + ||| = ||||||| ,
which is iconic. But base one notation is very inecient for representing large
numbers. With Arabic numerals, the use of 1 for \one" is iconic (one stroke), but
the others are symbolic16 . Using the blank character for \zero" would be iconic,
but of course this would undermine the positional aspect of decimal notation
and introduce ambiguities. Chinese notation for several of the small numerals is
iconic.
Peirce's three classes of sign overlap, so some signs will be hard to classify.
Also, complex situations may involve all three kinds of sign, interacting in com-
plex ways; indeed, dierent aspects of the same sign can be iconic, indexical, and
symbolic. It is often necessary to consider the context of a sign, e.g., how is it
used in practice, and of course its relation to other signs in the same system. See
[19, 35] for further examples and discussion, the former mainly from computer
science, and the latter mainly from language.
The following denition gives some precise ways to compare the quality of
representations:
Denition 3: Given a semiotic morphism M : S1 ! S2, then:
(1) M is level preserving i the partial ordering on levels is preserved by M ,
in the sense that if sort s is lower level than sort s0 in S1 , then M (s) has
lower (or equal) level than M (s0 ) in S2 .
(2) M is priority preserving i c < c0 in S1 implies M (c) < M (c0 ) in S2 .
(3) M is axiom preserving i for each axiom a of S1 , its translation M (a) to
S2 is a logical consequence of the axioms in S2 .
(4) Given also M 0 : S1 ! S2 , then M 0 is (at least) as dened as M , written
M M 0 , i for each constructor c of S1 , M 0 (c) is dened whenever M (c) is.
(5) Given also M 0 : S1 ! S2 , then M 0 preserves all axioms that M does,
written M M 0 , i whenever M preserves an axiom, then so does M 0 .
(6) Given also M 0 : S1 ! S2 , then M 0 is (at least) as inclusive as M i
M (x) = x implies M 0 (x) = x for each sign x of S1 .
(7) Given also M 0 : S1 ! S2 , then M 0 preserves (at least) as much content
as M , written M M 0, i M 0 is as dened as M and M 0 preserves every
selector that M does, where a morphism M : S1 ! S2 preserves a selector
f1 of S1 i there is a selector f2 for S2 such that for every sign x of S1 where
M is dened, then f2 (M (x)) = f1 (x), where
16
Though there is a trick for regarding several of the small Arabic numerals as symbolic.
An Introduction to Algebraic Semiotics 263
better, sometimes we can recover lost information some other way, and then less
preservation may be better, because it allows for a more compact representation.
For another example, let's consider representing (abstract) texts as strings,
i.e., let's consider semiotic morphisms M : TXT ! STG. The sign system
TXT has sorts for sentences, words, and characters, while the sign system STG
only has sorts for strings and characters. Because characters are a data sort, any
morphism M : TXT ! STG must preserve the sort char, and there is also no
choice about how to map the other sorts of TXT: they must all go to the sort
string. The top level constructor of TXT forms texts by concatenating sen-
tences, while its second level constructor concatenates words to form sentences,
and its third level constructor concatenates characters to form words. Since the
only constructor for STG concatenates characters to form strings, the obvious
thing to do is map each concatenation of TXT to the concatenation of STG.
However, the sign resulting from a text would now be just one huge ugly string
which \mushes" everything together. As we know, it is usual to insert spaces
between words, and a period and two spaces after each sentence. It is easy to
dene a morphism that does this, though it is more complex than the \mushing"
representation.
Both these morphisms preserve the structure of TXT. But what would it
mean for a morphism M : TXT ! STG not to preserve this structure? There
are many possibilities, including dropping some characters, words, and/or sen-
tences, and permuting them in a random order. Phenomena like these will clearly
produce a low quality display.
Experiments reported in [27] show that preserving high levels is more im-
portant than preserving priorities, which in turn is more important than pre-
serving content. They also show a strong tendency to preserve higher levels at
the expense of lower levels when some structure must be dropped. This may be
surprising, because of emphasis by cognitive psychologists on the \basic level" of
lexical concepts (e.g., Rosch [50, 51]). For natural language, the sentential level
was long considered to be basic, but research like that of [27] shows that the
discourse level is higher in our technical sense, and thus more important. This
is consistent with the important general principle that structure has priority
over content, i.e., form is more important than content (if something must be
sacriced to limit the complexity of the display).
Much more detailed empirical work is needed to determine more precisely the
tradeos among various preservation and other optimality criteria for semiotic
morphisms. At start is being made by assembling a collection of examples of bad
design arising from failures of semiotic morphisms to fully preserve structure in
the \world-famous"17 UC San Diego Semiotic Zoo. Although not all the expla-
nations are available yet, the animals can be visited at any hour of the day or
night, at
http://www.cs.ucsd.edu/users/goguen/zoo/
17
For some reason, the real San Diego Zoo, which really is world famous, almost always
precedes its name with \world-famous," with the hyphen.
An Introduction to Algebraic Semiotics 265
where much additional information (and some bad jokes about zoos) can also be
found. Most of the exhibits there involve color and/or interactive graphics, and
so cannot easily be discussed in this traditional medium of print.
The tatami project at UCSD is applying semiotic morphisms and their order-
ings to design the user interface of a system to supports cooperative distributed
proofs over the world wide web [31, 25]. We found that certain ways we had used
to represent proofs were not semiotic morphisms, which then led us to construct
better representations; we also used semiotic morphisms to determine aspects of
window layout, button location, etc. Details can be found especially in [22, 25],
and of course on the project website
http://www.cs.ucsd.edu/groups/tatami/
which should always have the very latest information.
Fauconnier and Turner [10, 12] study the \blending" of conceptual spaces, to
obtain new spaces that combine the parts of the input spaces. Blends are common
in natural language, for example, in words like \houseboat" and \roadkill," and
in phrases like \articial life" and \computer virus," as well as in metaphors
that have more than one strand (as is usually the case).
The most basic kind of blend may be visualized using the diagram below,
where I1 and I2 are called the inputs, G the generic, and B the blend18 .
More precisely, we dene a blend of sign systems I1 and I2 over G (using
given semiotic morphisms G ! I1 and G ! I2 ) to be a sign system B with
morphisms I1 ! B , I2 ! B , and G ! B , which are all called injections, such
that the diagram weakly commutes, in the sense that both the compositions
G ! I1 ! B and G ! I2 ! B are weakly equal to the morphism G ! B ,
in the sense that each sign in G gets mapped to the same sign in B under
them, provided that both morphisms are dened on it19 . It follows that the
compositions G ! I1 ! B and G ! I2 ! B are also weakly equal when G ! B
is totally dened, but not necessarily otherwise. The special case where all sign
systems are conceptual spaces is called a conceptual blend. In general, we
should expect the morphisms to the blend to preserve as much as possible from
the inputs and generic.
18
The form of this diagram is \upside down" from that used by Fauconnier and Turner,
in that our arrows go up, with the generic G on the bottom, and the blend B on
the top; this is consistent with the metaphor (or \image scheme" [40]) that \up is
more" as well as with conventions for drawing such diagrams in mathematics. Also,
Fauconnier and Turner do not include the map G ! B .
19
Strict commutativity, which is usually called just commutativity, means that the
compositions are strictly equal.
266 Joseph Goguen
B
;; 6I
@@
; @
I1
; @I
I@
@ ;;
2
@@ ;;
G
Mathematically, it is more perspicuous to think of blending the two morphisms
ai : G ! Ii than the two spaces I1 ; I2 , and for this reason we will sometimes use
the notation a1 3 a2 to stand for an arbitrary blend of a1 and a2 ; this will be
especially helpful in writing formulae for our calculus of blending.
Blends have applications in computer interface design, some of which are
described in [31]. For a simple example, suppose we want to display both tem-
perature and time of day on the same device. This is an example of the product
of sign systems: if TMP is a sign system for temperature; then the sign system
for our device is TOD TMP. Before giving the technical denition, let 1 denote
the \trivial" sign system that has only one sort (its top sort) and no operations
(except those for data). Now given sign systems S1 and S2 , their product, de-
noted S1 S2 , is the blend of S1 and S2 over 1 with the obvious (and only)
morphisms 1 ! Si , formed by taking the disjoint union20 of S1 and S2 , and
then identifying their top sorts to get a new sort called the product sort. Both
injections are injective and both triangles strictly commute.
It is not hard to prove some simple properties of product, including the
following, where S; S1 ; S2 ; S3 are arbitrary sign systems,
S1 =S ,
1S =S ,
S1 S2
= S2 S1 ,
S1 (S2 S3 )
= (S1 S2 ) S3 .
These are only a modest addition to our calculus of representation, but the
notion of product becomes more interesting later on, when extended from sign
systems to representations. Forms of the commutative and identity laws also
hold for blends, and may be written as
a1 3 a2 = a2 3 a1 ,
a 3 1G = a ,
1G 3 a = a ,
20
This involves renaming sorts and operations, if necessary, so that there are no over-
laps except for the data sorts and operations. Thus this blend is a sort of \amalga-
mated sum" of its two inputs (this phrase is used in algebraic topology, among other
places). Due to the duality between theories and models (as formalized in the theory
of institutions [24]), this corresponds to taking products of models.
An Introduction to Algebraic Semiotics 267
where the rst should be read as saying that any blend of a1 ; a2 is also a blend
of a2 ; a1 , and the next two as saying that one blend of any space with its generic
space is the space itself.
Before doing a slightly more complex example in some detail, we generalize
the concept of blend to a labeled graph, with sign systems on its nodes and
morphisms on its edges, such that if e is an edge from n0 to n1 , then the mor-
phism on e has as its source the sign system on n0 and as its target the one
on n1 . We will call this labeled graph the base graph. Some morphisms in the
base graph may be designated as auxiliary21 , indicating that the relationships
that they embody do not need to be preserved. Then a blend for a given base
graph is some sign system, together with a morphism called an injection to it
from each sign system in the graph, such that any triangle of morphisms involv-
ing two injections and one non-auxiliary morphism in the base graph weakly
commutes. The exclusion of auxiliary morphisms is important, because commu-
tativity should not be expected for auxiliary information; this is illustrated in
the example below. The base graph for the basic kind of blend considered at the
beginning of this section has a \V" shape; let us use the term V-blends for this
case. Also, let us call a node in the base graph auxiliary if all morphisms to
and from it in the base graph are auxiliary22.
Appendix B develops the above ideas more precisely, and puts blending in the
rich mathematical framework of category theory, relating V-blends to what are
called \pushouts", and the more general blend of a base graph to what are called
colimits. In addition, Appendix B develops a special kind of category, called a
3 -category, and shows that (what we there call) 3 -pushouts and 3 -colimits give
2 2 2
blends that are \best possible" in a certain precise sense that involves ordering
semiotic morphisms by quality, e.g., that they should be as dened as possible,
should preserve as many axioms as possible, and should be as inclusive as possible
(see Denition 3).
We now show several ways to blend spaces for the words \house" and \boat";
see Figure 6, in which the generic space is auxiliary. We do not aspire to great ac-
curacy in linguistic modeling here; certainly much more detail could be added to
the various spaces, and some details could be challenged23. Our interest is rather
to illustrate the mathematical machinery introduced in this section with a sim-
ple, intuitive example. The generic space has three constants, object, medium,
and person, plus two relations, on and use. The house input has constants for
house, land, and resident; these are mapped onto by object, medium, and
person from the generic space, respectively; the relations are live-in, and on,
where the rst is mapped onto by use, and where the house is on land. Simi-
21
More technically, it is the edges that are designated as auxiliary, because it is possible
that the same morphism appears on more than one edge, where not all instances of
it are auxiliary.
22
I thank Grigore Rosu for the suggestion to generalize from auxiliary nodes to auxil-
iary edges.
23
This is consistent with our belief that unique best possible theories do not exist for
most real world concepts [21].
268 Joseph Goguen
live-in on live-in on
hsbt bths
ride on
live-in on
house boat
person medium
use on
object
larly, the boat input space has constants for boat, water, and passenger, which
are mapped onto by object, medium, and person, respectively; and it has rela-
tions ride and on, where the rst is mapped onto by use, and where the boat
is on water. In forming a blend, there is a con
ict between being on water and
being on land, and for \houseboat", water wins. Here all triangles commute.
The blend for boathouse chooses land instead of water. But the most interest-
ing things to notice about the boathouse blend are that the boat becomes the
resident, and that this leads to a non-commutative triangle of morphisms on the
right side.
There are also some other, more surprising, blends for these two conceptual
spaces: one gives a boat for transporting houses, and another gives an amphibious
house! See Figure 7. The rst blend (to the left in Figure 7) is dual to houseboat:
instead of the boat ending up in the house, the house ends up on the boat; there's
nothing strange about this except that we don't have any established word for
it, and it doesn't correspond to anything in (most people's) experience24 . The
second blend (to the right in Figure 7) is more exotic, since the resulting object
can be either on land or on water, and the user both rides and lives in it. Although
24
But we can easily imagine a construction project on an island where prefabricated
houses are transported by boat.
An Introduction to Algebraic Semiotics 269
no such thing exists in our world now, we can easily imagine some mad engineer
trying to build one. Now it is interesting to see which triangles commute for each
of these, and then to compare the naturalness of each blend with its degree of
commutativity. The left triangle of the rst blend fails to commute (again just
dual to \boathouse"). For the second, although both its triangles commute, the
situation here is actually worse than if they didn't, because the injections fail
to preserve some of the relevant structure, namely the (implicit) negations of
relation instances, such as that the boat is not on land.
water
ride
moves on on
live-in
on
hsbt hsbt
The above is a good illustration of the very important fact that blends are not
unique. Ambiguity and its resolution are pervasive in natural language under-
standing. A word, phrase or sentence with an \obvious" meaning in one context,
or in isolation, can have a very dierent meaning in another context. What is
amazing is that we resolve ambiguities so eortlessly that we aren't even aware
that they existed, so that it takes some eort to discover the other possibilities
that were passed over so easily! For another example, Appendix A constructs a
context in which the old aphorism \Time
ies like an arrow" undergoes a drastic
change of meaning, and also gives a formal specication of the conceptual spaces
involved, using the OBJ system [28] to compute the blend, parse the sentence,
and then evaluate it to reveal the \meaning". A dierent way to illustrate the
ambiguity of blends can be seen in the beautiful analyses done by Hiraga [37,
36] of haiku by the great Japanese poet Basho; she shows that several dierent
blends coexist for these haiku, and argues that this is a deliberate exploitation
of ambiguity as a poetic device.
Ambiguity also plays an interesting role in so called \oxymorons" (like \mil-
itary intelligence"): these involve two dierent blends of two given words, one of
which has a standard meaning, and the other of which has some kind of con
ict
in it. The second meaning only arises because the word \oxymoron" has been
introduced, and this deliberate creation of a surprising ambiguity is what makes
these a form of humor. For \military intelligence" the standard meaning is an
agency that gathers intelligence (i.e., information, especially secret information)
for military purposes, while the second, con
ictual meaning is something like
\stupid smartness", playing o the common (but incorrect) prejudice that the
military are stupid, plus the more usual meaning of intelligence. A lot of hu-
270 Joseph Goguen
6 Discussion
This paper has introduced algebraic semiotics, a new approach to user inter-
face design, cognitive linguistics, and other areas, based on a notion of sign
allowing complex hierarchical structure, thus elaborating Saussure's insight that
signs come in systems. Representations are mappings, or morphisms, between
sign systems, and a user interface is considered a representation of the underlying
functionality to which it provides access. This motivates a calculus for combining
signs, sign systems, and representations. One important mode of composition is
blending, introduced by Fauconnier and Turner, which is related to certain con-
cepts from category theory. The main contribution of this paper is the precision
that its approach can bring to applications. Building on an insight from com-
puter science, that discrete structures can be described by algebraic theories,
sign systems are dened as algebraic theories with some extra structure, and
semiotic morphisms are dened as mappings of algebraic theories that pre-
serve the extra structure to some extent; the quality of representations was found
to correlate with the degree to which structure is preserved.
When one sees concrete examples of sign systems like graphical user inter-
faces, it is easy to believe that these sign systems \really exist". It is amazing
how quickly and easily we see signs as actually existing with all their structure
\out there" in the \real world". Nevertheless, what \really exists" (in the sense
of physics) are the photons coming o the screen; the structure that we see is our
own construction. This paper provides a way to describe and study perceived
regularities, as modeled by sign systems, without claiming that these regularities
correspond to real objects, let alone that best possible descriptions exist for any
given phenomenon. This is consistent with ordinary engineering practice, which
constructs models for bridges, aircraft wings, audio ampliers, etc. that are good
enough for the practical purpose at hand, without claiming that the models are
the reality, and indeed, with a deep awareness, based on practical experience,
that the models are denitely not adequate in certain respects, some known and
some unknown26. Another advantage of our approach is that it enables us to
avoid a lot of distracting philosophical problems, e.g., having to do with the
doctrine of realism.
The use of morphisms of theories for representations instead of morphisms of
models relates to the above point, in that we tend to think of models as nally
grounding the representation process in something \real", whereas morphisms
never claim more than to be re-representations, which may add more detail, but
do not exhaust all of the possibilities for description.
William Burroughs said language is a virus [7], meaning (for example) that
peculiarities of accent, vocabulary, attitude, disposition, confusion, neurosis, etc.
are contagious, and tend to spread within communities. Mikhael Bakhtin [3] em-
phasized that language is never a single homogeneous system, using the word
\heteroglossia". Paraphrasing Burroughs in the light of Bakhtin, we might say
26
For example, Hook's law for the length of a spring as a function of the weight it is
holding, fails if the weight is too heavy, because the spring will be damaged.
272 Joseph Goguen
References
1. William P. Alston. Sign and symbol. In Paul Edwards, editor, Encyclopaedia of
Philosophy, Volume 7, pages 437{441. Macmillan, Free Press, 1967. In 8 volumes;
republished 1972 in 4 books.
2. Peter B. Andersen. Dynamic logic. Kodikas, 18(4):249{275, 1995.
3. Mikhail Bakhtin. The Dialogic Imagination: Four Essays. University of Texas at
Austin, 1981.
4. Roland Barthes. S/Z: An Essay and Attitudes. Hill and Wang, 1974. Trans.
Richard Miller.
5. Jon Barwise and John Perry. Situations and Attitudes. MIT (Bradford), 1983.
6. John Bowers. The politics of formalism. In Martin Lea, editor, Contexts of
Computer-Mediated Communication. Harvester Wheatsheaf, 1992.
7. William S. Burroughs. The Adding Machine: Selected Essays. Arcade, 1986.
8. John Carroll. Learning, using, and designing lenames and command paradigms.
Behavior and Information Technology, 1(4):327{246, 1982.
9. Alain Cohen. Blade Runner: Aesthetics of agonistics and the law of response. Il
Cannocchiale, 3:43{58, 1996.
10. Gilles Fauconnier and Mark Turner. Conceptual projection and middle spaces.
Technical Report 9401, University of California at San Diego, 1994. Dept. of
Cognitive Science.
11. Gilles Fauconnier and Mark Turner. Blending as a central process of grammar. In
Adele E. Goldberg, editor, Conceptual Structure, Discourse and Language, pages
113{129. CSLI, 1996.
12. Gilles Fauconnier and Mark Turner. Conceptual integration networks. Cognitive
Science, 22(2):133{187, 1998.
13. Syd Field. Screenplay: The Foundations of Screenwriting. Dell, 1982. Third edition.
14. Deidre Gentner. Structure-mapping: A theoretical framework for analogy. Cogni-
tive Science, 7(2):155{170, 1983.
15. Joseph Goguen. Semantics of computation. In Ernest Manes, editor, Proceedings,
First International Symposium on Category Theory Applied to Computation and
Control, pages 151{163. Springer, 1975. (San Fransisco, February 1974.) Lecture
Notes in Computer Science, Volume 25.
16. Joseph Goguen. What is unication? A categorical view of substitution, equa-
tion and solution. In Maurice Nivat and Hassan At-Kaci, editors, Resolution of
Equations in Algebraic Structures, Volume 1: Algebraic Techniques, pages 217{261.
Academic, 1989.
17. Joseph Goguen. A categorical manifesto. Mathematical Structures in Computer
Science, 1(1):49{67, March 1991.
18. Joseph Goguen. Types as theories. In George Michael Reed, Andrew William
Roscoe, and Ralph F. Wachter, editors, Topology and Category Theory in Computer
Science, pages 357{390. Oxford, 1991. Proceedings of a Conference held at Oxford,
June 1989.
19. Joseph Goguen. On notation (a sketch of the paper). In Boris Magnus-
son, Bertrand Meyer, and Jean-Francois Perrot, editors, TOOLS 10: Tech-
nology of Object-Oriented Languages and Systems, pages 5{10. Prentice-
Hall, 1993. The extended version of this paper may be obtained from
http://www.cs.ucsd.edu/users/goguen/ps/notn.ps.gz.
20. Joseph Goguen. Requirements engineering as the reconciliation of social and tech-
nical issues. In Marina Jirotka and Joseph Goguen, editors, Requirements Engi-
neering: Social and Technical Issues, pages 165{200. Academic, 1994.
274 Joseph Goguen
21. Joseph Goguen. Towards a social, ethical theory of information. In Georey
Bowker, Leigh Star, William Turner, and Les Gasser, editors, Social Science, Tech-
nical Systems and Cooperative Work: Beyond the Great Divide, pages 27{56. Erl-
baum, 1997.
22. Joseph Goguen. Social and semiotic analyses for theorem prover user interface
design, submitted for publication 1998.
23. Joseph Goguen. Theorem Proving and Algebra. MIT, to appear.
24. Joseph Goguen and Rod Burstall. Institutions: Abstract model theory for speci-
cation and programming. Journal of the Association for Computing Machinery,
39(1):95{146, January 1992.
25. Joseph Goguen, Kai Lin, Akira Mori, Grigore Rosu, and Akiyoshi Sato. Dis-
tributed cooperative formal methods tools. In Michael Lowry, editor, Proceedings,
Automated Software Engineering, pages 55{62. IEEE, 1997.
26. Joseph Goguen, Kai Lin, Akira Mori, Grigore Rosu, and Akiyoshi Sato. Tools for
distributed cooperative design and validation. In Proceedings, CafeOBJ Sympo-
sium. Japan Advanced Institute for Science and Technology, 1998. Nomuzu, Japan,
April 1998.
27. Joseph Goguen and Charlotte Linde. Optimal structures for multi-media instruc-
tion. Technical report, SRI International, 1984. To Oce of Naval Research,
Psychological Sciences Division.
28. Joseph Goguen and Grant Malcolm. Algebraic Semantics of Imperative Programs.
MIT, 1996.
29. Joseph Goguen and Grant Malcolm. A hidden agenda. Technical Report CS97{
538, UCSD, Dept. Computer Science & Eng., May 1997. To appear in special issue
of Theoretical Computer Science on Algebraic Engineering, edited by Chrystopher
Nehaniv and Masamo Ito. Early abstract in Proc., Conf. Intelligent Systems: A
Semiotic Perspective, Vol. I, ed. J. Albus, A. Meystel and R. Quintero, Nat. Inst.
Science & Technology (Gaithersberg MD, 20{23 October 1996), pages 159{167.
30. Joseph Goguen and Grant Malcolm. Hidden coinduction: Behavioral correctness
proofs for objects. Mathematical Structures in Computer Science, to appear 1999.
31. Joseph Goguen, Akira Mori, and Kai Lin. Algebraic semiotics, ProofWebs and dis-
tributed cooperative proving. In Yves Bartot, editor, Proceedings, User Interfaces
for Theorem Provers, pages 25{34. INRIA, 1997. (Sophia Antipolis, 1{2 September
1997).
32. Joseph Goguen, James Weiner, and Charlotte Linde. Reasoning and natural ex-
planation. International Journal of Man-Machine Studies, 19:521{559, 1983.
33. Robert Goldblatt. Topoi, the Categorial Analysis of Logic. North-Holland, 1979.
34. Martin Heidegger. Being and Time. Blackwell, 1962. Translated by John Mac-
quarrie and Edward Robinson from Sein und Zeit, Niemeyer, 1927.
35. Masako K. Hiraga. Diagrams and metaphors: Iconic aspects in language. Journal
of Pragmatics, 22:5{21, 1994.
36. Masako K. Hiraga. Rough seas and the milky way: `Blending' in a haiku text.
In Plenary Working Papers in Computation for Metaphors, Analogy and Agents,
pages 17{23. University of Aizu, 1998. Technical Report 98-1-005, Graduate School
of Computer Science and Engineering.
37. Masako K. Hiraga. `Blending' and an interpretation of haiku : A cognitive approach.
Poetics Today, to appear 1998.
38. Marina Jirotka and Joseph Goguen. Requirements Engineering: Social and Tech-
nical Issues. Academic, 1994.
39. William Labov. The transformation of experience in narrative syntax. In Language
in the Inner City, pages 354{396. University of Pennsylvania, 1972.
An Introduction to Algebraic Semiotics 275
40. George Lako and Mark Johnson. Metaphors We Live By. Chicago, 1980.
41. Saunders Mac Lane. Categories for the Working Mathematician. Springer, 1971.
42. Bruno Latour. Science in Action. Open, 1987.
43. Bruno Latour. Aramis, or the Love of Technology. Harvard, 1996.
44. John Lechte. Fifty Key Contemporary Thinkers. Routledge, 1994.
45. Eric Livingston. The Ethnomethodology of Mathematics. Routledge & Kegan Paul,
1987.
46. Grant Malcolm and Joseph Goguen. Signs and representations: Semiotics for user
interface design. In Ray Paton and Irene Nielson, editors, Visual Representations
and Interpretations. Springer Workshops in Computing, 1998. Proceedings of an
international workshop held in Liverpool.
47. Jose Meseguer and Joseph Goguen. Initiality, induction and computability. In
Maurice Nivat and John Reynolds, editors, Algebraic Methods in Semantics, pages
459{541. Cambridge, 1985.
48. Donald A. Norman. The Design of Everyday Things. Doubleday, 1988.
49. Charles Saunders Peirce. Collected Papers. Harvard, 1965. In 6 volumes; see
especially Volume 2: Elements of Logic.
50. Eleanor Rosch. On the internal structure of perceptual and semantic categories.
In T.M. Moore, editor, Cognitive Development and the Acquisition of Language.
Academic, 1973.
51. Eleanor Rosch. Cognitive reference points. Cognitive Psychology, 7, 1975.
52. Harvey Sacks. On the analyzability of stories by children. In John Gumpertz and
Del Hymes, editors, Directions in Sociolinguistics, pages 325{345. Holt, Rinehart
and Winston, 1972.
53. Harvey Sacks. Lectures on Conversation. Blackwell, 1992. Edited by Gail Jeerson.
54. Ferdinand de Saussure. Course in General Linguistics. Duckworth, 1976. Trans-
lated by Roy Harris.
55. Ben Shneiderman. Designing the User Interface. Addison Wesley, 1997.
56. Susan Leigh Star. The structure of ill-structured solutions: Boundary objects and
heterogeneous problem-solving. In Les Gasser and Michael Huhns, editors, Dis-
tributed Articial Intelligence, volume 2, pages 37{54. Pitman, 1989.
57. Lucy Suchman. Plans and Situated Actions: The Problem of Human-machine Com-
munication. Cambridge, 1987.
58. Mark Turner. The Literary Mind. Oxford, 1997.
is the lowly article \an"! How does this happen? The \local space-time vector"
(whatever that is) prepares the reader for \an arrow", and then \time
ies" are
introduced explicitly. These two conceptual spaces blend into another, where our
sentence gets its new interpretation; they share a subspace where a ship takes
realtime in a wormhole.
We describe these three conceptual spaces, form a blend, and then parse and
evaluate our sentence using the OBJ language (for more on OBJ and its under-
lying theory, see [28]), which is especially suitable because of its rich facilities for
combining theories. The keyword pair th...endth delimits OBJ modules that
introduce \theories" which allow any model that satises the axioms. The two
\pr SHIP" lines indicate importation of the theory SHIP in such a way that it is
shared; + tells OBJ to form a blend (which is actually their colimit in the sense of
Appendix B below), which is then named POUT as part of the make...endm con-
struct, which just builds and names a module. Predicates appear as Bool(ean)
valued functions. Finally, red tells OBJ to parse what follows, apply equations
as left to right rewrite rules, and then print the nal result (if there is one):
th SHIP is sort Thing .
ops (the ship) wormhole vector : -> Thing .
op _in_ : Thing Thing -> Bool .
op _makes_ : Thing Thing -> Bool .
eq the ship in wormhole = true .
var X : Thing .
cq X makes vector = true if X in wormhole .
endth
th FLIES is pr SHIP .
op time flies : -> Thing .
ops (_like_)(_buzz around_) :
Thing Thing -> Bool .
eq time flies buzz around the ship = true .
var X : Thing .
cq time flies like X = true if X == vector .
endth
th ARROW is pr SHIP .
op an arrow : -> Thing .
eq an arrow = vector .
endth
This shows OBJ parsing both sentences and then \understanding" that they are
\true"; note that neither sentence parses outside the blend. I hope the reader
is as pleased as the author at how easy27 all this is. Of course, we could get the
usual understanding of the sentence by evaluating it in a dierent context.
We now consider a somewhat more complex example, a proof that one
metaphor is better than another, under certain assumptions. The assumptions
are given in the ve theories, the metaphors in the two views, and the proof in the
four reductions. The rst metaphor, \The internet is an information tornado,"
comes from a press release from the Federal Communications Commission, while
the second, \The internet is an information volcano," comes from a poster that
the author of this paper prepared for a course on material in this paper at UCSD.
The keyword \us" (from \using") indicates importation by copying rather than
sharing, and *(op A to B) indicates a renaming of the operation A to become
B.
th COMMON is
27
It took about 15 minutes to write the code, and less than a second for OBJ to
process it, most of which is spent on input-output, rather than on processing the
various declarations and doing the 6 applications of rewrite rules.
278 Joseph Goguen
th PROCESS is us COMMON .
sort Volume .
ops subject process : -> Agent .
op flow : Agent Agent -> Volume .
ops low medium high huge : -> Volume .
endth
The OBJ3 output from this shows that the rst two reductions give false and
the second two give true. This means that the rst semiotic morphism does
not preserve the axioms (which concern the
ow of material between the user
and the object, either tornado or volcano), while the second morphism does,
which implies that the second metaphor is better than the rst with respect to
preserving these axioms. (On the other hand, the tornado metaphor resonates
with many common phrases such as \winds of change," which are part of our
culture, whereas we have less collective experience and associated language for
volcanos.)
Although this appendix is written under the assumption that readers already
know some basic category theory28 , it is nonetheless essentially self-contained,
though terse, in order to x notation for the new material. The essential intuition
behind categories is that they capture mathematical structures; for example,
sets, groups, vector spaces, and automata, along with their structure preserving
morphisms, each form a category, and their morphisms are an essential part of
the picture.
Denition 4: A category C consists of: a collection, denoted jCj, of objects;
for each pair A; B of objects, a set C(A; B ) of morphisms (also called arrows
or maps) from A to B ; for each object A, a morphism 1A from A to A called the
identity at A; and for each three objects A; B; C , an operation called compo-
sition, C(A; B) C(B; C ) ! C(A; C ) denoted \;" such that f ; (g; h) = (f ; g); h
and f ; 1A = f and 1A ; g = g whenever these compositions are dened. We write
f : A ! B when f 2 C(A; B ), and call A the source and B the target of f . 2
Results in the body of this paper show that sign systems with semiotic mor-
phisms form a category. We will review the notions of pushout, cone and col-
imit for ordinary categories, relate this to blending, and then consider the more
general setting of 32 -categories, which captures more of the phenomenology of
blending.
The intuition for colimits is that they put some components together, iden-
tifying as little as possible, with nothing left over, and with nothing essentially
new added [17]. This suggests that colimits should give some kind of optimal
blend. We will see that there are problems with this, so that the traditional
categorical notions are not quite appropriate for blending. Nevertheless, they
provide a good place to begin our journey of formalization.
28
See [33, 16, 17] for relatively gentle introductions to some basic ideas of category
theory; there are also many many other papers and many other books.
280 Joseph Goguen
diagram D0 having the same nodes as D. Commutative cones over D0 are then
cones over D that commute except possibly over the auxiliary morphisms. Now
we can also form a colimit of D0 , to get a \best possible" such cone over D. It
therefore makes sense to dene a blend to be a commutative cone over a diagram
with the auxiliary morphisms removed.
One advantage of formalization is that it makes it possible to prove general
laws, in this case, laws about blends based on general results from category
theory, such as that \the pushout of a pushout is a pushout." This result suggests
proving that \the blend of a blend is a blend," so that compositionality of the
kind of optimal blends given by pushouts follows from the above quoted result
about pushouts. The meaning of these assertions will be clearer if we refer to
the following diagram:
? _❅ ❅
c2 ❅ ❅ c3
❅❅
❅
_❅ ❅ ?
_❅ ❅
❅❅ ❅ ❅ b3
a1 ❅ ❅ ❅ b2
❅❅
❅
_❅ ❅ ?
❅❅
a2 ❅ ❅ ❅
a3
Here we assume that b2 ; b3 is a blend of a2 ; a3 , and c2 ; c3 is a blend of a1 ; b2 , i.e.,
that a2 ; b2 = a3 ; b3 and a1 ; c2 = b2 ; c3 ; then the claim is that c2 ; b3; c3 is a blend
of a2 ; a1 ; a3 , which follows because a2 ; a1 ; c2 = a3 ; b3 ; c3 . Using the notation
a2 3 a3 for an arbitrary blend of a2 ; a3 , we can write this result rather nicely in
the form
a1 3 (a2 3 a3 ) = (a2 ; a1 ) 3 a3 ,
taking advantage of a convention that a1 3 (a2 3 a3 ) indicates blending a1 with
the left injection of (a2 3 a3 ) (the top left edge of its diamond).
The pushout composition result (proved e.g. in [33, 41]) states that if b2 ; b3 is
a pushout of a2 ; a3 , and c2 ; c3 is a pushout of a1 ; b2, then c2 ; b3 ; c3 is a pushout
of a2 ; a1 ; a3 . If we write a2 ./ a3 for the pushout of a2 ; a3 , then this result can
also be written neatly, as
a1 ./ (a2 ./ a3 ) = (a2 ; a1 ) ./ a3 .
We can also place a second blend (or pushout) on top of b3 instead of b2 ;
corresponding results then follow by symmetry, and after some renaming of
arrows can be written as follows:
(a1 3 a2 ) 3 a3 = a1 3 (a2 ; a3 ) .
(a1 ./ a2 ) ./ a3 = a1 ./ (a2 ; a3 ) .
We can further generalize to any pattern of diamonds: if they all commute, then
so does the outside gure; and if they are all pushouts, then so is the outside
282 Joseph Goguen
gure. Another very general result from category theory says that the colimit
of any connected diagram can be built from pushouts of its parts. Taken all
together, these results give a good deal of calculational power for blending.
Now it's time to broaden our framework. The category of sign systems with
semiotic morphisms has some additional structure over that of a category: it is
an ordered category, because of the orderings by quality of representation that
can be put on its morphisms. This extra structure gives a richer framework
for considering blends; I believe this approach captures what Fauconnier and
Turner have called \emergent" structure, without needing any other machinery.
Moreover, all the usual categorical compositionality results about pushouts and
colimits extend to 32 -categories.
Denition 6: A 32 -category31 is a category C such that each set C(A; B) is
partially ordered, composition preserves the orderings, and identities are maxi-
mal. 2
Because we are concerned here with ordered categories, a somewhat dierent
notion of pushout is appropriate, and for this notion, the uniqueness property is
(fortunately!) lost:
Denition 7: Given a V, ai : G ! Ii (i = 1; 2) in a 32 -category C, a cone b1; b2
over a1 ; a2 is consistent i there exists some d : G ! B such that a1 ; b1 d
and a2 ; b2 d, and is a 32 -pushout i given any consistent cone ci : Ii ! C
over a1 ; a2 , the set
fh : B ! C j b1 ; h c1 and b2 ; h c2 g
has a maximum element. 2
Proposition 8: The composition of two 32 -pushouts is also a 32 -pushout.
Proof: Let b1; b2 be a 32 -pushout3 of a1; a2, and let c1; c2 be a 32 -pushout of a3; b1;
we will show that c1 b2; c2 is a 2 -pushout of a1 ; a3 ; a2 .
A
O ]
c
h
d1
h
@@@
?
d2 _
c @@
c2 @@
1
@ @
@@
_
@@@
? _
@@ @
a3 @@ b1 b2 @@
@
@@
_
?
@
a1 @@@ a2
31
In the literature, similar structures have been called \one and a half" categories,
because they are half way between ordinary (\one dimensional") categories and the
more general \two (dimensional)" categories.
An Introduction to Algebraic Semiotics 283
@@ d
@@ 2
@@
c1
@@@ c
? _
c @@@ c ? _
@@2 3 @@4
@@ @@
@ @
@@
_
@@@
?
_ ?
@@ @
b1 @@ b2 b3 @@ b4
@
@@ _
?
@
a1 @@@ a2
@@4 2
@@
c1
@@@ b3 ;c2
? _
?
@@@
@ b4
@
_
@@
?
@@
a1 ;b1 @@ a2
284 Joseph Goguen
and applying Proposition 8 once more gives us that the big square is a 32 -pushout.
2
Passing from V's to arbitrary diagrams of morphisms generalizes 32 -pushouts
to 3 -colimits,
2 and provides what seems a natural way to blend complex inter-
connections of meanings. The notion of consistent diamond extends naturally to
arbitrary diagrams, as follows:
Denition 10: Let D be a diagram. Then a family fi gi2jDj of morphisms
is D-consistent i a; j i whenever there is a morphism a : i ! j in D.
Similarly, given J jDj, we say a family of morphisms fi gi2J is D-consistent
i fi gi2J extends to a D-consistent family fi gi2jDj . 2
Fact 11: A diamond a1 ; a2; b1; b2 is consistent if and only if fb1; b2g is fa1; a2g-
consistent.
Proof: If the diamond is consistent then there is some d such that a1; b1 d
and a2 ; b2 d. But then fb1; b2 ; dg is fa1; a2 g-consistent, i.e., fb1; b2 g is fa1 ; a2 g-
consistent. Conversely, if fb1; b2 g is fa1 ; a2 g-consistent, then some d exists such
that fb1 ; b2; dg is fa1 ; a2 g-consistent, which says exactly that a1 ; b1 d and
a2 ; b2 d, i.e., that the diamond is consistent. 2
Denition 12: Let D be a diagram. Then a family figi2jDj is a 32 -colimit of
D i it is a cone and for any D-consistent family fi gi2jDj , the set fh j i ; h
i ; for each i 2 jDjg has a maximum element. 2
The following is another typical result that extends from ordinary colimits
to 32 -colimits:
Theorem 13: Let a W diagram consist of two V's connected at the middle
top. If D is a W diagram, then a 32 -colimit of D is obtained by taking a 32 -pushout
of each V, and then taking a pushout those two pushouts, as shown below.
Proof: Let D contain3 the morphisms a1; a2; a3; a4 , let b1; b2 3be a 32 -pushout of
a1 ; a2 , let b3 ; b4 be a 2 -pushout of a3 ; a4 , and let c1 ; c2 be a 2 -pushout of b2 b3 .
Then we must show that the family of morphisms fb1 ; c1 ; a2 ; b2 ; c1 ; b2 ; c1 ; a3 ; b3 ; c2 ; b4; c2 g
is a 32 -colimit of D.
:
; c
A K O R S ]
h 6
3
h1 @ h2,
d1 @@ d5
)
? _
c1 d3 c2@
@@ '
d2 d4
@@@ @@
@
? _ ? _
b1 b2@@ b3 b4@@
@ @
@
@@
_
@@@ ? _
?
@ @ a
a1 @@@ a2 a3 @@@
4
An Introduction to Algebraic Semiotics 285
has had a much more philosophical focus. As a result, a great deal of philosophical
discussion could be generated concerning the heretical approach of this paper.
This appendix connes itself to just a few points that seem to have some practical
signicance.
Today humanists of nearly all schools reject the notion that some kind of
\Cartesian coordinates" can be imposed on experience, despite partial evidence
to the contrary from elds like linguistics and music. This rejection is understand-
able as a reaction to the scientistic reductionism that nearly always accompanies
projects to impose structure on experience. Such tendencies are deeply ingrained
in Western civilization, going back at least to Pythagoras and Plato. But evi-
dence from a wide range of elds now makes it clear that traditional reductionism
has serious limitations. The following are brief descriptions of some better known
examples:
1. Work on mechanical speech recognition has shown that contextual informa-
tion is essential for determining what phoneme some raw acoustic waveform
represents (if anything); this contextual information may include not just
prior but also subsequent speech, a prole for the individual speaker (accent,
eccentricities, etc.), the topic of discourse, and much more, up to arbitrary
shared cultural knowledge.
2. In music, the same acoustic event in a dierent context can have a radically
dierent impact, ranging from ugly and incongruous, to great beauty and
elegance. Moreover, the background of the listener is crucial; for example,
naive listeners have little chance of appreciating the subtleties and beauties
of Cecil Taylor or Ornette Coleman, however familiar with theories of psycho-
acoustics and harmony they might be.
3. Similar things happen in cinema and poetry, and indeed any art or craft,
from architecture and interior design, to basket weaving, pottery, and
ower
arranging. Often a great deal of cultural context is needed to appreciate
(in any deep sense) a single artifact; buildings, rooms, baskets and pots are
used by ordinary people in their ordinary lives, as part of the complex social
fabric. The \Gucci" label on a purse is not lovely in itself, but nonetheless
it has a meaning to those who go out of their way to acquire it. A brightly
colored postmodern bank building in Lisbon has a complex cultural meaning
that does not transfer to Paris, London, or New York.
4. Despite the stunning success of applying simple atomic theory to basic molec-
ular chemistry, physics has found it necessary to postulate nonlocalized quan-
tum elds to explain many important phenomena, some of which appear even
in applied chemistry, to say nothing of more rareed areas.
5. Metamathematics has had great success in formalizing mathematics, and in
studying what is provable. But its greatest successes have been results, like
Godel's incompleteness theorem, that demonstrate the limitations of formal-
ization. Moreover, formal proofs lack the comprehensibility, and the human
interest, of well done informal proofs. See Appendix D for more discussion
along these lines, demonstrating the importance of context for making proofs
\come alive."
An Introduction to Algebraic Semiotics 287
Returning now to our main point, there is a justiable opposition to totalizing
reductionist structuralist systems, while at the same time, there is the utterly
pervasive presence of structured signs. What are we to do about this seemingly
contradictory situation?
Two alternatives have been most explored, each with some valuable results.
The rst is to pursue the quest for structure, digging deeper wherever it seems to
work, and avoiding the (very many) areas where things just seem too slippery to
admit much precision. This inevitably results in a partial view, which is open to
criticism in various ways (as post-structuralism has criticized the structuralism
of Saussure, Levi-Strauss, Barthes, etc.). The second alternative is to abandon
structure and work with intuitive experiences and descriptions (some currently
fashionable words are \rich," \nuanced," \textured," and \postmodern"). This
too inevitably results in a partial view, which in the extreme avoids criticism by
refusing to be pinned down, even to the extent of using inconsistent, incoher-
ent language. Through both are extreme positions, it seems dicult to nd a
clear, consistent, defensible middle ground. (A general reference for continental
philosophy is [44].)
It seems to me that ethnomethodology provides some valuable hints on a
way out of this impasse. Often presented as a principled criticism of traditional
sociology, especially its normative category schemes (gender, race, status, etc.),
ethnomethodology can perhaps better be seen positively as an approach to un-
derstanding social phenomena (such as signs!) by seeing how members of some
group come to see those sign as present. Thus, ethnomethodology wants to know
what categories the members of a social group use, and what methods they use
to determine instances of those categories. This requires careful attention to
real social interaction, and avoids the Platonist assumption that the categories
have a pre-given existence \in nature." Rather, we see how members of a group
achieve categorization in actual practice, without having to give these either the
categories or their instances any status other than what has been achieved in
a particular way at a particular time. The branch of ethnomethodology called
conversation analysis has taken a rather radical approach to the social context
of language, showing that even simple features such as whose turn it is to speak
are always negotiated in real time by actual social groups [52, 53], and should
not be considered as given. Words like \reication" and \transcendentalizing"
are used to describe approaches that take the opposite view. (Of course, any one
paragraph description of ethnomethodology is necessarily a gross oversimplica-
tion; more information may be found in [57] and [21] among many other places,
some of which may be very dicult to read.)
Although this paper is not the place to discuss it, phenomenology has also
been an important in
uence on our formulation of a philosophical foundation for
semiotics, particularly in its insistence that the only possible starting point is
the ground of our own actual experience, with all metaphysical principles rmly
bracketed.
The sign, object, interpretant triad of classical semiotics (Peirce, Morris, Eco,
etc.) presupposes an objective world, whereas our morphic semiotics is consistent
288 Joseph Goguen
with the view that mind (usually unconsciously) constructs models by selecting
and blending (abstractions from) immediate and past experience, using (e.g.)
templates derived from embodied motion [40], so that what we see as \objects"
are actually parts of these models. This does not deny that a \world" exists,
but it does deny that we experience it directly. As Heidegger observed, we come
closest to experiencing \reality" when our models break down [34]. Similarly, we
may reinterpret the syntax, semantics, pragmatics triad of classical semiotics,
by claiming that its instances can probably be better understood through the
use of semiotic morphisms.
The above ideas suggest various ways to avoid the extremes of mindless
reductionism and mindless holism. The most straightforward approach is to ad-
mit that while each individual analysis no doubt has biases and limitations, it
nonetheless embodies certain structures, values, insights, etc. A given analysis,
if it is clear, coherent and consistent, can be formalized, and may have some
value as such; for example, its limitations will be easier to spot. Such an anal-
ysis should not pretend to be objective, factual, complete, universal, or even
self-contained; it is a momentary snapshot of a partial understanding of one (or
more) interested party, and of course, can only be understood by other interested
parties who have a more or less comparable background. It has frozen out the
uid processes of interpretation that actually produced the understanding.
The previous paragraph may claim too little, because sometimes analyses
can have great impact, with broad acceptance, important applications, etc., e.g.,
Newtonian mechanics32 . However this paper is not the place to try to understand
why some analyses may work better than others in some given social context.
It is enough for our purposes that analyses exist, exhibit structure, and can be
formalized, without requiring a totalizing, reductionist, or realist stance.
D What is a Proof?
Mathematicians talk of \proofs" as real things. But all we can ever actually nd
in the real world of actual experience are proof events, or \provings", each of
which is a social interaction occurring at a particular time and place, involv-
ing particular people, who have particular skills as members of an appropriate
mathematical social community.
A proof event minimally involves a \proof observer" with the relevant back-
ground and interest, and some mediating physical objects, such as spoken words,
gestures, hand written formulae, 3D models, or printed words, diagrams or for-
mulae. But none of these can be a \proof" by itself, because each must be
interpreted in order to come alive as a proof event.
The ecacy of some proof events depends on the marks that constitute a
diagram being seen to be drawn in a certain order; e.g., Euclidean geometric
proofs, and commutative diagrams in algebra; in some cases, the order may not
32
We should not forget that, according to today's science, Newtonian mechanics, de-
spite its tremendous utility, is not a correct physical theory, but only a practical
approximation that holds within certain (not entirely well specied) limits.
An Introduction to Algebraic Semiotics 289
be easily inferred from just the diagram. Therefore we must generalize from proof
objects to proof processes, such as diagrams being drawn, movies being shown,
and Java applets being executed.
Mathematicians habitually and professionally reify, and it seems that what
they call proofs are idealized Platonic \mathematical objects," like numbers,
that cannot be found anywhere on this earth. So let us agree to go along with
this confusion (I almost wrote \joke") and call any object or process a \proof" if
it eectively mediates a proof event, not forgetting that an appropriate context
is also needed. Then perhaps surprisingly, almost anything can be a proof! For
example, 3 geese joining a group of 7 geese
ying north is a proof that 7 + 3
= 10, to an appropriate observer. Peirce's notion of semiosis takes a cognitive
view of examples like this, placing emphasis on a sign having a relation to an
interpretation.
Notice that a proof event can have many dierent outcomes. For a mathe-
matician engaged in proving, the most satisfactory outcome is that all partici-
pants agree that \a proof has been given." Other outcomes may be that most
are more or less convinced, but want to see some further details; or they may
agree that the result is probably true, but believe there are signicant gaps; or
they may think that the proof is bad and the result is false. And of course, some
observers may be lost or confused. In real provings, outcomes are not always
just `true' or `false'. Moreover, a group of proof observers need not agree among
themselves, in which case there may not be any denite socially negotiated \out-
come" at all!
Going a little further, the distinction between a proof giver and a proof
observer is often articial or problematic; for example, a group of mathematicians
working collaboratively on a proof may argue among themselves about whether
or not some given person has contributed substantively to \the proof". Hence
we should speak of \proof participants", however they happen to be distributed
in space and time, and be aware that the nature of their participation is subject
to social negotiation, like everything else.
The above deconstruction of \proofs" as objectively existing real things is
only the rst part of a more complex story. In addition to a proof object (or
process), certain practices (also called methods) are needed to establish an in-
terpretation of a proof object as a proof event. For example, to interpret the
ying geese as a proof about addition requires a practice of counting. This runs
counter to the tendency, in mathematics as well as in literature and linguistics,
to insist on the \primacy of the text" ignoring the practices required to bring
texts to life, as well as the communities that embody those practices.
In fact, practices and their communities are at least as important as proof
objects; in particular, it is clear that they are indispensable for interpreting some
experience as a proof; if you can't count, then you can't see goose patterns as
proofs, and if you haven't been taught about the numerals `7', `3', `10', then you
can't explain your proof to the decimal digit speaking community. Of course, this
line of thought takes us further from the objective certainties that mathematics
likes to claim, but if we look at the history of mathematics, it is clear that there
290 Joseph Goguen
have been many dierent communities of proving practice; for example, what
we call \mathematical rigor" is a relatively very new viewpoint, and even within
it, there are various competing schools, including formalists, intuitionists and
constructivists, each of which itself has many variants. Moreover, the availability
of calculators and computers is even now once more changing mathematical
practice.
Mathematical logic restricts attention to small sets of simple mechanical
methods, called rules of inference, and claims that all proofs can be constructed
as nite sequences of applications of such rules. While this approach is appro-
priate for foundational studies, and has been interesting and valuable in many
ways, it is far from capturing the great diversity and vital living quality of natural
proofs.
Unfortunately, we lack the detailed studies that would reveal the full richness
of mathematical practice, but it is already clear that proof participants bring a
tremendous variety of resources to bear on proof objects (see [45] for an excellent
discussion). For example, a discussion among a group of mathematicians at a
blackboard will typically involve the integration of writing, drawing, talking
and gesturing in real time multimedia interaction. In at least some cases, this
interaction has a high level \narrative" structure, in which sequentially organized
proof parts are interleaved with evaluation and motivation in complex ways.
Aristotle said \Drama is con
ict", meaning that the dramatic interest, or
excitement, of a play comes from con
ict, that is, from obstacles and diculties.
Anyone who has done mathematics knows that many diculties arise. But the
way proofs are typically presented hides those diculties, showing only the spe-
cialized bulldozers, grenades, torpedos, etc. that were built to eradicate them.
Thus reading a conventional proof can be a highly alienating experience, since it
is dicult or impossible to understand why these particular weapons have been
deployed. No wonder the public's typical response to mathematics is something
like \I don't understand it. I can't do it. I don't like it". I believe that mathe-
maticians' systematic elision of con
ict must take a signicant part of the blame
for this. (Note the military metaphor used above; it is suggestive, and also very
common in mathematical discourse.)
So called \natural deduction" (due to Gentzen) is a proof structure with
some advantages, but it is very far from \natural" in the sense of being what
provers do in natural settings; natural deduction presents proofs in a purely top
down manner, so that, for example, lemmas cannot be proved before they are
used. We need to move beyond the extreme poverty of the proof structures that
are traditional in mathematical logic, by developing more
exible and inclusive
structures. A rst step towards accommodating con
ict in proofs might be to
allow alternative proofs that are incomplete, or even incorrect. For example, to
show why a lemma is needed, it is helpful to rst show how the proof fails without
it; or to show why transnite induction is needed, it may help to show how
ordinary induction fails. A history of attempts to build a proof records con
icts,
and hence reintroduces drama, which can make proofs more interesting and less
alienating. Of course, we should not go too far with this; no proof reader will
An Introduction to Algebraic Semiotics 291
want to see all the small errors a proof author makes, e.g., bad syntax, failure
to check hypotheses of a theorem before applying it, etc. As in a good movie,
con
ict introduction should be carefully structured and carefully timed, so that
the clarity of the narrative line is not lost, but actually enhanced. The tatami
system, which embodies many of these ideas, is described in [25, 26], and more
detail on the application of ideas in this paper to that system can be found in
[31, 22]; for a less formal introduction to some of the ideas of algebraic semiotics,
see also [46].
The narrative structures of natural proofs seem to have much in common
with cinema: there is a hierarchical structuring (of acts, scenes, shots in cinema,
and of proof parts in mathematics); there are
ashbacks and
ashforwards; there
is a rich use of multimedia; etc. The traditional formal languages for proofs are
also very impoverished in the mechanisms they provide for structuring proofs
into parts, and for explaining these structures and parts. Probably we could learn
much about how to better structure proofs by studying movies, because a movie
must present a complex world, involving the parallel lives of many people, as a
linear sequence of scenes, in a way that holds audience interest, e.g., see [9]. No
doubt there are many other exciting areas for further exploration in our quest
to improve the understandability of proofs. Success in this quest could have a
signicant impact on mathematics education, given the impending pervasiveness
of computers in schools, and the mounting frustration with current mathematical
education practices.
(The essay in this appendix was in part inspired by remarks of Eric Liv-
ingston, whom I wish to thank, though I may still have got it wrong. The re-
marks on narrative draw on detailed studies by the sociolinguist William Labov
[39]. See [21] for some related discussion and background.)
An Algebraic Approach to Modeling Creativity
of Metaphor
Bipin Indurkhya
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 292–306, 1999.
c Springer-Verlag Berlin Heidelberg 1999
An Algebraic Approach to Modeling Creativity of Metaphor 293
Classical model theory studies the properties of and relations between different
models of a given theory. A similar approach is used in most other formaliza-
tions of semiotics (Goguen 1997). This situation is depicted in Figure 1 below.
However, to understand creativity of metaphor, we need to reverse our stand-
point and consider different theories of the same model. For example, in the
painting-as-pumping metaphor mentioned above, one would like to see how the
pumping theory restructures the painting model. In the Seascape example, we
would like to be able to describe how the harp and its related concepts (which
could be considered a theory) restructure the experiential datum (the model) of
the ocean. This situation is depicted in Figure 2.
To avoid the confusion between two senses of ‘model’: one referring to mod-
eling creativity in metaphor, and the other to the model of a theory, we will
henceforth use the term environment to refer to the model of a theory. Thus,
Figure 1 should be read as ‘Focus on multiple environments of a theory’ and
Figure 2 as ‘Focus on multiple theories of an environment’. We believe that in
order to model creativity of metaphor we must focus on Figure 2, and study how
different theories can conceptualize the same environment differently.
294 Bipin Indurkhya
theory
theory 1 theory 2 theory N
(sign system)
model
model 1 model 2 model N
(environment)
Fig. 1: Focus on multiple models of a theory. Fig. 2: Focus on multiple theories of a model.
Many theories and their cognitive relations are inherited, biologically or cul-
turally, or learned as we grow up. We can dub them as conventional cognitive
relations. These cognitive relations structure our environment in various ways,
and it is this structured environment that we live in and interact with. How-
ever, in certain situations, it becomes necessary to form new cognitive relations.
A prime example of such situations is metaphor. In metaphor, a new cognitive
296 Bipin Indurkhya
4 Some Examples
We now present a few examples to illustrate our approach. The first example
is from the Copycat domain pioneered by Hofstadter (1984), which concerns
proportional analogy problems between letter strings, as in:
This domain may seem rather simple at first but in fact, as Hofstadter has
shown, a number of rich and complex analogies can be drawn in it. In particular,
the Copycat domain is quite suitable for demonstrating the context effect, ac-
cording to which an object needs to be represented differently depending on the
context, thereby revealing the limitations of fixed-representation approaches. For
instance, in the analogy problems (2) and (3) below, the first term of the analogy
(abba) is the same, but it needs to be given a different representation to solve
each problem: for analogy (2), abba needs to be represented as a symmetrical
object, with the string ab, reflected and appended to itself; and for analogy (3)
it needs to be seen as an iterative structure, namely two copies of b, flanked by
the same object, namely a, on either side.
odd symmetry, for it has a pivot point in the middle ‘b’ — the representation
algebras that generate the minimum information load gestalts for each of these
terms individually have mostly different elements, and so when we combine them
to get the representation algebra that can generate both the terms, the complex-
ity of the resulting algebra is almost cumulative. However, if we represent ‘abba’
and ‘abbbbba’ as iterative structures, then their individual representation al-
gebra have a high degree of overlap, so that the complexity of the combined
representation algebra remains almost the same. Fuller details of our approach
can be found in Dastani, Indurkhya and Scha (1997), and Dastani (1998).
: :: :
: :: :
A : B :: C : D
Fig. 3: Two examples of proportional analogy relations A is to B as C is to D involving geomteric
figures. Notice that the terms A and B are the same in each example, yet different figures for the C term
forces a different way of decomposing figures A and B.
of various closed figures, like ‘triangle’, and their structural configurations are
essential to understanding the analogies.
What seems necessary here is to provide a sufficiently low-level description
of the figures (say, in terms of line segments and arcs), and a rich repertoire of
operators and gestalts that allow one to build different higher-level structured
representations from these low-level descriptions. For the examples in Fig. 3, we
need the gestalts of ‘triangle’, ‘hexagon’, ‘ellipse’, etc.; and operators like ‘invert’
(turn upside down), ‘juxtapose’ ‘rotate-clockwise’, and so on. A structured rep-
resentation using these gestalts and operators essentially shows how the figure
can be constructed from the line segments and arcs.1 Needless to say, there are
many ways to construct each figure, so there are many corresponding structured
representations.
Thus the heart of the problem, in this approach, lies in searching for a struc-
tured representation that is most appropriate in a given context. As representa-
tions correspond to algebraic terms, it means we must find suitable representa-
tion algebras for each of the figures — where ‘suitability’ must take into account
complexity of representation algebras, complexity of representations, existence
of an isomorphic mapping between representation algebras, and the complexity
of this mapping. We must emphasize two somewhat unusual aspects of our ap-
proach here. One is that we require a mapping between representation algebras,
and not between representations themselves, to capture the analogical relation.
The reason for this is that a mapping between representation algebras is more
robust with respect to trivial changes of representation — such as ones arising
from symmetry or transitivity of operators. The second distinctive feature is that
we require an isomorphism rather than a homomorphism. However, as explained
above in Section 3, this by no means constitutes a limitation of our approach;
on the contrary, it focuses attention on the isomorphism underlying each homo-
morphism. (See Indurkhya 1991 for a further elaboration of these issues and a
formally worked out example.)
The next example we would like to present, taken from Indurkhya (1997b),
concerns modeling a certain kind of creative arguments in legal reasoning. Very
briefly, the example is about a college professor, Weissman, who deducted the
expenses of maintaining an office at home from his taxable income. A precedent
that was helpful to Weissman’s arguments was the case of a concert violinist,
Drucker, who was allowed to claim home-office deduction for keeping a studio
at home where he practiced. However, the Revenue Service tried to distinguish
Weissman from Drucker on the grounds that Drucker’s employer provided no
space for practice, which is obviously required of a musician, whereas Weissman’s
employer provided an office (a shared one). The judges, however, ruled that
Weissman’s employer provided no suitable space for carrying out his required
1
It should be noted here that the algebra corresponding to this domain would be like
the algebraic specification of any drawing or graphics program such as Superpaint. In
any such graphics program, the user can create various objects on the screen, group
them in certain ways to create different gestalts, and apply a variety of operations
on them.
An Algebraic Approach to Modeling Creativity of Metaphor 301
duties (the office, being a shared one, was not safe for keeping books and other
research material), just as Drucker’s employer provided no suitable space for
Drucker to practice.
The key issue in modeling this argument is how to specialize category ‘no
space provided by the employer’ to ‘no suitable space provided by the employer’,
because the former distinguishes Weissman from Drucker, but the latter cate-
gory allows Drucker to be applied to Weissman. We have argued that the new
category can be obtained from other precedents. In this example, there was an-
other precedent, Cousino, a high-school teacher who was denied home-office tax
deduction, because the judges argued that his employer provided him a suitable
space for each task for which he was responsible. A very interesting aspect of this
example is that though Cousino and Drucker, when they are individually applied
to Weissman, lead to a decision against Weissman; but when Cousino is used to
reinterpret Drucker, and then reinterpreted Drucker is applied to Weissman, a
decision in favor of Weissman can be obtained.
In modeling this argument in our approach, the environment level is as-
sociated with the facts of a case, and the model or theory level is associated
with the rationale for the decision of the case (Hunter and Indurkhya 1998).
For example, facts of the Cousino case would include: ‘employer-of (Cousino) =
XYZ’, ‘high-school (XYZ)’, ‘responsible (Cousino, teach)’, ‘responsible (Cousino,
grade-papers)’, ‘provided (Cousino, XYZ, classroom)’ ‘provided (Cousino, XYZ,
staff-room)’, ‘suitable-for (classroom, teaching), ‘suitable-for (staff-room, grade-
papers) ’, etc. Notice that because the facts are themselves composed of linguistic
and abstract categories, we need to allow predicates and relations in the envi-
ronment algebra.
The rationale of the case, in this example, would consist of a complex term
(we mean algebraic term here) ‘employer provided suitable space for the tasks for
which the employee is responsible’. As this is a precedent, that has already been
decided, the terms of the rationale level would already be connected to the facts
level (meaning that a cognitive relation exists). This already shows the grouping
phenomenon, and how the facts level seems isomorphic to the rationale level.
The object ‘tasks’ at the rationale level is connected to different objects at the
facts level, including ‘teach’, ‘grade-papers’, ‘prepare-lessons’, ‘talk-to-parents’,
etc. So all these activities are grouped together and are seen as a unit from the
rationale level. Also, many facts at the facts level are not considered relevant,
and so are not connected to anything at the rationale level. Nonetheless, it is
necessary to keep these facts, for they may become necessary in reinterpreting
the Cousino case, which is precisely what happens when Cousino is applied to
reinterpret Drucker.
In applying the rationale of Cousino — which contains the gestalt ‘suitable
space’ — to the facts of Drucker, a new rationale and a new cognitive relation
between the rationale and the facts levels of the Drucker case emerges. Using
this new rationale, the facts of the Weissman case can also be organized in such
a way that a decision favorable to Weissman can be obtained, and moreover,
Drucker can be cited as a precedent to support this argument.
302 Bipin Indurkhya
Our final example concerns linguistic metaphor, and is taken from a certain
translation of the Bible. As Stephen is persecuted for spreading the teachings of
Jesus, he rebukes his persecutors:
The phrase we would like to focus on is ‘uncircumcised in heart and ears’. Now
several gestalt descriptions (or algebraic terms) can be associated with ‘circum-
cised’: for example ‘surgically removing prepuce’, ‘purify spiritually’, etc. Note
that these descriptions themselves contain gestalts like ‘prepuce’, ‘purify’, which
can be further decomposed into other gestalts. However, at some point, we have
to try to interpret the gestalt descriptions by finding similar operations in the
context of ears and heart. For example, ‘surgically remove’ is an operation ap-
plied to ‘prepuce’, so we have to find a similar operation that can be applied
to some part of the ear. This process may require creating imagery for ear (and
possibly for circumcision as well) using perceptual knowledge about it. Perhaps
the gestalt that is easiest to interpret is ‘purify’ or ‘cleanse’, which means ‘un-
circumcised’ would correspond to ‘unclean’ (negation operation is applied). But
‘unclean’ for ears could suggest ears plugged up by earwax, for example, so that
the person cannot hear the message.
Finding the right gestalt of ‘uncircumcised’ to interpret in the context of
‘heart’ is more complex, because ‘heart’ itself is used metaphorically, not for
the physical organ that pumps blood, but for feelings and understanding. Here
one can perhaps construct an image where something that is unclean cannot
receive new ideas or impressions (e.g. adding a new tint to the dirty water),
and the person with the unclean heart does not see what is the truth according
to Stephen. There may also be the association that as circumcision requires a
surgical procedure, something drastic needs to be done to purify the heart.
We should add that all this analysis is done from a viewpoint that is outside
of the Bible, for when viewed within the Bible, circumcision is a dead or a
conventional metaphor (e.g. ‘Circumcise yourselves to the Lord’. Jeremiah 4:4.)
Also, in some other translations a more literal approach is taken:
“ ‘How stubborn you are!’ Stephen went on to say, ‘How heathen your
hearts, how deaf you are to God’s message! You are just like your an-
cestors: you too have always resisted the Holy Spirit!’ ” (Acts 7:51. The
Good News Bible. The Bible in Today’s English Version translated and
published by the United Bible Societies, 1976.)
5 Related Research
In the last twenty years or so there has been much interest in metaphor, and
many researchers from different disciplines have approached the problem from
An Algebraic Approach to Modeling Creativity of Metaphor 303
various angles. Our approach outlined here is based on the insights of Max
Black (1962; 1979) and Nelson Goodman (1978), among others. However, be-
cause of not being spelled out precisely, these ideas have often been misunder-
stood. We already mentioned above that Black has been unfairly criticized for
claiming that there is an isomorphism underlying every metaphor. Then Black
has also been inconsistent on the symmetry of metaphor: at times suggesting
that metaphors may be symmetrical, while in most places his account is clearly
asymmetrical. This again has caused some needless misunderstanding (see, for
example, Lakoff & Turner 1989, pp. 131–133). Our approach towards formal-
izing their insights and extending it further, we hope, dispels many of these
misunderstandings.
The research on metaphor and its role in organizing our conceptual sys-
tem has received a huge impetus from the work of George Lakoff and his col-
leagues (Lakoff & Johnson 1980; Lakoff 1987). While the empirical data they
have amassed to demonstrate how metaphors pervade our everyday life and
discourse are indeed impressive, their attempts to explain how a metaphor can
reorganize the topic and create new features in it are fraught with contradictions.
In some places they claim that certain topic domains derive their structure pri-
marily through metaphors, and they do not have a pre-metaphorical structure.
At other places they imply that the topic constrains the possible metaphorical
mappings and creation of feature slots. (See also Indurkhya 1992, pp. 78–84,
pp. 124–127.) We believe that our formal approach clearly resolves this appar-
ent paradox of how metaphor can restructure the topic, and yet it is not the
case that anything goes.2
More recently, Gilles Fauconnier and Mark Turner have introduced a the-
ory of conceptual blending (see, for example, Turner & Fauconnier 1995), which
introduces a multiple space model. However, their theory works primarily with
concepts, showing how concepts from many spaces blend together to produce
metaphorical meanings. While we acknowledge that the multiple-space model
does indeed come close to the way real-world metaphors work, we also feel that
it is crucial to involve the object or the situation (what we have been calling the
environment) in the interaction. Without incorporating this orthogonal compo-
nent, we believe, the creativity of metaphor cannot be accounted for satisfacto-
rily. Thus, in our view the approach presented here supplements the conceptual
blending theory, and in the future we expect to broaden it by considering how
multiple environments and multiple theories interact together to produce new
meanings.
2
On the formal side, Goguen (1997) has embarked on an ambitious project to de-
velop a formal framework for systems of signs and their representations. However,
we believe that the mechanisms proposed here would have to be incorporated in
the semiotic morphisms of Goguen in order to be able to account for creativity in
metaphor. Though we must add that this kind of creative restructuring is neither
always required, nor always desirable. Therefore, there may well be many situations
where semiotic morphisms without allowing restructuring would work just fine. But a
more comprehensive framework would have to allow the possibility of restructuring.
304 Bipin Indurkhya
6 Conclusions
In this paper we have focused on the problem of how metaphor can restructure an
object or a situation, and create new perspectives on it. With this goal in mind,
we outlined some algebraic mechanisms that can be used to model creativity
and restructuring of metaphor. Needless to say, the approach presented here is
merely a step towards a fuller understanding of the creativity of metaphor. First
of all, the model, as it is, needs to be elaborated considerably, and computa-
tional mechanisms need to be developed to implement its different mechanisms.
For example, elsewhere (Indurkhya 1997b) we have suggested a blackboard archi-
tecture for modeling interaction between a cognitive model and an environment
in the domain of legal reasoning. Secondly, the approach needs to be expanded
to incorporate language, communication between agents, and so on. Obviously,
all these issues will keep us busy for years to come.
References
Black, M. (1962). Metaphor. In M. Black Models and Metaphors, Cornell University
Press, Ithaca, NY, pp. 25–47.
Black, M. (1979). More about Metaphor. In A. Ortony (ed.) Metaphor and Thought,
Cambridge University Press, Cambridge, UK, pp. 19–45.
Bottini, G., Corcoran, R., Sterzi, R., Paulesu, E., Schenone, P., Scarpa, P., Frackowiak,
R.S.J., and Frith, C.D. (1994). The role of the right hemisphere in the interpretation
of figurative aspects of language: A positron emission tomography activation study.
Brain, 117, pp. 1241–1253.
Brachman, R.J. and Levesque, H.J. (eds.) (1985). Readings in Knowledge Representa-
tion. Morgan Kaufmann, San Mateo, California.
An Algebraic Approach to Modeling Creativity of Metaphor 305
Lakoff, G. and Johnson, M. (1980). Metaphors We Live By. Univ. of Chicago Press,
Chicago.
Lakoff, G. and Turner, M. (1989). More than Cool Reasons: A Field Guide to Poetic
Metaphors. Univ. of Chicago Press, Chicago.
Leeuwenberg, E. (1971). A Perceptual Coding Language for Visual and Auditory Pat-
tern. American Journal of Psychology 84, pp. 307–349.
Mal’cev, A.I., (1973). Algebraic Systems. B.D. Seckler & A.P. Doohovskoy (trans.).
Springer-Verlag, Berlin, Germany.
Marschark, M., Katz, A. and Paivio, A. (1983). Dimensions of Metaphors. Journal of
Psycholinguistic Research 12, pp. 17–40.
Nueckles, M. and Janetzko, D. (1997). The Role of Semantic Similarity in the Com-
prehension of Metaphor. In Proceedings of the Nineteenth Annual Conference of the
Cognitive Science Society, Lawrence Erlbaum Associates, Hillsdale, New Jersey, pp.
578–583.
Paivio, A. (1979). Imagery and Verbal Processes. Hillsdale, NJ : Lawrence Erlbaum
associates, Inc.
Piaget, J. (1953). Logic and Psychology. Manchester University Press, Manchester, UK.
Piaget, J. (1967). Biology and Knowledge. B. Walsh (trans.) (1971). Univ. of Chicago
Press, Chicago.
Rosch, E. (1977). Human Categorization. In N. Warren (ed.) Studies in Cross-Cultural
Psychology: Vol. 1. Academic Press, London, pp. 1–49.
Schön, D.A. (1963). Displacement of Concepts. Humanities Press, New York.
Schön, D.A. (1979). Generative Metaphor: A Perspective on Problem-Setting in Social
Policy. In A. Ortony (ed.) Metaphor and Thought. Cambridge Univ. Press, Cambridge,
UK, pp. 254–283.
Tourangeau, R. and Rips, L. (1991). Understanding and Appreciating Metaphors. Cog-
nition 11, pp. 203–244.
Turner, M. and Fauconnier, G. (1995). Conceptual Integration and Formal Expression.
Metaphor and Symbolic Activity, 10(3), pp. 183–204.
Van der Helm, P. and Leeuwenberg, E. (1991). Accessibility: A Criterion for Regularity
and Hierarchy in Visual Pattern Code. Journal of Mathematical Psychology, 35, 151–
213.
Metaphor and Human-Computer Interaction: A Model
Based Approach
Abstract. The role of metaphor in the interface design process is examined and
the importance of formal approaches for characterizing metaphor is stressed.
Two mathematical models of metaphor are put forward - a model based upon a
set approach and a model based upon functional decomposition. The set-based
model has proved to be useful in the design process enabling designers to
identify problem areas and possible improvement areas. The more detailed
functional model mirrors the set approach and is still under development,
however the main ideas are outlined.
The interface between a human being and a computer application consists of a set of
interface objects which map onto objects in the underlying computer system and
whose manipulation instructs the system to perform certain functions. The state of
these interface objects also reflects the current system state and provides
communication between system and user. Recently there has been more emphasis on
graphical user interfaces enabling designers to provide realistic interface controls
which can be “directly manipulated” (Shneiderman, 1978). This shifts the emphasis to
“doing” rather than linguistic reasoning when solving interface problems, resulting in
new interest in the use of metaphor. Two of the most ubiquitous metaphors used have
been the “Desktop Metaphor”, where many housekeeping functions are mapped to the
manipulation of papers on a desktop, and the “Windows Metaphor“ whereby users
have views onto different applications. These metaphors have been successful in
allowing users to manage files and to control many applications simultaneously.
Carroll & Mack (1985) state that ‘metaphors can facilitate active learning….. by
providing clues for abductive and adductive inferences through which learners
construct procedural knowledge of the computer’. The selection and application of
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp.307 -321, 1999.
Springer-Verlag Berlin Heidelberg 1999
308 J.L. Alty and R.P. Knott
existing models of familiar objects and experiences allow users to comprehend novel
situations. Lakoff & Johnson, (1980) claim that all learning is metaphoric in nature.
2 What Is Metaphor ?
Literary theory characterizes the role of metaphor as the presentation of one idea in
terms of another, such that understanding of the first idea is transformed. From the
fusion of the two ideas, a new one is created. Richards (1936) has proposed a
nomenclature in which he defines the original idea as the ‘tenor’ and the second idea
imported to modify or transform the tenor as the ‘vehicle’. The use of metaphor must
involve some form of transformation; otherwise the construction is simply an analogy
or juxtaposition and not a metaphor. Metaphors draw incomplete parallels between
unlike things, emphasizing some qualities and suppressing others, (Lakoff & Johnson,
1980).
The mismatches are an important part of metaphor. One thinks of those lines from
Auden made famous in the film “Three Weddings and a Funeral”:
“The stars are not wanted now: put out every one;
Pack up the moon and dismantle the sun;
Pour away the ocean and sweep up the wood;
For nothing now can ever come to any good.”
The mismatches are huge (“pack up the moon”, “dismantle the sun”), but the images
are powerful.
Using Richards’ terms, in the design of the Apple Macintosh interface, the real-world
desktop acts as a vehicle in order to transform the tenor, in this case the operating
system of the computer. Thus a metaphor requires three concepts; the Tenor, the
Vehicle and the transformation between them.
Although there have been many papers on the use of metaphor at the interface, there
has been a lack of formal design approaches. A mathematical approach to metaphor
representation is mentioned in Kuhn et al.. (1991), but is not developed.
3 Metaphoric Interfaces
When designers use a metaphor at the interface, they have to carefully design a set of
interface objects with which to represent the observable states of the system. These
objects have a dual function. They present the state of the system to the users and
inform them of system changes. At the same time they must provide a set of actions
Metaphor and Human-Computer Interaction: A Model Based Approach 309
through which the user can initiate changes in the system state. In Graphical User
interfaces (GUIs) these actions corresponds to mouse clicks, or dragging etc.
Usually, each metaphor at the interface, relates to a single application (or even a sub-
task such as cut-and-paste). If several applications are running there may be several
concurrent metaphors at the interface, one for each active application and others for
system functions. A user, however, is usually only concerned with one application at
a time. Note however that this application could be the operating system itself and
that some metaphors may apply across all applications.
Anderson and co-workers have put forward a model which has proved very useful
in investigating metaphoric mapping issues. The model is shown in Figure 1.
S -M +
Features o f vehicle (M )
Features of S+M +
system (S )
S+M -
S -M -
The four areas (in what might be considered a Venn Diagram) are:
S+M+ à features in the system supported by the Metaphor,
S+M- à features in the system not supported by the Metaphor.,
S-M+ à features implied by the Metaphor but not provided by the system,
S-M- à features not implied by the Metaphor nor supported by the system.
Anderson et al. (1994) used this model to investigate the importance of the concept of
“conceptual baggage” - the proportion of S-M+ to S+M+ features (that is those
features of the metaphor, which do not map to system functionality compared with
those which do). Anderson et al.. found empirical evidence that conceptual baggage
did play an important role in the overall effectiveness of metaphor at the interface. In
the process control area, conceptual baggage is an important issue since it could lead
operators into erroneous conclusions about the process.
310 J.L. Alty and R.P. Knott
The prototype system used in these investigations was designed to act as an interface
to an office-based integrated digital broadband telecommunications infrastructure.
More specifically, the system was designed to broadcast the availability state of all
users of the system at any given point in time, and to enable users to make point to
point audio-visual connections. Each user of the system was represented as a
graphical icon which was available to all other users of the system. Communication
between users of the system was initiated via these icons which were also used to
display the availability state of the particular user. In order to provide an adequate
simulation of such technology, the system, known as DOORS (MITS 1994a), was
developed to utilize the audio-visual infrastructure and controlling software (Gaver et
al., 1992) available at Rank Xerox Research Centre, Cambridge.
In order to describe the relationships between system and vehicle for each of the three
pairings, it was necessary to explore the features of each of the vehicles with respect
to the proposed system functionality. Techniques suggested by Carroll et al. (1988)
were used to consider the mappings between vehicle and system at the levels of
‘tasks’, ‘methods’ and ‘appearances’ in a representative set of scenarios. The results
of this analysis were set in the context of the above model so that it was possible to
allocate attributes of the vehicle-system pairing to one of the four categories in
Anderson's model. The ease and immediacy of the allocation process formed the
basis of the characterization of each vehicle-system pairing. For example, office
doors immediately provided a wide range of possible attributes pertinent to the
initiation of point to point audio-visual connections, compared to the attributes
associated with dogs.
The first vehicle-system pairing adopted the office door as a vehicle for representing
the availability of a user. Specifically, an open door corresponded to ‘available for
communication’, a partially open door to ‘busy but interruptible’ and finally a closed
Metaphor and Human-Computer Interaction: A Model Based Approach 311
Features of vehicle
Features of system
5.2.2 Dogs
The second vehicle-system pairing adopted the dog as a vehicle for representing the
availability of a user. Specifically, an attentive dog corresponded to ‘available for
communication’, a digging dog to ‘busy but interruptible’ and finally a sleeping dog
to ‘not available for communication’. The characterization of the relationship between
this vehicle and the system is shown in Figure 3.
In this pairing, as in the previous case, there were also a great number of
potentially relevant features of the vehicle that were not supported by the system. For
312 J.L. Alty and R.P. Knott
example, dogs could not be trained to allow communications from specified people.
Thus it can be seen that the proportion of S-M+ features compared to S+M+ features
was relatively high. Again, there was considerable conceptual baggage. However, it
can be seen that very little of the system functionality was accounted for by features
of this vehicle. Such a characterization would lead to different predictions about the
patterns of user performance. Firstly it would be expected that initially subjects would
not find the system intuitive, not only because the metaphor seems less contextually
relevant, but also because the ratio of S+M- features to S+M+ features was
comparatively high. Dogs was therefore considered to be a rich but inappropriate
vehicle in the context of this pairing.
Features of vehicle
Features of system
The third vehicle-system pairing adopted the traffic light as a vehicle for representing
the availability of a user. Specifically, a green light corresponded to ‘available for
communication’, an amber light to ‘busy but interruptible’ and finally a red light to
‘not available for communication’. The characterization of the relationship between
this vehicle and the system is shown in Figure 4.
In this pairing it can be seen that there were few potentially relevant features of the
vehicle that were not supported by the system. Thus the proportion of S-M+ features
compared to S+M+ features was relatively low. In this instance, there was
considerably less conceptual baggage than in the previous two situations. As was the
case with the dog, it can be seen that very little of the system functionality was
accounted for by features of the vehicle. This characterization would lead to further
predictions about the patterns of subject performance. Firstly it would be expected
that subjects would not initially find the system intuitive, not only because the
metaphor seems less contextually relevant, but also because the ratio of S+M- features
to S+M+ features would be quite high. For the same reason it would be expected that
even if subjects do explore the system and become familiar with the functionality, the
boundary between S+M- and S+M+ features will be apparent. Finally, owing to the
predicted lack of conceptual baggage it would be expected that the subjects would be
Metaphor and Human-Computer Interaction: A Model Based Approach 313
better able to distinguish between S-M+ features and S+M+ features associated with
this vehicle-system pairing. Traffic Lights was therefore considered to be a sparse
vehicle with limited appropriateness in the context of this pairing.
Features of vehicle
Features of system
An experiment was designed and carried out to investigate the viability of the model
by utilizing the interface metaphors Office doors, Dogs and Traffic Lights. In order to
compare and contrast the effects of each of the vehicle-system pairings, three
independent groups of subjects undertook the same task that required usage of
identical underlying telecommunications services. Experimental data was collected
using a combination of verbal protocol, activity capture using video and questionnaire
techniques. This section will focus on the data generated by the questionnaire and will
outline some preliminary findings.
It is clear from the results that the intuitive nature of the Office Door interface
metaphor caused the subjects to make incorrect assumptions concerning the nature of
the underlying system functionality.
This would imply that subjects were confident that they were able to distinguish
functionality that was in the system but not covered by the vehicle, from functionality
that was covered by the vehicle, when in fact this was found not to be the case. The
subjects exhibited a misplaced sense of confidence about their answers due to the
richness and contextual relevance of this vehicle, which had the effect of masking the
boundary of the mapping between vehicle and system. It would seem therefore that
the Office doors vehicle, while providing a contextually rich set of resources, brought
a considerable amount of conceptual baggage to this particular vehicle-system
pairing. The effect of this baggage was exacerbated by the relative simplicity of the
underlying system functionality.
314 J.L. Alty and R.P. Knott
In the case of Dogs, subjects were better able to identify system functionality that was
not supported by the vehicle, than functionality that was suggested by the vehicle but
was not present in the system.
In contrast to the Office doors vehicle, it would seem that Dogs provided a rich set of
resources that were largely inappropriate in the context of this particular vehicle-
system pairing. This is indicated by the fact that subjects reported a need for a manual
explaining the representations of system state at the start of the task. Thus, whilst a
degree of conceptual baggage could be expected, the lack of contextual relevance
caused the effect to be reduced.
Finally in the case of Traffic Lights, subjects were better able to identify system
functionality that was supported by the vehicle than functionality that was suggested
by the vehicle but was not present in the system.
In addition this last result indicates that the vehicle maps only to a small part of the
system functionality causing subjects to be aware of the boundary between the two.
Subjects did not find this vehicle at all intuitive as is indicated by the fact that the
majority of them expressed a need for a manual to explain the representations of
system state. Once the subjects became aware of the mapping between vehicle and
system, actual understanding of the interactions was superior to that in either of the
other two vehicle-system pairings. The Traffic Lights vehicle then, did not provide a
rich set of resources. However the resources it did provide mapped tightly to a small
subset of the system functionality. Consequently the effect of this vehicle’s inherent
conceptual baggage was not as marked as in either of the other vehicle-system
pairings.
The additional component increases the number of distinct areas to eight, namely,
S+M+V+, S+M+V-, S+M-V+, S+M-V-, S-M+V+, S-M+V-, S-M-V+, and S-M-V.
Where, of course, areas in the Anderson model each subsume two areas in our new
model (e.g. S+M+ = {S+M+V+} + {S+M+V-}).
Metaphor and Human-Computer Interaction: A Model Based Approach 315
V
S-M+V+
S-M-V+ S-M+V-
M
S+M-V+ S+M+V+ S+M+V-
S+M-V-
S-M-V-. These are operations and objects in the world which are of no interest to
us. We call these Irrelevant Mappings.
S+M-V+. These are operations which are implemented at the interface, do map to
system functionality, but either have no metaphoric interpretation or have an incorrect
metaphoric interpretation. We call these Metaphor Inconsistencies. This is an area of
dissonance. The designer has implemented a function not consistent with the
metaphor. A classic example of this is dragging the disk icon into the waste bin in the
MacIntosh interface. The metaphor would suggest that the disk will be deleted, or
trashed, whereas the functionality is ejection of the disk from the system.
S+M-V-. These are operations available in the system but not implemented in the
interface, nor suggested by the metaphor. We call these External Functions to this
Metaphor. These will usually be functions covered by other metaphors.
316 J.L. Alty and R.P. Knott
S+M+V-. These are operations which are available in the system, which the
metaphor suggests can be done, but which are not implemented. We call these Missed
Opportunities or Implementation Exclusions These are usually caused by a narrow
interpretation of the metaphor. For example, the “doors” metaphor used by Anderson
et al. provided the user with an indication of the availability of another party on a
communication link. An “open” door meant available, a “closed” door, not available,
and a door, which was “ajar”, meant possibly interruptible. The doors were merely
signaling intention. They were not connected in any way to access security. Thus, if
users closed their doors this did not prevent interruption (though it might have done in
a more embracing interpretation of the metaphor).
S-M+V+. These are operations which are consistent with the Metaphor and are
implemented but have no corresponding functionality. We call these Metaphoric
Surface Operations. These usually correspond to operations which have no system
interpretation but are useful and consistent with the metaphor. An example would be
tidying the desktop by dragging file icons.
S-M-V+. These are implementations in the interface which are neither in the
system functionality nor in the metaphor. They are like metaphoric housekeeping but
do not have a relevance to the metaphor. We call these Non-Metaphoric Surface
Operations. Examples of these type of operations would be changing the size or
color of an icon, or a font size (user tailoring of the objects in an interface).
S-M+V-. These are operations which are suggested by the metaphor but which are
neither in the system nor the interface. This is essentially what is meant by
“conceptual baggage”. The user is erroneously led to believe that something can be
done within the metaphor but which is not implemented and does not map onto any
system functionality. This is the area of Conceptual Baggage discussed earlier. A
good example of this is the use of the “clipboard” in the MacIntoch interface. In a
normal clipboard, a user can successively clip additional documents to the board
(hence the clip), The board acts as a last-in/first-out storage system. However in the
Macintosh implementation, a second clip overwrites the first.
Designers, when implementing metaphors at the interface should first examine the
objects and manipulations defined in the set V (i.e. what they have actually
implemented). These manipulations can be divided into four discrete subsets:
The first two are very important. Two other important areas are S-M+V- and S+M+V-
(Conceptual Baggage and Missed Opportunities, neither of these are in the
implementation). Conversion of S-M+V- to S+M+V+ is a powerful design tool.
Furthermore, adding interface support for S+M+V- manipulations can strengthen the
metaphor, make the interface easier to learn and remember. Such situations are often
the result of system extensions, or lack of thought about implementation.
The underlying system (tenor) will exist in a number of possible unique states s1, s2,
...sn. System behavior is characterized by movements between these states. Today it is
common for a user to have several applications active at the same time, each having
its own state. The system state is the aggregation of these active application states and
the basic underlying system. If the user has applications A1 A2…Am active and
application Ai can have possible states ai,1, ai,2,…..ai,r then at any one time the total set
of possible system states is
æ ö
S = ∏ ç si * ∏ ∏ aj , k ÷ .
ç ÷
i
è j k ø
We can represent a typical element, ê, of S as ê=(si, a1,i1, a2,i2,… ar,ir). State changes
result from direct user actions, system responses to them, or external events. A user
may initiate a copy command, which moves the system through state changes which
write information to disk. The initial change was initiated by the user, but later state
changes result from system actions, some visible, some not.
Although a user may control a number of applications at the same time, at any
particular moment the user will only concentrate on one of them, so for the rest of our
discussion we will assume there is a current, single application, or set of system
functions Ai.
The set of transformations which initiate state changes in the underlying system.
A set of transformations at the interface which cause changes in the system, and
the set of transformations induced or predicted in the user’s mental model of
the metaphor.
318 J.L. Alty and R.P. Knott
The system designer has to specify some functionality for the underlying system and
for the user interface. We define a system function f, which acts upon a non-empty
set of states Si ⊂ S and produces a non-empty set of state Sj ⊂ S (illustrated in Figure
6).
f: Si → Sj
Both the subject and object of this function must be subsets of S, rather than elements,
since the same function will be applicable not only in many different states of the
chosen application, but also for almost all possible states of the other active
applications. An example of a function is that of opening a new file. This can be done
at many states in the system.
S
f
S
j
S
i
The functionality of the system is the set F of all system functions f. F represents the
functionality of the underlying system for which we wish to build an interface and is
equivalent to S in Figure 5.
At any time, the set of objects representing the vehicle (or interface) for Ai is in one of
a number of object states O ={ o1, o2, ..., ok}. Each object state ok represents a
configuration of the actual interface objects which the user can manipulate. Each
object state can be transformed into another object state through manipulation of the
real objects at the interface. This corresponds to the area V in figure 5. There will be
many possible transformations between the elements of O. Each oi affords some
actions at the interface, which, if initiated, would move the vehicle into some new
state oj.
Finally we describe a metaphoric model in the user's mind. The metaphorical model is
similar to the set of interface objects and manipulations in the implementation. At any
time, the user's mind (in relation to the computer application) is in one of a number of
mental states U ={ u1, u2 ,u3 ,…un}. Each object state ui represents a configuration of
objects in the user's mental model. Each mental state can be transformed into another
mental state through manipulations in the mental model (corresponding to M in
Figure 5). Each ui affords some mental actions which, if initiated, moves the mental
model into a new state uj.
A metaphor function u acts upon a mental model state ui and produces a final state
uj.
u: ui →uj.
The functionality of the metaphor is the set U of all metaphor functions u.
There may be a difference between the designer's view of a metaphor and the user's
view. Indeed this can be a cause of difficulty in interface design. This is best solved
by the designer carefully checking the metaphors used against the target user
population. Thus we assume that the designer's metaphor and the user population's
metaphor will agree, and this is the metaphor described above.
There is a mapping, φ which represents the relationship between the metaphor objects
in the mental model and the interface objects There is also an associated mapping υ
from the elements of M to the set V.
Clearly we require that if u∈ dom(υ) and u maps ui to uj then if φ(ui) =oi and φ( uj) = oj
the υ(u) maps oi to oj. The implementation reflects the user’s expectations for all such
mappings.
Similarly, there is a mapping θ from the set of interface object states to the set S such
that if oi is some object state, then it must correspond to some state ai, k of the
application Ai. Let êi be the element of S corresponding to this state. Then we define
θ(oi) = êi.
320 J.L. Alty and R.P. Knott
u
ui uj Set of Metaphor
Objects
φ (Mental Models)
φ
d=υ(u) Set of Interface
oi oj
Objects
θ
θ
f=ω(d)
θ(oi)
θ(oj) Set of System
Functions
Si
Sj
Referring to Figure 7, there will be elements of M, which are not in the domain of
υ and they form the set M+V-. The image of υ is the set M+V+. The elements in V
but not in dom(ω) is the set S-V+ while the set Im(ω) is the set S+V+.
10 Conclusions
Converting S-M+V- and S+M+V- elements to S+M+V+ can be a powerful driver for
extending interface functionality in novel ways.
References
Meurig Beynon
1 Introduction
More than ten years have elapsed since McDermott’s celebrated renunciation of
logicism in AI first appeared [59]. The status of neat and scruffy approaches to
AI remains controversial, and there has been limited progress towards the two
complementary goals that might make the most decisive impact on the argument:
Goal L (“The Logicist Goal”): Develop sophisticated symbolic models with
powerful applications.
Goal NL (“The Non-Logicist Goal”): Identify general principles for application
development outside the logicist framework.
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 322–365, 1999.
c Springer-Verlag Berlin Heidelberg 1999
Empirical Modelling and the Foundations of Artificial Intelligence 323
The paper is in three main sections. Section 2 contrasts logicist and non-
logicist perspectives on intelligence with reference to a typical IQ puzzle and to
the analysis of a historic railway accident. Section 3 introduces EM principles
and techniques, and illustrates their potential significance for railway accident
investigation. Section 4 discusses the new foundational perspective on AI that
EM affords with particular reference to the work of William James on Radical
Empiricism, of David Gooding on the empirical roots of science, of Mark Turner
on the roots of language and of Rodney Brooks on robotics.
2 Perspectives on Intelligence
– It only accounts for changes of state in the system to a limited degree: the
future states of the system are not circumscribed, there may be singular
states in which conflicting values are attributed to observables, and there
are no guarantees of reliable response or progress.
The problem posed in Box 1 illustrates one popular view of intelligence that has
much in common with the logicist perspective as portrayed in [55]. It is drawn
from a publication by Mensa, a society whose membership comprises people with
a high “intelligence quotient”.
The Captain of the darts team needs 72 to win. Before throwing a dart, he remarks
that (coincidentally) 72 is the product of the ages of his three daughters. After
throwing one dart, he remarks that (coincidentally) the score for the dart he has
just thrown is the sum of the ages of his daughters. Fred, his opponent, observes at
this point that he does not know the ages of the Captain’s daughters. “I’ll give you
a clue”, says the Captain. My eldest daughter is called Vanessa. “I see”, says Fred.
“Now I know their ages.”
The solution to this problem centres on the fact that factorisations of 72 into 3
factors are disambiguated by the sum of factors but for the pair of factorisations:
72 = 3 * 3 * 8 = 6 * 6 * 2.
By observing that he does not know the ages of the daughters, Fred discloses
to the solver that one or other of these factorisations of 72 is the required one.
(Note that, to make his observation, Fred does not need to know—as we as
solvers do—that no other pair of factorisations of 72 into three yields the same
326 Meurig Beynon
sum, since he knows that the Captain has scored 14.) When he knows there is
an eldest daughter, he knows that the ages of the daughters are 3, 3 and 8.
This puzzle illustrates several ingredients of logicism discussed in [55]. The
problem is contrived around a mathematical model in the poser’s mind. The
casual and artificial way in which the abstract problem is placed in a real-world
context echoes the modularity of ‘inventing conceptualizations’ and ‘grounding
concepts’ presumed in logicism [55]. Embodiment plays a contrived role in the
problem. The issue of psychological realism is not addressed. It is assumed that
Fred exercises instantaneous—or at least very rapid—inference skills on-line,
whilst “knowing the ages of the daughters” is an abstract concept, unconnected
with being able to associate an age with a daughter who might turn up at the
darts match. Nor is indexicality respected. In order to draw any inferences, a
single Mensa-like persona must be imposed on the agents in the puzzle (the
Captain and Fred) and on the poser and solver also.
The remarkable thing about problems of this nature is that the IQ-literate
reader adopts the conventions of the problem poser so readily. Why should we
regard problem-solving of this nature as intelligent? Perhaps because it involves
being able to see through the contrived presentation to make ingenious abstract
inferences, discounting the commonsense obstacles to deduction (cf. Naur’s anal-
ysis of logical deduction in Sherlock Holmes stories [63]: “truth and logical infer-
ence in human affairs is a matter of the way in which these affairs are described”).
To some degree, facility in making abstractions is a quality of intelligence.
Some commonsense facts about the world must be taken for granted to make
sense of the problem. For example, a game of darts takes place on such a timescale
that the ages of the children are fixed for its duration. 14 is a legitimate score
for one dart. Yet the puzzle is posed so artificially that it is almost a parody of
intelligence.
A complementary mental skill is far less well-represented in logicism. This is
the ability to transpose the problem imaginatively so as to disclose the implicit
presumptions about the relationship between the abstract and the real-world el-
ements. Imagination of this kind can subvert the intelligence test. A suspension
of disbelief is needed in supposing that the Captain and Fred are mathemati-
cally adept and sober enough to factorise 72 in their heads whilst simultaneously
taking turns at darts, or that Fred determines the ages of the children because
of an inference rather than because he remembers Vanessa’s age. In some con-
texts, especially where creativity or design are concerned, such questioning of
the premises of a problem is essential, but it is out-of-place in the world of Mensa
problems. The intended world model is closed and preconceived.
The Mensa problem is an example of the kind of challenge that might be
addressed by an intelligence inference engine. It might not be easy to meet, as
it involves some meta-level reasoning. This is illustrated by the fact that if Fred
said he knew the ages of the daughters before he was told the name of the eldest,
no inference could be drawn.
Though logicism is not primarily concerned with artificial tests of intelligence
of this nature, it can be seen as construing intelligence in similar terms. It involves
Empirical Modelling and the Foundations of Artificial Intelligence 327
establishing a formal relationship between the world and a logical model similar
to that between the mathematical model and the darts match scenario, such
that intelligent behaviour can be viewed as if it were inference of the kind used
in solving the intelligence test.
Empirical Modelling techniques address the broader view of intelligence that
encompasses creativity and imagination. They are not particularly well-suited for
exercises in inference masquerading as commonsense problems, but have direct
relevance to real-life scenarios in which abstract explanations are sought.
The following discussion refers to a 19th century railway accident [67] that is
described in Box 2 and illustrated in Figure 1. In analysing the accident (e.g. as
in conducting an accident inquiry), the significance of embodiment is particularly
clear. To assess the behaviour of the human agents, it is essential to take account
of psychological and experiential matters. How big was the red flag? How was
it displayed? Did the drivers and signalman have normal sight? How far away
could oncoming trains be seen? These are perceptual matters, which taken in
conjunction with knowledge about how fast trains travelled and how closely they
followed each other, help us to gauge the performance of human agents. There
are also conceptual matters, to be considered in the light of the training given
328 Meurig Beynon
match). The process of identifying and actively checking the state of the signal
also has a conceptual component.
Issues of this nature have to be viewed with reference to the particular en-
vironment, such as the weather conditions. In this context, whether the speed
of the train was “too fast” is a matter of pragmatics rather than mathemat-
ics. The need to think in egocentric indexical terms is self-evident. None of the
human agents has a comprehensive view of the system. Without at least being
able to acquire some representative experience of what signalman Killick’s task
involved, it is hard to make a fair judgement about his degree of responsibility
for the accident, and to assess the relevance of his having worked for 24 hours
at a stretch.
In the railway accident scenario, unlike the Mensa problem, the interaction
between conceptual worlds and the real world is very subtle. Ironically, the prac-
tical measures designed to protect against the dangers of a breakdown in the
tunnel also generated the conceptual framework that led to the disaster. Driver
Scott’s decision to reverse his train arose from the fiction that a train may have
broken down in the tunnel ahead. Had he had another misconception, such as
that Killick had waved a white flag, there would have been no accident, and
normal operation would shortly have been resumed. In the real world, there are
degrees of physical interaction between trains that fall short of the catastrophe
that actually occurred, some of which might even have entailed no disruption to
the railway system. It is hard to envisage how logicist models could address the
range of ways in which what is informally viewed as inconsistency can be mani-
fest. Drastic colocation of trains is a particularly striking example of embodied
inconsistency. After this event, there is, in some sense, no longer a model.
can be effective in this role, accounting for the highly complex interactions in
the railway system within a robust generic conceptual framework.
There are three challenges in particular that are met in conceiving railway
system operation in closed-world terms. They are concerned with obtaining guar-
antees, so far as this is possible, on the following points:
– All human activities are framed around objective knowledge and skills.
– All significant operations are based on highly reliable assumptions.
– Practice does not depend on the specific features of particular environments.
The need to deal with first person concerns. One possible subgoal for an
investigator might be reconstructing the mechanics of the accident. A mathe-
matical model could be developed in terms of such factors as the mass, position,
velocity, acceleration, braking efficiency of the trains and friction and gradient
in the environment. In this model, agency would manifest itself as changes in
acceleration due to manipulation of the throttle and brake.
An alternative model might be aimed at reconstructing the sequence of sig-
nificant events. This could be built around an analysis of the protocols for inter-
action between the signalmen and the drivers, e.g. using a mathematical model
for concurrency such as process algebra or calculus. Such a model would register
the communications between the human agents as abstract events, and enable
their possible patterns of synchronisation to be analysed.
From each perspective, the result is a self-contained closed-world model of
the accident. That is to say, both models can be developed to the point where,
relative to their subgoal, there is apparently no need to make further reference to
the physical context in which the accident took place. In accounting for the crash,
the mechanical model can give insight into the influence of technological factors
and perhaps supply objective information about the train drivers’ actions. The
protocol model can likewise clarify what communication took place, and help to
assess its significance.
In practice, both perspectives are too deficient in psychological terms to be
helpful to an accident inquiry in making judgements about responsibility. Both
models create objective “third person” accounts that help to clarify exactly what
an external observer might have seen, and put this observation in the context of
other possible scenarios. Neither gives us insight into how the experiences of the
human agents and the physical embodiments of mechanical agents contributed
to the accident.
To construct a logicist model that is adequate for understanding the railway
accident would certainly require more sophisticated mathematics. What form
should such a model take? It would have to model agents so as to take sufficient
account of mechanics and how communication between agents is synchronised. It
would also have to characterise the interactions between agents in propositional
terms in a way that took sufficient account of psychological factors.
Empirical Modelling and the Foundations of Artificial Intelligence 331
The need to deal with the particular context. In considering the accident
scenario, it is often necessary to speculate on the precise characteristics of the
environment for the accident. Sufficient detail has been retained in the account
of the accident given above to convey the impression of the richness of the
context surrounding the crash. The trains apparently leave Brighton just a few
minutes apart; Killick is fatigued; the trains are heavy; it is a Sunday morning.
These details may or may not be relevant. Whatever details we include, it seems
that language cannot do justice to what we need to know when we probe the
circumstances of the accident.
332 Meurig Beynon
Did Killick have to leave the cabin in order to wave the flag? What was the
exact distance between the signal and the cabin, and how much longer would it
have to have been for Scott to see the flag? Was Scott supposed to acknowledge
seeing the flag? Did his train have a whistle? All these issues require reference
to the real situation, and are concerned with the specific characteristics of the
particular time and place.
The explanation of particular events can also invoke observables in ways that
cannot be preconceived. In the particular scenario of the Clayton Tunnel crash,
the signalman needed to know whether the driver—several hundred yards away
in the tunnel—had seen the red flag. Perhaps other accident scenarios in which
there was no violation of agreed practice would throw up different examples of
rogue observables that were never considered by the designers of the protocols
or the pioneers of railway and communications technology.
From the above discussion, modelling in the logicist tradition is seen to be
intimately connected with identifying contexts in the world that are stable with
respect to preconceived patterns of interaction. Validating that such a context
has been identified is a pragmatic and empirical matter about which no absolute
guarantees can be given. The observables that feature in these worlds, though
not necessarily statically predetemined, have to come and go according to pre-
conceived patterns. The agents that operate in these worlds must perform their
actions in a manner that respects preconceived integrity constraints. These are
the characterisations of closed worlds and circumscribed agency.
3 Empirical Modelling
The preceding discussion argues the need for an alternative to logicism as a
framework for modelling. Accident investigation demands something other than
closed-world modelling. In particular, it suggests a specific agenda: modelling
from a first-person perspective, with partial and provisional knowledge, and
with reference to a specific context. To respect the need to consult the world
in the process of model-building, the modelling process should also be situated:
it should take place in or as if in the context of the situation to which it refers.
Empirical Modelling, here introduced and illustrated with reference to the Clay-
ton Tunnel Accident scenario, has been conceived with this agenda in mind.
3.1 Orientation
The context for the Empirical Modelling Project is supplied by what Brödner [28]
has identified as a conflict between two engineering cultures:
One position, . . . the “closed world” paradigm, suggests that all real-
world phenomena, the properties and relations of its objects, can ulti-
mately, and at least in principle, be transformed by human cognition
into objectified, explicitly stated, propositional knowledge.
The counterposition, . . . the “open development” paradigm . . .
contests the completeness of this knowledge. In contrast, it assumes the
Empirical Modelling and the Foundations of Artificial Intelligence 333
process relies upon embodiment in an essential way, and artefacts are seen as
indispensable for its representation. The experiential intuitions that inform the
construction of such artefacts are here described informally. Practical experience
is perhaps the best way to gain a deeper appreciation of EM principles.
The important intuitions on which EM draws are the experience of momen-
tary state (as in “the current situation”), and that of an identifiable pattern of
state transitions (as in “a phenomenon”). In the context of the Clayton Tunnel
illustration, Figure 1 depicts a particular situation. A phenomenon might be “a
train passing through the tunnel”; another might be “a train approaching the
tunnel whilst the alarm is ringing”. In EM, an artefact is used to model ex-
perimental interaction in a situation, with a view to identifying and construing
phenomena associated with this situation.
Construal in EM is relative to the egocentric perspective of a particular agent.
Whereas most computational modelling is aimed at realising a system behaviour,
the primary focus of EM is on modelling the way that an agent’s construal
of a situation develops and how subsequently the conception of a system may
emerge. The computer model serves to represent a situation, and transformations
associated with the contemplation of this situation. In this context, the computer
is being used not to compute a result but to represent a state metaphorically,
in much the same way that a physical artefact (such as a scale model, or VR
reconstruction of a historic building) can be used as a prototype. The term
‘computer artefact’ is used to convey this emphasis.
The interpretation of computer artefact adopted here is unusual, and merits
amplification. It derives from inviting the human interpreter to view the com-
puter as a physical object open to interaction, observation and experiment in
abstractly the same way as any other physical object in our environment. Such
a view contrasts with the conception of a computer as negotiating input and
output to a preconceived schema for interpretation, and in order to perform a
preconceived function. This contrast is much sharper than is suggested simply
by considering what are often termed the non-functional aspects of the computer
operation, such as speed, user convenience and visual effect. The computer arte-
fact is experienced without reference to specific function, and its state is not
to be conceived as meaningful only in relation to a predefined abstract pattern
of behaviour (e.g. as in the states of a finite state machine). The meaning and
significance of the state of the artefact is instead to be acquired through a prim-
itive process of conflating experiences of the artefact and of the external world
(cf. the blending to which Turner refers [73,74]). In this negotiation of meaning,
there is no necessary presumption that transitions between states in the artefact
reflect familiar objective external behaviours. Rather, like a physical object, the
artefact manifests itself in its current state, and my conception of this state is
informed by my previous experience, expectations and construal of the situa-
tion. By this token, changes to the state of the computer artefact reflect what
the human observer deems to be the case: for instance, that one-and-the-same
object is now in a different state, or that I now take a different view of this
one-and-the-same object.
336 Meurig Beynon
a murder on a train was committed from valid but inconsistent testimony about
synchronisation of events by observers on and off the train. What is the dis-
tinction between percept vs. concept? The psychological subtlety of this issue is
well-illustrated by this extract from Railway Regulations of 1840 [67]: “A Signal
Ball will be seen at the entrance to Reading Station when the Line is right for
the Train to go in. If the Ball is not visible the Train must not pass it.”. Such an
injunction to respond to what is not perceived only makes sense in the context
of an expectation that the ball might be seen.
An appropriate philosophical perspective for EM will be considered later. In
practice, EM takes a pragmatic stance. Where a logicist model has to address
the matter of inconsistency and incompleteness of knowledge explicitly, if only
by invoking meta-level mechanisms, EM aims at faithful metaphorical represen-
tation of situations as they are—or are construed to be—experienced. There is
no expectation that EM should generate abstract accounts of phenomena that
are complete and self-contained. In resolving singularities that arise in inter-
preting its artefacts, there is always the possibility of recourse to the mind that
is construing a phenomenon, and to further experimental investigation of the
phenomenon itself.
Observables, dependency and agency are the focus for two activities: an anal-
ysis of my experience, and the construction of a computer artefact to represent
this experience metaphorically.
In analysing my experience, I adopt a stance similar to that of an experi-
mental scientist. Repeated observation of a phenomenon leads to me to ascribe
identity to particular characteristic elements. To some extent, this attribution
stems from the perceived continuity of my observation (e.g. this is the same key-
board that I have been using all the while I have been typing this sentence), but
it may stem from a more subtle presumption of conjunction (e.g. this is the same
keyboard I was using last week, though I have not been present to confirm this),
or another conceptual continuity (as e.g. when I have bought a new computer:
that was and this is my keyboard). The integrities that can be identified in this
way are observables.
Because the characterisation of observables in EM is experiential and em-
pirical, it is open to a much wider interpretation than a conventional use of
the term. When driver Scott sees the red flag, there is no physical perception
of the danger of entering the tunnel—indeed, there is no immediate physical
danger to be perceived. Nonetheless, the context for displaying the red flag has
been established indirectly with reference to expected experience. Danger, of
itself invisible—even absent, is present as a conceptual observable concomitant
with the red flag. To construe the accident, the investigator must take account
of the fictional obstruction in the tunnel that Scott infers when the red flag
is seen. And, to deconstruct Scott’s concept yet more comprehensively, though
Scott could not see even a real obstruction in the tunnel, yet an extrapolation
from his recollected experience potentially traces the path from the mouth of
the tunnel to the point of collision with this invisible imaginary obstacle.
The idea of dependency is illustrated in the concomitance of ‘red flag’ and
‘danger’ as observables. Other examples of dependencies include: the electrical
linkage between the telegraphs, whereby the state of a button in one signal box
is indivisibly coupled to the state of a dial in another, the mechanical linkage
that enables Killick to reset the distant signal, and the mechanism that causes
the alarm to sound whilst the signal has not yet been reset.
Dependencies play a very significant part in the construal of a phenomenon.
They are particularly intimately connected with the role that invoking agents
plays in accounting for system behaviour. A dependency is not merely a con-
straint upon the relationship between observables but an observation concerning
how the act of changing one particular observable is perceived to change other
observables predictably and indivisibly. This concept relies essentially upon some
element of agency such as the investigator invokes in conducting experiments—
if perhaps only “thought experiments”—with the railway system. In empirical
terms, dependency is a means of associating changes to observations in the sys-
tem into causal clusters: the needle moved because—rather than simply at the
same time as—the button was pressed.
In investigating a phenomenon, dependency at a higher level of abstraction
associates clusters of observables into agents that are empirically identified as
340 Meurig Beynon
technology can be the basis for artefacts whose characteristics are no longer so
tightly constrained.
Construal in EM can be viewed as associating a pattern of observables, depen-
dencies and agents with a given physical phenomenon. EM techniques and tools
also serve a dual role: constructing physical artefacts to realise given patterns
of observables, dependency and agency. A key role in this construction process
is played by dependency-maintenance that combines the updating mechanism
underlying a spreadsheet with perceptualisation. One technique for this involves
the use of definitive (definition-based) notations [24].
A definitive notation is used to formulate a family of definitions of variables
(a definitive script) whose semantics is loosely similar to the acyclic network of
dependencies behind the cells of a spreadsheet. The values of variables on the left-
hand side in a definitive script are updated whenever the value of a variable that
appears on the right-hand side is updated. This updating process is conceptually
atomic in nature: it is used to model dependencies between the observables
represented by the variables in the script. A visualisation is typically attached to
each variable in a script, and the visual representation is also updated indivisibly
when the value of the variable changes. Definitive notations are distinguished by
the kind of visual elements and operators that can be used in definitions.
Definitive scripts are a basis for representing construals. In typical use, the
variables in a script represent observables, and the definitions dependencies.
A script can then represent a particular state, and actions performed in this
state can be represented by redefining one or more variables in the script or by
introducing a new definition.
The use of two definitive notations in combination is illustrated in Figure 1.
One notation is used to define the screen layout and textual annotations, the
other to maintain simple line drawings. By using such notations, it is easy to
represent the kinds of dependencies that have been identified above. For instance,
the dial displays occupied whilst the appropriate button is depressed.
If a phenomenon admits an effective construal in the sense introduced above,
we can expect counterparts of the transitions that are conceived in exploring
the system to be realisable by possible redefinitions in the artefact. In practice,
the possible redefinitions do not respect semantic boundaries. For instance, in
Figure 1, they may relate to modifying the visualisation (e.g. using a dotted
line to represent the track in the tunnel), emulating the actions of agents in
the scenario (e.g. resetting the signal), or fantasising about possible scenarios
(e.g. changing the location of the signal). This accords with the view of the
investigator as resembling an experimental scientist, who, within one and the
same environment, can select the phenomenon to be studied, decide upon the
viewpoint and procedures for observation, adjust the apparatus and develop
instruments.
In practical use, a definitive script can be used judiciously so that all inter-
action is initiated and interpreted with discretion by the human investigator.
For the script to serve a richer purpose than that considered in Naur’s account
of constructed models [63], there must be interaction and interpretation that is
342 Meurig Beynon
not preconceived. Ways of framing particular modes of interaction that are less
open-ended are nonetheless useful. For example, the actions that are attributed
to agents need to be identified, and the different categories of action available
to the investigator discriminated.
A special-purpose notation, named LSD, has been introduced for this pur-
pose. (The LSD notation was initially motivated by a study of the Specifica-
tion and Description Language SDL—widely used in the telecommunications
industry—hence its name.) The manner in which an agent is construed to act is
declared by classifying the observables through which its actions are mediated.
This classification reflects the ways in which real-world observables can be ac-
cessed by an experimenter. Certain observables can be directly observed (these
are termed oracles), some can be changed (handles), but this change is subject
to observed dependencies (derivates) and is generally possible or meaningful pro-
vided that certain conditions hold (such conditional actions are expressed as a
protocol that comprises privileges to act). It may also be appropriate for a con-
strual to take account of attributes associated with the experimenter (states).
For instance, the status of certain observations and actions may be affected by
the experimenter’s location.
An LSD account of an agent can be used in a wide variety of contexts. It
can represent what I personally can observe and change in a given situation.
Alternatively, it can express what I believe to be the role of an agent other than
myself, either from my perspective or from its own. In an appropriate context, it
can be also used to specify an agent’s behaviour (cf. the LSD Engine developed
by Adzhiev and Rikhlinsky [3]). These three perspectives on agency are discussed
in more detail in section 4.
When construing a complex phenomenon, the presence of several agents leads
to potential ambiguity about which perspective is being invoked. For this rea-
son, LSD accounts do not necessarily lead directly to operational models of
phenomena. It is not in general possible to develop a faithful computer model
of behaviour that can be executed fully automatically; the intervention of the
modeller in the role of super-agent is needed to emulate non-deterministic in-
teraction, to resolve ambiguity about the current state of the system, and to
arbitrate where the actions of agents conflict. Special tools have been developed
for this purpose: they include the Abstract Definitive Machine (ADM) [21], and
the distributed variant of the Eden interpreter [23] that has been used to generate
Figure 1.
There are many important respects in which the principles of EM, as described
above, engages with the fundamental issues raised by Kirsh in [55]: it is first-
person centred, it is not primarily language-based, but experientially-based; it
involves embodied interaction and experiment; it addresses conceptualization in
psychological terms; it is concerned with intentionality and meaning rather than
logical consequence.
Empirical Modelling and the Foundations of Artificial Intelligence 343
4 The Implications of EM
Having discussed the character and significance of EM from a practical view-
point, it remains to return to the broad agenda set out in the introduction. This
section discusses how EM contributes towards three key objectives:
– giving a perspective on logicist and non-logicist approaches;
– providing a conceptual foundation for AI broader than logicism;
– providing a context for existing practice in “scruffy” AI.
Empirical Modelling and the Foundations of Artificial Intelligence 347
There have been many criticisms of the logicist position. Where AI is concerned,
the sources include Rodney Brooks [31,32], Brian Smith [70,71], Mark Turner [73]
and Peter Naur [63]. Other relevant philosophical ideas are drawn from William
James’s ideas on Radical Empiricism, first collected for publication shortly af-
ter his death in 1910 [53], and from more contemporary work of Gooding [47]
and Hirschheim al. [52] on methodological issues in science and information sys-
tems development respectively. These indicate that the controversy surrounding
a logicist viewpoint is neither new, nor confined to AI and computer science.
Gooding’s analysis of Faraday’s work is motivated by disclosing simplistic as-
sumptions about the relationship between scientific theory and practical experi-
ment. William James addressed similar issues in his attacks upon the rationalist
viewpoint on experience. Hirschheim [52] is concerned with information system
design as involving the development of social communication systems. This ar-
guably places the design of such systems outside the paradigm for Computer
Science proposed in Denning al. [41]. For instance, it raises issues such as shared
meaning, and the management of ambiguity, inconsistencies and conflict in sys-
tem specifications.
Common themes that arise in these writings include:
EM: the first person perspective. Under some interpretations, Kant’s fa-
mous dictum: “sensation without conception is blind” might serve as a motto
for the logicist. An appropriate motto for EM might be that of the anonymous
little girl who, on being told—by a logicist, no doubt—to be sure of her meaning
before she spoke, said: “How can I know what I think till I see what I say?” [65].
This epitomises the first-person variant of EM that has been described above:
the dialogue between me and myself, in which understanding is construction
followed by reconstruction in the light of experience of what I have constructed.
First-person activities in EM have centred on interface development [22,10] and
conceptual design [2].
EM: the third person perspective. One of the most complex and subtle pro-
cesses that can operate in an EM framework is the transition to the third-person
perspective. The observables that can be viewed from a third-person perspec-
tive are those elements of our experience that empirically appear to be common
to all other human agents, subject to what is deemed to be the norm (cf. the
presumptions surrounding the Mensa problem above). The identification of such
observables is associated with interaction between ourselves and other human
agents in a common environment. Objectivity is empirically shaped concurrently
by our private experience, and our experience of other people’s responses.
The extent to which objective third-person observables dominate our public
agenda can obscure the sophistication of the social conventions they require. In
matters such as observing the number of items in a collection, and confirming its
objective status, complex protocols are involved: eating an item is not permitted,
350 Meurig Beynon
interaction with the environment must be such that every item is observed and
none is duplicated, co-operation and honest reporting of observation is needed
to reach consensus.
Underpinning third-person observables are repeatable contexts for reliable
interaction, and associated behaviours of different degrees of locality and sophis-
tication. In this context, locality refers to the extent to which a pattern of activity
embraces all the agents in an environment and constrains the meaningful modes
of observation. Counting techniques provide examples of behaviours that are typ-
ically local in this sense - they involve few agents, and are applied in the context
of otherwise uncircumscribed interaction. Conventional computer programming
typically presumes a closely circumscribed context, in which human-computer
interaction is subject to global behavioural constraints (as in sequential interac-
tion between a single user and computer), and the main preoccupation is with
objective changes of state (such as are represented in reliable computer operation
and universally accepted conventions for interpretation of input-output state).
ing between asserting what is observed and asserting what is believed (cf. [2]).
The computational forum for this representation is provided by the ADM [21],
in which the modeller can prescribe the privileges of agents and retain total dis-
cretion over how these privileges are exercised. The evolution process converges
if and when the modeller has specified a set of agents, privileges and criteria
for reliability of agent response that realise the observed or intended behaviour.
System implementation is then represented in this framework as replacement of
certain agents in the model by appropriate physical devices.
More generally, EM can be applied in a concurrent engineering context [1],
where independent views may be subject to conflict, as in Gruber’s shadow
box experiment [47]. To account for the process by which such views might be
reconciled through arbitration and management requires a hierarchical model
for agent interaction in which an agent at one level acts in the role of the human
modeller in relation to those at the level below [1]. The associated “dialectic
of form and process” is specified in terms of commitments on the part of the
modeller agents similar in character to those involved in system implementation.
Our investigation of a concurrent framework for EM of this nature remains at
an early stage, but has direct relevance to requirements analysis [18] and has
been used to devise simulations of insect behaviour illustrating Minsky’s Society
of Mind paradigm [56].
mind’ has much in common with the concept of blending that has been ex-
plored by Turner and others [74]. The ontological stance that James and Turner
adopt is consonant with EM: the foundations of intelligence are to be sought in
first-person experience, not in third-person abstractions. It is in this spirit that
Turner regards the formal definitions of metaphor by logicists (cf. [35]), and the
grammatical structure of a language, as sophisticated abstractions rather than
primitive building blocks of human intellect. For Turner, the blend and the story
(which to Bradley would doubtless have seemed so ‘contradictory’ in nature),
are simple experientially-centred primitives.
something about which nothing can be said’. For James’s pure experi-
ence has to be such that nothing can be said about it, if it is to fulfil
the role for which it is cast. . . . Without some ability to characterise the
experiences we have no means of determining their identity, and even
no clear means of assessing James’s central claim that we are presented
with conjunctive relations in experience as well as atomic sensations.
the entity as an agent whose role can only be represented through recourse to
first-person agency. To create the circumscribed closed world, it is essential to
pass through the experimental realm.
5 Conclusion
Brooks has argued in [31,32] that significant progress towards the principal goals
of AI research—building intelligent systems and understanding intelligence—
demands a fundamental shift of perspective that rules out what is commonly
understood to be a hybrid logicist / non-logicist approach. This paper endorses
this view, contending that logicism relies upon relating the empirical and the
rational in a way that bars access to the primitive elements of experience that
inform intelligence. EM suggests a broader philosophical framework within which
theories are associated with circumscribed and reliably occurring patterns of
experience. The empirical processes that lead towards the identification and
formulation of such theories surely require human intelligence. The application of
such theories, taken in isolation, is associated with rule-based activity as divorced
from human intelligence as the execution of a computer program. Intelligence
itself lives and operates in experience that eludes and transcends theory.
Acknowledgments
I am much indebted to all the contributors to the Empirical Modelling Project,
and to Dominic Gehring, Theodora Polenta and Patrick Sun in particular, for
their valuable philosophical, theoretical and practical input. Most of all, I am
indebted to Steve Russ, whose constructive criticism and ideas have been crucial
in identifying the essential character of EM. I also wish to thank Mike Luck for
several useful references and feedback. The idea of relating EM to first-, second-
and third-person perspectives owes much to several workshop participants, no-
tably Joseph Goguen, Kerstin Dautenhahn and Chrystopher Nehaniv. I have
also been much encouraged and influenced by Mark Turner’s exciting ideas on
blending and the roots of language. I am especially grateful to the Programme
Committee and the Workshop sponsors for their generous invitation and finan-
cial support.
References
1. V. D. Adzhiev, W. M. Beynon, A. J. Cartwright, and Y. P. Yung. A computational
model for multi-agent interaction in concurrent engineering. In Proc. CEEDA’94,
pages 227–232. Bournemouth University, 1994. 348, 349, 351, 351, 352, 352
2. V. D. Adzhiev, W. M. Beynon, A. J. Cartwright, and Y. P. Yung. A new computer-
based tool for conceptual design. In Proc. Workshop Computer Tools for Concep-
tual Design. University of Lancaster, 1994. 349, 352, 356
3. V.D. Adzhiev and A. Rikhlinsky. The LSD engine. Technical report, Moscow
Engineering Physics Institute, 1997. 342
362 Meurig Beynon
Department of Cybernetics
University of Reading, Reading, RG2 6AE, UK,
1
{sjn, kd}@cyber.rdg.ac.uk,
2
J.M.Bishop@reading.ac.uk
1 Introduction
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp.365 -379, 1999.
Springer-Verlag Berlin Heidelberg 1999
366 Slawomir J. Nasuto et al.
2 Computational Metaphor
The emergence of connectionism is based on the belief that neurons can be treated as
simple computational devices [1]. Further, the assumption that information is encoded
as mean firing rate of neurons was a base assumption of all the sciences related to
brain modelling. The initial boolean McCulloch-Pitts model neuron was quickly
extended to allow for analogue computations.
The most commonly used framework for connectionist information representation
and processing is a subspace of a Euclidean space. Learning in this framework is
equivalent to extracting an appropriate mapping from the sets of existing data. Most
learning algorithms perform computations which adjust neuron interconnection
weights according to some rule, adjustment in a given time step being a function of a
training example. Weight updates are successively aggregated until the network
reaches an equilibrium in which no adjustments are made (or alternatively stopping
before the equilibrium, if designed to avoid overfitting). In any case knowledge about
the whole training set is stored in final weights. This means that the network does not
possess any internal representation of the (potentially complex) relationships between
training examples. Such information exists only as a distribution of weight values. We
do not consider representations of arity zero predicates, (e.g. those present in
NETtalk [8]), as sufficient for representation of complex relationships. These
limitations result in poor internal knowledge representation making it difficult to
interpret and analyse the network in terms of causal relationships. In particular it is
difficult to imagine how such a system could develop symbolic representation and
logical inference (cf. the symbolic/connectionist divide). Such deficiencies in the
representation of complex knowledge by neural networks have long been recog-
nised [9,10,11].
The way in which data are processed by a single model neuron is partially
responsible for these difficulties. The algebraic operations that it performs on input
vectors are perfectly admissible in Euclidean space but do not necessarily make sense
in terms of the data represented by these vectors. Weighted sums of quantities,
averages etc., may be undefined for objects and relations of the real world, which are
nevertheless represented and learned by structures and mechanisms relying heavily on
such operations. This is connected with a more fundamental problem missed by the
connectionist community - the world (and relationships between objects in it) is
fundamentally non-linear. Classical neural networks are capable of discovering non-
linear, continuous mappings between objects or events but nevertheless they are
restricted by operating on representations embedded in linear, continuous structures
(Euclidean space is by definition a finite dimensional linear vector space equipped
with standard metric). Of course it is possible in principle that knowledge from some
domain can be represented in terms of Euclidean space. Nevertheless it seems that
only in extremely simple or artificial problems the appropriate space will be of small
dimensionality. In real life problems spaces of very high dimensionality are more
likely to be expected. Moreover, even if embedded in an Euclidean space, the actual
set representing a particular domain need not be a linear subspace, or be a connected
subset of it. Yet these are among the topological properties required for the correct
368 Slawomir J. Nasuto et al.
operation of classical neural nets. There are no general methods of coping with such
situations in connectionism. Methods that appear to be of some use in such cases seem
to be freezing some weights (or restriction of their range) or using a ‘mixture of
experts or gated networks’ [12]. However, there is no a principled way describing how
to perform the former. Mixture of experts models appear to be a better solution, as
single experts could in principle explore different regions of a high dimensional space
thus their proper co-operation could result in satisfactory behaviour. However, such
architectures need to be individually tailored to particular problems. Undoubtedly
there is some degree of modularity in the brain, however it is not clear that the brain’s
operation is based solely on a rigid modularity principle. In fact we will argue in the
next section that biological evidence seems to suggest that this view is at least
incomplete and needs revision.
We feel that many of the difficulties outlined above follow from the underlying
interpretation of neuron functioning in computational terms, which results in entirely
numerical manipulations of knowledge by neural networks. This seems a too
restrictive scheme.
Even in computational neuroscience, existing models of neurons describe them as
geometric points although neglecting the geometric properties of neurons, (treating
dendrites and axons as merely passive transmission cables), makes such models very
abstract and may strip them of some information processing properties. In most
technical applications of neural networks the abstraction is even higher - axonic and
dendritic arborisations are completely neglected - hence they cannot in principle
model the complex information processing taking place in these arbors [13].
We think that the brain functioning is best described in terms of non-linear
dynamics but this means that processing of information is equivalent to some form of
temporal evolution of activity. The latter however may depend crucially on geometric
properties of neurons as these properties obviously influence neuron activities and
thus whole networks. Friston [14] stressed this point on a systemic level when he
pointed out to the importance of appropriate connections between and within regions -
but this is exactly the geometric (or topological) property which affects the dynamics
of the whole system. Qualitatively the same reasoning is valid for single neurons.
Undoubtedly, model neurons which do not take into account geometrical effects
perform some processing, but it is not clear what this processing has to do with the
dynamics of real neurons. It follows that networks of such neurons perform their
operations in some abstract time not related to the real time of biological networks
(We are not even sure if time is an appropriate notion in this context, in case of
feedforward nets ‘algorithmic steps’ would be probably more appropriate). This
concerns not only classical feedforward nets which are closest to classical algorithmic
processing but also many other networks with more interesting dynamical behaviour,
(e.g. Hopfield or other attractor networks).
Of course one can resort to compartmental models but then it is apparent that the
description of single neurons becomes so complex that we have to use numerical
methods to determine their behaviour. If we want to perform any form of analytical
investigation then we are bound to simpler models.
Communication as an Emergent Metaphor for Neuronal Operation 369
Relationships between real life objects or events are often far more complex for
Euclidean spaces and smooth mappings between them to be the most appropriate
representations. In reality it is usually the case that objects are comparable only to
some objects in the world, but not to all. In other words one cannot equip them with a
‘natural’ ordering relation. Representing objects in a Euclidean space imposes a
serious restriction, because vectors can be compared to each other by means of
metrics; data can be in this case ordered and compared in spite of any real life
constraints. Moreover, variables are often intrinsically discrete or qualitative in nature
and in this case again Euclidean space does not seem to be a particularly good choice.
Networks implement parameterised mappings and they operate in a way implicitly
based on the Euclidean space representation assumption - they extract information
contained in distances and use it for updates of weight vectors. In other words,
distances contained in data are translated into distances of consecutive weight vectors.
This would be fine if the external world could be described in terms of Euclidean
space however it would be a problem if we need to choose a new definition of
distance each time new piece of information arrives. Potentially new information can
give a new context to previously learnt information, with the result that concepts
which previously seemed to be not related now become close. Perhaps this means that
our world model should be dynamic - changing each time we change the definition of
a distance? However, weight space remains constant - with Euclidean distance and
fixed dimensionality. Thus the overall performance of classical networks relies heavily
on their underlying model of the external world. In other words, it is not the networks
that are ‘smart’, it is the choice of the world model that matters. Networks need to
obtain ‘appropriate’ data in order to ‘learn’, but this accounts to choosing a static
model of the world and in such a situation networks indeed can perform well. Our
feeling is that, to a limited extent, a similar situation appears in very low level sensory
processing in the brain, where only the statistical consistency of the external world
matters. However, as soon as the top down information starts to interact with the
bottom up processing the semantic meaning of objects becomes significant and this
can often violate the assumption of static world representations.
It follows that classical neural networks are well equipped only for tasks in which
they process numerical data whose relationships can be well reflected by Euclidean
distance. In other words classical connectionism can be reasonably well applied to the
same category of problems which could be dealt with by various regression methods
from statistics. Moreover, as in fact classical neural nets offer the same explanatory
power as regression, they can be therefore regarded as its non-linear counterparts. It is
however doubtful whether non-linear regression constitutes a satisfactory (or the most
general) model of fundamental information processing in natural neural systems.
Another problem follows from the rigidity of neurons’ actions in current
connectionist models. The homogeneity of neurons and their responses is the rule
rather than the exception. All neurons perform the same action regardless of individual
conditions or context. In reality, as we argue in the next section, neurons may
condition their response on the particular context, set by their immediate
surroundings, past behaviour and current input etc. Thus, although in principle
identical, they may behave as different individuals because their behaviour can be a
370 Slawomir J. Nasuto et al.
function of both morphology and context. Hence, in a sense, the way conventional
neural networks operate resembles symbolic systems - both have built in rigid
behaviour and operate in an a priori determined way. Taking different ‘histories’ into
account would allow for the context sensitive behaviour of neurons - in effect for
existence of heterogeneous neuron populations.
Standard nets are surprisingly close to classical symbolic systems although they
operate in different domains: the latter operating on discrete, and the former on
continuous spaces. The difference between the two paradigms in fact lies in the nature
of representations they act upon, and not so much in the mode of operation. Symbolic
systems manipulate whole symbols at once, whereas neural nets usually employ sub-
symbolic representations in their calculations. However, both execute programs,
which in case of neural networks simply prescribe how to update the interconnection
weights in the network. Furthermore, in practice neural networks have very well
defined input and output neurons, which together with their training set, can be
considered as a closed system relaxing to its steady state. In modular networks each of
the ‘expert’ nets operates in a similar fashion, with well defined inputs and outputs
and designed and restricted intercommunication between modules. Although many
researchers have postulated a modular structure for the brain [15], with distinct
functional areas being black boxes, more recently some [16, 17] have realised that the
brain operates rather like an open system. And due to the ever changing conditions a
system with extensive connectivity between areas and no fixed input and output. The
above taxonomy resembles a similar distinction between algorithmic and interactive
systems in computer science, the latter possessing many interesting properties [18].
3 Biological Evidence
Recent advances in neuroscience provide us with evidence that neurons are much
more complex than previously thought [19]. In particular it has been hypothesised that
neurons can select input depending on its spatial location on dendritic tree or temporal
structure [19,20,21]. Some neurobiologists suggest that synapses can remember the
history of their activation or, alternatively, that whole neurons discriminate spatial
and/or temporal patterns of activity [21].
Various authors have postulated spike encoding of information in the brain
[22,23,24]. The speed of information processing in some cortical areas, the small
number of spikes emitted by many neurons in response to cognitive tasks [25,26,27],
together with very random behaviour of neurons in vivo [28], suggest that neurons
would not be able to reliably estimate mean firing rate in the time available. Recent
results suggest that firing events of single neurons are reproducible with very high
reliability and interspike intervals encode much more information than firing
rates [29]. Others found that neurons in isolation can produce, under artificial
stimulation, very regular firing with high reproducibility rate suggesting that the
apparent irregularity of firing in vivo may follow from interneuronal interactions or
may be stimulus dependent [30].
Communication as an Emergent Metaphor for Neuronal Operation 371
The use of interspike interval coding enables richer and more structured
information to be transmitted and processed by neurons. The same mean firing rate
corresponds to a combinatorial number of interspike interval arrangements in a spike
train. What would previously be interpreted as a single number can carry much more
information in temporal coding. Moreover, temporal coding enables the system to
encode unambiguously more information than is possible with a simple mean firing
rate. Different parts of a spike train can encode qualitatively different information. All
these possibilities have been excluded in the classical view of neural information
processing. Even though a McCulloch-Pitts neuron is sufficient for production of
spike trains, spike trains by themselves do not solve the binding problem (i.e. do not
explain the mechanism responsible for integration of object features constituting an
which are processed in spatially and temporally distributed manner). However,
nothing would be gained, except possibly processing speed, if the mean firing rate
encoding would be merely replaced by temporal encoding as the underlying
framework of knowledge representation and processing still mixes qualitatively
different information by simple algebraic operations.
The irregular pattern of neuron activity in vivo [28] is inconsistent with temporal
integration of excitatory post synaptic potentials (EPSP’s) assumed in classical model
neurons. It also introduces huge amounts of noise, thus making any task to be
performed by neurons, were they unable to differentially select their input, extremely
difficult. On the other hand, perhaps there is a reason for this irregular neuronal
behaviour. If neurons are coincidence detectors rather than temporal
integrators [19,22] then the randomness of neuron firing is an asset rather than
liability.
One of the most difficult and as yet unresolved problems of computational
neuroscience is that of binding distinct features of the same object into a coherent
percept. However, in [31], Nelson postulates that it is the traditional view
‘transmission first, processing later’, that introduces the binding problem. On this view
processing cannot be separated from transmission and, when entangled with
transmission performed by neural assemblies spanning multiple neuronal areas, it
makes the binding problem non-existent [32].
4 Communication Metaphor
‘economical’ one: the brain facilitates the survival of its owner and for that purpose
uses all available resources to processes information.
5 Architecture of NESTOR
Taking into account the above considerations we adopt a model neuron that inherently
operates on rich information (encoded in spike trains) rather than a simple mean firing
rate. Our neuron simply accepts information for processing dependent on conditions
imposed by a previously accepted spike train. It compares corresponding parts of the
spike trains and, depending on the result, further distributes the other parts. Thus
neurons do not perform any numerical operations on the obtained information - they
forward its unchanged parts to other neurons. Their power relies on the capability to
select appropriate information from the incoming input depending on the context set
by their history and the activity of other neurons.
Although we define a single neuron as a functional unit in our architecture we are
aware that the debate on what constitutes such a unit is far from being resolved. We
based this assumption on our interpretation of neurobiological evidence. However, we
realise that even among neuroscientist there is no agreement as to what constitutes
such elementary functional unit, (proposals range from systems of neurons or
microcircuits [34], through single neurons [35] to single synapses [13]). In fact it is
possible that qualitatively similar functional units might be found on different levels of
brain organisation.
In the characteristics of this simple model neuron we have tried to capture what we
consider to be fundamental properties of neurons. Although our model neurons are
also dimensionless, nevertheless in their information processing characteristics we
included what might follow for real neurons from their geometric properties (namely
ability to distinguish their inputs - spatio-temporal filtering).
A network of such model neurons was proposed in [36]. The NEural STochastic
diffusion search netwORk (NESTOR) consists of an artificial retina, a layer of fully
connected matching neurons and retinotopically organised memory neurons. Matching
neurons are fully connected to both retina and memory neurons.
It is important to note that matching neurons obtain both ascending and descending
inputs. Thus their operation is influenced by both bottom-up and top-down
information. As Mumford [16] notices, systems which depend on interaction between
feedforward and feedback loops are quite distinct from models based on Marr’s
feedforward theory of vision.
The information processed by neurons is encoded by a spike train consisting of two
qualitatively different parts - a tag determined by the relative position of the receptor
on the artificial retina and a feature signalled by that receptor. The neurons operate by
introducing time delays and acting as spatiotemporal coincidence detectors.
Although we exclusively used a temporal coding, we do not mean to imply that
firing rates do not convey any information in the brain. This choice was undertaken
for simplicity of exposition and because in our simplified architecture it is not
important how the information about the stimulus is encoded. What is important is the
374 Slawomir J. Nasuto et al.
SDS consists of a number of simple agents acting independently but whose collective
behaviour locates the best-fit to a predefined target within the specified search space.
Figure 1 illustrates the operation of SDS on an example search space consisting of a
string of digits with the target - a pattern ‘371’ - being exactly instantiated in the
search space.
It is assumed that both the target and the search space are constructed out of a
known set of basic microfeatures (e.g. bitmap pixel intensities, intensity gradients,
phonemes etc.). The task of the system is to solve the best fit matching problem - to
locate the target or if it does not exist its best instantiation in the search space. Initially
each agent samples an arbitrary position in the search space, checking if some
microfeature in that position matches with corresponding microfeature of the target. If
this is the case, then the agent becomes active otherwise it is inactive. Activity
distinguishes agents which are more likely to point to a correct position from the rest.
Next, in a diffusion phase, each inactive agent chooses at random another agent for
communication. If the chosen agent is active, then its position in the search space will
be copied by the inactive agent. If, on the other hand, the chosen agent is also inactive
then the choosing agent will reallocate itself to an arbitrary position in the search
space.
This procedure iterates until SDS reaches an equilibrium state, where a maximal
stable population of active agents will point towards common position in the search
space. In the most general case convergence of SDS has to be interpreted in statistical
sense [38]. The population supporting the solution will fluctuate, identities of
particular agents in this population will change but nevertheless the system as a whole
will exhibit a deterministic behaviour. From such competition and co-operation
between weakly randomly coupled agents emerges the deterministic behaviour of
Communication as an Emergent Metaphor for Neuronal Operation 375
Fig. 1. SDS consisting of five agents searching in the string of digits for a pattern ‘371’. Active
agents point to corresponding features with (solid arrows). Inactive agents are connected to the
last checked features by (dashed lines). Agents pointing to the correct position are encircled by
(ovals). The first number in the agent denotes position of the potential solution and the second
number - the relative position of the checked microfeature
The time complexity of SDS was analysed in [39] and shown to be sublinear in the
presence of no noise when the perfect match is present. Further work has confirmed
that this characteristic also holds in more general conditions. As noted in [39] this
performance is achieved without using heuristic strategies, in contrast to the best
deterministic one- and two-dimensional string searching algorithms or their extensions
to tree matching [40], which at best achieve time linearity.
taxonomy. It also supports the hypothesis that local interactions are not the most
important feature of real biological networks. The most recent findings suggest that,
contrary to assumptions of some researchers [41], attention may be operating on all
levels of visual system with the expectation of the whole system directly influencing
cell receptive fields and, as a result, information processing by single neurons (for an
excellent exposition see [44] and references therein).
These findings are qualitatively reflected in the architecture of NESTOR. Although
network architecture and neuron properties only very approximately correspond to the
architecture of the visual system and properties of real neurons, nevertheless, in the
light of the cited evidence, we think that it is an interesting candidate for modelling
visual attention.
The formation of a dynamic assembly representing the best fit to the target
corresponds to an attentional mechanism allocating available resources to the desired
object.
The analysis of properties of our model suggests that both parallel and serial
attention may be just different facets of one mechanism. Parallel processing is
performed by individual neurons and serial attention emerges as a result of formation
of an assembly and its shifts between interesting objects in the search space.
8 Conclusions
Much new evidence is emerging from the neuroscience literature. It points to the
neuron as a complex device, acting as a spatio-temporal filter probably processing
much richer information than originally assumed. At the same time our understanding
of information processing in the brain has to be revised on the systems level. Research
suggests that communication should not be disentangled from computation, thus
bringing into question the usefulness of ‘control-theoretic’ like models based on
clearly defined separate functional units.
We claim that this new evidence suggests supplementing the oversimplistic
McCulloch-Pitts neuron model by models taking into account such a communication
metaphor. It seems more accurate and natural to describe emergent neuron operations
in terms of communication - a vital process for all living organisms - exhibiting
‘computations’ only as a mean of implementing neuron functionality in biological
hardware. In this way we will avoid several problems lurking behind computational
metaphor, such as homunculus theories of mind and the binding problem.
We propose a particular model neuron and discuss a network of such neurons
(NESTOR) effectively equivalent to the Stochastic Diffusion Search. NESTOR shows
all the interesting properties of SDS and moreover we think that it serves as an
interesting model of visual attention. The behaviour of neurons in our model is context
sensitive and the architecture allows for extending to heterogeneous neural
populations.
Although the model advanced in this paper is based solely on exploring the
communication metaphor we argue that it shows interesting information processing
capabilities - fast search for the global optimum solution to a given problem and
Communication as an Emergent Metaphor for Neuronal Operation 377
Acknowledgments
The authors would like to thank an anonymous referee for critical comments which
helped us to refine and improve our paper.
References
1. McCulloch, W.S., Pitts, W.: A logical calculus immanent in nervous activity. Bulletin of
Mathematical Biophysics 5 (1943) 115-133.
2. Rosenblatt, F.: Principles of Neurodynamics. Spartan Books, Washington DC (1962)
3. Poggio, T., Girosi, F.: Networks for approximation and learning. Proceedings of the IEEE 78
(1990) 1481-1497.
4. Haykin, S.: Neural Networks: A Comprehensive Foundation. Macmillan, New York (1994)
5. Rumelhart, D. E., McClelland, J.L. (eds.): Parallel Distributed Processing. Explorations in
the Microstructure of Cognition, MIT Press, Cambridge MA (1986).
6. Fukushima, K.: Neocognitron: A hierarchical neural network capable of visual pattern
recognition. Neural Networks 1 (1988) 119-130.
7. Selman, B. et al.: Challenge Problems for Artificial Intelligence. Proceedings of AAAI-96,
National Conference on Aritifical Intelligence, AAAI Press, 1996.
8. Sejnowski, T.J., Rosenberg, C.R.: Parallel networks that learn to pronounce English text.
Complex Systems 1 (1987) 145-168.
9. Fodor, J., Pylyshyn, Z.W.: Connectionism and Cognitive Architecture: A Critical Analysis.
In: Boden, M.A. (ed.): The Philosophy of Artificial Intelligence, Oxford University Press
(1990).
10. Barnden, J., Pollack, J. (eds.): High-Level Connectionist Models, Ablex: Norwood, NJ,
(1990).
11. Pinker, S., Prince, A.: On Language and Connectionism: Analysis of a Parallel Distributed
Processing Model of Language Acquisition. In: Pinker, S., Mahler, J. (eds.): Connections
and Symbols, MIT Press, Cambridge MA, (1988).
378 Slawomir J. Nasuto et al.
12. Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. MIT
Comp. Cog. Sci. Tech. Report 9301 (1993).
13. Shepherd, G.M.: The Synaptic Organisation of the Brain. Oxford University Press, London
Toronto (1974).
14. Friston, K.J.: Transients, Metastability, and Neuronal Dynamics. Neuroimage 5 (1997) 164-
171.
15. Fodor, J.A.: The Modularity of Mind. MIT Press (1983).
16. Mumford, D.: Neural Architectures for Pattern-theoretic Problems. In: Koch, Ch., Davies,
J.L. (eds.): Large Scale Neuronal Theories of the Brain. The MIT Press, London, England
(1994).
17. Farah, M.: Neuropsychological inference with an interactive brain: A critique of the locality
assumption. Behavioural and Brain Sciences (1993).
18. Wegner, P.: Why Interaction is More Powerful then Algorithms. CACM May (1997).
19. Koch, C.: Computation and the single neuron. Nature 385 (1997) 207-210.
20. Barlow, H.: Intraneuronal information processing, directional selectivity and memory for
spatio-temporal sequences. Network: Computation in Neural Systems 7 (1996) 251-259.
21. Granger, R., et al.: Non-Hebbian properties of long-term potentiation enable high-capacity
encoding of temporal sequences. Proc. Natl. Acad. Sci. USA Oct (1991) 10104-10108.
22. Thomson, A.M.: More Than Just Frequency Detectors ?. Science 275 Jan (1997) 179-180.
23. Sejnowski, T.J.: Time for a new neural code?, Nature 376 (1995) 21-22.
24. Koenig, P., et al.: Integrator or coincidence detector? The role of the cortical neuron
revisited. Trends Neurosci. 19(4) (1996) 130-137.
25. Perret, D.I., et al.: Visual neurons responsive to faces in the monkey temporal cortex.
Experimental Brain Research 47 (1982) 329-342.
26. Rolls, E.T., Tovee, M.J.: Processing speed in the cerebral cortex and the neurophysiology of
visual backward masking. Proc. Roy. Soc. B 257 (1994) 9-15.
27. Thorpe, S.J., Imbert, M.: Biological constraints on connectionist modelling. In: Pfeifer, R.,
et al. (eds.): Connectionism in Perspective. Elsevier (1989).
28. Softky, W.R., Koch, Ch.: The highly irregular firing of cortical cells is inconsistent with
temporal integration of random EPSP. J. of Neurosci. 13 (1993) 334-350.
29. Berry, M. J., et al.: The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci.
USA 94 (1997) 5411-5416.
30. Mainen, Z.F., Sejnowski, T.J.: Reliability of spike timing in neocortical neurons. Science
168 (1995) 1503-1506.
31. Nelson, J.I.: Visual Scene Perception: Neurophysiology. In: Arbib, M.A. (ed.): The
Handbook of Brain Theory and Neural Networks. MIT Press: Cambridge MA (1995).
32. Nelson, J.I.: Binding in the Visual System. In: Arbib, M.A. (Ed.): The Handbook of Brain
Theory and Neural Networks, MIT Press, Cambridge MA (1995).
33. Brown, R.: Social Psychology. Free Press, New York (1965).
34. Douglas, R.J., Martin, K.A.C.: Exploring cortical microcircuits. In: McKenna, Davis,
Zornetzer, (eds.): Single Neuron Computation. Academic Press (1992).
35. Barlow, H.B.: Single units and sensation: A neuron doctrine for perceptual psychology?.
Perception 1 371-394.
36. Nasuto, S.J., Bishop, J.M.: Bivariate Processing with Spiking Neuron Stochastic Diffusion
Search Network. Neural Processing Letters (at review).
st
37. Bishop, J.M.: Stochastic Searching Networks. Proc. 1 IEE Conf. Artificial Neural
Networks, pp. 329-331, London (1989).
38. Nasuto, S.J., Bishop, J.M.: Convergence Analysis of a Stochastic Diffusion Search. Parallel
Algorithms and Applications (in press).
Communication as an Emergent Metaphor for Neuronal Operation 379
39. Nasuto, S.J., Bishop, J.M, Lauria, S.: Time Complexity Analysis of Stochastic Diffusion
Search, Proc. Neural Computation Conf., Vienna, Austria (1998).
40. van Leeuven, J. (ed.): Handbook of Theoretical Computer Science. MIT Press: Amsterdam
(1990).
41. Treisman, A.: Features and Objects: The fourteenth Bartlett Memorial Lecture. The
Quarterly Journal of Experimental Psychology 40A(2) (1998) 201-237.
42. Cowey, A.: Cortical Visual Areas and the Neurobiology of Higher Visual Processes. In:
Farah, M.J., Ratcliff, G. (eds.): The Neuropsychology of High-Level Vision. LEA Publishers
(1994).
43. Spillmann, L., Werner, J.S.: Long range interactions in visual perception. Trends Neurosci.
19(10) (1996) 428-434.
44. McCrone, J.: Wild minds. New Scientist 13 Dec (1997) 26-30.
The Second Person – Meaning and Metaphors
Chrystopher L. Nehaniv
Truth and meaning, as logicians will tell you, only make sense in reference to a
particular universe of discourse. Less obviously perhaps, meaning also only makes
sense from the standpoint of an observer, whether that observer is someone
manipulating a formal system to determine referents and applying predicates
according to compositional rules, is an animal hunting in the forest, is a Siberian
swam in a flock of swams over-wintering on a northern Japanese lake, is an
artificial agent maintaining control parameters over an industrial process, or is
the ‘mind of God’. We thus take a seemingly stricter view than that of most
logicians, that meaning only makes sense for agents, situated and embedded in
interaction with their particular Umwelt, the world around them. Actually this
is a view wider in scope in that it now includes anything that could potentially
qualify as an ‘observer’, not only a universal third-person or external impersonal
one. The agent may be as simple as an active process on the CPU of your
Current address: Interactive Systems Engineering, Department of Computer Science,
University of Hertfordshire, Hatfield, Hertfordshire AL10 9AB, United Kingdom,
E-mail: c.l.nehaniv@herts.ac.uk
C. Nehaniv (Ed.): Computation for Metaphors, Analogy, and Agents, LNCS 1562, pp. 380–388, 1999.
c Springer-Verlag Berlin Heidelberg 1999
The Second Person – Meaning and Metaphors 381
2 Locus of Meaning
Where is the meaning for an agent? It is in the observer, who as we said may be
the agent itself. So in looking for meaning in any situation one must ask, Where
are the observers?
An agent interacts with the world through its sensors, embodiment and ac-
tuators. An evolved biological agent uses sensory and action channels that have
been varied and selected over the course of evolution. The channels it uses are
meaningful to it for its survival, homeostasis, reproduction, etc. The access to
the particular channels has evolved because they are of use to the agent for such
purposes, and thus meaning arises for the agent as it accesses these channels.
In this access, the agent is in the role of an observer (though not necessarily a
conscious one) and this observer is also an actor.
What is meaning then? It is information considered with respect to channels
of interaction (perception and/or action) whose source and target are determined
with respect to an observer. The source and target may be known, uncertain,
or unknown; they may be agents or aspects of environments; information in the
channel may or may not be accesible to the observer; the observer may be an
agent at one end (or possibly both ends) of the channel, or may be external to
the channel.
The attempts and successes of formalization and rationalism to escape from con-
text, to formulate universal scientific laws that do not depend on the particular
observer and aspects of the messiness of embodiment, useful Platonic entities
such as numbers, and generic impersonal statements about ‘he’/‘she’/ ‘it’/‘they’
have been extremely important in the history of science and engineering. They
have led to great successes in physical sciences, mathematics and engineering,
The Second Person – Meaning and Metaphors 383
achieving somewhat less success in the case of animate beings, such as in biology
at the level of the organism, psychology and economics (where agents matter).
Such logical positivistic approaches tend to presuppose a single unique plane of
description, one universal coordinate system or model in which all phenomena
may be described and understood. (Note, however, that sometimes more sophis-
ticated versions allow several viewpoints, which agree where they overlap but
may also explain some areas which are not mutually explainable in a consistent
manner, e.g. in relativistic physics, the theory of manifolds in differential geom-
etry and topology – obtained by ‘gluing’ locally Euclidean pieces of space, and
more general coordinate systems affording formal understanding of systems [15]).
We propose that first- and second-person perspectives can assist in these
agent sciences. The third-person observer perspective is thus an extra-agent view.
Nevertheless, there is an agent present in this viewpoint, namely, the observer
itself.
very similar to that of the agent. This similarity can be a substrate for interaction
and provides structure that the agent’s own structure can be related and mapped
to. These other agents are thus ‘second persons’, alter-egos (i.e. other ‘I’s) in the
world whose actions could be analyzed and possibly ‘understood’ as correspond-
ing to one’s own. A tendency to regard such others as ‘egomorphic’, similar to
the self, or to expect that their actions in given situations should be similar to
what one’s own would be could thus be adaptive. This egomorphic principle may
be at the root of the ability of animals to perceive signals of intent in others.
For example, a dog might not have a theory of other minds, but may well growl
when it preceives and acts on signals, such as gaze direction, of another animal
looking at a piece of food it has grasped in its teeth and paws.
A generalization of the egomorphic principle in humans is their anthropo-
morphizing tendency to view other animals and objects around them as having
human-like consciousness, feelings, intentions or goals. This tendency may lead
to appropriate behavior in response to, say, perceived threat and anger in a
snarling carnivore protecting its young, or to less successful behavior in, say,
attributing a vengeful state of mind to storm clouds and trying to appease them
with burnt offerings.
The notion ‘second person’ refers to the experience by an agent of other
agents and of the interaction dynamics with other agents. It is thus an inter-
agent notion. Aspects include theory of other mind and empathic resonance [7];
biographic reconstruction for others [17]; perception of signals of intention; inter-
action; and mapping of the self to the other. In mapping the self to the other, the
latter becomes for this observer a blend of the self with the notions of otherness:
the second person — to whom are attributed states and dynamics (e.g. inten-
tions, drives, feelings, desires, goals) and possibly a biographic history [17]. As
the second person, the other ceases to be an object and becomes an agent. As
just mentioned, it may be that such mapping from ‘I’ to ‘Thou’ also lies at
the core of the anthropomorphizing tendencies so often observed in human in-
teraction with computers and robots. How such interaction dynamics work in
natural agents and could be constructed in artificial ones leads one into the
study of imitation, social dynamics, communication and the understanding of
language games and interaction games. Some of the second person techniques
for interaction illustrated in this book are in Dautenhahn [8] (learning by imita-
tion, temporal synchronization (‘dancing’)), Barnden [1] (theory of mind, beliefs
of others), Brooks et al. (interaction dynamics), Scassellati [21] (scaffolding for
imitation, joint attention), and Kauppinen [11] (imitation and child language
acquisition via figures of speech).
3 Constructive Biology
point that one’s understanding should enable one to, in principle, build the sys-
tems of interest. For example, Barbara Webb has shown through building that a
much simpler mechanism than expected, not involving functional decomposition
or planning, is sufficient to account for much observed cricket phonotaxis behav-
ior [27]. Valentino Braitenberg’s examples [4] of simple robots to whom human
observers attribute such states as ‘fear’, ‘aggression’, ‘love’, etc., illustrate that
meaning of an interaction for an external observer can be quite different to that
its has for the agent (in these cases, simple taxis). Constructive biology will in-
escapably lead to mappings that respect structural constraints and grounding
of agents, to the use and manipulation of hierarchies and the need for a deeper
understanding of them in relation to natural adaptive systems.
The study of correspondence via the algebraic notion of homomorphism (full,
partial or relational) provides an inroad for the precise study of correspondence
between agents interacting with their environments or with each other. Preserv-
ing structure of meaning channels for an agent coupled to its environment is
required for the usefulness of and determines the quality of metaphors and map-
pings in the design, algebraic engineering, interaction dynamics, and constructive
biology of situated agents.
4 Epilogue: Correspondences
Acknowledgements
References
22. Claude E. Shannon and Warren Weaver, The Mathematical Theory of Com-
munication, University of Illinois Press, 1963. 381
23. Kazuko Shinohara, Conceptual Mappings from Spatial Motion to Time: Anal-
ysis of English and Japanese. In [16], 230–241, (this volume). 383
24. Georgi Stojanov, Embodiment as Metaphor: Metaphorizing-In the Environ-
ment. In [16], 88–101, (this volume). 383
25. Stephin Toumlin, From Clocks to Chaos: Humanizing the Mechanistic World-
View. In Hermann Haken, Anders Karlqvist, and Uno Svedin, eds., The Ma-
chine as Metaphor and Tool, Springer Verlag, 139–153, 1993.
26. Mark Turner, Forging Connections, In [16], 11-26, (this volume). 385
27. Barbara Webb, Using Robots to Model Animals: A Cricket Test, Robotics and
Autonomous Systems, 16:117–134, 1995. 385
28. Ludwig Wittgenstein, The Blue and Brown Books, Harper & Brothers, 1958.
382
29. Ludwig Wittgenstein, Philosophical Investigations, (Philosophische Unter-
suchungen), German with English translation by G. E. M. Anscombe, 1964.
Basil Blackwell, Oxford, reprinted 3rd edition, 1968. 382
Author Index
Alty, J. L. 307
Marjanovic, M. 52
Barnden, J. A. 143
Beynon, M. 322 Nasuto, S. J. 365
Bishop, M. 365 Nehaniv, C. L. 1, 380
Breazeal, C. 52 Nehmzow, U. 209
Brooks, R. A. 52
O'Neill-Brown, P. 165
Dautenhahn, K. 102, 365
Scassellati, B. 52, 176
Fenton-Kerr, T. 154 Shinohara, K. 230
Stojanov, G. 88
Goguen, J. 242
Turner, M. 11
Hiraga, M. K. 27
Veale, T. 37
Indurkhya, B. 292
Williamson, M. W. 52
Kauppinen, A. 196
Knott, R. P. 307