Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Modeling Social Common Sense for Seamless Human-Machine Teaming:

Inverting the “Intuitive Game Engine” with Probabilistic Programming

1 Executive Summary

1.1 Objectives
What is the problem?
A major bottleneck for seamless human-robot teaming is the inability of traditional robotics systems to
interact with humans in natural ways -- ways that people consider safe, intuitive, productive, rewarding.
Even young children engage in productive social interactions with other children and adults -- simple cases
of "human-human teaming" -- from a very early age. From the time they begin to walk and speak their first
words, 14 month-old infants can recognize that someone else is trying to obtain an out-of-reach object and
hand it to them (Warneken & Tomasello, Infancy, 2007), and 18 month-old infants perform a range of
cooperative, helping activities, such as opening a door for someone with their hands full, opening a box for
someone who doesn't know how to open it (Warneken & Tomasello, Science, 2006), or pointing out the
location of an object to someone searching for it (Liszkowski, Carpenter, Striano, & Tomasello, Journal of
Cognition and Development, 2006). In contrast, even this basic level of flexible collaboration lies beyond
the capabilities of current AI and robotic systems for "human-robot teaming". Robots lack the
commonsense psychology or “Theory of Mind” (Premack & Woodruff, Behavioral and Brain Sciences,
1978) that allows humans to determine when and how to work with others.

What are we going to do?


Our goal is to "reverse-engineer" the simple, yet powerful collaborative teaming abilities that even young
children possess intuitively and implement automatically, without much conscious thought. We will do this
by building AI systems that represent and interact with their physical and social environment as humans do
from infancy -- systems that are able to infer the goals of and anticipate human movements, to jointly infer
the beliefs and desires of human collaborators, to communicate their own intentions, to seamlessly and
flexibly carry out helpful, collaborative actions with human teammates, and to do all of this in ways that
humans can automatically and unconsciously grasp and trust.

To realize this vision of intelligent co-robot interaction, we propose to build models which use rich
representations of the world, similar to modern video game engines, which offer 3D physical
representations of the dynamics of objects and the world, as well as social representations of different
characters, their characteristics, their relationships, and their ability to affect the state of the world. To
capture the flexibility of human reasoning, we will implement these "intuitive game engines" as
probabilistic programs, which allow predictive simulation of future actions and events, but crucially also
support inductive inferences about unobserved aspects of the physical and social world.

We propose the following specific tasks: (1) developing a range of teaming scenarios, based on human-
human teaming from cognitive development; (2) designing objective measurements of teaming efficiency;
(3) developing a model that can infer the intention of a single agent and predict its future actions; (4)
developing models for planning joint teaming actions by using distributed planners.

1.2 How is it done today, and what are the limits of current practice?
Intriguingly, recent research suggests that representations and processes analogous to video game engines
and probabilistic programs are part of a complex of "core-knowledge" which emerges during infancy, and
underlies children's understanding of objects, physical scenes, and social agents (Carey & Spelke,
Philosophy of Science, 1996; Dehaene, Izard, Pica, & Spelke, Science, 2006; Spelke & Kinzler,
Developmental Science, 2007). Recently, researchers have begun to model core-knowledge
computationally using physical and social simulations expressed as probabilistic programs. These models
accurately predict quantitative human psychophysical responses across a range of ages and domains:
 12 month-olds' physical expectations about the order and timing with which several moving balls
will exit a semicircular enclosure (Teglas, Vul, Girotto, Gonzalez, Tenenbaum, Bonatti, Science,
2011);

1
 10 month-olds' social evaluations of agents as a function of their actions toward others, as well as the
attributed mental states underlying those actions (Hamlin, Ullman, Tenenbaum, Goodman, & Baker,
Developmental Science, 2013);
 3-6 year-olds' joint attributions of belief and desire to others on the basis of their actions and
situational context (Richardson, Baker, Tenenbaum, & Saxe, CogSci, 2012).
These studies provide many of the basic computational, experimental and conceptual building blocks for
our proposed research. However, previous modeling of children's reasoning in Cognitive Science has
typically been confined to simplified experimental domains (e.g., 2-dimensional environments) or else
relied on highly abstracted representations of the domain and situation.

In the field of robotics, a sophisticated mathematical framework for understanding and learning from
others’ observed behavior using “inverse optimal control” is gaining traction (Ng & Russell, ICML, 2000;
Ziebart, Bagnell & Dey, ICML, 2010). Recent work has applied similar ideas to generating “legible
motion” – actions that appear natural to human observers, and for which the goal of the action can be
rapidly and reliably inferred (Dragan & Srinivasa, Robotics: Science and Systems, 2013). Some of this
work is loosely inspired by developmental psychology, and by the cognitive science modeling cited above.
This demonstrates the potential of approaches based on reverse-engineering children’s core-knowledge to
scale to real robots operating in complex environments.

There is a long tradition of “developmental robotics”, connecting children’s physical and cognitive
capabilities to robotic engineering. Much of this research has focused on mechanisms such as “imitation
learning” (Breazeal & Scassellati, Trends in Cognitive Sciences, 2002; Demiris & Meltzoff, Infant and
Child Development, 2008) or “simulation theory” (Breazeal, Gray & Berlin, International Journal of
Robotics Research, 2009), inspired by classic empirical and theoretical work in psychology. Other
cognitive robotics approaches have employed cognitive architectures such as ACT-R, PolyScheme, or
SOAR (Trafton et al., Journal of Human-Robot Interaction, 2013; Hiatt, Harrison, & Trafton, IJCAI, 2011;
Bello & Cassimatis, CogSci, 2006) to provide robots with cognitive models of human psychology, in order
to reason about human teammates’ perspectives, knowledge, and actions. These approaches have several
limitations. First, they do not directly model the integrated core-knowledge that allows young infants to
perform seamless human-human teaming. Second, they are not grounded in the recent advances in robotic
sensing and (inverse) optimal control that we cite above. Because probabilistic programming with intuitive
game engines provides a unified mapping between infant core-knowledge, such as intuitive physics
engines, or intuitive principles of rationality, and modern robotic engineering approaches, we believe this
will be the most promising route toward robotic systems with more advanced teaming capabilities that have
previously been possible.

A primary limitation shared by all previous approaches is their lack of integration. Our previous efforts to
reverse-engineer the intuitive principles of core-knowledge proposed specific principles -- "the principles
of rational action and belief"; "the principle of sympathy/antipathy"; "the intuitive physics engine" -- to
make their predictions. Models based on separate principles also relied on specific algorithms for inference.
The intuitive principles we have proposed apply very generally within the social and physical domains,
respectively, but integrated, core-knowledge based systems will benefit from general principles that unify
their various modules and mechanisms.

Our current proposal aims to unify the disparate core-knowledge principles enumerated above, using
modern video game engines cast as probabilistic programs. Modern video games provide a combination of
realistic physics engines, graphics engines, and "game AI"-based virtual agents – rich representations which
mirror infants’ and adults’ core cognition about the world and other agents. Probabilistic programming
provides a principled framework for flexible reasoning about these rich models, supporting predictive and
inductive reasoning that matches quantitative human judgments empirically.

1.3. What's new in your approach and why do you think it will be successful?
The foundation of our approach is recent scientific discoveries of “social common sense” that allow even
young infants to interact and learn from human adults with a deep understanding of other’s mental states.
We formalize “social common sense” by wrapping a game engine as a core component into a general

2
probabilistic programming framework. The planning program is a core component of a game engine. The
inputs of this program are representations of mental states, including belief, intention, and utilities. The
output is a series of actions. By introducing uncertainties to the input mental states, the planning program
turns into a probabilistic one. By running the model in the forward direction, this machine can generate its
rational actions, and predicts human actions; By running it in the backward direction in a Bayesian way, it
can infer input mental states that can best explain the observed human actions. Seamless human-machine
teaming is therefore achieved by the closed-loop interaction of running the planning engine forward and
backward.

The success of our approach is strongly supported by recent breakthroughs from five fields: (1) cognitive
science studies revealing the developmental origin of human social common sense; (2) cognitive modeling
of social common-sense in 2D displays; (3) 3D sensing technology for tracking human actions in real
scenes; (4) Robotic motion planning algorithms for generating rational actions; (5) probabilistic
programing for “inverting” complicated programs. We will further explain these breakthroughs and how
they can be synthesized to construct powerful models for seamless human-machine teaming.

1.4 Who cares?


Seamless human-machine teaming will serve as a transformative technology applicable to a wide variety of
scenarios and environments. In fact, we believe that it may benefit virtually any scenario that involves
interactions between humans and intelligent systems. It can have a broad impact across DoD, industry and
academia.

Application to critical DoD tasks.


Future robotics systems will commonly operate around humans and other robots or vehicles. For example,
massive efforts are underway worldwide for developing highly capable robots that operate in everyday
home and office environments. Similarly, progress has accelerated dramatically in the development of fully
autonomous mobile robots for operation in urban environments, as witnessed by the DARPA Urban Grand
Challenge and other large DoD projects on unmanned systems. Arguably, the least mature part of these
systems is the ability to anticipate paths that the other vehicles or pedestrians may take in the future. Here
again, we believe that the human-machine teaming technology will provide a quantum leap in capabilities.

Integrating with a wide range of consumer “smart” devices


While the connection between human-machine teaming and robotics is obvious, the type of “machine” can
be much more broadly defined, including various “smart devices” equipped with 3D sensors. Those devices
are now rapidly penetrating into our daily life. The “Tango” project at Google integrates a 3D sensor into a
smart phone, making the phone a machine that can “see” human motion and the 3D environment
(https://www.google.com/atap/project-tango/). Similarly, 3D sensors have also been integrated with “smart
furniture”, with its own cloud-connected AI. Our model is a critical component for these smart devices to
understand human mental states and respond in socially intelligent and appropriate ways.

Synthesizing several academic fields


Our model will firmly establish the intrinsic connections among a wide range of academic fields, including
cognitive science, computer vision, robotics, and computer graphics. In fact, members of our team are
actively organizing several workshops at major conferences to spread our perspective. These workshops
include CVPR workshop on Vision meets Cognition (2014, 2015,
http://visionmeetscognition.org/fpic2015/), and “Physical and Social scene understanding” workshop at
CogSci 2015.

1.5 What are the risks and the payoffs?


Physics engines are mature, accurate and fast. In contrast, planning engines for humanoid models are still a
cutting edge technique in robotics and computer graphics. Implementing such an engine is challenging. In
addition, inverting such a complicated program via probabilistic programming is also challenging and
computationally expensive. Our model requires robust 3D tracking of human motion. The 3D tracking may
contain errors, especially when the human body is severely occluded. This error may affect the quality of
our model’s inference. As a result, our model will only work in limited contexts as a proof of concept, and
it may not be able to run in real time.

3
The payoffs are that machines with human-like social common sense can interact with humans in close
quarters, and conduct joint activities with humans. The machines will be able to anticipate human
movements, to communicate their own intentions, and to do all of this in ways that humans can
automatically and unconsciously grasp and trust. This will be necessary for humans to feel safe working
with these robots in an effective and rewarding manner.

1.6 How much will it cost?


We are proposing a budget of $650K for one year.

1.7 How long will it take?


As a demonstration of concept, the initial implementation of our model will take about 1-1.5 years. Scaling
our model up to more complex task will take about 3-5 years.

1.8 What are the midterm and final "exams" to check for success? How will progress be
measured?
The success of our model can be tested with the social “helping” and “hindering” tasks (see below). By
observing such meaningful human interactions, the model should be able to infer humans’ mental states and
plan human-machine joint actions to maximize (helping) or minimize (hindering) human objectives. The
midterm exam can test the model’s ability to infer human intentions. The final exam requires the execution
of human-machine joint actions.

Progress can be measured using several criteria. (1) Human vs. Model performance. Our approach to
human-machine teaming is to reverse-engineer human-human teaming. To achieve this goal, it is necessary
to design tasks that can be performed by both humans and machines. A “human benchmark” can be
obtained by rigorous psychophysics. The machine performance can then be tested against this benchmark.
(2) Dimensions of mental states. Human mental states include belief, intention, and utility. We propose to
start by inferring each of these states individually, and then infer the entire mental state by modeling their
joint-distributions with probabilistic programming. (3) Human-human interactions can be recursive.
Ideally, a machine should be able to predict a human’s reaction to its action and then use that reasoning to
further optimize its action. Progress can be made by moving from non-recursive models to recursive
models.

2. Technical descriptions and key innovations

2.1 Scientific and Engineering foundations


Our approach is strongly supported by recent breakthroughs within several fields of research.
Revealing the developmental origin of human-human teaming
Revealing the developmental origin of human social common sense is one of the most exciting and fruitful
areas of cognitive science. In the past 10 years, it has been shown that young infants are capable of
representing others’ beliefs, even when they are inconsistent with their own beliefs (Kovács, et al., Science,
2010; Onishi & Baillargeon, Science, 2005). Infants can distinguish and evaluate different type of multi-
agent interactions, including “helping” and “hindering” (Hamlin et al., Nature, 2008). They also
spontaneously help human adults by “correcting” the adults’ error, or removing “obstacles” (Warneken &
Tomasello, Science, 2006, see Figure 1). When interacting with human adults, the way they learn is largely
dependent on their inference of whether the adults are “intentionally” teaching them (Topál et al., Science,
2009). Infants’ ability to understand the intentions of human action is determined by their first-person
experience of generating similar actions (Skerry, Carey, & Spelke, PNAS, 2013).

We see a great opportunity to reverse-engineer these exciting scientific discoveries in a computational


model of “social common sense”, which can enable a machine to perceive, interact with, and learn from
human adults like babies do. Here we highlight one of these studies to illustrate their implications for
modeling human-machine seamless teaming. Figure 1 depicts a sequence of the “helping” test from
Warneken & Tomasello (2006). Figure 1A shows an adult that couldn’t put a pile of books into a cabinet, as
he couldn’t free his hands to open the door; Figure 1B shows that without any instruction, infants

4
spontaneously realized that the adult needed help and opened the door for him. We propose that these types
of tests from developmental studies on young infants can be turned into intriguing challenges for seamless
human-machine teaming.

Figure 1. A “helping” test from Warneken & Tomasello (2006). An 18-month old infant spontaneously
helps an adult by opening the door of a cabinet.

Working models of human-machine teaming from 2D displays.


Parallel to developmental studies on human infants, our team members have made substantial progress in
constructing computational models of children and adults’ Theory of Mind (Baker, Saxe, & Tenenbaum,
Cognition, 2009; Baker, Saxe, & Tenenbaum, CogSci, 2011) (Figure 2). These models take as input
observations of other agents’ actions and situations, and use Bayesian inference over probabilistic programs
representing the agents’ situation- and mental-state-dependent planning to output inferences of the
unknown beliefs, desires, goals, situations, and social relationships that caused the agent’s actions, termed
“inverse planning” (Figure 2C). Across a series of studies, our research has demonstrated that people’s
inferences of others’ beliefs and desires (Baker et al., CogSci, 2011; Figure 2A), goals, interactions
(helping or hindering; Ullman et al., NIPS, 2009; Figure 2B), and even emotions can be captured
empirically with Bayesian inverse planning models.

The specific probabilistic programs used to capture agents’ planning processes are based on artificial
intelligence (AI) and economics models of rational agents acting under uncertainty, such as (partially
observable) Markov decision processes and Markov games. The notion of probabilistic programming over
rich “game AI”-based virtual agents naturally generalizes these models, bringing them to domains with
richer agents, actions, and physics. Emerging evidence suggests that integrating these principles can
provide a unified approach to modeling infant core-knowledge.

3D scene reconstruction for human-machine teaming


Our model of human-machine teaming requires 3D representations of human skeletal movements and the
surrounding environment. Such representations cannot be constructed through 2D image classification
algorithms. Fortunately, this challenge can be addressed by RGBD sensing technique, which captures 2D
color images, as well as 3D depth information as a point cloud. Human movements and the 3D
environments can be reconstructed by combining 2D and 3D information. 3D sensing is under rapid
development and has been implemented in a wide range of commercial devices (such as the Microsoft
Kinect sensor developed for the video game console Xbox). With this technique, it is convenient to
represent the world in a way that can be easily matched to the outputs synthesized by a game engine, which
also operates in 3D space.

5
A C
Physics

World Agent
state state

7 1 Graphics/Haptics
Desire

Model

Belief
4 0.5 People
1 0
K L M LMN Beliefs Desires
Object World Agent’s
Agent’s mind
Planning
B Agent’s mind
mind
Actions

Frame 4 Frame 6 Frame 8


People Model Flower
1 1 Tree
Help
0.5 0.5 Hinder
0
2 4 6 810121416
0
2 4 6 810121416 Observer
Frame Frame

Figure 2. “Inverse-planning” based models of human belief, desire, and interactive goal inference. A
Modeling adults’ inferences of beliefs and desires, based on an agent’s observed pathway through a simple
environment. In this example, people and the model attribute a strong desire to the agent (for the
(M)exican food-truck, versus the (L)ebanese or Korean food-trucks), and a strong false belief (that the
(M)exican truck, rather than the (L)ebanese truck or (N)othing was parked where the (L)ebanese truck is
actually located). B Modeling adults’ inferences of the social goals to “help or hinder”. Here, a strong
agent moves a heavy “boulder” out of a weak agent’s way, and people infer that it is trying to help.

3D scene reconstruction for human-machine teaming


Our model of human-machine teaming requires 3D representations of human skeletal movements and the
surrounding environment. Such representations cannot be constructed through 2D image classification
algorithms. Fortunately, this challenge can be addressed by RGBD sensing technique, which captures 2D
color images, as well as 3D depth information as a point cloud. Human movements and the 3D
environments can be reconstructed by combining 2D and 3D information. 3D sensing is under rapid
development and has been implemented in a wide range of commercial devices (such as the Microsoft
Kinect sensor developed for the video game console Xbox). With this technique, it is convenient to
represent the world in a way that can be easily matched to the outputs synthesized by a game engine, which
also operates in 3D space.

Rational robotic motion planners for human-machine teaming


We need a planning program to synthesize rational human actions. Humans’ multi-joint movements can be
represented as a motion trajectory in a high dimensional space. In complex environments, rational planning
for high-dimensional collision-free trajectories is extremely challenging. Fortunately, Robotics research has
made tremendous progress in solving this problem in the last few years. Several algorithms have been
developed, which can optimize a motion trajectory given customized utility function. These algorithms
include RRT* (Karaman et al., International Journal of Robotics Research, 2011), STOMP (Kalakrishnan
et al., ICRA, 2011), CHOMP (Zucker et al., International Journal of Robotics Research, 2013) and
TRAJOPT (Schulman et al., International Journal of Robotics Research, 2014). These algorithms greatly
improve the quality of the motion trajectory of controlling a robotic arm, making the trajectory short,
smooth, collision-free, and energy efficient. It is now reasonable to integrate these robotic motion-planning
algorithms into a game engine to synthesize human actions.

Probabilistic programming for human-machine teaming

6
Probabilistic programming languages in general allow one to express a wide range of different generative
models as programs, and to perform automatic inference on those models. Our group has successfully
model computer vision with probabilistic programing (Kulkarni, Kohli, Tenenbaum & Mansinghka, CVPR,
2015). It has long been popular to think of vision conceptually as "inverse graphics": observed visual data
are the result of a causal graphics process that takes the scene (the world's geometric and object structure)
as input and produces images as output, and the goal of vision is to work backwards from the observed
output of this process to the most likely inputs (the scene most likely to have generated the observed
image). We have now been able to formulate this approach very generally using probabilistic programming
as running a "graphics engine" in reverse, wrapping a standard real-time graphics engine inside a domain-
specific probabilistic programming engine for scene perception, which we call "Picture".

What we would like to do here is analogous to (and generalizes the idea of) wrapping a graphics engine in a
domain-specific probabilistic programming language; now we would like to do that for the whole game
engine functionality, to include the physics simulation engine and the online planning engine, and
especially a physics-sensitive planner. We think this should provide the basis for endowing robots with
intuitive physics and intuitive psychology, and thereby the core of common sense at the heart of seamless
human-robot teaming.

Inverse Planning with the rationality principle


One important principle underlying human common sense is the “rationality principle”, which asserts that
an intelligent agent achieves its goal by maximizing the efficiency of its actions (Figure 1A). Inefficient
actions are not interpreted as goal-directed. The underlying mechanism of this principle is further revealed
by a recent study (Skerry, et al., PNAS, 2013), showing that infants learn to interpret others’ actions through
their own experience producing goal-directed action. Only infants who had successful reaching experience
differentiated rational directed reaches and irrational indirect reaches (Figure 1B). This supports our model
that the same planning engine is used both for generating actions, and inferring intentions. By using state-
of-the-art robotic motion planning techniques, we have implemented a planning engine that is capable of
synthesizing human collision-free reaching actions employed in developmental studies (Figure 1C).

Figure 2. Understanding of intentionality and the rationality principle. (A) Graphical representation of the
rationality principle; (B) Goal-directed and non-goal directed reaching motions generated by using the
same arm trajectory but changing the location of the barrier. (C) Synthesized environment-sensitive robotic
reaching trajectory.

2.2 Proposals
We propose to model human-machine teaming by integrating probabilistic programming from the
Tenenbaum’s group and the optimal control theory developed by the Todorov’s group. We propose to start
from inferring the intention of a single agent (2.2.1), by which we can synthesize the two group’s systems.
Based on a module of intention inference, we then describe how to explore multi-agent teaming with
distributed optimal planners (2.2.2). For both single agent and multi-agent projects, we articulate the task,
computational model, experimental manipulations, and objective measurements.

2.2.1 Inferring the intention of a single agent


Task:
In a fetching task, a human reaches a randomly selected target on a table surface. Depending on the

7
structure of the environment and the position of the human, she can reach the target by (a) just grasping; (b)
standing up and grasping; and (c) standing up, moving alone the table contour and grasping. An observer
will watch the human’s action and infer her intention.

Computational Model:
The structure of our model is depicted in Figure 3: (a) A human actor's 3D skeletal movements are tracked
by inputs from a RGBD sensor (Figure 3A, B) or motion sensor attached on body joints; (b) The tracked
human skeleton is converted to a customized kinematic model implemented in the Robotic Operating
System (ROS, http://www.ros.org/, Figure 3C); (c) Collision-free primitive motions (e.g. stand, walk,
grasp) are generated by a near-optimal robotic planning algorithm, with a cost-function tailored for human-
like motions; (d) Primitive actions are composed by a Hierarchical Task Network, which can be understood
as a forward planning engine defined over a set of stochastic grammars (Figure 3D); (e) A long sequence
of human actions can then be synthesized by rationally composing primitive motions for each possible goal
(Figure 3E); (f) Finally, the goal of an action can be inferred by comparing the observed human trajectory
to the synthesized trajectories using the approximate Bayesian model we have described (Figure 3F).
In our system, human intentions are inferred by inverting a planning program that combines a hierarchical
task network (HTN) from AI (Nau et al., Journal of Artificial Intelligence Research, 2003) and state-of-the-
art motion planners. This illustrates our proposal that the same planning program is used for both
generating actions (running the program forward) and inferring intentions (running the program backward).

Figure 3. Schematic illustration of inferring human intentions by inverting a planning program.

Experimental Manipulations:
(1) Time course of inference.
Successful human-human teaming depends on rapid inference of other’s intentions. We predict that a
human observer can efficiently decode another human’s intention even by observing just a fraction of her
action. To evaluate the time course of intention inference, we can manipulate the percentage of the human
trajectory presented to the observer, varying from 10% to 80%.
(2) Sources of social information. To identify critical visual features for inferring human intention, we can
use a divide-and-conquer approach. Different sources of social features (Face, arm, and body) can be
isolated by using image processing technique. Since “gaze” is a type of particularly critical yet subtle social

8
information, we can design a separate experiment to manipulate the visibility of different features around
the eye. Figures 4-5 depict the isolation of different types of social information.

Figure 4. Isolating different types of social information in a human reaching task.

Figure 5. Isolating different types of social information in a gazing task.

Measurement:
In this task, we will primarily use psychophysical measurements, including reaction time and accuracy, to
evaluate the objective efficiency of intention inference.

2.2.2 Agent-Agent teaming


We propose the following set of experiments for agent-agent teaming.

Tasks:
(1) Fetch.
We can adapt the “fetch” task for agent-agent interaction. Agent A needs to fetch a target on a table
surface. Possible targets are scattered around Agent B, and are relatively far away from the agent A. This
set up makes teaming with Agent B desirable, as B can help A by handing the target over to him. (B can
also hinder A by removing the target from A). To be an efficient helper, B needs to rapidly recognize A’s
intention, grasp the object, and plan a joint-action for handover with A. Figure 6 depicts the hindering and
helping interactions, and the tracked human skeleton.

9
Figure 5. Human-human teaming with “hindering” and “helping” games. (A) Two actors compete to reach
for the same target; (B) One actor helps the other actor to reach for the target. (C) Human movements
captured by a RGBD sensor.

(2) Obstacle Removal.


In a cluttered warehouse, one typical “helping” behavior is for B to remove an obstacle that will block A’s
future actions. In many cases, an agent needs this type of assistance, simply because his or her hands are
occupied, which can make even simple actions (such as opening a door) very challenging (See Figure 7A).
In the Warneken & Tomasello (2006)’ task (Figure 1), the baby helped the adult by opening the door of a
cabinet. This can be viewed as a variant of the “obstacle removal” task. The Tenenbaum group has explored
a similar “helping” task in 2D grid world (Figure 7B). Here we propose to scale up this helping task to the
real world by using Todorov group’s planning engine for manipulating physical objects with hands
(Mordatch, Todorov, & Popović, 2012) (Figure 7C).

Figure 7. Obstacle removal as a type of helping. (A) Even opening a door can become very challenging
when hands are occupied. (B) Modeling a “helping” task in a 2D grid world; from Ullman et al.,(2009).
Here the big agent helps the small agent by moving the rock to a different location. (C) A physics-based
humanoid planner for manipulating an object with hands (Mordatch et al., 2012)

(3) Coordinated Carrying.


It is typically more efficient or even necessary for two agents to manipulate large or heavy objects jointly (a
real-world example is shown in Figure 8A). The planning engine from Todorov’s group is capable of
synthesizing the actions of two agents with a centralized planner (Figure 8B). Human coordinated actions
can be modeled by replacing the centralized controller with distributed controllers, each of which is capable
of inferring other agents’ actions and planning accordingly.

(4) Coordinating Picking and Placing of Items Among Teammates with Varying Capabilities
A major source of synergy within human-robot teams is the differing capabilities that each bring to a
particular problem. For example, robots and humans may have different size, strength, dexterity, and risk
tolerance, which can all affect the optimal allocation of tasks between teammates. To exemplify this,
consider teams that must pick items from and place them on shelves of different heights. Suppose each
teammate has a specific range of heights at which they can comfortably pick and place items. For teams in

10
which either (1) both, or (2) only one of the players know the goal, we can implement algorithms which
achieve seamless coordination of pick and place tasks among the players, given their common knowledge
of all teammates’ capabilities. This will rely on sophisticated recursive reasoning between players to arrive
at a joint plan of action.

Figure 8. Coordinated carrying. (A) Two human carry a big box jointly. (B) Two simulated agents
manipulate an object jointly by a central planner (Mordatch et al., 2012).

(4) Flexible mixture of tasks. Teaming in the real world rarely involves only a single type of task. For
example, fetching a heavy object cannot be achieved with a hand-over, but requires agent B to carry the
target object together with A. While carrying this object, one agent may have to free a hand to remove
obstacles in their path. Humans can adaptively adjust their teaming strategy by using their physical and
social common sense. We also hope to capture this type of flexible teaming with a planning engine that can
provide optimal actions in different environments.

Experimental Manipulations:
To explore the above scenarios, we propose an experimental design that manipulates three orthogonal
variables: the type of agents, the type of teaming, and the transparency of intention. The product of the
three variables can cover a large range of social interactions.

(1) The type of Agents.


The above scenarios involve 2 agents that can be either humans or machines, yielding a rich set of agent-
agent interactions. The first and last columns of Table 1 show the combinations of different types of agents.
Human-human teaming can be tested in real scenes by tracking the human body with a RGBD sensor, or
placing motion sensors on human joints. Machine-Machine teaming can be tested in a virtual reality (VR)
environment with simulated robots. Human-machine teaming can also be explored in a VR environment in
which a human interacts with a simulated robot. These investigations can provide a solid foundation for
human-machine teaming in the real world, which can be investigated with a larger grant based on this
seedling.

Table 1: Types of agent-agent interaction


Agent B Teaming Agent A
Human Human
Helping
Machine Machine
Hindering
Human Machine
No-Teaming
Machine Human

(2) The type of teaming


We can also manipulate the type of teaming. In our framework, different types of teaming can be compactly
defined by how Agent B’s utility relates to Agent A’s utility. Helping, as the core of teaming, is our primary
focus. In a Helping team, B maximizes A’s utility by planning its actions jointly with A. To reveal the
unique behavioral signatures of helping and the objective gains from it, we also need to explore a “No-

11
Teaming” condition as a baseline, in which B’s utility is independent of A. As a result, A has to finish the
task alone without B’s assistance. In the “Hindering” condition, B’s objective is to minimize the utility of
A. This can be viewed as an “anti-teaming” condition, which has important implications for adversarial
scenarios.

(3) The transparency of intentions


We will also manipulate the “transparency” of an agent’s intention: whether Agent B knows Agent A’s
intention. In the “Transparent” condition, B knows A’s intention before the task starts; in the “Non-
Transparent” condition, B does not know A’s intention, and has to infer it on-the-fly. Contrasting these two
conditions can reveal the cost of real-time inference of intention.

In human-human teaming, we expect that the team performance of the “Non-Transparent” condition should
be only slightly worse than that of the “Transparent” condition, given cognitive science studies showing
humans’ remarkable ability to infer others’ intentions. The performance in the human-human “Non-
Transparent” condition can serve as a human benchmark for evaluating machine-machine teaming and
human-machine teaming.

In machine-machine teaming, the “Transparent” condition can be approximated by planning A and B’s joint
actions with a central planning engine. Such a planning engine has already been developed in the Todorov
group. The “Non-Transparent” condition, however, requires a distributed controller for each individual
agent. It requires each distributed planner to have a model of its teammates, and plan its action based on
this model’s inference of others’ intentions. Therefore, the efficiency of teaming can be evaluated by
contrasting the performance of a central planner with distributed planners. Similar to the human-human
case, we expect performance from distributed planners should be only slightly worse than that of the central
planner.

Measurements
In combination with the tasks and experimental designs, we propose a set of objective measurements of
teaming efficiency.

(1) Productivity
In all the tasks we have proposed, an agent has a job to accomplish. Therefore, a straightforward evaluation
of the teaming performance is just how many jobs the agent can finish given a certain period of time. A
good team should significantly outperform teams in which agents work independently. In addition,
inferring other team members’ intentions should not worsen team performance significantly. Beyond the
group level performance, our model should be able to capture trial-by-trail variance even within each
teaming condition. Within the same task, due to the environmental configuration and the agents’ positions,
the gain from teaming will vary from trial to trial. We hope to capture this trial-by-trail variance with our
model. The Tenenbaum’s group have used similar trial-by-trail analyses to verify models of human
cognition (Baker et al., 2009; Ullman et al.,2009).

(2) Energy Consumption


One major objective of an optimal planner is to minimize the total energy consumption, which can serve as
an objective measurement of a team efficiency. Our assumption is that an efficient team should
significantly reduce the consumption of energy. This is primarily because teaming allows the two agents to
optimally select their actions from a much larger action space than they act alone. Energy consumption can
be approximated by using a cost function defined over the agents’ motion trajectories.

(3) Third-Party Observation: What are they doing and how are they doing it?
Social common-sense can also be revealed by asking an observer to watch the interactions of other agents,
and then report their goals and the nature of their teaming (e.g., whether B is helping A; whether B knows
A’s intention or has to infer it). The observer’s performance at “perceiving” others’ social interactions can
be objectively evaluated by classic psychophysical metrics, including (a) accuracy: how accurately can the
observer decode the interactions of the team; and (b) reaction time: how fast can the observer make a
judgment. The observers include both humans and computational models. Machines with human-like social
common sense should be able to understand their “social environment” by observation.

12
3. Statement of Work

The Tenenbaum group proposes to create teaming scenarios, develop objective measurements, and
construct models of recursive theory-of-mind using probabilistic programming. Tao Gao, who is now a
post-doc fellow in the Tenenbaum group, will join GE research and continue to work closely with the
Tenenbaum group.

Task 1: We will develop a range of teaming scenarios, based on human-human teaming from cognitive
development. These scenarios are anticipated to include collaborative tasks such as: fetching of objects,
obstacle removal, coordinated carrying, and coordinated activities among teammates with varying
capabilities, as well as flexible mixing of different tasks and scenarios. We will design objective
performance measurements, anticipated to include: the productivity of teams versus individuals, and the
productivity of teams in which the intention is commonly known versus known to a subset of teammates;
the energy consumption of the team, per unit of productivity, which can be approximated using a cost
function defined over teammates' states and actions; and third-party observation, asking external observers
to rate how effectively or naturally the team is performing the task.

Task 2: We will develop models for inferring the intent of 3D agents. These will take as input the
movements of a 3D humanoid skeleton over time, and a simplified 3D model of the environment, and
output inferences of the intention of the actor, and potentially other factors such as the actor's subjective
cost function, or the individual constraints they are subject to. These models will be based on extensions of
previous work in the Tenenbaum and Todorov labs on "inverse-planning" and "inverse optimal control". We
will develop objective performance measurements anticipated to include: a comparison between model
inferences and the ground-truth intentions on which the actions were based (where applicable); and a
comparison between model and third-party human inferences of intent (similar to previous work in the
Tenenbaum lab).

Task 3: We will develop models for planning joint teaming actions, given transparent or opaque intentions.
Joint action planning given transparent intentions will use centralized planners such as those developed by
the Todorov lab. Joint action planning with opaque actions will require team members to jointly infer each
others' knowledge and intentions in-the-loop, and plan joint coordination behaviors using recursive theory-
of-mind reasoning. This kind of nested game-theoretic computation is intractible in general; thus we will
explore simple approximations, such as truncating nested reasoning at a shallow level.

The University of Washington team lead by Dr. Todorov will extend optimal control methods developed in
prior work to the context of multi-agent interaction and inverse optimal control:

Task 4: We will adapt our trajectory optimization methods to the problem of inverse optimal control. The
input will be a family of cost functions that depend on to-be-inferred parameters, together with observed
motions generated by another agent. These motions will be used to infer the parameters of the cost function
being optimized by the other agent. This will be done by minimizing the discrepancy between the observed
movements and the movements that are optimal for a given set of cost function parameters.

Task 5: We will develop a decentralized algorithm for trajectory optimization involving two agents: main
actor (A) and helper (B). Each agent will form a plan for both agents, so as to anticipate what the other
agent will do, but will only execute his portion of the plan. The two agents will plan with respect to
different cost functions. Agent A knows the true cost. Agent B either knows the true cost, or knows
nothing – in which case he adopts a stay-out-of-the-way cost for himself and assumes that agent A is acting
randomly. Inverse optimal control (from Task 1) will be used by agent B to infer the parameters of the true
cost from observations of how agent A is moving. At the same time agent A will infer if B already knows
the true cost or not.

Additional research to achieve project goals: Additional research could include inverse optimal control
over joint actions and collaborative behaviors; simplified models of teaming for cognitive and behavioral
experimentation.

13
References

Baker, C. L., Saxe, R. R., & Tenenbaum, J. B. (2011). Bayesian Theory of Mind : Modeling Joint Belief-
Desire Attribution. Proceedings of the Thirty-Second Annual Conference of the Cognitive Science
Society, 1(20011), 2469–2474.

Baker, C., Saxe, R., & Tenenbaum, J. (2009). Action understanding as inverse planning. Cognition, 113,
329–349. http://doi.org/doi: 10.1016/j.cognition.2009.07.005

Bello, P., & Cassimatis, N. (2006). Developmental Accounts of Theory-of-Mind Acquisition: Achieving
Clarity via Computational Cognitive Modeling. Cognitive Science, (Goldman 1989), 1014–1019.

Breazeal, C., & Scassellati, B. (2002). Robots that imitate humans. Trends in Cognitive Sciences.
http://doi.org/10.1016/S1364-6613(02)02016-8

Carey, S., & Spelke, E. (1996). Science and Core Knowledge. Philosophy of Science.
http://doi.org/10.1086/289971

Dehaene, S., Izard, V., Pica, P., & Spelke, E. (2006). Core knowledge of geometry in an Amazonian
indigene group. Science (New York, N.Y.), 311(5759), 381–384.
http://doi.org/10.1126/science.1121739

Demiris, Y., & Meltzoff, A. (2008). The robot in the crib: A developmental analysis of imitation skills in
infants and robots. Infant and Child Development, 17(1), 43–53. http://doi.org/10.1002/icd.543

Hamlin, J. K., Wynn, K., & Bloom, P. (2008). Social evaluation by preverbal infants. Pediatric Research,
63(3), 219. http://doi.org/10.1203/PDR.0b013e318168c6e5

Hiatt, L. M., Harrison, A. M., & Trafton, J. G. (2011). Accommodating human variability in human-robot
teams through theory of mind. IJCAI International Joint Conference on Artificial Intelligence, 2066–
2071. http://doi.org/10.5591/978-1-57735-516-8/IJCAI11-345

Kalakrishnan, M., Chitta, S., Theodorou, E., Pastor, P., & Schaal, S. (2011). STOMP: Stochastic trajectory
optimization for motion planning. Proceedings - IEEE International Conference on Robotics and
Automation, 4569–4574. http://doi.org/10.1109/ICRA.2011.5980280

Karaman, S., Walter, M. R., Perez, A., Frazzoli, E., & Teller, S. (2011). Anytime motion planning using the
RRT. In Proceedings - IEEE International Conference on Robotics and Automation (pp. 1478–1483).
http://doi.org/10.1109/ICRA.2011.5980479

Kiley Hamlin, J., Ullman, T., Tenenbaum, J., Goodman, N., & Baker, C. (2013). The mentalistic basis of
core social cognition: Experiments in preverbal infants and a computational model. Developmental
Science, 16(2), 209–226. http://doi.org/10.1111/desc.12017

Kovács, Á. M., Téglás, E., & Endress, A. D. (2010). The social sense: susceptibility to others’ beliefs in
human infants and adults. Science (New York, N.Y.), 330(6012), 1830–1834.
http://doi.org/10.1126/science.1190792

14
Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka.Vikash. (2015). Picture : A Probabilistic
Programming Language for Scene Perception. CVPR.

Liszkowski, U., Carpenter, M., Striano, T., & Tomasello, M. (2006). 12- and 18-Month-Olds Point to
Provide Information for Others. Journal of Cognition and Development.
http://doi.org/10.1207/s15327647jcd0702_2

Mordatch, I., Todorov, E., & Popović, Z. (2012). Discovery of complex behaviors through contact-invariant
optimization. ACM Transactions on Graphics. http://doi.org/10.1145/2185520.2335394

Nau, D., Au, T.-C., Ilghami, T.-C., Ugur, K., Murdock, J. M., Wu, D., & Yaman, F. (2003). SHOP2 : An
HTN Planning System. Journal of Artificial Intelligence Research, 20, 379–404.

Ng, A., & Russell, S. (2000). Algorithms for inverse reinforcement learning. Proceedings of the
Seventeenth International Conference on Machine Learning, 0, 663–670.
http://doi.org/10.2460/ajvr.67.2.323

Onishi, K. H., & Baillargeon, R. (2005). Do 15-month-old infants understand false beliefs? Science (New
York, N.Y.) (Vol. 308).

Premack, D., & Woodruff, G. (1978). Premack and Woodruff : Chimpanzee theory of mind. Behavioral and
Brain Sciences, 515–526.

Richardson, H. L., Baker, C. L., Tenenbaum, J. B., & Saxe, R. R. (2012). The development of joint belief-
desire inferences. Cognitive Science, 1(1), 1–6. Retrieved from
http://palm.mindmodeling.org/cogsci2012/papers/0168/paper0168.pdf

Schulman, J., Duan, Y., Ho, J., Lee, A., Awwal, I., Bradlow, H., … Abbeel, P. (2014). Motion planning with
sequential convex optimization and convex collision checking. The International Journal of Robotics
Research, 0278364914528132–. http://doi.org/10.1177/0278364914528132

Skerry, A. E., Carey, S. E., & Spelke, E. S. (2013). First-person action experience reveals sensitivity to
action efficiency in prereaching infants. Proceedings of the National Academy of Sciences of the
United States of America, 110(46), 18728–33. http://doi.org/10.1073/pnas.1312322110

Spelke, E. S., & Kinzler, K. D. (2007). Core knowledge. Developmental Science, 10(1), 89–96.
http://doi.org/10.1111/j.1467-7687.2007.00569.x

Téglás, E., Vul, E., Girotto, V., Gonzalez, M., Tenenbaum, J. B., & Bonatti, L. L. (2011). Pure reasoning in
12-month-old infants as probabilistic inference. Science (New York, N.Y.), 332(6033), 1054–1059.
http://doi.org/10.1126/science.1196404

Topál, J., Gergely, G., Erdohegyi, A., Csibra, G., & Miklósi, A. (2009). Differential sensitivity to human
communication in dogs, wolves, and human infants. Science (New York, N.Y.), 325(5945), 1269–
1272. http://doi.org/10.1126/science.1176960

Trafton, G., Hiatt, L., Harrison, A., Tanborello, F., Khemlani, S., & Schultz, A. (2013). ACT-R/E: An
Embodied Cognitive Architecture for Human-Robot Interaction. Journal of Human-Robot
Interaction, 2(1), 30–55. http://doi.org/10.5898/JHRI.2.1.Trafton

Ullman, T., Baker, C., Macindoe, O., Goodman, N. D., & Tenenbaum, J. B. (2009). Help or Hinder:
Bayesian Models of Social Goal Inference. NIPS. Retrieved from https://papers.nips.cc/paper/3747-
help-or-hinder-bayesian-models-of-social-goal-inference.pdf

15
Warneken, F., & Tomasello, M. (2006). Altruistic helping in human infants and young chimpanzees.
Science (New York, N.Y.), 311(5765), 1301–1303. http://doi.org/10.1126/science.1121448

Warneken, F., & Tomasello, M. (2007). Helping and Cooperation at 14 Months of Age. Infancy, 11(3), 271–
294. http://doi.org/10.1111/j.1532-7078.2007.tb00227.x

Ziebart, B. D., Bagnell, J. A., & Dey, A. K. D. (2010). Modeling Interaction via the Principle of Maximum
Causal Entropy. Entropy, 1(1), 1–41. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.165.6641&rep=rep1&type=pdf

Zucker, M., Ratliff, N., Dragan, a. D., Pivtoraiko, M., Klingensmith, M., Dellin, C. M., … Srinivasa, S. S.
(2013). CHOMP: Covariant Hamiltonian optimization for motion planning. The International
Journal of Robotics Research, 32, 1164–1193. http://doi.org/10.1177/0278364913488805

16

You might also like