Professional Documents
Culture Documents
PDF Indoor Scene Recognition by 3 D Object Search For Robot Programming by Demonstration Pascal Meisner Ebook Full Chapter
PDF Indoor Scene Recognition by 3 D Object Search For Robot Programming by Demonstration Pascal Meisner Ebook Full Chapter
https://textbookfull.com/product/swift-3-object-oriented-
programming-gaston-c-hillar/
https://textbookfull.com/product/excel-vba-programming-by-
examples-programming-for-complete-beginners-step-by-step-
illustrated-guide-to-mastering-excel-vba-thanh-tran/
https://textbookfull.com/product/python-3-object-oriented-
programming-3rd-edition-dusty-phillips-dusty-phillips/
https://textbookfull.com/product/learn-to-program-with-
python-3-a-step-by-step-guide-to-programming-irv-kalb/
Scratch by Example Programming for All Ages 1st Edition
Eduardo A. Vlieg
https://textbookfull.com/product/scratch-by-example-programming-
for-all-ages-1st-edition-eduardo-a-vlieg/
https://textbookfull.com/product/practical-machine-learning-and-
image-processing-for-facial-recognition-object-detection-and-
pattern-recognition-using-python-himanshu-singh/
https://textbookfull.com/product/learn-to-program-with-
python-3-a-step-by-step-guide-to-programming-2nd-edition-irv-
kalb/
https://textbookfull.com/product/writing-blockbuster-plots-a-
step-by-step-guide-to-mastering-plot-structure-and-scene-martha-
alderson/
https://textbookfull.com/product/visual-basic-for-kids-a-step-by-
step-computer-programming-tutorial-philip-conrod/
Springer Tracts in Advanced Robotics 135
Pascal Meißner
Indoor Scene
Recognition
by 3-D Object
Search
For Robot Programming
by Demonstration
Springer Tracts in Advanced Robotics
Volume 135
Series Editors
Bruno Siciliano, Dipartimento di Ingegneria Elettrica e Tecnologie
dell’Informazione, Università degli Studi di Napoli Federico II, Napoli, Italy
Oussama Khatib, Artificial Intelligence Laboratory, Department of Computer
Science, Stanford University, Stanford, CA, USA
Advisory Editors
Nancy Amato, Computer Science & Engineering, Texas A&M University, College
Station, TX, USA
Oliver Brock, Fakultät IV, TU Berlin, Berlin, Germany
Herman Bruyninckx, KU Leuven, Heverlee, Belgium
Wolfram Burgard, Institute of Computer Science, University of Freiburg, Freiburg,
Baden-Württemberg, Germany
Raja Chatila, ISIR, Paris cedex 05, France
Francois Chaumette, IRISA/INRIA, Rennes, Ardennes, France
Wan Kyun Chung, Robotics Laboratory, Mechanical Engineering, POSTECH,
Pohang, Korea (Republic of)
Peter Corke, Science and Engineering Faculty, Queensland University of
Technology, Brisbane, QLD, Australia
Paolo Dario, LEM, Scuola Superiore Sant’Anna, Pisa, Italy
Alessandro De Luca, DIAGAR, Sapienza Università di Roma, Roma, Italy
Rüdiger Dillmann, Humanoids and Intelligence Systems Lab, KIT - Karlsruher
Institut für Technologie, Karlsruhe, Germany
Ken Goldberg, University of California, Berkeley, CA, USA
John Hollerbach, School of Computing, University of Utah, Salt Lake, UT, USA
Lydia E. Kavraki, Department of Computer Science, Rice University, Houston, TX,
USA
Vijay Kumar, School of Engineering and Applied Mechanics, University of
Pennsylvania, Philadelphia, PA, USA
Bradley J. Nelson, Institute of Robotics and Intelligent Systems, ETH Zurich,
Zürich, Switzerland
Frank Chongwoo Park, Mechanical Engineering Department, Seoul National
University, Seoul, Korea (Republic of)
S. E. Salcudean, The University of British Columbia, Vancouver, BC, Canada
Roland Siegwart, LEE J205, ETH Zürich, Institute of Robotics & Autonomous
Systems Lab, Zürich, Switzerland
Gaurav S. Sukhatme, Department of Computer Science, University of Southern
California, Los Angeles, CA, USA
The Springer Tracts in Advanced Robotics (STAR) publish new developments and
advances in the fields of robotics research, rapidly and informally but with a high
quality. The intent is to cover all the technical contents, applications, and
multidisciplinary aspects of robotics, embedded in the fields of Mechanical
Engineering, Computer Science, Electrical Engineering, Mechatronics, Control, and
Life Sciences, as well as the methodologies behind them. Within the scope of the
series are monographs, lecture notes, selected contributions from specialized
conferences and workshops, as well as selected PhD theses.
Special offer: For all clients with a print standing order we offer free access to the
electronic volumes of the Series published in the current year.
Indexed by DBLP, Compendex, EI-Compendex, SCOPUS, Zentralblatt Math,
Ulrich’s, MathSciNet, Current Mathematical Publications, Mathematical Reviews,
MetaPress and Springerlink.
123
Pascal Meißner
IAR-IPR
Karlsruhe Institute of Technology
Karlsruhe, Germany
Dissertation approved by the KIT Department of Informatics. Oral examination on July 6th,
2018 at Karlsruhe Institute of Technology (KIT)
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Carlo Bourlet—Professor at CNAM,
Paris, France—, a role model in dedication
and determination.
Foreword by Rüdiger Dillmann
vii
viii Foreword by Rüdiger Dillmann
At the dawn of the century’s third decade, robotics is reaching an elevated level of
maturity and continues to benefit from the advances and innovations in its enabling
technologies. These all are contributing to an unprecedented effort to bringing
robots to human environment in hospitals and homes, factories and schools; in the
field for robots fighting fires, making goods and products, picking fruits and
watering the farmland, saving time and lives. Robots today hold the promise for
making a considerable impact on a wide range of real-world applications from
industrial manufacturing to health care, transportation, and exploration of the deep
space and sea. Tomorrow, robots will become pervasive and touch upon many
aspects of modern life.
The Springer Tracts in Advanced Robotics (STAR) is devoted to bringing to the
research community the latest advances in the robotics field on the basis of their
significance and quality. Through a wide and timely dissemination of critical
research developments in robotics, our objective with this series is to promote more
exchanges and collaborations among the researchers in the community and con-
tribute to further advancements in this rapidly growing field.
The monograph by Pascal Meissner is based on the author’s doctoral thesis. It
focuses on Robot Programming by Demonstration (PbD) to enable humans to teach
robots real-world tasks through physical demonstrations. The concept of Implicit
Shape Model (ISM) trees is introduced to derive scene representation models in
terms of the spatial relations among the objects to be manipulated. Then, an opti-
mization algorithm for Active Scene Recognition (ASR) allows embedding
canonical scene recognition in a decision-making system to select best camera
views for 3-D object localization.
Rich of experiments in a setup mimicking a kitchen, the results demonstrate the
good performance of ISM trees as scene classifiers for a large number of object
arrangements. A very fine addition to the STAR series!
ix
Preface
While it is the purpose of this thesis to convey the most important findings of my
Ph.D. research, I want to take this preface as an opportunity to report on the very
nature of doing doctoral studies as I got to know it. While some may argue that
finding the right institution and getting admitted there is the main challenge for a
graduate—from my point of view, the former is a matter of personality, while a
good recipe for the latter is to carry out one’s studies at a lab of one’s own choice as
continuously as possible—I think that being a researcher is a major challenge that
has little in common with succeeding in one’s university studies. From my own
experience, numerous of my colleagues experienced disappointment and frustration
while being Ph.D. candidates, even though working under—in my view—good
conditions. I suggest that this kind of issues results from misconceptions of what it
actually means to do doctoral studies. As an attempt to clarify this at least for my
field in Germany, I want to draw an analogy between being a Ph.D. candidate and
an entrepreneur on the basis of Long et al. from 1983. More precisely, I propose
that Ph.D. students consider themselves as being entrepreneurs. According to Long
et al., a first defining aspect of entrepreneurship is self-employment. While the
colleagues at my lab and myself were employees in the public service, I still think
that this attribute applied to us, e.g. because we were continuously expected to
come up with new research challenges on our own. Far beyond our mere interest in
technology, it was essential to have the ambition to discover research questions as
well as to develop and present appropriate answers. In the sense of my entrepreneur
metaphor, we had to figure out promising business opportunities, to develop offers
and to sell them. In my opinion, the fact that research findings are mostly attributed
to individuals is closely linked to the self-employment in academia and is thus an
indication of it. For example, Nobel Prizes are to this day awarded to an individual
and not to collaborative achievements. The impact of findings from Ph.D. research
is commonly regarded as a good measure for the achievement they represent. If one
considers a publication which contains such findings, an offer and the authors as its
supplier, the impact can be equated with the benefit Ph.D. students can strive for.
As entrepreneurs, Ph.D. candidates should, therefore, keep the actual purpose
xi
xii Preface
for any Ph.D. candidate as soon as he starts supervising. We tried to optimize our
efforts in supervising with various concepts such as chaining fixed-length
appointments, undergraduates working together on greater problems and experi-
ments, undergraduates supervising each other, groupware-supported supervision or
the usage of development frameworks such as Scrum. What proved to be essential
to us was not only relying on the aforementioned mechanistic approaches but also
taking into account the specific traits of each individual undergraduate in order to
adapt their respective tasks, working conditions as well as our leadership style
during his stay at our lab.
Provided sufficient expertise as well as the toughness and perseverance to remain
focused on obtaining research findings—despite the numerous encountered dis-
tractions and interruptions—I am convinced that anyone who can identify with
being a research entrepreneur can find his fulfillment in my field. To conclude, I
wish everyone a hopefully insightful and maybe even enjoyable read of this thesis.
This book is equivalent to the Ph.D. thesis I submitted under the title “Indoor
Scene Recognition by 3-D Object Search for Robot Programming by
Demonstration” to the KIT Department of Informatics. I defended this thesis at
Karlsruhe Institute of Technology (KIT) on July 6th, 2018. The source code for all
contributions of this approved thesis is freely available under https://github.
com/asr-ros.
My sincere gratitude goes to Dr. Stefan Gächter Toya, Prof. John K. Tsotsos,
Dr. Robert Eidenberger and Prof. Antonio Torralba for inspiring me with their
research. They laid the foundations for the contributions of my thesis.
I am very grateful to my advisor Prof. Rüdiger Dillmann for putting his trust in me
while I pursued my doctoral studies. I particularly thank him for supporting my vision
while granting me complete freedom in defining and implementing it. Moreover, I
would like to thank Prof. Michael Beetz for the interesting conversations about my
research problems, we had. My special thanks go to Prof. Torsten Kröger for his
tremendous support towards the end of my doctoral studies.
My deepest gratitude goes to my mentor Dr. Sven R. Schmidt-Rohr. First as my
supervisor, then as a colleague, he provided decisive support in word and deed
throughout highs and lows. I also thank him for broadening my horizon in unex-
pected directions with his compelling enthusiasm and strategic foresight. Many
thanks to Dr. Rainer Jäkel for his expert advice as well as for his friendly, calm and
consistently helpful manner. My thanks additionally go to Dr. Martin Lösch for
being such a committed leader of our research group at the beginning of my
doctoral studies.
My gratitude goes to my student co-workers Tobias Allgeyer,
Florian Aumann-Cleres, Jocelyn Borella, Souheil Dehmani, Benny Fuhry,
Nikolai Gaßner, Joachim Gehrung, Fabian Hanselmann, Heinrich Heizmann,
Florian Heller, Robin Hutmacher, David Kahles, Oliver Karrenbauer,
Daniel Kleinert, Felix Marek, Matthias Mayr, Jonas Mehlhaus, Sebastian Münzner,
Trung Nguyen, Reno Reckling, Ralf Schleicher, Patrick Schlosser, Patrick Stöckle,
Daniel Stroh, Jeremias Trautmann, Richard Weiss and Valerij Wittenbeck for
spending countless days and nights in my two labs and joining me in struggling with
both hard- and software.
xv
xvi Acknowledgements
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Programming by Demonstration . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Passive Scene Recognition . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.3 Active Scene Recognition . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2 Thesis Statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Document Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 23
2.1 Scene Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 23
2.1.1 Convolutional Neural Networks and Image Databases .... 23
2.1.2 Applicability of Convolutional Neural Networks
and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Part-Based Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.2 Constellation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.3 Implicit Shape Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.4 Pictorial Structures Models . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.5 Comparison and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 34
2.3 View Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.2 Selected Approaches to Three-Dimensional Object
Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 37
2.3.3 Comparison and Conclusion . . . . . . . . . . . . . . . . . . . .... 39
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 41
xvii
xviii Contents
From 2014 onwards, the European Commission decided to spend up to 700 million
euros [16] in research and innovation in the field of robotics over seven years. Apart
from this considerable funding on the public side, the private side of the robotics
community in Europe invests another 2.1 billion euros [16] together with other par-
ties. The objective of this overall public-private investment of 2.8 billion euros is
defined in the Strategic Research Agenda (SRA) for robotics in Europe. The SRA
document describes robots as a key technology to address long-term societal issues
“such as healthcare and demographic change, food security and sustainable agricul-
ture, smart and integrated transport and secure societies” [1, p. 7]. For future robots to
succeed in that respect, SRA in particular requires them to have perception abilities
and decisional autonomy [1, p. 41]. For instance, a robot should be able to decide on
its own which action it performs next. In the real world, decisional autonomy highly
depends on perception, as the following definition of the term situation illustrates:
A situation is the entirety of circumstances which are to be considered for the selection of
an appropriate behavior pattern at a particular point of time [42].
The authors claim that a situation is derived from an underlying scene, which they
define as follows:
A scene describes a snapshot of the environment including the scenery and dynamic elements
as well as all actors’ and observers’ self-representations and the relationships among those
entities [42].
2
Skill Skill
S E S E
Scene References Scene References
Skill
S E
Skill Scene References
S E
Scene References
Shall I add
utensils?
Shall I bring
cereals?
Fig. 1.1 1: Mobile robot observing a breakfast scene in our laboratory setup. It reasons about which
action (skill) [25] to apply. 2: Different configurations of the same utensils and dishes, representing
a different scene category [23]
1 “X” stands for any configuration in 2 in Fig. 1.1, from “Pause” to “Do not like”.
1 Introduction 3
Since SRA requires autonomy [37, p. 35] from future robots, they will have to
be able to learn models of scenes on their own and in a multitude of domains. The
subsequent requirements regarding the generality and flexibility of a suitable model
for scene recognition are directly related to this observation.
1. A uniform representation of spatial relations that is sufficiently generic to describe
each type of relation but still captures the details in its variations.
2. Freedom in choosing which pairs of objects within a scene category are intercon-
nected by spatial relations.
A mobile household robot, which we take as an example, is going to face both
missing objects and clutter when trying to match a learnt scene classifier to its per-
cepts. Hence follows another requirement for scene classification:
3. Robustness against missing objects and clutter.
Existing work on probabilistic methods for part-based object recognition suffers
from severe limitations to all these requirements. As an alternative, we propose to
derive scene recognition from another method from part-based recognition that is
called Implicit Shape Models (ISMs)2 [30], a Generalized Hough Transform [8]
variant.
In the usual household, scenes will not be visible as a whole from a single point
of view. In order to nevertheless gather evidence about existing scenes, robots will
have to integrate successive estimates about the presence of objects while freely
traversing their environment. Thus, a scene category representation should be favored
that fulfills the following requirements:
4. Independence from the viewpoint from which a scene is perceived.
5. Low time consumption, since scene recognition is executed repeatedly during
evidence-gathering.
Requirement 4 is best met with a scene category model in which both object poses
and spatial relations are specified in six-degrees-of-freedom (6-DoF). In literature,
such three-dimensional models are usually limited to modeling object positions with-
out orientations. In addition, we expect scene category representations to consider
uncertainties in spatial relations. Since modeling uncertainties in 6-DoF with para-
metrical distributions is tedious, a non-parametrical approach as the ISMs is more
appropriate for us. However, it is an open issue how to adapt their representation
and the algorithms operating on them in order to model scenes in full 6-DoF while
maintaining efficiency. The same holds for requirement 2, as ISMs in the field of
part-based object recognition are only able to represent relations between a single
so-called reference part and all other parts of the object instead of relating arbitrary
combinations of parts of an object [21, p. 70].
Household scenes like in Fig. 1.2 can, e.g., be described as a single global scene
or as a combination of local scenes of presumably different categories. Since local
2 Even though the inventors of the Implicit Shape Models present a probabilistic motivation for their
Fig. 1.2 Example configuration of objects—a scene—in our laboratory setup. It is composed of
subscenes. The objects of each subscene are surrounded by boxes which differ in color depending
on which subscene they belong to
are known in advance, either from prior knowledge [32] or by way of intermediate
objects [29].
On the contrary, autonomy [37, p. 35] demands from robots the ability to adapt
their knowledge to arbitrary environments. This holds in particular for object search.
Consequently, we require the following for this problem:
6. Hypotheses about the 6-DoF poses of searched objects should not be predefined
but rather predicted at runtime from estimates about present scenes.
7. A realistic model for how to search objects with visual sensors. The choice of
sensor viewpoints during the search should consider three-dimensional space,
taking into account both sensor position and orientation. It should also precisely
model how the interdependence between sensor viewpoints and the 6-DoF poses
of the searched object affects visual perception.
In order to meet these requirements, we designed an optimization problem and
algorithm for object search and decided to combine them with our method for
scene recognition. This allows for guiding the subsequent search for missing objects
through information about partially recognized scenes. For the special case of the
place setting in 1 in Fig. 1.1, this proceeding could be verbalized by a robot as fol-
lows: “I have found a milk carton on the table that should belong to a scene of
the category “Cereals—Setting”. Let’s estimate where I should look for the miss-
ing cereals box that belongs to this category, too.” In this thesis, we call such an
approach Active Scene Recognition (ASR). In cognitive science, this term [43] has
been coined as a process in which a human observer improves his capabilities in
visual perception by deliberately changing between points of view. This contrasts
scene recognition in which the observer is immobile. In computer science litera-
ture, recognizing scenes and searching objects are usually investigated as separate
problems in different research fields. Scene recognition is generally performed on a
single sensor-reading. Reference [43] refers to that as a passive approach. In order
to stress the difference between scene recognition with an immobile observer and
recognition with a mobile observer, we designate scene recognition without object
search as Passive Scene Recognition (PSR).
1.1 Motivation
Robot
Human
Perception
Sensors
Actors
Execution, Evaluation
User demonstration: and Adaption
how
Modeling +
Simulation Simulation +
Validation
User intention:
why
Model-based Transfer
Interpretation
Segmentation,
Abstraction,
User interaction Generalization
Background knowledge
Fig. 1.3 Overview of the steps in the principal method for PbD of skills from user demonstration.
Derived from [14]
by the user himself or by directly controlling the robot that is to be taught [14].
Our work is based on the first approach, transferring skills from humans to robots.
Demonstrations by users can be mapped to differing robot systems, in contrast to
demonstrations with a robot that are restricted to that target system [14]. The different
steps of this approach, shown in Fig. 1.3, are as follows: It is usual to demonstrate
different variations of the same skill, all annotated by the user. This allows for a
subsequent step, based on learning algorithms, to abstract from the concrete setup
(in which each demonstration takes place) to a generalized concept of the skill itself.
This conceptual representation encodes the goals which are the purpose of a learnt
skill and enables robots to adapt their skill knowledge to deviating situations. In that
respect, PbD differs from the related Imitation Learning, the perspective of which is
focused on reproducing and adapting demonstrated motions rather than abstracting
to a conceptual representation. Before the robot can execute such conceptual knowl-
edge, an additional step in PbD transfers the skill model to the kinematics of the
target system.
1.1 Motivation 7
Assuming that an exemplary robot has four different skills from PbD at its disposal,
like in Fig. 1.1, the next problem is whether any of these skills should be used in the
presence of a specific scene, and if so, which one? Given in the scene in Fig. 1.1,
the robot could opt for bringing cereals because, e.g., the present milk carton has
always been observed next to a cereals box in a scene category named “Cereals—
Setting”. Or it could choose to clear the table as a knife lies on top of the plate in
the middle of the place setting, which is usual in scene category “Setting—Clear
the Table”. Deciding which of these skills is applicable leads us back to perception
and decisional autonomy. The applicability of a skill can be formalized as a set of
preconditions [25] that have to be fulfilled. As the aforementioned example illustrates,
scenes or rather their presence is an important cue among those preconditions. Since
skills, generated by PbD, are expected to adapt to changing environmental conditions
during execution, this must also be expected of its preconditions and in particular of
the scene category models, to which some of these preconditions refer.
We designed the contributions of this thesis to seamlessly integrate into the
approach of Jaekel et al. [24] for PbD of manipulation skills. How both systems
are interrelated is visualized by the connected pair of arcs in Fig. 1.4. The learning
of skill models takes place in two steps which are visible in the upper arc: In the first
place, the demonstration of a user is generalized to a skill model. Then this model is
specialized to the target robot system. Just before the skill is executed on the target
system, its model references representations of those scene categories that are among
the preconditions for this skill. We can assume that preconditions for a skill model,
generated by PbD, should rather rely on scene category models that are specialized
in this skill, instead of scene category models which are trained independently of
the skill. Nevertheless, this does not exclude that the same category model can be
referenced by multiple skills. For example, tea drinkers would want a category model
named “Drinks—Setting” for breakfast to associate milk to tea rather than to coffee.
These user-specific preferences can, e.g., also hold for a specific order in which food
or silverware is put to order in a cupboard. This especially affects the spatial rela-
tions between the objects in a scene. In order to create user-specific scene category
models in a non-expert-friendly manner, we adopt the principle of learning from user
demonstrations in our learning of scene category models. Learning such category
models and their usage in scene recognition is depicted in the lower arc in Fig. 1.4.
For each scene category that is a precondition to a specific manipulation skill, the
user presents a number of possible variations of object configurations in the course
of a demonstration. Every variation is sensory-recorded and interpreted by means
of visual perception. Based on estimates about the names and locations of the con-
cerned objects, a learning algorithm for scene classifiers first decides which objects
to connect pairwise by spatial relations, before the actual classifier is deduced from
the recorded estimates under consideration of which relations have been selected.
Once scene classifiers for all preconditions of a given skill are acquired, they can by
used for Active Scene Recognition: A robot that wants to apply a skill at runtime,
first uses the set of related scene classifiers to check for the presence of the required
local scenes. More precisely, it tries to extract scene models from its percepts before
it starts to employ the learnt skill model. This is mainly achieved by repeatedly
8 1 Introduction
Programming by Demonstration
of
Manipulation Skills
Grounding in scenes
Fig. 1.4 Manipulation skills from PbD have to be grounded in recognized scenes. Upper arc is
derived from [24, p. 8]
We deduced our definition of scenes from that by Ulbrich et al. [42], presented at
the beginning of this chapter. Just like them, we regard scenes as snapshots of the
environment and not as processes with a start and end point in time. According to the
authors, scenes can on the highest level be subdivided into their elements and actors,
1.1 Motivation 9
Scene
Models
(1,m) (0,n)
relation with
Models Object
Element Reference
Object Scene
3-D-Position Scene (0,m)
relation with
(1,n)
3-D-Orientation Recognition
3-D-Position
Name
3-D-Orientation 3-D-Position
Object 1 Object 2 ... Object n Name 3-D-Orientation
Name
Fig. 1.5 Definition of those data structures as entity-relationship models [9] that are input and
output to scene recognition
also called observers. Representations of actors focus on their skills. Since dealing
with skill modeling can be outsourced to PbD of manipulation skills as developed by
Jaekel et al. [24], we deliberately leave them out in our scene definition. Furthermore,
we only consider those scene elements that are relevant to the successful execution of
a given skill. It is up to the human demonstrator to pick out the relevant elements in
the real world. Thus, the learning of scene category models takes place in a supervised
manner [15, p. 16]. For robots to use scene classifiers in locations—different from
the places where the demonstrations take place—we usually omit the scenery and
restrict ourselves to dynamic scene elements and their interrelations. If elements of
the scenery are indispensable to a skill, they can nevertheless be considered as well.
The input to our scene classifiers, carrying out scene recognition, is a set of object
models which is visible in Fig. 1.5 on the left. Object models are usually obtained
from estimates about those objects that are present in the environment of the robot.
As features, each object model includes a 6-DoF pose and a name tag. Using position
and orientation information in combination instead of limiting ourselves to positions
is particularly important when it comes to manipulation. For example, when a robot
wants to pour something into a cup, it does not only have to take into account that
the cup has to be within its reach but also that it is standing upright. The name tag
in turn grants access to object-specific information like training data for its visual
localization or surface models for its visualization.
For each scene that a scene classifier recognizes, a scene model such as on the
right in Fig. 1.5 is output. Those models consist of objects that are connected by
binary [18], spatial relations. The objects in a scene model are a subset of the input
to the classifier and adopt their features. Strictly speaking, spatial relations do not
connect the objects within a scene. Instead, all objects are connected to a common
scene reference. We define objects as being the elements of a scene. In contrast, the
reference is a placeholder for the scene itself. Its location is nearly identical to that
of one of the objects. The name of the reference is equivalent to the name of the
scene category, prescribed by the demonstrator when learning the scene category
model for the employed classifier. Moreover, the reference has a confidence value
that expresses the confidence of the classifier in the existence of the scene. This confi-
dence is derived from confidences about how well the relations within the considered
10 1 Introduction
scene category are fulfilled. While the representation of each relation is encapsulated
in the employed scene category model, it is the scene classifier that compares this
knowledge to given object models in order to calculate confidences. Scene confi-
dences are tremendously important when it comes to integrating our approach with
manipulation skill execution. Their values decide whether the preconditions of a skill
are met.
Going back to the distinction between symbolic and subsymbolic approaches to
scene modeling, the question is which one of both approaches is best to calculate
scene models as defined in the previous section. Symbolic approaches make it possi-
ble to abstract numerical data about scenes, like estimates from visual perception, into
natural language descriptions. In order to describe spatial configurations of objects in
a symbolic manner, as our scene model does, the concept of Qualitative Spatial Rela-
tions (QSR) [12] has been introduced. Mathematical definitions for QSRs such as “on
top of”, “left of”, “inside” have been developed, as well as corresponding computa-
tional models to decide about their presence in numerical data like images or point
clouds. QSRs allow for estimating qualitative information about isolated aspects of
scenes. Combining them among each other and with attributes of objects delivers
rich, language-based scene descriptions. Encoding them in probabilistic frameworks,
for instance, allows for classifying scenes [31]. Even though classification derives
information such as the type of a scene, this can only be done on a qualitative level.
Besides, the computational models for relations that are part of those descriptions
have to be designed and parameterized by expert users. This proceeding is prone to
errors since inappropriate design or parameterization can lead to models that either
overlook the decisive details that distinguish scenes at all or are too coarse to capture
those details in case the scenes are similar enough. For example, in 2 in Fig. 1.1, a
description like “Fork and knife touch each other” does not suffice to identify which
of the five scene categories is meant. Working in a quantitative manner, a classifier
may not provide as general scene descriptions as a symbolic approach, but can rely
on more informative data [19] when it comes to scene classification. In the sense that
calculations of our classifiers entirely rely on the subsymbolic information, which
is encapsulated in scene category models, symbols as the names of spatial relations
are irrelevant in our approach to scene recognition. The only employed symbols are
name tags for objects on the input side and for scenes on the output side.
According to [18], all binary spatial relations have in common that they describe
relative poses between pairs of objects. In Euclidean space, six parameters for trans-
lation and rotation are required to express the pose of a rigid body [41]. Each of the
various mathematical formalisms for translation and rotation [41] therefore provides
the most generic manner to characterize spatial relations, though it provides no means
to organize relations into categories. In contrast, there is no mathematical formalism
for QSRs that covers a sufficiently large number of types of spatial relations in order
to be able to realistically model scenes in real-world environments [12]. Instead, a
variety of concepts for different types of spatial relations [12] coexist such as e.g. for
topological spatial relations [11]. Each concept just allows for distinguishing among
the subset of spatial relations that it represents. Since we require that scene category
models are learnt from demonstrations, a human demonstrator would be expected to
1.1 Motivation 11
R R R
R R R R
R R
1 2 3 4
Fig. 1.6 All possible relation topologies for scene categories with three objects and in which all
objects are connected
assign and parameterize mathematical definitions of spatial relations for each given
real-world scene. This in turn would require expert knowledge about QSRs, which
is contrary to the PbD paradigm. We define a unified representation of spatial rela-
tions by means of translations and rotation with no further abstraction as asked in
requirement 1.
The maximum number of binary relations that any scene category model for n
objects can represent corresponds to the number of edges n·(n−1) 2
in a complete graph
[44]. In general, we visualize as a graph the specific combination of relations that
every scene category model represents. Undirected graphs, resulting from that kind
of visualization, are shown in Fig. 1.6. Their vertices stand for objects and their
edges for relations like {R1 , R2 , R3 }. While 1 in Fig. 1.6 depicts a complete graph,
three additional combinations of less relations are shown on its right. We call these
combinations relation topologies. Figure 1.6 shows all relation topologies that are
possible for three objects.
In order to verify if a configuration of objects is consistent with a scene cate-
gory, each of the represented relations has to be checked for being fulfilled by a
corresponding object pair in the configuration. Computational costs in scene recog-
nition increase disproportionally when raising the number of objects in the scene to
be recognized. This comes as a result of an equally disproportional increase of the
maximum number of modeled relations. Thus, relation topologies with few relations
should be favored. Not every relation that can be defined on a set of objects is equally
relevant to scene recognition. For example, in 1 in Fig. 1.1, the depicted place setting
contains two forks on the left of the plate. When we look at guidelines [45] for lay-
ing place setting, two rules are commonly related to the relative poses of these three
objects. The first relates the relative poses of the forks, the second to the relative pose
of the plate and the fork lying closest to it. Consequently, if we connect the fork in
the middle to both the plate and the other fork by spatial relations, there is no need to
relate the other fork to the plate as well. Efficiency is not the only issue to consider
when deciding which relation topology to use for creating a scene category model.
False positives [37, p. 770] may occur in scene recognition, depending on the relation
topology employed. This issue is discussed in more detail in Sect. 3.6.1. In order to
effectively optimize efficiency and accuracy of scene recognition by choosing suit-
able relation topologies, we ought to base our scene category model on a method
12 1 Introduction
In Fig. 1.2, we gave an example for the cluttered indoor scenarios we address in
this thesis and interpret as combinations of several local scenes. A mobile robot is
expected to recognize such scenes, based on scene category models that are learnt
from demonstrations. A demonstration can, for example, be performed in front of an
observing robot, as indicated in the lower middle of Fig. 1.7. Since doing demonstra-
tions usually produces significant efforts for the human demonstrator, reusability of
category models is of great importance. In order to address that, we introduced the
concept of local scenes in which portions of a vast global scene are independently
modeled by their own category models. The place setting on the table in Fig. 1.1
could be such a local scene. We call it “Setting—Ready for Breakfast”. On both the
left and the right of Fig. 1.7, “Setting—Ready for Breakfast” appears at two different
locations on the same table. Spatial relations are not the only means to defining a
category model. Instead, this can also be done with the help of absolute object poses.4
Symbols representing scene category models that could have been generated by both
approaches are visible in the upper middle of Fig. 1.7.
We assume that the mobile robot with the pivoting head of visual sensors that is
shown in Fig. 1.1 is exploring its environment by alternating scene recognition and
three-dimensional object search, i.e. by performing Active Scene Recognition (ASR).
When doing so with the help of spatial relations, the robot is going to check whether
the absolute object poses, acquired during ASR, comply to the spatial relations in
the employed category model. The absolute poses are not directly checked. Instead,
3 In the following, we use the term scene (category) model to designate the entire hierarchical model,
Dennis. I say, boys, we must have a kodak ready for the unveiling,
and catch the girls’ faces on the fly.
Helen. They deserve some kind of penance for their behaviour this
afternoon.
Amy. Yes, even in addition to our intended neglect when Lord
Ferrol arrives.
Helen. Oh, we can make it a capital joke, and if Lord Ferrol is only
nice we can have both the joke and a good time.
Amy. Well, I don’t care what Lord Ferrol is; I am going to use him
to punish—them.
Helen. Oh! Amy, why that significant pause? We all know how
them spells his name.
Rose (springing to her feet with a scream). Girls! Girls!!
Amy (startled). What’s the matter?
Rose (melodramatically). My Lords! My Lords! There are traitors
in the camp and treachery stalks rampant.
Curtain
ACT II
Helen (going round table and sitting, still holding cup). Not for
you.
Enter Rose with hot water pot. Men return to fireplace. Amy sits
easy-chair l. of tea-table.
Rose (rubbing teapot against Dennis’s hand as she passes). Hot
water.
Dennis (jumping and looking at his hand). Not the least doubt of
it.
Helen. Make the most of it, boys: it’s the last time our tea will be
sweet to you!
Dennis. Why is Helen like a “P. & O.” steamer?
Helen (indignantly). I’m not!
Steven. Because she’s steaming the tea?
Dennis. No.
Amy. Don’t keep us in suspense.
Steven. Because she’s full of tease.
George. You make me tired.
Steven. Is that why you sat down so often on the ice?
Helen. Isn’t that just like George,—sitting round, while the rest do
the work.
George. If you think there’s any particular pleasure in sitting in a
snowdrift, there’s one outside, right against the verandah.
Steven. That would never do at present. It might result in a cold,
and so destroy our little plan of winning the maiden affections of—
well, I won’t give him a name till I have seen him!
Helen. It is hard to put up with foreign titles, but as long as our
government will not protect that industry, the home product is so
rude, boorish, VULGAR, and YOUNG, that we cannot help—
Rose (interrupting). Listen! (Pause.) There’s the carriage.
Rose (rising and bringing hassock). Let me give you this hassock
—one is so uncomfortable in these deep chairs without one.
Lord F. Er, Thanks! You’re very kind.
Helen (tenderly). Lord Ferrol, will you tell me how you like your
tea?
Lord F. Strong, please, with plenty of cream and sugar.
Amy (admiringly). Ah, how nice it is to find a man who takes his
tea as it should be taken! (looking at men scornfully). It is really a
mental labor to pour tea for the average man.
Dennis. Average is a condition common to many; therefore we are
common. Yet somebody said the common people were never wrong.
Helen. Well, they may never be wrong, but they can be
uncommonly disagreeable!
Lord F. Yes, that’s very true. You know, at home we don’t have
much to do with that class, but out here you can’t keep away from
them.
Amy (turning to men). There! I hope you are properly crushed?
Lord F. (turning to Amy). Eh!
Amy (leaning over Lord F. tenderly). Oh, I wasn’t speaking to you,
dear Lord Ferrol!
Mrs. W. I fear that you have had some unpleasant experiences
here, from the way you speak.
Lord F. Rather. (Helen hands cup with winning smile.) Thanks,
awfully!
George. Perhaps Lord Ferrol will tell us some of them; we may be
able to free him from a wrong impression.
Lord F. The awful bore over here is, that every one tries to make
jokes. Now, a joke is very jolly after dinner, or when one goes to
“Punch” for it.
Steven. To what?
Lord F. To “Punch,” don’t you know,—the paper.
Steven. Oh! Excuse my denseness; I thought we were discussing
jokes.
Lord F. I beg pardon?
Amy. Don’t mind him, Lord Ferrol.
George. No, like “Punch,” he’s only trying to be humorous.
Lord F. Er, is that an American joke?
Dennis. I always thought Punch was a British joke!
Lord F. Er, then you Americans do think it funny?
George. Singularly!
Lord F. What I object to in this country is the way one’s inferiors
joke. It’s such bad form.
Rose (horrified). Surely they haven’t tried to joke you?
Lord F. Yes. Now to-day, coming up here, I took my luggage to the
station, and got my brasses, but forgot your direction that it must be
re-labelled at the Junction, so they wer’n’t put off there. I spoke to
the guard, and he was so vastly obliging in promising to have them
sent back that I gave him a deem.
Omnes. A what?
Lord F. A deem—your small coin that’s almost as much as our
sixpence, don’t you know.
Omnes. Oh, yes!
Lord F. Well, the fellow looked at it, and then he smiled, and said
loud enough for the whole car to hear: “My dear John Bull, don’t you
sling your wealth about in this prodigal way. You take it home, and
put it out at compound interest, and some day you’ll buy out Gould
or Rockefeller.”
Helen. How shockingly rude! What did you do?
Lord F. I told him if he didn’t behave himself, I’d give him in
charge. (Men all laugh.) Now, is that another of your American
jokes?
Dennis (aside). Oh! isn’t this rich?
Amy (aside to Lord F.). Oh, you are beautiful!
Lord F. (bewildered and starting). Thanks awfully,—if you really
mean it!
Steven (coming down to back of Lord F.’s chair). What did she
say, Lord Ferrol? You must take Miss Sherman with a grain of
allowance.
Amy. I’m not a pill, thank you.
Lord F. Why, who said you were?
Dennis. Only a homœopathic sugarplum.
Lord F. I don’t understand.
Steven (aside to Lord F.). Keep it up, old man. It’s superb!
Lord F. I beg pardon,—did you speak to me?
Steven (retreating to fireplace). Oh, no! only addressing vacancy.
Mrs. W. I hope, Lord Ferrol, that there has been enough pleasant
in your trip to make you forget what has been disagreeable.
Lord F. Er, quite so. The trip has been vastly enjoyable.
Rose. Where have you been?
Lord F. I landed in New York and spent the night there, but it was
such a bore that I went on to Niagara the next day. From there I
travelled through the Rockies, getting some jolly sport, and then
went to Florida.
Mrs. W. Why, you have seen a large part of our country; even more
than your father did. I remember his amazement at our autumn
foliage. He said it was the most surprising thing in the trip.
Amy. What did you think of it, Lord Ferrol?
Lord F. It struck me as rather gaudy.
Rose. Why, I had never thought of it, but perhaps it is a little vivid.
Dennis (aside to men). Oh, how I should like to kick him!
Steven (aside to Dennis). Hush! You forget that “Codlin’s your
friend—not Short.”
George. Didn’t you ever see a Venetian sunset?
Lord F. Oh, yes. Why do you ask?
George (sarcastically). I merely thought it might be open to the
same objection!
Lord F. It might—I don’t remember. I’ll look it up in my journal
when I get home, and see if it impressed me at the time.
Helen. Do you keep a journal? (Rises and sits on footstool at Lord
F.’s feet.) How delightful! (Beseechingly.) Oh, won’t you let me look
at what you have with you?
Rose. Please, Lord Ferrol!
Amy. Ah, do!
Lord F. It would bore you, I’m sure.
Dennis (aside). I don’t care if he isn’t a double-barrelled earl, I
should like to kick him all the same!
Helen. Lord Ferrol, you must let us hear some of it.
Rose. If you don’t we shall think you have said something
uncomplimentary of the American women.
Lord F. No, I assure you I have been quite delighted.
Amy. Then why won’t you let us see it?
Lord F. Er, I couldn’t, you know; but if you really are in earnest, I’ll
read you some extracts.
Omnes. Oh, do!
Lord F. I ought to explain that I started with the intention of
writing a book on America, so this (producing book) is not merely
what I did and saw, but desultory notes on the States.
Rose. How interesting!
Lord F. After your suggestion of what I have written of the
American women, I think it best to give you some of my notes on
them.
Mrs. W. By all means!
Lord F. (reading). “Reached Washington, the American capital,
and went direct to Mrs. ——. Cabman charged me sixteen shillings.
When I made a row, butler sent for my host, who, instead of calling a
constable, made me pay the fellow, by insisting on paying it himself.
Mr. —— is a Senator, and is seen very little about the house, from
which I infer the American men are not domestic—presumably,
because of their wild life—”
Mrs. W. (with anxiety). Their what?
Lord F. Their wild life,—spending so much of their time on the
plains, don’t you know.
Mrs. W. (relieved). Oh! Excuse my misapprehension.
Lord F. (reading). “The daughter is very pretty, which Mrs. ——
tells me is unusual in Washington society—as if I could be taken in
by such an obvious Dowager puff! (Men all point at Mrs. W. and
laugh. Mrs. W. shakes her finger reprovingly.) Miss —— says the
Boston girls are plain and thin, due to their living almost wholly on
fads, which are very unhealthy.” (Speaking.) I couldn’t find that
word in the dictionary.
Steven. Sort of intellectual chewinggum, Lord Ferrol.
Dennis. Yes, and like gum, you never get beyond a certain point
with it. It’s very fatiguing to the jaw.
Lord F. (reading). “She says the New York girls are the best
dressed in the country, being hired by the dressmakers to wear
gowns, to make the girls of other cities envious, and that this is
where they get all the money they spend. Very remarkable!”
Helen. Something like sandwich men, evidently.
Lord F. (reading). “The Philadelphia girls, she says, are very fast,
but never for long at a time, because the men get sleepy and must
have afternoon naps.”
Amy. Did she tell you that insomnia is thought to make one very
distinguished there?
Lord F. (making note in book). Er, thanks, awfully. (Reading.)
“She says that the Baltimore girls are great beauties, and marry so
quickly that there is generally a scarcity. It is proposed to start a joint
stock company to colonise that city with the surplus from Boston,
and she thinks there ought to be lots of money in it! Another extreme
case of American dollar worship! The Western girls, she told me, are
all blizzards.” (Speaking.) I don’t think I could have mistaken the
word, for I made her spell it. Yet the American dictionary defines
blizzard as a great wind or snow storm.
George. That is it, Lord Ferrol. They talk so much that it gives the
effect of a wind storm.
Lord F. Ah! much obliged. (Reading.) “Went to eight receptions in
one afternoon, where I was introduced to a lot of people, and talked
to nobody. Dined out somewhere, but can’t remember the name.
Took in a Miss ——, a most charming and lovely—”
Dennis (interrupting). Ah, there!
Lord F. I beg pardon.
Rose. You must forgive his rude interruption, Lord Ferrol.
Lord F. Oh, certainly! You’re sure you’re not bored?
Omnes. By no means. Do go on.
Lord F. “A most charming and lovely girl from New York. She
thinks Miss —— characterised the cities rightly, except her own.
Asked me if I thought she was only a dressmaking advertisement? As
scarcely any of her dress was to be seen, I replied that as I couldn’t
look below the table, I was sure it was the last thing one would
accuse her of being. She blushed so violently that I had to tell her
that I had seen much worse dresses in London; but that didn’t please
her any better, and she talked to the man next her for the rest of the
evening. (All have difficulty in suppressing their laughter.) I met a
Boston girl afterwards who—”
[Bell rings.